,doc_body,doc_description,doc_full_name,doc_status,article_id
3,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDEMO: DETECT MALFUNCTIONING IOT SENSORS WITH STREAMING ANALYTICS
IBM AnalyticsLoading...

Unsubscribe from IBM Analytics? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 26KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

175 views 6LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 7 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Nov 6, 2017This video demonstrates a Streaming Analytics application written in Python
running in the IBM Data Science experience. The results of the analysis are
displayed on a map using Plotly.

The notebook demonstrated in this video is available for you to try: http://ibm.biz/WeatherNotebook

Visit Streamsdev for more articles and tips about Streams: https://developer.ibm.com/streamsdev

Python API Developer guide: http://ibmstreams.github.io/streamsx....

Streaming Analytics in Python course: https://developer.ibm.com/courses/all...

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * The Python ecosystem for Data Science: A guided tour - Christian Staudt -
   Duration: 25:41. PyData 1,411 views 25:41


--------------------------------------------------------------------------------

 * IBM Streaming Analytics and Python - Duration: 1:00:51. John O'Neill 105
   views 1:00:51
 * How Customers Are Using the IBM Data Science Experience Expected Cases and
   Not So Expected Ones - Duration: 18:29. Databricks 327 views 18:29
 * Giovanni Lanzani | Applied Data Science - Duration: 35:14. PyData 2,728 views 35:14
 * Detecting Fraud in Real-Time with Azure Stream Analytics - Duration: 32:16.
   Philip Howard 71 views 32:16
 * Step by step guide how to build a real-time anomaly detection system using
   Apache Spark Streaming - Duration: 16:11. Mariusz Jacyno 4,591 views 16:11
 * Real-time Analytics with Azure Stream Analytics - Duration: 54:47. PASS
   Business Analytics Virtual Group 940 views 54:47
 * Real-Time Machine Learning Analytics Using Structured Streaming and Kinesis
   Firehose - Duration: 31:25. Databricks 660 views 31:25
 * Data Science - Duration: 25:05. manish telang 3 views 25:05
 * Real-Time Log Analytics using Amazon Kinesis and Amazon Elasticsearch Service
   - Duration: 28:32. Amazon Web Services - Webinar Channel 1,072 views 28:32
 * IBM Data Science Experience and Machine Learning Use Cases in Healthcare -
   Duration: 26:53. IDEAS 157 views 26:53
 * Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud
   Services - Duration: 47:06. Kai Wähner 1,761 views 47:06
 * An overview of IBM Streaming Analytics for Bluemix - Duration: 44:12. IBM
   Analytics 1,311 views 44:12
 * Predicting Stock Prices - Learn Python for Data Science #4 - Duration: 7:39.
   Siraj Raval 274,452 views 7:39
 * REST API concepts and examples - Duration: 8:53. WebConcepts 1,687,034 views 8:53
 * Streaming Data Analytics with Apache Spark Streaming - Duration: 1:01:19.
   Data Gurus 300 views 1:01:19
 * Orchestrate IBM Data Science Experience analytics workflows using Node-RED -
   Duration: 10:16. Balaji Kadambi 109 views 10:16
 * Delight Clients with Data Science on the IBM Integrated Analytics System -
   Duration: 15:05. IBM Analytics 1,581 views 15:05
 * What is DevOps? - In Simple English - Duration: 7:07. Rackspace 657,396 views 7:07
 * Introduction - Learn Python for Data Science #1 - Duration: 6:55. Siraj Raval
   206,552 views 6:55
 * Loading more suggestions...
 * Show more

 * Language: English
 * Location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Detect bad readings in real time using Python and Streaming Analytics.,Detect Malfunctioning IoT Sensors with Streaming Analytics,Live,0
5,"No Free Hunch Navigation * kaggle.com

 * kaggle.com

Communicating data science: A guide to presenting your work 4COMMUNICATING DATA SCIENCE: A GUIDE TO PRESENTING YOUR WORK
Megan Risdal | 06.29.2016

See the forest, see the trees . Here lies the challenge in both performing and presenting an analysis. As
data scientists, analysts, and machine learning engineers faced with fulfilling
business objectives, we find ourselves bridging the gap between The Two Cultures : sciences and humanities. After spending countless hours at the terminal
devising a creative and elegant solution to a difficult problem, the insights
and business applications are obvious in our minds. But how do you distill them
into something you can communicate?

Qualifications and requirements for a senior data scientist position.

Presenting my work is one of the surprising challenges I faced in my recent
transition from academia to life as a data analyst at a market research and strategy firm . When I was a linguistics PhD student at UCLA studying learnability theory in
a classroom or measuring effects of an oral constriction on glottal vibration in
a sound booth, my colleagues and I were comfortable speaking the same language.
Now that I work with a much more diverse crowd of co-workers and clients with
varied backgrounds and types of expertise, I need to work harder to ensure that
the insights of my analyses are communicated effectively.

In this second entry in the communicating data science series , I cover some essentials when it comes to presenting a thorough,
comprehensible analysis for readers who want (or need) to know how to get their
work noticed and read.


--------------------------------------------------------------------------------

GET YOUR HEAD IN THE GAME
Imagine you’ve just completed the so-called heavy lifting, whatever it may be,
and you’re ready to present your results and conclusions in a report. Well, step
away from the word processor! There are two things you must first consider: your
audience and your goals. This is your forest.

WHO IS YOUR AUDIENCE?
The matter of who you’re speaking to will influence every detail of how you
choose to present your analysis from whether you use technical jargon or spend
time carefully defining your terms. The formality of the context may determine
whether a short, fun tangent or personal anecdote will keep your audience
happily engaged or elicit eye rolls … and worse.

This is all important to consider because once you’ve envisioned your audience,
you take stock of what may and may not be shared knowledge and how to manage
their expectations. In your writing (and in everyday life), it’s useful to be
cognizant of Grice’s principles of cooperative communication:

 1. Maxim of quantity : be informative, without giving overwhelming amounts of extraneous detail.
 2. Maxim of quality : be truthful. Enough said.
 3. Maxim of relation : be relevant. I’ll give you some tips on staying topical shortly!
 4. Maxim of manner : be clear. Don’t be ambiguous, be orderly.

So be cooperative! Know your audience and do what you can to anticipate their
expectations. This will ensure that you cover all ground in exactly as much
detail as necessary in your report.

WHAT IS THE GOAL?
Also before you put pen to paper, it’s helpful remind yourself again (and again)
of what your goal is. If you’re working in a professional environment, you’re
aware that it’s important to be continually mindful of the goal or business
problem and why you’re tasked with solving it.

Or perhaps it's a strategic initiative you're after: Did you set out to learn
something new about some data (and the world)? Or have you been diligently
working on a new skill you’d like to showcase? Do you want to test out some
ideas and get feedback? It’s okay to make it your goal to find out “Can I do
this?” Maybe you want to share some of your expertise with the community on
Kaggle Scripts. In that case, it’s even more imperative that you have a
buttoned-up analysis!

""If we can really understand the problem, the answer will come out of it,
because the answer is not separate from the problem.""

― Jiddu KrishnamurtiIf you’ve reached the point of having an analysis to report, you’ve more than
likely familiarized yourself with the goals of the initiative, but you must also
keep them at the forefront of your thoughts when presenting your results as
well. Your work should be contextualized in terms of your understanding of the
research objectives. Often in my own day job this means synthesizing many
analyses I’ve performed into a few key pieces of evidence which support a story;
this can’t be done well except by accident without keeping in mind the ultimate
objective at hand.

THE PREAMBLE
Now that you’ve got yourself in the right frame of mind―you can see the forest
and you know the trees―you’re ready to start thinking about the content of your
report. However, before you start furiously spilling ink, first remind yourself
of the three elements required to ask an askable question in science:

 1. The question itself along with some justification of how it addresses your
    objectives
 2. A hypothesis
 3. A feasible methodology for addressing your question

Much as I implore you to consider who your audience is and what your objectives
are in order to get your mind in the right place, I’m recommending that you have
the answers to these three things ready because they will dictate the content of
your report. You don’t want to throw everything and the kitchen sink into a
report!

WHAT’S THE QUESTION?
On Kaggle, the competition hosts very generously provide their burning questions
to the community. Outside of this environment, the challenge is to come up with
one on your own or work within the business objectives of your employer. At this
point, you make sure that you can appropriately state the question and how it
relates to your objective(s).

As an aside, if you need some exercise in the area of asking insightful
questions (a skill unto its own), I hereby challenge you to scroll through some
of Kagglers’ most recent scripts, find and read one, and think of one new
question you could ask the author. If you find that this is a stumbling block
preventing you from proceeding with your analysis, many dataset publishers
include a number of questions they’d like to see addressed. Or read the Script of the Week blogs and see what other ideas script authors would like to see explored in the same
dataset.

WHAT’S THE ANSWER?
Now that you have your question, what do you think the answer will be? It’s good
practice, of course, to consider what the possible answers may be before you dig
into the data, so hopefully you’ve already done that! Clearly delimiting the
hypothesis space at this point will guide the evidence and arguments you use in
the body of your report. It will be easier to evaluate what constitutes weak and
strong support of your theory and what analyses may be absolutely irrelevant.
Ultimately you will prevent yourself from attacking straw men in faux support of your theory.

Don't build straw men.

WHAT’S YOUR METHODOLOGY?
Let’s say you’re asking whether Twitter users with dense social networks in the How ISIS Uses Twitter dataset express greater negative sentiment than users with less dense networks. Your
first step is to confirm that the data available is sufficient to address your
research question. If there’s major missing information, you may want to rethink
your question, revise your methodology, or even collect new data.

If you’re unsure of how to put language to a particular methodology, this is a
good opportunity to flex your Googling skills. Search for “social network
analysis in r” or “sentiment analysis in python.” Dive into some academic papers
if it's appropriate and see how it's presented. Peruse the natural language processing tags on No Free Hunch and read the winners’ interviews . Get inspiration from scripts on similar datasets on Kaggle. For example, a similar analysis was performed by Kaggle user Khomutov Nikita using the Hillary Clinton’s Emails dataset .

Hillary Clinton's network graph. See the code here .

Even if you don’t end up needing to share every nuance of your methodology with
your given audience, you should always document your work thoroughly to the
extent possible. Once you’re ready to present your analysis, you’ll be capable
of determining how much is the right amount to share when discussing the nitty
gritty mechanics of your model. Similarly, I've been able to pleasantly surprise
my boss many times because I have an answer ready at-hand for immediate
questions thanks to keeping my exploratory analyses well-documented.

By the way, if you’ve felt overwhelmed by the task of putting together a solid
methodology for tackling a question, it can’t hurt to lob an idea and some code
to the community for feedback. Especially once you have solid presentation of
analysis skills! Be honest about where you feel you could use extra input and
maybe a fellow Kaggler will come forth with different angle on the problem.

PUTTING THE PIECES TOGETHER
Finally, you’re ready to write. Keep in mind that a good analysis should
facilitate its own interpretation as much as possible. Again, this requires
anticipating what information your likely audience will be seeking and what
knowledge they’re coming in with already. One method which is both
tried-and-true and friendly to the academic nature of the discipline is
following a template for your analysis. With that, this section covers the
structure which when fleshed out will help you tell the story in the data.

Keep in mind that a good analysis should facilitate its own interpretation as
much as possible.

NOT SO ABSTRACT
Make it easy for your audience to quickly determine what they’re about to
digest. Use an abstract or introduction to recall your objectives and clearly
state them for your readers. What is the problem that you’ve set out to solve?
If you have a desired outcome or any expectations of your audience, say it, as
this is the entire reason you’re presenting them with your analysis.

You then cover everything from your preamble in this section: the question
you’ve been on a mission to answer, your hypothesis, and the methodology you’ve
used. Finally, you will often provide a high level summary of your results and
key findings. Don’t worry about spoiler alerts or boring your readers to death
with the content that’s about to follow. Trust that if they pay attention past
the introduction that they are interested in how you achieve what you claim you
have.

Many people I've talked to have said that they often find it easier to write the
abstract after having already completely documented the detailed findings of the analysis. I
think that this is at least in part because you've familiarized yourself with
your own work through the lens of your readership by doing so. Slowly but surely
you're extracting yourself from the trees and the bigger picture becomes
apparent.

THE CONTENT: BREAK OFF WHAT YOU CAN CHEW
This is where the good stuff lives. You've laid the foundation for your analysis
such that your audience is prepared to read or listen intently to your story. I
can’t tell you the specifics of what goes here, but I can tell you how to
structure it.

Take your analysis in small bits by breaking your question into subparts. For a
data-driven analysis, it can make sense to tackle each piece of evidence
one-by-one. You may have a dissertation’s worth of data to report on, but more
likely than not you must pick and choose what will best support your analysis
succinctly and effectively. Again, having the objectives and audience in mind
will help you decide what’s critical. Lay it all out before you and pair
sub-questions with evidence until you have a story.

Once you’ve presented the evidence, explain why it supports (or doesn't support)
your hypothesis or your objectives. A good analysis also considers alternative
hypotheses or interpretations as well. You’ve already surveyed the hypothesis
space, so you should be ready-armed to handle contrary evidence. Doing so is
also a way of anticipating the expectations of your audience and the skepticism
they may harbor. It’s at this point that it’s most critical to keep in mind your
objectives and the question you’re addressing with your analysis. Ask how every
piece of evidence you offer takes you one step closer to confirming or
disproving your hypothesis.

OTHER TIPS AND TRICKS
Visualize the problem . Seeing is believing. It’s cliched to say in any statement asserting the value
of data visualization, but it’s so incredibly true. This “trick” is so effective
that I’m going to spend more time talking about it in a future post. If you can
plainly “state” something with a graph or chart, go for it!

 * Shail Jayesh Deliwala visualizes confusion matrices to evaluate and compare model performance.
   Read the full notebook here ."" /
 * Lj Miranda shows the steady rise of carbon emissions in the Philippines. Read the full
   notebook here ."" /
 * 33Vito uses polar coordinates to show the times during the day leveling and
   non-leveling characters play World of Warcraft. Read the full notebook here ."" /
 * Michael Griffiths uses color and variations in transparency to make this table of percentages
   more readily interpretable. Read the full notebook here ."" /

 * Shail Jayesh Deliwala visualizes confusion matrices to evaluate and compare model performance.
   Read the full notebook here ."" /
 * Lj Miranda shows the steady rise of carbon emissions in the Philippines. Read the full
   notebook here ."" /
 * 33Vito uses polar coordinates to show the times during the day leveling and
   non-leveling characters play World of Warcraft. Read the full notebook here ."" /
 * Michael Griffiths uses color and variations in transparency to make this table of percentages
   more readily interpretable. Read the full notebook here ."" /

Variety is the spice of life . And it can liven up your writing (and speaking) as well. For example, use a
mix of short and sweet sentences interspersed among longer, more elaborate ones.
Find where you accidentally used the word “didactic” four times on one page and
change it up! Related to my first point, use effective variety in types of
visualizations you employ. Small things like this will keep your readers awake
and interested.

Check your work . I don’t like to emphasize this too much because I’m a descriptivist , but make sure your writing is grammatical, fluent, and free of typos. For
better or worse, trivial mistakes can discredit you in the eyes of many. I find
that it helps to read my writing aloud to catch disfluencies.

Gain muscle memory . If you really struggle with transforming your analysis into a form that can
be shared more broadly, begin by writing anything until writing prose feels as natural as writing code. For example, I actually
suggest sitting down and copying a report word-for-word. Or even any instance of
persuasive writing. Not to be used as your own in any way (i.e., plagiarism),
but to remove one more unknown from the equation: what it literally feels like to go through the motions of stringing words and sentences and paragraphs
together to tell a story.

CONCLUSIONS & NEXT STEPS
A good analysis is repetitive. You know the intricacies of your work in and out,
but your audience does not. You’ve told your readers in your abstract (or
introduction, if you prefer) what you had ventured to do and even what you end
up finding and the content lays this all out for them. In the conclusions
section you hit them with it again. At this point, they’ve seen the relevant
data you’ve carefully chosen to support your theory so it’s time to formally
draw your conclusions. Your readers can decide if they agree or not.

Speaking of being repetitive, after making your conclusions, you again remind
your readers of the objective(s) of this report. Restate them again and help
your readers help you―what do you expect now? What feedback would you like? What
decision-making can happen now that your report is presented and the insights
have been shared? In my work, I often collaborate with strategists to develop a
set of recommendations for our clients. Typically I'll take a stab at it based
on the expertise I've gained in working with the data and a strategist will
refine using their business insights.

FIN
And this is exactly where the beauty of the analysis and your skillful
presentation thereof meet. Because you’ve managed to package your approach in a
fashion digestible to your audience, your readers, collaborators, and clients
have comprehended and learned from your analysis and what its implications are
without getting lost in the trees. They are equipped to react to the value in
your work and participate in the next step of realizing its objectives.


--------------------------------------------------------------------------------

Thanks for reading the second entry in this series on communicating data science . I covered the basics of presenting an analysis at a very high level. I'd love
to learn what your approach is, how you realize the value in your work, and how
you collaborate with others to achieve business goals. Leave a comment or send me a note !

If you missed my interview with Tyler Byers, a data scientist and storytelling
expert, check it out here . Stay tuned to learn some data visualization fundamentals.

communicating data science data analysis Reporting Tutorial writing * Liling TanGricean maxims should be ""maxims"" of quantity, quality, relation and manner,
   not ""maximums"" =)
   
    * Megan RisdalHaha, wow! I don't know how I did that. Fixed. Thank you! 🙂
      
      
    * 
   
   
 * 
 * Albert CampsVery interesting, thanks!!! Trying to summarize it ended up being quite long
   anyway. A lot of distilled information.
   
   We work a bit differently. We include an executive summary + recommendations
   at the beginning of the presentation instead of putting them at the end, just
   after stating the question to answer. After that the audience knows what will
   come, and when the presentation is revisited it is a lot faster to check . If
   there's a need to dig deeper, there's always available all the analysis
   steps.
   
   Hoping to see the next one soon! 😀
   
    * Megan RisdalThanks! I actually often do the same thing re: executive summaries in my
      day job, too! That's a really good point. There's definitely no
      one-size-fits-all approach which makes a high-level summarization
      misleading in certain ways. And now that I think of it, another strength
      in communicating data science is being able to be information dense &
      concise for times where you need to fit your work into a standalone one-
      or two-sheeter/executive summary.
      
      Hopefully more good stuff coming soon. 🙂
      
      
    * 
   
   
 * 

THE OFFICIAL BLOG OF KAGGLE.COM
SearchCATEGORIES
 * Data Science News (38)
 * Kaggle News (120)
 * Kernels (22)
 * Tutorials (28)
 * Winners' Interviews (174)

WANT TO SUBSCRIBE?
Email Address * First Name * = required fieldPOPULAR TAGS
Algo Trading Challenge Annual Santa Competition binary classification community computer vision CrowdFlower Search Results Relevance Dark Matter Deloitte diabetes Diabetic Retinopathy EEG data Elo Chess Ratings Competition Eurovision Challenge Facebook Recruiting Flavours of Physics: Finding τ → μμμ Flight Quest Grasp-and-Lift EEG Detection Heritage Health Prize How Much Did It Rain? image classification Intel Kaggle InClass Kernels logistic regression March Mania Merck multiclass classification natural language processing optimization problem Otto Product Classification Owen Zhang Practice Fusion Product Product News Profiling Top Kagglers Recruiting regression problem scikit-learn scripts of the week The Hunt for Prohibited Content Tourism Forecasting Tutorial video series Wikipedia Challenge XGBoostARCHIVES
Archives Select Month July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 January 2014 December 2013 November 2013 September 2013 August 2013 July 2013 June 2013 May 2013 April 2013 March 2013 February 2013 January 2013 December 2012 November 2012 October 2012 September 2012 August 2012 July 2012 June 2012 May 2012 April 2012 March 2012 February 2012 January 2012 December 2011 November 2011 October 2011 September 2011 August 2011 July 2011 June 2011 May 2011 April 2011 March 2011 February 2011 January 2011 December 2010 November 2010 October 2010 September 2010 August 2010 July 2010 June 2010 May 2010 April 2010 Toggle the Widgetbar","See the forest, see the trees. Here lies the challenge in both performing and presenting an analysis. As data scientists, analysts, and machine learning engineers faced with fulfilling business obj…",Communicating data science: A guide to presenting your work,Live,1
7,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Watson Student Advisor

 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (April 18, 2017)
 * This Week in Data Science (April 11, 2017)
 * How to Become a Data Scientist
 * This Week in Data Science (April 4, 2017)
 * This Week in Data Science (March 28, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (APRIL 18, 2017)
Posted on April 18, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Top mistakes data scientists make when dealing with business people – A discussion of the three top mistakes data scientists make.
 * 4 Trends in Artificial Intelligence that affect enterprises. – Four AI trends that stand out in their affect on companies and
   enterprises.
 * R Best Practices: R you writing the R way! – A list of programming practices that result in improved readability,
   consistency, and repeatability.
 * The 5 Best Reasons To Choose MYSQL – and its 5 Biggest Challenges – Reasons to use MYSQL and the common challenges associated.
 * 7 types of job profiles that make you a Data Scientist – A discussion of the common skill sets of different data scientist
   profiles.
 * Detecting Hackers & Impersonators with Machine Learning – Applying Machine Learning to faster detect phishing attacks.
 * Some Lesser-Known Deep Learning Libraries – A list of lesser known but useful Deep Learning libraries.
 * In case you missed it: March 2017 roundup – Articles about R programming from Revolutions.
 * Investing, Fast & Slow – Part 2: Investment for Data Scientists 101 – Second part in a discussion series on investing and data science from
   Dataconomy.
 * 10 Free Must-Read Books for Machine Learning and Data Science – A list of interesting Machine Learning and Data Science reads.
 * Integrate Sparkr And R For Better Data Science Workflow – How to work with R and Sparkr for wrangling with large datasets.
 * Can Watson, the Jeopardy champion, solve Parkinson’s? – Toronto Researchers are using Watson to help find a cure for Parkinson’s.
 * The Henry Ford to debut ‘cognitive dress’ using IBM Watson technology – The Henry Ford will display a dress created from a collaboration between
   Marchesa and IBM Watson.
 * The Democratization of Machine Learning: What It Means for Tech Innovation – How accessible ML can further spur tech innovation.
 * 3 reasons why data scientist remains the top job in America – A discussion of why the role of data scientist has remained the top job in
   America.


FEATURED COURSES FROM BDU
 * SQL and Relational Databases 101 – Learn the basics of the database querying language, SQL.
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.
 * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to
   apply deep learning to different data types in order to solve real world
   problems.


UPCOMING DATA SCIENCE EVENTS
 * Data Science: Classification Algorithms in Python(Hands-On) –April 25, 2017 @ 6 – 8:30 pm Lighthouse Labs


COOL DATA SCIENCE VIDEOS
 * Machine Learning With Python – Unsupervised Learning – Measuring the
   Distances Between Clusters – Using Single Linkage Clustering to measure the distance between Clusters.
 * Machine Learning With Python – Hierarchical Clustering Advantages &
   Disadvantages – A discussion of Hierarchical Clustering.
 * Machine Learning With Python – Unsupervised Learning K Means Clustering
   Advantages & Disadvantages – A discussion of K-Means Clustering.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: Big Data , data science , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (April 18, 2017)",Live,2
8,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCALE - BOOST THE PERFORMANCE OF YOUR
DISTRIBUTED DATABASE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 29, 2016Learn how distributed DBs (Cassandra, MongoDB, RethinkDB, etc) solve the problem
of scaling persistent storage, but introduce latency as data size increases and
become I/O bound. In single server DBs, latency is solved by introducing
caching. In this talk, Akbar Ahmed shows you how to improve the performance of
distributed DBs by using a distributed cache to move the data layer performance
limitation from I/O bound to network bound.

Akbar is the CEO and founder of DynomiteDB, a framework for turning single
server data stores into linearly scalable, distributed databases. He is an
Apache Cassandra certified developer and a Cassandra MVP, enjoys the
expressiveness of both SQL and alternative query languages, and evaluates the
entire database ecosystem every 6 months and has an MBA in Information Systems.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud","Learn how distributed DBs solve the problem of scaling persistent storage, but introduce latency as data size increases and become I/O bound.",DataLayer Conference: Boost the performance of your distributed database,Live,3
12,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: ANALYZE NY RESTAURANT INSPECTIONS DATA
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

9 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Science Experience: Analyze NYC traffic collisions data with a community
   notebook - Duration: 8:08. developerWorks TV 5 views * New 8:08


--------------------------------------------------------------------------------

 * Data Science Experience: Analyze Db2 Warehouse on Cloud data in RStudio -
   Duration: 5:30. developerWorks TV 3 views * New 5:30
 * Data Science Experience: Analyze precipitation data using a community
   notebook - Duration: 5:15. developerWorks TV No views * New 5:15
 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21.
   IBM Analytics 8,386 views 8:21
 * Data Science Experience: Sentiment Analysis of Twitter Hashtags Using Spark
   Streaming - Duration: 5:35. developerWorks TV No views * New 5:35
 * Introduction to Spark and Data Science Experience - Duration: 49:24. Data
   Gurus 419 views 49:24
 * Data Science Experience: Load and analyze public data sets - Duration: 2:46.
   developerWorks TV No views * New 2:46
 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54.
   developerWorks TV No views * New 6:54
 * Use IBM PixieDust and Data Science Experience to analyze San Francisco
   traffic - Duration: 11:57. scottdangelo 447 views 11:57
 * Data Analytics Overview | Data Science With Python Tutorial - Duration:
   18:23. Simplilearn 12,463 views 18:23
 * Exploring Data Science Experience, a Platform for Data Scientists using Open
   Source Technologies - Duration: 54:10. Data Gurus 102 views 54:10
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29
 * A data scientist experiments with Jupyter notebooks and Apache Spark: Part 1
   - Duration: 13:30. IBM Analytics 4,093 views 13:30
 * Creating the Data Science Experience - Duration: 3:55. IBM Analytics 3,197
   views 3:55
 * H2O With IBM's Data Science Experience (DSX) - Duration: 4:43. Matt McInnis
   303 views 4:43
 * Data Science With Python | Data Science Tutorial | Simplilearn - Duration:
   7:52. Simplilearn 20,809 views 7:52
 * Q&A with Lightbend’s Duncan DeVore on Reactive Microservices and JavaOne -
   Duration: 5:39. developerWorks TV 36 views * New 5:39
 * IBM Analytics Engine Overview - Duration: 7:21. developerWorks TV 7 views *
   New 7:21
 * Visual Machine Learning in Data Science Experience - Duration: 1:37. Armand
   Ruiz 2,996 views 1:37
 * JavaOne: Microservice hands-on - Duration: 5:22. developerWorks TV No views *
   New 5:22

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video demonstrates the power of IBM DataScience Experience using a simple New York State Restaurant Inspections data scenario. ,Analyze NY Restaurant data using Spark in DSX,Live,4
14,"Compose is all about immediacy. You want a new database - you can have it in seconds. You want your database metrics - they're just a click away in your dashboard. And, for MongoDB and Redis, we've also provided immediate access to the database through our browsers.Now, PostgreSQL joins them in allowing for immediate data browsing from the Compose Dashboard. We've just launched the first edition of our PostgreSQL data browser for PostgreSQL Compose deployments so you can try it out now.What the browser is designed for is quick, no local-footprint, viewing and modification of the database. No connections have to be set up, no ports opened, no URIs memorized – just log in to Compose, select your database and get browsing. It's a complement to the extensive and complex client tools that exist for PostgreSQL which, once configured and mastered, can let you peer into every corner of the database.We'll be expanding the Compose browser's capabilities over time with that philosophy in mind, but first, let's take a look at the capabilities already available in the PostgreSQL browser.We'll begin our tour of the PostgreSQL browser from the top. To get to the browser, log into the Compose dashboard, select 'Browser' in the sidebar to see this...The ""top"" of the browser's view shows the databases created in the PostgreSQL instance. In the screenshot above you can see that there are two, the default compose and a dvdrental database. You can also see the on-disk size of each of the databases. The dvdrental is having data imported into it for a future demonstration so let's take a look inside there by clicking on its row to reveal:A number of tables. These are the tables in the database, each one displayed with an estimated row count. If selected, the Admin tab in the sidebar only offers the option to delete the current database, but we're more interested in looking at some of the data here. If we click into the film table, we get a better view of that data:This is the Query view of the film table. The default query reads the first 20 items from the table and displays their contents in a table below the query. This table will include all the fields so some horizontal scrolling could be involved. You can edit the query to adjust and LIMIT the number of items displayed, add a WHERE clause to include your own selection criteria, add an ORDER BY clause and sort according to a field or add an OFFSET to skip a number of returned results. Using all of these would look like this.That OFFSET value can also be changed, by the LIMIT value, using the Next and Last buttons at the bottom of the table view so you can page through the data.If there's a primary key on the table, then you'll also be able to get a better look at the data in a row by clicking anywhere on a row to get to the update row view. We're going to give you the whole view of one here, though if you have a field-rich table, expect to scroll:Here we see nearly all the fields of the row; the only one missing is the primary key field, which if you look just above the field list is being used in the WHERE clause to select this record. The rest of the fields are displayed with both the field name and, usually, the type of that field along with an editable field to allow for modifications.So, we can see, going down this page, a ""title"" field, defined as a ""varchar(255)"" with a text area to edit its contents, and below that, a ""description"" field, defined as ""text"" also with a text area. Field validation takes place on submitting the update to the database, so if you put too much text in the ""title"" field, it's at update time that you'll be told there's too much text.The same goes for validating the numeric fields, like the smallint and numeric types, and the date intervals like the year field. The observant will notice a field with the type ""mpaa_rating"" further down the table. This field has its type set to a user defined enum type like this:CREATE TYPE mpaa_rating AS ENUM ('G','PG','PG-13','R','NC-17'Because the browser lets the database flag errors at update time, it means this field is also validated; enter a string value which doesn't match with one of the enum values and when you press update you'll get an error and the update will be completely rolled back.Back to the fields types in the table. The ""lastupdate"" field is effectively read-only as it'll be overwritten during updating. The ""fulltext"" field is a tsvector field and is also not editable - in this database's case it is updated on insert or update by a trigger. There is one field you can edit, the text array that is ""specialfeatures"", which takes PostgreSQL syntax for an array literal – { ""string"",""string"",""string"",... }.That covers editing, but you can also add new rows. You'll find the button for that in the query view in the top right marked ""Insert row"". It'll bring up an unpopulated page similar to the edit row page:This form is more forgiving of validation errors than the ""Update Row"" page in that if you do have a change which is rejected by the validation process, the fields you have entered are not cleared. Apart from that, it is functionally the same as the ""Update Row"" page.If we go back to the top table view, there are two tabs we haven't mentioned at the top of the page. The Indexes tab shows the current indexes that apply to the table we are looking at:Here, for example we can see the unique primary key on the film_id, a fulltext index on the tsvector fulltext field, a foreign key index on the language id, and a simple index on the movie title. There's also the option to drop any one of these indexes with the right side's drop button. As well as displaying indexes, you can create indexes, albeit, currently, simply unique or non-unique btree indexes. Enter the fields you want indexed between the parentheses, click Unique if you want a unique index and click name and enter an index name if you want to set a particular name for the index - then just click Create Index.The Settings tab currently offers one option, to drop the table. But this also comes with options:The option in this case is whether or not to drop any database objects that depend on the table, when you drop the table, using a CASCADE operator. Remember to check you have a working backup of your table, or whole database, before you drop the table as there's no going back on the drop without doing a restore. The browser doesn't have a ""Create Table"" option yet, so you can't manually rebuild the table without reaching for the psql command-line tool.We’ll be enhancing the PostgreSQL browser in the future. As you can see, there’s already a useful range of functionality for the database user on the go and we aim to make it your first stop for PostgreSQL control on Compose.",Using Compose's PostgreSQL data browser.,Browsing PostgreSQL Data with Compose,Live,5
15,"UPGRADING YOUR POSTGRESQL TO 9.5Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 26, 2016Upgrading your PostgreSQL deployment to version 9.5 is now possible through theCompose console. Working out how to perform this upgrade safely and reliably hasbeen an interesting process because from version 9.4 to 9.5 is a PostgreSQLmajor upgrade.""Wait"", you say, ""that's not a major upgrade.""With PostgreSQL it is...""A major release is numbered by increasing either the first or second part ofthe version number, e.g. 9.1 to 9.2.""The important thing about PostgreSQL major updates is that they usually changehow the data is stored internally. That's why whenever a PostgreSQL databasestarts up, it checks what version of PostgreSQL created the data directory. Ifit isn't the same major version, it'll refuse to run. Upgrading is traditionallydone by dumping the contents of the databases, updating the database softwareand then restoring the dump's contents to the freshly updated database. It's abit hands on and time consuming so we looked for an alternative.We'll be looking at how we came up with our approach to this in another articlecoming soon. Suffice to say we looked at engineering an upgrade system with aneye on resilience and redundancy which performed quickly.COMPOSE'S POSTGRESQL UPGRADEOur major version upgrade process begins with a backup. We start there as it's aknown point in the life of your data. You may wish to put your applications intomaintenance mode and create an on-demand backup to ensure that you have the mostrecent data.Whichever backup you go with, it will be be restored to a new PostgreSQLdeployment where we may, or may not, run the pg_upgrade tool.We call this process Deployment from backup and it supports the ability to change database version while it runs. At theend of the process, you'll have a freshly provisioned database in a fraction ofthe time it would take to dump and restore it. Backups are made automatically onCompose so there's always a recent backup, but you can always make one on demandto be completely up to date. Once you have a backup, you can restore it to a newdeployment by clicking on the restore icon.That will take you to the Deployment from backup dialog. At the top are details about the backup you have selected to restore;which deployment it is from and when it was created.The rest of the page is about the deployment to be created to house thisrestored backup. You can enter a new deployment name (or accept the delightfullygenerated default). You can then say generally where you'd like the deploymentcreated if you have a Compose Enterprise account, otherwise it defaults to""Compose Hosted"". If ""Compose Hosted"" is selected, a range of data centerlocations is then available to create the deployment in.Then we get to the Upgrade section. By default, this process will not upgradeyour database and will select the matching major version. If you click Create Deployment at this point you will effectively clone your database.If, on the other hand, you select a different version, like say 9.5, when youclick Create Deployment , something extra happens. Your data is restored into a new deployment butrather than start up a database instance, pg_upgrade is run to upgrade the stored data to the selected version.You'll now have your original PostgreSQL deployment and an upgraded PostgreSQLdeployment running concurrently. Validate the upgraded PostgreSQL deployment andswitch your applications to use that, then decommission the original PostgreSQLwhen you are ready.If you're unhappy with the upgraded database, you still have your originaldatabase to fall back to. We're not expecting anyone to have problems with the Deployment from backup process, but we like to build things which give you lots of room to recoverwith if anything does go astray.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose",Upgrading your PostgreSQL deployment to version 9.5 is now possible through the Compose console. Working out how to perform this upgrade safely and reliably has been an interesting process because from version 9.4 to 9.5 is a PostgreSQL major upgrade.,Upgrading your PostgreSQL to 9.5,Live,6
17,"Follow Sign in / Sign up 135 8 * Share
 * 135
 * 
 * 

Never miss a story from Several People Are Coding , when you sign up for Medium. Learn more Never miss a story from Several People Are Coding Get updates Get updates Ronnie Blocked Unblock Follow Following I engineer data at @SlackHQ. Professional business dog. Dec 7
--------------------------------------------------------------------------------

DATA WRANGLING AT SLACK
By Ronnie Chen and Diana Pojar

Research Data Management via janneke staaks licensed under Creative CommonsFor a company like Slack that strives to be as data-driven as possible,
understanding how our users use our product is essential.

The Data Engineering team at Slack works to provide an ecosystem to help people
in the company quickly and easily answer questions about usage, so they can make
better and data informed decisions: “ Based on a team’s activity within its first week, what is the probability that
it will upgrade to a paid team? ” or “ What is the performance impact of the newest release of the desktop app?”

THE DREAM
We knew when we started building this system that we would need flexibility in
choosing the tools to process and analyze our data. Sometimes the questions
being asked involve a small amount of data and we want a fast, interactive way to explore the results. Other times we are running large aggregations across longer time series and we need a system that can handle the sheer
quantity of data and help distribute the computation across a cluster. Each of
our tools would be optimized for a specific use case, and they all needed to
work together as an integrated system.

We designed a system where all of our processing engines would have access to
our data warehouse and be able to write back into it. Our plan seemed
straightforward enough as long as we chose a shared data format, but as time
went on we encountered more and more inconsistencies that challenged our
assumptions.

THE SETUP
Our central data warehouse is hosted on Amazon S3 where data could be queried
via three primary tools: Hive , Presto and Spark .

To help us track all the metrics that we want, we collect data from our MySQL
database, our servers, clients, and job queues and push them all to S3. We use
an in-house tool called Sqooper to scrape our daily MySQL backups and export the
tables to our data warehouse. All of our other data is sent to Kafka, a scalable, append-only message log and then persisted on to S3 using a tool
called Secor .

For computation, we use Amazon’s Elastic MapReduce (EMR) service to create ephemeral clusters that are preconfigured with all
three of the services that we use.

Presto is a distributed SQL query engine optimized for interactive queries. It’s a
fast way to answer ad-hoc questions, validate data assumptions, explore smaller
datasets, create visualizations and use it for some internal tools, where we
don’t need very low latency.

When dealing with larger datasets or longer time series data, we use Hive , because it implicitly converts SQL-like queries into MapReduce jobs. Hive can
handle larger joins and is fault-tolerant to stage failures, and most of our
jobs in our ETL pipelines are written this way.

Spark is a data processing framework that allows us to write batch and aggregation
jobs that are more efficient and robust, since we can use a more expressive
language, instead of SQL-like queries. Spark also allows us to cache data in
memory to make computations more efficient. We write most of our Spark pipelines
in Scala to do data deduplication and write all core pipelines.

TYING IT ALL TOGETHER
How do we ensure that all of these tools can safely interact with each other?

To bind all of these analytics engines together, we define our data using Thrift , which allows us to enforce a typed schema and have structured data. We store
our files using Parquet which formats and stores the data in a columnar format. All three of our
processing engines support Parquet and it provides many advantages around query
and space efficiency.

Since we process data in multiple places, we need to make sure that our systems
always are aware of the latest schema, thus we rely on the Hive Metastore to be our ground truth for our data and its schema.

CREATE TABLE IF NOT EXISTS server_logs(
  team_id BIGINT,
  user_id BIGINT,
  visitor_id STRING,
  user_agent MAP<STRING, STRING>,
  api_call_method STRING,
  api_call_ok BOOLEAN 
)
PARTITIONED BY (year INT, month INT, day INT, hour INT)
STORED AS PARQUET
LOCATION 's3://data/server_logs'

Both Presto and Spark have Hive connectors that allow them to access the Hive
Metastore to read tables and our Spark pipelines dynamically add partitions and
modify the schema as our data evolves.

With a shared file format and a single source for table metadata, we should be
able to pick any tool we want to read or write data from a common pool without
any issues. In our dream, our data is well defined and structured and we can
evolve our schemas as our data needs evolve. Unfortunately, our reality was a
lot more nuanced than that.

COMMUNICATION BREAKDOWN
All three processing engines that we use ship with libraries that enable them to
read and write Parquet format. Managing the interoperation of all three engines
using a shared file format may sound relatively straightforward, but not
everything handles Parquet the same way, and these tiny differences can make big
trouble when trying to read your data.

Under the hood, Hive, Spark, and Presto are actually using different versions of
the Parquet library and patching different subsets of bugs, which does not
necessarily keep backwards compatibility. One of our biggest struggles with EMR
was that it shipped with a custom version of Hive that was forked from an older
version that was missing important bug fixes.

What this means in practice is that the data you write with one of the tools
might not be read by other tools, or worse, you can write data which is read by
another tool in the wrong way. Here are some sample issues that we encountered:

ABSENCE OF DATA
One of the biggest differences that we found between the different Parquet
libraries was how each one handled the absence of data.

In Hive 0.13, when you use use Parquet, a null value in a field will throw a NullPointerException. But supporting optional
fields is not the only issue. The way that data gets loaded can turn a block of
nulls— harmless by themselves — into an error if no non-null values are also
present ( PARQUET-136) .

In Presto 0.147, the complex structures were the ones that made us uncover a
different set of issues — we saw exceptions being thrown when the keys of a map
or list are null . The issue was fixed in Hive, but not ported in the Presto dependency ( HIVE-11625 ).

To protect against these issues, we sanitize our data before writing to the
Parquet files so that we can safely perform lookups.

SCHEMA EVOLUTION TROUBLES
Another major source of incompatibility is around schema and file format
changes. The Parquet file format has a schema defined in each file based on the
columns that are present. Each Hive table also has a schema and each partition
in that table has its own schema. In order for data to be read correctly, all
three schemas need to be in agreement.

This becomes an issue when we need to evolve custom data structures, because the
old data files and partitions still have the original schema. Altering a data
structure by adding or removing fields will cause old and new data partitions to
have their columns appears with different offsets, resulting in an error being
thrown. Doing a complete update will require re-serializing all of the old data
files and updating all of the old partitions. To get around the time and
computation costs of doing a complete rewrite for every schema update, we moved
to a flattened data structure where new fields are appended to the end of the
schema as individual columns.

These errors that will kill a running job are not as dangerous as invisible
failures like data showing up in incorrect columns. By default, Presto settings
use column location to access data in Parquet files while Hive uses column
names. This means that Hive supports the creation of tables where the Parquet
file schema and the table schema columns are in different order, but Presto will
read those tables with the data appearing in different columns!

File schema:
""fields"": [{""name"":""user_id"",""type"":""long""},   
           {""name"":""server_name"",""type"":""string""},
           {""name"":""experiment_name"", ""type"":""string""}]

Table schema:
(user_id BIGINT, experiment_name STRING, server_name STRING) 

----------------- Hive ------------------
user_id    experiment_name    server_name

  1             test1           slack-1
  2             test1           slack-2

---------------- Presto -----------------
user_id    experiment_name    server_name

  1            slack-1          test1
  2            slack-2          test1

It’s a simple enough problem to avoid or fix with a configuration change, but
easily something that can slip through undetected if not checked for.

UPGRADING EMR
Upgrading versions is an opportunity to fix all of the workarounds that were put
in earlier. But it’s very important to do this thoughtfully. As we upgrade EMR
versions to resolve bugs or to get performance improvements, we also risk
exchanging one set of incompatibilities with another. When libraries get
upgraded, it’s expected that the new libraries are compatible with the older
versions, but changes in implementation will not always allow older versions to
read the upgraded versions.

When upgrading our cluster, we must always make sure that the Parquet libraries
being used by the analytics engines we are using are compatible with each other
and with every running version of those engines on our cluster. A recent test
cluster to try out a newer version of Spark resulted in some data types being
unreadable by Presto.

This leads to us being locked into certain versions until we implement
workarounds for all of the compatibility issues and that makes cluster upgrades
a very scary proposition. Even worse, when upgrades render our old workarounds
unnecessary, we still have a difficult decision to make. For every workaround we
remove, we have to decide if it’s more effective to backfill our data to remove
the hack or perpetuate it to maintain backwards compatibility. How can we make
that process easier?

A COMMON LANGUAGE
To solve some of these issues and to enable us to safely perform upgrades, we
wrote our own Hive InputFormat and Parquet OutputFormat to pin our encoding and
decoding of files to a specific version. By bringing control of our
serialization and deserialization in house, we can safely use out-of-the-box
clusters to run our tooling without worrying about being unable to read our own
data.

These formats are essentially forks of the official version which bring in the
bug fixes across various builds.

FINAL THOUGHTS
Because the various analytics engines we use have subtly different requirements
about serialization and deserialization of values, the data that we write has to
fit all of those requirements in order for us to read and process it. To
preserve the ability use all of those tools, we ended up limiting ourselves and
building only for the shared subset of features.

Shifting control of these libraries into a package that we own and maintain
allows us to eliminate many of the read/write errors, but it’s still important
to make sure that we consider all of the common and uncommon ways that our files
and schemas can evolve over time. Most of our biggest challenges on the data
engineering team were not centered around writing code, but around understanding
the discrepancies between the systems that we use. As you can see, those
seemingly small differences can cause big headaches when it comes to
interoperability. Our job on the data team is to build a deeper understanding of
how our tools interact with each other, so we can better predict how to build
for, test, and evolve our data pipelines.


--------------------------------------------------------------------------------

If you want to help us make Slack a little bit better every day, please check
out our job openings page and apply.

Thanks to Diana Pojar and Ross Harmes . Big Data Analytics 135 8 Blocked Unblock Follow FollowingRONNIE
I engineer data at @SlackHQ . Professional business dog.

FollowSEVERAL PEOPLE ARE CODING
The Slack Engineering Blog","For a company like Slack that strives to be as data-driven as possible, understanding how our users use our product is essential. The Data Engineering team at Slack works to provide an ecosystem to…",Data Wrangling at Slack,Live,7
21,"* Host
 * Competitions
 * Datasets
 * Kernels
 * Jobs
 * Community ▾ * User Rankings
    * Forum
    * Blog
    * Wiki
   
   
 * 

 * Sign up
 * Login

Log in
with — Remember me? Forgot your Username / Password ?$1,000,000 • 655 TEAMS
DATA SCIENCE BOWL 2017
Merger and Entry Deadline31 MAR
2 MONTHS
DEADLINE FOR NEW ENTRY & TEAM MERGERS
Thu 12 Jan 2017 Wed 12 Apr 2017 (2 months to go)DASHBOARD
 * Home * Data
    * Make a submission
   
   
 * Information * Description
    * Evaluation
    * Rules
    * Prizes
    * About the DSB
    * Resources
    * Timeline
    * Tutorial
   
   
 * Forum
 * Kernels * New Script
    * New Notebook
   
   
 * Leaderboard

FORUM (57 TOPICS)
 * Explore Digital Imaging and Communications in Med 42 minutes ago
 * Keras vs Cancer 48 minutes ago
 * mxnet + xgboost baseline [LB: 0.57] 1 hour ago
 * IP rights 1 hour ago
 * Full Preprocessing Tutorial 1 hour ago
 * Can we use TensorFlow? 1 hour ago

Competition Details » Get the Data » Make a submissionCAN YOU IMPROVE LUNG CANCER DETECTION?
In the United States, lung cancer strikes 225,000 people every year, and
accounts for $12 billion in health care costs. Early detection is critical to
give patients the best chance at recovery and survival.

One year ago, the office of the U.S. Vice President spearheaded a bold new
initiative, the Cancer Moonshot, to make a decade's worth of progress in cancer
prevention, diagnosis, and treatment in just 5 years.

In 2017, the Data Science Bowl will be a critical milestone in support of the
Cancer Moonshot by convening the data science and medical communities to develop
lung cancer detection algorithms.

Using a data set of thousands of high-resolution lung scans provided by the
National Cancer Institute, participants will develop algorithms that accurately
determine when lesions in the lungs are cancerous. This will dramatically reduce
the false positive rate that plagues the current detection technology, get
patients earlier access to life-saving interventions, and give radiologists more
time to spend with their patients.

This year, the Data Science Bowl will award $1 million in prizes to those who
observe the right patterns, ask the right questions, and in turn, create
unprecedented impact around cancer screening care and prevention. The funds for
the prize purse will be provided by the Laura and John Arnold Foundation.

Visit DataScienceBowl.com to:
• Sign up to receive news about the competition
• Learn about the history of the Data Science Bowl and past competitions
• Read our latest insights on emerging analytics techniques


ACKNOWLEDGMENTS
The Data Science Bowl is presented by


COMPETITION SPONSORS
Laura and John Arnold Foundation
The Cancer Imaging Program of NCI
American College of Radiology
Amazon Web Services
NVIDIA

DATA SUPPORT PROVIDERS
National Lung Screening Trial
The Cancer Imaging Archive
Dr. Bram van Ginneken, Professor of Functional Image Analysis and his team at
Radboud University Medical Center in Nijmegen
Lahey Hospital & Medical Center
University of Copenhagen
Nicholas Petrick, Ph.D., Acting Director Division of Imaging, Diagnostics and
Software Reliability Office of Science and Engineering Laboratories Center for
Devices and Radiological Health U.S. Food and Drug Administration

SUPPORTING ORGANIZATIONS
Bayes Impact
Black Data Processng Associates
Code the Change
Data Community DC
DataKind
Galvanize
Great Minds in STEM
Hortonworks
INFORMS
Lesbians Who Tech
NSBE
Society of Asian Scientists & Engineers
Society of Women Engineers
University of Texas Austin, Business Analytics Program,
McCombs School of Business
US Dept. of Health and Human Services
US Food and Drug Administration
Women in Technology
Women of Cyberjutsu

Started: 2:00 pm, Thursday 12 January 2017 UTC
Ends: 11:59 pm, Wednesday 12 April 2017 UTC (90 total days)
Points: this competition awards standard ranking points
Tiers: this competition counts towards tiers

© 2017 Kaggle Inc Our Team Careers Terms Privacy Contact/Support","Kaggle is your home for data science. Learn new skills, build your career, collaborate with other data scientists, and compete in world-class machine learning challenges.",Data Science Bowl 2017,Live,8
28,"THE GRADIENT FLOW
DATA / TECHNOLOGY / CULTURE
Menu Search Skip to content * Home
 * About
 * Calendar
 * Contact
 * Hardcore Data Science and Data Engineering
 * The Data Show
 * Webcasts

Search for:USING APACHE SPARK TO PREDICT ATTACK VECTORS AMONG BILLIONS OF USERS AND
TRILLIONS OF EVENTS
[A version of this post appears on the O’Reilly Radar .]

THE O’REILLY DATA SHOW PODCAST: FANG YU ON DATA SCIENCE IN SECURITY,
UNSUPERVISED LEARNING, AND APACHE SPARK.
Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science: Stitcher , TuneIn , iTunes , SoundCloud , RSS .

In this episode of the O’Reilly Data Show, I spoke with Fang Yu , co-founder and CTO of DataVisor . We discussed her days as a researcher at Microsoft, the application of data
science and distributed computing to security, and hiring and training data
scientists and engineers for the security domain.

DataVisor is a startup that uses data science and big data to detect fraud and
malicious users across many different application domains in the U.S. and China. Founded by security researchers from Microsoft , the startup has developed large-scale unsupervised algorithms on top of
Apache Spark, to (as Yu notes in our chat) “predict attack vectors early among
billions of users and trillions of events.”

Several years ago, I found myself immersed in the security space and at that
time tools that employed machine learning and big data were still rare. More
recently, with the rise of tools like Apache Spark and Apache Kafka, I’m
starting to come across many more security professionals who incorporate
large-scale machine learning and distributed systems into their software
platforms and consulting practices.


Below are some highlights from our conversation:

UNSUPERVISED LEARNING FOR DETECTING FRAUDULENT USERS AND BEHAVIOR
Let me step back a little bit and explain how traditional solutions identify bad
accounts or bad behavior. Traditionally, the typical solution is rule-based. For
example, a user may not be allowed to just register, and immediately start to
transfer money or immediately starting send a lot of email. That behavior is
bad, so you write a rule based on that. But a rule-based solution is very
reactive. You need to observe what attackers are doing and then based on that,
you derive expert rules. Rule-based systems are hard to maintain and are always
late because a human needs to observe the bad behavior and start to write the
rules. Nowadays, a rule-based system is one solution, but a lot of online
services are moving to a machine learning-based solution. They have some bad
labels and then they train a model.

Discover unknown attacks without requiring labels or training data. Source: Fang
Yu, used with permission.


In DataVisor, we developed a brand new solution, which is unsupervised. We do
not require clients to give us labeled data. In our approach, we do not only
look at a single user’s behavior. We put all the users together and study
correlations between the users and how users link to each other, how similar are
the users’ actions. Nowadays, bad attackers do not have a single bad account.
They usually have tens of accounts, hundreds, even millions of accounts. Using
these accounts, they can do spam, they can do “likes,” they do transactions.
These accounts usually have high correlations among them because they’re
controlled by robots or controlled by trained people. For us, we look at the
user-user correlation.

AN ECOSYSTEM THAT SUPPORTS ATTACKS ACROSS DIFFERENT INDUSTRIAL SECTORS
Because we look at the account level and how users behave, our engine is quite
general to different sectors. We have clients in social media, mobile gaming,
and we’re also working with a client in financial services. The reason that our
engine can work across different sectors is that we look at the notion of
accounts and the underground ecosystem that supports massive attacks to
different services [and which can] have the same set of people. Some people
specialize in registering bad accounts, some people specialize in stealing
credit cards, and some people specialize in writing templates, etc. So, there is
an underground ecosystem in the tools they use, the data centers that they use,
the VPNs they use. There are a lot of commonalities across different sectors.

APACHE SPARK
We have clients that send us billions of events per day, so it’s a huge amount
of data, and you want to find a small amount of bad users. It’s like finding a
needle in a haystack without any labels. It’s very challenging. There are also a
lot of the social network elements associated with security. Some attackers want
to actively friend because the more they friend, the more they can spam them,
etc. The resulting graphs can be massive.

One of our founding members also came from Berkeley and he used Spark before;
when we wanted to scale the system, Spark was a very natural choice. We have had
a very positive experience. Spark is very easy to use and it has a great
community; it helped us scale our system pretty well.

Note: Fang Yu’s frequent collaborator and DataVisor co-founder Yinglian Xie will speak about Leveraging Apache Spark to analyze billions of user actions to reveal hidden
fraudsters at Strata + Hadoop World in San Jose this March .

Related resources:

 * Scalable Machine Learning (video)
 * Secure Because Math? Challenges on Applying Machine Learning to Security (video)
 * The Security Data Lake (free report)

SHARE THIS:
 * Click to share on Twitter (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * Click to email (Opens in new window)
 * 

02/25/2016 Ben Lorica data show , podcast , security , sparkPOST NAVIGATION
← →LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.

Notify me of new posts via email.


SEARCH
Search for:RSS FEED
 * RSS - Posts

SITE MAP
 * About
 * Calendar
 * Contact
 * Hardcore Data Science and Data Engineering
 * The Data Show
 * Webcasts

RECENT POSTS
 * Structured streaming comes to Apache Spark 2.0
 * Don’t overlook simpler techniques and algorithms
 * Recent trends in recommender systems
 * Semi-supervised, unsupervised, and adaptive algorithms for large-scale time
   series
 * Practical machine learning techniques for building intelligent applications

CATEGORIES
 * Data Engineer
 * Data Science
 * Finance
 * Marketing
 * Science
 * Uncategorized

ARCHIVES
 * May 2016 (3)
 * April 2016 (2)
 * March 2016 (5)
 * February 2016 (3)
 * January 2016 (3)
 * December 2015 (3)
 * November 2015 (4)
 * October 2015 (5)
 * September 2015 (5)
 * August 2015 (3)
 * July 2015 (4)
 * June 2015 (4)
 * May 2015 (3)
 * April 2015 (6)
 * March 2015 (5)
 * February 2015 (7)
 * January 2015 (6)
 * December 2014 (7)
 * November 2014 (3)
 * October 2014 (3)
 * September 2014 (4)
 * August 2014 (5)
 * July 2014 (7)
 * June 2014 (6)
 * May 2014 (1)
 * April 2014 (4)
 * March 2014 (4)
 * February 2014 (7)
 * January 2014 (4)
 * December 2013 (5)
 * November 2013 (3)
 * October 2013 (3)
 * September 2013 (5)
 * August 2013 (4)
 * July 2013 (4)
 * June 2013 (5)
 * May 2013 (4)
 * April 2013 (4)
 * March 2013 (4)
 * February 2013 (2)
 * October 2012 (1)
 * August 2012 (1)

My Tweets Blog at WordPress.com. | The Sorbet Theme . FollowFOLLOW “THE GRADIENT FLOW”
Get every new post delivered to your Inbox.


Join 35 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show podcast: Fang Yu on data science in security, unsupervised learning, and Apache Spark. Subscribe to the O’Reilly…",Using Apache Spark to predict attack vectors among billions of users and trillions of events,Live,9
30,"OFFLINE-FIRST IOS APPS WITH SWIFT & PART 1: THE DATASTOREJason H. Smith / January 25, 2016This walk-through is a sequel to Apple’s well-known iOS programmingintroduction, Start Developing iOS Apps (Swift) . Apple’s introduction walks us through the process of building the UI, data,and logic of an example food tracker app, culminating with a section on datapersistence: storing the app data as files in the iOS device.This series picks up where that document leaves off: syncing data betweendevices, through the cloud, with an offline-first design. You will achieve thisusing open source tools and the free IBM Cloudant service.This document is the first in the series, showing you how to use the CloudantSync datastore, CDTDatastore , for FoodTracker on the iOS device. Subsequent posts will cover syncing to thecloud and other advanced features such as accounts and data management.TABLE OF CONTENTS 1. Getting Started 2. CocoaPods 1. Learning Objectives     2. Install CocoaPods on your Mac     3. Install Cloudant Sync using CocoaPods     4. Change from a Project to a Workspace         3. Compile with Cloudant Sync 1. Learning Objectives     2. Create the CDTDatastore Bridging Header     3. Check the Build         4. Store Data Locally with Cloudant Sync 1. Offline First     2. Learning Objectives     3. The Cloudant Document Model     4. Design Plan     5. Remove NSCoding     6. Initialize the Cloudant Sync Datastore     7. Side Note: Deleting the Datastore in the iOS Simulator     8. Implement Storing and Querying Meals     9. Create Sample Meals in the Datastore         5. Conclusion 6. Download This ProjectGETTING STARTEDThe FoodTracker main screenThese lessons assume that you have completed the FoodTracker app from Apple’s walk-through. First, complete that walk-through. It will teach youthe process of beginning an iOS app and it will end with the chapter, Persist Data . Download the sample project from the final lesson (the “Download File” linkat the bottom of the page).Extract the zip file, Start-Dev-iOS-Apps-10.zip , browse into its folder with Finder, and double-click FoodTracker.xcodeproj . That will open the project in Xcode. Run the app (Command-R) and confirm thatit works correctly. If everything is in order, proceed with this document.COCOAPODSThe first step is to install CocoaPods which will allow you to quickly and easily use open source packages in your iOSapps. You will use the CocoaPods repository to integrate the Cloudant Sync Datastore library, called CDTDatastore .LEARNING OBJECTIVESAt the end of the lesson, you’ll be able to: 1. Install CocoaPods on your Mac 2. Use CocoaPods to download and integrate CDTDatastore with FoodTrackerINSTALL COCOAPODS ON YOUR MACThe CocoaPods web site has an excellent page, Getting Started , which covers installing and upgrading. For your purposes, you will use themost simple approach to installation, the command-line gem program.To install CocoaPods 1. Open the Terminal application 1. Click the Spotlight icon (a magnifying glass) in the Mac OS task bar     2. Type “terminal” in the Spotlight prompt, and press return         2. In Terminal, type this command:gem install cocoapods            Note , if you receive an error message and the CocoaPods gem does not install,    try this instead:        sudo gem install cocoapods             3. Confirm that CocoaPods is installed with this command:        pod --version            You should see the CocoaPods version displayed in Terminal:        0.39.0            INSTALL CLOUDANT SYNC USING COCOAPODSTo install CDTDatastore as a dependency, create a Podfile , a simple configuration files which tell CocoaPods which packages this projectneeds.To create a Podfile 1. Choose File > New > File… (or press Command-N) 2. On the left side of the dialog that appears, under “iOS”, select Other. 3. Select Empty, and click Next. 4. In the Save As field, type Podfile . 5. The save location (“Where”) defaults to your project directory.The Group option defaults to your app name, FoodTracker.        In the Targets section, make sure both your app and the tests for your app    are not selected.         6. Click Create. Xcode will create a file called Podfile which is open in the Xcode editor.Next, configure CDTDatastore in the Podfile.To configure the Podfile 1. In Podfile , add the following codeplatform :ios, '9.1'    pod ""CDTDatastore"", '~> 1.0.0'         2. Choose File > Save (or press Command-S)With your Podfile in place, you can now use CocoaPods to install theCDTDatastore pod.To install CDTDatastore 1. Open Terminal 2. Change to your project directory, the directory containing your new Podfile.    For example,# Your 'cd' change to the folder you use.    cd ""FoodTracker - Persist Data""             3. Type this command. Note, *this may take a few minutes to complete .        pod install --verbose            You will see colorful output from CocoaPods in the terminal.CHANGE FROM A PROJECT TO A WORKSPACEBecause you are now integrating FoodTracker with the third-party CDTDatastorelibrary, your project is now a group of projects combined into one useful whole. XCode supports this, and CocoaPodshas already prepared you for this transition by creating FoodTracker.xcworkspace for you—a workspace encompassing both FoodTracker and CDTDatastore.To change to your project workspace 1. Choose File > Close Window (or press Command-W). 2. Choose File > Open (or press Command-O). 3. Select FoodTracker.xcworkspace and click Open.You will see a similar XCode view as before, but notice that you now have twoprojects now.Note , when you build or run the app, you may see compiler warnings fromCDTDatastore code and its dependencies. You can safely ignore these warnings.Checkpoint: Run your app. The app should behave exactly as before. Now you know that everything is in itsplace and working correctly.COMPILE WITH CLOUDANT SYNCYour next step is to compile FoodTracker along with CDTDatastore . You will not change any major FoodTracker code yet; however, this willconfirm that CDTDatastore and FoodTracker integrate and compile correctly.LEARNING OBJECTIVESAt the end of the lesson, you’ll be able to create a bridging header to link Swift and Objective-C code.CREATE THE CDTDATASTORE BRIDGING HEADERCDTDatastore is written in Objective-C. FoodTracker is a Swift project.Currently, the best way to integrate these projects together is with a bridging header . The bridging header, CloudantSync-Bridging-Header.h , will tell Xcode to compile CDTDatastore into the final app.To create a header file 1.  Choose File > New > File (or press Command-N) 2.  On the left side of the dialog that appears, under “iOS”, select Source. 3.  Select Header File, and click Next. 4.  In the Save As field, type CloudantSync-Bridging-Header . 5.  Click the down-arrow expander button to the right of the “Save As” field.     This will display the file system tree of the project. 6.  Click the FoodTracker folder. 7.  Confirm that the Group option defaults to your app name, FoodTracker. 8.  In the Targets section, check the FoodTracker target. 9.  Click Create. Xcode will create and open a file called CloudantSync-Bridging-Header.h . 10. Under the line which says #define CloudantSync_Bridging_Header_h , insert the following code:#import <CloudantSync.h>           11. Choose File > Save (or press Command-S)The header file contents are done. But, despite its name, this file is not yet a bridging header as far as Xcode knows. The final step is to tell Xcode that this file willserve as the Objective-C bridging header.To assign a project bridging header 1. Enter the Project Navigator view by clicking the upper-left folder icon (or    press Command-1). 2. Select the FoodTracker project in the Navigator. 3. Under Project, select the FoodTracker project. (It has a blue icon). 4. Click “Build Settings”. 5. Click All to show all build settings 6. In the search bar, type “bridging header.” You should see Swift Compiler – Code Generation and inside it, Objective-C Bridging Header .         7. Double-click the empty space in the “FoodTracker” column, in the row    “Objective-C Bridging Header”. 8. A prompt window will pop up. Input the following:        FoodTracker/CloudantSync-Bridging-Header.h                     9. Press returnYour bridging header is done! Xcode should look like this:CHECK THE BUILDCheckpoint: Run your app. This will confirm that the code compiles and runs. While you have not changedany user-facing app code, you have begun the first step to Cloudant Sync bycompiling CDTDatastore into your project.STORE DATA LOCALLY WITH CLOUDANT SYNCWith CDTDatastore compiled and connected to FoodTracker, the next step is toreplace the NSCoder persistence system with CDTDatastore. Currently, in MealTableViewController.swift , during initialization, the encoded array of meals is loaded from localstorage. When you add or change a meal, the entire meals array is encoded and stored on disk.You will replace that system with a document-based architecture—in other words,each meal will be one record (called a “document” or simply “doc”) in theCloudant Sync datastore.Keep in mind, this first step of using Cloudant Sync does not use the Internet at all . The first goal is simply to store app data locally, in CDTDatastore. Afterthat works correctly, you will add the ability to sync with Cloudant.OFFLINE FIRSTThis is the offline-first architecture , with Internet access being optional to use the app. All data operations areon the local device. If the device has an Internet connection, then the app willsync its data with Cloudant—covered in future posts in this series.LEARNING OBJECTIVESAt the end of the lesson, you’ll be able to: 1. Understand the Cloudant document model: 1. Key-value storage for simple data types     2. Attachment storage for binary data     3. The document ID and revision ID         2. Store meals in the Cloudant Sync datastore 3. Query for meals in chronological order, from the datastoreTHE CLOUDANT DOCUMENT MODELLet’s begin with a discussion of Cloudant basics. The document is the primary data model of the Cloudant database, not only CDTDatastore foriOS, but also for Android, the Cloudant hosted database, and even the opensource Apache CouchDB database.A document, often called a doc , is a set of key-value data. Do not think, “Microsoft Office document”; think“JSON object.” A document is a JSON object: keys (strings) can have values:Ints, Doubles, Bools, Strings, as well as nested Arrays and Dictionaries.Documents can also contain binary blobs, called attachments . You can add, change, or remove attachments in a very similar way as you wouldadd, change, or remove key-value data in a doc.All documents always have two pieces of metadata used to manage them. The document ID (sometimes called _id or simply id ) is a unique string identifying the doc. You use the ID to read, and write aspecific document. When you create a document, you may omit the _id value, in which case Cloudant will automatically generate a unique ID for thedocument.The revision ID (sometimes called _rev or revision ) is a string generated by the datastore which tracks when the doc changes. Therevision ID is mostly used internally by the datastore, especially to facilitatereplication. In practice, you need to remember the basics about revisions : * The revision ID changes every time you update a document. * When you update a document, you provide the current revision ID to the   datastore, and the datastore will return to you the new revision ID of the new document. * When you create a document, you do not provide a revision ID, since there is no such “current” document.Finally, note that deleting a document is actually an update, with metadata setto indicate deletion, called a tombstone . Since a delete is an update just like any other, the deleted document willhave its own revision ID. The tombstones are necessary for replication:replicating a tombstone from one database to another will cause doc to bedeleted in both databases. As far as your app is concerned, it can consider thedocument deleted).DESIGN PLANWith this in mind, consider: how will the sample meals that are pre-loaded intothe app work? At first, you might think to create meal documents whenFoodTracker starts. That will work correctly the first time the user runs theapp; however, if the user changes or deletes the sample meals, those changes must persist . For example, if the user deletes the sample meals and then restarts the applater, those meals must remain deleted.To support this requirement, you will use document tombstones . This will be the basic design: * Each meal will be represented by a single document. User-created meals will   have an automatically-generated document ID; but sample meals will have   hard-coded document IDs: “meal1”, “meal2”, and “meal3”.// An example meal document:   {       ""_id"": ""meal1"",       ""name"": ""Caprese Salad"",       ""rating"": 4,       ""created_at"": ""2016-01-03T02:15:49.727Z""   }          * Sample meals have a hard-coded docId . Just before creating a sample meal, first try to fetch the meal by ID. * If CDTDatastore returns a meal doc, that means it has already been      created. Do nothing .    * If CDTDatastore returns a ""not_found"" error, that means the meal has never been created. Proceed with doc creation .    * If CDTDatastore returns a different error, that means the meal has been      created and then deleted. Do nothing .      Now, you can put this understanding into practice by transitioning to CloudantSync for local app data storage.REMOVE NSCODINGBegin cleanly by removing the current NSCoding system from the model and thetable view controller.To remove NSCoding from the model 1. Open Meal.swift 2. Find the class declaration, which saysclassMeal: NSObject, NSCoding{             3. Remove the word NSCoding and also the comma before it, making the new class declaration look like    this:        classMeal: NSObject{             4. Delete the comment line, // MARK: NSCoding . 5. Delete the method below that, encodeWithCoder(_:) . 6. Delete the method below that, init?(_:) .Next, remove NSCoding from the table view controller.To remove NSCoding from the table view controller 1. Open MealTableViewController.swift 2. Find the method viewDidLoad() , and delete the comment beginning // Load any saved meals and also the if/else code below it:// Load any saved meals, otherwise load sample data.iflet savedMeals = loadMeals() {        meals += savedMeals    } else {        // Load the sample data.        loadSampleMeals()    }             3. Delete the method loadSampleMeals() , which is immediately beneath the viewDidLoad() method. 4. Find the method tableView(_:commitEditingStyle:forRowAtIndexPath:) and delete the line of code saveMeals() . 5. Find the method unwindToMealList(_:) and delete its last two lines of code: a comment, and a call to saveMeals() .        // Save the meals.    saveMeals()             6. Delete the comment line, // MARK: NSCoding 7. Delete the method below that, saveMeals() . 8. Delete the method below that, loadMeals() .Checkpoint: Run your app. The app will obviously lose some functionality: loading stored meals, andcreating the first three sample meals; although you can still create, edit, andremove meals (but they will not persist if you quit the app). That is okay. Inthe next step, you will restore these functions using Cloudant Sync instead.INITIALIZE THE CLOUDANT SYNC DATASTORENow you will add loading and saving back to the app, using the Cloudant Syncdatastore. A meal will be a document, with its name and rating stored askey-value data, and its photo stored as an attachment. Additionally, you willstore a creation timestamp, so that you can later sort the meals in the orderthey were created.Begin with the Meal model, the file Meal.swift . You will add a new initialization method which can create a Meal object froma document. In other words, the init() method will set the meal name and rating from the document key-value data; andit will set the meal photo from the document attachment.Representing a Meal as a Cloudant document requires few changes besides theinitialization function. The only change to the the actual model is to addvariables for the underlying document ID, and the creation time. By rememberinga meal’s document ID, you will be able to change that doc when the user changesthe meal (e.g. by changing its rating, its name, or its photo). And by storingits creation time, you can later query the database for meals in the order thatthe user created them.To add Cloudant Sync datastore support 1. Open Meal.swift 2. In Meal.swift , in the section MARK: Properties , append these lines so that the variable declarations look like this:// MARK: Propertiesvar name: Stringvar photo: UIImage?    var rating: Int// Data for Cloudant Syncvar docId: String?    var createdAt: NSDate         3. In Meal.swift , edit the init?(_:photo:rating:) method to accept docId as a final argument, and to set the docId and createdAt properties . When you are finished, the method will look like this:        init?(name: String, photo: UIImage?, rating: Int, docId: String?) {        // Initialize stored properties.self.name = name        self.photo = photo        self.rating = rating        self.docId = docId        self.createdAt = NSDate()            super.init()            // Initialization should fail if there is no name or if the// rating is negative.if name.isEmpty || rating < 0 {            returnnil        }    }            Now add a convenience initializer. This initializer will use a givenCDTDatastore document to create a Meal object.To create a convenience initializer 1. Open Meal.swift 2. In Meal.swift, below the method init?(_:photo:rating:docId:) , add the following code:requiredconvenienceinit?(aDoc doc:CDTDocumentRevision) {        iflet body = doc.body {            let name = body[""name""] as! Stringlet rating = body[""rating""] as! Intvar photo: UIImage? = niliflet photoAttachment = doc.attachments[""photo.jpg""] {                photo = UIImage(                  data: photoAttachment.dataFromAttachmentContent())            }                self.init(name:name, photo:photo, rating:rating,                      docId:doc.docId)        } else {            print(""Error initializing meal from document: \(doc)"")            returnnil        }    }            That’s it for the model. The Meal class now tracks its underlying document IDand creation time; and it supports convenient initialization directly from ameal document.Since the Meal model initializer has a new docId: String? parameter, you will need to update the one bit of code which initializes Mealobjects, in the Meal view controller.To update the meal view controller 1. Open MealViewController.swift 2. In MealViewController.swift , find the function prepareForSegue(_:sender:) and change the last section of code to (dd , docId: docId ):// Set the meal to be passed to MealTableViewController after the// unwind segue.let docId = meal?.docId    meal = Meal(name: name, photo: photo, rating: rating, docId: docId)            Now the model has been updated to work from Cloudant Sync documents.Checkpoint: Run your app. The app should build successfully. This will confirm that all changes areworking together harmoniously. Of course, the app behavior is obviouslyincomplete, which you will correct in the next steps.All that remains is to use the datastore from the Meal table view controller.Begin by initializing the datastore and data.To initialize the datastore 1. Open MealTableViewController.swift 2. In MealTableViewController.swift , in the section MARK: Properties , append these lines so that the variable declarations look like this:// MARK: Propertiesvar meals = [Meal]()    var datastoreManager: CDTDatastoreManager?    var datastore: CDTDatastore?             3. In MealTableViewController.swift , append the following code at the end of the method viewDidLoad() :        // Initialize the Cloudant Sync local datastore.    initDatastore()            Now write the initialization function. Begin by creating a code marker for thenew Cloudant Sync datastore methods.To create a code marker for your code 1. Open MealTableViewController.swift 2. In MealTableViewController.swift , find the last method in the class, unwindToMealList(_:) 3. Below that method, add the following:// MARK: Datastore        This will be the section of the code where you implement all Cloudant Syncdatastore functionality.To implement datastore initialization , in MealTableViewController.swift , append the following code in the section MARK: Datastore :funcinitDatastore() {    let fileManager = NSFileManager.defaultManager()    let documentsDir = fileManager.URLsForDirectory(.DocumentDirectory,        inDomains: .UserDomainMask).last!    let storeURL = documentsDir.URLByAppendingPathComponent(""foodtracker-meals"")    let path = storeURL.path    do {        datastoreManager = tryCDTDatastoreManager(directory: path)        datastore = try datastoreManager!.datastoreNamed(""meals"")    } catch {        fatalError(""Failed to initialize datastore: \(error)"")    }}SIDE NOTE: DELETING THE DATASTORE IN THE IOS SIMULATORSometimes during development, you may want to delete the datastore and startover. There are several ways to do this, for example, by deleting the app fromthe simulated device.However, here is a quick command you can paste into the terminal. It will removethe Cloudant Sync database. When you restart the app, the app will initialize anew datastore and behave as if this was its first time to run. For example, itwill re-create the sample meals again.To delete the datastore from the iOS Simulatorrm -i -rv $HOME/Library/Developer/CoreSimulator/Devices/*/data/Containers/Data/Application/*/Documents/foodtracker-mealsThis command will prompt you to remove the files. If you are confident that thecommand is working correct, you can omit the -i option.IMPLEMENT STORING AND QUERYING MEALSWith the datastore initialized, you need to write methods to store and retrievemeal documents. This is the cornerstone of your project. With a few methods tointeract with the datastore, you will enjoy all the benefits the Cloudant Syncdatastore brings: offline-first operation and cloud syncing.For FoodTracker, you will have two primary ways of persisting meals in thedatastore: creating meals and updating meals. Each of these will have its ownmethod, but the methods will share some common code to populate a meal documentwith the correct data. Begin by writing this method. Given a Meal object and aCloudant document, it will copy all of the meal data to the document, so thatthe latter can be created or updated as needed.To implement populating a meal document 1. Open MealTableViewController.swift 2. In MealTableViewController.swift , in the section MARK: Datastore , append a new method:funcpopulateRevision(meal: Meal, revision: CDTDocumentRevision?) {       // Populate a document revision from a Meal.let rev: CDTDocumentRevision = revision           ?? CDTDocumentRevision(docId: meal.docId)       rev.body[""name""] = meal.name       rev.body[""rating""] = meal.rating           // Set created_at as an ISO 8601-formatted string.let dateFormatter = NSDateFormatter()       dateFormatter.locale = NSLocale(localeIdentifier: ""en_US_POSIX"")       dateFormatter.timeZone = NSTimeZone(abbreviation: ""GMT"")       dateFormatter.dateFormat = ""yyyy-MM-dd'T'HH:mm:ss.SSS'Z'""let createdAtISO = dateFormatter.stringFromDate(meal.createdAt)       rev.body[""created_at""] = createdAtISO           iflet data = UIImagePNGRepresentation(meal.photo!) {           let attachment = CDTUnsavedDataAttachment(data: data,               name: ""photo.jpg"", type: ""image/jpg"")           rev.attachments[attachment.name] = attachment       }    }            Next, implement the method to create new meal documents. Note that sample mealswill have hard-coded document IDs, so that you can detect if they have alreadybeen created or not. User-created meals will have no particular doc ID.To implement meal document creation 1. In MealTableViewController.swift , in the section MARK: Datastore , append a new method:// Create a meal. Return true if the meal was created, or false if// creation was unnecessary.funccreateMeal(meal: Meal) -> Bool {       // User-created meals will have docId == nil. Sample meals have a// string docId. For sample meals, look up the existing doc, with// three possible outcomes://   1. No exception; the doc is already present. Do nothing.//   2. The doc was created, then deleted. Do nothing.//   3. The doc has never been created. Create it.iflet docId = meal.docId {           do {               try datastore!.getDocumentWithId(docId)               print(""Skip \(docId) creation: already exists"")               returnfalse           } catchlet error asNSError {               if (error.userInfo[""NSLocalizedFailureReason""] as? String                       != ""not_found"") {                   print(""Skip \(docId) creation: already deleted by user"")                   returnfalse               }                   print(""Create sample meal: \(docId)"")           }       }           let rev = CDTDocumentRevision(docId: meal.docId)       populateRevision(meal, revision: rev)           do {           let result = try datastore!.createDocumentFromRevision(rev)           print(""Created \(result.docId)\(result.revId)"")       } catch {           print(""Error creating meal: \(error)"")       }           returntrue    }            Now you are ready to write the update method. Note that “deleting” a Cloudantdocument is in fact a type of update . The update method will accept a Bool parameter indicating whether to deletethe document or not. However, to keep the rest of the code simple, you willwrite one-line convenience methods deleteMeal(_:) and updateMeal(_:) to set the deletion flag automatically.To implement deleting and updating meal documents 1. In MealTableViewController.swift , in the section MARK: Datastore , append the two convenience methods and then the full implementation.funcdeleteMeal(meal: Meal) {        updateMeal(meal, isDelete: true)    }        funcupdateMeal(meal: Meal) {        updateMeal(meal, isDelete: false)    }        funcupdateMeal(meal: Meal, isDelete: Bool) {        guardlet docId = meal.docId else {            print(""Cannot update a meal with no document ID"")            return        }            let label = isDelete ? ""Delete"" : ""Update""print(""\(label)\(docId): begin"")            // First, fetch the current document revision from the DB.var rev: CDTDocumentRevisiondo {            rev = try datastore!.getDocumentWithId(docId)            populateRevision(meal, revision: rev)        } catch {            print(""Error loading meal \(docId): \(error)"")            return        }            do {            var result: CDTDocumentRevisionif (isDelete) {                result = try datastore!.deleteDocumentFromRevision(rev)            } else {                result = try datastore!.updateDocumentFromRevision(rev)            }                print(""\(label)\(docId) ok: \(result.revId)"")        } catch {            print(""Error updating \(docId): \(error)"")            return        }    }            Your app can now create, update, and delete meal docs. To complete this feature,these methods must be integrated with UI. When the user saves or deletes a meal,the controller must run these methods.To create and update meals 1. In MealTableViewController.swift , in the method unwindToMealList(_:) , modify the method body so that it calls updateMeal() or createMeal() as appropriate. The code will look as follows:iflet selectedIndexPath = tableView.indexPathForSelectedRow {        // Update an existing meal.        meals[selectedIndexPath.row] = meal        tableView.reloadRowsAtIndexPaths([selectedIndexPath], withRowAnimation: .None)        updateMeal(meal)    } else {        // Add a new meal.let newIndexPath = NSIndexPath(forRow: meals.count, inSection: 0)        meals.append(meal)        tableView.insertRowsAtIndexPaths([newIndexPath], withRowAnimation: .Bottom)        createMeal(meal)    }             2. In the method tableView(_:commitEditingStyle:forRowAtIndexPath) , insert a call to deleteMeal(_:) for the .Delete editing event. The code will look as follows.        if editingStyle == .Delete {        // Delete the row from the data sourcelet meal = meals[indexPath.row]        deleteMeal(meal)        meals.removeAtIndex(indexPath.row)        tableView.deleteRowsAtIndexPaths([indexPath],            withRowAnimation: .Fade)            The final thing to write is the code to query for meals in the datastore. Thiscode has two parts: initializing an index during app startup (to query bytimestamp), and of course the code to query that index.To support querying meals by timestamp 1. In MealTableViewController.swift , in the method initDatastore() , append this code:datastore?.ensureIndexed([""created_at""], withName: ""timestamps"")        // Everything is ready. Load all meals from the datastore.    loadMealsFromDatastore()             2. In MealTableViewController.swift , in the section MARK: Datastore , append this method:        funcloadMealsFromDatastore() {        let query = [""created_at"": [""$gt"":""""]]        let result = datastore?.find(query, skip: 0, limit: 0, fields:nil, sort: [[""created_at"":""asc""]])        guard result != nilelse {            print(""Failed to query for meals"")            return        }            meals.removeAll()        result!.enumerateObjectsUsingBlock({ (doc, idx, stop) -> Voidiniflet meal = Meal(aDoc: doc) {                self.meals.append(meal)            }        })    }            That’s it! The most intricate part of your code is finished.CREATE SAMPLE MEALS IN THE DATASTORENow is time to create sample meal documents during app startup. This method willrun every time the app initializes. For each sample meal, it will call createMeal(_:) which will either create the documents or no-op, as needed.To create sample meals during app startup 1. In MealTableViewController.swift , in the section MARK: Datastore , add a new method:funcstoreSampleMeals() {        let photo1 = UIImage(named: ""meal1"")!        let photo2 = UIImage(named: ""meal2"")!        let photo3 = UIImage(named: ""meal3"")!            let meal1 = Meal(name: ""Caprese Salad"", photo: photo1, rating: 4,            docId: ""sample-1"")!        let meal2 = Meal(name: ""Chicken and Potatoes"", photo: photo2, rating: 5,            docId:""sample-2"")!        let meal3 = Meal(name: ""Pasta with Meatballs"", photo: photo3, rating: 3,            docId:""sample-3"")!            // Hard-code the createdAt property to get consistent revision IDs. That way, devices that share// a common cloud database will not generate conflicts as they sync their own sample meals.let comps = NSDateComponents()        comps.day = 1        comps.month = 1        comps.year = 2016        comps.timeZone = NSTimeZone(abbreviation: ""GMT"")        let newYear = NSCalendar.currentCalendar()            .dateFromComponents(comps)!            meal1.createdAt = newYear        meal2.createdAt = newYear        meal3.createdAt = newYear            createMeal(meal1)        createMeal(meal2)        createMeal(meal3)    }             2. In MealTableViewController.swift , in the method initDatastore() , insert a call to storeSampleMeals() before the code initializing the index. The final lines of the method will    look as follows:           storeSampleMeals()       datastore?.ensureIndexed([""created_at""], withName: ""timestamps"")           // Everything is ready. Load all meals from the datastore.       loadMealsFromDatastore()    }            Checkpoint: Run your app. The app should behave exactly as it did at the beginning of this project.CONCLUSIONCongratulations! While the app remains unchanged superficially, you have made avery powerful upgrade to FoodTracker’s most important aspect: its data. You havetransformed the data layer from a minimal, unexceptional side note to become aflexible, powerful database. This database can be queried, searched, scaled, andreplicated between devices and through the cloud.The next update of this series will cover replicating this data to the cloudusing IBM Cloudant. Indeed, implementing cloud syncing is much simpler than thework from this lesson. You have completed laying the foundation!DOWNLOAD THIS PROJECTTo see the completed sample project for this lesson, download the file and viewit in Xcode.Download FileSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / iOS / Mobile / swift Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Apple's sample app, Food Tracker, taught you iOS. Now, take it further and sync data between devices, through the cloud, with an offline-first design.",Offline-First iOS Apps with Swift & Cloudant Sync; Part 1: The Datastore,Live,10
33,"Warehousing data from Cloudant to dashDB greatly enhances your options to analyze that data. Now, we have extended this capability to include GeoJSON documents. An ever increasing number of mobile and internet-of-things (IOT) applications capture and store geospatial data in NoSQL databases such as those provided by Cloudant. GeoJSON is the de-facto standard for such data. This new capability enables your data analysis to reflect geospatial aspects. For example, you can gain even more data insight by combining existing data with geospatial data from other sources, such as from weather services which becomes simple to integrate under IBM’s new partnership with The Weather Company. Or you can use the power of the statistical R language to do the geospatial analysis.In this post my co-author, Holger Kache, and I briefly describe how GeoJSON documents are warehoused. Then we will illustrate the new capabilities with some examples.As a Cloudant user interested in capturing geospatial information, GeoJSON is probably familiar to you. Cloudant     already offers ways to store your spatial data in GeoJSON and the ability to do     basic analysis with spatial functions. Anyway, let’s briefly touch on how GeoJSON documents are structured.In essence GeoJSON documents usually come in one of three form factors:Atomic geometrieswith a geometry type (like Point, LineString, Polygon, etc.) and coordinates, like this LineString:Featuretypes, which are atomic geometries along with some text properties. For example, a name:FeatureCollectiontypes, which combine lists of Feature types under a common roof:Because the three form factors have different structures, they would result in different table structures for the standard (non-GeoJSON) warehousing. For     GeoJSON, in contrast, we don’t want that. We want one consistent table structure for all of the three form factors. Cloudant achieves this     consistency by internally converting each document to a FeatureCollection type first. Let’s look at this in more detail.Suppose that we created a Cloudant database called geojson_demo that contains the three example documents from above. Let’s take a look how these documents are warehoused into dashDB tables. The process of scheduling a warehouse is unchanged from what was described in this    earlier post. So we will jump     directly to the newly created tables in the dashDB warehouse as seen in the dashDB Tables page:As you can see, the warehousing process created three tables: a base table called GEOJSON_DEMO, a feature table called GEOJSON_DEMO_FEATURES, and an     overflow table GEOJSON_DEMO_OVERFLOW that contains potential issues encountered during the warehousing.Here is the base table GEOJSON_DEMO:Each document results in one row. You can see that they were all converted to the FeatureCollection type.The most interesting table is the GEOJSON_DEMO_FEATURES table because it contains the geometries in the GEOMETRY column along with the    property name in the PROPERTIES_NAME column:In this table each geometry occupies one row. Again, all geometries appear as the Feature type in this table regardless of their initial structure. This     makes it easy for you to access the different geometries in dashDB: You can always expect them to show up in the GEOMETRY column of the     table  along with the properties in columns named PROPERTIES_. The geometries are stored in one of dashDB’s     geospatial data types such as ST_Point, ST_LineString, ST_Polygon or alike.You might have noticed another difference from standard warehousing: All table and column names are now uppercase instead of mixed case.     The reason for this is that many database access tools like ESRI’s ArcMap expect uppercase database objects. Again, we made it as easy as possible for you     to access the warehousing results.In this first version of the GeoJSON support, there are some restrictions that you should be aware of:First, we expect a homogeneous database with respect to geometry types. Consequently, if you mix different geometrytypes like Points, LineString, or Polygons in one database, we will warehouse only the documents that contain the most frequently occurring geometrytype and reject all others. You can find information about the rejected documents in the EXCEPTION column of the overflow table.We only support the default Coordinate Reference Systems (CRS) WGS84. If you specify a different CRS, you will get a warning in theWARNING column of the overflow table like this one:We do not support the geometry type GeometryCollection because there is no suitable geometry type available in dashDB. So if your datahas the GeometryCollections type, restructure it to the FeatureCollection type instead.The GeoJSON bounding box member bbox is neglected because in dashDB we internally calculate the bounding box of eachgeometry at loading time.Now that your GeoJSON data is in the warehouse, it is time to kick off some spatial analysis. Suppose you have warehoused     the Boston criminal incidence reports (which you can find as     GeoJSON database under http://opendata.cloudant.com/crimes). Let’s take a look at it by using ESRI’s     ArcMap tool:Or by using ArcMap’s kernel density tool you can directly highlight the critical areas. Because the tool will import all the data first, this process is     rather time consuming.But there is a way to speed things up. Look at this example where we join the crimes data with the neighborhood districts to highlight critical districts:For this analysis, we imported the neighborhoods districts, available in shape format, into dashDB and joined it with the point geometries of the warehoused crimes database. This join is fast because it is done at the database level     without exporting the data to the tool first. The underlying SQL looks like this:1	SELECT CN.NAME, CN.NUM_CRIMES / BN.""Acres"" AS CDENSITY, BN.GEO_DATA2	  FROM3	  (SELECT N.""Name"" AS NAME, COUNT(C.""_ID"") AS NUM_CRIMES4	          FROM5	          	CRIMES_FEATURES AS C,6	          	BOSTON_NEIGHBORHOODS AS N7	          WHERE8	          	DB2GSE.ST_CONTAINS(N.GEO_DATA, C.GEOMETRY) = 19	          GROUP BY N.""Name"") AS CN,10	  BOSTON_NEIGHBORHOODS AS BN11	  WHERE12	  	CN.NAME = BN.""Name"";As you can see, the neighborhood polygons N.GEO_DATA are joined with the points of the crimes table C.GEOMETRY (line 8). The crime density CDENSITY is     calculated by the count of crimes per district divided by the area in acres (line 1).Once your data is in dashDB you also have the full power of the statistical R language at hand. As one example, you might be interested in     how many crimes happen over the course of a day. Here is the answer:To produce this graph, within the dashDB console we chose the menu item Analytics > R Scripts, paste in the following script, and then clicked    Submit. Again, this will be fast because we are using in-database analytics functions of R.# Initlibrary(ggplot2)# Connect to the database and read in the data frameidaInit(idaConnect(""BLUDB"","""",""""))qAnd finallyThere are lots of other ways that you can exploit geospatial data by using the Cloudant to dashDB integration: Use other data sources, take advantage ofthe geospatial capabilities of R, use it in a mobile scenario… you name it. So go ahead and try out what works for you – and let us know. Also, if youencounter any shortcomings, leave us a comment or email.References and some more linksThe GeoJSON Format SpecificationCloudant blog:Introducing Data Warehousing and Analytics with Cloudant and dashDBMore on IBM dashDBMore on geospatial processing with dashDBHolger’s blog on Warehouse style analytics for the cloudYouTube Video on Analyzing Geospatial Data with IBM dashDB and Esri ArcGIS for Desktop/* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */var disqus_shortname = 'cloudant'; // required: replace example with your forum shortname/* * * DON'T EDIT BELOW THIS LINE * * */(function() {var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;dsq.src = 'https://' + disqus_shortname + '.disqus.com/embed.js?https';(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);Please enable JavaScript to view the comments powered by Disqus.dashDB: Analytics in Your Hands, Data Warehouse Infrastructure Out of Your WayExperts share tips for staying ahead of the game","Replicating data to a relational dashDB database greatly enhances your options to analyze that data. In addition to the ability to query the warehouse with SQL, you can use the power of the statistical R language to do the analysis. Now, we have extended Cloudant’s warehousing capability to include GeoJSON documents. An ever increasing number of mobile and internet-of-things (IOT) applications capture and store geospatial data in NoSQL databases such as those provided by Cloudant. GeoJSON is the de-facto standard for such data. This new capability enables your data analysis to reflect geospatial aspects.",Warehousing GeoJSON documents,Live,11
36,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Toggle navigation * Sign in
 * 

Create New RecipeRECIPE
Learn moreLOADING...

IBM
LovesRecipes@IoTF

TIMESERIES DATA ANALYSIS OF IOT EVENTS BY USING JUPYTER NOTEBOOK
THIS RECIPE SHOWCASES HOW ONE CAN ANALYZE THE HISTORICAL TIME SERIES DATA,
CAPTURED ON THE IBM WATSON IOT PLATFORM, IN A JUPYTER NOTEBOOK USING SPARK SQL
AND PANDAS DATAFRAMES. ALSO, USE THE PRE-INSTALLED MATPLOTLIB LIBRARY TO
VISUALIZE RESULTS.
1610 2 2Remixed from Recipes@IoTF


 * 
   Warning : in_array() expects parameter 2 to be array, boolean given in 
   /data/http/wp-content/themes/remix-theme/plugins/tutorial-favourites/public/class-tutorial-favourites.php on line 295
   favourite'
 * 
 * 
 * 
 * 
 * 

REQUIREMENTS
 * IBM Bluemix account
 * Git (Optional)
 * Maven (Optional)

SKILL LEVEL


INTERMEDIATE

Basic knowledge of
1. IBM Watson IoT Platform
2. Apache Spark
3. Cloudant NoSQL
4. Pandas for Data Manipulation

RECIPES TO ENHANCE ANALYTICS IN IBM WATSON IOT PLATFORM


Before you proceed, evaluate the following analytical recipes that suites your
need.


INTRODUCTION


In the previous recipe “ Engage Machine Learning for detecting anomalous behaviors of things “, we saw how one can integrate IBM Watson IoT, Apache Spark service,
Predictive Analysis service and Real-Time Insights to take timely action before
an (unacceptable) event occurs. And in this recipe, we will make use of the data
(historical data) produced by the previous recipe to discover the hiddern
patterns, termperature trend over the days, month and year using Apache Spark SQL , Pandas DataFrame and Jupyter Notebook .


What is Spark SQL and DataFrames?

Apache Spark SQL is a Spark module for structured data processing. Unlike the
basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with
more information about the structure of both the data and the computation being
performed. Internally, Spark SQL uses this extra information to perform extra
optimizations. There are several ways to interact with Spark SQL including SQL,
the DataFrames API and the Datasets API. And in this recipe we will be using
DataFrams to analyze and visualize the temperature data.


A DataFrame is a distributed collection of data organized into named columns. It is
conceptually equivalent to a table in a relational database or a data frame in
R/Python, but with richer optimizations under the hood. DataFrames can be
constructed from a wide array of sources such as: structured data files, tables
in Hive, external databases, or existing RDDs .


To make things simple, this recipe does not alter the previous recipe setup,
rather it just adds a Node-RED application and Cloudant NoSQL DB service on top
of part 1 of the recipe as shown below,


In case if you want to look at the overall architecture that shows the
components of part1 and part2, take a look at this link .

As shown, the Node-RED application will subscribe to the results (which contains
the actual temperature, forecasted temperature, zscore and wzscore values) from
the Watson IoT Platform and store them in a Cloudant NoSQL DB. This Cloundat
NoSQL DB will act as a historical data storage. Once the Cloudant NoSQL DB is
filled with enough data, this recipe will use the Jupyter Notebook to load the
data into the Spark engine and use Spark SQL, Panda DataFrames, other graphical
libraries to analyze the data and show the results in charts or graphs.


Also, One can use the sample application present in the github to generate the historical data without running the
previous recipe code. The steps are detailed in the following section.


CREATE A NODE-RED APPLICATION


In this step, we will create a Node-RED application which will store the results
into Cloudant DB.


Create Node-RED application

 1. Open your favorite browser and go to Bluemix . If you are an existing Bluemix user, log in as usual. If you are new to
    Bluemix you can sign up for a free 30 day trial.
 2. Once you signed up to Bluemix, click this link to create the Node-RED starter application in Bluemix.
 3. Type a name for your application and click the Create button.
 4. Wait for the Bluemix to create the application. Note that the Cloudant
    service is created along with the Node-RED application, so no need to create
    the Cloudant service separately.


Create Node-RED flow

 1. Once the application is created, Click on the application URL to open the
    Node-RED landing page. ( Note : Your application must be running for this to work, If your application
    has stopped for any reason, select the Restart button and wait for it to
    successfully restart).
 2. Select “ Go to your Node-RED flow editor ” button to enter into the Node-RED flow editor.
 3. Navigate to the menu at the top right of the screen and select Import from
    Clipboard. Copy the JSON string from the text area below and paste it into
    the dialog box in Node-RED and select OK. If there any issues, copy the
    contents from github .[{""id"":""f2818749.4a8ad8"",""type"":""ibmiot"",""z"":""519983d4.82448c"",""name"":""coi0nz""},{""id"":""f396adc3.b4e0d8"",""type"":""ibmiot in"",""z"":""519983d4.82448c"",""authentication"":""apiKey"",""apiKey"":""f2818749.4a8ad8"",""inputType"":""evt"",""deviceId"":"""",""applicationId"":"""",""deviceType"":""+"",""eventType"":""result"",""commandType"":"""",""format"":""json"",""name"":""IBM IoT"",""service"":""registered"",""allDevices"":true,""allApplications"":"""",""allDeviceTypes"":true,""allEvents"":false,""allCommands"":"""",""allFormats"":"""",""x"":206.1999969482422,""y"":144.1999969482422,""wires"":[[""a98be6b0.04c158"",""f42ea830.63aee8""]]},{""id"":""f42ea830.63aee8"",""type"":""cloudant out"",""z"":""519983d4.82448c"",""service"":""gateway-sample-cloudantNoSQLDB"",""cloudant"":"""",""name"":""Cloudant Store"",""database"":""recipedb"",""payonly"":true,""operation"":""insert"",""x"":460.1999969482422,""y"":231.1999969482422,""wires"":[]},{""id"":""a98be6b0.04c158"",""type"":""debug"",""z"":""519983d4.82448c"",""name"":""Debug"",""active"":true,""console"":""false"",""complete"":""payload"",""x"":382.49998474121094,""y"":144,""wires"":[]}]
    
    
 4. This imported flow has 3 nodes * IBM IoT In node – This node subscribes to all ‘result’ events published
       by any device in the same organization
     * Debug node – This node displays the above events in the debug tab of the
       Node-RED flow
     * Cloudant Store – This node persists the event in the ‘ recipedb ‘ store in Cloudant DBNote that this imported flow is not complete. Carry out the following
       steps to complete the flow.
       
       
 5. Double click on the IBM IoT node and enter the IoT credentials, such as API
    Key and API Token as shown below and leave the other fields as is,
 6. If the previous recipe is already running, then you should observe the result events in the debug
    window of the Node-RED and also in the Cloudant NoSQL DB. (In case if you
    want to quickly load the Cloudant NoSQL DB without running the previous
    recipe code, carry out the steps mentioned in the last sub-section of this
    section)


View the events in Cloudant NoSQL DB


 1. Go to Bluemix Dashboard,
 2. Click the Node-RED application that you created in this step.
 3. Observe that a Cloudant NoSQL DB service present as part of the application. Click on the service and then
    the Launch button.
 4. Observe that a Databased called “ recipedb ” is created where all the result events are stored.
 5. Click recipedb to enter inside the database and click on any document to
    view the events.


Retrieve the credentials of Cloudant DB to load the events in Spark


 1. Go back to Bluemix Dashboard,
 2. Click the Node-RED application that you created in this step.
 3. Click on the Show Credentials tab as shown below and note down the username
    & password. This will be required to load the events into the Spark engine.


Load Cloudant DB with sample data – Required only if you want to bypass the
previous recipe and quickly generate the historical data,

 1. Download and install Maven and Git if not installed already.
 2. Clone the iot-predictive-analytics repository as follows:git clone https://github.com/ibm-messaging/iot-predictive-analytics-samples.git
    
    
 3. Navigate to the DeviceDataGenerator project and build the project using
    maven,mvn clean package 
    
    (This will download all required dependencies and starts the building
    process. Once built, the sample can be located in the target directory, with
    the filename IoTDataGenerator-1.0.0-SNAPSHOT.jar)
    
    
 4. Run the Historical generator sample using the following command:
    
    mvn exec:java -Dexec.mainClass=""com.ibm.iot.iotdatagenerator.HistoricalDataGenerator"" -Dexec.args=""<Cloudant-username> <Cloudant-password>"" 
    
    
 5. Observe that the application connects to the Cloudant NoSQL DB service and
    stores the simulated resultant events as documents. Observe that the
    timestamp of the first document is January 18th and the interval between 2
    records are 2 minutes. One can modify the code HistoricalDataGenerator.java to control the timestamp.
    
    
In this step, we have successfully created a Node-RED application to store the
results into the Cloudant NoSQL DB.


CREATE A SPARK SQL DATAFRAME


In this step, we will create the Notebook application and load the Cloudant data
into Apache Spark service.


What is Jupyter Notebook?

The Jupyter Notebook is a web application that allows one to create and share documents that contain
executable code, mathematical formulae, graphics/visualization (matplotlib) and
explanatory text. Its primary use includes:

 1. Data cleaning and transformation,
 2. Numerical simulation,
 3. Statistical modeling,
 4. Machine learning and much more.


Create a Notebook


 1. While the first notebook is running, go back to the Bluemix Catalog and open
    the same Apache Spark service that you created as part of recipe 1 .
 2. Click NOTEBOOKS button to show existing Notebooks. Click on NEW NOTEBOOK button.
 3. Enter a Name, under Language select Python and click CREATE NOTEBOOK button to create a new notebook.


Load data into Spark and perform basic operations


 1. Go to the notebook, In the first cell (next to In [ ] ), enter the following command that creates the SQLContext and click Run. The SQLContext is the main entry point into all functionality in Spark SQL
    and is necessary to create the DataFrames.sqlContext=SQLContext(sc)
    
    
 2. Enter the following statements into the second cell, and then click Run . Replace hostname, username, and password with the hostname, username, and
    password for your Cloudant account. This command reads the recipedb database
    from the Cloudant account and assigns it to the cloudantdata variable.cloudantdata=sqlContext.read.format(""com.cloudant.spark"").
    option(""cloudant.host"",""hostname"").
    option(""cloudant.username"", ""username"").
    option(""cloudant.password"", ""password"").
    load(""recipedb"")
    
    
 3. Enter the following statement into the third cell, and then click Run . This command will return the schema as shown below,cloudantdata.printSchema()
    
    out[3]:root
    |– _id: string (nullable = true)
    |– _rev: string (nullable = true)
    |– forecast: double (nullable = true)
    |– name: string (nullable = true)
    |– temperature: double (nullable = true)
    |– timestamp: string (nullable = true)
    |– wzscore: double(nullable = true)
    |– zscore: double (nullable = true)
    
    
 4. Enter the following command in the next cell to look at one record
    (document) and click Run ,cloudantdata.take(1)
    
    [Row(_id=u’0001683791e04032a4ca0955b70b12f8′,
    _rev=u’1-e1fac6e387132edcb0450c7b2f26d35b’, forecast=17.530496368661147,
    name=u’datacenter’, temperature=17.53, timestamp=u’2016-Mar-14 14:28:00′,
    wzscore=-0.3973792127656496, zscore=-0.062204879122082946)]
    
    
 5. Enter the following command in the next cell to get the number of rows in
    the Cloundant NoSQL DB and click Run ,cloudantdata.count()
    
    33980
    
    
 6. Enter the following command in the next cell to get only the temperature
    values and click Run, (Note that it will return only the top 20 rows ),
    cloudantdata.select(""temperature"").show()
    
    +—————+
    |temperature|
    +—————+
    | 18.66|
    | 18.5|
    | 18.53|
    | 18.56|
    | 17.595|
    ………….
    | 17.5|
    +—————+
    only showing top 20 rows
    
    
In this step we have successfully loaded the historical (Cloudant NoSQL DB) data
into the Spark Service and explored the schema.


CREATE A PANDAS DATAFRAME


In this step we will convert the Spark SQL DataFrame into Pandas timeseries
Dataframe and perform basic operations.


The Python Data Analysis Library (a.k.a. pandas) provides high-performance, easy-to-use data structures and data
analysis tools that are designed to make working with “relational” or “labeled”
data both easy and intuitive. It aims to be the fundamental high-level building
block for doing practical, real world data analysis in Python.


Create a Pandas DataFrame


 1. Enter the following commands in the next cell to create a Pandas DataFrame
    from the Spark SQL DataFrame and click Run. This line prints the schema of the newly created Pandas DataFrame which
    will be same as the Spark SQL DataFrame,import pprint
    import pandas as pd
    pandaDF = cloudantdata.toPandas()
    #Fill NA/NaN values to 0
    pandaDF.fillna(0, inplace=True)
    pandaDF.columns
    
    Index([u’_id’, u’_rev’, u’forecast’, u’name’, u’temperature’, u’timestamp’,
    u’wzscore’, u’zscore’], dtype=’object’)
    
    
 2. Using len on a DataFrame will give the number of rows as shown below,len(pandaDF)
    
    33980
    
    
 3. Columns can be accessed in two ways in Pandas. The first is using the
    DataFrame like a dictionary with string keys,pandaDF[""temperature""]
    
    0 17.69
    1 17.38
    2 16.56
    3 17.69
    4 17.50
    ………..
    
    
 4. You can get multiple columns out at the same time by passing in a list of
    strings as shown below,pandaDF[[""timestamp"",""temperature""]]
    
    timestamp temperature
    0 2016-Mar-14 14:28:00 17.530
    
    
 5. The second way to access columns is using the dot syntax. This only works if
    your column name could also be a Python variable name (i.e., no spaces), and
    if it doesn’t collide with another DataFrame property or function name
    (e.g., count, sum).pandaDF.temperature
    
    
Create datetime as the index

By default Pandas DataFrame uses the sequence number as index, since we analyze
the timeseries data its better If we use datetime instead of integers for our
index, we will get some extra benefits from pandas when plotting later on. This
section will focus on doing the same,

 1. Enter the following code in the next cell to make the timestamp as the
    index.#import the datatime library
    from datetime import datetime
    # convert the time from string to panda's datetime
    pandaDF.timestamp = pandaDF.timestamp.apply(lambda d: datetime.strptime(d, ""%Y-%b-%d %H:%M:%S""))
    pandaDF.index = pandaDF.timestamp
    
    # Drop the timestamp column as the index is replaced with timestamp now
    pandaDF = pandaDF.drop([""timestamp""], axis=1)
    pandaDF.head()
    
    # Also, sort the index with the timestamp
    pandaDF.sort_index(inplace=True)
    
    
 2. Enter the following command in the next cell to retrieve a row corresponding
    to a particular time and click Run, one can retrieve the temperature reading based on a relative datetime by
    first finding a closest time and then querying for it as shown below,'''
    One can query the temperature based on the datetime, 
    incase if you are not sure about the exact time, 
    then use searchsorted() method to get to the nearest date
    '''
    date = pandaDF.index.searchsorted(datetime(2016, 2, 18, 17, 44, 23))
    pandaDF.ix[date]
    
    _id 4d5306615f5a416e97921b3b70b75ab7
    _rev 1-29ae6dc3d931fdeea1d9d20e5d50c5b8
    forecast 17.61406
    name datacenter
    temperature 17.69
    wzscore 1.695702
    zscore 0.3832717
    Name: 2016-02-18 17:46:00, dtype: object
    
    
As shown above, the row corresponding to the closest time is retrieved .


In this step we have successfully created a Panda dataframe and performed few
basic operations. In the next section we will see how one can visualize the
temperature data using the matplotlib visualization library.


VISUALIZE TEMPERATURE READINGS


When working with interactive notebooks, one can decide how to present results
and information. So far, we have used normal print functions which are
informative. In this section, we will show how one can visualize the temperature
data using the Pandas DataFrames and matplotlib library.


 1. Enter the following command in the next cell to generate the histogram for
    the temperature and click Run, This will tell how well the temperature
    readings are distributed,#tell Jupyter to render charts inline:
    %matplotlib inline
    import matplotlib.pyplot as plt
    
    pandaDF.temperature.hist() 
    
    
 2. Enter the following commands in the next cell to plot the overall
    temperature and click Run , Observe that the graph is drawn with the timestamp in x axis and
    temperature values in y axis. Also, observe 2 Red lines showing the upper
    and lower thresholds,# Draw overall temperature
    %matplotlib inline
    import matplotlib.pyplot as plt
    import numpy as np
    
    plotDF = pandaDF[['temperature']]
    
    import matplotlib.dates as dates
    
    fig, ax = plt.subplots()
    plotDF.plot(figsize=[20,10], ax=ax, grid=True)
    ax.set_xlabel(""Timestamp"",fontsize=20)
    ax.set_ylabel(""Temperature"",fontsize=20)
    ax.set_title(""Overall Temperature"", fontsize=20)
    ax.set_ylim([12,22])
    # Draw lines to showcase the upper and lower threshold
    ax.axhline(y=19,c=""red"",linewidth=2,zorder=0)
    ax.axhline(y=15,c=""red"",linewidth=2,zorder=0)
    ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=5, maxticks=None, interval_multiples=False))
    ax.xaxis.set_minor_formatter(dates.DateFormatter('%dn%an%H:%M:%S'))
    ax.xaxis.grid(True, which=""minor"")
    ax.xaxis.set_major_locator(dates.MonthLocator())
    ax.xaxis.set_major_formatter(dates.DateFormatter('nnn%bn%Y'))
    plt.tight_layout()
    plt.show()
    
    
 3. If you want to plot only the last 400 values, then enter the following
    commands in the next cell and click Run . This will help one to understand the recent state of the system.# Draw Last 400 temperature values
    plotDF = pandaDF[['temperature']]
    fig, ax = plt.subplots()
    
    plotDF.tail(400).plot(figsize=[20,10], ax=ax, grid=True)
    ax.set_xlabel(""Timestamp"",fontsize=20)
    ax.set_ylabel(""Temperature"",fontsize=20)
    ax.set_title(""Recent 400 Temperature Values"", fontsize=20)
    ax.set_ylim([12,22])
    ax.axhline(y=19,c=""red"",linewidth=2,zorder=0)
    ax.axhline(y=15,c=""red"",linewidth=2,zorder=0)
    ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=5, maxticks=None, interval_multiples=False))
    ax.xaxis.set_minor_formatter(dates.DateFormatter('%dn%an%H:%M:%S'))
    ax.xaxis.grid(True, which=""minor"")
    ax.yaxis.grid()
    ax.xaxis.set_major_locator(dates.MonthLocator())
    ax.xaxis.set_major_formatter(dates.DateFormatter('nnn%bn%Y'))
    plt.tight_layout()
    plt.show()
    
    
 4. Similarly you can plot temperature values along with zscore & wzscore by
    entering the following commands into the next cell and click Run, In the following example, we plot the graph between 2 days.
    # Draw temperature chart with normal zscore & wzscore
    start = datetime(2016, 4, 19)
    end = datetime(2016, 4, 20)
    plotDF = pandaDF.ix[start:end]
    plotDF = plotDF[['temperature','zscore','wzscore']]
    if (len(plotDF) > 0):
     fig, ax = plt.subplots()
     plotDF.plot(figsize=[20,10], ax=ax, grid=True)
     # format the axis
     ax.set_xlabel(""Timestamp"",fontsize=20)
     ax.set_ylabel(""Temperature and zscore"",fontsize=20)
     ax.set_title(""Temperatures between "" + str(start) + "" and "" + str(end) + "" with zscore"", fontsize=20)
     ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=1, maxticks=None, interval_multiples=False))
     ax.xaxis.set_minor_formatter(dates.DateFormatter('%dn%an%H:%M:%S'))
     ax.xaxis.grid(True, which=""minor"")
     ax.yaxis.grid()
     ax.xaxis.set_major_locator(dates.MonthLocator())
     ax.xaxis.set_major_formatter(dates.DateFormatter('nnn%bn%Y'))
     plt.tight_layout()
     plt.show()
    else:
     print ""There are no rows matching the given condition, Try changing the dates""
    
    
 5. Enter the following command to overlay the zscore & wzscore along with the
    temperature, this will help one to understand the deviations better,# Draw temperature chart with scaled zscore & wzscore
    
    # define a method that scales zscore with the temperature
    def scaleZscore(row):
     return row['zscore'] + row['temperature']
    
    # define a method that scales wzscore with the temperature
    def scaleWZscore(row):
     return row['wzscore'] + row['temperature']
    
    # apply the functions
    pandaDF['scaledzscore'] = pandaDF.apply(scaleZscore, axis=1)
    pandaDF['scaledwzscore'] = pandaDF.apply(scaleWZscore, axis=1)
    
    start = datetime(2016, 2, 19)
    end = datetime(2016, 2, 20)
    plotDF = pandaDF.ix[start:end]
    
    if (len(plotDF) > 0):
     # create a dataframe with a required fields that we want to plot
     plotDF = plotDF[['temperature','scaledzscore','scaledwzscore']]
    
     fig, ax = plt.subplots()
    
     plotDF.plot(figsize=[23,12], ax=ax)
    
     ax.set_xlabel(""Timestamp"",fontsize=20)
     ax.set_ylabel(""Temperature and zscore"",fontsize=20)
     ax.set_title(""Temperatures between "" + str(start) + "" and "" + str(end) + "" with scaled zscore"", fontsize=20)
     ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=1, maxticks=None, interval_multiples=True))
     ax.axhline(y=19,c=""purple"",linewidth=2,zorder=0)
     ax.axhline(y=15,c=""purple"",linewidth=2,zorder=0)
     ax.set_ylim([13,21])
     ax.xaxis.set_minor_formatter(dates.DateFormatter('%dn%an%H:%M:%S'))
     ax.xaxis.grid(True, which=""minor"")
     ax.yaxis.grid()
     ax.xaxis.set_major_locator(dates.MonthLocator())
     ax.xaxis.set_major_formatter(dates.DateFormatter('nnn%bn%Y'))
     plt.tight_layout()
     plt.show()
    else:
     print ""There are no rows matching the given condition, Try changing the dates""
    
    
 6. One can visualize the temperature readings between 2 different times
    specified. For example, the following code allows one to visualize the
    temperature over the last 2 days, (Note, you might observe a failure if
    there isn’t any data in the last 2 days)from datetime import *
    import pytz
    
    # retrieve the current temperature
    now = datetime.now(pytz.timezone('UTC'))
    
    '''
    get the start time that will be behind 2 days from now, just modify ""days = 2"" to ""hours=2"" 
    in case if you want to retrieve the temperature from last 2 hours.
    '''
    
    last_n_days = now - timedelta(days=2)
    
    plotDF = pandaDF.ix[last_n_days:now]
    if len(plotDF) > 0:
     plotDF = plotDF[['temperature','scaledzscore']]
     fig, ax = plt.subplots()
    
     plotDF.plot(figsize=[20,10], ax=ax)
     # choose the colours for each column
     with pd.plot_params.use('x_compat', True):
     plotDF.temperature.plot(color='b')
     plotDF.scaledzscore.plot(color='r');
    
     ax.set_xlabel(""Timestamp"",fontsize=20)
     ax.set_ylabel(""Temperature and scaledzscore"",fontsize=20)
     ax.set_title(""Temperature in last 2 days"", fontsize=20)
    
     ax.xaxis.set_minor_locator(dates.AutoDateLocator(tz=None, minticks=1, maxticks=None, interval_multiples=True))
     ax.axhline(y=19,c=""purple"",linewidth=2,zorder=0)
     ax.axhline(y=15,c=""purple"",linewidth=2,zorder=0)
     ax.set_ylim([13,21])
     ax.xaxis.set_minor_formatter(dates.DateFormatter('%dn%an%H:%M:%S'))
     ax.xaxis.grid(True, which=""minor"")
     ax.yaxis.grid()
     ax.xaxis.set_major_locator(dates.MonthLocator())
     ax.xaxis.set_major_formatter(dates.DateFormatter('nnn%bn%Y'))
     plt.tight_layout()
     plt.show()
    else:
     print ""There are no rows matching the given condition, Trying changing the dates""
    
    
In this step, we have successfully analyzed the temperature data and visualized
the results using bar and line charts.


OPERATIONS RELATED TO MAXIMUM TEMPERATURE


In this step, we will see how to use the Pandas DataFrames to find the maximum
temperature over the hour, day, year and etc..


 1. Enter the following command in the next cell to find out the overall maximum
    temperature and click Run,# find the maximum temperature
    maximum = pandaDF.temperature.max()
    maximum
    
    19.78
    
    
 2. Enter the following statements in the next cell to find out all the
    instances where the temperature has crossed 19 degree and click Run . Observe that it returns all the rows where the temperature is greater
    than 19 degree.threshold_crossed_days = pandaDF[pandaDF.temperature > 19]
    threshold_crossed_days
    
    
 3. Enter the following command to return only the days and not the timestamp in
    which the temperature is crossed the threshold,threshold_crossed_days['timestamp'] = threshold_crossed_days.index
    days = threshold_crossed_days.timestamp.map(lambda t: t.date()).unique()
    print ""Number of times the threshold is crossed: "" + str(threshold_crossed_days.temperature.count())
    print ""The days are --> "" + str(days) 
    
    Number of times the threshold is crossed: 100
    The days are –> [datetime.date(2016, 2, 19) datetime.date(2016, 2, 21)
    datetime.date(2016, 2, 22) datetime.date(2016, 2, 24) …….]
    
    
 4. Enter the following command to find the hourly maximum temperature for each
    years, the result will show 24 rows per year wherein each row will show the
    maximum temperature of the corresponding hour. This will be useful to find
    out the utilization of the equipment (assuming the temperature is directly
    propotional to the utilization of the equipment) in each hour, for example,
    how much the equipment is utilized in the first hour compared to 2nd hour
    and so on. Best examples could be the space utilization (Office Space,
    Parking Space and etc..) for each hour over the year.# Find out hourly maximum temperature for each year
    year_hour_max = pandaDF.groupby(lambda x: (x.year, x.hour)).max()
    
    fig, ax = plt.subplots()
    plotDF = year_hour_max[['temperature']]
    plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Maximum temperature for each year')
    ax.set_xlabel(""Hour of each year"",fontsize=12)
    ax.set_ylabel(""Temperature"",fontsize=12)
    plt.show()
    
    
 5. You can create a bar chart as well for better visualization by typing the
    following command in the next cell and click Run,# draw a bar chart for hourly maximum temperature
    fig, ax = plt.subplots()
    plotDF.temperature.plot(kind='bar',figsize=(15,5), ax=ax, title='Hourly Maximum temperature for each year')
    ax.set_xlabel(""Hour of each year"",fontsize=12)
    ax.set_ylabel(""Temperature"",fontsize=12)
    plt.show()
    
    
 6. But if you want to observe the maximum temperature for each hour (every day)
    and plot it, enter the following code snippet,# Find out hourly maximum temperature for each day
    each_hour_max = pandaDF.groupby(lambda x: (x.year, x.month, x.day, x.hour)).max()
    
    fig, ax = plt.subplots()
    plotDF = each_hour_max[['temperature']]
    plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Maximum temperature for each Day')
    ax.set_xlabel(""Hour of each day"",fontsize=12)
    ax.set_ylabel(""Temperature"",fontsize=12)
    plt.show() 
    
    
 7. Enter the following command to find out the maximum temperature for each day
    over the years,# Maximum temperature of each Day
    df = pandaDF
    
    df = df.drop([""_id""], axis=1)
    df = df.drop([""_rev""], axis=1)
    df = df.drop([""scaledzscore""], axis=1)
    df = df.drop([""scaledwzscore""], axis=1)
    df = df.drop([""forecast""], axis=1)
    df = df.drop([""zscore""], axis=1)
    df = df.drop([""wzscore""], axis=1)
    
    df['Year'] = map(lambda x: x.year, df.index)
    df['Month'] = map(lambda x: x.month, df.index)
    df['Day'] = map(lambda x: x.day, df.index)
    plotDF = df.groupby(['Day','Month','Year']).max()
    
    fig, ax = plt.subplots()
    
    plotDF.plot(kind='bar', figsize=[20,10], ax=ax)
    
    ax.axhline(y=19,c=""purple"",linewidth=2,zorder=0)
    ax.axhline(y=15,c=""purple"",linewidth=2,zorder=0)
    ax.set_ylim([10,25])
    ax.set_title(""Daily maximum temperature"", fontsize=20)
    ax.set_xlabel(""Day"",fontsize=20)
    ax.set_ylabel(""Temperature"",fontsize=20)
    ax.xaxis.grid(True, which=""minor"")
    ax.yaxis.grid()
    plt.tight_layout()
    plt.show()
    
    
In this step, we have seen how to use the Pandas DataFrames to explore and plot
the maximum temperature data from the historical data. Similarly you can use the
min() function to find the minimum temperatures.


OPERATIONS RELATED TO AVERAGE TEMPERATURE


In this step, we will see how to use the Pandas DataFrames to explore and plot
the average temperature data from the historical data.


 1. Enter the following command in the next cell to find out the average
    temperature and click Run ,#calculate temperature mean
    pandaDF.temperature.mean()
    
    17.593230723955266
    
    
 2. Enter the following command to find the average temperature for the last one hour ,from datetime import *
    import pytz
    
    # retrieve the current time
    now = datetime.now(pytz.timezone('UTC'))
    
    last_n_hours = now - timedelta(hours=1)
    pandaDF.ix[last_n_hours:now].temperature.mean() 
    
    17.589529391059482
    
    
 3. Similarly, to find the average temperature of the last one day enter the following command,# Caculate average temperature for last day
    from datetime import *
    import pytz
    # retrieve the current time
    now = datetime.now(pytz.timezone('UTC'))
    last_n_days = now – timedelta(days=1)
    
    pandaDF.ix[last_n_days:now].temperature.mean()
    
    17.583351550960117
    
    
 4. Similarly use the following command to find out the average temperature for the last month ,# retrieve the current time
    now = datetime.now(pytz.timezone('UTC'))
    '''
    get the start time that will be behind n days from now, just modify ""days = n"" to ""hours = n"" 
    in case if you want to retrieve the temperature from last n hours
    '''
    last_n_days = now - timedelta(days=30)
    
    pandaDF.ix[last_n_days:now].temperature.mean()
    
    17.592428135954886
    
    
 5. Enter the following command to find hourly average temperature for each
    years, the result will show 24 rows per year wherein each column will show
    the average temperature of the corresponding hour. This will be useful to
    find out the utilization of the equipment (assuming the temperature is
    directly propotional to the utilization of the equipment) in each hour, for
    example, how much the equipment is utilized in the first hour compared to
    2nd hour and so on. Best examples could be the space utilization for each
    hour over the year.# Find out hourly Average temperature for each year
    year_hour_avg = pandaDF.groupby(lambda x: (x.year, x.hour)).mean()
    
    fig, ax = plt.subplots()
    plotDF = year_hour_avg[['temperature']]
    plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Average temperature for each year')
    ax.set_xlabel(""Hour of each year"",fontsize=12)
    ax.set_ylabel(""Temperature"",fontsize=12)
    plt.show() 
    
    
 6. You can create a bar chart as well for better visualization by typing the
    following command in the next cell and click Run,# draw a bar chart for hourly average temperature
    fig, ax = plt.subplots()
    plotDF.temperature.plot(kind='bar',figsize=(15,5), ax=ax, title='Hourly Averagetemperature for each year')
    ax.set_xlabel(""Hour of each year"",fontsize=12)
    ax.set_ylabel(""Temperature"",fontsize=12)
    plt.show() 
    
    
 7. But if you want to find out the average temperature for each hour and plot
    it, enter the following code snippet, In the following example, we plot the
    hourly average for last 2 days,# retrieve the current temperature
    now = datetime.now(pytz.timezone('UTC'))
    
    '''
    get the start time that will be behind 2 days from now, just modify ""days = 2"" to ""hours=2"" in case if
    you want to retrieve the temperature from last 2 hours
    '''
    last_n_days = now - timedelta(days=2)
    
    plotDF = pandaDF.ix[last_n_days:now]
    # Find out hourly average temperature for each day
    plotDF = plotDF.groupby(lambda x: (x.year, x.month, x.day, x.hour)).mean()
    
    fig, ax = plt.subplots()
    plotDF = plotDF[['temperature']]
    plotDF.temperature.plot(figsize=(15,5), ax=ax, title='Hourly Average temperature of each Day')
    ax.set_xlabel(""Hour of each day"",fontsize=12)
    ax.set_ylabel(""Temperature"",fontsize=12)
    plt.show()
    
    
 8. Enter the following command to find out the average temperature for each day
    over the years,# Average temperature of each Day
    df = pandaDF
    
    df = df.drop([""_id""], axis=1)
    df = df.drop([""_rev""], axis=1)
    df = df.drop([""scaledzscore""], axis=1)
    df = df.drop([""scaledwzscore""], axis=1)
    df = df.drop([""forecast""], axis=1)
    df = df.drop([""zscore""], axis=1)
    df = df.drop([""wzscore""], axis=1)
    
    df['Year'] = map(lambda x: x.year, df.index)
    df['Month'] = map(lambda x: x.month, df.index)
    df['Day'] = map(lambda x: x.day, df.index)
    plotDF = df.groupby(['Day','Month','Year']).mean()
    
    fig, ax = plt.subplots()
    
    plotDF.plot(kind='bar', figsize=[20,10], ax=ax)
    
    ax.axhline(y=19,c=""purple"",linewidth=2,zorder=0)
    ax.axhline(y=15,c=""purple"",linewidth=2,zorder=0)
    ax.set_ylim([10,25])
    ax.set_title(""Daily Average temperature"", fontsize=20)
    ax.set_xlabel(""Day"",fontsize=20)
    ax.set_ylabel(""Temperature"",fontsize=20)
    ax.xaxis.grid(True, which=""minor"")
    ax.yaxis.grid()
    plt.tight_layout()
    plt.show()
    
    
In this step, we have seen how to use the Pandas DataFrames to explore and plot
the average temperature data from the historical data.


CONCLUSION AND THE ROAD AHEAD


This recipe showed how to analyze the historical timeseries data to understand
the temperature trend over the day, month & year, The maximum temperature over
the year, average temperature over the year and etc.. using the Spark SQL and
Jupyter Notebook. One can use the average/maximum temperature derived from the
data analysis and set the rule accordingly in IBM Real Time Insights service to
create alerts. Developers can take a look at the code made available in this
recipe and also in the Notebook in github repository to understand what’s happening under the hood. The Notebook present in the github has more operations than what is showed in this recipe. Developers can consider
this recipe as a template for doing a timeseries historical data analysis and
can modify the python code depending upon the use case.


The next recipe would showcase more complex analytical components. Keep watching
this space.


TUTORIAL TAGS
#python bluemix cloudant dataframe ibmiot iot iotf jupyter machine learning pandas spark sql timeseries watson Recipe Palette 0 step: steps: step steps waiting to be combinedWhat if you didn’t have to start from scratch when creating a recipe? With
developerWorks Recipes you can leverage the work of other recipe writers in the
developer community through the use of the Recipe Palette.

Here’s how you do it:

 * When viewing a recipe, you can click the copy icon in the top right of each step to add the step to your Recipe Palette.
 * You can even build up your palette of steps from different recipes, ready to
   combine into something new.
 * When you’re happy with the contents of your palette, hit the Combine button,
   and you’ll have a new draft recipe readily populated with the steps you’ve
   collected from around the site.
 * Edit your new recipe, if needed, before publishing your recipe to the world.


Clear Combine * Report Abuse Terms of Use
 * Third Party Notice IBM Privacy","This recipe showcases how one can analyze the historical time series data, captured on the IBM Watson IoT platform, in a Jupyter Notebook using Spark SQL and Pandas DataFrames. Also, use the pre-installed matplotlib library to visualize results. ",Timeseries Data Analysis of IoT events by using Jupyter Notebook,Live,12
37,"Maureen McElaney Blocked Unblock Follow Following dev advocate at @IBM Watson Data Platform. founder of @GDIBurlington. executive
fellow at @BTVIgnite. content here is mine. Website: http://mcelaney.me/ Apr 24
--------------------------------------------------------------------------------

BRIDGING THE GAP BETWEEN PYTHON AND SCALA JUPYTER NOTEBOOKS
USING THE PIXIEDUST PYTHON HELPER LIBRARY TO IMPORT SCALA PACKAGES
There’s a reason you’ve been hearing a lot about data science notebooks lately: data scientists are in high demand , and the Python programming language is widely used . In particular, Jupyter Notebooks are a popular tool for creating and sharing code for quick analysis.

Most Jupyter Notebooks you’ll see come in two main flavors: Python and Scala.
While Python is great for collaborating with colleagues — clean syntax, tons of
handy libraries, good documentation — sometimes you need the processing power of
Scala. The Apache Spark data processing engine is built on Scala, so if you’re working with a big data
set, a Scala Jupyter notebook is the way to go. The downside of Scala is that fewer people know it .

HELLO, [SCALA] WORLD! FROM [PYTHON] PIXIEDUST
David Taieb has led the charge for our team to build an open source application that we
affectionately call PixieDust. The PixieDust Python helper library works as an add-on to your Jupyter notebook that lets you do all sorts of new things , like automatic chart rendering or progress monitors for cells running code.
PixieDust can also help developers bridge contexts: call Scala code from a
Python Notebook, or call Python code from a Scala Notebook.

With this post, I’d like to demonstrate how to use PixieDust to import a Scala
“Hello, world!” package into a Jupyter Python notebook. I’m going use the same
code that was used in this article by Dustin V :

Part 1 : How to add a custom library to a Jupyter Scala notebook in IBM Data
Science Experience… I have been using IBM’s Data Science Experience platform for
a few months now. Its a great platform to perform data… medium.comI’ll show you how to use the same Scala JAR that Dustin provides, but from
within a Jupyter Python notebook instead.

PREPARE TO NOTEBOOK!
Here are the basic steps:

 1. Set up an account in IBM Data Science Experience (DSX)
 2. Create a project in DSX
 3. Point to Dustin’s Scala JAR
 4. Test Scala JAR from Python Notebook with PixieDust

1. SET UP AN ACCOUNT IN IBM DATA SCIENCE EXPERIENCE (DSX)
Browse to http://datascience.ibm.com/ and sign up for a free trial. You’ll get a 30-day free trial that includes
Jupyter and other tools. (You will need to provide a personal email address for
the account). This step should take about 10–15 minutes to complete.

2. CREATE A PROJECT IN DSX
When your DSX account is ready, it’s time to create a new project:

Name your project (mine is called “pixiedust”), and use the defaults for the
Spark and Object Storage instances. You’ll add a notebook in just a moment.

3. POINT TO DUSTIN’S SCALA JAR
It’s optional, but you can refer to Dustin V ’s post for steps on how to compile your own JAR file. Otherwise, he makes the JAR
available for test purposes, and I found that it works just fine. You’ll see
this URL in our sample notebook:

https://github.com/dustinvanstee/dv-hw-scala/raw/master/target/scala-2.10/dv-hw-scala-assembly-1.0.jar

4. TEST SCALA JAR FROM PYTHON NOTEBOOK WITH PIXIEDUST
You can now use our sample notebook to test PixieDust’s Python-Scala bridge functionality. To run the notebook in
your own account, first download it via the universal download arrow icon:

Download our sample notebook before uploading and running it in your own DSX
project.Head back to the DSX project you created in step 2, and add a notebook to your
project:

Choose the From File option. Name your notebook (mine is “HelloWorld”). Now, create your notebook.

The notebook has all the configuration and sample code you’ll need. Just run the
cells using the play icon , and make sure to restart your kernel when prompted in the cell output to
avoid errors.

Hello Scala - IBM Data Science Experience apsportal.ibm.com Using PixieDust in a Python Notebook to access a custom Scala Package.You will know that everything is working when you see PixieDust generate a chart
at the end of your notebook. Now you know that Scala and Python can be BFFs with
PixieDust and Jupyter Notebooks!

A Python matplotlib chart generated by PixieDust on a Scala Spark DataFrame.If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

Thanks to Mike Broberg . * Data Science
 * Python
 * Pixiedust
 * Scala
 * Jupyter

1 Blocked Unblock Follow FollowingMAUREEN MCELANEY
dev advocate at @IBM Watson Data Platform. founder of @GDIBurlington . executive fellow at @BTVIgnite . content here is mine. Website: http://mcelaney.me/

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","There’s a reason you’ve been hearing a lot about data science notebooks lately: data scientists are in high demand, and the Python programming language is widely used. In particular, Jupyter…",Bridging the Gap Between Python and Scala Jupyter Notebooks,Live,13
41,"Raj Singh Blocked Unblock Follow Following Developer Advocate and Open Data Lead at IBM Watson Data Platform Aug 15
--------------------------------------------------------------------------------

GOT ZIP CODE DATA? PREP IT FOR ANALYTICS.
USING FINE-GRAINED U.S. CENSUS DATA AND JUPYTER NOTEBOOKS TO BETTER UNDERSTAND
YOUR CUSTOMERS
Who are those people lurking behind the statistics in your data? Whether you are
looking at retail shoppers, insurance policy holders, banking customers or
political constituents, the more you can flesh out the lives of the people
behind the numbers, the better you will do at deriving useful insights into how
to serve them. This is why demographic market segmentation is such an interesting industry.

BLOCK PARTY
Market segmentation is the process of dividing a target population into groups,
or segments, based on some common characteristics. The strategies for creating
these groups range from the simple — age, sex, race, income — to the
sophisticated — “Uptown Individuals” or “Cozy Country Living.” Products such as Tapestry Segmentation from Esri or PRIZM from Claritas/Nielsen live at the sophisticated end, and carry a price tag to match. If you are not
ready to take the plunge, however, you can do a lot on your own with U.S. Census
data, some basic analytics skills, and a Jupyter notebook.

The U.S. Census is a treasure trove of free demographic data, as I’ve written about before . You can find detailed statistics on age, income, race, housing, and
occupation from the national level down to the block group (a very small area consisting of about 2,000 people in most places). That’s
just the tip of the iceberg. There are many more interesting statistics you can
tease out of Census data with a little bit of analytics skills.

“Block groups are statistical divisions of census tracts and generally contain
between 600 and 3,000 people.” Source: U.S. Census Bureau .THE CORE OF THE PROBLEM
Some cities are denser than others. But where are those dense cores so you can
finely target them?

One statistic I find really interesting is how urban a person is. Do they live
in the dense city, the suburbs, or out in the rural countryside? Depending on
your question, location can be a more useful fact to know than age or income or
family size.

You might think that it’s pretty easy to figure out what places are city,
suburban, and rural, but it turns out to be a bit of a challenge. For example,
take the map of eastern Massachusetts below. The City of Boston is shaded in
gray in the center of the picture. That’s a pretty poor representation of urban,
as many towns around Boston are just as urban as the city (Cambridge,
Somerville, and others).

The Census has a place type called “Urban Areas,” which for the Boston area is
the red line you see in the picture. It stretches waaaaaay out from the city to
even go into New Hampshire to the north, and almost to Cape Cod to the south.
This may make some sense when you look at the country as a whole, comparing
Massachusetts to Minnesota for example, but it does a poor job of capturing true
urban-ness. The dashed gray line is an even less useful designation from the
Census called “Metropolitan Statistical Areas.”

Depending on your definition, “urban” can mean a lot of different kinds of
places. For instance, Boston’s urban core is mostly walkable; however, if you’re
in Phoenix, you’ll need a car.Now look at the map below derived from the data I’ve prepared. Instead of using
the most detailed level of Census data — block groups — I use zip codes because
you’ll always have a zip code for your customers.

Data geek note: these are actually “zip code tabulation areas” (ZCTAs), not true zip codes.
ZCTAs are a zip code-esque structure the Census created to make zip code data
better for mapping and spatial analysis.It shows most of Boston, and some neighboring zips, in red — true urban areas,
places where people live primarily in multi-family housing, condos, or
apartments. Toward the south, you can also see little red spots in Providence,
RI; New Bedford, MA; and Fall River, MA.

The orange color depicts areas called “Early Suburban.” Here you’ll find people
living primarily in single-family homes, but lot size will be usually around a
1/4 to 1/2 acre. Then in light orange, you’ll see areas that are closer to rural
with single-family homes on 1 acre lots or larger. Finally in a light tan color,
is everything else — truly rural areas consisting primarily of 1+ acre
residential lots, farms, and forests.

Picking the core urban areas out of wider, more suburban metro area.METHODOLOGY: BEFORE AND AFTER THE CAR
The methodology used to build this model comes from an academic article, “From Jurisdictional to Functional Analysis of Urban Cores & Suburbs” in New Geography . From that work, my notebook uses the following classifications for
urban-ness:

 * Urban (pre-auto urban core): density > 2,900 sq. km
 * Auto suburban, early : median house built 1946 to 1979, density < 2,900 sq. km and density > 100
   sq. km
 * Auto suburban, later : median house built after 1979, density < 2,900 sq. km and density > 100
   sq. km
 * Auto exurban : all others

From the requirements above, the key data needed to reproduce the model are
population and the median age-of-home in an area. We can easily get these data
from the U.S. Census American Community Survey. The instructions for doing this
yourself, if you are so inclined, are in the Jupyter notebook referenced below.

SHOW ME THE DATA
If you are less interested in the details of the analysis, and just want the
data to use in your own work, we’ve provided a public download of the CSV file in this GitHub repo . If you want to see the details of how it was built, read on.

OH, THE URBANITY!
I analyzed the data using Python in a Jupyter Notebook called urbanity.ipynb in the same GitHub repo . It uses the Pandas read_csv function to extract statistics on zip code areas, population counts, and median
housing age from three larger data files. In the notebook, I then join those
statistics into a single DataFrame and calculate population density per square
kilometer.

From there it’s a simple matter of running some SQL-like queries on the
DataFrame to classify the zip codes into the four categories of interest. That’s
it for the initial analysis.

LOOKING AROUND THE U.S.
The Jupyter notebook goes on to create an interactive map using Mapbox
technology, which I’ll describe in detail in a forthcoming post. For now, I want
to focus on what this map can tell us.

As with the Boston example, other views from around the country each tell
different stories about the composition of urban-ness, which when combined with
your own data, can lead to deeper insights into customers or constituents.

The dense Mid-Atlantic region from New York City to Baltimore. Contrastingly,
urbanity in the South shows almost no dense urban areas. Combining both
extremes, Los Angeles to the San Francisco Bay shows large swaths of rural
areas.If you find the data useful, or want to know more about how to use it to build a
custom analysis, please leave a comment here. Whether you’re in a Pre-Auto Urban
Core or an Auto Exurban municipality, thank you for reading!

Please ♡ this article to recommend it to other Medium readers.

Thanks to Mike Broberg . * Jupyter
 * Analytics
 * Data Science
 * Mapbox
 * Market Segmentation

Blocked Unblock Follow FollowingRAJ SINGH
Developer Advocate and Open Data Lead at IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Who are those people lurking behind the statistics in your data? Whether you are looking at retail shoppers, insurance policy holders, banking customers or political constituents, the more you can…",Got zip code data? Prep it for analytics. – IBM Watson Data Lab – Medium,Live,14
45,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Advisory Council
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Advisory Council
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
STREAMING
EXTEND STRUCTURED STREAMING FOR SPARK ML
EARLY METHODS TO INTEGRATE MACHINE LEARNING USING NAIVE BAYES AND CUSTOM SINKS.
To learn more about Structured Streaming and Machine Learning, check out Holden
Karau’s and Seth Hendrickson’s session Spark Structured Streaming for machine learning at Strata + Hadoop World New York from 2:05pm to 2:45pm, Thursday September
29th.

Spark’s new ALPHA Structured Streaming API has caused a lot of excitement because it brings the Data set/DataFrame/SQL
APIs into a streaming context. In this initial version of Structured Streaming,
the machine learning APIs have not yet been integrated. However, this doesn’t
stop us from having fun exploring how to get machine learning to work with
Structured Streaming. (Simply keep in mind this is exploratory, and things will
change in future versions.)

For our Spark Structured Streaming for machine learning talk on at Strata + Hadoop World New York 2016, we’ve started early
proof-of-concept work to integrate structured streaming and machine learning
available in the spark-structured-streaming-ml repo. If you are interested in following along with the progress toward Spark's
ML pipelines supporting structured streaming, I encourage you to follow SPARK-16424 and give us your feedback on our early draft design document .

One of the simplest streaming machine learning algorithms you can implement on
top of structured streaming is Naive Bayes, since much of the computation can be
simplified to grouping and aggregating. The challenge is how to collect the
aggregate data in such a way that you can use it to make predictions. The
approach taken in the current streaming Naive Bayes won’t directly work, as the
ForeachSink available in Spark Structured Streaming executes the actions on the
workers, so you can’t update a local data structure with the latest counts.

Instead, Spark's Structured Streaming has an in-memory table output format you
can use to store the aggregate counts.

// Compute the counts using a Dataset transformation
    val counts = ds.flatMap{
      case LabeledPoint(label, vec) =>
        vec.toArray.zip(Stream from 1).map(value => LabeledToken(label, value))
    }.groupBy($""label"", $""value"").agg(count($""value"").alias(""count""))
      .as[LabeledTokenCounts]
    // Create a table name to store the output in
    val tblName = ""qbsnb"" + java.util.UUID.randomUUID.toString.filter(_ != '-').toString
    // Write out the aggregate result in complete form to the in memory table
    val query = counts.writeStream.outputMode(OutputMode.Complete())
      .format(""memory"").queryName(tblName).start()
    val tbl = ds.sparkSession.table(tblName).as[LabeledTokenCounts]


The initial approach taken with Naive Bayes is not easily generalizable to other
algorithms, which cannot as easily be represented by aggregate operations on a Dataset . Looking back at how the early DStream-based Spark Streaming API implemented
machine learning can provide some hints on one possible solution. Provided you
can come up with an update mechanism on how to merge new data into your existing model, the DStream foreachRDD solution allows you to access the underlying micro-batch view of the data.
Sadly, foreachRDD doesn't have a direct equivalent in Structured Streaming, but by using a custom
sink, you can get similar behavior in Structured Streaming.

The sink API is defined by StreamSinkProvider , which is used to create an instance of the Sink given a SQLContext and
settings about the sink, and Sink trait, which is used to process the actual
data on a batch basis.

abstract class ForeachDatasetSinkProvider extends StreamSinkProvider {  
  def func(df: DataFrame): Unit

  def createSink(
      sqlContext: SQLContext,
      parameters: Map[String, String],
      partitionColumns: Seq[String],
      outputMode: OutputMode): ForeachDatasetSink = {
    new ForeachDatasetSink(func)
  }
}

case class ForeachDatasetSink(func: DataFrame => Unit)  
    extends Sink {
  override def addBatch(batchId: Long, data: DataFrame): Unit = {
    func(data)
  }
}


As with writing DataFrames to customs formats, to use a third-party sink, you
can specify the full class name of the sink. Since you need to specify the full
class name of the format, you need to ensure that any instance of the
SinkProvider can update the model—and since you can’t get access to the sink
object that gets constructed—you need to make the model outside of the sink.

object SimpleStreamingNaiveBayes {  
  val model = new StreamingNaiveBayes()
}

class StreamingNaiveBayesSinkProvider extends ForeachDatasetSinkProvider {  
  override def func(df: DataFrame) {
    val spark = df.sparkSession
    SimpleStreamingNaiveBayes.model.update(df)
  }
}


You can use the custom sink shown above to integrate machine learning into
Structured Streaming while you are waiting for Spark ML to be updated with
Structured Streaming.

// Train using the model inside SimpleStreamingNaiveBayes object
  // - if called on multiple streams all streams will update the same model :(
  // or would except if not for the hard coded query name preventing multiple
  // of the same running.
  def train(ds: Dataset[_]) = {
    ds.writeStream.format(
      ""com.highperformancespark.examples.structuredstreaming."" +
        ""StreamingNaiveBayesSinkProvider"")
      .queryName(""trainingnaiveBayes"")
      .start()
  }


If you are willing to throw caution to the wind, you can access some Spark internals to construct a sink that behaves more like the original foreachRDD . If you are interested in custom sink support, you can follow SPARK-16407 or this PR .

The cool part is, regardless of whether you want to access the internal Spark
APIs, you can now handle batch updates in the same way Spark’s earlier streaming
machine learning is implemented.

While this certainly isn't ready for production usage, you can see that the
Structured Streaming API offers a number of different ways it can be extended to
support machine learning.

You can learn more in High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark .

SHARE ON
 * 
 * Share

HOLDEN KARAU
DATE
22 September 2016TAGS
streaming, data prosNEWSLETTER
Subscribe to the Spark Technology Center newsletter for the latest thought
leadership in Apache Spark™, machine learning and open source.

SubscribeNEWSLETTER

YOU MIGHT ALSO ENJOY
MACHINE LEARNING EXTEND STRUCTURED STREAMING FOR SPARK ML by Holden Karau CAN APACHE™ SPARK REVEAL HOW PEOPLE REALLY USE IBM’S CLOUD STORAGE? by Shelly
Garion APACHE SPARK™ 2.0: DEEP DIVE INTO SPARK CATALOG AND DDL NATIVE SUPPORTS by Xiao
Li APACHE SPARK 2.0 APACHE SPARK™ 2.0: KEEPING COUNT by Christian KadnerSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About
 * Advisory Council

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.

 * 
 * 
 * 
 *",Early methods to integrate machine learning using Naive Bayes and custom sinks.,Apache Spark™ 2.0: Extend Structured Streaming for Spark ML,Live,15
48,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Libo Li
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

HIGHER-ORDER LOGISTIC REGRESSION FOR LARGE DATASETS
Posted on February 11, 2017Contributed by: Sandra Mitrović

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow
us @DataMiningApps . Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail
over at briefings@dataminingapps.com and let’s get in touch!


--------------------------------------------------------------------------------

The performance of supervised predictive models is characterized by the
generalization error, that is, the error, obtained on datasets different from
the one used to train the model. More precisely, generalization error equals to
expectation of predictive error over all datasets D and ground truth y and can be represented as: E D,y [ L ( y ,f( x ))], where f( x ) is predicted outcome for an input x and L is the chosen loss function. Obviously, the goal is to minimize generalization
error, which, for several most typical choices of loss function, can be
decomposed as:

a L * Variance( x ) + Bias( x ) + b L * Noise( x )

Where factors a L and b L depend on the choice of a loss function L [1]. Bias measures systematic error, which stems from the predictive method used. Variance , on the other hand, is not related to the chosen modeling method, but rather
to the dataset used. It represents fluctuations around the most commonly (in
case of classification)/average (in case of regression) predicted value in
different test datasets. Ideally, both bias and variance should be low, since
high bias leads to under-fitting, which is inability of the model to fit the
data (low predictive performance on train data), while high variance leads to
over-fitting meaning that the obtained model is too much adjusted to the train
data that it fails to generalize on test data (also known as “memorization” (of
train data)). This, however, is not possible and hence, in practice, we have to
make a trade-off between variance and bias.

Different types of models have different bias/variance profiles, e.g. Naïve
Bayes classifier has low variance and high bias, while decision tree has low
bias and high variance. Logistic Regression (LR) is a well-established method,
which despite being fairly simple has been proven to have good performances [2].
On one side, this is beneficial since it facilitates interpretation of the model
and obtained results. On the other side, having low variance (and high bias),
makes it a limited method in terms of its expressive power. We can overcome this
drawback by introducing more complex features, obtained as a Cartesian product
of the original features. Logistic regression of order n (denoted as LR n ) is defined as the logistic regression modeling the interactions of the n -th order (as defined in [3], although it can be, as well, defined to consider
lower level interactions i.e. interactions of order ≤ n ).

LR n allows modeling of much larger number of distributions, as compared to LR.
Obviously, on smaller datasets this leads to over-fitting and different types of
regularization are known in literature to penalize for the model complexity. But
what happens in the case of really large datasets? It has been demonstrated that
with the increase of training dataset, variance decreases and bias increases
[4]. Hence, high variance of LRn would not be a problem, as long as bias could
be controlled. Bias/variance profile of higher-order LR has been extensively
investigated in [3], where LR n (for n =1,2,3) have been compared on 75 datasets from UCI repository. As it can be
seen in the Figure 1 (borrowed from [3]), with increasing amount of data,
higher-order LR perform better than the lower-order LR. In other words, with
large enough datasets, as the order n increases, the bias of LR n descreases, This clearly motivates the usage of higher-order LR with large
datasets.

Once again, it is very important to emphasize the amount of data observed (i.e.
the number of instances). For example, based on the zoomed part of graph, if we
would make a strict cutoff at any number of instances i , where i < 1000, due to the fact that LR 1 learning curve has steeper decrease than both LR 2 and LR 3 , we would conclude that LR 1 performs the best (out of these three). As it can be seen from upper part of
the figure, the same conclusion could be derived to the detriment of LR 3 after sufficiently enough number of instances. However, for extensively large
dataset, it is obvious that LR 3 outperforms the other two. This phenomenon is due to the fact that lower-order
logistic regressions have higher learning rate in the beginning of learning
process.


Figure 1: Learning curves of logistic regressions of different order plotted for
increasing amount of data (an illustration from [3]). Click to view full size
version.

REFERENCES
 * [1] Domingos, P. (2000). A unified bias-variance decomposition. In
   Proceedings of 17th International Conference on Machine Learning (pp.
   231-238).
 * [2] Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., &
   Vanthienen, J. (2003). Benchmarking state-of-the-art classification
   algorithms for credit scoring. Journal of the operational research society,
   54(6), 627-635.
 * [3] Zaidi, N. A., Webb, G. I., Carman, M. J., & Petitjean, F. (2016). ALR n:
   Accelerated HigherOrder Logistic Regression. In Proceedings of European
   Conference on Machine Learning.
 * [4] Brain, D., & Webb, G. I. (2002, August). The need for low bias algorithms
   in classification learning from large data sets. In European Conference on
   Principles of Data Mining and Knowledge Discovery (pp. 62-73). Springer
   Berlin Heidelberg.

‹ Do you see differences in maturity of analytics across business units in an
organization? —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Higher-order Logistic Regression for Large Datasets
 * Do you see differences in maturity of analytics across business units in an
   organization?
 * Web Picks (week of 30 January 2017)
 * Offline Recommender Evaluation is Killing Serendipity
 * How can networked data be leveraged for analytics?

Archives * February 2017
 * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com","The performance of supervised predictive models is characterized by the generalization error, that is, the error, obtained on datasets different from the one used to train the model.",Higher-order Logistic Regression for Large Datasets,Live,16
50,"Enterprise Pricing Articles Sign in Free 30-Day TrialCOMPOSE FOR MYSQL NOW FOR YOU
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 12, 2016Compose is pleased to bring a new database onto our platform in the form of
Compose for MySQL. We've always considered MySQL as a potential Compose
database, but had to wait for the arrival of a high availability solution that
worked well enough that we could deliver it to Compose users. That solution came
in the form of MySQL InnoDB Cluster, a new version of MySQL which has a
reliable, performant replication and high availability architecture which is an
excellent fit to the Compose environment.

That meant we could bring MySQL's feature set to Compose and offer its proven
and popular capabilities to our users. MySQL InnoDB Cluster is built around
MySQL 5.7.15 which will allow us to offer many of the most recent innovations
such as MySQL shell, X DevAPI and the JSON document store as part of our new
MySQL deployments. Adopting this leading edge version of MySQL for our Compose
for MySQL beta brings all the latest benefits of MySQL to our users.

It also means that MySQL users can enjoy the power of Compose to set up their
database with just one click, enjoy regular, automated backups and sleep better
knowing their database is highly available in whichever cloud platform they
choose. It means a database you can administer from the web, through an
easy-to-use web front end. Give your administrators and developers their own
accounts, with easily created roles to control access to your new database too.
These are just some of the benefits of Compose for MySQL.

So, how do you get going with the beta of the Compose for MySQL? Simply log in
to your Compose account, select Create Deployment and select MySQL from the Beta list. Within minutes you'll have your own MySQL cluster up and running. It'll
cost you $27 a month for your first GB of data and $18 a month extra for each
extra GB. If you don't have a Compose account, sign up now for a 30 day free
trial.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","We've always considered MySQL as a potential Compose database, but had to wait for the arrival of a high availability solution that worked well enough that we could deliver it to Compose users. That solution came in the form of MySQL InnoDB Cluster.",Compose for MySQL now for you,Live,17
57,"Homepage Follow Sign in / Sign up * Home
 * Big Ideas
 * Founder Stories
 * Startup Culture
 * Growth & Scale
 * Venture Capital
 * 
 * Global Conference
 * 

Luke de Oliveira Blocked Unblock Follow Following Deep learning, Infrastructure, and Open Source. Founder @ Vai, Visiting
scientist @ Berkeley Labs, Stanford/Yale Alum. Feb 11
--------------------------------------------------------------------------------

FUELING THE GOLD RUSH: THE GREATEST PUBLIC DATASETS FOR AI
It has never been easier to build AI or machine learning-based systems than it
is today. The ubiquity of cutting edge open-source tools such as TensorFlow , Torch , and Spark , coupled with the availability of massive amounts of computation power through AWS , Google Cloud , or other cloud providers, means that you can train cutting-edge models from
your laptop over an afternoon coffee.

Though not at the forefront of the AI hype train, the unsung hero of the AI
revolution is data — lots and lots of labeled and annotated data, curated with the elbow grease of
great research groups and companies who recognize that the democratization of data is a necessary step towards accelerating AI.

However, most products involving machine learning or AI rely heavily on proprietary datasets that are often not released, as this provides implicit defensibility .

With that said, it can be hard to piece through what public datasets are useful to look at, which are viable for a proof of concept, and
what datasets can be useful as a potential product or feature validation step before you collect your own proprietary data.

It’s important to remember that good performance on data set doesn’t guarantee a machine learning system will perform well in real product scenarios. Most
people in AI forget that the hardest part of building a new AI solution or
product is not the AI or algorithms — it’s the data collection and labeling . Standard datasets can be used as validation or a good starting point for
building a more tailored solution.

This week, a few machine learning experts and I were talking about all this. To
make your life easier, we’ve collected an (opinionated) list of some open
datasets that you can’t afford not to know about in the AI world.


--------------------------------------------------------------------------------

Legend:

📜 Classic — these are some of the more famous, legacy, or storied datasets in AI. It’s hard to find a researcher or engineer who hasn’t heard of them.

🛠 Useful — these are datasets that are about as close to real-world that a curated, cleaned dataset can be. Also, these are often general enough to be useful in both the product and R&D world.

📚 Academic baseline — these are datasets that are commonly used the in the academic side of Machine Learning and AI as benchmarks or baselines. For better or worse, people use these datasets to validate algorithms.

🗿 Old - these datasets, irrespective of utility, have been around for a while.

COMPUTER VISION
 * 📚 📜 🗿 MNIST : most commonly used sanity check. Dataset of 25x25, centered, B&W
   handwritten digits. It is an easy task — just because something works on
   MNIST, doesn’t mean it works.
 * 📜 🗿 CIFAR 10 & CIFAR 100 : 32x32 color images. Not commonly used anymore, though once again, can be
   an interesting sanity check.
 * 🛠 📚 📜 ImageNet : the de-facto image dataset for new algorithms. Many image API companies
   have labels from their REST interfaces that are suspiciously close to the
   1000 category WordNet hierarchy from ImageNet.
 * LSUN : Scene understanding with many ancillary tasks (room layout estimation,
   saliency prediction, etc.) and an associated competition.
 * 📚 PASCAL VOC : Generic image Segmentation / classification — not terribly useful for
   building real-world image annotation, but great for baselines.
 * 📚 SVHN : House numbers from Google Street View. Think of this as recurrent MNIST in
   the wild.
 * MS COCO : Generic image understanding / captioning, with an associated competition.
 * 🛠 Visual Genome : Very detailed visual knowledge base with deep captioning of ~100K images.
 * 🛠 📚 📜 🗿 Labeled Faces in the Wild : Cropped facial regions (using Viola-Jones ) that have been labeled with a name identifier. A subset of the people
   present have two images in the dataset — it’s quite common for people to
   train facial matching systems here.

NATURAL LANGUAGE
 * 🛠 📚 Text Classification Datasets (Google Drive Link) from Zhang et al., 2015 : An extensive set of eight datasets for text classification. These are the most commonly reported baselines for new text classification baselines. Sample size of 120K to 3.6M, ranging
   from binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp,
   Yahoo!, Sogou, and AG.
 * 🛠 📚 WikiText : large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind .
 * 🛠 Question Pairs : first dataset release from Quora containing duplicate / semantic similarity
   labels.
 * 🛠 📚 SQuAD : The Stanford Question Answering Dataset — broadly useful question
   answering and reading comprehension dataset, where every answer to a question
   is posed as a span , or segment of text.
 * CMU Q/A Dataset : Manually-generated factoid question/answer pairs with difficulty ratings
   from Wikipedia articles.
 * 🛠 Maluuba Datasets : Sophisticated, human-generated datasets for stateful natural language
   understanding research.
 * 🛠 📚 Billion Words : large, general purpose language modeling dataset. Often used to train
   distributed word representations such as word2vec or GloVe .
 * 🛠 📚 Common Crawl : Petabyte-scale crawl of the web — most frequently used for learning word
   embeddings. Available for free from Amazon S3 . Can also be useful as a network dataset for it’s crawl of the WWW.
 * 📚 📜 bAbi : synthetic reading comprehension and question answering dataset from Facebook AI Research (FAIR) .
 * 📚 The Children’s Book Test ( download link ): Baseline of (Question + context, Answer) pairs extracted from Children’s
   books available through Project Gutenberg. Useful for question-answering,
   reading comprehension, and factoid look-up.
 * 📚 📜 🗿 Stanford Sentiment Treebank : standard sentiment dataset with fine-grained sentiment annotations at
   every node of each sentence’s parse tree.
 * 📜 🗿 20 Newsgroups : one of the classic datasets for text classification, usually useful as a
   benchmark for either pure classification or as a validation of any IR /
   indexing algorithm.
 * 📜 🗿 Reuters : older, purely classification based dataset with text from the newswire.
   Commonly used in tutorials.
 * 📜 🗿 IMDB : an older, relatively small dataset for binary sentiment classification.
   Fallen out of favor for benchmarks in the literature in lieu of larger
   datasets.
 * 📜 🗿 UCI’s Spambase : Older, classic spam email dataset from the famous UCI Machine Learning Repository . Due to details of how the dataset was curated, this can be an interesting
   baseline for learning personalized spam filtering.

SPEECH
Most speech recognition datasets are proprietary — the data holds a lot of value for
the company that curates. Most datasets available in the field are quite old.

 * 📚 🗿 2000 HUB5 English : English-only speech data used most recently in the Deep Speech paper from Baidu.
 * 📚 LibriSpeech : Audio books data set of text and speech. Nearly 500 hours of clean speech
   of various audio books read by multiple speakers, organized by chapters of
   the book containing both the text and the speech.
 * 🛠 📚 VoxForge : Clean speech dataset of accented english, useful for instances in which
   you expect to need robustness to different accents or intonations.
 * 📚 📜 🗿 TIMIT : English-only speech recognition dataset.
 * 🛠 CHIME : Noisy speech recognition challenge dataset. Dataset contains real,
   simulated and clean voice recordings. Real being actual recordings of 4
   speakers in nearly 9000 recordings over 4 noisy locations, simulated is
   generated by combining multiple environments over speech utterances and clean
   being non-noisy recordings.
 * TED-LIUM : Audio transcription of TED talks. 1495 TED talks audio recordings along
   with full text transcriptions of those recordings.

RECOMMENDATION AND RANKING SYSTEMS
 * 📜 🗿 Netflix Challenge : first major Kaggle style data challenge. Only available unofficially, as privacy issues arose .
 * 🛠 📚 📜 MovieLens : various sizes of movie review data — commonly used for collaborative
   filtering baselines.
 * Million Song Dataset : large, metadata-rich, open source dataset on Kaggle that can be good for
   people experimenting with hybrid recommendation systems.
 * 🛠 Last.fm : music recommendation dataset with access to underlying social network and
   other metadata that can be useful for hybrid systems.

NETWORKS AND GRAPHS
 * 📚 Amazon Co-Purchasing and Amazon Reviews : crawled data from the “ users who bought this also bought… ” section of Amazon, as well as amazon review data for related products.
   Good for experimenting with recommendation systems in networks.
 * Friendster Social Network Dataset : Before their pivot as a gaming website, Friendster released anonymized
   data in the form of friends lists for 103,750,348 users.

GEOSPATIAL DATA
 * 🛠 📜 OpenStreetMap : Vector data for the entire planet under a free license . It includes (an older version of) the US Census Bureau’s TIGER data.
 * 🛠 Landsat8 : Satellite shots of the entire Earth surface, updated every several weeks.
 * 🛠 NEXRAD : Doppler radar scans of atmospheric conditions in the US.

❗️People often think solving a problem on one dataset is equivalent to having a
well thought out product. Use these datasets as validation or proofs of concept , but don’t forget to test or prototype how the product will function and
obtain new, more realistic data to improve its operation. Successful data-driven companies usually derive strength from their ability to collect
new, proprietary data that improves their performance in a defensible way.


--------------------------------------------------------------------------------

PLEASE CONTRIBUTE!
If you think we’ve missed a dataset or two (which we definitely have!) or have a
conflicting opinion about a dataset discussed here, please let me know with a
comment, or you can shoot me an email at lukedeo@ldo.io !

P.S. — This post is part of a open, collaborative effort to build an online reference, the Open Guide to Practical AI , which we’ll release in draft form in a few weeks. See this popular previous guide for an example. If you’d like to get updates on or help with with this effort,
drop me a comment or email me at lukedeo@ldo.io . Special thanks to Joshua Levy , Srinath Sridhar , and Max Grigorev . Thanks to Joshua Levy . Machine Learning Artificial Intelligence Deep Learning Data Science Big Data 759 18 Blocked Unblock Follow FollowingLUKE DE OLIVEIRA
Deep learning, Infrastructure, and Open Source. Founder @ Vai, Visiting
scientist @ Berkeley Labs, Stanford/Yale Alum.

FollowSTARTUP GRIND
The life, work, and tactics of entrepreneurs around the world — by founders, for
founders. Welcoming submissions on technology trends, product design, growth
strategies, and venture investing.

 * Share
 * 759
 * 
 * 

Never miss a story from Startup Grind , when you sign up for Medium. Learn more Never miss a story from Startup Grind Get updates Get updates","It has never been easier to build AI or machine learning-based systems than it is today. The ubiquity of cutting edge open-source tools such as TensorFlow, Torch, and Spark, coupled with the…",The Greatest Public Datasets for AI – Startup Grind,Live,18
60,"METRICS MAVEN: MODE D'EMPLOI - FINDING THE MODE IN POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 6, 2016In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the metrics you need from your data.
In this article, we'll have a look at mode to round out our series on mean,
median, and mode.

Mode is the simplest to understand of the three metrics we've been looking at
(mean, median, and mode) so we'll keep this article short and sweet and get
straight to it. If you want to start with a review of mean or median before
looking at mode, then have a look at Calculating a Mean or A Look at Median for a refresher.

For our examples in this article, we'll continue to use the orders data from our
dog products catalog that we've used in the previous articles:

order_id | date       | item_count | order_value  
------------------------------------------------
50000    | 2016-09-02 | 3          | 35.97  
50001    | 2016-09-02 | 2          | 7.98  
50002    | 2016-09-02 | 1          | 5.99  
50003    | 2016-09-02 | 1          | 4.99  
50004    | 2016-09-02 | 7          | 78.93  
50005    | 2016-09-02 | 0          | (NULL)  
50006    | 2016-09-02 | 1          | 5.99  
50007    | 2016-09-02 | 2          | 19.98  
50008    | 2016-09-02 | 1          | 5.99  
50009    | 2016-09-02 | 2          | 12.98  
50010    | 2016-09-02 | 1          | 20.99  


MODE
The mode of a series is the most frequently occurring value. In some series this
may indicate popularity. In others, it is an indication of commonality, more
conspicuous than the average or the median.

For our use case, together with mean and median , the mode can help us really zero in on why we're seeing the results that we
do from our hypothetical pet supply business.

Unlike median, for which we covered 4 different query options in our previous article , PostgreSQL offers a built-in function starting in the 9.4 version to find the
mode in a series: MODE() . Let's dive right into some examples.

We'll start by finding the mode for item_count with this query:

SELECT MODE() WITHIN GROUP (ORDER BY item_count) AS item_count_mode  
FROM orders;  


As you can see, the syntax for MODE() looks a little awkward. You use the WITHIN GROUP (ORDER BY ...) clause to indicate the field you want to get the mode of. We encountered this
clause when finding the median in option 4 of our previous article. This clause
is used with the ordered set aggregates introduced in PostgreSQL 9.4, such as PERCENTILE_CONT and RANK . Once you start to use these aggregate functions, you'll easily get the hang
of it.

Now back to what we were doing... Our result from the query above is 1. Orders
from our dog products catalog contain only 1 item most frequently. That's
disappointing for the business. Secretly, we'd hoped customers would buy whole
product lines of items for their pooches!

ZEROES AND NULLS
You may be wondering right about now how MODE() handles zeroes and NULLs since one of our orders has a ""0"" item count and a
NULL order value. From our previous articles, we know that this is an important
aspect to consider for obtaining the best metrics for the use case.

MODE() and the other ordered set aggregates ignore NULL values by default. That's good
news because we determined previously that we should be ignoring orders that
have a 0 item_count or a NULL order_value . Those would clearly be invalid orders. MODE() does not, however, ignore zeroes. In our case, it does not matter much since we
have only one zero value in our orders, but if we didn't know that, we would
actually want to write the query including a WHERE condition for the item_count to not be zero, like so:

SELECT MODE() WITHIN GROUP (ORDER BY item_count) AS item_count_mode  
FROM orders  
WHERE item_count <�  


WHAT MODE CAN TELL US ABOUT OUR BUSINESS
Now that we've got the handling of zeroes and NULLs squared away, let's look at
the mode for order_value to get more insight into orders:

SELECT MODE() WITHIN GROUP (ORDER BY order_value) AS order_value_mode  
FROM orders;  


The result we get back is $5.99.

Hmmmm.... these are pretty strong indicators for why our business isn't
performing as well as we want it to be. Customers are most frequently only
purchasing 1 item at a time with a value of $5.99. If we look back at the values
we got from mean and median for each of these fields, the story becomes clearer
with each metric:

Mean item count = 2.10  
Median item count = 1.5  
Mode item count = 1

Mean order value = $19.98  
Median order value = $10.48  
Mode order value = $5.99  


If we were relying on just the mean (or even the median) to get a sense of our
business performance, we would have inadvertently been believing we were doing
much better than we actually are. Now, in full recognition of the reality that
our orders are not where we want them to be, we can take action. We might offer
a discount for customers who purchase multiple items in one order or we might
promote higher-priced items more strongly than lower-priced ones. Armed with
these metrics, we can decide how to increase orders and improve our business.

WRAPPING UP
This concludes our look at mean, median, and mode and why each of them are
important metrics to get a handle on. As we've seen, they each provide a
slightly different perspective on the data. By using all of them together, we
can do a much better job of understanding how our business is doing (and then
determining the actions we should take) than by using just one of them alone.

This also concludes 2016 for the Metrics Maven series! Join us next year as we go even deeper into metrics - how to calculate
and apply them to get the most from your data. Until then, wishing you all happy
holidays!

Image by: Peggy_Marco Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article, we'll have a look at mode to round out our series on mean, median, and mode.",Finding the Mode in PostgreSQL,Live,19
62,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

IBM Data Science Experience Blocked Unblock Follow Following Apr 25
--------------------------------------------------------------------------------

WORKING INTERACTIVELY WITH RSTUDIO AND NOTEBOOKS IN DSX
It is often useful to use RStudio for one piece of your analysis and notebooks
(whether in R, or in another language) for other parts of your analysis. This
article will step you through the process of interacting between the two
environments by saving data to the underlying Spark service’s distributed GPFS
file structure.

The following is top level view of the cluster stack for Data Science Experience
(DSX):

Notice that the RStudio kernel is not a typical “edge node” and instead accesses
DSX via the sparklyr snappy connect pipeline.

For instructions on how to connect an RStudio instance to a Spark instance
follow the examples provided in the /home/rstudio/ibm-sparkaas-demo folder from your DSX home:

As the code shows, you will need to know the names of your spark service
kernels:

> kernels <- list_spark_kernels()
> kernels
[1] ""test1""           ""Apache Spark-ty"" ""sparklyr""        ""Apache Spark-jw""

For my RStudio session, I have four spark kernels available (“test1”, “Apache
Spark-ty”, “sparklyr”, “Apache Spark-jw”). To get R to interact with a Spark
service choose one of the services listed:

> # connect to Spark kernel
> sc <- spark_connect(config = ""sparklyr"")

For more information on how to use the sparklyr connection to your spark
instance see the impressive sparklyr documentation at spark.rstudio.com .

In order to interact between notebooks and RStudio you will need a notebook
running on the same Spark session as you connected to with sparklyr:

Notice that above, I connected RStudio to the SparkAAS named “sparklyr”, and
here I am creating my R notebook on the Spark Service named the same. This way
they will be able to interact on that service. Notice that we also could
interact with the Spark objects with Python or Scala notebooks.

To see these two interfaces working together, let’s save some spark data to the
distributed file system. First, we will need to find the address to the Spark
service GPFS home directory. Here we are seeing the address to the tenant root
directory plus the default /notebook/work :

The R code used in the Notebook above is:

getwd()

Alternatively you could use the “ls” command on the unix command line. In R you
could use the system command:

system(""pwd"", intern=TRUE)

In Python you can use the “!” command to access the system prompt:

!pwd

In any case, you just need to find your tenant name for the associated Spark
service.

Back in RStudio, we can view our tenant name by looking at the spark context:

> sc$config$tenant.id[[1]][1]
[1] ""s106-a1450be504b787-ea7328759346""

This tenant name should be the exact same as the tenant name in your notebook
(above my tenant name is “s106-a1450be504b787-ea7328759346” for this particular
Spark service).

In order to allow notebook and RStudio elements to interact, we will want to
move out of the /notebook/work folder in our notebook, and create new folder (or access an already existing
folder) in our Spark service’s root directory. In our notebook, we will move up
the to the tenant root directory and create a folder called spark_work1 :

The code used in the R Notebook above is:

#get the working directory
getwd()

#set the working directory to the root home for GPFS 
tenant_name = ""s106-a1450be504b787-ea7328759346"" #FILL IN YOUR TENANT NAME HERE  
setwd(paste0(""/gpfs/global_fs01/sym_shared/YPProdSpark/user/"", tenant_name))  
getwd()

#make a directory with systemt command 
system(""mkdir spark_work1"", intern=TRUE)

#move to the directory and verify that it is empty 
setwd(""./spark_work1"")  
getwd()  
system(""ls"", intern=TRUE)

Next we will save a sample file from RStudio to this tenant’s GPFS:

Here is the R code used above:

library(dplyr)  
#connect to the correct spark instance 
sc <- spark_connect(config = ""sparklyr"")  
#get the tenant name from your Notebook
tenant_name = ""s106-a1450be504b787-ea7328759346""  
#create a temp file in spark
iris_tbl = copy_to(sc, iris, ""iris"")  
#save the temp file to arquet
spark_write_parquet( iris_tbl,  
paste0(""/gpfs/global_fs01/sym_shared/YPProdSpark/user/"", tenant_name, ""/spark_work1/iris_tbl_parquet""))

And now we can take a look at the file, and load it in a notebook:

Using the following R code from a notebook on the same spark instance:

setwd(""/gpfs/global_fs01/sym_shared/YPProdSpark/user/s106-a1450be504b787-ea7328759346/spark_work1"")  
system(""ls"", intern= T)  
dat = read.parquet(sqlContext, ""/gpfs/global_fs01/sym_shared/YPProdSpark/user/s106-a1450be504b787-ea7328759346/spark_work1/iris_tbl_parquet/"" )  
take(dat,5)

One thing to take note of is that RStudio and R notebooks in DSX interface with
Spark through two different APIs (sparklyr and SparkR, respectively). This means
that your R code in the two different UIs will not be plug and play, and more
importantly, that you cannot save and interact models between the two
interfaces.


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on April 25, 2017 by Jim Crozier .

 * Spark
 * Rstudio

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","It is often useful to use RStudio for one piece of your analysis and notebooks (whether in R, or in another language) for other parts of your analysis. This article will step you through the process…",Working interactively with RStudio and notebooks in DSX,Live,20
63,"Raj Singh Blocked Unblock Follow Following Developer Advocate and Open Data Lead at IBM Watson Data Platform Jun 14
--------------------------------------------------------------------------------

MAPPING FOR DATA SCIENCE WITH PIXIEDUST AND MAPBOX
ADD ANOTHER LAYER TO YOUR JUPYTER NOTEBOOKS WITH BUILT-IN MAP RENDERING
You’re doing your data a disservice if you don’t use maps. Nearly all data has a
spatial component — customer locations, crime, election results, traffic
incidents, points-of-purchase, infrastructure locations—and if you’re not
familiar with some basic mapping tools, you’re not doing good data science.

Seeing your data on a map has many powerful benefits. At the exploratory stage
of data science, it’s a great way to get a feel for the geographic distribution
of your data.

Home sales over a million dollars, winter 2016. Darker means higher price. Data
courtesy of Redfin.com .MAKING SCATTERPLOTS LESS SCATTERSHOT
A standard scatterplot is a good starting point. It can work OK for spatial data
— just use longitude and latitude as your X and Y values — but plotting those
points on a real map shows your data’s relationship to real-world features. For example, a
scatterplot might reveal that the data is clustered into four major groupings,
but a map could show not only those four groupings, but also that they are all
near subway stations. Overlaying your data on a map can surface unseen patterns.

A plain scatterplot on the home sales data set lacks context without a map.At the presentation stage of data science, using maps is a no-brainer. Combining
data with maps is a natural storytelling device (when the story you’re telling
has a geographic aspect, of course).

So you’ve never done any mapping before, and it seems hard? Not to worry. PixieDust makes it easy , with a little help from Mapbox APIs (and to a lesser extent the Google Maps
API), you can get up and running with some beautiful map-based visualizations in
no time!

GETTING STARTED
If you’re unfamiliar with PixieDust, check out this introductory article and get your Jupyter Notebook-based data science environment up and running. PixieDust comes with
mapping goodness baked in.

As you read through the next section, you can follow along in the Jupyter notebook called pixiedust_mapbox_geocharts hosted on the IBM Data Science Experience .

GEOCHARTS
A Google GeoChart, which you can easily render in PixieDust given the correct
field names.If your data contains a place name column such as country, province or state
names, you can make what Google calls a GeoChart — a map that shows those regions color-coded based on the value of a numeric
column in your data.

To create a GeoChart in PixieDust, you must first have a Spark or Pandas
DataFrame with a place name column. Invoke PixieDust on that DataFrame with the display() command: display( mydataframe ) .

Click on the chart menu (to the right of the table button) and select the Map
item (it’s the one with the globe icon). The options dialog should pop up. If it
doesn’t, click on the Options button and drag the field that has place names into Keys . Then for the Values field, choose any numeric field you want to visualize.

Within the Display Mode menu, choose:

 * Region to color the entire area of your named places, e.g., countries, provinces,
   or states.
 * Markers to place a circle in the center of the region which is scaled according to
   the data selected for the Value field.
 * Text to label regions with labels like Russia or Asia .

This is good stuff, but if you’re doing heavyweight data science, in most cases
your data will be disaggregated down to the point (latitude/longitude) level.
This is where we chose to use our mapping partner Mapbox’s API instead of one of
Google’s mapping APIs.

Selecting the map option from PixieDust’s chart menu, and using the mapbox
renderer.POINT MAPPING WITH MAPBOX
The Mapbox option lets you create a map of geographic point data. Your DataFrame
needs at least the following three fields in order to work with this renderer:

 * a latitude field named latitude , lat , or y
 * a longitude field named longitude , lon , long , or x
 * a numeric field for visualization

To use the Mapbox renderer, you need a free API key from Mapbox. You can get one
on their website at https://www.mapbox.com/signup/ . When you get your key, enter it in the Options dialog box.

In the Options dialog, drag both your latitude and longitude fields into Keys . Then choose any numeric fields for Values . Only the first one you choose is used to color the map thematically, but any
other fields specified in Values appear in a tooltip when you hover over a data point on the map.

You can also choose the style of the underlying base map, which gives context to
your data. The image below uses Mapbox’s “light” style, which works great with a
data overlay, as the lightly colored streets and place names don’t fight for
attention with your data.

Following the instructions above using PixieDust’s test dataset #6, you can
reproduce this tasteful map.If you want to see what a place really looks like, zoom in and switch to
satellite view:

I can’t wait to see what cool things you do with the mapping features in
PixieDust. Let me know how you use it here in the comments below. And please ♡
this article to recommend it to other Medium readers.

Thanks to Mike Broberg . * Mapbox
 * Geospatial
 * Google Maps
 * Jupyter
 * Pixiedust

Blocked Unblock Follow FollowingRAJ SINGH
Developer Advocate and Open Data Lead at IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","You’re doing your data a disservice if you don’t use maps. Nearly all data has a spatial component — customer locations, crime, election results, traffic incidents, points-of-purchase, infrastructure…",Mapping for Data Science with PixieDust and Mapbox – IBM Watson Data Lab – Medium,Live,21
64,"IMPORTING JSON DOCUMENTS WITH NOSQLIMPORT
Glynn Bird / September 19, 2016Two years ago, I started to write couchimport , a command-line line utility to allow me to import comma-separated and
tab-separated files into a Apache CouchDB™ or Cloudant NoSQL database.


cat mydata.tsv | couchimport --db mydatabase


I built the tool for my own purposes but decided to share it publicly and
open-sourced couchimport in case anyone else would find it useful. I learned a lot by writing this
project:

 * publishing the project to npm (the Node.js package manager) allows other users to easily install the code
   — npm install -g couchimport
 * if your library/utility would save someone an hour of effort, then it’s worth
   open-sourcing it
 * using the Node.js Stream API allows your application to deal with file
   streams and HTTP streams interchangeably
 * a decent README is essential if you expect folks to use your software. In
   many cases the README.md file is the documentation for your project.

The purpose of couchimport is to write data to CouchDB in chunks using data from a text file:


It also allows a transform function to be added to the workflow: this function
gets called with each new document and can modify it to cast data types, remove
unwanted fields or to reorganise the structure of the document. The project
turned out to be a very useful way for folks to get started with CouchDB because
pre-existing data is likely to be in a spreadsheet or a relational database
which can easily be exported to CSV/TSV.

I also found I needed to use couchimport’s functionality programmatically, and
so I exposed some of its functions to the world so that couchimport can be npm install -ed into anyone’s Node.js project. In fact, couchimport is the importer used in our Simple Search Service project.

INTRODUCING NOSQLIMPORT
I recently refactored the couchimport code to make it work with other JSON document stores, so today I’m publishing nosqlimport :


This can be be installed as a command-line utility:


npm install -g nosqlimport


Or as a library to be used in your own project:


npm install --save nosqlimport


On its own, nosqlimport only writes its data to the terminal, but it has three other optional npm
modules that can be added for Apache CouchDB, MongoDB and ElasticSearch support:


npm install -g nosqlimport-couchdb
npm install -g nosqlimport-mongodb
npm install -g nosqlimport-elasticsearch


The type of database that is written to is defined by the --nosql or -n command line switch at run-time, e.g.:


cat movies.tsv | nosqlimport -n couchdb


IMPORTING DATA INTO COUCHDB
Firstly, define your CouchDB or Cloudant URL as an environment variable:


export NOSQL_URL=http://localhost:5984


Or:


export NOSQL_URL=https://myusername:mypassword@myaccount.cloudant.com


The CouchDB or Cloudant “database” to write data to can also be defined as an
environment variable:


export NOSQL_DATABASE=mydatabase


Then import a text file:


cat movies.tsv | nosqlimport -n couchdb


If you’d prefer to supply all the details as command-line switches, then that’s
possible too:


cat movies.tsv | nosqlimport -n couchdb -u https://myusername:mypassword@myaccount.cloudant.com -db mydatabase


IMPORTING DATA INTO MONGODB
Firstly, define your MongoDB URL as an environment variable:


export NOSQL_URL=mongodb://localhost:27017/mydatabase


The MongoDB “collection” to write data to can also be defined as an environment
variable:


export NOSQL_DATABASE=mycollection


Then import a text file:


cat movies.tsv | nosqlimport -n mongodb


If you’d prefer to supply all the details as command-line switches, then that’s
possible too:


cat movies.tsv | nosqlimport -n mongodb -u mongodb://localhost:27017/mydatabase -db mycollection


IMPORTING DATA INTO ELASTICSEARCH
Firstly, define your MongoDB URL as an environment variable:


export NOSQL_URL=http://localhost:9200/myindex


The ElasticSearch “type” to write data to can also be defined as an environment
variable:


export NOSQL_DATABASE=mytype


Then import a text file:


cat movies.tsv | nosqlimport -n elasticsearch


If you’d prefer to supply all the details as command-line switches, then that’s
possible too:


cat movies.tsv | nosqlimport -n elasticsearch -u http://localhost:9200/myindex -db mytype


SPECIFYING THE DELIMITER
By default, nosqlimport expects text files with a tab character delimiting the columns in the text
file, but this can be specified at run time by supplying a --delimiter or -d parameter:


cat movies.csv | nosqlimport -d ',' -n couchdb


TRANSFORM FUNCTIONS
Transform functions are entirely optional but are a very powerful way of
modifying the JSON object before it is written to the database. You may need to:

 * cast data types to force strings to be numbers, or booleans prior to saving
 * remove some documents that don’t need saving in the first place
 * rearrange the JSON object e.g. generate a GeoJSON object from a text file of latitudes and longitudes

A transform function is saved to a text file before calling nosqlimport and contains a single JavaScript function exported via module.exports . The transform function is called for each row in the incoming text file
(except the first line which contains the column headings), and the document it
synchronously returns is added to the write buffer. For example, if our source
data looked like this:

name latitude longitude description live Middlesbrough 54.576841 -1.234976 A large industrial town on the south bank of the River Tees true Boston 42.358056 -71.063611 The largest city in Massachusetts. true Atlantis 0 0 A fictional island falseThe documents being generated and passed to the transform function would look
like this:


{
  ""name"": ""Middlesbrough"",
  ""latitude"": ""54.576841"",
  ""longitude"": ""-1.234976"",
  ""description"": ""A large industrial town on the south bank of the River Tees"",
  ""live"": ""true""
}


Notice how:

 * the object’s keys were inferred from the incoming file’s first line
 * the values are all strings — because a CSV file doesn’t contain any sense of
   a column’s data type.

In this example, we cast the latitude and longitude values to numbers and force the live value to be a boolean:


module.exports = function(doc) {
  doc.latitude = parseFloat(doc.latitude);
  doc.longitude = parseFloat(doc.longitude);
  doc.live = (doc.live === 'true');
  return doc;
};


To prevent certain documents from being saved, then simply return {} instead of a populated object:


module.exports = function(doc) {
  if (doc.live === 'true') {
    return doc;
  } else {
    // nothing is written to the database
    return {}
  }
};


Or you can elect to craft a new JSON document in your own format based on the
data being imported, in this case GeoJSON:


module.exports = function(doc) {
  if (doc.live === 'true') {
    var newdoc = {
      type: 'Feature',
      geometry: {
        type: 'Point',
        coordinates: [ parseFloat(doc.latitude), parseFloat(doc.longitude) ]
      },
      properties: {
        name: doc.name
      }
    };
    return newdoc;
  } else {
    return {};
  }
};


A transform function is used by supplying the path to the file containing the
code using the -t parameter:


cat places.tsv | nosqlimport -n mongodb -t './geojson.js'


USING NOSQLIMPORT IN YOUR OWN APPLICATION
If you are building a Node.js application and need to be able to import files of
content, streams or HTTP streams into a NoSQL database, then you can use nosqlimport in your own project as a dependency. Add it to your project with:


npm install --save nosqlimport


Add the database-specifc module:


npm install --save nosqlimport-couchdb
npm install --save nosqlimport-mongodb
npm install --save nosqlimport-elasticsearch


And call the code:


var nosqlimport = require('nosqlimport');

// connection options
var opts = { nosql: 'couchdb', url: 'http://localhost:5984', database: 'mydb'};

// import the data
nosqlimport.importFile('./places.tsv', null, opts, function(err, data) {
    console.log(err, data);
});


Or, supply a JavaScript function to transform the data:


var nosqlimport = require('nosqlimport');

// cast lat/long to numbers and live to boolean
var transformer = function(doc) {
  doc.latitude = parseFloat(doc.latitude);
  doc.longitude = parseFloat(doc.longitude);
  doc.live = (doc.live === 'true');
  return doc;
};

// connection options
var opts = { nosql: 'couchdb', url: 'http://localhost:5984', database: 'mydb', transform: transformer};

// import the data
nosqlimport.importFile('./places.tsv', null, opts, function(err, data) {
    console.log(err, data);
});


LINKS
nosqlimport and its plugins are open-source projects, so please raise issues or contribute
PRs if you can!

 * https://www.npmjs.com/package/nosqlimport
 * https://www.npmjs.com/package/nosqlimport-couchdb
 * https://www.npmjs.com/package/nosqlimport-mongodb
 * https://www.npmjs.com/package/nosqlimport-elasticsearch

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: cloudant / CouchDB / Elasticsearch / MongoDB / Node.js / NoSQL Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Introducing nosqlimport, an npm module to help you import comma-separated and tab-separated files into your JSON document store of choice.",Move CSVs into different JSON doc stores,Live,22
66,"This video shows you how to build and query a Cloudant Geospatial index using the new Maps in the Cloudant dashboard! Watch the other videos in this series titled ""Introducing Cloudant Geospatial"" and ""Cloudant Geospatial in Action"". Find more videos in the Cloudant Learning Center at http://www.cloudant.com/learning-center.",This video shows you how to build and query a Cloudant Geospatial index using the new Maps in the Cloudant dashboard!,Tutorial: How to build and query a Cloudant geospatial index,Live,23
68,"THE CONVERSATIONAL INTERFACE IS THE NEW PARADIGM
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 30, 2016In 1962 Thomas Kuhn published The Structure of Scientific Revolutions . In it he posited that science moves forward with brief, dramatic episodes of
revolution in the paradigms of thought followed by longer terms of assimilating
and exploring these changes. A stepwise function if you will from revolution to
revolution. One could say that the brief history of software is governed by a
similar abstraction. From the era of the desktop app to the era of the web page
to the era of the mobile app to the latest paradigm shift which seems to be
happening now: the conversation.

As developers it behooves us to keep up, even if it just appeals to the ""look
it's new and shiny"" which some of us have, with these dramatic changes.
Certainly, the hype cycle in the short term will get to the point that the
conversation bots or assistants or whatever the eventual designated name will be
will overrun what is actually possible. Eventually, though this new paradigm
like all of those before it will take a long period of time to work its way
forward and move into many aspects of computing.

What follows is an example which is not even a toy app but we will carry it no
further. The goal is to expose you to some of the differences which are
currently apparent in this next revolution. It is still early and it is unclear
who will win (Siri or Alexa or Facebook Messenger or some unrealesed thing from
Google or ...) and what the ultimate ecosystem will look like. It does seem
clear though that whoever does win they won't be able to do it all. No one
company can write all of the desktop apps or all of the web pages or even all of
the mobile apps. Conversational apps will be the same. We will end up with some
provider(s) who will deliver the interface to the users either via a message
line like Slack or WhatsApp or via voice like Siri and Alexa (both of these
ultimately get turned into text lines too). These providers will most likely sit
at the center of an ecosystem which will handle NLP (Natural Language
Processing), semantic analysis, and other core tasks such as location and
calendar integration. So, what will this leave? All of the niche domains
provided by all of the many businesses and organizations in the world! It's a
huge opportunity. Things like the following:

 1. Ask your local grocery store bot if they have an item currently in stock.
    e.g. @CurbMarket Do you have any local strawberries today?
    
    
 2. Tell a clothing merchant to notify you next time they have a big sale. e.g. @OakHallClothier Tell me when you have your next sale
    
    
 3. Use a service to estimate when auto maintenance is due. e.g. @autobot i have a 2011 Toyota Highlander with 48000 miles. Tell me when my
    next oil change is due.
    
    
BOTKIT
There are many tools for bots today with new ones arriving, some fading and
others ""on the horizon"". Currently, there are ""bits and pieces"" for particulars
like dialogs (IBM Dialog) and NLP (IBM AlchemyAPI) all the way to large sdk's
for voice and digital assistants (Alexa, Siri, and Google). This non
comprehensive list points to a few facts about this current space of chatbots.
It's early and there is a large scope of investment occurring. While all of
these warrant investigating if you are interested in this space, the easiest
entry currently is a project called Botkit. It's an open source Javascript
library built by the folks at howdy.ai with some assistance from the folks at Slack . It runs as a Node server which can connect via a socket to Slack's Realtime
API or it can even handle webhooks from Slack, Facebook, and Twilio. Botkit
provides a simple framework to handle the basics of creating a chat application.

Starting with Slack's Realtime APISlack in some ways is the simplest and arguably most useful of the current
platforms. Many teams use Slack with some basic integrations on a daily basis.
Many of these bots appear as users inside of Slack and have an online presence
in a channel at the same level of a user.


It is very easy to connect a bot once you have a token from Slack:

var Botkit = require('botkit');

if(!process.env.token) {  
  console.log(""Must set slack token in env."");
  process.exit(1);
}

var controller = Botkit.slackbot({  
  debug: false
});

controller.spawn({  
  token: process.env.token
}).startRTM(function(err) {
  if(err) {
    throw new Error(err);
  }
});


The controller above is the core driver that creates the direct connection to Slack via a
socket. Then once the bot is connected it can listen for many types of events
such as a direct_message or mention or even more basic things like rtm_open and user_channel_join . Often though we just want the bot to hear certain things and react to them:

controller.hears(['hello','hi'], ['mention'], function(bot, msg) {  
  bot.reply(msg, ""yello"");
});


The above does just that. It registers to hear hello or hi when the bot is mention ed and then it fires the callback which in this case just replies with a yello . In essence, we just performed the hello world of building and integrating a bot with slack.

A ConversationWhile hello world is nice, a modestly complex interaction such as step by step conversation
really isn't that much more difficult:

controller.hears(['what', 'you'], ['mention'], function(bot,msg) {  
  bot.startConversation(msg, function(err, convo) {
    convo.say('I help you track vehicle maintenance.');
    convo.say('You tell me about your vehicle and how much you drive.');
    convo.say('then I\'ll keep track of things and notify you when it\'s time for maintenance.' );
    convo.ask('Would you like to know more?', [
        {
          pattern: bot.utterances.yes,
          callback: function(res, convo) {
            convo.say(""just tell me to 'add' so I can ask you a couple of questions"");
            convo.next();
          }
        },
        {
          pattern: bot.utterances.no,
          callback: function(res, convo) {
            convo.say(""awww"");
            convo.next();
          }
        },
        {
          default: true,
          callback: function(res, convo) {
            convo.repeat();
            convo.next();
          }
        }
      ]);
  })
});


Once again you register a top level handler with controller.hear . It listens for what and you with the bot's name mentioned. When that is heard the callback will fire. In this instance it is the bot.startConversation that is most interesting because it starts a stateful flow with that particular
user. Typically, this is the kind of construct which can be used to gather
information for whatever it is that your app provides to your user. Analagous in
some ways to an HTML form yet this is more like a dynamic workflow.

The above example does little more than give some overview as to what this
particular bot might actually do. It's like a help message for the user. First,
it gives a brief overview with the convo.say s then it asks a question. The ask can handle yes and no. If it doesn't get
either it does the default and just asks again and again until it does get the
yes or no so that it can continue. Truly, not very smart but still a start and a
base from which many smarts can be built up.

A Multi Step Conversation

A FOUNDATION TO BUILD UPON
This example of creating a bot which has a presence that can react to textual
messages is the foundation of this next revolution. While the examples above are
simplistic they do provide some structure and a view into the basic text lines
of voice and chat applications. These are the starting points for much more
sophisticated applications. Botkit itself has support for plugging in middleware
which can pre and post process messages. It would be normal to extend an
application with functionality that does deep language analysis or some kind of
machine learning to the recognize and trigger portions of the above. Throw in
some user context of location and schedules and even some limited knowledge that
a digital assistant might have about an individual and the possibilities become
plentiful indeed.

SOME LINKS
 1. Botkit
 2. Hubot, an alternative from github
 3. Slack API
 4. Twilio messaging
 5. Facebook Messenger
 6. Alexa
 7. Code Example on Github

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose",Botkit provides a simple framework to handle the basics of creating a chat application. What follows is an example which is not even a toy app but we will carry it no further. ,The Conversational Interface is the New Paradigm,Live,24
69,"Skip navigation Upload Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseCREATING THE DATA SCIENCE EXPERIENCE
IBM Analytics Subscribe Subscribed Unsubscribe 18,909 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics

448 views 10LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 11 1DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 2Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Jun 7, 2016Want to learn more about how we created the Data Science Experience? We've
interviewed hundreds of data scientists and analyzed how they think, how they
learn, how they build off the work of others, and how the get feedback to
improve their input. Data scientists need a one stop shop environment that
enables them to learn, create and collaborate, and that's where there Data
Science Experience comes in. We think you're going to love it. Learn more about
the Data Science Experience at http://ibm.co/data-science

Subscribe to the IBM Analytics Channel: https://www.youtube.com/subscription_...

The world is becoming smarter every day, join the conversation on the IBM Big
Data & Analytics Hub:
http://www.ibmbigdatahub.com
https://www.youtube.com/user/ibmbigdata
https://www.facebook.com/IBManalytics
https://www.twitter.com/IBMbigdata
https://www.linkedin.com/company/ibm-...
https://www.slideshare.net/IBMBDA

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21.
   IBM Analytics 166 views * New 8:21


--------------------------------------------------------------------------------

 * Data Science Training | Data Science Tutorial | Online Data Science Training
   - Duration: 58:05. Intellipaat 1,364 views 58:05
 * Tetiana Ivanova - How to become a Data Scientist in 6 months a hacker’s
   approach to career planning - Duration: 56:26. PyData 8,009 views 56:26
 * How to learn data science - Duration: 32:41. Vik Paruchuri 4,321 views 32:41
 * Data Science @Stanford- Bonnie Berger, PhD - Duration: 51:37. Stanford 1,397
   views 51:37
 * Apache Spark Maker Community Event: The livestream playback - Duration:
   1:30:23. IBM Analytics 610 views 1:30:23
 * Data Just Right: A Practical Introduction to Data Science Skills - DataEDGE
   2013 - Duration: 1:24:51. Berkeley School of Information 24,125 views 1:24:51
 * Introducing the Data Science Experience - Duration: 2:31. IBM Analytics 2,608
   views 2:31
 * The Science of Doubt: Creating Good Controls for Data Science Experiments. -
   Duration: 20:49. Next Day Video 372 views 20:49
 * Educating the Next Generation of Data Scientists - DataEDGE 2013 - Duration:
   41:45. Berkeley School of Information 3,575 views 41:45
 * Lecture 1 | Machine Learning (Stanford) - Duration: 1:08:40. Stanford
   1,048,059 views 1:08:40
 * Scalable Data Science and Deep Learning with H2O, Arno Candel, 20150603 -
   Duration: 1:27:05. San Francisco Bay ACM 2,773 views 1:27:05
 * The Future of Data Science - Data Science @ Stanford - Duration: 25:49.
   Stanford 29,189 views 25:49
 * What it's Like to Interview as a Data Scientist - Duration: 17:03. Dose of
   Data 9,133 views 17:03
 * How Data Science Plus Prescriptive Insights Drive Sales Performance -
   Duration: 46:17. InsideSales.com 334 views 46:17
 * Daniel Moisset - Bridging the gap: from Data Science to service - Duration:
   36:57. PyData 316 views 36:57
 * Don't rely on spreadsheets: Empower your business with IBM SPSS Statistics -
   Duration: 1:00:34. IBM Analytics 6 views * New 1:00:34
 * Data Science for Fun and Profit - Duration: 1:00:09. Tech Talk 9,712 views 1:00:09
 * Outthink Threats - Duration: 20:29. IBM Security 366 views 20:29
 * Joel Horwitz, IBM Analytics & Sri Satish Ambati, H2O - Apache Spark Maker
   Community Event 2016 - Duration: 25:17. SiliconANGLE 89 views 25:17
 * Loading more suggestions...
 * Show more

 * Language: English
 * Country: Worldwide
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Try something new!
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...","Want to learn more about how we created the Data Science Experience? We've interviewed hundreds of data scientists and analyzed how they think, how they lear...",Creating the Data Science Experience,Live,25
75,"GOOGLE RESEARCH BLOG The latest news from Research at GoogleUSING MACHINE LEARNING TO PREDICT PARKING DIFFICULTY
Friday, February 03, 2017 Posted by James Cook, Yechen Li, Software Engineers and Ravi Kumar, Research
Scientist

"" When Solomon said there was a time and a place for everything he had not
encountered the problem of parking his automobile. "" - Bob Edwards , Broadcast Journalist

Much of driving is spent either stuck in traffic or looking for parking . With products like Google Maps and Waze , it is our long-standing goal to help people navigate the roads easily and
efficiently. But until now, there wasn’t a tool to address the all-too-common
parking woes.

Last week, we launched a new feature for Google Maps for Android across 25 US cities that offers predictions about parking difficulty close to
your destination so you can plan accordingly. Providing this feature required
addressing some significant challenges:
 * Parking availability is highly variable, based on factors like the time, day
   of week, weather, special events, holidays, and so on. Compounding the
   problem, there is almost no real time information about free parking spots.
 * Even in areas with internet-connected parking meters providing information on
   availability, this data doesn’t account for those who park illegally, park
   with a permit, or depart early from still-paid meters.
 * Roads form a mostly-planar graph, but parking structures may be more complex,
   with traffic flows across many levels, possibly with different layouts.
 * Both the supply and the demand for parking are in constant flux, so even the
   best system is at risk of being outdated as soon as it’s built.

To face these challenges, we used a unique combination of crowdsourcing and
machine learning (ML) to build a system that can provide you with parking
difficulty information for your destination, and even help you decide what mode
of travel to take — in a pre-launch experiment, we saw a significant increase in
clicks on the transit travel mode button, indicating that users with additional
knowledge of parking difficulty were more likely to consider public transit
rather than driving.

Three technical pieces were required to build the algorithms behind the parking
difficulty feature: good ground truth data from crowdsourcing, an appropriate ML
model and a robust set of features to train the model on.

Ground Truth Data
Gathering high-quality ground truth data is often a key challenge in building
any ML solution. We began by asking individuals at a diverse set of locations
and times if they found the parking difficult. But we learned that answers to
subjective questions like this produces inconsistent results - for a given
location and time, one person may answer that it was “ easy ” to find parking while another found it “ difficult. ” Switching to objective questions like “ How long did it it take to find parking? ” led to an increase in answer confidence, enabling us to crowdsource a
high-quality set of ground truth data with over 100K responses.

Model Features
With this data available, we began to determine features we could train a model
on. Fortunately, we were able to turn to the wisdom of the crowd , and utilize anonymous aggregated information from users who opt to share
their location data, which already is a vital source of information for
estimates of live traffic or popular times and visit durations .

We quickly discovered that even with this data, some unique challenges remain.
For example, our system shouldn’t be fooled into thinking parking is plentiful
if someone is parking in a gated or private lot. Users arriving by taxi might
look like a sign of abundant parking at the front door, and similarly,
public-transit users might seem to park at bus stops. These false positives, and
many others, all have the potential to mislead an ML system.

So we needed more robust aggregate features. Perhaps not surprisingly, the
inspiration for one of these features came from our own backyard in downtown
Mountain View. If Google navigation observes many users circling downtown
Mountain View during lunchtime along trajectories like this one, it strongly
suggests that parking might be difficult:
Our team thought about how to recognize this “fingerprint” of difficult parking
as a feature to train on. In this case, we aggregate the difference between when
a user should have arrived at a destination if they simply drove to the front
door, versus when they actually arrived, taking into account circling, parking,
and walking. If many users show a large gap between these two times, we expect
this to be a useful signal that parking is difficult.

From there, we continued to develop more features that took into account, for
any particular destination, dispersion of parking locations, time-of-day and
date dependence of parking (e.g. what if users park close to a destination in
the early morning, but further away at busier hours?), historical parking data
and more. In the end, we decided on roughly twenty different features along
these lines for our model. Then it was time to tune the model performance.

Model Selection & Training
We decided to use a standard logistic regression ML model for this feature, for a few different reasons. First, the behavior of
logistic regression is well understood, and it tends to be resilient to noise in
the training data; this is a useful property when the data comes from
crowdsourcing a complicated response variable like difficulty of parking.
Second, it’s natural to interpret the output of these models as the probability
that parking will be difficult, which we can then map into descriptive terms
like “ Limited parking ” or “ Easy .” Third, it’s easy to understand the influence of each specific feature, which
makes it easier to verify that the model is behaving reasonably. For example,
when we started the training process, many of us thought that the “fingerprint”
feature described above would be the “silver bullet” that would crack the
problem for us. We were surprised to note that this wasn’t the case at all — in
fact, it was features based on the dispersion of parking locations that turned
out to be one of the most powerful predictors of parking difficulty.

Results
With our model in hand, we were able to generate an estimate for difficulty of
parking at any place and time. The figure below gives a few examples of the
output of our system, which is then used to provide parking difficulty estimates
for a given destination. Parking on Monday mornings, for instance, is difficult
throughout the city, especially in the busiest financial and retail areas. On
Saturday night, things are busy again, but now predominantly in the areas with
restaurants and attractions.
Output of our parking difficulty model in the Financial District and Union
Square areas of San Francisco. Red denotes a higher confidence that parking is
difficult. Top row: a typical Monday at ~8am (left) and ~9pm (right). Bottom row: the same times but on a typical Saturday. We’re excited about the opportunities to continue to improve the model quality
based on user feedback. If we are able to better understand parking difficulty,
we will be able to develop new and smarter forms of parking assistance — we’re
very excited about future applications of ML to help make transportation more
enjoyable! Google Labels: crowd-sourcing , Google Maps , Machine Learning   LABELS
 * accessibility
 * ACL
 * ACM
 * Acoustic Modeling
 * Adaptive Data Analysis
 * ads
 * adsense
 * adwords
 * Africa
 * AI
 * Algorithms
 * Android
 * API
 * App Engine
 * App Inventor
 * April Fools
 * Art
 * Audio
 * Australia
 * Automatic Speech Recognition
 * Awards
 * Cantonese
 * China
 * Chrome
 * Cloud Computing
 * Collaboration
 * Computational Imaging
 * Computational Photography
 * Computer Science
 * Computer Vision
 * conference
 * conferences
 * Conservation
 * correlate
 * Course Builder
 * crowd-sourcing
 * CVPR
 * Data Center
 * Data Discovery
 * data science
 * datasets
 * Deep Learning
 * DeepDream
 * DeepMind
 * distributed systems
 * Diversity
 * Earth Engine
 * economics
 * Education
 * Electronic Commerce and Algorithms
 * electronics
 * EMEA
 * EMNLP
 * Encryption
 * entities
 * Entity Salience
 * Environment
 * Europe
 * Exacycle
 * Expander
 * Faculty Institute
 * Faculty Summit
 * Flu Trends
 * Fusion Tables
 * gamification
 * Gmail
 * Google Books
 * Google Brain
 * Google Cloud Platform
 * Google Docs
 * Google Drive
 * Google Genomics
 * Google Maps
 * Google Play Apps
 * Google Science Fair
 * Google Sheets
 * Google Translate
 * Google Trips
 * Google Voice Search
 * Google+
 * Government
 * grants
 * Graph
 * Graph Mining
 * Hardware
 * HCI
 * Health
 * High Dynamic Range Imaging
 * ICLR
 * ICML
 * ICSE
 * Image Annotation
 * Image Classification
 * Image Processing
 * Inbox
 * Information Retrieval
 * internationalization
 * Internet of Things
 * Interspeech
 * IPython
 * Journalism
 * jsm
 * jsm2011
 * K-12
 * KDD
 * Klingon
 * Korean
 * Labs
 * Linear Optimization
 * localization
 * Machine Hearing
 * Machine Intelligence
 * Machine Learning
 * Machine Perception
 * Machine Translation
 * MapReduce
 * market algorithms
 * Market Research
 * ML
 * MOOC
 * Multimodal Learning
 * NAACL
 * Natural Language Processing
 * Natural Language Understanding
 * Network Management
 * Networks
 * Neural Networks
 * Ngram
 * NIPS
 * NLP
 * open source
 * operating systems
 * Optical Character Recognition
 * optimization
 * osdi
 * osdi10
 * patents
 * ph.d. fellowship
 * PhD Fellowship
 * PiLab
 * Policy
 * Professional Development
 * Proposals
 * Public Data Explorer
 * publication
 * Publications
 * Quantum Computing
 * renewable energy
 * Research
 * Research Awards
 * resource optimization
 * Robotics
 * schema.org
 * Search
 * search ads
 * Security and Privacy
 * Semi-supervised Learning
 * SIGCOMM
 * SIGMOD
 * Site Reliability Engineering
 * Social Networks
 * Software
 * Speech
 * Speech Recognition
 * statistics
 * Structured Data
 * Style Transfer
 * Supervised Learning
 * Systems
 * TensorFlow
 * Translate
 * trends
 * TTS
 * TV
 * UI
 * University Relations
 * UNIX
 * User Experience
 * video
 * Video Analysis
 * Vision Research
 * Visiting Faculty
 * Visualization
 * VLDB
 * Voice Search
 * Wiki
 * wikipedia
 * WWW
 * YouTube

ARCHIVE
 *   2017 * Feb
   
    * Jan
   
   
 *   2016 * Dec
   
    * Nov
   
    * Oct
   
    * Sep
   
    * Aug
   
    * Jul
   
    * Jun
   
    * May
   
    * Apr
   
    * Mar
   
    * Feb
   
    * Jan
   
   
 *   2015 * Dec
   
    * Nov
   
    * Oct
   
    * Sep
   
    * Aug
   
    * Jul
   
    * Jun
   
    * May
   
    * Apr
   
    * Mar
   
    * Feb
   
    * Jan
   
   
 *   2014 * Dec
   
    * Nov
   
    * Oct
   
    * Sep
   
    * Aug
   
    * Jul
   
    * Jun
   
    * May
   
    * Apr
   
    * Mar
   
    * Feb
   
    * Jan
   
   
 *   2013 * Dec
   
    * Nov
   
    * Oct
   
    * Sep
   
    * Aug
   
    * Jul
   
    * Jun
   
    * May
   
    * Apr
   
    * Mar
   
    * Feb
   
    * Jan
   
   
 *   2012 * Dec
   
    * Oct
   
    * Sep
   
    * Aug
   
    * Jul
   
    * Jun
   
    * May
   
    * Apr
   
    * Mar
   
    * Feb
   
    * Jan
   
   
 *   2011 * Dec
   
    * Nov
   
    * Sep
   
    * Aug
   
    * Jul
   
    * Jun
   
    * May
   
    * Apr
   
    * Mar
   
    * Feb
   
    * Jan
   
   
 *   2010 * Dec
   
    * Nov
   
    * Oct
   
    * Sep
   
    * Aug
   
    * Jul
   
    * Jun
   
    * May
   
    * Apr
   
    * Mar
   
    * Feb
   
    * Jan
   
   
 *   2009 * Dec
   
    * Nov
   
    * Aug
   
    * Jul
   
    * Jun
   
    * May
   
    * Apr
   
    * Mar
   
    * Feb
   
    * Jan
   
   
 *   2008 * Dec
   
    * Nov
   
    * Oct
   
    * Sep
   
    * Jul
   
    * May
   
    * Apr
   
    * Mar
   
    * Feb
   
   
 *   2007 * Oct
   
    * Sep
   
    * Aug
   
    * Jul
   
    * Jun
   
    * Feb
   
   
 *   2006 * Dec
   
    * Nov
   
    * Sep
   
    * Aug
   
    * Jul
   
    * Jun
   
    * Apr
   
    * Mar
   
    * Feb
   
   
FEED Google on Follow @googleresearch Give us feedback in our Product Forums .COMPANY-WIDE
 * Official Google Blog
 * Public Policy Blog
 * Student Blog

PRODUCTS
 * Android Blog
 * Chrome Blog
 * Lat Long Blog

DEVELOPERS
 * Developers Blog
 * Ads Developer Blog
 * Android Developers Blog

 * Google
 * Privacy
 * Terms","Much of driving is spent either stuck in traffic or looking for parking. With products like Google Maps and Waze, it is our long-standing goal to help people navigate the roads easily and efficiently. But until now, there wasn’t a tool to address the all-too-common parking woes.",Using Machine Learning to predict parking difficulty,Live,26
77,"Skip navigation Upload Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseGETTING THE BEST PERFORMANCE WITH PYSPARK
Apache Spark Subscribe Subscribed Unsubscribe 15,557 15KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics

286 views 4LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 5 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Jun 16, 2016

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Loading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * docker and spark - Duration: 5:49. Jeffrey Ellin 405 views 5:49


--------------------------------------------------------------------------------

 * Understanding Memory Management In Spark For Fun And Profit - Duration:
   29:00. Apache Spark 187 views 29:00
 * Paco Nathan: NLP and text analytics at scale with PySpark and notebooks -
   Duration: 42:54. PyData 1,680 views 42:54
 * Peter Hoffmann: Indroduction to the PySpark DataFrame API - Duration: 44:49.
   PyData 1,823 views 44:49
 * How to detect Phishing URLs using PySpark Decision Trees - PyCon India 2015 -
   Duration: 24:56. Python India 198 views 24:56
 * Hadoop Certification - CCA - Pyspark - 01 Joining Data Sets using Python -
   Duration: 32:43. itversity 1,186 views 32:43
 * Distributed Natural Language Processing with Anaconda Platform Tools on a
   Spark Cluster and PySpark - Duration: 6:33. Continuum Analytics 1,597 views 6:33
 * Top 5 Mistakes When Writing Spark Applications - Duration: 29:38. Apache
   Spark 543 views 29:38
 * Spark and Couchbase: Augmenting the Operational Database with Spark -
   Duration: 29:34. Apache Spark 57 views 29:34
 * Lessons Learned From Running Spark On Docker - Duration: 26:36. Apache Spark
   197 views 26:36
 * Assignment 4 - Using Docker to deploy Apache Spark - Duration: 14:56. Anders
   Rahbek 303 views 14:56
 * Best Practices for running PySpark - Duration: 29:42. Apache Spark 2,650
   views 29:42
 * Holden Karau - Improving PySpark Performance: Spark performance beyond the
   JVM - Duration: 43:23. PyData 675 views 43:23
 * ODSC West 2015 | Juliet Hougland - ""PySpark Best Practices"" - Duration:
   46:01. Open Data Science 155 views 46:01
 * Hadoop Certification - CCA - Pyspark - Developing Word Count program -
   flatMap, map, reduceByKey - Duration: 22:22. itversity 933 views 22:22
 * Hadoop Certification - CCA - Pyspark - Filtering data - Duration: 21:42.
   itversity 581 views 21:42
 * Hadoop Certification - CCA - Submitting pyspark applications - Duration:
   12:50. itversity 1,252 views 12:50
 * HUG Meetup Feb 2016: Running Spark Clusters in Containers with Docker -
   Duration: 37:14. ydntheater 178 views 37:14
 * Operational Tips For Deploying Apache Spark - Duration: 29:53. Apache Spark
   133 views 29:53
 * Holden Karau: A brief introduction to Distributed Computing with PySpark -
   Duration: 53:32. PyData 2,502 views 53:32
 * Loading more suggestions...
 * Show more

 * Language: English
 * Country: Worldwide
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Try something new!
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This talk assumes you have a basic understanding of Spark and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs – this is the talk for you.,Getting The Best Performance With PySpark,Live,27
81,"ACCESS DENIED
Sadly, your client does not supply a proper User-Agent, and is consequently
excluded.

We have an inordinate number of problems with automated scripts which do not
supply a User-Agent, and violate the automated access guidelines posted at
arxiv.org -- hence we now exclude them all.

(In rare cases, we have found that accesses through proxy servers strip the
User-Agent information. If this is the case, you need to contact the
administrator of your proxy server to get it fixed.)

If you believe this determination to be in error, see http://arxiv.org/denied.html for additional information.","In this paper, we propose gcForest, a decision tree ensemble approach with performance highly com- petitive to deep neural networks. ",Deep Forest: Towards An Alternative to Deep Neural Networks,Live,28
82,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

IBM Data Science Experience Blocked Unblock Follow Following Jan 27
--------------------------------------------------------------------------------

EXPERIENCE IOT WITH COURSERA
I’m very happy and proud to announce that IBM is the first non-academic supplier
to offer a data science course on Coursera. We’ve worked very hard to make this
course a great learning experience for anyone interested in data science and
IoT, and IBM Data Science Experience is central to the course.

It has been great to see all of our hard work payoff. In addition to launching
our first Coursera course, Exploring and Visualizing IoT Data , on January 9, 2017, we also kicked off a data science degree program.

Since my team and I are working for the IBM Watson IoT division, and IoT is one
of the most prominent disruptors in that space, the choice was obvious that we
create a course on the topic of exploring and visualizing IoT data. The course
is applicable to any time series problem including stock exchange data or social
media streams, and even non-time series data.

Those interested in learning more on the hardware and cloud data integration
part of this topic might want to have a look at the course A developer’s guide to the Internet of Things (IoT) .

I really would have loved to immediately start with artificial intelligence
methods for IoT time-series forecasting and anomaly detection, but this would
have been the wrong starting point of the journey. To help guide you through
that journey, we decided to create a data science degree (in Coursera terms, a
specialization) and the courses mentioned above will set the stage and make you
familiar with technologies like message brokers, NoSQL databases, Object
Storage, Apache SparkSQL, Python and Matplotlib.

Using that technology stack, we introduce statistical measures to gain insight
on IoT data and learn how to visualize it.

Having laid the foundation with the 1st course, we are currently creating a 2nd
course on IoT time-series analysis using Apache Spark 2.0 Structured Streaming
on the highly optimized tungsten and catalyst engine. We will teach you how to
detect anomalies and predict future events using advance statistical methods.

Then finally, the last course will talk about artificial intelligence methods
using deep learning frameworks — auto encoders and recurrent LSTM networks for
anomaly detection and forecasting. So stay tuned! And take the course to start your journey :)

Course links:
https://www.coursera.org/learn/developer-iot
https://www.coursera.org/learn/exploring-visualizing-iot-data


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on January 27, 2017 by Romeo Kienzler .

 * Object Storage
 * NoSQL
 * Python
 * IoT
 * Education

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",I’m very happy and proud to announce that IBM is the first non-academic supplier to offer a data science course on Coursera. We’ve worked very hard to make this course a great learning experience for…,Experience IoT with Coursera,Live,29
84,"KDNUGGETS
Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * SOFTWARE
 * NEWS
 * Top stories
 * Opinions
 * Tutorials
 * JOBS
 * Academic
 * Companies
 * Courses
 * Datasets
 * EDUCATION
 * Certificates
 * Meetings
 * Webinars


KDnuggets Home » News » 2016 » Jun » Opinions, Interviews, Reports » How open API economy accelerates the growth of big data and analytics ( 16:n22 )LATEST NEWS, STORIES
 * Top tweets, Jun 22-28: #Bayesian #Statistics explaine... Peeking Inside Convolutional Neural Networks Mining Twitter Data with Python Part 5: Data Visualisa... U. Chicago Center for Data Science and Public Policy: ... KDnuggets 16:n23, Jun 29: Machine Learning Trends & Fu...


More News & Stories | Top Stories

HOW OPEN API ECONOMY ACCELERATES THE GROWTH OF BIG DATA AND ANALYTICS
Previous post Next post Tweet Tags: API , Big Data Analytics , Open Data
--------------------------------------------------------------------------------

An open API is available on the internet for free. We review the growth of API
economy and how organizations have been realizing the potential of open APIs in
transforming their business.

By Kaushik Pal , TechAlpine. comments
The already huge world of big data and analytics has got a boost in the form of
open Application Programming Interfaces (APIs). The use of open APIs has been
generating huge volumes of big data. Since open APIs are now accessed by the
general public, mainly via apps and software programs, it has resulted into an
exponential growth of data. Open APIs are also contributing to the creation of
analytics because a group of APIs now have cognitive abilities which enable them
to deliver analytics to systems. The growth of open APIs and other APIs has
given birth to the term “ API economy ”. Prominent business houses such as Google and Yahoo have been offering public
APIs for different purposes such as weather updates and traffic management.

What is an open API?

An open API is made available in the Internet and is available for use free of
cost. For example, a startup software company specialized in the insurance
domain may make its underwriting calculation software available as an open API.
Interested third-party developers may access the calculation software as per the
terms and conditions of the API availability. The third-party developers may use
the calculator in any manner unless they are bound by specific terms and
conditions of API usage. Usually, open APIs are not bound by any terms and
conditions.

Open API provides benefits to both the owner and the user of it. For the owner,
whenever the open API is used, it means its products and services are getting
publicity while it retains the ownership of the code. For the user, open APIs
relieve the third-party developers of the effort required to build an entire
software program from the scratch. The software the third-party developers are
building is a mashup between the source software and new code.

Here is are some sites that list many public APIs

 * Any-API, Documentation and Test Consoles for Over 225 Public APIs
 * ProgrammableWeb API Directory, search over 15,000 APIs
 * Wikipedia list of open APIs
 * NASA open APIs
 * Data.gov APIs

What is an API economy?

So huge and ubiquitous has been the emergence of open APIs that many experts
have been using the term API economy to refer to the transactions taking place
with the help of APIs. More and more organizations have been realizing the
potential of open APIs in transforming their business and have been rolling out
open APIs.

Impact of API economy on big data and analytics

So far, the impact of the open API economy on big data and analytics has been
felt in the following four areas:

Growth in the data volume

The volume of data has grown even more with the growth of open APIs. Let us
understand how open APIs have contributed to the growth of big data with the
example of the online education domain. Online education is highly popular now,
students use apps, websites to learn. Now, the educational content is stored in
different storage systems and it is a tedious and difficult task to connect so
many storage systems with the apps and also maintain them. In such a case, open
APIs can really help. Open APIs can help apps and websites interact with
different data storage systems. Now, when a student uses an app to access say,
interactive lessons on Java, an open API takes the request to the database which
sends the required data through the API after proper authorization, if
applicable.

Open APIs make it easy to connect to multiple data sources through apps. To
access a data source, all that is needed is to just call an API which delivers
the requested information. More and more people are using open APIs because of
the convenience it provides. Over time, the data volume has grown because more
data is being generated for example student details, course details, student
performance and analysis and patterns.

Cognitive APIs

Cognitive API is a relatively new development in the world of APIs and it is
especially applicable for analytics. A cognitive API accepts a request in a
certain format from a system and delivers it to another system. Now, the
recipient system provides analytics as response which is delivered to the
requesting system. Cognitive APIs are capable of processing complex,
unstructured data and delivering analytics. Many organizations use such APIs to
create their own products and services.

Faster access to big data

APIs can provide big data applications faster access to the data storage. This
results in faster retrieval, processing and analytics. Such APIs can sit as a
layer between distributed computing applications and storage.

APIs now available to the layman

There was a time when APIs were the exclusive territory of the developers.
Developers still know APIs in and out but the layman have also been using APIs,
albeit indirectly. People have been using apps which connect to the APIs. The
APIs take requests and delivers responses from the server which the user views.
This factor has significantly accounted for the huge growth of big data.

Important Statistics

The statistics below establishes that the API economy has been getting stronger
and influencing big data and analytics.

 * There is a difference between the growth prediction and actual growth of
   public APIs, as per the ProgrammableWeb directory of APIs. This is shown by
   the image below.

Source: nordicapis.com/tracking-the-growth-of-the-api-economy/

 * However, the above estimate may be deceptive because there are other API
   directories too. Also, the impact of APIs is best gauged when they are
   consumed by third-party APIs. Now, such instances are not recorded often but
   that does not diminish the importance of the APIs.
 * The image below shows that the number of API calls has increased
   significantly over the years.

Source: blog.mailchimp.com/10m-api-calls-per-day-more/

 * As per Netflix, the number of requests the Netflix API has received over the
   years has exponentially increased. From less than 1 billion requests in
   January, 2010, the number of requests has increased to more than 20 billion
   requests in January, 2011.


Source: http://techblog.netflix.com/2011/02/redesigning-netflix-api.html

 * A significant development has been the APIs becoming more inclusive. There
   was a time when only developers could understand APIs. Now, APIs can be used
   even by non-development people. Laymen are able to access APIs, albeit
   without their knowledge, through apps.
 * Smartphones use countless mobile services that are built on APIs.

Conclusion

It seems that open APIs are synonymous with convenience, time savings and
efficiency. There are good reasons that businesses consider open APIs an
important business development tool. With due importance given to other
influences open APIs have had on big data and analytics, the involvement of
general public seems to be the most important driver of big data and analytics
growth. Considering the present times, the API economy seems to on course for an
explosive growth over the next few years and it will redefine many businesses.

Related :

 * Data Science and Cognitive Computing with HPE Haven OnDemand: The Simple Path
   to Reason and Insight
 * HPE Haven OnDemand and Microsoft Azure Machine Learning: Power Tools for
   Developers and Data Scientists
 * Machine Learning at your fingertips – 60+ free APIs, from HPE Haven OnDemand


--------------------------------------------------------------------------------

Previous post Next post


--------------------------------------------------------------------------------


MOST POPULAR LAST 30 DAYS
Most viewed 1. 7 Steps to Mastering Machine Learning With Python R vs Python for Data Science: The Winner is ... What is the Difference Between Deep Learning and “Regular” Machine
    Learning? TensorFlow Disappoints - Google Deep Learning falls shallow 9 Must-Have Skills You Need to Become a Data Scientist Top 10 Data Analysis Tools for Business How to Explain Machine Learning to a Software Engineer

Most shared 1. What is the Difference Between Deep Learning and “Regular” Machine Learning? Data Science of Variable Selection: A Review R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016
    Software Poll Results A Visual Explanation of the Back Propagation Algorithm for Neural Networks Machine Learning Key Terms, Explained How to Build Your Own Deep Learning Box Big Data Business Model Maturity Index and the Internet of Things (IoT)


MORE RECENT STORIES
 * The Big Data Ecosystem is Too Damn Big Civis Analytics: Data Scientist, Statistics Civis Analytics: Lead Data Engineer An Inside Update on Natural Language Processing Webinar, Jun 30: Introducing Anaconda Mosaic: Visualize. Explo... 5 More Machine Learning Projects You Can No Longer Overlook U. of Iowa: Business Analytics & Information Systems, Lec... U. of Iowa: Lecturer: Business Analytics & Information Sy... Top Stories, June 20-26: New Machine Learning Book, Free Draft... BigDebug: Debugging Primitives for Interactive Big Data Proces... Mining Twitter Data with Python Part 4: Rugby and Term Co-occu... Improving Nudity Detection and NSFW Image Recognition Highmark Health: Medical Economics Consultant Regularization in Logistic Regression: Better Fit and Better G... Doing Data Science: A Kaggle Walkthrough Part 6 – Creati... Highmark Health: Lead Decision Support Analyst Top Machine Learning Libraries for Javascript Predictive Analytics World in October: Government, Business, F... Ten Simple Rules for Effective Statistical Practice: An Overview Bank of America: Statistician


KDnuggets Home » News » 2016 » Jun » Opinions, Interviews, Reports » How open API economy accelerates the growth of big data and analytics ( 16:n22 )

© 2016 KDnuggets. About KDnuggets
Subscribe to KDnuggets News | Follow @kdnuggets | | X",An open API is available on the internet for free. We review the growth of API economy and how organizations have been realizing the potential of open APIs in transforming their business. ,How open API economy accelerates the growth of big data and analytics,Live,30
87,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS
Loading...

Sign up by October 31st for an extended 3-month trial of YouTube Red.Working...

No thanks Try it free Find out why CloseDATA SCIENCE EXPERIENCE: SIGN UP FOR FREE TRIAL
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

56 views 1LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 2 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * FREE 4G NET ON ANY ANDROID DEVICE II NO DATA PACK REQUIRED II 2017 -
   Duration: 7:03. CLAX TECH 499 views 7:03


--------------------------------------------------------------------------------

 * Advances Towards Building an Artificial Brain | IBM' Dharmendra Modha -
   Duration: 23:00. Artificial Intelligence AI 1 view * New 23:00
 * How to Get Unlimited Cell Data for Free (Any Carrier or Phone) - Duration:
   8:17. ThioJoe 5,859,252 views 8:17
 * IBM Watson Machine Learning: Score a Predictive Model Built with IBM SPSS
   Modeler - Duration: 5:31. developerWorks TV 7 views * New 5:31
 * The Data Science Experience - Duration: 42:45. Evolving Education with
   Cognitive & Data Sciences 1,170 views 42:45
 * Data science expert interview: Jennifer Shin - Duration: 7:29. IBM Analytics
   17,327 views 7:29
 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21.
   IBM Analytics 8,386 views 8:21
 * Use IBM PixieDust and Data Science Experience to analyze San Francisco
   traffic - Duration: 11:57. scottdangelo 447 views 11:57
 * Introducing the Data Science Experience - Duration: 2:31. IBM Analytics
   14,839 views 2:31
 * Data Science Experience: Analyze precipitation data using a community
   notebook - Duration: 5:15. developerWorks TV No views * New 5:15
 * Datascience made simple with IBM DSX | HackerEarth Webinar - Duration:
   1:06:11. HackerEarth 264 views 1:06:11
 * Creating the Data Science Experience - Duration: 3:55. IBM Analytics 3,197
   views 3:55
 * H2O With IBM's Data Science Experience (DSX) - Duration: 4:43. Matt McInnis
   303 views 4:43
 * HOW TO GET UNLIMITED DATA 3G/4G INTERNET FOR FREE - Duration: 1:41. Lame Dabe
   1,818 views 1:41
 * IBM Blockchain Business Models - Duration: 10:13. IBMBlockchain 78 views *
   New 10:13
 * Data Science Experience: Load and analyze public data sets - Duration: 2:46.
   developerWorks TV No views * New 2:46
 * IBM Watson Text to Speech Demo - Duration: 9:27. James Belton 1 view * New 9:27
 * Visual Machine Learning in Data Science Experience - Duration: 1:37. Armand
   Ruiz 2,996 views 1:37
 * Data science expert interview: Holden Karau - Duration: 6:21. IBM Analytics
   4,722 views 6:21
 * Content made easy with IBM Watson Content Hub - Duration: 3:13. IBM Watson
   Customer Engagement 843 views * New 3:13

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to sign up for a free trial of IBM Data Science Experience (DSX).,Sign up for a free trial in DSX,Live,31
90,"No Free Hunch Navigation * kaggle.com

 * kaggle.com

A Kaggler's Guide to Model Stacking in PracticeA KAGGLER'S GUIDE TO MODEL STACKING IN PRACTICE
Ben Gorman | 12.27.2016

INTRODUCTION
Stacking (also called meta ensembling) is a model ensembling technique used to
combine information from multiple predictive models to generate a new model.
Often times the stacked model (also called 2nd-level model) will outperform each
of the individual models due its smoothing nature and ability to highlight each
base model where it performs best and discredit each base model where it
performs poorly. For this reason, stacking is most effective when the base
models are significantly different. Here I provide a simple example and guide on
how stacking is most often implemented in practice.

Feel free to follow this article using the related code and datasets here in the Machine Learning Problem Bible .

This tutorial was originally posted here on Ben's blog, GormAnalysis .


MOTIVATION
Suppose four people throw a combined 187 darts at a board. For 150 of those we
get to see who threw each dart and where it landed. For the rest, we only get to
see where the dart landed. Our task is to guess who threw each of the unlabelled
darts based on their landing spot.


K-NEAREST NEIGHBORS (BASE MODEL1)
Let’s make a sad attempt at solving this classification problem using a
K-Nearest Neighbors model. In order to select the best value for K, we’ll use
5-fold Cross-Validation combined with Grid Search where K=(1, 2, … 30). In
pseudo code:

 1. Partition the training data into five equal size folds. Call these test
    folds.
 2. For K = 1, 2, … 10
 3. For each test fold 1. Combine the other four folds to be used as a training fold
     2. Fit a K-Nearest Neighbors model on the training fold (using the current
        value of K)
     3. Make predictions on the test fold and measure the resulting accuracy
        rate of the predictions
    
    Calculate the average accuracy rate from the five test fold predictions
 4. Keep the K value with the best average CV accuracy rate

With our fictitious data we find K=1 to have the best CV performance (67%
accuracy). Using K=1, we now train a model on the entire training dataset and
make predictions on the test dataset. Ultimately this will give us about 70%
classification accuracy.

SUPPORT VECTOR MACHINE (BASE MODEL2)
Now let’s make another sad attempt at solving the problem using a Support Vector
Machine. Additionally, we’ll add a feature DistFromCenter that measures the distance each point lies from the center of the board to help
make the data linearly separable. With R’s LiblineaR package we get two hyper parameters to tune:

TYPE
 1. L2-regularized L2-loss support vector classification (dual)
 2. L2-regularized L2-loss support vector classification (primal)
 3. L2-regularized L1-loss support vector classification (dual)
 4. support vector classification by Crammer and Singer
 5. L1-regularized L2-loss support vector classification

COST
Inverse of the regularization constant

The grid of parameter combinations we’ll test is the cartesian product of the 5
listed SVM types with cost values of (.01, .1, 1, 10, 100, 1000, 2000). That is

type cost 1 0.01 1 0.1 1 1 … … 5 100 5 1000 5 2000Using the same CV + Grid Search approach we used for our K-Nearest Neighbors
model, here we find the best hyper-parameters to be type = 4 with cost = 1000.
Again, we use these parameters to train a model on the full training dataset and
make predictions on the test dataset. This’ll give us about 61% CV
classification accuracy and 78% classification accuracy on the test dataset.

STACKING (META ENSEMBLING)
Let’s take a look at the regions of the board each model would classify as Bob,
Sue, Mark, or Kate.


Unsurprisingly, the SVM does a good job at classifying Bob’s throws and Sue’s
throws but does poorly at separating Kate’s throws and Mark’s throws. The
opposite appears to be true for the K-nearest neighbors model. HINT : Stacking these models will probably be fruitful.

There are a few schools of thought on how to actually implement stacking. Here’s
my personal favorite applied to our example problem:

1. Partition the training data into five test folds

train

ID FoldID XCoord YCoord DistFromCenter Competitor 1 5 0.7 0.05 0.71 Sue 2 2 -0.4 -0.64 0.76 Bob 3 4 -0.14 0.82 0.83 Sue … … … … … … 183 2 -0.21 -0.61 0.64 Kate 186 1 -0.86 -0.17 0.87 Kate 187 2 -0.73 0.08 0.73 Sue2. Create a dataset called train_meta with the same row Ids and fold Ids as the training dataset, with empty columns
M1 and M2. Similarly create a dataset called test_meta with the same row Ids as the test dataset and empty columns M1 and M2

train_meta

ID FoldID XCoord YCoord DistFromCenter M1 M2 Competitor 1 5 0.7 0.05 0.71 NA NA Sue 2 2 -0.4 -0.64 0.76 NA NA Bob 3 4 -0.14 0.82 0.83 NA NA Sue … … … … … … … … 183 2 -0.21 -0.61 0.64 NA NA Kate 186 1 -0.86 -0.17 0.87 NA NA Kate 187 2 -0.73 0.08 0.73 NA NA Suetest_meta

ID XCoord YCoord DistFromCenter M1 M2 Competitor 6 0.06 0.36 0.36 NA NA Mark 12 -0.77 -0.26 0.81 NA NA Sue 22 0.18 -0.54 0.57 NA NA Mark … … … … … … … 178 0.01 0.83 0.83 NA NA Sue 184 0.58 0.2 0.62 NA NA Sue 185 0.11 -0.45 0.46 NA NA Mark3. For each test fold
{Fold1, Fold2, … Fold5}

3.1 Combine the other four folds to be used as a training fold

train fold1

ID FoldID XCoord YCoord DistFromCenter Competitor 1 5 0.7 0.05 0.71 Sue 2 2 -0.4 -0.64 0.76 Bob 3 4 -0.14 0.82 0.83 Sue … … … … … … 181 5 -0.33 -0.57 0.66 Kate 183 2 -0.21 -0.61 0.64 Kate 187 2 -0.73 0.08 0.73 Sue3.2 For each base model
M1: K-Nearest Neighbors (k = 1)
M2: Support Vector Machine (type = 4, cost = 1000)

3.2.1 Fit the base model to the training fold and make predictions on the test
fold. Store these predictions in train_meta to be used as features for the
stacking model

train_meta with M1 and M2 filled in for fold1

ID FoldID XCoord YCoord DistFromCenter M1 M2 Competitor 1 5 0.7 0.05 0.71 NA NA Sue 2 2 -0.4 -0.64 0.76 NA NA Bob 3 4 -0.14 0.82 0.83 NA NA Sue … … … … … … … … 183 2 -0.21 -0.61 0.64 NA NA Kate 186 1 -0.86 -0.17 0.87 Bob Bob Kate 187 2 -0.73 0.08 0.73 NA NA Sue4. Fit each base model to the full training dataset and make predictions on the
test dataset. Store these predictions inside test_meta

test_meta

ID XCoord YCoord DistFromCenter M1 M2 Competitor 6 0.06 0.36 0.36 Mark Mark Mark 12 -0.77 -0.26 0.81 Kate Sue Sue 22 0.18 -0.54 0.57 Mark Sue Mark … … … … … … … 178 0.01 0.83 0.83 Sue Sue Sue 184 0.58 0.2 0.62 Sue Mark Sue 185 0.11 -0.45 0.46 Mark Mark Mark5. Fit a new model, S (i.e the stacking model) to train_meta, using M1 and M2 as
features. Optionally, include other features from the original training dataset
or engineered features

S: Logistic Regression (From LiblineaR package, type = 6, cost = 100). Fit to train_meta

6. Use the stacked model S to make final predictions on test_meta

test_meta with stacked model predictions

ID XCoord YCoord DistFromCenter M1 M2 Pred Competitor 6 0.06 0.36 0.36 Mark Mark Mark Mark 12 -0.77 -0.26 0.81 Kate Sue Sue Sue 22 0.18 -0.54 0.57 Mark Sue Mark Mark … … … … … … … … 178 0.01 0.83 0.83 Sue Sue Sue Sue 184 0.58 0.2 0.62 Sue Mark Sue Sue 185 0.11 -0.45 0.46 Mark Mark Mark MarkThe main point to take home is that we’re using the predictions of the base models as features (i.e. meta features) for the stacked model. So, the stacked model is able to
discern where each model performs well and where each model performs poorly.
It’s also important to note that the meta features in row i of train_meta are not dependent on the target value in row i because they were produced using information that excluded the target_i in the base models’ fitting procedure.

Alternatively, we could make predictions on the test dataset using each base
model immediately after it gets fit to each test fold. In our case this would
generate test-set predictions for five K-Nearest Neighbors models and five SVM
models. Then we would average the predictions per model to generate our M1 and
M2 meta features. One benefit to this is that it’s less time consuming than the
first approach (since we don’t have to retrain each model on the full training
dataset). It also helps that our train meta features and test meta features
should follow a similar distribution. However, the test metas M1 and M2 are
likely more accurate in the first approach since each base model was trained on
the full training dataset (as opposed to 80% of the training dataset, five times
in the 2nd approach).

STACKED MODEL HYPER PARAMETER TUNING
So, how do you tune the hyper parameters of the stacked model? Regarding the
base models, we can tune their hyper parameters using Cross-Validation + Grid
Search just like we did earlier. It doesn’t really matter what folds we use, but
it’s usually convenient to use the same folds that we use for stacking. Tuning
the hyper parameters of the stacked model is where things get interesting. In
practice most people (including myself) simply use Cross Validation + Grid
Search using the same exact CV folds used to generate the Meta Features. There’s
a subtle flaw to this approach – can you spot it?

Indeed, there’s a small bit of data leakage in our stacking CV procedure.
Consider the 1st round of Cross Validation for the stacked model. We fit a model
S to {fold2, fold3, fold4, fold5}, make predictions on fold1 and evaluate
performance. But the meta features in {fold2, fold3, fold4, fold5} are dependent
on the target values in fold1. So, the target values we’re trying to predict are
themselves embedded into the features we’re using to fit our model. This is
leakage and in theory S could deduce information about the target values from
the meta features in a way that would cause it to overfit the training data and
not generalize well to out-of-bag samples. However, you have to work hard to
conjure up an example where this leakage is significant enough to cause the
stacked model to overfit. In practice, everyone ignores this theoretical hole
(and frankly I think most people are unaware it even exists!).

STACKING MODEL SELECTION AND FEATURES
How do you know what model to choose as the stacker and what features to include
with the meta features? In my opinion, this is more of an art than a science.
Your best bet is to try different things and familiarize yourself with what
works and what doesn’t. Another question is, what (if any) other features should
you include in for the stacking model in addition to the meta features? Again
this is somewhat of an art. Looking at our example, it’s pretty evident that DistFromCenter plays a part in determining which model will perform well. The KNN appears to
do better at classifying darts thrown near the center and the SVM model does
better at classifying darts thrown away from the center. Let’s take a shot at
stacking our models using Logistic Regression. We’ll use the base model
predictions as meta features and DistFromCenter as an additional feature.

Sure enough the stacked model performs better than both of the base models – 75%
CV accuracy and 86% test accuracy. Now let’s take a look at its classification
regions overlaying the training data, just like we did with the base models.


The takeaway here is that the Logistic Regression Stacked Model captures the
best aspects of each base model which is why it performs better than either base
model in isolation.

STACKING IN PRACTICE
To wrap this up, let’s talk about how, when, and why you might use stacking in
the real world. Personally, I mostly use stacking in machine learning
competitions on Kaggle . In general, stacking produces small gains with a lot of added complexity –
not worth it for most businesses. But Stacking is almost always fruitful so it’s
almost always used in top Kaggle solutions. In fact, stacking is really
effective on Kaggle when you have a team of people trying to collaborate on a
model. A single set of folds is agreed upon and then every team member builds
their own model(s) using those folds. Then each model can be combined using a
single stacking script. This is great because it prevents team members from
stepping on each others toes, awkwardly trying to stitch their ideas into the
same code base.

One last bit. Suppose we have dataset with (user, product) pairs and we want to
predict the probability that a user will purchase a given product if he/she is
presented an ad with that product. An effective feature might be something like,
using the training data, what percent of the products advertised to a user did
he actually purchase in the past? So, for the sample ( user1 , productA ) in the training data, we want to tack on a feature like UserPurchasePercentage but we have to be careful not to introduce leakage into the data. We do this as
follows:

 1. Split the training data into folds
 2. For each test fold
 3. Identify the unique set of users in the test fold Use the remaining folds to calculate UserPurchasePercentage (percent of advertised products each user purchased) Map UserPurchasePercentage back to the training data via ( fold id , user id )

Now we can use UserPurchasePercentage as a feature for our gradient boosting model (or whatever model we want).
Effectively what we’ve just done is built a predictive model that predicts user_i will purchase product_x with probability based on the percent of advertised products he purchased in
the past and used those predictions as a meta feature for our real model. This
is a subtle but valid and effective form of stacking – one which I often do
implement in practice and on Kaggle.


BIO
I’m Ben Gorman – math nerd and data science enthusiast based in the New Orleans area. I spent
roughly five years as the Senior Data Analyst for Strategic Comp before starting GormAnalysis . I love talking about data science, so never hesitate to shoot me an email if
you have questions: bgorman@gormanalysis.com . As of September 2016, I’m a Kaggle Master ranked in the top 1% of competitors
world-wide.

model ensembling model stackingTHE OFFICIAL BLOG OF KAGGLE.COM
SearchCATEGORIES
 * Data Science News (50)
 * Kaggle News (124)
 * Kernels (31)
 * Tutorials (36)
 * Winners' Interviews (188)

WANT TO SUBSCRIBE?
Email Address * First Name Last Name * = required fieldPOPULAR TAGS
#1 Kaggler Annual Santa Competition binary classification community computer vision convolutional neural networks CrowdFlower Search Results Relevance Dark Matter Deloitte diabetes Diabetic Retinopathy Draper Satellite Image Chronology EEG data Elo Chess Ratings Competition Eurovision Challenge Facebook Recruiting Flight Quest Heritage Health Prize How Much Did It Rain? image classification image processing Intel Kaggle InClass Kernels logistic regression March Mania Merck multiclass classification natural language processing open data open data spotlight Otto Product Classification Practice Fusion Product Product News Profiling Top Kagglers Recruiting regression problem scikit-learn scripts of the week Tourism Forecasting Tutorial video series Wikipedia Challenge XGBoostARCHIVES
Archives Select Month December 2016 November 2016 October 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 January 2014 December 2013 November 2013 September 2013 August 2013 July 2013 June 2013 May 2013 April 2013 March 2013 February 2013 January 2013 December 2012 November 2012 October 2012 September 2012 August 2012 July 2012 June 2012 May 2012 April 2012 March 2012 February 2012 January 2012 December 2011 November 2011 October 2011 September 2011 August 2011 July 2011 June 2011 May 2011 April 2011 March 2011 February 2011 January 2011 December 2010 November 2010 October 2010 September 2010 August 2010 July 2010 June 2010 May 2010 April 2010 Toggle the Widgetbar",Stacking is a model ensembling technique used to combine information from multiple predictive models to generate a new model. Often times the stacked model will outperform each of the individual mo…,A Kaggler's Guide to Model Stacking in Practice,Live,32
91,"Working Vis * 
 * 

 * Home
 * About This Blog
 * Brunel","Analytics and visualization often go hand-in-hand.  One of the great things about notebooks such as IPython/Jupyter is that they provide a single interface to numerous data analysis technologies that often can be used together.  So, using Brunel within notebooks is a very natural fit.  For example, I can use a wide variety of python libraries to cleanse, shape and analyze data–and then use Brunel to visualize those results.",Using Brunel in IPython/Jupyter Notebooks,Live,33
96,"Homepage Follow Sign in / Sign up Steve Moore Blocked Unblock Follow Following Parent, playwright, artistic director of Physical Plant Theater, & IBMer. Jul 10
--------------------------------------------------------------------------------

NEW MENTAL MODELS FOR MACHINE LEARNING: PART 1
Machine learning has already extended into so many aspects of daily life that it
can be handy for us to simply memorize a set of go-to examples of its impact on
certain industries.

For instance, we might think of fraud detection as the canonical example of
machine learning in the financial sector. Or we might think of Watson’s cognitive approach to oncology as the canonical example of machine learning in healthcare. Or, yet again, we might point to recommendation engines at
Netflix and Amazon as canonical examples of machine learning in retail.

Certainly, those are tremendous demonstrations of the power of the technology —
and in aggregate, they give a sense of machine learning’s pervasive presence in
our lives. But the convenience of go-to examples might come at a cost. In
particular, citing the same handy examples might keep us from noticing the wide
diversity of machine learning use cases within individual sectors.

This post is the first in a series aimed at shaking up our intuitions about the
things that machine learning is making possible in specific sectors — to look
beyond the same set of use cases that always come to mind.

Let’s start with Government…

1. BRINGING ML TO ENVIRONMENTAL PROTECTION
As much as any commercial sector, Government is under constant pressure to do
more with less, to serve more constituents more effectively and more
intelligently. That includes agencies tasked with environmental protection like
the DCMR Milieudienst Rijnmond , which battles pollution, waste, and other environmental threats for the
region surrounding Rotterdam in the Netherlands.

By combining various IBM Analytics software, a strong partnership with the Dutch security firm DataExpert , and a suite of remote sensors, the team could use machine learning to help
identify and evaluate environmental hazards in real time — and can sort the
hazards by severity and urgency. By detecting and assessing environmental
threats algorithmically, the system can identify key risks and lack of
compliance. Automating and improving that aspect of their work can give the DCMR
more time and energy for other action that could boost public safety.

2. ML AND JOB SECURITY FOR BELGIANS
In the same corner of Europe, an employment and vocational agency called VDAB is striving to give workers in Belgium’s Flanders region the information and
resources they need to find and keep work. Thankfully, unemployment in Belgium
is falling — from 8.2% to 6.8% in the last year — but even at 6.8%, there’s clearly more
work to do.

One of the agency’s key goals is reducing the duration of unemployment for young
workers while finding ways to direct limited resources where they’re truly
needed. The machine learning solution: an ML model crafted by IBM Global Business Services that crunches past data to predict the duration of unemployment for each job
seeker. By focusing attention on the young Belgians most at risk, the agency can
do more to interrupt the patterns of joblessness and kick off self-reinforcing
steps toward job security — a longterm boon to the economy at large.

3. ML IN THE FIGHT TO FEED THE YOUNG
Halfway around the world, we find the the Instituto Colombiano de Bienestar Familiar , a children and family welfare organization working nationwide in Colombia for
the prevention and protection of early childhood, childhood, adolescence and the
welfare of families. On a tight budget, the organization still manages to reach
more than 8 million Colombians with its programs and services.

Among those 8 million, 38,730 in 2016 were malnourished children who received
29,552 emergency food rations and more than five million dietary supplements.
That work didn’t happen by accident. Behind the scenes, the analytics firm Infórmese used IBM SPSS Modeler to provide predictive analytics and micro-targeting capabilities that optimize the distribution of aid to
Colombia’s poorest and most remote areas.

GOOD GOVERNANCE
Governments and their agencies across the world are using machine learning at
the national and local level to do more than process tax returns or make the
buses run on time. Let’s put these three new examples in our tool belts as we
continue to advocate for machine learning — and as we look for new ways to bring its capabilities to bear.

 * Machine Learning

Blocked Unblock Follow FollowingSTEVE MOORE
Parent, playwright, artistic director of Physical Plant Theater, & IBMer.

FollowINSIDE MACHINE LEARNING
Deep-dive articles about machine learning and data. Curated by IBM Analytics.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates","Machine learning has already extended into so many aspects of daily life that it can be handy for us to simply memorize a set of go-to examples of its impact on certain industries. For instance, we…",Top 10 Machine Learning Use Cases: Part 1,Live,34
99,"Nick Kasten Blocked Unblock Follow Following Computer Science / Math Student @ Texas State University Aug 30
--------------------------------------------------------------------------------

GAZE INTO MY REDDIT CRYSTAL BALL
USING WATSON MACHINE LEARNING TO PREDICT A POST’S POTENTIAL
Editor’s note: This article is part of an occasional series by the 2017 summer interns on the
Watson Data Platform developer advocacy team, depicting projects they developed
using Bluemix data services, Watson APIs, the IBM Data Science Experience, and
more.Reddit is a social news-aggregation and discussion forum that receives millions of new
posts every day. Some of these posts are links or images, but some contain only
text, and usually serve to request/provide information or spark some kind of
discussion. Users on the site can “upvote” or “downvote” these posts, nudging
the post’s score by one in either a positive or negative direction. The end
result of this system is a ranked list of posts for users to scroll through,
divided into “subreddits” (subjects) with those posts having the highest scores
situated at the top.

A look at the Reddit interface from the MachineLearning subreddit.What if there were a way, using Watson Machine Learning and Watson Cognitive Services , to predict the score of a post before putting it on Reddit? Spoiler alert:
there is!

In this article, I’ll describe an app I built to help with my Reddit game, and
what I learned about machine learning in the process. I’ll also share the code
so you can try it yourself.

INTRODUCING THE REDDIT CRYSTAL BALL
The Reddit Crystal Ball is an app that predicts how high a score your post will receive on Reddit, when
posted at the current time. If there’s a time later in the day at which the app
thinks you could get a better score, you’ll be notified of that as well.

This post would likely earn a higher score later in the day.The app uses Watson’s machine learning service to make its prediction, which is
based on a few different factors:

 * Subreddit
 * Current Time of Day
 * Average Word Size
 * Watson Social Tone Analysis

I used these features to build a machine learning model with Spark ML, which I
then deployed on Bluemix using the Watson ML service. This creates a “scoring
endpoint,” which allows us to interact with and query our model through a REST
API that can be accessed from any platform, using any programming language.

EVALUATING ALGORITHMS
To make predictions, the machine learning model uses an algorithm called K-Means Clustering to group similar posts into clusters. The clusters of posts are then analyzed
to determine the average score for posts placed in that cluster, which are then
separated into 4 groups: Low, Medium, High, and Great.

This method wasn’t my first choice. I initially attempted using decision-tree-
and probabilistic-based algorithms like Random Forest and Naive Bayes to predict a specific score, but quickly learned that predicting an exact score
was not going to work well given the constraints of this data set.

Because I wanted to document the process of gathering data, processing it, and
creating a machine learning model, I chose to build this project in a Jupyter Notebook . A notebook is an environment that allows documentation and executable code to
live together, side-by-side, so it was perfect for this project.

In a Jupyter Notebook, documentation and code live side-by-side inside cells.IMPLEMENTATION IN A DATA SCIENCE NOTEBOOK
After stepping through my notebook , you’ll not only understand how the data was processed and used to train a
model, but you’ll also be able to interact with that model and use it to make
predictions on you own posts.

These interactive elements are called PixieApps . A PixieApp is an app created with Python, using the PixieDust helper library , that runs in the notebook itself. Using the templating language Jinja 2 , it becomes relatively easy to create a nice UI that helps the data come
alive.

WHAT I [MACHINE] LEARNED
After playing with the data and interacting with the model in the PixieApp, some
interesting trends emerged. While all the features influenced the prediction,
the most important were the choice of subreddit and the time a post was made.
This makes sense, since different sections of the site are likely to be most
active at different times, and it follows that posts would score higher during
these periods of activity. At the same time, a post containing a link — which
can drastically increase the average word size of a post — or posts that skew
far in a certain direction in the tone analysis can be predicted to score higher
or lower than solely based on subreddit and time alone.

At the start of this project, machine learning was a completely foreign concept
to me. Even the process of gathering, cleaning, and analyzing data was something
I had little experience with. The great thing about notebooks on the IBM Data Science Experience is that you get the Pandas Python Data Analysis Library and the Spark engine out-of-the-box to get you started with small and large data science projects
alike.

Working with these tools, I was able to analyze Reddit post data set and
experiment with different features in-depth. Now, I feel I have a much better
grasp on what machine learning does, how it works, and the tools needed to work
with large data sets.

So check out the notebook and PixieApp I created, and let me know what you think
here in the comments. You’ll be able to see the entire process of building and
deploying the model, and you’ll have the opportunity to make predictions on your
own posts. You might even find that it helps you create the perfect,
high-scoring Reddit post of your dreams.

To create your own crystal ball, load the notebook , complete the setup steps, and follow the instructions in the notebook cells.
May your comments be plentiful, and your future filled with upvotes!

Thanks to Patrick Titzler , Teri Chadbourne, CMP , and Mark Watson . * Machine Learning
 * Reddit
 * Ibm Watson
 * Jupyter Notebook
 * Cognitive Computing

Blocked Unblock Follow FollowingNICK KASTEN
Computer Science / Math Student @ Texas State University

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","In this article, I’ll describe an app I built to help with my Reddit game, and what I learned about machine learning in the process. I’ll also share the code so you can try it yourself.",Gaze Into My Reddit Crystal Ball – IBM Watson Data Lab – Medium,Live,35
100,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
DATA VISUALIZATION PLAYBOOK: DETERMINING THE RIGHT LEVEL OF DETAIL
Post Comment August 31, 2015 by Jennifer Shin Topics: Analytics , Big Data Research , Big Data Technology , Big Data Use Cases , Data Scientists Tags: data visualization , analytics , data scienceOne of the most important steps for creating data visualizations is selecting
which aspects, features or dimensions of the data to present—in other words, letting the data dictate the visualization . Unlike school assignments, data scientists and professionals rarely receive
project that provides the same clear guidance they received as children. There
is no longer a teacher who assigns a bar chart; instead, data scientists are
expected to find insights that will enlighten managers and colleagues.

Utilizing data science can be beneficial to anyone interested in an effective
visualization. This article shows how data science can be used to create
effective data visualizations by focusing on one key question every data
scientist needs to ask: What level of detail should I show in my visualization?

To demonstrate the importance of this question, consider the following scenario.
A researcher is conducting an experiment and the researcher records the date,
time and a measurement at 6 a.m., 2 p.m. and 8 p.m. every day for a month. How
can this data set be visualized?

STEP 1: CREATING VISUAL REPRESENTATIONS
The most direct way to present the data is to plot each data point. The most
direct way to present the data is to plot each data point. In Figure 1, each
measurement recorded over the course of the study is plotted against the date
and time using a bar chart.


Figure 1: Bar chart of each measurement recorded over the course of the study.

Bar charts can seem simple and easy to use, but selecting the wrong data can
impact the effectiveness of any visualization. With close to 100 data points in
Figure 1, including every data point makes it difficult to gain significant
insight without further analysis.

If plotting each data point doesn’t provide meaningful insight, consider using
summary statistics to gather information and as a starting point for finding
useful patterns in the data set. In certain cases, visualizing summary
statistics may be sufficient for presenting information. For example, a chart
showing the average temperature for each month can be an effective presentation
of the seasonal weather changes for a geographic region.

STEP 2: DIGGING INTO THE DATA
In the previous step, Figure 1 fell short of presenting usable insights. To get
better insights, you can use summary statistics to analyze the data points
directly or evaluate the visualization. Either approach allows data scientists
to explore potential patterns in any data sets, as shown below.

For data scientists who prefer to work directly with the data set, daily or
weekly averages can present an effective overview by splitting up the data set
into different levels. Figure 2a shows the daily average for the first seven
days and the difference between the daily averages and the weekly average. The
table shows that the difference between the daily average and the weekly average
stands out on the sixth day, when the daily average is significantly higher than
the weekly average. With the discovery of unusual behavior in the first week,
it’s easy to check whether the pattern is consistent during the other weeks of
the study.


Figure 2a: Measurements for the first seven days, including the daily average
and the difference between the daily average and the weekly average.

For data scientists who prefer to work with visualization, the bar chart in
Figure 1 can serve as a valuable source for insights. Figure 2b shows the
measurements for the first seven days of the study with the average for this
period represented by the horizontal red line. Similar to the previous step, the
values for the sixth day are significantly different than the values for the
other days in the study.


Figure 2b: Measurements recorded during the first 7 days. The red line
represents the average measurement recorded.

STEP 3: REVISING THE CHART
Since both the data set and the original visualization revealed that the data
peaked on the sixth day, Figure 1 can be revised to determine if this pattern is
consistent throughout the study. Specifically, in Figure 3, the three
measurements recorded each day are represented as one averaged daily value and
shows that the measurement values peak each week on the same weekday.


Figure 3: Bar chart of the average measurement recorded daily.

APPLY WITH CAUTION
While averages can be useful for data mining, using this approach too liberally
can inadvertently result in hiding valuable information. By replacing daily
averages with weekly averages, Figure 4a no longer shows the peaks that occur on
days 6, 13, 20 and 27—and the measurements are so close that the chart suggests
there is very little variability in the data.


Figure 4a: Bar chart of the average measurement recorded weekly.

Conceptually calculating the average of a set of numbers is similar to
redistributing the amounts evenly across these values until each one is equal.
For instance, finding the average of 8 and 12 can be thought of as taking 2 from
12 and moving it to 10 so that the two values both equal 10, which is the
average of the two numbers. Hence, if a set of numbers includes extreme values,
averaging these terms can result in the loss of vital information.

Remember that using a “one-size-fits-all” approach can increase the chances of
hiding or missing important insights. Creating alternative visualizations of the
measurements by time, as in Figure 4b, will minimize this risk and open up the
possibility of finding new patterns.


Figure 4b: Line chart of the measurements by time.

Discover how the IBM advanced analytics portfolio can help you find patterns and derive insights by visually exploring data.


Follow @IBMBigData

RELATED CONTENT
BLOG
WHAT IS MACHINE LEARNING?
Businesses can benefit enormously from analysis-derived rules that enable
understanding why certain events occur and the corresponding actions to take.
Learn more about a widely used six-phase methodology for building predictive
analytics models that can reveal hidden rules for meaningful business... Read Blog Podcast How is open source transforming machine learning? Blog Bridging NoSQL databases into open data science initiatives Blog Spark and R: The deepening open analytics stack Blog Go global with data science at Datapalooza Podcast How is open source transforming streaming analytics? Podcast InsightOut: Leveraging metadata and governance Blog What is Spark? Blog Internet of Things data access and the fear of the unknown Blog InsightOut: Metadata and governance Podcast How is open source transforming graph analytics? Blog What is Hadoop? Blog Bridging Spark analytics to cloud data services
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analyticsMORE
Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insight Blog What is machine learning? Blog Simple polyglot persistence in the cloud Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Blog What is machine learning? Blog Simple polyglot persistence in the cloudMORE
Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Blog What is machine learning? Blog Simple polyglot persistence in the cloud Podcast How is open source transforming machine learning? Blog The future of cognitive business: Try the self-service technical preview Blog Bridging NoSQL databases into open data science initiatives Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freightMORE
Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic How financial advisors can connect with investors Podcast Cyber Beat Live: I'm In! When insiders threaten our security Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insightMORE
Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insight Infographic How financial advisors can connect with investors Blog What is machine learning? Blog IBM is a leader in the Forrester Wave™: Big Data Hadoop Cloud Solutions,
Q2 2016 * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site",Here’s a quick and handy guide to creating data visualizations that are appropriately detailed to ensure maximum effectiveness.,Data visualization playbook: The right level of detail,Live,36
102,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. 4 mins ago
--------------------------------------------------------------------------------

CREATE A CUSTOM DOMAIN FOR CLOUDANT USING CLOUDFLARE
WHAT’S IN A NAME? PROXY TO GET SPEED AND PROTECTION TOO.
When signing up for an IBM Cloudant account through cloudant.com, you pick a
username, which becomes the sub-domain of cloudant.com , e.g. janedoe.cloudant.com . If you create a Cloudant service inside Bluemix, then you are assigned a
randomly-generated sub-domain like dd4f-de8e79e7--9652-4d92-fd347be5b308-bluemix.cloudant.com . If you want to assign a custom domain to your Cloudant account, then you could perform the DNS magic yourself, but it would leave you with the responsibility
of dealing with the creation of an HTTPS certificate for your domain.

A much simpler alternative is to sign up for a Cloudflare account and let them do the heavy lifting! Cloudflare is a proxy service that
sits between your users and your website handling caching, immunity to
denial-of-serivce attacks, analytics, content optimisation, and lots more. In
this case, we are going to place Cloudflare in front of a Cloudant account.

This article assumes you have your own custom domain name already (like janedoe.com ) and have already signed up for a Cloudant account (like janedoe.cloudant.com ). We want to create a new sub-domain: db.janedoe.com , which will work with HTTPS and whose traffic will be sent to Cloudant.

SIGN UP FOR CLOUDFLARE
Visit www.cloudflare.com and create an account. Enter your custom domain name and let Cloudflare perform
its initial scan.

ADD A CNAME RECORD
Once the Cloudflare scan of your existing domain is complete, we can tell
Cloudflare that we wish to proxy db.janedoe.com to janedoe.cloudant.com . To do this, we create a CNAME record by completing the form:

Here, we choose the CNAME type from the pull-down list and enter the new sub-domain ( db ) and our target ( janedoe.cloudant.com ).

TELL CLOUDANT ABOUT YOUR DOMAIN NAME
Cloudant also needs to know about this new naming strategy. In the Cloudant
dashboard, select Account > Virtual Hosts and complete the form. Enter your new domain name ( db.janedoe.com ) and click the Add Domain button:

TESTING
After a few minutes, you should be able to visit http://db.janedoe.com and https://db.janedoe.com (HTTPS may take up to 24 hours to take effect). That's it!

Note: If you bind your proxied Cloudant service to a Bluemix app, this mapping will
not take effect because the VCAP_SERVICES entry for Cloudant will not reflect the new domain name.BENEFITS OF USING CLOUDFLARE AND CLOUDANT
Cloudflare offers several benefits for Cloudant users

 * HTTP2 . Cloudflare supports HTTP2/SPDY out of the box. So requests from
   HTTP2-compatible sources (like Google Chrome) would benefit from the smaller
   binary protocol, the single multiplexed connection, and the compressed
   headers that HTTP2 affords.
 * Free HTTPS . Your custom domain can be covered by a free HTTPS certificate without any
   fuss.
 * DDoS protection . If you are paying for a quota of Cloudant requests, then the last thing
   you want is for a bad actor to maliciously call your Cloudant account
   directly at your expense.
 * Compression . Traffic between the browser/user-agent and Cloudflare can be compressed,
   reducing the amount of bandwidth required to transmit or receive requests.
 * Caching . If you upgrade to a paid plan, then you can customise Cloudflare to cache
   certain requests to improve performance or to take some load from your
   Cloudant service.
 * Analytics . You can see the statistics of which URLs are being hit.

CONCLUSION
That’s how easy it is to set up a custom domain for your Cloudant service. If
you use Cloudant on Bluemix, the process is the same. (To reach your Cloudant
dashboard from Bluemix, just open the service and click Launch .) Then follow the steps outlined in this post. Enjoy your new custom domain,
along with all the benefits of Cloudflare.

Cloudant Cloudflare Tutorial Proxy DNS Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

× Don’t miss Glynn Bird’s next story Blocked Unblock Follow Following Glynn Bird","When you customise your Cloudant domain with Cloudfare, you get better performance, DDoS protection, and caching too. Here’s how to set it up.",Create a Custom Domain for Cloudant Using Cloudflare – IBM Watson Data Lab,Live,37
105,"The primary index is the fastest way to retrieve data from your database.Enhance this tutorial with live data from a sample database inside your Cloudant account.For security purposes, please sign in or sign up to demo the API.To demo the Cloudant API, you'll need to replicate a small sample database into your account. The database is named animaldb, and it contains information from Wikipedia about ten different animals.The primary index is fast because it comes with every Cloudant database, which means you don't have to write any code before you can use it.The primary index, often referred to as _all_docs, returns an id, a key and a value for every document in the database. The id and key are the same (Cloudant makes an index keyed by doc id), while the value is the _rev of the document._all_docs also reports on the total number of documents and any offset used to query the index.Demo the Cloudant API right here. The server response (JSON) will appear directly below.Sign in or create a free account to demo the Cloudant API.To demo the API here, replicate the sample database first.All indexes are sorted by their key. The sort order is:The full specification is documented in theCouchDB Wiki.The generic _all_docs request above returns all the documents in the database. That's fine for this example database, but in a realistic scenario you'll probably want a more manageable result set. That's where API options come in.Add the limit parameter to keep your result set to a certain size. If you want to offset your result set (for example to paginate through some rows) you can also pass in a skip parameter.In this call, we limit the result set to 2 rows and skip the first 3 rows.Use slicing to pull out row ranges from the index by using start and end keys in your query.Here we are looking for animals with names that begin with letters greater than the startkey up to and including the endkey.If you don't want to include documents that match the end key, add the inclusive_end parameter with a value of false.View slicing with starkey and endkey can be combined with skip, limit and inclusive_end to further constrain your result set.Cloudant's primary index automatically turns a document's _id into its key. If you want a document matching a single key, find it with the key parameter.Here, we're looking for a document indexed with the key of ""llama"".You can also hit the document directly, without additional parameters, at its unique URL. The result is similar to the single key request we made above, but different in that all fields are included in the result.Use include_docs=true when you want all of the contents of the document you're requesting (not just the id).This API call uses include_docs=true along with limit and skip.You can also query for a specific set of keys by POSTing a JSON array of keys to the view.As we've seen, the _all_docs index can be a very useful view into your database, especially if your application has a natural unique identifier that you can use for your documents.As your data grows, you'll want to explore secondary indexes, which allow you to build additional indexes over your database, defined by efficient MapReduce views.Browse the API ReferenceGet the API Reference (PDF)","A guide to using Cloudant's _all_docs endpoint to retrieve documents by id, or within a range of keys using this interactive tutorial.",For Developers: Querying the Cloudant Primary Index,Live,38
106,"* R Views
 * About this Blog
 * Contributors
 * Some Resources
 * 

 * R Views
 * About this Blog
 * Contributors
 * Some Resources
 * 

REPRODUCIBLE FINANCE WITH R: PULLING AND DISPLAYING ETF DATA


by Jonathan Regenstein

It’s the holiday season, and that can mean only one thing: time to build a
leaflet map as an interface to country Exchange Traded Fund ( ETF ) data!

In previous posts, we examined how to import stock data and then calculate and
display the Sharpe Ratio of a portfolio. Today, we’re going to skip the
calculations and focus on a nice interface for pulling and displaying data.
Specifically, our end product will enable users to graph country ETF prices by
clicking on those countries in an interactive map, instead of having to use the
ETF ticker symbol. Admittedly, part of the motivation here is that I don’t like
having to remember ticker symbols for country ETFs, but hopefully others will
find it useful too.

Our app will be simple in that it displays price histories, but it can serve as
the foundation for more complicated work, as we will discuss when the app is
completed in the next post. At the outset, it is crucial to note that this
Notebook will serve a different purpose than our previous Notebook. As before,
we will use this Notebook to test data import, wrangling, and our visualizations
before taking the next step of building an interactive Shiny app. However, we
are going to save objects from this Notebook into a .Rdat file, and then use
that file in our app. In that way, this Notebook is more fundamentally connected
to our app than our previous Notebook. In the next “finance Friday = fun day”
post, we will go through how to build that app (though frankly the hard work
occurs in this Notebook), but for today here is how we’ll proceed.

First, we will get our ETF tickers, countries and year-to-date performance data
into a nice, neat data frame. Note that the data frame will not hold the price
history data itself. Rather, it will hold simply the ticker symbols, country
names and YTD percentages. Next, we pass those ticker symbols to the getSymbols() function and download the price histories for the county ETFs. Advance warning:
there are 42 country ETFs in this example, and downloading 42 xts objects takes
time and RAM. I recommend using the server version of the IDE if you want to run
this code, or truncate and grab three or four price histories, or skip this
step.

As we’ll see, it is not strictly necessary to pass all of those tickers to getSymbols() right now because the data will be downloaded on the fly when a user clicks on
a country in our Shiny app. However, even though it requires a lot memory, I
prefer to download all 42 price histories in order to confirm that the tickers
are correct and accessible via getSymbols() . Better to find the typos now than to have users discover an error in the app.
Once we have confirmed that our ticker symbols are valid, it’s time for step 3:
build our map using a shapefile of the world’s countries. This step requires a
lot of RAM, but leaflet makes the process quite simple from a coding
perspective. If you’re new to map building, this will serve as a gentle
introduction to creating a usable interactive map.

Fourth, and very importantly, we will add our ETF tickers and year-to-date
performance data to our shapefile, making them accessible via clicks on the map.
At this step, we will be thankful that when we created a data frame in step 1,
we used the same country names as appear on the map: that forethought will allow
us to do an easy merge() of the data. We’ll then build the map to make sure it looks how we want it to
look in the final app. Once we have a shapefile with our ETF tickers added,
we’ll save it to a .RDat file that we can load into our Shiny app.

Let’s get to it! Building an interface to country ETFs will require those ETF
ticker symbols. We also need the country names to go alongside them. Why country
names instead of, say, the full ETF title? We need a way to synchronize with our
map file and country names is a good way. There’s no way to know this ahead of
time without thinking through the structure of the app and probably making
liberal use of a whiteboard. That valuable country ETF data is available here . Have a peek at that link and notice that the year-to-date performance is also
readily available. I hadn’t planned on including YTD performance in any way, but
we’ll grab it and put it to good use. That data is not available in the html, so
simple rvest moves aren’t going to help us. There’s a download button, but I found it easier
to copy/paste to a spreadsheet and then import to the IDE.

I will spare us the gsub() pain of extracting country names from the fund titles (though direct message me
if you want that code) and paste the tickers, country names and year-to-date
performance below.


The data frame looks pretty good, though quite simple, and it’s fair to wonder
why I bothered to highlight this step with it’s own code chunk. In fact, getting
the clean ticker and country names was quite time-consuming, and that will often
be the case: the most prosaic data import and tidying tasks can take a long
time! Here is another fine occasion to bring up reproducibility and work flow.
Once you or your colleague has spent the time to get a clean data frame with
ticker and country names, we definitely want to make sure that no one else,
including your future self, has to duplicate the effort for a future project. I
put this step in it’s own code chunk so that the path back to the clean data
would be as clear as possible.

For that reason, I also have a personal preference for the ‘DataGrab’ file
naming convention – i.e., in the IDE, I named this file
‘Global-ETF-Map-DataGrab’. Whenever I use a Notebook for the purpose of
importing, tidying, building and then saving objects in a .Rdat file that will
be loaded by a Shiny app, I include ‘DataGrab’ in the name of the file. If
future me or a team member needs to locate the file behind one of our
flexdashboards, they will know that it has ‘DataGrab’ in the title.

Back to the code at hand! Now that we have the tickers in a data frame column,
we can use getSymbols() to import the price history of each ETF. We aren’t going to use the results of
this import in the app. Rather, we are going to perform this import to test that
we have the correct symbols, and that they play nicely with getSymbols() , because that is the function we will use in our Shiny app.


Alright, it looks like we’ve been successful at importing the closing price
history of the country ETFs. Nothing too complicated here and again, our purpose
was to test that the ticker symbols are correct. We are not going to be saving
these prices for future use. Now it’s time to build a map of the Earth! First,
we will need a shapefile that contains the spatial polygons for the countries of
the world. The next code chunk will grab a shapefile from naturalearthdata.com . That shapefile has the longitude and latitude coordinates for the world’s
countries and some data about them. We’ll then use the readOGR() function from the rgdal package to load the shapefile into our global environment.


Take a peek at the data frame portion of the shapefile, and scroll to the right
to see some interesting things like GDP estimates and economic development
stages. It’s pretty nice that the shapefile contains some economic data for us.
The other portion of the shapefile is the spatial data: longitude and latitude
coordinates. If you’re not a cartographer, don’t worry about those for now.


If you’re not familiar with spatial data frames, that’s okay because neither am
I. The leaflet package makes building a nice interactive map with these shapefiles relatively
painless.

Before building a map, let’s make use of the data that was included in our data
frame. The ‘gpd_md_est’ column (which you can see in the data frame above)
contains GDP estimates for each country. We’ll add some color to our map with
shades of blue that are darker for higher GDPs and lighter for lower GDPs.


We want something to happen when a user clicks a country. How about a popup with
country name and stage of economic development? Again, that data is included in
the shapefile we downloaded.


Now we can use leaflet to build a world map that is shaded by GDP and displays a
popup. Note the ‘layerId = ~name’ snippet below – it creates a layer of country
names. We will change that later in an important way.


The map looks good, but it sure would be nice if we could add the ETF ticker
symbols and year-to-date data to the world spatial data frame object – and we
can! Our ‘name’ column in the ETF data frame uses the same country naming
convention as the ‘name’ column of the map, and those columns are both called
‘name’. Thus, we can use the merge() function from the sp package to add the ETF data frame to the spatial data frame. This is similar to
a join using dplyr.

The correspondence of country names wasn’t just luck – I had the benefit of
having worked with this shapefile in the past, and made sure the country names
matched up, and now you have the benefit of having worked with this shapefile.
For any future project that incorporates a map like this, give some forethought
to how data might need to be merged with the shapefile. The shapefile and the
new data need a way to be matched. Country names usually work well.

After the merging, the ticker symbols and year-to-date number columns will be
added for each country that has a match in the ‘name’ column. For those with no
match, the ‘ticker’ and ‘ytd’ columns will be filled with NA.


Now that the ytd data is added, let’s shade the different countries according to
the year-to-date performance of the country EFT, instead of by GDP as we did
before. A nice side benefit of this new shading scheme: if a country has no ETF,
it will remain an unattractive grey.


The new shading is nice, but let’s also have the popup display the exact
year-to-date performance percentage for any detail-oriented users.


Now we’ll build another map that uses the year-to-date color scheme and popup,
but we will make one more massively important change: we will change layerId = ~name to layerId = ~ticker to create a map layer of tickers.

Why is this massively important? When we eventually create a Shiny app, we want
to pass ticker symbols to getSymbols() based on a user click. The ‘layerId’ is how we’ll do that: when a user clicks
on a country, we capture the ‘layerId’, which is a ticker name that we can pass
to getSymbols() . But that is getting ahead of ourselves. For now, here is the new map:


Fantastic: we have a map that is shaded by the YTD performance of country ETFs,
and displays that YTD percentage in the popup. Notice the difference between
this map and the previous map which was shaded by GDP: a user can quickly see
which countries have ETFs and click to see more.

The ‘world_etf’ shapefile is going to play a crucial role in our Shiny app, and
the last step is to save it for use in our flexdashboard.


Note that we are not going to save the ETF price data. It’s not needed in the
interactive Shiny app because that data will be imported dynamically when a user
clicks. That allows our dashboard to be constantly updated in real time.
Remember that we loaded up the ETF data in this Notebook so that we could ensure
that the ticker symbols play nicely with getSymbols() . Next time, we’ll wrap this up into a Shiny app by way of flexdashboard, and
that app will allow users to click on a country and graph the ETF history. The
first thing we’ll do in that file is load the .RDat file that we just created.
There are two pieces of good news: first, we’ve already done the hard work of
creating a map object, and the app coding is the fun part. Second, the work here
does not need to be repeated for any future projects. If you or your team ever
need to build a map of the world shaded by GDP estimates or ETF YTD performance,
here it is. If you ever need the clean tickers, year-to-date performance or the
time series data on these 42 country ETFs, here it is.

See you soon!

Jonathan Regenstein 2016-12-14T13:17:05+00:008 COMMENTS
 1.  Reproducible Finance with R: Pulling and Displaying ETF Data - Use-R!Use-R! December 14, 2016 at 10:12 pm[…] leave a comment for the author, please follow the link and comment on
     their blog: RStudio.R-bloggers.com offers daily e-mail updates about R news
     and tutorials on topics such as: Data […]
     
     
 2.  
 3.  Reproducible Finance with R: Pulling and Displaying ETF Data – Mubashir
     Qasim December 15, 2016 at 2:20 am[…] article was first published on RStudio, and kindly contributed to […]
     
     
 4.  
 5.  Rich December 15, 2016 at 12:02 pmExcellent!
     
     
 6.  
 7.  Angus Davidson December 15, 2016 at 1:28 pmGreat post. Really useful.
     
     Off topic slightly but do you know of any good R training companies in
     London? I work in finance for an I Bank in London. Want to get a couple of
     days of classroom training to get me going. Just wondering if you could
     suggest anyone?
     
      * Jonathan Regenstein December 15, 2016 at 1:39 pmHi Angus, i’ll direct message over some thoughts. Thanks for the kind
        words.
        
         * Angus Davidson December 16, 2016 at 6:41 amHi Jonathan,
           Pleasure. thanks for putting in the work to produce the post. Have
           PM’d you but posting as a comment in case anyone else out there has
           done a course in London.
           I’m looking for a two or three day course to get me going.
           Once I’m up and running I’ll be using it a fair bit at work and so
           should learn it pretty fast hopefully. What I really want to do is
           accelerate that really slow painful early bit where you’re trying to
           get going and everything is new.
           Have done a bit of research and found this course – anyone used them?
           Any thoughts?
           Thanks for any help. Angus
           
           http://www.acuitytraining.co.uk/server-database-programming/r-training-course/
           
           
         * 
        
        
      * 
     
     
 8.  
 9.  Reproducible Finance with R: A Shiny ETF Map – RStudio December 16, 2016 at 12:01 pm[…] a previous post, we built an R Notebook that laid the groundwork for a
     Shiny app that allows users to graph country […]
     
     
 10. 
 11. Reproducible Finance with R: A Shiny ETF Map - Use-R!Use-R! December 16, 2016 at 9:12 pm[…] a previous post, we built an R Notebook that laid the groundwork for a
     Shiny app that allows users to graph country […]
     
     
 12. 

Comments are closed.

250 Northern Ave, Boston, MA 02210
844-448-1212
info@rstudio.com

DMCA
Trademark
Support
ECCN * Switch tabs w/o muscle cramps: New RStudio Desktop 1.0.136 switches w/
   Ctrl+Tab. Lots of tabs? Ctrl+Shift+. to select tab by name! #rstats
   
   6 days ago

Copyright 2016 RStudio | All Rights Reserved | Legal Terms Twitter Linkedin Facebook Rss Email github Rss","Our app will be simple in that it displays price histories, but it can serve as the foundation for more complicated work, as we will discuss when the app is completed in the next post. At the outset, it is crucial to note that this Notebook will serve a different purpose than our previous Notebook.",Pulling and Displaying ETF Data,Live,39
107,"Stats and Bots Follow Sign in / Sign up * Home
 * Subscribe
 * 
 * 🤖 TRY STATSBOT FREE - Empower every department with data
 * 

Vadim Smolyakov Blocked Unblock Follow Following passionate about data science and machine learning
https://github.com/vsmolyakov Aug 22
--------------------------------------------------------------------------------

ENSEMBLE LEARNING TO IMPROVE MACHINE LEARNING RESULTS
HOW ENSEMBLE METHODS WORK: BAGGING, BOOSTING AND STACKING
Ensemble learning helps improve machine learning results by combining several
models. This approach allows the production of better predictive performance
compared to a single model. That is why ensemble methods placed first in many
prestigious machine learning competitions, such as the Netflix Competition, KDD
2009, and Kaggle.

The Statsbot team wanted to give you the advantage of this approach and asked a data
scientist, Vadim Smolyakov, to dive into three basic ensemble learning
techniques.


--------------------------------------------------------------------------------

Ensemble methods are meta-algorithms that combine several machine learning
techniques into one predictive model in order to decrease variance (bagging), bias (boosting), or improve predictions (stacking).

Ensemble methods can be divided into two groups:

 * sequential ensemble methods where the base learners are generated sequentially (e.g.
   AdaBoost).
   The basic motivation of sequential methods is to exploit the dependence between the base learners. The overall performance can be boosted by weighing previously mislabeled
   examples with higher weight.
 * parallel ensemble methods where the base learners are generated in parallel (e.g.
   Random Forest).
   The basic motivation of parallel methods is to exploit independence between the base learners since the error can be reduced dramatically by averaging.

Most ensemble methods use a single base learning algorithm to produce
homogeneous base learners, i.e. learners of the same type, leading to homogeneous ensembles .

There are also some methods that use heterogeneous learners, i.e. learners of
different types, leading to heterogeneous ensembles . In order for ensemble methods to be more accurate than any of its individual
members, the base learners have to be as accurate as possible and as diverse as
possible.

BAGGING
Bagging stands for bootstrap aggregation. One way to reduce the variance of an
estimate is to average together multiple estimates. For example, we can train M
different trees on different subsets of the data (chosen randomly with
replacement) and compute the ensemble:

Bagging uses bootstrap sampling to obtain the data subsets for training the base
learners. For aggregating the outputs of base learners, bagging uses voting for classification and averaging for regression .

We can study bagging in the context of classification on the Iris dataset. We
can choose two base estimators: a decision tree and a k-NN classifier. Figure 1
shows the learned decision boundary of the base estimators as well as their
bagging ensembles applied to the Iris dataset.

Accuracy: 0.63 (+/- 0.02) [Decision Tree]
Accuracy: 0.70 (+/- 0.02) [K-NN]
Accuracy: 0.64 (+/- 0.01) [Bagging Tree]
Accuracy: 0.59 (+/- 0.07) [Bagging K-NN]

The decision tree shows the axes’ parallel boundaries, while the k=1 nearest
neighbors fit closely to the data points. The bagging ensembles were trained
using 10 base estimators with 0.8 subsampling of training data and 0.8
subsampling of features.

The decision tree bagging ensemble achieved higher accuracy in comparison to the
k-NN bagging ensemble. K-NN are less sensitive to perturbation on training
samples and therefore they are called stable learners.

Combining stable learners is less advantageous since the ensemble will not help
improve generalization performance.The figure also shows how the test accuracy improves with the size of the
ensemble. Based on cross-validation results, we can see the accuracy increases
until approximately 10 base estimators and then plateaus afterwards. Thus,
adding base estimators beyond 10 only increases computational complexity without
accuracy gains for the Iris dataset.

We can also see the learning curves for the bagging tree ensemble. Notice an
average error of 0.3 on the training data and a U-shaped error curve for the
testing data. The smallest gap between training and test errors occurs at around
80% of the training set size.

A commonly used class of ensemble algorithms are forests of randomized trees.In random forests , each tree in the ensemble is built from a sample drawn with replacement (i.e.
a bootstrap sample) from the training set. In addition, instead of using all the
features, a random subset of features is selected, further randomizing the tree.

As a result, the bias of the forest increases slightly, but due to the averaging
of less correlated trees, its variance decreases, resulting in an overall better
model.

In an extremely randomized trees algorithm randomness goes one step further: the splitting thresholds are
randomized. Instead of looking for the most discriminative threshold, thresholds
are drawn at random for each candidate feature and the best of these
randomly-generated thresholds is picked as the splitting rule. This usually
allows reduction of the variance of the model a bit more, at the expense of a
slightly greater increase in bias.

BOOSTING
Boosting refers to a family of algorithms that are able to convert weak learners
to strong learners. The main principle of boosting is to fit a sequence of weak
learners− models that are only slightly better than random guessing, such as
small decision trees− to weighted versions of the data. More weight is given to
examples that were misclassified by earlier rounds.

The predictions are then combined through a weighted majority vote
(classification) or a weighted sum (regression) to produce the final prediction.
The principal difference between boosting and the committee methods, such as
bagging, is that base learners are trained in sequence on a weighted version of
the data.

The algorithm below describes the most widely used form of boosting algorithm
called AdaBoost , which stands for adaptive boosting.

We see that the first base classifier y1(x) is trained using weighting
coefficients that are all equal. In subsequent boosting rounds, the weighting
coefficients are increased for data points that are misclassified and decreased
for data points that are correctly classified.

The quantity epsilon represents a weighted error rate of each of the base
classifiers. Therefore, the weighting coefficients alpha give greater weight to
the more accurate classifiers.

The AdaBoost algorithm is illustrated in the figure above. Each base learner
consists of a decision tree with depth 1, thus classifying the data based on a
feature threshold that partitions the space into two regions separated by a
linear decision surface that is parallel to one of the axes. The figure also
shows how the test accuracy improves with the size of the ensemble and the
learning curves for training and testing data.

Gradient Tree Boosting is a generalization of boosting to arbitrary differentiable loss functions. It
can be used for both regression and classification problems. Gradient Boosting
builds the model in a sequential way.

At each stage the decision tree hm(x) is chosen to minimize a loss function L
given the current model Fm-1(x):

The algorithms for regression and classification differ in the type of loss
function used.

STACKING
Stacking is an ensemble learning technique that combines multiple classification
or regression models via a meta-classifier or a meta-regressor. The base level
models are trained based on a complete training set, then the meta-model is
trained on the outputs of the base level model as features.

The base level often consists of different learning algorithms and therefore
stacking ensembles are often heterogeneous. The algorithm below summarizes
stacking.

The following accuracy is visualized in the top right plot of the figure above:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.91 (+/- 0.06) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [Naive Bayes]
Accuracy: 0.95 (+/- 0.03) [Stacking Classifier]

The stacking ensemble is illustrated in the figure above. It consists of k-NN,
Random Forest, and Naive Bayes base classifiers whose predictions are combined
by Logistic Regression as a meta-classifier. We can see the blending of decision
boundaries achieved by the stacking classifier. The figure also shows that
stacking achieves higher accuracy than individual classifiers and based on
learning curves, it shows no signs of overfitting.

Stacking is a commonly used technique for winning the Kaggle data science
competition. For example, the first place for the Otto Group Product
Classification challenge was won by a stacking ensemble of over 30 models whose
output was used as features for three meta-classifiers: XGBoost, Neural Network,
and Adaboost. See the following link for details.

CODE
In order to view the code used to generate all figures, have a look at the
following ipython notebook .

CONCLUSION
In addition to the methods studied in this article, it is common to use
ensembles in deep learning by training diverse and accurate classifiers.
Diversity can be achieved by varying architectures, hyper-parameter settings,
and training techniques.

Ensemble methods have been very successful in setting record performance on
challenging datasets and are among the top winners of Kaggle data science
competitions.

RECOMMENDED READING
 * Zhi-Hua Zhou, “Ensemble Methods: Foundations and Algorithms”, CRC Press, 2012
 * L. Kuncheva, “Combining Pattern Classifiers: Methods and Algorithms”, Wiley,
   2004
 * Kaggle Ensembling Guide
 * Scikit Learn Ensemble Guide
 * S. Rachka, MLxtend library
 * Kaggle Winning Ensemble

YOU’D ALSO LIKE:
Support Vector Machines Tutorial Learning SVMs from examples blog.statsbot.co Google Analytics Audit Checklist and Tools Auditing a Google Analytics setup
like a pro blog.statsbot.co Generative Adversarial Networks (GANs): Engine and Applications How generative
adversarial nets are used to make our life better blog.statsbot.co * Machine Learning
 * Ensemble
 * Ensemble Learning
 * Data Science
 * Random Forest

2 Blocked Unblock Follow FollowingVADIM SMOLYAKOV
passionate about data science and machine learning https://github.com/vsmolyakov

FollowSTATS AND BOTS
Data stories on machine learning and analytics. From Statsbot’s makers.

 * Share
 * 2
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates",Ensemble learning helps improve machine learning results by combining several models. Ensemble methods allow the production of better predictive performance compared to a single model. ,Ensemble Learning to Improve Machine Learning Results,Live,40
109,"TL;DR: It's easy to customise the Mongo shell's prompt, especially if If you use MongoDB shell with one of MongoHQ's Elastic Deployments, you will have noticed that your replica set name is not as catchy as it could be. You will have noticed it because that replica set name appears in your Mongo shell prompt by default. So your prompt looks something like this...set-5345738b13a3efb950000d32:PRIMARY>That's a bit noisy and probably not telling you much unless you have excellent skills memorising and comparing hex strings. It is also telling you whether you are connected to the primary or secondary. Thats at least 37 characters and you are nearly half way across the screen before you start typing. What would probably be more useful is a shorter customised prompt that tells you what you want to know. Here we'll show you how in a couple of steps.The prompt in Mongo shell is derived from a variable called prompt; if its not defined, then the shell shows us its default. The best place to set that is in your .mongorc.js file but before we do that let's see what we can do with it in the shell.The first thing you might want to do is just strip down the prompt. To do that you can set prompt to a string:set-5345738b13a3efb950000d32:PRIMARY> var prompt="">""If you aren't keen on the minimalism, just delete prompt to return to the default prompt. You can always set the variable to something more informative. Let's put the database name into the prompt:Of course we have to remember that the value of the prompt when set like that, is unchanging. So if we put the time in prompt like so:exemplum>var prompt=ISODate().toLocaleTimeString()+"">""14:53:38>it would always be 14:53:38. To have a dynamic prompt, we need to set prompt to a function, so that evaluating it will make the function return an newly calculated value.14:53:38>var prompt=function(){ return ISODate().toLocaleTimeString()+"">""; }14:56:35>14:56:38>Now the time will update when it displays the prompt. We're using the time here as its the most easily accessible changing value for all users. It could be any other statistic you want to display, but do remember that statistic will be calculated every time the prompt is displayed.We're already getting to the point where we want to make this more permanant, so exit the shell and open an editor on your .mongorc.js file. You'll find it in your $HOME directory on Unix systems. We can set the prompt variable in there and create our own, ever smarter, version by adding this to .mongorc.js:var prompt=function() {var dbname=db.getName();var master=db.isMaster().ismaster;var dblabel=master?dbname:""(""+dbname+"")""var time=ISODate().toLocaleTimeString();return dblabel+""/""+time+"">"";Here, the dbname is shown with parentheses around it if we are talking to a secondary node, and no parentheses if the primary, and adds in the local time to our compact prompt. That gives us a neat prompt:exemplum/15:20:30>And that replica set name we hid at the start? If you need that, just run db.isMaster().setName in the shell.Now you can go and customise your prompt to what you need and get it displayed in the compact (or verbose) form you prefer. And when you've come up with your perfect custom prompt why not share it with us by mailing it to dj@mongohq.com and we'll publish them in a future article.","It's easy to customize the Mongo shell's prompt, especially if you use MongoDB shell with one of MongoHQ's Elastic Deployments.",Customizing MongoDB’s Shell with Compact Prompts,Live,41
110,"GETTING STARTED WITH COMPOSE'S SCYLLADB
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 22, 2016Getting started with ScyllaDB is easy since it is a drop in replacement for
Apache's Cassandra database. For all intents and purposes, Scylla looks just
like Cassandra to your code. So much so that Scylla even uses Cassandra's
drivers. The main difference is in implementation. Scylla is written in C++
while Cassandra is written in Java. Compose's ScyllaDB is the latest version:
Scylla 1.3. This version corresponds to Cassandra 2.1.8 with a detailed
compatibility matrix here .

One of the benefits of mimicking Cassandra is that the tool chain, drivers, and
built in query language, cql , are already mature since they have evolved through multiple iterations and a
great deal of use. The number of drivers on Planet Cassandra , all of which are compatible, are far beyond a typical 1.x project. cql , a SQL like language, has grown into being the de facto way to interact with
Scylla/Cassandra and it even has its own shell, cqlsh , similar to many SQL shells for RDBMSs.

What follows is a brief run through of some of the highlights of connecting to
ScyllaDB on Compose. After creating a deployment, we will look at getting
connected with cqlsh then we will review connecting on the JVM, Python, and NodeJS runtimes to go
over the basics of getting started.

Connect with cqlshAssuming you already have a Compose account (if not you can get a 30 day free
trial here ), creating a deployment of ScyllaDB is little more than hitting ""Create
Deployment"" and choosing ""ScyllaDB"". After a couple of minutes, a three node
cluster will have been created for you:


The ""Overview"" page has all of the information needed to connect to your new
Scylla cluster. The easiest way to verify your deployment and your tools is to
connect directly with cqlsh . Depending on your platform there are multiple ways to get this tool onto your
local device whether that be your laptop, a cloud VM, or even your own dedicated
hardware. The easiest is to just install the latest Cassandra release (the
latest versions still support version 2.1.8 which is what Scylla is) and use the
builtin cqlsh . On a Mac with homebrew , it is nothing more than brew install cassandra . For others there are myriad ways from package managers to straight downloads.
Use whatever suits your platform best.

From the ""Overview"" page it is easy to copy the cmd (any one of them will work):


and then just paste it into your shell to execute it:


If you type HELP you can see that the shell has a lot of capability. What's even nicer is that
all of those commands have TAB completion too. Let's try it. Type CREATE KEYSPACE my_new_keyspace <TAB><TAB><TAB> you should see the choices for the replication class. Go ahead and choose SimpleStrategy since the cluster won't be spanning multiple data centers. Hit <TAB><TAB> again and enter in 3 for the replication_factor. Then close the brace with } and finish the statement with ;<enter> .

You just created your first KEYSPACE and defaulted it to replicating your data
to all three nodes in your cluster.

Now that you have a keyspace let's use it:

USE my_new_keyspace;

Your shell will show that your command prompt is using your keyspace by default:


Every table has to have a keyspace and when we create one in the shell here it
will default to my_new_keyspace .

While Scylla/Cassandra has evolved into having a schema language that looks very
similar to SQL. It's not really the case. Unlike an RDBMS, a row here is much
more like a key value lookup. It just so happens that the value has a flexible
schema which we are about to define:

CREATE TABLE my_new_table (  
  my_table_id uuid,
  last_name text,
  first_name text,
  PRIMARY KEY(my_table_id)
);


Type that CREATE TABLE command in your cqlsh to give us a place to populate with the following examples.

CONNECT FROM THE JVM
One of the most advanced drivers for Cassandra is the Java driver. This makes
sense considering Cassandra is written in Java. What follows is a Groovy script. For those who utilize just about any JVM language translating from
Groovy to your language of choice should be relatively straightforward:

@Grab('com.datastax.cassandra:cassandra-driver-core:3.1.0')
@Grab('org.slf4j:slf4j-log4j12')

import com.datastax.driver.core.BoundStatement  
import com.datastax.driver.core.Cluster  
import com.datastax.driver.core.Host  
import com.datastax.driver.core.PreparedStatement  
import com.datastax.driver.core.Row  
import com.datastax.driver.core.Session

import static java.util.UUID.randomUUID

Cluster cluster = Cluster.builder()  
    .addContactPointsWithPorts(
        new InetSocketAddress(""aws-us-east-1-portal9.dblayer.com"", 15399 ),
        new InetSocketAddress(""aws-us-east-1-portal9.dblayer.com"", 15401 ),
        new InetSocketAddress(""aws-us-east-1-portal6.dblayer.com"", 15400 )
    )
    .withCredentials(""scylla"", ""XOEDTTBPZGYAZIQD"")
    .build()

Session session = cluster.connect(""my_new_keyspace"")

PreparedStatement myPreparedInsert = session.prepare(  
  """"""INSERT INTO my_new_table(my_table_id, last_name, first_name)
     VALUES (?,?,?)"""""")

BoundStatement myInsert = myPreparedInsert  
    .bind(randomUUID(), ""Hutton"", ""Hays"")

session.execute(myInsert)

session.close()  
cluster.close()


To get started we pull in the latest Cassandra driver:

@Grab('com.datastax.cassandra:cassandra-driver-core:3.1.0')

After all of the imports we use a Cluster.builder() to build up the configuration. Just one of the ContactPoint s is used to connect. From that connection the other nodes in the cluster are
discovered. If that ContactPoint is unreachable on connect then another is used which is why we add all three.

PreparedStatement s may be familiar since they are analogous to other DBs' features of the same
name. The statement is parsed and held at the server ready to be used over and
over again. The following calls to bind and execute populate and send the data over to the server for actual execution. While there
are simpler methods for one off execution, it is good to highlight such a useful
feature.

To prove that the script works go back to your cqlsh and query the table:


CONNECT FROM PYTHON
Support for languages other than Java is very solid too. Python is a great
example. cqlsh is even written in Python. So make no mistake the support here is more than up
to date:

pip install cassandra-driver


The above pulls in the driver with a python package manager pip . The following performs very similarly to the Java code of preparing a
statement and executing an insert:

from cassandra.cluster import Cluster  
from cassandra.auth import PlainTextAuthProvider  
import uuid

auth_provider = PlainTextAuthProvider(  
                  username='scylla', 
                  password='XOEDTTBPZGYAZIQD')

cluster = Cluster(  
            contact_points = [""aws-us-east-1-portal9.dblayer.com""],
            port = 15401,
            auth_provider = auth_provider)

session = cluster.connect('my_new_keyspace')

my_prepared_insert = session.prepare(""""""  
    INSERT INTO my_new_table(my_table_id, first_name, last_name)
    VALUES (?, ?, ?)"""""")

session.execute(my_prepared_insert, [uuid.uuid4(), 'Snake', 'Hutton'])


To verify again we'll run the same SELECT :


CONNECT FROM NODEJS
Last but not least: Javascript.

npm install cassandra-driver  
npm install uuid  


We use the ubiquitous node package manager (npm) to install the driver and the
needed uuid library. The very similar code to the above examples follows:

var cassandra = require('cassandra-driver')  
var authProvider = new cassandra.auth.PlainTextAuthProvider('scylla', 'XOEDTTBPZGYAZIQD')  
var uuid = require('uuid')

client = new cassandra.Client({  
                        contactPoints: [ 
                          ""aws-us-east-1-portal9.dblayer.com:15399"",
                          ""aws-us-east-1-portal9.dblayer.com:15401"",
                          ""aws-us-east-1-portal6.dblayer.com:15400""
                        ],
                        keyspace: 'my_new_keyspace',
                        authProvider: authProvider});

client.execute(""INSERT INTO my_new_table(my_table_id, first_name, last_name) VALUES(?,?,?)"",  
               [uuid.v4(), ""V8"", ""Hutton""],
               { prepare: true },
               function(err, result) {
                 if(err) { console.error(err); }
                 console.log(""success"")
               });


Once again we connect, prepare, and execute an insert statement. And finally we
verify:


MoreThere is so much more to ScyllaDB. Modelling data from queries first. User
defined data types. Tunable consistency. Building databases without joins.
Timestamps. Architecting an app with eventual consistency. CAP theorem. PACELC
theorem. Dynamo and BigTable. On and on...

The flexible availability guarantees of ScyllaDB/Cassandra really are a great
tool and plumbing the depths of how to make them work well can take some time.
We at Compose though are excited about ScyllaDB and look forward to seeing what
you can do with such a great new database.

Image by Margarida CSilva Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",Getting started with ScyllaDB is easy since it is a drop in replacement for Apache's Cassandra database.,Getting Started with Compose's ScyllaDB,Live,42
115,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 


DEEP LEARNING WITH TENSORFLOW
The majority of data in the world is unlabeled and unstructured. Shallow neural
networks cannot easily capture relevant structure in, for instance, images,
sound, and textual data. Deep networks are capable of discovering hidden
structures within this type of data. In this TensorFlow course you'll use
Google's library to apply deep learning to different data types in order to
solve real world problems.

Login to EnrollTELL YOUR FRIENDS
 * 
 * 
 * 
 * 

 * Course code: ML0120EN
 * Audience: Anyone interested in Machine Learning, Deep Leaning and TensorFlow
 * Course level: Advanced
 * Time to complete: 10 Hours
 * Learning path: Deep Learning

This Deep Learning with TensorFlow course focuses on TensorFlow. If you are new to the subject of deep learning, consider
taking our Deep Learning 101 course first.

Traditional neural networks rely on shallow nets, composed of one input, one
hidden layer and one output layer. Deep-learning networks are distinguished from
these ordinary neural networks having more hidden layers, or so-called more
depth. These kind of nets are capable of discovering hidden structures within
unlabeled and unstructured data (i.e. images, sound, and text), which consitutes
the vast majority of data in the world.

TensorFlow is one of the best libraries to implement deep learning. TensorFlow
is a software library for numerical computation of mathematical expressional,
using data flow graphs. Nodes in the graph represent mathematical operations,
while the edges represent the multidimensional data arrays (tensors) that flow
between them. It was created by Google and tailored for Machine Learning. In
fact, it is being widely used to develop solutions with Deep Learning.

In this TensorFlow course, you will be able to learn the basic concepts of
TensorFlow, the main functions, operations and the execution pipeline. Starting
with a simple “Hello Word” example, throughout the course you will be able to
see how TensorFlow can be used in curve fitting, regression, classification and
minimization of error functions. This concept is then explored in the Deep
Learning world. You will learn how to apply TensorFlow for backpropagation to
tune the weights and biases while the Neural Networks are being trained.
Finally, the course covers different types of Deep Architectures, such as
Convolutional Networks, Recurrent Networks and Autoencoders.


Course Syllabus

Module 1 – Introduction to TensorFlow

 * HelloWorld with TensorFlow
 * Linear Regression
 * Nonlinear Regression
 * Logistic Regression
 * Activation Functions

Module 2 – Convolutional Neural Networks (CNN)

 * CNN History
 * Understanding CNNs
 * CNN Application

Module 3 – Recurrent Neural Networks (RNN)

 * Intro to RNN Model
 * Long Short-Term memory (LSTM)
 * Recursive Neural Tensor Network Theory
 * Recurrent Neural Network Model

Module 4 - Unsupervised Learning

 * Applications of Unsupervised Learning
 * Restricted Boltzmann Machine
 * Collaborative Filtering with RBM

Module 5 - Autoencoders

 * Introduction to Autoencoders and Applications
 * Autoencoders
 * Deep Belief Network

GENERAL INFORMATION
 * This TensorFlow course is free.
 * This course if with Python language.
 * It is self-paced.
 * It can be taken at any time.
 * It can be audited as many times as you wish.

RECOMMENDED SKILLS PRIOR TO TAKING THIS COURSE
 * Neural Network

REQUIREMENTS
 * Python programming

COURSE STAFF
Saeed Aghabozorgi , PhD is a Data Scientist in IBM with a track record of developing enterprise
level applications that substantially increases clients’ ability to turn data
into actionable knowledge. He is a researcher in data mining field and expert in
developing advanced analytic methods like machine learning and statistical
modelling on large datasets.

BIG DATA UNIVERSITY COURSE DEVELOPMENT TEAM
Thanks to BDU course developement team, BDU interns and all individuals
contributed to the development of this course: Kiran Mantri, Shashibushan
Yenkanchi, Jag Rangrej, Naresh Vempala, Walter Gomes, Anita Vincent, Gabriel
Sousa, Francisco Magioli, Victor Costa, Erich Sato, Luis Otavio and Rafael Belo.

 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 *",This free Deep Learning with TensorFlow course provides a solid introduction to the use of TensorFlow to analyze unstructured data.,Deep Learning With Tensorflow Course by Big Data University,Live,43
116,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectUNCOVER INSIGHTS ABOUT YOUR PRODUCTS HIDDEN IN STACK OVERFLOWmarkwatson / April 25, 2016As developer advocates one of our jobs is to help developers who areexperiencing issues with our products. Most developers turn to Stack Overflow to ask questions when they run into trouble (over 11.5 million questions askedto date!). We constantly monitor Stack Overflow for questions related to ourproducts, or our personal expertise, to provide as much assistance to developersas possible. We answer a lot of questions, and it’s important that we track andanalyze those questions.How do we conduct our Stack Overflow analysis? In this post we are going to showyou how to extend the Stack Overflow connector to provide real value and solvereal problems. We’ll show you how we use it to monitor the products we support,improve our responsiveness, and most importantly help our fellow developers.With 11,000,000+ questions, getting relevant Stack Overflow insights on aproduct is a challenge. We’ll show you how we do it, with our open source SimpleData Pipe app.THE STACK OVERFLOW CONNECTORIn this tutorial we showed you how to build a Simple Data Pipe Connector for Stack Overflow. Theend result was a connector that allowed users to select one of the top 30 mostactive tags on Stack Overflow and retrieve the 30 most active questions for thattag. While the tutorial served its purpose as a gentle introduction to Data PipeConnector development, we really didn’t create a connector that was all thatuseful.In this post we will show you how to extend the Stack Overflow connector to movedata that will enable us to: * Find questions that we need to answer. * Find out which of our products are most popular on Stack Overflow. * Run statistics to determine response rate, acceptance rate, etc.THE SIMPLE DATA PIPE SDKReflecting on the Stack Overflow connector we built in the previous tutorial,it’s easy to see where it was lacking: * We needed to be able to select less popular and more relevant tags, such as cloudant or apache-spark . * We needed to be able to pull more than 30 questions. * We needed to pull in the questions and the answers to those questions.It was obvious we needed to extend our connector. The Simple Data Pipe SDKallows us to extend almost every part of the connector, including: * Adding custom properties to the connector configuration. * Customizing the user interface for managing and running the connector. * Massaging or enhancing the data moved from the connector into Cloudant.ADDING CUSTOM PROPERTIESEvery pipe created in the Simple Data Pipe has a correspending document storedin the pipe_db database in Cloudant. This document contains information about the type of pipe(i.e., stackoverflow) and the configuration specific to that pipe. Here is asample document stored for a Stack Overflow pipe:{  ""_id"": ""fd1ffa968a467f73ce93d2a4720fdec4"",  ""_rev"": ""28-666e0c198da190afafa275230e935a05"",    ""connectorId"": ""stackoverflow"",  ""name"": ""stackoverflow-html-tag"",  ""type"": ""pipe"",  ""version"": 1,  ""clientId"": ""6812"",  ""clientSecret"": ""ShxD2WxxxxxxSHxxJExX5x(("",  ""oAuth"": {    ""accessToken"": ""(R38xxxxC8WxxxPMN*Sp8Q))""  },  ""tables"": [    {      ""name"": ""javascript"",      ""label"": ""javascript""    },    {      ""name"": ""java"",      ""label"": ""java""    },    // ...    {      ""name"": ""html"",      ""label"": ""html""    }  ],  ""selectedTableName"": ""html"",  ""selectedTableId"": ""html""}The Simple Data Pipe SDK allows connector developers to add and access customproperties on this document. Developers can use those properties in code to makedecisions on how to retrieve data from the desired data source.To get the data that we need from Stack Overflow we are going to add three newproperties: * customTags : A comma-separated list of tags for questions that should be downloaded   from Stack Overflow. * questionCount : The number of questions to download for each tag. * downloadAnswers : A boolean value specifying whether or not to download the answers for all   tags.EXTENDING THE USER INTERFACEIn order to provide users the ability to specify custom values for our three newproperties ( customTags , questionCount , and downloadAnswers ) we need to make some changes to the user interface.We are going to customize the Filter page by adding a text field for users toenter the list of custom tags. This will populate our customTags property. We’ll add a pulldown with a list of paging options to populate our questionCount property. Finally, we’ll add a checkbox that will allow a user to specifywhether or not to retrieve the answers. This will set our downloadAnswers property.We start by copying the pipeDetails.tables.html page from the simple-data-pipeproject into the simple-data-pipe-stackoverflow project ( simple-data-pipe/app.templates simple-data-pipe-connector-stackoverflow/lib/templates ). We then add the following HTML:<div class=""form_field"" ng-if=""selectedPipe.selectedTableId == 'custom'"">    <label for=""custom_tags"" class=""form_label"">Custom Tags (comma separated)</label>    <input type=""text"" class=""input_text"" name=""customTags"" id=""custom_tags"" required ng-model=""selectedPipe.customTags"" placeholder=""cloudant,apache-spark""></div><div class=""form_field"">    <label for=""custom_tags"" class=""form_label"">Number of Questions per Tag</label>    <select class=""input_select"" id=""questions_count"" name=""questionCount"" ng-model=""selectedPipe.questionCount"">        <option value=""100"">100</option>        <option value=""200"">200</option>        <option value=""500"">500</option>        <option value=""1000"">1000</option>    </select></div><div class=""form_field"">    <input type=""checkbox"" name=""downloadAnswers"" id=""download_answers"" required ng-model=""selectedPipe.downloadAnswers""> Download Answers</div>Our new Filter page looks like this:When a user saves their filter options we can see the three new properties addedto the pipe config document in the database:{  ""_id"": ""8237fa1bd2ea945cee7f89f71c1fa112"",  ""_rev"": ""98-694fb538c0a10d2245658cb90c5e6c1c"",  ""connectorId"": ""stackoverflow"",  ...  ""customTags"": ""apache-spark,cloudant,dashdb"",  ""questionCount"": ""500"",  ""downloadAnswers"": true}Now that we have these three properties available to us, we need to use them inour connector code. The extent of the changes are too great for this post, butwe can see that these properties can be easily accessed from the pipe object passed into many of the connector functions, for example:this.fetchRecords = function(dataSet, pushRecordFn, done, pipeRunStep, pipeRunStats, pipeRunLog, pipe, pipeRunner) {    var tags = pipe.customTags;    var pageSize = pipe.questionCount;    var downloadAnswers = pipe.downloadAnswers;    //...}THE STACK OVERFLOW QUESTION DATA STRUCTUREAfter we update our code to use these properties and run our pipe we can see thequestions moved to Cloudant. Here is a sample question:{  ""_id"": ""0506c2a366b431fbbdf939f4aae574a3"",  ""_rev"": ""1-27c2ab3a997fa0cd73fd5f3cfe0168f4"",  ""tags"": [    ""java"",    ""nosql"",    ""cloudant""  ],  ""owner"": {    ""user_id"": 3052176,    ...  },  ""is_answered"": false,  ""answer_count"": 1,  ...  ""question_id"": 29216049,  ""title"": ""Updating Cloudant database using Java"",  ""body"": ""Was wondering if it possible to write code in Java that will update the entries in my Cloudant database?"",  ""answers"": [    {      ""owner"": {        ""user_id"": 4284412,        ...      },      ""is_accepted"": false,      ""question_id"": 29216049,      ""body"": ""Yes,  Its possible to write JAVA code to update entries / documents in Cloudant database.  You need to use the java-cloudant driver.  Please have a look at the following project on github.""      ...    }  ],  ...}As you can see we are now retrieving and associating answers with questions.We’ve also highlighted a few other important fields: * tags : The tags associated to the question. * is_answered : A boolean specifying whether or not an answer has been accepted by the   user who asked the question. * answer_count : The number of answers to the question.In the next section we’ll use these fields to create custom queries to find thedata that we need to gain greater insight into our Stack Overflow developercommunity.QUERYING AND ANALYZING THE STACK OVERFLOW DATAWe are going to start by creating a new design document in Cloudant that willallow us to aggregate and search our Stack Overflow data. Specifically, we willcreate views and search indexes to: * Get the number of questions for a tag that have or have not been answered. * Get the number of questions for a tag that have or have not been accepted (by   the owner of the question). * Get a list of questions for a tag that have no answers.We rolled up these views and indexes into a single design document:{  ""_id"": ""_design/questions"",  ""views"": {    ""by_tag"": {      ""reduce"": ""_sum"",      ""map"": ""function (doc) {\n  if (doc.tags) {\n    for (var i=0; i�\n    }\n  }\n}""    },    ""by_tag_accepted"": {      ""reduce"": ""_sum"",      ""map"": ""function (doc) {\n  if (doc.is_answered && i�\n    }\n  }\n}""    },    ""by_tag_not_accepted"": {      ""reduce"": ""_sum"",      ""map"": ""function (doc) {\n  if (! doc.is_answered && i�\n    }\n  }\n}""    },    ""by_tag_answered"": {      ""reduce"": ""_sum"",      ""map"": ""function (doc) {\n  if (doc.answer_count && doc.answer_count > 0 && i�\n    }\n  }\n}""    },    ""by_tag_not_answered"": {      ""reduce"": ""_sum"",      ""map"": ""function (doc) {\n  if (! doc.is_answered && doc.answer_count == 0 && i�\n    }\n  }\n}""    }  },  ""language"": ""javascript"",  ""indexes"": {    ""by_tag"": {      ""analyzer"": ""standard"",      ""index"": ""function (doc) {\n  if (doc.tags && doc.tags.length � i�\n  index(\""answered\"", doc.answer_count � i�\n    }\n}\n}""    }  }}We’ll use the following views to query statistics: * questions/by_tag : This will return the total number of questions for a tag. * questions/by_tag_answered : This will return the total number of answered questions for a tag. * questions/by_tag_not_answered : This will return the total number of questions that have not been answered   for a tag. * questions/by_tag_accepted : This will return the total number of accepted questions for a tag. * questions/by_tag_not_accepted : This will return the total number of questions that have not been accepted   for a tag.The first thing we are going to look at is the total number of questions fortags apache-spark , cloudant , and dashdb . We’ll do this by querying the questions/by_tag view. For the cloudant tag this query would look something like this:curl -X GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/stackoverflow_custom/_design/questions/_view/by_tag?group=true&key=%22cloudant%22Example response:{""rows"":[    {""key"":""cloudant"",""value"":476}]}There have been 476 questions labeled with the tag cloudant . If we run the same query for apache-spark and dashdb we can see which product is the most popular on Stack Overflow:Tag # Questions apache-spark 12,521 cloudant 476 dashdb 58Let’s see how well these products are being supported by querying the questions/by_tag_answered view.curl -X GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/stackoverflow_custom/_design/questions/_view/by_tag_answered?group=true&key=%22cloudant%22Example response:{""rows"":[    {""key"":""cloudant"",""value"":427}]}427 of the 476 questions labeled with the tag cloudant have been answered. We can also use the questions/by_tag_accepted view to find how many questions have been accepted. Here are the results forall of our three tags:Tag # Questions # Answered % Answered # Accepted % Accepted apache-spark 12,521 9,617 76.8 7,376 58.9 cloudant 476 427 89.7 339 71.2 dashdb 58 53 91.4 39 67.2As you can see around 90% of questions tagged with cloudant or dashdb have been answered while over 23% of questions tagged with apache-spark have gone unanswered. So, let’s see if we can find a few of these questions andstart answering them.In the design document we created the following search indexes: * questions/by_tag : This will return all of the questions that have a tag that matches our   query. * questions/by_tag_answer_status : This will return all of the questions that have a tag that matches our   query and match our answer paramWe can query the questions/by_tag_answer_status index passing in the tag and the answered: parameter set to false , as follows:curl -X GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/stackoverflow_custom/_design/questions/_search/by_tag_answer_status?q=tag:%22apache-spark%22+AND+answered:false&include_docs=true&limit=2In this example we have limited our search results to two. The result is twoquestions without answers :{   ""total_rows"":2904,   ...   ""rows"":[      {         ""id"":""f74f323a1c531ef4c5ef6faf3fe2e074"",         ""order"":[            3.3885726928710938,            6         ],         ""fields"":{},         ""doc"":{            ""_id"":""f74f323a1c531ef4c5ef6faf3fe2e074"",            ""_rev"":""1-5c6a960c4457a7382cbb0729c0844137"",            ""tags"":[               ""apache-spark""            ],            ""owner"":{               ""reputation"":24,               ""user_id"":1935652,               ...            },            ""is_answered"":false,            ""view_count"":3,            ""answer_count"":0,            ""score"":1,            ""last_activity_date"":1460620372,            ""creation_date"":1460620372,            ""question_id"":36616897,            ""link"":""http://stackoverflow.com/questions/36616897/task-data-locality-no-pref-when-is-it-used"",            ""title"":""Task data locality NO_PREF. When is it used?"",            ""body"":""According to Spark doc, there are 5 levels of data locality..."",            ...         }      },      {         ""id"":""d173ca7647eac111020df96c264137bc"",         ""order"":[            3.241180419921875,            26         ],         ""fields"":{},         ""doc"":{            ...            ""tags"":[               ""apache-spark""            ],            ""owner"":{               ""reputation"":143,               ""user_id"":5245972,               ...            },            ""is_answered"":false,            ""view_count"":12,            ""answer_count"":0,            ""score"":0,            ""last_activity_date"":1460378894,            ""creation_date"":1460378894,            ""question_id"":36549142,            ""link"":""http://stackoverflow.com/questions/36549142/can-i-use-checkpoint-for-spark-in-this-way"",            ""title"":""Can I use checkpoint for Spark in this way?"",            ""body"":""The spark doc said about checkpoint..."",            ...         }      }   ]}From here we can copy the link for a question, go to the Stack Overflow site,and try to help out another developer in need of assistance.CONCLUSION AND NEXT STEPSUsing the Simple Data Pipe SDK to extend our Stack Overflow connector, we havebeen able to gain real insights into how we support developers. We did this byextending the user interface of our basic Stack Overflow connector to give usthe ability to choose more relevant data to download. We added new properties toour connector config that we were able to access immediately in code and withoutdatabase schema changes. Finally, we created views and search indexes inCloudant to retrieve important statistics and unanswered questions quickly andefficiently.We’ve barely scratched the surface with what we can do with this data. Here aresome potential next steps: * Create a dashboard for viewing and sharing these statistics. * Create an interface for searching previous answers or unanswered questions. * Integrate user information to find the users in our group who are answering   the most questions, have the highest % of accepted questions, etc.You can access the Stack Overflow connector on github at https://github.com/ibm-cds-labs/simple-data-pipe-connector-stackoverflow .For more information about the Simple Data Pipe and Simple Data Pipe connectors start here .SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Learn how to use IBM Bluemix and the Simple Data Pipe example app to conduct Stack Overflow analysis on how well users support certain tech products.,Uncover Product Insights Hidden in Stack Overflow,Live,44
120,"Build a custom library for Apache® Spark™ and deploy it to a Jupyter Notebook.New to developing applications with Apache® Spark™? This is the tutorial foryou. It provides the end-to-end steps needed to build a simple custom libraryfor Apache® Spark™ (written in scala ) and shows how to deploy it on IBM Analytics for Apache Spark for Bluemix , giving you the foundation you need to build real-life productionapplications.In this tutorial, you'll learn how to: 1. Create a new Scala project using sbt and package it as a deployable jar. 2. Deploy the jar into a Jupyter Notebook on Bluemix. 3. Call the helper functions from a Notebook cell. 4. Optional: Import, test and debug your project into Scala IDE for Eclipse.REQUIREMENTSTo complete these steps you need to: * be familiar with the scala language and jupyter notebooks . * download scala runtime 2.10.4 . * download homebrew . * download scala sbt (simple build tool).CREATE A SCALA PROJECT USING SBTThere are multiple build frameworks you can use to build Apache® Spark™projects. For example, Maven is popular with enterprise-build engineers. For this tutorial, we chose SBTbecause setup is fast and it’s easy to work with.The following steps guide you through creation of a new project. Or, you candirectly download the code from this Github repository : 1. Open a terminal or command line window. cd to the directory that contains    your development project and create a directory named helloSpark :        mkdir helloSpark && cd helloSpark         2. Create the recommended directory layout for projects builts by Maven or SBT    by entering these 3 commands:        mkdir -p src/main/scala    mkdir -p src/main/java    mkdir -p src/main/resources         3. In src/main/scala directory, create a subdirectory that corresponds to the    package of your choice, like mkdir -p com/ibm/cds/spark/samples . Then, in that directory, create a new file called HelloSpark.scala and in your favorite editor, add the following content to it:            package com.ibm.cds.spark.samples        import org.apache.spark._        object HelloSpark {        //main method invoked when running as a standalone Spark Application        def main(args: Array[String]) {            val conf = new SparkConf().setAppName(""Hello Spark"")            val spark = new SparkContext(conf)                println(""Hello Spark Demo. Compute the mean and variance of a collection�            println("">>> Results: "")            println("">>>>>>>Mean: �            println("">>>>>>>Variance: �            (rdd.mean(), rdd.variance())        }    }             4. Create your sbt build definition. To do so, in your project root directory,    create a file called build.sbt and add the following code to it:            name := ""helloSpark""        version := ""1.0""        scalaVersion := ""2.10.4""        libraryDependencies ++= {        val sparkVersion =  ""1.3.1""        Seq(            ""org.apache.spark"" %% ""spark-core"" % sparkVersion,            ""org.apache.spark"" %% ""spark-sql"" % sparkVersion,            ""org.apache.spark"" %% ""spark-repl"" % sparkVersion         )    }            The libraryDependencies line tells sbt to download the specified spark    components. In this example, we specify dependencies to spark-core,    spark-sql, and spark-repl, but you can add more spark components    dependencies. Just follow the same pattern, like: spark-mllib, spark-graphx,    and so on. Read detailed documentation on sbt build definition .         5. From the root directory of your project, run the following command: sbt update . This command uses Apache Ivy to compute all the dependencies and download    them in your local machine at /.ivy2/cache directory.         6. Compile your source code by entering the following command: sbt compile         7. Package your compiled code as a jar by entering the following command: sbt package .        You should see a file named hellospark 2.10-1.0.jar in your project root directory’s target/scala-2.10 directory. (Terminal tells you where it saved the package.) The namingconvention for the jar file is:projectName scala version-project version .jar hellospark 2.10-1.0 .jarDEPLOY YOUR CUSTOM LIBRARY JAR TO A JUPYTER NOTEBOOKWith your custom library built and packaged, you're ready to deploy it to aJupyter Notebook on Bluemix. 1. If you haven't already, sign up for Bluemix , IBM's open cloud platform for building, running, and managing    applications. 2. In Bluemix, initiate the IBM Analytics for Apache Spark service. 1. In the top menu, click Catalog .     2. Under Data and Analytics , find Apache Spark .             3. Click to open it, and click Create .         3. Get the deployable jar on a publicly available url by doing one of the    following:         * Upload the jar into a github repository. Note the download URL. You'll       use in Step 5 to deploy the jar into the IBM Analytics for Apache Spark       Service.     * Or, you can use our sample jar, which is pre-built and posted here on github .         4. Create a new Scala notebook.         1. In Bluemix, open your Apache Spark service.     2. If prompted, open an existing instance or create a new one.             3. Click New Notebook .     4. Enter a Name , and under Language select Scala . Click Create Notebook .         5. In the first cell, enter and run the following special command called AddJar    to upload the jar to the IBM Analytics for Spark service. Insert the URL of    your jar.        %AddJar    https://github.com/ibm-cds-labs/spark.samples/raw/master/dist/hellospark_2.10-1.0.jar    -f        That % before AddJar is a special command, which is currently available, but may be deprecated    in an upcoming release. We'll update this tutorial at that time. The -f forces the download even if the file is already in the cache.        Now that you deployed the jar, you can call APIs from within the Notebook.CALL THE HELPER FUNCTIONS FROM A NOTEBOOK CELLIn the notebook, call the code from the helloSpark sample library. In a newcell, enter and run the following code:val countPerPartitions = 500000var partitions = 10val stats = com.ibm.cds.spark.samples.HelloSpark.computeStatsForCollection(         sc, countPerPartitions, partitions)println(""Mean: "" + stats._1)println(""Variance: "" + stats._2)Final results in your Bluemix Jupyter Notebook look like this:OPTIONAL: IMPORT, TEST, AND DEBUG YOUR PROJECT IN SCALA IDE FOR ECLIPSEIf you want to get serious and import, test, and debug your project in a localdeployment of Apache® Spark™, follow these steps for working in Eclipse. 1.  Download the Scala IDE for Eclipse . (Note that you can alternatively use the Intellij scala IDE but it's easier to follow this tutorial with Scala IDE for Eclipse)           2.  Install sbteclipse (sbt plugin for Eclipse) with a simple edit to the     plugins.sbt file, located in ~/.sbt/0.13/plugins/ (If you can't find this     file, create it.) Read how to install .           3.  Configure Scala IDE to run with Scala 2.10.4          Launch Eclipse and, from the menu, choose Scala IDE Preferences . Choose Scala Installations and click the Add button. Navigate to your scala 2.10.4 installation root directory, select     the lib directory, and click Open . Name your installation (something like 2.10.4 ) and click OK . Click OK to close the dialog box.                     4.  Generate the eclipse artifacts necessary to import the project into Scala     IDE for eclipse.          Return to your Terminal or Command Line window. From your project's root     directory use the following command: sbt eclipse . Once done, verify that .project and .classpath have been successfully     created.           5.  Return to Scala IDE, and from the menu, choose File Import . In the dialog that opens, choose General Existing Projects into Workspace .           6.  Beside Select root directory , click the Browse button and navigate to the root directory of your project, then click Finish :           7.  Configure the scala installation for your project.          The project will automatically compile. On the lower right of the screen,     on the Problems tab, errors appear, because you need to configure the scala installation     for your project. To do so, right-click your project and select Scala Set the Scala installation .                    In the dialog box that appears, select 2.10.4 (or whatever you named your installation).               Click OK and wait until the project recompiles. On the Problems tab, there are no errors this time.           8.  Export the dependency libraries. (This will make it easier to create the     launch configuration in the next step).          Right-click on the helloSpark project, and select Properties . In the Properties dialog box, click Java Build Path . The Order and Export tab opens on the right. Click the Select All button and click OK :                9.  Create a launch configuration that will start a spark-shell.           1. From the menu, choose Run Run Configurations .      2. Right-click Scala Application and select New .      3. In Project , browse to your helloSpark project and choose it.      4. In Main Class , type org.apache.spark.deploy.SparkSubmit               5. Click the Arguments tab, go to the Program Arguments box, and type:         --class org.apache.spark.repl.Main spark-shell                  Then within VM Arguments type:         -Dscala.usejavacp=true         -Xms128m -Xmx800m -XX:MaxPermSize=350m -XX:ReservedCodeCacheSize=64m                             10. Click Run .           11. Configuration runs in the Console and completes with a scala prompt.          Now you know how to run/debug a spark-shell from within your developmentenvironment that includes your project in the classpath, which can then becalled from the shell interpreter.You can also build a self-contained Apache® Spark™ Application and run it manually using spark-submit or scheduling. The sample code thatcomes with this tutorial is designed to run both as an Apache® Spark™Application and a reusable library. If you want to run/debug the applicationfrom within the Scala IDE, then you can follow the same steps as above, but inStep 9e, replace the call in the Program Arguments box with the fully qualifiedname of your main class, like --class com.ibm.cds.spark.samples.HelloSpark spark-shellSUMMARYYou just learned how to build your own library for Apache® Spark™ and share itvia Notebook on the cloud. You can also manage your project with the import,test, and debug features of Scala IDE for Eclipse.Next, move on to my Sentiment Analysis of Twitter Hashtags tutorial, which uses Apache Spark Streaming in combination with IBM Watson totrack how a conversation is trending on Twitter. In future tutorials, we'll diveinto more sample apps that cover more on Spark SQL, Spark Streaming, and otherpowerful components that Spark has to offer.© “Apache”, “Spark,” and “Apache Spark” are trademarks or registered trademarksof The Apache Software Foundation. All other brands and trademarks are theproperty of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Build a custom library for Apache® Spark™ and deploy it to a Jupyter Notebook.,Start Developing with Spark and Notebooks,Live,45
121,"REAL-TIME Q&A APP WITH RETHINKDB
Matt Collins / September 13, 2016One of my first tasks here at IBM Cloud Data Services was to blog about my first impressions of RethinkDB . One thing that article briefly touched upon was RethinkDB’s unique ability of
being able to push updates to your app as and when the data changes — making it a strong contender to be your database
of choice when building real-time apps.

This was something that got my attention back then, and it’s about time we
revisited RethinkDB’s push functionality to see just how easy it is to build a
real-time app.

THE CHALLENGE
As a Developer Advocate, a large part of my job is going out into the community
and delivering talks on a range of topics. These talks invariably end with a Q&A
session where I can clear up anything that was confusing or misunderstood.

Live Q&As seem like a good use-case for building a real-time app: allow
attendees to ask questions from their smartphone during the talk, and vote on
which questions they want answered. We can then update the list of questions
on-screen in real time, showing the most popular questions and any answers that
surfaced during the talk.

The live Q&A app we’ll build using Node.js and RethinkDB.

TOOLS


Node.js is what I am going to use to build this app. Node’s ability to deal with a
large amount of concurrent connections is something that stands it in good stead
when building real-time apps. Although it’s not something we need to consider in
this post, it’s good to design for future scale.


RethinkDB is our database. As mentioned above, we are going to be making use of the
changefeeds functionality to push any changes to our app as and when they
happen. We will also be looking at ReQL and how that works with Node.js. You can
start up a free RethinkDB instance with Compose to get you going.

For the front end we will be using Vue.js and Bootstrap . Vue is one of many Javascript Frameworks for building front ends. I prefer it
to something like Angular, as I think Vue is easier to get up and running.
Bootstrap, of course, is the popular HTML/CSS framework from Twitter.

We still need to be able to get this data from our app to the front end, and to
do that we will be using Socket.IO . This is a simple way to implement WebSockets in your Node.js app, but it will
also take care of any cross-browser/platform issues for you. We will also use
this Socket.IO component for Vue, so that we can easily incorporate Socket.IO into our Vue app.

SET UP
All the code below can be found in the rethinkdb-questions GitHub repository for this article.

Clone the repo and npm install to get all of the dependencies and follow along below!

app.js is the brains of our whole app. We are using Express to help us get up and running quickly. The first portion of the file is just
including dependencies and so on.

Once we get to line 25 or so, we have some configuration to do. We need to define a connection object for RethinkDB. There are two connection objects defined in the code at the moment: one for if you are using a local
RethinkDB instance, and one if you are using a hosted instance via Compose.
Uncomment the one you wish to use, and if you are using Compose, make sure you
enter your connection details!

On the topic of Compose connection details for RethinkDB, note that Compose’s
Deployment Overview gives you a proxy connection string that is similar — yet
subtly different — than the URL for the RethinkDB admin UI. It’s the difference
of a single . , so make sure you’re using the right string for your host.

We will use this connection object every time we create a RethinkDB connection.

RETHINKDB AND REQL
Before we dive into the code we should take a bit of time to talk about ReQL,
the query language for RethinkDB.

When you create a connection to RethinkDB, this is a permanent, socket-like
connection that will stay open until it is closed by the application. This is
useful for a couple of reasons:

 * It allows RethinkDB to return a cursor instead of a dataset to allow us to
   iterate through the data in an efficient manner
 * We can push updates down this connection using changefeeds

We get this cursor by querying the database using ReQL, which is RethinkDB’s
native query language. It is designed to embed itself into your code — i.e., if
you’re building your app in JavaScript, your ReQL looks like JavaScript — so
that it feels familiar and comfortable to the developer. It also fits into the
standard coding patterns of whichever language you are using.

It is important to know that even though the query looks like JavaScript (or
Ruby, Python, etc.), none of the heavy lifting is being performed in JavaScript,
or even by your app. What is happening in the background is that your ReQL
expression is being compiled down into a query that RethinkDB understands, sent
to the server, and then executed in a distributed fashion across the whole
cluster — allowing for performant querying of large datasets.

That being said, lets look at how we can use ReQL in our app.

API ENDPOINTS
We will start with the API, or the back end of the app. The API is going to
provide the front end with the ability to get our questions data from the
database, as well as add new questions and update existing questions.

The API consists of a collection of Express routes that we will define, which
will in turn query our RethinkDB instance. We won’t cover how Express routing
works today, however there is a very simple guide on the Express website that should help if you’re unfamiliar.

ADDING NEW QUESTIONS
Having a real-time Q&A app is no good if we have no questions, so the first
thing we need to do is create a way to add some. This is done using the POST /question endpoint.


// Create a new question
app.post(""/question"", bpJSON, bpUrlencoded, (req, res) => {

    var question = {
        question: req.body.question,
        score: 1,
        answer: """"
    }

    r.connect(connection, function(err, conn) {

        r.table(""questions"").insert(question).run(conn, (err, cursor) =�

    })

})


We create our question object using the question parameter provided as part of the request. Then we connect to the database,
build up our query in ReQL, and run the query. The ReQL portion of this code is
here:

r.table(""questions"").insert(question)

Lets take a minute to examine what this query is doing:

 * r is the RethinkDB namespace
 * We can tag on the table(""questions"") method (just like JavaScript, remember) to select our desired table
 * And then, we can use the insert(question) method to say we wish to insert a new document, passing in our question
   object.

Simple, huh?

Once the query completes, we return a simple JSON response, just to signify
whether this request was successful or not, and close our connection to the
database.

It’s important to close your RethinkDB connection, as you don’t want unused,
open connections consuming resources.

GETTING QUESTION DATA
Now that we have some questions, we probably want to be able to get them back,
right?

The GET /questions endpoint is designed to do just that – return all existing questions from the
RethinkDB database in one go.


// Get all questions
app.get(""/questions"", (req, res) => {

    r.connect(connection, (err, conn) => {

        r.table(""questions"").run(conn, (err, cursor) => {

            cursor.toArray((err, results) =�

    })

})


All we are doing here is connecting to RethinkDB, using ReQL to ask for
everything from the questions table, and transforming the full dataset into an array which is then sent back
to the client in the response. In the meantime, we close our connection to
RethinkDB. Again, the ReQL portion of this code is here:


// equivalent to SELECT * FROM questions
r.table(""questions"")


We previously touched upon the fact that we don’ instead, we receive a cursor.
In this instance, we don’t want a cursor. So in the Get all questions snippet we use the toArray() method to return our full dataset.

UPDATING OUR QUESTIONS
We mentioned before that we wanted our users to be able to vote on questions. We
actually have two endpoints for voting: POST /upvote/:id and POST /downvote/:id , which will either add or subtract 1 from the score of a question.

Here’s what the ReQL looks like:


// get by ID
// update score to score+1
// default of 1 if no score set
r.table(""questions"").get(req.params.id).update({
    score: r.row(""score"").add(1).default(1)
})


In English, we are saying:

 * Get a question by a specified ID
 * Update this question
 * Set the value of score to be score+1
 * If score does not currently exist, then set it to 1.

This is how we are upvoting a question. Similarly, we use .sub(1) in the downvote endpoint.

Finally, a question is no use without an answer. We add answers using POST /answer/:id .

This process is similar to changing the score of a question. All we need to do
is find our question by its unique ID and update it to include an answer that is
provided via this request:


// get by ID
// update answer
r.table(""questions"").get(req.params.id).update({
    answer: req.body.answer
})


FRONT END
Next, we need to define some routes for Express to allow us to access our app
via a web front end.

 * GET / is the homepage and will be used to display our questions & answers
 * GET /answer is identical to the homepage, but will allow an administrator to answer
   questions

Both of these endpoints will return index.html , which is where we will create our front end.

Once jQuery has told us that our document is ready to go, we define our Vue app
with the app variable. This is where we can define our data model on the client side, along
with a bunch of methods we can call and handlers for our Socket.IO events.


app = new Vue({
  el: '#app',  // the HTML element that this Vue app relates to
  data: {
    questions: [] // our data model, an array of questions
  },
  methods: {  
    ... // a bunch of methods we have defined that can interact with our data model
  },
  computed: {
    ... // some computed values that we can use
  },
  sockets:{ 
    ... // socket event handlers
  }
})


After defining our app, we call the app.getQuestions() method, which will hit the GET /questions endpoint, retrieving all of our questions data. We can then store this data in
our data model at app.questions .

It is easy to define how we want our data to be displayed with Vue. The app that
we defined above relates to everything that is defined inside div#app .


<div class=""container"" id=""app"">
....
</div>


In here we have a button, which toggles a form that we can use to ask a new
question. Below the form, we have another div that will house all of our questions.

We can do some powerful things like iterate through our questions array to
render a new div for each question. The example below is saying “iterate through
the sortedQuestions computed array, and create a new div for each element, exposing this element as question “.


<div v-for=""question in sortedQuestions"" id=""{{ question.id }}"" class=""col-lg-12"">
    ...
</div>


We can then use the special Vue handlebars notation to define the ID of this div like so id=""{{ question.id }}"" . At any point in this div we can refer to the question variable within the handlebars notation to refer to the current question in the
array. This gives us access to other properties such as score , answer , and question to help us create our question template in HTML.

We can also call on the methods we defined in our app with the following
notation:

<button type=""button"" v-on:click=""upvote(question.id)"" href=""#""></button>

When this button is clicked, we will call the upvote() method that is defined in our Vue app, passing the unique ID of this question
as the only parameter.

REAL-TIME UPDATES
We have a number of methods defined in general.js and questions.js on the front end that make requests back to the API endpoints we discussed
earlier in the article. We pass these requests using the jQuery ajax APIs.

 * getQuestions() calls GET /questions
 * askQuestion() handles the submission of the question form to POST /question
 * doUpvote() calls POST /upvote/:id
 * doDownvote calls POST /downvote/:id
 * answerQuestion() handles the submission of the answer form to POST /amswer/:id

On closer inspection of these functions, you might notice that when we add or
update a question, there is no code to react to the response from the API and
update the front end. The reason is that we want to handle any updates to the
data in real time, and we want all of our clients to respond to the same
stimulus — i.e., an event from our app to tell us that the data has changed in
the database. This approach helps us to manage state within our real-time app.

How else could it work? Well, if we have multiple ways of updating the front end
— i.e., after making an API call, or after receiving a WebSocket event — then
there is a greater chance of introducing bugs or inconsistencies in our code,
meaning that our clients could get out of sync. If every client is reacting to
data changes from a single source, then there is a much greater chance of
consistent results across the whole user base.

Historically, computer systems have been built with a single source of data (a
database!), but this doesn’t necessarily work well with data-driven, real-time
apps. Traditional databases are not designed for this use-case. To get such a
database to work in this context, the developer has to use an antiquated
approach such as polling (repeatedly asking the database for an update).
Developers could also use message queues and additional infrastructure to help
manage the flow of data. Neither of these solutions, however, scales well.

Scaling data-driven web apps is the problem we’re trying to solve by using
RethinkDB and changefeeds. Speaking of changefeeds…

CHANGEFEEDS & SOCKET.IO
As mentioned previously, we’ll use of the RethinkDB changefeed feature to get
updates from our database whenever a new question is added or updated. Towards
the bottom of app.js you should see some code that creates our changefeed.


r.connect(connection, (err, conn) => {

    r.table(""questions"").changes().run(conn, (err, cursor) => {

        // for each update emit the data via Socket.IO
        cursor.each((err, item) =�

            // new
            if (item.old_val === null && item.new_val !== null) {
                io.emit('new', item.new_val)
            }

            // deleted
            else if (item.old_val !== null && item.new_val === null) {
                io.emit('deleted', item.old_val)
            }

            // updated
            else if (item.old_val !== null &�


The first thing to note here is that we are not closing the connection to RethinkDB — this is because we want to keep the
connection open so that we can continue receiving updates.

Using changefeeds is similar to doing a normal query in ReQL: you just attach
the .changes() method to the end of your query. We still receive a cursor, and we can use that
cursor to iterate over any incoming update events.

Update events look like this:


{
    new_val: { ... },
    old_val: { ... }
}


The new_val is what is currently being stored in the database, whilst the old_val is what was previously there. If the old_val is null , then that indicates there was no previous value and this event represents an
insert of a new document. If new_val is null , then that indicates a deletion. If both new_val and old_val have data, then this signifies an update has taken place.

In the code example above, you can see that we are determining which event has
occurred, and that we are using Socket.IO to emit() the relevant data to the front end via WebSockets. When emitting events via
Socket.IO, the first parameter is the name of the event ( new , deleted , updated ) and the second parameter is the data you wish to send.

Again, we will not examine Socket.IO here, but there is an easy-to-follow
article on the official website that shows how to get started with Socket.IO and Express .

In our front end HTML, we have included the Socket.IO client library and a
Socket.IO component for Vue:


<script src=""/socket.io/socket.io.js""></script>
<script src=""/js/vue-socketio.min.js""></script>


We then configure Vue to use Socket.IO and define the location of the server:


// Tell Vue to use Socket.io
var socketUrl = `${location.protocol}//${location.hostname}${(location.port ? ':'+location.port: '')}`;
Vue.use(VueSocketio, socketUrl);


We can then define event handlers for Socket.IO that will listen for events. We
have three events:

 * new – a question has been inserted into the database
 * updated – a question has been updated (either answered or voted on)
 * deleted – a question has been deleted from the database

The handlers are defined in the sockets object of the Vue app we created at the beginning of the article. The handlers are simply functions
that update the data model of our app to reflect the change in our questions
data.

Because of Vue’s data bindings, we don’t have to do anything else — the front
end will automatically update to reflect the changed data! Now, whenever the
data changes within the database these updates will be reflected, in real time,
within our app. Pretty cool, huh?

CONCLUSION
So there we have it: from nothing to a real-time Q&A app in relatively little
time! What have we learnt?

Simply put, RethinkDB is a great place to start building your real-time apps. It provides a single
source of data to power your apps that isn’t available from other database
offerings without resorting to clumsy polling or complicated architecture to
manage the flow of data within your app.

RethinkDB doesn’t give you everything you need to create a real-time app
front-to-back. If you want to update users in real time, you still need a way to
get your updates to the front end, but RethinkDB does a lot of the heavy lifting
for you in a clear and succinct way.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: javascript / Node.js / nodejs / RethinkDB Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",RethinkDB's push updates makes it great for real-time apps. Here's an example built for live Q&A sessions at conferences using RethinkDB changefeeds.,Q&A Voting App with RethinkDB,Live,46
123,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

CONTENTS
 * Apache Spark * Get Started * Get Started in Bluemix
      
      
    * Tutorials * Load dashDB Data with Apache Spark
       * Load Cloudant Data in Apache Spark Using a Python Notebook
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Build SQL Queries
       * Use the Machine Learning Library
       * Build a Custom Library for Apache Spark
       * Sentiment Analysis of Twitter Hashtags
       * Use Spark Streaming
       * Launch a Spark job using spark-submit
      
      
    * Sample Notebooks * Sample Python Notebook: Precipitation Analysis
       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis
      
      
 * BigInsights * Get Started * BigInsights on Cloud for Analysts
       * BigInsights on Cloud for Data Scientists
       * Perform Text Analytics on Financial Data
       * Sample Scripts
      
      
 * Compose * Get Started * Create a Deployment
       * Add a Database and Documents
       * Back Up and Restore a Deployment
       * Enable Two-Factor Authentication
       * Add Users
       * Enable Add-Ons for Your Deployment
      
      
    * Compose Enterprise * Get Started
      
      
 * Cloudant * Get started * Copy a sample database
       * Create a database
       * Change database permissions
       * Connect to Bluemix
       * Developing against Cloudant
      
      
    * Intro to the HTTP API * Execute common API commands
       * Set up pre-authenticated cURL
      
      
    * Database Replication * Use cases for replication
       * Create a replication job
       * Check replication status
       * Set up replication with cURL
      
      
    * Indexes and Queries * Use the primary index
       * MapReduce and the secondary index
       * Build and query a search index
       * Use Cloudant Query
       * Cloudant Geospatial
      
      
    * Integrate * Create a Data Warehouse from Cloudant Data
       * Store Tweets Using Cloudant, dashDB, and Node-RED
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Load Cloudant Data in Apache Spark Using a Python Notebook
      
      
 * dashDB * dashDB Quick Start
    * Get * Get started with dashDB on Bluemix
       * Load data from the desktop into dashDB
       * Load from Desktop Supercharged with IBM Aspera
       * Load data from the Cloud into dashDB
       * Move data to the Cloud with dashDB’s MoveToCloud script
       * Load Twitter data into dashDB
       * Load XML data into dashDB
       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB
       * Load JSON Data from Cloudant into dashDB
       * Integrate dashDB and Informatica Cloud
       * Load geospatial data into dashDB to analyze in Esri ArcGIS
       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion
         Workbench (DCW)
       * Install IBM Database Conversion Workbench
       * Convert data from Oracle to dashDB
       * Convert IBM Puredata System for Analytics to dashDB
       * From Netezza to dashDB: It’s That Easy!
       * Use Aginity Workbench for IBM dashDB
      
      
    * Build * Create Tables in dashDB
       * Connect apps to dashDB
      
      
    * Analyze * Use dashDB with Watson Analytics
       * Perform Predictive Analytics and SQL Pushdown
       * Use dashDB with Spark
       * Use dashDB with Pyspark and Pandas
       * Use dashDB with R
       * Publish apps that use R analysis with Shiny and dashDB
       * Perform market basket analysis using dashDB and R
       * Connect R Commander and dashDB
       * Use dashDB with IBM Embeddable Reporting Service
       * Use dashDB with Tableau
       * Leverage dashDB in Cognos Business Intelligence
       * Integrate dashDB with Excel
       * Extract and export dashDB data to a CSV file
       * Analyze With SPSS Statistics and dashDB
      
      
    * REST API * Load delimited data using the REST API and cURL
      
      
 * DataWorks * Get Started * Connect to Data in IBM DataWorks
       * Load Data for Analytics in IBM DataWorks
       * Blend Data from Multiple Sources in IBM DataWorks
       * Shape Raw Data in IBM DataWorks
       * DataWorks API
      
      
INSTALL IBM DATABASE CONVERSION WORKBENCH
Jess Mantaro / July 22, 2015See how to download and install Database Conversion Workbench, Data Studio
plugin for IBM dashDB.

You can also read a transcript of this video

RELATED LINKS
 * About IBM Data Conversion Workbench
 * Convert IBM PureData for Analytics to dashDB
 * Convert data from Oracle to dashDB

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM","Watch how to download and install Database Conversion Workbench, Data Studio plugin for IBM dashDB.",Install IBM Database Conversion Workbench,Live,47
126,"Data Science Experience Datasci X * Data Science Experience Datasci X
 * Data Works

Sign In Sign UpDOCUMENTATION
 * All
 * Get started
 * Analyze data
 * Manage data

 * Get started * Quick overview
      
    * Set up projects and collaborate
      
    * Known issues
      
    * FAQs
      
   
 * Analyze data * Notebooks * Create notebooks: overview
         
       * Sample notebooks
         
       * Parts of a notebook
         
       * Install libraries and packages * Pixiedust packageManager
         
         
       * Load and access data in a notebook
         
      
    * Visualizations * Model visualizations
         
       * Pixiedust visualizations
         
       * Brunel visualizations
         
      
    * RStudio
      
    * Spark overview
      
   
 * Manage data * Catalogs * Create a catalog
         
       * Create data assets
         
       * Create catalog projects
         
       * Add and manage collaborators
         
       * Monitor data usage and user activity
         
      
    * Analyze streaming data from Kafka topics
      
   
Get started with IBM Data Science ExperienceGET STARTED WITH IBM DATA SCIENCE EXPERIENCE
Welcome to IBM Data Science Experience (DSX). Depending on the plan you chose,
your environment is set up with one or more Apache Spark instance and 5 GB or
more of object storage.

PROJECTS AND NOTEBOOKS
If you want to jump right in, you can create projects to collaborate with other
data scientists and data engineers, create and share notebooks, data sets, and
data connections, or use RStudio.

 * To start setting up projects and collaborating, see Set up projects and collaborate .
   
   
 * To work with notebooks in your projects, see Create notebooks: overview .
   
   
 * For RStudio, see RStudio overview .
   
   
COMMUNITY
You can also explore the community area for curated data sets, sample notebooks,
articles, and tutorials, both to learn from and to use as starting points.

Figure: A sample of community cards


Whenever you want to return to your homepage, click the IBM Data Science Experience button.

Learn more:

 * Quick overview
 * Known issues
 * FAQs

 * Contact
 * Privacy
 * Terms of Use",Learn to use IBM Data Science Experience.,Data Science Experience Documentation,Live,48
127,"Compose The Compose logo Articles Sign in Free 30-day trialGEOFILE: USING OPENSTREETMAP DATA IN COMPOSE POSTGRESQL - PART II
Published Mar 30, 2017 geofile openstreetmap postgis GeoFile: Using OpenStreetMap Data in Compose PostgreSQL - Part IIGeoFile is a series dedicated to looking at geographical data, its features, and
uses. In today's article, we're continuing our examination of OpenStreetMap data
and walking through how to incorporate other data sources. We'll also look at
using PostGIS to filter our data and to find places that are within or intersect
a chosen polygon.

In the last GeoFile article , we looked at how to import OpenStreetMap (OSM) data into Compose PostgreSQL
and ran some queries to get the most popular cuisines in Seattle. We found that
coffee shops were the most popular places in the city, and we provided a top ten
list of which coffee companies have the most branches in Seattle. In this
article, we'll be using the same OSM data in conjunction with the Seattle Police Department's 911 call data . We'll show you how to create tables and store this data in PostgreSQL using Sequelize , a Node.js ORM for relational databases. Then we'll look at locations, areas,
and reasons for 911 calls using PostGIS and then viewing them all using OpenJUMP , an open-source GIS tool.

Let's look at Sequelize and import some data into our PostgreSQL deployment ...

SEQUELIZE ME
Sequelize is a Node.js ORM that works with a number of relational databases out of the
box. For our use case, Sequelize makes it easy to perform CRUD operations and
create models for our data. In particular, for the data model that we'll be
creating, it comes with a geometry data type that works well with GeoJSON and
PostgreSQL.

What we'll be doing with Sequelize is creating a table called emergency_calls and inserting GeoJSON documents from the SPD 911 call API. The information that
we'll be gathering from the API is the incident id , longitude , latitude , event_clearance_group , and event_description . The event_clearance_group and event_description provide us with details about each 911 call incident.

In addition to Sequelize, we'll be using the request Node.js library. This library will allow us to gather the GeoJSON documents
from the SPD 911 call API and will help us insert the documents into PostgreSQL
one at a time.

To install Sequelize and request , we'll write the following in our terminal using NPM.

npm install sequelize request --save  


After installing the packages, let's create a file called 911data.js . Within the file, we'll first require both the request and sequelize libraries we installed with NPM.

const request = require('request');  
const Sequelize = require('sequelize');  


We then set up a variable url with the URL of the API and append to the URL $$app_token and include a custom token from data.seattle.gov . You'll have to apply for a token in order to not have download limits on your
data. Next, we'll use Sequelize's $offset and $limit functions to limit the number of records we'll import to our database since
there are more than 1.3 million records in the SPD 911 calls dataset. In order
to get the latest 911 calls, we'll offset our data by 1.3 million rows and limit
our data to only the last 100,000 rows.

const url = ""https://data.seattle.gov/resource/pu5n-trf4.geojson?$$app_token=your_token&$offset=1300000  


After that, we'll initialize a database connection using our Compose PostgreSQL
connection string located on the Overview page under Credentials . At the end of the connection string, we'll change the database name from compose to osm since we're inserting the 911 call records into a table located within the OSM
database.

const sequelize = new Sequelize(""postgres://admin:mypass@aws-us-west-4-portal.0.dblayer.com:25223/osm"");  


Once that's done, we can set up a Sequelize model for our data. For this
example, we'll keep it simple and only get the id and the event_group and event_description , which categorize and describe each 911 call. We'll also set up a column
called geom that will automatically process our GeoJSON longitude and latitude coordinates
into a PostGIS geometry object. Sequelize does this by using the PostGIS
function ST_GeomFromGeoJSON behind the scenes. Since we're using GeoJSON data, the PostGIS geometry object
coordinate reference system will be set to SRID 4236, but the geom column set up by Sequelize will have an SRID set to 0. We'll have to change
this once we've inserted our data since SRID 4236 will not work with OSM data
since it uses a different SRID - this is discussed further below.

To set up a model for our data, we'll first define the model using Sequelize's define method. The first argument that the method takes is the table name we want to
create. For our use case, the table will be named emergency_calls . The second argument of the function is an object that contains the data type,
field name, and other constraints we want to put on our columns such as defining
primary keys and allowing null values.

const EmergencyCalls = sequelize.define('emergency_calls', {  
    id: {
        type: Sequelize.STRING,
        field: 'id',
        primaryKey: true
    },
    eventGroup: {
        type: Sequelize.TEXT,
        field: 'event_clearance_group'
    },
    eventDescription: {
        type: Sequelize.TEXT,
        field: 'description'
    },
    geom: {
        type: Sequelize.GEOMETRY('POINT'),
        field: 'geom',
        allowNull: false
    }
});


The field is the name we assign to the PostgreSQL table column. The type is the data type we assign to the column using any appropriate Sequelize's data type . For columns what will store data as a string, we'll use Sequelize's STRING data type. For the geom column, Sequelize has a GEOMETRY data type that allows us to assign the type of geometry to a column. In our use
case, it's ""POINT"" since the GeoJSON geometry type is also ""POINT"". Sequelize
also allows us to assign the GEOMETRY data type a second parameter that is a SRID number. However, since our GeoJSON
data does not contain information regarding the SRID, if we define the SRID of
the column and our data doesn't match when inserting a record, we'll receive an
error. Therefore, to solve this problem we will not set the SRID of the column
initially and we'll go back and manually change the column later using PostGIS.

Now that we have the model set up, we can initialize EmergencyCalls which will create the table. Here, we'll append the sync method to create the PostgreSQL table in the database and use the force option to drop the table if it exists.

EmergencyCalls  
    .sync({ force: true })
    .then(() => {})
    .catch(err =


Once we have the model set up and the table created, we can start importing the
data into the table. We'll do this using the request library which takes a URL and a callback that contains the GeoJSON data in the body . We'll assign a variable called data that parses the GeoJSON data using the JSON.parse method. Then we'll take the data and iterate over the GeoJSON features array:

request(url, (err, res, body) = i 


Within the for-loop, we'll use the Sequelize create method to insert each 911 call record into our database. We'll only select the
necessary information from the GeoJSON ""Properties"" and ""Geometry"" objects and
put the results into keys we created from our Sequelize model.

EmergencyCalls.create({  
    id: jsonFeatures[i].properties.cad_cdw_id,
    eventGroup: jsonFeatures[i].properties.event_clearance_group,
    eventDescription: jsonFeatures[i].properties.event_clearance_description,
    geom: jsonFeatures[i].geometry
});


The full request looks like:

request(url, (err, res, body) = i 


Once we have the code set up, just run node 911data.js and we'll see the table set up and all of our data being logged in the terminal
window. After the data has been inserted, let's see what it looks like in
PostgreSQL by logging into our OSM database.

OUR 911 DATA AND OSM WITH POSTGIS
Once we've logged into our PostgreSQL deployment and connected to our OSM
database, we can view what Sequelize has inserted into our emergency_calls table. Using a SELECT query our table will contain documents that look something like this:

SELECT * FROM emergency_calls LIMIT 1;  


Running the query gives us:

  id   |  event_clearance_group   |    description     |                    geom                    |         createdAt          |         updatedAt          
-------+--------------------------+--------------------+--------------------------------------------+----------------------------+----------------------------
 89778 | SUSPICIOUS CIRCUMSTANCES | SUSPICIOUS VEHICLE | 0101000000C58EC6A17E935EC040852348A5CC4740 | 2017-03-29 20:48:10.812+00 | 2017-03-29 20:48:10.812+00


Notice that the fields that we defined and the data have been created and
inserted along with two other timestamp fields created by Sequelize. These
timestamps show when a record has been inserted and updated in the table. If you
don't want timestamps to be added, just add timestamps: false inside the Sequelize model.

To view the data type of each column run \d emergency_calls .

                Table ""public.emergency_calls""
        Column         |           Type           | Modifiers 
-----------------------+--------------------------+-----------
 id                    | character varying(255)   | not null
 event_clearance_group | text                     | 
 description           | text                     | 
 geom                  | geometry(Point)          | not null
 createdAt             | timestamp with time zone | not null
 updatedAt             | timestamp with time zone | not null
Indexes:  
    ""emergency_calls_pkey"" PRIMARY KEY, btree (id)


Here we can see that the geom column has been assigned a geometry data type without an SRID even though there
are geometry objects inserted in the column. Since our geom column contains GeoJSON data in the form of a geometry object, the coordinates
of the data are automatically calculated using SRID 4326 even though the column
doesn't have an SRID defined. If we decided to project the geom data onto our OSM map, however, the geom points would not align with the map because OSM uses SRID 3857. So how do we
solve this issue?

To solve the issue we'll have to transform our geom column to SRID 3857. To do this, we'll update the column by first setting the
SRID of each value to SRID 4326 by using PostGIS's ST_SetSRID function. We'll then transform each value to SRID 3857 using PostGIS's ST_Transform function. This can be done by writing the following SQL statement:

UPDATE emergency_calls SET geom = ST_Transform(ST_SetSRID(geom,4326),3857);  


After that's completed, we'll create an index on the geom column:

CREATE INDEX idx_emergency_calls ON emergency_calls USING GIST (geom);  


Using OpenJUMP, we now can view each of the points on the Seattle OSM map.


Now that we have some 911 call points, an OSM map, and other OSM data, let's do
some querying ...

POSTGIS QUERIES
In the last article, we looked at some restaurant data using OSM's hstore data
column to find the top ten cuisines in Seattle. We then found out that coffee
was the most popular ""cuisine"" and found the top ten coffee shops in the city.
Let's take a closer look at the map this time by focusing on one particular area
of Seattle called Capitol Hill.

We'll start out by selecting the area and getting its coordinates.

A useful tool to draw and get the coordinates of a polygon on a map is geojson.io . All we have to do is zoom on Seattle and draw a polygon around the area we
want. It will automatically provide us with the coordinates of the polygon in
GeoJSON in the right sidebar.


Once we have the coordinates, we can use PostGIS's function ST_GeomFromText to define the shape type, its coordinates, and the SRID. In this case, since
the coordinates derive from a GeoJSON object, the default SRID is 4326. This is
what the function with our coordinates will look like:

ST_GeomFromText('POLYGON((-122.32576847076416 47.61471249278791, -122.32044696807861 47.61471249278791, -122.32044696807861 47.6179525278143, -122.32576847076416 47.6179525278143, -122.32576847076416 47.61471249278791))', 4326)  


The value returned from ST_GeomFromText will have to be transformed so that it can be viewed correctly on the OSM map
like the geom column coordinates in our emergency_calls table. To do that, we'll again use the ST_Transform function and assign it SRID 3857:

ST_Transform(ST_GeomFromText(..., 4326), 3857)  


Once we have this setup, we can use OpenJUMP and select Run Datastore Query from the File menu. Once we select Run Datastore Query , it will open a window to create a new map layer. We first select or create a
new connection to our OSM database. Then we type in the name of our layer and
then write the SQL query:


SELECT ST_Transform(ST_GeomFromText('POLYGON((-122.32576847076416 47.61471249278791, -122.32044696807861 47.61471249278791, -122.32044696807861 47.6179525278143, -122.32576847076416 47.6179525278143, -122.32576847076416 47.61471249278791))', 4326), 3857)  
FROM planet_osm_polygon;  


Running this query we will see the polygon appear on the map.


To get all the restaurants and their names that are only within the polygon,
we'll use the PostGIS function ST_Contains , which selects only objects that are contained within a defined geometry. A
simple way to understand how this function works is to view it as:

ST_Contains(shape_to_search_in, objects_to_find_in_the_shape)  


Therefore, when we write this query to find all the OSM points inside the
polygon, we'd write:

SELECT amenity, name, way  
FROM planet_osm_point  
WHERE  
   ST_Contains(ST_Transform(ST_GeomFromText('POLYGON((-122.32576847076416 47.61471249278791, -122.32044696807861 47.61471249278791, -122.32044696807861 47.6179525278143, -122.32576847076416 47.6179525278143, -122.32576847076416 47.61471249278791))', 4326), 3857), 
way) AND amenity = 'restaurant'  
GROUP BY amenity, name, way;  


Within ST_Contains , the first geometry we add is the polygon that we want to search within. In
this case, it's the same polygon that we created and transformed to SRID 3857.
The second geometry way is the geometry column of planet_osm_point which are the OSM points we want to return if they are contained inside the
polygon. Additionally, we select only the restaurant amenities so that we only get the restaurants and filter out the other points.
Running this query provides us with five results:


  amenity   |         name          |                        way                         
------------+-----------------------+----------------------------------------------------
 restaurant | 611 Supreme           | 0101000020110F000014AE4719F1F869C185EB51B86C0D5741
 restaurant | Bill's Off Broadway   | 0101000020110F00008FC2F5E0DBF869C1C3F528EC6B0D5741
 restaurant | Fogan Cocina Mexicana | 0101000020110F00003333339BF6F869C114AE47D1760D5741
 restaurant | Raygun Lounge         | 0101000020110F000048E17A7402F969C11F85EB716B0D5741
 restaurant | Yo! Zushi             | 0101000020110F00006666660EC3F869C1295C8FD2870D5741


Using our 911 call data, we could create a more complex example using ST_Contains to show the number of 911 calls that took place near these restaurants.

To so that, what we'd want to show is the name of each restaurant, the reason
for the 911 call, and the distance between the 911 call event and the
restaurant. Another constraint that we might add is that the 911 call has to be
within a radius of 30 meters from the restaurant. This will filter out calls
that are further away. This query would look similar to the following:

SELECT p.name AS restaurant, e.event_clearance_group AS activity, ST_Distance(p.way, e.geom) AS distance  
FROM planet_osm_point AS p, emergency_calls AS e  
WHERE  
    p.amenity = 'restaurant' AND
    e.event_clearance_group   


Running the query gives us 71 results like:

      restaurant       |         activity         |     distance     
-----------------------+--------------------------+------------------
 611 Supreme           | ASSAULTS                 | 21.6197191748894
 ...
 Fogan Cocina Mexicana | MENTAL HEALTH            | 29.6527188363329
 ...
 Raygun Lounge         | SUSPICIOUS CIRCUMSTANCES | 28.8634577346799
 ...
 Yo! Zushi             | FALSE ALARMS             | 27.4377722361284


The first thing to notice is that we have two ST_Contains functions being used. Each one indicates that the restaurant and the 911 call
should be contained within the polygon. What's also noticeable is the other
PostGIS queries that we added: ST_Distance and ST_DWithin . ST_Distance provides the distance between one geometry and the other. In this case, it
shows us the distance between the 911 call and a restaurant. The function ST_DWithin returns true if two geometries are within a specified distance of each other.
So, above, we are indicating in the WHERE clause that each restaurant and 911 call have to be within 30 meters of each
other.

Another interesting function provided by PostGIS is ST_Instersects , which is useful for when you want to see what shapes intersect with another.
This function is helpful especially when we want to find roads that intersect
with our polygon. Like the function ST_Contains , this function first takes the geometry of the polygon that we are searching
within, and then the geometry column of the table that contains the shapes we
want if they intersect with our polygon. A query selecting all the roads that
interest with our polygon would look like the following:

SELECT name, way  
FROM planet_osm_roads  
WHERE  
  ST_Intersects(ST_Transform(ST_GeomFromText('POLYGON((-122.32576847076416 47.61471249278791, -122.32044696807861 47.61471249278791, -122.32044696807861 47.6179525278143, -122.32576847076416 47.6179525278143, -122.32576847076416 47.61471249278791))', 4326), 3857), 
way)  
GROUP BY name, way;  


We first provide the coordinates of the polygon that we transformed. Then we
provide the geometry column way from the planet_osm_roads table, which contains the geometries of all the roads on our map. When running
the query, we'll find that there are five streets that intersect the polygon.

               name                
-----------------------------------
 Broadway
 University Link Northbound
 East Pine Street
 University Link Southbound
 Seattle Streetcar First Hill Line


SO MUCH MORE ...
In this article, we looked at how to import and use another dataset with our OSM
data. In addition, we looked at using PostGIS functions in order to modify our
new dataset in order to work with OSM and to select only a portion of that data
to query. While we haven't covered all of PostGIS's capabilities, this basic
overview will help you start combining your own data with OSM and start using
PostGIS and PostgreSQL for all your GIS needs.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Stephen Monroe

Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
Mar 16, 2017GEOFILE: USING OPENSTREETMAP DATA IN COMPOSE POSTGRESQL - PART I
GeoFile is a series dedicated to looking at geographical data, its features, and
uses. In today's article, we're going to int…

Abdullah Alger Dec 15, 2016GEOFILE: POSTGIS AND RASTER DATA
GeoFile is a series dedicated to looking at geographical data, its features, and
uses. In this article, we'll look at raster…

Abdullah Alger Oct 17, 2016GEOFILE: EVERYTHING IN THE RADIUS WITH POSTGIS
GeoFile is a series dedicated to looking at geographical data, its features and
uses. In this article, we build upon our last…

Abdullah Alger Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",We'll also look at using PostGIS to filter our data and to find places that are within or intersect a chosen polygon.,GeoFile: Using OpenStreetMap Data in Compose PostgreSQL - Part II,Live,49
129,"Follow Sign in / Sign up Home About Insight Data Science Data Engineering Health Data AI Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates Sebastien Dery Blocked Unblock Follow Following I don’t know what I’m doing; but then neither do you so it’s all good. Master
of Layers, Protector of the Graph, Wielder of Knowledge. #OpenScience Oct 16
--------------------------------------------------------------------------------

GRAPH-BASED MACHINE LEARNING: PART I
COMMUNITY DETECTION AT SCALE
During the seven-week Insight Data Engineering Fellows Program recent grads and experienced software engineers learn the latest open source technologies by building a data platform to handle large, real-time datasets.

Sebastien Dery (now a Data Science Engineer at Yewno ) discusses his project on community detection on large datasets.


--------------------------------------------------------------------------------

#tltr : Graph-based machine learning is a powerful tool that can easily be merged
into ongoing efforts. Using modularity as an optimization goal provides a
principled approach to community detection. Local modularity increment can be
tweaked to your own dataset to reflect interpretable quantities. This is useful
in many scenarios, making it a prime candidate for your everyday toolbox.Many important problems can be represented and studied using graphs — social
networks, interacting bacterias, brain network modules, hierarchical image
clustering and many more.

If we accept graphs as a basic means of structuring and analyzing data about the
world, we shouldn’t be surprised to see them being widely used in Machine
Learning as a powerful tool that can enable intuitive properties and power a lot
of useful features. Graph-based machine learning is destined to become a
resilient piece of logic, transcending a lot of other techniques. See more in
this recent blog post from Google Research

This post explores the tendencies of nodes in a graph to spontaneously form
clusters of internally dense linkage (hereby termed “community”); a remarkable
and almost universal property of biological networks. This is particularly
interesting knowing that a lot of information can be extrapolated from a node’s
neighbor (e.g. think recommendation system, respondent analysis, portfolio
clustering). So how can we extract this kind of information?

Community Detection aims to partition a graph into clusters of densely connected nodes, with the
nodes belonging to different communities being only sparsely connected.

Graph analytics concerns itself with the study of nodes (depicted as disks) and
their interactions with other nodes (lines). Community Detection aims to
classify nodes by their “clique”.“ Is it the same as clustering? ”

 * Short answer: Yes .
 * Long answer: For all intents and purposes, yes it is .

So why shouldn’t I just use my good old K-Means? You absolutely should, unless
your data and requirements don’t work well with that algorithm’s assumptions,
namely:

 1. K number of clusters
 2. Sum of Squared Error (SSE) as the right optimization cost
 3. All variable have the same variance
 4. The variance of the distribution of each attribute is spherical

For a more in-depth look click here .

First off, let’s drop this idea of SSE and choose a more relevant notation of
what we’re looking for: the internal versus external relationships between nodes
of a community. Let’s discuss the notion of modularity.

where: nc is the number of communities; lc number of edges within; dc sum of vertex degree; and m the size of the graph (number of edges). We will be using this equation as a
global metric of goodness during our search for an optimal partitioning. In a nutshell: Higher score will be given to a community configuration offering higher
internal versus external linkage.So all I have to do is optimize this and we’re done, right?

A major problem in the theoretical formulation of this optimization scheme is
that we need an all-knowing knowledge of the graph topology (geometric
properties and spatial relations). This is rather, let’s say, intractable . Apparently we can’t do any better than to try all possible subsets of the
vertices and check to see which, if any, form communities. The problem of finding the largest clique in a graph is thus said to be NP-hard .

However, several algorithms have been proposed over the years to find reasonably good partitions in reasonable amounts of time, each with its own particular flavor. This post focuses on a
specific family of algorithms called agglomerative . These algorithms work very simply by collecting (or merging) nodes together.
This has a lot of advantages since it typically only requires a knowledge of first degree neighbors and small incremental merging steps , to bring the global solution towards stepwise equilibriums.

You might point out that the modularity metric gives a global perspective on the
state of the graph and not a local indicator. So, how does this translate to the
small local increment that I just mentioned?

The basic approach does indeed consists of iteratively merging nodes that
optimize a local modularity so let’s go ahead and define that as well:

where ∑ in is the sum of weighted links inside C, ∑ tot sum of weighted links incident to nodes in C, k i sum of weighted links incident to node i , k i, in sum of weighted links going from i to nodes in C and m a normalizing factor as the sum of weighted links for the whole graph. (Sorry, Medium doesn’t allow subscript and superscript)This is part of the magic for me as this local optimization function can easily
be translated to an interpretable metric within the domain of your graph. For
example,

 * Community Strength: Sum of Weighted Link within a community.
 * Community Popularity: Sum of Weighted Link incident to nodes within a specific community.
 * Node Belonging: Sum of Weighted Link from a node to a community.

There’s also nothing stopping from adding more terms to the previous equation
that are specific to your dataset. In other words, the weighted links can be a
function of the type of nodes computed on-the-fly (useful if you’re dealing with
a multidimensional graph with various types of relationships and nodes).

Example of converging iterations before the Compress phaseNow that we’re all set with our optimization function and local cost, the
typical agglomerative strategy consists of two iterative phases ( Transfer and Compress ). Assuming a weighted network of N nodes, we begin by assigning a different community to each node of the network.

 1. Transfer : For each node i, consider its neighbors j and evaluate the gain in modularity by swapping c_i for c_j . The greedy process transfers the node into the neighboring community,
    maximizing the gain in modularity (assuming the gain is positive). If no
    positive gain is possible, the node i stays in its original community. This process is applied to all nodes until
    no individual move can improve the modularity (i.e. a local maxima of
    modularity is attained — a state of equilibrium).
 2. Compress : building a new network whose nodes are the communities found during the
    first phase; a process termed compression (see Figure below). To do so, edge weights between communities are computed
    as the sum of the internal edges between nodes in the corresponding two
    communities.

Agglomerative process: Phase one converges to a local equilibrium of local
modularity. Phase two consist in compressing the graph for the next iteration,
thus reducing the number of nodes to consider and incidentally computation time
as well.Now the tricky part: as this is a greedy algorithm , you’ll have to define a stopping criteria based on your case scenario and the
data at hand.

How to define this criteria? It can be a lot of things: a maximum number of iterations, a minimum modularity
gain during the transfer phase, or any other relevant piece of information
related to your data that would inform you that it needs to stop.

Still not sure when to stop ? Just make sure you save every intermediate step of the iterative process
somewhere, let the optimization run until there’s only one node left in your
graph, and then look back at your data! The interesting part is that by keeping
track of each step, you also profit from a hierarchical view of your communities
which can be further explored and leveraged.

In a follow up post, I will discuss how we can achieve this on a distributed
system using Spark GraphX , part of my project while at the Insight Data Engineering Fellows Program .

[0803.0476] Fast unfolding of communities in large networks Abstract: We propose
a simple method to extract the community structure of large networks. Our method
is a heuristic… arxiv.org
--------------------------------------------------------------------------------

Want to learn Spark, machine learning with graphs, and other big data tools from
top data engineers in Silicon Valley or New York? The Insight Data Engineering Fellows Program is a free 7-week professional training where you can build cutting edge big
data platforms and transition to a career in data engineering at top teams like
Facebook, Uber, Slack and Squarespace.

Learn more about the program and apply today .

Big Data Data Science Machine Learning Social Network Analysis Insight Data Engineering 4 Blocked Unblock Follow FollowingSEBASTIEN DERY
I don’t know what I’m doing; but then neither do you so it’s all good. Master of
Layers, Protector of the Graph, Wielder of Knowledge. #OpenScience

FollowINSIGHT DATA
Insight Fellows Program —Your bridge to careers in Data Science and Data
Engineering.",Community Detection at Scale,Graph-based machine learning,Live,50
131,"* Free 7-Day Crash Course
 * Blog
 * Masterclass

MODERN MACHINE LEARNING ALGORITHMS: STRENGTHS AND WEAKNESSES
EliteDataScience 0 Comments

May 16, 2017

Share Google Linkedin TweetIn this guide, we’ll take a practical, concise tour through modern machine
learning algorithms. While other such lists exist, they don’t really explain the
practical tradeoffs of each algorithm, which we hope to do here. We’ll discuss
the advantages and disadvantages of each algorithm based on our experience.

Categorizing machine learning algorithms is tricky, and there are several
reasonable approaches; they can be grouped into generative/discriminative,
parametric/non-parametric, supervised/unsupervised, and so on.

For example, Scikit-Learn’s documentation page groups algorithms by their learning mechanism . This produces categories such as:

 * Generalized linear models
 * Support vector machines
 * Nearest neighbors
 * Decision trees
 * Neural networks
 * And so on…

However, from our experience, this isn’t always the most practical way to group
algorithms. That’s because for applied machine learning, you’re usually not
thinking, “boy do I want to train a support vector machine today!”

Instead, you usually have an end goal in mind, such as predicting an outcome or
classifying your observations.

Therefore, we want to introduce another approach to categorizing algorithms,
which is by machine learning task.

NO FREE LUNCH
In machine learning, there’s something called the “No Free Lunch” theorem. In a
nutshell, it states that no one algorithm works best for every problem, and it’s
especially relevant for supervised learning (i.e. predictive modeling).

For example, you can’t say that neural networks are always better than decision
trees or vice-versa. There are many factors at play, such as the size and
structure of your dataset.

As a result, you should try many different algorithms for your problem , while using a hold-out “test set” of data to evaluate performance and select
the winner.

Of course, the algorithms you try must be appropriate for your problem, which is
where picking the right machine learning task comes in. As an analogy, if you
need to clean your house, you might use a vacuum, a broom, or a mop, but you
wouldn't bust out a shovel and start digging.

MACHINE LEARNING TASKS
This is Part 1 of this series. In this part, we will cover the ""Big 3"" machine
learning tasks, which are by far the most common ones. They are:

 1. Regression
 2. Classification
 3. Clustering

In Part 2 (coming soon), we will cover more situational tasks, such as:

 1. Feature Selection
 2. Feature Extraction
 3. Density Estimation
 4. Anomaly Detection

Two notes before continuing:

 * We will not cover domain-specific adaptations, such as natural language
   processing.
 * We will not cover every algorithm. There are too many to list, and new ones
   pop up all the time. However, this list will give you a representative overview of successful contemporary algorithms for
   each task.

1. REGRESSION
Regression is the supervised learning task for modeling and predicting continuous, numeric variables. Examples include predicting real-estate prices, stock price
movements, or student test scores.

Regression tasks are characterized by labeled datasets that have a numeric target variable . In other words, you have some ""ground truth"" value for each observation that
you can use to supervise your algorithm.

Linear Regression

1.1. (REGULARIZED) LINEAR REGRESSION
Linear regression is one of the most common algorithms for the regression task.
In its simplest form, it attempts to fit a straight hyperplane to your dataset
(i.e. a straight line when you only have 2 variables). As you might guess, it
works well when there are linear relationships between the variables in your dataset.

In practice, simple linear regression is often outclassed by its regularized
counterparts (LASSO, Ridge, and Elastic-Net). Regularization is a technique for
penalizing large coefficients in order to avoid overfitting , and the strength of the penalty should be tuned.

 * Strengths: Linear regression is straightforward to understand and explain, and can be
   regularized to avoid overfitting. In addition, linear models can be updated
   easily with new data using stochastic gradient descent .
 * Weaknesses: Linear regression performs poorly when there are non-linear relationships.
   They are not naturally flexible enough to capture more complex patterns, and
   adding the right interaction terms or polynomials can be tricky and
   time-consuming.
 * Implementations: Python / R

1.2. REGRESSION TREE (ENSEMBLES)
Regression trees (a.k.a. decision trees) learn in a hierarchical fashion by
repeatedly splitting your dataset into separate branches that maximize the information gain of each split. This branching structure allows regression trees to naturally
learn non-linear relationships.

Ensemble methods, such as Random Forests (RF) and Gradient Boosted Trees (GBM),
combine predictions from many individual trees. We won't go into their
underlying mechanics here, but in practice, RF's often perform very well
out-of-the-box while GBM's are harder to tune but tend to have higher
performance ceilings.

 * Strengths: Decision trees can learn non-linear relationships, and are fairly robust to
   outliers. Ensembles perform very well in practice, winning many classical
   (i.e. non-deep-learning) machine learning competitions.
 * Weaknesses: Unconstrained, individual trees are prone to overfitting because they can
   keep branching until they memorize the training data. However, this can be
   alleviated by using ensembles.
 * Implementations: Random Forest - Python / R , Gradient Boosted Tree - Python / R

1.3. DEEP LEARNING
Deep learning refers to multi-layer neural networks that can learn extremely complex patterns. They use ""hidden layers"" between
inputs and outputs in order to model intermediary representations of the data that other algorithms cannot easily learn.

They have several important mechanisms, such as convolutions and drop-out, that
allows them to efficiently learn from high-dimensional data. However, deep
learning still requires much more data to train compared to other algorithms
because the models have orders of magnitudes more parameters to estimate.

 * Strengths: Deep learning is the current state-of-the-art for certain domains, such as
   computer vision and speech recognition. Deep neural networks perform very
   well on image, audio, and text data, and they can be easily updated with new
   data using batch propagation. Their architectures (i.e. number and structure
   of layers) can be adapted to many types of problems, and their hidden layers
   reduce the need for feature engineering.
 * Weaknesses: Deep learning algorithms are usually not suitable as general-purpose
   algorithms because they require a very large amount of data. In fact, they
   are usually outperformed by tree ensembles for classical machine learning
   problems. In addition, they are computationally intensive to train, and they
   require much more expertise to tune (i.e. set the architecture and
   hyperparameters).
 * Implementations: Python / R

1.4. HONORABLE MENTION: NEAREST NEIGHBORS
Nearest neighbors algorithms are ""instance-based,"" which means that that save
each training observation. They then make predictions for new observations by
searching for the most similar training observations and pooling their values.

These algorithms are memory-intensive, perform poorly for high-dimensional data,
and require a meaningful distance function to calculate similarity. In practice,
training regularized regression or tree ensembles are almost always better uses
of your time.

2. CLASSIFICATION
Classification is the supervised learning task for modeling and predicting categorical variables. Examples include predicting employee churn, email spam, financial
fraud, or student letter grades.

As you'll see, many regression algorithms have classification counterparts. The
algorithms are adapted to predict a class (or class probabilities) instead of
real numbers.

Logistic Regression

2.1. (REGULARIZED) LOGISTIC REGRESSION
Logistic regression is the classification counterpart to linear regression.
Predictions are mapped to be between 0 and 1 through the logistic function , which means that predictions can be interpreted as class probabilities.

The models themselves are still ""linear,"" so they work well when your classes
are linearly separable (i.e. they can be separated by a single decision surface). Logistic regression
can also be regularized by penalizing coefficients with a tunable penalty
strength.

 * Strengths: Outputs have a nice probabilistic interpretation, and the algorithm can be
   regularized to avoid overfitting. Logistic models can be updated easily with
   new data using stochastic gradient descent.
 * Weaknesses: Logistic regression tends to underperform when there are multiple or
   non-linear decision boundaries. They are not flexible enough to naturally
   capture more complex relationships.
 * Implementations: Python / R

2.2. CLASSIFICATION TREE (ENSEMBLES)
Classification trees are the classification counterparts to regression trees.
They are both commonly referred to as ""decision trees"" or by the umbrella term
""classification and regression trees (CART).""

 * Strengths: As with regression, classification tree ensembles also perform very well in
   practice. They are robust to outliers, scalable, and able to naturally model
   non-linear decision boundaries thanks to their hierarchical structure.
 * Weaknesses: Unconstrained, individual trees are prone to overfitting, but this can be
   alleviated by ensemble methods.
 * Implementations: Random Forest - Python / R , Gradient Boosted Tree - Python / R

2.3. DEEP LEARNING
To continue the trend, deep learning is also easily adapted to classification
problems. In fact, classification is often the more common use of deep learning,
such as in image classification.

 * Strengths: Deep learning performs very well when classifying for audio, text, and image
   data.
 * Weaknesses: As with regression, deep neural networks require very large amounts of data
   to train, so it's not treated as a general-purpose algorithm.
 * Implementations: Python / R

2.4. SUPPORT VECTOR MACHINES
Support vector machines (SVM) use a mechanism called kernels , which essentially calculate distance between two observations. The SVM
algorithm then finds a decision boundary that maximizes the distance between the
closest members of separate classes.

For example, an SVM with a linear kernel is similar to logistic regression.
Therefore, in practice, the benefit of SVM's typically comes from using
non-linear kernels to model non-linear decision boundaries.

 * Strengths: SVM's can model non-linear decision boundaries, and there are many kernels
   to choose from. They are also fairly robust against overfitting, especially
   in high-dimensional space.
 * Weaknesses: However, SVM's are memory intensive, trickier to tune due to the importance
   of picking the right kernel, and don't scale well to larger datasets.
   Currently in the industry, random forests are usually preferred over SVM's.
 * Implementations: Python / R

2.5. NAIVE BAYES
Naive Bayes (NB) is a very simple algorithm based around conditional probability and counting. Essentially, your model is actually a probability table that gets
updated through your training data. To predict a new observation, you'd simply
""look up"" the class probabilities in your ""probability table"" based on its
feature values.

It's called ""naive"" because its core assumption of conditional independence
(i.e. all input features are independent from one another) rarely holds true in
the real world.

 * Strengths: Even though the conditional independence assumption rarely holds true, NB
   models actually perform surprisingly well in practice, especially for how
   simple they are. They are easy to implement and can scale with your dataset.
 * Weaknesses: Due to their sheer simplicity, NB models are often beaten by models properly
   trained and tuned using the previous algorithms listed.
 * Implementations: Python / R

3. CLUSTERING
Clustering is an unsupervised learning task for finding natural groupings of observations (i.e. clusters) based on the
inherent structure within your dataset. Examples include customer segmentation,
grouping similar items in e-commerce, and social network analysis.

Because clustering is unsupervised (i.e. there's no ""right answer""), data
visualization is usually used to evaluate results. If there is a ""right answer""
(i.e. you have pre-labeled clusters in your training set), then classification
algorithms are typically more appropriate.

K-Means

3.1. K-MEANS
K-Means is a general purpose algorithm that makes clusters based on geometric distances (i.e. distance on a coordinate plane) between points. The clusters are grouped
around centroids, causing them to be globular and have similar sizes.

This is our recommended algorithm for beginners because it's simple, yet
flexible enough to get reasonable results for most problems.

 * Strengths: K-Means is hands-down the most popular clustering algorithm because it's
   fast, simple, and surprisingly flexible if you pre-process your data and
   engineer useful features.
 * Weaknesses: The user must specify the number of clusters, which won't always be easy to
   do. In addition, if the true underlying clusters in your data are not
   globular, then K-Means will produce poor clusters.
 * Implementations: Python / R

3.2. AFFINITY PROPAGATION
Affinity Propagation is a relatively new clustering technique that makes
clusters based on graph distances between points. The clusters tend to be smaller and have uneven sizes.

 * Strengths: The user doesn't need to specify the number of clusters (but does need to
   specify 'sample preference' and 'damping' hyperparameters).
 * Weaknesses: The main disadvantage of Affinity Propagation is that it's quite slow and
   memory-heavy, making it difficult to scale to larger datasets. In addition,
   it also assumes the true underlying clusters are globular.
 * Implementations: Python / R

3.3. HIERARCHICAL / AGGLOMERATIVE
Hierarchical clustering, a.k.a. agglomerative clustering, is a suite of
algorithms based on the same idea: (1) Start with each point in its own cluster.
(2) For each cluster, merge it with another based on some criterion. (3) Repeat
until only one cluster remains and you are left with a hierarchy of clusters.

 * Strengths: The main advantage of hierarchical clustering is that the clusters are not
   assumed to be globular. In addition, it scales well to larger datasets.
 * Weaknesses: Much like K-Means, the user must choose the number of clusters (i.e. the
   level of the hierarchy to ""keep"" after the algorithm completes).
 * Implementations: Python / R

3.4. DBSCAN
DBSCAN is a density based algorithm that makes clusters for dense regions of points. There's also a recent new development called HDBSCAN that allows
varying density clusters.

 * Strengths: DBSCAN does not assume globular clusters, and its performance is scalable.
   In addition, it doesn't require every point to be assigned to a cluster,
   reducing the noise of the clusters (this may be a weakness, depending on your
   use case).
 * Weaknesses: The user must tune the hyperparameters 'epsilon' and 'min_samples,' which
   define the density of clusters. DBSCAN is quite sensitive to these
   hyperparameters.
 * Implementations: Python / R

PARTING WORDS
We've just taken a whirlwind tour through modern algorithms for the ""Big 3""
machine learning tasks: Regression, Classification, and Clustering.

In Part 2 (coming soon), we will look at algorithms for more situational tasks,
such as Dimensionality Reduction (i.e. Feature Selection or Extraction), Density
Estimation, and Anomaly Detection.

However, we want to leave you with a few words of advice based on our
experience:

 1. First... practice, practice, practice. Reading about algorithms can help you find your footing at the start, but
    true mastery comes with practice. As you work through projects and/or
    competitions, you'll develop practical intuition, which unlocks the ability
    to pick up almost any algorithm and apply it effectively.
 2. Second... master the fundamentals. There are dozens of algorithms we couldn't list here, and some of them can
    be quite effective in specific situations. However, almost all of them are
    some adaptation of the algorithms on this list, which will provide you a
    strong foundation for applied machine learning.
 3. Finally, remember that better data beats fancier algorithms. In applied machine learning, algorithms are commodities because you can
    easily switch them in and out depending on the problem. However, effective
    exploratory analysis, data cleaning, and feature engineering can
    significantly boost your results.

If you'd like to learn more about the applied machine learning workflow and how
to efficiently train professional-grade models, we invite you to sign up for our free 7-day email crash course .

For more over-the-shoulder guidance, we also offer a comprehensive masterclass that further explains the intuition behind many of these algorithms and teaches
you how to apply them to real-world problems.

Share Google Linkedin TweetLEAVE A RESPONSE CANCEL REPLY
Name* Email* Website* Denotes Required Field

RECOMMENDED READING
 * Modern Machine Learning Algorithms: Strengths and Weaknesses
 * The Ultimate Python Seaborn Tutorial: Gotta Catch ‘Em All
 * The 5 Levels of Machine Learning Iteration
 * R vs Python for Data Science: Summary of Modern Advances
 * Python Machine Learning Tutorial, Scikit-Learn: Wine Snob Edition

Copyright © 2016 · EliteDataScience.com · All Rights Reserved


 * Home
 * Terms of Service
 * Privacy Policy","Get to know the ML landscape through this practical, concise overview of modern machine learning algorithms. Plus, we'll discuss the tradeoffs of each.",Modern Machine Learning Algorithms,Live,51
132,"* United States

IBM® * Site map

Search

IBM Developer Advocacy * Services * Set Up a Secure Gateway
    * Cloudant
    * Migrate CSV data to dashDB
    * Migrate PureData for Analytics Data to dashDB
    * Migrate Data with the Lift Data Load API
    * Compose
    * Spark
    * dashDB
    * IBM Graph
    * Data Connect
    * Lift
    * BigInsights on Cloud
    * Watson Analytics
    * DB2 on Cloud
    * DataStage on Cloud
    * Master Data Management on Cloud
    * Informix on Cloud
   
   
 * Blog
 * Showcases
 * Search Resources
 * Events

Services to get , build , and analyze data on the ibm cloud Set Up a Secure GatewayLearn how to set up a secure gateway as the first step to migrating your data to
dashDB using IBM Bluemix Lift. You can also…

CloudantA fully-managed NoSQL database as a service (DBaaS) built from the ground up to
scale globally, run non-stop, and handle a wide variety of data…

Migrate CSV data to dashDBLearn how to migrate your CSV data to dashDB using IBM Bluemix Lift. You can
also read a transcript of this video Read the migration…

Migrate PureData for Analytics Data to dashDBLearn how to migrate data from IBM PureData for Analytics to dashDB using IBM
Bluemix Lift. You can also read a transcript of this video…

Migrate Data with the Lift Data Load APIThe IBM Bluemix Lift Data Load API allows you to perform your migration from
on-premises sources to targets on the cloud. The IBM Bluemix Lift…

ComposeProduction-ready hosting for the following databases: MongoDB with SSL,
Elasticsearch, RethinkDB, PostgreSQL, Redis, etcd, and RabbitMQ.

SparkAnalytics for Apache Spark provides fast, in-memory, distributed analytics
processing of large data sets.

dashDBTrue business intelligence comes from the ability to glean insights from your
data. To get them, you need a place where you can combine data…

IBM GraphIBM Graph is an easy-to-use, fully managed graph database service for storing,
querying, and visualizing data points, their connections, and properties. IBM
Graph is based…

Data ConnectData Connect is a cloud-based data refinery that transforms raw data into
relevant and actionable information. Find data, shape it, and deliver it to
applications…

LiftMigrate data from on-premises to the cloud quickly and securely.

BigInsights on CloudIBM BigInsights on Cloud provides Hadoop-as-a-service on IBM’s SoftLayer global
cloud infrastructure. It offers the performance and security of an on-premises
deployment without the cost…

Watson AnalyticsWatson Analytics offers you the benefits of advanced analytics without the
complexity. A smart data discovery service available on the cloud, it guides
data exploration,…

DB2 on CloudIBM DB2 on Cloud offering provides a database on IBM’s SoftLayer® global cloud
infrastructure. It offers customers the rich features of an on-premise DB2
deployment…

DataStage on CloudIBM DataStage on Cloud provides IBM InfoSphere DataStage on the IBM SoftLayer
global cloud infrastructure. It offers the rich features of the on-premises
DataStage deployment…

Master Data Management on CloudIBM Master Data Management on Cloud provides IBM Master Data Management Advanced
Edition on IBM Softlayer global cloud infrastructure. It offers the rich
features of…

Informix on CloudToday’s businesses are embracing the virtualization and automation of cloud
computing to decrease costs and increase the deliverables of their IT
departments. The time-tested characteristics…

Search Topic
Advanced Search Language
Technology
Powered by the Simple Search Service i What's This?The most popular Topics, Technologies and Languages are determined by the Simple
Search Service - a microservice that lets you quickly create a faceted search
engine. See what else IBM can do for you.

Learn More about the Simple Search Service CloudDataServices Labs Open Menu * 
 * Services * Back to Navigation
    * Watson Analytics
    * Migrate Data with the Lift Data Load API
    * Informix on Cloud
    * Migrate PureData for Analytics Data to dashDB
    * Set Up a Secure Gateway
   
   
 * Blog
 * Showcases
 * Search resources * Back to Navigation
   
   
 * Events

NEW VIDEOS! HOW TO BUILD AN APP USING IBM GRAPH

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Lauren Schaefer 12/14/16Lauren Schaefer


Learn More Recent Posts * New videos! How to build an app using IBM Graph Watch how to build a storefront web app with IBM Graph.
 * What’s all the hoopla about graph databases? Learn why you'd want to use a Graph database and see how to get started.

What do you do when you need to take a simple, static website and turn it into
an online storefront with personalized recommendations? You create a new graph
database using IBM Graph, and you start coding!

Over the last few months, I’ve been doing just that. While I’ve been busy
coding, I’ve been documenting my progress in videos. So, sit back, relax, and
enjoy my video playlist!

If you’d like to try my demo app yourself, visit http://laurenslovelylandscapegraph.mybluemix.net . You can get your own copy of the code here . Or better yet, you can deploy the app to Bluemix with the simple click of a
button so you can have your own running copy of the app:

I’m currently building the recommendation engine for the app. Follow me on
Twitter for updates: @Lauren_Schaefer .

Happy graphing!


 * Graph

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus
RECENT UPDATES

 * Blog
 * Recent Post
 * New videos! How to build an app using IBM Graph
 * 12/14/16
 * Watch how to build a storefront web app with IBM Graph.
 * Lauren Schaefer",Watch how to build a storefront web app with IBM Graph.,Build an app using IBM Graph,Live,52
135,"Homepage Follow Sign in Get started Homepage * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Jake Shelley Blocked Unblock Follow Following PM on IBM Watson Data Platform Nov 15
--------------------------------------------------------------------------------

INTRODUCING STREAMS DESIGNER
Starting today, users will be able to access Streams Designer through the Watson
Data Platform. Streams Designer is a brand new IDE for building applications
using real time data.

WHAT IS STREAMS DESIGNER?
Building real-time applications can be intimidating. Streams Designer makes the
process easy and accessible by allowing you to simply drag and drop operators to
shape, model, and transform your data as it flows from inputs to outputs.
Streams Designer will allow new users to get their feet wet building real-time
applications without having to dive deep into complex libraries and tools.
Existing users will also love how quickly they can build and test new flows.

WHAT’S NEW IN THE SERVICE?
Here are a couple highlights of the functionality being delivered in Streams
Designer.

 * Drag and drop interface for real-time applications: Streams Designer promises to make real-time analysis more accessible. You
   can drag and drop operators onto a canvas and connect them to create a
   pipeline for your data to flow through.

Streams Designer offers a drag and drop IDE to create real-time applications * Monitor your flow in real time: While your flow is running, Streams Designer provides a dashboard for you to
   monitor the throughput of events as they pass through operators. You can also
   see the events and their attributes as they pass from operator to operator.
   You can quickly determine the health and status of your flow without having
   to check outputs and logs.

Monitor the health and status of your flow in the real-time dashboard * Handle common Streaming use cases with a constantly growing list of
   operators: Today you can create flows that leverage models, filter by geofences, and
   aggregate clickstream data. Use the getting started wizard to set up a flow
   using a template. The team is continuously working on new operators and use
   cases, so if you don’t see what you need today, let us know and we’ll get it
   on it!


--------------------------------------------------------------------------------

HOW DO I GET STARTED?
Getting started is easy and free . If you don’t have a Watson Data Platform account, sign up here . After you finish registering, select Streams Designer from the Tools menu or add it to an existing project.

The team is incredibly excited to open up Streams Designer to a wider audience.
We’ve come a long way, but there is a lot more coming! Look for updates in this
blog.

If you’d like more information about IBM Streaming Analytics you can find it here .

 * Real Time Analytics
 * Streaming Analytics
 * IBM

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingJAKE SHELLEY
PM on IBM Watson Data Platform

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Starting today, users will be able to access Streams Designer through the Watson Data Platform. Streams Designer is a brand new IDE for building applications using real time data. ",Introducing Streams Designer,Live,53
138,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
8 WAYS TO TURN DATA INTO VALUE WITH APACHE SPARK MACHINE LEARNING
Post Comment October 18, 2016 by Alex Liu Chief Data Scientist, Analytics Services, IBM Follow me on LinkedInEven as Apache Spark becomes increasingly easy to use, it is also becoming
organizations’ go-to solution for executing big data computations. Not
surprisingly, then, more companies than ever are adopting Spark.

BUILDING AN ANALYTICS OPERATING SYSTEM
When Databricks looked into 900 organizations’ use of Apache Spark in July 2016, an even clearer picture emerged. Spark played an essential role
in building real-time streaming use cases for more than half (51%) of
respondents, and 82% said the same when asked about advanced analytics.
Similarly, use of Spark’s machine learning capabilities for production purposes
jumped from 13% in 2015 to 18% in 2016.

Within the computing community, increasing numbers of corporations, IBM among
them, have helped enhance the capabilities of Spark. In particular, IBM backs
Spark as the “analytics operating system” and accordingly has become one of the
top contributors to Spark 2.0.0, as well as one of the biggest contributors to Spark’s machine learning capabilities .


Data compiled by the IBM WW Competitive and Product Strategy Team.

In the wake of much favorable media attention paid to Spark, many corporations
have adopted Spark on paper—or have at least downloaded it with an eye to future
use. Yet only a fraction have actually used Spark, let alone implemented it as
their core analytics platform.

TURNING DATA INTO VALUE THROUGH MACHINE LEARNING
In the modern business environment, implementation of any platform, Apache Spark
or not, requires practical justifications. Accordingly, the foundation for any
serious Spark adoption is, as always, Spark’s power to turn data into value.
Drawing on my own consulting experience as well as on some of my own research , I’ll share eight ways of using Spark’s machine learning capabilities to turn
data into value.

1. OBTAIN A HOLISTIC VIEW OF BUSINESS
In today's competitive world, many corporations work hard to gain a holistic
view or a 360 degree view of customers, for many of the key benefits as outlined by data analytics expert Mr. Abhishek Joshi . In many cases, a holistic view was not obtained, partially due to the lack of
capabilities to organize huge amount of data and then to analyze them. But
Apache Spark’s ability to compute quickly while using data frames to organize
huge amounts of data can help researchers quickly develop analytical models that
provide a holistic view of the business, adding value to related business
operations. To realize this value, however, an analytical process, from data
cleaning to modeling, must still be completed.

2. ENHANCE FRAUD DETECTION WITH TIMELY UPDATES
To avoid losing millions or even billions of dollars to the ever-changing
fraudulent schemes that plague the modern financial landscape, banks must use
fraud detection models that let them quickly adopt new data and update their
models accordingly. The machine learning capabilities offered by Apache Spark
can help make this possible.

3. USE HUGE AMOUNTS OF DATA TO ENHANCE RISK SCORING
For financial organizations, even tiny improvements to risk scoring can bring
huge profits merely by avoiding defaults. In particular, the addition of data
can help heighten the accuracy of risk scoring, allowing financial institutions
to predict default. Although adding data can be a very challenging prospect from
the standpoint of traditional credit scoring, Apache Spark can simplify the risk
scoring process.

4. AVOID CUSTOMER CHURN BY RETHINKING CHURN MODELING
Losing customers means losing revenue. Not surprisingly, then, companies strive
to detect potential customer churn through predictive modeling, allowing them to
implement interventions aimed at retaining customers. This might sound easy, but
it can actually be very complicated: Customers leave for reasons that are as
divergent as the customers themselves are, and products and services can play an
important, but hidden, role in all this. What’s more, merely building models to
predict churn for different customer segments—and with regard to different
products and services—isn’t enough; we must also design interventions, then
select the intervention judged most likely to prevent a particular customer from
departing. Yet even doing this requires the use of analytics to evaluate the
results achieved—and, eventually, to select interventions from an analytical
standpoint. Amid this morass of choices, Apache Spark’s distributed computing
capabilities can help solve previously baffling problems.

5. DEVELOP MEANINGFUL PURCHASE RECOMMENDATIONS
Recommendations for purchases of products and services can be very powerful when
made appropriately, and they have become expected features of e-commerce
platforms, with many customers relying on recommendations to guide their
purchases. Yet developing recommendations at all means developing
recommendations for each customer—or, at the very least, for small segments of
customers. Apache Spark can make this possible by offering the distributed
computing and streaming analytics capabilities that have become invaluable tools
for this purpose.

6. DRIVE LEARNING BY AVOIDING STUDENT ATTRITION AND PERSONALIZING LEARNING
Big data is no longer solely the province of business—it has come to play a
central role in education, particularly as universities seek to combat student
churn, including by providing personalized education. In the modern educational
environment, a combination of Apache Spark–based student churn modeling and
recommendation systems can add significant value, both material and nonmaterial,
to educational institutions.

7. HELP CITIES MAKE DATA-DRIVEN DECISIONS
Pursuant to laws and regulations enacted at various levels of government, US
cities are increasingly making their collected data publicly available—the data.gov portal is a well-known example. Certainly, as seen in New York , the open data thus disseminated is an important enabler of data-driven
decision making at the municipal level. But US cities are only just beginning to
generate value in this way, partly because of the difficulties of organizing
this mass of data in easily used forms and the challenge of applying suitable
predictive models. However, as we’ve already observed in open data meetups,
including an IBM-sponsored meetup in Glendale , Apache Spark and other open-source tools, such as R, are indeed helping
municipalities derive increasing value from open data.

8. PRODUCE SUITABLE CUSTOMER SEGMENTATIONS USING TELECOMMUNICATIONS DATA
Many giant telecommunications companies, in the United States as well as around
the world, have collected huge amounts of data, some of which they make
available to their partners and customers. But using this data to create value
often remains a significant challenge: The data is stored using special formats
and chiefly comprises text, not numeric, information—and that’s apart from any
special data issues that may arise, including those involving missing cases or
missing content. Fortunately, Apache Spark, when used together with R and IBM
SPSS, can help companies work effectively with special data formats while
handling special data issues and providing modeling algorithms suited for work
with both numbers and text—bringing software solutions together to offer
additional ways of creating value.

For more information about these ways of using Apache Spark, including detailed
plans of action, check out my book Apache Spark Machine Learning Blueprints , available on Amazon.


Reflecting IBM’s focus on Apache Spark, the machine learning capabilities of
Apache Spark will be a main focus at the IBM Insight at World of Watson 2016 conference , scheduled for 24–27 October in Las Vegas. I hope to see you there, where I’ll
be joining my colleagues. Look out for me at select events and in the IBM
bookstore for a chance to meet up at one of my book signings.


Follow @IBMBigData

Topics: Analytics , Big Data Education , Big Data Use Cases , Data Scientists , Hadoop Tags: Apache Spark , churn , counterfraud , data analytics , data science , e-commerce , education , Finance , fraud , IBM SPSS , machine learning , Public Sector , R , risk , segmentation , telecommunicationsRELATED CONTENT
PODCAST
DATA SCIENCE EXPERT INTERVIEW: DEZ BLANCHFIELD, CRAIG BROWN, DAVID MATHISON,
JENNIFER SHIN AND MIKE TAMIR PART 2
Take a peek at the future of data science in this discussion with five thought
leaders in the data analytics industry, the second installment of a two-part
interview recorded at the IBM Insight at World of Watson 2016 conference. Listen to Podcast Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David
Mathison, Jennifer Shin and Mike Tamir part 1 Blog Calling all TM1 users: Your next on-premises planning solution is here Video Dez Blanchfield's predictions based on what he learned at World of Watson
2016 Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber
criminals? Blog Accessing the power of R through a robust statistical analysis tool Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders
relations? Video IBM Incentive Compensation Management: Improve sales results and
operational efficiencies Blog The cognitive level of surveillance for financial institutions Video Dez Blanchfield's top 3 takeaways from World of Watson 2016 Video Recommender System with Elasticsearch: Nick Pentreath & Jean-François
Puget Video Hyperparameter optimization: Sven Hafeneger
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Sales Performance Management Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Presentation Calling all IBM TM1 users! There’s a new on-premises solution in
town Podcast The unusual suspects in cyber warfareMORE
Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Presentation Calling all IBM TM1 users! There’s a new on-premises solution in
town Podcast The unusual suspects in cyber warfare Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architects Video Dez Blanchfield's predictions based on what he learned at World of Watson
2016 Blog Internet of Things: A continuum of change with opportunities galore Presentation Calling all IBM TM1 users! There’s a new on-premises solution in
town Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architectsMORE
Blog Internet of Things: A continuum of change with opportunities galore Presentation Calling all IBM TM1 users! There’s a new on-premises solution in
town Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architects Blog Accessing the power of R through a robust statistical analysis tool Video Insurers: Isn't it time to go beyond traditional views of policyholders
relations? Video IBM Incentive Compensation Management: Improve sales results and
operational efficiencies Podcast The unusual suspects in cyber warfare Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber
criminals? Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders
relations?MORE
Podcast The unusual suspects in cyber warfare Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber
criminals? Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders
relations? Blog The cognitive level of surveillance for financial institutions Blog Dynamic duo: Big data and design thinking Video Data streams in telecom: Koen Dejonghe Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David
Mathison, Jennifer Shin and Mike... Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David
Mathison, Jennifer Shin and Mike...MORE
Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David
Mathison, Jennifer Shin and Mike... Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David
Mathison, Jennifer Shin and Mike... Presentation 8 innovative ideas for data architects Video Dez Blanchfield's predictions based on what he learned at World of Watson
2016 Blog Accessing the power of R through a robust statistical analysis tool * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site",Discover eight ways that Apache Spark’s machine learning capabilities are driving the modern business.,8 ways to turn data into value with Apache Spark machine learning,Live,54
141,"PREDICT FLIGHT DELAYS WITH APACHE SPARK MLLIB, FLIGHTSTATS, AND WEATHER DATA
David Taieb / August 4, 2016Flight delays are an inconvenience. Wouldn’t it be great to predict how likely a
flight is to be delayed? You could remove uncertainty and let travelers plan
ahead. Usually, the weather is to blame for delays. So I’ve crafted an analytics
solution based on weather data and past flight performance.

This solution takes weather data from IBM Insights for Weather and combines it
with flight history from flightstats.com to build a predictive model that can
forecast delays. To load and combine all this data, we use our Simple Data Pipe
open source tool to move it into a NoSQL Cloudant database. Then I use Spark
MLLib to train predictive models using supervised learning algorithms and
cross-validate them.


ABOUT PREDICTIVE MODELING
To create a solution that can make accurate predictions, we need to tease
meaningful information out of our data to craft a predictive model that can make
guesses about future events. We do this using our historical weather and flight
data, which we divvy up into 3 parts:

 * the training set helps discover potentially predictive variables and relationships between
   them.
 * the test set assesses the strength of these relationships and improves them, shaping our
   model.
 * Finally the blind set validates the model.

Here’s the iterative flow:


SET UP A FLIGHTSTATS ACCOUNT
We get our historical data from flightstats.com, so you’ll need to create an
account to get access to their data sets.

Save Time! If you don’t feel like walking through flightstats account setup. but want to
understand the analytics, you can use a sample database I created. Skip ahead to
the Create Spark Instance section to set up the app.

 1. Sign up for a free developer account at FlightStats.com .
 2. Fill out the form and monitor email for confirmation link (access to APIs
    may take up to 24 hours).
 3. Once you get your access confirmation email, go to https://developer.flightstats.com/admin/applications and copy your Application ID and Application Key (you will need them in a
    few minutes).
    
    Tip: While you’re here, you can also explore the flightstats APIs:
    
    – https://developer.flightstats.com/api-docs-scheduledFlights/v1
    
    – https://developer.flightstats.com/api-docs/airports/v1
    
    
CREATE A SPARK INSTANCE
 1. Login to Bluemix (or sign up for a free trial) .
 2. Create a new space.
    If you’ve been working in Bluemix already, create a new space to have a
    separate, clean working area for new apps and services. On the upper left of
    your Bluemix dashboard, click + Create a Space and name it flightpredict or whatever you want and click Create .
 3. On your Bluemix dashboard, click Work with Data . Click New Service . Find and click Apache Spark then click Choose Apache Spark . Click Create . Click the New Instance button.

DEPLOY SIMPLE DATA PIPE
The Simple Data Pipe is a handy data movement tool our team created to help you get and combine JSON
data for use where you need it. The fastest way to deploy this app to Bluemix is
to click the Deploy to Bluemix button, which automatically provisions and binds the Cloudant service too.

Using my sample credentials? In that case, you don’t need to import data with the pipe. Feel free to read
and understand, but then skip ahead to: Create an IPython Notebook .


If you would rather deploy manually , or have any issues, refer to the readme .

When deployment is done, leave this Deployment Succeeded page open. You’ll return here in a minute.

ADD INSIGHTS FOR WEATHER SERVICE
To work its magic, the flight predict connector that we’re about to install
needs weather data. So add IBM’s Insights for Weather service now, by following
these steps:

 1. Open a new browser window or tab, and in Bluemix, go to the top menu, and
    click Catalog .
 2. In the Search box, type Weather , then click the Insights for Weather tile.
    
 3. Under app , click the arrow and choose your new Simple Data Pipe application. Doing
    so binds the service to your new app.
 4. In Selected plan choose Premium plan to ensure you’ll have enough authorized API calls to try out this app.
    Not ready to lay down your credit card? If you want to understand this tutorial, without stepping through all
    installations and data loads, you can follow along using our sample data.
    Just skip ahead to Create an IPython Notebook and run the notebook without changing any credentials.
    
    
 5. Click Create .
 6. If you’re prompted to restage your app, do so by clicking Restage .

INSTALL FLIGHTSTATS CONNECTOR
I created a custom connector for the Simple Data Pipe app that loads and
combines historical flight data from flightstats.com with weather data from IBM
Insights for Weather.

Note: If you have a local copy of Simple Data Pipe, you can install this connector using Cloud Foundry .

 1. In Bluemix, at the deployment succeeded screen, click the EDIT CODE button.
 2. Click the package.json file to open it.
 3. Edit the package.json file to add the following line to the dependencies list:""simple-data-pipe-connector-flightstats"": ""*""
    
    Tip: Be sure to end the line above your new line with a comma and follow proper
    JSON syntax.
    
    
 4. From the menu, choose File Save .
    
 5. Press the Deploy app button and wait for the app to deploy again.
    
    
LOAD THE DATA
We’ll load 2 sets of data, an initial set of flight data from 10 major airports,
and a test set, that the connector prepares for you.

LOAD INITIAL DATA SET
 1.  Launch simple data pipe in one of the following ways: * In the code editor where your redeployed, go to the toolbar and click
        the Open button for your simple data pipe app.
        
      * Or, in Bluemix, go to the top menu and click Dashboard , then on your Simple Data Pipe app tile, click the Open URL button.
        
     
 2.  In Simple Data Pipe, go to menu on the left and click Create a New Pipe .
 3.  Click the Type dropdown list, and choose Flight Stats .When you added a Flightstats connector earlier, you added the option you’re
     choosing now.
     
     
 4.  In Name , enter training (or anything you want).
 5.  If you want, enter a Description .
 6.  Click Save and continue .
 7.  Enter the Flightstats App ID and App Key you copied when you set up your FlightStats account.
 8.  Click Connect to FlightStats .
     You see a You’re connected confirmation message.
 9.  Click Save and continue .
     
     
 10. On the Filter Data screen, click the dropdown arrow and select Mega SubSet from 10 busiest airports . Then click Save and continue .
     
     
 11. Click Skip , to bypass scheduling.
 12. Click Run now .
     
     View your progress: If you want, you can see the data load in-process. In a separate browser
     tab or window, open or return to Bluemix. Open your Simple Data Pipe app,
     go the menu on the left, and click Logs .
     
     
     When the data’s done loading, you see a Pipe Run complete! message.
     
     
LOAD TEST SET
Create a new pipe again to load test data.

 1. In your Simple Data Pipe app, click Create a new Pipe .
 2. In the Type dropdown, select Flight Stats .
 3. In Name enter test .
 4. If you want, enter a Description .
 5. Click Save and Continue .
 6. Enter the Flightstats App ID and App Key you copied when you set up your FlightStats account.
 7. Click Connect to FlightStats .
    You see a You’re connected confirmation message.
 8. Click Save and continue .
 9. On the Filter Data screen, click the dropdown arrow and select Test set . Then click Save and continue .

CREATE AN IPYTHON NOTEBOOK
Shortcuts: If you’ve opted to use my sample credentials, go through the following steps to
create the notebook and run its commands. If you want to skip these notebook
creation steps too, you can follow the rest of this tutorial by viewing this
prebuilt notebook on Github: 
https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/blob/master/notebook/Flight%20Predict%20PyCon%202016.ipynb

Create a notebook on Bluemix:

 1. Go to your Bluemix dashboard and open your Spark service.
 2. Click the Notebooks button.
 3. Click the New Notebook button.
 4. Click the From URL tab.
 5. Name it whatever you want and enter the following in the Notebook URL field:
    https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/notebook/Flight%20Predict%20PyCon%202016.ipynb
    
    
INSTALL PYTHON PACKAGE AND ADD SERVICE CREDENTIALS
Here, we install the Python Library I created, which lets you write code inline
within notebook cells and encapsulate helper APIs within the Python package.
This package helps keep our notebook short and performs most of the hard work. ( See this library on GitHub .)

 1. Run the first cell of the notebook, which contains the following command:
    sc.addPyFile(""https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/flightPredict/training.py"")
    sc.addPyFile(""https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/flightPredict/run.py"")
    import training  #module contains apis to train the models
    import run  #module contains apis to run the models
    
    
Tip: An alternative method to install the package (not recommended for use in this
tutorial) is to use pip:

!pip install --user --exists-action=w --egg git+https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats.git#egg=flightPredict


Compare these 2 ways of using helper Python packages

– SparkContext.addPyFile . Easy addition of python module file, supports multiple module files via zip
format, and recommended during development where frequent code changes occur.
– egg distribution package: pip install from PyPi server or file server (like
GitHub) . Persistent install across sessions, and recommended in production.

ADD CREDENTIALS
Before your new notebook can work with flight and weather data, it needs access.
To grant it, add your Cloudant and Weather service credentials to the notebook.

Using my sample credentials? Skip ahead to Step 4 and confirm that you see the following values:

cloudantHost: dtaieb.cloudant.com
cloudantUserName: weenesserliffircedinvers
cloudantPassword: 72a5c4f939a9e2578698029d2bb041d775d088b5
weatherUrl: 
https://4b88408f-11e5-4ddc-91a6-fbd442e84879:p6hxeJsfIb@twcservice.mybluemix.net

 1. In Bluemix, open your app’s dashboard.
 2. In the menu on the left, click Environment Variables .
 3. Copy credentials for Cloudant and Weather Insights.
    
    
 4. Return to your notebook, and in the second cell, paste in your credentials,
    replacing the ones there. (If you’re just following along in the notebook,
    leave existing credentials in place.)
    
    
 5. Run that cell to import python modules the notebook uses and to connect to
    services.
    
    
TRAIN THE MACHINE LEARNING MODELS
 1. Load training set in Spark SQL DataFrame.
    
    Within the next cell, make sure the training dbName is your dbname from
    Cloudant. (To find it, go to your Simple Data Pipe app dashboard, click the
    Cloudant tile, then click Launch . The Cloudant dashboard shows your dbname.)
    
    
    Then run the following code:
    
    
    dbName=""pycon_flightpredict_training_set""
    %time cloudantdata = training.loadDataSet(dbName,""training"")
    %time cloudantdata.printSchema()
    %time cloudantdata.count()
    
    
 2. Visualize classes in scatter plot.Run the next 3 cells to plot delays based on factors like temperature,
    pressure, and wind speed. These plots are good first step to check
    distribution and possibly identify patterns.
    
    
 3. Load the training data as an RDD of LabeledPoint.Run the following code to Spark SQL connector to load data into a DataFrame.
    
    
    trainingData = training.loadLabeledDataRDD(""training"")
    trainingData.take(5)
    
    
 4. Train multiple classification models.
    
    Here we apply several machine-learning classification algorithms. To ensure
    accuracy of our predictions, we test the following different methods, and
    use cross-validation to choose the best one. Run the next few cells to
    train:
    
     * Logistic Regression Mode
     * NaiveBayes Model
     * Decision Tree Model
     * Random Forest Model
    
    
TEST THE MODELS
 1. Load test dataMake sure your dbname is the test database name from Cloudant (check your
    Cloudant dashboard as you did in the preceding section). Then run the
    following code:
    
    dbTestName=""pycon_flightpredict_test_set""
    testCloudantdata = training.loadDataSet(dbTestName,""test"")
    testCloudantdata.count()
    
    
 2. Run Accuracy metricsRun the next cell to compare the performance of the models.
    
    
 3. Run the next few cells to get confusion matrixes for each model.
    
    While the metrics table we just created can tell us which model performs
    well overall, the confusion matrixes let us see the performance of
    individual classes (like Delayed less than 2 hrs ) and help us decide if we need more training data or if we need to change
    classes or other variables.
    
    
 4. Plot the distribution of your data with Histograms
    
    Run the code in cell 15 to refine classifications and see a bar chart. Each
    bar is a bin (group of data points). You can specify different numbers of bins to examine data distribution and identify outliers. This info, combined
    with the confusion matrix results, helps you quickly uncover issues with
    your data. Then you can fix them and create a better predictive model.
    
    
    If you see an extremely long tail here (lots of bins that yield few
    results), you may have a data distribution issue, which you could solve by
    tweaking your classes. For example, this graph prompted me to change Delayed more than 4 hours and Delayed less than 2 hours to shorter increments of: Delayed less than 13 minutes , Delayed between 13-41 minutes , and Delayed more than 41 minutes . Doing so improved accuracy and helped us include the most meaningful
    results in our model.
    
    
 5. Customize the training handler.
    
    Run the cell beneath the bar chart to provide new classification and add day
    of departure as a new feature. This code also re-builds the models,
    re-computes accuracy metrics.
    
    
RUN THE MODELS
Now our predictive model is in place! Our app is working with enough accuracy to
let flyers enter flight details and see the likelihood of a delay.

Run the final cell.


If you want, replace the flight details (in red) with info on an upcoming flight
of yours and run it again to see if you’ll make it on time.

CONCLUSION
Predictive modeling is an art form and an intensely iterative process. It
requires substantial data sets and a fast, flexible way to test and tweak
approaches. Simple Data Pipe let us load the pertinent data into Cloudant. From
there, we used IBM Analytics for Apache Spark to create a notebook for analysis
and modeling. You saw how flexible a Python notebook can be. Using it in
combination with APIs in my Python package let us leverage Spark MLLIB to train
predictive models and cross-validate fast and effectively.

Feel free to play with this code and extend it. For example, a great improvement
for deploying this app in production, would be to create a custom card for
Google Now that automatically notifies a mobile user of impending flight delays
and then proposes alternative flight routes using Freebird.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Build a Machine Learning model with Apache Spark MLLib to predict flight delays based on weather data and past performance.,"Predict Flight Delays with Apache Spark MLLib, FlightStats, and Weather Data",Live,55
143,"INTRODUCING THE SIMPLE AUTOCOMPLETE SERVICE
Glynn Bird / May 31, 2016We have all seen auto-complete on web forms. The field label is Town . We start typing “M” then “a” and before we know it, a pull-down list has
appeared suggesting some words that begin with the letters we’ve typed:


The more characters we type, the smaller the list gets and we can click on the
correct town name at any time.

Websites build such tools in one of two ways:

 1. The entire data set is transferred to the web page and autocomplete happens
    within the browser
 2. No data is transferred to the browser; each keypress triggers a search for
    matching items on a server-side API

The first solution is best for small data sets, but when the list of possible
values is larger (say hundreds, thousands, or even millions of options), then
the client-server approach is much more efficient. It is this second scenario
that the Simple Autocomplete Service is built to cover.

WHAT IS THE SIMPLE AUTOCOMPLETE SERVICE?
The Simple Autocomplete Service is a Node.js web app built with the Express framework that lets you upload multiple data sets to a cloud service which then
operates a fast and efficient autocomplete API. Later in this article, we’dr is:
it uses a Redis in-memory database to store and index the data.

Here are some example API calls from a deployed Simple Autocomplete Service instance that has been locked down so that it is now read-only:

 * https://simple-autocomplete-service.mybluemix.net/api/countries?term=bo
 * https://simple-autocomplete-service.mybluemix.net/api/presidents?term=W

Notice how the urls show the two individual data sets that have been uploaded ( countries and presidents ). The search string that you want to lookup is supplied as a term parameter. The API can then be plumbed into a webpage to provide auto-complete
on a form control.

You can run the application locally in conjunction with a local Redis instance
or deploy to the IBM Bluemix platform-as-a-service with a connected Redis by Compose service.

INSTALLATION
Click this button


to deploy the app to Bluemix, IBM’s cloud development platform. If you don’t yet
have a Bluemix account, you’ll be prompted you to sign up.
(You’ll also find this button in the Simple Autocomplete Service source code repository .)

Upon deployment, you’ll get an error saying that deployment failed. No worries!
It didn’t really. It just requires Redis.


Click the APP DASHBOARD button and click your new Simple Autocomplete Service to open it.

To add Redis:

 1. In a new browser tab, head over to https://www.compose.io/ and sign up for an account there.
 2. Hit Create Deployment then choose Redis and wait for a cluster to be created for you.
 3. On the Getting Started page that appears, click reveal your password and leave this page open. You’ll come back for these Redis credentials in a
    moment.
    
    
 4. Head back to Bluemix and where you have your Simple Autocomplete Service open.
 5. Click ADD A SERVICE OR API and choose Redis by Compose .
    
    
 6. Enter your credentials as follows:
    
     * For Username enter only x
     * In Password enter your Redis service password.
     * For Public hostname/Port enter the string that appears in the TCP Connection String box after the @ character, replacing the : character with / as illustrated:
    
    
    When you enter these credentials, your completed form looks something like
    this:
    
    
 7. Click Create .
    
    
 8. When prompted, click Restage .
    
    When the app is done staging, click its URL to launch and see the service in
    action.
    
    
UPLOADING DATA
Find or create a file with your own data. It should be a plain text file with
one text string per line, like:

William
Mary
John

From the menu on the left, click Create an index , enter an Index name , and click the Upload button.


Scroll up to Current Indexes and in a few seconds you see your new index in the list. Try a few
auto-completes by typing letters in the Test box. You can add as many indexes as you need (or until you run out of Redis
memory).


The Simple Autocomplete Service is really an API service. You can try the API call directly in a new browser
window, just visit the URL of this form:

https://MYAPP.mybluemix.net/api/MYINDEX?term=a

replacing MYAPP with your application domain and MYINDEX with the name you chose when you created the index.

LOCKING DOWN THE SERVICE
When you’re happy with your data, you can lock down the Simple Autocomplete Service so that it becomes a read-only API. Simply add a custom environment variable to
your Bluemix app called “LOCKDOWN” with a value of “true”. Your application will
restart and only the autocomplete API will function.


INTEGRATING WITH YOUR OWN FORMS
The Simple Autocomplete Service is CORS-enabled, so it should be simple to plumb it into your own forms. If you
have an HTML page with jQuery and jQueryUI in, you can create an auto-complete
form with a few lines of code:

<!doctype html>
<html lang=""en"">
<head>
  <meta charset=""utf-8"">
  <script src=""https://code.jquery.com/jquery-1.10.2.js""></script>
  <script src=""https://code.jquery.com/ui/1.11.4/jquery-ui.js""></script>
  <link rel=""stylesheet"" href=""https://simple-autocomplete-service.mybluemix.net/css/sas.css"" />
  <script>
    $(function() {
      $(""#mycontrol"").autocomplete({
        source: ""https://simple-autocomplete-service.mybluemix.net/api/countries�
  </script>
</head>
<body>
  <div class=""ui-widget"">
    <input id=""mycontrol"">
  </div>
</body>
</html>

THE ANATOMY OF THE SIMPLE AUTOCOMPLETE SERVICE
FIRST PRINCIPLES
Redis is chosen as the database for this task because it stores its data in
memory (it is flushed to disk periodically). In-memory databases are extremely
fast and the auto-complete use-case requires high performance because the use of
the web form will expect a speedy reponse to the keypresses they make.

The heart of our autocomplete service is the data that is uploaded. Any text
file containing one line per value should be fine e.g.

.
.
.
Mabel
Mabelle
Mable
Mada
Madalena
Madalyn
Maddalena
Maddi
Maddie
.
.
.

One solution to find matches from this data is to store the values in a list and
scan every member for matches when performing an autocomplete request.
This solution is fine for small data sets but as it involves scanning the whole
collection from top to bottom to establish a list of matches it becomes
increasingly inefficient as the data size increases. It is said to have a O(N) complexity, because the effort required to perform the search increases
linearly with the size of the data set (N).

In a blog post from 2010 the creator of Redis, Salvatore Sanfilippo, discusses a more efficient solution
which involves pre-calculating the possible search strings and placing them into
an “sorted set” data structure in Redis. Sorted sets are usually used for
ordering keys by value (e.g. a high-score table), but in this case it keeps our
candidate search strings in alphabetical order. The solution outlined in the
blog post is used in a slightly modified form in the Simple Autocomplete Service , with our sorted set containing keys made up of combinations of possible
letter combinations:

.
.
""m""
""ma""
""mab""
""mabe""
""mabel*Mabel""
""mabell""
""mabelle*Mabelle""
""mabl""
""mable*Mable""
.
.

Some features of the data to notice:

 * this index occupies more space than a simple list of the complete values
 * the keys are stored in alphabetical order
 * the keys are lowercased and filtered for punctuation before saving for a
   predictable, case-sensitive match
 * at the end of each sequence of keys we store the unaltered original key using
   the notation mabelle*Mabelle , with the original unfiltered string placed after the asterisk. This allows
   the service to access the original string in its original case.
 * the keys are not repeated – there is only one key for “ma” despite several
   names starting with “ma” to save space in the index
 * the method of storage is most efficient on large data sets with lots of
   repetition at the starts of words

IMPORTING THE DATA
The Simple Autocomplete Service adds strings to the Redis database using the ZADD command to create a sorted set:

 ZADD myindex 0 ""m""
 ZADD myindex 0 ""ma""
 ZADD myindex 0 ""mab""
 ZADD myindex 0 ""mabe""
 ZADD myindex 0 ""mabel*Mabel""

The zero in the syntax above is the score of the sorted set. We set all the
strings to have the same score so that only alphabetical ordering takes place.

QUERYING THE DATA
When we wish to find the auto-complete solutions for the string ma , we need to find our way to ma in our Redis index and then retrieve a number of keys that occur after that
point in the index.

In Redis, we use two queries to do this

 1. ZRANK to find the place in the index that matches our search string
 2. ZRANGE 75 to find the 75 lines that occur in the index from that point on. 75 is
    number hard-coded into the service to return a reasonable number of
    solutions to the query.

e.g.

 ZRANK myindex ma
(integer) 7429
 ZRANGE myindex 7429 7504
 1) ""ma""
 2) ""mab""
 3) ""mab*Mab""
 4) ""mabe""
 5) ""mabel""
 6) ""mabel*Mabel""
 7) ""mabell""
 8) ""mabelle*Mabelle""
 9) ""mabl""
10) ""mable*Mable""
11) ""mad""
12) ""mada""
13) ""mada*Mada""
14) ""madal""
15) ""madale""
16) ""madalen""
17) ""madalena*Madalena""
18) ""madaly""
19) ""madalyn*Madalyn""
20) ""madd""
.
.
.

The service only keeps the keys with an asterisk in the middle (the complete
answers) and then returns those values to the user:

[""Mab"",""Mabel"",""Mabelle"",""Mable""]

As the index is stored in order by Redis, the ZRANK function is an O(log n) operation, meaning that its complexity only increases in proportion to the
logarithm of the data size (N). The ZRANGE query is similarly efficient so the
amount of work required to perform a search using the ZRANK/ZRANGE technique
remains almost constant whatever the data size.

How many strings would we need have in our text file before the ZRANK/ZRANGE
solution out-performs scanning a linear list? The answer is less than 100. It’
the indexed solution wins in all but the very simplest cases.

HOMEWORK
As it stands, the Simple Autocomplete Service only matches strings that begin with search phrase. What if I wanted to match
on the second word of a phrase? Imagine I indexed actors names:

Molly Ringwald
Judd Nelson
Paul Gleason
Anthony Michael Hall
Ally Sheedy
Emilio Estevez

I want autocomplete to work when I type “A..L” as well as if I type “S..H”.

That would involve indexing additional data

a
al
all
ally
ally s
ally sh
ally she
ally shee
ally sheed
ally sheedy*Ally Sheedy
s
sh
shee
sheed
sheedy*Ally Sheedy

The index would be bigger in this case, but it should work. If anyone would like
to modify the source code repository and send me a pull request, I’d be happy to incorporate this as an option.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Easily add autocomplete to your web form fields. Simply upload your data set using this cloud service then use its fast and efficient autocomplete API.,Introducing the Simple Autocomplete Service,Live,56
147,"WILL WOLF
DATA SCIENCE THINGS AND THOUGHTS ON THE WORLD
 * About
 * Archive
 * RSS
 * EN * ES
   
   
TRANSFER LEARNING FOR FLIGHT DELAY PREDICTION VIA VARIATIONAL AUTOENCODERS
WILL WOLF
May 8, 2017In this work, we explore improving a vanilla regression model with knowledge
learned elsewhere. As a motivating example, consider the task of predicting the
number of checkins a given user will make at a given location. Our training data
consist of checkins from 4 users across 4 locations in the week of May 1st, 2017
and looks as follows:

user_id location checkins 1 a 3 1 b 6 2 c 7 2 d 2 3 a 1 3 c 4 4 b 9 4 d 4We'd like to predict how many checkins user 3 will make at location b in the coming week. How well will our model do?

While each user_id might represent some unique behavior - e.g. user 3 sleeps late yet likes going out for dinner - and each location might represent
its basic characteristics - e.g. location b is an open-late sushi bar - this is currently unbeknownst to our model. To this
end, gathering this metadata and joining it to our training set is a clear
option. If quality, thorough, explicit metadata are available, affordable and
practical to acquire, this is likely the path to pursue. If not, we'll need to
explore a more creative approach. How far can we get with implicit metadata learned from an external task?

TRANSFER LEARNING ¶
Transfer learning allows us to use knowledge acquired in one task to improve
performance in another. Suppose, for example, that we've been tasked with
translating Portuguese to English and are given a basic phrasebook from which to
learn. After a week, we take a lengthy test. A friend of ours - a fluent Spanish
speaker who knows nothing of Portuguese - is tasked the same. Who gets a better
score?

PREDICTING FLIGHT DELAYS ¶
The goal of this work is to predict flight delays - a basic regression task. The
data comprise 6,872,294 flights from 2008 via the United States Department of Transportation's Bureau of Transportation Statistics . I downloaded them from stat-computing.org .

Each row consists of, among other things: DayOfWeek , DayofMonth , Month , ScheduledDepTimestamp (munged from CRSDepTime ), Origin , Dest and UniqueCarrier (airline), and well as CarrierDelay , WeatherDelay , NASDelay , SecurityDelay , LateAircraftDelay - all in minutes - which we will sum to create total_delay . We'll consider a random sample of 50,000 flights to make things easier. (For
a more in-depth exploration of these data, please see this project's repository .)

ROUTES, AIRPORTS ¶
While we can expect DayOfWeek , DayofMonth and Month to give some seasonal delay trends - delays are likely higher on Sundays or
Christmas, for example - the Origin and Dest columns might suffer from the same pathology as user_id and location above: a rich behavioral indicator represented in a crude, ""isolated"" way. (A
token in a bag-of-words model, as opposed to its respective word2vec
representation, gives a clear analogy.) How can we infuse this behavioral
knowledge into our original task?

AN AUXILIARY TASK ¶
In 2015, I read a particularly-memorable blog post entitled Towards Anything2Vec by Allen Tran. Therein, Allen states:

Like pretty much everyone, I'm obsessed with word embeddings word2vec or GloVe.
Although most of machine learning in general is based on turning things into
vectors, it got me thinking that we should probably be learning more fundamental
representations for objects, rather than hand tuning features. Here is my
attempt at turning random things into vectors, starting with graphs.

In this post, Allen seeks to embed nodes - U.S. patents, incidentally - in a
directed graph into vector space by predicting the inverse of the path-length to
nodes nearby. To me, this (thus-far) epitomizes the ""data describe the
individual better than they describe themself:"" while we could ask the nodes to
self-classify into patents on ""computing,"" ""pharma,"" ""materials,"" etc., the
connections between these nodes - formal citations, incidentally - will capture
their ""true"" subject matters (and similarities therein) better than the authors
ever could. Formal language, necessarily, generalizes.

OpenFlights contains data for over ""10,000 airports, train stations and ferry terminals
spanning the globe"" and the routes between. My goal is to train a neural network
that, given an origin airport and its latitude and longitude, predicts the
destination airport, latitude and longitude. This network will thereby ""encode""
each airport into a vector of arbitrary size containing rich information about,
presumably, the diversity and geography of the destinations it services: its
""place"" in the global air network. Surely, a global hub like Heathrow - a fact
presumably known to our neural network, yet unknown to our initial dataset with
one-hot airport indices - has longer delays on Christmas than than a two-plane
airstrip in Alaska.

Crucially, we note that while our original (down-sampled) dataset contains
delays amongst 298 unique airports, our auxiliary routes dataset comprises flights amongst 3186 unique airports. Notwithstanding,
information about all airports in the latter is distilled into vector representations then injected into the former; even though we might
not know about delays to/from Casablanca Mohammed V Airport (CMN), latent
information about this airport will still be intrinsically considered when predicting delays between other airports to/from which CMN flies.

DATA PREPARATION ¶
Our flight-delay design matrix $X$ will include the following columns: DayOfWeek , DayofMonth , Month , ScheduledDepTimestamp , Origin , Dest and UniqueCarrier . All columns will be one-hotted for simplicity. (Alternatively, I explored
mapping each column to its respective value_counts() , i.e. X.loc[:, col] = X[col].map(col_val_counts) , which led to less agreeable convergence.)

Let's get started.

In [1]:fromabcimportABCMeta,abstractmethodfromIPython.displayimportIFrame,SVGimportosimportsysroot_dir=os.path.join(os.getcwd(),'..')sys.path.append(root_dir)importfeatherfromgmplotimportgmplotimportmatplotlib.pyplotaspltimportnumpyasnpimportpandasaspdimportseabornassnsfromsklearn.metricsimportmean_squared_errorasmean_squared_error_scikitfromsklearn.model_selectionimporttrain_test_splitfromsklearn.preprocessingimportMinMaxScaler,StandardScaler%matplotlib inline
sns.set(style='darkgrid')

In [2]:importkeras.backendasKfromkeras.layersimportBatchNormalization,Dense,Dropout,Embedding,Flatten,Input,LayerasKerasLayerfromkeras.layers.mergeimportconcatenate,dotfromkeras.lossesimportmean_squared_errorfromkeras.modelsimportModelfromkeras.optimizersimportAdamfromkeras.regularizersimportl2fromkeras.utils.vis_utilsimportmodel_to_dotfromkeras_tqdmimportTQDMNotebookCallback

In [3]:FLIGHTS_PATH='../data/flights-2008-sample.feather'# build X, yflights=feather.read_dataframe(FLIGHTS_PATH)X=flights[['DayOfWeek','DayofMonth','Month','ScheduledDepTimestamp','Origin','Dest','UniqueCarrier']].copy()y=flights['total_delay'].copy()# one-hotone_hot_matrices=[]forcolinfilter(lambdacol:col!='ScheduledDepTimestamp',X.columns):one_hot_matrices.append(pd.get_dummies(X[col]))one_hot_matrix=np.concatenate(one_hot_matrices,axis=1)X=np.hstack([X['ScheduledDepTimestamp'].values.reshape(-1,1),one_hot_matrix])# normalizeX=StandardScaler().fit_transform(X)y=np.log(y+1).values

In [4]:TEST_SIZE=int(X.shape[0]*.4)X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=TEST_SIZE,random_state=42)X_val,X_test,y_val,y_test=train_test_split(X_test,y_test,test_size=int(TEST_SIZE/2),random_state=42)print('Dataset sizes:')print('    Train:      {}'.format(X_train.shape))print('    Validation: {}'.format(X_val.shape))print('    Test:       {}'.format(X_test.shape))

Dataset sizes:
    Train:      (30000, 657)
    Validation: (10000, 657)
    Test:       (10000, 657)


FLIGHT-DELAY MODELS ¶
Let's build two baseline models with the data we have. Both models have a single
ReLU output and are trained to minimize the mean squared error of the predicted
delay via stochastic gradient descent.

ReLU was chosen as an output activation because delays are both bounded below at
0 and bi-modal. I considered three separate strategies for predicting this
distribution.

 1. Train a network with two outputs: total_delay and total_delay == 0 (Boolean). Optimize this network with a composite loss function: mean
    squared error and binary cross-entropy, respectively.
 2. Train a ""poor-man's"" hierarchical model: a logistic regression to predict total_delay == 0 and a standard regression to predict total_delay . Then, compute the final prediction as a thresholded ternary, e.g. y_pred = np.where(y_pred_lr > threshhold, 0, y_pred_reg) . Train the regression model with both all observations, and just those
    where total_delay > 0 , and see which works best.
 3. Train a single network with a ReLU activation. This gives a reasonably
    elegant way to clip our outputs below at 0, and mean-squared-error still
    tries to place our observations into the correct mode (of the bimodal output
    distribution; this said, mean-squared-error may try to ""play it safe"" and
    predict between the modes).

I chose Option #3 because it performed best in brief experimentation and was the
simplest to both fit and explain.

In [5]:classBaseEmbeddingModel(metaclass=ABCMeta):defcompile(self,optimizer,loss,*args,**kwargs):self.model.compile(optimizer,loss)defsummary(self):returnself.model.summary()deffit(self,*args,**kwargs):returnself.model.fit(*args,**kwargs)defpredict(self,X):returnself.model.predict(X)@abstractmethoddef_build_model(self):passclassSimpleRegression(BaseEmbeddingModel):def__init__(self,input_dim:int,λ:float):'''Initializes the model parameters.        Args:            input_dim : The number of columns in our design matrix.            λ : The regularization strength to apply to the model's                dense layers.        '''self.input_dim=input_dimself.λ=λself.model=self._build_model()def_build_model(self):input=Input((self.input_dim,),dtype='float32')dense=Dense(144,activation='relu',kernel_regularizer=l2(self.λ))(input)output=Dense(1,activation='relu',name='regression_output',kernel_regularizer=l2(self.λ))(dense)returnModel(input,output)classDeeperRegression(BaseEmbeddingModel):def__init__(self,input_dim:int,λ:float,dropout_p:float):'''Initializes the model parameters.        Args:            input_dim : The number of columns in our design matrix.            λ : The regularization strength to apply to the model's                dense layers.            dropout_p : The percentage of units to drop in the model's                dropout layer.        '''self.input_dim=input_dimself.λ=λself.dropout_p=dropout_pself.model=self._build_model()def_build_model(self):input=Input((self.input_dim,),dtype='float32',name='input')dense=Dense(144,activation='relu',kernel_regularizer=l2(self.λ))(input)dense=Dense(144,activation='relu',kernel_regularizer=l2(self.λ))(dense)dense=Dropout(self.dropout_p)(dense)dense=Dense(72,activation='relu',kernel_regularizer=l2(self.λ))(dense)dense=Dense(16,activation='relu',kernel_regularizer=l2(self.λ))(dense)output=Dense(1,activation='relu',name='regression_output')(dense)returnModel(input,output)

In [6]:deffit_flight_model(model,X_train,y_train,X_val,y_val,epochs,batch_size=256):returnmodel.fit(x=X_train,y=y_train,batch_size=batch_size,epochs=epochs,validation_data=(X_val,y_val),verbose=0,callbacks=[TQDMNotebookCallback(leave_outer=False)])defprepare_history_for_plot(history):'''Arrange the model's `history` into a ""long"" DataFrame    to enable more convenient plotting.    Args:        history (keras.callbacks.History) : a Keras `history` object.    '''results=pd.DataFrame({'train':history.history['loss'],'val':history.history['val_loss'],})results_long=pd.melt(results)results_long.columns=['dataset','loss']results_long['epoch']=2*history.epochresults_long['subject']=1returnresults_longdefplot_model_fit(history):'''Plot the training loss vs. the validation loss.    Args:        history (keras.callbacks.History) : a Keras `history` object.    '''results=prepare_history_for_plot(history)plt.figure(figsize=(11,7))sns.tsplot(data=results,time='epoch',value='loss',condition='dataset',unit='subject')plt.title('Training Loss by Epoch',fontsize=13)

SIMPLE REGRESSION ¶
In [7]:LEARNING_RATE=.0001simple_reg=SimpleRegression(input_dim=X.shape[1],λ=.05)simple_reg.compile(optimizer=Adam(lr=LEARNING_RATE),loss='mean_squared_error')simple_reg_fit=fit_flight_model(simple_reg,X_train,y_train,X_val,y_val,epochs=5,batch_size=16)plot_model_fit(simple_reg_fit)


DEEPER REGRESSION ¶
In [8]:deeper_reg=DeeperRegression(input_dim=X.shape[1],λ=.03,dropout_p=.2)deeper_reg.compile(optimizer=Adam(lr=.0001),loss='mean_squared_error')deeper_reg_fit=fit_flight_model(deeper_reg,X_train,y_train,X_val,y_val,epochs=5,batch_size=16)plot_model_fit(deeper_reg_fit)


TEST SET PREDICTIONS ¶
In [9]:y_pred_simple=simple_reg.model.predict(X_test).ravel()y_pred_deeper=deeper_reg.model.predict(X_test).ravel()mse_simple=mean_squared_error_scikit(y_test,y_pred_simple)mse_deeper=mean_squared_error_scikit(y_test,y_pred_deeper)print('Mean squared error, simple regression: {}'.format(mse_simple))print('Mean squared error, deeper regression: {}'.format(mse_deeper))

Mean squared error, simple regression: 2.331459019628268
Mean squared error, deeper regression: 2.3186310632259204


LEARNING AIRPORT EMBEDDINGS ¶
We propose two networks through which to learn airport embeddings: a dot product
siamese network, and a variational autoencoder .

DOT PRODUCT SIAMESE NETWORK ¶
This network takes as input origin and destination IDs, latitudes and
longitudes. It gives as output a binary value indicating whether or not a
flight-route between these airports exists. The airports DataFrame gives the geographic metadata. The routes DataFrame gives positive training examples for our network. To build negative samples, we employ,
delightfully, ""negative sampling.""

NEGATIVE SAMPLING ¶
routes gives exlusively (origin, dest, exists = 1) triplets. To create triplets where exists = 0 , we simply build them ourself: (origin, fake_dest, exists = 0) . It's that simple.

Inspired by word2vec's approach to an almost identical problem, I pick fake_dest 's based on the frequency with which they occur in the dataset - more frequent
samples being more likely to be selected - via:

$$P(a_i) = \frac{ {f(a_i)}^{3/4} }{\sum_{j=0}^{n}\left( {f(a_j)}^{3/4} \right)
}$$where $a_i$ is an airport. To choose a fake_dest for a given origin , we first remove all of the real dest 's, re-normalize $P(a)$, then take a multinomial draw.

For a more complete yet equally approachable explanation, please see Goldberg and Levy . For an extremely thorough review of related methods, see Sebastian Ruder's On word embeddings - Part 2: Approximating the Softmax .

VARIATIONAL AUTOENCODER ¶
DISCRIMINATIVE MODELS ¶
The previous network is a discriminative model: given two inputs origin and dest , it outputs the conditional probability that exists = 1 . While discriminative models are effective in distinguishing between output classes, they don't offer an idea of what data look like within each
class itself. To see why, let's restate Bayes rule for a given input $x$:

$$P(Y\vert x) = \frac{P(x\vert Y)P(Y)}{P(x)} = \frac{P(x, Y)}{P(x)}$$Discriminative classifiers jump directly to estimating $P(Y\vert x)$ without
modeling its component parts $P(x, Y)$ and $P(x)$.

Instead, as the intermediate step, they simply compute an unnormalized joint distribution $\tilde{P}(x, Y)$ and a normalizing ""partition function.""
The following then gives the model's predictions for the same reason that
$\frac{.2}{1} = \frac{3}{15}$:

$$P(Y\vert x) = \frac{P(x, Y)}{P(x)} = \frac{\tilde{P}(x, Y)}{\text{partition
function}}$$This is explained much more thoroughly in a previous blog post: Deriving the Softmax from First Principles .

GENERATIVE MODELS ¶
Conversely, a variational autoencoder is a generative model: instead of jumping directly to the conditional probability of all possible outputs given a specific input,
they first compute the true component parts: the joint probability distribution
over data and inputs alike, $P(X, Y)$, and the distribution over our data,
$P(X)$.

The joint probability can be rewritten as $P(X, Y) = P(Y)P(X\vert Y)$: as such,
generative models tell us the distribution over classes in our dataset, as well
as the distribution of inputs within each class. Suppose we are trying to
predict t-shirt colors with a 3-feature input; generative models would tell us:
""30% of your t-shirts are green - typically produced by inputs near x = [1, 2, 3] ; 40% are red - typically produced by inputs near x = [10, 20, 30] ; 30% are blue - typically produced by inputs near x = [100, 200, 300] . This is in contrast to a discriminative model which would simply compute:
given an input $x$, your output probabilities are: $\{\text{red}: .2,
\text{green}: .3, \text{blue}: .5\}$.

To generate new data with a generative model, we draw from $P(Y)$, then
$P(X\vert Y)$. To make predictions, we solicit $P(Y), P(x\vert Y)$ and $P(x)$
and employ Bayes rule outright.

MANIFOLD ASSUMPTION ¶
The goal of both autoencoders is to discover underlying ""structure"" in our data:
while each airport can be one-hot encoded into a 3186-dimensional vector, we
wish to learn a, or even the, reduced space in which our data both live and
vary. This concept is well understood through the ""manifold assumption,""
explained succinctly in this CrossValidated thread :

Imagine that you have a bunch of seeds fastened on a glass plate, which is
resting horizontally on a table. Because of the way we typically think about
space, it would be safe to say that these seeds live in a two-dimensional space,
more or less, because each seed can be identified by the two numbers that give
that seed's coordinates on the surface of the glass.

Now imagine that you take the plate and tilt it diagonally upwards, so that the
surface of the glass is no longer horizontal with respect to the ground. Now, if
you wanted to locate one of the seeds, you have a couple of options. If you
decide to ignore the glass, then each seed would appear to be floating in the
three-dimensional space above the table, and so you'd need to describe each
seed's location using three numbers, one for each spatial direction. But just by
tilting the glass, you haven't changed the fact that the seeds still live on a
two-dimensional surface. So you could describe how the surface of the glass lies
in three-dimensional space, and then you could describe the locations of the
seeds on the glass using your original two dimensions.

In this thought experiment, the glass surface is akin to a low-dimensional
manifold that exists in a higher-dimensional space : no matter how you rotate
the plate in three dimensions, the seeds still live along the surface of a
two-dimensional plane.

In other words, the full spectrum of that which characterizes an airport can be
described by just a few numbers. Varying one of these numbers - making it larger
or smaller - would result in an airport of slightly different ""character;"" if
one dimension were to represent ""global travel hub""-ness, a value of $-1000$
along this dimension might give us that hangar in Alaska.

In the context of autoencoders (and dimensionality reduction algorithms),
""learning 'structure' in our data"" means nothing more than finding that ceramic
plate amidst a galaxy of stars .

GRAPHICAL MODELS ¶
Variational autoencoders do not have the same notion of an ""output"" - namely,
""does a route between two airports exist?"" - as our dot product siamese network.
To detail this model, we'll start near first principles with probabilistic
graphical models with our notion of the ceramic plate in mind:


Coordinates on the plate detail airport character; choosing coordinates - say, [global_hub_ness = 500, is_in_asia = 500] - allows us to generate an airport. In this case, it might be Seoul. In variational autoencoders,
ceramic-plate coordinates are called the ""latent vector,"" denoted $z$. The joint
probability of our graphical model is given as:

$$P(z)P(x\vert z) = P(z, x)$$Our goal is to infer the priors that likely generated these data via Bayes rule:

$$P(z\vert x) = \frac{P(z)P(x\vert z)}{P(x)}$$The denominator is called the evidence ; we obtain it by marginalizing the joint distribution over the latent
variables:

$$P(x) = \int P(x\vert z)P(z)dz$$Unfortunately, this asks us to consider all possible configurations of the latent vector $z$. Should $z$ exist on the vertices of a cube in
$\mathbb{R}^3$, this would not be very difficult; should $z$ be a
continuous-valued vector in $\mathbb{R}^{10}$, this becomes a whole lot harder.
Computing $P(x)$ is problematic.

VARIATIONAL INFERENCE ¶
In fact, we could attempt to use MCMC to compute $P(z\vert x)$; however, this is
slow to converge. Instead, let's compute an approximation to this distribution then try to make it closely resemble the (intractable)
original. In this vein, we introduce variational inference , which ""allows us to re-write statistical inference problems (i.e. infer the
value of a random variable given the value of another random variable) as
optimization problems (i.e. find the parameter values that minimize some
objective function)."" 1

Let's choose our approximating distribution as simple, parametric and one we
know well: the Normal (Gaussian) distribution. Were we able to compute $P(z\vert
x) = \frac{P(x, z)}{P(x)}$, it is instrinsic that $z$ is contingent on $x$; when building our own distribution to
approximate $P(z\vert x)$, we need to be explicit about this contingency: different values for $x$ should be assumed to have been
generated by different values of $z$. Let's write our approximation as follows,
where $\lambda$ parameterizes the Gaussian for a given $x$:

$$q_{\lambda}(z\vert x)$$Finally, as stated previously, we want to make this approximation closely
resemble the original; the KL divergence quantifies their difference:

$$KL(q_{\lambda}(z\vert x)\Vert P(z\vert x)) = \int{q_{\lambda}(z\vert
x)\log\frac{q_{\lambda}(z\vert x)}{P(z\vert x)}dz}$$Our goal is to obtain the argmin with respect to $\lambda$:

$$q_{\lambda}^{*}(z\vert x) = \underset{\lambda}{\arg\min}\
\text{KL}(q_{\lambda}(z\vert x)\Vert P(z\vert x))$$Expanding the divergence, we obtain:

$$ \begin{align*} KL(q_{\lambda}(z\vert x)\Vert P(z\vert x)) &=
\int{q_{\lambda}(z\vert x)\log\frac{q_{\lambda}(z\vert x)}{P(z\vert x)}dz}\\ &=
\int{q_{\lambda}(z\vert x)\log\frac{q_{\lambda}(z\vert x)P(x)}{P(z, x)}dz}\\ &=
\int{q_{\lambda}(z\vert x)\bigg(\log{q_{\lambda}(z\vert x) -\log{P(z, x)} +
\log{P(x)}}\bigg)dz}\\ &= \int{q_{\lambda}(z\vert
x)\bigg(\log{q_{\lambda}(z\vert x)} -\log{P(z, x)}}\bigg)dz +
\log{P(x)}\int{q_{\lambda}(z\vert x)dz}\\ &= \int{q_{\lambda}(z\vert
x)\bigg(\log{q_{\lambda}(z\vert x)} -\log{P(z, x)}}\bigg)dz + \log{P(x)} \cdot 1
\end{align*} $$As such, since only the left term depends on $\lambda$, minimizing the entire
expression with respect to $\lambda$ amounts to minimizing this term.
Incidentally, the opposite (negative) of this term is called the ELBO , or the ""evidence lower bound."" To see why, let's plug the ELBO into the
equation above and solve for $\log{P(x)}$:

$$\log{P(x)} = ELBO(\lambda) + KL(q_{\lambda}(z\vert x)\Vert P(z\vert x))$$In English: ""the log of the evidence is at least the lower bound of the evidence
plus the divergence between our true posterior $P(z\vert x)$ and our
(variational) approximation to this posterior $q_{\lambda}(z\vert x)$.""

Since the left term above is the opposite of the ELBO, minimizing this term is
equivalent to maximizing the ELBO.

Let's restate the equation and rearrange further:

$$ \begin{align*} ELBO(\lambda) &= -\int{q_{\lambda}(z\vert
x)\bigg(\log{q_{\lambda}(z\vert x)} -\log{P(z, x)}}\bigg)dz\\ &=
-\int{q_{\lambda}(z\vert x)\bigg(\log{q_{\lambda}(z\vert x)} -\log{P(x\vert z)}
- \log{P(z)}}\bigg)dz\\ &= -\int{q_{\lambda}(z\vert
x)\bigg(\log{q_{\lambda}(z\vert x)} - \log{P(z)}}\bigg)dz + \log{P(x\vert
z)}\int{q_{\lambda}(z\vert x)dz}\\ &= -\int{q_{\lambda}(z\vert
x)\log{\frac{q_{\lambda}(z\vert x)}{P(z)}}dz} + \log{P(x\vert z)} \cdot 1\\ &=
\log{P(x\vert z)} -KL(q_{\lambda}(z\vert x)\Vert P(z)) \end{align*} $$Our goal is to maximize this expression, or minimize the opposite:

$$-\log{P(x\vert z)} + KL(q_{\lambda}(z\vert x)\Vert P(z))$$In machine learning parlance: ""minimize the negative log likelihood of our data
(generated via $z$) plus the divergence between the distribution (ceramic plate)
of $z$ and our approximation thereof.""

See what we did?

FINALLY, BACK TO NEURAL NETS ¶
The variational autoencoder consists of an encoder network and a decoder
network.

ENCODER ¶
The encoder network takes as input $x$ (an airport) and produces as output $z$
(the latent ""code"" of that airport, i.e. its location on the ceramic plate). As
an intermediate step, it produces multivariate Gaussian parameters $(\mu_{x_i},
\sigma_{x_i})$ for each airport. These parameters are then plugged into a
Gaussian $q$, from which we sample a value $z$. The encoder is parameterized by a weight matrix $\theta$.

DECODER ¶
The decoder network takes as input $z$ and produces $P(x\vert z)$: a
reconstruction of the airport vector (hence, autoencoder). It is parameterized
by a weight matrix $\phi$.

LOSS FUNCTION ¶
The network's loss function is the sum of the mean squared reconstruction error
of the original input $x$ and the KL divergence between the true distribution of
$z$ and its approximation $q$. Given the reparameterization trick (next section)
and another healthy scoop of algebra, we write this in Python code as follows:

'''`z_mean` gives the mean of the Gaussian that generates `z``z_log_var` gives the log-variance of the Gaussian that generates `z``z` is generated via:  z = z_mean + K.exp(z_log_var / 2) * epsilon    = z_mean + K.exp( log(z_std)**2 / 2 ) * epsilon    = z_mean + K.exp( (2 * log(z_std) / 2 ) * epsilon    = z_mean + K.exp( log(z_std) ) * epsilon    = z_mean + z_std * epsilon'''kl_loss_numerator=1+z_log_var-K.square(z_mean)-K.exp(z_log_var)kl_loss=-0.5*K.sum(kl_loss_numerator,axis=-1)defloss(x,x_decoded):returnmean_squared_error(x,x_decoded)+kl_loss

REPARAMETERIZATION TRICK ¶
When back-propagating the network's loss to $\theta$ , we need to go through $z$
— a sample taken from $q_{\theta}(z\vert x)$. Trivially, this sample is a scalar;
intuitively, its derivative should be non-zero. In solution, we'd like the
sample to depend not on the stochasticity of the random variable, but on the random variable's parameters . To this end, we employ the ""reparametrization trick"" , such that the sample depends on these parameters deterministically .

As a quick example, this trick allows us to write $\mathcal{N}(\mu, \sigma)$ as
$z = \mu + \sigma \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0, 1)$.
Drawing samples this way allows us to propagate error backwards through our
network.

AUXILIARY DATA ¶
ROUTES ¶
In [10]:# import routesroutes_cols=['airline','airline_id','origin','origin_id','dest','dest_id','codeshare','stops','equipment']routes=pd.read_csv('../data/routes.csv',names=routes_cols,usecols=['origin','dest'])routes['exists']=1# how many unique routes are there? by how many airlines are they flown?unique_routes=routes.groupby(['origin','dest']).count()print('There are {} unique routes.'.format(unique_routes.shape[0]))unique_routes.sort_values(by='exists',ascending=False).head(20).T

There are 37595 unique routes.


Out[10]: origin ORD ATL ORD HKT HKG CAN DOH ATL AUH BKK JFK MIA LHR ATL KGL MSY MCT CNX CDG dest ATL ORD MSY BKK BKK HGH BAH MIA MCT HKG LHR ATL JFK LAX DFW EBB JFK AUH BKK JFK exists 20 19 13 13 12 12 12 12 12 12 12 12 12 11 11 11 11 11 11 11 In [11]:# compute airport frequencies for negative samplingall_airports=routes['origin'].tolist()+routes['dest'].tolist()airport_counts=pd.Series(all_airports).value_counts()airport_probs=(airport_counts**.75)/(airport_counts**.75).sum()

In [12]:defcompute_unique_negative_dests(airport,routes=routes):returnroutes[routes['origin']!=airport]['dest'].unique()defdraw_negative_samples(n,neg_dest_probs):samples_mode=np.infwhilesamples_mode>=.75*n:negative_sample_idxs=np.random.multinomial(n,neg_dest_probs)samples_mode=negative_sample_idxs.max()ifn>=4else-np.infnegative_samples=[]fordest,countinzip(neg_dest_probs.index,negative_sample_idxs):ifcount>0:negative_samples+=count*[dest]returnnegative_samples

In [13]:# append `routes` with negative samplesnegative_sample_dfs=[]fori,airportinenumerate(set(routes['origin'])):n_routes=len(routes[routes['origin']==airport])negative_dests=compute_unique_negative_dests(airport)negative_dest_probs=airport_probs[negative_dests]/airport_probs[negative_dests].sum()negative_samples=draw_negative_samples(n_routes,negative_dest_probs)df=pd.DataFrame({'origin':airport,'dest':negative_samples,'exists':0})negative_sample_dfs.append(df)negative_routes=pd.concat(negative_sample_dfs,axis=0)routes=pd.concat([routes,negative_routes])

AIRPORTS ¶
In [14]:# import airportsairports_cols=['Airport ID','Name','City','Country','IATA','ICAO','Latitude','Longitude','Altitude','Timezone','DST','Tz database time zone','Type','Source']airports=pd.read_csv('../data/airports.csv',names=airports_cols,usecols=['Name','IATA','Latitude','Longitude','Altitude'],index_col=['IATA'])# join origin and destination airport metadata to `routes`origin_airports=airports.copy()origin_airports.columns=['origin_name','origin_latitude','origin_longitude','origin_altitude']dest_airports=airports.copy()dest_airports.columns=['dest_name','dest_latitude','dest_longitude','dest_altitude']routes=routes\
            .join(origin_airports,on='origin')\
            .join(dest_airports,on='dest')\
            .dropna()\
            .reset_index(drop=True)# map airport names to a unique indexdelall_airportsall_airports=routes['origin'].tolist()+routes['dest'].tolist()unique_airports=set(all_airports)airport_to_id={airport:indexforindex,airportinenumerate(unique_airports)}routes['origin_id']=routes['origin'].map(airport_to_id)routes['dest_id']=routes['dest'].map(airport_to_id)

In [15]:# build X_routes, y_routesgeo_cols=['origin_latitude','origin_longitude','dest_latitude','dest_longitude']X_r=routes[['origin_id','dest_id']+geo_cols].copy()y_r=routes['exists'].copy()X_r.loc[:,geo_cols]=StandardScaler().fit_transform(X_r[geo_cols])# split training, test datatest_size=X_r.shape[0]//3val_size=test_size//2X_train_r,X_test_r,y_train_r,y_test_r=train_test_split(X_r,y_r,test_size=test_size,random_state=42)X_val_r,X_test_r,y_val_r,y_test_r=train_test_split(X_test_r,y_test_r,test_size=val_size,random_state=42)print('Dataset sizes:')print('    Train:      {}'.format(X_train_r.shape))print('    Validation: {}'.format(X_val_r.shape))print('    Test:       {}'.format(X_test_r.shape))

Dataset sizes:
    Train:      (87630, 6)
    Validation: (21907, 6)
    Test:       (21907, 6)


DOT PRODUCT EMBEDDING MODEL ¶
To start, let's train our model with a single latent dimension then visualize
the results on the world map.

In [16]:N_UNIQUE_AIRPORTS=len(unique_airports)classDotProductEmbeddingModel(BaseEmbeddingModel):def__init__(self,embedding_size:int,λ:float,n_unique_airports=N_UNIQUE_AIRPORTS):'''Initializes the model parameters.        Args:            embedding_size : The desired number of latent dimensions in our                 embedding space.            λ : The regularization strength to apply to the model's                dense layers.        '''self.n_unique_airports=n_unique_airportsself.embedding_size=embedding_sizeself.λ=λself.model=self._build_model()def_build_model(self):# inputsorigin=Input(shape=(1,),name='origin')dest=Input(shape=(1,),name='dest')origin_geo=Input(shape=(2,),name='origin_geo')dest_geo=Input(shape=(2,),name='dest_geo')# embeddingsorigin_embedding=Embedding(self.n_unique_airports,output_dim=self.embedding_size,embeddings_regularizer=l2(self.λ),name='origin_embedding')(origin)dest_embedding=Embedding(self.n_unique_airports,output_dim=self.embedding_size,embeddings_regularizer=l2(self.λ))(dest)# dot productdot_product=dot([origin_embedding,dest_embedding],axes=2)dot_product=Flatten()(dot_product)dot_product=concatenate([dot_product,origin_geo,dest_geo],axis=1)# dense layerstanh=Dense(10,activation='tanh')(dot_product)tanh=BatchNormalization()(tanh)# outputexists=Dense(1,activation='sigmoid')(tanh)returnModel(inputs=[origin,dest,origin_geo,dest_geo],outputs=[exists])

In [17]:dp_model=DotProductEmbeddingModel(embedding_size=1,λ=.0001)dp_model.compile(optimizer=Adam(lr=.001),loss='binary_crossentropy')SVG(model_to_dot(dp_model.model).create(prog='dot',format='svg'))

Out[17]: G 5275051792 origin: InputLayer 4974217368 origin_embedding: Embedding 5275051792->4974217368 5275052128 dest: InputLayer 5274577440 embedding_1: Embedding 5275052128->5274577440 5274577552 dot_1: Dot 4974217368->5274577552 5274577440->5274577552 5274449512 flatten_1: Flatten 5274577552->5274449512 5273038296 concatenate_1: Concatenate 5274449512->5273038296 5275052464 origin_geo: InputLayer 5275052464->5273038296 5275052744 dest_geo: InputLayer 5275052744->5273038296 5272987184 dense_6: Dense 5273038296->5272987184 5272987968 batch_normalization_1: BatchNormalization 5272987184->5272987968 5272758368 dense_7: Dense 5272987968->5272758368 In [18]:dp_model_fit=dp_model.fit(x=[X_train_r['origin_id'],X_train_r['dest_id'],X_train_r[['origin_latitude','origin_longitude']].as_matrix(),X_train_r[['dest_latitude','dest_longitude']].as_matrix(),],y=y_train_r,batch_size=256,epochs=10,validation_data=([X_val_r['origin_id'],X_val_r['dest_id'],X_val_r[['origin_latitude','origin_longitude']].as_matrix(),X_val_r[['dest_latitude','dest_longitude']].as_matrix()],y_val_r),verbose=0,callbacks=[TQDMNotebookCallback(leave_outer=False)])plot_model_fit(dp_model_fit)


VISUALIZE EMBEDDINGS ¶
To visualize results, we'll:

 1. Compose a list of unique origin airports.
 2. Extract the learned (1-dimensional) embedding for each.
 3. Scale the results to $[0, 1]$.
 4. Use the scaled embedding as a percentile-index into a color gradient. Here,
    we've chosen the colors of the rainbow: low values are blue/purple, and high
    values are orange/red.

In [19]:# compose DataFrame of unique originssubset_cols=['origin_id','origin','origin_latitude','origin_longitude']unique_origins=routes.drop_duplicates(subset=subset_cols).reset_index(drop=True)unique_origins=unique_origins[subset_cols]unique_origins.columns=['origin_id','origin','latitude','longitude']defget_dp_embeddings(dp_model,unique_origins=unique_origins):'''Returns the origin airport embeddings from the dot-product embedding model,    *aligned with the index of `unique_origins`*.    '''origin_embeddings=dp_model.model.get_layer(name='origin_embedding').get_weights()[0]returnorigin_embeddings[unique_origins['origin_id'].values]unique_origins['embedding']=get_dp_embeddings(dp_model)

In [20]:MARKERS=['#9400D3','#BA55D3','#1E90FF','#9ACD32','#FFFF00','#FFA500','#FF6347']defprepare_colors(intensities:pd.Series,palette=MARKERS):'''Indexes scale-less color intensities into HTML color codes with respect to a given palette.    Args:        intensities : A Pandas Series containing values with which to scale the color palette.            These values do not need to be scaled to a specific interval.        palette : A list of HTML color codes (loosely) spanning the principal colors of the rainbow.    Returns:        list : HTML color codes.    '''palette_matrix=np.array(palette)intensities=intensities.values.reshape(-1,1)percentiles=MinMaxScaler().fit_transform(intensities).ravel()percentiles-=1e-5get_percentile_marker=lambdaperc:palette[int(perc*len(palette))]returnpd.Series(percentiles)\
            .map(get_percentile_marker)\
            .tolist()WORLD_COORDS=[21.2770321,5.0159425,3]defplot_embeddings_on_world_map(unique_origins_df:pd.DataFrame,output_path:str,world_coords=WORLD_COORDS):'''Plots each unique origin airport on the world map, colored by its 1-dimensional network embedding.    Darker colors indicate a larger embedding value.    Args:        unique_origins_df : A Pandas DataFrame containing at least the following columns: `latitude`,            `longitude`, `embedding`.        output_path : The path to which to write the HTML map file.        world_coords : Respectively, the latitude, longitude, and 'zoom factor' appended to the Google            Maps query string so as to focus the map on the entire world.    Returns:        None : Instead, writes an HTML map file to `output_path`.    '''unique_origins_df['color']=prepare_colors(unique_origins_df['embedding'])gmap=gmplot.GoogleMapPlotter(*world_coords,markers_base_path='',api_key=os.environ['GOOGLE_MAPS_API_KEY'])gmap.scatter(lats=unique_origins_df['latitude'].tolist(),lngs=unique_origins_df['longitude'].tolist(),color=unique_origins_df['color'].tolist(),marker=True)gmap.draw(output_path)

In [21]:plot_embeddings_on_world_map(unique_origins,output_path='../figures/dp_model_map.html')

In [38]:# visit the URL for a full-screen view 👇DOT_PRODUCT_EMBED_VIZ_S3_PATH='https://willwolf-public.s3.amazonaws.com/transfer-learning-flight-delays/dp_model_map.html'IFrame(DOT_PRODUCT_EMBED_VIZ_S3_PATH,width=1000,height=800)

Out[38]:VARIATIONAL AUTOENCODER ¶
In [23]:classVariationalLayer(KerasLayer):def__init__(self,output_dim:int,epsilon_std=1.):'''A custom ""variational"" Keras layer that completes the        variational autoencoder.        Args:            output_dim : The desired number of latent dimensions in our                 embedding space.        '''self.output_dim=output_dimself.epsilon_std=epsilon_stdsuper().__init__()defbuild(self,input_shape):self.z_mean_weights=self.add_weight(shape=(input_shape[1],self.output_dim),initializer='glorot_normal',trainable=True)self.z_mean_bias=self.add_weight(shape=(self.output_dim,),initializer='zero',trainable=True,)self.z_log_var_weights=self.add_weight(shape=(input_shape[1],self.output_dim),initializer='glorot_normal',trainable=True)self.z_log_var_bias=self.add_weight(shape=(self.output_dim,),initializer='zero',trainable=True)super().build(input_shape)defcall(self,x):z_mean=K.dot(x,self.z_mean_weights)+self.z_mean_biasz_log_var=K.dot(x,self.z_log_var_weights)+self.z_log_var_biasepsilon=K.random_normal(shape=K.shape(z_log_var),mean=0.,stddev=self.epsilon_std)kl_loss_numerator=1+z_log_var-K.square(z_mean)-K.exp(z_log_var)self.kl_loss=-0.5*K.sum(kl_loss_numerator,axis=-1)returnz_mean+K.exp(z_log_var/2)*epsilondefloss(self,x,x_decoded):returnmean_squared_error(x,x_decoded)+self.kl_lossdefcompute_output_shape(self,input_shape):return(input_shape[0],self.output_dim)

In [24]:classVariationalAutoEncoderEmbeddingModel(BaseEmbeddingModel):def__init__(self,embedding_size:int,dense_layer_size:int,λ:float,n_unique_airports=N_UNIQUE_AIRPORTS):'''Initializes the model parameters.        Args:            embedding_size : The desired number of latent dimensions in our                 embedding space.            λ : The regularization strength to apply to the model's                dense layers.        '''self.embedding_size=embedding_sizeself.dense_layer_size=dense_layer_sizeself.λ=λself.n_unique_airports=n_unique_airportsself.variational_layer=VariationalLayer(embedding_size)self.model=self._build_model()def_build_model(self):# encoderorigin=Input(shape=(self.n_unique_airports,),name='origin')origin_geo=Input(shape=(2,),name='origin_geo')dense=concatenate([origin,origin_geo])dense=Dense(self.dense_layer_size,activation='tanh',kernel_regularizer=l2(self.λ))(dense)dense=BatchNormalization()(dense)variational_output=self.variational_layer(dense)encoder=Model([origin,origin_geo],variational_output,name='encoder')# decoderlatent_vars=Input(shape=(self.embedding_size,))dense=Dense(self.dense_layer_size,activation='tanh',kernel_regularizer=l2(self.λ))(latent_vars)dense=Dense(self.dense_layer_size,activation='tanh',kernel_regularizer=l2(self.λ))(dense)dense=BatchNormalization()(dense)dest=Dense(self.n_unique_airports,activation='softmax',name='dest',kernel_regularizer=l2(self.λ))(dense)dest_geo=Dense(2,activation='linear',name='dest_geo')(dense)decoder=Model(latent_vars,[dest,dest_geo],name='decoder')# end-to-endencoder_decoder=Model([origin,origin_geo],decoder(encoder([origin,origin_geo])))returnencoder_decoder

In [25]:vae_model=VariationalAutoEncoderEmbeddingModel(embedding_size=1,dense_layer_size=20,λ=.003)vae_model.compile(optimizer=Adam(lr=LEARNING_RATE),loss=[vae_model.variational_layer.loss,'mean_squared_logarithmic_error'],loss_weights=[1.,.2])SVG(model_to_dot(vae_model.model).create(prog='dot',format='svg'))

Out[25]: G 5274278880 origin: InputLayer 5304745880 encoder: Model 5274278880->5304745880 5274277200 origin_geo: InputLayer 5274277200->5304745880 5330847224 decoder: Model 5304745880->5330847224 In [26]:# build VAE training, test setsone_hot_airports=np.eye(N_UNIQUE_AIRPORTS)X_train_r_origin=one_hot_airports[X_train_r['origin_id']]X_val_r_origin=one_hot_airports[X_val_r['origin_id']]X_test_r_origin=one_hot_airports[X_test_r['origin_id']]X_train_r_dest=one_hot_airports[X_train_r['dest_id']]X_val_r_dest=one_hot_airports[X_val_r['dest_id']]X_test_r_dest=one_hot_airports[X_test_r['dest_id']]print('Dataset sizes:')print('    Train:      {}'.format(X_train_r_origin.shape))print('    Validation: {}'.format(X_val_r_origin.shape))print('    Test:       {}'.format(X_test_r_origin.shape))

Dataset sizes:
    Train:      (87630, 3186)
    Validation: (21907, 3186)
    Test:       (21907, 3186)


In [27]:vae_model_fit=vae_model.fit(x=[X_train_r_origin,X_train_r[['origin_latitude','origin_longitude']].as_matrix()],y=[X_train_r_dest,X_train_r[['dest_latitude','dest_longitude']].as_matrix()],batch_size=1024,epochs=5,validation_data=([X_val_r_origin,X_val_r[['origin_latitude','origin_longitude']].as_matrix()],[X_val_r_dest,X_val_r[['dest_latitude','dest_longitude']].as_matrix()],),verbose=0,callbacks=[TQDMNotebookCallback(leave_outer=False)])plot_model_fit(vae_model_fit)


VISUALIZE ¶
In [28]:defget_vae_embeddings(vae_model,unique_origins=unique_origins):'''Returns the origin airport embeddings from the variational autoencoder embedding model,    *aligned with the index of `unique_origins`*.    '''encoder_inputs=[one_hot_airports[unique_origins['origin_id']],unique_origins[['latitude','longitude']].as_matrix()]returnvae_model.model.get_layer('encoder').predict(encoder_inputs)unique_origins['embedding']=get_vae_embeddings(vae_model)plot_embeddings_on_world_map(unique_origins,output_path='../figures/vae_model_map.html')

In [39]:# visit the URL for a full-screen view 👇VAE_EMBED_VIZ_S3_PATH='https://willwolf-public.s3.amazonaws.com/transfer-learning-flight-delays/vae_model_map.html'IFrame(VAE_EMBED_VIZ_S3_PATH,width=1000,height=800)

Out[39]:FINALLY, TRANSFER THE LEARNING ¶
Retrain both models with 20 latent dimensions, then join the embedding back to
our original dataset.

In [30]:# dot product embeddingEMBEDDING_SIZE=20dp_model=DotProductEmbeddingModel(embedding_size=EMBEDDING_SIZE,λ=.0001)dp_model.compile(optimizer=Adam(lr=LEARNING_RATE),loss='binary_crossentropy')dp_model_fit=dp_model.fit(x=[X_r['origin_id'],X_r['dest_id'],X_r[['origin_latitude','origin_longitude']].as_matrix(),X_r[['dest_latitude','dest_longitude']].as_matrix(),],y=y_r,batch_size=256,epochs=5,verbose=0,callbacks=[TQDMNotebookCallback(leave_outer=False)])


In [31]:# variational autoencoder embeddingvae_model=VariationalAutoEncoderEmbeddingModel(embedding_size=EMBEDDING_SIZE,dense_layer_size=30,λ=.003)vae_model.compile(optimizer=Adam(lr=LEARNING_RATE),loss=[vae_model.variational_layer.loss,'mean_squared_logarithmic_error'],loss_weights=[1.,.2])X_r_origin=one_hot_airports[X_r['origin_id']]X_r_dest=one_hot_airports[X_r['dest_id']]vae_model_fit=vae_model.fit(x=[X_r_origin,X_r[['origin_latitude','origin_longitude']].as_matrix()],y=[X_r_dest,X_r[['dest_latitude','dest_longitude']].as_matrix()],batch_size=1024,epochs=5,verbose=0,callbacks=[TQDMNotebookCallback(leave_outer=False)])


EXTRACT EMBEDDINGS, CONSTRUCT JOINT DATASET ¶
In [32]:# get dot product, variational autoencoder embeddingsdp_embeddings=get_dp_embeddings(dp_model)vae_embeddings=get_vae_embeddings(vae_model)assertdp_embeddings.shape==vae_embeddings.shape,'Embedding matrices are of unequal size'# create names for embedding columnsn_embedding_dims=dp_embeddings.shape[1]dp_embedding_cols=['dp_dim_{}'.format(d)fordinrange(n_embedding_dims)]vae_embedding_cols=['vae_dim_{}'.format(d)fordinrange(n_embedding_dims)]embedding_cols=dp_embedding_cols+vae_embedding_cols# create an embeddings DataFrameembeddings_df=pd.DataFrame(data=np.concatenate([dp_embeddings,vae_embeddings],axis=1),columns=embedding_cols,index=unique_origins['origin'])

In [33]:# construct joint datasetdelflights,X,yflights=feather.read_dataframe(FLIGHTS_PATH)X=flights[['DayOfWeek','DayofMonth','Month','ScheduledDepTimestamp','Origin','Dest','UniqueCarrier']].copy()y=flights['total_delay'].copy()X=X\
    .join(embeddings_df,on='Origin',sort=False)\
    .join(embeddings_df,on='Dest',rsuffix='_Dest',sort=False)\
    .drop(['Origin','Dest'],axis=1)\
    .fillna(0)# one-hotone_hot_matrices=[]embedding_cols=[colforcolinX.columnsif'_dim_'incol]column_filter=lambdacol:col!='ScheduledDepTimestamp'andcolnotinembedding_colsforcolinfilter(column_filter,X.columns):one_hot_matrices.append(pd.get_dummies(X[col]))one_hot_matrix=np.concatenate(one_hot_matrices,axis=1)X=np.concatenate([X[embedding_cols+['ScheduledDepTimestamp']],one_hot_matrix],axis=1)# normalizeX=StandardScaler().fit_transform(X)y=np.log(y+1).values

In [34]:X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=TEST_SIZE,random_state=42)X_val,X_test,y_val,y_test=train_test_split(X_test,y_test,test_size=int(TEST_SIZE/2),random_state=42)print('Dataset sizes:')print('    Train:      {}'.format(X_train.shape))print('    Validation: {}'.format(X_val.shape))print('    Test:       {}'.format(X_test.shape))

Dataset sizes:
    Train:      (30000, 151)
    Validation: (10000, 151)
    Test:       (10000, 151)


TRAIN ORIGINAL MODELS ¶
In [35]:simple_reg=SimpleRegression(input_dim=X.shape[1],λ=.05)simple_reg.compile(optimizer=Adam(lr=.0005),loss='mean_squared_error')simple_reg_fit=fit_flight_model(simple_reg,X_train,y_train,X_val,y_val,epochs=5,batch_size=16)plot_model_fit(simple_reg_fit)


In [36]:deeper_reg=DeeperRegression(input_dim=X.shape[1],λ=.03,dropout_p=.2)deeper_reg.compile(optimizer=Adam(lr=.0001),loss='mean_squared_error')deeper_reg_fit=fit_flight_model(deeper_reg,X_train,y_train,X_val,y_val,epochs=5,batch_size=16)plot_model_fit(deeper_reg_fit)


In [37]:y_pred_simple=simple_reg.model.predict(X_test).ravel()y_pred_deeper=deeper_reg.model.predict(X_test).ravel()mse_simple=mean_squared_error_scikit(y_test,y_pred_simple)mse_deeper=mean_squared_error_scikit(y_test,y_pred_deeper)print('Mean squared error, simple regression: {}'.format(mse_simple))print('Mean squared error, deeper regression: {}'.format(mse_deeper))

Mean squared error, simple regression: 2.3176028493805263
Mean squared error, deeper regression: 2.291221474968889


SUMMARY ¶
In fitting these models to both the original and ""augmented"" datasets, I spent
time tuning their parameters — regularization strengths, amount of dropout,
number of epochs, learning rates, etc. Additionally, the respective datasets are
of different dimensionality. For these reasons, comparison between the two sets
of models is clearly not ""apples to apples.""

Notwithstanding, the airport embeddings do seem to provide a nice lift over our
original one-hot encodings. Of course, their use is not limited to predicting
flight delays: they can be used in any task concerned with airports.
Additionally, these embeddings give insight into the nature of the airports
themselves: those nearby in vector space can be considered as ""similar"" by some
latent metric. To figure out what these metrics mean, though - it's back to the
map.

ADDITIONAL RESOURCES ¶
 * Towards Anything2Vec
 * Deep Learning for Calcium Imaging
 * DeepWalk: Online Learning of Social Representations
 * Variational Autoencoder: Intuition and Implementation
 * Introducing Variational Autoencoders (in Prose and Code)
 * Variational auto-encoder for ""Frey faces"" using keras
 * Transfer Learning - Machine Learning's Next Frontier
 * Tutorial - What is a variational autoencoder?
 * A Beginner's Guide to Variational Methods: Mean-Field Approximation
 * Variational Autoencoder: Intuition and Implementation
 * CrossValidated - What is the manifold assumption in semi-supervised learning?
 * David Blei - Variational Inference
 * Edward - Variational Inference
 * On Discriminative vs. Generative classifiers: A comparison of logistic
   regression and naive Bayes

CODE ¶
The repository for this project can be found here .


FOOTNOTES ¶
1: A Beginner's Guide to Variational Methods: Mean-Field Approximation


--------------------------------------------------------------------------------

COMMENTS
Tweet Please enable JavaScript to view the comments powered by Disqus.SOCIAL:
LINKS: TRAVEL BLOG , SOURCE CODE
© Will Wolf 2017

Powered by Pelican","In this work, we explore improving a vanilla regression model with knowledge learned elsewhere. ",Transfer Learning for Flight Delay Prediction via Variational Autoencoders,Live,57
152,"Skip navigation Upload Sign in SearchLoading...Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.WATCH QUEUEQUEUEWatch Queue Queue * Remove all * Disconnect 1. Loading...Watch Queue Queue __count__/__total__ Find out why CloseHOLDEN KARAU - BIGDATASV 2016 - #BIGDATASV - THECUBESiliconANGLE Subscribe Subscribed Unsubscribe 5,686 5KLoading...Loading...Working...Add toWANT TO WATCH THIS AGAIN LATER?Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?   Sign in to report inappropriate content. Sign in * Transcript * Statistics260 views 4LIKE THIS VIDEO?Sign in to make your opinion count. Sign in 5 0DON'T LIKE THIS VIDEO?Sign in to make your opinion count. Sign in 1Loading...Loading...TRANSCRIPTThe interactive transcript could not be loaded.Loading...Loading...Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Mar 31, 201601. Holden Karau, IBM, Visits #theCUBE !. ( 00:21 )02. Give Us An Update On Spark. ( 00:43 )03. Do The Hardcore Spark Developers Have To Main Stream It. ( 01:48 )04. There's A Lot Of Integration What Are Your Thoughts On That. ( 03:22 )05. Is Spark A Comparable Investment To Lynx. ( 04:32 )06. Give Me An Example Of The Magnitude Of Spark. ( 06:11 )07. Can You Give Us Examples Of Products That Are Moving To Spark. ( 07:24 )08. Who Is Policing The Agorithms. ( 08:26 )09. Where Are We In Machine Learning Put On The Process Of The Design And RunTime. ( 11:03 )10. Do We See Big Packet Apps Emerging For This Class Of Apps. ( 15:32 )11. What Is Your Take On The Status Of Machine Learning. ( 17:32 )12. Do You Have Another Book On The Horizon. ( 19:24 )Track List created with http://www.vinjavideo.com .--- ---Machine learning on machine learning software: It’s closer than you think | #BigDataSVby Amber Johnson | Mar 31, 2016As the tech world pivots on game-changing applications, data scientists rise tothe occasion. Such is the case with Holden Karau, principal software engineer ofBig Data at IBM and coauthor of Learning Spark.When asked about the current renovations within Spark, Karau said she sees thistime as an “opportunity to get rid of dead weight” by streamlining certainprocesses. For example, she cited getting functional and relative queries totalk to each other within Spark.Two area of expansion include sequencing and machine learning. Karau notedanother “massive expansion” in getting other applications to run on top of Sparkduring an interview with John Furrier (@furrier) and George Gilbert(@ggilbert41), cohosts of theCUBE from the SiliconANGLE Media team, during theBigDataSV 2016 event in San Jose, California, where theCUBE is celebrating #BigDataWeek , including news and events from the #StrataHadoop conference.The three self-described tech geeks discussed the advances with Spark since thebandwagon effect has kicked in. Karau predicted that machine learning on machinelearning software will arrive sooner than Gilbert’s conservative five-yearestimate. While she didn’t give a specific time frame, Karau stated emphaticallythat it is “closer than five years.”How data science is changing software dynamicsKarau conferred with Furrier and Gilbert about several aspects of data scienceand how it is changing software dynamics. One side project in particular stoodout. Karau is working on a Spark validator that will help with “policingquality” in regards to algorithms within pipeline models. Pipeline modelspresent challenges regarding working large scale and still being able to workwith the Big Data interactively. When asked about getting data science to workon data science, Karau said the tech was “there-ish.”In addition, Karau is working with her coauthor, Rachel Warren, on a new bookcalled High Performance Spark. Karau spoke eloquently and candidly about sourcesof frustration in working with Spark pipeline issues, saying, “How do I savethis damn thing?” However, when it comes to Spark, Karau literally wrote thebook.@theCUBE#BigDataSV #StrataHadoop * CATEGORY    * Science & Technology       * LICENSE    * Creative Commons Attribution license (reuse allowed)       *     * Remix this video      Show more Show lessLoading...Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * Bill Schmarzo - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 19:25.   SiliconANGLE 119 views 19:25-------------------------------------------------------------------------------- * 41 videos Play all BigDataSV 2016 - #BigDataSV SiliconANGLE * Holden Karau - Improving PySpark Performance: Spark performance beyond the   JVM - Duration: 43:23. PyData 441 views 43:23 * Joel Horwitz - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 20:12.   SiliconANGLE 62 views 20:12 * Muddu Sudhakar - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 13:53.   SiliconANGLE 335 views 13:53 * Joel Horwitz - IBM Spark Summit 2015 - theCUBE - Duration: 21:06.   SiliconANGLE 243 views 21:06 * Rob Thomas & Joel Horwitz - BigDataNYC 2015 - theCUBE - #BigDataNYC -   Duration: 23:46. SiliconANGLE 185 views 23:46 * Rajeev Madhavan & Ratnakar Lavu - BigDataSV 2016 - #BigDataSV - theCUBE -   Duration: 14:42. SiliconANGLE 328 views 14:42 * Ritika Gunnar - IBM Insight 2015 - theCUBE - #ibminsight - Duration: 19:16.   SiliconANGLE 130 views 19:16 * Steven Sit - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 16:16.   SiliconANGLE 125 views 16:16 * Effective testing of Spark programs and jobs - Strata NY 2015 video -   Duration: 33:14. Holden Karau 404 views 33:14 * Day Two Kickoff - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 16:21.   SiliconANGLE 49 views 16:21 * Kostas Tzoumas - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 11:45.   SiliconANGLE 108 views 11:45 * Christina Noren - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 22:49.   SiliconANGLE 278 views 22:49 * Scott Gnau - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 21:42.   SiliconANGLE 96 views 21:42 * Holden Karau: Sparkling Pandas- Letting Pandas Roam on Spark DataFrames -   Duration: 35:41. PyData 1,593 views 35:41 * Holden Karau - Interview Engineering and Data Tools - Duration: 4:06. Global   Data Geeks 79 views 4:06 * Mike Williams - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 9:35.   SiliconANGLE 65 views 9:35 * Rishi Yadav - BigDataSV 2016 - #BigDataSV - theCUBE - Duration: 18:10.   SiliconANGLE 199 views 18:10 * Dan Graham & Stephanie McReynolds - BigDataSV 2016 - #BigDataSV - theCUBE -   Duration: 21:40. SiliconANGLE 77 views 21:40 * Loading more suggestions... * Show more * Language: English * Country: Worldwide * Restricted Mode: OffHistory HelpLoading...Loading...Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Try something new! * Loading...Working...Sign in to add this to Watch LaterADD TOLoading playlists...","01. Holden Karau, IBM, Visits #theCUBE!. (00:21) 02. Give Us An Update On Spark. (00:43) 03. Do The Hardcore Spark Developers Have To Main Stream It. (01:48)...",Advancements in the Spark Community,Live,58
155,"Homepage PUBLISHED IN AUTONOMOUS AGENTS — #AI Follow Sign in / Sign up 3 Preetham V V Blocked Unblock Follow Following #AI & #MachineLearning enthusiast. Author: Java Web Services / Internet
Security & Firewalls. VP, Brand Sciences & Products @inMobi #UltraRunner 6 hrs ago 10 min read
--------------------------------------------------------------------------------

HOW TO TAME THE VALLEY — HESSIAN-FREE HACKS FOR OPTIMIZING LARGE #NEURALNETWORKS
Let’s say you have the gift of flight (or you are riding a chopper). You are
also a Spy (like in James Bond movies). You are given the topography of a long
narrow valley as shown in the image and you are given a rendezvous point to meet
a potential aide who has intelligence that is helpful for your objective. The
only information you have about the rendezvous point is as follows:

“Meet me at the lowest co-ordinate of ‘this long valley’ in 4 hours”How do you go about finding the lowest co-ordinate point? More so, how do you
intend to find it in a stipulated time period?

Well, for complex Neural Networks which has very large parameters, the error
surface of the Neural Network is very similar to the long narrow valley of
sorts. Finding a “minima” in the valley can be quite tricky when you have such
pathological curvatures in your topography.

Note: there are many posts written on second-order optimization hacks for Neural
Network. The reason I decided to write about it again is that most of it jumps
straight into complex Math without much explanation. Instead, I have tried to explain Math as briefly where possible and mostly
point to detailed sources to learn if you are not trained in the particular
field of Math. This post shall be a bit longish due to that.In the past posts, we used Gradient Descent algorithms while Back-propagating
that helped us minimize the errors. You can find the techniques in the post
titled “ Backpropagation — How Neural Networks Learn Complex Behaviors ”

LIMITATIONS OF GRADIENT DESCENT
There is nothing fundamentally wrong with a Gradient Descent algorithm [or
Stochastic Gradient Descent (SGD) to be precise]. In fact we have proved that it
is quite efficient for some of the Feed Forward examples we have used in the
past. The problem of SGD arises when we have “Deep” Neural Networks which has
more than one hidden layer. Especially when the Network is fairly large.

Here are some illustrations of a non-monotonic error surface of a Deep Neural
Network to get an idea.

Error Surface — 2 Error Surface — 2Note that there are many minima and maxima in the illustration. Let us quickly
look at the weight update process in SGD

SGD weight updatesThe problem with using SGD for the illustrations is as follows:

 * Since SGD uses first order optimization method, it assumes that the error
   surface always looks like a plane (In the direction of descent that is) and
   does not account for curvature.
 * When there is a quadratic curvature, we apply some tricks to ensure that SGD
   does not just bounce off the surface as shown in the weight update equation.
 * We control the momentum-value using some pre-determined alpha and control the velocity by applying a learning rate epsilon .
 * The alpha and the epsilon buffers the speed and direction of SGD and slows
   down the optimization until we converge. We can only tune these
   hyper-parameters to get a good balance of speed versus effectiveness of SGD.
   But they still slow us down.
 * In large networks with pathological curvatures as shown in illustration,
   tuning these hyper-parameters is quite challenging.
 * The error in SGD can suddenly start rising when you move in the direction of
   the gradient when you are traversing a long narrow valley. In fact SGD can
   almost grind to a halt before it can make any progress at all.

We need a better method to work with large or Deep Neural Networks.

SECOND ORDER OPTIMIZATION TO THE RESCUE
SGD is a first order optimization problem. First order methods are methods that
have linear local curves. In that we assume that we can apply linear approximations to solve equations. Some examples of first-order methods are as follows:

 * Gradient Descent
 * Sub-Gradient
 * Conjugate Gradient
 * Random co-ordinate descent

There are methods called the second-order methods which considers the convexity
or curvature of the equation and does quadratic approximations. Quadratic
approximations is an extension of liner approximations but provide a additional
variable to deal with which helps create a quadratic surface to deal with a
point on the error surface.

The key difference between the first-order and second-order approximations is
that, while the linear approximation provides a “plane” that is tangential to a
point on a error surface, the second-order approximation provides a quadratic
surface that hugs the curvature of the error surface.

If you are new to quadratic approximations, I encourage you to check this Khan Academy lecture on Quadratic approximations .The advantage of a second-order method is that, it shall not ignore the
curvature of the error surface. Because of the fact that the curvature is being
considered, second-order methods are considered to have better step-wise
performance.

 * The full step jump of a second-order method points directly to the minima of
   a curvature (unlike first-order methods which requires multiple steps with
   multiple gradient calculation in each step).
 * Since a second-order method points to the minima of a quadratic curvature in
   one step, the only thing you have to worry about is how well the curve
   actually hugs the error surface. This is a good enough heuristic to deal
   with.
 * Working with the hyper-parameters given the heuristic becomes very efficient.

The following are some second-order methods

 * Newton’s method
 * Quasi-Newton, Gauss-Newton
 * BFGS, (L)BFGS

Let’s take a look at Newton’s method which is a base method and is bit more
intuitive compared to others.

YO! NEWTON, WHATS YOUR METHOD?
Newton’s Method , also called Newton-Raphson Method is a iterative method approximation
technique on the roots of a real valued function. This is one of the base
method’s used in any second-order convex optimization problems to approximate
functions.

Let’s first look at Newton’s method using first-derivate of a function.

Let’s say we have a function f(x) = 0, and we have some initial solution x_0
which we believe is sub-optimal. Then, Newton’s method suggest us to do the
following

 1. Find the equation for the tangent line at x_0
 2. Find the point at which the tangent line cuts the x-axis and call this new
    point as x_1.
 3. Find the projection of x_1 on the function f(x)=0 which is also at x_1.
 4. Now, iterate again from step-1, by replacing x_0 with x_1.

Really that simple. the caveat is that the method does not tell you when to stop
so we add a 5th step as follows:

5. If x_n (the current value of x) is equal to or lesser than a threshold then
we stop.

Here is the image that depicts the above:

Finding optimal value of X using Newton’s Method.Here is an animation that shows the same:

animation creditFirst-degree-polynomial, One-dimension:

Here is the math for a function which is a first degree polynomial with
one-dimension.

Second-degree-polynomial, One-dimension

Now, we can work on Newton approximation for a second degree polynomial
(second-order optimizations) function with one-dimension (before we get to
multiple dimensions). A second degree polynomial is quadratic in nature and
would need a second-order derivative to work with. To work on the
second-derivative of a function, Let’s use the Taylor approximation as follows:

Second-degree-polynomial, Multiple-dimension

Suppose that we are working on a second degree polynomial with multiple
dimensions, then we work with the same Newton’s approach as we found above but
replace the first-derivatives with a gradient and the second-derivatives with a
Hessian as follows:

A Hessian Matrix is square matrix of second-order partial derivatives of a scalar, which
describes the local curvature of a multi-variable function.

Specifically in case of a Neural Network, the Hessian is a square matrix with
the number of rows and columns equal to the total number of parameters in the
Neural Network.

The Hessian for Neural Network looks as follows:

Hessian Matrix of a Neural NetworkWHY IS HESSIAN BASED APPROACH THEORETICALLY BETTER THAN SGD?
Now, the second-order optimization using the Newton’s method of iteratively
finding the optimal ‘x’ is a clever hack for optimizing the error surface
because, unlike SGDm where you fit a plane at the point x_0 and then determine
the step-wise jump, In second-order optimization, we find a tightly fitting
quadratic curve at x_0 and directly find the minima of the curvature. This is
supremely efficient and fast.

But !!! Empirically though, can you now imagine computing a Hessian for a
network with millions of parameter? Of course it gets very in-efficient as the
amount of storage and computation required to calculate the Hessian is of
quadratic order as well. So though in theory, this is awesome, In practice it
sucks.

We need a Hack for the Hack ! And the answer seems to lie in Conjugate
Gradients.

CONJUGATE GRADIENTS
Actually, there are several quadratic approximation methods for a convex
function. But Conjugate Gradient Method works quite well for a symmetric matrix, which are positive-definite. In fact,
Conjugate Gradients are meant to work with very-large, sparse systems.

Note that a Hessian is symmetric around the diagonal, the parameters of a Neural
Network are typically sparse, and the Hessian of a Neural Network is
positive-definite (Meaning, it only has positive Eigen Values). Boy, are we in
luck?

If you need a thorough introduction of Conjugate Gradient Methods, go through
the paper titled “ An Introduction to the Conjugate Gradient Method Without the Agonizing Pain ” by Jonathan Richard Shewchuk. I find this quite through and useful. I would
suggest that you study the paper in free-time to get a in-depth understanding of
Conjugate Gradients.The easiest way to explain the Conjugate Gradient (CG) is as follows:

 * The CG Descent is applicable on any quadratic form .
 * CG uses a step-size ‘alpha’ value similar to SGD but instead of a fixed
   alpha, we find the alpha through a line search algorithm.
 * CG also needs a ‘Beta’ a scalar value that helps find the next direction
   which is “conjugate” to the first direction.

You can check most of the hairy-math around arriving at a CG equation by the
paper cited above. I shall directly jump to the section of the algorithm of the
conjugate gradient:

For solving a equation Ax=b , we can use the following algorithm (from Wikipedia)

image credit * Here r_k is the residual value,
 * p_k is the conjugate vector and,
 * x_k+1 is iteratively updated with previous value x_k and the dot product of
   the step-size alpha_k and conjugate vector p_k.

Given that we know how to compute the Conjugate Gradient, let’s look at the
Hessian Free optimization technique.

HESSIAN-FREE OPTIMIZATION ALGORITHM
Now that we have understood the CG algorithm, let’s look at the final clever
hack that allows us to be free from the Hessian.

CITATION: Hessian-free optimization is a technique adopted to Neural Networks by James
Marten at the University of Toronto in a paper titled “ Deep-Learning Via Hessian Free Optimization ”.Let’s start with a second-order Taylor expansion of a function:

Here we need to find the best delta_x and then move to x+delta_x and keep
iterating until converge. In other words, the steps involved in Hessian-free
optimization is as follows:

Algorithm:

 1. Start with i=0 and iterate
 2. Let x_i be some initial sub-optimal x_0 choosen randomly.
 3. At current x_n, Given the Taylor expansion shown as above, compute gradient
    of f(x_n) and hessian of f(x_n)
 4. Given the Taylor expansion, Compute the next x_n+1 (which is nothing but
    delta_x) using the Conjugate Gradient algorithm.
 5. Iterate steps 2–4 until the current x_n converges.

The crucial insight: Note that unlike in the Newton’s method where a Hessian is needed to compute
x_n+1, in Hessian-free algorithm we do not need the Hessian to compute x_n+1.
Instead we are using the Conjugate Gradient.

Clever Hack: Since the Hessian is used along with a vector x_n, we just need an
approximation of the Hessian along with the vector and we do NOT need the exact
Hessian. The approximation of Hessian with a Vector is far faster than computing
the Hessian itself. Check the following reasoning.

Take a look at the Hessian again:

Hessian Matrix of a Neural NetworkHere, the i’th row contains partial derivates of the form

Where ‘i’ is the row index and ‘j’ is the column index. Hence the dot product of
a Hessian matrix and any vector:

Using directional derivatives and finite differences, we can optimize the above
as following:

In fact a thorough explanation and technique for fast multiplication of a
Hessian with a vector is available in the paper titled “ Fast Exact Multiplication of the Hessian ” by Barak A. Pearlmutter from Siemens Corporate Research.With this insight, we can completely skip the computation of a Hessian and just
focus on the approximation of the Hessian to a vector multiplication, which
tremendously reduces the computation and storage capacity.

To understand the impact of the optimization technique, check the following
illustration.

Note that with this approach, instead of bouncing off the side of the mountains
like in SGD, you can actually move along the slope of the valley before you can
find a minima in the curvature. This is quite effective for very large Neural
Networks or Deep Neural Networks with million of parameters.

Apparently, It’s not easy to be a Spy…

Machine Learning Artificial Intelligence Deep Learning Neural Networks Hessian Free Optimization 3 Blocked Unblock Follow FollowingPREETHAM V V
#AI & #MachineLearning enthusiast. Author: Java Web Services / Internet Security
& Firewalls. VP, Brand Sciences & Products @inMobi #UltraRunner

FollowAUTONOMOUS AGENTS — #AI
Notes of Artificial Intelligence and Machine Learning.

× Don’t miss Preetham V V’s next story Blocked Unblock Follow Following Preetham V V",Let’s say you have the gift of flight (or you are riding a chopper). You are also a Spy (like in James Bond movies). You are given the…,How to tame the valley — Hessian-free hacks for optimizing large #NeuralNetworks – Autonomous Agents — #AI,Live,59
157,"RStudio Blog * Home

 * Subscribe to feed

READR 1.0.0
August 5, 2016 in Packages

readr 1.0.0 is now available on CRAN. readr makes it easy to read many types of
rectangular data, including csv, tsv and fixed width files. Compared to base
equivalents like read.csv() , readr is much faster and gives more convenient output: it never converts
strings to factors, can parse date/times, and it doesn’t munge the column names.
Install the latest version with:

install.packages(""readr"")

Releasing a version 1.0.0 was a deliberate choice to reflect the maturity and
stability and readr, thanks largely to work by Jim Hester. readr is by no means
perfect, but I don’t expect any major changes to the API in the future.

In this version we:

 * Use a better strategy for guessing column types.
 * Improved the default date and time parsers.
 * Provided a full set of lower-level file and line readers and writers.
 * Fixed many bugs.

COLUMN GUESSING
The process by which readr guesses the types of columns has received a
substantial overhaul to make it easier to fix problems when the initial guesses
aren’t correct, and to make it easier to generate reproducible code. Now column
specifications are printing by default when you read from a file:

mtcars2 <- read_csv(readr_example(""mtcars.csv""))
#> Parsed with column specification:
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer()
#> )

The thought is that once you’ve figured out the correct column types for a file,
you should make the parsing strict. You can do this either by copying and
pasting the printed column specification or by saving the spec to disk:

# Once you've figured out the correct types
mtcars_spec <- write_rds(spec(mtcars2), ""mtcars2-spec.rds"")

# Every subsequent load
mtcars2 <- read_csv(
  readr_example(""mtcars.csv""), 
  col_types = read_rds(""mtcars2-spec.rds"")
)
# In production, you might want to throw an error if there
# are any parsing problems.
stop_for_problems(mtcars2)

You can now also adjust the number of rows that readr uses to guess the column
types with guess_max :

challenge <- read_csv(readr_example(""challenge.csv""))
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )
#> Warning: 1000 parsing failures.
#>  row col               expected             actual
#> 1001   x no trailing characters .23837975086644292
#> 1002   x no trailing characters .41167997173033655
#> 1003   x no trailing characters .7460716762579978 
#> 1004   x no trailing characters .723450553836301  
#> 1005   x no trailing characters .614524137461558  
#> .... ... ...................... ..................
#> See problems(...) for more details.
challenge <- read_csv(readr_example(""challenge.csv""), guess_max = 1500)
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_date(format = """")
#> )

(If you want to suppress the printed specification, just provide the dummy spec col_types = cols() )

You can now access the guessing algorithm from R: guess_parser() will tell you which parser readr will select.

guess_parser(""1,234"")
#> [1] ""number""

# Were previously guessed as numbers
guess_parser(c(""."", ""-""))
#> [1] ""character""
guess_parser(c(""10W"", ""20N""))
#> [1] ""character""

# Now uses the default time format
guess_parser(""10:30"")
#> [1] ""time""

DATE-TIME PARSING IMPROVEMENTS:
The date time parsers recognise three new format strings:

 * %I for 12 hour time format:library(hms)
   parse_time(""1 pm"", ""%I %p"")
   #> 13:00:00
   
   Note that parse_time() returns hms from the hms package, rather than a custom time class
   
   
 * %AD and %AT are “automatic” date and time parsers. They are both slightly less flexible
   than previous defaults. The automatic date parser requires a four digit year,
   and only accepts - and / as separators. The flexible time parser now requires colons between hours
   and minutes and optional seconds.parse_date(""2010-01-01"", ""%AD"")
   #> [1] ""2010-01-01""
   parse_time(""15:01"", ""%AT"")
   #> 15:01:00
   
   
If the format argument is omitted in parse_date() or parse_time() , the default date and time formats specified in the locale will be used. These
now default to %AD and %AT respectively. You may want to override in your standard locale() if the conventions are different where you live.

LOW-LEVEL READERS AND WRITERS
readr now contains a full set of efficient lower-level readers:

 * read_file() reads a file into a length-1 character vector; read_file_raw() reads a file into a single raw vector.
 * read_lines() reads a file into a character vector with one entry per line; read_lines_raw() reads into a list of raw vectors with one entry per line.

These are paired with write_lines() and write_file() to efficient write character and raw vectors back to disk.

OTHER CHANGES
 * read_fwf() was overhauled to reliably read only a partial set of columns, to read files
   with ragged final columns (by setting the final position/width to NA ), and to skip comments (with the comment argument).
 * readr contains an experimental API for reading a file in chunks, e.g. read_csv_chunked() and read_lines_chunked() . These allow you to work with files that are bigger than memory. We haven’t
   yet finalised the API so please use with care, and send us your feedback.
 * There are many otherbug fixes and other minor improvements. You can see a
   complete list in the release notes .

A big thanks goes to all the community members who contributed to this release:
@ antoine-lizee , @ fpinter , @ ghaarsma , @ jennybc , @ jeroenooms , @ leeper , @ LluisRamon , @ noamross , and @ tvedebrink .

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,780 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

LEAVE A COMMENT
Comments feed for this article

LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.

Notify me of new posts via email.


« Don’t miss Hadley Wickham’s Master R Workshop September 12 and 13 in NYCBlog at WordPress.com.

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,780 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","readr 1.0.0 is now available on CRAN. readr makes it easy to read many types of rectangular data, including csv, tsv and fixed width files. Compared to base equivalents like read.csv(), readr is mu…",readr 1.0.0,Live,60
165,"METRICS MAVEN: WINDOW FUNCTIONS IN POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 10, 2016Compose's data scientist shares database features, tips, tricks, and code you
can use to get the metrics you need from your data. In this first article, we'll
look at how to use window functions in PostgreSQL.

POSTGRESQL WINDOW FUNCTIONS
If you use PostgreSQL, you're probably already familiar with many of the common aggregate functions , such as COUNT() , SUM() , MIN() , MAX() , and AVG() . But you may not be familiar with window functions since they're touted as an advanced feature. Window functions aren't nearly as
esoteric as they may seem, however.

As the name implies, window functions provide a ""window"" into your data, letting
you perform aggregations against a set of data rows according to specified
criteria that match the current row. While they are similar to standard
aggregations, there are also additional functions that can only be used through
window functions (such as the RANK() function we'll demonstrate below). In some situations window functions can
minimize the complexity of your query or even speed up the performance.

Make note: window functions always use the OVER() clause so if you see OVER() you're looking at a window function. Once you get used to how the OVER() clause is formatted, where it fits in your queries, and the kind of results you
can get, you'll soon start to see lots of ways to apply it. Let's dive in!

OVER( )
Depending on the purpose and complexity of the window function you want to run,
you can use OVER() all by itself or with a handful of conditional clauses. Let's start by looking
at using OVER() all by itself.

If the aggregation you want to run is to be performed across all the rows
returned by the query and you don't need to specify any other conditions, then
you can use the OVER() clause by itself. Here's an example of a simple window function querying a
table in our Compose PostgreSQL database containing the United States Census
data on estimated population :

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER() AS national_population
FROM population  
WHERE state �


Notice that we're using a window function to sum the state populations over all
the result rows (that's the OVER() you see in our query... yep, just that one little addition to an otherwise
standard query). Returned, we get result rows for each state and their
populations with also the population sum for the nation - that's the aggregation
we performed with our window function:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 324893002  
Alaska               | 738432           | 324893002  
Arizona              | 6828065          | 324893002  
Arkansas             | 2978204          | 324893002  
California           | 39144818         | 324893002  
Colorado             | 5456574          | 324893002  
Connecticut          | 3590886          | 324893002  
Delaware             | 945934           | 324893002  
District of Columbia | 672228           | 324893002  
Florida              | 20271272         | 324893002  
. . . .


Consider how this compares to standard aggregation functions. Without the window
function, the simplest thing we could do is return the national population by
itself, like this, by summing the state populations:

SELECT SUM(popestimate2015) AS national_population  
FROM population  
WHERE state �


The problem is, we don't get any of the state level information this way. To get
the same results as our window function, we'd have to do a sub-select as a
derived table:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       x.national_population
FROM population,  
(
  SELECT SUM(popestimate2015) AS national_population
  FROM population
  WHERE state > 0 -- only state-level rows
) x
WHERE state �


Looks ugly in comparison, doesn't it? Using window functions, our query is much
less complex and easier to understand.

CONDITION CLAUSES
In the above example, we looked at a simple window function without any
additional conditions, but in many cases, you'll want to apply some conditions
in the form of additional clauses to your OVER() clause. One is PARTITION BY which acts as the grouping mechanism for aggregations. The other one is ORDER BY which orders the results in the window frame (the set of applicable rows).

So, besides the format of the returned rows as we reviewed above, the other
obvious difference with window functions is how the syntax works in your
queries: use the OVER() clause with an aggregate function (like SUM() or AVG() ) and/or with a specialized window function (like RANK() or ROW_NUMBER() ) in your SELECT list to indicate you're creating a window and apply additional conditions as
necessary to the OVER() clause, such as using PARTITION BY (instead of the GROUP BY you may be used to for aggregation).

Let's look at some specific examples.

PARTITION BY
PARTITION BY allows us to group aggregations according to the values of the specified
fields.

In our census data for estimated population, each state is categorized according
to the division and region it belongs to. Let's partition first by region:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER(PARTITION BY region) AS regional_population
FROM population  
WHERE state �


Now we can see the population sum by region but still get the state level data:

state_name           | state_population | region    | regional_population  
-------------------------------------------------------------------------
Alabama              | 4858979          | South     | 121182847  
Alaska               | 738432           | West      | 76044679  
Arizona              | 6828065          | West      | 76044679  
Arkansas             | 2978204          | South     | 121182847  
California           | 39144818         | West      | 76044679  
Colorado             | 5456574          | West      | 76044679  
Connecticut          | 3590886          | Northeast | 56283891  
Delaware             | 945934           | South     | 121182847  
District of Columbia | 672228           | South     | 121182847  
Florida              | 20271272         | South     | 121182847  
. . . .


Let's add division:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       division,
       SUM(popestimate2015)
            OVER(PARTITION BY division) AS divisional_population
FROM population  
WHERE state �


Now we're looking at state-level data, broken out by region and division, with a
population summary at the division level:

state_name           | state_population | region    | division           | divisional_population  
-------------------------------------------------------------------------------------------
Alabama              | 4858979          | South     | East South Central | 18876703  
Alaska               | 738432           | West      | Pacific            | 52514181  
Arizona              | 6828065          | West      | Mountain           | 23530498  
Arkansas             | 2978204          | South     | West South Central | 39029380  
California           | 39144818         | West      | Pacific            | 52514181  
Colorado             | 5456574          | West      | Mountain           | 23530498  
Connecticut          | 3590886          | Northeast | New England        | 14727584  
Delaware             | 945934           | South     | South Atlantic     | 63276764  
District of Columbia | 672228           | South     | South Atlantic     | 63276764  
Florida              | 20271272         | South     | South Atlantic     | 63276764  
. . . .


ORDER BY
As you've probably noticed in the previous queries, we're using ORDER BY in the usual way to order the results by the state name, but we can also use ORDER BY in our OVER() clause to impact the window function calculation. For example, we'd want to use ORDER BY as a condition for the RANK() window function since ranking requires an order to be established. Let's rank
the states according to highest population:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       RANK()
            OVER(ORDER BY popestimate2015 desc) AS state_rank
FROM population  
WHERE state �


In this case, we've added ORDER BY popestimate2015 desc as a condition of our OVER() clause in order to describe how the ranking should be performed. Because we
still have our ORDER BY name clause for our result set, though, our results will continue to be in state
name order, but we'll see the populations ranked accordingly with California as
the number 1 ranked based on its population:

state_name           | state_population | state_rank  
-----------------------------------------------------
Alabama              | 4858979          | 24  
Alaska               | 738432           | 49  
Arizona              | 6828065          | 14  
Arkansas             | 2978204          | 34  
California           | 39144818         | 1  
Colorado             | 5456574          | 22  
Connecticut          | 3590886          | 29  
Delaware             | 945934           | 46  
District of Columbia | 672228           | 50  
Florida              | 20271272         | 3  
. . . .


Let's combine our PARTITION BY and our ORDER BY window function clauses now to see the ranking of the states by population
within each region. For this, we'll change our result-level ORDER BY name clause at the end to order by region instead so that it'll be clear how our
window function works:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       RANK()
           OVER(PARTITION BY region ORDER BY popestimate2015 desc) AS regional_state_rank
FROM population  
WHERE state �


Our results:

state_name   | state_population | region        | regional_state_rank  
----------------------------------------------------------------------
Illinois     | 12859995         | Midwest       | 1  
Ohio         | 11613423         | Midwest       | 2  
Michigan     | 9922576          | Midwest       | 3  
Indiana      | 6619680          | Midwest       | 4  
Missouri     | 6083672          | Midwest       | 5  
Wisconsin    | 5771337          | Midwest       | 6  
Minnesota    | 5489594          | Midwest       | 7  
Iowa         | 3123899          | Midwest       | 8  
Kansas       | 2911641          | Midwest       | 9  
Nebraska     | 1896190          | Midwest       | 10  
South Dakota | 858469           | Midwest       | 11  
North Dakota | 756927           | Midwest       | 12  
New York     | 19795791         | Northeast     | 1  
Pennsylvania | 12802503         | Northeast     | 2  
New Jersey   | 8958013          | Northeast     | 3  
. . . .


Here we can see that Illinois is the number 1 ranking state by population in the
Midwest region and New York is number 1 in the Northeast region.

So, we combined some conditions here, but what if we need more than one window
function? Read on...

NAMED WINDOW FUNCTIONS
In queries where you are using the same window function logic for more than one
returned field or where you need to use more than one window function
definition, you can name them to make your query more readable.

Here's an example where we've defined two windows functions. One, named ""rw"",
partitions by region and the other, named ""dw"", partitions by division. We're
using each one twice - once to calculate the population sum and again to
calculate the population average. Our windows functions are defined and named
using the WINDOW clause which comes after the WHERE clause in our query:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER rw AS regional_population,
       AVG(popestimate2015)
            OVER rw AS avg_regional_state_population,
       division,
       SUM(popestimate2015)
            OVER dw AS divisional_population,
       AVG(popestimate2015)
            OVER dw AS avg_divisional_state_population
FROM population  
WHERE state �


Since we didn't do any manipulation on the averages values yet, the numbers look
a little crazy, but that can be easily cleaned up using ROUND() and CAST() if need be. Our purpose here is to demonstrate how to use multiple window
functions and the results you'll get. Check it out:

state_name           | state_population | region    | regional_population | avg_regional_state_population | division           | divisional_population | avg_divisional_state_population  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Alabama              | 4858979          | South     | 121182847           | 7128402.764705882353          | East South Central | 18876703              | 4719175.750000000000  
Alaska               | 738432           | West      | 76044679            | 5849590.692307692308          | Pacific            | 52514181              | 10502836.200000000000  
Arizona              | 6828065          | West      | 76044679            | 5849590.692307692308          | Mountain           | 23530498              | 2941312.250000000000  
Arkansas             | 2978204          | South     | 121182847           | 7128402.764705882353          | West South Central | 39029380              | 9757345.000000000000  
California           | 39144818         | West      | 76044679            | 5849590.692307692308          | Pacific            | 52514181              | 10502836.200000000000  
Colorado             | 5456574          | West      | 76044679            | 5849590.692307692308          | Mountain           | 23530498              | 2941312.250000000000  
Connecticut          | 3590886          | Northeast | 56283891            | 6253765.666666666667          | New England        | 14727584              | 2454597.333333333333  
Delaware             | 945934           | South     | 121182847           | 7128402.764705882353          | South Atlantic     | 63276764              | 7030751.555555555556  
District of Columbia | 672228           | South     | 121182847           | 7128402.764705882353          | South Atlantic     | 63276764              | 7030751.555555555556  
Florida              | 20271272         | South     | 121182847           | 7128402.764705882353          | South Atlantic     | 63276764              | 7030751.555555555556  
. . . .


Now that's an informative report of population metrics... and window functions
made it easy!

WRAPPING UP
This article has given you a glimpse of the power of PostgreSQL window
functions. We touched on the benefits of using window functions, looked at how
they are different (and similar) to standard aggregation functions, and learned
how to use them with various conditional clauses, walking through examples along
the way. Now that you can see how window functions work, start trying them out
by replacing standard aggregations with window functions in your queries. Once
you get the hang of them you'll be hooked.

In our next article we'll look at window framing options in PostgreSQL to give
you even more control over how your window functions behave.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","If you use PostgreSQL, you're probably already familiar with many of the common aggregate functions, such as COUNT(), SUM(), MIN(), MAX(), and AVG(). But you may not be familiar with window functions since they're touted as an advanced feature. Window functions aren't nearly as esoteric as they may seem, however. Let's dive in!",Metrics Maven: Window Functions in PostgreSQL,Live,61
167,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs    * Videos * All Videos       * IBM Big Data In A Minute                * Video Chats * Analytics Video Chats       * Big Data Bytes       * Big Data Developers Streaming Meetups       * Cyber Beat Live                * Podcasts    * White Papers & Reports    * Infographics & Animations    * Presentations    * Galleries       * Subscribe×BLOGSDATA VISUALIZATION PLAYBOOK: THE IMPORTANCE OF EXCLUDING UNNECESSARY DETAILSPost Comment December 2, 2015 by Jennifer Shin Topics: Big Data Technology Tags: big data , data analytics , data science , data scientist , data visualization , visualizationsAs the big data revolution gathers momentum, data scientists are working withlarger data sets than ever before—a trend that shows no sign of abating. Butwith ever larger data sets comes the temptation to include ever moreinformation, representing the data in all its glorious detail. Who, after all,can resist the temptation to flex some intellectual muscles by mastering trulycomplex data?But as tempting as visualizing every last detail can be, doing so can erectbarriers to understanding. A visualization that includes unnecessary informationcan overwhelm readers, obscuring the message and leaving its audience confused.Let’s explore a real-world scenario, stepping through the thought process thatgoes into designing an effective data visualization.BREAK DOWN THE DATAA foundation set up to fund environmental projects published an overview of thefunding it provided during 2013. In its original form, shown in Figure 1, thereport included a pie chart showing the distribution of grants across a range ofenvironmental issues.Figure 1: The share of funding distributed to 17 environmental issues during2013.EVALUATE YOUR VISUALIZATION’S USABILITYA first glance at the pie chart reveals nothing wildly amiss. The chartrepresents the data simply and directly, breaking down the distribution offunding in its legend. But a c loser look reveals certain flaws that can impede understanding: * Excluded information   The chart supplies an exact figure for only 7 of the 17 issues—specifically,   only for issues that received at least 7 percent of the overall funding.    * Cumbersome design   The many slices in the pie chart distract from the larger facts, requiring   readers to match the color of each slice with a color in the legend to   identify the issue described.    * Confusing presentation   The choice of colors does little to differentiate issue areas—for example, a   reader could easily mistake “Air Quality” for “Rivers and Lakes” or fail to   differentiate “Populations” from “Wildlife Biodiversity.”Figure 2: The level of funding distributed to each environmental issue during2013, both in dollars and as a percentage of total funding.FIND THE FOREST IN THE TREESThe organization intended the visualization to provide an overview of issueareas funded during 2013. T o boost the overview’s effectiveness, the designer grouped environmental issuesinto five categories, as depicted in Figure 2: “Environmental Policy,” “Climateand Energy,” “Natural Resources,” “Preservation and Biodiversity” and“Sustainable Development.” The designer then redesigned the pie chart around thenew categories, grouping slices as shown in Figure 3.Figure 3: The share of funding distributed to each issue during 2013, withindividual issues delineated but grouped into colored categories.PROVIDE A QUANTITATIVE OVERVIEWBut the visualization still contained unnecessary information. The designerstreamlined the legend as shown in Figure 4a, emphasizing the categories anddispensing with a complete list of issues. To further emphasize the categories,the designer removed the lines demarcating individual issues and supplied thepercentage of funding distributed to each category, as shown in Figure 4b. By categorizing and unifying individual issues, the new visualization providedan effective overview featuring quantitative information.Figure 4a: The share of funding distributed to each category during 2013 , with individual issues delineated but grouped into colored categories .Figure 4b: The percentage of funding distributed to each category during 2013.CREATE NEW LEVELS OF INSIGHTAfter streamlining the pie chart, the designer introduced a new level ofanalysis, segmenting the data by global region and creating pie charts to showthe worldwide distribution of grants. To obviate the need for another legend,the designer superimposed the pie charts on a world map, as shown in Figure 5.Figure 5: The regional share of funding distributed to each category within eachglobal region during 2013.DESIGN FOR YOUR AUDIENCEBefore you create a data visualization, tailor your message to your audience.Don’t overwhelm your audience with data, but also take care not to render thedata useless through oversimplification. You’ll want to create one kind ofvisualization when presenting to experts in the field, for example, but anotherwhen giving a high-level overview to a general audience. To learn more, d iscover how the IBM advanced analytics portfolio can help you find patterns in and derive insights from your data through visualexploration.Follow @IBMBigDataRELATED CONTENTPODCASTHOW IS OPEN SOURCE TRANSFORMING STREAMING ANALYTICS?Open source is a disruptor that never quits. It seems to be penetrating andtransforming every aspect of established data, analytics and applicationecosystems. In this podcast, recorded at IBM InterConnect 2016, listen to DavidTaieb, a cloud data services developer advocate at IBM, share his... Listen to Podcast Podcast Becoming a cognitive business Podcast InsightOut: Leveraging metadata and governance Blog What is Spark? Blog Internet of Things data access and the fear of the unknown Blog Spark: The operating system for big data analytics Blog Graph databases catch electronic con artists in the act Blog InsightOut: Metadata and governance Blog New IBM DB2 release simplifies deployment and key management Podcast How is open source transforming graph analytics? Blog What is Hadoop? Blog The rise of NoSQL databases Blog Bridging Spark analytics to cloud data servicesView the discussion thread.IBM * Site Map * Privacy * Terms of Use * 2014 IBMFOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata       * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics       * Explore By Topic * Use Cases    * Industries    * Analytics    * Technology    * For Developers    * Big Data & Analytics Heroes       * Explore By Content Type * Blogs    * Videos    * Analytics Video Chats    * Big Data Bytes    * Big Data Developers Streaming Meetups    * Cyber Beat Live    * Podcasts    * White Papers & Reports    * Infographics & Animations    * Presentations    * Galleries    * Events    * Around the Web    * About The Big Data & Analytics Hub    * Contact Us    * RSS Feeds       * Additional Big Data Resources * AnalyticsZone    * Big Data University    * Channel Big Data    * developerWorks Big Data Community    * IBM big data for the enterprise    * IBM Data Magazine    * Smarter Questions Blog       * Events * Upcoming Events    * Webcasts    * Twitter Chats    * Meetups       * Around the Web * For Developers * Big Data & Analytics HeroesMore * Events * Upcoming Events    * Webcasts    * Twitter Chats    * Meetups       * Around the Web * For Developers * Big Data & Analytics HeroesSearchEXPLORE BY TOPIC:Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog For strategic planning, business must go beyond spreadsheets Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analyticsMOREBlog For strategic planning, business must go beyond spreadsheets Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Interactive All data all the time: How mobile technology informs travelerhabits Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog The secret to enhancing customer engagement Blog For strategic planning, business must go beyond spreadsheets Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa BodellMOREBlog For strategic planning, business must go beyond spreadsheets Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of care Podcast InsightOut: Leveraging metadata and governance Blog 6 simple ways to help fight crime with analytics Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive All data all the time: How mobile technology informs travelerhabitsMOREBlog 6 simple ways to help fight crime with analytics Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive All data all the time: How mobile technology informs travelerhabits Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog How to protect our PII and sensitive information from fraud Blog Big data in healthcare: The secret to calculating total cost of care Interactive Cognitive business starts with analytics Blog The secret to enhancing customer engagement Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of careMOREInteractive Cognitive business starts with analytics Blog The secret to enhancing customer engagement Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of care Podcast Becoming a cognitive business Podcast InsightOut: Leveraging metadata and governance Blog The LED lighting revolution * Home * Explore By Topic * Use Cases * All       * Acquire, Grow & Retain Customers       * Create New Business Models       * Improve IT Economics       * Manage Risk       * Optimize Operations & Reduce Fraud       * Transform Financial Processes                * Industries * All       * Banking       * Consumer Products       * Education       * Energy & Utilities       * Government       * Healthcare & Life Sciences       * Industrial       * Insurance       * Media & Entertainment       * Retail       * Telecommunications                * Analytics * All       * Content Analytics       * Customer Analytics       * Entity Analytics       * Social Media Analytics                * Technology * All       * Business Intelligence       * Cloud Database       * Data Governance       * Data Warehouse       * Database Management Systems       * Data Science       * Hadoop & Spark       * Internet of Things       * Predictive Analytics       * Streaming Analytics                   * Content By Type * Blogs    * Videos * All Videos       * IBM Big Data In A Minute                * Video Chat * Analytics Video Chats       * Big Data Bytes       * Big Data Developers Streaming Meetups       * Cyber Beat Live                * Podcasts    * White Papers & Reports    * Infographics & Animations    * Presentations    * Galleries       * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events    * Webcasts    * Twitter Chat    * Meetups       * Around The Web * About Us * Contact Us * Search Site",Find out how including too much information can neutralize your data visualization—and your message with it.,Data visualization: The importance of excluding unnecessary details,Live,62
168,"PostgreSQL is a powerhouse of SQL driven database power and Compose's PostgreSQL is all that with the power of Compose's cloud deployments. But before you can harness that power you need to create users to access your database. In this article, we're going to show you the quick way to do that and then introduce you to one of PostgreSQL's powerful tools. Let's begin, right after you've created your first PostgreSQL database.When you create a PostgreSQL deployment, there's only one role that has been created for the database and that's the admin role. ""Wait a minute"" you may be thinking ""I want a user not a role"". PostgreSQL has pulled in all the concepts of users, groups and permissions and turned them into one concept, roles. Roles can represent one or many users, one role can grant another role membership to grant privileges, roles can own tables or other database objects. Roles are PostgreSQL's swiss army knife of access control.Now the admin role that is created is not the database superuser. Thats a more restricted account that isn't remotely accessible. The admin role is, basically, the first user Compose creates on the PostgreSQL database and it has permission to create new databases and create new roles. This user role is also there so essential maintenance can be performed by Compose's own automated processes. You can use the admin role as your sole login to the database, or you can create your own roles. Let's first look at the admin role.When you look at your overview page for the database you'll see that information in the Connection info panel under Credentials. Or rather you won't because by default it is obscured.Below the credentials are the connection string you use to connect applications to your database and the command line you can use, if you have psql installed, to create an interactive command line into the database. Notice all the sensitive information is hidden in these too by marking where you would substitute a username or password.When you click the Show link – you'll be prompted for your Compose account password before anything is shown – the credentials, connection string and command line will all be populated with the admin role's credentials. You can use these as is in your applications if you wish. Make a mental note of the admin password then click the Hide link which appeared where the Show link to obscure that information.You may, though, want to create more roles to be used for particular purposes. It's not necessary to create a role-per-user or role-per-application. Experience tells us that the more roles you create, the more complex your access control will be and the harder to manage. With that in mind we suggest that you create one or two roles at most with appropriate capabilities and use them. The quickest way to create one of those roles is to use the data browser.  click on the Browser button in the sidebar.We've talked about the data browser before, but since that article was published, its gained the ability to let you add and remove roles. The first thing the browser shows you are the databases currently configured and by default on Compose, your first database is called compose. You can create more databases here by clicking on Create Database in the top right, but for now we're going to work with the default compose database so click on that in the database list.And now we're into the table level view of the browser, or rather not, because this database doesn't have any tables yet. We'll get back to that. Right now we want to create a role, and we can do this by clicking on the Roles option in the sidebar.But, yes, we just want a user and if you look, you can see we are viewing Users, roles which are configured as database users. And there in the list is our admin user. You can see the power that user has because listed there are the three roles it is a member of - login so admin can log in, createdb so it can create databases and createrole so it can create new users and roles. We can make a user of our own here by clicking on Add User in the top right.This is where we can add user roles; we just fill in the blanks in the command line, so if we're making a user called fred and giving them a password 'drowssap' we put those in the appropriate fields.If we press Add role now though, the created role would just be an unprivileged user who could not login. Click on the Login button to add that privilege to the user. If you want the user to be able the create databases and roles, click the appropriate buttons too but remember in the world of privileges, less privilege usually means more security. When you are done, press Add role and the new user will be created. Apart from dropping the user, thats all you can do with the database browser but its enough to create new users. The next stage will be to connect to the database using the new user we created.PostgreSQL's command line is called psql and you probably don't have it as it is usually bundled with the PostgreSQL database system. This is a common occurrence with database software and it means you'll have to download and install the database software locally to get the official tools – the important part is you shouldn't run the database itself. You can find where to download binaries of PostgreSQL on the project's website. For discussion purposes, we'll set up Mac OS X. If you look on the Mac OS X packages page you'll find a number of options. There's an EnterpriseDB graphical installer and the Postgres.app GUI installer; we'll skip them as they are very much about getting the database itself running. There's also packages in the Fink, MacPorts and Homebrew package managers. At Compose, we're big fans of Homebrew because it works so well. You'll need to install Homebrew first, follow the instructions on the home page. Once that's installed, just run brew install postgresql...Most of the text there relates to how to configure the database server to start up so we can ignore that. At the end of the process, what's important is the psql binary is installed in /usr/local/bin. Now you can get to connecting. Head back to the Overview for your Compose PostgreSQL database and look for the Command Line  connection string.  Use that entire line, substituting in your new username. You'll be prompted for your password and then connected:Congratulations, you've just plugged into one of the most powerful command line tools for any database. At its most basic, you can type in Postgres SQL commands and see them executed.  Let's create a table:The command like doesn't consider a command complete until it is terminated with a correctly placed semi-colon. The thing to keep an eye on it the prompt, specifically the character after the database name and before the >. When it's = it is the start of a new command. When it's - it means that this is a continuation of the previous line and when it's ( it means that what's been typed so far has opened a parentheses, but not yet closed it so a semi-colon entered before closing the parentheses would be an error.Here we are creating the table rockets and we open the parentheses on the first line then hit return. Then we enter the columns, with commas as separators – we could type these all on one line but this is easier to type and read back. Finally we close the parentheses and end the statement with a semicolon. Psql then echos back the type of command that has been run (or displays and error) and returns to the = prompt. We can then insert some data into our new table:The last number after the INSERT reflect the number of rows inserted (usually). Of course, we can select to get that information back. Here we'll break up the command over a couple of lines, because we can:If you want the full list of commands available, enter \h to list all the SQL commands and follow the \h with the name of a command to get further details on it. It's useful help, but remember the PostgreSQL documentation is also a useful companion available online or offline as A4 or US sized PDFs.But it's not only SQL commands that can be entered into psql. It has its own rich command set too. To list those commands type \? - all the psql commands are preceded with a backslash. One of the essential commands is \d which will tell you about the objects and their relations in your database. Without any parameters it'll tell you about tables like so:Give it a parameter like the name of a table and it will give you information about the columns and indexes of that table:There's a huge number of psql commands you may want to put to use. The \e command will call up vi or the editor set in the environment variable ""PSQL_EDITOR"" and let you edit the last SQL command.  \i will read and execute commands from a file. \s will display your command history and yes you can cursor up and down through that history. \w lets you write the query buffer (where the last command is saved) to disk. One favorite is \watch which will repeat the last query every two seconds – follow it with a number of seconds to adjust that.  We could spend an entire article looking at the applications and uses of the psql command set. Suffice to say for now it is extensive and very useful.We've shown you how to create users quickly on Compose PostgreSQL and how to use those new users with PostgreSQL's powerful command line. If you think that's powerful, just wait till you see what you can do with the database itself!",PostgreSQL is a powerhouse of SQL-driven database power and Compose's PostgreSQL is all that with the power of Compose's cloud deployments. But before you can harness that power you need to create users to access your database.,Compose PostgreSQL: Making users and more,Live,63
169,"* About
 * Services
 * Portfolio
 * Teaching
 * Blog
 * Contact

PREDICTING GENTRIFICATION USING LONGITUDINAL CENSUS DATA

By Ken SteifAuthors: Ken Steif, Alan Mallach, Michael Fichman, Simon Kassel


Figure 1: A mockup of a web-based, community-oriented gentrification forecasting
applicationRecently, the Urban Institute called for the creation of “neighborhood-level early warning and response systems that can
help city leaders and community advocates get ahead of (neighborhood) changes.”

Open data and open-source analytics allows community stakeholders to mine data
for actionable intelligence like never before.

The objective of this research is to take a first step in exploring the
feasibility of forecasting neighborhood change using longitudinal census data in
29 Legacy Cities (Figure 2).

The first section provides some motivation for the analysis. Section 2 discusses
the feature engineering and machine learning process. Section 3 provides results
and the final section concludes with a discussion of community-oriented
neighborhood change forecasting systems.


Figure 2: Legacy cities used in this analysisWhy forecast gentrification?

Neighborhoods change because people and capital are mobile and when new
neighborhood demand emerges, incumbent residents rightfully worry about
displacement.

Acknowledging these economic and social realities, policy makers have a
responsibility balance economic development and equity. To that end, analytics
can help us understand how the wave of reinvestment moves across space and time
and how to pinpoint neighborhoods where active interventions are needed today in
order to avoid negative outcomes in the future.

While the open data movement and open source software like Carto and R lower
costs associated with community analytics, time series parcel-level data is
expensive to collect, store and analyze.

Census data is ubiquitous however, and many non-profits are well-versed in
technologies like the Census’ American FactFinder and The Reinvestment Fund’s PolicyMap . Thus, it seems reasonable to develop forecasts using these data before
building comparable models using the more expensive, high resolution space/time
home sale data.

The goal here is to use 1990 and 2000 Census data on home prices to predict home
prices in 2010. If those models prove robust, we can use the model to forecast
for 2020.

Endogenous gentrification

The key to our forecasting methodology is the conversion of Census tract data
into useful ‘features’ or variables that help predict price. Our empirical
approach is inspired by the theory of ‘endogenous gentrification’ – a theory of neighborhood change which suggests that low-priced neighborhoods
adjacent to wealthy ones have the highest probability of gentrifying in the face
of new housing demand.

Typically, urban residents trade off proximity to amenities with their
willingness to pay for housing. Because areas in close proximity to the highest
quality amenities are the least affordable, the theory suggests that gentrifiers
will choose to live in an adjacent neighborhood within a reasonable distance of
an amenity center but with lower housing costs.

As more residents move to the adjacent neighborhood, new amenities are generated
and prices increase which means that at some point, the newest residents are
going to settle in the next adjacent neighborhood and so on.

This space/time process resembles a wave of investment moving across the
landscape. Our forecasting approach attempts to capture this wave by developing
a series of spatially endogenous home price features.

The models attempt to trade off these micro-economic patterns with
macro-economic trends that face many of the Legacy Cities in our sample.
Principal among these is the Great Recession of the late 2000s. Of equal
importance is the fact that gentrification affects only a small fraction of
neighborhoods. As our previous research has demonstrated, neighborhood decline is still the predominant force in U.S. Legacy cities .

Featuring Engineering

Our dataset consists of 3,991 Census tracts in 29 Legacy Cities from 1990, 2000
and 2010. The data originates from the Neighborhood Change Database (NCDB) which standardizes previous Decennial Census surveys into 2010
geographical boundaries allowing for repeated measurements for comparable
neighborhoods over time.

While standardizing tract geographies over time is certainly convenient, it does
not account for the ecological fallacy nor deal with the fact that tracts rarely
comprise actual real estate submarkets.

Figure 3 plots the distribution of Median Owner-Occupied Housing Value for 1990,
2000 & 2010 for the 29 cities in our sample. There is no clear global price
trend in our sample. Some cities see price increases, some see decreases and
others don’t change at all.


Figure 3: Median Owner-Occupied Housing Value by cityBetween census variables and those of our own creation, our dataset consists of
nearly 200 features or variables that we use to predict price. We develop
standard census demographic features as well endogenous features that explain
price as a function of nearby prices and other economic indicators like income.
There are three main statistical approaches we take to develop these features.

The simplest of our endogenous price features is the ‘spatial lag’, which for
any given census tract is the simply the average price of tracts that surround
it (Figure 4). Figure 5 shows the correlation between the spatial lag and price
for the cities in our sample.


Figure 4: The spatial lag

Figure 5: Price as a function of spatial lagOur second endogenous price feature is one which measures proximity to high-cost
areas. Here we create an indicator for the highest priced and highest income
tracts for each city in each time period and calculate the average distance in
feet from each tract to its n nearest 5th quintile neighbors in the previous time period. The motivation here
is to capture emerging demand in the ‘next adjacent’ neighborhood over time
(Figure 6). Figure 7 shows the correlation with price.


Figure 6: Distance to highest value tract in the previous time period

Figure 7: Price as a function of its distance to highest value tract in the
previous time periodOur third endogenous price predictor attempts to capture the local spatial pattern of prices for a tract and its adjacent neighbors. As previously
mentioned, to be robust, our algorithm must trade-off global trends with local
neighborhood conditions. There are three local spatial patterns of home prices
that we are interested in: clustering of high prices; clustering of low prices;
and spatial randomness of prices.

Local clustering of high and low prices suggests that housing market agents
agree on equilibrium prices in a given neighborhood. Local randomness we argue,
is indicative of a changing neighborhood – one that is out of equilibrium.

A similar approach was used for a previous project, predicting vacant land
prices in Philadelphia.

In a changing neighborhood, buyers and sellers are unable to predict the value
of future amenities. Our theory argues that when this uncertainty is capitalized
into prices, the result is a heterogeneous pattern of prices across space.
Capturing this spatial trend is crucial for forecasting neighborhood change.

To do so we develop a continuous variant of the one-sided Local Moran’s I statistic. Assume that the dots in Figure 8 below represent home sale prices for houses or
tracts. The homogeneous prices in the left panel are indicative of an area in
equilibrium where all housing market agents agree on future expectations. Our
Local Moran’s I feature of this area would indicate relative clustering.


Figure 8: Equilibrium and disequilibrium marketsConversely, the panel on the right with more heterogeneous prices, is more
indicative of a neighborhood in flux – one where housing market agents are
capitalizing an uncertain future into prices. In this case, the Local Moran’s I
feature would indicate a spatial pattern closer to randomness.

We find this correlation in many of the cities in our sample as illustrated in
Figure 9.


Figure 9: Price as a function of the Local Moran’s I p-valueResults

A great deal of time was spent on feature engineering and feature selection. We
employ four primary machine learning algorithms, Ordinary Least Squares (OLS),
Gradient Boosting Machines (GBM), Random Forests, and an ensembling approach
that combines all three. You can find more information on these models in our
paper which is linked below.

Our models are deeply dependent on cross-validation , ensuring that goodness of fit is based on data that the model has not seen.

Although we estimate hundreds of models, Table 1 presents (out of sample)
goodness of fit metrics for our four best – each an example of one of the four
predictive algorithms.

The “MAPE” or mean absolute percentage error, is the absolute value of the
average error (the difference between observed and predicted prices by tract)
represented as a percentage which allows for a more consistent way to describe
model error across cities.


Table 1: Goodness of fit metrics for four modelsThe Standard Deviation of R-Squared measures over-prediction. Using
cross-validation, each time the model is estimated with another set of randomly
drawn observations, we can record goodness of fit. If the model is truly
generalizable to the variation in our Legacy City sample, then we should expect
consistent goodness of fit across each permutation.

If the model is inconsistent across each permutation, it may be that the
goodness of fit is driven solely by individual observations drawn at random.
This latter outcome might indicate overfitting. Thus, this metric collects R^2
statistics for each random permutation and then uses standard deviation to
assess whether the variation in goodness of fit across each permutation is small
(ie. generalizability) or a large (ie. overfitting).

Figure 10 shows the predicted prices as a function of observed prices for all
tracts in the sample. If predictions were perfect, we would expect the below
scatterplots to look like straight lines. The obvious deviation from would-be
straight lines is much greater for the OLS and GBM models then for the random
forest and stacked ensemble models. These models loose predictive power for
higher priced tracts.


Figure 10: Predicted prices as a function of observed prices for all tractsFigures 11-14 display observed vs. predicted prices by city for each of the
predictive algorithms. Again, OLS and Random Forests loose predictive power for
high priced tracts. However, when predictions are displayed in this way, the GBM
and Ensemble predictions appear quite robust for most of the cities.


Figure 11: OLS predicted prices as a function of observed prices for tracts by
city


Figure 12: GBM predicted prices as a function of observed prices for tracts by
city

Figure 13: Random Forest predicted prices as a function of observed prices by
city

Figure 14: Ensemble predicted prices as a function of observed prices by cityFinally, Figures 15 and 16 display the MAPE (error on a percentage basis) by
City in bar chart and map form respectively. The highest error rates that we
observe at the city-level is around 13.5% and the smallest is around 4%. One
important trend to note is that we achieve ~8% errors for many of the larger,
post-industrial cities. In addition, it does not seem as though there is an
observable city-by-city pattern in error. That is, the model is not biased
toward smaller cities or larger ones or those with booming economies. This is
evidence that our final model is generalizable to a variety of urban contexts.


Figure 15: MAPE by City

Figure 16: MAPE by City in map formFinally, Figure 17 illustrates for Chicago, the 2010 predictions generated for
each of the four algorithms along with the observed 2010 median owner-occupied
home prices.

Despite the stacked ensemble predictions having the lowest amount of error, it
still appears to underfit for the highest valued tracts (Panel 4, Figure 17).
This occurs for three reasons. First, the census data is artificially capped at
$1 million dollars which creates artificial outlying “spikes” in the data, that,
despite our best efforts, we were unable to model in the feature engineering
process.

Second, as previously mentioned, the predominant pattern over time is decline
not gentrification. Thus, it is difficult for the model, at least in Chicago, to
separate a very local phenomenon like gentrification, from a more global
phenomenon like decline. Because all cities are modeled simultaneously, these
predictions are also weighted not only by the Chicago trend, but by the trend
throughout the sample.

Finally, and this is probably the most important issue, our time serious has
just two preceding time periods to use as predictors while neighborhood
trajectories are clearly more fluid.


Figure 17: Predicted prices for four algorithms and observed 2010 prices,
ChicagoThe implications of this under-prediction in cities like Chicago is that our
forecasts in these cities will also under-predict. While we choose to illustrate
Chicago, many cities do not in fact under-predict. This is evident in their 2020
forecasts as seen in Figure 18 below.

The next step is to rerun our models using 2000 and 2010 to forecast for 2020.
Figure 18 shows the results of these forecasts in barplot form by city,
alongside observed 1990, 2000 and 2010 prices. Many cities including Baltimore,
Chicago, Cleveland, Detroit, Minneapolis, Newark, Philadelphia, show a marginal
increase in price forecasts for 2020. Others, such as Baltimore, Boston, Jersey
City and Washington D.C. do not, despite the fact that anecdotally, we might
expect them to. Figure 19 shows tract-level predictions for three cities.

With this under-prediction notwithstanding, we still are quite pleased with how
much predictive power we could mine from these data. As we discuss below, there
is a real upside to replicating this model on sales-level data.


Figure 18: Time series trend with predictions by City

Figure 19: Tract-level forecasts in 3 citiesNext steps

It appears that endogenous spatial features combined with modern machine
learning algorithms can help predict home prices in American Legacy Cities using
longitudinal census data – with caveats as mentioned above.

As previous work has shown however, this approach is really powerful when using parcel-level
time series sales data.

This insight motivates what we think are some important next steps with respect
to the development of neighborhood change early warning systems.

First, check out this phenomenal paper recently published in HUD’s Cityscape journal entitled “Forewarned: The Use of
Neighborhood Early Warning Systems for Gentrification & Displacement” by Karen
Chapple and Mariam Zuk.

The authors raise two important points. The first, is that existing early
warning systems are not doing a great job on the predictive analytics side. We
think that many of these deficiencies could be addressed as more Planners become
versed in machine learning techniques including how to build useful features
like the endogeneous gentrification variables described above.

The second critical point that Chapple and Zuk raise is that “Little is
understood, however, about precisely how stakeholders are using the systems and
what impact those systems have on policy.” A UX/UI engineer might restate this
question by asking, “What are the use cases? Why would someone use such a
system?”

The point is that no matter how well the model performs, if insights cannot be
converted in to equity and real policy, then predictive accuracy is meaningless.

Here are some suggestions about how the next generation of forecasting tools
could look:


Figure 20: An example of a gentrification early warning system using event-based
forecastsInstead of modeling data for 29 cities at once, consider a model built for one
city, using consecutive years of parcel-level data.

Second, alongside a continuous outcome like price, consider modeling a series of
development-related events like new construction permits, rehab permits and
evictions. Event-based forecasting could for instance, predict the probability
of (re)development for each parcel citywide.

These probabilities could help the government and non-profit sectors better
allocate their limited resources. This helps us get a better sense of the most
appropriate use cases, like, “Where should build our next affordable housing
development?”

A tool, like the one shown in Figure 20 would allow equity-driven organizations
to strategically plan future development and redevelopment opportunities as well
as better manage the existing stock of affordable housing in the neighborhood.
It would also help state finance agencies better target tax credits, and aid
planning and zoning boards to better understand the effect that zoning variances
might have on future development patterns.

If one were to combine these predictive price and event-based algorithms into
one information system, fueled predominately by preexisting city-level open
data, the potential value-added for community organizations, government and
grant-making institutions would be immense.

Conclusion

This report experimented with using longitudinal census to predict home prices
in 29 American Legacy cities. Our motivation was that if we could develop a
robust model, the results could help community stakeholders better allocate
their limited resources.

Our training models use 1990 and 2000 data to predict for 2010 yielding an
average prediction error of just 14% across all tracts. When this error is
considered on a city-by-city basis, the median error is around 8%. It is
important to note that our endogenous feature approach does not overfit the
model. We think these results are admirable given the limitations of our time
series; that our unit of analysis, census tracts, rarely if ever conform to true
real estate submarkets and that neighborhood decline is still the predominate
dynamics in these Legacy Cities. The greatest weakness of our model is that the
limited time series is the likely driver for under-prediction in some cities,
which affects our 2020 forecasts.

Our major methodological contribution is the adoption of endogenous
gentrification theory in the development of spatial features that are effective
for predicting prices in a machine learning context without overfitting.

We believe that this approach can and should be extended to parcel data, using
both continuous outcomes like prices, and event-based outcomes like development.
These algorithms and technological innovations such as the information system
described above, can play a pivotal role in how community stakeholders allocate
their limited resources across space.

Ken Steif, PhD is the founder of Urban Spatial. He is also the director of the Master of Urban Spatial Analytics program at the University of Pennsylvania. You can follow him on Twitter @KenSteif .

The full report can be downloaded here .

This work was generously supported by Alan Mallach and the Center for Community Progress . This is the second of two neighborhood change research reports – here is the first .


Urban Spatial
508 S. Melville St.
Philadelphia, PA 19143",Open data and open-source analytics allows community stakeholders to mine data for actionable intelligence like never before.  The objective of this research is to take a first step in exploring the feasibility of forecasting neighborhood change using longitudinal census data in 29 Legacy Cities.,Predicting gentrification using longitudinal census data,Live,64
170,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Home
 * Cognitive Computing
 * Data Science
 * Web Dev
 * 

Mike Broberg Blocked Unblock Follow Following Editor for the IBM Watson Data Platform developer advocacy team. OK person. Mar 17
--------------------------------------------------------------------------------

INTERCONNECT WITH US
PRACTICAL INFO FOR THE BIG IBM CONFERENCE MARCH 19–23
If you’re an IBM customer or business partner, you’ve probably heard of the
company’s InterConnect conference . If you’re unfamiliar, IBM InterConnect is a huge conference at the Mandalay Bay in Las Vegas. It features the “what’s next” of
tech innovation for cloud services, Internet of Things, and IBM Watson.

This event might sound overwhelming because IBM is such a big company. As
developer advocates, however, we work to make things simple. Our team will be
attending with a focus on delivering talks and presenting example code that’s
open source, useful, and approachable — whether you do business with IBM or not.

MACHINE LEARNING WITH APACHE SPARK™
Our robot pal Marvin took an early flight to Vegas—via UPS. You can find him in the InterConnect
DevZone, where he plans to beat you at a high-stakes game of Rock, Paper,
Scissors. With a little help from Apache Spark , Marvin uses machine learning algorithms to find patterns in human gameplay,
then exploits them as he chooses his moves.

“Oh, human. Surely you must be cheating somehow.” —MarvinMarvin dishes out the sass. I expect he’ll be playing this card a lot as he
kills time in Vegas waiting for human opponents to arrive:

“Nothing personal, human. But my brain is connected to Apache Spark.” —MarvinOFFLINE-FIRST DEMO APP
Voice of InterConnect is a web app that uses Hoodie for its backend, where it combines with several IBM services to measure
attendee sentiment about the conference (and possibly about dinosaurs).

Hoodie is a complete backend for your apps, exposed as a JavaScript API and
accessible from the browser. For the Voice of InterConnect app, it’s also hooked
up to the following IBM services: Cloudant, Watson Speech to Text, and Watson
Natural Language Understanding.

Architecture for the Voice of InterConnect sentiment app. Code on GitHub .The app uses Offline First design principles by storing recordings locally, in
the web browser, and then using Hoodie’s Apache CouchDB-style data replication
to synchronize changes to the backend services, where the analysis happens.
IBM’s developerWorks TV and The New Builders Podcast did a recent interview on Voice of InterConnect, with the partners building the
app: Steve Trevathan of Make&Model and Gregor Martynus of Neighbourhoodie . Here’s the video:

MEETUP
Speaking of The New Builders folks, they’re hosting a meetup this Sunday at Rí Rá Irish Pub at Mandalay Place , 7:00 p.m. — 10:30 p.m. This event is specifically for developers and data
science folks, and will offer lightning talks on chatbots, Offline First
development, and the PixieDust helper library for interactive notebooks. The event is free, but you’ll need to register here:

The New Builders: Ideas on Tap Event You're invited to join the developer
community at RiRa Irish Pub to network, learn & share! Bring a friend!Join us
for… www.eventbrite.comPRESENTATIONS
Members of our team will be presenting talks and leading Ask Me Anything
sessions (AMA) and drop-in labs at InterConnect. AMAs and labs run on a drop-in
basis during the times below. Labs take about 20 minutes to complete, and we’ll
be there to help. For AMAs, you’ll lead the conversation with your questions, or
you can ask for demos. Here’s an overview:

VISUALIZING BIG DATA WITH MAPS — AMA WITH RAJ SINGH
Ask Raj how to use map-based visualizations to sanity-check your big-data
analyses. Tuesday, 3 p.m. — 5 p.m., DevZone AMA # 3

USING NOTEBOOKS WITH PIXIEDUST FOR FASTER, EASIER DATA ANALYSIS — LAB WITH VA BARBOSA AND DAVID TAIEB
Explore data sets with PixieDust, an awesome helper library for data science
notebooks on Spark, with help from Va and David. Wednesday, 1:15 p.m. — 5 p.m., DevZone Hello World Lab # 4

MOBILE MAPPING WITH THE WATSON DATA PLATFORM — LAB WITH RAJ SINGH
Learn to use location data and maps in your mobile apps, with help from Raj. Wednesday, 1:15 p.m. — 5 p.m., DevZone Hello World Lab # 1

CHATBOT ARCHITECTURE, DESIGN AND DEVELOPMENT — AMA WITH MARK WATSON
Ask Mark about chatbot architecture. He’ll have some example apps to share too. Wednesday, 2:30 p.m — 5 p.m., DevZone Ask Me Anything # 1

FROM MOBILE FIRST TO OFFLINE FIRST — BREAKOUT SESSION WITH BRADLEY HOLT
Bradley will show how you can build fast, responsive apps that will keep users
happy, even without a reliable network connection. Thursday, 9:30 a.m. — 10:15 a.m., Islander H

FIND US THERE
Most of the Watson Data Platform dev advos will be in the DevZone at InterConnect. Where’s the DevZone? It’s in the back of the conference’s
concourse (a.k.a. expo area). What does that look like? This!:

Get in the zone—the IBM InterConnect DevZone. We don’t want to be starring in
“Zone Alone,” after all. LOL, ok, ok.See you in Las Vegas!

Thanks to Bradley Holt . * Ibm Watson
 * Cloudant
 * Offline First
 * Apache Spark

Blocked Unblock Follow FollowingMIKE BROBERG
Editor for the IBM Watson Data Platform developer advocacy team. OK person.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","If you’re an IBM customer or business partner, you’ve probably heard of the company’s InterConnect conference. If you’re unfamiliar, IBM InterConnect is a huge conference at the Mandalay Bay in Las…",InterConnect with us,Live,65
171,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectINTRODUCING CLOUDANT FOODTRACKER: AN OFFLINE-FIRST APPBradley Holt / November 10, 2015I love helping people understand the “why” and the “how” of buildingoffline-first apps. An offline-first app is an app that works, without error,when it has no network connection. An offline-first app then applies progressive enhancement to enable additional features and functionality, such as syncing with a clouddatabase, when and if it has a reliable network connection. I’m happy tointroduce to you a new sample app called Cloudant FoodTracker which demonstrates building an offline-first app using Cloudant Sync for iOS ( we just released Cloudant Sync for iOS v1.0 ).Apple provides a great tutorial on starting to develop iOS apps . The tutorial walks readers through creating a simple meal tracking app calledFoodTracker. From the tutorial:“This app shows a list of meals, including a meal name, rating, and photo. Auser can add a new meal, and remove or edit an existing meal. To add a new mealor edit an existing one, users navigate to a different screen where they canspecify a name, rating, and photo for a particular meal.”In a strict sense of the term, Apple’s FoodTracker can be considered anoffline-first app. As Apple’s FoodTracker has no network capabilities, it mightbe better to call it an offline-only app. All of your meal data is stored locally on the device–and it never leavesthe device. Soon we will publish a tutorial that walks you through transforming Apple’sFoodTracker into a true offline-first app that stores its data locally using Cloudant Sync for iOS , and then synchronizes this data with IBM Cloudant. For those of you who want an early preview, we’ve published the Cloudant FoodTracker code on GitHub .SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / Cloudant Sync / FoodTracker / iOS / Offline First Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",I'm happy to introduce to you a new sample app called Cloudant FoodTracker which demonstrates building an offline-first app using Cloudant Sync for iOS.,Introducing Cloudant FoodTracker: An Offline-First App,Live,66
172,"LORNAJANE BLOG
09 Jun 2016FIND MONGO DOCUMENT BY ID USING THE PHP LIBRARY
My new job as a Developer Advocate with IBM means I get to play with databases
for a living (this is the most awesome thing ever invented, seriously). On my
travels, I spent some time with MongoDB which is a document database - but I ran
into an issue with fetching a record by ID so here's the code I eventually
arrived at, so I can refer to it later and if anyone else needs it hopefully
they will find it too.

MONGODB AND IDS
When I inserted the data to the collection, I did not set the _id field; if this is empty, MongoDB will just generate an ID and use that which is
fine by me. My issues arose when I wanted to then fetch that data using that
generated identifier.

If I inspect my data by using db.posts.find() (my collection is called posts), then the data looks like this:

{""_id"":ObjectId(""575038831661d710f04111c1""),...


So if I want to fetch by ID, I need to include that ObjectId function call around the ID.

USING THE PHP LIBRARY
When I came to do this with PHP, I couldn't find an example of using the new MongoDB PHP Library that used the ID in this way (but it's a good library, use it). Older versions
of this library used a class called MongoID and I knew that wasn't what I wanted - but had I checked the docs for that, I'd
have found that they have been updated to point to the new equivalent so this is
also very useful to know if you can only find older code examples!

To pass an ID to MongoDB using the PHP Library, you will need to construct a MongoDB\BSON\ObjectID . My example was blog posts and to fetch a record by its ID, I used:

$post=$posts->findOne([""_id""=>newMongoDB\BSON\ObjectID($id)]);


Later I updated the record - the blog post included nested comments in the
record, so to add an array to the comments collection of a record whose _id I knew, I used this code:

$result=$posts->updateOne([""_id""=>newMongoDB\BSON\ObjectID($id)],['$push'=>[""comments""=>$new_comment_data]]);


Hopefully this gives you a pointer on using the generated IDs in MongoDB from
the PHP library and saves you at least as much time as I lost trying to figure
this out!

FURTHER READING
 * Importing and Exporting MongoDB Databases
 * XHGui on VM, Storage on Host
 * MySQL 5.7 Introduces a JSON Data Type

This entry was posted in php and tagged mongodb by lornajane . Bookmark the permalink .POST NAVIGATION
← Previous Next →ONE THOUGHT ON “ FIND MONGO DOCUMENT BY ID USING THE PHP LIBRARY ”
 1. Pingback: Community News: Recent posts from PHP Quickfix (06.15.2016) – SourceCode
    
    
 2. 

LEAVE A REPLY CANCEL REPLY
Please use [code] and [/code] around any source code you wish to share.Comment

Name *

Email *

Website


CONTACT
 * Email: [email protected]
 * Twitter: @lornajane
 * Phone: +44 113 830 1739

LINKS
 * Go PHP7 (ext)
 * Joind.In
 * ZCE Links Bundle
 * ZCE Questions Pack

BOOKS AND VIDEOS
© 2006-2016 LornaJane.net Icons courtesy of The Noun Project","My new job as a Developer Advocate with IBM means I get to play with databases for a living (this is the most awesome thing ever invented, seriously). On my travels, I spent some time with MongoDB …",Find Mongo Document By ID Using The PHP Library,Live,67
173,"KDNUGGETS
Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * SOFTWARE
 * NEWS
 * Top stories
 * Opinions
 * Tutorials
 * JOBS
 * Academic
 * Companies
 * Courses
 * Datasets
 * EDUCATION
 * Certificates
 * Meetings
 * Webinars


KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » An Introduction to Scientific Python (and a Bit of the Maths Behind It) –
NumPy ( 16:n20 )LATEST NEWS, STORIES
 * In Deep Learning, Architecture Engineering is the New ... What the Next Generation of IoT Sensors Have in Store MNIST Generative Adversarial Model in Keras Online Master of Science in Predictive Analytics Statistical Data Analysis in Python


More News & Stories | Top Stories

AN INTRODUCTION TO SCIENTIFIC PYTHON (AND A BIT OF THE MATHS BEHIND IT) – NUMPY
Previous post Next post Tweet Tags: numpy , Python , Scientific Computing
--------------------------------------------------------------------------------

An introductory overview of NumPy, one of the foundational aspects of Scientific
Computing in Python, along with some explanation of the maths involved.

By Jamal Moir, Oxford Brookes University .


Oh the amazing things you can do with Numpy.

NumPy is a blazing fast maths library for Python with a heavy emphasis on
arrays. It allows you to do vector and matrix maths within Python and as a lot
of the underlying functions are actually written in C, you get speeds that you
would never reach in vanilla Python.

Numpy is an absolutely key piece to the success of scientific Python and if you
want to get into Data Science and or Machine Learning in Python, it's a must
learn. NumPy is well built in my opinion and getting started with it is not
difficult at all.

This is the second post in a series of posts on scientific Python, don't forget
to check out the others too. An up-to-date list of posts in this series is at
the bottom of this post.

ARRAY BASICS

Creation

NumPy revolves around these things called arrays. Actually nparrays, but we
don't need to worry about that. With these arrays we can do all sorts of useful
things like vector and matrix maths at lightning speeds. Get your linear algebra
on! (Just kidding we won't be doing any heavy maths)

# 1D Array
a = np.array([0, 1, 2, 3, 4])
b = np.array((0, 1, 2, 3, 4))
c = np.arange(5)
d = np.linspace(0, 2*np.pi, 5)

print(a) # [0 1 2 3 4]print(b) # [0 1 2 3 4]print(c) # [0 1 2 3 4]print(d) # [ 0.          1.57079633  3.14159265  4.71238898  6.28318531]print(a[3]) # 3


The above code shows 4 different ways of creating an array. The most basic way
is just passing a sequence to NumPy's array() function; you can pass it any
sequence, not just lists like you usually see.

Notice how when we print an array with numbers of different length, it
automatically pads them out. This is useful for viewing matrices. Indexing on
arrays works just like that of a list or any other of Python's sequences. You
can also use slicing on them, I won't go into slicing a 1D array here, if you
want more information on slicing, check out this post .

The above array example is how you can represent a vector with NumPy, next we will take a look at how we can represent matrices and more with multidimensional arrays.

# MD Array,
a = np.array([[11, 12, 13, 14, 15],
              [16, 17, 18, 19, 20],
              [21, 22, 23, 24, 25],
              [26, 27, 28 ,29, 30],
              [31, 32, 33, 34, 35]])

print(a[2,4]) # 25


To create a 2D array we pass the array() function a list of lists (or a sequence
of sequences). If we wanted a 3D array we would pass it a list of lists of
lists, a 4D array would be a list of lists of lists of lists and so on.

Notice how with a 2D array (with the help of our friend the space bar), is
arranged in rows and columns. To index a 2D array we simply reference a row and
a column.

A Bit of the Maths Behind It

To understand this properly, we should really take a look at what vectors and
matrices are.

A vector is a quantity that has both direction and magnitude. They are often used to
represent things such as velocity, acceleration and momentum. Vectors can be
written in a number of ways although the one which will be most useful to us is
the form where they are written as an n-tuple such as (1, 4, 6, 9). This is how
we represent them in NumPy.

A matrix is similar to a vector, except it is made up of rows and columns; much like a
grid. The values within the matrix can be referenced by giving the row and the
column that it resides in. In NumPy we make arrays by passing a sequence of
sequences as we did previously.


Multidimensional Array Slicing

Slicing a multidimensional array is a bit more complicated than a 1D one and
it's something that you will do a lot while using NumPy.

# MD slicingprint(a[0, 1:4]) # [12 13 14]print(a[1:4, 0]) # [16 21 26]print(a[::2,::2]) # [[11 13 15]#     [21 23 25]#     [31 33 35]]print(a[:, 1]) # [12 17 22 27 32]


As you can see you slice a multidimensional array by doing a separate slice for
each dimension separated with commas. So with a 2D array our first slice defines
the slicing for rows and our second slice defines the slicing for columns.

Notice that you can simply specify a row or a column by entering the number. The
first example above selects the 0th column from the array.

The diagram below illustrates what the given example slices do.


Array Properties

When working with NumPy you might want to know certain things about your arrays.
Luckily there are lots of handy methods included within the package to give you
the information that you need.

# Array properties
a = np.array([[11, 12, 13, 14, 15],
              [16, 17, 18, 19, 20],
              [21, 22, 23, 24, 25],
              [26, 27, 28 ,29, 30],
              [31, 32, 33, 34, 35]])

print(type(a)) # <class 'numpy.ndarray'print(a.dtype) # int64print(a.size) # 25print(a.shape) # (5, 5)print(a.itemsize) # 8print(a.ndim) # 2print(a.nbytes) # 200


As you can see in the above code a NumPy array is actually called an ndarray. I
don't know why it's called an ndarray, if anyone knows please leave a comment!
My guess is that it stands for n dimensional array.

The shape of an array is how many rows and columns it has, the above array has 5
rows and 5 columns so its shape is (5, 5).

The 'itemsize' property is how many bytes each item takes up. The data type of
this array is int64, there are 64 bits in an int64, 8 bits in a byte, divide 64
by 8 and you get how many bytes it takes up, which in this case is 8.

The 'ndim' property is how many dimensions the array has. This one has 2. A
vector for example however, has just 1.

The 'nbytes' property is how many bytes are used up by all the data in the
array. You should note that this does not count the overhead of an array and so
the actual space that the array takes up will be a little bit larger.

Pages: 1 2


--------------------------------------------------------------------------------

Previous post Next post


--------------------------------------------------------------------------------


MOST POPULAR LAST 30 DAYS
Most viewed 1. 7 Steps to Mastering Machine Learning With Python R vs Python for Data Science: The Winner is ... What is the Difference Between Deep Learning and “Regular” Machine
    Learning? TensorFlow Disappoints - Google Deep Learning falls shallow 9 Must-Have Skills You Need to Become a Data Scientist Top 10 Data Analysis Tools for Business How to Explain Machine Learning to a Software Engineer

Most shared 1. What is the Difference Between Deep Learning and “Regular” Machine Learning? Data Science of Variable Selection: A Review R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016
    Software Poll Results A Visual Explanation of the Back Propagation Algorithm for Neural Networks Machine Learning Key Terms, Explained How to Build Your Own Deep Learning Box Big Data Business Model Maturity Index and the Internet of Things (IoT)


MORE RECENT STORIES
 * Why Big Data is in Trouble: They Forgot About Applied Statistics Predictive Analytics Introductory Key Terms, Explained O’Reilly AI: Last chance to get Best Price Top Stories, July 11–17: Top Machine Learning MOOCs and Onli... KDnuggets Interview: Inderpal Bhandari, IBM Global Chief Data ... America’s Next Topic Model Data Mining Most Vexing Problem Solved, or is this drug REALLY... 4 Major Trends Disrupting the Data Science Market 2016’s Best Places for Data Scientist Jobs Data Mining/Data Science “Nobel Prize”: 2016 SIGKD... 10 Algorithm Categories for A.I., Big Data, and Data Science How to Start Learning Deep Learning What Data Scientists Can Learn From Qualitative Research Online Courses: Big Data Projects and Data Science Pipelines 2016 SIGKDD Service Award to Wei Wang Risk-Free Hadoop Ride: Simplified Workload Migration to Big Da... Top tweets, Jul 6 – Jul 12: Statistical Data Analysis #Py... Metis Data Science Open Houses: San Francisco and New York City What do Postgres, Kafka, and Bitcoin Have in Common? Bayesian Machine Learning, Explained


KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » An Introduction to Scientific Python (and a Bit of the Maths Behind It) –
NumPy ( 16:n20 )

© 2016 KDnuggets. About KDnuggets
Subscribe to KDnuggets News | Follow @kdnuggets | | X","An introductory overview of NumPy, one of the foundational aspects of Scientific Computing in Python, along with some explanation of the maths involved.",An Introduction to Scientific Python (and a Bit of the Maths Behind It) – NumPy,Live,68
174,"METRICS MAVEN: CALCULATING A MOVING AVERAGE IN POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 5, 2016In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the metrics you need from your data.
In this article, we'll look at how to calculate a moving average in PostgreSQL.

This article builds on our previous two articles about Window Functions and Window Frames in PostgreSQL. We'll take advantage of the windowing features we've previously
discussed to calculate a moving average and we'll also look at an alternative
method.

WHAT IS A MOVING AVERAGE?
A moving average is just what it sounds like - an average that is continually
moving based on changing input. For example, you may want to take the average of
some value for the top 100 entries or for the previous 30 days. Because you will
be getting new entries into your database or because each new day is another
date, the average will change. The term ""moving average"" is also synonymous with
""rolling average"" or ""running average"", but there are a few different kinds of
moving averages. In this article we're going to focus on the ""simple"" moving
average to get our feet wet and we'll also briefly review the ""cumulative""
moving average at the end of the article. A future article will cover weighted
and exponential moving averages.

The reason to use a moving average for your metrics is to make it easier to spot
trends. It's a commonly-used technique in finance and business analytics to
smooth out dips and spikes which may occur in the data so that true trends can
be identified over the changing series. Figuring out how to perform the
calculation as the data changes can be a bit daunting, however, if you've never
done it. Once you learn a method you like, though, (we'll cover two) it's easy
to do and you'll find many uses for it in your tracking and reports. Let's get
to it.

THE DATA
First things first: we'll need a table that contains the values we want to
average.

In practice at Compose, we often find that the base data we need is not already
neatly defined in one table. For that reason we have a few aggregate tables that
pull the data we need together. These are the base tables to which we will apply
more advanced calculations, like a moving average. In some cases these are
derived tables that exist temporarily for the execution of the main query. In
other cases, we might use a view or a materialized view . So, however you get at it, you'll need a table containing the values you want
to average and whatever dimension(s) you want to order the data by.

For our example, let's say we've been asked to create a 30 day rolling average
for app downloads from Example Co. The app download data is populated daily to a
table named ""app_downloads_by_date"" and the most recent portion of it looks like
this:

date          | downloads  
---------------------------
2016-05-26    | 35  
2016-05-27    | 30  
2016-05-28    | 7  
2016-05-29    | 14  
2016-05-30    | 22  
2016-05-31    | 20  
2016-06-01    | 24  
2016-06-02    | 27  
2016-06-03    | 24  
2016-06-04    | 35  
2016-06-05    | 20  
2016-06-06    | 21  
2016-06-07    | 29  
2016-06-08    | 26  
2016-06-09    | 23  
2016-06-10    | 20  
2016-06-11    | 13  
2016-06-12    | 8  
2016-06-13    | 16  
2016-06-14    | 25  
2016-06-15    | 21  
2016-06-16    | 34  
2016-06-17    | 15  
2016-06-18    | 15  
2016-06-19    | 15  
2016-06-20    | 26  
2016-06-21    | 26  
2016-06-22    | 17  
2016-06-23    | 25  
2016-06-24    | 16  
2016-06-25    | 13  
2016-06-26    | 14  
2016-06-27    | 17  
2016-06-28    | 21  
2016-06-29    | 26  
2016-06-30    | 32  
2016-07-01    | 26  
2016-07-02    | 16  


In this example, ordering by date will be important since we want to calculate a
30 day rolling average over the preceding series of dates. Because of this, it's
important that we have a row for each date. In our case we do, but if you have
gaps in your data where there are no values for certain dates, you can use generate_series when constructing your base table to ensure you've got all the rows you'll
need.

Notice how this range of dates contains fluctuating app downloads totals from 35
to 7. It's very difficult to see a trend from this data:


Enter the moving average.

USING WINDOW FRAMES FOR A SIMPLE MOVING AVERAGE
If you remember from our previous article in this series , window frames are used to indicate the number of rows around the current row
the window function should include. They create a subset of data for the window
function to operate on. Depending on your data and your needs, your moving
average calculation may include rows both preceding and following the current
row, but for our purposes, our moving average will use preceding rows and the
current row because we want to generate a new moving average value for each new
date.

Our query looks like this:

SELECT ad.date,  
       AVG(ad.downloads)
            OVER(ORDER BY ad.date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_downloads
FROM app_downloads_by_date ad  
;


We're using ORDER BY on our date field to ensure our data will be in the order we expect and we've
specified ROWS BETWEEN 29 PRECEDING AND CURRENT ROW to set the window frame for the AVG calculation. As the window frame advances for each date, only the preceding 29
rows and the current one (30 total days) are used for the calculation:

date          | avg_downloads  
-----------------------------------
. . . .
2016-05-26    | 21.2333333333333333  
2016-05-27    | 21.4000000000000000  
2016-05-28    | 21.1000000000000000  
2016-05-29    | 21.2000000000000000  
2016-05-30    | 21.7333333333333333  
2016-05-31    | 22.0000000000000000  
2016-06-01    | 22.3000000000000000  
2016-06-02    | 22.3666666666666667  
2016-06-03    | 22.3000000000000000  
2016-06-04    | 22.8000000000000000  
2016-06-05    | 22.7000000000000000  
2016-06-06    | 22.9666666666666667  
2016-06-07    | 23.4000000000000000  
2016-06-08    | 23.2666666666666667  
2016-06-09    | 22.7333333333333333  
2016-06-10    | 22.4000000000000000  
2016-06-11    | 21.9000000000000000  
2016-06-12    | 21.3000000000000000  
2016-06-13    | 21.2666666666666667  
2016-06-14    | 21.6666666666666667  
2016-06-15    | 21.4666666666666667  
2016-06-16    | 21.8333333333333333  
2016-06-17    | 21.7000000000000000  
2016-06-18    | 20.9666666666666667  
2016-06-19    | 20.6333333333333333  
2016-06-20    | 21.2666666666666667  
2016-06-21    | 21.9000000000000000  
2016-06-22    | 21.6000000000000000  
2016-06-23    | 21.6666666666666667  
2016-06-24    | 21.6333333333333333  
2016-06-25    | 20.9000000000000000  
2016-06-26    | 20.3666666666666667  
2016-06-27    | 20.7000000000000000  
2016-06-28    | 20.9333333333333333  
2016-06-29    | 21.0666666666666667  
2016-06-30    | 21.4666666666666667  
2016-07-01    | 21.5333333333333333  
2016-07-02    | 21.1666666666666667  


Since we aren't showing you dates in our base table before May 26 for this
example, let's focus our review of the results on dates where we showed you the
29 preceding rows. Let's take June 30, for example. Our window frame focuses our AVG aggregation on the app downloads from June only, this portion of our base
table:

date          | downloads  
---------------------------
. . . .
2016-06-01    | 24  
2016-06-02    | 27  
2016-06-03    | 24  
2016-06-04    | 35  
2016-06-05    | 20  
2016-06-06    | 21  
2016-06-07    | 29  
2016-06-08    | 26  
2016-06-09    | 23  
2016-06-10    | 20  
2016-06-11    | 13  
2016-06-12    | 8  
2016-06-13    | 16  
2016-06-14    | 25  
2016-06-15    | 21  
2016-06-16    | 34  
2016-06-17    | 15  
2016-06-18    | 15  
2016-06-19    | 15  
2016-06-20    | 26  
2016-06-21    | 26  
2016-06-22    | 17  
2016-06-23    | 25  
2016-06-24    | 16  
2016-06-25    | 13  
2016-06-26    | 14  
2016-06-27    | 17  
2016-06-28    | 21  
2016-06-29    | 26  
2016-06-30    | 32  
. . . .


So, now, if we chart the rolling average we've calculated, we can see that the
data is smoothed out and there is an upward trend through the first week of
June, then a more volatile downward trend after that:


Since this is showing only one month of data it's not very telling for an
analytical report, but hopefully it helps you understand how calculating a
moving average can be useful for business analysis.

A TIP ON NOT INCLUDING THE CURRENT ROW
If for some reason you don't want to include the current row for your window
function and you're using only PRECEDING or only FOLLOWING settings for your window frame, an easy way to do that is to use x PRECEDING or y FOLLOWING twice in your ROWS BETWEEN... clause. For example, say we wanted to use 30 rows preceding our current row,
but not include the current row in the window frame. We could write that clause
like this: ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING . Similarly, we can exclude the current row, but do 30 rows following like
this: ROWS BETWEEN 1 FOLLOWING AND 30 FOLLOWING .

AN ALTERNATIVE METHOD FOR A SIMPLE MOVING AVERAGE
Before PostgreSQL 9.0, we didn't have the x PRECEDING or y FOLLOWING window frame options available to us. To calculate a moving average without
using a window frame, we can instead use two table aliases from our base table.
We'll use one alias to operate over the other one using a date interval. Check
it out:

SELECT ad_1.date, AVG(ad_2.downloads)  
FROM app_downloads_by_date ad_1  
JOIN app_downloads_by_date ad_2 ON ad_2.date >= ad_1.date - interval '29 days'  
  AND ad_2.date �


Using this method we can achieve the same results as described above with the
window frame. If you're operating over large amounts of data, the window frame
option is going to be more efficient, but this alternative exists if you want to
use it.

CALCULATING A CUMULATIVE MOVING AVERAGE
Now that we've reviewed a couple methods for how to calculate a simple moving
average, we'll switch up our window frame example to show how you can also do a
cumulative moving average. The same principles apply, but rather than having a
continually shifting window frame for an interval, the window frame simply
extends. For example, instead of doing a 30 day rolling average, we're going to
calculate a year-to-date moving average. For each new date, it's value is simply
included in the average calculation from all the previous dates. Let's have a
look at this example:

SELECT ad.date,  
       AVG(ad.downloads)
            OVER(ORDER BY ad.date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_downloads_ytd
FROM app_downloads_by_date ad  
;


Because our base table starts at January 1st for the current year, we're using UNBOUNDED PRECEDING to set our window frame. The results we get back for this cumulative
calculation look like this:

date          | avg_downloads_ytd  
-----------------------------------
. . . .
2016-05-26    | 20.2585034013605442  
2016-05-27    | 20.3243243243243243  
2016-05-28    | 20.2348993288590604  
2016-05-29    | 20.1933333333333333  
2016-05-30    | 20.2052980132450331  
2016-05-31    | 20.2039473684210526  
2016-06-01    | 20.2287581699346405  
2016-06-02    | 20.2727272727272727  
2016-06-03    | 20.2967741935483871  
2016-06-04    | 20.3910256410256410  
2016-06-05    | 20.3885350318471338  
2016-06-06    | 20.3924050632911392  
2016-06-07    | 20.4465408805031447  
2016-06-08    | 20.4812500000000000  
2016-06-09    | 20.4968944099378882  
2016-06-10    | 20.4938271604938272  
2016-06-11    | 20.4478527607361963  
2016-06-12    | 20.3719512195121951  
2016-06-13    | 20.3454545454545455  
2016-06-14    | 20.3734939759036145  
2016-06-15    | 20.3772455089820359  
2016-06-16    | 20.4583333333333333  
2016-06-17    | 20.4260355029585799  
2016-06-18    | 20.3941176470588235  
2016-06-19    | 20.3625730994152047  
2016-06-20    | 20.3953488372093023  
2016-06-21    | 20.4277456647398844  
2016-06-22    | 20.4080459770114943  
2016-06-23    | 20.4342857142857143  
2016-06-24    | 20.4090909090909091  
2016-06-25    | 20.3672316384180791  
2016-06-26    | 20.3314606741573034  
2016-06-27    | 20.3128491620111732  
2016-06-28    | 20.3166666666666667  
2016-06-29    | 20.3480662983425414  
2016-06-30    | 20.4120879120879121  
2016-07-01    | 20.4426229508196721  
2016-07-02    | 20.4184782608695652  


If we chart these results, you can see that the advantage of the cumulative
moving average is a further smoothing out of the data so that only significant
data changes show up as trends. We see now that there is a slight upward trend
year-to-date:


WRAPPING UP
Now that you know a couple different kinds of moving averages you can use and a
couple different methods for calculating them, you can perform more insightful
analysis and create more effective reports.

In our next Metrics Maven article, we'll look at some options for how to make
data pretty so that instead of values like ""20.4184782608695652"", we'll see
""20.42"". See you next time!

Image by: extrabrandt Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Figuring out how to calculate a moving average can be a bit daunting if you've never done it. Once you learn a method you like, though, (we'll cover two) it's easy to do and you'll find many uses for it in your tracking and reports.",Metrics Maven:  Calculating a Moving Average in PostgreSQL,Live,69
176,"Skip navigation Upload Sign in SearchLoading...Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.WATCH QUEUEQUEUEWatch Queue Queue * Remove all * Disconnect 1. Loading...Watch Queue Queue __count__/__total__ Find out why CloseOFFLINE-FIRST APPS WITH POUCHDBnode.js Subscribe Subscribed Unsubscribe 3,495 3KLoading...Loading...Working...Add toWANT TO WATCH THIS AGAIN LATER?Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?   Sign in to report inappropriate content. Sign in * Transcript * Statistics2,108 24LIKE THIS VIDEO?Sign in to make your opinion count. Sign in 25 0DON'T LIKE THIS VIDEO?Sign in to make your opinion count. Sign in 1Loading...Loading...TRANSCRIPTThe interactive transcript could not be loaded.Loading...Loading...Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Dec 11, 2015Bradley Holt, IBM CloudantWeb and mobile apps shouldn't stop working when there's no network connection.Based on Apache CouchDB, PouchDB is an open source syncing JavaScript databasethat runs within a web browser. Offline-first apps that use PouchDB can providea better, faster user experience—both offline and online.Learn how to build offline-enabled responsive mobile web apps using the HTML5Offline Application Cache and PouchDB. We’ll also discuss how to buildcross-platform apps or high-fidelity prototypes using PouchDB, Cordova, andIonic. PouchDB can also be run within Node.js and on devices for Internet ofThings (IoT) applications.This talk includes code examples for creating a PouchDB database, creating a newdocument, updating a document, deleting a document, querying a database,synchronization PouchDB with a remote database, and live updates to a userinterface based on database changes. * CATEGORY    * People & Blogs       * LICENSE    * Standard YouTube License      Show more Show lessLoading...Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * Sync With Couchbase Using Only AngularJS And PouchDB - Duration: 54:51. Ca   Pham Van 434 views 54:51-------------------------------------------------------------------------------- * How PouchDB works - Duration: 50:03. Nolan Lawson 2,598 views 50:03 * Deep Dive into Offline-First with PouchDB and IBM Cloudant - Duration: 57:39.   IBM Cloudant 1,698 views 57:39 * Developing Mobile Apps Offline Experiences with NoSQL DB and Couchbase Mobile   - Duration: 1:12:27. Movel 1,977 views 1:12:27 * Say Hello to Offline First • Ola Gasidlo - Duration: 39:14. GOTO Conferences   2,132 views 39:14 * Getting started with PouchDB and CouchDB (tutorial) - Duration: 51:51. Nolan   Lawson 20,426 views 51:51 * Building Node.js powered mobile apps with Red Hat Mobile - Duration: 19:58.   node.js 906 views 19:58 * Stressed About NoSQL? Relax with CouchDB. - Duration: 54:10. Nodevember 2,868   views 54:10 * Node.js at Netflix - Duration: 25:18. node.js 16,125 views 25:18 * Sprouting Node.js Roots at Ancestry - Duration: 19:01. node.js 643 views 19:01 * JavaScript, For Science! - Duration: 20:26. node.js 1,813 views 20:26 * AngularJS + PouchDB, ng-model and ng-form - Duration: 1:27:22. AngularJS Utah   1,653 views 1:27:22 * CouchDB everywhere with PouchDB - Dale Harvey, Mozilla - Duration: 37:19. IBM   Cloudant 6,449 views 37:19 * Building Mobile Apps that Work Online and Offline - Duration: 34:15.   Couchbase 1,243 views 34:15 * Sync Data Using PouchDB In Your Ionic Framework App - Duration: 35:16. Nic   Raboy 8,191 views 35:16 * Offline-first web and mobile apps with Polymer and Vaadin by AMahdy AbdElAziz   - Duration: 14:11. Devoxx 620 views 14:11 * Building Interactive npm Command Line Modules -- All The Things. - Duration:   18:23. node.js 1,037 views 18:23 * JS.Geo- PouchDB and SQLDown - Duration: 26:20. Confreaks 210 views 26:20 * Javascript Offline First - Leeds JS Talk - Duration: 58:01. Leeds JS 733   views 58:01 * Sync With Couchbase Using Only AngularJS And PouchDB - Duration: 54:51.   Couchbase 650 views 54:51 * Loading more suggestions... * Show more * Language: English * Country: Worldwide * Restricted Mode: OffHistory HelpLoading...Loading...Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Try something new! * Loading...Working...Sign in to add this to Watch LaterADD TOLoading playlists...","Bradley Holt, IBM Cloudant Web and mobile apps shouldn't stop working when there's no network connection. Based on Apache CouchDB, PouchDB is an open source ...",Offline-First Apps with PouchDB,Live,70
177,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSIMPLE METRICS TUTORIAL PART 1: METRICS COLLECTIONRaj R Singh / August 25, 2015OVERVIEWThis tutorial explains how we created a lightweight web-tracking app to recorduser actions on our site’s search engine page. See how we use the open source Piwik® web analytics app to collect information and Node.js® to store that data in Cloudant . Then try it yourself by implementing tracking on a demo app we provide. Herein Part 1, we focus on data collection. When you’re done, you can try Part 2,where we show how to visualize the data you’ve gathered.WHY WE BUILT THIS APPWe had a problem here in the Cloud Data Services Developer Advocacy group. GlynnBird created a great faceted search engine that we use on our site’s How-To’s page ( read Glynn’s tutorial on creating your own faceted search engine ). Our How-To’s page is more sophisticated than a static web page. It uses AJAXto respond to user requests. Instead of refreshing the entire page, we updatesmall parts of the page to show results. This meant that traditional server-sidetracking tools that log events wouldn’t help us understand what users are doingin this dynamic context. We also ruled out available client-side trackingservices, because they don’t offer full control over what you track, or how datais stored and analyzed. How could we collect and see the user activity data wewanted? The answer was to create our own app to collect and analyze metrics. Weattached link-tracking to the UI elements dynamically generated by our site’sDOM, and we persisted that data to prepare for future analysis.Metrics appGET DEPLOYEDYou can preview the demo app to see how it works. But first things first. Here in Part 1, we’ll explain howthis app collects metrics. You can find all the code for Part 1 of this tutorialin the metrics-collector GitHub repo . The easiest way to explore the app is to deploy it to Bluemix (IBM’s open cloud platform for building, running, and managing applications).Open the repo’s README and click the Deploy to Bluemix button. When you click it, Bluemix creates and hosts a copy of the coderepository. Thanks, Deploy to Bluemix button!HOW IT WORKSHere’s an architectural overview of our metrics collector. Its middlewarecomponent serves tracker.js and piwik.js , which perform the metrics collection work and persist metrics data to thedatabase. We use Cloudant as our database, a NoSQL JSON document store based on Apache CouchDB™ .Metrics collector architectureTRACKING USER ACTIONS WITH PIWIKWe use the Piwik library to capture search events generated in our web page.Piwik’s JavaScript tracking client offers the ability to capture a host ofclient-side information, from basics like page views and outbound link clicks,down to the most detailed user events. For us, it captures the search activityby listening to events on the the user interface elements that create a request:the search text box, and the checkboxes for filtering search results.How-Tos search elementsTo connect Piwik to the web page you want to track, all you do is add one simpleline to that page. If you view the source of our How-Tos page , you’ll find the script tag include that reads:<script src=""//metrics-collector.mybluemix.net/tracker.js"" siteid=""cds.search.engine""></script>That’s all we do in the HTML page we’re tracking—load the tracker.js script andpass it a single variable, siteid , which is a unique identifier that’s saved to the database with every eventcoming from the How-Tos page. Tip: You can use this tracking app for any type of web page or app. But, you’re notlimited to just one at a time. The “application” identifier is the siteid, so ifyou use the same siteid on different web pages, their metrics are grouped andanalyzed together. (You can still identify the different web pages via the trackPageView Piwik event you’re tracking, and see it in the database as the url key). To see how event collection works, go to the metrics collector app’s repo , open the js folder and look at the tracker.js file. Two interesting functions are customDataFn , which captures metadata about a user’s browser, and enableLinkTrackingForNode , which facilitates link-tracking for a DOM node and lets us programmaticallyattach tracking to individual UI elements as they appear. You can find this line of code in the file cds.js in the search engine GitHub repo . The point of this client-side event tracking is that every user action on thesearch engine interface results in an event submission back to the tracker thatlooks something like this:TRACKING PAYLOAD URL SUBMISSIONhttps://metrics-collector.mybluemix.net/tracker?   search=&search_cat=[{""key"":""topic"",""value"":""Data Warehousing""},   {""key"":""topic"",""value"":""Analytics""}]&   search_count=7&   idsite=cds.search.engine&   rec=1&r=493261&h=17&m=46&s=48&   url=https://developer.ibm.com/clouddataservices/how-tos/&   _id=0e9dcf4b6b5b0dc7&   _idts=1433860426&   _idvc=2&   _idn=0&   _refts=0&   _viewts=1433881201&   _ref=https://google.com&   send_image=0&   pdf=1&qt=0&realp=0&wma=0&dir=0&fla=1&java=1&gears=0&ag=0&   cookie=1&res=3360x2100&gt_ms=51&   uap=MacIntel� rv:31.0) Gecko/20100101 Firefox/31.0&   date=2015-5-4Pretty cool so far. We’ve implemented some custom event tracking on our searchengine web app. Next, we persist the data so we can do some usage analytics.PERSISTING USAGE DATA TO CLOUDANTWe’re going to use the Cloudant NoSQL database to store our event data. We do sofor a couple reasons: * Flexibility. Cloudant stores its data as JSON documents. That format provides schema   flexibility that’s a nice fit for the event data. * Availability. Cloudant provides high availability read-write access, enabling high levels   of concurrent connections, which ensures we never miss user interactions even   under heavy load.To take that tracking payload and persist it to a Cloudant database, we wrote alittle Node.js Express app, server.js , which you’ll find in the metrics collector repo . This app accepts the data in an HTTP GET key-value-pair request, transformsit into JSON, and writes it to Cloudant. Here’s a sample JSON document showinghow a record is stored in Cloudant:STRUCTURE OF A TRACKING PAYLOAD DOCUMENT   {     ""type"": ""search"",              //Type of event being captured (currently pageView, search and link)     ""idsite"": ""cds.search.engine"", //app id (must be unique)     ""ip"": ""75.126.70.43"",          //ip of the client     ""url"": ""https://developer.ibm.com/clouddataservices/how-tos/"",   //source url _for_ the event     ""geo"": {                       //geo coordinates of the client (if available)       ""lat"": 42.3596328,       ""long"": -71.0535177     }     ""search"": """",         //Search text if any (specific to search events)     ""search_cat"": [       //Faceted search info (specific to search events)       {         ""key"": ""topic"",         ""value"": ""Analytics""       },       {         ""key"": ""topic"",         ""value"": ""Data Warehousing""       }     ],     ""search_count"": 7,    //search result count (specific to search events)     ""action_name"": ""IBM Cloud Data Services - Developers Center - Products"", //Document title (specific to pageView events)     ""link"": ""https://developer.ibm.com/bluemix/2015/04/29/connecting-pouchdb-cloudant-ibm-bluemix/"", //_target url_ (specific to link events)     ""rec"": 1,             //always 1     ""r"": 297222,          //random string     ""date"": ""2015-5-4"",    //event date time -yyyy-mm-dd     ""h"": 16,              //event timestamp - hour     ""m"": 20,              //event timestamp - minute     ""s"": 10,              //event timestamp - seconds     ""$_id"": ""0e9dcf4b6b5b0dc7"", //cookie visitor     ""$_idts"": 1433860426,       //cookie visitor count     ""$_idvc"": 2,          //Number of visits in the session     ""$_idn"": 0,           //Whether a new visitor or not     ""$_refts"": 0,         //Referral timestamp     ""$_viewts"": 1433881201,  //Last Visit timestamp     ""$_ref"": 'google.com',//Referral url     ""send_image"": 0,      //used image to send payload     ""uap"": ""MacIntel"",     //client platform     ""uab"": ""Netscape"",     //client browser     ""pdf"": 1,             //browser feature: supports pdf     ""qt"": 0,              //browser feature: supports quickTime     ""realp"": 0,           //browser feature: supports real player     ""wma"": 0,             //browser feature: supports windows media player     ""dir"": 0,             //browser feature: supports director     ""fla"": 1,             //browser feature: supports shockwave     ""java"": 1,            //browser feature: supports java     ""gears"": 0,           //browser feature: supports google gear     ""ag"": 0,              //browser feature: supports silver light     ""cookie"": 1,          //browser feature: has cookies     ""res"": ""3360x2100"",   //browser feature: screen resolution     ""gt_ms"": 51           //Config generation performance generation time   }Let’s look at server.js . First, we load in required modules, including one called cloudant (loaded from the file storage.js ) that simplifies the process of connecting to a Cloudant database—much thesame way the excellent nano library simplifies connecting to an Apache CouchDB database. (Cloudant is, in manyways, an extension of CouchDB.) We set up our database connection in the trackerDb variable initialization and add some secondary indices to it at the same time.(In Cloudant and in CouchDB, secondary indices are defined by JavaScript Mapfunctions.) Then, we set up Express to serve the static JavaScript files. Thefollowing code around line 66 makes any file in the js directory web-accessible via the url http://metrics-collector.mybluemix.net/<filename> :app.use(express.static(path.join(__dirname, 'js')));Last but not least, the app accepts event-tracking data on the /tracker endpoint. In app.get(""/tracker""... we take the data and use lodash to construct the JavaScript “tracking payload” object shown earlier. You may have noticed that our Node.js Express app is doing double duty. Notonly does it accept requests to save tracking information for persisting toCloudant, that same app serves out the JavaScript files, tracker.js and piwik.js .IMPLEMENT TRACKING ON A SAMPLE APPNow, try it for yourself. Test your deployment and implement tracking on a asample web app.CLONE THE SAMPLE APPLICATIONFor this test, we’ll use the guitars faceted search engine app written by Glynn Bird. 1. Copy the app to your local machine. git clone https://github.com/glynnbird/guitars 2. Add the following tracking script tag to index.html :            <script src=""https://code.jquery.com/jquery-1.11.2.min.js""></script>        <script src=""https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js""></script>        <!-- Add the tracker script tag. Use the deployed url for your instance. You can also pick your own siteid (no need to register it before) -->            <script src=""https://my-metrics-collector.mybluemix.net/tracker.js"" siteid=""test.metric.app""></script>        <script src=""guitars.js""></script>        <meta name=""viewport"" content=""width=device-width, initial-scale=1""></meta>             3. Edit guitar.js to add the tracking code for dynamically generated content. Locate the    following code around line 140 and match what you see here:      $('#searchtitle').html(html);      //Reset the tracking for these elements      if ( typeof _paq !== 'undefined' ){    	  _paq.push([ enableLinkTrackingForNode, $('#searchtitle')]);      }            Then around line 52:      $.ajax(obj).done(function(data) {        $('#loading').hide();        if (callback) {          callback(null, data);        }            //Track the search results, do not log the initial page load as a search        if ( searchText !== """" || (filter && $.isArray(filter) && filter.length �        	}        }            VERIFY THAT THE EVENTS ARE BEING RECORDED 1. Go to Bluemix and locate your metrics-collector application. 2. Click metrics-collector-cloudant-service . 3. Click the Launch button. 4. Click the tracker_db database and note the number of docs in the database. 5. In your favorite browser, launch the guitars index.html . 6. Search for some guitars and click on a few filters. 7. Go back to the Cloudant dashboard and reload the page. You’ll see that the    number of docs has increased.You’ve now verified that the metrics collector application is correctly deployedon Bluemix and gathering data. In Part 2 of this tutorial, you’ll see how torepresent that data graphically in a report.SUMMARY OF METRICS COLLECTIONHere in Part 1 of this tutorial, you learned how to use Piwik to collect useractions and persist the data to a Cloudant database. Now you’re ready for Part2, Metrics Analytics , where you’ll learn how to display that data graphically in a report. Like Simple Metrics Collector?© “Apache”, “CouchDB”, “Apache CouchDB” and the CouchDB logo are trademarks orregistered trademarks of The Apache Software Foundation. All other brands andtrademarks are the property of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Tutorial for creating a web-tracking app that works with dynamically-generated UI elements. Uses Node.js, Cloudant, and IBM Bluemix.",Simple Metrics Tutorial Part 1: Metrics Collection -- Code a web analytics app with Node.js and IBM Cloudant,Live,71
179,"COULD POSTGRESQL 9.5 BE YOUR NEXT JSON DATABASE?Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 13, 2016TL;DR: No, but that's not the right question.Just over a year ago we asked Is PostgreSQL Your Next JSON Database ... Now, with PostgreSQL 9.5 out, it's time to check if Betteridge's law still applies. So let's talk about JSONB support in PostgreSQL 9.5.For context, and for those of you who haven't been following, it's worth knowingthe history of JSON in PostgreSQL. If you're all up to speed already, just skip ahead to read about the new features. The JSON story begins with the arrival of JSONin PostgreSQL 9.2..JSON IN 9.2The original JSON data type that landed in PostgreSQL 9.2 was basically a textcolumn flagged as JSON data for processing through a parser. In 9.2 though, youcould turn rows and arrays in json and for everything else you have to dive intoone of the PL languages. Useful in some cases but ... more, lots more wasneeded. To illustrate, if we had JSON data like this:{  ""title"": ""The Shawshank Redemption"",  ""num_votes"": 1566874,  ""rating"": 9.3,  ""year"": ""1994"",  ""type"": ""feature"",  ""can_rate"": true,  ""tconst"": ""tt0111161"",  ""image"": {    ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"",    ""width"": 933,    ""height"": 1388  }}We could create a table like so:CREATE TABLE filmsjson ( id BIGSERIAL PRIMARY KEY, data JSON );`  And insert data into it like so:compose=> INSERT INTO filmsjson (data) VALUES ('{    ""title"": ""The Shawshank Redemption"",  ""num_votes"": 1566874,  ""rating"": 9.3,  ""year"": ""1994"",  ""type"": ""feature"",  ""can_rate"": true,  ""tconst"": ""tt0111161"",  ""image"": {    ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"",      ""width"": 933,    ""height"": 1388  }}')INSERT 0 1  compose=�And apart from storing and retrieving the entire document, there was little wecould do with it. Notice that all the spaces and carriage returns have beenpreserved. That'll be important later...FAST FORWARD TO POSTGESQL 9.3.On the back of a new parser for JSON in PostgreSQL 9.3, operators appear toextract values from the JSON data type. Chief among them is -> which can, given an integer, extract a value from a JSON array or, given astring, member of an JSON data type and ->> which does the same but returns text. Building on this is #> and #>> which allow a path to be specified to the value to be extracted.With our previous example table, that meant we could now at least peer into theJSON and do a query like:compose=> select data-�            ?column?          ---------------------------- ""The Shawshank Redemption""(1 row)compose=> select data#�   ?column? ---------- 933(1 row)Yes, the path is a list of keys working down through the JSON document. Don't becaught out thinking the curly braces represent JSON though - this is a textarray as a literal string which PostgreSQL interprets into a text[]. That meansthat query is equivelant to this:select data#�  These were joined by a good set of functions but this was all still pretty limited. It didn't really allow for complexqueries, there was limited indexing on particular fields and only a few ways tocreate new JSON elements. But most importantly all that on the fly parsing of atext field wasn't efficient.CUT TO POSTGRESQL 9.4.PostgreSQL 9.4 is where JSONB arrived. JSONB is a binary encoded version of JSONwhich efficiently stores the keys and values of a JSON document. This means allthe space padding is gone and with it all the need to parse the JSON. The downside is that you can't have repeated keys at the same level and you generallylose all the formatted structure of the document. It's a sacrifice thats wellworth making because everything gets generally more efficient because there's noon the fly parsing. It does slow inserts down because it's there that theparsing actually gets done. To see the difference, let's create a JSONB tableand insert our example data into it:compose=�  CREATE TABLE  compose=�INSERT 0 1  compose=> select * from filmsjsonb  id |                                                                                                                                                 data  ----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  1 | {""type"": ""feature"", ""year"": ""1994"", ""image"": {""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"", ""width"": 933, ""height"": 1388}, ""title"": ""The Shawshank Redemption"", ""rating"": 9.3, ""tconst"": ""tt0111161"", ""can_rate"": true, ""num_votes"": 1566874}(1 row)Yes, that is rather wide. All the spaces and returns from the JSON data havegone leaving one compact key/value list.Although they share many features, here's a fun fact: JSONB has no creationfunctions. In 9.4, the JSON data type got a bundle of extra creation functions: json_build_object() , json_build_array() and json_object() . Use those, or other creation functions, and cast to JSONB ( ::jsonb ) to get the JSONB version. It reflects the logic the PostgreSQL developershave applied - JSON for document fidelity and storage, JSONB for fast, efficientoperations. So while JSON and JSONB both have the -> , ->> , #> and #>> operators, only JSONB has the ""contains"" and ""exists"" operators @> , <@ , ? , ?| and ?& .Exists is a check for strings that match top-level keys in the JSONB data so wecan check there's a rating field in our example data like so:compose=> select data-�            ?column?          ---------------------------- ""The Shawshank Redemption""(1 row)But if we queried for the url key that's inside the image value, we'd failcompose=> select data-�   ?column? ----------(0 rows)But we could test the image value, like so:compose=> select data->'title' from filmsjsonb where data-�            ?column?          ---------------------------- ""The Shawshank Redemption""(1 row)The ?| operator does the same thing but ""or"" matches the keys against an array ofstrings rather than just one string. The ?& operator does a similar thing but ""and"" matches so all the strings in the arraymust be matched.But exists operators just check for presence. With the '@ ' contains operator you can match keys, paths and values. But let's quicklyimport some more movies into the database first. Ok, now say we want all themovies from 1972, we can look for the records that contain ""year"":""1972"".compose=> select data->'title' from filmsjsonb where data @�      ?column?     ----------------- ""The Godfather"" ""Solaris""(2 rows)And we can look for particular values within objects:compose=> select data->'title' from filmsjsonb where data @�                 ?column?               -------------------------------------- ""The Green Mile"" ""My Neighbor Totoro"" ""Nausicaä of the Valley of the Wind""(3 rows)9.4 also brought creating GIN indexes which cover all the fields in the JSONBdocuments for all JSON operations. It's also possible to create GIN indexes with json_path_ops set which gives smaller, faster indexes but only for use of the @> contains operator which is actually remarkably useful as many JSON operationson nested documents are about finding documents which contain particular values.That said, there's still plenty of scope for more comprehensive and capableindexing.So, 9.4 brought PostgreSQL up to the point where you could create, extract andindex JSON/JSONB. What was missing though was the ability to modify theJSON/JSONB data types. You still had to look at passing the JSON data to a PLv8or PLPerl script where it could be natively manipulated. So, things were closeto being a full service JSON document handling environment, but not quite.ENTER POSTGRESQL 9.5PostgreSQL 9.5's new JSON capabilities are all about modifying and manipulatingJSONB data. Apart from one, that is. The jsonb_pretty() function takes JSONB and makes it more readable so you go from:compose=�                                                                                                                                                   data                                                                                                                                                  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- {""type"": ""feature"", ""year"": ""1994"", ""image"": {""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"", ""width"": 933, ""height"": 1388}, ""title"": ""The Shawshank Redemption"", ""rating"": 9.3, ""tconst"": ""tt0111161"", ""can_rate"": true, ""num_votes"": 1566874}(1 row)To a much more digestable form...compose=�                                                   jsonb_pretty                                                  --------------------------------------------------------------------------------------------------------------- {                                                                                                            +     ""type"": ""feature"",                                                                                       +     ""year"": ""1994"",                                                                                          +     ""image"": {                                                                                               +         ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"",+         ""width"": 933,                                                                                        +         ""height"": 1388                                                                                       +     },                                                                                                       +     ""title"": ""The Shawshank Redemption"",                                                                     +     ""rating"": 9.3,                                                                                           +     ""tconst"": ""tt0111161"",                                                                                   +     ""can_rate"": true,                                                                                        +     ""num_votes"": 1566874                                                                                     + }(1 row)Which is much more readable and going to pop up in any JSON related PostgreSQL9.5 examples. On to the operators....LET US DELETEThe simplest modifier is deletion. Just say what you want gone and make it goaway. For that, 9.5 introduces the - and #- operators. The - operator works like the -> operator except instead of returning a value from an array (if given an integeras a parameter) or object (if given a string), it deletes the value or key/valuepair. So, with our movie database, if we want to remove the rating field thenthis does the trick:compose=�  UPDATE 250  The #- operator goes further, taking a path as a parameter. So say we wanted to removethe image's dimension properties:compose=�  UPDATE 250  compose=�  UPDATE 250  compose=�                                                   jsonb_pretty                                                 -------------------------------------------------------------------------------------------------------------- {                                                                                                           +     ""type"": ""feature"",                                                                                      +     ""year"": ""1994"",                                                                                         +     ""image"": {                                                                                              +         ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg""+     },                                                                                                      +     ""title"": ""The Shawshank Redemption"",                                                                    +     ""tconst"": ""tt0111161"",                                                                                  +     ""can_rate"": true,                                                                                       +     ""num_votes"": 1566874                                                                                    + }(1 row)We do two updates because the path specifier doesn't allow for optional keys butwe can get it down to one update by remembering that the set expression can beas complex as we need it.compose=�  UPDATE 250  Although you can delete data from the database, remember that you can also justremove it from your output too:compose=�                                                 jsonb_pretty                                                 -------------------------------------------------------------------------------------------------------------- {                                                                                                           +     ""type"": ""feature"",                                                                                      +     ""year"": ""1994"",                                                                                         +     ""image"": {                                                                                              +         ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg""+     },                                                                                                      +    .... CONCATENATIONThe operator for manipulations is the concatenation operator || . This tries to combine two JSONB objects into one. It works with the top levelkeys of both values only and when the same key is present on both sides, itresolves it by taking the right-hand operand's value. This means you can use itas an update mechanism too. Say, using out example data, we need to set the can_rate field to false, clear the num_votes field and add a new revote field set to true...compose=�  UPDATE 250  compose=�                                                   jsonb_pretty                                                  --------------------------------------------------------------------------------------------------------------- {                                                                                                            +     ""type"": ""feature"",                                                                                       +     ""year"": ""1994"",                                                                                          +     ""image"": {                                                                                               +         ""url"": ""http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg"",+         ""width"": 933,                                                                                        +         ""height"": 1388                                                                                       +     },                                                                                                       +     ""title"": ""The Shawshank Redemption"",                                                                     +     ""rating"": 9.3,                                                                                           +     ""revote"": true,                                                                                          +     ""tconst"": ""tt0111161"",                                                                                   +     ""can_rate"": false,                                                                                       +     ""num_votes"": 0                                                                                           + }(1 row)This is a generally useful way to merge JSONB data types, for example in postprocessing. As an update method it leaves something to be desired. Updating asingle top level-field, it's a bit overkill. Updating a nested single field in adocument, then you have to dig your way down to the containing object and mergefrom there. If only there was a simple way to set a particular field...JSONB_SET FOR SUCCESSThe jsonb_set() function is designed for updating single fields wherever they are in the JSONdocument. Let's jump straight to an example:compose=�  This will change the value of the image.width property to 1024. The argumentsfor jsonb_set() are simple; the first argument is a JSONB data type you want to modify, thesecond is a text array path and the third is a JSONB value to replace the valueat the end of that path. If the key/value pair at the end of the path doesn'texist, by default, jsonb_set() creates and sets it. To stop that behavior, add a fourth optional parameter(""create_missing"") and set it to false. If ""create_missing"" is true but othercomponents of the path don't exist then jsonb_set() won't try to create the entire path and will just fail. Say we wanted to add anew object to our image data about picture rights, we can simply add in the JSONdata for that new object:compose=�  compose=�                                                   jsonb_pretty                                                  --------------------------------------------------------------------------------------------------------------- {                                                                                                            +     ""type"": ""feature"",                                                                                       +     ""year"": ""1972"",                                                                                          +     ""image"": {                                                                                               +         ""url"": ""http://ia.media-imdb.com/images/M/MV5BMjEyMjcyNDI4MF5BMl5BanBnXkFtZTcwMDA5Mzg3OA@@._V1_.jpg"",+         ""width"": 1024,                                                                                       +         ""height"": 500,                                                                                       +         ""quality"": {                                                                                         +             ""copyright"": ""company X"",                                                                        +             ""registered"": true                                                                               +         }                                                                                                    +     },                                                                                                       +     ""title"": ""The Godfather"",                                                                                +     ""rating"": 9.2,                                                                                           +     ""tconst"": ""tt0068646"",                                                                                   +     ""can_rate"": true,                                                                                        +     ""num_votes"": 1072605                                                                                     + }(1 row)jsonb_set() is probably the most important addition in PostgreSQL 9.5's JSON functions. Itoffers the chance to change data in-place within JSONB data types. Do rememberthat where we've used simple values to set parameters is only for examples; youcould have PostgreSQL subqueries creating new values and co-ercing them intoJSONB subdocuments or arrays to create richer JSON documents.CONSIDER THISWhat this all leads to is an interesting position for PostgreSQL. PostgreSQL9.5's JSON enhancements mean that you could use PostgreSQL as a JSON database;it's fast and functional. Whether you'd want to is a different consideration.For example, the relatively accessible APIs or client libraries of many JSONdatabases are not there. In their place is a PostgreSQL specific dialect of SQLfor manipulating JSON which is used in tandem with the rest of the database'sSQL to exploit the full power of it. This means you still have to learn SQL, arequirement which, unfortunately, too many people use as their reason for usinga ""NoSQL"" database.You can use PostgreSQL to create rich, complex JSON/JSONB documents within thedatabase. But then if you are doing that, you may want to consider whether youare using PostgreSQL well. If the richness and complexity of those documentscomes from relating the documents to each other then the relational model isoften the better choice for data models that have intertwined data. Therelational model also has the advantage that it handles that requirement withoutlarge scale duplication within the actual data. It also has literally decades ofengineering expertise backing up design decisions and optimizations.What JSON support in PostgreSQL is about is removing the barriers to processingJSON data within an SQL based relational environment. The new 9.5 features takedown another barrier, adding just enough accessible, built-in and efficientfunctions and operators to manipulate JSONB documents.PostgreSQL 9.5 isn't your next JSON database, but it is a great relationaldatabase with a fully fledged JSON story. The JSON enhancements arrive alongsidenumerous other improvements in the relational side of the database, ""upsert"",skip locking and better table sampling to name a few.It may not be your next JSON database, but PostgreSQL could well be the nextdatabase you use to work with relational and JSON data side by side.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Just over a year ago we asked Is PostgreSQL Your Next JSON Database... Now, with PostgreSQL 9.5 out, it's time to check if Betteridge's law still applies. So let's talk about JSONB support in PostgreSQL 9.5.",Could PostgreSQL 9.5 be your next JSON database?,Live,72
180,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * BDU China initiatives
 * This Week in Data Science (September 27, 2016)
 * Introducing Two New SystemT Information Extraction Courses
 * This Week in Data Science (September 20, 2016)
 * This Week in Data Science (September 13, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (SEPTEMBER 27, 2016)
Posted on September 27, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * How Open Data Is Making Our Cities More Efficient – A new collaboration between the EU and Japan is looking to support the
   development of smart cities with a cloud-based shared platform
 * Self-Driving Cars Gain Powerful Ally: The Government – Uber, the ride-hailing giant, began trials in Pittsburgh last week using
   driverless technology. The government’s new guidelines for autonomous driving
   will speed up the rollout of self-driving cars, experts said.
 * MIT aims to make sense of Twitter’s presidential debate firehose – Using machine learning, Electome researchers analyze public’s debate
   conversations
 * Like a gym membership, data has no value unless you use it – Data is like that gym. How much you use it, how well you exercise and
   apply it, and how far it reaches into your work life determine the value
   return from having it
 * Meet Trace Genomics, The “23andMe” Of Soil – For $199, farmers can understand their soil better which is key to keeping
   their crops healthy.
 * Researchers Use Wireless Signals to Recognize Emotions – System that uses reflected radio signals has potential applications for
   smart homes, offices and hospitals
 * Is Artificial Intelligence Permanently Inscrutable? – Despite new biology-like tools, some insist interpretation is impossible.
 * Watch: IBM Watson creates the first AI-made movie trailer – and it’s really
   eerie – Now IBM Watson has added yet another skill to its arsenal as it just
   learned how to make movie trailers.
 * Airbnb Shows How Private Sector Can Use Data to Fight Discrimination – Airbnb has acknowledged the bias present on its platform, noting that
   “minorities struggle more than others to book a listing,” and has created a
   plan to tackle discrimination on its platform.
 * What Math Looks Like in the Mind – In a surprise to scientists, it appears blind people process numbers by
   tapping into a part of their brains that’s reserved for images in sighted
   individuals.
 * Top Algorithms and Methods Used by Data Scientists – Latest KDnuggets poll identifies the list of top algorithms actually used
   by Data Scientists, finds surprises including the most academic and most
   industry-oriented algorithms.
 * Why The Cars of the Future Will Rely on the IoT – The future of vehicles is exciting, and engineers are working toward
   safer, simpler, and faster modes of transportation all the time.
 * IBM Watson and The Weather Company Are Ready to Launch Their First Cognitive
   Ads – The Weather Company is getting ready to roll out its first ad campaign
   since being acquired by IBM earlier this year. But for the first brand,
   Campbell Soup Company, it’s featuring the supercomputer Watson as the chef.
 * 14 Traits Of The Best Data Scientists – Actual data scientists are in high demand, and there’s not enough of them
   to go around. If you want to identify the right talent, consider these tips.
 * How Big Data Changes the Economics of Renewable Energy – Big data can boost the transition to renewable energy sources much faster,
   says WSJ Energy Expert Jason Bordoff

UPCOMING DATA SCIENCE EVENTS
 * Deriving value from the data lake – Join Nik Rouda, Senior Analyst for Enterprise Strategy, on October 6th, to
   learn more about data lakes.
 * Machine Intelligence Summit New York – Come hear from amazing speakers, discover emerging trends, and expand your
   network at the Machine Intelligence Summit on November 2nd-3rd.
 * IBM Webinar: Driving Innovation and Growth with Big Data – Join Noel Yuhanna, Principal Analyst at Forrester Research, on October
   6th, to hear how an emerging collection of technologies that Forrester calls
   big data fabric is driving innovation and growth.

NEW IN BIG DATA UNIVERSITY
 * Text Analytics – This course introduces the field of Information Extraction and how to use
   a specific system, SystemT, to solve your Information Extraction problem.
 * Advanced Text Analytics – This course goes into details about the SystemT optimizer and how it
   addresses the limitations of previous IE technologies.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (September 27, 2016)",Live,73
181,"Homepage Follow Sign in / Sign up IBM ML Hub Blocked Unblock Follow Following IBM Machine Learning Hub. To be the best, learn from the best. Latest on
Machine Learning, AI & more. Info: MLHub@us.ibm.com http://ibm-ml-hub.com/
#IBMML #ML Apr 26
--------------------------------------------------------------------------------

THE 3 KINDS OF CONTEXT: MACHINE LEARNING AND THE ART OF THE FRAME
“What do you do for a living?” That question used to have a pretty clear answer:
“I’m a data scientist.” But lately, it’s gotten more complicated…

The two of us — Jorge Castañón and Óscar D. Lara Yejas — do data science at
IBM’s Machine Learning Hub , where clients from around the world bring us their mission-critical goals for
turning data into knowledge. The clients range across industries from insurance
to retail to energy to finance. On Monday, a manufacturer needs to use data from
the quality control process to fast-forward the testing of components — with no
forfeit of excellence. On Wednesday, a healthcare provider needs to isolate the
real factors that tip patients from low risk to high risk. On Friday, a credit
union needs to improve its loyalty offerings for retirees.

Most client visits last just two days. In those 48 hours, our job is to go from
zero to insight. We set priorities, access and clean the data, fire up Jupyter
notebooks, set up a collaboration environment with Data Science Experience (yes, this is a plug; it’s fantastic.), choose algorithms, build models, run
and tweak the models, and generate visualizations and recommendations.

But none of that is the complicated part…

THE 3 KINDS OF CONTEXT
The complicated part is about context. We’ve learned that hopping from project
to project doesn’t just mean hopping from one context to another — it means
hopping across multiple kinds of context:

 * Industry context
 * Data context
 * Transfer context

The first two are fairly intuitive. The third is less so. Let’s take each of
them in turn.

INDUSTRY CONTEXT
Well before we dive into data and models, we ask clients to convey their domain
expertise. These are people with a seemingly limitless understanding of the
industry issues at play — and of the dynamics that are shaping the demands of
those they serve. The more we listen to these clients, the clearer it becomes
that each industry (aka sector, aka vertical) represents a problem space unto
itself, each with its own goals for data: Healthcare clients tend to want to
solve classification problems. Finance and energy clients tend to want to solve
certain kinds of prediction problems. And manufacturing, transportation, and
insurance clients tend to want to solve optimization problems. Are those
tendencies cut-and-dry? Absolutely not. But combined with careful listening,
they give us a place to start.

But then there are the limitations. Sometimes the limitation is about the
client’s familiarity with machine learning itself. Healthcare and retail have
been deep into machine learning for years, while other industries are just
ramping up. (Interestingly, sometimes the less familiar the better, since some
clients turn out to be sitting on troves of accumulated data — typically
proprietary data behind the firewall that’s just waiting to be mined .)

And sometimes the limitation is about the need for interpretability. The
algorithms and models we choose vary from client to client based on whether the
models need to “show their work”. Our healthcare client needed more than a
numerical prediction of risk migration for a given patient; they needed to know
the factors at play and the weight for each factor. By the same token, banks,
insurers, and government bureaus need to be able to assure watchdogs and
regulators that their ML-driven automations are bias-free. To preserve
interpretability for those industries, we might try to favor methods like logistic regression and decision trees . Where interpretability is less important — for example, in retail — we can
jump into deep learning and other black-box approaches.

It’s only after we have our heads around that industry context that we start to
puzzle through the actual data.

DATA CONTEXT
After cleaning and formatting the data we get from clients, we’re looking for
what kinds of ML models the data is capable of driving. And let’s be frank: some
clients approach us with real problems that just can’t be addressed with machine
learning and the data at hand, so first we talk through what’s possible. Once we
have something tractable in mind, we can start to ask more questions: What are
the inputs and outputs? What’s the plan for feature extraction? Should we use supervised or unsupervised learning? (So far, it hasn’t made sense to use reinforcement learning , but maybe someday soon.) Is the response variable continuous or a class that
you want to predict? If you need a classification model, which variables help to
represent the classes we’ll use? And so on. That work gets us a list of
potential models.

But on top of all that, we also want some context about how data comes to the
system in the real world. How much data? How often? As a stream or in batches?
Not to mention questions about provenance, governance, and security. We seldom
have enough time to go as deep as we want to with clients, but without some of
that context, we might end up creating models that can’t actually be deployed,
accessed, or retrained.

So, the industry information and the data take us a long way toward framing our
efforts, but there’s one more angle we didn’t anticipate.

TRANSFER CONTEXT
The more time we spend at the Machine Learning Hub, the more we’re struck by
what we’re learning about learning. Naturally, we want to come fresh to every
encounter — but we still want to benefit from all the work we’ve done before. As
we think about those trade-offs, we’re realizing that our daily work as
flesh-and-blood data scientists maps onto a key aspect of the search for
artificial general intelligence (AGI): transfer learning.

As the name suggests, transfer learning means trying to improve performance on a
task by leveraging knowledge acquired from some related task. That’s something
we do every day. How well we do over time will depend on how successfully we can
discern the knowledge that we should — and shouldn’t — transfer from one engagement to another.

In that sense, the third context is really about our roles as data scientists
and being aware of that context means being aware of our opportunities for
improving our methods across a wide range of problem spaces — while also
thinking of ourselves as learning machines that are prone to cognitive biases . Who knows, maybe the processes we develop at the Machine Learning Hub will
offer clues to achieving AGI .

ART OF THE FRAME
As much as thinking about these three contexts has helped us, it’s also
reinforced the fact that machine learning is often more art than science. For
us, it’s an art of emphasis and de-emphasis. It’s the art of finding frames to
put around the world — whether that’s a frame around an industry, a frame around
the data, or a frame around our own learning.

Whatever the frame, our hope is to enlarge and energize the features that matter
— and to see them with fresh eyes.

For more about our work at the Machine Learning Hub or to schedule a session, reach out to us. We’d love to continue the conversation.

 * Machine Learning
 * Data Science
 * Industry
 * Data
 * Transfer Learning

5 Blocked Unblock Follow FollowingIBM ML HUB
IBM Machine Learning Hub. To be the best, learn from the best. Latest on Machine
Learning, AI & more. Info: MLHub@us.ibm.com http://ibm-ml-hub.com/ #IBMML #ML

FollowINSIDE MACHINE LEARNING
Deep-dive articles about machine learning and data. Curated by IBM Analytics.

 * Share
 * 5
 * 
 * 
 * 

Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates","“What do you do for a living?” That question used to have a pretty clear answer: “I’m a data scientist.” But lately, it’s gotten more complicated…",The 3 Kinds of Context: Machine Learning and the Art of the Frame,Live,74
182,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: TOUR THE COMMUNITY SECTION
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

5 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Science Experience: Manage object storage - Duration: 2:01.
   developerWorks TV No views * New 2:01


--------------------------------------------------------------------------------

 * Data Science Experience: Analyze precipitation data using a community
   notebook - Duration: 5:15. developerWorks TV No views * New 5:15
 * Data Science Experience: Analyze NYC traffic collisions data with a community
   notebook - Duration: 8:08. developerWorks TV 5 views * New 8:08
 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21.
   IBM Analytics 8,386 views 8:21
 * Introduction to Spark and Data Science Experience - Duration: 49:24. Data
   Gurus 419 views 49:24
 * Datascience made simple with IBM DSX | HackerEarth Webinar - Duration:
   1:06:11. HackerEarth 260 views 1:06:11
 * Introducing the Data Science Experience - Duration: 2:31. IBM Analytics
   14,454 views 2:31
 * Data Science Experience: Create a project and notebook - Duration: 1:04.
   developerWorks TV 1 view * New 1:04
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29
 * Creating the Data Science Experience - Duration: 3:55. IBM Analytics 3,197
   views 3:55
 * Data Science and Web Development with Python in Visual Studio - Duration:
   8:58. Microsoft Visual Studio 5,303 views 8:58
 * Immerse yourself in the world of data science at IBM Datapalooza - Duration:
   1:59. IBM Analytics 1,062 views 1:59
 * Armand Ruiz Gabernet, IBM - BigDataNYC #BigDataNYC 2016 #theCUBE - Duration:
   15:26. SiliconANGLE 826 views 15:26
 * JavaOne: Microservice hands-on - Duration: 5:22. developerWorks TV No views *
   New 5:22
 * IBM Data Science Experience (DSX) + Spark SQL Intro - Duration: 29:51. Nacho
   Alonso 413 views 29:51
 * Data Science Expert Interview: Influencer roundtable - Duration: 24:11. IBM
   Analytics 2,024 views 24:11
 * IBM Data Science Experience - Duration: 3:55. Valeria Montrucchio 360 views 3:55
 * JavaOne: The excitement so far - Duration: 5:04. developerWorks TV 1 view *
   New 5:04
 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54.
   developerWorks TV No views * New 6:54
 * BigInsights on Cloud: Use Sqoop to Ingest Data from Compose for MySQL -
   Duration: 5:24. developerWorks TV 6 views * New 5:24

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video provides a tour of the Community section in IBM Data Science Experience. ,Tour the Community in DSX,Live,75
183,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Watson Student Advisor

 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (May 2, 2017)
 * This Week in Data Science (April 25, 2017)
 * This Week in Data Science (April 18, 2017)
 * This Week in Data Science (April 11, 2017)
 * How to Become a Data Scientist

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (MAY 2, 2017)
Posted on May 2, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Video Roundup: New from IBM Watson – A brief run down of some new things IBM Watson is tackling.
 * Five Missteps to Avoid on your First Big Data Journey. – Steps to take in order to avoid common Big Data pitfalls.
 * How Machine Learning Is Changing The Future Of Digital Businesses – How Machine Learning impacts automation and digital transformation.
 * Hacking maps with ggplot2 – Short look at mapping with the R package ggplot2.
 * AI & Machine Learning Black Boxes: The Need for Transparency and
   Accountability – The importance of comprehending the inner workings Machine Learning
   Algorithms.
 * How just 30 machines beat a warehouse-sized supercomputer to set a new world
   record – IBM partners with Nvidia to showcase the ability of massively parallel
   processing on GPUs.
 * Data Analytics Is The Key Skill For The Modern Engineer – How engineers can embrace Data Analytics to streamline business operations
   and task integration.
 * Building and Exploring a Map of Reddit with Python – A tutorial on how to explore a map of the most popular subreddits with
   python.
 * Data scientists really love their jobs, survey finds – The results of a survey showing how satisfied Data Scientists are with
   their jobs.
 * Reproducible Data Science with R – A presentation on the application of a Reproducible Workflow to Data
   Science in R.
 * IBM uses deep learning to better detect a leading cause of blindness – IBM has made another application of cognitive computing to the medical
   field.
 * Awesome Deep Learning: Most Cited Deep Learning Papers – A list of fairly recent must read publications on Deep Learning.
 * Emotion Detection Using Machine Learning – An example of the use of Deep Learning to perform feature extractions.
 * Plotting Data Online via Plotly and Python – Introductory steps to creating plots with Plotly.
 * Machine Learning Classification Using Naive Bayes – A classification exercise using the Naive-Bayes algorithm in R.
 * The Art of Data –How Watson, fed with data about different subjects, helped to create art.


FEATURED COURSES FROM BDU
 * SQL and Relational Databases 101 – Learn the basics of the database querying language, SQL.
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.
 * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to
   apply deep learning to different data types in order to solve real world
   problems.


UPCOMING DATA SCIENCE EVENTS
 * UofT Data Science Workshop: Intro to Clustering with R – May 2, 2017 @ 6:00 pm – 9:00 pm
 * UofT Data Science Workshop: Intro to Classification with R –May 4, 2017 @ 6:00 pm – 7:00 pm
 * IBM Webinar: Charting Your Analytical Future Webinar: Get the best of
   Self-service Analytics and Managed reporting together – May 4, 2017 @ 12:00 pm – 1:00 pm

COOL DATA SCIENCE VIDEOS
 * Machine Learning With Python – Collaborative Filtering & Its Challenges – An Exploration of Collaborative Filtering Techniques.
 * Machine Learning With Python – Course Summary – A review of the BDU course Machine Learning 101.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (May 2, 2017)",Live,76
184,"FACEBOOK
Email or Phone Password Forgot account?UPDATE YOUR BROWSER
You’re using a web browser that isn’t supported by Facebook.
To get a better experience, go to one of these sites and get the latest version
of your preferred browser: Internet Explorer Mozilla Firefox Google Chrome Get Facebook on Your Phone Stay connected anytime, anywhere. * English (US)
 * Español
 * Français (France)
 * 中文(简体)
 * العربية
 * Português (Brasil)
 * Italiano
 * 한국어
 * Deutsch
 * हिन्दी
 * 日本語
 * 

Sign Up Log In Messenger Facebook Lite Mobile Find Friends Badges People Pages Places Games Locations Celebrities Groups Moments About Create Ad Create Page Developers Careers Privacy Cookies Ad Choices Terms Help Settings Activity Log Facebook © 2016","While the sum of Facebook's offerings covers a broad spectrum of the analytics space, we continually interact with the open source community in order to share our experiences and also learn from others.",Apache Spark @Scale: A 60 TB+ production use case,Live,77
186,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses
 *  * Our Courses
    * Partner Courses
   
   
 * Badges
 *  * Our Badges
    * BDU Badge Program
   
   
 * Student Advisor
 * Business

 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (May 16, 2017)
 * This Week in Data Science (May 9, 2017)
 * This Week in Data Science (May 2, 2017)
 * This Week in Data Science (April 25, 2017)
 * This Week in Data Science (April 18, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (MAY 16, 2017)
Posted on May 16, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * General Tips for Web Scraping with Python – Tips on scrapping and saving data from the web.
 * Top 10 Skills in Data Science – The results of a study on the skills possessed by Data Science.
 * Data Mining for Social Intelligence – Opinion Data as a Monetizable Resource – A look at how opinion data is quickly becoming a monetary resource.
 * Sparking change: How analytics is helping global communities improve water
   security – How Water Mission turned to IBM to use analytics to improve access to safe
   water.
 * How to go about interpreting regression coefficients – A brief look at coefficients and how to interpret them.
 * Three Mistakes that set Data Scientists up for Failure – Mistakes that Data Scientists may make in their line of work and how to
   avoid them.
 * Analytics and the cloud: The rise of open source – Open Source and IBM’s involvement in Open Source software.
 * Top 15 Python Libraries for Data Science in 2017 – A look at 15 of the most popular Python Data Science libraries.
 * IBM updates PowerAI to make deep learning more accessible – How IBM updates to PowerAI will make it easier for Data Scientists and
   developers to integrate and deploy models.
 * Big Data for Humans: The Importance of Data Visualization – The importance of the most crucial and oft overlooked step in Analytics:
   Data Visualization.
 * Top 3 ways to measure the success of your analytics investment – Three factors to consider when evaluating technologies that aid in
   business decisions.
 * Pretty histograms with ggplot2 – Learn to create visually stimulating histograms by example with ggplot2
   for R.
 * IBM pushes for NVMe adoption to boost storage speeds – Why the adoption of NVMe is necessary for today’s vast amounts of data.
 * In case you missed it: April 2017 roundup – A look back at all the stories from Revolutions R blog.
 * Machine Learning Pipelines for R – How the R package pipeliner helps to streamline the process of building
   machine learning and statistical models.
 * Machine Learning. Linear Regression Full Example (Boston Housing). – Short tutorial on performing linear regression on a data set.


FEATURED COURSES FROM BDU
 * SQL and Relational Databases 101 – Learn the basics of the database querying language, SQL.
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.
 * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to
   apply deep learning to different data types in order to solve real world
   problems.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal
 * Changelog

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

We are currently rebranding our site in order to better reflect our focus on
Data Science and Cognitive Computing. Please bear with us. Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (May 16, 2017)",Live,78
187,"This video shows you how to a Cloudant Geospatial index is used in a real world application. Watch the other videos in this series titled ""Introducing Cloudant Geospatial"" and ""Build and Query a Cloudant Geospatial Index"". Find more videos in the Cloudant Learning Center at http://www.cloudant.com/learning-center.",See how a Cloudant Geospatial index is used in a real world application. ,Tutorial: How to Cloudant Geospatial in Action,Live,79
190,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Akhil Tandon Blocked Unblock Follow Following Jun 9
--------------------------------------------------------------------------------

LEVERAGE SCIKIT-LEARN MODELS WITH CORE ML
OVERVIEW
This post discusses how to implement Apple’s new Core ML platform within DSX, which was announced a few days ago at WWDC 2017. Core ML
is a platform that allows integration of powerful pre-trained models into iOS
and macOS applications.

Core ML comes with two main benefits: efficiency and privacy. Core ML has been
specifically engineered for on-device performance. Having a pre-trained model
accessible on your device removes a network connection requirement and ensures
privacy for users.

But the best thing about Core ML is that you can continue to use your favorite
machine learning libraries in Python, and easily convert your pre-trained models
to Core ML objects that can be exported to your iOS and macOS application
development.

The conversion to Core ML objects from libraries such as Keras , sklearn , LibSVM , and others is supported out-of-the-box in Data Science Experience.

INSTALLATION
You can install coremltools via pip, which can be called from within a notebook in DSx. It's important to
note that Core ML supports Python 2.7 only.

!pip install -U coremltools

CREATE A LINEAR MODEL WITH SCIKIT-LEARN
First, create some data using numpy , a library for computing with Python. We’ll create a very simple model because
the focus of this short guide is converting a scikit-learn object to a Core ML
model.

import numpy as np
x_values = np.linspace(-2.25,2.25,300)
y_values = np.array([np.sin(x) + np.random.randn()*.25 for x in x_values])

Now that we’ve got our data, we’ll perform a linear regression.

from sklearn.linear_model import LinearRegression
lm = LinearRegression().fit(x_values.reshape(-1,1), y_values)

CREATE A CORE ML MODEL
Core ML supports many kinds of machine learning models in addition to linear
models, including neural networks, tree-based models, and more. The Core ML
model format supports the .mlmodel file extension. We'll show how to instantiate an MLModel using this kind of
file. The aim is to painlessly transition from an sklearn object to a Core ML
model.

from coremltools.converters import sklearn
coreml_model = sklearn.convert(lm) print(type(coreml_model))

<class 'coremltools.models.model.MLModel'>

Now coreml_model is our Core ML object.

The MLModel class has a few attributes and methods. Metadata contains information about the origin, author, inputs and outputs, among other
things. Let's see how this works.

coreml_model.author = ""DSX"" print(coreml_model.author)

DSX

We can add other metadata as we please. The list of attributes includes:

 * author : The author of the model.
 * input_description : The descriptions of the inputs. This can include information about the
   data types, number of features, and more. In our example, we have a single
   input, a real valued number.
 * output_description : A description of the output.
 * short_description : A comment on the purpose of the model.
 * user_defined_metadata : Anything you like!

coreml_model.short_description = ""I approximate a sine curve with a linear model!""
coreml_model.input_description[""input""] = ""a real number""
coreml_model.output_description[""prediction""] = ""a real number"" print(coreml_model.short_description)

I approximate a sine curve with a linear model!

At this point you have a tuned and labeled CoreML object. The goal is to
seamlessly integrate this into the existing workflow of an iOS/macOS application
developer who needs your machine learning models. Saving the model to local
storage is very easy using coremltools :

coreml_model.save('linear_model.mlmodel')

We can also create an MLModel object using a .mlmodel file.

from coremltools.models import MLModel
loaded_model = MLModel('linear_model.mlmodel') print(loaded_model.short_description)

I approximate a sine curve with a linear model!

SAVE YOUR MODEL
An application developer can access your trained model with Object Storage using IBM Bluemix . You will need your Bluemix credentials to link to Object Storage, which can
be generated from the data assets tab in your notebook:

You need to have some files in your data assets for this screen to be visible!

The cell below shows the code generated from this process:

credentials_1 = { 'auth_url':'https://identity.open.softlayer.com',
'project':'object_storage_9-----3', 'project_id':'7babac2********e0', 'region':'dallas', 'user_id':'9603b8************70f', 'domain_id':'2c66d***********b9d26', 'domain_name':'1026***', 'username':'member_******************', 'password':""""""***************"""""", 'container':'TemplateNotebooks', 'tenantId':'undefined', 'filename':'2001.csv' }

Don’t worry about the filename in this credentials dictionary, as we will define
a function put_file that will use the important security credentials generated above along with the
local mlmodel file to send it to Object Storage.

from io import BytesIO
import requests
import json

def put_file(credentials, local_file_name):
    """"""This functions returns a StringIO object containing the file
    content from Bluemix Object Storage V3.""""""
    f = open(local_file_name,'r')
    my_data = f.read()
    url1 = ''.join(['https://identity.open.softlayer.com', 
        '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
        'password': {'user': {'name': credentials['username'],
        'domain': {'id': credentials['domain_id']}, 
        'password': credentials['password']}}}}} 
    headers1 = {'Content-Type': 'application/json'} 
    resp1 = requests.post(url=url1, data=json.dumps(data),
        headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                if(e2['interface']=='public' and
                e2['region']=='dallas'):
                    url2 = ''.join([e2['url'],'/',
                    credentials['container'], '/', local_file_name])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept':
        'application/json'}
    resp2 = requests.put(url=url2, headers=headers2, data = my_data) 
    print resp2

Calling put_file with your credentials and linear_model.mlmodel as the local filename will send your Core ML model into Object Storage. It is
now available for the iOS/macOS application developer to access through Bluemix.
You can find documentation on retrieving assets from Object Storage here .

Now you can convert pre-trained machine learning models that you made in DSX and
provide them to a software developer for use in iOS and macOS applications.

Here is a link to the notebook in DSx where we ran this code. Please don’t hesitate
to contact myself or Adam Massachi if you have any questions!


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on June 9, 2017.

 * Machine Learning
 * Data Science
 * IBM
 * Scikit Learn
 * Coreml


2 Blocked Unblock Follow FollowingAKHIL TANDON
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 2
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","This post discusses how to implement Apple’s new Core ML platform within DSX, which was announced a few days ago at WWDC 2017. Core ML is a platform that allows integration of powerful pre-trained…",Leverage Scikit-Learn Models with Core ML,Live,80
199,"Homepage Follow Sign in Get started Homepage * Home
 * About Insight
 * Data Science
 * Data Engineering
 * Health Data
 * AI
 * 

Javed Qadrud-Din Blocked Unblock Follow Following Nov 28
--------------------------------------------------------------------------------

TRANSFORM ANYTHING INTO A VECTOR
ENTITY2VEC: USING COOPERATIVE LEARNING APPROACHES TO GENERATE ENTITY VECTORS
Javed Qadrud-Din previously worked as a business architect at IBM Watson. At Insight, he developed a new method that allows businesses to efficiently
represent users, customers, and other entities in order to better understand,
predict, and serve them.

Want to learn applied Artificial Intelligence from top professionals in Silicon
Valley or New York? Learn more about the Artificial Intelligence program.


--------------------------------------------------------------------------------

Businesses commonly need to understand, organize, and make predictions about
their users and partners. For example, trying to predict which users will leave
the platform (churn prediction), or identifying different types of advertising
partners (clustering). The challenge comes from trying to represent these
entities in a meaningful and compact way, to feed them into a machine learning
classifier for example.

I will be presenting the way I tackled this challenge below, all of the code is
available on GitHub here .

DRAWING INSPIRATION FROM NLP
One of the most significant recent advances in Natural Language Processing (NLP)
came from a team of researchers at Google ( Tomas Mikolov , Ilya Sutskever , Kai Chen , Greg Corrado , Jeffrey Dean ) created word2vec , which is a technique to represent words as continuous vectors called
embeddings .

The embeddings they trained on 100 billion words (and then open sourced) managed
to capture much of the semantic meaning of the words they represent. For
example, you can take the embedding for ‘king’, subtract the embedding for
‘man’, add the embedding for ‘woman’, and the result of those operations will be
very close to the embedding for ‘queen’ — an almost spooky result that shows the
extent to which the Google team managed to encode the meanings of human words.

Mikolov et. al.Ever since, word2vec has been a staple of Natural Language Processing, providing
an easy and efficient building block for many text based applications such as
classification, clustering, and translation. The question I asked myself while
at Insight was how techniques similar to word embeddings might be employed for
other types of data, such as people or businesses.

ABOUT EMBEDDINGS
Let’s first think about what an embedding is. Physically, an embedding is just a
list of numbers (a vector) that represent some entity. For word2vec, the entities were English words. Each
word had its own list of numbers.

These lists of numbers are optimized to be useful representations of the
entities they stand for by adjusting them through gradient descent on a training task. If the training task requires remembering general
information about the entities of interest, then the embeddings will end up
absorbing that general information.


--------------------------------------------------------------------------------

EMBEDDINGS FOR WORDS
In the word2vec case, the training task involved taking a word (call it Word A)
and predicting the probability that another word (Word B) appeared in a 10-word
window around Word A somewhere in a massive corpus of text (100 billion words
from Google News).

Each word would have this done tens of thousands of times during training, with
words that commonly appear around it, and words that never appear in the same
context (a technique called negative sampling).

This task forces the embedding for each word to encode information about the
other words that co-occur with the embedded word. Words that co-occurred with
similar sets of words would end up having similar embeddings. For example, the
word ‘smart’ and the word ‘intelligent’ are often used interchangeably, so the
set of words typically found around them in a large corpus will be a very
similar set. As a result, the embeddings for ‘smart’ and ‘intelligent’ will be
very similar to each other.

Embeddings created with this task are forced to encode so much general
information about the word, that they can be used to stand for the word in
unrelated tasks. The Google word2vec embeddings are used in a wide range of
natural language processing applications, such as sentiment analysis and text
classification.

There are also alternative word embeddings designed by other teams using
different training strategies. Among the most popular are GloVe and CoVe .


--------------------------------------------------------------------------------

EMBEDDINGS FOR ANYTHING
Word vectors are essential tools for a wide variety of NLP tasks. But
pre-trained word vectors don’t exist for the types of entities businesses often
care the most about. Where there are pre-trained word2vec embeddings for words
like ‘red’ and ‘banana’, there are no pre-trained word2vec embeddings for users
of a social network, local businesses, or any other entity that isn’t frequently
mentioned in the Google News corpus from which the word2vec embeddings were
derived.

Businesses care about their customers, their employees, their suppliers, and
other entities for which there are no pre-trained embeddings. Once trained,
vectorized representations of entities can be used as inputs to a wide range of
machine learning models. For example, they could be used in models predicting
which ads users are likely to click on, which university applicants are likely
to graduate with honors, or which politician is likely to win an election.

Entity embeddings allow us to accomplish these types of tasks by leveraging the
bodies of natural language text associated with these entities that businesses
frequently have. For example, we can create entity embeddings from the posts a
user has written, the personal statement a university applicant wrote, or the
tweets and blog posts people write about a politician.

Any business that has entities paired with text could make use of entity
embeddings, and when you think about it, most businesses have this one way or
another: Facebook has users and the text they post or are tagged in, LinkedIn
has users and the text of their profiles, Yelp has users and the reviews they
write, along with businesses and the reviews written about them, Airbnb has
places to stay along with descriptions and reviews, universities have applicants
and the admission essays they write, and the list goes on. In fact, Facebook
recently published a paper detailing an entity embedding technique.

The aim with my entity2vec project was to find a way to use text associated with
entities to create general-use embeddings that represent those entities. To do
this, I used a technique somewhat similar to word2vec’s negative sampling to
squeeze the information from a large body of text known to be associated with a
certain entity into entity embeddings.

EXAMPLE 1: FAMOUS PEOPLE
To develop and test the technique, I tried training embeddings to represent
prominent people (e.g. Barack Obama, Lady Gaga, Angelina Jolie, Bill Gates).
Prominent people were a good starting point because, for these very famous
peoples’ names, pre-trained Google word2vec embeddings exist and are freely
available, so I’d be able to compare my embeddings’ performance against the
word2vecs for those peoples’ names.

Like with word2vec, I needed a training task that would force the entity
embeddings to learn general information about the entities they stand for. I
decided to train a classifier that would take a snippet of text from a person’s
Wikipedia article and learn to guess who that snippet is about.

The training task would take several entity embeddings as input and would output
the position of the entity embedding that the text snippet is about. In the
following example, the classifier would see as input a text snippet about Obama,
as well as the embeddings for Obama, and three other randomly chosen people. The
classifier would output a number representing which of its inputs is the Obama
embedding.

All of the embeddings would be trainable in each step, so, not only would the
correct person embedding learn information about what that person is , but the other incorrect embeddings would also learn something about what
their people are not .

This technique seemed sensible intuitively, but, in order to validate my
results, I needed to try the resulting embeddings out on some other tasks to see
if they’d actually learned general information about their entities.

To do this, I trained simple classifiers on several other tasks that took entity
embeddings as inputs and outputted classifications like the gender or occupation
of the entity. Here is the architecture of these classifiers:

And here are the results obtained, compared against guessing and against doing
the same thing with word2vec embeddings.

My embeddings performed pretty much on-par with the word2vec embeddings even
though mine were trained on much less text — about 30 million words vs 100
billion. That is four orders of magnitude less text required!


--------------------------------------------------------------------------------

EXAMPLE 2: YELP BUSINESSES
Next, I wanted to see if this technique was generalizable. Did it just work on
people from Wikipedia, or does the technique work more generally? I tested it by
trying exactly the same technique to train embeddings that represent businesses
using the Yelp dataset.

Yelp makes a slice of its dataset available online that contains businesses
along with all the tips and reviews written about those businesses. I trained
embeddings using precisely the same technique as I used with the Wikipedia
people, except this time the text consisted of Yelp reviews about businesses and
the entities were the businesses themselves. The task looked like this:

Once trained, I tested the embeddings on a new task — figuring out which type of
business a certain business was, e.g. CVS Pharmacy is in the ‘health’ category
whereas McDonalds is in the ‘restaurants’ category. There were ten possible
categories a business could fall into, and a single business could fall into
multiple categories — so it was a challenging multi-label classification task
with ten labels. The results, as compared with educated guessing, were as
follows:

This is a great result considering the difficulty of such a task!


--------------------------------------------------------------------------------

Altogether, it was a successful experiment. I trained embeddings to capture the
information in natural language text, and then I was able to get useful
information back out of them by validating them on other tasks. Any business
that has entities paired with text could use this technique, to be able to run
predictive tasks on their data.

NEXT STEPS AND CODE
While these results are promising, the idea can be taken further by
incorporating structured data into the embeddings along with text, which I will
be looking to explore in the future.

Anyone can now use this technique on their own data using a Python package I
created and just a few lines of code. You can find the package on GitHub here .


--------------------------------------------------------------------------------

Want to learn applied Artificial Intelligence from top professionals in Silicon
Valley or New York? Learn more about the Artificial Intelligence program.

Are you a company working in AI and would like to get involved in the Insight AI
Fellows Program? Feel free to get in touch .

Thanks to Emmanuel Ameisen . * Machine Learning
 * Insight Ai
 * Artificial Intelligence
 * NLP
 * Business

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingJAVED QADRUD-DIN
FollowINSIGHT DATA
Insight Fellows Program —Your bridge to careers in Data Science and Data
Engineering.

 * 
 * 
 * 
 * 

Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates",Using cooperative learning approaches to generate entity vectors.,Transform anything into a vector,Live,81
202,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS
Loading...

Sign up by October 31st for an extended 3-month trial of YouTube Red.Working...

No thanks Try it free Find out why CloseIBM WATSON MACHINE LEARNING: BUILD A LOGISTIC REGRESSION MODEL
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

262 views 2LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 3 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017This video shows how to create, train, save, and deploy a logistic regression
model that assesses the likelihood that a customer of an outdoor equipment
company will buy a tent based on age, sex, marital status and job profession.

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Watson Machine Learning: Build a Predictive Analytic Model - Duration:
   4:06. developerWorks TV 80 views 4:06


--------------------------------------------------------------------------------

 * Building Logistic Regression model using RapidMiner Studio - Duration: 5:07.
   Fnu Chintan 1,166 views 5:07
 * Logistic Regression - The Math of Intelligence (Week 2) - Duration: 44:01.
   Siraj Raval 20,053 views 44:01
 * Video 7: Logistic Regression - Introduction - Duration: 11:53.
   dataminingincae 118,061 views 11:53
 * Transform Data into Intelligence Using IBM Watson Machine Learning -
   Duration: 1:19. IBM Analytics 657 views 1:19
 * Machine Learning with Scikit-Learn - The Cancer Dataset - 8 - Logistic
   Regression 1 - Duration: 8:26. Cristi Vlad 1,119 views 8:26
 * Logistic Regression Machine Learning Method Using Scikit Learn and Pandas
   Python - Tutorial 31 - Duration: 13:28. TheEngineeringWorld 676 views 13:28
 * Predicting and Analyzing Claims Fraud Using IBM Watson Analytics - Duration:
   13:45. Brian Snyder 2,150 views 13:45
 * 4. Building Logistic Regression models using RapidMiner Studio - Duration:
   23:44. Pallab Sanyal 10,291 views 23:44
 * Azure Machine Learning: Getting Started - Duration: 3:26. Microsoft Azure
   3,593 views 3:26
 * 04 Predictive Analytics Training with Weka (Building a classifier) -
   Duration: 9:01. Predictive Analytics 12,857 views 9:01
 * Training a machine learning model with scikit-learn - Duration: 19:49. Data
   School 76,093 views 19:49
 * Logistic Regression Classifiers - Duration: 15:21. Mike Bernico 10,237 views 15:21
 * Weka Regression Models - Diabetes Dataset - Duration: 4:50. Arpan Shrivastava
   1,805 views 4:50
 * Regresion Lineal Simple con Java (commons math, jmathplot) - Duration: 20:06.
   jc jimenez 5,560 views 20:06
 * logistic regression using java et WEKA [Tutoriel Complet ] - Duration: 10:03.
   Hind 168 views 10:03
 * 2 Logistic Regression Example - Duration: 17:19. Quant Education 18,864 views 17:19
 * WEKA API 15/19: Making Predictions (Regression) - Duration: 4:47. Noureddin
   Sadawi 4,909 views 4:47
 * Healthy Habits Pet Assembly, Part 2 - Duration: 8:03. developerWorks TV 6
   views * New 8:03
 * Healthy Habits Pet - Flash MicroPython - Duration: 3:30. developerWorks TV 6
   views * New 3:30

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...","This video shows how to create, train, save, and deploy a logistic regression model using IBM Waston Machine Learning and IBM Data Science Experience that assesses the likelihood that a customer of an outdoor equipment company will buy a tent based on age, sex, marital status and job profession.",Build a logistic regression model with WML & DSX,Live,82
203,"Compose The Compose logo Articles Sign in Free 30-day trialCOMPOSE'S FIRST GRAPH DATABASE: JANUSGRAPH
Published Jun 15, 2017 graph janusgraph compose Compose's first graph database: JanusGraphAt Compose we've always looked to ensure you can get the databases you need.
Today, we are proud to announce that JanusGraph is coming to Compose and will
bring with it the power of fully open source graph databases.

JanusGraph is a new player in databases with a deep heritage. It builds on a fork of the
Titan graph database, a previous leader in open source graph databases. That
code is capable of being plugged into a number of different database backends.
It's all then integrated with the database-agnostic Apache Tinkerpop graph
framework. The JanusGraph project itself is organized under the Linux Foundation and led by developers from
Expero, Google, GRAKN.AI and IBM. And it's all open source with new companies
joining the community to enhance JanusGraph.

At Compose, we've worked with IBM's JanusGraph developers to combine Compose's
one-click deployment, high-availability, managed database platform with
JanusGraph. A great graph database demands a great backend and we've teamed it
with Scylla, the high-performance Cassandra compatible database for best
reliability. Then we added our automated backup system, private VLAN
configuration and HAProxy managed access to give peace of mind.

That means that from today, Compose users can deploy the industry leading graph
database from their Compose account.

WHY A GRAPH DATABASE?
Graph databases model the world as nodes and directed connections, vertices, and
edges as graph theory calls them. Both can have properties associated with them
and the connections. Both are equal elements in how the database is managed and
queries and a query on a graph database can start at a point and explore the
connections around it so you can say ""I'm looking any person who likes brand X
who has a friends or friends of friends who buy brand Y and Z"".

Relational databases typically treat relations as a simple connection between
one row and another or demand that you add another table to associate data with
the relationship. That means that when you want to query across relationships
and examine the network that exists, you have to do a lot of expensive queries.

A Graph database as part of your datalayer allows you to understand and explore
relationships and networks within your data without compromising the performance
of your production relational and document stores.

JANUSGRAPH ON COMPOSE
We're launching JanusGraph on Compose as a beta as we build the functionality
around it. You'll find it ready to deploy in the beta section of the Create Deployment view of Compose.If you haven't discovered Compose yet, you can sign up for a
free 30-day trial below. If you want to learn more about JanusGraph on Compose,
check out our JanusGraph documentation .

Try Compose free for 30 days


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Nayuki

Josh Mintz is an Offering Manager at IBM Watson Data Platform. He has an enthusiasm for
homemade hummus, foreign policy, and the English Premier League. Love this
article? Head over to Josh Mintz ’s author page and keep reading.RELATED ARTICLES
Jun 8, 2017COMPOSE POSTGRESQL POWERS UP TO 9.6
TL;DR: You can now run PostgreSQL 9.6 on Compose, PostGIS has been upgraded and
now PGrouting is also available. PostgreSQL 9…

Dj Walker-Morgan Jun 6, 2017COMPOSE NOTES: A NEW WAY TO VIEW YOUR DATABASES AND SAFER DELETION
We're giving users more control on how they can view their many databases on
Compose - and it all started with us adding a ne…

Dj Walker-Morgan Jun 1, 2017COMPOSE ENTERPRISE COMES TO IBM BLUEMIX
If you use Compose databases on the IBM Bluemix platform, you'll be pleased to
know you can now sign up for Compose Enterpris…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Today, we are proud to announce that JanusGraph is coming to Compose and will bring with it the power of fully open source graph databases.",Compose's first graph database: JanusGraph,Live,83
204,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix       * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Use Spark Streaming                * Tutorials and samples * Sample Notebooks       * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata for Analytics to dashDB       * From Neteeza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  LOAD TWITTER DATA INTO DASHDBJess Mantaro / July 17, 2015This video shows how easy it is to consume Twitter data with IBM dashDB forfurther analytics.You can also read a transcript of this videoTry the tutorialRELATED LINKS * Get the codePlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Watch how easy it is to consume Twitter data with IBM dashDB for further analytics.,Tutorial: How to load Twitter data in IBM dashDB ,Live,84
207,"RStudio Blog * Home

 * Subscribe to feed

TESTTHAT 1.0.0
April 29, 2016 in Packages

testthat 1.0.0 is now available on CRAN. Testthat makes it easy to turn your
existing informal tests into formal automated tests that you can rerun quickly
and easily. Learn more at http://r-pkgs.had.co.nz/tests.html . Install the latest version with:

install.packages(""testthat"")

This version of testthat saw a major behind the scenes overhaul. This is the
reason for the 1.0.0 release, and it will make it easier to add new expectations
and reporters in the future. As well as the internal changes, there are
improvements in four main areas:

 * New expectations.
 * Support for the pipe.
 * More consistent tests for side-effects.
 * Support for testing C++ code.

These are described in detail below. For a complete set of changes, please see
the release notes .

IMPROVED EXPECTATIONS
There are five new expectations:

 * expect_type() checks the base type of an object (with typeof() ), expect_s3_class() tests that an object is S3 with given class, and expect_s4_class() tests that an object is S4 with given class. I recommend using these more
   specific expectations instead of the generic expect_is() , because they more clearly convey intent.
 * expect_length() checks that an object has expected length.
 * expect_output_file() compares output of a function with a text file, optionally update the file.
   This is useful for regression tests for print() methods.

A number of older expectations have been deprecated:

 * expect_more_than() and expect_less_than() have been deprecated. Please use expect_gt() and expect_lt() instead.
 * takes_less_than() has been deprecated.
 * not() has been deprecated. Please use the explicit individual forms expect_error(..., NA) , expect_warning(.., NA) , etc.

We also did a thorough review of the documentation, ensuring that related
expectations are documented together.

PIPING
Most expectations now invisibly return the input object . This makes it possible to chain together expectations with magrittr:

factor(""a"") %>% 
  expect_type(""integer"") %>% 
  expect_s3_class(""factor"") %>% 
  expect_length(1)

To make this style even easier, testthat now imports and re-exports the pipe so
you don’t need to explicitly attach magrittr.

SIDE-EFFECTS
Expectations that test for side-effects (i.e. expect_message() , expect_warning() , expect_error() , and expect_output() ) are now more consistent:

 * expect_message(f(), NA) will fail if a message is produced (i.e. it’s not missing), and similarly
   for expect_output() , expect_warning() , and expect_error() .quiet <- function() {}
   noisy <- function() message(""Hi!"")
   
   expect_message(quiet(), NA)
   expect_message(noisy(), NA)
   #> Error: noisy() showed 1 message. 
   #> * Hi!
   
   
 * expect_message(f(), NULL) will fail if a message isn’t produced, and similarly for expect_output() , expect_warning() , and expect_error() .expect_message(quiet(), NULL)
   #> Error: quiet() showed 0 messages
   expect_message(noisy(), NULL)
   
   
There were three other changes made in the interest of consistency:

 * Previously testing for one side-effect (e.g. messages) tended to muffle other
   side effects (e.g. warnings). This is no longer the case.
 * Warnings that are not captured explicitly by expect_warning() are tracked and reported. These do not currently cause a test suite to fail,
   but may do in the future.
 * If you want to test a print method, expect_output() now requires you to explicitly print the object: expect_output(""a"", ""a"") will fail, expect_output(print(""a""), ""a"") will succeed. This makes it more consistent with the other side-effect
   functions.

C++
Thanks to the work of Kevin Ushey , testthat now includes a simple interface to unit test C++ code using the Catch library. Using Catch in your packages is easy – just call testthat::use_catch() and the necessary infrastructure, alongside a few sample test files, will be
generated for your package. By convention, you can place your unit tests in src/test-<name>.cpp . Here’s a simple example of a test file you might write when using testthat +
Catch:

#include <testthat.h�
  }
}

These unit tests will be compiled and run during calls to devtools::test() , as well as R CMD check . See ?use_catch for a full list of functions supported by testthat, and for more details.

For now, Catch unit tests will only be compiled when using the gcc and clang
compilers – this implies that the unit tests you write will not be compiled +
run on Solaris, which should make it easier to submit packages that use testthat
for C++ unit tests to CRAN.

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

« Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by
Apache Arrow Register now for Hadley Wickham’s Master R in Amsterdam »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:",testthat 1.0.0 is now available on CRAN. Testthat makes it easy to turn your existing informal tests into formal automated tests that you can rerun quickly and easily. Learn more at Install the lat…,testthat 1.0.0,Live,85
213,"METRICS MAVEN: BEYOND AVERAGE - A LOOK AT MEAN IN POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 5, 2016In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the metrics you need from your data.
Over the next few articles, we'll get cozy with mean, median, and mode in
PostgreSQL.

Mean, median and mode - these three metrics can provide valuable insights from
your data. Over the next few articles we'll get deeply familiar with each of
them and also highlight some key considerations for understanding what's really
contributing to the numbers we report.

For our examples, we'll use the following orders data based on the dog products
catalog we used in our previous article on crosstab :

order_id | item_count | order_value  
------------------------------------------------
50000    | 3          | 35.97  
50001    | 2          | 7.98  
50002    | 1          | 5.99  
50003    | 1          | 4.99  
50004    | 7          | 78.93  
50005    | 0          | (NULL)  
50006    | 1          | 5.99  
50007    | 2          | 19.98  
50008    | 1          | 5.99  
50009    | 2          | 12.98  
50010    | 1          | 20.99  


MEAN
You're probably already familiar with how to calculate a mean, which usually
goes by its more common name — average. Here's a quick refresher, though... To
find the mean of a set of numbers, you add them all up and then divide by the
count. PostgreSQL provides the AVG aggregate function to compute this metric for you. We've used the AVG function a few times for various examples in other articles. In the example
below, you'll see our use of the ROUND function with AVG , which we covered in our Making Data Pretty article :

SELECT  
ROUND(AVG(item_count),2) AS avg_items,  
ROUND(AVG(order_value),2) AS avg_value  
FROM orders;  


The mean values (a.k.a. ""averages"") we get back look like this:

avg_items | avg_value  
----------------------
1.91      | 19.98  


Now, in looking at the data set for this query, you may have noticed that order
50005 has 0 items and has a NULL order value, so you may be wondering how it's
treated in the average calculation. Let's look.

NULLS AND ZEROES
Understanding how NULL values and 0 values are handled is something to be aware
of when reporting metrics so that you can make sure you're getting the numbers
you expect.

Let's look at the avg_items and avg_value metrics we calculated above. In the case of the avg_items , the average is calculated using all 11 values (0 counts as a value by
default). The AVG function is performing this calculation:

 -- avg_items calculation
(3 + 2 + 1 + 1 + 7 + 0 + 1 + 2 + 1 + 2 + 1) / 11 = 1.91


In the case of avg_value , however, only 10 values are used in the calculation since NULL values are
ignored. That calculation looks like this:

 -- avg_value calculation
(35.97 + 7.98 + 5.99 + 4.99 + 78.93 + 5.99 + 19.98 + 5.99 + 12.98 + 20.99) / 10 = 19.98


To alter this default behavior (either to ignore 0 values or to include NULL
values), we can add some conditions to act on the AVG function. In this next example, we've chosen to ignore the 0 value for avg_items and to include the NULL value for the avg_value :

SELECT  
ROUND(AVG(NULLIF(item_count,0)),2) AS avg_items,  
ROUND(AVG(CASE WHEN order_value IS NULL THEN 0 ELSE order_value END),2) AS avg_value  
FROM orders;  


In the above query, for avg_items , we're using NULLIF to set 0 values to NULL for the calculation so that they will be ignored by the
default behavior. For avg_value , we're using a CASE WHEN conditional to convert NULL values to 0 so that they will be included by the
default behavior. Here's our new result:

avg_items | avg_value  
----------------------
2.10      | 18.16  


What we actually want for our report, however, is for any orders with 0 or NULL
values to be ignored so our query will end up looking like this:

SELECT  
ROUND(AVG(NULLIF(item_count,0)),2) AS avg_items,  
ROUND(AVG(order_value),2) AS avg_value  
FROM orders;  


And we'll have the following values for our final report:

avg_items | avg_value  
----------------------
2.10      | 19.98  


Since we've also got quite a few orders with the same item counts and order
values, we may want to get an idea of what the means would be if we average only
the unique values instead.

DISTINCT
By adding a DISTINCT clause to our query, we can calculate the mean using only the unique values.
Just as easy as this:

SELECT  
ROUND(AVG(DISTINCT NULLIF(item_count,0)),2) AS avg_items,  
ROUND(AVG(DISTINCT order_value),2) AS avg_value  
FROM orders;  


Our result shows the following:

avg_items | avg_value  
----------------------
3.25      | 23.48  


As we can see, the values are quite a bit higher when we remove duplicate values
for item count and order value. What this tells us is that our orders are skewed
more toward the lower end of the range for both the item count and the order
value, which brings our average values down in our final report. If we saw lower
values here than we got from our final report, then we'd know that our orders
were skewed more toward the higher end. It's always good to get a sense of how
your data skews so that your business can decide how (or if) it wants to try to
shift that data in the future. In this case, if we wanted to shift our orders
more toward the higher end, we might do some kind of promotion to increase the
number of items purchased per order and focus an advertising campaign on
higher-priced items.

WRAPPING UP
In this article, we've taken the commonly-used AVG function and explored it a little more deeply to get an understanding of how
NULL and 0 values are handled (and how the default behavior can be altered). We
also touched on using DISTINCT to get a sense of how our data may be skewing the mean value more towards the
higher end or lower. In future articles, we'll get into cumulative and weighted
averages, but before we go there, our next article will take a close look at how
to calculate a median and what it can tell us about our data.

Image by: bernswaelz Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Mean, median and mode - these three metrics can provide valuable insights from your data. Over the next few articles we'll get deeply familiar with each of them and also highlight some key considerations for understanding what's really contributing to the numbers we report.",Metrics Maven: Beyond Average,Live,86
217,"Beta ☰ * Login
 * Sign Up

 * Learning Paths
 * Courses
 * Badges
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (July 26, 2016)
 * Welcome to the new BDU!
 * This Week in Data Science (July 19, 2016)
 * This Week in Data Science (July 12, 2016)
 * This Week in Data Science (July 05, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (JULY 26, 2016)
Posted on July 26, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Facebook Moves One Step Closer to Light-Based Wireless Communication – Looking to bring internet access to more people in the world, Facebook
   researchers are using light to wirelessly transmit internet signals.
 * Creating a Beer Recommendation Engine – Chernetsky, a data scientist and beer fanatic from San Francisco, walks us
   through how he created a beer recommendation engine.
 * A neural network tried to write a 9th Harry Potter book, and the results are
   hilarious – Max Deutsch was experimenting with deep learning when he used an LSTM
   recurrent neural network to write a Harry Potter novel.
 * Understanding Bias: A Pre-requisite For Trustworthy Results – Adam Kelleher explains how using conditional knowledge may cause bias in
   results.
 * From Shelter To Forever Home, Big Data Helps Pets – Organizations are looking to use Big Data in order to improve the lives of
   pets.
 * Nvidia’s Eye-Tracking Tech Could Revolutionize Virtual Reality – Nvidia is significantly improving the realism of virtual reality by
   tracking a user’s gaze and applying focusing graphics.
 * Here’s all the money in the world, in one chart – This data visualization chart displays the distribution of all the money
   in the world.
 * Macy’s Teams With IBM Watson For AI-Powered Mobile Shopping Assistant – Macy’s and IBM are launching a cognitive mobile web tool that will help
   shoppers get information as they navigate Macy’s stores.
 * Using Keras and Deep Q-Network to Play FlappyBird – This project explains how to use the Deep-Q Learning algorithm and Keras
   to play FlappyBird.
 * A Review of Travel Chatbots – Travel chatbots are tested, analyzed, and compared.
 * What developers want in 2016: JavaScript, Angular, machine learning – Packt, a Birmingham-based company, surveyed developers to learn about
   their current and future skillsets. They discovered that developers believe
   that machine learning, Big Data, virtual reality, and the Internet of Things
   are the “next big thing.”
 * An analysis of Pokémon Go types, created with R – To get a sense of the distribution of Pokémon types, Joshua Kunst used R
   to download data from the Pokémon API and created a treemap of all the
   Pokémon types.
 * Google Cuts Its Giant Electricity Bill With DeepMind-Powered AI – A DeepMind AI system was put in control of some of Google’s data centers
   in order to reduce power consumption by manipulating computer servers and
   related equipment. This system cut power costs by several percentage points.
 * Comments to the White House Office of Science and Technology Policy on
   Artificial Intelligence – In response to The White House Office of Science and Technology Policy’s
   (OSTP) requested comments about artificial intelligence, the Center for Data
   Innovation has filed comments outlining some of the most significant benefits
   and challenges of AI.
 * Why The Internet of Things is Getting Real Now – The Internet of Things is no longer just a concept for a futuristic
   society. It’s a reality and it can drastically improve and change our lives.
 * Even IBM’s Watson Is Getting Into ‘Pokémon Go’ – Michael Hsu, a software developer from California, used Watson’s visual
   recognition API with Watson’s Internet of Things platform to get the
   supercomputer to play Pokémon Go for him.

UPCOMING DATA SCIENCE EVENTS
 * Multi-genre Advanced Analytics to Optimize your Hadoop Data Lake – Join Data Science Central’s Webinar on August 2nd as they discuss data
   lakes and multi-genre analytics.
 * Big Data University on 2016 Hadoop Summit – Join Big Data University at the Hadoop Summit in Australia on September
   1st as they discuss spatial-temporal trajectory analysis with Spark.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (July 26, 2016)",Live,87
224,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSIMPLE METRICS COLLECTOR – MICROSERVICES EDITIONGlynn Bird / March 3, 2016Back in August 2015, Raj Singh wrote about the Simple Metrics Collector App that he and David Taieb had put together. It was a simple app that collectedweb analytics data and stored it in an IBM Cloudant database.In this post, I' either Redis, RabbitMQ, or IBM Message Hub (Apache Kafka). Youcan then deploy further Microservice apps to consume the queued data forstorage, analytics, and real-time reporting.HOW DOES IT WORK?The Simple Metrics Collector Microservice repository contains the source code and instructions for installation and configuration.You can deploy to Bluemix with a single click or run locally against local orremote services.Instead of writing data to Cloudant using its HTTP API the new code passes the incoming data to a series of modules which write datato a choice of queues, PubSub channels, or message brokers. The module is chosendepending on the value of an environment variable QUEUE_TYPE which can take one of the following values: * stdout write to the terminal only * redis_queue write to a Redis queue * redis_pubsub write to a Redis pubsub channel * rabbit_queue write to a RabbitMQ queue * rabbit_pubsub write to a RabbitMQ pubsub channel * kafka write to an Apache Kafka or IBM Message Hub topicWhen the app runs, the value of the environment variable QUEUE_TYPE determines which of the plugins from the plugins directory is loaded. Each plugin has its own, customised add function which performs the task of communicating with each queue system.It's simple to add your own plugin. Just add a new file to the plugins directory! If you want to write a ZeroMQ, MQLight, or MQTT module, then forkthe code and get writing. We'd be happy to accept your changes in a PullRequest.THE DIFFERENCE BETWEEN QUEUES AND PUBSUBMy previous post, Get in line! An intro to Queues and PubSub , covers this in detail, but here's a quick summary:A queue stores each data payload in the order they were received. Workers readthe oldest item or items from the queue. Some queue software (RabbitMQ & Redis) pushes data to the workers. In others (Apache Kafka), the worker polls the queue fordata. With a queue, we can scale the number of workers to deal with increasingworkload because the load is split evenly between the workers. Queues buffer thedata until consumers have had a chance to consume it, and in the case ofRabbitMQ, provide a feedback mechanism to ensure that the worker acknowledgescompletion of each task.A PubSub channel also receives time-ordered data and sends it to consumers, buteach consumer of data gets all the data. PubSub doesn't share the data between the connected workers, all theworkers get a copy of the data. This lets you use the same data stream forseveral purposes like storage, streaming analytics, real-time dashboard, etc.PubSub systems tend not to buffer data; the consumers must be connected to receive notification ofincoming data (the exception being Apache Kafka whose topics can behave likebuffer queues or pubsub channels).The Simple Metrics Collector Microservice allows you to choose between Queue andPubSub modes for each of the supported queuing platforms, Redis, RabbitMQ, andApache Kafka.CONSUMING THE DATAHaving the Metrics Collector Microservice running on its own isn't the fullstory, as it merely queues the data ready for consumption. It is nothing withoutsome other microservices that consume the data. This is where the Metrics Collector Storage Microservice comes in. This microservice can also connect to either a Redis, RabbitMQ, orKafka queue but this time to receive incoming data. Any data that arrives on thequeue (or PubSub channel) is written to one of a choice of databases * IBM Cloudant / Apache CouchDB * MongoDB * ElasticsearchAll three storage services are capable of storing JSON data, so require littleeffort to be able to store the stream of data from the hub. The MetricsCollector Storage Microservice uses the DATABASE_TYPE to decide which storage mechanism to use: * stdout output to the console only * cloudant save to Cloudant * mongodb save to MongoDB * elasticsearch save to ElasticSearchThe Metrics Collector Storage Microservice uses each database's bulk storage APIto efficiently store data in batches, rather than write each individual recordas it arrives. As with the Metrics Collector Microservice, there is a pluginarchitecture so that that code can be easily extended to deal with other queuesor other database targets.The Cloudant/CouchDB/MongoDB/ElasticSearch service can be run locally orremotely; all of the services are availble to try in IBM's Bluemixplatform-as-a-service.MICROSERVICES GIVES YOU FLEXIBILITYIf we run our Microservice in PubSub mode, then we can deploy any number ofconsuming microservices to that channel for storage, analytics, and real-timereporting. We could deploy two instances of the Metrics Collector StorageMicroservice: one configured to write the data to MongoDB, and the other towrite the data to ElasticSearch.If the volume of data is too large for a single producer service, thenadditional services can be created and the traffic shared between them.The producer microservice is decoupled from the consumers (and vice versa) bythe hub inbetween.WHY SUCH A COMPLICATED ARCHITECTURE?Scale. Imagine we have a web app generating a lot data. We may need more thanone server collecting the metrics from the web and a large queue clusterbuffering the data. Then we can add any number of consumers, adding as manynodes as we need to deal with the load. As long as the queue can handle thevolume of data, then we can add producer nodes and consumer nodes to match theworkload.The Simple Metrics Collector Microservice is a demonstration of the principle ofMicroservices using web metrics as an example. Your application could collectany kind of data: Internet of Things readings, server logs, mobile game usagestatistics, or weather reports. But the theme is the same; build yourapplication to scale from the beginning and add computing power to the clusterto deal with the volume of data being generated.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Apache Kafka / CouchDB / Elasticsearch / Message Hub / microservices / MongoDB / pubsub / queues / RabbitMQ / Redis Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Our Metrics Collector app rebuilt to fit into a Microservices stack.,Simple Metrics Collector Microservices Edition,Live,88
227,"KDNUGGETS
Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * SOFTWARE
 * NEWS
 * Top stories
 * Opinions
 * Tutorials
 * JOBS
 * Academic
 * Companies
 * Courses
 * Datasets
 * EDUCATION
 * Certificates
 * Meetings
 * Webinars


KDnuggets Home » News » 2015 » Jun » Software » Top 20 R Machine Learning and Data Science packages ( 15:n21 )LATEST NEWS, STORIES
 * Top tweets, Jun 29 – Jul 5: Big Data Ecosystem ... Storytelling: The Power to Influence in Data Science Success Criteria for Process Mining Mining Twitter Data with Python Part 7: Geolocation an... 3 Key Ethics Principles for Big Data and Data Science


More News & Stories | Top Stories

TOP 20 R MACHINE LEARNING AND DATA SCIENCE PACKAGES
Previous post Next post Tweet Tags: CRAN , Data Science , Machine Learning , R , R Packages , Top list
--------------------------------------------------------------------------------

We list out the top 20 popular Machine Learning R packages by analysing the most
downloaded R packages from Jan-May 2015.

By Geethika Bhavya Peddibhotla , KDnuggets. comments The CRAN Package repository features 6778 active packages. Which of these should you know? Here is an
analysis. See also link to the raw data at the bottom of the post.

Most of these R packages are favorites of Kagglers, endorsed by many authors,
rated based on one package's dependency on other packages. They are also rated &
reviewed by users as a crowdsourced solution by Crantastic.org . However, these user ratings are too few to be based on for analysis.

Let us explore how many machine learning packages are being downloaded from Jan
to May by analysing CRAN daily downloads.

 1.  e1071 Functions for latent class analysis, short time Fourier transform, fuzzy
     clustering, support vector machines, shortest path computation, bagged
     clustering, naive Bayes classifier etc (142479 downloads)
     
 2.  rpart Recursive Partitioning and Regression Trees. (135390)
 3.  igraph A collection of network analysis tools. (122930)
 4.  nnet Feed-forward Neural Networks and Multinomial Log-Linear Models. (108298)
 5.  randomForest Breiman and Cutler's random forests for classification and regression.
     (105375)
 6.  caret package (short for Classification And REgression Training) is a set of
     functions that attempt to streamline the process for creating predictive
     models. (87151)
 7.  kernlab Kernel-based Machine Learning Lab. (62064)
 8.  glmnet Lasso and elastic-net regularized generalized linear models. (56948)
 9.  ROCR Visualizing the performance of scoring classifiers. (51323)
 10. gbm Generalized Boosted Regression Models. (44760)
 11. party A Laboratory for Recursive Partitioning. (43290)
 12. arules Mining Association Rules and Frequent Itemsets. (39654)
 13. tree Classification and regression trees. (27882)
 14. klaR Classification and visualization. (27828)
 15. RWeka R/Weka interface. (26973)
 16. ipred Improved Predictors. (22358)
 17. lars Least Angle Regression, Lasso and Forward Stagewise. (19691)
 18. earth Multivariate Adaptive Regression Spline Models. (15901)
 19. CORElearn Classification, regression, feature evaluation and ordinal evaluation.
     (13856)
 20. mboost Model-Based Boosting. (13078)

It is interesting to note that some open source R tools are gaining popularity
such as Rattle , a GUI for data mining using R (35539 downloads), and fastcluster , fast hierarchical clustering routines for R and Python (14214 downloads).

Did we miss your favorites? Light up this space and contribute to the community
by letting us know which R packages you use!!

For completeness, here is data on 135 R package downloads, from Jan to May 2015 .

Bio: Bhavya Geethika is pursuing a masters in Management Information Systems at University of
Illinois at Chicago. Her areas of interests include Statistics & Data Mining for
Business, Machine learning and Data-Driven Marketing.


Related: * Top 10 R Packages to be a Kaggle Champion Machine Learning 201: Does Balancing Classes Improve Classifier Performance? Top KDnuggets tweets, Mar 14-16: Is Apache Spark the Next Big Thing? R
   Meta-Book – best CRAN posts assembled


--------------------------------------------------------------------------------

Previous post Next post


--------------------------------------------------------------------------------


MOST POPULAR LAST 30 DAYS
Most viewed 1. 7 Steps to Mastering Machine Learning With Python R vs Python for Data Science: The Winner is ... What is the Difference Between Deep Learning and “Regular” Machine
    Learning? TensorFlow Disappoints - Google Deep Learning falls shallow 9 Must-Have Skills You Need to Become a Data Scientist Top 10 Data Analysis Tools for Business How to Explain Machine Learning to a Software Engineer

Most shared 1. What is the Difference Between Deep Learning and “Regular” Machine Learning? Data Science of Variable Selection: A Review R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016
    Software Poll Results A Visual Explanation of the Back Propagation Algorithm for Neural Networks Machine Learning Key Terms, Explained How to Build Your Own Deep Learning Box Big Data Business Model Maturity Index and the Internet of Things (IoT)


MORE RECENT STORIES
 * How to Compare Apples and Oranges ? : Part III KDnuggets 16:n24, Jul 6: Text Mining 101; Softmax and Logis... Getting Started with Analytics: What’s the Upfront Investment? A Brief Primer on Linear Regression – Part III Top June stories: The Difference Deep Learning and “Regu... Mining Twitter Data with Python Part 6: Sentiment Analysis Basics NLP, Sentiment Analysis, Consumer and Market Insights at SAS16 Top /r/MachineLearning Posts, June: Microsoft Videos, Machine ... Academic/Research positions in Business Analytics, Data Scienc... Top Stories, June 27 – July 3: Big Data Ecosystem is Too... Data Mining History: The Invention of Support Vector Machines Altria: Senior Analyst, Consumer & Marketplace Insights Upcoming Meetings in Analytics, Big Data, Data Mining, Data Sc... What is Softmax Regression and How is it Related to Logistic R... Three Impactful Machine Learning Topics at ICML 2016 Text Mining 101: Topic Modeling Recursive (not Recurrent!) Neural Networks in TensorFlow Jimdo: Data Scientist Jimdo: Data Engineer Determining the Economic Value of Data


KDnuggets Home » News » 2015 » Jun » Software » Top 20 R Machine Learning and Data Science packages ( 15:n21 )

© 2016 KDnuggets. About KDnuggets
Subscribe to KDnuggets News | Follow @kdnuggets | | X",The CRAN Package repository features 6778 active packages. Which of these should you know? Here is an analysis.  ,Top 20 R Machine Learning and Data Science packages,Live,89
228,"Enterprise Pricing Articles Sign in Free 30-Day TrialNEW IMPORT FOR COMPOSE MONGODB
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Aug 18, 2016We're pleased to announce that we are now enabling importing on MongoDB on
Compose. Although MongoDB/Classic has had import options for some time, the
newer SSL-enabled, WiredTiger supporting MongoDB on Compose has, up until now,
not had any import options.

After designing and engineering a reliable platform for running processes like
imports, we can now offer MongoDB users the ability to import data from MongoDB
Classic databases and the now unavailable Compose MongoDB sandboxes. This means
if you want to move to the latest Compose MongoDB, it's all become a lot easier.
The enhancements have also improved the experience for existing import support.

The database import options do just what they imply, import an entire database.
That means all collections and recreating the indexes that were present in the
source database.

To show you how it works, let's run through the process for moving data from a
MongoDB Classic deployment to a freshly deployed MongoDB. The source of data for
this is a deployment called sourcemongodb with a database within it called original and within that a collection called miscdocs . That's all the things we need to know as we start this process over on the
destination deployment...


We need a destination database to import into next. If you reuse an existing
database, do be aware that the import process will first drop the entire
database before beginning the import. As we are working with a fresh database,
we click on Browser and enter a name for the database:


This will take us on to the Browser view for that database. In the left hand
menu bar, the Imports item is now available. Click on that to see this:


If we'd previously imported into this database, we'd see the past imports listed
here, but this is a fresh database so our next stop is to click on the New Import button in the top right which takes us to the Imports/New page.


The first option is to select whether you are importing from within Compose or
from a remotely accessible MongoDB. If the Remote option is selected, you can enter a MongoDB URI (with username, password, host,
port and database name) to import from an internet accessible MongoDB. We're
importing within Compose, where things are simpler. The Deployment menu button will list all the deployments which are available for import, and
which ones aren't. We'll select our source deployment in this menu sourcemongodb . Then we enter the database name in the following field. Once that's done
click Import Database and you will return to the Imports view.


The import may take a short while to start; it will show Pending during that time and once ready, switch to, as we see here, Running . This page will automatically update as the import runs, showing when it
started importing and where it's importing from, how long it has been running
and current status. There's also a button to cancel the import. Finally, there's
a button to display logs. Click that and you'll see something like this:


This is a textual record of log messages from the import process. Transient
errors may be reported in this log so if the data being transferred is critical,
this is our first step to validate it.

Depending on how much data there is to transfer, the import will eventually
complete. If we go to our collections view, we'll see the sole collection (all
500,000 documents) imported:


Now you can see how you can mobilize your data around the Compose platform
without leaving the Compose console. Go forth and import!

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",We're pleased to announce that we are now enabling importing on MongoDB on Compose.,New Import for Compose MongoDB,Live,90
230,"Compose The Compose logo Articles Sign in Free 30-day trialWEBSITE ENGAGEMENT TRACKING WITH ELASTICSEARCH
Published Aug 10, 2017 Website Engagement Tracking with Elasticsearch elasticsearch nodejs javascript Free 30 Day TrialWebsite Engagement Tracking is a technique that allows businesses to see which
parts of their website users are visiting, clicking on, and viewing. In this
article, we'll take a look at tracking user engagement using Elasticsearch on
Compose.

Tracking users’ focus on your website can be a great way to determine which
content is engaging and which content is being ignored. There are many ways to
track user engagement, but one of the easiest is to track where users’ are
clicking on your website.

In this first of 2-parts, we’ll create a simple JavaScript snippet that tracks
users' clicks on a website and stores them in Elasticsearch so we can find the
most interesting pixel regions on our website. In part-2, we'll use the Kibana
Add-on in Compose to visualize those clicks using a heat map overlayed directly
onto our website.

CREATING THE SNIPPET
We’ll start out by tracking clicks on our website, which is relatively simple.
We’ll use the addEventListener method from the DOM API to attach a click listener to our entire document object. This method will only work in HTML5-compliant browsers, so if you need
to support earlier browsers check out the Mozilla Developer Docs for a few methods you can use.

First, let’s create a simple HTML page with some content we’d like to track.
We’ll use some dummy content generated by the Lorem Ipsum Generator and some dummy images from Lorem Pixel .

<html>  
   <head>
   </head>
   <body>
      <div>
         <h1>This is a test page for some interesting content</h1>
        <a href=""#"">Click here!</a>
        <p>
          <!-- content goes here --
        </p>
        <img src=""http://lorempixel.com/400/200"" />
        <p>
            Integer non tortor ullamcorper, porta eros ac, dictum eros. Sed fermentum libero massa, sed egestas libero commodo non. Duis varius quam dignissim, luctus mi at, congue diam. Integer nec augue urna. Etiam ultrices sed justo vitae volutpat...
         <!-- fill this in with as much content as you'd like -->
        </p>
      </div>
   </body>
</html>  


Next, let’s put together some JavaScript that will listen to mouse events. For
now, we'll log out the event directly and see what it gives us:

document.addEventListener('click', function(event) {  
  console.log(event);
});


Simple enough - a single event listener that listens for clicks throughout our
entire document. Now, when we click on the screen, we should see a MouseEvent in the logs that looks something like this:

MouseEvent  
 altKey:false
 bubbles:true
 button:0
 ...
 returnValue:true
 screenX:189
 screenY:590
 ...
 pageX: 189
 pageY: 830
 ...
 toElement:div
 type:""click""
 view:Window
 which:1
 x:180
 y:97


We’ll ignore almost all of these fields, but there are a few that are
interesting to us with our click tracker. There are a few x and y coordinates, but the ones we’re the most interested in are the pageX and pageY fields, which represent the location on the website that the click occurred,
irrespective of scrolling and viewport size. This means that a click on the site
at the pageX location will always occur at that exact pixel location no matter how the
browser is sized or how far down the user is scrolled when they click. We’ll
also want to track the timestamp and toElement methods so we can search for which element was under the mouse when it was
clicked.

We'll want to associate all of the click events that occurred on each load of
the site. We can do this by generating a random ID each time the user loads the
page. The snippet of code can give us a workable random session ID:

var generateRandomSessionId = function() {  
   return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
      var r = Math.random() * 16 | 0,
         v = c == 'x' ? r : r 
}


We’ll also want to create a timestamp using the JavaScript Date object:

var timestamp = new Date.now();  


The resulting object that we’ll store to represent each click looks like the
following:

{
   ""x"": 100,
   ""y"": 100,
   ""timestamp"": 150000020302,
   ""sessionId"": 'a53cdbe2-acd1-4231-a331-fc3280d42ef1'
}


Let’s put these all together into a snippet we can install on our site.

(function() {
   var generateRandomSessionId = function() {
      return 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
         var r = Math.random() * 16 | 0,
            v = c == 'x' ? r : r 


SETTING UP ELASTICSEARCH
Now that we know what we want to store, let’s start pushing that data over to
Elasticsearch. Elasticsearch uses a RESTFul API, but we don’t want to push our
clicks directly from the browser since our URL includes our Elasticsearch
credentials. To fix both of these, we’ll create a simple Node.JS application,
following along from an earlier article on using Elasticsearch from Node.JS and use Express to handle our own API.

Let’s start by creating a new Compose Elasticsearch deployment that we can push our click data out to. Then, create your Elasticsearch user
and find your connection string on the Deployment Overview page. Once you have a valid connection string, we can start sending our RESTful
API calls over to Elasticsearch.

You'll need npm and the following Node modules to get this working:

 * elasticsearch
 * get-json
 * express
 * body-parser

Install the modules using npm:

npm install express body-parser elasticsearch get-json  


We’ll use the technique presented in the Getting Started guide to create a client.js and info.js , which we can use to make connections to our Elasticsearch deployment and get
info about the deployment. They should look like the following:

// client.js
var elasticsearch=require('elasticsearch');

var client = new elasticsearch.Client( {  
  hosts: [
    'https://[username]:[password]@[server]:[port]/',
    'https://[username]:[password]@[server]:[port]/'
  ]
});

module.exports = client;  


// info.js
var client = require('./client.js');

client.cluster.health({},function(err,resp,status) {  
  console.log(""-- Client Health --"",resp);
});


Let’s use our new info.js file to do a quick check before we move forward. Type the following into the
terminal:

node info.js

You should see a response that looks like this:

-- Client Health -- { cluster_name: 'el-petitions',
  status: 'green',
  timed_out: false,
  number_of_nodes: 3,
  number_of_data_nodes: 3,
  active_primary_shards: 0,
  active_shards: 0,
  relocating_shards: 0,
  initializing_shards: 0,
  unassigned_shards: 0,
  delayed_unassigned_shards: 0,
  number_of_pending_tasks: 0,
  number_of_in_flight_fetch: 0 }


If you get an error message (usually in HTML format) then double-check your
connection credentials and make sure you’ve added a user / password for your
deployment.

CREATING AN INDEX
An Index in Elasticsearch is different than you might be expecting - it’s more
analogous to a Table in relational databases or a Collection in MongoDB. We can
create the index a number of different ways, but here we’ll follow the Getting Started guide and do this in NodeJS.

Create a new file called “init.js” and add the following:

// init.js
var client = require('./client');

client.indices.create({  
  index: 'clicks'
},function(err,resp,status) {
  if(err) {
    console.log(err);
  } else {
    console.log(""create"",resp);
  }
});


Run your new init.js file:

node init.js  


And you should get the following response:

create { acknowledged: true }  


Finally, let’s create the expressjs app that our frontend will call to save clicks to our database. We’ll use the
Elasticsearch index call to add clicks to our index.

// app.js
var express = require('express'),  
    app = express(),
    bodyParser = require('body-parser'),
    client = require('./client'),
    path = require('path');

app.use(bodyParser.json());

app.post('/registerClick', function(req, res) {  
    client.index({  
      index: 'clicks',
      id: '1',
      type: 'click',
      body: req.body
    },function(err,resp,status) {
        res.send(resp);
    });
});

app.get('/', function(req, res) {  
    res.sendFile(path.join(__dirname, 'index.html'));
});

app.listen(process.env.PORT || 8080);  


Our app creates two routes, a /registerClick route where we’ll send our clicks to, and a / route which renders our HTML. You can access the site by running the following:

node app.js

And then opening http://localhost:8080 in your web browser.
Right now anyone can send click events to our app, so when we’re ready to take
this live we’ll probably want to add some security measures to make sure that
only requests from the same server are allowed (ie: so someone can’t send bad
click data into our app), but we won’t cover that for now.

CONNECTING THE CLICK TRACKER TO NODE
Now that you have your backend set up, let’s send our clicks back to our server
so it can relay them on to Elasticsearch. For this article, we’ll include the
JQuery library and it’s .ajax method to make our RESTful a little more readable. Add JQuery to the <head> of your HTML file:

<html>  
<head>  
...
   <script src=""https://code.jquery.com/jquery-3.2.1.min.js"" integrity=""sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4="" crossorigin=""anonymous""></script>
...


Then, let’s update our snippet so that an ajax call is made every time a click
is detected:

(function($) {
   ...
   var clickApp = {
      trackClick: function(evt) {
         var click = {
            ""x"": evt.pageX,
            ""y"": evt.pageY,
            ""sessionId"": generateRandomSessionId(),
            ""timestamp"": Date.now()
         }
         $.post(""/registerClick"", click).then(function(response) {                        
            console.log(response);
         });          
      }
   }
   document.addEventListener('click', function(event) { 
      clickApp.trackClick(event);
   });

...
})(jQuery);        


This snippet generates an AJAX POST request and sends the click data directly
over to our NodeJS web application. We’re also logging out the response we get
back, so we should be able to determine whether our click tracker is working.

Finally, run your application again using node app.js , navigate to http://localhost:8080 in your browser and start clicking around. In the developer console of your
browser, you should see something like the following:

created:true  
_id:""AV3NW08fEajW3QBwsZU2""  
_index:""clicks""  
_shards:Object  
_type:""click""  
_version:1  
__proto__:Object  


The created: true is what you’re looking for - this means that your click was created
successfully. You can head back to the Elasticsearch browser and click on your
index to confirm:


WRAPPING UP
Now that you have your website clicks being tracked, you can add the Kibana
plugin and start looking at which regions of your website are being clicked on
the most often. In our next article, we’ll look at how to use this click data to
generate a heat map of clicks and overlay them onto an image of our website.

John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page to keep reading.CONQUER THE DATA LAYER
Spend your time developing apps, not managing databases.

Try Compose for Free for 30 DaysRELATED ARTICLES
Feb 3, 2017NEWSBITS: SCYLLADB 1.6, GITLAB DB TROUBLES, ELASTICSEARCH 5.2, NODE 7.5.0, AND
MORE
NewsBits for week ending February 3rd: The release of ScyllaDB 1.6 RC1, Gitlab
shuts down temporarily due to data troubles, R…

John O'Connor Feb 1, 2017BUILDING SECURE DISTRIBUTED JAVASCRIPT MICROSERVICES WITH RABBITMQ AND SENECAJS
To take Microservices into production, you need to make sure they are
communicating securely and reliably. We explore using R…

John O'Connor Oct 28, 2016NEWSBITS: ELASTICSEARCH 5.0, NODE 7.0, LAMBDA GO, SWIFT AND NO BATTERY
TRANSISTORS
Compose NewsBits for the week ending October 28th - Elasticsearch 5.0.0
released, Node 7.0.0 released, an AWS Lambda framewor…

Hays Hutton Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","Website Engagement Tracking is a technique that allows businesses to see which parts of their website users are visiting, clicking on, and viewing. In this article, we'll take a look at tracking user engagement using Elasticsearch on Compose.",Website Engagement Tracking with Elasticsearch,Live,91
233,"* Free 7-Day Crash Course
 * Blog
 * Masterclass

9 MISTAKES TO AVOID WHEN STARTING YOUR CAREER IN DATA SCIENCE
EliteDataScience 0 Comments

June 23, 2017

Share Google Linkedin TweetIf you wish to begin a career in data science, you can save yourself days,
weeks, or even months of frustration by avoiding these 9 costly beginner
mistakes.

If you’re not careful, these mistakes will eat away at your most valuable resources: your time,
energy, and motivation.

We’ve broken them into three categories:

 * Mistakes while learning data science
 * Mistakes when applying for a job
 * Mistakes during job interviews

WHILE LEARNING DATA SCIENCE
The first set of mistakes are ""undercover"" and they're hard to spot. They slowly
but surely drain your time and energy without giving you warning, and they spawn
from the misconceptions surrounding this field.

1. SPENDING TOO MUCH TIME ON THEORY.
Many beginners fall into the trap of spending too much time on theory, whether
it be math related (linear algebra, statistics, etc.) or machine learning
related (algorithms, derivations, etc.).

This approach is inefficient for 3 main reasons:

 * First, it's slow and daunting. If you've ever felt overwhelmed by all there
   is to learn, you've likely sunk into this trap.
 * Second, you won't retain the concepts as well. Data science is an applied
   field, and the best way to solidify skills is by practicing.
 * Finally, there's a greater risk that you'll become demotivated and give up if
   you don't see how what you're learning connects to the real world.

This theory-heavy approach is traditionally taught in academia, but most
practitioners can benefit from a more results-oriented mindset.

To avoid this mistake:

 * Balance your studies with projects that provide you hands-on practice.
 * Learn to be comfortable with partial knowledge. You'll naturally fill in the
   gaps as you progress.
 * Learn how each piece fits into the big picture (covered in our free 7-day crash course) .

2. CODING TOO MANY ALGORITHMS FROM SCRATCH.
This next mistake also causes students to miss the forest for the trees. At the
start, you really don't need to code every algorithm from scratch.

While it's nice to implement a few just for learning purposes, the reality is
that algorithms are becoming commodities. Thanks to mature machine learning
libraries and cloud-based solutions, most practitioners actually never code
algorithms from scratch.

Today, it's more important to understand how to the apply the right algorithms
in the right settings (and in the right way).

To avoid this mistake:

 * Pick up general-purpose machine learning libraries, such as Scikit-Learn (Python) or Caret (R) .
 * If you do code an algorithm from scratch, do so with the intention of
   learning instead of perfecting your implementation.
 * Understand the landscape of modern machine learning algorithms and their strengths and weaknesses.

3. JUMPING INTO THE DEEP END.
Some people enter this field because they want to build the technology of the
future: Self-Driving Cars, Advanced Robotics, Computer Vision, and so on. These
are powered by techniques such as deep learning and natural language processing.

However, it's important to master the fundamentals. Every olympic diver needed
to learn how to swim first, and so should you.

To avoid this mistake:

 * First, master the techniques and algorithms of ""classical"" machine learning,
   which serve as building blocks for advanced topics.
 * Know that classical machine learning still has incredible untapped potential.
   While the algorithms are already mature, we are still in the early stages of
   discovering fruitful ways to use them.
 * Learn a systematic approach to solving problems with any form of machine
   learning (covered in our free 7-day crash course) .

Don't try this at home (until you have plenty of practice)

WHEN APPLYING FOR A JOB
This next set of mistakes can cause you to miss some great opportunities during
the job search process. Even if you're well qualified, you can maximize your
results by avoiding these hiccups.

4. HAVING TOO MUCH TECHNICAL JARGON IN A RESUME.
The biggest mistake many applicants make when writing their resume is
suffocating it with technical jargon.

Instead, your resume should paint a picture and your bullet points should tell a
story. Your resume should advocate the impact you could bring to an
organization, especially if you're applying for entry-level positions.

To avoid this mistake:

 * Do not simply list the programming languages or libraries you've used.
   Describe how you used them and explain the results.
 * Less is more. Think about the most important skills to emphasize and give
   them the space to shine by removing other distractions.
 * Make a resume master template so you can spin off different versions that are
   tailored to different roles. This keeps each version clean.

5. OVERESTIMATING THE VALUE OF ACADEMIC DEGREES.
Sometimes, graduates can overestimate the value of their education. While a
strong degree in a related field can definitely boost your chances, it's neither
sufficient nor is it usually the most important factor.

To be clear, we're not saying graduates are arrogant...

In most cases, what's taught in an academic setting is simply too different from
the machine learning applied in businesses. Working with deadlines, clients, and
technical roadblocks necessitate practical tradeoffs that are not as urgent in
academia.

To avoid this mistake:

 * Supplement coursework with plenty of projects using real-world datasets .
 * Learn a systematic approach to solving problems with machine learning
   (covered in our free 7-day crash course ).
 * Take relevant internships, even if they are part-time.
 * Reach out to local data scientists on LinkedIn for coffee chats.

6. SEARCHING TOO NARROWLY.
Data science is a relatively new field, and organizations are still evolving to
accommodate the growing impact of data. You'd be limiting yourself if you only
search for ""Data Scientist"" openings.

Many positions are not labeled as ""data science,"" but they'll allow you to
develop similar skills and function in a similar role.

To avoid this mistake:

 * Search by required skills (Machine Learning, Data Visualization, SQL, etc.).
 * Search by job responsibilities (Predictive Modeling, A/B Testing, Data
   Analytics, etc.).
 * Search by technologies used in the role (Python, R, Scikit-Learn, Keras,
   etc.).
 * Expand your searches by job title (Data Analyst, Quantitative Analyst,
   Machine Learning Engineer, etc.).

Source: Cyanide and Happiness

DURING THE INTERVIEW
The last set of mistakes are stumbling blocks during the interview. You've
already done the hard work to get to this step, so now it's time to finish
strong.

7. BEING UNPREPARED TO DISCUSS PROJECTS.
Having projects in your portfolio serves as a major safety net for ""how would
you"" type interview questions. Instead of speaking in hypotheticals, you'll be
able to point to concrete examples of how you handled certain situations.

In addition, many hiring managers will specifically look for your ability to be
self-sufficient because data science roles naturally include elements of project
management. That means you should understand the entire data science workflow
and know how to piece everything together.

To avoid this mistake:

 * Complete end-to-end projects that allow you to practice every major step (i.e. Data Cleaning, Model Training, etc.).
 * Organize your methodology. Data science should be deliberate, not haphazard.
 * Review and practice describing past projects from any internships, jobs, or
   classes you've taken.

8. UNDERESTIMATING THE VALUE OF DOMAIN KNOWLEDGE.
Developing technical skills and machine learning knowledge are the basic
prerequisites for landing a data science position. However, to truly stand out
above the competition, you should learn more about the specific industry you'll
be applying your skills to.

Remember, data science never exists in a vacuum.

To avoid this mistake:

 * If you're interviewing for a position at a bank, brush up on some basic
   finance concepts.
 * If you're interviewing for a strategy position at a Fortune 500, practice a
   few case interviews and learn about drivers of profitability.
 * If you're interviewing for a startup, learn about its market and try to
   discern how it will gain a competitive edge.
 * In short, taking a little bit of extra initiative here can pay big dividends!

9. NEGLECTING COMMUNICATION SKILLS.
Currently, in most organizations, data science teams are still very small
compared to developer teams or analyst teams. So while an entry-level software
engineer will often be managed a senior engineer, data scientists tend to work
in more cross-functional settings.

Interviewers will look for your ability to communicate with colleagues of
various technical and mathematical backgrounds.

To avoid this mistake:

 * Practice explaining technical concepts to non-technical audiences. For
   example, try explaining your favorite algorithm to a friend.
 * Prepare bullet point responses to common interview questions and practice delivering your answers.
 * Practice analyzing various datasets, extracting key insights, and presenting
   your findings.

CONCLUSION
In this guide, you learned practical tips for avoiding the 9 costliest mistakes
by data science beginners:

 1. Spending too much time on theory.
 2. Coding too many algorithms from scratch.
 3. Jumping into advanced topics, e.g. deep learning, too quickly.
 4. Having too much technical jargon in a resume.
 5. Overestimating the value of academic degrees.
 6. Searching too narrowly for jobs.
 7. Being unprepared to discuss projects during interviews.
 8. Underestimating the value of domain knowledge.
 9. Neglecting communication skills.

To jumpstart your journey ahead, we invite you to sign up for our free 7-day email crash course on applied machine learning . You'll get exclusive lessons that aren't covered on our blog.

For more over-the-shoulder guidance, we also offer a comprehensive Machine Learning Masterclass that will teach you data science while allowing you to build an impressive
portfolio along the way.

Share Google Linkedin TweetLEAVE A RESPONSE CANCEL REPLY
Name* Email* Website* Denotes Required Field

RECOMMENDED READING
 * 9 Mistakes to Avoid When Starting Your Career in Data Science
 * WTF is the Bias-Variance Tradeoff? (Infographic)
 * Free Data Science Resources for Beginners
 * Dimensionality Reduction Algorithms: Strengths and Weaknesses
 * Modern Machine Learning Algorithms: Strengths and Weaknesses
 * The Ultimate Python Seaborn Tutorial: Gotta Catch ‘Em All
 * The 5 Levels of Machine Learning Iteration

Copyright © 2017 · EliteDataScience.com · All Rights Reserved


 * Home
 * Terms of Service
 * Privacy Policy","If you wish to begin a career in data science, you can save yourself days, weeks, or even months of frustration by avoiding these 9 costly beginner mistakes.",9 Mistakes to Avoid When Starting Your Career in Data Science,Live,92
237,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates Lorna Mitchell Blocked Unblock Follow Following Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net) 13 mins ago
--------------------------------------------------------------------------------

DEPLOY YOUR PHP APPLICATION TO BLUEMIX
Deploying to the cloud can save so much time and hassle with commissioning and
setting up servers, it’s no surprise that we’re seeing many more organisations
take this approach for some or all of their web properties. In this post we’ll
take a standard PHP application and deploy it to Bluemix. You’ll learn how to:

 * prepare your project for the cloud
 * put the databases or other storage elements in place
 * safely convey your work to its new home

The example application here is a simple web-based Guestbook which every
self-respecting website had once upon a time (about 20 years ago!). All the code for the project is on GitHub to make it easy to look at the examples here in the context of a real project.
The Bluemix platform offers a 30-day free trial so you have a chance to try
deploying your own application before handing over your credit card details.

PREPARE FOR CLOUD
When we deploy a PHP application, Bluemix automatically realises it’s a PHP
project and uses the PHP buildpack to run it. The “buildpacks” are a Cloud Foundry term, meaning a collection of tools needed to run a particular tech stack. The
PHP buildpack provides PHP with a selection of extensions and Composer already included. You can also specify other, non Bluemix, buildpacks available
for Cloud Foundry if you want, and they’re used (specify this in the manifest.yml which we'll cover later on. Keep reading!).

In contrast to a server where we choose which version of PHP to install, the
buildpacks typically support multiple versions of PHP so we need to indicate
which one this project should use by specifying this in composer.json if it's not already there.

For my project, I needed to add this line to the require block in composer.json :

""php"" : "">=7.0"",

This lets the buildpack know what my PHP dependency is, along with my other
project dependencies since Composer runs when we deploy the application.

The other change we need to make, is to specify where the webroot should be. We
add this information to a file called .bp-config/options.json , which the buildpack automatically reads. There's a number of options that you
can configure in this file, but we'll stick with the webroot for now. I set my
webroot to the directory public/ which is where index.php is:

Now that the PHP is ready, we move on to putting in the dependencies that the
rest of the project needs.

GET READY TO BLUEMIX
At this point, it’s time to install the tools needed to work with Bluemix:

 * Cloud Foundry ( cf )provides all the commands you need to build and manage your app.
 * Bluemix CLI is useful for platform-specific tasks like managing regions, spaces, Bluemix
   virtual machines, and your account.

These tools have a lot of overlap; everything we do today we can achieve with
the cf command alone, however the bluemix tool has some bluemix-specific additions which you may find useful as your own
projects grow. Once the commands are installed, we can create the services that
our application depends on. Mine needs the Cloudant NoSQL database and RabbitMQ;
if yours uses PostgreSQL or MySQL instead, or indeed any of the other services,
the process will look pretty similar.

You’ll need your Bluemix account details handy at this point — sign up for the free trial if you don’t already have an account.Before we can deploy an app to Bluemix, we need to create a space for it to
deploy into. There are a few steps to this since Bluemix has multiple regions,
organisations, and users, as well as spaces — which makes it really easy to be
very organised with large applications but for our first PHP deployment might
feel a little bit heavy I admit!

 1. Pick a region and set the API endpoint, e.g. for US South use bluemix api https://api.ng.bluemix.net
 2. Now log in using the command bluemix login . You'll be prompted for your username and password.
 3. Check the list of organisations in your account with cf orgs and switch to the one you want to use by doing cf target -o [org] , replacing [org] with the organisation name you want.
 4. Create a new space to deploy to: cf create-space Dev and target it cf target -s Dev
 5. Verify that this all made some sense by doing cf target and make sure that your user, organisation and space settings are as you
    expect.

We made it! The tools are ready, so we’ll go ahead and set up the services that
my application needs.

COMMISSION THE SERVICES FOR YOUR APP
My application needs two services: a Cloudant NoSQL Database (the Bluemix name
for CouchDB the awesome document database) and RabbitMQ (no fancy names there).
Before I deploy, I will use the cf tool to create these services so that my application can use them.

First up, I’ll create the Cloudant service. The cf help create-service command tells me that I need to specify the service, the plan and then name my
service. To find out the exact service and plan names, use the command cf marketplace (beware it can take quite a long time to return as it has a lot of information
to find!). For me, the command is:

cf create-service cloudantNoSQLDB Lite guestbook-db

Now when I run cf services I can see the database listed there. You can also see it by going to your
Bluemix dashboard in the correct region/organisation/space combination. For most
services, you can access their administrative interfaces from here.

To set up RabbitMQ, I enter the following command:

cf create-service compose-for-rabbitmq Standard guestbook-messages

Again, cf services shows my new addition and at this point I have the pieces my application
depends on.

Be careful with your create-service commands. Both the services and the plans
are case sensitive which can trip you up very easily! If you see errors,
double-check you have everything spelled correctly, including case.There are two next steps: create a manifest file to describe how to deploy the
application, and change the PHP code to know how to access these services we
just configured using the Bluemix environment variables. We’ll do the manifest
file next but if you were thinking there is a missing link, you’re definitely
keeping up! On we go …

CREATE THE MANIFEST FILE AND DEPLOY
You usually describe how to deploy an application with a manifest file called manifest.yml . If you looked at the GitHub project, you can see that this application has
other applications (and a dev platform setup) in addition to this PHP
application. I'm putting my manifest file in the directory that contains just
this PHP application, which is a couple of levels down from the root of the
project, so here we're working in src/web .

My manifest file simply names my application, allocates it some RAM, and states
which services it needs available to it. Here it is:

This contains all the information that Bluemix needs to run my application. (The
application name must be unique across the whole of the bluemix region, so you
need to change the name in your version of this file.) Now we’re ready to
deploy. The moment of truth: run cf push to deploy your application to the cloud ...

Hopefully that all worked well, and you see your application working through its
deployment steps, installing any dependencies. Eventually you see an information
block including a urls field. Go to that URL and you should see your project. In fact, you'll probably
see an error message from your project because we didn't explain to PHP how to
connect to the services we made earlier on, so we'll deal with that next.

HANDY COMMANDS
This seems like a good time to share a couple of commands that I use a lot when
setting up an application for the first time:

 * cf logs --recent guestbook-web shows the last few lines of the logs from your application, including its
   deployment. If there's a syntax error in your PHP application, you'll see it
   here. You can also use just cf logs guestbook-web to see the logs as they happen (replace with your own application name as
   appropriate).
 * cf env guestbook-web shows the environment variables available to your application, and since
   we're about to connect our PHP to the services we created, this is very handy
   indeed! In particular we'll be looking at the VCAP_SERVICES environment variable as it contains the information about the services that
   this application has access to.

I made my own quick reference card for keeping track of the commands, so feel free to check that out as you go
along too!

CONNECT PHP TO SERVICES
From using the cf env command in the previous section, you can hopefully already see the data you
need to plug your PHP application in to its services by accessing the VCAP_SERVICES variable. It's JSON-encoded so in PHP, I use this line to get the configuration
into an array I can use:

Once you have this, feel free to use var_dump or something to look at the structure, but essentially there's an array element
per service type, with an array element inside for each actual service of that
type. We need to amend our existing application so that when we detect we're
running on Bluemix, we use the Bluemix variables, and otherwise we fall back to
whatever our usual configuration process is. For instance, my example app
connects to CouchDB (on the local development platform) or Cloudant (on Bluemix)
with a block of code like this (from config.php ):

The RabbitMQ setup is a bit more complicated since it uses a certificate and the RabbitMQ library I’m using (php-amqplib) expects the parameters all
separate whereas Bluemix sends a complete URL with them already assembled. So
the code for connecting to RabbitMQ looks like this (also from config.php ):

Looking at this RabbitMQ configuration code, you can see that we first grab the
contents of the VCAP_SERVICES environment variable, but then we need to grab the URL for the RabbitMQ
service. Once we have it, we break it apart using parse_url() to get the arguments we'll need later to construct the RabbitMQ connection. The path piece has a leading slash on it which we don't need, so the substr() function sorts that out for us. We also need to grab the certificate needed for
the SSL connection.

(If you are deploying to Bluemix and using RabbitMQ, take a look in the dependencies.php file in the GitHub project for where the AMQP objects are actually
instantiated. There are a couple of extra options to set that might help you
succeed.)

RUN PHP IN THE CLOUD
The joy of the cloud is that you can just create new applications (and clean up
those you don’t need any more) without ordering hardware, installing servers, or
keeping up with patching. I’ve found that PHP applications are pretty happy on
these cloud platforms and having this sort of setup lets the apps cost as little
as necessary but scale as much as demand dictates — capacity planning is a hard
problem and it means not having spare servers “just in case”. I hope that by
sharing my steps above, I’ve shown you what you need to get your own PHP
application into the cloud, and perhaps convinced you to give it a try.

Bluemix Deployment PHP Cloud Computing Tutorial Blocked Unblock Follow FollowingLORNA MITCHELL
Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net )

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.",Read how to take a standard PHP application and deploy to Bluemix.,Deploy Your PHP Application to Bluemix,Live,93
239,"Skip navigation Upload Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseARMAND RUIZ GABERNET, IBM - BIGDATANYC #BIGDATANYC 2016 #THECUBE
SiliconANGLE Subscribe Subscribed Unsubscribe 6,734 6KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics

166 views 2LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 3 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Sep 28, 2016Armand Ruiz Gabernet (@armand_ruiz), Lead Product Manager - IBM Data Science
Experience, IBM, sits down with Dave Vellante & Jeff Frick on the #theCUBE at #BigDataNYC 2016, New York, NY

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Creative Commons Attribution license (reuse allowed)
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM DataFirst Launch Event Keynote - #DataFirst - #theCUBE - Duration:
   1:28:16. SiliconANGLE 65 views * New 1:28:16


--------------------------------------------------------------------------------

 * Nik Green, Delhaize America & Kevin McIntyre, IBM - BigDataNYC #BigDataNYC
   2016 #theCUBE - Duration: 17:19. SiliconANGLE 45 views * New 17:19
 * Ritika Gunnar , IBM - BigDataNYC #BigDataNYC 2016 #theCUBE - Duration: 14:35.
   SiliconANGLE 14 views * New 14:35
 * Cory Minton, DellEMC & Simeon Yep, Splunk - Splunk .conf2016 - #splunkconf16
   - #theCUBE - Duration: 14:32. SiliconANGLE 63 views * New 14:32
 * Robert Herjavec & Atif Ghauri, Herjavec Group - Splunk .conf2016 -
   #splunkconf16 - #theCUBE - Duration: 24:16. SiliconANGLE 20 views * New 24:16
 * Chris Kammerman, Shazam - Splunk .conf2016 - #splunkconf16 - #theCUBE -
   Duration: 15:46. SiliconANGLE 47 views * New 15:46
 * Matt Kraft, Dunkin' Brands - Splunk .conf2016 - #splunkconf16 - #theCUBE -
   Duration: 18:07. SiliconANGLE 24 views * New 18:07
 * Snehal Antani, Splunk - Splunk .conf2016 - #splunkconf16 - #theCUBE -
   Duration: 16:21. SiliconANGLE 23 views * New 16:21
 * Shay Mowlem, Splunk - Splunk .conf2016 - #splunkconf2016 - #theCUBE -
   Duration: 16:48. SiliconANGLE 31 views * New 16:48
 * Haiyan Song & Monzy Merza, Splunk - Splunk .conf2016 - #splunkconf16 -
   #theCUBE - Duration: 20:52. SiliconANGLE 7 views * New 20:52
 * Chuck Yarbrough, Pentaho A Hitachi Group Company - Big Data NYC - #BigDataNYC
   - #theCUBE - Duration: 16:20. SiliconANGLE 4 views * New 16:20
 * Day 2 Kickoff - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration:
   12:17. SiliconANGLE 3 views * New 12:17
 * Tendu Yogurtcu, Synsort - BigDataNYC #BigDataNYC 2016 #theCUBE - Duration:
   19:18. SiliconANGLE 18 views * New 19:18
 * Josh Rogers, Syncsort - Splunk .conf2016 - #splunkconf16 - #theCUBE -
   Duration: 12:58. SiliconANGLE 35 views * New 12:58
 * Ram Varadarajan - Splunk .conf2016 - #splunkconf16 - #theCUBE - Duration:
   15:15. SiliconANGLE 13 views * New 15:15
 * Steve Hatch, Cox Automotive - Splunk .conf2016 - #splunkconf16 - #theCUBE -
   Duration: 10:41. SiliconANGLE 2 views * New 10:41
 * Wei Wang & Matt Morgan, Hortonworks - BigDataNYC #BigDataNYC 2016 #theCUBE -
   Duration: 16:49. SiliconANGLE 8 views * New 16:49
 * Michael Dell, Dell Technologies - #VMworld 2016 #theCUBE - Duration: 15:37.
   SiliconANGLE 3,263 views 15:37
 * Tom Gerhard, Priceline - Splunk .conf2016 - #splunkconf2016 - #theCUBE -
   Duration: 12:25. SiliconANGLE 24 views * New 12:25
 * Greg Sands & Jim Wilson - Oracle OpenWorld - #oow16 - #theCUBE - Duration:
   26:17. SiliconANGLE 31 views * New 26:17
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Try something new!
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...","Armand Ruiz Gabernet (@armand_ruiz), Lead Product Manager - IBM Data Science Experience, IBM, sits down with Dave Vellante & Jeff Frick on the #theCUBE at #B...","Armand Ruiz Gabernet, IBM - BigDataNYC #BigDataNYC 2016 #theCUBE",Live,94
242,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (December 20, 2016)
 * This Week in Data Science (December 13, 2016)
 * New York Data Science Bootcamp And Validated Badges
 * This Week in Data Science (December 06, 2016)
 * This Week in Data Science (November 29, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (DECEMBER 20, 2016)
Posted on December 20, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * 5 Tips for Leveraging Big Data to Increase Holiday Sales – Running an e-commerce store? A small business? Big data could help you nab
   more sales this holiday season and get the revenue flowing in.
 * IBM and BMW want Watson to help drive your car – IBM Watson could soon be helping you to drive your car, as IBM’s cognitive
   computing unit is set to work with BMW Group to explore how the technology
   could aid cars of the future.
 * IBM’s Watson Turns Its Computer Brain to NASA Research – IBM’s Watson computer system, hosted in the cloud, is taking on NASA’s big
   research data.
 * Obama administration proposes that all new cars must be able to talk to each
   other – The Obama Administration on Tuesday proposed a rule that would require all
   new cars to be able to communicate with other cars wirelessly, a move that
   advocates said could save lives, but that also raises privacy and hacking
   concerns among opponents.
 * How Will the Softer Side of Robots Affect our Lives – Despite the advancements in Robotics and Artificial Intelligence, Robots
   have not learnt how to show emotion… just yet…but when we think of robots,
   more often than not images of clunky humanoid contraptions, metal with hinged
   joints and bulky movement spring to mind (excuse the pun).
 * How Artificial Intelligence Will Usher in the Next Stage of E-Government – Since the earliest days of the Internet, most government agencies have
   eagerly explored how to use technology to better deliver services to
   citizens, businesses and other public-sector organizations.
 * Data Science, Predictive Analytics Main Developments in 2016 and Key Trends
   for 2017 – Key themes included the polling failures in 2016 US Elections, Deep
   Learning, IoT, greater focus on value and ROI, and increasing adoption of
   predictive analytics by the “masses” of industry.
 * Data science skills: Is NoSQL better than SQL? –
   Big data is one of the hottest sectors in tech right now, but how do you stay
   on top of the changing technologies? David Pardoe of Hays Recruitment talks
   about the differences between SQL and NoSQL in data.
 * Amazon makes its first Prime Air drone delivery to a customer – Amazon has completed its first customer delivery by drone.
 * Big Data Science: Expectation vs. Reality – The path to success and happiness of the data science team working with
   big data project is not always clear from the beginning. It depends on
   maturity of underlying platform, their cross skills and devops process around
   their day-to-day operations.
 * Data Tools Offer Hints at How Judges Might Rule – Services offer lawyers statistics on how likely a given case is to be
   dismissed.
 * A Supercomputer Knows What Flavors You Like Better Than You Do – You probably feel like you have a good idea of what food you like and what
   you don’t. Turns out, you might actually enjoy flavor combinations
   (strawberries and jalapeno!?!) that you would never have thought to try.
   Enter, supercomputers.
 * Deep-Learning Machine Listens to Bach, Then Writes Its Own Music in the Same
   Style – Can you tell the difference between music composed by Bach and by a neural
   network?
 * The Countries With The Fastest Internet – According to Akamai, South Korea is well ahead of the pack when it comes
   to fast internet.
 * Tourists Vs Locals: 20 Cities Based On Where People Take Photos – Tourists and locals experience cities in strikingly different ways.
 * Deep Learning Reinvents the Hearing Aid – Finally, wearers of hearing aids can pick out a voice in a crowded room.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our forty fifth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (December 20, 2016)",Live,95
253,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
IMPROVING QUALITY OF LIFE WITH SPARK-EMPOWERED MACHINE LEARNING
Post Comment June 2, 2016 by Michal Malohlava Software Engineer, H2O.ai by Desmond Chan Senior Director, Marketing, H2O.aiWe are in an age in which machine learning has increasing importance in our
daily lives. Machine learning is put into action whenever your mobile map
application automatically reminds you to leave for your next appointment because
of unusual traffic situations. Besides personal assistants on your cell phones, wearable sport devices use machine-learning algorithms to propose personal training plans, and banks
depend on accurate machine-learning models to detect malicious transactions.

Healthcare, for instance, has also started to find helpful patterns in medical
data using machine learning. Modern technologies allow for close monitoring of a
patient’s condition through a large volume of data provided by a number of
sensors. Machine learning is applied to this data to find patterns and predict
how a patient will react to a treatment plan, for example. Accuracy is
particularly important in this field because each miss can have significant
implications. This case presents several challenges for machine learning
technologists:

 * Complicated model preparation because of the huge data volume and various
   forms of data including highly imbalanced data sets
 * Constant model retraining and reevaluation because of the ever-changing
   nature of patient data and structure, as well as the need to improve the
   accuracy of model prediction
 * Fast deployment of newly trained models to monitor a patient’s condition

MACHINE LEARNING WITH THE POWER OF SPARK
Sparkling Water brings the H2O open source, machine-learning platform to Apache
Spark environments. H2O runs directly in a Spark Java virtual machine (JVM),
which eliminates any data transfer overhead that other solutions typically
incur:


H2O allows users to combine the data processing power of Spark with powerful
machine-learning algorithms provided by the H2O platform. This combination
solves the aforementioned challenges for machine learning technologists in a
variety of ways.

PARALLELIZED DATA PROCESSING
H2O is designed to process huge amounts of data in a distributed and fully
parallelized fashion. This approach means a hospital can fully leverage all the
data available for their analyses, explore and test more models in quick
iterations and benefit from the results.

OPERATIONALIZED MODEL TRAINING, EVALUATION AND COMPARISON, AND SCORING
Finding the optimum model for a given patient condition is a tedious process
that has many moving parts. Hospitals need to try out different strategies to
explore the space of possible models and various setups and compare the results
best suited for their environments. H2O operationalizes this tedious training
process in several ways:

 * Providing a library of machine-learning algorithms supporting advanced,
   algorithm-specific features; moreover, H2O allows combining models into
   ensembles—super learners
 * Performing fast exploration of hyperspace of parameters (aka grid search)
 * Offering the facility to specify various criteria that identify and select
   the best model—for example, accuracy, building time, scoring time and so on
 * Adding the ability to continue model preparation with modified parameters and
   additional relevant training data; this specific feature of H2O helps
   simplify the lives of data scientists and speeds up model preparation
   turnaround
 * Creating visualizations of various model characteristics on the fly and the
   final model during training; moreover, users can explore the performance of
   the model on training as well as validation—that is, unseen—data.

H2O also allows users to stop the model training process manually, if the visual
feedback reports unexpected results; modify parameters; and continue the
training.


OPTIMIZED MODEL DEPLOYMENT
Model deployment is one of the most critical elements of the machine-learning
process in healthcare—the model, or even multiple models, are instantiated and
fed by real-time data from sensors monitoring a patient’s body, and the models
need to provide predictions as quickly as possible. To meet these strict
requirements, H2O allows for the export of trained models as an optimized code
for deployment into target systems—that is, web services, applications and so
on. The optimized code delivers the best possible response time, which is
crucial for applications that need to react quickly to changing conditions.

USE CASES WITH STREAMLINED IMPLEMENTATION
Sparkling Water improves and streamlines the way machine learning is applied to
healthcare. Besides healthcare, Sparkling Water can also elevate the use of
machine learning in a variety of other use cases:

 * Detecting fraud in the finance industry, where high accuracy and speed are
   key factors
 * Proposing interest rates for insurance applications or predicting drivers’
   risk factors
 * Planning truck maintenance based on tracking trucks’ telemetry.

Next time when you think about improving the quality of your life, remember
Sparkling Water. At the Apache Spark Maker Community Event, 6 June 2016, IBM is
sharing important announcements for helping customers to use Spark, R and open
data science to drive business innovations. Register for this in-person event . If you can’t attend, then register to watch a livestream presentation of the event .


Follow @IBMBigData

Topics: Analytics , Big Data Technology , Data Scientists Tags: machine learning , algorithm , machine-learning algorithm , Apache Spark , Spark , analytics , predictive analytics , big data , R , PythonRELATED CONTENT
WHITE PAPERS & REPORTS
INTRODUCING NOTEBOOKS: A POWER TOOL FOR DATA SCIENTISTS
Check out the details on a tool that can change the game for data
scientists—open source analytics notebooks. Learn what notebooks are, what value
they provide and how to get started using them today. View White papers & Reports Blog The power of machine learning in Spark Blog How can data scientists collaborate to build better business applications? Blog InsightOut: The role of Apache Atlas in the open metadata ecosystem Blog Top analytics tools in 2016 Blog End-to-end analytics in the cloud Blog Highlights from the Apache Spark Maker Community Event Blog Experiencing deeper productivity in open data science White papers & Reports Using a predictive analytics model to foresee flight
delays Blog Learning to fly: How to predict flight delays using Spark MLlib Blog Innovative business applications: The disruptive potential of open data
science Blog Lean data science with Apache Spark Blog Boosting the productivity of the next-generation data scientist
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacyMORE
Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacy Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog Cloud-based ingestion: The future is hereMORE
Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog Cloud-based ingestion: The future is here Blog 3 strategies to get your CFO to care about Sales Performance Management Blog Proactive emergency plans: Data empowers law enforcement agencies at all
levels Blog Emergency management information system data needs to be filtered Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformationMORE
Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformation Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog Keep your head above water with information lifecycle governance Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutionsMORE
Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Video What does Hadoop and big data success look like? Blog The death of application performance White papers & Reports Introducing notebooks: A power tool for data scientists * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site",Discover an open source machine learning platform that combines the data processing power of Spark with powerful machine learning algorithms.,Improving quality of life with Spark-empowered machine learning,Live,96
256,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (November 08, 2016)
 * Protected: Partnering with Big Data University – UMUC Case Study
 * This Week in Data Science (November 01, 2016)
 * This Week in Data Science (October 25, 2016)
 * This Week in Data Science (October 18, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (NOVEMBER 08, 2016)
Posted on November 9, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * NASA Is Harnessing Graph Databases To Organize Lessons Learned From Past
   Projects – The space agency has a new tool to discover unexpected patterns in past
   projects.
 * Tracking How the World is Feeling – Spur Projects, an Australian organization focusing on suicide prevention,
   has published the data from its “How Is the World Feeling?” mental health
   survey.
 * Machines Can Now Recognize Something After Seeing It Once – Algorithms usually need thousands of examples to learn something.
   Researchers at Google DeepMind found a way around that.
 * Uber Self-Driving Truck Packed With Budweiser Makes First Delivery in
   Colorado – The ride-hailing giant teamed up with AB InBev to transport beer in an
   autonomous vehicle, which they say is the world’s first such commercial
   delivery.
 * These Cows Will Text You When They’re in Heat – Dairy farmers are using sensors in cows’ stomachs to track the health of
   the herd.
 * How To Boost An Organization’s Competitive Advantage By Using Cognitive
   Computing – Between AI-powered chatbots and gadget-based voice assistants, cognitive
   computing capabilities have captured the public imagination. But consumer
   products are just the tip of the iceberg.
 * Taking the hard work out of Apache Hadoop – Why has IBM created its own distribution of Apache Hadoop and Apache
   Spark, and what makes it stand out from the competition?
 * MIT CSAIL brings reasoning to machine learning – More and more companies are taking advantage of artificial intelligence to
   train machines on their data and make predictions. Researchers from MIT’s
   Computer Science and Artificial Laboratory (CSAIL) want to take it a step
   further by revealing how a machine makes those insights.
 * The White House Releases Paper on What It Wants to Do With Artificial
   Intelligence – The White House released a document earlier in the year via the Office of
   the President and the National Science and Technology Council Committee on
   Technology (NSTC).
 * How Artificial Intelligence can enhance educational efforts – By taking advantage of artificial intelligence (AI) technologies, schools
   are giving teachers more tools to help their students while removing
   unnecessary obstacles.
 * How Data Mining Reveals the World’s Healthiest Cuisines – Algorithms are teasing apart the link between food and health to provide
   the first evidence that we really are what we eat.
 * IBM Teams Up With Slack to Build Smarter Data-Crunching Chatbots – IBM is teaming up with Slack Technologies Inc. to make it easier for
   companies to build custom chatbots into the startup’s workplace-messaging
   systems.
 * The app developer’s guide to creating your first Watson bot – You’re building your first chat bot and the pressure’s on. Never fear –
   the Watson team is here!
 * Google’s neural networks invent their own encryption – A team from Google Brain, Google’s deep learning project, has shown that
   machines can learn how to protect their messages from prying eyes.

UPCOMING DATA SCIENCE EVENTS
 * IBM Webinar: Self service analytics in a flash with dashDB – On November 17th, learn how the dashDB family of warehousing solutions can
   help with self-service analytics.
 * Introduction to Python for Data Science – Learn how to use Python for data science on November 10th.
 * IBM Event: Analytics Strategies in the Cloud – Join IBM and 2-time Canadian Olympic gold-medalist Alexandre Bilodeau on
   November 7th for a complimentary event in Montreal where you’ll network, eat,
   drink and engage in an inspiring discussion on making business analytics
   easier and more available for all departments throughout your company.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our thirty ninth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (November 08, 2016)",Live,97
257,"SHARP SIGHT LABS

 * HOME
 * MEMBER LOGIN
 * ABOUT

HOW TO MAP GEOSPATIAL DATA: USA RIVERS
February 7, 2017


R CODE
Here’s the R code to produce the map:


#===============
# LOAD PACKAGES
#===============
library(tidyverse)
library(maptools)


#===============
# GET RIVER DATA
#===============

#==========
# LOAD DATA
#==========

#DEFINE URL
# - this is the location of the file
url.river_data <- url(""http://sharpsightlabs.com/wp-content/datasets/usa_rivers.RData"")


# LOAD DATA
# - this will retrieve the data from the URL
load(url.river_data)


# INSPECT
summary(lines.rivers)
lines.rivers@data %>% glimpse()


levels(lines.rivers$FEATURE)
table(lines.rivers$FEATURE)

#==============================================
# REMOVE MISC FEATURES
# - there are some features in the data that we
#   want to remove
#==============================================
lines.rivers <- subset(lines.rivers, !(FEATURE %in% c(""Shoreline""
                                                      ,""Shoreline Intermittent""
                                                      ,""Null""
                                                      ,""Closure Line""
                                                      ,""Apparent Limit""
                                                      )))

# RE-INSPECT
table(lines.rivers$FEATURE)

#==============
# REMOVE STATES
#==============

#-------------------------------
# IDENTIFY STATES
# - we need to find out
#   which states are in the data
#-------------------------------
table(lines.rivers$STATE)


#---------------------------------------------------------
# REMOVE STATES
# - remove Alaska, Hawaii, Puerto Rico, and Virgin Islands
# - these are hard to plot in a confined window, so 
#   we'll remove them for convenience
#---------------------------------------------------------

lines.rivers <- subset(lines.rivers, !(STATE %in% c('AK','HI','PR','VI')))

# RE-INSPECT
table(lines.rivers$STATE)


#============================================
# FORTIFY
# - fortify will convert the 
#   'SpatialLinesDataFrame' to a proper
#    data frame that we can use with ggplot2
#============================================

df.usa_rivers <- fortify(lines.rivers)


#============
# GET USA MAP
#============
map.usa_country <- map_data(""usa"")
map.usa_states <- map_data(""state"")


#=======
# PLOT
#=======

ggplot() +
  geom_polygon(data = map.usa_country, aes(x = long, y = lat, group = group), fill = ""#484848"") +
  geom_path(data = df.usa_rivers, aes(x = long, y = lat, group = group), color = ""#8ca7c0"", size = .08) +
  coord_map(projection = ""albers"", lat0 = 30, lat1 = 40, xlim = c(-121,-73), ylim = c(25,51)) +
  labs(title = ""Rivers and waterways of the United States"") +
  annotate(""text"", label = ""sharpsightlabs.com"", family = ""Gill Sans"", color = ""#A1A1A1""
           , x = -89, y = 26.5, size = 5) +
  theme(panel.background = element_rect(fill = ""#292929"")
        ,plot.background = element_rect(fill = ""#292929"")
        ,panel.grid = element_blank()
        ,axis.title = element_blank()
        ,axis.text = element_blank()
        ,axis.ticks = element_blank()
        ,text = element_text(family = ""Gill Sans"", color = ""#A1A1A1"")
        ,plot.title = element_text(size = 34)
        ) 


USE THIS AS PRACTICE
If you’ve learned the basics of data visualization in R (namely, ggplot2) and you’re interested in geospatial visualization , use this as a small, narrowly-defined exercize to practice some intermediate
skills.

There are at least three things that you can learn and practice with this
visualization:

 1. Learn about color: Part of what makes this visualization compelling are the colors. Notice
    that in the area surrounding the US, we’re not using pure black, but a dark
    grey. For the title, we’re not using white, but a medium grey. Also, notice
    that for the rivers, we’re not using “blue” but a very specific hexadecimal
    color. These are all deliberate choices. As an exercise, I highly recommend
    modifying the colors. Play around a bit and see how changing the colors
    changes the “feel” of the visualization.
 2. Learn to build visualizations in layers: I’ve emphasized this several times recently, but layering is an important principle of data visualization. Notice that we’re layering the river data over the
    USA country map. As an exercise, you could also layer in the state
    boundaries between the country map and the rivers. To do this, you can use map_data() .
 3. Learn about ‘Spatial’ data: R has several classes for dealing with ‘geospatial’ data, such as ‘ SpatialLines ‘, ‘ SpatialPoints ‘, and others. Spatial data is a whole different animal, so you’ll have to
    learn its structure. This example will give you a little experience dealing
    with it.

ITERATE TO GET THE DETAILS RIGHT
What really makes this visualization work is the fine little details. In
particular, the size of the lines and the colors.

The reality is that creating good-looking visualizations requires attention to
the little details.

To get the details right for a plot like this, I recommend that you build the visualization iteratively .

Start with a simple version of just the map of the US.


ggplot() +
  geom_polygon(data = map.usa_country, aes(x = long, y = lat, group = group), fill = ""#484848"")


Next, layer on the rivers:


ggplot() +
  geom_polygon(data = map.usa_country, aes(x = long, y = lat, group = group), fill = ""#484848"") +
  geom_path(data = df.usa_rivers, aes(x = long, y = lat, group = group)) 


Make no mistake: this doesn’t look good. But, in the early stages, that’s not
the goal. You just want to make sure that the data are structurally right. You
want something simple that you can build on.

Ok, next, play with the river colors.

Start with a simple ‘ blue ‘:


ggplot() +
  geom_polygon(data = map.usa_country, aes(x = long, y = lat, group = group), fill = ""#484848"") +
  geom_path(data = df.usa_rivers, aes(x = long, y = lat, group = group), color = ""blue"") 


Let’s be honest. This still does not look good.

But it’s closer.

From here, you can play with the colors some more. Select a new color (I
recommend using a color picker ), and modify the color = aesthetic for geom_path() .


ggplot() +
  geom_polygon(data = map.usa_country, aes(x = long, y = lat, group = group), fill = ""#484848"") +
  geom_path(data = df.usa_rivers, aes(x = long, y = lat, group = group), color = ""#99ccff"") 


Not perfect, but better still.

From here, you can continue to iterate, add more details, and get them all
“perfect”:

 * The exact color (this takes lots of trial-and-error, and a bit of good taste)
 * The line size for geom_path()
 * The title and text annotations
 * Modify the projection, and change it to the “albers” projection with coord_map()
 * The other theme() details like background color, removing extraneous elements (like the axis labels)
   etc

Once again: getting this just right takes lots of iteration. Try it yourself and
build this visualization from the bottom up.

LEARN GGPLOT2 (BECAUSE GGPLOT2 MAKES THIS EASY)
In this post, we’ve used ggplot2 to create this particular visualization. While I would classify this
visualization at an “intermediate” level, ggplot2 still makes it relatively
easy.

That said, if you’re interested in data science and data visualization, learn ggplot2 .

Longtime readers at Sharp Sight will know my thoughts on this, but if you’re a new reader this is important.

ggplot2 is almost without question, the best data visualization tool available . Of course, different people will have different needs, but speaking
generally, ggplot2 is flexible, powerful, and it allows you to create beautiful data
visualizations with relative ease.

Not interested in visualization per se?

Do you want to focus on machine learning instead?

Fair enough.

If you want to learn machine learning , you still need to be able to analyze and explore your data .

Once again, the best tool for exploring and analyzing your data is ggplot2 . This is particularly true when you combine it with dplyr , tidyr , stringr , and other tools from the tidyverse .

SIGN UP TO MASTER DATA VISUALIZATION
Do you want to get a job as a data scientist?

You need to master data visualization.

We’ll show you how.

Sign up now, and we’ll show you step-by-step how to learn (and master) data
visualization in R.


SIGN UP NOW


LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


GET THE FREE DATA SCIENCE CRASH COURSE
Sign up now and learn:
• a step-by-step data science learning plan
• the 1 programming language you need to
learn
• 3 essential data visualizations
• how to do data manipulation in R
• how to get started with machine learning
• the difference between machine learning
and statistics
• and more ...

Your first name Your best email address
By signing up for the newsletter you'll also get ...
✓ Free machine learning tutorials
✓ Free data visualization tutorials
✓ Learning strategies to skyrocket your
progress

... delivered to your inbox on a regular basis.

RECOMMENDED READING
FlowingData
R-Bloggers
R-users (jobs site)Subscribe to receive our free ""Getting Started with Analytics and Data Science""
pdf.

First Name E-Mail Address © 2017 · Powered by data",Sign up now to learn about data visualization in R,How to map USA rivers using ggplot2,Live,98
266,"SEVEN DATABASES IN SEVEN DAYS – DAY 2: MONGODB
Lorna Mitchell and Matt Collins / August 5, 2016This post is part of a series of posts created by the two newest members of our
Developer Advocate team here at IBM Cloud Data Services. In honour of the book Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, we challenged Lorna and Matt to take a new
database from our portfolio every day, get it set up and working, and write a
blog post about their experiences. Each post reflects the story of their day
with a new database. We’ll update our seven-days GitHub repo with example code as the series progresses. —The Editors

 * Database type: schemaless JSON-like storage with search and data aggregation
 * Best tool for: creating highly scalable apps that need to query large datasets fast

MongoDB . It’s kind of a thing.

OVERVIEW
MongoDB is a NoSQL database that allows you to store your data in JSON-like documents
rather than the more traditional RDBMS approach. With a focus on scalability
(sharding and replication are available out of the box) and flexibility (data
stores are schemaless and easily searchable via secondary indexes — even
geospatial!). MongoDB intends to provide a database that maps to your
application and keeps up through iterations.

There are also a number of other features, such as a powerful Data Aggregation
Pipeline and MapReduce, or for more in depth analysis you can connect MongoDB
directly to Hadoop or Spark.

MongoDB is open source, so you can get up and running pretty quickly although we
are going to make use of MongoDB from Compose to get up and running for the purposes of this article. We will cover how to
get started with MongoDB and put together a simple example showing how you can
utilise this database to store blog posts with threaded comments.

GETTING SET UP
Start by setting up a MongoDB instance on Compose — this may take a few minutes to deploy. Make sure to check the SSL option when
configuring your deployment!


Once deployed, you will be able to add a database and a user. Compose presents
you with a cool “command line” style interface to do this, and it’s simple
enough.

When creating your user, make sure you make a note of the password as Compose
will hide this from you going forwards. Compose will automatically give your
user the permissions it requires. Also, for this tutorial, make sure you create
your user in the special admin database. For more on user permissions in Compose’s MongoDB service, see Connecting to the new MongoDB at Compose .


Creating a database is just as simple as giving it a name. Feel free to try it
out, but our example code will handle database creation for you.


We are going to use PHP to create our examples, and you’ll need to install the MongoDB PHP extension . Since I’m on Linux, I’ll use pecl :

pecl install mongodb

According to the PHP documentation , Mac users should use brew :

brew install php55-mongodb

New to PHP on a Mac? Without installing MAMP , here’s how to get PHP running on your local Apache web server . And here’s how to access php.ini if you need to append extension=mongodb.so .

And finally, you will need to save a copy of the SSL Certificate from the
Overview page of the Compose control panel — we have saved ours into a file
simply called cert . Make sure to include the -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- in your saved file.

CONNECTING FROM PHP
MongoDB is well supported, with libraries available for all of your favourite languages, including PHP . These libraries make it very easy to gain access to all of MongoDBs features
and let you focus on building your app.

To connect from PHP you will need the following:

 * Your username
 * Your password
 * Your SSL Certificate

Combine the username and password with the connection string provided by Compose
on the Overview screen to get something like this:

mongodb://your-user-here:your-password-here@aws-us-east-1-portal.11.dblayer.com:28086/admin?ssl=true

This connection string will be different for each deployment, so be careful when
copying and pasting! Here are two general notes, however:

 * Use the connection string for drivers that cannot handle failover between mongos nodes.
 * Connect to the admin database, since our code keeps it simple.

In addition to the MongoDB extension, we’ll use the PHP library that MongoDB
provide to give a nice, easy wrapper for accessing MongoDB. This can be
installed via Composer :

composer require ""mongodb/mongodb=^1.0.0""

This adds the requirement into your composer.json file (creating it if it didn’t exist already). You’ll need to run the composer install command to bring the files in; these can be found in the vendor directory.

Now use the connection string from before to connect to your MongoDB deployment
as so:

connect.php


$client = new MongoDB\Client(""mongodb://sevendbs:a22733d78a33d34c20da2d84ee9db5e4@aws-us-east-1-portal.11.dblayer.com:28086/admin?ssl=true"", [], [""cafile"" => ""./cert

$posts = $client->selectDatabase(""posts"")->selectCollection(""posts


Notice that as well as passing in the connection string, we are also providing
the path to the SSL Certificate. We can then select the posts collection from the posts database, that we create. This file is saved as connect.php and our other scripts also use it to connect.

What is a collection in MongoDB? From the MongoDB reference manual : “A [collection is a] grouping of MongoDB documents. A collection is the
equivalent of an RDBMS table. A collection exists within a single database.
Collections do not enforce a schema. Documents within a collection can have
different fields. Typically, all documents in a collection have a similar or
related purpose.”

INSERTING DATA
The PHP library is a fairly lightweight wrapper around MongoDB’s command-line
interface, which really helps to make this database feel like a consistent
interface across platforms (all the other language drivers also follow this
pattern). In this case, we’re using the insertOne method to add a new blog post to our posts collection.

This example shows a very basic PHP script which will display an HTML form,
allowing the user to enter some data, which then gets saved in the database.
Here’s the form itself, followed by the code:


If you’re not seeing “Post saved” upon form submission, remember to enable debugging !

add_post.php


<?php

require(""connect.php

if($_POST) {
    $data = [""title"" => filter_input(INPUT_POST, ""title"", FILTER_SANITIZE_STRING),
        ""description"" => filter_input(INPUT_POST, ""post
    $posts-

    echo ""Post saved

} else {

    // show the form
?>

<html>
<head>
<title>MongoDB in action</title>
<link rel=""stylesheet"" href=""http://yui.yahooapis.com/pure/0.6.0/pure-min.css"">
</head>
<body>
<h1>Add A Post</h1>

<form action=""add_post.php"" method=""post"" class=""pure-form pure-form-stacked"">
<label for=""title"">Title
<input type=""text"" id=""title"" name=""title"" size=""60""/>
</label>

<label for=""post"">Post
<textarea id=""post"" name=""post"" rows=""6"" cols=""80""></textarea>
</label>

<input type=""submit"" value=""Save post"" class=""pure-button pure-button-primary""/>
</form>

<?php
}
?>


You can see that if there’s no data supplied, a simple form is shown here (with
a little http://purecss.io to make it nicer to look at) so we can quickly start adding data. If data does
arrive as a POST request, then we build up an array with the data we want, and
then save it to MongoDB.

Remember that MongoDB does not have a schema, you can build up whatever data
structure you like before inserting, and the shape of the data can be different
each time which makes it ideal for sparse properties, for example.

MongoDB will give our record a unique ID when it saves it, you may also want to
supply this yourself which you can do by including a _id key and the desired value when creating the data to insert. Either way, this is
useful when we come to fetch a list of records and want to be able to identify
just one of them.

FETCHING DATA
Mongo has some great query functionality, and its “aggregation framework” is
excellent for gaining insights into potentially large and nested data sets. We
just want a list of posts however, and for that we simply use the find() method, then output each of our posts along with a count of comments (more on
comments in the next section):

index.php


<?php

require(""connect.php

$all_posts = $posts-

?>

<html>
<head>
<title>MongoDB in action</title>
<link rel=""stylesheet"" href=""http://yui.yahooapis.com/pure/0.6.0/pure-min.css"">
</head>
<body>
<h1>Blog Posts</h1>

<ul>
<?php
foreach($all_posts as $p):
?>
<li><a href=""post.php?id=<?=$p->_id ?>""><?=$p->title ?></a> (<?=count($p->comments) ?> comments)</li>    


?>
</ul>


MongoDB returns each document as an object, with properties set for each of the
fields that were stored. This makes it very easy to access using object
notation, e.g. the $p->title in the example above. In the list, we’re also adding hyperlinks and using the
ID so that we can fetch individual records on another page.

ADDING NESTED DATA
MongoDB doesn’t really do joins, so for the most part, database design involves
storing data together that will be used together. So if you’re storing content,
you’ll probably have a bunch of content elements and anything they rely on, all
inside one document. In this example, we’re storing blog posts and we’ll add the
comments as part of the post record.

Here’s the individual post page, which displays the post, allows a user to add a
comment, and lists the comments that have already been added:


post.php


<?php

require(""connect.php

if($_POST) {
    $id = filter_input(INPUT_POST, ""post_id
    $data = [""username"" => filter_input(INPUT_POST, ""name"", FILTER_SANITIZE_STRING),
        ""comment"" => filter_input(INPUT_POST, ""comment
    $result = $posts->updateOne([""_id"" => new MongoDB\BSON\ObjectID($id)], ['$push' => [""comments"" =

    header(""Location: /post.php?id=

} else {
    $id = filter_input(INPUT_GET, ""id
}

if($id):
    $post = $posts->findOne([""_id"" =

?>

<html>
<head>
<title>MongoDB in action</title>
<link rel=""stylesheet"" href=""http://yui.yahooapis.com/pure/0.6.0/pure-min.css"">
</head>
<body>
<h1><?=$post->title ?></h1>

<p><?=$post->description ?></p>

<h2>Add Comments</h2>

<form action=""post.php"" method=""post"" class=""pure-form pure-form-stacked"">
<input type=""hidden"" name=""post_id"" value=""<?=$id ?>"" />

<label for=""name"">User Name
<input type=""text"" id=""name"" name=""name"" size=""20""/>
</label>

<label for=""comment"">Comment
<textarea id=""comment"" name=""comment"" rows=""4"" cols=""50""></textarea>
</label>

<input type=""submit"" value=""Post comment"" class=""pure-button pure-button-primary""/>
</form>

<?php
foreach($post->comments as $comment):
?>

<hr />
<?=$comment->comment ?><em> - by <?=$comment->username?></em>

 // if the post actually existed
?>


The interesting bit here is really where we save the comments, the call to $posts->updateOne . We use the same filter criteria as we do when we fetch the post, but then we
go on to push the $data array onto the end of the comments collection. If this collection doesn’t exist, MongoDB will simply create it.

Look out for using the mongo identifiers such as $push — in PHP we need to carefully wrap them in single quotes so that PHP doesn’t
try to interpret the $ !

Now our comments are inside our existing MongoDB document:


{
    ""_id"" : ObjectId(""575038cc1661d711090e9911""),
    ""title"" : ""Databases are excellent"",
    ""description"" : ""We could talk about them for hours"",
    ""comments"" : [
        {
            ""username"" : ""lorna"",
            ""comment"" : ""I think so too""
        },
        {
            ""username"" : ""lorna"",
            ""comment"" : ""I think so too""
        },
        {
            ""username"" : ""fred"",
            ""comment"" : ""Thanks for this post, it helped me!""
        },
        {
            ""username"" : ""george"",
            ""comment"" : ""I totally disagree, they are a hazard""
        }
    ]
}


With this in place, we can add some comments to our database and then revisit
the index page to see how things are looking:


CONCLUSION
MongoDB is quite a key player in the NoSQL arena, and this shows through with
the amount of developer support that is available on their website in the shape
of libraries and docs; however, there were some instances where we were looking
for examples that didn’t seem to exist! On the plus side, MongoDB does have a
solid user base and there is a rich ecosystem of content from forums and other
people’s blog posts that will help you — beware that the PHP libraries changed
relatively recently though so you may find some content is outdated.

One feature that can set it apart from some of its rivals is that you don’t need
to write the whole document back again when updating — you can simply push
updates to the fields that you require. This can help avoid conflicts in a write
heavy application. The big selling point, however, is the schema-less and
scalable nature of the database, meaning that you really can build apps with the
future in mind without worrying about how your infrastructure will adapt. The
inclusion secondary indexes allows quick searching on huge amounts of data and
that can only be a positive.

With MongoDB being open source you can get started on any platform and deploy to
more or less anywhere, or if you want to avoid that entirely there are a number
of cloud based providers available.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: Compose / MongoDB / NoSQL / php Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Looking to learn the basics of cloud databases? In this series, we show them running on Compose and intro programmatic access. Enter: MongoDB + PHP.",Seven Databases in Seven Days – Day 2: MongoDB,Live,99
268,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseIBM DATA CATALOG: USE DATA ASSETS IN A PROJECT
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

9 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 31, 2017This video shows you how to add a data asset to an existing project and then
load that data for analytics in a Python notebook. Find more videos in the IBM
Data Catalog Learning Center at http://ibm.biz/data-catalog-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Data Catalog: Create and administer a data catalog - Duration: 3:19.
   developerWorks TV 13 views * New 3:19


--------------------------------------------------------------------------------

 * IBM Data Catalog: Add data assets to a catalog - Duration: 3:03.
   developerWorks TV 2 views * New 3:03
 * IBM Data Catalog: Governance overview - Duration: 4:11. developerWorks TV 5
   views * New 4:11
 * IBM Data Catalog: Overview - Duration: 2:03. developerWorks TV 2 views * New 2:03
 * IBM InfoSphere Information Analyzer: Analyzing Data Quality and Risk with the
   Thin Client - Duration: 3:56. IBM Analytics 219 views 3:56
 * Information Governance Catalog Basic Overview - Duration: 9:01. 3milevideo
   1,368 views 9:01
 * IBM Blockchain Car Lease Demo - Duration: 3:01. developerWorks TV 50,058
   views 3:01
 * IBM Data Refinery: Create a project and add data - Duration: 1:56.
   developerWorks TV 7 views * New 1:56
 * IBM Cognos Integrated with IBM Information Governance Catalog - Duration:
   13:02. Jeff Martin 701 views 13:02
 * Watson Data Platform: Provision IBM Data Catalog or IBM Data Refinery
   services - Duration: 1:05. developerWorks TV 22 views * New 1:05
 * IBM Data Refinery: Shape data - Duration: 5:46. developerWorks TV 10 views *
   New 5:46
 * UrbanCode Deploy: Using composite blueprints - Duration: 9:13. developerWorks
   TV 6 views * New 9:13
 * IBM Data Refinery: Create a connection and add it to a project - Duration:
   1:54. developerWorks TV 3 views * New 1:54
 * Data Profiling (Column Analysis) using IBM Information Analyzer 11.5 -
   Duration: 28:43. PR3 Systems 131 views 28:43
 * IBM MVS 3.8 Catalog Management Introduction - Duration: 35:06. moshix 125
   views 35:06
 * How to kick-off a Data Governance Project using IBM Information Governance
   Catalog - Duration: 52:14. PR3 Systems 81 views 52:14
 * Healthy Habits Pet Assembly, Part1 - Duration: 5:53. developerWorks TV 6
   views * New 5:53
 * IBM - InfoSphere Information Governance Catalog Demo - Duration: 11:43. PR3
   Systems 3,833 views 11:43
 * IBM Information Server Data Lineage in Two Easy Steps - Duration: 2:50.
   BusinessDrivenInfo 4,047 views 2:50
 * Welcome - Duration: 1:35. developerWorks TV No views * New 1:35

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to add a data asset to an existing project and then load that data for analytics in a Python notebook. ,Use data assets in a project using IBM Data Catalog,Live,100
269,"SHARP SIGHT LABS

 * HOME
 * MEMBER LOGIN
 * ABOUT

HOW TO CHOOSE A PROJECT TO PRACTICE DATA SCIENCE
March 14, 2017

Here at Sharp Sight , I’ve derided the “jump in and build something” method of learning data
science for quite some time.

Learning data science by “jumping in” and starting a big project is highly
inefficient.


However, projects can be extremely useful for practicing data science and refining your skillset, if you know how to select the right
project.

Before I give you some pointers on how to select a good project, let’s first
talk about why “jump in and build” is not the best method of learning data
science.

JUMP IN AND BUILD IS BAD FOR LEARNING
As I mentioned above, Jump in and Build Something™ is the method of learning
where you jump in and just build something. It’s based on the idea that the best
way to learn a new skill is to select a large project and just build , even if you don’t know most of the requisite skills.

You see this quite a bit in programming. A few years ago, you used to hear guys
say “I’m going to learn PHP by building an online social network” (essentially,
building a Facebook copy).

JUMP IN AND BUILD IS EXTREMELY INEFFICIENT
While I will admit that it is possible to learn a new skill by jumping into a new project, you have to understand that
it’s extremely inefficient. I also tend to think that for beginners, the
“knowledge gained” decreases dramatically as the size and complexity of a
project increases. That’s another way of saying that if a beginner selects a
project that’s too big, they’re likely to learn very little (although, large
projects can be very useful for advanced practitioners).

The reason for this is that if you choose a project that’s too big, and you
don’t know most of the skills, you get bogged down just trying to learn
everything before you can move on to getting things done. If you “jump in” to a
very complicated project, but you don’t know the requisite skills, you’re going
to spend 99% of your time just looking things up. If you’re a beginner and you
don’t know much, you might even have trouble figuring out where to start.

Essentially, if you try to work on a project that’s too large or too
complicated, you’ll spend all of your time trying to learn dozens of small things that you should have learned before starting the project.

To help clarify why this point, I’ll give you a few analogies.

EXAMPLES: WHEN “JUMPING IN” IS A BAD IDEA
I can think of dozens of examples in other arenas where “jumping in” can get you
in over your head, but here are two that some of you might be familiar with:
learning an instrument and lifting weights.

TRYING TO LEARN GUITAR WITH SOMETHING WAY TOO HARD
At some point in their life, most people have a desire to learn to play an
instrument. For many people (and guys in particular) learning guitar is a goal.

If you want to learn to play guitar, are you going to jump in and try to learn
to do this right away?


There are some people who are foolish enough to try.

The fact is, learning to play guitar like this would take most people years.
More importantly, it would take years of preparation by learning thousands of little skills before you’d be at a level to perform like this. You’d have to learn a thousand
little things: how to position your fingers on the fretboard. How to pick. How
to play little “phrases” and also how to play fast. Etcetera.

Moreover, it’s not just the nuts-and-bolts techniques that makes it hard. It’s
also a matter of style. To play guitar like this, you need to learn how to be
expressive with the guitar. That’s a completely separate skill that also takes
years.

So if you want to learn to play guitar, could you do it by jumping in and
learning the guitar solo in the video? Is it possible to learn guitar by trying
to learn this complicated guitar solo, one note at a time? Would you be able to
do this without knowing any foundational guitar skills beforehand? Maybe. But it
would be a long, frustrating effort. My guess is that such a task would induce
most people to quit.

For beginning guitarists, it’s much, much more effective and efficient to start
with the absolute foundational guitar skills, master the foundational skills,
and progressively move on to skills of increasing difficulty.

It’s much more effective to put together a systematic plan with a skilled
teacher that puts you on the path to your goal in structured way.

Data science is exactly like this. The most efficient and effective way to learn
data science is to be highly systematic. You need to have a plan. You need to
learn the right things in the right order. The optimal strategy for learning
data science is almost the opposite of “jump in and build something.”

TRYING TO GET STRONG BY LIFTING TOO MUCH WEIGHT
Here’s another example.

If you want to get fit and strong, it’s a terrible idea to jump in and try to
lift very heavy weight. If you “jump in” and try to lift an amount of weight
that’s far beyond your strength level, you’re likely to fail.

Like this guy.


Wow. Too much man. Take some weight off the bar.

In weightlifting, if you try to lift too much, you’re likely to fail and you
might even get hurt.

In data science, you won’t have a risk of injuring yourself physically, but you
might incur a different sort of damage: you might injure your ego . You might attempt a project that’s too hard and subsequently fail. Your
failure might cause you to believe that you’re “not smart enough” to learn data
science, and you might give up altogether. I hear it all the time. People try
something that’s too hard, fail, and then give up. It’s a very real risk.

There’s actually a much better way to become a strong data scientist and it’s a
lot like trying to get strong in the gym. In the gym, the best way to get strong
is to start with light weights, and learn the basic motions safely with those
low weights. Then, add a little weight each week. Five pounds. Maybe ten. That
doesn’t sound like a lot, but over the course of only a few months, if you
continue to add weight to the bar each week, you will get stronger.

Similarly, in data science, instead of jumping into a project with a high
difficulty level, you should start with something small and do-able with your
current skill level, then increase the size and complexity of your projects as
you learn more over time. It’s remarkably similar to weightlifting. Start small,
then increase complexity. Over time, you will become a strong, highly skilled
data scientist.

WHEN TO USE PROJECTS TO PRACTICE DATA SCIENCE
At this point, I want to clarify something, to make sure that you don’t get the
wrong idea. Projects are great, but not for learning .

At a high level, projects are not very good for learning skills.

However, projects are excellent for 2 things:

 1. Integrating skills that you’ve already learned
 2. Identifying skill gaps

PROJECTS HELP YOU INTEGRATE SKILLS YOU’VE ALREADY LEARNED
As you develop as a data scientist, projects are best for integrating the things
that you already know.

Here’s what I mean:

Many of the skills that you need to learn in order to become a data scientist
are highly modular .

This is particularly true if you’re using the tidyverse in R. For the most part,
the tidyverse was designed such that each function does one thing, and does it
well.

Each of these small tools (I.e., each function) is a small unit that you should
learn and practice on a very small scale before starting a project. You should
find very, very simple examples and practice those examples repeatedly over
time.

This is just like a guitar player: a guitar player might practice a guitar scale
every single day for a few weeks (or years). He might have a set of 3 chords and
practice simple transitions between those guitar chords.

Similarly, you should have small, learnable units that you practice regularly.
As a beginner, you should practice just making a bar chart. You should practice
how to use dplyr::mutate() to add a new variable to a dataset . You should learn these skills on very simple examples, and practice them
repeatedly until you can write that code “with your eyes closed.”

Then, when you start working on a small project, the project will help you integrate those skills. For example, you’ll often need to use dplyr::filter() in combination with ggplot2 to subset your data and create a new plot. Working on a project gives you an
opportunity to put these two tools together. It allows you to take ggplot() and filter() – which you should have practiced separately – and integrate them in a way that
produces something new and more complex.

This is what projects are great for: they help you put the pieces together.
Projects help you integrate skills that you’ve already learned into a more
cohesive whole.

PROJECTS HELP YOU IDENTIFY SKILL GAPS
The second use for projects is to help you identify skill gaps.

When you start a new project, I recommend that you know most of the tools and
techniques that you need to complete the project. So if the project requires bar
charts, histograms, data sorting, adding new variables, etc, you should already
know those skills. You should have learned them with small, simple examples, and
practiced them for a while so that you’re “smooth” at executing them.

However, even if you’ve learned and practiced the required tools, when you dive
into your project, you’ll begin to find little gaps. You’ll find things that you
don’t know quite as well as you thought you did. You’ll discover that maybe you
don’t know a particular function that well. Or you’ve forgotten a critical piece
of syntax.

This is gold.

When you work on a project, these “missing pieces” tell you what you need to
work on in order to get you to the next level.

Let me give you an example: when you’re starting out with ggplot2 , I recommend that you learn 5 critical data visualizations : the bar, the line, the scatter, the histogram, and the small multiple. These
comprise what I sometimes call “the big 5” data visualizations. These are the
essentials.

After learning these, let’s say that you decide to work on a project. You decide
to analyze a small dataset that you obtained online, and you plan to use the
“essential visualizations.”

But after creating the basic visualizations to analyze the data, you decide that
you want to make them look a little more polished by modifying the plot themes.
If, at that point, you haven’t learned ggplot2::theme() and all of the element functions (like element_line() element_rect() , etc) then you’ll have a hard time formatting your plots and making them look
more professional. In this case, you will have identified a “skill gap.” These
are next skills to work on. You’d know that to get to the next level, you need
to learn (and practice!) the theme() function and the accompanying functions & parameters of the ggplot2 theme system.

Projects are excellent for identifying your skill weaknesses. That will help you
refine your learning plan as you move forward.

HOW TO CHOOSE A GOOD DATA SCIENCE PROJECT TO PRACTICE DATA SCIENCE
To get the benefits from project work, the critical factor is selecting a
project that’s at the right skill level: not too hard, but not too easy. If
you’ve selected well, then you’ll have a small and manageable list of “things to
learn right now” in order to finish the project. If you’ve selected a project at
an appropriate skill level, then your “skill gap” will be small, you’ll be able
to learn those new skills on the fly, and you’ll be able to complete the
project.

Afterwards, you’ll be able to add these “new skills” to your practice routine so
you can remember them over time .

Choosing such a project is more of an art than a science, but here are a few
pointers:

CHOOSE SOMETHING THAT YOU THAT’S MOSTLY WITHIN YOUR CURRENT SKILL LEVEL
Ultimately, you want something that’s within your skill level, but will push you
just a little bit.

Having said this, when you consider a new project, you should just ask a few
simple questions:

 1. What skills do I think I’ll need?
 2. Do I know those skills?

Here’s an example: about a year ago, I did an analysis of a car dataset that I obtained online.

Before starting this project, I had a good idea of the tools that I’d need:

 * Bar chart
 * Line chart
 * Histogram
 * Small multiple
 * Joining datasets
 * Adding variables to a dataset

There were a few other tools and techniques, but that’s the short list.

Before I even started the project, I had a rough idea that those were the skills
that I needed to know.

If you wanted to execute a similar project, you should make a similar list, and ask
yourself, do I know most of these skills already?

YOU SHOULD KNOW HOW TO DO ABOUT 90% OF THE WORK
After identifying the tools and techniques you’ll need for a project, here’s a
good rule of thumb: you should already know about 90 percent of the tools and
techniques.

For example, if you’re working on a project that requires about 20 primary tools
or techniques, you should be able to execute roughly 18 of those techniques.

That means that there would be about 2 – 4 techniques that you didn’t know. Such a project would be a decent stretch. For the 18 techniques that you
do know, it will be good practice. You’ll get to repeat those techniques
(repetition is essential for long-term memory) and perhaps combine them into new
or interesting ways.

What about the techniques that you don’t know? You’ll have to learn them on the
fly and integrate them into the project. This is actually hard to do, because
learning a new technique will slow you down. Learning a new technique while
you’re working on your project will dramatically reduce your effectiveness and
slow down the project’s progress. That’s why I recommend that you mostly learn
and practice techniques outside the context of project work. To rapidly learn and master your tools , should be learning and practicing your toolkit regularly and separate from
your projects.

But again, if you begin a project and realize that there are a few necessary
techniques that you don’t know, that’s fine. In fact, it’s good. It tells you
what your next steps are for your learning plan.

This invites a question though: What counts as a technique?

I actually think that the tidyverse’s modular structure gives us a good way of
breaking things down. Individual tidyverse functions are a good way to dissect
the project into different tools. In this scheme, I’d consider dplyr::mutate() to be one tool. dplyr::arrange() would be another. Among the ggplot2 techniques, you could consider geom_line() to be a single technique. Some of the intermediate tools like scale_fill_gradient() could also be considered separate techniques. Again, the tidyverse is highly
modular, in that each function is a little functional module that does one
thing. That being the case, you can treat these little, modular functions as
units that you either know or don’t know when you evaluate a potential project.

So to restate, here’s a good rule of thumb: when you start a project, you should
already know about 90% of the techniques (and the remaining 10% will force you
to stretch your skill).

IF IT FEELS TOO EASY, CHOOSE SOMETHING HARDER
Having said that, if you evaluate a project, and it seems too easy, then try to
find something harder. You want to push yourself just a little.

For example, if you’ve been a data scientist for a year or two, and you’ve made
a few hundred bar charts and line charts, then choosing a project that uses only
the basic tools might be little too easy for you. If that’s the case, try to
find something that is just a little out of our comfort zone.

Again, it’s like weight lifting: you need to add a little weight to the bar
every week in order to get strong. If the weight on the bar is so easy that you
can do a couple dozen repetitions, it’s too light. You need something more
difficult.

If you look at a potential project, and you know you’ve done something very
similar many times before, choose something more difficult.

PROJECTS ARE PART OF A LARGER PROCESS OF SYSTEMATIC LEARNING
If you use projects the right way, then they are a critical part of a much
larger scheme of highly systematic learning.

In this post, I dropped some hints, but here I’ll be more explicit: to rapidly
learn and master data science, you need to be systematic. You need to be
systematic in what you learn, when you learn it, and how you practice. High
performers of all stripes know that relentless, systematic practice is the most effective way to learn a new skill.

Having said that, as I mentioned above, projects are an important part of a
systematic learning plan because they help you integrate what you’ve already
learned, they help you identify skill gaps, and they can push you beyond your
comfort zone.

But whatever you do, don’t fall into the “jump in and build something” trap by
trying to learn data science without a plan.

SIGN UP NOW, AND DISCOVER HOW TO RAPIDLY MASTER DATA SCIENCE
To rapidly master data science, you need a plan.

You need to be highly systematic.

Sign up for our email list right now, and you’ll get our “Data Science Crash Course.”

In it you’ll discover:

 * A step-by-step learning plan
 * How to create the essential data visualizations
 * How to perform the essential data wrangling techniques
 * How to get started with machine learning
 * How much math you need to learn (and when to learn it)
 * And more …


SIGN UP NOW


LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


GET THE FREE DATA SCIENCE CRASH COURSE
Sign up now and learn:
• a step-by-step data science learning plan
• the 1 programming language you need to
learn
• 3 essential data visualizations
• how to do data manipulation in R
• how to get started with machine learning
• the difference between machine learning
and statistics
• and more ...

Your first name Your best email address
By signing up for the newsletter you'll also get ...
✓ Free machine learning tutorials
✓ Free data visualization tutorials
✓ Learning strategies to skyrocket your
progress

... delivered to your inbox on a regular basis.

RECOMMENDED READING
FlowingData
R-Bloggers
R-users (jobs site)Subscribe to receive our free ""Getting Started with Analytics and Data Science""
pdf.

First Name E-Mail Address © 2017 · Powered by data","Projects can be great for mastering data science, but you have to choose your projects carefully. This article will give you tips on how to choose a project that's appropriate for your skill level (and tell you some pitfalls to watch out for). For more data science tutorials, sign up for our email list.",How to choose a project to practice data science,Live,101
271,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * Data Catalog
 * 
 * Watson Data Platform
 * 

Susanna Tai Blocked Unblock Follow Following Offering Manager, Watson Data Platform | Data Catalog Sep 14
--------------------------------------------------------------------------------

HOW TO EASE THE STRAIN AS YOUR DATA VOLUMES RISE
September 14, 2017 | Written by: Manish Bhide

Ever had to make a decision when you didn’t have the time, means or patience to
look up all the data that could help you choose the best option? Yes, well,
you’re not alone on that score. Usually, this doesn’t have significant or
long-lasting consequences — does it really matter if you choose where to go for
dinner because you like the look of a place, rather than combing through recent
reviews?

But some decisions carry a lot more weight. For example, executives at Kodak
decided not to pursue the digital camera technology that their employees
invented, giving arch-rivals Fuji and Sony a golden opportunity to seize market
share that they were never able to claw back. Executives at Kodak decided not to
pursue the digital camera technology that their employees invented, giving
arch-rivals Fuji and Sony a golden opportunity to seize market share that they
were never able to claw back.

For some time now, the party line has been that big data could have saved these
organizations and countless others from bad decisions. But that isn’t the whole
story. As my colleague Jay Limburn shared in a previous blog post , having lots of information — particularly when it is poorly organized,
difficult to find or not fully trusted — can hold you back just as much as not
having enough data.

SOLVING THE SCALABILITY CONUNDRUM
We all know how important scalability is when building an infrastructure that
can cope with big data — the clue is in the name ‘big data’! But how do you
actually achieve scalability that delivers service continuity as your data
grows? First, you need to take some key considerations into account.

Scalability isn’t just about coping with gigabytes of data that grows to
terabytes, petabytes, exabytes, zettabytes, yottabytes and beyond. It’s also
about dealing with increasing numbers of data sets, formats and types. For that,
you’ll need to make sure that the tools that help your knowledge workers make
sense of data and manage governance policies can scale up too, or you’ll soon be
in trouble.

You can scale data infrastructure vertically, by adding resources to existing
systems, or horizontally, by adding more systems and connecting them so you can
load balance across them as a single logical unit. Vertical scaling is limited,
because you will eventually reach the maximum capacity of your machine. In
contrast, horizontal scaling may take more planning but presents far fewer
restrictions.

SO, WHAT’S THE ANSWER?
The best approach is multi-faceted: give knowledge workers access to lots of
data, along with the tools they need to quickly find the most relevant assets
without violating governance policies along the way. Of course, this is easier
said than done.

But with data management tools that include built-in cataloging — such as IBM’s
new IBM Data Catalog solution — you will be able to quickly search for data both within and across
extremely large sets. As an example, if one of your data scientists discovers a
relevant data set when researching a topic, they will be able to add tags and
descriptions to make it easier for other data workers to find it when working on
similar problems or questions. As more people add to the metadata, it will
become increasingly easy for data scientists to gather the information they need
through keyword searches.

In addition to its cataloging capabilities, IBM Data Catalog will also feature a
business glossary, to help users tackle the challenges of continually evolving
terminology. Different people refer to different things in different ways, which
can prevent knowledge workers from finding relevant data sets, a problem that
only gets worse as organizations and their data get larger. A business glossary
will enable you to establish a consistent set of terms to describe your data, so
that knowledge workers can quickly understand which assets are useful and which
are irrelevant to their analyses.

Users will also be able to take advantage of an auto-discovery service. It will
trawl through their systems to find available data sources, work out the types
and formats of data in each, and present them to the data user, who can then
choose which to publish in the catalog. It doesn’t stop there — through
auto-profiling, the solution will be able to automatically classify data,
figuring out whether it contains social security numbers, names, addresses, zip
codes, or other common types of data.

As discussed in more detail in another previous blog post , IBM Data Catalog will also offer automated, real-time classification and
enforcement of governance policies. This is currently a unique proposition, and
resolves one of the major obstacles to scaling up the size and use of data
management systems. Automated governance will remove the need for the Chief Data
Officer (CDO)’s team to manually enforce governance policies, avoiding
scalability issues as the number of data assets grows.

Moreover, the governance dashboard will offer CDOs an aggregated view of
enforcement across an organization, including requests for access and usage of
assets. The scale and complexity of governance efforts usually grow alongside
companies and their data, so these tools will represent a real game-changer in
the building and use of data management systems.

AND WHAT WILL HAPPEN BEHIND THE SCENES?
Delivered via the cloud, IBM Data Catalog will give users the chance to no
longer worry about scaling infrastructure. But let’s take a look behind the
curtain to understand a few of the ways IBM will ensure seamless services, even
when demand suddenly spikes.

The IBM cloud provides load-balancers that can automatically distribute workload
between the available application nodes, avoiding bottlenecks when one node gets
busy. The cloud platform can also automatically scale horizontally, spinning up
new nodes to deal with more data when demand exceeds a certain threshold. The
result is that the user can enjoy stable response times, with little to no
degradation of performance even during busy periods.

For added resilience, nodes can also be deployed across data centers in multiple
availability zones, protecting service continuity in the event of an outage at
one location.

The same scalability and resilience are provided at the storage layer, too. All
Data Catalog metadata is stored in IBM Cloudant, where it is auto-replicated
across nodes. This replication avoids the risk of having a single point of
failure, helping to keep the Catalog available even in the event of a node
failing.

And for customers who choose to use Data Catalog not only as a metadata store,
but also as a repository for the data itself, the solution harnesses IBM Cloud
Object Storage to provide massive scalability for any volume of data.

Finally, behind the scenes, IBM will have a team of specialists monitoring your
infrastructure to catch and address any potential scalability issues. Utilizing
the best of IBM’s technology to analyze and track key performance indicators
such as CPU and memory usage, they will be notified of any emerging problems so
they can take action before users feel the impact.

LOOKING AHEAD
In summary, Data Catalog has been engineered from both a functional and
non-functional perspective to solve the real problems posed by scaling big data
architectures. Instead of focusing purely on the storage of the data itself,
Data Catalog addresses practical issues such as findability, usability and
governance — helping you not only preserve and organize your data, but also
allow users and data stewards to work with it more effectively.

Learn more about IBM Data Catalog today


--------------------------------------------------------------------------------

Originally published at www.ibm.com on September 14, 2017.

 * Data Management
 * Data Catalog
 * Cloud Services
 * Big Data

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingSUSANNA TAI
Offering Manager, Watson Data Platform | Data Catalog

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Ever had to make a decision when you didn’t have the time, means or patience to look up all the data that could help you choose the best option? Yes, well, you’re not alone on that score. ",How to ease the strain as your data volumes rise,Live,102
272,"* R Views
 * About this Blog
 * Contributors
 * Some Resources
 * 

 * R Views
 * About this Blog
 * Contributors
 * Some Resources
 * 

R FOR ENTERPRISE: HOW TO SCALE YOUR ANALYTICS USING R


by Sean Lopp

At RStudio, we work with many companies interested in scaling R. They typically
want to know:

 * How can R scale for big data or big computation?
 * How can R scale for a growing team of data scientists?

This post provides a framework for answering both questions.

SCALING R FOR BIG DATA OR BIG COMPUTATION
The first step to scaling R is understanding what class of problems your
organization faces. At RStudio, we think of three use cases: data extraction,
embarrassingly parallel problems, and analysis on the whole. Garrett Grolemund
hosted an excellent webinar on Big Data in R , in which he outlined the differences in these three cases.


DISCLAIMER: These three cases are not exhaustive, nor are most problems easily
categorized into one of the three classes. But, when scoping a scaled R
environment, it is imperative to understand which class needs to be enabled.
Your organization might have all three cases, or it might have only one or two.

CASE 1: COMPUTE ON THE DATA EXTRACT
Example: I want to build a predictive model. I only need a few dozen features
and a three-month window to build a good model. I can also aggregate my data
from the transaction level to the user level. The result is a much smaller data
set that I can use to train my model in R.

Computing on data extracts is arguably the most common use case; an analyst will
run a query to pull a subset of data from an external source into R. If your
data extracts are large, you can run R on a server. At RStudio, we recommend
using the server version of the IDE (either open-source or professional), but
there are many ways to use R interactively on a server.

CASE 2: COMPUTE ON THE PARTS
Example: When I worked at a national lab (NREL), we validated fuel economy
models against real-world datasets. Each dataset had hundreds of recorded trips
from individual vehicles. While the total dataset was TBs, each individual trip
was a few hundred MBs. We ran independent models in parallel against each trip.
Each of these jobs added a single line to a results file. Then we aggregated the
results with a reduction step (taking a weighted mean). By using an HPC system,
a task that would take weeks to run sequentially was completed in a few hours.

Compute on the parts happens when the analyst needs to run the same analysis
over many subsets of data, or needs to run the same analysis many times, and
each model is independent of the others.

Examples include cross validation, sensitivity analysis, and model scoring.
These problems are called: “embarrassingly parallel” (often a misnomer, since
scaling for embarrassingly parallel problems is rarely embarrassingly simple).

COMPUTE ON THE PARTS WITH A SINGLE MACHINE
By default, R is single threaded; however, you can also use R packages to do
parallel processing on a multicore server or a multicore desktop. Local
parallelization is facilitated by packages like parallel, snow, foreach, etc.
These packages parallelize your R commands by running them on independent
threads in multicore processors. Alternatively, low-level parallelization can be
facilitated with packages like Rcpp and RcppParallel. These packages facilitate
the interaction of R with C++.

COMPUTE ON THE PARTS WITH A HIGH PERFORMANCE CLUSTER (HPC)
In some cases, R users have access to High Performance Computing environments.
These environments are becoming more readily available with technologies like
Docker Swarm. An R user will test R code interactively (on an edge node or their
local machine), and then submit the R code to the cluster as a series of batch
jobs. Each batch job will call R on a slave node.

Note that RStudio, as an interactive IDE, may run on an edge node of the cluster
or on a local machine. RStudio does not run on the slave nodes. Only R is run on
the slave nodes and is executed in batch (not interactively).

One challenge faced by R users is knowing how to submit batch jobs to the
cluster, tracking their progress, and re-running jobs that fail. One solution is
the batchtools package. This package abstracts the details of job submission and
tracking into a series of R function calls. The R functions, in turn, use
generic templates provided by system administrators. Parallel R with Batch Jobs
provides a nice overview. Some analysts have created Shiny applications that
leverage these functions to provide an interactive Job Management interface from
within RStudio!

One challenge faced by system administrators is ensuring the dependencies for
the batch R script are available on all the slave nodes. Dependencies include:
data access, the correct version of R, and correct versions of R packages. One
solution is to store the R binaries and package libraries on shared storage
(accessible by every slave node), alongside shared data and the project’s
read/write scratch space.

Case 2: Compute on the parts. Technologies: parallel, snow, RcppParallel, LSF, SLURM , Torque , Docker SwarmCASE 3: COMPUTE ON THE WHOLE
Example: A recommendation engine for movies that is robust to “unique” tastes.
The entire domain space needs to be considered all at once. Image classification
falls into this class; the weights for a complex neural network need to be fit
against the entire training set. This class of problem is the most difficult to
solve, and has generated the most hype. Sometimes analysts will purchase, use,
and modify ready-made implementations of these algorithms.

Computing on the whole happens when the analyst needs to run a model against an
entire dataset, and the model is not embarrassingly parallel or the data does
not fit on a single machine. Typically, the analyst will leverage specialized
tools such as MapReduce, SQL, Spark, H20.ai, and others. R is used as an
orchestration layer. Orchestration involves using R to run jobs in other
languages. R has a long history of orchestrating other languages to accomplish
computationally intensive tasks. See Extending R by John Chambers.

When orchestrating a case 3 problem, the R analyst will use R to direct an
external computation engine that does the heavy lifting. This approach is very
similar case 1. For example, Oracle’s Big Data Appliance and Microsoft SQL
Server 2016 with R Server both include routines for fitting models in the
database. These routines are accessible as specialized R functions. These
functions are used in addition to case 1 extracts created with traditional SQL
queries through RODBC or dplyr.

Another example is Apache Spark. The R analyst will work from an edge node
running R. (The open-source or professional RStudio Server can facilitate this
interactive use.) In R, the user will call functions from a specialized R
package, which in turn accesses Spark’s data processing and machine learning
routines. One available R package is sparklyr.

Note that the machine learning routines are not running in R. The analyst uses
these routines as black boxes that can be pieced together into pipelines, but
not modified directly.

Case 3: Compute on the whole. Technologies: Hadoop, Spark, Tensorflow, In-DB
computing (RevoScaleR, OracleR, Aster, etc)MULTIPLE USERS: SCALING R FOR TEAMS
As organizations grow, another concern is how to scale R for a team of data
scientists. This type of scale is orthogonal to the previous topic. Scaling for
a team addresses questions like: How can analysts share their work? How can
compute resources be shared? How does R integrate with the IT landscape? In many
cases, these questions need to be answered even if the R environment doesn’t
need to scale for big data.

Scaling R for teams. Technologies: Version control (Git, SVN), miniCRAN, RStudio
Server ProOpen-source packages can address many of these concerns. For example, many
organizations use packrat and miniCRAN to manage R’s package ecosystem. The use
of version control become increasingly important as teams grow and work
together. Many companies will create internal R packages to facilitate sharing
things like data access scripts, ggplot2 themes, and R Markdown templates.
Airbnb provides a detailed example . For more information on version control, packrat, and packages, see the
webinar series RStudio Essentials . At RStudio, we recommend using RStudio Server Pro because its features such
as load balancing, multi-session support, collaborative editing and auditing are
designed specifically to support a large numbers of user sessions.

WRAP UP
Whether you need to compute on big data, grow your analytic team, or do both, R
has tools to help you succeed. As more companies look to data to drive business
decisions, creating a scaleable R environment will be a critical step towards
success. Many of the topics in this blog deserve their own posts. However,
understanding and discussing these different types of scale can help create the
correct roadmap. If you’ve created an R environment at scale, we’d love to hear
from you. In a later post, we’ll address another outstanding question: after I
scale the R platform, how do I scale the distribution of results and insights to
non-R users?

seanlopp 2016-12-21T14:06:49+00:00 250 Northern Ave, Boston, MA 02210
844-448-1212
info@rstudio.com

DMCA
Trademark
Support
ECCN * Switch tabs w/o muscle cramps: New RStudio Desktop 1.0.136 switches w/
   Ctrl+Tab. Lots of tabs? Ctrl+Shift+. to select tab by name! #rstats
   
   6 days ago

Copyright 2016 RStudio | All Rights Reserved | Legal Terms Twitter Linkedin Facebook Rss Email github Rss","by Sean LoppAt RStudio, we work with many companies interested in scaling R. They typically want to know:How can R scale for big data or big computation?How can R scale for a growing team of data scientists?This post provides a framework for answering both questions.Scaling R for Big Data or Big ComputationThe first step to",How to Scale Your Analytics Using R,Live,103
274,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (November 15, 2016)
 * This Week in Data Science (November 08, 2016)
 * Partnering with Big Data University – UMUC Case Study
 * This Week in Data Science (November 01, 2016)
 * This Week in Data Science (October 25, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (NOVEMBER 15, 2016)
Posted on November 15, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * What Artificial Intelligence Can and Can’t Do Right Now – Lately the media has sometimes painted an unrealistic picture of the
   powers of AI.
 * Can deep learning help solve lip reading? – New research paper shows AI easily beating humans, but there’s still lots
   of work to be done.
 * Trump, Failure of Prediction, and Lessons for Data Scientists – The shocking and unexpected win of Donald Trump of presidency of the
   United States has once again showed the limits of Data Science and prediction
   when dealing with human behavior.
 * Understanding the Four Types of Artificial Intelligence – Machines understand verbal commands, distinguish pictures, drive cars and
   play games better than we do. How much longer can it be before they walk
   among us?
 * Google DeepMind’s AI learns to play with physical objects – Push it, pull it, break it, maybe even give it a lick. Children experiment
   this way to learn about the physical world from an early age. Now, artificial
   intelligence trained by researchers at Google’s DeepMind and the University
   of California, Berkeley, is taking its own baby steps in this area.
 * IBM’s Watson to use genomic data to defeat drug-resistant cancers – The five-year, $50 million project will study thousands of drug-resistant
   tumors.
 * A Day in the Life of a Data Engineer – This post is part of our Day in the Life of Data series, where our alumni
   discuss the daily challenges they work on at over 200 companies.
 * Machine-Learning Algorithm Quantifies Gender Bias in Astronomy – Calculation suggests papers with women first-authors have citation rates
   pushed down by 10 percent.
 * Delivering real-time AI in the palm of your hand – As video becomes an even more popular way for people to communicate, we
   want to give everyone state-of-the art creative tools to help you express
   yourself.
 * Six Data Science Lessons from the Epic Polling Failure – Big data analytics suffered a huge setback on Tuesday when nearly every
   political poll failed to predict the outcome of the presidential election.
 * Deep learning is already altering your reality – If we’re living in an algorithmic bubble, we should know how it’s bending
   and coloring whatever rays of light we’re able to glimpse through it.
 * How IBM Watson May Help Solve Cancer Drug Resistance – We may soon know how cancer dodges powerful drugs and becomes resistant to
   them.
 * How to approach machine learning in the cloud – Machine learning needs lots of data, and the best place for all that data
   and the systems that use it is in the cloud.

UPCOMING DATA SCIENCE EVENTS
 * Data Analysis with Spark – Come learn how to work with Big Data using Apache Spark on November 17th.
 * Apache Spark – Hands-on Session – Come join speakers Matt McInnis and Sepi Seifzadeh, Data Scientists from
   IBM Canada as they guide the group through three hands-on exercises using
   IBM’s new Data Science Experience to leverage Apache Spark.
 * Analytics Strategies in the Cloud – Join IBM and 2-time Canadian Olympic gold-medalist Alexandre Bilodeau for
   a complimentary event in Montreal where you’ll network, eat, drink and engage
   in an inspiring discussion on making business analytics easier and more
   available for all departments throughout your company.
 * Self service analytics in a flash with dashDB – Join this IBM Webinar on November 17th to learn about self service
   analytics.
 * The DNA of a Data Science Rock Star – Join us on November 29th for this latest Data Science Central Webinar and
   learn what skills, tools, and behaviors are emerging as the DNA of the Rock
   Star Data Scientist.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our fortieth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (November 15, 2016)",Live,104
275,"BUILDING OFFLINE-FIRST, PROGRESSIVE WEB APPS

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Glynn Bird 11/8/16Glynn Bird

Before joining IBM Cloud Data Services, Glynn served as the Head of IT and
Development for Central Index, creating a white-label frontend for a NoSQL
business directory (using PHP, Node.js, MySQL, Redis, Cloudant, and Redshift).
His experience includes writing CRM systems, ""find my nearest"" indexes,
e-commerce platforms, and a phone…

Learn More Recent Posts * Building Offline-First, Progressive Web Apps In this article, I aim to summarise Progressive Web Apps and provide
   recommendations from my…
 * Plug into the Cloudant Node.js Library v1.5 Today marks version 1.5 of the Cloudant Node.js Library. The library comes
   with a new…
 * Importing JSON Documents with nosqlimport Introducing nosqlimport, an npm module to help you import comma-separated
   and tab-separated files into your…

I’ve been creating websites for many years and I’ve watched the definition of
“best practice” evolve over time. Web technology is a movable feast driven by:

 * Web users who consume the websites being built
 * Web developers who are tasked with building websites using the tools
   available
 * Browser developers who introduce new features into their products that
   developers can utilise
 * Standards committees who attempt to gain consensus between all the interested
   parties so that innovation happens in a way that is mutually beneficial

Inevitably there are casualties along the way: standards or browser innovations
that show promise but are little-used, fail to gain cross-platform consensus or
are superseded by another round of innovation.

“BEST PRACTICE IN A BOTTLE, YEAH”
In this blog post I aim to summarise Progressive Web Apps (PWAs) , which seem to me to form a manifesto of best practices for the websites of
today (November 2016.) The recommendations herein stem from my experience
refactoring one of my apps in the summer of 2016. I hope they help you get
started with your own PWA implementation.

It’s important to note that this blog will only have a limited shelf life. In a
year or two, the advice set out here will be out-of-date, perhaps laughably so,
but that’s the nature of the beast. It’ “best practice” rolls onwards and the
very programming language we use to pin it all together changes radically.

WHAT ARE PWAS?
The term “Progressive Web App” refers to a website that aims to provide a user
experience akin to a native app. The “Progressive” bit refers to the web app
selecting which technologies it engages depending on the capabilities of the
platform the website is running on. On an older browser, a PWA may not have any
special features, but on the latest Firefox or Chrome builds they may silently
enable the modern APIs that those platforms afford.

Simply put, a PWA aims to provide:

 * Responsive design – that displays well on mobile, tablet and desktop form
   factors
 * Offline rendering – where the web page can be viewed and used with no network
   connection
 * Offline-First storage – where data is stored locally on the device and synced
   to the cloud later
 * App-Like install – where a mobile user can save the web app to their desktop

Pokedex.org by Nolan Lawson runs offline and can install to your phone like a normal app. See this write-up or the source on GitHub.

Some of these aims are not new, but the PWA manifesto brings together this
shopping list of best practice and offers APIs that web developers can use
today. Many of the aims are technology-neutral and can be solved using a variety
of tools.

RESPONSIVE DESIGN
There are any number of CSS frameworks that dictate the markup you can use to
achieve a fluid, collapsible web interface that looks good on all devices. I
have used Bootstrap for years but for this blog post, and in the interest of variety, I chose the Materialize library instead. Incorporating Google’s Material Design principles, Materialize
makes it very simple to create a good-looking, responsive website that works
well on mobile devices.

OFFLINE RENDERING
A standard website won’t function at all if there’s no network connection. Even
if the network connection is patchy, such as when browsing on a mobile device, a
site may struggle to deliver a satisfactory user-experience. The AppCache API allows websites to be aggressively cached on the device to the point where they
can render with no network connectivity (as long as they were visited at least
once on a previous occasion!). The AppCache API is an example of a solution that
was designed by committee, received widespread browser adoption but was not
widely loved by developers. It has been superseded by the Service Worker API .

Service Workers are JavaScript tasks (a bit like server-side daemons but running
on the client side) that are instantiated by web pages and from that point, can
intercept and route traffic emanating from that page. The Service Worker API is
much more flexible than AppCache as it allows the developer to decide in minute
detail what happens to each client-side web request — but with flexibility comes
complexity.

OFFLINE-FIRST STORAGE
Offline-First storage allows data to be stored in an in-browser database, giving your web application the opportunity to read and write data to
and from its local database, even when offline. There are several solutions to
this problem. IndexeDB , DexieJS , and SQLite are supported by a range of browsers, but my favourite in-browser database is PouchDB , which works on a wide variety of browsers and devices and provides the same
API to you (the developer) while choosing the best in-browser storage technology
at runtime. Making a website work on a range of browsers and platforms is hard
enough, but in-browser storage varies greatly from browser to browser, and
PouchDB smooths the path immeasurably.

Wrote some code with @pouchdb today. Soooo easy, sooo simple.

— Simon /\/\e†s0|\| (@drsm79) September 19, 2016


PouchDB also allows the in-browser database to be synced to a remote Apache CouchDB™ , IBM Cloudant or PouchDB database when there is network connectivity using the CouchDB
replication protocol. The ability of CouchDB-like databases to allow the same
data to be replicated, modified in different ways and re-synced without
data-loss makes this an ideal solution for offline-first storage.

APP-LIKE INSTALL
Progressive Web Apps are not installed from an app-store like native apps; they
are shared using URLs as the Web’s design intends. Once loaded on a phone’s
browser, the URL can be added to the phone’s home screen, but implementations of
this functionality vary between browsers and platforms. Google Chrome supports a manifest.json file that lists the application’s name, colours, icons and other metadata.

BUILDING A PWA
While I can’t share all the source code, I will include some snippets from my
work refactoring my app earlier this summer. Here’s the toolkit I chose to
produce the PWA features I was after:

 * Cloudant Envoy – to allow my one-database-per-user model to result in a single database on
   the server side
 * MaterializeCSS – for responsive CSS and markup. Other frameworks are available, of course.
 * jQuery – I’m not a full-time front-end developer. I understand jQuery, and I
   haven’t the time to learn one of the formal frameworks like Angular or
   ReactJS.
 * PouchDB – for in-browser storage and sync
 * LeafletJS – for maps and HTML5 geolocation
 * Mustache – for HTML templating
 * Simple Data Vis – absurdly simple visualisation library based on d3

The range of choices is bewildering. This list doesn’t represent the only way to
build a PWA by any means, but it’s the tooling I was comfortable using.

Don’t be overwhelmed! Here’s where I started with my PWA.

I found it easiest to start with the front end in my app. I wrote my front end
code assuming that the user in my application was authenticated and by
hard-coding a few settings. Then I wrote my front end app to read and write its
data from its local PouchDB database. I knew that with a few more lines of code
I could get it to sync correctly, so that “solved problem” wasn’t one I needed
to waste time on. If I could get an app to allow data to be added, edited and
deleted on the client side, then the rest should fall into place.

I also ignored the offline caching code until the last minute too. I assumed
(correctly) that if I got my app working then I could add the Service Worker to
provide a caching service at a later date.

GETTING STARTED WITH CLOUDANT ENVOY
In your blank directory create a new “package.json” file with:


> npm init


We can then add Cloudant Envoy :


> npm install --save cloudant-envoy


We are going to put our static website (index.html, JavaScript, CSS, images,
etc.) in a “public” sub-directory and our Node.js app in “app.js”:


> mkdir public
> touch public/index.html
> mkdir public/js
> mkdir public/css
> touch app.js


Create your app in app.js :


var path = require('path'),
  express = require('express'),
  router = express.Router();

// my custom API call
router.post('/myapicall', function(req, res) {
  res.send({ok: true});
});

// setup Envoy to 
//     - log incoming requests
//     - switch off demo app
//     - serve out our static files
//     - add our routes
var opts = {
  logFormat: 'dev',
  production: true,
  static: path.join(__dirname, './public'),
  router: router
};

// start up the web server
var envoy = require('cloudant-envoy')(opts);
envoy.events.on('listening', function() {
  console.log('[OK]  Server is up');
});


The above code uses Envoy to start the web server and adds in:

 * Our “public” directory to be served out
 * Our custom API calls to be incorporated

This design allows us to build a website that is static web server, handles API
calls, and is a CouchDB-compatible replication target all in one go.

In the client-side code, the app then uses PouchDB to create a database:


var db = new PouchDB('mylocaldatabase');


That PouchDB database can then be used to store data:


var mydata = { a:1, b:2, c: 'three'};
db.post(mydata).then(function(d) {
  console.log('Data saved to', d.id);
  });


When you need to sync the data, simply use the PouchDB replicate or sync tools:


var remotedb = new PouchDB('https://username:password@mywebserver.myhost.com/envoy');
  db.sync(remotedb);


The URL you sync to depends on where your app is running. It could be https://username:password@myapp.mybluemix.net/envoy or http://localhost:8000/envoy . The database name (after the last slash) has to match the one that your app
is using ( envoy is the default db name).

CREATING USERS WITH ENVOY
By default, Cloudant Envoy looks for users in its envoyusers database. Here’s what a user object looks like:


{
  ""_id"": ""user123"",
  ""_rev"": ""1-89de8ebc2b1ad4385ced1f0ed29fa708"",
  ""type"": ""user"",
  ""name"": ""user123"",
  ""roles"": [],
  ""username"": ""user123"",
  ""password_scheme"": ""simple"",
  ""salt"": ""1d5d80c9-d925-4f1e-8114-ed44501c38a5"",
  ""password"": ""4809dcd4f8dd1cf16f592d90d518875d3c5916f8"",
  ""seq"": null,
  ""meta"": {
    ""user_name"": ""johnsmith"",
    ""facebook_id"": ""johnsmith88"",
    ""premium"": true
  }
}


Envoy can create users for you. In your code, simply call:


var username = 'user123';
var password = 'mysecretpassword';
var meta = {
  ""user_name"": ""johnsmith"",
  ""facebook_id"": ""johnsmith88"",
  ""premium�


Once added, the username-password combination should work for replication too.

LOCAL DOCUMENTS
If you need to store state locally that you don’t want to be replicated to the
remote replica, then simply store data to a document whose _id begins with _local/ , e.g.:


var localstate = { _id: '_local/mystate', a:1, b:2};
db.put(localstate);


Local documents are only stored on the device and are not included in the list
of documents to be copied during replication.

OFFLINE MAPS
The Leaflet JavaScript library is easy enough to cache so that it works offline, but the
map tiles themselves are pretty tricky: there’s lots of them at lots of
resolutions. The solution I developed was to use an empty map and add a GeoJSON
layer that contained a rough outline of the world. For my application, I only
need to geo-locate users approximately, and I didn’t need every road, river and
hill to be rendered on the map.

To render the map, I created a Leaflet map:


var mymap = L.map('mapid').setView([20, 0], 1);


Then, I fetched the 250k GeoJSON file and rendered it on top:


$.ajax({url: '/js/world.json',
  success: function(data) {
    var style = {
        color: ""#666"",
        fillColor: ""#66bb66�


If we cache the Leaflet CSS & JavaScript files together with the world.json file referenced in the snippet, then we have offline-first maps!

CONCLUSION
Progressive Web Apps give users a vastly improved experience when used with
modern browsers:

 * The same app can be used on desktop and mobile browsers
 * Data is stored and retrieved from a local data set, so performance and
   battery life are excellent
 * Site assets can be cached locally, making the app available despite the
   network connection status
 * Apps can be distributed through URLs without app store submission and
   installation with much smaller application size

Compliments? Complaints? Mild salutations? Direct them to @glynn_bird , and don’t forget to have a look at PouchDB , Cloudant Envoy and the other tools here for your next Progressive Web App.",A summary of Progressive Web Apps and recommendations on refactoring code to use offline-first storage and other aspects of PWAs.,"Building Offline-First, Progressive Web Apps",Live,105
276,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs    * Videos * All Videos       * IBM Big Data In A Minute                * Video Chats * Analytics Video Chats       * Big Data Bytes       * Big Data Developers Streaming Meetups       * Cyber Beat Live                * Podcasts    * White Papers & Reports    * Infographics & Animations    * Presentations    * Galleries       * Subscribe×BLOGSDATA VISUALIZATION PLAYBOOK: USING FUNCTION TO DRIVE DESIGNPost Comment November 24, 2015 by Jennifer Shin Topics: Big Data Technology Tags: big data , data analytics , data science , data scientist , data visualization , visualizationsData scientists must be selective when choosing what type of visualization touse with a data set. In particular, should we select a visualization beforeworking with the data, or should the same type of visualization always accompanya particular type of data? To decide, let’s explore another, more foundationalquestion: Which comes first—the data, or the visualization?PUTTING A FACE ON DATAConsider a nonprofit organization that wishes to create a data visualization foran upcoming report. The members of the board decide to create a visualizationdepicting the distribution of funding for all initiatives, across eightdifferent types of projects. To do so, they commission a graphic designer tocreate a distinctive icon to represent each project.To emphasize the icons, the new visualization arranges them in the circularformat shown in Figure 1. The icons are also ranked by amount of fundingreceived, with each icon sized to scale.Figure 1: The distribution of funding for projects across the organization’smajor initiatives.IDENTIFYING THE PROBLEMHowever, for several reasons, the new visualization failed to accomplish theboard’s full purpose in creating it.ISSUE 1: THE MORE-IS-BETTER APPROACHNoticing that the visualization did not highlight information about theorganization’s two most important initiatives, health and education, the boardcommissioned two additional visualizations for only those initiatives in thesame format, as shown in Figure 2a.Figure 2a: Two additional visualizations were created for the health andeducation initiatives, using the same format.ISSUE 2: FUNCTIONAL LIMITATIONSEach individual figure accomplished the board’s objectives in commissioning it,using specially designed graphics to display the amount of funding for eachproject category. However, the visualizations proved less effective thanexpected and did not effectively communicate information to readers.Figure 2b: The three visualizations appeared separately in the report, makingcomparison of initiatives difficult.Once inserted into the report, each visualization filled half a page. Moreover,because the figures were separated by pages of text, comparison required readersto flip between visualizations.What’s more, the images’ visual similarity led readers astray, creating theimpression that the visualizations represented similar data sets. However, thevisualizations for the health and education initiatives were merely spotlightson two important portions of the whole amount, whereas the first visualizationdepicted the total amount for all initiatives—including the amounts broken downin the other two visualizations.Accordingly, some readers did not understand that the amounts given in the firstvisualization also included the amounts shown in the other two, a problemcompounded because the largest icon in each visualization was sized the same asthe largest icon in each other—yet represented a different dollar amount.Indeed, although two visualizations focused on health and education, novisualization similarly highlighted the remaining initiatives—employment,cultural and social.ITERATING TOWARD A SOLUTIONWhen the board attempted to address the issue, several rounds of revisionsensued, each more closely approximating the board’s intent.STEP 1: RETHINKING THE DESIGNThe icons’ circular layout in the initial visualization worked well for a set oficons displayed in a single figure but frustrated comparison of icons acrossseveral figures. Changing the design from a circular format to a linear onereduced the amount of space required to display each series of icons andconsolidated all three visualizations into a single image, allowing readers toeasily compare icons across initiatives.Figure 3: The revised visualization consolidated the three former visualizationsby arranging series of icons in such a way as to allow their comparison.STEP 2: REDESIGNING BY RESIZINGThough the new format helped readers compare icons, sizing each icon to reflectits associated percentage of overall funding decreased the figure’s visualappeal, as shown in Figure 3. The icons for the projects that received the leastfunding became unrecognizably small, rendering them useless. The newvisualization didn’t capitalize on the newly designed icons, and it frustratedthe organization’s attempt to associate particular icons with particular projecttypes. However, although scaling the icons decreased the image’s visual appealand reduced its usefulness, readers were able to easily understand thedistribution of funds within initiatives.STEP 3: CHOOSING THE RIGHT DATATo provide users with information about all initiatives in the organization, thedesigner reworked the visualization once more, as shown in Figure 4, replacingdollar amounts for overall funding with figures representing combined fundingfor projects associated with the employment, cultural and social initiatives.The designer also sized all icons uniformly, taking full advantage of theirdistinctiveness. The resulting visualization provided a complete overview of thedistribution of funding across initiatives and project types alike.Figure 4: The final visualization showed the amount of funding received by eighttypes of projects across the organization’s five initiatives.USING DESIGN TO SUPPORT FUNCTIONChoosing design over function can lead to redundant visualizations that fail totake full advantage of their graphics. By welcoming changes to the originaldesign, the organization created an effective visualization that preserved themost important features of the original design. The uniformly sized iconsallowed readers to quickly identify issue areas, and the linear layout allowedcomparison of sums without page-turning. Moreover, by arranging projects indescending order along a horizontal line, the final visualization preserved therelative rankings of amounts distributed for project types within an initiative.Balancing design and function can produce a beautiful and effective datavisualization, but finding balance can be tricky. In particular, overemphasis onvisual design can diminish a visualization’s ultimate effectiveness. To create apowerful visualization, keep the big picture in mind—and remember that moreisn’t always better.Discover how the IBM advanced analytics portfolio can help you visually explore your data to find patterns and uncover insights.Follow @IBMBigDataRELATED CONTENTINTERACTIVECOGNITIVE BUSINESS STARTS WITH ANALYTICSOrganizations today have tremendous opportunities to transform theirprofessions, businesses and industries. They must optimize their use of advancedanalytics and capitalize on all available data. With the right solutions, theycan gain clarity about their business, generate new insights and take... View Interactive Podcast How is open source transforming streaming analytics? Podcast Becoming a cognitive business Podcast InsightOut: Leveraging metadata and governance Blog What is Spark? Blog Internet of Things data access and the fear of the unknown Blog Spark: The operating system for big data analytics Blog Graph databases catch electronic con artists in the act Blog InsightOut: Metadata and governance Blog New IBM DB2 release simplifies deployment and key management Podcast How is open source transforming graph analytics? Blog What is Hadoop? Blog The rise of NoSQL databasesView the discussion thread.IBM * Site Map * Privacy * Terms of Use * 2014 IBMFOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata       * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics       * Explore By Topic * Use Cases    * Industries    * Analytics    * Technology    * For Developers    * Big Data & Analytics Heroes       * Explore By Content Type * Blogs    * Videos    * Analytics Video Chats    * Big Data Bytes    * Big Data Developers Streaming Meetups    * Cyber Beat Live    * Podcasts    * White Papers & Reports    * Infographics & Animations    * Presentations    * Galleries    * Events    * Around the Web    * About The Big Data & Analytics Hub    * Contact Us    * RSS Feeds       * Additional Big Data Resources * AnalyticsZone    * Big Data University    * Channel Big Data    * developerWorks Big Data Community    * IBM big data for the enterprise    * IBM Data Magazine    * Smarter Questions Blog       * Events * Upcoming Events    * Webcasts    * Twitter Chats    * Meetups       * Around the Web * For Developers * Big Data & Analytics HeroesMore * Events * Upcoming Events    * Webcasts    * Twitter Chats    * Meetups       * Around the Web * For Developers * Big Data & Analytics HeroesSearchEXPLORE BY TOPIC:Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog For strategic planning, business must go beyond spreadsheets Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analyticsMOREBlog For strategic planning, business must go beyond spreadsheets Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Interactive All data all the time: How mobile technology informs travelerhabits Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog The secret to enhancing customer engagement Blog For strategic planning, business must go beyond spreadsheets Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa BodellMOREBlog For strategic planning, business must go beyond spreadsheets Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of care Podcast InsightOut: Leveraging metadata and governance Blog 6 simple ways to help fight crime with analytics Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive All data all the time: How mobile technology informs travelerhabitsMOREBlog 6 simple ways to help fight crime with analytics Presentation 5 questions to ask when analyzing customer behavior Blog Innovation, inspiration and practical intelligence: IBM Vision 2016 Interactive All data all the time: How mobile technology informs travelerhabits Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog How to protect our PII and sensitive information from fraud Blog Big data in healthcare: The secret to calculating total cost of care Interactive Cognitive business starts with analytics Blog The secret to enhancing customer engagement Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of careMOREInteractive Cognitive business starts with analytics Blog The secret to enhancing customer engagement Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of care Podcast Becoming a cognitive business Podcast InsightOut: Leveraging metadata and governance Blog The LED lighting revolution * Home * Explore By Topic * Use Cases * All       * Acquire, Grow & Retain Customers       * Create New Business Models       * Improve IT Economics       * Manage Risk       * Optimize Operations & Reduce Fraud       * Transform Financial Processes                * Industries * All       * Banking       * Consumer Products       * Education       * Energy & Utilities       * Government       * Healthcare & Life Sciences       * Industrial       * Insurance       * Media & Entertainment       * Retail       * Telecommunications                * Analytics * All       * Content Analytics       * Customer Analytics       * Entity Analytics       * Social Media Analytics                * Technology * All       * Business Intelligence       * Cloud Database       * Data Governance       * Data Warehouse       * Database Management Systems       * Data Science       * Hadoop & Spark       * Internet of Things       * Predictive Analytics       * Streaming Analytics                   * Content By Type * Blogs    * Videos * All Videos       * IBM Big Data In A Minute                * Video Chat * Analytics Video Chats       * Big Data Bytes       * Big Data Developers Streaming Meetups       * Cyber Beat Live                * Podcasts    * White Papers & Reports    * Infographics & Animations    * Presentations    * Galleries       * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events    * Webcasts    * Twitter Chat    * Meetups       * Around The Web * About Us * Contact Us * Search Site","Find out why redundant visualizations can turn detail into too much of a good thing, obscuring connections and diminishing contrast.",Data visualization: Function drives design,Live,106
283,"ERIK BERNHARDSSON ABOUT
WHEN MACHINE LEARNING MATTERS
2016-08-05I joined Spotify in 2008 to focus on machine learning and music recommendations.
It’s easy to forget, but Spotify’s key differentiator back then was the
low-latency playback. People would say that it felt like they had the music on
their own hard drive. (The other key differentiator was licensing — until early
2009 Spotify basically just had all kinds of weird stuff that employees had
uploaded. In 2009 after a crazy amount of negotiation the music labels agreed to
try it out as an experiment. But I’m getting off topic now.)

Music distribution is a trivial problem now. Put everything on a CDN and you’re
done. The cost of bandwidth and storage has gone down by an order of magnitude,
not to mention the labor cost needed to build and maintain it.

Anyway, at some point in 2009 we realized that we had far bigger challenges at
Spotify than building a music recommendation system. So instead, I switched
gears and ran the “Analytics team” for 2 years. We did the first A/B tests, ad
delivery optimizations, provided data points crucial to bizdev deals, etc.

Not until 2013 did we feel like it was time to focus on music recs. So I
switched back and built up a team around that. The feeling was that we already
solved the “tablestakes” problems around music distribution and music
management. Those problems had become easy to solve for anyone. The next
differentiator would be more advanced features that deliver user value and are
harder for competitors to copy. So we focused a lot on ML again.

Which brings me to this conclusion


In the majority of all products, machine learning will not be a key
differentiator in the first five years.

MOST MACHINE LEARNING IS SPRINKLES ON THE TOP
The first few years of product iteration is about getting the “tablestakes” out
of the way. The ROI of those are just vastly bigger. I lead the tech team at a
startup and we are nowhere near using any kind of sophisticated machine
learning, two years into the process. There are a few promising opportunities
where we want to use it. I absolutely think it’s going to be a huge competitive
advantage for us. But right now far more simpler things matter. Spending a few
days working on the conversion funnel is guaranteed do deliver far more business
value.

Rarely is machine learning the fundamental enabler of a product. It’s often an enhancer . This unfortunately means that the machine learning team isn’t a team that
creates the core business value and has a crucial strategic role. It will be the
team that comes in after 5-10 years once the “basic” features have been built
and then squeezes out another 10% MAU by A/B testing the crap out of the
product. Despite the current AI hype, most of the big shops focus on relatively
mundane things. Google is trying to get you to click on more ads, Facebook to
use the newsfeed more. It’s all incremental improvements on top of a product
that already existed for 10 years.


Obviousy the image above has nothing to do with this post. I just thought it was
funny. Sorry.

PICK YOUR COMPETITIVE ADVANTAGE
How can we get around this? How can we build a company that’s founded based on
machine learning first? I suspect ML in itself is very rarely a competitive
advantage. Any machine learning company needs to find a sustainable non-ML advantage. Do you have a fantastic set of image filters? Great, use that tiny head start,
launch an app and build a social network. Do you have a really good fraud
detection system? Go out and sign up enterprise customers that feed you data
back.

Machine learning can be a first mover advantage. But there’s a high likelihood
whatever insight you have will be independently discovered and published at the
next NIPS/KDD/ICML. You need to turn it into something sustainable — having
data, or lots of users, or very sticky enterprise contracts, or something else.

Besides the core machine learning, other technology can definitely be a competitive advantage. Building super nasty integrations
with vendors, or figuring out the control engineering of the suspension system
of a self driving car. Those are proprietary assets where there’s little open
research. For the pure machine learning I think we’ll see a separate force of
commoditization of machine learning in those areas, where the technological
differental between companies coverges towards zero. Knowing how to build a
convolutional neural network will not be a valuable asset. Hooking it up to a
surveillance system and building video distribution system could be a really key
piece of technology.

Don’t underestimate the power of data. Scraping the web doesn’t create valuable
asset. But if you can obtain highly valuable unique data then that’s a huge
competitive advantage. Another type of data I think people underestimate is in
people’s heads — learnings from real production usage. Eg. Netflix has iterated movie recommendations for
10 years. They know their shit. It’s hard building a better recommender system
even if you magically had ten times the data that Netflix has.

What seems to happen in reality is that the human capital becomes real asset. Here’s a list of some acquisitions . It’s clear to me these acquisitions were 90% acqui-hire — about human capital
being redeployed to something else. Google and other big players has shown that
they are willing to pay a huge premium for smart teams (throwing out a fun
conspiracy theory just for the sake of it: Google is going to acqui-hire any
team with smart people just to create a talent monopoly.) These companies all
had built some cool tech, but the price paid really represented the scarcity of
skills. I expect that scarcity to vanish gradually.

RELATED POSTS
 * Interview with a Data Scientist: Erik Bernhardsson 2015-10-27
 * How to build up a data team (everything I ever learned about recruiting)
   2014-06-08
 * Recurrent Neural Networks for Collaborative Filtering 2014-06-28
 * Iterate or die 2016-03-01

© 2016 . All rights reserved.",Machine learning is often the enhancer of a product.,When machine learning matters · Erik Bernhardsson,Live,107
285,"Skip to content * Features
 * Business
 * Explore
 * Marketplace
 * Pricing

This repository Sign in or Sign up * Watch 25
 * Star 212
 * Fork 33

IBM-WATSON-DATA-LAB / PIXIEDUST
Code Issues 114 Pull requests 0 Projects 0 Wiki Insights Pulse GraphsTUTORIAL: USING NOTEBOOKS WITH PIXIEDUST FOR FAST, FLEXIBLE, AND EASIER DATA
ANALYSIS AND EXPERIMENTATION
va barbosa edited this page Aug 22, 2017 · 7 revisionsPAGES 9
 * Home
 * How to write a new PixieDust visualization
 * Package Manager
 * PixieDust display API
 * Renderer developer notes
 * Setup: Install and Configure pixiedust
 * Tutorial: Extending the PixieDust Visualization
 * Tutorial: Using Notebooks with PixieDust for Fast, Flexible, and Easier Data
   Analysis and Experimentation
 * Using Scala language within a Python Notebook

Clone this wiki locallyInteractive notebooks are powerful tools for fast and flexible experimentation
and data analysis. Notebooks can contain live code, static text, equations and
visualizations. In this lab, you create a notebook via the IBM Data Science
Experience to explore and visualize data to gain insight. We will be using
PixieDust, an open source Python notebook helper library, to visualize the data
in different ways (e.g., charts, maps, etc.) with one simple call.


OVERVIEW
In this tutorial, you will be learning about and using:

 * IBM Data Science Experience (DSX)
 * Jupyter Notebooks
 * PixieDust
 * Las Vegas Open Data

The tutorial can be followed from a local Jupyter Notebook environment. However,
the instructions and screenshots here walk through the notebook in the DSX
environment.

A corresponding notebook is available here: https://gist.github.com/vabarbosa/dc1eeaa363e8534306a2f5e09270cfee

You may access this tutorial at a later time and try it again at your own pace
from here: http://ibm.biz/pixiedustlab

Note : For best results, use the latest version of either Mozilla Firefox or Google
Chrome.

DSX
DSX is an interactive, collaborative, cloud-based environment where data
scientists, developers, and others interested in data science can use tools
(e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and
gather insight from their data.

SIGN UP
DSX is powered by IBM Bluemix, therefore your DSX login is same as your IBM
Bluemix login. If you already have a Bluemix account or previously accessed DSX
you may proceed to the Sign In section. Otherwise, you first need to sign up for an account.

From your browser:

 1. Go to the DSX site: http://datascience.ibm.com
 2. Click on Sign Up
 3. Enter your Email
 4. Click Continue
 5. Fill out the form to register for IBM Bluemix

SIGN IN
From your browser:

 1. Go to the DSX site: http://datascience.ibm.com
 2. Click on Sign In
 3. Enter your IBMid or email
 4. Click Continue
 5. Enter your Password
 6. Click Sign In

JUPYTER NOTEBOOKS
Jupyter Notebooks are a powerful tool for fast and flexible data analysis and can contain live
code, equations, visualizations and explanatory text.

CREATE A NEW NOTEBOOK
You will need to create a noteboook to experiment with the data and a project to
house your notebook. After signing into DSX:

 1. On the upper right of the DSX site, click the + and choose Create project .
 2. Enter a Name for your project
 3. Select a Spark Service
 4. Click Create

From within the new project, you will create your notebook:

 1. Click add notebooks
 2. Click the Blank tab in the Create Notebook form
 3. Enter a Name for the notebook
 4. Select Python 2 for the Language
 5. Select 2.0 for the Spark version
 6. Select the Spark Service
 7. Click Create Notebook

You are now in your notebook and ready to start working.

When you use a notebook in DSX, you can run a cell only by selecting it, then
going to the toolbar and clicking on the Run Cell (▸) button. When a cell is running, an [*] is shown beside the cell. Once the cell has finished the asterisks is replaced
by a number.

If you don’t see the Jupyter toolbar showing the Run Cell (▸) button and other notebook controls, you are not in edit mode. Go to the dark
blue toolbar above the notebook and click the edit (pencil) icon.


PIXIEDUST
PixieDust is an open source Python helper library that works as an add-on to Jupyter
notebooks to extends the usability of notebooks.

With interactive notebooks, a mundane task like creating a simple chart or
saving data into a persistence repository requires mastery of complex code like
this matplotlib snippet:


To improve the notebook experience PixieDust simplifies much of this and
provides a single display() API to visualize your data.

UPDATE PIXIEDUST
DSX already comes with the PixieDust library installed, but it is always a good
idea to make sure you have the latest version:

 1. In the first cell of the notebook enter:
    
    !pip install --upgrade pixiedust
    
    
 2. Click on the Run Cell (▸) button
    
    
After the cell completes, if instructed to restart the kernel, from the notebook
toolbar menu:

 1. Go to > Kernel > Restart
 2. Click Restart in the confirmation dialog

Note : The status of the kernel briefly flashes near the upper right corner,
alerting when it is Not Connected , Restarting , Ready , etc.

IMPORT PIXIEDUST
Before, you can use the PixieDust library it must be imported into the notebook:

 1. In the next cell enter:
    
    import pixiedust
    
    
 2. Click on the Run Cell (▸) button
    
    
Note : Whenever the kernel is restarted, the import pixiedust cell must be run before continuing.

PixieDust has been updated and imported, you are now ready to play with your
data!

LAS VEGAS OPEN DATA
You now need some data! Many cities are now making much of their data available.
One such city is Las Vegas.

Las Vegas Open Data is the online home of a large portion of the data the City of Las Vegas
collects and makes available for citizens to see and use.

It would be good to take a look at some data from the city of Las Vegas. More
specifically, the Las Vegas Restaurant Inspections data. This dataset contains demerits, grades, etc from inspections of Las Vegas
restaurants.

LOAD THE DATA
With PixieDust, you can easily load CSV data from a URL into a PySpark DataFrame
in the notebook.

In a new cell enter and run:

inspections = pixiedust.sampleData(""https://opendata.lasvegasnevada.gov/resource/86jg-3buh.csv"")


Remember to wait for the [*] indicator to turn into number, at which point the cell has completed running.

Here, you are passing the URL of the Las Vegas Restaurant Inspections CSV file to PixieDust's sampleData API and store the resultant dataframe into an inspections variable. In the output, you will see logging from PixieDust as it downloads
the files and creates the dataframe.

VIEW THE DATA
Now that you have the data into a dataframe in your notebook, it is time to take
a look at it. With PixieDust's display API, you can easily view and visualize the data.

In a new cell enter and run:

display(inspections)


The output from this cell is the PixieDust display output which includes a toolbar and a visualization area:


By default, you will be presented with the Table View showing a sampling (100
rows max) of the data and the schema of the data, that is which columns are
strings, integers, etc.

FILTER THE DATA
Looking at the restaurants data in the table, you may notice it contains entries
for restaurants outside of Las Vegas. You can however, filter this to a subset
of only Las Vegas restaurants.

In a new cell enter and run:

inspections.registerTempTable(""restaurants"")
lasDF = sqlContext.sql(""SELECT * FROM restaurants WHERE city='Las Vegas'"")
lasDF.count()


Using a basic SQL query, you filtered the data and created a new dataframe with
only restaurants in Las Vegas. The cell output is a count of entries
specifically for Las Vegas.

VISUALIZE THE DATA
With your data ready to go, you can begin to visualize it as charts and not just
a simple table.

NUMBER OF RESTAURANTS BY CATEGORIES
In a new cell enter and run:

bycat = lasDF.groupBy(""category_name"").count()
display(bycat)


The result is a new table showing the number of entries by categories in the
city of Las Vegas. From the PixieDust display output toolbar, you can view this data in multiple ways:

 1. Click the Chart dropdown menu and choose Bar Chart
    
    
 2. From the Chart Options dialog
    
     1. Drag the category_name field and drop it into the Keys area
        
        
     2. Drag the count field and drop it into the Values area
        
        
     3. Set the # of Rows to Display to 1000
        
        
 3. Click OK
    
    
And just like that you have a bar chart showing the percentages of the entries
with a given grade!

RENDERING OPTIONS
You can play around with the chart further to provide a better visual
experience. PixieDust supports mulitple renderers, each with their own set of
features and distinct look. The default renderer is matplotlib but you can easily switch to a different renderer.

 1. Click the Renderer dropdown menu and choose bokeh
    
    
 2. Toggle the Show Legend Bar Chart Option to show or hide the legend
    
    
The result is a nice bar chart showing the count of the different categories of
places to eat. It's probably no surprise that most are restaurants and bars.

INSPECTION DEMERITS AND GRADES
What if you wanted to visualize something a little more complex? What if you
wanted to see the average number of inspection demerits per category clustered
by the inspection grade? Give it a try!

 1. In a new cell enter and run:
    
    display(lasDF)
    
    
 2. Click the Chart dropdown menu and choose Bar Chart
    
    
 3. From the Chart Options dialog
    
     1. Drag the category_name field and drop it into the Keys area
     2. Drag the inspection_demerits field and drop it into the Values area
     3. Set the Aggregation to AVG
     4. Set the # of Rows to Display to 1000
     5. Click OK
    
    
 4. Click the Renderer dropdown menu and choose bokeh
    
    
 5. Click the Cluster By dropdown menu and choose inspection_grade
    
    
 6. Click the Type dropdown menu and choose the desired bar type (e.g., stacked )
    
    
CURRENT DEMERITS VS INSPECTION DEMERITS
You are not restricted to just bar charts. You can try other charts to gain
additional insights and different perspective of the data.

 1. Click the Options button to launch the Chart Options dialog
    
    
 2. From the Chart Options dialog
    
     1. Set the Keys to inspection_demerits
     2. Set the Values to current_demerits
     3. Set the # of Rows to Display to 1000
     4. Click OK
    
    
 3. Click the Chart dropdown menu and choose Scatter Plot
    
    
 4. Select bokeh from the Renderer dropdown menu
    
    
 5. Select inspection_grade from the Color dropdown menu
    
    
What can be gathered from this chart?

MAP THE DATA
When looking at the sample data, you may have noticed it also includes the
location data of the restaurants. Plotting these points on a map can also be
done with PixieDust.

ACCESS TOKEN
For the Map renderers, a token is required for them to display properly. Currently,
PixieDust has two map renderers (i.e, Google, MapBox). For this section of the
tutorial, you will be using the MapBox renderer and thus a MapBox API Access Token will need to be created if you choose to continue.

Open a new browser tab ( do not close the DSX browser tab ):

 1. If you do not have an MapBox account, please Sign up for one: https://www.mapbox.com/studio/signup
    
    
 2. If you are not already logged into MapBox, go to https://www.mapbox.com and Log in
    
    
 3. Navigate to your MapBox account page: https://www.mapbox.com/studio/account
    
    
 4. Click the API access tokens tab
    
    
 5. Click Create a new token and give your new token a name
    
    
 6. Click on Generate
    
    
 7. Make note of your token
    
    
 8. Return to your notebook in DSX but do not close the MapBox page just yet
    
    
SHAPE THE DATA
The current data includes the longitude/latitude in the location_1 field as a string like such: POINT (-114.923505 36.114434)

However, the current Map renderers in PixieDust expect the longitude and latitude as separate number
fields. The first thing you will need to do is parse the location_1 field into separate longitude and latitude number fields.

Note : Python is indentation sensitive. Do not mix space and tab indentations.
Either use strictly spaces or tabs for all indentations.

The last character in the field name location_1 is the number 1 .

In a new cell enter and run:

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def valueToLon(value):
    lon = float(value.split('POINT (')[1].strip(')').split(' ')[0])
    return None if lon == 0 else lon if lon < 0 else (lon * -1)

def valueToLat(value):
    lat = float(value.split('POINT (')[1].strip(')').split(' ')[1])
    return None if lat == 0 else lat

udfValueToLon = udf(valueToLon, DoubleType())
udfValueToLat = udf(valueToLat, DoubleType())

lonDF = lasDF.withColumn(""lon"", udfValueToLon(""location_1""))
lonlatDF = lonDF.withColumn(""lat"", udfValueToLat(""location_1""))

lonlatDF.printSchema()


You should have a new dataframe ( lonlatDF ) with two new columns ( lon , lat ) which contain the longitude and latitude for the restaurant.

VIEW THE MAP DATA
You are ready to view the data on a map.

 1. In a new cell enter and run:
    
    display(lonlatDF)
    
    
 2. Click the Chart dropdown menu and choose Map
    
    
 3. From the Chart Options dialog
    
     1. Drag the lon field and the lat field and drop it into the Keys area
        
        
     2. Drag the current_demerits field and drop it into the Keys area
        
        
     3. Set the # of Rows to Display to 1000
        
        
     4. Enter your access token from MapBox into the MapBox Access Token field. If you left the MapBox browser tab open you may return to it,
        copy the token and paste it here.
        
        
     5. Click OK
        
        
 4. Click the kind dropdown menu and choose choropleth
    
    
You can move around the map and zoom into the various areas and get a quick
glimpse of the restaurants current_demerits based on it's color on the map.


SUMMARY
Before finishing the tutorial and stepping away do not forget to sign out of
DSX, MapBox, and close out of any additional tabs you opened up.

In this tutorial, you covered some of the basics of visualizing data from a
Jupyter Notebook with PixieDust in the IBM Data Science Experience.
Visualization is just one aspect of PixieDust. PixieDust contains additional
features such as a Package Manager , Spark Progress Monitor , and Scala Bridge to name a few.

Likewise DSX has numerous tools to analyze your data. DSX tools make it easier
to share, collaborate, and solve your toughest data challenges.

Feel free to sign back into DSX at later time and continue with analyzing and
visualizing this data further. Better yet, load and start experimenting with
your own data.

LINKS
 * PixieDust Hello World Lab
   http://ibm.biz/pixiedustlab
 * IBM Data Science Experience
   http://datascience.ibm.com
 * The Jupyter Notebook
   https://jupyter.org
 * I Am Not A Data Scientist
   https://medium.com/ibm-watson-data-lab/i-am-not-a-data-scientist-efe7ca6ceba2
 * PixieDust
   https://ibm-watson-data-lab.github.io/pixiedust
 * Welcome to PixieDust
   
   https://apsportal.ibm.com/exchange/public/entry/view/5b000ed5abda694232eb5be84c3dd7c1
 * Magic for Your Python Notebook
   
   https://developer.ibm.com/clouddataservices/2016/10/11/pixiedust-magic-for-python-notebook
 * Make Your Own Custom Visualization
   http://ibm.biz/pixiedustvis
 * FlightPredict II: The Sequel
   
   https://medium.com/ibm-watson-data-lab/flightpredict-ii-the-sequel-fb613afd6e91

 * © 2017 GitHub , Inc.
 * Terms
 * Privacy
 * Security
 * Status
 * Help

 * Contact GitHub
 * API
 * Training
 * Shop
 * Blog
 * About

You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.","Create a notebook using IBM Data Science Experience using PixieDust to explore and visualize data in different ways (e.g., charts, maps, etc.) with one simple call.","Using Notebooks with PixieDust for Fast, Flexible, and Easier Data Analysis and Experimentation",Live,108
286,"Study Group Deep Learning Curriculum Blog Newsletter ArchiveTENSORFLOW QUICK TIPS
by Malte Baumann on February 19, 2017


TENSORFLOW WAS THE NEW KID ON THE BLOCK WHEN IT WAS INTRODUCED IN 2015 AND HAS
BECOME THE MOST USED DEEP LEARNING FRAMEWORK LAST YEAR. I JUMPED ON THE TRAIN A
FEW MONTHS AFTER THE FIRST RELEASE AND BEGAN MY JOURNEY INTO DEEP LEARNING
DURING MY MASTER'S THESIS. IT TOOK A WHILE TO GET USED TO THE COMPUTATION GRAPH
AND SESSION MODEL, BUT SINCE THEN I'VE GOT MY HEAD AROUND MOST OF THE QUIRKS AND
TWISTS.

THIS SHORT ARTICLE IS NO INTRODUCTION TO TENSORFLOW, BUT INSTEAD OFFERS SOME
QUICK TIPS, MOSTLY FOCUSED ON PERFORMANCE, THAT REVEAL COMMON PITFALLS AND MAY
BOOST YOUR MODEL AND TRAINING PERFORMANCE TO NEW LEVELS. WE'LL START WITH
PREPROCESSING AND YOUR INPUT PIPELINE, VISIT GRAPH CONSTRUCTION AND MOVE ON TO
DEBUGGING AND PERFORMANCE OPTIMIZATIONS.

PREPROCESSING AND INPUT PIPELINES
KEEP PREPROCESSING CLEAN AND LEAN
ARE YOU BAFFLED AT HOW LONG IT TAKES TO TRAIN YOUR RELATIVELY SIMPLE MODEL?
CHECK YOUR PREPROCESSING! IF YOU'RE DOING ANY HEAVY PREPROCESSING LIKE
TRANSFORMING DATA TO NEURAL NETWORK INPUTS, THOSE CAN SIGNIFICANTLY SLOW DOWN
YOUR INFERENCE SPEED. IN MY CASE I WAS CREATING SO-CALLED 'DISTANCE MAPS',
GRAYSCALE IMAGES USED IN ""DEEP INTERACTIVE OBJECT SELECTION"" AS ADDITIONAL
INPUTS, USING A CUSTOM PYTHON FUNCTION. MY TRAINING SPEED TOPPED OUT AT AROUND
2.4 IMAGES PER SECOND EVEN WHEN I SWITCHED TO A MUCH MORE POWERFUL GTX 1080. I
THEN NOTICED THE BOTTLENECK AND AFTER APPLYING MY FIX I WAS ABLE TO TRAIN AT
AROUND 50 IMAGES PER SECOND.

IF YOU NOTICE SUCH A BOTTLENECK THE USUAL FIRST IMPULSE IS TO OPTIMIZE THE CODE.
BUT A MUCH MORE EFFECTIVE WAY TO STRIP AWAY COMPUTATION TIME FROM YOUR TRAINING
PIPELINE IS TO MOVE THE PREPROCESSING INTO A ONE-TIME OPERATION THAT GENERATES
TFRECORD FILES. YOUR HEAVY PREPROCESSING IS ONLY DONE ONCE TO CREATE TFRECORDS
FOR ALL YOUR TRAINING DATA AND YOUR PIPELINE BOILS DOWN TO LOADING THE RECORDS.
EVEN IF YOU WANT TO INTRODUCE SOME KIND OF RANDOMNESS TO AUGMENT YOUR DATA, ITS
WORTH TO THINK ABOUT CREATING THE DIFFERENT VARIATIONS ONCE INSTEAD OF BLOATING
YOUR PIPELINE.

WATCH YOUR QUEUES
A WAY TO NOTICE EXPENSIVE PREPROCESSING PIPELINES ARE THE QUEUE GRAPHS IN
TENSORBOARD. THESE ARE GENERATED AUTOMATICALLY IF YOU USE THE FRAMEWORKS
QUEUERUNNERS AND STORE THE SUMMARIES IN A FILE. THE GRAPHS SHOW IF YOUR MACHINE
WAS ABLE TO KEEP THE QUEUES FILLED. IF YOU NOTICE NEGATIVE SPIKES IN THE GRAPHS
YOUR SYSTEM IS UNABLE TO GENERATE NEW DATA IN THE TIME YOUR MACHINE WANTS TO
PROCESS ONE BATCH. ONE OF THE REASONS FOR THIS WAS ALREADY DISCUSSED IN THE
PREVIOUS SECTION. THE MOST COMMON REASON IN MY EXPERIENCE IS LARGE MIN_AFTER_DEQUEUE VALUES. IF YOUR QUEUES TRY TO KEEP LOTS OF RECORDS IN MEMORY, THEY CAN EASILY
SATURATE YOUR CAPACITIES, WHICH LEADS TO SWAPPING AND SLOWS DOWN YOUR QUEUES
SIGNIFICANTLY. OTHER REASONS COULD BE HARDWARE ISSUES LIKE TOO SLOW DISKS OR
JUST LARGER DATA THAN YOUR SYSTEM CAN HANDLE. WHATEVER IT IS, FIXING IT WILL
SPEED UP YOUR TRAINING PROCESS.

GRAPH CONSTRUCTION AND TRAINING
FINALIZE YOUR GRAPH
TENSORFLOWS SEPARATE GRAPH CONSTRUCTION AND GRAPH COMPUTATION MODEL IS QUITE
RARE IN DAY TO DAY PROGRAMMING AND CAN CAUSE SOME CONFUSION FOR BEGINNERS. THIS
APPLIES TO BUGS AND ERROR MESSAGES, WHICH CAN OCCUR IN THE CODE FOR THE FIRST
TIME WHEN THE GRAPH IS BUILT, AND THEN AGAIN WHEN IT'S ACTUALLY EVALUATED, WHICH
IS COUNTERINTUITIVE WHEN YOU ARE USED TO CODE BEING EVALUATED JUST ONCE.

ANOTHER ISSUE IS GRAPH CONSTRUCTION IN COMBINATION WITH TRAINING LOOPS. THESE
LOOPS ARE USUALLY 'STANDARD' PYTHON LOOPS AND CAN THEREFORE ALTER THE GRAPH AND
ADD NEW OPERATIONS TO IT. ALTERING A GRAPH WHILE CONTINUOUSLY EVALUATING IT WILL
CREATE A MAJOR PERFORMANCE LOSS, BUT IS RATHER HARD TO NOTICE AT FIRST.
THANKFULLY THERE IS AN EASY FIX. JUST FINALIZE YOUR GRAPH BEFORE STARTING YOUR
TRAINING LOOP BY CALLING TF.GETDEFAULTGRAPH().FINALIZE() . THIS WILL LOCK THE GRAPH AND ANY ATTEMPTS TO ADD A NEW OPERATION WILL THROW
AN ERROR. EXACTLY WHAT WE WANT.

PROFILE YOUR GRAPH
A LESS PROMINENTLY ADVERTISED FEATURE OF TENSORFLOW IS PROFILING. THERE IS A
MECHANISM TO RECORD RUN TIMES AND MEMORY CONSUMPTION OF YOUR GRAPHS OPERATIONS.
THIS CAN COME IN HANDY IF YOU ARE LOOKING FOR BOTTLENECKS OR NEED TO FIND OUT IF
A MODEL CAN BE TRAINED ON YOUR MACHINE WITHOUT SWAPPING TO THE HARD DRIVE.

TO GENERATE PROFILING DATA YOU NEED TO PERFORM A SINGLE RUN THROUGH YOUR GRAPH
WITH TRACING ENABLED:

# COLLECT TRACING INFORMATION DURING THE FIFTH STEP.
IF GLOBAL_STEP == 5:
    # CREATE AN OBJECT TO HOLD THE TRACING DATA
    RUN_METADATA = TF.RUNMETADATA()

    # RUN ONE STEP AND COLLECT THE TRACING DATA
    _, LOSS = SESS.RUN([TRAIN_OP, LOSS_OP], OPTIONS=TF.RUNOPTIONS(TRACE_LEVEL=TF.RUNOPTIONS.FULL_TRACE),
        RUN_METADATA=RUN_METADATA)

    # ADD SUMMARY TO THE SUMMARY WRITER
    SUMMARY_WRITER.ADD_RUN_METADATA(RUN_METADATA, 'STEP%D', GLOBAL_STEP)


AFTERWARDS A TIMELINE.JSON FILE IS SAVED TO THE CURRENT FOLDER AND THE TRACING DATA BECOME AVAILABLE IN
TENSORBOARD. YOU CAN NOW EASILY SEE, HOW LONG AN OPERATION TAKES TO COMPUTE AND
HOW MUCH MEMORY IT CONSUMES. JUST OPEN THE GRAPH VIEW IN TENSORBOARD, SELECT
YOUR LATEST RUN ON THE LEFT AND YOU SHOULD SEE PERFORMANCE DETAILS ON THE RIGHT.
ON THE ONE HAND, THIS ALLOWS YOU TO ADJUST YOUR MODEL IN ORDER TO USE YOUR
MACHINE AS MUCH AS POSSIBLE, ON THE OTHER HAND, IT LETS YOU FIND BOTTLENECKS IN
YOUR TRAINING PIPELINE. IF YOU PREFER A TIMELINE VIEW, YOU CAN LOAD THE TIMELINE.JSON FILE IN GOOGLE CHROMES TRACE EVENT PROFILING TOOL .

ANOTHER NICE TOOL IS TFPROF , WHICH MAKES USE OF THE SAME FUNCTIONALITY FOR MEMORY AND EXECUTION TIME
PROFILING, BUT OFFERS MORE CONVENIENCE FEATURES. ADDITIONAL STATISTICS REQUIRE
CODE CHANGES.

WATCH YOUR MEMORY
PROFILING, AS EXPLAINED IN THE PREVIOUS SECTION, ALLOWS YOU TO KEEP AN EYE ON
THE MEMORY USAGE OF PARTICULAR OPERATIONS, BUT WATCHING YOUR WHOLE MODELS MEMORY
CONSUMPTION IS EVEN MORE IMPORTANT. ALWAYS MAKE SURE, THAT YOU DON'T EXCEED YOUR
MACHINE'S MEMORY, AS SWAPPING WILL MOST CERTAINLY SLOW DOWN YOUR INPUT PIPELINE
AND YOUR GPU STARTS WAITING FOR NEW DATA. A SIMPLE TOP OR, AS EXPLAINED IN ONE OF THE PREVIOUS SECTIONS, THE QUEUE GRAPHS IN
TENSORBOARD SHOULD BE SUFFICIENT FOR DETECTING SUCH BEHAVIOR. DETAILED
INVESTIGATION CAN THEN BE DONE USING THE AFOREMENTIONED TRACING.

DEBUGGING
PRINT IS YOUR FRIEND
MY MAIN TOOL FOR DEBUGGING ISSUES LIKE STAGNATING LOSS OR STRANGE OUTPUTS IS TF.PRINT . DUE TO THE NATURE OF NEURAL NETWORKS, LOOKING AT THE RAW VALUES OF TENSORS
INSIDE OF YOUR MODEL USUALLY DOESN'T MAKE MUCH SENSE. NOBODY CAN INTERPRET
MILLIONS OF FLOATING POINT NUMBERS AND SEE WHATS WRONG. BUT ESPECIALLY PRINTING
OUT SHAPES OR MEAN VALUES CAN GIVE GREAT INSIGHTS. IF YOU ARE TRYING TO
IMPLEMENT SOME EXISTING MODEL, THIS ALLOWS YOU TO COMPARE YOUR MODEL'S VALUES TO
THE ONES IN THE PAPER OR ARTICLE AND CAN HELP YOU SOLVE TRICKY ISSUES OR EXPOSE
TYPOS IN PAPERS.

WITH TENSORFLOW 1.0 WE HAVE BEEN GIVEN THE NEW TFDEBUGGER , WHICH LOOKS VERY PROMISING. I HAVEN'T USED IT YET, BUT WILL DEFINITELY TRY IT
OUT IN THE COMING WEEKS.

SET AN OPERATION EXECUTION TIMEOUT
YOU HAVE IMPLEMENTED YOUR MODEL, LAUNCH YOUR SESSION AND NOTHING HAPPENS? THIS
IS USUALLY CAUSED BY EMPTY QUEUES, BUT IF YOU HAVE NO IDEA, WHICH QUEUE COULD BE
RESPONSIBLE FOR THE MISHAP THERE IS AN EASY FIX: JUST ENABLE THE OPERATION
EXECUTION TIMEOUT WHEN CREATING YOUR SESSION AND YOUR SCRIPT WILL CRASH WHEN AN
OPERATION EXCEEDS YOUR LIMIT:

CONFIG = TF.CONFIGPROTO()
CONFIG.OPERATION_TIMEOUT_IN_MS=5000
SESS = TF.SESSION(CONFIG=CONFIG)


USING THE STACK TRACE YOU CAN THEN FIND OUT, WHICH OP CAUSES YOUR HEADACHE, FIX
THE ERROR AND TRAIN ON.


--------------------------------------------------------------------------------

I HOPE I COULD HELP SOME OF MY FELLOW TENSORFLOW CODERS. IF YOU FOUND AN ERROR,
HAVE MORE TIPS OR JUST WANT TO GET IN TOUCH, PLEASE SEND ME AN EMAIL!


SIGN UP TO RECEIVE MORE CONTENT LIKE THIS PLUS INDUSTRY NEWS, CODE AND TUTORIALS EVERY WEEK FRESH TO YOUR INBOX.
No spam. One-click unsubscribe.",A weekly newsletter about the latest developments in Deep Learning.,TensorFlow Quick Tips,Live,109
289,"PIXIEDUST: MAGIC FOR YOUR PYTHON NOTEBOOK
David Taieb / October 11, 2016As any data scientist knows, Python notebooks are a powerful tool for fast and
flexible data analysis. But the learning curve is steep, and it’s easy to get
blank page syndrome when you’re starting from scratch. Thankfully, it's easy to
save and share notebooks. However, even for seasoned data scientists or
developers, modifying an existing notebook can be daunting.

GOT SYNTAX?
Data science notebooks were first popularized in academia, and there are some
formalities to work through before you can get to your analysis. For example, in
a Python interactive notebook, a mundane task like creating a simple chart or
saving data into a persistence repository requires mastery of complex code like
this matplotlib snippet:

All this for a chart?Once you do create a notebook that provides great data insights, it's hard to
share with business users, who don’t want to slog through all that dry,
hard-to-read code, much less tweak it and collaborate.

PixieDust to the rescue. To improve the notebook experience and ease
collaboration, I created an open source Python helper library that works as an
add-on to Jupyter notebooks.

FRIENDLIER DATA SCIENCE NOTEBOOKS
When I watched data scientists and developers work with Python noteboooks, I
thought it shouldn't be so difficult. PixieDust fills feature gaps that made
notebooks too challenging for certain users and scenarios.


Six quick benefits of PixieDust (no sound).PixieDust extends the usability of notebooks with the following features:

 * packageManager lets you install spark packages inside a Python notebook. This is something
   that you can't do today on hosted Jupyter notebooks, which prevents
   developers from using a large number of spark package add-ons.
 * visualizations. One single API called display() lets you visualize your spark object in different ways: table, charts, maps,
   etc…. Much easier than matplotlib (but you can still use matplotlib, if you
   want). This module is designed to be extensible, providing an API that lets
   anyone easily contribute a new visualization plugin.
   
   This sample visualization plugin uses d3 to show the different flight routes
   for each airport:
   
   
 * Export. Share and save your data. Download to .csv, html, json, etc. locally on your
   laptop or into a variety of back-end data sources, like Cloudant, dashDB,
   GraphDB, etc.
   
   
 * Scala Bridge. Use scala directly in your Python notebook. Variables are automatically
   transfered from Python to Scala and vice-versa.
   
   
 * Extensibility. Create your own visualizations using the pixiedust APIs. If you know html
   and css, you can write and deliver amazing graphics without forcing notebook
   users to type one line of code.
   
   
 * Apps. Allow nonprogrammers to actively use notebooks. Transform a hard-to-read
   notebook into a polished graphic app for business users. Check out these
   preliminary sample apps:
   
    * An app can feature embedded forms and responses, like flightpredict , which lets users enter flight details to see the likelihood of landing
      on-time.
    * Or present a sophisticated workflow, like our twitter demo , which delivers a real-time feed of tweets, trending hashtags, and
      aggregated sentiment charts with Watson Tone Analyzer.
   
   
TRY IT
See for yourself. You can play with pixiedust right now online via IBM's Data
Science Experience. To get a look at the features you just read about, follow
these steps:

 1. Visit the IBM Data Science Experience and log in with your Bluemix account credentials or sign up.
 2. If prompted, create an instance of the Apache Spark service.
    
    Data Science Experience may generate a Spark instance for you automatically.
    If not, you'll be prompted to instantiate your own. You'll need it to run
    your Python code.
    
    
 3. Create a new notebook.
    
    On the upper left of the screen, click the hamburger menu to reveal the left
    menu. Then click New > Notebook . Click From URL , enter a name, and in the Notebook URL field, enter 
    https://github.com/ibm-cds-labs/pixiedust/raw/master/notebook/Intro%20to%20PixieDust.ipynb
    
    
 4. Create or select your Spark instance.
    
    If you don't already have the Spark service up, Data Science Experience
    prompts you to instantiate it. You'll need it to run your Python code.
    
    
 5. Your new notebook opens. Run each cell in order to see a few PixieDust
    features
    
    
If you get an error: PixieDust is preinstalled on Data Science Experience. If you get an error in
cell 2, insert a cell at the top of the notebook and enter and run the following
code:
!pip install --user --no-deps --upgrade pixiedust
Then restart the kernel and run cells above. If you get other errors, it's
always a good idea to restart the kernel and try again.

JOIN US
PixieDust is an open source project. Join the conversation and contribute.
You'll find lots of guidance in our repo's wiki with more to come. Write your own app or visualization plugin. Pull requests
welcome! Visit PixieDust's GitHub repo .

Later this month, I’m speaking at World of Watson [ 1 , 2 ]. Join me there to learn more about pixiedust.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: Apache Spark / data science / Data Science Experience / IBM Analytics for Apache Spark / IPython / Jupyter / matplotlib / Notebooks / PixieDust / Python Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Graph
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Object Storage
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","An open source helper library for your Jupyter Python notebook with easier data viz & export, package manager, and Scala context from within Python!",PixieDust: Magic for Your Python Notebook,Live,110
291,"Homepage Follow Sign in / Sign up Homepage * Home
 * Archive
 * 

Greg Filla Blocked Unblock Follow Following Product manager. Data scientist. I like coding for data stuff Oct 19
--------------------------------------------------------------------------------

TIDY UP YOUR JUPYTER NOTEBOOKS WITH SCRIPTS
Over the past few years, we have seen the transition from scripts to notebooks
for data scientists. Jupyter notebooks are quickly becoming the preferred data
science IDE. These notebooks are perfect for writing short code blocks to
interact with data, but what happens when your project grows?

At Data Science Experience, we see notebooks as the primary way data scientists
want to code… but not all code should stay in the notebook. Helper functions, Classes , messy visualization code — all the necessary bits that we do not need to
include in a notebook that could be used for a presentation to communicate
results. Let’s start cleaning up our notebooks.

Image from Pixabay.com licensed under CC BY 2.0First, I will describe how to take an existing .py script or package and use it
in IBM Data Science Experience (DSX). Then, I’ll show my approach for setting up projects to facilitate clean
notebooks.

IMPORTING EXISTING PYTHON SCRIPTS IN DSX
DSX offers a collaborative enterprise data science environment in the cloud, but
many times it’s necessary to migrate existing scripts for use in DSX projects.
Here are options for using a locally developed script in DSX:

 1. Copy/paste code from local file into a notebook cell. At the top of this
    cell add %%writefile <your_file_name>.py - this will save the code as a Python file in your GPFS working directory
    (GPFS is the file system that comes with the DSX Spark Service). Any
    notebooks using the same Spark Service instance will be able to access this
    file for importing.
 2. Load the Python script into Object Storage. You can use Insert to Code , then take the string and write to a file in GPFS that can then be
    accessed the same way.

I recommend option 1 because it allows you to continue to tweak code and update
the script written to GPFS from this notebook. I will go into this in more
detail in the section below on setting up your project.

IMPORTING EXISTING PACKAGES IN DSX
The methods above work great if you just have a single script that you need to
import from (or execute) from inside a notebook. If you have a Python package,
the following options are available for importing in DSX:

Pre-req:

 * Python — Package up your code ( Here is an example of a simple package I wrote)
 * R — Package up your code (great post from Hilary Parker) . Check out a simple R package example here

 1. Put it in a repo and install. This can be accomplished from a public or
    private GitHub repository (I’m sure others as well, but I have only used
    GitHub). Pip installing from a public repo looks like this:

!pip install git+https://github.com/gfilla/dsxtools.git

If you need to install from a private GitHub repository, it looks like this:

!pip install git+https://<user_name>:<personal_access_token>@github.<your_company>.com/<your_org>/<your_repo>.git --ignore-installed

You get your personal_access_token from Settings > Personal Access Tokens > Generate new token. You need to give
repo access to this token.

For R — use this syntax for installing a package from GitHub:

install.packages('devtools')
library(devtools)
install_github('<username>/<repo>') #installs the package
library('<repo_name>')  #loads the package for use

2. Zip it up and load from Object Storage. This is similar to option 2 above,
this time we zip up the directory with the Python package and load in an Object
Storage container. Here you can use this code to get/save the zip .

Once you have installed/saved the package in GPFS, you are good to start
importing inside your notebook!

BRING THIS ALL TOGETHER IN A DSX PROJECT
At this point, you should feel confident in importing existing Python code for
use in a DSX project. Let’s build on this to review one method for building out
a larger project.

My notebooks at the start of a new projectWhen I start a new project and have a clear vision for my goals, I will start
with a “Class” notebook. This will be the notebook I will work in mostly for the
early stages of my project. This notebook will be the messiest of all notebooks
through most of the project lifecycle, but at the end it will be the cleanest —
only including the code for the classes I will use for the project. Each cell in
this notebook contains a class, we can easily write each of these cells to a
Python script using the %%writefile method described above. Other notebooks in this project will import these
classes to access the methods to have overall much cleaner code.

An example of a cell in my “Class” notebookYou may be asking yourself why you should use this method instead of just using
an IDE intended for writing larger Python programs. That is a fair question —
and some projects can definitely require that approach. I prefer staying inside
notebooks for class development for the same reason I use them for data analysis. I can quickly
tweak my class and have any experimental code in subsequent code cells to fix
any bugs (this is where it can get messy).

To complete the example, I’ll show how one of these classes is used in my other
notebooks. At this point, if you are new to Python and have not used classes I
recommend checking out the documentation to see how they can be incorporated in your code.

After executing the cell where I write the Python class to GPFS, I can simply
import using syntax from <python file name> import <class name>

So clean..Since cnnParser is the name of my class, I instantiate an instance in the cnn variable. A very nice benefit of using classes and hanging methods on the class
is that Jupyter shortcuts are available to view all methods/attributes of the
class (Shift + Tab to get the view in the screenshot). If you didn’t know about
this shortcut — check out this post .


--------------------------------------------------------------------------------

That should be enough to get started using scripts, packages, and notebooks
together in a complementary way. If you know any tips/tricks I missed please let me know ! Happy coding :-)

 * Python
 * Dsx
 * Data Science
 * Jupyter Notebook


Blocked Unblock Follow FollowingGREG FILLA
Product manager. Data scientist. I like coding for data stuff

FollowIBM DATA SCIENCE EXPERIENCE
Master the art of data science

 * 
 * 
 * 
 * 

Never miss a story from IBM Data Science Experience , when you sign up for Medium. Learn more Never miss a story from IBM Data Science Experience Get updates Get updates",Learn how to use scripts and external packages in Jupyter notebooks to facilitate code organization for larger projects.,Tidy up your Jupyter notebooks with scripts,Live,111
295,"Skip navigation Upload Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseBUILDING CUSTOM MACHINE LEARNING ALGORITHMS WITH APACHE SYSTEMML
Apache Spark Subscribe Subscribed Unsubscribe 15,637 15KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics

172 views 3LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 4 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Jun 16, 2016

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Loading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Breakthroughs in Machine Learning - Google I/O 2016 - Duration: 28:28. Google
   Developers 7,190 views 28:28


--------------------------------------------------------------------------------

 * Python+Machine Learning tutorial - Introduction - Duration: 1:11:53.
   Microsoft Research 26 views 1:11:53
 * Toward Causal Machine Learning - Duration: 57:33. Microsoft Research 1 view 57:33
 * Stuff machine learning, let’s talk about climate change. - Duration: 28:38.
   Microsoft Research 175 views 28:38
 * Machine Learning Algorithms Workshop - Duration: 1:39:55. Microsoft Research
   79 views 1:39:55
 * Machine learning is not the future - Google I/O 2016 - Duration: 39:00.
   Google Developers 17,976 views 39:00
 * Livy: A REST Web Service For Apache Spark - Duration: 21:29. Apache Spark 430
   views 21:29
 * Machine Learning Algorithms – Part 1 - Duration: 15:53. Microsoft Azure 74
   views * New 15:53
 * Machine learning for algorithmic trading w/ Bert Mouler - Duration: 1:03:17.
   Chat With Traders 6,586 views 1:03:17
 * Scalable Machine Learning Pipeline For Meta Data Discovery From eBay Listings
   - Duration: 29:13. Apache Spark 144 views 29:13
 * Making Machine Learning Reproducible with CodaLab - Duration: 37:21.
   Microsoft Research 15 views 37:21
 * Symposium: Deep Learning - Max Jaderberg - Duration: 20:09. Microsoft
   Research 149 views 20:09
 * Smart Monitoring of Logs: ELK-Elastic Search, Logstash, Kibana: Anania M. and
   Edgar T. | Synergy - Duration: 48:24. Barcamp Yerevan 138 views 48:24
 * #56 Data Science from Scratch - Duration: 51:04. Talk Python 36 views 51:04
 * Elasticsearch And Apache Lucene For Apache Spark And MLlib - Duration: 33:44.
   Apache Spark 136 views 33:44
 * Managed Dataframes And Dynamically Composable Analytics: The Bloomberg Spark
   Server - Duration: 28:34. Apache Spark 70 views 28:34
 * Jose Quesada - A full Machine learning pipeline in Scikit-learn vs in
   scala-Spark: pros and cons - Duration: 38:47. PyData 784 views 38:47
 * Crate.io and CaseZero @Ticketmaster June 14, 2014 - Duration: 1:45:45. Carl
   Mullins 90 views 1:45:45
 * GPU Computing With Apache Spark And Python - Duration: 17:35. Apache Spark 73
   views 17:35
 * Diving into Machine Learning - by Rob Craft, Group Product Manager at Google
   - Duration: 59:19. Startupfood 913 views 59:19
 * Loading more suggestions...
 * Show more

 * Language: English
 * Country: Worldwide
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Try something new!
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",What is Apache SystemML? Demo! How to get SystemML.,Building Custom Machine Learning Algorithms With Apache SystemML,Live,112
297,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (February 28, 2017)
 * This Week in Data Science (February 21, 2017)
 * Learn how to use R with Databases
 * This Week in Data Science (February 14, 2017)
 * This Week in Data Science (February 7, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (FEBRUARY 28, 2017)
Posted on February 28, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * http://www.ibmbigdatahub.com/blog/four-perspectives-data-lakes – The relation of architecture, value, innovation and governance to data
   lakes.
 * Fueling the Gold Rush: The Greatest Public Datasets for AI – A run down of some public datasets for Artificial Intelligence.
 * Pandas Cheat Sheet – Python for Data Science – Cheat sheet for one of the most popular data science packages.
 * 17 More Must-Know Data Science Interview Questions and Answers, Part 2 – Additional must-know questions for data science interviews.
 * IBM, Northern Trust partner on financial security blockchain tech – IBM and Northern Trust partner to develop blockchain technology for the
   management of private equity funds and services.
 * The Origins of Big Data – A perspective summary of the field and use of the term Big Data.
 * How is Deep Learning Changing Data Science Paradigms? – A look at the rise of Deep Learning and its effect on Data Science
   Paradigms.
 * Melbourne IBM Research team using Watson AI to identify glaucoma – Melbourne-based IBM research team trains Watson to identify eye
   abnormalities.
 * Removing Outliers Using Standard Deviation in Python – How to remove outliers using a well known but underutilized metric.
 * R Packages worth a look – A short list and summaries of R statistical and graphical packages.
 * 25 Big Data Terms Everyone Should Know – Big Data Terms and concepts as an introduction to the field.
 * Moving from R to Python: The Libraries You Need to Know – Python packages and their R contemporaries.
 * Predicting the 2017 Oscar Winners – Using Machine Learning to predict the winners at the 89th annual Academy
   of Motion Picture Arts and Sciences Awards.
 * How To Hire A Data Scientist: 5 Don’ts For Data Scientist Interview Questions – How hiring managers can land a proficient data scientist.
 * Artificial intelligence: Understanding how machines learn – The current limits of Artificial Intelligence and Machine Learning.

UPCOMING DATA SCIENCE EVENTS
 * IBM Webinar: Are you getting enough value from your relational database? – March 1, 2017 @ 1:00 pm – 2:00 pm
 * IBM Webinar: Art of the Possible…and the Reality of Execution – March 2, 2017 @ 1:00 pm – 2:00 pm

FEATURED COURSES FROM BDU
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.

COOL DATA SCIENCE VIDEOS
 * Deep Learning with TensorFlow Course Summary – A summary of our free course here at BDU Deep Learning with TensorFlow.
 * Deep Learning with Tensorflow – Deep Belief Networks – An overview of Deep Belief Networks.
 * Deep Learning with Tensorflow – Autoencoder Structure –An overview of the structure and applications of an Autoencoder.
 * Deep Learning with Tensorflow – Autoencoders with TensorFlow –Tutorial on how to implement an Autoencoder using TensorFlow.
 * Deep Learning with Tensorflow – Introduction to Autoencoders – The basic concepts of Autoencoders – a type of neural network.
 * SHARE THIS:
    * Facebook
    * Twitter
    * LinkedIn
    * Google
    * Pocket
    * Reddit
    * Email
    * Print
    * 
   
   
 * RELATED
   

Tags: analytics , Big Data , data science


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (February 28, 2017)",Live,113
298,"Compose The Compose logo Articles Sign in Free 30-day trialUSE ALL THE DATABASES - PART 1
Published Mar 2, 2017 graphql developing writestuff Use all the Databases - Part 1Loren Sands-Ramshaw, author of GraphQL: The New REST , shows how to combine data from multiple sources using GraphQL in this Write Stuff two-part series.

Ever wanted to use a few different databases to build your app? Different types
of databases are meant for different purposes, so it often makes sense to
combine them. You might be hesitant due to the complexity of maintenance and
coding, but it can be easy if you combine Compose and GraphQL: instead of
writing a number of complex REST endpoints, each querying multiple databases,
you set up a single GraphQL endpoint that provides whatever data the client
wants using your simple data fetching functions.

This tutorial is meant for anyone who provides or fetches data, whether it’s a
backend dev writing an API (in any language) or a frontend web or mobile dev
fetching data from the server. We’ll learn about the GraphQL specification, set
up a GraphQL server, and fetch data from five different data sources. The code
is in Javascript, but you’ll still get a good idea of GraphQL without knowing
the language.

In this first part, we'll look at the databases that will be involved. Then we'll introduce GraphQL before moving on to the query we want to make, the schema we need to create and how to setup the server to make that all happen.

In part two, we'll look at resolving queries on SQL, Elasticsearch, MongoDB,
Redis and REST data sources and a look at how to get the best performance before
calling things done.

Part 1

 * The databases
 * GraphQL intro
 * The query
 * The schema
 * Server setup

Part 2

 * Resolvers * SQL
    * Elasticsearch
    * MongoDB
    * Redis
    * REST
   
   
 * Performance
 * Done!

THE DATABASES
We at Chirper Fictional, Inc. were building a Twitter clone, and decided to use
these databases:

 * 💾 PostgreSQL : Because like most apps, our data was relational, and our boss said that
   the database we wanted to use (RethinkDB) was too new to be trusted 😔.
   
   
 * 💾 Redis : We wanted to cache frequently-used data, like the public feed, so we could
   get it quickly and reduce the read load on Postgres.
   
   
 * 💾 Elasticsearch : A database built for searching that would function better and scale better
   than searching Postgres.
   
   
 * 💾 REST : We wanted to show our users tweets from their area, and we didn't want to
   prompt for GPS permissions or pay for a MaxMind IP address database, so we
   found a REST API for geolocating IP addresses.
   
   
 * 💾 MongoDB : We wanted to track some user stats, and we didn't need them to be in the
   main app database. We put the intern on this, and while he could have just
   used a second Postgres DB, he used Mongo because he heard it was Web Scale.
   And we didn't mind because we didn't need ACID or JOINs for our stats.
   
   
Now we need a way to combine the data from all of these sources together in
whichever ways our clients want it, and the best way to do this is with GraphQL.

GRAPHQL INTRO
Gotta be honest here... for the first few months of GraphQL's short existence
(it launched in July 2015), I thought GraphQL was a query language for accessing
your Facebook friend graph 😳. Turns out that’s FQL, and GraphQL is a
replacement for REST! And sorry REST, but GraphQL is kinda better than you for
most things 😁. Here's why:

 * ✅ Easier to consume : The GraphQL client's job is super simple—just write the data fields you
   want filled in. When you send the query string like the one on the left side
   of the image, you get back the JSON response on the right, with the same
   structure you asked for. Instead of sending multiple REST requests (sometimes
   multiple round trips in series), you can send a single GraphQL request. And
   instead of getting more or less data than you need from the REST endpoints,
   you get exactly the data you ask for.


 * ✅ Easier to produce : On the GraphQL server you write resolvers —functions that resolve a field to its value; for instance for the above,
   there's a user() function that responds to the user(id: 1) query and returns user #1's SQL record. One nice thing is that they work at
   any place in the query—looking up the current user's first name ( user.firstName is ""Maurine"" in the above example) at the top level runs the same code as looking up the
   author of a tweet that mentions her name ( user.mentions[0].author.firstName happens to also be ""Maurine"" ), nested in the query heirarchy ( more info on this ). Also, sometimes with REST you have endpoints talking to multiple
   databases. A GraphQL server is more organized, since in most cases each
   resolver talks to a single data source.


Credit: Jonas Helfer

 * ✅ Types and introspection : Each query has a typed schema ( User , Tweet , String , Int , etc). At first it may seem like extra work, but it means that you get
   better error messages, query linting , and automatic server response mocking . It also has introspection—a standard method of querying the server to ask
   what queries it supports (and their schemas)—which is what powers Graph i QL (with an i and pronounced, “graphical”), the in-browser auto-documented
   GraphQL IDE described later in this article.
   
   
 * ✅ Version free : Because the client decides what data it wants, you can easily support many
   different client versions. Instead of versioning your endpoints (eg GET
   /api/v2/user), when you add new features, you simply add more fields. When
   you sunset old features, the associated fields can be deprecated but continue
   to function.
   
   
Fear not—you don't need to rewrite all your REST servers: you can instead add a
simple GraphQL server in front of them, as we'll see with the REST data source
example below.

Note: you can of course also change data with GraphQL (with functions called mutations ), but I won't be covering that in this post.

THE QUERY
Let's figure out the query that we'll need for our app's home dashboard. First,
here are the things we'd like to display:

 * Your name and photo (SQL)
 * Recent tweets that mention your name (Elasticsearch)
 * Most recent few tweets worldwide (Redis)
 * Recent tweets in your city (REST to geolocate and then SQL)
 * For each tweet, the number of times it has been viewed (Mongo)

For each tweet, we'll want to display the text of the tweet, the author's name
and photo, and when it was created. For the mentions and city feeds, we also
want the number of times the tweets were viewed and from what city they were
made.

Now to make the query, we write out the pieces of data we need in order to
display the above list, choosing names for each field and putting it in a
JSON-like format! 😄

const queryString = `  
{
  user(id: 1) {
    firstName
    lastName
    photo
    mentions {
      text
      author {
        firstName
        lastName
        photo
      }        
      city
      views
      created      
    }
  }      
  publicFeed {
    text
    author {
      firstName
      lastName
      photo
    }    
    created
  }
  cityFeed {
    text
    author {
      firstName
      lastName
      photo
    }      
    city
    views
    created
  }
}
`


We'll put mentions as a field of the user query instead of at the top level because we'll need to the user's name in
order to query Elasticsearch, and we'll have their name from the first step of
the user query (we'll see how this looks when we implement it).

Parentheses are used to pass arguments—for simplicity's sake, we're passing our
own user id with (id: 1) . Usually when fetching the current user’s data, instead of passing your user
id as an argument, you'd put your auth token in the Authorization header, and the server would authenticate you. This is done automatically for
you by frameworks like Meteor .

Our query should return the below JSON data. The data mirrors the query format,
with values filled in, sometimes with arrays of objects:

{
  ""data"": {
    ""user"": {
      ""firstName"": ""Maurine"",
      ""lastName"": ""Rau"",
      ""photo"": ""http://placekitten.com/200/139"",
      ""mentions"": [
        {
          ""text"": ""Maurine Rau Eligendi in deserunt."",
          ""author"": {
            ""firstName"": ""Maurine"",
            ""lastName"": ""Rau"",
            ""photo"": ""http://placekitten.com/200/139""
          },
          ""city"": ""San Francisco"",
          ""views"": 82,
          ""created"": 1481757217713
        }
      ]
    },
    ""publicFeed"": [
      {
        ""text"": ""Corporis qui impedit cupiditate rerum magnam nisi velit aliquam."",
        ""author"": {
          ""firstName"": ""Tia"",
          ""lastName"": ""Berge"",
          ""photo"": ""http://placekitten.com/200/139""
        },
        ""city"": ""New York"",
        ""views"": 91,
        ""created"": 1481757215183
      },
      ...
    ],
    ""cityFeed"": [
      {
        ""text"": ""Edmond Jones Harum ullam pariatur quos est quod."",
        ""author"": {
          ""firstName"": ""Edmond"",
          ""lastName"": ""Jones"",
          ""photo"": ""http://placekitten.com/200/139""
        },
        ""city"": ""Mountain View"",
        ""views"": 69,
        ""created"": 1481757216723
      },
      ...
    ]
  }
}


Now let's write the simple GraphQL server that will return that data!

THE SCHEMA
The first thing your server needs is a schema. This is what the server will use
to provide type safety and power the introspection and improved error messages.
Since we've already written out what we'd like our queries to look like, this
will be easy - we just need to list out the fields and their types. First, under type Query , we list the possible queries (top-level attributes in our query string ):

type Query {  
  user(id: Int!): User

  # A feed of the most recent tweets worldwide 
  publicFeed: [Tweet]

  # A feed of the most recent tweets in your city
  cityFeed: [Tweet]
}


code

Each query is followed by the type that is returned. Besides the basic types ( String , Int , Float , Boolean ), you can make your own types, which start with a capital letter. So the first
line reads, ""One possible query is the user query, which takes one required argument (the exclamation point in Int! means required) named id of type Int and which returns something of type User ."" The last line reads, ""One possible query is the cityFeed query, which has no arguments and returns an array of Tweet s."" The # comments are descriptions , which show up in the GraphiQL IDE described later.

Now to define the User and Tweet types, we'll list the fields we chose in our query string :

type User {  
  firstName: String
  lastName: String
  photo: String
  mentions: [Tweet]
}

type Tweet {  
  text: String
  author: User
  city: String
  views: Int
  created: Float
}


code

That's our schema! The schema goes into a string:

// data/schema.js

const schema = `  
type User { ...  
type Tweet { ...  
type Query { ...

schema {  
  query: Query
}
`;

export default schema;  


data/schema.js

SERVER SETUP
The reference implementation of the GraphQL specification is GraphQL-JS , and it's used by graphql-server-express , a GraphQL middleware for Express , the most popular Node.js web server. Here's how we set it up:

// server.js:
import express from 'express';  
import { graphqlExpress, graphiqlExpress } from 'graphql-server-express';  
import { makeExecutableSchema } from 'graphql-tools';  
import bodyParser from 'body-parser';

import schema from './data/schema';  
import resolvers from './data/resolvers';

const graphQLServer = express();

const executableSchema = makeExecutableSchema({  
  typeDefs: [schema],
  resolvers,
});

graphQLServer.use('/graphql', bodyParser.json(), graphqlExpress({  
  schema: executableSchema,
}));

graphQLServer.use('/graphiql', graphiqlExpress({  
  endpointURL: '/graphql',
}));

const GRAPHQL_PORT = 8080;

graphQLServer.listen(GRAPHQL_PORT, () =


server.js

 * import schema from './data/schema' – the GraphQL schema that we wrote in the last section
 * import resolvers from './data/resolvers' – an object with our resolve functions, which will do the DB lookups (we'll
   do this in the next article)
 * /graphiql – GraphiQL , the IDE for GraphQL. If you visit this URL (for us it’s http://localhost:8080/graphiql in a browser, you'll see the UI shown in the first screenshot . * While you're typing the query string in the left side of the screen, it
      autocompletes query fields.
    * When you hit the run button or cmd-return , the response from the server is shown on the right.
    * There's also a docs sidebar that has automatic documentation of the
      available queries and data fields.
   
   
If you’d like to run this server on your computer, first follow the repo’s setup instructions . Now you can start the server by running server.js :

nodemon ./server.js --exec babel-node  


And make queries in GraphiQL:

http://localhost:8080/graphiql

When you edit the code, the server will restart itself, and you can re-run your
query in GraphiQL. Reload the page in order to get the docs and autocompletion
to update.

You can try out the GraphiQL of the finished Twitter clone server (powered by
Compose!) here:

all-the-databases.graphql.guide/graphql

The only differences between the hosted server and the code running on your own
computer are environment variables that contain the database connection info
that you get when you set up a new Compose database.

We now have a working server and schema. The server setup was short, and
specifying types for the schema was intuitive, but we haven’t done anything
database-specific yet. In Part 2 we’ll write the server code that fetches the
right data from SQL, Elasticsearch, MongoDB, Redis, and a REST API.s. Add Compose Articles to your feed reader to get the next part!


--------------------------------------------------------------------------------


attribution Hyberbole and a half

This article is licensed with CC-BY-NC-SA 4.0 by Compose. Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX
The power of IBM's Bluemix cloud platform is now able to seamlessly harness
Compose's databases, making Compose-configured Mo…

Dj Walker-Morgan Sep 28, 2016POWERING SOCIAL FEEDS AND TIMELINES WITH ELASTICSEARCH
Evolving from MongoDB and Redis to Elasticsearch, Campus Discounts' founder and
CTO Don Omondi talks about how and why the co…

Guest Author Dec 3, 2015CHOOSING THE RIGHT SOLUTION FOR YOU - COMPOSE PB&J
If you're new to some of the databases that Compose offers, you might be
wondering which ones you should choose for your proj…

Lisa Smith Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data from multiple sources using GraphQL in this Write Stuff two-part series.",Use all the Databases,Live,114
299,"MENU
Close * Home

Subscribe MenuFINDING THE USER IN DATA SCIENCE
03 June 2016When the IBM Design team began researching data scientists, we had a lot to
learn; but what we found was our two disciplines had a lot in common.

Without connecting people to data, it’s just a bunch of stuff

The Data Science practice is amazing and complex. A solo data scientist has to
form a relevant hypothesis, find a corresponding data set, clean it, and
repeatedly build and edit a model to prove or disprove their hypothesis.

The Data Science Experience grew from our attempts to understand data science as
outsiders: as designers wanting to build a tool for data scientists. We were
curious how data scientists distill something interesting from inchoate data.
This curiosity catapulted us into a months-long research endeavor. We
synthesized research conducted in our studios all over the world and had
conversations with every data scientist we could find. This included hundreds of
interviews, dozens of contextual inquiries, and the production of countless
research artifacts. We were astounded by the practice we uncovered, and inspired
by its creativity. We came to understand data science as storytelling — an act
of cutting away the meaningless, and finding humanity in a series of digits.

The data science process is an experiment, the adding and subtracting of
elements to find just the right mix. It’s a fluid dance of trial and error, give
and take, push and pull. We realized that the tools that data scientists
currently use are not designed to support this fluid process of constant
refinement—the tools operate in isolation. Data scientists constantly have to
navigate away from their workspaces in order to advance and edit their product.
This disconnection is where we found our opportunity

Finding our principles

Current tools only address single facets of data science — which means data
scientists must toggle back-and-forth between research and development. Data
Shaper is for cleaning data, Jupyter is for modeling, and MatPlotLib is for
visualizing. These tools are designed to serve a linear process, but a data
scientist’s process is not linear, it’s cyclical.

Research artifact depicting the cyclical process of data science

From this model, our first design principle emerged: A holistic approach to
enable data scientists. As we discussed before, much of our research involved
contextual inquiries. We watched a data scientist build a pipeline — sourcing
assets from the web, comparing his code to others’, and constantly jumping from
tool to tool. We loved this part of the research, as it helped us understand
that each facet of the process requires unique research.

Notes on contextual inquiry during pipeline construction

We saw him use dozens of assets of many different types. We watched him organize
and name them. At any given point, he needed a tutorial, an academic paper, or a
data set to move to the next step in his process, and each of these assets had
to be saved and interacted with in a different environment. The process he used
to manage his resources helped us establish a tentative system for artifact
classification.

It was also enlightening to watch him browse for resources. Whether he was
scrolling through lists in databases or scanning forums for code, he had
criteria for assessing the value of these artifacts. We watched him pull code
from several different projects and seek advice on API implementation from a
forum.

It became obvious that a data science project can’t just stand on its own. It
needs support and validation from the community. An artifact, whether code
snippet, API, or academic paper, is only as strong as the people who use it. The
more an artifact is employed, the more people there are to discuss it. The
public use of an artifact sharpens its quality. The value of an asset is
determined by the discussion around it — its documentation, its versioning, and
its critics.

The evolution of data science is fueled by the collaborative processes of
building off of each other’s work. This understanding led us to our second, and
arguably most inspiring principle, Community first. The community is the
strongest tool a data scientist can access. So why hasn’t it been factored in
any of their current interfaces?

Turning principles into practice

We wanted to create an interface that was open and dynamic, just like the
modeling process we observed. We determined that our concept must allow the data
scientists to converse, learn, and research in the context of their software. We
knew our design had to operate as a toolbox that was more dynamic than just a
collection of software applications. In addition to providing data scientists
with the full scope of software products that they need to complete their
process, we need to address their need to validate and advance their work
through research.

This helped us design one of our first concepts: the maker palette. This feature
developed from the idea that the community is a tool — just as important as a
notebook or data set. The design treatment is just the same as any other
resource--it appears in a panel that can be opened and closed at will. The
benefit is that it’s not specific to a file format or tool, so it can be
accessed in any part of the interface.

A user test with the maker palette

In the community palette, a data scientist can find data sets, access papers,
view tutorials, and compare their code to others. When they’re uninspired or
stuck, the community acts as both peer, tool, and teacher.

Mixed content

The practice of data science surrounds the building of a pipeline, which is a
sequence of algorithms that process and learn from data. As we watched data
scientists build their pipelines in notebooks, we likened the process to
building a wall around a garden, brick by brick. Each brick must be tested to
see if it fits the within the bricks that preceded it. These bricks, collected
piecemeal throughout the process, slowly enclose the desired pieces of data. The
implementation of these bricks requires supplemental materials, like
documentation and user testimonials. While these materials will not be included
in the pipeline, they need to be viewed in the context of the code. Although
they manifest as different file types, these materials are building blocks also,
and are just as necessary to the advancement of a project as an actual line of
code.

The brick building metaphor inspired the form of our design. We translated the
modularity of pipeline construction into a card design paradigm for the
interface. Having a uniform treatment for a variety of content types allowed us
to streamline the search for resources. A key component of our maker palette was
the ability to display mixed content in a singular environment. The data
scientist can search for any type of asset inside of their workspace, and review
and reference it in a singular, cohesive environment.

The design of our cards was shaped by repeated user testing.

The card-in-panel format gives the data scientist the ability to quickly test a
variety of assets in their work. They can make off-the-cuff adjustments without
having to make time commitments to deep research or additional tools. They can
repeatedly complete the cycles of their work--ask, build, test, refine--in one
unified experience.

In data scientists, we see ourselves

In IBM Design, we often discuss “the loop,” or the practice of continuous
refinement of an idea through research and testing. Like the scientific method,
we design a hypothesis, develop prototypes, test them, make observations, and
adjust. As software designers, we’re constantly trying to find the storyline in
“stuff.” Much like data scientists, we sift through the extraneous to find the
human elements in products and processes. At the beginning, data science seemed
complex and distant, and now, after all our research and a little
self-reflection, it seems strangely familiar.

Data Science Experience Creation

Zoe Padgett and Eytan Davidovits's PictureZOE PADGETT AND EYTAN DAVIDOVITS
Read more posts by this author.

SHARE THIS POST
Twitter Facebook Google+ IBM Data Science Experience Blog © 2016 Proudly published with Ghost","When the IBM Design team began researching data scientists, we had a lot to learn; but what we found was our two disciplines had a lot in common.",Finding the user in data science,Live,115
300,"* Be a better programmer

CATEGORIES Toggle navigation * Algorithms
 * Competitive Programming
 * Internet of Things
 * Python
 * Machine Learning

×WANT A CAREER IN DATA SCIENCE / ANALYTICS ?
Drop your email to get latest tutorials, career paths, projects, jobs in machine
learning & data science.

GET MORE STUFF
Subscribe now to get the latest updates from the developer community in your
inbox!

PRACTICAL TUTORIAL ON RANDOM FOREST AND PARAMETER TUNING IN R
Open Modal Open Modal Machine Learning R December 14, 2016 Share 120INTRODUCTION
Treat ""forests"" well. Not for the sake of nature, but for solving problems too!

Random Forest is one of the most versatile machine learning algorithms available
today. With its built-in ensembling capacity, the task of building a decent
generalized model (on any dataset) gets much easier. However, I've seen people
using random forest as a black box model; i.e., they don't understand what's
happening beneath the code. They just code.

In fact, the easiest part of machine learning is coding . If you are new to machine learning, the random forest algorithm should be on
your tips. Its ability to solve—both regression and classification problems
along with robustness to correlated features and variable importance plot gives
us enough head start to solve various problems.

Most often, I've seen people getting confused in bagging and random forest. Do
you know the difference?

In this article, I'll explain the complete concept of random forest and bagging.
For ease of understanding, I've kept the explanation simple yet enriching. I've
used MLR, data.table packages to implement bagging, and random forest with
parameter tuning in R. Also, you'll learn the techniques I've used to improve
model accuracy from ~82% to 86%.


TABLE OF CONTENTS
 1. What is the Random Forest algorithm?
 2. How does it work? (Decision Tree, Random Forest)
 3. What is the difference between Bagging and Random Forest?
 4. Advantages and Disadvantages of Random Forest
 5. Solving a Problem * Parameter Tuning in Random Forest
    
    
WHAT IS THE RANDOM FOREST ALGORITHM?
Random forest is a tree-based algorithm which involves building several trees
(decision trees), then combining their output to improve generalization ability
of the model. The method of combining trees is known as an ensemble method.
Ensembling is nothing but a combination of weak learners (individual trees) to
produce a strong learner.

Say, you want to watch a movie. But you are uncertain of its reviews. You ask 10
people who have watched the movie. 8 of them said "" the movie is fantastic.""
Since the majority is in favor, you decide to watch the movie. This is how we
use ensemble techniques in our daily life too.

Random Forest can be used to solve regression and classification problems. In
regression problems, the dependent variable is continuous. In classification
problems, the dependent variable is categorical.

Trivia: The random Forest algorithm was created by Leo Brieman and Adele Cutler in 2001.


HOW DOES IT WORK? (DECISION TREE, RANDOM FOREST)
To understand the working of a random forest, it's crucial that you understand a tree . A tree works in the following way:

1. Given a data frame (n x p), a tree stratifies or partitions the data based on
rules (if-else). Yes, a tree creates rules. These rules divide the data set into
distinct and non-overlapping regions. These rules are determined by a variable's
contribution to the homogenity or pureness of the resultant child nodes (X2,X3).

2. In the image above, the variable X1 resulted in highest homogeneity in child
nodes, hence it became the root node. A variable at root node is also seen as
the most important variable in the data set.

3, But how is this homogeneity or pureness determined? In other words, how does
the tree decide at which variable to split?

 * In regression trees (where the output is predicted using the mean of observations in the
   terminal nodes), the splitting decision is based on minimizing RSS. The
   variable which leads to the greatest possible reduction in RSS is chosen as
   the root node. The tree splitting takes a top-down greedy approach, also known as recursive binary splitting . We call it ""greedy"" because the algorithm cares to make the best split at
   the current step rather than saving a split for better results on future
   nodes.
 * In classification trees (where the output is predicted using mode of observations in the terminal
   nodes), the splitting decision is based on the following methods: * Gini Index - It's a measure of node purity. If the Gini index takes on a smaller
      value, it suggests that the node is pure. For a split to take place, the
      Gini index for a child node should be less than that for the parent node.
    * Entropy - Entropy is a measure of node impurity. For a binary class (a,b), the
      formula to calculate it is shown below. Entropy is maximum at p = 0.5. For
      p(X=a)=0.5 or p(X=b)=0.5 means, a new observation has a 50%-50% chance of
      getting classified in either classes. The entropy is minimum when the
      probability is 0 or 1.
   
   
Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))


In a nutshell, every tree attempts to create rules in such a way that the
resultant terminal nodes could be as pure as possible. Higher the purity, lesser
the uncertainity to make the decision.

But a decision tree suffers from high variance. ""High Variance"" means getting
high prediction error on unseen data. We can overcome the variance problem by
using more data for training. But since the data set available is limited to us,
we can use resampling techniques like bagging and random forest to generate more
data.

Building many decision trees results in a forest . A random forest works the following way:

 1. First, it uses the Bagging (Bootstrap Aggregating) algorithm to create
    random samples. Given a data set D1 (n rows and p columns), it creates a new
    dataset (D2) by sampling n cases at random with replacement from the
    original data. About 1/3 of the rows from D1 are left out, known as Out of
    Bag(OOB) samples.
 2. Then, the model trains on D2. OOB sample is used to determine unbiased
    estimate of the error.
 3. Out of p columns, P << p columns are selected at each node in the data set.
    The P columns are selected at random. Usually, the default choice of P is
    p/3 for regression tree and P is sqrt(p) for classification tree.
 4. Unlike a tree, no pruning takes place in random forest; i.e, each tree is
    grown fully. In decision trees, pruning is a method to avoid overfitting.
    Pruning means selecting a subtree that leads to the lowest test errror rate.
    We can use cross validation to determine the test error rate of a subtree.
 5. Several trees are grown and the final prediction is obtained by averaging or
    voting.

Each tree is grown on a different sample of original data. Since random forest
has the feature to calculate OOB error internally, cross validation doesn't make
much sense in random forest.


WHAT IS THE DIFFERENCE BETWEEN BAGGING AND RANDOM FOREST?
Many a time, we fail to ascertain that bagging is not same as random forest. To
understand the difference, let's see how bagging works:

 1. It creates randomized samples of the data set (just like random forest) and
    grows trees on a different sample of the original data. The remaining 1/3 of
    the sample is used to estimate unbiased OOB error.
 2. It considers all the features at a node (for splitting).
 3. Once the trees are fully grown, it uses averaging or voting to combine the
    resultant predictions.

Aren't you thinking, ""If both the algorithms do same thing, what is the need for
random forest? Couldn't we have accomplished our task with bagging?"" NO!

The need for random forest surfaced after discovering that the bagging algorithm
results in correlated trees when faced with a data set having strong predictors.
Unfortunately, averaging several highly correlated trees doesn't lead to a large
reduction in variance.

But how do correlated trees emerge? Good question! Let's say a data set has a very strong predictor , along with other moderately strong predictors. In bagging, a tree grown every time would consider the very strong predictor at its root node, thereby resulting in trees similar to each other.

The main difference between random forest and bagging is that random forest considers only a subset of
predictors at a split. This results in trees with different predictors at top
split, thereby resulting in decorrelated trees and more reliable average output. That's why we say random forest is robust to
correlated predictors.


ADVANTAGES AND DISADVANTAGES OF RANDOM FOREST
Advantages are as follows:

 1. It is robust to correlated predictors.
 2. It is used to solve both regression and classification problems.
 3. It can be also used to solve unsupervised ML problems.
 4. It can handle thousands of input variables without variable selection.
 5. It can be used as a feature selection tool using its variable importance
    plot.
 6. It takes care of missing data internally in an effective manner.

Disadvantages are as follows:

 1. The Random Forest model is difficult to interpret.
 2. It tends to return erratic predictions for observations out of range of
    training data. For example, the training data contains two variable x and y.
    The range of x variable is 30 to 70. If the test data has x = 200, random
    forest would give an unreliable prediction.
 3. It can take longer than expected time to computer a large number of trees.


SOLVING A PROBLEM (PARAMETER TUNING)
Let's take a data set to compare the performance of bagging and random forest
algorithms. Along the way, I'll also explain important parameters used for
parameter tuning. In R, we'll use MLR and data.table package to do this
analysis.

I've taken the Adult dataset from the UCI machine learning repository. You can download the data from here .

This data set presents a binary classification problem to solve. Given a set of
features, we need to predict if a person's salary is <=50K or >=50k. Since the
given data isn't well structured, we'll need to make some modification while
reading the data set.

#set working directory
> path <- ""~/December 2016/RF_Tutorial""
> setwd(path)

#load libraries
> library(data.table)
> library(mlr)
> library(h2o)

#set variable names
setcol <- c(""age"",
""workclass"",
""fnlwgt"",
""education"",
""education-num"",
""marital-status"",
""occupation"",
""relationship"",
""race"",
""sex"",
""capital-gain"",
""capital-loss"",
""hours-per-week"",
""native-country"",
""target"")

#load data
> train <- read.table(""adultdata.txt"",header = F,sep = "","",col.names =
setcol,na.strings = c("" ?""),stringsAsFactors = F)
> test <- read.table(""adulttest.txt"",header = F,sep = "","",col.names =
setcol,skip = 1, na.strings = c("" ?""),stringsAsFactors = F)

After we've loaded the data set, first we'll set the data class to data.table.
data.table is the most powerful R package made for faster data manipulation.

> setDT(train)
> setDT(test)

Now, we'll quickly look at given variables, data dimensions, etc.

> dim(train)
> dim(test)
> str(train)
> str(test)

As seen from the output above, we can derive the following insights:

 1. The train data set has 32,561 rows and 15 columns.
 2. The test data has 16,281 rows and 15 columns.
 3. Variable target is the dependent variable.
 4. The target variable in train and test data is different. We'll need to match
    them.
 5. All character variables have a leading whitespace which can be removed.

We can check missing values using:

#check missing values
> table(is.na(train))
FALSE TRUE
484153 4262
> sapply(train, function(x) sum(is.na(x))/length(x))*100

> table(is.na(test))
FALSE TRUE
242012 2203
> sapply(test, function(x) sum(is.na(x))/length(x))*100

As seen above, both train and test datasets have missing values. The sapply function is quite handy when it comes to performing column computations. Above,
it returns the percentage of missing values per column.

Now, we'll preprocess the data to prepare it for training. In R, random forest
internally takes care of missing values using mean/ mode imputation. Practically
speaking, sometimes it takes longer than expected for the model to run.

Therefore, in order to avoid waiting time, let's impute the missing values using
median / mode imputation method; i.e., missing values in the integer variable
will be imputed with median and factor variables will be imputed with mode (most
frequent value).

We'll use the impute function from MLR package, which is enabled with several
unique methods for missing value imputation:

> imp1 <- impute(data = train,target = ""target"",classes =
list(integer=imputeMedian(), factor=imputeMode()))
> imp2 <- impute(data = test,target = ""target"",classes =
list(integer=imputeMedian(), factor=imputeMode()))
> train <- imp1$data
> test <- imp2$data

Being a binary classification problem, you are always advised to check if the
data is imbalanced or not. We can do it in the following way:

> setDT(train)[,.N/nrow(train),target]
target V1
1: <=50K 0.7591904
2: >50K 0.2408096
> setDT(test)[,.N/nrow(test),target]
target V1
1: <=50K. 0.7637737
2: >50K. 0.2362263

If you observe carefully, the value of the target variable is different in test
and train. For now, we can consider it a typo error and correct all the test
values. Also, we see that 75% of people in train data have income <=50K.
Imbalanced classification problems are known to be more skewed with a binary
class distribution of 90% to 10%. Now, let's proceed and clean the target column
in test data.

> test[,target := substr(target,start = 1,stop = nchar(target)-1)]

We've used the substr function to return the subtring from a specified start and
end position. Next, we'll remove the leading whitespaces from all character
variables. We'll use str_trim function from stringr package.

> library(stringr)
> char_col <- colnames(train)[sapply(train,is.character)]
> for(i in char_col)
set(train,j=i,value = str_trim(train[[i]],side = ""left""))

Using sapply function, we've extracted the column names which have character
class. Then, using a simple for - set loop we traversed all those columns and applied the str_trim function.

Before we start model training, we should convert all character variables to
factor. MLR package treats character class as unknown.

> fact_col <- colnames(train)[sapply(train,is.character)]
>for(i in fact_col)
set(train,j=i,value = factor(train[[i]]))
>for(i in fact_col)
set(test,j=i,value = factor(test[[i]]))

Let's start with modeling now. MLR package has its own function to convert data
into a task, build learners, and optimize learning algorithms. I suggest you
stick to the modeling structure described below for using MLR on any data set.

#create a task
> traintask <- makeClassifTask(data = train,target = ""target"")
> testtask <- makeClassifTask(data = test,target = ""target"")

#create learner
> bag <- makeLearner(""classif.rpart"",predict.type = ""response"")
> bag.lrn <- makeBaggingWrapper(learner = bag,bw.iters = 100,bw.replace = TRUE)

I've set up the bagging algorithm which will grow 100 trees on randomized
samples of data with replacement. To check the performance, let's set up a
validation strategy too:

#set 5 fold cross validation
> rdesc <- makeResampleDesc(""CV"",iters=5L)

For faster computation, we'll use parallel computation backend. Make sure your
machine / laptop doesn't have many programs running at backend.

#set parallel backend (Windows)
> library(parallelMap)
> library(parallel)
> parallelStartSocket(cpus = detectCores())

For linux users, the function parallelStartMulticore(cpus = detectCores()) will activate parallel backend. I've used all the cores here.

r <- resample(learner = bag.lrn
,task = traintask
,resampling = rdesc
,measures = list(tpr,fpr,fnr,fpr,acc)
,show.info = T)

#[Resample] Result:
# tpr.test.mean=0.95,
# fnr.test.mean=0.0505,
# fpr.test.mean=0.487,
# acc.test.mean=0.845

Being a binary classification problem, I've used the components of confusion matrix to check the model's accuracy. With 100 trees, bagging has returned an accuracy
of 84.5%, which is way better than the baseline accuracy of 75%. Let's now check
the performance of random forest.

#make randomForest learner
> rf.lrn <- makeLearner(""classif.randomForest"")
> rf.lrn$par.vals <- list(ntree = 100L,
importance=TRUE) )

> r <- resample(learner = rf.lrn
,task = traintask
,resampling = rdesc
,measures = list(tpr,fpr,fnr,fpr,acc)
,show.info = T)

# Result:
# tpr.test.mean=0.996,
# fpr.test.mean=0.72,
# fnr.test.mean=0.0034,
# acc.test.mean=0.825

On this data set, random forest performs worse than bagging. Both used 100 trees
and random forest returns an overall accuracy of 82.5 %. An apparent reason
being that this algorithm is messing up classifying the negative class. As you
can see, it classified 99.6% of the positive classes correctly, which is way
better than the bagging algorithm. But it incorrectly classified 72% of the
negative classes.

Internally, random forest uses a cutoff of 0.5; i.e., if a particular unseen
observation has a probability higher than 0.5, it will be classified as <=50K.
In random forest, we have the option to customize the internal cutoff. As the
false positive rate is very high now, we'll increase the cutoff for positive
classes (<=50K) and accordingly reduce it for negative classes (>=50K). Then,
train the model again.

#set cutoff
> rf.lrn$par.vals <- list(ntree = 100L,
importance=TRUE,
cutoff = c(0.75,0.25))

> r <- resample(learner = rf.lrn
,task = traintask
,resampling = rdesc
,measures = list(tpr,fpr,fnr,fpr,acc)
,show.info = T)

#Result: tpr.test.mean=0.934,
# fpr.test.mean=0.43,
# fnr.test.mean=0.0662,
# acc.test.mean=0.846

As you can see, we've improved the accuracy of the random forest model by 2%,
which is slightly higher than that for the bagging model. Now, let's try and
make this model better.

Parameter Tuning: Mainly, there are three parameters in the random forest algorithm which you
should look at (for tuning):

 * ntree - As the name suggests, the number of trees to grow. Larger the tree, it
   will be more computationally expensive to build models.
 * mtry - It refers to how many variables we should select at a node split. Also as
   mentioned above, the default value is p/3 for regression and sqrt(p) for
   classification. We should always try to avoid using smaller values of mtry to
   avoid overfitting.
 * nodesize - It refers to how many observations we want in the terminal nodes. This
   parameter is directly related to tree depth. Higher the number, lower the
   tree depth. With lower tree depth, the tree might even fail to recognize
   useful signals from the data.

Let get to the playground and try to improve our model's accuracy further. In
MLR package, you can list all tuning parameters a model can support using:

> getParamSet(rf.lrn)

#set parameter space
params <- makeParamSet(
makeIntegerParam(""mtry"",lower = 2,upper = 10),
makeIntegerParam(""nodesize"",lower = 10,upper = 50)
)

#set validation strategy
rdesc <- makeResampleDesc(""CV"",iters=5L)

#set optimization technique
ctrl <- makeTuneControlRandom(maxit = 5L)

#start tuning
> tune <- tuneParams(learner = rf.lrn
,task = traintask
,resampling = rdesc
,measures = list(acc)
,par.set = params
,control = ctrl
,show.info = T)
[Tune] Result: mtry=2; nodesize=23 : acc.test.mean=0.858

After tuning, we have achieved an overall accuracy of 85.8%, which is better
than our previous random forest model. This way you can tweak your model and
improve its accuracy.

I'll leave you here. The complete code for this analysis can be downloaded from Github .


SUMMARY
Don't stop here! There is still a huge scope for improvement in this model.
Cross validation accuracy is generally more optimistic than true test accuracy.
To make a prediction on the test set, minimal data preprocessing on categorical
variables is required. Do it and share your results in the comments below.

My motive to create this tutorial is to get you started using the random forest
model and some techniques to improve model accuracy. For better understanding, I
suggest you read more on confusion matrix. In this article, I've explained the
working of decision trees, random forest, and bagging.

Did I miss out anything? Do share your knowledge and let me know your experience
while solving classification problems in comments below.

Share 120ABOUT THE AUTHOR
Manish Saraswat * 
 * 

Making an effort to help people understand Machine Learning. I believe your
educational background doesn't stop you to pursue ML & Data Science. Earned
Masters in F/M, a self taught data science professional. Previously worked at
Analytics Vidhya. Now solving ML & Growth challenges at HackerEarth!AUTHOR POST
Machine Learning R DEEP LEARNING & PARAMETER TUNING WITH MXNET, H2O PACKAGE IN R Jan 30, 2017 Machine Learning R PRACTICAL GUIDE TO CLUSTERING ALGORITHMS & EVALUATION IN R Jan 19, 2017 Machine Learning R HOW CAN R USERS LEARN PYTHON FOR DATA SCIENCE ? Jan 12, 2017 Machine Learning R PRACTICAL GUIDE TO LOGISTIC REGRESSION ANALYSIS IN R Jan 5, 2017 Machine Learning R EXCLUSIVE SQL TUTORIAL ON DATA ANALYSIS IN R Dec 28, 2016 Machine Learning R BEGINNERS TUTORIAL ON XGBOOST AND PARAMETER TUNING IN R Dec 20, 2016 Machine Learning R DEEP LEARNING & PARAMETER TUNING WITH MXNET, H2O PACKAGE IN R Jan 30, 2017 Machine Learning R PRACTICAL GUIDE TO CLUSTERING ALGORITHMS & EVALUATION IN R Jan 19, 2017 Machine Learning R HOW CAN R USERS LEARN PYTHON FOR DATA SCIENCE ? Jan 12, 2017 Machine Learning R PRACTICAL GUIDE TO LOGISTIC REGRESSION ANALYSIS IN R Jan 5, 2017 Machine Learning R EXCLUSIVE SQL TUTORIAL ON DATA ANALYSIS IN R Dec 28, 2016 Machine Learning R BEGINNERS TUTORIAL ON XGBOOST AND PARAMETER TUNING IN R Dec 20, 2016 Please enable JavaScript to view the comments powered by Disqus. x LIVE WEBINAR Machine Learning in a Live Production Environment Register NowABOUT US
 * Blog
 * Engineering Blog
 * Updates & Releases
 * Team
 * Careers
 * In the Press

TOP CATEGORIES
 * Hiring
 * Placements
 * Hackathons
 * Community
 * Competitive Programming
 * Culture

RESOURCES
 * Webinars
 * Podcasts
 * CodeTable
 * Hackathon Handbook
 * Complete Reference to Competitive Programming
 * How to get started with Open Source

FOR COMPANIES
 * Recruit
 * Assessment
 * Sourcing
 * Host Hackathons
 * Interview

© 2017 HackerEarth Share 120","In this tutorial, the complete concept of random forest and bagging is explained.",Practical Tutorial on Random Forest and Parameter Tuning in R,Live,116
301,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
APACHE SPARK™ 2.0: MIGRATING APPLICATIONS
Many excellent fixes, enhancements and new features are available in Apache
Spark TM 2.0 as highlighted in What's New in Apache Spark TM 2.0 . High-level descriptions for migrating applications to Apache Spark TM 2.0 can be found at Apache Spark TM SQL Programming Guide and Apache Spark TM MLlib Guide .

This post provides a brief summary of sample code changes to migrate a Java
application from Apache Spark TM 1.6 to Apache Spark TM 2.0. The migration effort is dependent upon the Apache Spark TM APIs a given application uses. Note a few breaking API changes introduced in
2.0 release can result in compilation errors for an application compatible with
previous releases. The most common compilation errors when initially updating a
Java application for 2.0 release are as follows.

 * DataFrame cannot be resolved to a type. The import
   org.apache.spark.sql.DataFrame cannot be resolved.
 * The methods fit, transform, train must override or implement a supertype
   method
 * The return type is incompatible with PairFlatMapFunction .call(Iterator<... ).

Resolving each one is straightforward by applying a group of code changes as
follows.

Replace DataFrame variable declarations and references with Dataset< Row .For Java applications, the type org.apache.spark.sql.DataFrame type no longer
exists because in Scala it has been redefined as a type alias for Dataset[Row].
So in general, for each Java class that uses DataFrame, apply the following
pattern.

Replace:

import org.apache.spark.sql.DataFrame;


With:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;


Change:

DataFrame df;


To:

Dataset<Row


Additional background on DataFrame and Dataset API changes in Apache Spark TM 2.0 can be found in SPARK-13244 Unify DataFrame and Dataset API .

Replace DataFrame with Dataset< ? in method overrides of MLlib API subclasses.Note there are cases where DataFrame needs to be replaced by Dataset< ? instead of Dataset< Row . More specifically, subclasses of org.apache.spark.ml.Estimator, Transformer,
Predictor, etc. that override methods such as fit(), transform(), and train()
need to be updated as follows.

Change:

fit(DataFrame df)
transform(DataFrame df)
train(DataFrame df)


To:

fit(Dataset<?> df)
transform(Dataset<?> df)
train(Dataset<?> df)


This change is related to SPARK-14500 Accept Dataset[] instead of DataFrame in MLlib APIs . In Scala MLlib APIs, DataFrame was replaced by Dataset[_]. For Java, this
requires using Dataset< ? instead.

Replace Iterable< with Iterator< for classes implementing PairFlatMapFunction.If a Java class implements PairFlatMapFunction (or other variations of
FlatMapFunction), compiling against 2.0 API reports an error like the following:

The return type is incompatible with PairFlatMapFunction<Iterator<...>>.call(Iterator<...>).


To resolve, change the declared return type from Iterable to Iterator in the call() method override and import java.util.Iterator. In addition,
modify the return value to return an iterator() of the collection instead of the
collection itself. Below is a partial code fragment to illustrate what to modify
for a class that implements FlatMapFunction and corresponding call() method.

Change:

public class CustomFlatMapFunction implements FlatMapFunction<Tuple2<Integer, Iterable<String>>, String> {
    @Override
    public Iterable<String> call(Tuple2<Integer, Iterable<String>> arg0) throws Exception {
        ArrayList = new ArrayList<String
    }
}


To:

import java.util.Iterator;

public class CustomFlatMapFunction implements FlatMapFunction<Tuple2<Integer, Iterable<String>>, String> {
    @Override
    public Iterator<String> call(Tuple2<Integer, Iterable<String>> arg0) throws Exception {
        ArrayList = new ArrayList<String
    }
}


Although this is a breaking API change, the advantage is that functions do not
need to instantiate all data. See SPARK-3369 Java mapPartitions Iterator- Iterable is inconsistent with Scala's
Iterator- Iterator for additional information and discussion.

Update deprecated or removed APIs.Many deprecated APIs were removed in Apache Spark TM 2.0 as described at Removals, Behavior Changes and Deprecations . It's simpler to fix deprecated API warnings before switching to Apache Spark TM 2.0 since the generated deprecation warning typically identifies what newer
method to call instead. Alternatively, the API documentation can be reviewed
(e.g., 1.6.2 API Documentation vs. 2.0.0 API Documentation . Detailed API removals for Apache Spark TM SQL can be seen under SPARK-12600 Remove deprecated methods in SQL / DataFrames .

Here's a simple example for org.apache.spark.sql.types.DataType.

Change:

DataType.fromCaseClassString(""DoubleType"")


To:

DataType.fromJson(""DoubleType"")


Additional information and a list of JIRAs related to new SparkSession and
related API changes can be found under SPARK-13485 (Dataset-oriented) API evolution in Apache Spark TM 2.0 .

There is also some refactoring in MLlib from SPARK-13944 Separate out local linear algebra as a standalone module without
Spark dependency .

Change:

import org.apache.spark.mllib.linalg.Vector


To:

import org.apache.spark.ml.linalg.Vector


--------------------------------------------------------------------------------

As mentioned in the beginning of this post, the amount of changes needed to
update an application varies based on API usage. Although the amount of changes
may initially seem large, applying a few patterns as shown in this post can
eliminate most of the issues to get an application quickly up and running with
2.0 API. Enjoy the power and flexibility of Apache Spark TM 2.0!

SHARE ON
 * 
 * Share

GLENN WEIDNER
DATE
08 September 2016TAGS
SPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.",This post provides a brief summary of sample code changes to migrate a Java application from Apache SparkTM 1.6 to Apache SparkTM 2.0. The migration effort is dependent upon the Apache SparkTM APIs a given application uses.,Apache Spark™ 2.0: Migrating Applications,Live,117
304,"Compose The Compose logo Articles Sign in Free 30-day trialONE PRESCHOOL - MAKING THE MOST OF COMPOSE
Published Mar 7, 2017 case study postgresql redis One Preschool - Making the Most of ComposeOne Preschool has been a loyal Compose customer since their inception. They have been on a
mission to make preschooling both accessible and practical throughout the world.
We spoke with co-founders Chris Bennett and Arrel Gray to hear about their
startup, technology stack, and mission.

Finding a preschool or daycare can be a daunting task for parents. And it’s not
just in North America; many people around the world face similar childcare
dilemmas. There’s simply not enough quality preschools. Or if there is, it’s
hard to find one when you need it. Arrel Gray faced the same situation when he
was looking for a childcare. Being an entrepreneur, Arrel took this as a
challenge worth solving for others. He partnered up with his colleague and
friend Chris Bennett to explore why there is such a problem and if they can do
anything about it. And so, One Preschool was born.

“We wanted to create a solution for educators to start preschools, and for
parents to more easily find them. We went in to figure out ways to incentivize
people to start them and so we launched this company to do that. We realized
that if we could find a way to provide early childhood education to every kid in
the world, then we could have a really big social impact,” said Chris when
talking about the company’s vision.

But being technologists and having no background in education, Chris and Arrel
needed to “eat their own dogfood”, as it were, to start their initial venture.
So they partnered with early childhood educators to work with teachers they
would bring on. Their experience with a previous startup was also helpful in
setting up a new business from scratch. For One Preschool, the team opened two
schools: one in Berkeley and another in Los Angeles. These were test cases
without any technological platform. The duo wanted to understand how the system
works, figure out marketing, how to help teachers, make pricing decision and
policies, understand ramp up period, and a myriad of other details to get a
school off the ground.

According to Chris, “Though parents are always looking for a good childcare
facility, they’re a little hesitant to join something new like One Preschool.
But once they start to see their friends joining, they get more interested. They
also love the way how One Preschool lets them work closely with the teachers.”

Another learning was that many parents have misconceptions about what it means
to be prepared for kindergarten. They think it’s about teaching basic math or
learning the alphabet. But to Chris, what matters most is the social and
emotional development of a child. Can they make associations? Are they
communicative and able to mix with other children?

So, from early on, they started working with experienced teachers - teachers
that would understand the pragmatic needs of a child’s mental growth.

As the company started growing, it’s been hard to keep up with the services that
teachers need to do their job effectively; for example, they have a shortage of
substitute teachers. There’s also a ton of work for licensing, curriculum,
school setup, scheduling, payment, marketing, curriculum, networking and
mentorship, and many more things that have to go right for the box model to
work.

This is why One Preschool was launched; to make the experience of starting a
preschool and running it day to day easy for educators who want to start their
own business. With the One Preschool platform, experienced educators can start a
preschool and get all the benefits that are built into One Preschool.


The platform has three components. There’s a network of schools where teachers
and parents can search for a school, schedule a tour and get enrolled. Then
there’s a dashboard the teachers use to manage enrollment, do expenses and keep
track of any schedule changes that occur in the school for a certain parent.
Finally, there is a separate setup system that the teachers use when they first
sign up to go from an idea to a full school. The system takes care of payment
and support, provides access to mentors and professionals and even gives an
estimate of how much earnings can be made.

Powering the platform is Phoenix/Elixir on the server side and React.js on the
front end. For their data layer, One Preschool selected Compose PostgreSQL for
persistent data and Compose Redis as their message queue.

They chose PostgresSQL because they needed a powerful object-relational database
that is highly customizable. With PostgreSQL, development is fast and easily
scalable, plus one can develop in a language they are comfortable with. It's
also a feature-rich enterprise database with JSON support, providing the best of
both the SQL and NoSQL worlds.

“We rely a lot on the data consistency features and introspective searches for
looking up performance data and diagnosing problems. I've built a couple of
medium scale (tens of millions of records) services for event tracking and
search using PostgresSQL before. At that scale it's still surprisingly fast,
even for things like full-text search,” said Arrel who’s at the helm of the
engineering.

Arrel also loves Redis because of its versatility, ease of setup and low
maintainability. It's a blazingly fast, in-memory key/value store; perfect for
One Preschool's needs. On Compose's platform, Redis is pre-tuned for high
availability and comes with additional security features.

At One Preschool they use Redis for message queueing, but also use it for
locking and caching of webhook data to know when to refresh or reset a
connection.

Talking about managing databases in his long career, Arrel had a strong opinion.
“I hate anything database-related with the word ""patch"" in it. I love that
Compose keeps the database up to date, makes it super easy to upgrade versions,
and scales for us automatically.”

When asked why One Preschool chose Compose, Arrel said, “We used Compose in a
previous project, which was always super stable and had a good web interface, so
we were psyched when you started offering Postgres and we could switch over all
our data storage.” He also mentioned that “as a software engineer, the most
important features are the ones that let me stay focused on the software and
forget about the infrastructure. It’s also about reliability. After sticking
with Compose for several years I've learned that when things go wrong, you guys
got my back.”

All the passion and hard work are paying off. Only at its second year, One
Preschool has already spans 11 schools with more on the way.

To learn more about One Preschool and their platform, visit http://onepreschool.com .


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by: Pixabay

Arick Disilva works in Product Marketing at Compose. Love this article? Head over to Arick Disilva ’s author page and keep reading.RELATED ARTICLES
Mar 10, 2017NEWSBITS - POSTGRESQL, MYSQL SHELL, SCYLLADB, KAGGLE, REDIS PUB/SUB,
BLOCKCHAINS, EMOJIS AND DNA
NewsBits for the week ending March 10th - PostgreSQL 10's latest features,
Google betas PostgreSQL, MySQL's Shell approaches…

Dj Walker-Morgan Mar 2, 2017USE ALL THE DATABASES - PART 1
Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data
from multiple sources using GraphQL in this W…

Guest Author Dec 16, 2016NEWSBITS - MYSQL, REDIS, ELASTICSEARCH, POSTGRESQL, CENTOS, COREOS AND MORE
NewsBits for the week ending December 16 - MySQL Group Replication is now GA,
Redis 4.0 RCs, Modules and Loglog changes, GoRe…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","One Preschool has been a loyal Compose customer since their inception. They have been on a mission to make preschooling both accessible and practical throughout the world. We spoke with co-founders Chris Bennett and Arrel Gray to hear about their startup, technology stack, and mission.",Making the Most of Compose - Customer One Preschool,Live,118
305,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectLAUNCH A SPARK JOB USING SPARK-SUBMITDavid Taieb / April 20, 2016In my previous Spark tutorials, I used notebooks to run and interact with thecode. In this tutorial, I show how to run Spark batch jobs programmaticallyusing the spark_submit script functionality on IBM Analytics for Apache Spark.We’ll look at 2 examples that launch a Hello World Spark job via spark-submit:one written in Scala and one in Python.ARCHITECTUREThe following diagram shows the architecture for both modes:So, why do we need 2 modes to run a Spark Job? And which one should you use? * Notebooks are great if you need to interact with the Spark cluster. For example, if   you’re a data scientist searching for insights, you need to quickly   experiment with the data, and fail fast. Notebooks are also great for demos   and collaborating with other people. * Spark-submit jobs are more likely to be used in production. For example your app could   use spark-submit to run large nightly batch jobs. These jobs can launch   unattended from a script triggered by a cron job, and don’t require any   interaction to complete.BEFORE YOU BEGIN 1. Review the official spark-submit documentation . 2. Download the spark-submit script . 3. Locate the service credentials information for your spark service.In Bluemix, open your Spark service and in the left menu, click Service Credentials .                 4. Create a new file called vcaps.json , save it in the same directory where spark-submit.sh lives, and copy the service credentials into the file:{    ""credentials"": {    ""tenant_id"": ""XXXX"",    ""tenant_id_full"": ""XXXX"",    ""cluster_master_url"": ""https://XXX:8443"",    ""instance_id"": ""XXXX"",    ""tenant_secret"": ""XXXX"",    ""plan"": ""ibm.SparkService.PayGoPersonalInteractive""    }    }        SPARK-SUBMIT JOB USING SCALA 1. Download a version of helloSpark-assembly-2.1.jar into the same directory where spark-submit.sh lives. (You’ll find details    on building a jar file like this, in my Start Developing with Spark and Notebooks tutorial.) 2. Submit your job using the following command from the directory where    spark-submit.sh was downloaded. Enter your cluster_master_url from the Spark service credentials you copied and the path where your .jar    file lives:./spark-submit.sh --vcap ./vcaps.json --deploy-mode cluster --class com.ibm.cds.spark.samples.HelloSpark --master <<cluster_master_url>> <<local path to helloSpark-assembly-2.1.jar>>            which looks something like this:        ./spark-submit.sh --vcap ./vcaps.json --deploy-mode cluster --class com.ibm.cds.spark.samples.HelloSpark --master https://169.54.219.20:8443 ./helloSpark-assembly-2.1.jar        When the script finishes, it displays the location of the log file where you    can find more information about your job.        Done downloading from workdir/driver-20160408121238-0001-98831756-2640-427d-a7a2-b30ebd91b8f2/stderr to stderr_1460135564N                                                Log file can be found at spark-submit_1460135564N.log           3. To access the driver machine logs for your application, look at the end in    the spark-submit_XXXX.log and locate the curl command that lets you download    the stdout:curl -D ""stdout_1460140266N.header"" -v -X GET --insecure -u sd73-de3b55cc941e55-4137fa4057f6:c8d92cd6-d13d-435e-b7a9-a4a7b96c0b79  -H ""X-Spark-service-instance-id: 3fd28f1b-cedc-4b50-bd73-de3b55cc941e"" https://169.54.219.20/tenant/data/workdir/driver-20160408133059-0003-ec49b480-f62a-4b76-b5b7-eb03b095dd0d/stdout        Run it, and this command should return the following results:        Hello Spark Demo. Compute the mean and variance of a collection                                                                                                            Results:                                                                                                                                                              Mean: 250000.0                                                                                                                                                     Variance: 2.083325E10         Note: Easy access to log messages from different Spark executors, called Spark History , is coming soon. When it’s available, I’ll write a follow-up describing it indetail. Stay tuned. In the meantime, this tutorial covers a quick way to check status or cancel a job .SPARK-SUBMIT JOB USING PYTHONTo submit a job using Python, follow the same pattern as in Scala except thatyou’re using a py script instead of a jar: 1. Create a py script called helloSpark.py (or download it from here ) as follows:import sys    from pyspark import SparkContext        def computeStatsForCollection(sc,countPerPartitions=100000,partitions=5):    totalNumber = min( countPerPartitions * partitions, sys.maxsize)    rdd = sc.parallelize( range(totalNumber),partitions)    return (rdd.mean(), rdd.variance())        if __name__ == ""__main__"":    sc = SparkContext(appName=""Hello Spark"")    print(""Hello Spark Demo. Compute the mean and variance of a collection"")    stats = computeStatsForCollection(sc);    print("" Results: "")    print(""Mean: "" + str(stats[0]));    print(""Variance: "" + str(stats[1]));    sc.stop()         2. Invoke the spark-submit.sh script as follows:./spark-submit.sh --vcap ./vcaps.json --deploy-mode cluster --master https://169.54.219.20:8443 <<Path to helloSpark.py>>        GET STATUS AND CANCEL A LONG-RUNNING JOBWhen you’re dealing with long-running jobs, you may want to query the status.Also, sometimes you can’t wait for a long Spark job to complete and want to killthe job before it finishes. You can handle both of these tasks using thespark-submit.sh script. (The following steps work whether you used Scala orPython to launch the job.)Open the log file and locate the Submission ID, which looks something like this:Submission ID : driver-20160408121238-0001-98831756-2640-427d-a7a2-b30ebd91b8f2and use the value in the following commands: * To get a job status:./spark-submit.sh --vcap ./vcaps.json --master https://169.54.219.20:8443 --status driver-20160408121238-0001-98831756-2640-427d-a7a2-b30ebd91b8f2       * To kill the job:./spark-submit.sh --vcap ./vcaps.json --master https://169.54.219.20:8443 --kill driver-20160408121238-0001-98831756-2640-427d-a7a2-b30ebd91b8f2      CONCLUSIONIn this tutorial, you learned how to use spark-submit.sh to run a Spark batchjob using Scala and Python. We also looked at how to monitor the job, check thestatus, and kill the job. You can find more information on spark-submitfunctionality here .Stay tuned for an upcoming post on how to use Spark History which will provide anice UI with an aggregated view of all log messages produced by each executor inthe cluster.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",How to run Spark batch jobs programmatically. See examples in both Scala and Python that launch a Hello World Spark job via spark-submit.,Launch a Spark job using spark-submit,Live,119
306,"Homepage Follow Sign in / Sign up John Thomas Blocked Unblock Follow Following IBM Distinguished Engineer. #Cloud, #Analytics, #Cognitive, #zSystems,
#ITEconomics. Chess, Food, Travel (60+ countries). Tweets are personal opinions. Jun 12
--------------------------------------------------------------------------------

MACHINE LEARNING & APACHE SPARK: A DYNAMIC DUO
The Machine Learning revolution is underway and is changing industries and
delivering outcomes that were unimaginable a few years ago. In this video of
John J. Thomas’s keynote at ApacheCon on May 17, 2017, learn how Apache Spark
and other related projects are being used by innovative companies to remake
products and services and enabling data-driven decision making.

For more information, visit the Data Science Experience .

Video courtesy of the The Linux Foundation ( https://www.linuxfoundation.org/ ) via YouTube.

 * Apache Spark
 * Machine Learning

Blocked Unblock Follow FollowingJOHN THOMAS
IBM Distinguished Engineer. #Cloud, #Analytics, #Cognitive, #zSystems,
#ITEconomics. Chess, Food, Travel (60+ countries). Tweets are personal opinions.

FollowINSIDE MACHINE LEARNING
Deep-dive articles about machine learning and data. Curated by IBM Analytics.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates",The Machine Learning revolution is underway and is changing industries and delivering outcomes that were unimaginable a few years ago. In this video of John J. Thomas’s keynote at ApacheCon on May 17…,A Dynamic Duo – Inside Machine learning – Medium,Live,120
307,"PICKING SQL OR NOSQL? – A COMPOSE VIEW
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 20, 2016Here's a question we hear a lot - Should I use SQL databases or NoSQL databases?
It's a question that gets asked because often underlying it is another question
- What's broken in SQL databases that NoSQL databases fixes? The answer to that
one is much easier. Nothing is broken because they are different approaches to
creating databases in the same way that assembler and higher level languages are
to creating applications.

Think of a typical high level language. It abstracts away all the ideas of
machine code – of scheduling, memory management, interrupts, processor stacks
and and buffers – into a different intellectual framework that is the language.
You write a program in the language and a compiler or interpreter steps in and
turns your code into digestible chunks of machine code (or intermediate code) to
be run on some actual hardware. You don't care about that though, all you care
is that your code can go into any machine and right things happen.

You can think of this as akin to SQL; you write your high level query which is
generally portable between different SQL databases and the database's internal
compiler or interpreter turns it into executable operations which it can then
run to give the results you are expecting. There's a whole query engine in your
database that looks for the optimal way to turn your SQL query into the optimal
set of operations to get your results. You usually only care about what it's
doing when your queries aren't running as fast as you'd hope, in the same way
that you only care about your compiler when it generates slow code for your
application.

Now think of assembler. Assembler is unique to the processor family it runs on.
These are the smallest operations the processor will let you program it with and
they all run as fast as the processor can. They do exactly what they say and no
more. High level language compilers convert programs into assembler (eventually)
so they can be run, but writing in bare assembler can be even more efficient as
long as you can take into account all the internal ""moving parts"" of the
processor. The downside is that you can't move your assembler code to a
different processor family.

And now think of NoSQL like that. The query engine and low level operations of a
database exposed through an API to give you a more intimate control of your
database operations. For databases that's something like find a record by a key,
update a record with that key, construct a query from a chain of operands. These
small operations can be combined by applications to create powerful
applications.

NoSQL emerged in a world of SQL not to replace it but to allow people to
experiment with new ways of working with databases and optimising databases to
particular tasks. The same deal with assembler applies with NoSQL; you get
direct control of the underlying system, you have to worry about managing that
system a lot more - selecting indexes, creating reliable operations which don't
crash into each other, making sure you aren't locking out other operations -
these are things you will, at any scale, have to think about at some point. The
good news is that NoSQL databases have matured so the underlying mechanisms are
more resilient and reliable to these issues. NoSQL databases have also focused
on particular data types or arrangements - JSON document, columnar storage,
graphs - and on different architectures - in-memory, sharded, distributed,
replicated - to create databases which are very powerful for particular use
cases.

SQL is a language of general purpose utility. It sets out with a relational,
table centric structure and you rely on the database to make optimal decisions
in interpreting your intent and coming up with the best path to get your
results. Because of that SQL also shaped how the underlying databases operated
and how they developed over time.

To jump ship to another analogy temporarily, NoSQL is like RISC processors were
to the CISC processors in the 80s and 90s. RISC processors gave chip designers a
whole new way to approach problems of scale and moved the task of building
optimised instruction pipelines up to the compilers used to create code for the
RISC chips. Some even went as far as turning CISC instructions into RISC
instructions on the fly. The two approaches often found themselves facing off
over performance. Where are we now? The lessons learnt from RISC processors are
embedded in CISC designs while a new class of more complex RISC chip is to be
found optimized for power consumption in a billion devices - the biggest niche
ever.

Here's the cool part. Those billion RISC devices interoperate with all the CISC
and other RISC devices out there over the internet and through the millions of
servers in the cloud. It's not an either/or choice. It's a best-for-the-task
selections. When you go out and buy a computer in 2016, you pick it for
suitability for a task, not whether it has a RISC or CISC design philosophy at
the heart of its CPU. In the same way, when picking a database, or databases,
for a task you should select for suitability for that task.

Which brings us back to the assembler/higher level language analogy in this
analogy inception. What this analogy offers us is a simple rule of thumb for
thinking about how SQL and NoSQL impact on that decision of suitability. A NoSQL
database will tend to be optimized for a class of problems and it's important to
understand what those problems are. SQL can always be, at least in theory,
compiled into the operations of a NoSQL database and there are tools out there
which will do this for you. You'll usually find them in the Business
Intelligence and Analytics aisle. Some NoSQL databases are internalising the
same ideas to offer subsets of SQL too, raising the bar on NoSQL's
assembler-ness in this analogy to something closer to the capabilities of a high
level language.

Opt for NoSQL and you get handed the keys to the database, along with a
specially selected set of components and the freedom to assemble them how you
wish. Opt for SQL, you'll get access to an often feature rich semi-autonomous
car which will take you from A to B efficiently every day. As a developer you'd
never say ""I'll just use assembler for all my apps"" or ""I'll use only this high
level languages""; you would keep all options open.

The best solution? Opt for whatever is best for your task; not just one but as
many as you need. If your application stack needs an in-memory database or
messaging bus binding together applications using a document database for client
facing applications, a database for backend analytics and a JSON document search
database, then thats the architecture you should go for.

That and the ability to deploy production grade versions of all those databases
whenever you need them.

Image by Davide Ragusa Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose",Here's a question we hear a lot - Should I use SQL databases or NoSQL databases? It's a question that gets asked because often underlying it is another question - What's broken in SQL databases that NoSQL databases fixes? The answer to that one is much easier: Nothing.,Picking SQL or NoSQL? – A Compose View,Live,121
309,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Homepage * Home
 * Cognitive Computing
 * Data Science
 * Web Dev
 * 

Mark Watson Blocked Unblock Follow Following Developer Advocate, IBM Watson Data Platform Oct 18
--------------------------------------------------------------------------------

WATSON MACHINE LEARNING FOR DEVELOPERS
UNDERSTANDING THE BASIC PROBLEMS AND WORKFLOW (PART 1)
I am not a Data Scientist , but I am a developer interested in data science and machine learning. I hope
you are here because you are as well!

This is the first installment in a series of posts aimed at introducing
developers like me and you to the basic machine learning concepts and tools
required to get an ML system up and running. I will not be spending a lot of
time talking about how to clean and analyze data, or the finer points of how
machine learning works, but I will introduce you to fundamental concepts that
you will need to get your first system up and running.

Let’s start by understanding when and why you would use machine learning.

We’ll eventually use the Watson ML service to deploy our model, but the problems
and workflow I describe here apply broadly to machine learning.PREDICTIONS
The ultimate goal of a machine learning system is to make a prediction. Here are
some examples you may be familiar with:

 1. Predict whether an image is a cat or dog
 2. Predict the value of a home
 3. Predict which products to recommend to a user
 4. Predict which users share the same interests
 5. Predict when to turn, accelerate, or apply the brakes in a self-driving car

Machine learning is all about predictions. If you have a use case where you need
to make predictions (and a lot of data), machine learning may be a good fit. How
do ML systems make predictions?

It all starts with the data. ML libraries and platforms can make predictions by
analyzing massive amounts of data and finding patterns or mathematical formulas
that “explain” the data. The data is the most crucial component to a successful
ML system. You need to have a lot of it, and it has to be good. Bad data in =
bad predictions out.

Let’s go a tad deeper to get a better understanding of how machine learning
works.

DATA
It can’t be said enough. It all starts with the data, and it has to be good
data.

Let’s start with a simple, well-known machine learning example: predicting house
prices. Let’s say we have a data set of known houses and their associated
prices:

Square Feet       # Bedrooms       Color         Price
-----------       ----------       -----         -----        
2,100             3                White         $100,000
2,300             4                White         $125,000
2,500             4                Brown         $150,000

Obviously this is not a lot of data, and not good data, but ignore that for now.
Our goal is to build a machine learning system to predict house prices using
this data set.

Predicting the price of a house is a supervised machine learning problem — that means we know the outcome for a subset of use
cases (i.e., we know what the prices are for the houses listed above), and we
can use those outcomes to train a ML system to predict outcomes for new use
cases (i.e., predict the price for a house that is not in the list). An
unsupervised ML problem is one where the system learns from the data, rather
than being trained by the data. We’ll cover unsupervised learning in a future
post.

Specifically, this is a regression problem. A regression problem is one in which you want to predict a real
number, like the price of a house. We will also cover binary and multiclass
classification (when you want to predict a class or category from a predefined
list of values) and clustering (when you want to group data that is similar).

When we build a supervised ML model, we need to specify which variables we want to use to make our
predictions. These variables are referred to as features . We know that when a house is 2,100 square feet, has 3 bedrooms, and is the
color white, then the price is $100,000. In this example, color is not important
to predicting the price of a home, but you could reason that both square footage
and the number of bedrooms are. So, it makes sense that we choose Square Feet and # Bedrooms as our features.

The value we want to predict is the Price . This is referred to as our label .

We’ll use the features in our data set to build a model that can predict the label (Price). That process looks a little like this:

 1. Choose a ML algorithm. We’ll cover some of the common algorithms used in
    machine learning.
 2. Instruct our ML algorithm to use Square Feet and # Bedrooms as our features and Price as our label (the value we want to predict).
 3. Feed the data set to our ML algorithm to train an ML model that can make
    predictions. The algorithm will use the data set that you feed it to come up
    with a mathematical formula for predicting new outcomes.
 4. To predict a price, we feed our ML model a set of features (square footage
    and number of bedrooms) and in response receive a predicted price.

Now that we have data, and I’ve outlined the general steps from getting from the
data to a prediction, let’s see what tools can help us get there.

TOOLS
We’ll focus on the tools provided by the IBM Data Science Experience (DSX). Many of the tools are open source and can be run locally or on other
platforms, and the general concepts should apply to other hosted machine
learning offerings.

Jupyter Notebooks : Notebooks are used by data scientists to clean, visualize, and understand data.
DSX uses Jupyter Notebooks, but notebooks come in different flavors. In DSX you
code your notebooks in Python or Scala.

Apache Spark ™: Spark is a cluster computing platform for analyzing massive amounts of data
in-memory. For machine learning to be effective, you need lots of data, so it
only makes sense that you have a platform like Spark to help.

Apache Spark ML : Spark ML is a library for building ML pipelines on top of Apache Spark. Spark
ML includes algorithms and APIs for supervised and unsupervised machine learning
problems.

IBM Watson ML : Watson ML is a service for deploying ML models and making predictions at
runtime. Watson ML provides a REST API to your ML models which can be called
directly from your application or your middleware.

Let’s see how all these tools work together.

WORKFLOW
Here is the typical path I take when building and hosting a machine learning
model:

 1. Identify a prediction you want to make and the data set that can help you
    make it.
 2. Create a Jupyter Notebook and import, clean, and analyze the data.
 3. Use Apache Spark ML to build and test a machine learning model.
 4. Deploy the model to Watson ML.
 5. Call the Watson ML scoring endpoint (REST API) to make predictions from a
    client application or backend service.

This path works for supervised and unsupervised machine learning, and I’ll use
it to show you how you can solve regression, classification, and clustering ML
problems.

NEXT STEPS
In this post, I gave an overview of what you can use machine learning for, a
tool chain that you can use to build end-to-end ML systems, and the path I
follow to build them. In part two , we’ll follow this path to build an ML system to predict housing prices. I’ll
show you how to get from a raw data set to a REST API with just a few lines of
code.

 * Machine Learning
 * Ibm Watson
 * Apache Spark
 * Ibm Bluemix
 * Data Science

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingMARK WATSON
Developer Advocate, IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",This is the first installment in a series of posts aimed at introducing developers like me and you to the basic machine learning concepts and tools required to get an ML system up and running.,Watson Machine Learning for Developers,Live,122
312,"WHAT’S ALL THE HOOPLA ABOUT GRAPH DATABASES?
Lauren Schaefer / October 7, 2016When it’s time to choose the database technology for your app, the choices can
be overwhelming. Should you choose SQL or NoSQL? Open source or proprietary?
Self-hosted or hosted? If you’re not already familiar with graph databases, you
might be tempted to ignore them as an option. But that could be a mistake.
Here’s why:


If you want to try a graph database, getting started can get very complicated.
Check out my latest video that shows you how to quickly and easily try a graph
database:


Happy graphing!


SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: Bluemix / database / graph / graph databases / IBM Graph / NoSQL Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Object Storage
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","When it's time to choose the database technology for your app, the choices can be overwhelming. Here's why you should consider graph databases.",What's all the hoopla about graph databases?,Live,123
315,"Homepage Sign in / Sign up 9 * Share
 * 9
 * 
 * 

Never miss a story from Karlijn Willems , when you sign up for Medium. Learn more Never miss a story from Karlijn Willems Blocked Unblock Follow Join Medium Karlijn Willems Blocked Unblock Follow Following Data Science Journalist @DataCamp 2 days ago
--------------------------------------------------------------------------------

PYTHON MACHINE LEARNING: SCIKIT-LEARN TUTORIAL
Originally published at https://www.datacamp.com/community/tutorials/machine-learning-python

Machine learning studies the design of algorithms that can learn. The hope that
this discipline brings with itself is that the inclusion of experience into its
tasks will eventually improve the learning. However, this improvement needs to
happen in such a way that the learning itself becomes automatic so that humans
don’t need to interfere anymore is the ultimate goal.

You’ll probably have already heard that machine learning has close ties between
this discipline and Knowledge Discovery, Data Mining, Artificial Intelligence
(AI) and Statistics. Typical use cases of machine learning rnage from scientific
knowledge discovery and more commercial ones: from the “Robot Scientist” to
anti-spam filtering and recommender systems.

Or maybe, if you haven’t heard about this discipline, you’ll find it vaguely
familiar as one of the 8 topics that you need to master if you want to excel in data science.

This scikit-learn tutorial will introduce you to the basics of Python machine
learning: step-by-step, it will show you how to use Python and its libraries to
explore your data with the help of matplotlib , work with the well-known algorithms KMeans and Support Vector Machines (SVM)
to construct models, to fit the data to these models, to predict values and to
validate the models that you have build.

Note that the code chunks have been left out for convenience. If you want to
follow and practice with code, go here .

If you’re more interested in an R tutorial, check out our Machine Learning with R for Beginners tutorial

LOADING YOUR DATA
The first step to about anything in data science is loading in your data. This
is also the starting point of this tutorial.

If you’re new to this and you want to start problems on your own, finding data
sets might prove to be a challenge. However, you can typically find good data
sets at the UCI Machine Learning Repository or on the Kaggle website. Also, check out this KD Nuggets list with resources .

For now, you just load in the digits dataset that comes with a Python library, called scikit-learn . No need to go and look for datasets yourself.

Fun fact: did you know the name originates from the fact that this library is a
scientific toolbox built around SciPy? By the way, there is more than just one scikit out there. This scikit contains modules specifically for machine learning and
data mining, which explains the second component of the library name. :)

To load in the data, you import the module datasets from sklearn . Then, you can use the load_digits() method from datasets to load in the data.

Note that the datasets module contains other methods to load and fetch popular reference datasets, and
you can also count on this module in case you need artificial data generators.
In addition, this data set is also available through the UCI Repository that was
mentioned above: you can find the data here . You’ll load in this data with the help of the pandas library.

When you first start working with a dataset, it’s always a good idea to go
through the data description and see what you can already learn. When it comes
to scikit-learn , you don’t immediately have this information readily available, but in the
case where you import data from another source, there's usually a data
description present, which will already be a sufficient amount of information to
gather some insights into your data.

However, these insights are not merely deep enough for the analysis that you are
going to perform. You really need to have a good working knowledge about the
data set.

Performing an exploratory data analysis (EDA) on a data set like the one that
this tutorial now has might seem difficult.

You should start with gathering the basic information: you already have
knowledge of things such as the target values and the description of your data.
You can access the digits data through the attribute data . Similarly, you can also access the target values or labels through the target attribute and the description through the DESCR attribute.

To see which keys you have available to already get to know your data, you can
just run digits.keys() .

The next thing that you can (double)check is the type of your data.

If you used read_csv() to import the data, you would have had a data frame that contains just the
data. There wouldn’t be any description component, but you would be able to
resort to, for example, head() or tail() to inspect your data. In these cases, it’s always wise to read up on the data
description folder!

However, this tutorial assumes that you make use of the library’s data and the
type of the digits variable is not that straightforward if you’re not familiar with the library.
Look at the print out in the first code chunk. You’ll see that digits actually contains numpy arrays!

This is already quite some important information. But how do you access these
arays?

It’s very easy, actually: you use attributes to access the relevant arrays.

Remember that you have already seen which attributes are available when you
printed digits.keys() . For instance, you have the data attribute to isolate the data, target to see the target values and the DESCR for the description, …

But what then?

The first thing that you should know of an array is its shape. That is, the
number of dimensions and items that is contained within an array. The array’s
shape is a tuple of integers that specify the sizes of each dimension.

Now let’s try to see what the shape is of these three arrays that you have
distinguished (the data , target and DESCR arrays).

Use first the data attribute to isolate the numpy array from the digits data and then use the shape attribute to find out more. You can do the same for the target and DESCR . There’s also the images attribute, which is basically the data in images.

To recap: by inspecting digits.data , you see that there are 1797 samples and that there are 64 features. Because
you have 1797 samples, you also have 1797 target values.

But all those target values contain 10 unique values, namely, from 0 to 9. In
other words, all 1797 target values are made up of numbers that lie between 0
and 9. This means that the digits that your model will need to recognize are
numbers from 0 to 9.

Lastly, you see that the images data contains three dimensions: there are 1797 instances that are 8 by 8 pixels
big.

Then, you can take your exploration up a notch by visualizing the images that
you’ll be working with. You can use one of Python’s data visualization
libraries, such as matplotlib :

On a more simple note, you can also visualize the target labels with an image:

Now you know a very good idea of the data that you’ll be working with!

But is there no other way to visualize the data?

As the digits data set contains 64 features, this might prove to be a challenging task. You
can imagine that it’s very hard to understand the structure and keep the
overview of the digits data. In such cases, it is said that you’re working with a high dimensional
data set.

High dimensionality of data is a direct result of trying to describe the objects
via a collection of features. Other examples of high dimensional data are, for
example, financial data, climate data, neuroimaging, …

But, as you might have gathered already, this is not always easy. In some cases,
high dimensionality can be problematic, as your algorithms will need to take
into account too many features. In such cases, you speak of the curse of
dimensionality. Because having a lot of dimensions can also mean that your data
points are far away from virtually every other point, which makes the distances
between the data points uninformative.

Dont’ worry, though, because the curse of dimensionality is not simply a matter
of counting the number of features. There are also cases in which the effective
dimensionality might be much smaller than the number of the features, such as in
data sets where some features are irrelevant.

In addition, you can also understand that data with only two or three dimensions
is easier to grasp and can also be visualized easily.

That all explains why you’re going to visualize the data with the help of one of
the Dimensionality Reduction techniques, namely Principal Component Analysis
(PCA). The idea in PCA is to find a linear combination of the two variables that
contains most of the information. This new variable or “principal component” can
replace the two original variables.

In short, it’s a linear transformation method that yields the directions
(principal components) that maximize the variance of the data. Remember that the
variance indicates how far a set of data points lie apart. If you want to know
more, go to this page .

You can easily apply PCA do your data with the help of scikit-learn.

Tip : you have used the RandomizedPCA() here because it performs better when there’s a high number of dimensions. Try
replacing the randomized PCA model or estimator object with a regular PCA model
and see what the difference is.

Note how you explicitly tell the model to only keep two components. This is to
make sure that you have two-dimensional data to plot. Also, note that you don’t
pass the target class with the labels to the PCA transformation because you want
to investigate if the PCA reveals the distribution of the different labels and
if you can clearly separate the instances from each other.

You can now build a scatterplot to visualize the data:

Again you use matplotlib to visualize the data. It’s good for a quick visualization of what you’re
working with, but you might have to consider something a little bit more fancy
if you’re working on making this part of your data science portfolio.

Also note that the last call to show the plot ( plt.show() ) is not necessary if you’re working in Jupyter Notebook, as you’ll want to put
the images inline. When in doubt, you can always check out our Definitive Guide to Jupyter Notebook .

WHERE TO GO NOW?
Now that you have even more information about your data and you have a
visualization ready, it does seem a bit like the data points sort of group
together, but you also see there is quite some overlap.

This might be interesting to investigate further.

Do you think that, in a case where you knew that there are 10 possible digits
labels to assign to the data points, but you have no access to the labels, the
observations would group or “cluster” together by some criterion in such a way
that you could infer the lables?

Now this is a research question!

In general, when you have acquired a good understanding of your data, you have
to decide on the use cases that would be relevant to your data set. In other
words, you think about what your data set might teach you or what you think you
can learn from your data.

From there on, you can think about what kind of algorithms you would be able to
apply to your data set in order to get the results that you think you can
obtain.

Tip: the more familiar you are with your data, the easier it will be to assess the
use cases for your specific data set. The same also holds for finding the
appropriate machine algorithm.

However, when you’re first getting started with scikit-learn , you’ll see that the amount of algorithms that the library contains is pretty
vast and that you might still want additional help when you’re doing the
assessment for your data set. That’s why this scikit-learn machine learning map will come in handy.

Note that this map does require you to have some knowledge about the algorithms
that are included in the scikit-learn library. This, by the way, also holds some truth for taking this next step in
your project: if you have no idea what is possible, it will be very hard to
decide on what your use case will be for the data.

As your use case was one for clustering, you can follow the path on the map
towards “KMeans”. You’ll see the use case that you have just thought about
requires you to have more than 50 samples (“check!”), to have labeled data
(“check!”), to know the number of categories that you want to predict (“check!”)
and to have less than 10K samples (“check!”).

But what exactly is the K-Means algorithm?

It is one of the simplest and widely used unsupervised learning algorithms to
solve clustering problems. The procedure follows a simple and easy way to
classify a given data set through a certain number of clusters that you have set
before you run the algorithm. This number of clusters is called k and you select this number at random.

Then, the k-means algorithm will find the nearest cluster center for each data
point and assign the data point closest to that cluster.

Once all data points have been assigned to clusters, the cluster centers will be
recomputed. In other words, new cluster centers will emerge from the average of
the values of the cluster data points. This process is repeated until most data
points stick to the same cluster. The cluster membership should stabilize.

You can already see that, because the k-means algorithm works the way it does,
the initial set of cluster centers that you give up can have a big effect on the
clusters that are eventually found. You can, of course, deal with this effect,
as you will see further on.

However, before you can go into making a model for your data, you should
definitely take a look into preparing your data for this purpose.

As you have read in the previous section, before modeling your data, you’ll do
well by preparing it first. This preparation step is called “preprocessing”.

The first thing that we’re going to do is preprocessing the data. You can
standardize the digits data by, for example, making use of the scale() method. By scaling the data, you shift the distribution of each attribute to
have a mean of zero and a standard deviation of one (unit variance).

In order to assess your model’s performance later, you will also need to divide
the data set into two parts: a training set and a test set. The first is used to
train the system, while the second is used to evaluate the learned or trained
system.

In practice, the division of your data set into a test and a training sets is
disjoint: the most common splitting choice is to take 2/3 of your original data
set as the training set, while the 1/3 that remains will compose the test set.

You will try to do this also here. You see in the code chunk below that this
‘traditional’ splitting choice is respected: in the arguments of the train_test_split() method, you clearly see that the test_size is set to 0.25 .

You’ll also note that the argument random_state has the value 42 assigned to it. With this argument, you can guarantee that your split will
always be the same. That is particularly handy if you want reproducible results.

After you have split up your data set into train and test sets, you can quickly
inspect the numbers before you go and model the data:

You’ll see that the training set X_train now contains 1347 samples, which is exactly 2/3d of the samples that the
original data set contained, and 64 features, which hasn’t changed. The y_train training set also contains 2/3d of the labels of the original data set. This
means that the test sets X_train and y_train contain 450 samples.

After all these preparation steps, you have made sure that all your known
(training) data is stored. No actual model or learning was performed up until
this moment.

Now, it’s finally time to find those clusters of your training set. Use KMeans() from the cluster module to set up your model. You’ll see that there are three arguments that are
passed to this method: init , n_clusters and the random_state .

You might still remember this last argument from before when you split the data
into training and test sets. This argument basically guaranteed that you got
reproducible results.

The init indicates the method for initialization and even though it defaults to ‘k-means++’ , you see it explicitly coming back in the code. That means that you can leave
it out if you want. Try it out in the DataCamp Light chunk above!

Next, you also see that the n_clusters argument is set to 10 . This number not only indicates the number of clusters or groups you want your
data to form, but also the number of centroids to generate. Remember that a
cluster centroid is the middle of a cluster.

Do you also still remember how the previous section described this as one of the
possible disadvantages of the K-Means algorithm?

That is, that the initial set of cluster centers that you give up can have a big
effect on the clusters that are eventually found?

Usually, you try to deal with this effect by trying several initial sets in
multiple runs and by selecting the set of clusters with the minimum sum of the
squared errors (SSE). In other words, you want to minimize the distance of each
point in the cluster to the mean or centroid of that cluster.

By adding the n-init argument to KMeans() , you can determine how many different centroid configurations the algorithm
will try.

Note again that you don’t want to insert the test labels when you fit the model to
your data: these will be used to see if your model is good at predicting the
actual classes of your instances!

You can also visualize the images that make up the cluster centers:

If you want to see another example that visualizes the data clusters and their
centers, go here .

The next step is to predict the labels of the test set. You predict the values
for the test set, which contains 450 samples. You store the result in y_pred . You also print out the first 100 instances of y_pred and y_test and you immediately see some results. In addition, you can study the shape of
the cluster centers: you immediately see that there are 10 clusters with each 64
features.

But this doesn’t tell you much because we set the number of clusters to 10 and
you already knew that there were 64 features.

Maybe a visualization would be more helpful:

Tip : run the code from above again, but use the PCA reduction method:

At first sight, the visualization doesn’t seem to indicate that the model works
well.

This needs some further investigation.

And this need for further investigation brings you to the next essential step,
which is the evaluation of your model’s performance. In other words, you want to
analyze the degree of correctness of the model’s predictions.

You should look at the confusion matrix. Then, you should try to figure out
something more about the quality of the clusters by applying different cluster
quality metrics. That way, you can judge the goodness of fit of the cluster
labels to the correct labels.

There are quite some metrics to consider:

 * The homogeneity score
 * The completeness score
 * The V-measure score
 * The adjusted Rand score
 * The Adjusted Mutual Info (AMI) score
 * The silhouette score

But also these scores aren’t fantastic.

Clearly, you should consider another estimator to predict the labels for the digits data.

When you recapped all of the information that you gathered out of the data
exploration, you saw that you could build a model to predict which group a digit
belongs to without you knowing the labels. And indeed, you just used the
training data and not the target values to build your KMeans model.

Let’s assume that you depart from the case where you use both the digits training data and the corresponding target values to build your model.

If you follow the algorithm map, you’ll see that the first model that you meet
is the linear SVC. Let’s apply this to our data.

You see here that you make use of X_train and y_train to fit the data to the SVC model. This is clearly different from clustering.
Note also that in this example, you set the value of gamma manually. It is possible to automatically find good values for the parameters
by using tools such as grid search and cross validation.

Even though this is not the focus of this tutorial, you will see how you could
have gone about this if you would have made use of grid search to adjust your
parameters.

For a walkthrough on how you should apply grid search, I refer you to the original tutorial .

You see that in the SVM classifier has a kernel argument that specifies the kernel type that you’re going to use in the
algorithm. By default, this is rbf . In other cases, you can specify others such as linear , poly , …

But what is a kernel exactly?

A kernel is a similarity function, which is used to compute similarity between
the training data points. When you provide a kernel to an algorithm, together
with the training data and the labels, you will get a classifier, as is the case
here. You will have trained a model that assigns new unseen objects into a
particular category. For the SVM, you will typicall try to linearly divide your
data points.

You can now visualize the images and their predicted labels. This plot is very
similar to the plot that you made when you were exploring the data:

But now the biggest question: how does this model perform?

You clearly see that this model performs a whole lot better than the clustering
model that you used earlier.

You can also see it when you visualize the predicted and the actual labels:

You’ll see that this visualization confirms your classification report, which is
very good news. :)

WHAT’S NEXT IN YOUR DATA SCIENCE JOURNEY?
Congratulations, you have reached the end of this scikit-learn tutorial, which
was meant to introduce you to Python machine learning! Now it’s your turn.

Start your own digit recognition project with different data. One dataset that
you can already use is the MNIST data, which you can download here .

The steps that you will need to take are very similar to the ones that you have
gone through with this tutorial, but if you still feel that you can use some
help, you should check out this page , which works with the MNIST data and applies the KMeans algorithm.

Working with the digits dataset was the first step in classifying characters
with scikit-learn . If you’re done with this, you might consider trying out an even more
challenging problem, namely, classifying alphanumeric characters in natural
images.

A well-known dataset that you can use for this problem is the Chars74K dataset,
which contains more than 74,000 images of digits from 0 to 9 and the both
lowercase and higher case letters of the English alphabet. You can download the
dataset here .

Whether you’re going to start with the projects that have been mentioned above
or not, this is definitely not the end of your journey of data science with
Python. If you choose not to widen your view just yet, consider deepening your
data visualization and data manipulation knowledge: don’t miss out on DataCamp’s Interactive Data Visualization with Bokeh course to make sure you can impress your peers with a stunning data science portfolio
or DataCamp’s pandas Foundation course , to learn more about working with data frames in Python.


--------------------------------------------------------------------------------

Originally published at www.datacamp.com .

Data Science Machine Learning Python Scikit Learn 9 Blocked Unblock Follow FollowingKARLIJN WILLEMS
Data Science Journalist @DataCamp",Machine learning studies the design of algorithms that can learn. The hope that this discipline brings with itself is that the inclusion of experience into its tasks will eventually improve the…,Python Machine Learning: Scikit-Learn Tutorial,Live,124
316,"Christopher Roach * Articles * Node
    * Other
    * Statistics
   
   
 * Feeds * All
    * Node
    * Other
    * Statistics
   
   
 * About * About Christopher
    * GitHub
    * Twitter
    * LinkedIn
   
   
STATISTICS FOR HACKERS
 1. 18 January 2017
 2. Statistics

MOTIVATION ¶
There's no shortage of absolutely magnificent material out there on the topics
of data science and machine learning for an autodidact, such as myself, to learn
from. In fact, so many great resources exist that an individual can be forgiven
for not knowing where to begin their studies, or for getting distracted once
they're off the starting block. I honestly can't count the number of times that
I've started working through many of these online courses and tutorials only to
have my attention stolen by one of the multitudes of amazing articles on data
analysis with Python, or some great new MOOC on Deep Learning. But this year is different! This year, for one of my new
year's resolutions, I've decided to create a personalized data science
curriculum and stick to it. This year, I promise not to just casually sign up
for another course, or start reading yet another textbook to be distracted part
way through. This year, I'm sticking to the plan.

As part of my personalized program of study, I've chosen to start with Harvard's Data Science course . I'm currently on week 3 and one of the suggested readings for this week is Jake VanderPlas' talk from PyCon 2016 titled ""Statistics for Hackers"". As I was watching the video and following along with the slides , I wanted to try out some of the examples and create a set of notes that I
could refer to later, so I figured why not create a Jupyter notebook. Once I'd
finished, I realized I'd created a decently-sized resource that could be of use
to others working their way through the talk. The result is the article you're
reading right now, the remainder of which contains my notes and code examples
for Jake's excellent talk.

So, enjoy the article, I hope you find this resource useful, and if you have any
problems or suggestions of any kind, the full notebook can be found on github , so please send me a pull request , or submit an issue , or just message me directly on Twitter .

PRELIMINARIES ¶
In [1]:importnumpyasnpimportmatplotlib.pyplotaspltimportpandasaspdimportseabornassns# Suppress all warnings just to keep the notebook nice and clean. # This must happen after all imports since numpy actually adds its# RankWarning class back in.importwarningswarnings.filterwarnings(""ignore"")# Setup the look and feel of the notebooksns.set_context(""notebook"",font_scale=1.5,rc={""lines.linewidth"":2.5})sns.set_style('whitegrid')sns.set_palette('deep')# Create a couple of colors to use throughout the notebookred=sns.xkcd_rgb['vermillion']blue=sns.xkcd_rgb['dark sky blue']fromIPython.displayimportdisplay%matplotlib inline
%config InlineBackend.figure_format = 'retina'


WARM-UP ¶
The talk starts off with a motivating example that asks the question ""If you
toss a coin 30 times and see 22 heads, is it a fair coin?""

We all know that a fair coin should come up heads roughly 15 out of 30 tosses,
give or take, so it does seem unlikely to see so many heads. However, the
skeptic might argue that even a fair coin could show 22 heads in 30 tosses from
time-to-time. This could just be a chance event. So, the question would then be
""how can you determine if you're tossing a fair coin?""

THE CLASSIC METHOD ¶
The classic method would assume that the skeptic is correct and would then test
the hypothesis (i.e., the Null Hypothesis ) that the observation of 22 heads in 30 tosses could happen simply by chance.
Let's start by first considering the probability of a single coin flip coming up
heads and work our way up to 22 out of 30.

$$ P(H) = \frac{1}{2} $$As our equation shows, the probability of a single coin toss turning up heads is
exactly 50% since there is an equal chance of either heads or tails turning up.
Taking this one step further, to determine the probability of getting 2 heads in
a row with 2 coin tosses, we would need to multiply the probability of getting
heads by the probability of getting heads again since the two events are
independent of one another.

$$ P(HH) = P(H) \cdot P(H) = P(H)^2 = \left(\frac{1}{2}\right)^2 = \frac{1}{4}
$$From the equation above, we can see that the probability of getting 2 heads in a
row from a total of 2 coin tosses is 25%. Let's now take a look at a slightly
different scenario and calculate the probability of getting 2 heads and 1 tails
with 3 coin tosses.

$$ P(HHT) = P(H)^2 \cdot P(T) = \left(\frac{1}{2}\right)^2 \cdot \frac{1}{2} =
\left(\frac{1}{2}\right)^3 = \frac{1}{8} $$The equation above tells us that the probability of getting 2 heads and 1 tails
in 3 tosses is 12.5%. This is actually the exact same probability as getting
heads in all three tosses, which doesn't sound quite right. The problem is that
we've only calculated the probability for a single permutation of 2 heads and 1
tails; specifically for the scenario where we only see tails on the third toss.
To get the actual probability of tossing 2 heads and 1 tails we will have to add
the probabilities for all of the possible permutations, of which there are
exactly three: HHT, HTH, and THH.

$$ P(2H,1T) = P(HHT) + P(HTH) + P(THH) = \frac{1}{8} + \frac{1}{8} + \frac{1}{8}
= \frac{3}{8} $$Another way we could do this is to calculate the total number of permutations
and simply multiply that by the probability of each event happening. To get the
total number of permutations we can use the binomial coefficient . Then, we can simply calculate the probability above using the following
equation.

$$ P(2H,1T) = \binom{3}{2} \left(\frac{1}{2}\right)^{3} = 3
\left(\frac{1}{8}\right) = \frac{3}{8} $$While the equation above works in our particular case, where each event has an
equal probability of happening, it will run into trouble with events that have
an unequal chance of taking place. To deal with those situations, you'll want to
extend the last equation to take into account the differing probabilities. The
result would be the following equation, where $N$ is number of coin flips, $N_H$
is the number of expected heads, $N_T$ is the number of expected tails, and
$P_H$ is the probability of getting heads on each flip.

$$ P(N_H,N_T) = \binom{N}{N_H} \left(P_H\right)^{N_H} \left(1 - P_H\right)^{N_T}
$$Now that we understand the classic method, let's use it to test our null
hypothesis that we are actually tossing a fair coin, and that this is just a chance occurrence. The
following code implements the equations we've just discussed above.

In [2]:deffactorial(n):""""""Calculates the factorial of `n`    """"""vals=list(range(1,n+1))iflen(vals)0:return1prod=1forvalinvals:prod*=valreturnproddefn_choose_k(n,k):""""""Calculates the binomial coefficient    """"""returnfactorial(n)/(factorial(k)*factorial(n-k))defbinom_prob(n,k,p):""""""Returns the probability of see `k` heads in `n` coin tosses    Arguments:    n - number of trials    k - number of trials in which an event took place    p - probability of an event happening    """"""returnn_choose_k(n,k)*p**k*(1-p)**(n-k)

Now that we have a method that will calculate the probability for a specific
event happening (e.g., 22 heads in 30 coin tosses), we can calculate the
probability for every possible outcome of flipping a coin 30 times, and if we
plot these values we'll get a visual representation of our coin's probability
distribution.

In [3]:# Calculate the probability for every possible outcome of tossing # a fair coin 30 times.probabilities=[binom_prob(30,k,0.5)forkinrange(1,31)]# Plot the probability distribution using the probabilities list # we created above.plt.step(range(1,31),probabilities,where='mid',color=blue)plt.xlabel('number of heads')plt.ylabel('probability')plt.plot((22,22),(0,0.1599),color=red);plt.annotate('0.8%',xytext=(25,0.08),xy=(22,0.08),multialignment='right',va='center',color=red,size='large',arrowprops={'arrowstyle':','lw':2,'color':red,'shrinkA':10});

The visualization above shows the probability distribution for flipping a fair
coin 30 times. Using this visualization we can now determine the probability of
getting, say for example, 12 heads in 30 flips, which looks to be about 8%.
Notice that we've labeled our example of 22 heads as 0.8%. If we look at the
probability of flipping exactly 22 heads, it looks likes to be a little less
than 0.8%, in fact if we calculate it using the binom_prob function from above, we get 0.5%

In [4]:print(""Probability of flipping 22 heads: %0.1f%%""%(binom_prob(30,22,0.5)*100))

Probability of flipping 22 heads: 0.5%


So, then why do we have 0.8% labeled in our probability distribution above?
Well, that's because we are showing the probability of getting at least 22 heads, which is also known as the p-value.

WHAT'S A P-VALUE? ¶
In statistical hypothesis testing we have an idea that we want to test, but considering that it's very hard to
prove something to be true beyond doubt, rather than test our hypothesis
directly, we formulate a competing hypothesis, called a null hypothesis , and then try to disprove it instead. The null hypothesis essentially assumes
that the effect we're seeing in the data could just be due to chance.

In our example, the null hypothesis assumes we have a fair coin, and the way we
determine if this hypothesis is true or not is by calculating how often flipping
this fair coin 30 times would result in 22 or more heads. If we then take the
number of times that we got 22 or more heads and divide that number by the total
of all possible permutations of 30 coin tosses, we get the probability of
tossing 22 or more heads with a fair coin. This probability is what we call the p-value .

The p-value is used to check the validity of the null hypothesis. The way this
is done is by agreeing upon some predetermined upper limit for our p-value,
below which we will assume that our null hypothesis is false. In other words, if
our null hypothesis were true, and 22 heads in 30 flips could happen often
enough by chance, we would expect to see it happen more often than the given
threshold percentage of times. So, for example, if we chose 10% as our
threshold, then we would expect to see 22 or more heads show up at least 10% of
the time to determine that this is a chance occurrence and not due to some bias
in the coin. Historically, the generally accepted threshold has been 5%, and so
if our p-value is less than 5%, we can then make the assumption that our coin
may not be fair.

The binom_prob function from above calculates the probability of a single event happening, so
now all we need for calculating our p-value is a function that adds up the
probabilities of a given event, or a more extreme event happening. So, as an
example, we would need a function to add up the probabilities of getting 22
heads, 23 heads, 24 heads, and so on. The next bit of code creates that function
and uses it to calculate our p-value.

In [5]:defp_value(n,k,p):""""""Returns the p-value for the given the given set     """"""returnsum(binom_prob(n,i,p)foriinrange(k,n+1))print(""P-value: %0.1f%%""%(p_value(30,22,0.5)*100))

P-value: 0.8%


Running the code above gives us a p-value of roughly 0.8%, which matches the
value in our probability distribution above and is also less than the 5%
threshold needed to reject our null hypothesis, so it does look like we may have
a biased coin.

THE EASIER METHOD ¶
That's an example of using the classic method for testing if our coin is fair or
not. However, if you don't happen to have at least some background in
statistics, it can be a little hard to follow at times, but luckily for us,
there's an easier method...

Simulation!

The code below seeks to answer the same question of whether or not our coin is
fair by running a large number of simulated coin flips and calculating the
proportion of these experiments that resulted in at least 22 heads or more.

In [6]:M=0n=50000foriinrange(n):trials=np.random.randint(2,size=30)if(trials.sum()=22):M+=1p=M/nprint(""Simulated P-value: %0.1f%%""%(p*100))

Simulated P-value: 0.8%


The result of our simulations is 0.8%, the exact same result we got earlier when
we calculated the p-value using the classical method above. So, it definitely
looks like it's possible that we have a biased coin since the chances of seeing
22 or more heads in 30 tosses of a fair coin is less than 1%.

FOUR RECIPES FOR HACKING STATISTICS ¶
We've just seen one example of how our hacking skills can make it easy for us to
answer questions that typically only a statistician would be able to answer
using the classical methods of statistical analysis. This is just one possible
method for answering statistical questions using our coding skills, but Jake's
talk describes four recipes in total for ""hacking statistics"", each of which is
listed below. The rest of this article will go into each of the remaining
techniques in some detail.

 1. Direct Simulation
 2. Shuffling
 3. Bootstrapping
 4. Cross Validation

In the Warm-up section above, we saw an example direct simulation, the first recipe in our
tour of statistical hacks. The next example uses the Shuffling method to figure
out if there's a statistically significant difference between two different
sample populations.

SHUFFLING ¶
In this example, we look at the Dr. Seuss story about the Star-belly Sneetches.
In this Seussian world, a group of creatures called the Sneetches are divided
into two groups: those with stars on their bellies, and those with no ""stars
upon thars"". Over time, the star-bellied sneetches have come to think of
themselves as better than the plain-bellied sneetches. As researchers of
sneetches, it's our job to uncover whether or not star-bellied sneetches really
are better than their plain-bellied cousins.

The first step in answering this question will be to create our experimental
data. In the following code snippet we create a dataframe object that contains a
set of test scores for both star-bellied and plain-bellied sneetches.

In [7]:importpandasaspddf=pd.DataFrame({'star':[1,1,1,1,1,1,1,1]+[0,0,0,0,0,0,0,0,0,0,0,0],'score':[84,72,57,46,63,76,99,91]+[81,69,74,61,56,87,69,65,66,44,62,69]})df

Out[7]: score star 0 84 1 1 72 1 2 57 1 3 46 1 4 63 1 5 76 1 6 99 1 7 91 1 8 81 0 9 69 0 10 74 0 11 61 0 12 56 0 13 87 0 14 69 0 15 65 0 16 66 0 17 44 0 18 62 0 19 69 0If we then take a look at the average scores for each group of sneetches, we
will see that there's a difference in scores of 6.6 between the two groups. So, on average, the star-bellied sneetches performed
better on their tests than the plain-bellied sneetches. But, the real question
is, is this a significant difference?

In [8]:star_bellied_mean=df[df.star==1].score.mean()plain_bellied_mean=df[df.star==0].score.mean()print(""Star-bellied Sneetches Mean: %2.1f""%star_bellied_mean)print(""Plain-bellied Sneetches Mean: %2.1f""%plain_bellied_mean)print(""Difference: %2.1f""%(star_bellied_mean-plain_bellied_mean))

Star-bellied Sneetches Mean: 73.5
Plain-bellied Sneetches Mean: 66.9
Difference: 6.6


To determine if this is a signficant difference, we could perform a t-test on our data to compute a p-value, and then just make sure that the p-value is
less than the target 0.05. Alternatively, we could use simulation instead.

Unlike our first example, however, we don't have a generative function that we
can use to create our probability distribution. So, how can we then use
simulation to solve our problem?

Well, we can run a bunch of simulations where we randomly shuffle the labels
(i.e., star-bellied or plain-bellied) of each sneetch, recompute the difference
between the means, and then determine if the proportion of simulations in which
the difference was at least as extreme as 6.6 was less than the target 5%. If
so, we can conclude that the difference we see is, in fact, one that doesn't
occur strictly by chance very often and so the difference is a significant one.
In other words, if the proportion of simulations that have a difference of 6.6
or greater is less than 5%, we can conclude that the labels really do matter,
and so we can conclude that star-bellied sneetches are ""better"" than their
plain-bellied counterparts.

In [9]:df['label']=df['star']num_simulations=10000differences=[]foriinrange(num_simulations):np.random.shuffle(df['label'])star_bellied_mean=df[df.label==1].score.mean()plain_bellied_mean=df[df.label==0].score.mean()differences.append(star_bellied_mean-plain_bellied_mean)

Now that we've ran our simulations, we can calculate our p-value, which is
simply the proportion of simulations that resulted in a difference greater than
or equal to 6.6.

$$ p = \frac{N_{ 6.6}}{N_{total}} = \frac{1512}{10000} = 0.15 $$ In [10]:p_value=sum(diff=6.6fordiffindifferences)/num_simulationsprint(""p-value: %2.2f""%p_value)

p-value: 0.15


The following code plots the distribution of the differences we found by running
the simulations above. We've also added an annotation that marks where the
difference of 6.6 falls in the distribution along with its corresponding
p-value.

In [11]:plt.hist(differences,bins=50,color=blue)plt.xlabel('score difference')plt.ylabel('number')plt.plot((6.6,6.6),(0,700),color=red);plt.annotate('%2.f%%'%(p_value*100),xytext=(15,350),xy=(6.6,350),multialignment='right',va='center',color=red,size='large',arrowprops={'arrowstyle':','lw':2,'color':red,'shrinkA':10});

We can see from the histogram above---and from our simulated p-value, which was
greater than 5%---that the difference that we are seeing between the populations
can be explained by random chance, so we can effectively dismiss the difference
as not statistically significant. In short, star-bellied sneetches are no better
than the plain-bellied ones, at least not from a statistical point of view.

For further discussion on this method of simulation, check out John Rauser's
keynote talk ""Statistics Without the Agonizing Pain"" from Strata + Hadoop 2014. Jake mentions that he drew inspiration from it in
his talk, and it is a really excellent talk as well; I wholeheartedly recommend
it.

BOOTSTRAPPING ¶
In this example, we'll be using the story of Yertle the Turtle to explore the
bootstrapping recipe. As the story goes, in the land of Sala-ma-Sond, Yertle the
Turtle was the king of the pond and he wanted to be the most powerful, highest
turtle in the land. To achieve this goal, he would stack turtles as high as he
could in order to stand upon their backs. As observers of this curious behavior,
we've recorded the heights of 20 turtle towers and we've placed them in a
dataframe in the following bit of code.

In [12]:df=pd.DataFrame({'heights':[48,24,51,12,21,41,25,23,32,61,19,24,29,21,23,13,32,18,42,18]})

The questions we want to answer in this example are: what is the mean height of
Yertle's turtle stacks, and what is the uncertainty of this estimate?

THE CLASSIC METHOD ¶
The classic method is simply to calculate the sample mean...

$$ \bar{x} = \frac{1}{N} \sum_{i=1}^{N} x_i = 28.9 $$...and the standard error of the mean.

$$ \sigma_{\bar{x}} = \frac{1}{ \sqrt{N}}\sqrt{\frac{1}{N - 1} \sum_{i=1}^{N}
(x_i - \bar{x})^2 } = 3.0 $$But, being hackers, we'll be using simulation instead.

Just like in our last example, we are once again faced with the problem of not
having a generative model, but unlike the last example, we're not comparing two
groups, so we can't just shuffle around labels here, instead we'll use something
called bootstrap resampling .

Bootstrap resampling is a method that simulates several random sample
distributions by drawing samples from the current distribution with replacement,
i.e., we can draw the same data point more than once. Luckily, pandas makes this
super easy with its sample function. We simply need to make sure that we pass in True for the replace argument to sample from our dataset with replacement.

In [13]:sample=df.sample(20,replace=True)display(sample)print(""Mean: %2.2f""%sample.heights.mean())print(""Standard Error: %2.2f""%(sample.heights.std()/np.sqrt(len(sample))))

heights 9 61 13 21 8 32 6 25 10 19 17 18 4 21 14 23 12 29 4 21 17 18 13 21 4 21 6 25 6 25 11 24 9 61 17 18 3 12 3 12Mean: 25.35
Standard Error: 2.93


More than likely the mean and standard error from our freshly drawn sample above
didn't exactly match the one that we calculated using the classic method
beforehand. But, if we continue to resample several thousand times and take a
look at the average (mean) of all those sample means and their standard
deviation, we should have something that very closely approximates the mean and
standard error derived from using the classic method above.

In [14]:xbar=[]foriinrange(10000):sample=df.sample(20,replace=True)xbar.append(sample.heights.mean())print(""Mean: %2.1f""%np.mean(xbar))print(""Standard Error: %2.1f""%np.std(xbar))

Mean: 28.8
Standard Error: 2.9


CROSS VALIDATION ¶
For the final example, we dive into the world of the Lorax. In the story of the
Lorax, a faceless creature sales an item that (presumably) all creatures need
called a Thneed. Our job as consultants to Onceler Industries is to project
Thneed sales. But, before we can get started forecasting the sales of Thneeds,
we'll first need some data.

Lucky for you, I've already done the hard work of assembling that data in the
code below by ""eyeballing"" the data in the scatter plot from the slides of the
talk. So, it may not be exactly the same, but it should be close enough for our
example analysis.

In [15]:df=pd.DataFrame({'temp':[22,36,36,38,44,45,47,43,44,45,47,49,52,53,53,53,54,55,55,55,56,57,58,59,60,61,61.5,61.7,61.7,61.7,61.8,62,62,63.4,64.6,65,65.6,65.6,66.4,66.9,67,67,67.4,67.5,68,69,70,71,71,71.5,72,72,72,72.7,73,73,73,73.3,74,75,75,77,77,77,77.4,77.9,78,78,79,80,82,83,84,85,85,86,87,88,90,90,91,93,95,97,102,104],'sales':[660,433,475,492,302,345,337,479,456,440,423,269,331,197,283,351,470,252,278,350,253,253,343,280,200,194,188,171,204,266,275,171,282,218,226,187,184,192,167,136,149,168,218,298,199,268,235,157,196,203,148,157,213,173,145,184,226,204,250,102,176,97,138,226,35,190,221,95,211,110,150,152,37,76,56,51,27,82,100,123,145,51,156,99,147,54]})

Now that we have our sales data in a pandas dataframe, we can take a look to see
if any trends show up. Plotting the data in a scatterplot, like the one below,
reveals that a relationship does seem to exist between temperature and Thneed
sales.

In [16]:# Grab a reference to fig and axes object so we can reuse themfig,ax=plt.subplots()# Plot the Thneed sales dataax.scatter(df.temp,df.sales)ax.set_xlim(xmin=20,xmax=110)ax.set_ylim(ymin=0,ymax=700)ax.set_xlabel('temprature (F)')ax.set_ylabel('thneed sales (daily)');

We can see what looks like a relationship between the two variables temperature
and sales, but how can we best model that relationship so we can accurately
predict sales based on temperature?

Well, one measure of a model's accuracy is the Root-Mean-Square Error (RMSE) . This metric represents the sample standard deviation between a set of
predicted values (from our model) and the actual observed values.

In [17]:defrmse(predictions,targets):returnnp.sqrt(((predictions-targets)**2).mean())

We can now use our rmse function to measure how well our models' accurately represent the Thneed sales
dataset. And, in the next cell, we'll give it a try by creating two different
models and seeing which one does a better job of fitting our sales data.

In [18]:# 1D Polynomial Fitd1_model=np.poly1d(np.polyfit(df.temp,df.sales,1))d1_predictions=d1_model(range(111))ax.plot(range(111),d1_predictions,color=blue,alpha=0.7)# 2D Polynomial Fitd2_model=np.poly1d(np.polyfit(df.temp,df.sales,2))d2_predictions=d2_model(range(111))ax.plot(range(111),d2_predictions,color=red,alpha=0.5)ax.annotate('RMS error = %2.1f'%rmse(d1_model(df.temp),df.sales),xy=(75,650),fontsize=20,color=blue,backgroundcolor='w')ax.annotate('RMS error = %2.1f'%rmse(d2_model(df.temp),df.sales),xy=(75,580),fontsize=20,color=red,backgroundcolor='w')display(fig);

In the figure above, we plotted our sales data along with the two models we
created in the previous step. The first model (in blue) is a simple linear
model, i.e., a first-degree polynomial . The second model (in red) is a second-degree polynomial, so rather than a
straight line, we end up with a slight curve.

We can see from the RMSE values in the figure above that the second-degree
polynomial performed better than the simple linear model. Of course, the
question you should now be asking is, is this the best possible model that we
can find?

To find out, let's take a look at the RMSE of a few more models to see if we can
do any better.

In [19]:rmses=[]fordeginrange(15):model=np.poly1d(np.polyfit(df.temp,df.sales,deg))predictions=model(df.temp)rmses.append(rmse(predictions,df.sales))plt.plot(range(15),rmses)plt.ylim(45,70)plt.xlabel('number of terms in fit')plt.ylabel('rms error')plt.annotate('$y = a + bx$',xytext=(14.2,70),xy=(1,rmses[1]),multialignment='right',va='center',arrowprops={'arrowstyle':'-|','lw':1,'shrinkA':10,'shrinkB':3})plt.annotate('$y = a + bx + cx^2$',xytext=(14.2,64),xy=(2,rmses[2]),multialignment='right',va='top',arrowprops={'arrowstyle':'-|','lw':1,'shrinkA':35,'shrinkB':3})plt.annotate('$y = a + bx + cx^2 + dx^3$',xytext=(14.2,58),xy=(3,rmses[3]),multialignment='right',va='top',arrowprops={'arrowstyle':'-|','lw':1,'shrinkA':12,'shrinkB':3});

We can see, from the plot above, that as we increase the number of terms (i.e.,
the degrees of freedom) in our model we decrease the RMSE, and this behavior can
continue indefinitely, or until we have as many terms as we do data points, at
which point we would be fitting the data perfectly.

The problem with this approach though, is that as we increase the number of
terms in our equation, we simply match the given dataset closer and closer, but
what if our model were to see a data point that's not in our training dataset?

As you can see in the plot below, the model that we've created, though it has a
very low RMSE, it has so many terms that it matches our current dataset too
closely.

In [20]:# Remove everything but the datapointsax.lines.clear()ax.texts.clear()# Changing the y-axis limits to match the figure in the slidesax.set_ylim(0,1000)# 14 Dimensional Modelmodel=np.poly1d(np.polyfit(df.temp,df.sales,14))ax.plot(range(20,110),model(range(20,110)),color=sns.xkcd_rgb['sky blue'])display(fig)

The problem with fitting the data too closely, is that our model is so finely
tuned to our specific dataset, that if we were to use it to predict future
sales, it would most likely fail to get very close to the actual value. This
phenomenon of too closely modeling the training dataset is well known amongst
machine learning practitioners as overfitting and one way that we can avoid it is to use cross-validation .

Cross-validation avoids overfitting by splitting the training dataset into
several subsets and using each one to train and test multiple models. Then, the
RMSE's of each of those models are averaged to give a more likely estimate of
how a model of that type would perform on unseen data.

So, let's give it a try by splitting our data into two groups and randomly
assigning data points into each one.

In [21]:df_a=df.sample(n=len(df)/2)df_b=df.drop(df_a.index)

We can get a look at the data points assigned to each subset by plotting each
one as a different color.

In [22]:plt.scatter(df_a.temp,df_a.sales,color='red')plt.scatter(df_b.temp,df_b.sales,color='blue')plt.xlim(0,110)plt.ylim(0,700)plt.xlabel('temprature (F)')plt.ylabel('thneed sales (daily)');

Then, we'll find the best model for each subset of data. In this particular
example, we'll fit a second-degree polynomial to each subset and plot both
below.

In [23]:# Create a 2-degree model for each subset of datam1=np.poly1d(np.polyfit(df_a.temp,df_a.sales,2))m2=np.poly1d(np.polyfit(df_b.temp,df_b.sales,2))fig,(ax1,ax2)=plt.subplots(nrows=1,ncols=2,sharex=False,sharey=True,figsize=(12,5))x_min,x_max=20,110y_min,y_max=0,700x=range(x_min,x_max+1)# Plot the df_a groupax1.scatter(df_a.temp,df_a.sales,color='red')ax1.set_xlim(xmin=x_min,xmax=x_max)ax1.set_ylim(ymin=y_min,ymax=y_max)ax1.set_xlabel('temprature (F)')ax1.set_ylabel('thneed sales (daily)')ax1.plot(x,m1(x),color=sns.xkcd_rgb['sky blue'],alpha=0.7)# Plot the df_b groupax2.scatter(df_b.temp,df_b.sales,color='blue')ax2.set_xlim(xmin=x_min,xmax=x_max)ax2.set_ylim(ymin=y_min,ymax=y_max)ax2.set_xlabel('temprature (F)')ax2.plot(x,m2(x),color=sns.xkcd_rgb['rose'],alpha=0.5);

Finally, we'll compare models across subsets by calculating the RMSE for each
model using the training set for the other model. This will give us two RMSE
scores which we'll then average to get a more accurate estimate of how well a
second-degree polynomial will perform on any unseen data.

In [24]:print(""RMS = %2.1f""%rmse(m1(df_a.temp),df_a.sales))print(""RMS = %2.1f""%rmse(m2(df_b.temp),df_b.sales))print(""RMS estimate = %2.1f""%np.mean([rmse(m1(df_a.temp),df_a.sales),rmse(m2(df_b.temp),df_b.sales)]))

RMS = 55.3
RMS = 49.4
RMS estimate = 52.4


Then, we simply repeat this process for as long as we so desire.

The following code repeats the process described above for polynomials up to 14
degrees and plots the average RMSE for each one against the non-cross-validated
RMSE's that we calculated earlier.

In [25]:rmses=[]cross_validated_rmses=[]fordeginrange(15):# df_a the model on the whole dataset and calculate its# RMSE on the same set of datamodel=np.poly1d(np.polyfit(df.temp,df.sales,deg))predictions=model(df.temp)rmses.append(rmse(predictions,df.sales))# Use cross-validation to create the model and df_a itm1=np.poly1d(np.polyfit(df_a.temp,df_a.sales,deg))m2=np.poly1d(np.polyfit(df_b.temp,df_b.sales,deg))p1=m1(df_b.temp)p2=m2(df_a.temp)cross_validated_rmses.append(np.mean([rmse(p1,df_b.sales),rmse(p2,df_a.sales)]))plt.plot(range(15),rmses,color=blue,label='RMS')plt.plot(range(15),cross_validated_rmses,color=red,label='cross validated RMS')plt.ylim(45,70)plt.xlabel('number of terms in fit')plt.ylabel('rms error')plt.legend(frameon=True)plt.annotate('Best model minimizes the\ncross-validated error.',xytext=(7,60),xy=(2,cross_validated_rmses[2]),multialignment='center',va='top',color='blue',size=25,backgroundcolor='w',arrowprops={'arrowstyle':'-|','lw':3,'shrinkA':12,'shrinkB':3,'color':'blue'});

According to the graph above, going from a 1-degree to a 2-degree polynomial
gives us quite a large improvement overall. But, unlike the RMSE that we
calculated against the training set, when using cross-validation we can see that
adding more degrees of freedom to our equation quickly reduces the effectiveness
of the model against unseen data. This is overfitting in action! In fact, from
the looks of the graph above, it would seem that a second-degree polynomial is
actually our best bet for this particular dataset.

2-FOLD CROSS-VALIDATION ¶
Several different methods for performing cross-validation exist, the one we've
just seen is called 2-fold cross-validation since the data is split into two subsets. Another close relative is a method
called $k$-fold cross-validation . It differs slightly in that the original dataset is divided into $k$ subsets
(instead of just 2), one of which is reserved strictly for testing and the other
$k - 1$ subsets are used for training models. This is just one example of an
alternate cross-validation method, but more do exist and each one has advantages
and drawbacks that you'll need to consider when deciding which method to use.

CONCLUSION ¶
Ok, so that covers nearly everything that Jake covered in his talk. The end of
the talk contains a short overview of some important areas that he didn't have
time to cover, and there's a nice set of Q&A at the end, but I'll simply direct
you to the video for those parts of the talk. Hopefully, this article/notebook has been helpful
to anyone working their way through Jake's talk, and if for some reason, you've
read through this entire article and haven't watched the video of the talk yet,
I encourage you to take 40 minutes out of your day and go watch it now ---it really is a fantastic talk!

FOUND AN ERROR WITH MY ANALYSIS OR A BUG IN MY CODE?
Everything on this site is avaliable on GitHub. Head on over and submit an issue. You can also message me directly on Twitter .

All work is available on GitHub .
Copyright © Christopher Roach, 2017 .
Site powered by pelican , theme crafted by Chris Albon ( GitHub ).","Musings on data science and software engineering (and at times, economics as well)",Statistics for Hackers,Live,125
321,"Enterprise Pricing Articles Sign in Free 30-Day TrialRETHINKDB JOINERY
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 21, 2016One of the great things about RethinkDB is that it has join functionality baked
in as part of the query engine. This is, compared to MongoDB, where the ""lookup""
function has been added to the aggregation framework, a much more useful
capability which gives a lot more flexibility in designing your data models.
That said, there's some pretty important things to bear in mind when you start
joining in RethinkDB and top of that list is...

THE FASTEST JOIN IS EQJOIN
If you are associating documents in one RethinkDB table with documents in
another then the most efficient way is for the document on the left hand side of
the association to refer, by id, to the document on the right hand side. That's
because the id field for any document is indexed by default, so it's faster to
look up. Looking up the value also, by definition, means it's equal. That's what eqJoin (or eg_join if you are working in Python) does, an ""equals join"".

Let's work with a solid example; we're going to be using JavaScript and Node.js
6 for these examples. In the Github repository for this article is a node program called populate.js. It assumes you've
creates a database called spystuff and two tables, agents and orgs . When you run it, it'll insert organization records that look like this:

{
    ""org"": ""MI6"",
    ""alignment"": {
        country: ""UK"",
        ""side"": ""west"",
    }
}


into the orgs table, get all the id's of those orgs and then update the agents data which looks like this:

{
    ""name"": ""James Bond"",
    ""org"": ""MI6"",
    ""skill"": [""assassination""]
}


to include the appropriate organization id numbers (as ""org_id""), remove the
""org"" field and insert the result into the agents table.

Now, we're set up with a problem joining the tables is made for. We want to get
all the agents' data with their organization data in the same document. This is
where we'll use the eqJoin function. The orgs table is already indexed by id and if we look the command at the core of eqjoin.js we find this:

r.table(""agents"").eqJoin(""org_id"", r.table(""orgs""))  


This starts with the agents table and applies an eqJoin to it, telling it to use the agents' org_id field and look it up in the orgs
table. It'll default to using the tables primary key.
That command gives records back like this:

  {
    ""left"": {
      ""id"": ""58662b62-6a09-422b-88ad-c4acaabaa29b"",
      ""name"": ""John Drake"",
      ""org_id"": ""192711a9-f73b-40f4-886e-07b9a188c47e"",
      ""skill"": [
        ""investigation""
      ]
    },
    ""right"": {
      ""alignment"": {
        ""country"": ""UK"",
        ""side"": ""west""
      },
      ""id"": ""192711a9-f73b-40f4-886e-07b9a188c47e"",
      ""org"": ""M9""
    }
  }


Yes, as it comes out of the join commands, the left and the right side are still
kept separate. To take care of this, RethinkDB recommends the zip function. If we add a .zip() to our query, like so:

r.table(""agents"").eqJoin(""org_id"", r.table(""orgs"")).zip()  


We get this:

   {
    ""alignment"": {
      ""country"": ""UK"",
      ""side"": ""west""
    },
    ""id"": ""192711a9-f73b-40f4-886e-07b9a188c47e"",
    ""name"": ""John Drake"",
    ""org"": ""M9"",
    ""org_id"": ""192711a9-f73b-40f4-886e-07b9a188c47e"",
    ""skill"": [
      ""investigation""
    ]
  }


Which looks great until you look a little closer and notice that the id from the
organization document has wiped out the id belonging to the agent document. Not
a problem, as there's the without function too and that can get rid of that
right hand side id field with .without({""right"": {""id"":true}}) like so:

r.table(""agents"").eqJoin(""org_id"", r.table(""orgs"")).without({""right"": {""id"":true}}).zip()  


and now we get:

   {
    ""alignment"": {
      ""country"": ""UK"",
      ""side"": ""west""
    },
    ""id"": ""58662b62-6a09-422b-88ad-c4acaabaa29b"",
    ""name"": ""John Drake"",
    ""org"": ""M9"",
    ""org_id"": ""192711a9-f73b-40f4-886e-07b9a188c47e"",
    ""skill"": [
      ""investigation""
    ]
  }


Even though we were deleting the id field on the right hand side, we've retained
the organization id field which we were joining on. The RethinkDB documentation
on joins shows a couple of other ways you could mitigate this overwriting, but
this is the simplest way for simple eqJoins .

INDEXES AND EQJOIN
The eqJoin function is actually very simple at it's core. It moves through the left hand
table using the specified field and simply looks up the value in the index it's
been given on the right hand side. By default, that's the id field as that is
the primary key and index on the right hand side. But it doesn't have to be that
index. You can point at any index that exists for the right hand side and as
long as there are values there that match up with the left hand side values,
you'll get results for that query.

Let's add a table of assets to our spystuff database - you'll find the code for this in populate2.js in the repository . It adds records like this:

  {
    ""type"": ""Black Helicopter"",
    ""use"": [ ""stealth"", ""investigation"" ],
    ""designer"": ""UN""
  }


First, let's unite these assets with their organizations. Let's assume that if a
country designed an asset, then organizations in that country can use that
asset. We'll need to create an index on the designer field of the assets which
we can use with eqjoin . We'll do that in the populate2.js file with:

r.table(""assets"").indexCreate(""designer"")  


Now we can do our join - you'll find it in eqjoinindex.js .

r.table(""orgs"").eqJoin(  
             r.row(""alignment"")(""country""), 
             r.table(""assets""),{ index:""designer""}
         ).without({""right"": { ""id"": true }})
         .zip() 


So, taking that step by step, we start with the orgs table and we apply an eqJoin to it. The field we want to join on isn't a top level field so we pass r.row(""alignment"")(""country"") so we can access it. We then tell eqJoin we want to join with the assets table. Here's the new bit; in the last options parameter, we pass a { index:""designer"" } to tell eqJoin to use that index to lookup on, so we're now joining the alignment.country of
organizations with the designer of assets which gets us records like this:

 {
    ""alignment"": {
      ""country"": ""USA"",
      ""side"": ""west""
    },
    ""designer"": ""USA"",
    ""id"": ""938201f0-6a24-4f0a-91ee-cc1751df23a4"",
    ""org"": ""CIA"",
    ""type"": ""Laser Pen"",
    ""use"": [
      ""management"",
      ""combat""
    ]
  }


Now we can see the CIA has access to laser pens. It's also quite a good example
of why you may not want to zip records at all. Let's show another aspect of this
secondary index joining; multi-indexes. Those are indexes where the field being
indexed is an array of values; when the indexer is told to index, it indexes the
record for each one of these values. So how can we use that?

Say we want to match our agent's primary skills with the assets they can use.
We'll want to index that ""use"" field first. The example code does just that with r.table(""assets"").indexCreate(""use"",{ multi: true } ) . The multi:true part lets the index work with the array as discrete values.

With that index in place, let's make a join query:

r.table('agents').eqJoin(r.row(""skill"")(0),r.table(""assets""),{ index:""use"" }).zip()  


There's a whole lot of things happening here. The r.row(""skill"")(0) is referring to the first value in the array of values in the ""skill"" field.
This is closer to being a function than a reference, and it is worth noting that eqJoin can take a function to create the value to match with. We point at the ""assets""
table as the right hand side and we telling it to use the index we created with { index:""use"" } . There is another other option by the way; ""ordered"" which when set to true
will sort according to the left hand side's input - we're just not using it
here. Anyway now the effect of this is to make it seem that when the first skill
of an agent is present in the array of ""use"" in the asset document, the two
documents will be joined and we've added a zip to merge the fields to get
something like:

  {
    ""designer"": ""Global"",
    ""id"": ""cd8aefc6-5442-499d-84bc-9fb85172b6f8"",
    ""name"": ""Chuck Bartowski"",
    ""org_id"": ""11a662d3-0477-4c96-a0d4-3ceebc0c29a4"",
    ""skill"": [
      ""investigation"",
      ""stealth""
    ],
    ""type"": ""Microdrone"",
    ""use"": [
      ""investigation"",
      ""stealth"",
      ""assassination""
    ]
  }


We could do another eqJoin against the orgs table - three way joins are easy
enough - but that's demonstrated the flexibility of the eqJoin function.

WHAT OF INNER_JOIN AND OUTER_JOIN?
There are other join functions - innerJoin and outerJoin – but they are slower and less efficient than eqJoin . Both use a function which evaluates true or false. That means though that
there's no scanning of the left hand side and index lookups for the right hand
side - it's all scanning and evaluating the function for the right hand side. So
it's slower. On the up side, if it's a join you want to do that isn't based on a
simple equality of fields, these are the functions you are looking for. We could
do something similar to the previous eqJoin command, without the index like so:

r.table('agents').innerJoin(r.table('assets'),(agrow,asrow) = } )  


What we are doing here is a set intersection between the agents skill and the
assets use arrays and returning true if two or more items are in the
intersection which turns out to be one agent with two assets. Powerful, but
you'll take a hit in terms of performance. Remember these tiny tables we're
using are living in the cache, probably the processor cache even - when scaled
up, you could really pay the price in performance.

JOIN POWER
So we've looked at RethinkDB's join functions and as you can see they deliver
what we typically need from a join function; a simple binding between records
based on the equality of fields. It's simple, quick and clear.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose",Learn about JOINs in the RethinkDB document database.,RethinkDB Joinery,Live,126
322,"The couchdb package is a Meteor package available onAtmosphere. The package is a full stack databasedriver that provides functionality to work with Apache CouchDB in Meteor.* an efficient Livequery implementation providing real-timeupdates from the database by consuming the CouchDB _changes feed* Distributed Data Protocol (DDP) RPC end-points for updating the data from clients connected over the wire* Serialization and deserialization of updates to the DDP formatThis Readme covers the followingAdd this package to your Meteor app:meteor add cloudant:couchdbSince Apache CouchDB is not shipped with Meteor or this package, you need to have a running CouchDB/Cloudant server and a url to connect to it.Note: The JSON query syntax used is 'Cloudant Query', initially developed by Cloudant and contributed back to Apache CouchDB version 2.0. Pre-built binaries of Apache CouchDB 2.0 are not yet available, so the easiest way to use this module is with Cloudant DBaas or LocalTo configure the Apache CouchDB/Cloudant server connection information, pass its url as the COUCHDB_URLenvironment variable to the Meteor server process.$export COUCHDB_URL=https://username:password@username.cloudant.comJust like Mongo.Collection, you will work with CouchDB.Database for CouchDB data.You can instantiate a CouchDB.Database on both client and on the server.var Tasks = new CouchDB.Database(""tasks"");The database wraps the Cloudant Query commands.  If a callback is passed then the commands execute asynchronously.  If no callback is passed, on the server, the call is executed synchronously (technically this uses fibers and only appears to be synchronous, so it does not block the event-loop).  If you're on the client and don't pass a callback, the call executes asynchronously and you won't be notified of the result.One can publish a cursor on the server and the client subscribe to it.if (Meteor.isServer) {// This code only runs on the serverMeteor.publish(""tasks"", function () {return Tasks.find();if (Meteor.isClient) {// This code only runs on the clientMeteor.subscribe(""tasks"");This way data will be automatically synchronized to all subscribed clients.Latency compensation works with all supported commands used either at the client or client's simulations.Once you remove the insecure package, you can allow/deny database modifications from the client//make sure no extra properties besides postContent are included in the insert operationTasks.allow({insert: function (userId, doc) {return _.without(_.keys(doc), 'postContent').length === 0;Apache CouchDB stores data in Databases. To get started, declare a database with new CouchDB.Database.new CouchDB.Database(name, [options])Constructor for a DatabaseArgumentsname StringThe name of the database. If null, creates an unmanaged (unsynchronized) local database.Optionsconnection ObjectThe server connection that will manage this database. Uses the default connection if not specified. Pass the return value of calling   DDP.connect to specify a different server. Pass null to specify no connection. Unmanaged (name is null) databases cannot specify a connection.idGeneration StringThe method of generating the _id fields of new documents in this database. Possible values:'STRING': random stringsThe default id generation technique is 'STRING'.Calling this function sets up a database (a storage space for records, or ""documents"") that can be used to store a particular type of information that matters to your application. Each document is a JSON object. It includes an _id property whose value is unique in the database, which Meteor will set when you first create the document.// common code on client and server declares a DDP-managed couchdb// database.Chatrooms = new CouchDB.Database(""chatrooms"");Messages = new CouchDB.Database(""messages"");The function returns an object with methods to insert documents in the database, update their properties, and remove them, and to find the documents in the database that match arbitrary criteria. The way these methods work is compatible with the popular CouchDB JSON Query syntax. The same database API works on both the client and the server (see below).// return array of my messagesvar myMessages = Messages.find({userId: Session.get('myUserId')}).fetch();// create a new messagevar id = Messages.insert({text: ""Hello, world!""});// mark my first message as ""important""Messages.update({_id: id, text: 'Hello, world!', important: true });If you pass a name when you create the database, then you are declaring a persistent database — one that is stored on the server and seen by all users. Client code and server code can both access the same database using the same API.Specifically, when you pass a name, here's what happens:On the server (if you do not specify a connection), a database with that name is created on the backend CouchDB server. When you call methods on that database on the server, they translate directly into normal CouchDB operations (after checking that they match your access control rules).On the client (and on the server if you specify a connection), Meteor's Minimongo is reused i.e. Minimongo instance is created. Queries (find) on these databases are served directly out of this cache, without talking to the server.When you write to the database on the client (insert, update, remove), the command is executed locally immediately, and, simultaneously, it's sent to the server and executed there too. This happens via stubs, because writes are implemented as methods.When, on the server, you write to a database which has a specified connection to another server, it sends the corresponding method to the other server and receives the changed values back from it over DDP. Unlike on the client, it does not execute the write locally first.If you pass null as the name, then you're creating a local database. It's not synchronized anywhere; it's just a local scratchpad that supports find, insert, update, and remove operations. (On both the client and the server, this scratchpad is implemented using Minimongo.)Find the documents in a database that match the selector.Argumentsselector : Selector specifier,  or StringA query describing the documents to find.optionssort Sort specifierSort orderskip NumberNumber of results to skip at the beginninglimit NumberMaximum number of results to returnfields :  Field specifierfields to returnfind returns a cursor. It does not immediately access the database or return documents. Cursors provide fetch to return all matching documents, map and forEach to iterate over all matching documents, and observe and observeChanges to register callbacks when the set of matching documents changes.Cursors are not query snapshots. Cursors are a reactive data source. Any change to the database that changes the documents in a cursor will trigger a recomputation.Finds the first document that matches the selector, as ordered by sort and skip options.Argumentsselector : Selector specifier,  or StringA query describing the documents to find.optionssort Sort SpecifierSort orderskip NumberNumber of results to skip at the beginninglimit NumberMaximum number of results to returnfields : Field specifierfields to returnInsert a document in the database. Returns its unique _id.Argumentsdoc ObjectThe document to insert. May not yet have an _id attribute, in which case Meteor will generate one for you.callback FunctionOptional. If present, called with an error object as the first argument and, if no error, the _id as the second.Add a document to the database. A document is just an object, and its fields can contain any combination of compatible datatypes (arrays, objects, numbers, strings, null, true, and false).insert will generate a unique ID for the object you pass, insert it in the database, and return the ID.Replace a document in the database. Returns 1 if document updated, 0 if not.Argumentsdoc JSON document with _id fieldthe _id field in this doc specifies which document in the database is to be replaced by this document's content.optionsupsert BooleanTrue to insert a document if no matching document is found.callback FunctionOptional. If present, called with an error object as the first argument and, if no error, returns 1 as the second.Replace a document that matches the _id field. This is done on the Apache CouchDB Server via a updateHandler ignoring the _rev field (Hence behaviour is same as last-writer-wins)Returns 1 from the update call if successful and you don't pass a callback.You can use update to perform a upsert by setting the upsert option to true. You can also use the upsert method to perform an upsert that returns the _id of the document that was inserted (if there was one)Replace a document in the database, or insert one if no matching document were found. Returns an object with keys numberAffected (1 if successful, otherwise 0) and insertedId (the unique _id of the document that was inserted, if any).Argumentsdoc JSON document with _id fieldthe _id field in this doc specifies which document in the database is to be replaced by this document's content if exists. If doesnt exist document is insertedcallback FunctionOptional. If present, called with an error object as the first argument and, if no error, returns 1 as the second.Replace a document that matches the _id of the document, or insert a document if no document matched the _id. This is done on the Apache CouchDB Server via a updateHandler ignoring the _rev field (hence behaviour is same as last-writer-wins). upsert is the same as calling update with the upsert option set to true, except that the return value of upsert is an object that contain the keys numberAffected and insertedId. (update returns only 1 if successful or 0 if not)Remove a document from the database.Argumentsid_id value of the document to be removedcallback FunctionOptional. If present, called with an error object as the first argument and, if no error, returns 1 as the second.Delete the document whose _id matches the specified value them from the database. This is done on the Apache CouchDB Server via a updateHandler ignoring the _rev field1 will be returned  when successful otherwise 0, if you don't pass a callback.Allow users to write directly to this database from client code, subject to limitations you define.optionsinsert, update, remove FunctionFunctions that look at a proposed modification to the database and return true if it should be allowed.fetch Array of StringsOptional performance enhancement. Limits the fields that will be fetched from the database for inspection by your update and remove functions.When a client calls insert, update, or remove on a database, the database's allow and deny callbacks are called on the server to determine if the write should be allowed. If at least one allow callback allows the write, and no deny callbacks deny the write, then the write is allowed to proceed.These checks are run only when a client tries to write to the database directly, for example by calling update from inside an event handler. Server code is trusted and isn't subject to allow and deny restrictions. That includes methods that are called with Meteor.call — they are expected to do their own access checking rather than relying on allow and deny.You can call allow as many times as you like, and each call can include any combination of insert, update, and remove functions. The functions should return true if they think the operation should be allowed. Otherwise they should return false, or nothing at all (undefined). In that case Meteor will continue searching through any other allow rules on the database.The available callbacks are:* insert(userId, doc)The user userId wants to insert the document doc into the database. Return true if this should be allowed. doc will contain the _id field if one was explicitly set by the client. You can use this to prevent users from specifying arbitrary _id fields.* update(userId, doc, modifiedDoc) The user userId wants to update a document doc. (doc is the current version of the document from the database, without the proposed update.) Return true to permit the change.  modifiedDoc is the doc submitted by the user.* remove(userId, doc) The user userId wants to remove doc from the database. Return true to permit this.When calling update or remove Meteor will by default fetch the entire document doc from the database. If you have large documents you may wish to fetch only the fields that are actually used by your functions. Accomplish this by setting fetch to an array of field names to retrieve.If you never set up any allow rules on a database then all client writes to the database will be denied, and it will only be possible to write to the database from server-side code. In this case you will have to create a method for each possible write that clients are allowed to do. You'll then call these methods with Meteor.call rather than having the clients call insert, update, and remove directly on the database.Override allow rules.optionsinsert, update, remove FunctionFunctions that look at a proposed modification to the database and return true if it should be denied, even if an allow rule says otherwise.This works just like allow, except it lets you make sure that certain writes are definitely denied, even if there is an allow rule that says that they should be permitted.When a client tries to write to a database, the Meteor server first checks the database's deny rules. If none of them return true then it checks the database's allow rules. Meteor allows the write only if no deny rules return true and at least one allow rule returns true.To create a cursor, use database.find. To access the documents in a cursor, use forEach, map, or fetch.Call callback once for each matching document, sequentially and synchronously.Argumentscallback FunctionFunction to call. It will be called with three arguments: the document, a 0-based index, and cursor itself.thisArg AnyAn object which will be the value of this inside callback.When called from a reactive computation, forEach registers dependencies on the matching documents.Map callback over all matching documents. Returns an Array.Argumentscallback FunctionFunction to call. It will be called with three arguments: the document, a 0-based index, and cursor itself.thisArg AnyAn object which will be the value of this inside callback.When called from a reactive computation, map registers dependencies on the matching documents.On the server, if callback yields, other calls to callback may occur while the first call is waiting. If strict sequential execution is necessary, use forEach instead.Return all matching documents as an Array.When called from a reactive computation, fetch registers dependencies on the matching documents.Returns the number of documents that match a query.Unlike the other functions, count registers a dependency only on the number of matching documents. (Updates that just change or reorder the documents in the result set will not trigger a recomputation.)Watch a query. Receive callbacks as the result set changes.Argumentscallbacks ObjectFunctions to call to deliver the result set as it changesThis follow same behaviour of mongo-livedata driverWatch a query. Receive callbacks as the result set changes. Only the differences between the old and new documents are passed to the callbacks.Argumentscallbacks ObjectFunctions to call to deliver the result set as it changesThis follow same behaviour of mongo-livedata driverThe simplest selectors are just a string. These selectors match the document with that value in its _id field.A slightly more complex form of selector is an object containing a set of keys that must match in a document:// Matches all documents where the name and cognomen are as given{name: ""Rhialto"", cognomen: ""the Marvelous""}// Matches every documentBut they can also contain more complicated tests:// Matches documents where age is greater than 18{age: {$gt: 18}}Sorts maybe specified using the Cloudant sort syntax//Example[{""Actor_name"": ""asc""}, {""Movie_runtime"": ""desc""}]JSON array following the field syntax, described below. This parameter lets you specify which fields of an object should be returned. If it is omitted, the entire object is returned.// Example include only Actor_name, Movie_year and _id[""Actor_name"", ""Movie_year"", ""_id""]",Meteor database driver for CouchDB and Cloudant,cloudant/meteor-couchdb,Live,127
325,"Inside every Cloudant account is a world-class search engine based on Apache Lucene™. We've recently added some powerful features to search, and -- since we pride ourselves on making the difficult seem easy -- we've made it simple to use.In this post, I'll take you through a demo of Cloudant's new faceted search capabilities. You don't have to be a search expert. We'll take this step-by-step.Create a free account and have fun with Cloudant faceted searchAt the most basic level, a text search engine does two things:Finds results (docs, Web pages, emails, etc.) that contain the searched-for termDisplays those results in order of relevanceAs my high-school physics teacher liked to say, ""You don't need to know how to build a telephone to use a telephone."" How engines find and rank these results is outside the scope of this post. If you're interested in the ""building the telephone"" details you can find plenty of references [1], [2], [3], [4].Before setting up any search indexes, we need a rich dataset with a large number of documents similar in format but different in content to test. This allows us to take advantage of both text search and Cloudant's new faceting functionality.Cloudant has a number of open data sets in the Cloudant/examples directory including the public dataset on government lobbyists. There are a number of interesting fields worth searching. This database is world-readable, so you too can replicate it into your account to try it out.I replicated this database to my account by adding the following JSON doc to my _replicator DB (simplified instructions follow; if this isn't immediately obvious, don't worry):Note: we have to set the use_checkpoints field to false for this replication to work.Don't have a _replicator database? The first step is to create one via the Cloudant dashboard:_replicator is a special database that contains your replication jobs. Now that you have one in place, pull up the command line terminal. We'll use curl to send commands to Cloudant. The command below will POST the necessary JSON to _replicator (substitute your account permissions where appropriate):$ curl -X POST 'https://Now we have the data set in our own Cloudant database -- a large number of docs of varying size and content, all of which contain information about registered lobbyists in D.C. and their activities.Time to get my first index cooking!Let's start out very simply and index the ""Type"" field of each document. Before writing our index function, let's look at the structure of an example JSON document in the database to know what we're working with. (Keep in mind that each document has its own self-defining structure, which can vary from doc to doc.)Now, look at a couple references on Cloudant Search and defining indexes in design docs. You'll see that setting up an index like this is fairly simple, and simpler still using the indexing functionality in the new dash. We can create a new search index called ""type"" in the lobbyist DB using this JavaScript function:Functions that define indexes need to be stored in special JSON documents called design documents. Start by going into your lobbyists database via the dashboard and creating a new search index. Here, you can choose a design document to save the index function to, or create a new one. You'll want to do the latter and name it ""SearchTest"". Name the index ""type"" and enter the function provided above. Here's what yours should look like after you create the index and go back in to edit it:We’ve chosen the ""Standard"" analyzer from the dropdown list. Use it for this tutorial, and visit our For Developers site for a list of other generic and language-specific analyzers included in Cloudant Search, if you're curious. So, after hitting save we have a design doc that looks like this:Note: Cloudant will generate this JSON for you. You won't need to copy, paste, and modify it. To find it, navigate to ""All design docs"" in your Cloudant dashboard and edit the SearchTest design doc to see the code.Like all documents in Cloudant, this one is JSON. Don't worry about the various metadata pieces, just focus on the index definition. Here it is again, with the newline characters (\n) parsed for readability:The index definition is central to getting search working in Cloudant, so let's take it piece-by-piece.First, we declare a function that takes a single argument: a JSON document. We set up a simple if statement to make sure that only docs with the “Type” field actually get sent to the indexer.Then we call index(""type"", doc.Type), which takes at least two arguments:""indexName"", which is the index name you'll have to specify in your search query parameters in Cloudant. Here, we pass in the string ""type"".The second argument, which is the part of a JSON doc that you want to index. Generically, you can think of it as doc.key. In the case of our index definition, doc.key is the ""Type"" field in our JSON documents, hence the argument doc.Type. IMPORTANT! Only strings and numbers can be indexed in Cloudant full-text search indexes. Nested fields, objects, etc. cannot be indexed as full text; however, secondary database indexes in Cloudant can handle these structures.Save the new design doc, and Cloudant goes to work indexing every JSON document in the database. After a brief time, the index is ready for querying. (Give it a few minutes for the 1.2 GB in our example data set.) Let's search for all ""THIRD QUARTER"" reports, which looks like:Note: You can also POST your query in JSON form to the _search API endpoint using the following syntax:$ curl -u ""The response yields the total number of rows and the document IDs of the first 25 results. (The ""bookmark"" field is a value your application can use to paginate the rest of the results.) Here's some of the output:{   ""rows"": [     {       ""fields"": {},       ""order"": [         0.19044028222560883,         0       ],       ""id"": ""315e4a1d10a025b62de23cd7c725bca4""     },     {       ""fields"": {},       ""order"": [         0.19044028222560883,         1       ],       ""id"": ""315e4a1d10a025b62de23cd7c7249988""     },     ...     {       ""fields"": {},       ""order"": [         0.19044028222560883,         55       ],       ""id"": ""315e4a1d10a025b62de23cd7c74d535a""     }   ],   ""bookmark"": ""g2wAAAABaANkACFkYmNvcmVAZGI2Lm1vb25zaGluZS5jbG91ZGFudC5uZXRsAAAAAmEAYj____9qaAJGP8hgWOAAAABhN2o"",   ""total_rows"": 83146 }And huzzah! We've done our first search query in Cloudant!Ok, now let's dig into something a little more interesting.Many of the documents in the lobbyist DB have a field called ""Amount"", indicating the amount of money the lobbyist(s) in question spent on their efforts over some time period. This is certainly more interesting than the third quarter reports we queried for earlier! And with Cloudant's new range facets (released to multitenant customers on April 10), we can easily find the number of transactions of a certain size.Note: Range facets can only be used on numbers, and count facets can only be used on strings. Remember this, and you'll be fine.The index function needed for this is slightly more complicated than the one above, but not much! Again, fire up your Cloudant dashboard and create a new search index as follows (I'm going to save mine as ""amountSearch""):Make sure to associate the new index with your SearchTest design doc. Here's what that looks like in the Cloudant dashboard:The new design doc for SearchTest now looks like this:The initial index we created called ""type"" is still there, but now there is a new index function called ""amountSearch"". I converted doc.Amount to an integer to ensure that I can use Cloudant's range faceting functionality (in case some documents store the value associated with the ""Amount"" field as a string and others as an integer, as is the case in our lobbyists database). Finally, I added a third argument to the index function, {""facet"":true}, to enable faceting.Now let's see who is paying whom, and how much. We can query the amounts similarly to querying the types of reports stored in the database:This query will return the first 25 docs in which the ""Amount"" field is exactly $500,000. I specified the field name in our query parameters (?q=amount:500000) because our index function explicitly names this field, passed into it as its first argument.Now let's get fancy and try out range facets. Say we want to split all records into ""cheap"" lobbyists (less than $25,000) and ""expensive"" lobbyists (greater than $25,000.) This can be done using the range query parameter:Note: Be aware that the URL below must be properly escaped before it will work via curl. Empty spaces should be replaced by ""%20"". You can tell curl to parse square brackets and curly brackets by disabling globbing via the ""-g"" flag. (See this handy post from our friend Glynn Bird (@glynn_bird) for more.) OK. We'll write this one for you, minus your credentials, of course ;-)$ curl -g -u ""Back to our example, the first query q=\*:* simply ensures that every doc in the DB is returned. Note that the range definitions use inclusive/exclusive syntax to define the range boundaries. Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.The response includes IDs for the first 25 hits plus the following output:Now we can see a fairly even split between cheap and expensive lobbyists.Note: The results of your output will vary based on how quickly you're progressing through these examples. It normally takes a few minutes to replicate the 1.2 GB lobbyists database and then to build the search indexes. Your query could take a couple minutes to return if Cloudant is still building an index, unless you append stale=ok to your query parameters. The stale=ok parameter indicates that your application would rather have low-latency responses than a completely up-to-date index. Here's how that would look:curl -g -u ""Before moving on, I should point out something you may have missed upon first reading: We didn't have to declare these ranges at index time. We didn't have to declare anything at index time other than the intention to eventually use range facets in our queries (by passing the {""facet"":true} argument into our index function.) That's it! That's all we have to do to enable this powerful query-time faceting functionality.Count facets allow us to quickly count by category (think author counts for a bookstore). Let's take a quick look!What really interested me in this database was the ""paper trail"" of the government agencies that lobbyists have been visiting. Unfortunately, these agencies are expressed in a nested structure within each document, so the index() function will be a little more complicated. Certified Cloudant Sherpa Max Thayer (@garbados) helped me out with the JavaScript in this example, so shout-out to Max (thank you!):This function will loop over all entities in a ""GovernmentEntities"" object and index them. We can form a search query to find all the government entities visited and how many times they were visited:This search query will yield a long list of agencies and the corresponding number of visits. Here's what a portion of this list looks like:{   ""counts"": {     ""entity"": {       ...       ""Federal Aviation Administration (FAA)"": 7172,       ""Federal Bureau of Investigation (FBI)"": 1037,       ""Federal Communications Commission (FCC)"": 12470,       ""Federal Deposit Insurance Commission (FDIC)"": 2164,       ""Federal Election Commission (FEC)"": 305,       ""Federal Emergency Management Agency (FEMA)"": 3962,       ""Federal Energy Regulatory Commission (FERC)"": 3993,       ""Federal Highway Administration (FHA)"": 2647,       ""Federal Housing Finance Board (FHFB)"": 924,       ""Federal Labor Relations Authority (FLRA)"": 55,       ""Federal Law Enforcement Training Center"": 9,       ""Federal Management Service"": 9,       ""Federal Maritime Commission"": 640,       ""Federal Mediation & Conciliation Service"": 16,       ""Federal Mine Safety Health Review Commission (FMSH"": 4,       ""Federal Motor Carrier Safety Administration"": 459,       ""Federal Railroad Administration"": 1546,       ""Federal Reserve System"": 3603,       ""Federal Retirement Thrift Investment Board"": 46,       ""Federal Trade Commission (FTC)"": 6169,       ""Federal Transit Administration (FTA)"": 2556,       ""Financial Crimes Enforcement Network (FinCEN)"": 70,       ""Financial Management Service (FMS)"": 35,       ""Food & Drug Administration (FDA)"": 10300,       ...     }   },   ... }And we see that the FCC is fairly popular compared to, say, the Federal Election Commission. The task of calculating how much money was spent on each entity is left as an exercise for the reader.With very few deviations (mostly trips into the URL encoding quagmire), this post tracks exactly how I first started learning about search and search facets with Cloudant. Hopefully it gives you the tools you need to make use of this feature.Speaking of tools, if you use curl regularly and you'd rather not enter your Cloudant username and password for every HTTP request, consider configuring acurl, a tool that many Cloudant engineers use. Check out the post ""Authorized curl, a.k.a acurl"" for instructions.For a more fully featured example app that archives and indexes email from an IMAP server and makes it searchable, see this GitHub repo from Cloudant Developer Advocates Benjamin Young (@bigbluehat) and Jason Smith (@_jhs).If you have questions, we at Cloudant are always happy to help. Please ping us on IRC, email support@cloudant.com, or (better yet) use our awesome new support portal in the Cloudant dashboard.","Cloudant Search is based on Apache Lucene which allows facets of your data to be aggregated and counted during the search process. Facets allow your customers to drill-down into the search results, filtering in an powerful and intuitive way.",Search Faceting from Scratch [Tutorial],Live,128
329,"DATALAYER: GRAPHQL - TRANSLATING BACKEND DATA TO FRONTEND NEEDS
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 28, 2016Engineers working on backend data services are often focused on operational
concerns like data consistency, reliability, uptime, and storage efficiency.
Because each situation calls for a specific set of tradeoffs, a single
organization can end up with a diverse set of backend databases and services.
For the people building the UI and frontend API layers, this diversity can
quickly become an issue, especially if the same client needs to call into
multiple backends or fetch related objects across different data sources.

GraphQL is a language-agnostic API gateway technology designed precisely to
solve this mismatch between backend and frontend requirements. It provides a
highly structured, yet flexible API layer that lets the client specify all of
its data requirements in one GraphQL query, without needing to know about the
backend services being accessed. Better yet, because of the structured, strongly
typed nature of both GraphQL queries and APIs, it's possible to quickly get
critical information, such as which objects and fields are accessed by which
frontends, which clients will be affected by specific changes to the backend,
and more. In this talk, Sasko Stubailo of Meteor explains what GraphQL is, what
data management problems it can solve in an organization, and how you can try it
today.

Sashko Stubailo is passionate about building technologies that help developers
build great apps. Sashko graduated with a CS degree from MIT in 2014 and has
worked on a declarative reactive charting library at Palantir, an interactive
i18n middleware for Rails at Panjiva, front end technology and build tooling in
the Meteor framework, and is now leading the new Apollo project to build a
next-generation GraphQL data platform.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Sasko Stubailo of Meteor explains what GraphQL is, what data management problems it can solve in an organization, and how you can try it today.",DataLayer Conference: Translating Backend Data to Frontend Needs,Live,129
331,"Skip to contentData, what now?

Making sense of all that mess.

MAIN NAVIGATION
Menu * Blog
 * About Me

FEATURE IMPORTANCE AND WHY IT’S IMPORTANT
Vinko Kodžoman April 20, 2017 April 20, 2017I have been doing Kaggle’s Quora Question Pairs competition for about a month now, and by reading the discussions on the forums, I’ve
noticed a recurring topic that I’d like to address. People seem to be struggling
with getting the performance of their models past a certain point. The usual
approach is to use XGBoost, ensembles and stacking. While those can generally
give good results, I’d like to talk about why it is still important to do
feature importance analysis.


DATA EXPLORATION
As an example, I will be using the Quora Question Pairs dataset . The dataset has 404,290 pairs of questions, and 37% of them are semantically
the same (“duplicates”). The goal is to find out which ones.

Initial steps; loading the dataset and data exploration:

# Load the dataset
train = pd.read_csv('train.csv', dtype={'question1': str, 'question2': str})

print('Training dataset row number:', len(train))  # 404290
print('Duplicate question pairs ratio: %.2f' % train.is_duplicate.mean())  # 0.37

Examples of duplicate and non-duplicate question pairs are shown below.

question1 question2 is_duplicate What is the step by step guide to invest in share market in india? What is the step by step guide to invest in share market? 0 How can I be a good geologist? What should I do to be a great geologist? 1 How can I increase the speed of my internet connection while using a VPN? How can Internet speed be increased by hacking through DNS? 0 How do I read and find my YouTube comments? How do I read and find my YouTube comments? 1This is the word cloud inspired by a Kaggle kernel for data exploration . The cloud shows which words are popular (most frequent). The word cloud is
created from words used in both questions. As you can see, the prevalent words
are ones you would expect to find in a question (e.g. “best way”, “lose weight”,
“difference”, “make money”, etc.)


We now have some idea about what our dataset looks like.

FEATURE ENGINEERING
I created 24 features, some of which are shown below. All code is written in
python using the standard machine learning libraries (pandas, sklearn, numpy).
You can get the full code from my github notebook . Examples of some features:

 * q1_word_num – number of words in question1
 * q2_length – number of characters in question2
 * word_share – ratio of shared words between the questions
 * same_first_word – 1 if both questions share the same first word, else 0

def word_share(row):
    q1_words = set(word_tokenize(row['question1']))
    q2_words = set(word_tokenize(row['question2']))
       
    return len(q1_words.intersection(q2_words)) / (len(q1_words.union(q2_words)))

def same_first_word(row):
    q1_words = word_tokenize(row['question1'])
    q2_words = word_tokenize(row['question2'])
    
    return float(q1_words[0].lower() == q2_words[0].lower())

# A sample of the features
train['word_share'] = train.apply(word_share, axis=1)

train['q1_word_num'] = train.question1.apply(lambda x: len(word_tokenize(x)))
train['q2_word_num'] = train.question2.apply(lambda x: len(word_tokenize(x)))
train['word_num_difference'] = abs(train.q1_word_num - train.q2_word_num)

train['q1_length'] = train.question1.apply(lambda x: len(x))
train['q2_length'] = train.question2.apply(lambda x: len(x))
train['length_difference'] = abs(train.q1_length - train.q2_length)

train['q1_has_fullstop'] = train.question1.apply(lambda x: int('.' in x))
train['q2_has_fullstop'] = train.question2.apply(lambda x: int('.' in x))

train['q1_has_math_expression'] = train.question1.apply(lambda x: int('[math]' in x))
train['q2_has_math_expression'] = train.question2.apply(lambda x: int('[math]' in x)) 

train['same_first_word'] = train.apply(same_first_word, axis=1)


BASELINE MODEL PERFORMANCE
To get the model performance, we first split the dataset into the train and test
set. The test set contains 20% of the total data. To evaluate the model’s
performance, we use the created test set (X_test and y_test).

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The model is evaluated with the logloss function. It is the same metric which is
used in the competition.

$logloss = \frac{1}{N} \displaystyle\sum_{i=1}^{N} \displaystyle\sum_{j=1}^{M}
y_{i,j} * log(p_{i,j})$

To test the model with all the features, we use the Random Forest classifier. It
is a powerful “out of the box” ensemble classifier. No hyperparameter tuning was
done – they can remain fixed because we are testing the model’s performance
against different feature sets. A simple model gives a logloss score of 0.62923,
which would put us at the 1371th place of a total of 1692 teams at the time of
writing this post. Now let’s see if doing feature selection could help us lower
the logloss.

model = RandomForestClassifier(50, n_jobs=8)
model.fit(X_train, y_train)

predictions_proba = model.predict_proba(X_test)
predictions = model.predict(X_test)

log_loss_score = log_loss(y_test, predictions_proba)
acc = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

print('Log loss: %.5f' % log_loss_score)  # 0.62923
print('Acc: %.5f' % acc)  # 0.70952
print('F1: %.5f' % f1)  # 0.59173


FEATURE IMPORTANCE
To get the feature importance scores, we will use an algorithm that does feature
selection by default – XGBoost. It is the king of Kaggle competitions. If you
are not using a neural net, you probably have one of these somewhere in your
pipeline. XGBoost uses gradient boosting to optimize creation of decision trees
in the ensemble. Each tree contains nodes, and each node is a single feature.
The number of instances of a feature used in XGBoost decision tree’s nodes is
proportional to its effect on the overall performance of the model.

model = XGBClassifier(n_estimators=500)
model.fit(X, y)

feature_importance = model.feature_importances_

plt.figure(figsize=(16, 6))
plt.yscale('log', nonposy='clip')

plt.bar(range(len(feature_importance)), feature_importance, align='center')
plt.xticks(range(len(feature_importance)), features, rotation='vertical')
plt.title('Feature importance')
plt.ylabel('Importance')
plt.xlabel('Features')
plt.show()

Looking at the graph below, we see that some features are not used at all, while
some (word_share) impact the performance greatly. We can reduce the number of
features by taking a subset of the most important features.


Using the feature importance scores, we reduce the feature set. The new pruned
features contain all features that have an importance score greater than a
certain number. In our case, the pruned features contain a minimum importance
score of 0.05.

def extract_pruned_features(feature_importances, min_score=0.05):
    column_slice = feature_importances[feature_importances['weights'] > min_score]
    return column_slice.index.values

pruned_featurse = extract_pruned_features(feature_importances, min_score=0.01)
X_train_reduced = X_train[pruned_featurse]
X_test_reduced = X_test[pruned_featurse]

def fit_and_print_metrics(X_train, y_train, X_test, y_test, model):
    model.fit(X_train, y_train)
    predictions_proba = model.predict_proba(X_test)

    log_loss_score = log_loss(y_test, predictions_proba)
    print('Log loss: %.5f' % log_loss_score)

MODEL PERFORMANCE WITH FEATURE IMPORTANCE ANALYSIS
As a result of using the pruned features, our previous model – Random Forest –
scores better. With little effort, the algorithm gets a lower loss, and it also
trains more quickly and uses less memory because the feature set is reduced.

model = RandomForestClassifier(50, n_jobs=8)
# LogLoss 0.59251
fit_and_print_metrics(X_train_reduced, y_train, X_test_reduced, y_test, model) 

# LogLoss 0.63376
fit_and_print_metrics(X_train, y_train, X_test, y_test, model)

Playing a bit more with feature importance score (plotting the logloss of our
classifier for a certain subset of pruned features) we can lower the loss even
more. In this particular case, Random Forest actually works best with only one
feature! Using only the feature “word_share” gives a logloss of 0.55305. If you
are interested to see this step in detail, the full version is in the notebook .


CONCLUSION
As I have shown, utilising feature importance analysis has a potential to
increase the model’s performance. While some models like XGBoost do feature
selection for us, it is still important to be able to know the impact of a
certain feature on the model’s performance because it gives you more control
over the task you are trying to accomplish. The “no free lunch” theorem (there
is no solution which is best for all problems) tells us that even though XGBoost
usually outperforms other models, it is up to us to discern whether it is really
the best solution. Using XGBoost to get a subset of important features allows us
to increase the performance of models without feature selection by giving that
feature subset to them. Using feature selection based on feature importance can
greatly increase the performance of your models.


Categories Data Science , Deep Learning , Machine Learning Tags feature engineering , feature importance , features , machine learning , pythonLEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


Notify me of follow-up comments by email.

Notify me of new posts by email.

PRIMARY SIDEBAR
Toggle Sidebar Search for:NEWSLETTER
RECENT POSTS
 * Feature importance and why it’s important

ARCHIVES
 * April 2017


Weenkus (Vinko Kodžoman) Vinko Kodžoman Weenkus Zagreb, Croatia vinko.kodzoman@yahoo.com Joined on Oct 07, 2014 8 Followers 19 Following 26 Public Repositories ansiweather Blog book_problems cards cats_vs_dogs_redux_kaggle Competition Deep-Learning-University-of-Zagreb digit_recognizer_kaggle FM-index GTEngine hello_app identicon_generator InverseMatrixCaching LabDump leaf_classification_kaggle LearnOpenGL_tutorial Machine-Learning-University-of-Washington Machine-Learning-University-of-Zagreb My-personal-webpage One-Hump-Iterator-Visualization on_power_efficient_virtual_network_function_placement_algorithm Reference-Genome-Index Rentals Search-Engine Sexual-Predator-Classification-Using-Ensemble-Classifiers toy_app 0 Public GistsData, what now? © 2017 . All Rights Reserved",Feature importance in machine learning using examples in Python with xgboost. Getting better performance from a model with feature pruning.,Feature importance and why it's important,Live,130
332,"Toggle navigation * 
 * About
 * 
 * Archives
 * 
 * 

PRACTICAL BUSINESS PYTHON
Taking care of business, one python script at a time

Sun 26 October 2014SIMPLE GRAPHING WITH IPYTHON AND PANDAS
Posted by Chris Moffitt in articles

INTRODUCTION
This article is a follow on to my previous article on analyzing data with python. I am going to build on my basic intro of IPython , notebooks and pandas to show how to visualize the data you have processed with these tools. I hope
that this will demonstrate to you (once again) how powerful these tools are and
how much you can get done with such little code. I ultimately hope these
articles will help people stop reaching for Excel every time they need to slice
and dice some files. The tools in the python environment can be so much more
powerful than the manual copying and pasting most people do in excel.

I will walk through how to start doing some simple graphing and plotting of data
in pandas. I am using a new data file that is the same format as my previous
article but includes data for only 20 customers. If you would like to follow
along, the file is available here .

GETTING STARTED
As described in the previous article , I’m using an IPython notebook to explore my data.

First we are going to import pandas, numpy and matplot lib. I am also showing
the pandas version I’m using so you can make sure yours is compatible.

importpandasaspdimportnumpyasnpimportmatplotlib.pyplotaspltpd.__version__


'0.14.1'


Next, enable IPython to display matplotlib graphs.

%matplotlibinline

We will read in the file like we did in the previous article but I’m going to
tell it to treat the date column as a date field (using parse_dates ) so I can do some re-sampling later.

sales=pd.read_csv(""sample-salesv2.csv"",parse_dates=['date'])sales.head()

account number name sku category quantity unit price ext price date 0 296809 Carroll PLC QN -82852 Belt 13 44.48 578.24 2014-09-27 07:13:03 1 98022 Heidenreich-Bosco MJ -21460 Shoes 19 53.62 1018.78 2014-07-29 02:10:44 2 563905 Kerluke, Reilly and Bechtelar AS -93055 Shirt 12 24.16 289.92 2014-03-01 10:51:24 3 93356 Waters-Walker AS -93055 Shirt 5 82.68 413.40 2013-11-17 20:41:11 4 659366 Waelchi-Fahey AS -93055 Shirt 18 99.64 1793.52 2014-01-03 08:14:27Now that we have read in the data, we can do some quick analysis

sales.describe()

account number quantity unit price ext price count 1000.000000 1000.000000 1000.000000 1000.00000 mean 535208.897000 10.328000 56.179630 579.84390 std 277589.746014 5.687597 25.331939 435.30381 min 93356.000000 1.000000 10.060000 10.38000 25% 299771.000000 5.750000 35.995000 232.60500 50% 563905.000000 10.000000 56.765000 471.72000 75% 750461.000000 15.000000 76.802500 878.13750 max 995267.000000 20.000000 99.970000 1994.80000We can actually learn some pretty helpful info from this simple command:

 * We can tell that customers on average purchases 10.3 items per transaction
 * The average cost of the transaction was $579.84
 * It is also easy to see the min and max so you understand the range of the
   data

If we want we can look at a single column as well:

sales['unit price'].describe()


count    1000.000000
mean       56.179630
std        25.331939
min        10.060000
25%        35.995000
50%        56.765000
75%        76.802500
max        99.970000
dtype: float64


I can see that my average price is $56.18 but it ranges from $10.06 to $99.97.

I am showing the output of dtypes so that you can see that the date column is a datetime field. I also scan this
to make sure that any columns that have numbers are floats or ints so that I can
do additional analysis in the future.

sales.dtypes


account number             int64
name                      object
sku                       object
category                  object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
dtype: object


PLOTTING SOME DATA
We have our data read in and have completed some basic analysis. Let’s start
plotting it.

First remove some columns to make additional analysis easier.

customers=sales[['name','ext price','date']]customers.head()

name ext price date 0 Carroll PLC 578.24 2014-09-27 07:13:03 1 Heidenreich-Bosco 1018.78 2014-07-29 02:10:44 2 Kerluke, Reilly and Bechtelar 289.92 2014-03-01 10:51:24 3 Waters-Walker 413.40 2013-11-17 20:41:11 4 Waelchi-Fahey 1793.52 2014-01-03 08:14:27This representation has multiple lines for each customer. In order to understand
purchasing patterns, let’s group all the customers by name. We can also look at
the number of entries per customer to get an idea for the distribution.

customer_group=customers.groupby('name')customer_group.size()


name
Berge LLC                        52
Carroll PLC                      57
Cole-Eichmann                    51
Davis, Kshlerin and Reilly       41
Ernser, Cruickshank and Lind     47
Gorczany-Hahn                    42
Hamill-Hackett                   44
Hegmann and Sons                 58
Heidenreich-Bosco                40
Huel-Haag                        43
Kerluke, Reilly and Bechtelar    52
Kihn, McClure and Denesik        58
Kilback-Gerlach                  45
Koelpin PLC                      53
Kunze Inc                        54
Kuphal, Zieme and Kub            52
Senger, Upton and Breitenberg    59
Volkman, Goyette and Lemke       48
Waelchi-Fahey                    54
Waters-Walker                    50
dtype: int64


Now that our data is in a simple format to manipulate, let’s determine how much
each customer purchased during our time frame.

The sum function allows us to quickly sum up all the values by customer. We can also
sort the data using the sort command.

sales_totals=customer_group.sum()sales_totals.sort(columns='ext price').head()

ext price name Davis, Kshlerin and Reilly 19054.76 Huel-Haag 21087.88 Gorczany-Hahn 22207.90 Hamill-Hackett 23433.78 Heidenreich-Bosco 25428.29Now that we know what the data look like, it is very simple to create a quick
bar chart plot. Using the IPython notebook, the graph will automatically
display.

my_plot=sales_totals.plot(kind='bar')

Unfortunately this chart is a little ugly. With a few tweaks we can make it a
little more impactful. Let’s try:

 * sorting the data in descending order
 * removing the legend
 * adding a title
 * labeling the axes

my_plot=sales_totals.sort(columns='ext price',ascending=False).plot(kind='bar',legend=None,title=""Total Sales by Customer"")my_plot.set_xlabel(""Customers"")my_plot.set_ylabel(""Sales ($)"")


<matplotlib.text.Text at 0x7ff9bf23c510>


This actually tells us a little about our biggest customers and how much
difference there is between their sales and our smallest customers.

Now, let’s try to see how the sales break down by category.

customers=sales[['name','category','ext price','date']]customers.head()

name category ext price date 0 Carroll PLC Belt 578.24 2014-09-27 07:13:03 1 Heidenreich-Bosco Shoes 1018.78 2014-07-29 02:10:44 2 Kerluke, Reilly and Bechtelar Shirt 289.92 2014-03-01 10:51:24 3 Waters-Walker Shirt 413.40 2013-11-17 20:41:11 4 Waelchi-Fahey Shirt 1793.52 2014-01-03 08:14:27We can use groupby to organize the data by category and name.

category_group=customers.groupby(['name','category']).sum()category_group.head()

ext price name category Berge LLC Belt 6033.53 Shirt 9670.24 Shoes 14361.10 Carroll PLC Belt 9359.26 Shirt 13717.61The category representation looks good but we need to break it apart to graph it
as a stacked bar graph. unstack can do this for us.

category_group.unstack().head()

ext price category Belt Shirt Shoes name Berge LLC 6033.53 9670.24 14361.10 Carroll PLC 9359.26 13717.61 12857.44 Cole-Eichmann 8112.70 14528.01 7794.71 Davis, Kshlerin and Reilly 1604.13 7533.03 9917.60 Ernser, Cruickshank and Lind 5894.38 16944.19 5250.45Now plot it.

my_plot=category_group.unstack().plot(kind='bar',stacked=True,title=""Total Sales by Customer"")my_plot.set_xlabel(""Customers"")my_plot.set_ylabel(""Sales"")


<matplotlib.text.Text at 0x7ff9bf03fc10>


In order to clean this up a little bit, we can specify the figure size and
customize the legend.

my_plot=category_group.unstack().plot(kind='bar',stacked=True,title=""Total Sales by Customer"",figsize=(9,7))my_plot.set_xlabel(""Customers"")my_plot.set_ylabel(""Sales"")my_plot.legend([""Total"",""Belts"",""Shirts"",""Shoes""],loc=9,ncol=4)


<matplotlib.legend.Legend at 0x7ff9bed5f710>


Now that we know who the biggest customers are and how they purchase products,
we might want to look at purchase patterns in more detail.

Let’s take another look at the data and try to see how large the individual
purchases are. A histogram allows us to group purchases together so we can see
how big the customer transactions are.

purchase_patterns=sales[['ext price','date']]purchase_patterns.head()

ext price date 0 578.24 2014-09-27 07:13:03 1 1018.78 2014-07-29 02:10:44 2 289.92 2014-03-01 10:51:24 3 413.40 2013-11-17 20:41:11 4 1793.52 2014-01-03 08:14:27We can create a histogram with 20 bins to show the distribution of purchasing
patterns.

purchase_plot=purchase_patterns['ext price'].hist(bins=20)purchase_plot.set_title(""Purchase Patterns"")purchase_plot.set_xlabel(""Order Amount($)"")purchase_plot.set_ylabel(""Number of orders"")


<matplotlib.text.Text at 0x7ff9becdc210>


In looking at purchase patterns over time, we can see that most of our
transactions are less than $500 and only a very few are about $1500.

Another interesting way to look at the data would be by sales over time. A chart
might help us understand, “Do we have certain months where we are busier than
others?”

Let’s get the data down to order size and date.

purchase_patterns=sales[['ext price','date']]purchase_patterns.head()

ext price date 0 578.24 2014-09-27 07:13:03 1 1018.78 2014-07-29 02:10:44 2 289.92 2014-03-01 10:51:24 3 413.40 2013-11-17 20:41:11 4 1793.52 2014-01-03 08:14:27If we want to analyze the data by date, we need to set the date column as the
index using set_index .

purchase_patterns=purchase_patterns.set_index('date')purchase_patterns.head()

ext price date 2014-09-27 07:13:03 578.24 2014-07-29 02:10:44 1018.78 2014-03-01 10:51:24 289.92 2013-11-17 20:41:11 413.40 2014-01-03 08:14:27 1793.52One of the really cool things that pandas allows us to do is resample the data.
If we want to look at the data by month, we can easily resample and sum it all
up. You’ll notice I’m using ‘M’ as the period for resampling which means the
data should be resampled on a month boundary.

purchase_patterns.resample('M',how=sum)

Plotting the data is now very easy

purchase_plot=purchase_patterns.resample('M',how=sum).plot(title=""Total Sales by Month"",legend=None)

Looking at the chart, we can easily see that December is our peak month and
April is the slowest.

Let’s say we really like this plot and want to save it somewhere for a
presentation.

fig=purchase_plot.get_figure()fig.savefig(""total-sales.png"")

PULLING IT ALL TOGETHER
In my typical workflow, I would follow the process above of using an IPython
notebook to play with the data and determine how best to make this process
repeatable. If I intend to run this analysis on a periodic basis, I will create
a standalone script that will do all this with one command.

Here is an example of pulling all this together into a single file:

# Standard import for pandas, numpy and matplotimportpandasaspdimportnumpyasnpimportmatplotlib.pyplotasplt# Read in the csv file and display some of the basic infosales=pd.read_csv(""sample-salesv2.csv"",parse_dates=['date'])print""Data types in the file:""printsales.dtypesprint""Summary of the input file:""printsales.describe()print""Basic unit price stats:""printsales['unit price'].describe()# Filter the columns down to the ones we need to look at for customer salescustomers=sales[['name','ext price','date']]#Group the customers by name and sum their salescustomer_group=customers.groupby('name')sales_totals=customer_group.sum()# Create a basic bar chart for the sales data and show itbar_plot=sales_totals.sort(columns='ext price',ascending=False).plot(kind='bar',legend=None,title=""Total Sales by Customer"")bar_plot.set_xlabel(""Customers"")bar_plot.set_ylabel(""Sales ($)"")plt.show()# Do a similar chart but break down by category in stacked bars# Select the appropriate columns and group by name and categorycustomers=sales[['name','category','ext price','date']]category_group=customers.groupby(['name','category']).sum()# Plot and show the stacked bar chartstack_bar_plot=category_group.unstack().plot(kind='bar',stacked=True,title=""Total Sales by Customer"",figsize=(9,7))stack_bar_plot.set_xlabel(""Customers"")stack_bar_plot.set_ylabel(""Sales"")stack_bar_plot.legend([""Total"",""Belts"",""Shirts"",""Shoes""],loc=9,ncol=4)plt.show()# Create a simple histogram of purchase volumespurchase_patterns=sales[['ext price','date']]purchase_plot=purchase_patterns['ext price'].hist(bins=20)purchase_plot.set_title(""Purchase Patterns"")purchase_plot.set_xlabel(""Order Amount($)"")purchase_plot.set_ylabel(""Number of orders"")plt.show()# Create a line chart showing purchases by monthpurchase_patterns=purchase_patterns.set_index('date')month_plot=purchase_patterns.resample('M',how=sum).plot(title=""Total Sales by Month"",legend=None)fig=month_plot.get_figure()#Show the image, then save itplt.show()fig.savefig(""total-sales.png"")

The impressive thing about this code is that in 55 lines (including comments),
I’ve created a very powerful yet simple to understand program to repeatedly
manipulate the data and create useful output.

I hope this is useful. Feel free to provide feedback in the comments and let me
know if this is helpful.

 * ← Simple Interactive Data Analysis with Python
 * Using Pandas To Create an Excel Diff →

Tags pandas csv excel ipython
--------------------------------------------------------------------------------

Tweet Vote on Hacker NewsCOMMENTS
SOCIAL
 * Github
 * Twitter
 * BitBucket
 * Reddit
 * LinkedIn

CATEGORIES
 * articles
 * news

POPULAR
 * Pandas Pivot Table Explained
 * Common Excel Tasks Demonstrated in Pandas
 * Overview of Python Visualization Tools
 * Web Scraping - It's Your Civic Duty
 * Simple Graphing with IPython and Pandas

TAGS
sets pygal csv barnum process s3 matplotlib plotting stdlib oauth2 xlsxwriter pelican jinja python google matplot pandas ipython seaborn notebooks cases xlwings gui excel vcs ggplot beautifulsoup powerpoint bokeh plotly analyze-this pdf github

FEEDS
 * Atom Feed


--------------------------------------------------------------------------------

Site built using Pelican • Theme based on VoidyBootstrap by RKI","This article is a follow on to the previous article on analyzing data with python, building on the basic intro of IPython, notebooks and pandas to show how to visualize the data you have processed with these tools.",Simple Graphing with IPython and Pandas,Live,131
334,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science
 * Machine Learning
 * Programming
 * Visualization
 * Events
 * Letters
 * 
 * Contribute
 * 

Karlijn Willems Blocked Unblock Follow Following Data Science Journalist @DataCamp Oct 12
--------------------------------------------------------------------------------

COLLECTING DATA SCIENCE CHEAT SHEETS
As you might already know, I’ve been making Python and R cheat sheets
specifically for those who are just starting out with data science or for those
who need an extra help when working on data science problems.

Now you can find all of them in one place on the DataCamp Community.

You can find all cheat sheets here .

To recap, these are the data science cheat sheets that we have already made and
shared with the community up until now:

Basics

 * Python Basics Cheat Sheet
 * Scipy Linear Algebra Cheat Sheet

Data Manipulation

 * NumPy Basics Cheat Sheet
 * Pandas Basics Cheat Sheet
 * Pandas Data Wrangling Cheat Sheet
 * xts Cheat sheet
 * data.table Cheat Sheet ( updated! )

Machine Learning, Deep Learning, Big Data

 * Scikit-Learn Cheat Sheet
 * Keras Cheat Sheet
 * PySpark RDD Cheat Sheet
 * PySpark SparkSQL Cheat Sheet

Data Visualization

 * Matplotlib Cheat Sheet
 * Seaborn Cheat Sheet
 * Bokeh Cheat Sheet ( updated! )

IDE

 * Jupyter Notebook Cheat Sheet

Enjoy and feel free to share!

PS. Did you see another data science cheat sheet that you’d like to recommend?
Let us know here !

 * Data Science
 * Data Analysis
 * Big Data
 * Data Visualization
 * Machine Learning

Show your supportClapping shows how much you appreciated Karlijn Willems’s story.

833 1 Blocked Unblock Follow FollowingKARLIJN WILLEMS
Data Science Journalist @DataCamp

FollowTOWARDS DATA SCIENCE
Sharing concepts, ideas, and codes.

 * 833
 * 
 * 
 * 

Never miss a story from Towards Data Science , when you sign up for Medium. Learn more Never miss a story from Towards Data Science Get updates Get updates",Python and R cheat sheets specifically for those who are just starting out with data science or for those who need an extra help when working on data science problems.,Collecting Data Science Cheat Sheets,Live,132
339,"Compose The Compose logo Articles Sign in Free 30-day trialHOW TO SCRIPT PAINLESS-LY IN ELASTICSEARCH
Published Aug 9, 2017 How to Script Painless-ly in Elasticsearch elasticsearch painless scripting Free 30 Day TrialWith the release of Elasticsearch 5.x came Painless, Elasticsearch's answer to
safe, secure, and performant scripting. We'll introduce you to Painless and show
you what it can do.

With the introduction of Elasticsearch 5.x over a year ago, we got a new
scripting language, Painless. Painless is a scripting language developed and
maintained by Elastic and optimized for Elasticsearch. While it's still an
experimental scripting language, at its core Painless is promoted as a fast,
safe, easy to use, and secure.

In this article, we'll give you a short introduction to Painless, and show you
how to use the language when searching and updating your data.

On to Painless ...

A PAINLESS INTRODUCTION
The objective of Painless scripting is to make writing scripts painless for the
user, especially if you're coming from a Java or Groovy environment. While you
might not be familiar with scripting in Elasticsearch in general, let's start
with the basics.

Variables and Data TypesVariables can be declared in Painless using primitive, reference, string, void (doesn't return a value), array, and dynamic typings. Painless supports the
following primitive types: byte , short , char , int , long , float , double , and boolean . These are declared in a way similar to Java, for example, int i = 0; double a; boolean g = true; .

Reference types in Painless are also similar to Java, except they don't support
access modifiers, but support Java-like inheritance. These types can be
allocated using the new keyword on initialization such as when declaring a as an ArrayList, or simply declaring a single variable b to a null Map like:

ArrayList a = new ArrayList();  
Map b;  
Map g = [:];  
List q = [1, 2, 3];  


Lists and Maps are similar to arrays, except they don't require the new keyword on initialization, but they are reference types, not arrays.

String types can be used along with any variable with or without allocating it
with the new keyword. For example:

String a = ""a"";  
String foo = new String(""bar"");  


Array types in Painless support single and multidimensional arrays with null as the default value. Like reference types, arrays are allocated using the new keyword then the type and a set of brackets for each dimension. An array can be
declared and initialized like the following:

int[] x = new int[2];  
x[0] = 3;  
x[1] = 4;  


The size of the array can be explicit, for example, int[] a = new int[2] or you can create an array with values 1 to 5 and a size of 5 using:

int[] b = new int[] {1,2,3,4,5};  


Like arrays in Java and Groovy, the array data type must have a primitive,
string, or even a dynamic def associated with it on declaration and initialization.

def is the only dynamic type supported by Painless and has the best of all worlds
when declaring variables. What it does is it mimics the behavior of whatever
type it's assigned at runtime. So, when defining a variable:

def a = 1;  
def b = ""foo"";  


In the above code, Elasticsearch will always assume a is a primitive type int with a value of 1 and b as a string type with the value of ""foo"" . Arrays can also be assigned with a def , for instance, note the following:

def[][] h = new def[2][2];  
def[] f = new def[] {4, ""s"", 5.7, 2.8C};  


With variables out of the way, let's take a look at conditionals and operators.

Operators and ConditionalsIf you know Java, Groovy, or a modern programming language, then conditionals
and using operators in Painless will be familiar. The Painless documentation
contains an entire list of operators that are compatible with the language in addition to their order of precedence
and associativity. Most of the operators on the list are compatible with Java
and Groovy languages. Like most programming languages operator precedence can be
overridden with parentheses (e.g. int t = 5+(5*5) ).

Working with conditionals in Painless is the same using them in most programming
languages. Painless supports if and else , but not else if or switch . A conditional statement will look familiar to most programmers:

if (doc['foo'].value = 5) {  
    doc['foo'].value *= 10;
} 
else {  
    doc['foo'].value += 10;
}


Painless also has the Elvis operator ?: , which is behaves more like the operator in Kotlin than Groovy. Basically, if
we have the following:

x ?: y  


the Elvis operator will evaluate the right-side expression and returns whatever
the value of x is if not null . If x is null then the left-side expression is evaluated. Using primitives won't work with
the Elvis operator, so def is preferred here when it's used.

MethodsWhile the Java language is where Painless gets most of its power from, not every
class or method from the Java standard library (Java Runtime Environment, JRE)
is available. Elasticsearch has a whitelist reference of classes and methods that are available to Painless. The list doesn't only
include those available from the JRE, but also Elasticsearch and Painless
methods that are available to use.

Painless LoopsPainless supports while , do...while , for loops, and control flow statements like break and continue which are all available in Java. An example for loop in Painless will also look familiar in most modern programming languages.
In the following example, we loop over an array containing scores from our
document doc['scores'] and add them to the variable total then return it:

def total = 0;  
for (def i = 0; i   


Modifying that loop to the following will also work:

def total = 0;  
for (def score : doc['scores']) {  
    total += score;
}
return total;  


Now that we have an overview of some of the language fundamentals, let's start
looking at some data and see how we can use Painless with Elasticsearch queries.

LOADING THE DATA
Before loading data into Elasticsearch, make sure you have a fresh index set up.
You'll need to create a new index either in the Compose console, in the
terminal, or use the programming language of your choice. The index that we'll
create is called ""sat"". Once you've set up the index, let's gather the data.

The data we're going to use is a list of average SAT scores by school for the year 2015/16 compiled by the California Department of Education. The data from the
California Department of Education comes in a Microsoft Excel file. We converted
the data into JSON which can be downloaded from the Github repository here .

After downloading the JSON file, using Elasticsearch's Bulk API we can insert the data into the ""sat"" index we created.

curl -XPOST -u username:password 'https://portal333-5.compose-elasticsearch.compose-44.composedb.com:44444/_bulk' --data-binary @sat_scores.json  


Remember to substitute the username, password, and deployment URL with your own
and add _bulk to the end of the URL to start importing data.

SEARCHING ELASTICSEARCH USING PAINLESS
Now that we have the SAT scores loaded into the ""sat"" index, we can start using
Painless in our SAT queries. In the following examples, all variables will use def to demonstrate Painless's dynamic typing support.

The format of scripts in Elasticsearch looks similar to the following:

GET sat/_search

{
  ""script_fields"": {
    ""some_scores"": {
      ""script"": {
          ""lang"": ""painless"",
        ""inline"": ""def scores = 0; scores = doc['AvgScrRead'].value + doc['AvgScrWrit'].value; return scores;""
      }
    }
  }
}


Within a script you can define the scripting language lang , where Painless is the default. In addition, we can specify the source of the
script. For example, we're using inline scripts or those that are run when making a query. We also have the option of
using stored , which are scripts that are stored in the cluster. Also, we have file scripts that are scripts stored in a file and referenced within Elasticsearch's
configuration directory.

Let's look at the above script in a little more detail.

In the above script, we're using the _search API and the script_fields command. This command will allow us to create a new field that will hold the
scores that we write in the script . Here, we've called it some_scores just as an example. Within this new script field, use the script field to define the scripting language painless (Painless is already the default language) and use the field inline which will include our Painless script:

def scores = 0;  
scores = doc['AvgScrRead'].value + doc['AvgScrWrit'].value;  
return scores;  


You'll notice immediately that the Painless script that we just wrote doesn't
have any line breaks. That's because scripts in Elasticseach must be written out
as a single-line string. Running this simple query doesn't require Painless
scripting. In fact, it could be done with Lucene Expressions, but it serves just
as an example.

Let's look at the results:

{
    ""_index"": ""sat"",
    ""_type"": ""scores"",
    ""_id"": ""AV3CYR8JFgEfgdUCQSON"",
    ""_score"": 1,
    ""_source"": {
        ""cds"": 1611760130062,
        ""rtype"": ""S"",
        ""sname"": ""American High"",
        ""dname"": ""Fremont Unified"",
        ""cname"": ""Alameda"",
        ""enroll12"": 444,
        ""NumTstTakr"": 298,
        ""AvgScrRead"": 576,
        ""AvgScrMath"": 610,
        ""AvgScrWrit"": 576,
        ""NumGE1500"": 229,
        ""PctGE1500"": 76.85,
        ""year"": 1516
    },
    ""fields"": {
        ""some_scores"": [
            1152
        ]
    }
}


The script is run on each document in the index. The above result shows that a
new field called fields has been created with another field containing the name of the new field some_scores that we created with the script_fields command.

Let's write another query that will search for schools that have a SAT reading
score of less than 350 and a math score of more than 350. The script for that
would look like:

doc['AvgScrRead'].value < 350 && doc['AvgScrMath'].value > 350  


And the query:

GET sat/_search

{
  ""query"": {
    ""script"": {
      ""script"": {
        ""inline"": ""doc['AvgScrRead'].value < 350 && doc['AvgScrMath'].value > 350"",
        ""lang"": ""painless""
      }
    }
  }
}


This will give us four schools. Of those four schools, we can then use Painless
to create an array containing four values: the SAT scores from our data and a
total SAT score, or the sum of all the SAT scores:

def sat_scores = [];  
def score_names = ['AvgScrRead', 'AvgScrWrit', 'AvgScrMath'];  
for (int i = 0; i   


We'll create a sat_scores array to hold the SAT scores ( AvgScrRead , AvgScrWrit , and AvgScrMath ) and the total score that we'll calculate. We'll create another array called scores_names to hold the names of the document fields that contain SAT scores. If in the
future our field names change, all we'd have to do is update the names in the
array. Using a for loop, we'll loop through the document fields using the score_names array, and put their corresponding values in the sat_scores array. Next, we'll loop over our sat_scores array and add the values of the three SAT scores together and place that score
in a temporary variable temp . Then, we add the temp value to our sat_scores array giving us the three individual SAT scores plus their total score.

The entire query to get the four schools and the script looks like:

GET sat/_search

{
  ""query"": {
    ""script"": {
      ""script"": {
        ""inline"": ""doc['AvgScrRead'].value < 350 && doc['AvgScrMath'].value  i "",
        ""lang"": ""painless""
      }
    }
  }
}


Each document returned by the query will look similar to:

""hits"": {
    ""total"": 4,
    ""max_score"": 1,
    ""hits"": [
      {
        ""_index"": ""sat"",
        ""_type"": ""scores"",
        ""_id"": ""AV3CYR8PFgEfgdUCQSpM"",
        ""_score"": 1,
        ""fields"": {
          ""scores"": [
            326,
            311,
            368,
            1005
          ]
        }
      }
 ...


One drawback of using the _search API is that the results aren't stored. To do that, we'd have to use the _update or _update_by_query API to update individual documents or all the documents in the index. So, let's
update our index with the query results we've just used.

UPDATING ELASTICSEARCH USING PAINLESS
Before we move further, let's create another field in our data that will hold an
array of the SAT scores. To do that, we'll use Elasticsearch's _update_by_query API to add a new field called All_Scores which will initially start out as an empty array:

POST sat/_update_by_query

{
  ""script"": {
    ""inline"": ""ctx._source.All_Scores = []"",
    ""lang"": ""painless""
  }
}


This will update the index to include the new field where we can start adding
our scores to. To do that, we'll use a script to update the All_Scores field:

def scores = ['AvgScrRead', 'AvgScrWrit', 'AvgScrMath'];  
for (int i = 0; i   


Using _update or the _update_by_query API, we won't have access to the doc value. Instead, Elasticsearch exposes the ctx variable and the _source document that allows us to access the each document's fields. From there we can
update the All_Scores array for each document with each SAT score and the total average SAT score for
the school.

The entire query looks like this:

POST sat/_update_by_query

{
  ""script"": {
    ""inline"": ""def scores = ['AvgScrRead', 'AvgScrWrit', 'AvgScrMath']; for (int i = 0; i "",
    ""lang"": ""painless""
  }
}


If we want to update only a single document, we can do that, too, using a
similar script. All we'll need to indicate is the document's _id in the POST URL. In the following update, we're simply adding 10 points to the AvgScrMath score for the document with id ""AV2mluV4aqbKx_m2Ul0m"".

POST sat/scores/AV2mluV4aqbKx_m2Ul0m/_update

{
  ""script"": {
    ""inline"": ""ctx._source.AvgScrMath += 10"",
    ""lang"": ""painless""
  }
}  


SUMMING UP
We've gone over the basics of Elasticsearch's Painless scripting language and
have given some examples of how it works. Also, using some of the Painless API
methods like HashMap and loops, we've given you a taste of what you could do
with the language when updating your documents, or just modifying your data
prior to getting your search results back. Nonetheless, this is just the tip of
the iceberg for what's possible with Painless.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Leeroy Agency

Abdullah Alger is a former University lecturer who likes to dig into code, show people how to
use and abuse technology, talk about GIS, and fish when the conditions are
right. Coffee is in his DNA. Love this article? Head over to Abdullah Alger ’s author page to keep reading.CONQUER THE DATA LAYER
Spend your time developing apps, not managing databases.

Try Compose for Free for 30 DaysRELATED ARTICLES
Aug 4, 2017NEWSBITS - SUMMER READING WITH SCYLLA, ELASTICSEARCH, CASSANDRA AND POSTGRESQL
These are the Compose NewsBits for the week ending August 4th... Using Scylla
and Elasticsearch together. Cassandra, partitio…

Dj Walker-Morgan Jul 28, 2017NEWSBITS - SCYLLA PREVIEWS MATERIALIZED VIEWS
These are the database, cloud and developer News bits for the week ending July
28th: A preview of Scylla's materialized view…

Dj Walker-Morgan Jul 12, 2017INTEGRATION TESTING AGAINST REAL DATABASES
Integration testing can be challenging, and adding a database to the mix makes
it even more so. In this Write Stuff contribu…

Guest Author Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","With the release of Elasticsearch 5.x came Painless, Elasticsearch's answer to safe, secure, and performant scripting. We'll introduce you to Painless and show you what it can do.",How to Script Painless-ly in Elasticsearch,Live,133
340,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (November 01, 2016)
 * This Week in Data Science (October 25, 2016)
 * This Week in Data Science (October 18, 2016)
 * How to run a successful Data Science meetup
 * This Week in Data Science (October 11, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (NOVEMBER 01, 2016)
Posted on November 3, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Democracy in the age of the Internet of Things – With the release of Swipe the Vote in spring 2016, Tinder, the ultimate
   hook-up app, broke new ground in the United States by claiming to be able to
   match young voters with their dream-perfect presidential candidate.
 * 5 Simple Math Problems No One Can Solve – Easy to understand, supremely difficult to prove.
 * Building an efficient neural language model over a billion words – New tools help researchers train state-of-the-art language models.
 * These are the 10 hottest data jobs – With recruitment in data on the rise and big businesses investing heavily
   in the data sector, Hays are looking at the top jobs in data.
 * Scholars use Big Data to show Marlowe co-wrote three Shakespeare plays – A new edition of William Shakespeare’s complete works will name
   Christopher Marlowe as co-author of three plays, shedding new light on the
   links between the two great playwrights after centuries of speculation and
   conspiracy theories.
 * Predicting the Presidential Election – With the presidential election less than a week out, Greg shares how he
   uses data to predict the results of the race.
 * What Happens When You Merge Virtual Reality with Big Data – Researchers at Cal Tech University are working on platforms that would
   allow scientists to use immersive virtual reality for multidimensional data
   visualization.
 * Pokemon Go Increased U.S. Activity Levels by 144 Billion Steps in Just 30
   Days – The latest gaming craze increases activity levels for players, regardless
   of their age, sex, or weight.
 * Watch IBM Watson Suggest Treatments for a Cancer Patient – An IBM exec showed off a demo at Fortune’s inaugural Brainstorm Health
   conference.
 * Once Again: Prefer Confidence Intervals to Point Estimates – Today I saw a claim being made on Twitter that 17% of Jill Stein
   supporters in Louisiana are also David Duke supporters. For anyone familiar
   with US politics, this claim is a priori implausible, although certainly not
   impossible.
 * Data science and Big Data: Definitions and Common Myths – There are many ways to define what big data is, and this is why probably
   it still remains a really difficult concept to grasp.
 * Accelerated Computing and Deep Learning – This is truly an extraordinary time. In my three decades in the computer
   industry, none has held more potential, or been more fun. The era of AI has
   begun.
 * What to Know Before You Get In a Self-driving Car – Uber thinks its self-driving taxis could change the way millions of people
   get around. But autonomous vehicles aren’t any­where near to being ready for
   the roads.
 * Education’s Response to the Big Data Skills Demand – What are universities and colleges doing to make Big Data skills easier to
   obtain, and how are they speeding up the educational process to get these
   people into the workforce faster?

UPCOMING DATA SCIENCE EVENTS
 * Introduction to Python for Data Science – Learn how to use Python for data science on November 10th.
 * IBM Event: Analytics Strategies in the Cloud – Join IBM and 2-time Canadian Olympic gold-medalist Alexandre Bilodeau on
   November 7th for a complimentary event in Montreal where you’ll network, eat,
   drink and engage in an inspiring discussion on making business analytics
   easier and more available for all departments throughout your company.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our thirty eighth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (November 01, 2016)",Live,134
349,"How do you back up a CouchDB or Cloudant database? One solution is to useCouchDB’s built-in replication API. Let’s say we have a Cloudant database called mydata that we need to back up.In CouchDB 1.x, backing up an entire database was as simple as locating thedatabase’s .couch file and copying it somewhere else. With its 2.x release, CouchDB and theCloudant database shard the data, splitting a single database into pieces anddistributing the data across multiple servers. So backing up a database is nolonger as simple as copying a single file.Then how do you back up? This blog post presents 3 options: * back up to a text file * replicate via the command-line * replicate via the Cloudant dashboardBACK UP TO A TEXT FILECloudant has a RESTful HTTP API, so it is easy to create your own tools tointeract with the service. I created a command-line tool called couchbackup , which you can use to spool an entire database (either CouchDB or Cloudant) toa text file.N.B. couchbackup does not do CouchDB replication, it simply pages throught the/_all_docs endpoint. Conflicts, deletions and revision history are discarded.Only the winning revisions (without the _rev) survive.To install the tool:You must have Node.js installed, together with its “npm” package manager. Then follow these steps: 1. Run: npm install -g couchbackup         2. Define an environment variable which holds the path of either: * your remote Cloudant database:         export COUCH_URL=""https://myusername:mypassword@myhost.cloudant.com""         * or local CouchDB instance:         export COUCH_URL=""http://localhost:5984""         3. Back up individual databases to their own text files: couchbackup --db mydb  mydb.txt         4. If you want to restore data from a backup into an empty database, then use the tool couchrestore which was also installed with couchbackup : cat mydb.txt | couchrestore --db mydb         5. To increase the speed of the restore operation you can perform multiple    write operations in parallel: cat mydb.txt | couchrestore --db mydb --parallelism 5        REPLICATION VIA THE COMMAND-LINEAnother option is to replicate the database to another Cloudant account or to another CouchDB service byissuing an API call to set off a replication task that copies data from thesource database to the target database.Start replication by adding a document into the _replicator database; a document that lists the source and target database, includingauthentication credentials. You can achieve all of this from the command-lineusing a single curl command: export SOURCE=""https://myusername:mypassword@myhost.cloudant.com"" export TARGET=""https://myotherusername:myotherpassword@myotherhost.cloudant.com"" export JSON=""{\""source\"":\""$SOURCE/mydata\"",\""target\"":\""$TARGET/mydata\""}"" curl -X PUT -H ""Content-Type: application/json"" -d ""$JSON"" ""$SOURCE/_replicator""{""id"":""0b05156eefc1feca97e48cd6bd000380"",""_rev"":""1-a301b0fbfa8840f3ca936876729e37cc""} The API returns with a JSON object containing the id of a document, which youcan fetch to monitor the status of the replication job: curl ""$SOURCE/_replicator/0b05156eefc1feca97e48cd6bd000380""If you have Apache CouchDB installed locally and you intend to back up data froma Cloudant cluster, then instruct your local CouchDB installation to perform thereplication. Why your local machine? Because it has visibility to the Cloudantservice, but not vice-versa. export SOURCE=""https://myusername:mypassword@myhost.cloudant.com"" export TARGET=""https://localhost:5984"" export JSON=""{\""source\"":\""$SOURCE/mydata\"",\""target\"":\""$TARGET/mydata\""}"" curl -X PUT -H ""Content-Type: application/json"" -d ""$JSON"" ""$TARGET/_replicator""{""id"":""0b05156eefc1feca97e48cd6bd001976"",""_rev"":""1-ac15e7843682715ccb712fac41169cf5""} REPLICATION VIA THE CLOUDANT DASHBOARDYou can also start and monitor a replication using the web-based user interfaceof the Cloudant dashboard. 1. On the left, choose the Replication tab, 2. Click New Replication 3. Complete the form and click Replicate .You can monitor running replications from this screen.In the above example, we are replicating a database that lives in the currentuser’s Cloudant account (the My Databases tab in the Source Database section) to another Cloudant account (the Remote Database tab in the Target Database section). Use the same form to perform replicationsbetween all combinations of local and remote sources and targets.THE DIFFERENCE BETWEEN REPLICATION AND COUCHBACKUPCouchDB/Cloudant replication is a sophisticated sync protocol that ensures alldata from the source database is transferred to the target. If the targetdatabase already contains some documents, then clashing revisions are stored as document conflicts . In addition, deleted documents from the source database are also transferredto the target database.couchbackup simply iterates through the /db/_all_docs endpoint fetching the “winning revisions” no conflicting revisions are created.The result of a couchrestore operation is a collection of “first revisions” that matches the winningrevisions of the source database.BACK UP BEFORE TRYING CLOUDANT’S COUCHDB 2.0 SANDBOXNow that you have the tools you need to do backups, run one now before moving toCloudant’s new sandbox001 cluster. It’s a test cluster that’s running an alpha release of Apache CouchDB2.0. (Backups are important here, as all data will be deleted from the clusterat the end of the sandbox program!)Cloudant will soon run its clusters on the CouchDB 2.0 code base. It’s all partof a larger effort to realign Cloudant’s code base with that of the Apacheproject. For more information, read Stefan Kruger’s article, “Cloudant <3 Apache CouchDB™ 2.0″ , which includes details on accessing the sandbox cluster.LINKS * Cloudant Replication documentation * couchbackup© “Apache”, “CouchDB”, “Apache CouchDB”, and the CouchDB logo are trademarks orregistered trademarks of The Apache Software Foundation. All other brands andtrademarks are the property of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: backup / cloudant / couchbackup / CouchDB / NoSQL / replication Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Following CouchDB's latest release, how do you back up a CouchDB or Cloudant database?",Simple CouchDB and Cloudant Backup,Live,135
354,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (February 7, 2017)
 * This Week in Data Science (January 31, 2017)
 * This Week in Data Science (January 24, 2017)
 * This Week in Data Science (January 17, 2017)
 * This Week in Data Science (January 10, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (FEBRUARY 7, 2017)
Posted on February 7, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * IBM and United Airlines collaborate on enterprise iOS apps – United Airlines partners with IBM to develop iOS apps in an effort more
   efficient customer service.
 * Capturing IoT data from network’s edge to the cloud – Improving customer service through combining untapped IoT data and
   traditional consumer data.
 * Becoming a Data Scientist – The skills and tools needed to become an effective Data Scientist.
 * IBM’s Watson wants to help you do your taxes at H&R Block – IBM Watson partners with H&R Block to improve customer service and
   identify credits and deductions.
 * Essentials of working with Python cloud (Ubuntu) – A summary of functionalities that may assist in running Python scripts on
   the Ubuntu cloud.
 * First IBM France Sparkathon a winning success – Top Apache Spark enthusiasts participated in the first IBM Sparkathon
   aimed at improving banking customer services.
 * Now over 10,000 packages in R – The official R package repository has surpassed the 10,000 mark.
 * IBM calls healthcare industry a ‘leaky vessel in a stormy sea’ – How the healthcare industry is more at risk for cyberattacks.
 * A Computer Just Clobbered Four Pros At Poker – Program making use of A.I. algorithm defeats poker professionals.
 * Internet of Things Tutorial: IoT Devices and the Semantic Sensor Web – How IoT applications utilize multiple sensors and Internet connected
   devices.
 * The 5 deadly Data Management sins – 5 practices to avoid Data Management pitfalls.
 * Data Scientist – best job in America, again – Glassdoor has again ranked the Data Scientist position as the best job in
   USA.
 * Internet of Things: Setting business vision on speed and agility – The importance of an agile data platform in a competitive atmosphere.
 * Stream processing and the IBM Open Platform – Choosing the right engine for real-time data processing with Hadoop.
 * R Packages worth a look – A roundup of some interesting R packages.

UPCOMING DATA SCIENCE EVENTS
 * IBM Event: Big Data and Analytics Summit – February 14, 2017 @ 7:15 am – 4:45 pm, Toronto Marriott Downtown Eaton
   Centre Hotel 525 Bay St. Toronto Ontario.

COOL DATA SCIENCE VIDEOS
 * Deep Learning with Tensorflow – Applying Recurrent Networks to Language
   Modelling – Explanation of Applying Recurrent Networks to Language Modelling
 * Deep Learning with Tensorflow – Introduction to Unsupervised Learning – Overview of the basic concepts of Unsupervised Learning.
 * Deep Learning with Tensorflow – RBMs and Autoencoders – An overview of RBMs and Autoencoders.
 * SHARE THIS:
    * Facebook
    * Twitter
    * LinkedIn
    * Google
    * Pocket
    * Reddit
    * Email
    * Print
    * 
   
   
 * RELATED
   

Tags: analytics , Big Data , data science , events


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (February 7, 2017)",Live,136
355,"This video shows you how to execute some common HTTP API commands to create, read, update, and delete data in a Cloudant database. Sign up for a Cloudant account here: https://cloudant.com/sign-up/. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center","This video shows you how to execute some common HTTP API commands to create, read, update, and delete data in a Cloudant database. ",Execute Common HTTP API Commands,Live,137
362,"Learn R programming for data science

 * Home
 * About Us
 * Archives
 * Contribute
 * Free Account
 * 

We share R tutorials from scientists at academic and scientific institutions
with a goal to give everyone in the world access to a free knowledge. Our
tutorials cover different topics including statistics, data manipulation and
visualization! Introduction Getting Data Data Management Visualizing Data Basic Statistics Regression Models Advanced Modeling Programming Best R Packages Tips & Tricks Data ManagementBEST PACKAGES FOR DATA MANIPULATION IN R
by Fisseha Berhane on May 17, 2016 2 Commentsdplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have
their strengths. While dplyr is more elegant and resembles natural language, data.table is succinct and we can do a lot with data.table in just a single line. Further, data.table is, in some cases, faster (see benchmark here ) and it may be a go-to package when performance and memory are constraints.
You can read comparison of dplyr and data.table from Stack Overflow and Quora .

You can get reference manual and vignettes for data.table here and for dplyr here . You can read other tutorial about dplyr published at DataScience+

BACKGROUND
I am a long time dplyr and data.table user for my data manipulation tasks. For someone who knows one of these
packages, I thought it could help to show codes that perform the same tasks in
both packages to help them quickly study the other. If you know either package
and have interest to study the other, this post is for you.

DPLYR
dplyr has 5 verbs which make up the majority of the data manipulation tasks we
perform. Select: used to select one or more columns; Filter: used to select some
rows based on specific criteria; Arrange: used to sort data based on one or more
columns in ascending or descending order; Mutate: used to add new columns to our
data; Summarise: used to create chunks from our data.

DATA.TABLE
data.table has a very succinct general format: DT[ i, j, by ], which is interpreted as: Take DT, subset rows using i , then calculate j grouped by by .

DATA MANIPULATION
First we will install some packages for our project.


library(dplyr)
library(data.table)
library(lubridate)
library(jsonlite)
library(tidyr)
library(ggplot2)
library(compare)


The data we will use here is from DATA.GOV . It is Medicare Hospital Spending by Claim and it can be downloaded from here . Let’s download the data in JSON format using the fromJSON function from the jsonlite package. Since JSON is a very common data format used for asynchronous browser/server
communication, it is good if you understand the lines of code below used to get
the data. You can get an introductory tutorial on how to use the jsonlite
package to work with JSON data here and here . However, if you want to focus only on the data.table and dplyr commands, you can safely just run the codes in the two cells below and ignore
the details.


spending=fromJSON(""https://data.medicare.gov/api/views/nrth-mfg3/rows.json?accessType=DOWNLOAD"")
names(spending)
""meta"" ""data"" 

meta=spending$meta
hospital_spending=data.frame(spending$data)
colnames(hospital_spending)=make.names(meta$view$columns$name)
hospital_spending=select(hospital_spending,-c(sid:meta))

glimpse(hospital_spending)
Observations: 70598
Variables:
$ Hospital.Name                       (fctr) SOUTHEAST ALABAMA MEDICAL CENT...
$ Provider.Number.                    (fctr) 010001, 010001, 010001, 010001...
$ State                               (fctr) AL, AL, AL, AL, AL, AL, AL, AL...
$ Period                              (fctr) 1 to 3 days Prior to Index Hos...
$ Claim.Type                          (fctr) Home Health Agency, Hospice, I...
$ Avg.Spending.Per.Episode..Hospital. (fctr) 12, 1, 6, 160, 1, 6, 462, 0, 0...
$ Avg.Spending.Per.Episode..State.    (fctr) 14, 1, 6, 85, 2, 9, 492, 0, 0,...
$ Avg.Spending.Per.Episode..Nation.   (fctr) 13, 1, 5, 117, 2, 9, 532, 0, 0...
$ Percent.of.Spending..Hospital.      (fctr) 0.06, 0.01, 0.03, 0.84, 0.01, ...
$ Percent.of.Spending..State.         (fctr) 0.07, 0.01, 0.03, 0.46, 0.01, ...
$ Percent.of.Spending..Nation.        (fctr) 0.07, 0.00, 0.03, 0.58, 0.01, ...
$ Measure.Start.Date                  (fctr) 2014-01-01T00:00:00, 2014-01-0...
$ Measure.End.Date                    (fctr) 2014-12-31T00:00:00, 2014-12-3...

As shown above, all columns are imported as factors and let’s change the columns
that contain numeric values to numeric.

cols = 6:11; # These are the columns to be changed to numeric.
hospital_spending[,cols] <- lapply(hospital_spending[,cols],as.character)
hospital_spending[,cols] <- lapply(hospital_spending[,cols], as.numeric)

The last two columns are measure start date and measure end date. So, let’s use
the lubridate package to correct the classes of these columns.

cols = 12:13; # These are the columns to be changed to dates.
hospital_spending[,cols] <- lapply(hospital_spending[,cols], ymd_hms)

Now, let’s check if the columns have the classes we want.

sapply(hospital_spending, class)
$Hospital.Name
    ""factor""
$Provider.Number.
    ""factor""
$State
    ""factor""
$Period
    ""factor""
$Claim.Type
    ""factor""
$Avg.Spending.Per.Episode..Hospital.
    ""numeric""
$Avg.Spending.Per.Episode..State.
    ""numeric""
$Avg.Spending.Per.Episode..Nation.
    ""numeric""
$Percent.of.Spending..Hospital.
    ""numeric""
$Percent.of.Spending..State.
    ""numeric""
$Percent.of.Spending..Nation.
    ""numeric""
$Measure.Start.Date
        ""POSIXct"" ""POSIXt"" 
$Measure.End.Date
        ""POSIXct"" ""POSIXt"" 


CREATE DATA TABLE
We can create a data.table using the data.table() function.

hospital_spending_DT = data.table(hospital_spending)
class(hospital_spending_DT)
""data.table"" ""data.frame"" 

SELECT CERTAIN COLUMNS OF DATA
To select columns, we use the verb select in dplyr . In data.table , on the other hand, we can specify the column names.

SELECTING ONE VARIABLE
Let’s selet the “Hospital Name” variable

from_dplyr = select(hospital_spending, Hospital.Name)
from_data_table = hospital_spending_DT[,.(Hospital.Name)]

Now, let’s compare if the results from dplyr and data.table are the same.

compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
dropped attributes

REMOVING ONE VARIABLE
from_dplyr = select(hospital_spending, -Hospital.Name)
from_data_table = hospital_spending_DT[,!c(""Hospital.Name""),with=FALSE]
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
dropped attributes

we can also use := function which modifies the input data.table by reference.
We will use the copy() function, which deep copies the input object and therefore any subsequent
update by reference operations performed on the copied object will not affect
the original object.

DT=copy(hospital_spending_DT)
DT=DT[,Hospital.Name:=NULL]
""Hospital.Name""%in%names(DT)FALSE 

We can also remove many variables at once similarly:

DT=copy(hospital_spending_DT)
DT=DT[,c(""Hospital.Name"",""State"",""Measure.Start.Date"",""Measure.End.Date""):=NULL]
c(""Hospital.Name"",""State"",""Measure.Start.Date"",""Measure.End.Date"")%in%names(DT)
FALSE FALSE FALSE FALSE 

SELECTING MULTIPLE VARIABLES
Let’s select the variables:
Hospital.Name,State,Measure.Start.Date,and Measure.End.Date.


from_dplyr = select(hospital_spending, Hospital.Name,State,Measure.Start.Date,Measure.End.Date)
from_data_table = hospital_spending_DT[,.(Hospital.Name,State,Measure.Start.Date,Measure.End.Date)]
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
dropped attributes

DROPPING MULTIPLE VARIABLES
Now, let’s remove the variables Hospital.Name,State,Measure.Start.Date,and
Measure.End.Date from the original data frame hospital_spending and the
data.table hospital_spending_DT.

from_dplyr = select(hospital_spending, -c(Hospital.Name,State,Measure.Start.Date,Measure.End.Date))
from_data_table = hospital_spending_DT[,!c(""Hospital.Name"",""State"",""Measure.Start.Date"",""Measure.End.Date""),with=FALSE]
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
dropped attributes

dplyr has functions contains() , starts_with() and, ends_with() which we can use with the verb select. In data.table , we can use regular expressions. Let’s select columns that contain the word
Date to demonstrate by example.

from_dplyr = select(hospital_spending,contains(""Date""))
from_data_table = subset(hospital_spending_DT,select=grep(""Date"",names(hospital_spending_DT)))
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
dropped attributes

names(from_dplyr)
""Measure.Start.Date"" ""Measure.End.Date"" 

RENAME COLUMNS
setnames(hospital_spending_DT,c(""Hospital.Name"", ""Measure.Start.Date"",""Measure.End.Date""), c(""Hospital"",""Start_Date"",""End_Date""))
names(hospital_spending_DT)
""Hospital"" ""Provider.Number."" ""State"" ""Period"" ""Claim.Type"" ""Avg.Spending.Per.Episode..Hospital."" ""Avg.Spending.Per.Episode..State."" ""Avg.Spending.Per.Episode..Nation."" ""Percent.of.Spending..Hospital."" ""Percent.of.Spending..State."" ""Percent.of.Spending..Nation."" ""Start_Date"" ""End_Date"" 

hospital_spending = rename(hospital_spending,Hospital= Hospital.Name, Start_Date=Measure.Start.Date,End_Date=Measure.End.Date)
compare(hospital_spending,hospital_spending_DT, allowAll=TRUE)
TRUE
  dropped attributes

FILTERING DATA TO SELECT CERTAIN ROWS
To filter data to select specific rows, we use the verb filter from dplyr with logical statements that could include regular expressions. In data.table , we need the logical statements only.

FILTER BASED ON ONE VARIABLE
from_dplyr = filter(hospital_spending,State=='CA') # selecting rows for California
from_data_table = hospital_spending_DT[State=='CA']
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
  dropped attributes

FILTER BASED ON MULTIPLE VARIABLES
from_dplyr = filter(hospital_spending,State=='CA' & Claim.Type!=""Hospice"") 
from_data_table = hospital_spending_DT[State=='CA' & Claim.Type!=""Hospice""]
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
  dropped attributes

from_dplyr = filter(hospital_spending,State %in% c('CA','MA',""TX"")) 
from_data_table = hospital_spending_DT[State %in% c('CA','MA',""TX"")]
unique(from_dplyr$State)
CA MA TX 

compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
dropped attributes

ORDER DATA
We use the verb arrange in dplyr to order the rows of data. We can order the rows by one or more variables. If
we want descending, we have to use desc() as shown in the examples.The examples are self-explanatory on how to sort in
ascending and descending order. Let’s sort using one variable.

ASCENDING
from_dplyr = arrange(hospital_spending, State)
from_data_table = setorder(hospital_spending_DT, State)
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
dropped attributes

DESCENDING
from_dplyr = arrange(hospital_spending, desc(State))
from_data_table = setorder(hospital_spending_DT, -State)
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
dropped attributes

SORTING WITH MULTIPLE VARIABLES
Let’s sort with State in ascending order and End_Date in descending order.


from_dplyr = arrange(hospital_spending, State,desc(End_Date))
from_data_table = setorder(hospital_spending_DT, State,-End_Date)
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
dropped attributes

ADDING/UPDATING COLUMN(S)
In dplyr we use the function mutate() to add columns. In data.table , we can Add/update a column by reference using := in one line.

from_dplyr = mutate(hospital_spending, diff=Avg.Spending.Per.Episode..State. - Avg.Spending.Per.Episode..Nation.)
from_data_table = copy(hospital_spending_DT)
from_data_table = from_data_table[,diff := Avg.Spending.Per.Episode..State. - Avg.Spending.Per.Episode..Nation.]
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
sorted
renamed rows
dropped row names
dropped attributes

from_dplyr = mutate(hospital_spending, diff1=Avg.Spending.Per.Episode..State. - Avg.Spending.Per.Episode..Nation.,diff2=End_Date-Start_Date)
from_data_table = copy(hospital_spending_DT)
from_data_table = from_data_table[,c(""diff1"",""diff2"") := list(Avg.Spending.Per.Episode..State. - Avg.Spending.Per.Episode..Nation.,diff2=End_Date-Start_Date)]
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
dropped attributes

SUMMARIZING COLUMNS
We can use the summarize() function from dplyr to create summary statistics.

summarize(hospital_spending,mean=mean(Avg.Spending.Per.Episode..Nation.))
mean 1820.409

hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Nation.))]
mean 1820.409

summarize(hospital_spending,mean=mean(Avg.Spending.Per.Episode..Nation.),
                            maximum=max(Avg.Spending.Per.Episode..Nation.),
                            minimum=min(Avg.Spending.Per.Episode..Nation.),
                            median=median(Avg.Spending.Per.Episode..Nation.))
mean     maximum   minimum  median
1820.409  20025       0      109

hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Nation.),
                        maximum=max(Avg.Spending.Per.Episode..Nation.),
                        minimum=min(Avg.Spending.Per.Episode..Nation.),
                        median=median(Avg.Spending.Per.Episode..Nation.))]
mean      maximum   minimum  median
1820.409  20025       0      109

We can calculate our summary statistics for some chunks separately. We use the
function group_by() in dplyr and in data.table , we simply provide by .

head(hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)),by=.(Hospital)])


mygroup= group_by(hospital_spending,Hospital) 
from_dplyr = summarize(mygroup,mean=mean(Avg.Spending.Per.Episode..Hospital.))
from_data_table=hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)), by=.(Hospital)]
compare(from_dplyr,from_data_table, allowAll=TRUE)

TRUE
  sorted
  renamed rows
  dropped row names
  dropped attributes

We can also provide more than one grouping condition.

head(hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)),
                          by=.(Hospital,State)])


mygroup= group_by(hospital_spending,Hospital,State)
from_dplyr = summarize(mygroup,mean=mean(Avg.Spending.Per.Episode..Hospital.))
from_data_table=hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)), by=.(Hospital,State)]
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
  sorted
  renamed rows
  dropped row names
  dropped attributes

CHAINING
With both dplyr and data.table , we can chain functions in succession. In dplyr , we use pipes from the magrittr package with %>% which is really cool. %>% takes the output from one function and feeds it to the first argument of the
next function. In data.table , we can use %>% or [ for chaining.


from_dplyr=hospital_spending%>%group_by(Hospital,State)%>%summarize(mean=mean(Avg.Spending.Per.Episode..Hospital.))
from_data_table=hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)), by=.(Hospital,State)]
compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE
  sorted
  renamed rows
  dropped row names
  dropped attributes

hospital_spending%>%group_by(State)%>%summarize(mean=mean(Avg.Spending.Per.Episode..Hospital.))%>%
arrange(desc(mean))%>%head(10)%>%
        mutate(State = factor(State,levels = State[order(mean,decreasing =TRUE)]))%>%
          ggplot(aes(x=State,y=mean))+geom_bar(stat='identity',color='darkred',fill='skyblue')+
          xlab("""")+ggtitle('Average Spending Per Episode by State')+
          ylab('Average')+ coord_cartesian(ylim = c(3800, 4000))


hospital_spending_DT[,.(mean=mean(Avg.Spending.Per.Episode..Hospital.)),
                                     by=.(State)][order(-mean)][1:10]%>% 
            mutate(State = factor(State,levels = State[order(mean,decreasing =TRUE)]))%>%
           ggplot(aes(x=State,y=mean))+geom_bar(stat='identity',color='darkred',fill='skyblue')+
          xlab("""")+ggtitle('Average Spending Per Episode by State')+
          ylab('Average')+ coord_cartesian(ylim = c(3800, 4000))


SUMMARY
In this blog post, we saw how we can perform the same tasks using data.table and dplyr packages. Both packages have their strengths. While dplyr is more elegant and resembles natural language, data.table is succinct and we can do a lot with data.table in just a single line. Further, data.table is, in some cases, faster and it may be a go-to package when performance and
memory are the constraints.

You can get the code for this blog post at my GitHub account.

This is enough for this post. If you have any questions or feedback, feel free
to leave a comment.

Tags Best R Packages Data Manipulation dplyr The Author Fisseha is a writer for DataScience+, a data scientist at Aurotech and works
for the FDA. He enjoys challenging and complex data analysis, data mining,
machine learning and data visualization tasks. Fisseha holds a PhD in
atmospheric Physics. LinkedIn WebsiteDISCLOSURE
 * Fisseha Berhane does not work or receive funding from any company or
   organization that would benefit from this article.

0 Shares Like this article? Give it a share: Facebook Twitter Google+ Linkedin Email this * Andrej OskinThank you for this interesting article, but one thing is wrong.
   These lines will mess original data:
   
   “`
   cols = 6:11; # These are the columns to be changed to numeric.
   hospital_spending[,cols] <- lapply(hospital_spending[,cols], as.numeric)
   “`
   
   It is discussed for example in this SO question: 
   http://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information
   
   You can not apply ""as.numeric"" to factor, because you will get labels instead
   of levels.
   
   I think it would be better to define options(stringAsFactors = F) somewhere
   at the beginning of your script, or in .Rprofile
   
    * Fisseha BerhaneThank you. Changed it to character first by adding
      hospital_spending[,cols] <- lapply(hospital_spending[,cols],as.character)
      
      Also: updated results that were affected
      
      
    * 
   
   
 * 

TRENDING NOW ON DATASCIENCE+
 * K Means Clustering in R
 * Sentiment analysis with machine learning in R
 * Fitting a Neural Network in R; neuralnet package
 * Implementing Apriori Algorithm in R
 * How to Create, Rename, Recode and Merge Variables in R

DataScience+ Learn R programming for data science Site Links * About Us
 * Contribute
 * Advertise
 * Contact Us

Legal * Privacy Policy
 * Terms of Use
 * Account Terms
 * Stylebook

Other Sites * R Bloggers

 * 
 * 
 * 
 * 

Connect with Us © 2016 DataSciencePlus.com","dplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have their strengths. While dplyr is more elegant and resembles natural language, data.table is succinct",Best packages for data manipulation in R,Live,138
364,"Enterprise Pricing Articles Sign in Free 30-Day TrialDESIGNING THE UFC MONEYBALL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 18, 2017Using big data analysis on sports? Gigi Sayfan takes us through doing just that
with Cassandra/Scylla, MySQL, and Redis. In this Compose's Write Stuff article,
he shows us how he constructs his own Moneyball.

Sports and big data analysis are a great match. In any sporting event, so much
is happening at every moment. There are many trends that evolve over various
time scales - momentum within the same match, within a season, and over the
career of an athlete. You can collect a lot of data, analyze it and use it for
many purposes. The movie Moneyball (based on a book by Michael Lewis ) made this notion popular.

In this article, we will take this concept to the world of MMA (mixed martial
arts) and design a data collection and analytics platform for it. The main focus
will be on the data collection, storage, and access. The design of the actual
analytics will be left as the dreaded exercise to the reader! The weapons of
choice will Cassandra/Scylla, MySQL, and Redis.

UFC MONEYBALL
Before we jump ahead and start talking databases, let's understand the domain a
little bit and what we want to accomplish. I always find it as a critical first
step that provides structure and a framework to operate within. The use cases
and conceptual model usually stabilize pretty quickly and provide a lot of
clarity. Additional use cases and concepts are often added later as ""yet
another"" and fit into the existing framework.

QUICK INTRO TO MMA
MMA is the sport where two competitors with backgrounds in multiple martial arts
fight each other in a cage using versatile techniques that involve striking
(punching and kicking) and grappling (throws, takedowns, joint locks, and
chokes). A fighter wins if his opponents submits, is knocked out, or is unable
to defend themselves intelligently as determined by a referee present in the
cage with them. The UFC is most popular and successful organization and there is a lot of money involved .

USE CASES
When you start looking at MMA from data analytics point of view many use cases
come to mind. Here is an arbitrary list:

 * Understand the style of a particular fighter
 * Find effective attacks against a particular fighter
 * Find effective defense against the attack of a particular fighter
 * Understand the energy level of a fighter along a fight
 * Understand how a fighter changes his style in the presence of an injury
 * Understand the game plan of a fighter against a particular fighter or type of
   fighter
 * Find tactics that can throw a fighter off his game plan
 * Adjust game plan during a round or between rounds

For example, the fighter Demian Maia is an elite Jiu Jitsu practitioner. He is famous for his effective ground game
where he drags opponents to the floor, mounts them or takes their back and
chokes them. Maia is at one end of the scale because his game plan is very
simple and everybody knows what he's going to do, but he is so good at it that
he is very difficult to stop. The fighter Yair Rodriguez is at the other end of the scale exhibiting an extravagant style full of
jumping, spinning kicks and somersaults mixed with surprise take-downs. It
doesn't appear that he himself knows what he's going to do from one second to
the next.

CONCEPTUAL MODEL
A basic conceptual model for this domain may include the following entities:
Fighter, Match, and Event.

The Fighter entity has a lot of data: physical attributes, age, weight class, fighting
stance, stamina, ranks in various martial arts disciplines, match history,
injury history, favorite techniques, etc.

The Fight entity has a lot of data, too: venue, opponents, referee, number of rounds, and
a collection of events.

The FightEvent entity represents anything relevant that happens during a match: fighter A
advances, takedown/throw attempt, jab thrown, jab lands, uppercut thrown, knock
down, front kick to the face, eye poke (illegal), stance switch, guard pass to
side control, arm-bar attempt, etc.

The EventCategory classifies events to various categories. For example, movement, punch, kick,
judo throw, position change on the ground, submission.

The interesting aspect of fight events is that they represent a time-series and
the order and timing of event sequences contain a lot of information that help
our use cases.

Note that the UFC organizes and promotes events that contain multiple matches in
one night. The events we consider here are fight events happening during the
match.

MMA ANALYTICS
Machine learning or more traditional statistical analysis and visualization can
take all the data about a fighter and their opponent, both historically and in
real-time during a match. It can provide a lot of insights that will help the
well-informed fighter and their coaches prepare the perfect game plan for a
particular opponent in a particular match and adjust it intelligently based on
how the match evolves.

CASSANDRA, MYSQL AND REDIS
The UFC moneyball data is diverse and will be used in different ways. Storing it
all in one database is not ideal. In this section, I'll describe briefly the
databases we will use in our design.

CASSANDRA/SCYLLA
The open source Apache Cassandra is a great database for time-series data. It was designed for distributed,
large-scale workloads. It is fast, stable and battle-tested. It is a
decentralized, highly available and has no single point of failure. Cassandra is
also idempotent and provides an interesting mix of consistency levels on a query
by query level. Cassandra succeeds in doing all that by a careful selection of
its feature set and even more careful selection of the features it doesn't
implement. For example, efficient ad-hoc queries are not supported. With
Cassandra you better know the shape of your queries when you model your data and
design your schema.

Scylla a high-performance drop-in replacement for Cassandra, which is already plenty
fast. It claims to have 10X better throughput and super low latency.
Conveniently Compose provides Hosted Scylla (in beta right now).

This is cool because you get to benefit from the extensive Cassandra
documentation, experience, tooling and community and yet run a streamlined and
highly optimized Scylla engine.

MYSQL
MySQL needs no introduction. I'll just mention that Compose now has a hosted MySQL service in beta. If you prefer PostgreSQL, which is also hosted by Compose, or any
other relational database that's fine. I will not be using any MySQL-specific
capabilities here and the concepts transfer.

REDIS
Redis is top of the class when it comes to fast in-memory key-value stores. But,
it is much more than that and defines itself as a data structure server. We'll
see this capability in action later. Of course, Compose can host Redis for you.

A HYBRID POLY-STORE STORAGE SCHEME
In this section, we'll model our domain and conceptual model. The basic idea is
to utilize the strength of each store and divide each type of data or metadata
into the most appropriate data store. Then the application can combine data from
multiple stores.

STORING FIGHT EVENTS IN CASSANDRA
Cassandra is a columnar database. This means that column data is stored
sequentially in memory (and on disk). But, unlike relational databases, you can
query arbitrary data in a single query. Cassandra organizes the data in wide
rows. Each such wide row has a key and can contain a lot of data (e.g. 100MB)
and you can query a single wide row at the time. If you try to think of it in
relational terms, then a wide row is the analog of a SQL table in a DB that
doesn't support joins. This can get really confusing because CQL (Cassandra
Query Language) is very similar syntactically to SQL, but the same terms mean
different things. For example, a Cassandra table is made of multiple wide rows.
Each row in the table shares the same schema, but since you can query on a
single row at a time it is better to think of a Cassandra table as a collection
of SQL tables with similar schema in a sharded relational database. This is
pretty accurate because different wide rows may be split across machines.
Another limitation of Cassandra's design is that you can efficiently query only
consecutive data from a single wide row. That means that it is very important to
design your schema in a way that matches your queries. If you need to query data
in different orders, Cassandra says disks are cheap and you just need to store
the data multiple times in different orders (a.k.a materialized views). Let's
see how all this affects our modeling of fight events.

We're interested in querying fight events at the match level and then at the
round level. This way we can analyze the meaningful time series. We may be
interested also in doing longitudinal studies on a particular fighter, how they
evolved over their career, what are their strengths and weaknesses etc.

Here is a Cassandra table schema that addresses these concerns:

CREATE KEYSPACE fightdb WITH REPLICATION = {  
    'class' : 'SimpleStrategy', 
    'replication_factor' : 3 };

use fightdb;

DROP TABLE fight_events;

CREATE TABLE fight_events (  
   fight_id int,
   round int,
   ts int,
   fighter_id int,
   event_id int,
   PRIMARY KEY (fight_id, round, ts)
) WITH CLUSTERING ORDER BY (round ASC, ts ASC);


Let's break it down. The first line creates a keyspace called fightdb , which is like a separate DB with its own policies. Normally, replication
factor will be at least 3 to gain redundancy. Then we tell Cassandra to use that
it, so there is no need to qualify names with the DB name. Next, we drop the fight_events table in case we're re-creating the DB from scratch. Don't do this in
production because you'll destroy all your data. You can ALTER TABLE to modify the schema. Finally, we get to create the table fight_Events .

It looks like regular SQL. The columns are defined using Cassandra data types.
The primary key is where things get interesting. The primary key is composed of
a partition key and a clustering key. The partition key is fight_id and it defines the wide row. Every entry with the same fight id will go into
the same wide row. The clustering key is round and ts . The ts column represents seconds into the current 5 minutes round (values will be 0
through 299). Inside the wide row, each record is called a compound column,
which is the analog of a SQL row. Then we have the clustering order, which says
that the order will be by round first and ts second both ascending. It looks pretty harmless so far. Let's insert some data.
Inserts look just like SQL inserts, but you have to provide the primary key. No
auto incrementing ID (which you want to avoid in a distributed system anyway)

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 1, 10, 2, 1);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 1, 11, 2, 4);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 1, 12, 1, 3);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 2, 7, 1, 4);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 2, 8, 2, 1);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (2, 1, 3, 2, 2);

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (2, 1, 4, 2, 1);  


Cassandra will also overwrite records with the same primary key. No uniques and
no duplicates in Cassandra. The reason is that Cassandra is idempotent. You can
perform the same operation multiple times with modifying the state. So, insert is also update in Cassandra.

OK, let's run some queries. This is where things get interesting. Starting with
selecting all records:

select * from fight_events;

 fight_id | round | ts | event_id | fighter_id
----------+-------+----+----------+------------
        1 |     1 | 10 |        1 |          2
        1 |     1 | 11 |        4 |          2
        1 |     1 | 12 |        3 |          1
        1 |     2 |  7 |        4 |          1
        1 |     2 |  8 |        1 |          2
        2 |     1 |  3 |        2 |          2
        2 |     1 |  4 |        1 |          2

(7 rows)


So far, so good. Note that the timestamp seems different. The order is indeed by round and ts . Let's verify that by inserting a record with the same primary key replacing
the existing one.

INSERT INTO fight_events (fight_id, round, ts, fighter_id, event_id)  
VALUES (1, 1, 10, 2, 5);  


We replaced the first record. Selecting all the records shows the following and
the order remains the same:

select * from fight_events;

 fight_id | round | ts | event_id | fighter_id
----------+-------+----+----------+------------
        1 |     1 | 10 |        1 |          2
        1 |     1 | 11 |        4 |          2
        1 |     1 | 12 |        3 |          1
        1 |     2 |  7 |        4 |          1
        1 |     2 |  8 |        1 |          2
        2 |     1 |  3 |        2 |          2
        2 |     1 |  4 |        1 |          2

(7 rows)


Let's try getting just the events with an id greater than 3:

select * from fight_events where event_id 

InvalidRequest: Error from server: code=2200 [Invalid query] message=""Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING""  


We can't do that. The ALLOW FILTERING option is a table scan, so not much help there. Maybe, we can at least select
events with a particular event_id (e.g. 4):

select * from fight_events where event_id = 4;  
InvalidRequest: Error from server: code=2200 [Invalid query] message=""Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING""  


We get exactly the same result. You must specify at least the partition key.
Maybe, we're asking too much. The event_id is not part of the primary key, so it's sort of understandable why you can't
efficiently query by it. Let's go for something simpler. Let's just get all the
records from round 1:

select * from fight_events where round = 1;

InvalidRequest: Error from server: code=2200 [Invalid query] message=""Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING""  


You can't do that either. Cassandra has the IN operator that allows you to provide multiple keys in one query, but you will
have to list each and every partition key:

select * from fight_events where fight_id IN (1, 2) AND round = 1;

 fight_id | round | ts | event_id | fighter_id
----------+-------+----+----------+------------
        1 |     1 | 10 |        1 |          2
        1 |     1 | 11 |        4 |          2
        1 |     1 | 12 |        3 |          1
        2 |     1 |  3 |        2 |          2
        2 |     1 |  4 |        1 |          2

(5 rows)


You must use equality tests on all the components of your where clause (which should only use elements from your clustering key left to right)
except the last component where you can use inequalities or ranges.

For example to get events that occurred after the first 5 seconds of round 1:

select * from fight_events where fight_id IN (1, 2) AND round = 1 AND ts 

 fight_id | round | ts | event_id | fighter_id
----------+-------+----+----------+------------
        1 |     1 | 10 |        1 |          2
        1 |     1 | 11 |        4 |          2
        1 |     1 | 12 |        3 |          1


Here are a couple of other queries that are invalid in CQL:

select * from fight_events where fight_id IN (1, 2) AND ts 

InvalidRequest: Error from server: code=2200 [Invalid query] message=""PRIMARY KEY column ""ts"" cannot be restricted as preceding column ""round"" is not restricted""

select * from fight_events where fight_id IN (1, 2) AND round < 3 AND ts 

InvalidRequest: Error from server: code=2200 [Invalid query] message=""Clustering column ""ts"" cannot be restricted (preceding column ""round"" is restricted by a non-EQ relation)""  


What about indices? Cassandra supports secondary indexes, but they come with so
many restrictions and caveats that the consensus is that you should rarely use
them. Check out this article for all the nitty-gritty details about secondary
indices: Cassandra Native Secondary Index Deep Dive .

The official position of Cassandra is that disks are cheap and if you want to
access your data in different ways, you should simply duplicate the data. In
general, for every query type you should have a dedicated table where they data
is already organized sequentially such that you can pull out the answer as is.
For example, if we want to pull all the data by rounds then our partition key
should be round . Consider this primary key for the same fight_events table:

PRIMARY KEY (round, fight_id, ts)  
WITH CLUSTERING ORDER BY (fight_id ASC, ts ASC);  


We merely switched fight_id and round , but this changes everything (and not for the better). Remember that the
partition key defines a wide row that must be fully present on the same node.
Since there are only 3 rounds (ignoring championship fights that last 5 rounds)
then we can't distribute our data on more than 3 nodes. This is ridiculous of
course. Given enough time and data a single machine won't even be able to hold
all the fight events that occurred in round 1. The solution is a compound
partition key. For example, we can add the month to the partition key (the
parentheses group tells Cassandra that round and month are a compound partition
key):

PRIMARY KEY ((round, month), fight_id, round, ts)  
) WITH CLUSTERING ORDER BY (round ASC, ts ASC);


Now, each wide row will contain all the events that occurred in a round in a
particular month. This is better for data distribution and you don't have to
worry about the data in a single wide row growing over time beyond the capacity
of a single machine. But, now if you want to query all the events in round 1
over the entire year of 2016, you'll need to run 12 queries for each combination
of round 1 and a month:

select * from fight_events where round = 1 and month = 1;  
select * from fight_events where round = 1 and month = 2;  
select * from fight_events where round = 1 and month = 3;  
...
select * from fight_events where round = 1 and month = 12;  


In general, we prefer to avoid data duplication. Cassandra's assertion that
disks are cheap doesn't hold up for web-scale systems. In a previous company, I
ran a Cassandra cluster that accumulated half a billion events per day. Over
more 3 years that system collected many terabytes of data. Various analytics
jobs required fast access to the entire dataset. The thing with Cassandra is
that even if disk space is relatively cheap, network traffic isn't. Cassandra
replicates data as part of its robust design. Cassandra is also constantly
compacting and re-shuffling data across the cluster. The more data you have the
more you pay for maintenance operations that might even be invisible to you.

That's the reason to store as little as possible in Cassandra. You'll note also
that the schema contains just integer ids. Where is the actual data? Again, the
idea is to save storage. Why store repeatedly the same values? Even with
Cassandra's compression, there is a price to pay (mostly in big result sets).
This is especially true if you need to update some value stored ubiquitously
across the cluster.

Enter the relational DB.

STORING METADATA IN MYSQL
The idea is that all these ids like fight_id , event_id and fighter_id are identifiers of rows in a corresponding relational metadata DB. Let's look
at a simple schema:

CREATE TABLE fight_event (  
    id INTEGER,
    name VARCHAR(255),
    PRIMARY KEY(id)
)   ENGINE=INNODB;

CREATE TABLE fighter (  
    id INTEGER,
    name VARCHAR(255),
    age  INTEGER,
    weight INTEGER,
    PRIMARY KEY(id)    
)   ENGINE=INNODB;

CREATE INDEX fighter_age ON fighter(age);  
CREATE INDEX fighter_weight ON fighter(weight);

CREATE TABLE fight (  
    id INTEGER,
    fighter1_id INTEGER,
    fighter2_id INTEGER,
    title VARCHAR(255),
    PRIMARY KEY(id),
    FOREIGN KEY (fighter1_id) REFERENCES fighter(id),
    FOREIGN KEY (fighter2_id) REFERENCES fighter(id)
)   ENGINE=INNODB;

CREATE INDEX fight_fighter1 ON fight(fighter1_id);  
CREATE INDEX fight_fighter2 ON fight(fighter2_id);  


MySQL can manage the metadata that will be indexed extensively. The metadata is
very read-heavy. A lot of indices to update on insert don't present a problem.
But, you can slice and dice it very efficiently to arrive at important ids that
are stored in Cassandra. The hybrid query pattern is that you query MySQL using
convoluted ad-hoc query to your heart's content. You end up with fight ids and
fighter ids that you use to construct and filter Cassandra queries and then when
you get back from Cassandra a result set with a bunch event ids, you can look
them up in the fight_event table or more likely in a in-memory dictionary you loaded at the beginning of
your program.

STORING BLAZING HOT DATA IN REDIS
It sounds like we're all set with the hybrid Cassandra + MySQL hybrid query
system. But, sometimes it's not enough. Consider a live UFC championship fight,
millions of viewers watching the fight via our custom app that adds live stats
and displays various visualizations of real-time fight events and slow-motion.
The typical web solution to deal with the massive demand of popular content is a
CDN (content delivery network). CDNs are great, but they are mostly optimized
for static, large content. Here we're talking about live streams of relatively
small data.

You may try to service each request dynamically as it comes from the hybrid
Cassandra + MySQL, but the reality is that it is very difficult to try and
fine-tune the caching behavior. Instead, we can use Redis. Redis is a
super-fast, in-memory (yet can be durable), data structure server. That means
it's a fancy key-value store that excels at retrieving data for its users. It
can be distributed via a Redis Cluster, so you don't have to worry about being
limited to a single machine. When there is a massive demand for a lot of data,
Redis can be a great solution to improve the responsiveness of the system, as
well as providing additional capacity quickly (à la elastic horizontal scaling).
In comparison adding a new node to a Cassandra cluster is a long and tedious
process. The replication will impact the entire cluster because Cassandra will
try to evenly distribute the data between all nodes, even if you just want to
add a node temporarily to handle a spike in requests.

Redis can also be great for distributed locks and counters . Overall, Redis gives you a lot of options for high-performance flexible
operations on data that is not suitable for either Cassandra or MySQL. Cassandra
has distributed counters, but they suffer from high latency due to some design limitations .

For example, let's say we want to keep track on the significant strikes (very
important statistic) of every fighter in every fight this evening. A good way to
model it in Redis is to use its HASH data structure. The HASH is a dictionary or
a map. Let's create a HASH called significant_strikes . The HASH will map the pair fight_id:fighter_id to the number of significant strikes they delivered to the opponent. Note that
in some tournaments the same fighter may participate in multiple fights.

Here we initialize the significant_strikes HASH by setting two keys ( fight_id:fighter_id ) to 0. In this case, the fighters 44 and 55 fight each other in fight 123.

HSET significant_strikes 123:44 0  
HSET significant_strikes 123:55 0  


Let's say 44 delivered a significant strike. We need to increment its counter:

HINCRBY significant_strikes 123:44 1  


Now, suppose 55 countered with a 3 strike combo (Wow!):

HINCRBY significant_strikes 123:55 3  


At each point you can get the entire significant_strikes HASH:

HGETALL significant_strikes  
1) ""123:44""  
2) ""1""  
3) ""123:55""  
4) ""3""  


Or just specific keys:

HGETALL significant_strikes 123:55  
""3""


ARCHITECTING THE UFC MONEYBALL
Let's go big and think about the overall system architecture. The working
assumption is that a large number of users will access the data concurrently.
During a live event there will be peak demand for data related to the matches
and the participating fighters. In addition, various jobs will run in the
background and some long running machine learning processes will digest and
crunch numbers constantly. There will be publicly facing REST APIs. Stateless
API servers (e.g. nginx) will delegate queries and requests to internal services
via fast protocols (e.g. grpc). The services will fetch data from all the
stores, merge them, massage the data and return it to the users via the APIs.
The users will consume the data via various clients: mobile, web, custom tools,
etc. In addition to Cassandra, MySQL and Redis, the system may also use some
cloud storage for AWS S3 for archiving cold data and for backups. The system
will run on one of the public cloud providers: AWS, GCE or Azure. The stateless
microservices will be deployed as Docker containers. The data stores will be
deployed directly, and the containers will be orchestrated as a Kubernetes
cluster.

CONCLUSION
Large-scale systems require multiple types of data stores to manage their data
properly. When you deal with time-series data, Cassandra is a solid option.
ScyllaDB is a promising high-performance drop-in replacement for Cassandra. But,
Cassandra data modeling is not trivial and querying it efficiently can be
assisted by storing metadata in a relational DB like MySQL. Redis is a great
option for caching frequently used data in memory to offload pressure from
Cassandra and MySQL. One of the most challenging aspects when designing a
large-scale system that has to handle a lot of data is figuring out what kinds
of data you need to handle, their cardinality, and the operations that you need
to perform on each. Of course, very often you will not have a full grasp at the
outset of your problem domain and even if you do, things will change. That means
that you also have to build a flexible enough system that will allow you to move
data between stores (and possibly add more data stores) as you learn more.

Gigi Sayfan is the chief platform architect of VRVIU, a start-up developing
cutting-edge hardware and software technology in the virtual reality space. Gigi
has been developing software professionally for 21 years in domains as diverse
as instant messaging, morphing, chip fabrication process control, embedded
multi-media application for game consoles, brain-inspired machine learning,
custom browser development, web services for 3D distributed game platform,
IoT/sensors and most recently virtual reality.This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Image via Skitterphoto Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","Gigi Sayfan takes us through doing just that with Cassandra/Scylla, MySQL, and Redis. In this Compose's Write Stuff article, he shows us how he constructs his own Moneyball.",Designing the UFC Moneyball,Live,139
366,"Compose The Compose logo Articles Sign in Free 30-day trialDATALAYER EXPOSED: EMILE BAIZEL & BUILDING A FINTECH BOT ON MONGODB AND
ELASTICSEARCH
Published Jul 24, 2017 datalayer DataLayer Exposed: Emile Baizel & Building a Fintech Bot on MongoDB and
ElasticsearchIt's Monday morning, which means we'll kick this week off with another video
from our DataLayer Conference earlier this year. This week, we're featuring
Emile Baizel from Digit .

Emile Baizel took the stage at DataLayer as our eighth speaker. Emile is a
full-stack developer Digit where he works to make the Digit bot smarter. And that's where Emile's
inspiration for his talk came from.

Digit users engage with Digit through their bot who answers questions about
their account like what's my checking balance and is my money safe and more humorous ones as tell me a joke . They tried a few different approaches before going with their current bot
that learns from past user questions to help answer future ones.

Emile talks about how they built their bot, why they chose MongoDB and
Elasticsearch and how they use Node's event emitters.

If you'd like to follow along with Emile's slide deck, you can download it here .

Previous DataLayer 2017 talks:

 * Charity Majors' presentation on observability
 * Ross Kukulinski's presentation on the state of containers
 * Antonio Chavez's presentation on the why he left MongoDB
 * Jonas Helfer's presentation on Joins across databases with GraphQL
 * Joshua Drake's presentation on PostgreSQL as the center of your data universe
 * Lorna Jane Mitchell's presentation on surviving failure with RabbitMQ
 * Amy Unrah's presentation on Scaling out SQL Databases with Spanner

Be sure to tell us what you think using hashtag #DataLayerConf and check back
next Monday for the next talk at DataLayerConf.


--------------------------------------------------------------------------------

We're in the planning stages for DataLayer 2018 right now so, if you have an
idea for a talk, start fleshing that out. We'll have a CFP, followed by a blind
submission review, and then select our speakers, who we'll fly to DataLayer to
present. Sounds fun, right?

Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe ’s author page and keep reading.RELATED ARTICLES
Jul 17, 2017DATALAYER EXPOSED: AMY UNRUH & SCALING OUT SQL DATABASES WITH SPANNER
Let's start the week off with another video from DataLayer Conf, the Compose
sponsored Conference held in Austin this past ma…

Thom Crowe Jul 10, 2017DATALAYER EXPOSED: LORNA JANE MITCHELL & SURVIVING FAILURE WITH RABBITMQ
It's Monday which means it's time for our next DataLayer Conf video installment.
This week, we'll hear about surviving failur…

Thom Crowe Jul 3, 2017DATALAYER EXPOSED: JOSHUA DRAKE & POSTGRESQL: THE CENTER OF YOUR DATA UNIVERSE
Start your Monday on a high note and catch up on videos from this year's
DataLayer Conference. This week we're highlighting J…

Thom Crowe Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","Another video from our DataLayer Conference, featuring Emile Baizel from Digit.",Emile Baizel & Building a Fintech Bot on MongoDB and Elasticsearch,Live,140
369,"REDDIT SENTIMENT ANALYSIS IN SPARKR AND COUCHDB
Chetna Warade / August 15, 2016A few months ago, I published Sentiment Analysis of Reddit AMAs , which explained how to grab a reddit Ask Me Anything (AMA) conversation and
export its data for analysis using our Simple Data Pipe app. From there, I used
the Spark-Cloudant Connector, and Watson Tone Analyzer to get insights into
writer sentiment.

Recently, I followed up with a similar exercise, but this time performing
analysis using dashDB data warehouse and R .

I’m back at it again, to share my excitement over SparkR , an R API for Apache Spark. Analysis using SparkR lets you create a full
working notebook really fast and iterate with ease. In this tutorial, I connect
to our Cloudant database using a handy new CouchDB R package, fetch all json
documents, create a SparkR dataframe, analyze with SQL and SparkR, then plot
results with R.

Here’s the flow:


BEFORE YOU BEGIN
If you haven’t already, read my earlier Sentiment Analysis of Reddit AMAs blog post , so you understand what we’re up to here. You’ll get the background you need,
and we can dive right in to this alternate analysis approach. (You don’t need to
follow that earlier tutorial, nor the follow-up on dashDB + R in order to implement this SparkR solution. All the steps you need are here in
this blog post.)

DEPLOY SIMPLE DATA PIPE
The fastest way to deploy this app to Bluemix (IBM’s Cloud platform) is to click
the Deploy to Bluemix button, which automatically provisions and binds the Cloudant service too.
(Bluemix offers a free trial, which means you can try this tutorial out for
free.)


If you would rather deploy manually , or have any issues, refer to the readme .

When deployment is done, click the EDIT CODE button.

INSTALL REDDIT CONNECTOR


Since we’re importing data from reddit, you need to establish a connection
between reddit and Simple Data Pipe.

Note: If you have a local copy of Simple Data Pipe, you can install this connector using Cloud Foundry .

 1. In Bluemix, at the deployment succeeded screen, click the EDIT CODE button.
 2. Click the package.json file to open it.
 3. Edit the package.json file to add the following line to the dependencies list:
    ""simple-data-pipe-connector-reddit"": ""^0.1.2""
    Tip: be sure to end the line above with a comma and follow proper JSON syntax.
 4. From the menu, choose File Save .
    
    
 5. Press the Deploy app button and wait for the app to deploy again.
    
    
ADD SERVICES IN BLUEMIX
To work its magic, the reddit connector needs help from a couple of additional
services. In Bluemix, we’re going analyze our data using the Apache Spark and
Watson Tone Analyzer services. So add them now by following these steps:

PROVISION IBM ANALYTICS FOR APACHE SPARK SERVICE
 1. On your Bluemix dashboard, click Work with Data . Click New Service . Find and click Apache Spark then click Choose Apache Spark Click Create .

PROVISION WATSON TONE ANALYZER SERVICE
 1. In Bluemix, go to the top menu, and click Catalog .
 2. In the Search box, type Tone Analyzer , then click the Tone Analyzer tile.
 3. Under app , click the arrow and choose your new Simple Data Pipe application. Doing
    so binds the service to your new app.
 4. In Service name enter only tone analyzer (delete any extra characters)
 5. Click Create .
 6. If you’re prompted to restage your app, do so by clicking Restage .

LOAD REDDIT DATA
 1.  Launch simple data pipe in one of the following ways: * If you just restaged, click the URL for your simple data pipe app.
        
      * Or, in Bluemix, go to the top menu and click Dashboard , then on your Simple Data Pipe app tile, click the Open URL button.
        
     
 2.  In Simple Data Pipe, go to menu on the left and click Create a New Pipe .
 3.  Click the Type dropdown list, and choose Reddit AMA .When you added a reddit connector earlier, you added the Reddit option
     you’re choosing now.
     
     
 4.  In Name , enter ibmama , or whatever you wish.
 5.  If you want, enter a Description .
 6.  Click Save and continue .
 7.  Enter the URL for the reddit conversation you want to analyze. You’re not
     limited to using an AMA conversation here. You can enter the URL of any
     reddit conversation, including the IBM-hosted AMA we used in earlier
     tutorials: 
     https://www.reddit.com/r/IAmA/comments/3ilzey/were_a_bunch_of_developers_from_ibm_ask_us
 8.  Click Connect to AMA .
     You see a You’re connected confirmation message.
 9.  Click Save and continue .
     
     
 10. On the Filter Data screen, make the following 2 choices:
     
      * under Comments to Load , select Top comments only .
      * under Output format , choose JSON flattened .
     
     Then click Save and continue .
     
     Why flattened JSON? Flat JSON format is much easier for Apache Spark to process, so for this
     tutorial, the flattened option is the best choice. If you decide to use the
     Simple Data Pipe to process reddit data with something other than Spark,
     you probably want to choose JSON to get the output in its purest form.
     
     
 11. Click Skip , to bypass scheduling.
 12. Click Run now .
     
     When the data’s done loading, you see a Pipe Run complete! message.
     
     
 13. Click View details .
     
     
ANALYZE REDDIT DATA
CREATE NEW R NOTEBOOK
 1. In Bluemix, open your Apache Spark service.
    Go to your dashboard and, under Services , click the Apache Spark tile and click Open .
 2. Open an existing instance or create a new one.
    
 3. Click New Notebook .
 4. Click the From URL tab.
 5. Enter any name, and under Notebook URL enter 
    https://github.com/ibm-cds-labs/reddit-sentiment-analysis/raw/master/couchDB-R/Preview-R-couchDB.ipynb
 6. Click Create Notebook
 7. Copy and enter your Cloudant credentials.In a new browser tab or window, open your Bluemix dashboard and click your
    Cloudant service to open it. From the menu on the left, click Service Credentials . If prompted, click Add Credentials . Copy your Cloudant host , username , and password into the corresponding places in cell 4 of the notebook (replacing XXXX’s).
    
    
RUN THE CODE AND GENERATE REPORTS
 1. Install CouchDB R package
    
    Run cells 1 and 2 to install the CouchDB package and library. You need to
    run these only once. Read more about the package .
    
    
 2. Define a variable sqlContext to use existing Spark ( sc ) and SparkRSQL Context that is already initialized with IBM Analytics for
    Apache Spark as Service.
    
    In [3]: sqlContext
    
    
 3. Run cell 4 to connect to Cloudant.
    
    
 4. Run cell 5 to get a list of Cloudant databases.
    
    In [5]: couch_list_databases(myconn)
    
    Out [5]:
    
        'pipe_db' 'reddit_sparkr_top_comments_only' 
    
    
 5. Then read this connection by running the next cell:In [6]: print(myconn)
    
    
CREATE A SPARKR DATAFRAME FROM A CLOUDANT DATABASE
There is no magic function that gets desired documents into a ready-to-use
SparkR dataframe. Instead, the function couch_fetch() retrieves a document object with value based on a key. At this point in the
code, I don't have keys in hand. Thanks to the primary index _all_docs that comes with Cloudant databases, there's no need to write extra code. Simply
add a forward slash / and
_all_docs to the database name. (To learn more, read https://cloudant.com/for-developers/all_docs/ .)

 1. Use _all_docs to fetch all documents from the Cloudant database (mine is named reddit_regularreddit_top_comments_and_replies ) and create a data frame by running the following command:In[7]:
    
        results 
    
    Note: Insert your database name in this cell. You can find it in results from
    running couch_list_databases(myconn) 2 cells before:
    
    
    About SparkR and R Dataframes
    
    ""SparkR is based on Spark’s parallel DataFrame abstraction. Users can create
    SparkR DataFrames from “local” R data frames, or from any Spark data source
    such as Hive, HDFS, Parquet or JSON.""
    -Spark 1.4 Announcement
    
    You can create a SparkR dataframe from R data by calling function createDataFrame() or as.DataFrame() both will do the job. In this case, I used
    createDataFrame(sqlContext, data)
    where data is R dataframe or a list, and it returns a DataFrame.
    
    Alternatively you can download the content of a SparkDataFrame into an R's
    data.frame, by calling function
    as.data.frame() (all lowercase). So,
    as.data.frame(x)
    where x is DataFrame, returns a data.frame.
    
    Tip: Learn more about R data type by calling function
    typeof(x) where x is a R data type, either a matrix or
    list or vector or data.frame.
    
    For detailed API http://spark.apache.org/docs/latest/api/R/index.html
    
    
 2. Print the schema that you just created by running the next cell:In [8]: printSchema(df)
    
    which returns:
    
         root
         |-- total_rows: integer (nullable = true)
         |-- offset: integer (nullable = true)
         |-- rows_id: string (nullable = true)
         |-- rows_key: string (nullable = true)
         |-- rows_rev: string (nullable = true)
         |-- rows_id_1: string (nullable = true)
         |-- rows_key_1: string (nullable = true)
         |-- rows_rev_1: string (nullable = true)
         |-- rows_id_2: string (nullable = true)
         |-- rows_key_2: string (nullable = true)
         |-- rows_rev_2: string (nullable = true)
         |-- rows_id_3: string (nullable = true)
         |-- rows_key_3: string (nullable = true)
         |-- rows_rev_3: string (nullable = true)
    
    The first row is the _design_ document so ignore. All that follows is reddit data.
    
    
 3. Run the typeof(results) command, which returns 'list'
    
    
 4. Print the list results returned by couch_fetch() .
    
    In [10]: print(results)
    
        $total_rows
        [1] 4
    
        $offset
        [1] 0
    
        $rows
        $rows[[1]]
        $rows[[1]]$id
        [1] ""_design/Top comments and replies""
    
        $rows[[1]]$key
        [1] ""_design/Top comments and replies""
    
        $rows[[1]]$value
        $rows[[1]]$value$rev
        [1] ""1-edc6f6bb0062260ecf1160c81872efdd""
    
        $rows[[2]]
        $rows[[2]]$id
        [1] ""f4f7cfa487898608fff6eb639fe6ed26""
    
        $rows[[2]]$key
        [1] ""f4f7cfa487898608fff6eb639fe6ed26""
    
        $rows[[2]]$value
        $rows[[2]]$value$rev
        [1] ""1-c0be345c89577577cdeb301328d9e4f5""
        .....
    
    
 5. Next, iterate over the list of keys returned, fetch individual documents,
    create a R dataframe, add each document as a row to the dataframe and create
    a new SparkR dataframe.In [11]:
    
        keys_list 
    
    
Output looks like

    root
    |-- X_id: string (nullable = true)
    |-- X_rev: string (nullable = true)
    |-- author: string (nullable = true)
    |-- created: integer (nullable = true)
    |-- edited: integer (nullable = true)
    |-- id: string (nullable = true)
    |-- title: string (nullable = true)
    |-- text: string (nullable = true)
    |-- Anger: string (nullable = true)
    |-- Disgust: string (nullable = true)
    |-- Fear: string (nullable = true)
    |-- Joy: string (nullable = true)
    |-- Sadness: string (nullable = true)
    |-- Analytical: string (nullable = true)
    |-- Confident: string (nullable = true)
    |-- Tentative: string (nullable = true)
    |-- Openness: string (nullable = true)
    |-- Conscientiousness: string (nullable = true)
    |-- Extraversion: string (nullable = true)
    |-- Agreeableness: string (nullable = true)
    |-- Emotional_Range: string (nullable = true)
    |-- pt_type: string (nullable = true)
                +--------------------+--------------------+----------+----------+----------+-------+-----+--------------------+-----+-------+-----+-----+-------+----------+---------+---------+--------+-----------------+------------+-------------+---------------+--------------------+
                |                X_id|               X_rev|    author|   created|    edited|     id|title|                text|Anger|Disgust| Fear|  Joy|Sadness|Analytical|Confident|Tentative|Openness|Conscientiousness|Extraversion|Agreeableness|Emotional_Range|             pt_type|
                +--------------------+--------------------+----------+----------+----------+-------+-----+--------------------+-----+-------+-----+-----+-------+----------+---------+---------+--------+-----------------+------------+-------------+---------------+--------------------+
                |f4f7cfa487898608f...|1-c0be345c8957757...|  delfinom|1467130823|         0|d4rcp8s|     |our strategy ...|18.72|  46.65|19.80|14.65|  27.61|     92.20|     0.00|    64.70|   42.10|            59.10|       81.20|        69.30|          15.80|Top comments and ...|
                |f4f7cfa487898608f...|1-6b0c4d5588c127c...|BlackOdder|1467127251|1467129666|d4ra2ax|     |This is good. Hop...|33.49|  54.14|29.26| 3.34|  29.87|     51.60|     0.00|    88.90|    1.20|             7.90|       96.60|        98.90|          94.20|Top comments and ...|
                |f4f7cfa487898608f...|1-aa11fd3a2efdfd7...|grauenwolf|1467127117|1467127784|d4r9yse|     |I don't see how t...|98.15|  54.75|13.05| 2.03|   5.21|     70.70|     0.00|    96.40|   50.20|             3.80|       47.30|        39.00|          84.30|Top comments and ...|
                +--------------------+--------------------+----------+----------+----------+-------+-----+--------------------+-----+-------+-----+-----+-------+----------+---------+---------+--------+-----------------+------------+-------------+---------------+--------------------+
```

ANALYZE REDDIT DATA WITH SPARKR SQL AND PLOT WITH R
Now we'll create a bar chart showing comment count by sentiment (for comments
scoring higher than 70%).

In [12]:

    registerTempTable(df2,""reddit"")
    sentimentDistribution  70')
        df3  70% in IBM Reddit AMA"",col=139, ylim=c(0,130),cex.axis=0.5,cex.names=0.5,ylab=""Reddit comment count"")


FILTER REDDIT DATA WITH SPARKR AND PRINT REPORT TO NOTEBOOK
You don't need to run SQL queries to work with Spark Dataframes. Dataframes have
functions like filter, select, grouping, and aggregation. filter() returns rows and select() returns columns that meet the condition passed in the input. In the following
code, filter() returns a Dataframe containing rows that have an emotion score higher than 70%. select() returns author (redditor - reddit user) and text (comments by redditors) from the Dataframe returned earlier. We could use
SparkR Dataframe functions head() and showDF() to show a quick data overview. But since we want a full list of comments by
sentiment with that high emotional score ( 70%), we call R print() function.

In [13]

    for(i in 1:length(columns)){
    columnset  70') )
    if(count(columnset)  0){
    print('----------------------------------------------------------------')
    print(columns[i])
    print('----------------------------------------------------------------')
    comments 

Results show comments grouped by sentiment. Some comments appear under multiple
sentiment categories. For example, the question Are you ashamed of Lotus Notes? appears both under Disgust and Extraversion . You can scroll through the list.


TRY LOADING A DIFFERENT REDDIT CONVERSATION
Launch your Simple Data Pipe app again and return to the Load reddit Data section. In step 7, swap in a different URL, run the notebook again, and check
out the results.

CONCLUSION
If you're an R fan, you'll appreciate that SparkR provides a handy R frontend
for Spark. What's great about moving data from JSON document to a SparkR or R
dataframe, is that the data structure pretty much remains the same. That offers
flexibility to write SparkR notebooks fast and makes it easy to move data in and
out of SparkR and R dataframes. Both offer powerful operations that produce
informative analytics with high performance. R and JSON: made for each other.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","More analysis options for Simple Data Pipe output. Once data is in Cloudant, connect via CouchDB then analyze the JSON with SparkR in a Jupyter notebook.",Reddit sentiment analysis in SparkR and CouchDB,Live,141
372,"Homepage Stats and Bots Follow Sign in Get started Homepage * Home
 * DATA SCIENCE
 * ANALYTICS
 * STARTUPS
 * BOTS
 * DESIGN
 * Subscribe
 * 
 * 🤖 TRY STATSBOT FREE
 * 

Jay Shah Blocked Unblock Follow Following Machine Learning Enthusiast Nov 16
--------------------------------------------------------------------------------

NEURAL NETWORKS FOR BEGINNERS: POPULAR TYPES AND APPLICATIONS
AN INTRODUCTION TO NEURAL NETWORKS LEARNING
Today, neural networks are used for solving many business problems such as sales
forecasting, customer research, data validation, and risk management. For
example, at Statsbot we apply neural networks for time series predictions, anomaly detection in
data, and natural language understanding.

In this post, we’ll explain what neural networks are, the main challenges for
beginners of working on them, popular types of neural networks, and their
applications. We’ll also describe how you can apply neural networks in different
industries and departments.

THE IDEA OF HOW NEURAL NETWORKS WORK
Recently there has been a great buzz around the words “neural network” in the
field of computer science and it has attracted a great deal of attention from
many people. But what is this all about, how do they work, and are these things
really beneficial?

Essentially, neural networks are composed of layers of computational units
called neurons, with connections in different layers. These networks transform
data until they can classify it as an output. Each neuron multiplies an initial
value by some weight, sums results with other values coming into the same
neuron, adjusts the resulting number by the neuron’s bias, and then normalizes
the output with an activation function.

ITERATIVE LEARNING PROCESS
A key feature of neural networks is an iterative learning process in which
records (rows) are presented to the network one at a time, and the weights
associated with the input values are adjusted each time. After all cases are
presented, the process is often repeated. During this learning phase, the
network trains by adjusting the weights to predict the correct class label of
input samples.

Advantages of neural networks include their high tolerance to noisy data, as
well as their ability to classify patterns on which they have not been trained.
The most popular neural network algorithm is the backpropagation algorithm .

Once a network has been structured for a particular application, that network is
ready to be trained. To start this process, the initial weights (described in
the next section) are chosen randomly. Then the training (learning) begins.

The network processes the records in the “training set” one at a time, using the
weights and functions in the hidden layers, then compares the resulting outputs
against the desired outputs. Errors are then propagated back through the system,
causing the system to adjust the weights for application to the next record.

This process occurs repeatedly as the weights are tweaked. During the training
of a network, the same set of data is processed many times as the connection
weights are continually refined.

SO WHAT’S SO HARD ABOUT THAT?
One of the challenges for beginners in learning neural networks is understanding
what exactly goes on at each layer. We know that after training, each layer
extracts higher and higher-level features of the dataset (input), until the
final layer essentially makes a decision on what the input features refer to.
How can it be done?

Instead of exactly prescribing which feature we want the network to amplify, we
can let the network make that decision. Let’s say we simply feed the network an
arbitrary image or photo and let the network analyze the picture. We then pick a
layer and ask the network to enhance whatever it detected. Each layer of the
network deals with features at a different level of abstraction, so the
complexity of features we generate depends on which layer we choose to enhance.

POPULAR TYPES OF NEURAL NETWORKS AND THEIR USAGE
In this post on neural networks for beginners, we’ll look at autoencoders,
convolutional neural networks, and recurrent neural networks.

AUTOENCODERS
This approach is based on the observation that random initialization is a bad
idea and that pre-training each layer with an unsupervised learning algorithm
can allow for better initial weights. Examples of such unsupervised algorithms
are Deep Belief Networks. There are a few recent research attempts to revive
this area, for example, using variational methods for probabilistic
autoencoders.

They are rarely used in practical applications. Recently, batch normalization
started allowing for even deeper networks, we could train arbitrarily deep
networks from scratch using residual learning. With appropriate dimensionality
and sparsity constraints, autoencoders can learn data projections that are more
interesting than PCA or other basic techniques.

Let’s look at the two interesting practical applications of autoencoders:

• In data denoising a denoising autoencoder constructed using convolutional layers is used for
efficient denoising of medical images.

A stochastic corruption process randomly sets some of the inputs to zero,
forcing the denoising autoencoder to predict missing (corrupted) values for
randomly selected subsets of missing patterns.

• Dimensionality reduction for data visualization attempts dimensional reduction using methods such as Principle Component
Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). They
were utilized in conjunction with neural network training to increase model
prediction accuracy. Also, MLP neural network prediction accuracy depended
greatly on neural network architecture, pre-processing of data, and the type of
problem for which the network was developed.

CONVOLUTIONAL NEURAL NETWORKS
ConvNets derive their name from the “convolution” operator. The primary purpose
of convolution in the case of a ConvNet is to extract features from the input
image. Convolution preserves the spatial relationship between pixels by learning
image features using small squares of input data. ConvNets have been successful
in such fields as:

 * Identifying faces

In the identifying faces work, they have used a CNN cascade for fast face
detection. The detector evaluates the input image at low resolution to quickly
reject non-face regions and carefully process the challenging regions at higher
resolution for accurate detection.

Illustration sourceCalibration nets were also introduced in the cascade to accelerate detection and
improve bounding box quality.

Illustration source * Self driving cars

In the self driving cars project, depth estimation is an important consideration
in autonomous driving as it ensures the safety of the passengers and of other
vehicles. Such aspects of CNN usage have been applied in projects like NVIDIA’s
autonomous car.

CNN’s layers allow them to be extremely versatile because they can process
inputs through multiple parameters. Subtypes of these networks also include deep
belief networks (DBNs). Convolutional neural networks are traditionally used for
image analysis and object recognition.

Illustration sourceAnd for fun, a link to use CNNs to d rive a car in a game simulator and predict steering angle .

RECURRENT NEURAL NETWORKS
RNNs can be trained for sequence generation by processing real data sequences
one step at a time and predicting what comes next. Here is the guide on how to implement such a model .

Assuming the predictions are probabilistic, novel sequences can be generated
from a trained network by iteratively sampling from the network’s output
distribution, then feeding in the sample as input at the next step. In other
words, by making the network treat its inventions as if they were real, much
like a person dreaming.

• Language-driven image generation

Can we learn to generate handwriting for a given text? To meet this challenge a
soft window is convolved with the text string and fed as an extra input to the
prediction network. The parameters of the window are output by the network at
the same time as it makes the predictions, so that it dynamically determines an
alignment between the text and the pen locations. Put simply, it learns to
decide which character to write next.

• Predictions

A neural network can be trained to produce outputs that are expected, given a
particular input. If we have a network that fits well in modeling a known
sequence of values, one can use it to predict future results. An obvious example
is Stock Market Prediction.

APPLYING NEURAL NETWORKS TO DIFFERENT INDUSTRIES
Neural networks are broadly used for real world business problems such as sales
forecasting, customer research, data validation, and risk management.

MARKETING
Target marketing involves market segmentation, where we divide the market into
distinct groups of customers with different consumer behavior.

Neural networks are well-equipped to carry this out by segmenting customers
according to basic characteristics including demographics, economic status,
location, purchase patterns, and attitude towards a product. Unsupervised neural
networks can be used to automatically group and segment customers based on the
similarity of their characteristics, while supervised neural networks can be
trained to learn the boundaries between customer segments based on a group of
customers.

RETAIL & SALES
Neural networks have the ability to simultaneously consider multiple variables
such as market demand for a product, a customer’s income, population, and
product price. Forecasting of sales in supermarkets can be of great advantage
here.

If there is a relationship between two products over time, say within 3–4 months
of buying a printer the customer returns to buy a new cartridge, then retailers
can use this information to contact the customer, decreasing the chance that the
customer will purchase the product from a competitor.

BANKING & FINANCE
Neural networks have been applied successfully to problems like derivative
securities pricing and hedging, futures price forecasting, exchange rate
forecasting, and stock performance. Traditionally, statistical techniques have
driven the software. These days, however, neural networks are the underlying
technique driving the decision making.

MEDICINE
It is a trending research area in medicine and it is believed that they will
receive extensive application to biomedical systems in the next few years. At
the moment, the research is mostly on modelling parts of the human body and
recognising diseases from various scans.

CONCLUSION
Perhaps NNs can, though, give us some insight into the “easy problems” of
consciousness: how does the brain process environmental stimulation? How does it
integrate information? But, the real question is, why and how is all of this
processing, in humans, accompanied by an experienced inner life, and can a
machine achieve such a self-awareness?

It makes us wonder whether neural networks could become a tool for artists — a
new way to remix visual concepts — or perhaps even shed a little light on the
roots of the creative process in general.

All in all, neural networks have made computer systems more useful by making
them more human. So next time you think you might like your brain to be as
reliable as a computer, think again — and be grateful you have such a superb
neural network already installed in your head!

I hope that this introduction to neural networks for beginners will help you
build your first project with NNs.

RECOMMENDED SOURCES FOR BEGINNERS
DEEP NEURAL NETWORKS
 * What is the difference between deep learning and usual machine learning?
 * What is the difference between a neural network and a deep neural network?
 * How is deep learning different from multilayer perceptron?

NEURAL NETWORKS PROJECTS
 * A classic example of Mapping Input to Output Image
 * Trying out Face Recognition on your own
 * Convolutional Neural Networks f0r Visual Recognition
 * Online Stanford Course on CNNs

YOU’D ALSO LIKE:
How to Get All Your Product Launch Metrics Without Leaving Slack How we used
Statsbot to track our product launch metrics blog.statsbot.co SQL Queries for Funnel Analysis A template for building SQL funnel queries
blog.statsbot.co How to Reduce Churn Rate By Handling Stripe Failed Payments How We Automated
Dunning Management blog.statsbot.co * Machine Learning
 * Neural Networks
 * Artificial Neural Network
 * Data Science
 * Recurrent Neural Network

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

151 Blocked Unblock Follow FollowingJAY SHAH
Machine Learning Enthusiast

FollowSTATS AND BOTS
Data stories on machine learning and analytics. From Statsbot’s makers.

 * 151
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates","An introduction to neural networks for beginners: the main challenges of working on neural networks, their popular types and applications.",Neural networks for beginners: popular types and applications,Live,142
374,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
APACHE SPARK
0 TO LIFE-CHANGING APP: SCALA FIRST STEPS AND AN INTERVIEW WITH JAKOB ODERSKY
SCALA! THE LANGUAGE THAT EVOKES EXTREME DIFFERENCES IN OPINION.
Being new to Silicon Valley, I have only recently come across the very strong
opinions of developers. Whether it be spaces versus tabs or Scala versus Python,
people definitely feel strongly one way or the other.

So whether you love Scala for its brevity and concise nature or whether you hate
it for being different, the fact is, Scala is very important for Spark, and
after all, this is the Spark Technology Center. That is why this week I am giving you some context
around Scala and a means to get you started, you know, before we move forward
with our life-changing app and generally saving the world.

Not sure what I'm talking about or what I'm doing? Look here and here and here .

For all of those people out there who are new to Spark or Scala, what you might
not know is that although Spark has a shell available in Scala and Python and
supports Scala, Java, Python, Clojure, and R, Scala has an advantage . Spark is written in the Scala Programming Language and runs on the Java
Virtual Machine (JVM). This means that Scala has more capabilities on Spark than
the PySpark alternative. (Depending on who you ask, this difference is
varying--again, lot's of opinions!) Not only this, but Scala inherently allows
you to have more succinct code, which is great for working with big data.

TO UNDERSTAND SCALA EVEN BETTER, I SAT DOWN WITH JAKOB ODERSKY, A REAL-LIFE,
BONAFIDE SCALA EXPERT, TO ASK HIM A FEW SERIOUSLY SCALA QUESTIONS.


WHY IS SCALA IMPORTANT FOR SPARK?
Spark's core APIs are implemented in Scala; it is the lingua franca of the
engine. I would also suggest that Scala's features, specifically its conciseness
combined with typesafety, make it ideal for implementing any kind of collection
framework, which, if you think about it, Spark really is at its highest level of
abstraction.

DOES IT PERFORM DIFFERENTLY THAN PYTHON?
Python is generally an interpreted language and therefore runs slower than
Scala, which is compiled to java bytecode and can run on the heavily optimized
Java Virtual Machine. In the original Spark APIs, where you write a sequence of
operations on RDDs, a difference in performance is quite noticable. However, The
newer Spark APIs (Datasets, Dataframes, etc) are opaque, in that they hide
operation details and let you specify ""what"" you want rather than ""how"" you want
it. This enables them to apply further optimization and expose a uniform
entry-point to all languages, thus making performance differences negligible (if
you require only the functionality provided by the newer APIs).

WHAT DO YOU LIKE MOST ABOUT SCALA?
There a couple of things I like about the language. Its type system is
incredibly complete, yet it doesn't get in your way of writing elegant and
concise code. I would say that my favorite feature is its simplicity compared to
expressivity: the language itself offers few, yet extremely powerful constructs,
allowing you to build libraries that feel ""native"" or ""built-in"", yet are just
implemented with regular features offered by Scala to anyone.

WHY IS IT RELEVANT TO BIG DATA AND SYSTEMML?
Making so called ""big data"" accessible from easy-to-use abstractions is
essential for fast and productive analysis. Scala makes it very simple to write
domain specific languages that can leverage analytics engines such as SystemML
but offer a low-barrier entry point to anyone. Furthermore, it is also possible
to use Scala in an interpreter, making it a natural choice to integrate into
data science notebooks [like Jupyter and Zeppelin]. This in turn makes it
possible to rapidly explore data, and with all the benefits of the language's
safety and expressivity, also make it a fun experience!

DO YOU HAVE ANY RESOURCES YOU WOULD RECOMMEND FOR NEW DEVELOPERS AND DATA
SCIENTISTS?
My recommendation would be to check out the first weeks of some online courses,
just to get a basic understanding of the language. As a beginner you are
extremely susceptible to either like or hate a topic, depending on the way you
learn it, therefore a good source is essential. There is no need to follow the
whole program however, just a few hours should give you a solid foundation to
continue on your own. If you have already have some knowledge in Java, I would
also recommend reading Cay Horstmann's book ""Scala for the Impatient"".

NOW THAT YOU HAVE THE CONTEXT, BELOW IS A BASIC TUTORIAL ON HOW TO GET GOING
WITH SCALA.
Quick Note: going beyond this cheat sheet is essential. I definitely recommend
reading the book 'Atomic Scala' by Bruce Eckel and Dianne Marsh to understand
the basics of Scala syntax once you have your shell or REPL up and running.

ASSUMING YOU FOLLOWED MY FIRST BLOG, YOU SHOULD HAVE ALREADY DOWNLOADED SPARK AND SET SPARK HOME IN YOUR BASH
PROFILE. IF YOU HAVEN'T, THEN DO THIS BEFORE YOU TRY TO ENTER THE SPARK SHELL IN
THE STEP BELOW. MAKE SURE TO ALSO SET YOUR PATH! MY SCALA AND SPARK ARE TOGETHER
IN THE FOLLOWING EXAMPLE.
FIRST, MAKE SURE JAVA IS INSTALLED.
//In your terminal type:
java -version  
//Update if needed
//Or install if needed
brew tap caskroom/cask  
brew install Caskroom/cask/java  


UPDATE OR INSTALL SCALA.
//check what version of scala you have installed
brew which scala  
//If you want to switch versions type this:
brew switch scala 2.9.2  
brew switch scala 2.10.0  
//If you need to install scala
brew install scala  


SET SCALA HOME AND PUT SCALA IN YOUR PATH.
//Pay attention to where you saved Scala!
//Go to your bash profile.
vi ~/.bash_profile  
//Type i for insert.
i  
//Now set Scala Home and put it in your path.
export SCALA_HOME=/Users/stc/scala  
/*Notice my Scala Home and Spark Home are on the same line of code for my path.*/
export PATH=$SCALA_HOME/bin:$SPARK_HOME/bin:$PATH  
//Now write and quit the changes
:wq


LOAD THE CHANGES YOU MADE IN YOUR BASH PROFILE.
source~/.bash_profile  


NOW YOU CAN LOAD THE REPL (READ-EVALUATE-PRINT-LOOP) OR THE SPARK SHELL TO WORK
IN SCALA.
//To load the REPL just type while in your terminal:
scala  
/*If you saved Scala Home and put it in your path it should work */
//For the spark-shell, type:
spark-shell  
//The scala> prompt should now be showing.
//If it's not, double check your .bash_profile


YOU'RE READY TO START EXPERIMENTING!
//Try setting some variables and running simple math.
scala> val a = 15  
scala> val b = 15.15  
scala> a * b  
//should return:
res0: Double = 227.25  
//Double means a fractional number. 
//An Int means a whole number.
//Knowing this, you could rewrite the above code as:
scala> val a:Int = 15  
scala> val b:Double = 15.5  
/*Just remember that val is immutable and var is mutable. Immutable means that if you change the value, you create a new value. Mutable means you can change the value at the source. Be careful using mutable values if you're working with others. This can make it very difficult for everyone to be on the same page at the same time.*/
//You can also print your first line.
scala> println(""What up Scala coder?"")  
//If you're ready to exit, type:
:quit


Now you are ready to use Scala in the Spark shell! Before we move forward with
our life-changing app, I'd recommend viewing some tutorials or reading one of
the recommended books. Knowledge of Scala will be super helpful as we move
forward with saving the world!

Stay tuned for our next step!

By Madison J. Myers

SHARE ON
 * 
 * Share

MADISON J MYERS
DATE
25 July 2016TAGS
apache spark, systemml, Life-changingSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.","To understand Scala even better, I sat down with Jakob Odersky, a real-life, bonafide Scala expert, to ask him a few seriously Scala questions.",0 to Life-Changing App: Scala First Steps and an Interview with Jakob Odersky,Live,143
375,"Toggle navigation * Courses * Courses List
    * Learning Paths
   
   
 * Events
 * Badges
 * Resources * Resources List
    * Downloads
   
   
 * Participate!
 * Blog
 * About

 * Login
 * Register

 1. Home
 2. Weekly Roundups
 3. This Week in Data Science (July 12, 2016)

THIS WEEK IN DATA SCIENCE (JULY 12, 2016)
Posted on July 12, 2016 by Coralie Phanord

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Musicmap – Learn about the Genealogy and history of popular music genres through an
   interactive data visualization.
 * IBM is making a music app that can create entirely new songs just for you – IBM Watson will soon be able to work as a creative assistant with humans
   to create entirely new music on an app.
 * Data Mining Reveals the Six Basic Emotional Arcs of Storytelling – Researchers at the Computational Story Lab at the University of Vermont
   used sentiment analysis to map the emotional arcs of over 1,700 stories and
   then used data-mining techniques to reveal the most common arcs.
 * Predictive Analytics, Big Data, and How to Make Them Work for You – Learn how data mining, regression analysis, machine learning, and data
   visualization tools can help change the way you do business.
 * Internet Of Things On Pace To Replace Mobile Phones As Most Connected Device
   In 2018 – Internet of Things (IoT) devices are expected to grow at a 23% compound
   annual growth rate from 2015 to 2021. They are expected to exceed mobile
   phones as the largest category of connected devices.
 * Is it brunch time? – Ben Jacobson uses data analysis and visualization to study the best and
   most popular time for brunch.
 * Google’s robot cars recognize cyclists’ hand signals – better than most
   cyclists – Google’s self-driving car is friendly to cyclists. It will err on the
   cautious side and surrender the lane to cyclists.
 * Weather Visualization is Powered by Big Data – High-performance computing (HPC) developers are using Big Data to
   eliminate guesswork involved in accurate weather forecasts.
 * Introducing OpenCellular: An open source wireless access platform – Facebook has designed a cost-effective open source wireless access
   platform aimed to improve connectivity in remote areas of the world.
 * Improving City Living with Smart Lighting Data – Hackathon platform Devpost and GE are launching a Hackathon that
   challenges civic hackers to develop smart city applications using the data
   from Internet-connected lighting systems.
 * Big data jobs are in high demand – As Big Data is becoming a part of everyday life, organizations in all
   fields can use big data to improve.
 * Google’s DeepMind AI to use 1 million NHS eye scans to spot diseases earlier – Google partnered with NHS’s Moorfields Eye Hospital to apply machine
   learning in order to spot eye diseases earlier.
 * Hadoop vs Spark: Which is right for your business? Pros and cons, vendors,
   customers and use cases – What are the pros and cons of each open source big data framework. Which
   is best for your enterprise.
 * Privacy Shield – Houston, We Still Have a Problem! – The European Commission (EC) has been working on an agreement with the
   U.S., called the Privacy Shield. How is it different from Safe Harbour, the
   previous agreement.
 * Mapping the Computer Science Skills Gap – The app association has created an interactive map showing the areas of
   the United States with the highest demand for people with computer science
   skills.
 * Why Python is Slow: Looking Under the Hood – Take a look at Python’s standard library and dive into the details to
   understand why Python is so slow.

UPCOMING DATA SCIENCE EVENTS
 * Data Science Bootcamp – A summer of data, analytics and insight – Join Ryerson University and Big Data University’s bootcamp this summer in
   Toronto.
 * The Big Data Channel – Join leaders in Big Data at the IoT, Big Data, and Visualization summits
   on September 8 & 9 in Boston.
 * Data for Development: Powering Evidence-Based International Aid with Mobile
   Technology – Join the Center for Data Innovation for a panel discussion on how
   policymakers and international development organizations can take advantage
   data to improve effectiveness on August 3rd in Washington D.C.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

LEAVE A REPLY CANCEL REPLY
BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (July 12, 2016)
 * This Week in Data Science (July 05, 2016)
 * This Week in Data Science (June 28, 2016)
 * This Week in Data Science (June 21, 2016)
 * This Week in Data Science (June 14, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My Tweets Follow us Facebok Twitter Google+ Linkedin YouTube * FAQ
 * Contact
 * About
 * Blog
 * Legal

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our twenty second release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (July 12, 2016)",Live,144
377,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * Data Catalog
 * 
 * Watson Data Platform
 * 

Susanna Tai Blocked Unblock Follow Following Offering Manager, Watson Data Platform | Data Catalog Aug 15
--------------------------------------------------------------------------------

DON’T THROW MORE DATA AT THE PROBLEM! HERE’S HOW TO UNLOCK TRUE VALUE FROM YOUR
DATA LAKE
August 15, 2017 | Written by: Jay Limburn

Just recently in the UK, we’ve seen the dangers of making decisions based on
incomplete or poor data play out on the world stage. The Prime Minister called a
general election three years earlier than she needed to, basing her decision on
data that showed that it would allow her to win a bigger majority in parliament.
Evidently, the data her team used was lacking: her party lost its overall
majority, and the UK ended up with a hung parliament.

So, what had the Prime Minister’s team missed? The election saw a higher turnout
of voters under age 35 than previous elections[1] — a demographic that her
policies had failed to win over. The result was a bad decision based on
incomplete data.

We may not all have the fates of nations in our hands, but the lesson is one
from which we can all learn. Companies grapple with a version of this same
challenge every day, when they try to make important strategic decisions based
on data that may be incomplete, inconsistent, inaccurate, or out-of-date.

To lessen the likelihood of bad decisions, many companies have invested in
extending their data lakes: the idea is that the more data you have, the less
likely you are to miss something important. But throwing more data at the
problem isn’t always enough to protect you from poor choices. Having too much
information can prevent you from seeing the forest for the trees — particularly
if that information is poorly organized or difficult to find.

DISILLUSIONED WITH BIG DATA? YOU’RE NOT THE ONLY ONE
It’s a familiar story: companies respond to the hype around big data by building
huge data lakes, but then find they don’t deliver the expected value. The data
is there, but knowledge workers can’t easily access it, and therefore can’t work
effectively. Moreover, the company is now paying for new systems to house all
this data, and needs to find highly skilled data scientists and engineers to
maintain them. What’s gone wrong?

One common issue is cultural: despite having the technical infrastructure in
place, different departments are often reluctant to share their data. We
discussed this challenge in my recent blog post, “ Data governance — You could be looking at it all wrong ” , but essentially, data owners need to have confidence that the data they share
will be accessed, used and protected appropriately.

A lack of effective data governance within data lakes prevents users from
trusting the system, so they hoard their data instead. As a result, its value is
lost to the rest of the company. Even if users are persuaded to share their data, it can be difficult to decide (a) how to share
it, and, (b) what kind of data cleansing needs to happen before it is safe for
others to use. Answering these questions may require yet another large IT
investment.

The other major challenge is findability of data. This issue is often
exacerbated when companies treat their data lake as a dumping ground for assets,
rather than a well-organized and actively managed archive. In these
circumstances, it is difficult for users to find or understand assets within the
data lake, and when they do, they are of questionable quality and unknown
provenance. Again, this discourages data sharing and reuse.

The problem is widespread: it was recently reported that data scientists,
business analysts and other knowledge workers estimate that they spend 80
percent of their time searching for, cleaning and organizing data, and only 20
percent actually analyzing it.[2] But what if there was a way to resolve the
challenges around both data governance and findability of data in a single move?

ENTER IBM DATA CATALOG
Built on Watson Data Platform , IBM Data Catalog is IBM’s next-generation, cloud-based enterprise data catalog. It promises to
provide a central solution where users can catalog, govern and discover
information assets, and it is designed to slash the time spent searching for and
hesitating over sharing data, so that you can focus on extracting business value
from your data assets.

With Data Catalog, you will be able to index the assets already in your data
lake, and then extend your strategy to include data from other sources too. For
example, you can take advantage of the built-in governance and control functions
to safely ingest enterprise assets that you were previously unable to move to
the data lake due to complexity or ownership issues. Data hosted by shadow IT
teams or SaaS providers, open datasets, data from social media or sensor feeds,
local spreadsheets and other dark data, and so on — Data Catalog will help you
liberate the value from all of these sources.

Beyond the advantages of uniting all your assets in a single, governed catalog,
Data Catalog will also offer:

Self-service capabilities: With its intelligent catalog capabilities, Data Catalog will provide users with
true self-service access to all the assets they are authorized to see. Its
advanced search features will also help users zone in on the data that is most
relevant to them, contributing to productivity.

Driving culture change: With Data Catalog, every user becomes a data custodian. By making the process
of cataloging simple, and automating the enforcement of governance policies, it
will encourage users to share data. They can also curate and comment on assets,
which makes the data easier for other users to find in the future. These factors
drive a culture change towards data-centricity, creating a virtuous circle that
continuously improves data governance over time.

Uncovering insights: By providing a space where users can bring different datasets together and work
with them in new ways, Data Catalog will help knowledge workers get deeper, more
accurate and more nuanced answers to their questions, sooner.

Integration with other solutions: Data Catalog will integrate with IBM Data Connect through the fabric of Watson
Data Platform, making it easy for users to access physical data and move it into
shared sandboxes or other workspaces for further manipulation or analysis. It is
also integrated with IBM Data Science Experience , giving users access to a set of powerful data science tools they can use to
explore new datasets and enhance their analysis.

THE LURE OF THE CLOUD
A few years ago, it was common to hear people say they would never move data
outside their company’s firewalls. However, times are changing. Recent
high-profile cyber attacks have demonstrated that keeping data on-premises may
be no safer that storing it in the cloud. In fact, there’s even an argument that
specialized cloud service providers may be able to take advantage of economies
of scale to invest in better security capabilities than most traditional
companies can afford in-house. As a result, many organizations are now
considering moving at least some of their data into the cloud.

For these organizations, creating a metadata index of your data with Data
Catalog will be an ideal starting point. You won’t actually have to move your
data to the cloud — only your metadata. In the process, you can get comfortable
with cloud solutions, and start to foster support within your organization. As
you gain confidence, Data Catalog will also help you assess which of your data
assets naturally gravitate towards cloud platforms, and how best to prioritize
the next steps in your cloud strategy.

If we’ve piqued your interest, learn more about Data Catalog today.

[1] Source: How Britain voted at the 2017 general election (YouGov)
[2] Source: 2016 Data Science Report (CrowdFlower)


--------------------------------------------------------------------------------

Originally published at www.ibm.com on August 15, 2017.

 * Data Governance
 * Data Management
 * Big Data
 * Data Lake
 * Data Catalog

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingSUSANNA TAI
Offering Manager, Watson Data Platform | Data Catalog

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Just recently in the UK, we’ve seen the dangers of making decisions based on incomplete or poor data play out on the world stage. The Prime Minister called a general election three years earlier than…",Don’t throw more data at the problem! Here’s how to unlock true value from your data lake,Live,145
378,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine
Learning Apr 14
--------------------------------------------------------------------------------

HOW TO USE DB2 WAREHOUSE ON CLOUD IN DATA SCIENCE EXPERIENCE NOTEBOOKS
We have heard from many of you that Db2 Warehouse on Cloud is the relational database of choice for use in DSX. Today, I’m happy to
announce a new feature that makes it even easier to use your Db2 Warehouse on
Cloud data in DSX notebooks.

We have added the same “Insert to code” functionality for Db2 Warehouse on Cloud
that we have available for CSV and JSON files. You can insert a Db2 Warehouse on Cloud table into your code by creating
a Connection in DSX. Connections allow you to manage database connections that can be added
to different projects in DSX. This helps to encapsulate the data access for only
members of a project.

Let’s see this feature in action:

SETTING UP A PROJECT TO USE THIS FEATURE
 1. Now, create a connection for this service in DSX. You can use this documentation to help with this step.
 2. Once the connection is created add it to a project by going to `Connections`
    in the 1001 tab of a project, checking the box for your connection and
    clicking `Apply`.

3. With the connection in the project, it’s ready for use in a notebook. The
connection can be used for existing notebooks or new ones.

USING THIS FEATURE INSIDE A NOTEBOOK
 1. This feature is very similar to other insert to code functionality in DSX.
    Check out this post showing how it is used for files in object storage.
 2. Use this documentation to see how insert to code for Db2 Warehouse on Cloud works.
 3. See the sections below to see what file formats can be selected for Db2
    Warehouse on Cloud tables.

Python Notebook

 * ibmdbpy IdaDataFrame (awesome library from IBM — pushes operations to Db2 Warehouse
   on Cloud rather than pulling into memory/cluster)
 * pandas DataFrame
 * SQL Context (Spark 1.6)/ SparkSession (Spark 2.0)
 * Insert Credentials

R Notebook

 * ibmdbr ida.data.frame (awesome package — similar to ibmdbpy)
 * R DataFrame
 * SQL Context (Spark 1.6)/ SparkSession (Spark 2.0)
 * Insert Credentials

Scala Notebook

 * SQL Context (Spark 1.6)/ SparkSession (Spark 2.0)
 * Insert Credentials

Watch this video to see how to set up a connection to Db2 Warehouse on Cloud and
a simple example of loading and analyzing data in a Scala notebook.

We hope this feature makes your future work with DSX and Db2 Warehouse on Cloud
a breeze. You can add any feedback or product suggestions to the DSX ideas page.


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on April 14, 2017.

 * Data Science
 * Dsx
 * Db2

Show your supportClapping shows how much you appreciated Greg Filla’s story.

Blocked Unblock Follow FollowingGREG FILLA
Product manager & Data scientist — Data Science Experience and Watson Machine
Learning

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","We have heard from many of you that Db2 Warehouse on Cloud is the relational database of choice for use in DSX. Today, I’m happy to announce a new feature that makes it even easier to use your Db2…",How to use Db2 Warehouse on Cloud in Data Science Experience notebooks,Live,146
382,"OFFLINE-FIRST QR-CODE BADGE SCANNERGlynn Bird / May 5, 2016Offline-first web applications are websites with a twist; they instruct thebrowser to cache all of the assets they need to render themselves, such asimages, css, and JavaScript files. Once loaded, the websites continue tofunction even when there is a flaky or non-existent network connection. Thekiller feature of such apps is that they can use in-browser storage to read andwrite dynamic data without relying on the presence of a cloud server. PouchDB lets the web application store data in the browser using a variety of localstorage mechanisms, while presenting a simple API. Furthermore, when it doesfind a network connection, a PouchDB database can sync with a remote Apache®CouchDB™ or Cloudant database, and changes flow seamlessly in both directionswithout loss of data.Last year I made a simple offline-first data collection app that lets you design an HTML form and then use it to capture structured data,which is stored in PouchDB. My developer advocate colleagues used the app tocollect submissions for a competition at a tech conference where the wifi was sopoor that offline-first was the only option. When they returned home, theysynced the PouchDB in their iPads to a shared Cloudant database.I thought I’d revisit this app to allow it to scan conference badges thatcontain a QR code . In fact, I ended up writing a whole new app.QR CODESQR Codes are two-dimensional barcodes containing a text payload that is encodedin black squares on a white background.The payload can be a URL or some text. At conferences, the payload tends to be a vCard —a blob of text that contains the attendee’s name, email, company, and url. Themore data the QR code has to store, the more detailed the blocks on the QR codeimage have to be:url vCardThe URL example above contains only http://www.glynnbird.com . The vCard example contains:BEGIN:VCARDVERSION:3.0N:Bird;GlynnFN:Glynn BirdORG:IBMTITLE:Developer AdvocateADR;work:;;1 The Square;Bristol;;BS1 6DG;UKTEL;WORK;VOICE:01179295012EMAIL;WORK;INTERNET:glynn.bird@uk.ibm.comURL:www.glynnbird.comEND:VCARDLEVERAGING OPEN-SOURCEAt first, I thought I would have to create a native iPhone or Android app tocapture images from a camera, decode QR codes, and store data in a database.Fortunately other open-source heroes have solved the hard problems for me: * W3C MediaStream API – to capture the host’s video camera feed * JavaScript QR code parsing library – to find QR codes in images * vCard parsing snippet – to parse vCard text * PouchDB – to store data in a local, in-browser database * Picnic CSS – to make the front-end presentable * AppCache API – to cache page assetsStill, it’s not quite that simple. The MediaStream API is new and notuniversally supported, so my code has to fall back on the older but also notuniversally supported getUserMedia API . I was also unable to get the media streaming code to work properly on mobiledevices. Conversely, the AppCache API is deprecated but its replacement, Service Workers , is not widely supported. Developers have to make such compromises every day;weighing established but deprecated functions against the latest bleeding edgecode that doesn’t have wide browser support.The finished demo app uses all of the above technologies in a single-page webapp that can be deployed to IBM Bluemix . Once you visit the page, it should be cached by your browser – try turningoff your wifi and revisiting the page:HOW DOES IT WORK?The web page contains a video tag, which the JavaScript uses to render areal-time feed of your machine’s webcam. The first time you open the website,you should be asked for permission for the app to access your webcam’s feed.There is also an invisible “canvas” control in the HTML markup, which takes asnapshot of the image every 0.5s. The data in the canvas goes to the QR-codeparsing library, which returns some data if it finds a QR code on the canvasimage.The QR-code is parsed and turned into a JSON object:{    ""version"": ""3.0"",    ""fn"": ""Glynn Bird"",    ""org"": ""IBM"",    ""title"": ""Developer Advocate"",    ""adr"": "",,1 The Square,Bristol,,BS1 6DG,UK"",    ""tel"": ""01179295012"",    ""email"": ""glynn.bird@uk.ibm.com"",    ""url"": ""www.glynnbird.com"",    ""ts"": 1461074275541,    ""date"": ""2016-04-19T13:57:55.541Z""}which is saved to a PouchDB database using the db.post function call.Below the real-time video feed, is a table of previously saved cards presentedin “newest-first” order. This is achieved by querying the PouchDB database using a Map/Reduce index ordered on the ts (timestamp) value we created inside the JSON object.SYNCING TO CLOUDANTIn PouchDB, syncing to a remote CouchDB or Cloudant database is a simple ascalling the replicateTo function:  db.replicate.to(remoteDB)    .on(""change"", function(info) {       // something changed    })    .on(""complete"", function(info) {       // all done    })    .on(""error"",  function(err) {       // something went wrong    });The variable remoteDB contains a URL of the remote database in the form:  https://username:password@myhostname.cloudant.com/mydatabaseCONCLUSIONSCreating offline-first applications is in some ways easier than creatingtraditional client-server applications. Your database is always availablebecause it resides on the same device as the browser, making for fastperformance and 100% uptime. The hard part—getting data from the client to theserver and vice versa—is handled for you by PouchDB/CouchDB/Cloudantreplication, which requires only a single function call to initiate the process.Allowing webpages to render and function without a network lets web apps goplaces they couldn’t normally go to: * capturing health data in developing countries * recording IoT data from remote sites * collecting information when the network is down or unusably slowCombining PouchDB with Cloudant makes it easy to create such applications,without getting into native application development.LINKS * Source code – https://github.com/ibm-cds-labs/badgescanner * Demo – https://badgescanner.mybluemix.net/© “Apache”, “CouchDB,” and “Apache CouchDB” are trademarks or registeredtrademarks of The Apache Software Foundation. All other brands and trademarksare the property of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / Offline First / PouchDB Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Build a data collection app that captures and stores QR Code data, even when your network is unavailable.",Offline-first QR-code Badge Scanner,Live,147
383,"SEARCH SLACK WITH IBM GRAPH
ptitzler / August 22, 2016If you use Slack, you know it can be hard to find information in the torrent of
messages that flow through your account. Our developer advocacy team is part an
enormous slack account. When I have a question, it’s hard to identify
appropriate channels or members to ask.

Fun fact: We currently have about 3,000 public channels, discussing a variety of topics,
including cats!

THE PROBLEM
Let’s say I went to our team’s Slack acocunt looking for information about the Cloudant schema discovery process , which is used to build and populate a dashDB data warehouse from IBM’s
Cloudant NoSQL database. I could:

 * Find channels that contain one or more relevant keywords in the name or
   purpose. To browse/join channels that include the key words I seek, like Cloudant (20+ hits), schema (2), and discovery (3) is time-consuming and may not help.
   
   Use the built-in search features (global search, in-channel search, by-user
   search) to find messages that include Cloudant schema discovery . Results usually vary between no hits and a gazillion, depending on the
   quality of your exact search term(s) and the number of Slack messages in the
   system. This never seems to work well for me.Ask people where or whom to ask. Not very efficient.
   
   
THE SOLUTION
USE A GRAPH DATABASE TO EXPLORE RELATIONSHIPS
With hundreds of people exchanging thousands of messages daily, chances are good
that the information (or contacts) you need can be automatically derived from
the messages that were exchanged between users.


A graph database is the perfect place to load and analyze this data. A graph is
comprised of vertices (nodes) and edges (relationships). In our scenario, Slack users, channels, and keywords are vertices . Relationship between vertices, like user-to-channel, user-to-user, and
user-to-keyword are Edges .

I built a graph database prototype solution that analyzes these relationships to
find answers to common questions. The solution uses a custom slash command as the “public” interface in Slack, a service to process the request and IBM Graph as the back-end database.

HOW IT WORKS
If you want to find info in Slack using my solution, you first enter the custom
slash command /about followed by the search term. So to find info on Cloudant , you’d enter: /about cloudant .


The service queries the graph database and returns the results to Slack for
display. Immediately you see the people and channels containing that term.

Retrieve information about channels or users by entering /about #nosql and /about @claudia , respectively.

BUILDING A SLACK TEAM GRAPH
To create a graph for a team representing users, channels, and keywords we:

 1. Generate social and keyword statistics from the Slack messages. Batch scripts collect the data, operating on exported team message
    archives. We use Watson’s AlchemyAPI to extract keywords and user and channel references (like @betty and #cloudant-sdp ) to collect social stats.
    
    We’ve really just scratched the surface …
    Additional information could be used to improve result quality. For example,
    channels frequented primarily by bots (like #cloudant-devops ) might be ranked lower than channels with heavy user activity ( #cloudant-help ).
    
    
 2. Build a graph model based on these statistics. The model is a logical representation of the Slack team graph, representing
    users, channels, keywords, and their relationships. The sample messages
    shown in the beginning of the blog post, might be represented in the model
    as follows:
    
    
    Once all relevant information has been added to the graph model, we can load
    it into IBM Graph.
    
    A graph model can be translated on the fly to Gremlin or input via bulk input APIs , so we can create many vertices and edges in the database with a
    relatively small number of requests.
    
    
 3. Load the graph model into IBM Graph. We translate the graph model to Gremlin scripts and run those to create the
    vertices and edges. Once all objects are created we can use the IBM Graph
    web console in Bluemix to explore the Slack team graph by running traversal
    steps.For example, to inspect the Slack team graph, open the Query tab and enter Gremlin queries, like:
    
    def g=graph.traversal(); g.V().has(""isUser"", true).count(); 
    def g=graph.traversal(); g.V().has(""isChannel"", true).count();
    def g=graph.traversal(); g.V().has(""iskeyword"", true).count();
    
    to count users, channels, or keywords:
    
    
Here’s the big picture of how we create the graph:


HOW SLACK USERS ACCESS THE GRAPH
To provide users easy access to the graph (within Slack) we’ve created a simple
service called about , implemented in NodeJS. This service extracts the query details (channel name,
user name, or keyword) from the Slack request, connects to IBM Graph and runs
predefined graph traversals using the IBM Graph client library (hat tip to Mike Elsmore ). The results are visible only to the user that invoked the slash command.


Sound interesting? Ready to explore your Slack Graph? Start here .

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: AlchemyAPI / Bluemix / graph / slack Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Search your Slack account using an IBM Graph database and Watson's AlchemyAPI.,Search Slack with IBM Graph,Live,148
385,"See how easy it is to unlock your data for use in mobile and web applications, or for more flexible analysis and reporting. Bluemix Secure Gateway service lets you move data from on-premises to the cloud in a secure manner. This is a multi-part tutorial which shows how to set up a gateway and then build an app on top of it. Here, in Part 1, we’ll cover:Lots of enterprises have valuable data they need to protect. To keep sensitive data secure, databases are often stored on-premises within an organization’s physical location, where staff can protect it more easily. But more and more, organizations also want to host data in the cloud for easy availability and integration with analytics and mobile or web apps. They're looking to take data out of their system of record and open it to one or more systems of engagement.Secure gateway lets you safely connect to an on-premises database. It works by creating a secure tunnel through which you can access protected data. The gateway encrypts and authenticates user connections, to prohibit unauthorized access. It’s a way to open your on-premises data to the cloud and enjoy the flexibility, security, and scalability that it offers.One gateway can connect to many on-premises data sources. In this tutorial, we're using Bluemix, IBM's cloud platform, to create the gateway. Here's a simplified version of what we're doing here in Part 1:We'll create a new Secure Gateway on Bluemix, which generates a gateway ID. We'll use that ID to start the gateway client in our on-premises network.Optional, but smart: You can add additional security by enforcing the use of a security token when starting the client.Create one or more destinations (data sources) to your on-prem database servers. Each destination will have its own port on the Bluemix server.Test the connection by accessing your data from your a browser or a bluemix app, through the url given for each destination.Here's how the different pieces connect together.In this tutorial, we’ll set up a secure gateway for access a sample Apache CouchDBTM database. The point of using CouchDB is to verify that the Secure Gateway instance works. You can replace it with any database of your choice to achieve the same results.Docker Engine is a lightweight runtime and packaging tool for apps.  Docker works best on Linux OS. If you want to use Docker on Mac or Windows, just install the helper app, Boot2Docker.  You’ll find all the details and instructions at  https://docs.docker.com/installation/#installation. Just choose your operating system and follow the  instructions.Now we’re ready to set up the gateway.Go to the Bluemix site: https://console.ng.bluemix.net/If you’re new to Bluemix, you can sign up for a free trial.Scroll down to Integration and click Secure Gateway.Tip: Most Bluemix services run entirely on the cloud. Secure Gateway is the rare exception to this rule, since its very purpose is to securely connect to on-prem data sources. So, it requires both cloud-platform-side and on-premises processes.On the upper right of the screen, click the APP dropdown and choose Leave unbound.Note: If you haven’t yet installed the Docker client, you must go do so now (see previous section).Enter any name you want for the gateway.Under How would you like to connect this gateway? choose Docker.Copy the text and, if you’re on Mac or Windows, add additional text:If you’re on Linux, this command works fine as-is. But for Mac and Windows, you need to insert the following additional text, right after docker runInsert spaces on either side. The beginning of the line should look like this:Go to your computer’s command line, paste in the text, and press Enter.Your gateway client is now connected to Bluemix.Connected! If you go back and open the gateway in Bluemix,status in the upper right corner shows as Connected.Leave your terminal command line window open. You’ll return to it in a few minutes.Next, we must set the data source endpoint. This will be the on-premises source database we want to share out to the cloud. For the purposes of this tutorial, we’ll use a simple CouchDB database.On your on-prem laptop or computer, install CouchDB.Return to Bluemix and open your open the gateway. Under Create Destinations Enter a name for the connection. Then enter the IP address and port of the on-prem machine where your couchDB database resides and click the +plus button on the far right of the line (use 127.0.0.1 if CouchDB is installed on the current laptop)If you're on Windows or Mac, configure Boot2Docker to provide access to the data.On Windows and Mac, you must allow access through multiple containers. To do so, open a new instance of Boot2Docker and run the following command--inserting your own IP and port information. (If couchDB is running on your local laptop, you can use 127.0.0.1 for the host and 5984 for the port, which are the default settings.)Now you'll see some results. Follow these steps to view your local couchDB data from outside your network.On a laptop or machine outside your on-premise network, open a browser and sign in to Bluemix.Locate the secure gateway connection you created and click its i information button.Open another browser window and paste the string into the address bar. At the end of the string, type /_utils so the address looks like this:You'll see your couchDB dashboard (Futon app) appear. That's it!  Your database is now accessible from outside your on-premises network!You saw it happen, and so did Bluemix. In Bluemix, return to or open the gateway. The chart shows a spike in traffic.Now you know how create a secure gateway that opens your on-prem data to the cloud. You can try these same steps  with  MYSQL, DB2, MongoDB, or any other databases you use on-premises.There are 2 types of security to consider. You can:* Require a security token when starting the gateway client. This is useful if you want to control who can start the gateway client. To do so, when you add the gateway, turn on the Enforce Security Token on Client checkbox.Once you do, you see the security token in Gateway details (beside the key icon) for use when starting the gateway on the client:* (Advanced) Extend TLS encryption between the gateway client and your on-prem data source. To implement, click the Enable client TLS checkbox located in the Advanced section of the destination configuration. Optionally, you can upload a certificate file (.pem extension). Note: You do not have to do this step if the certificate is self-signed....for additional parts of this tutorial which will show you how to build an app that leverages the secure gateway. After that, we'll learn how to include data sets from multiple sources (cloud-based and local) for combination and analysis.© ""Apache"", ""CouchDB"", ""Apache CouchDB"" and the CouchDB logo are trademarks or registered trademarks of The Apache Software Foundation. All other brands and trademarks are the property of their respective owners.","See how easy it is to unlock your data for use in mobile and web applications, or for more flexible analysis and reporting. Bluemix Secure Gateway service lets you move data from on-premises to the cloud in a secure manner. ",ibm-cds-labs/hybrid-cloud-tutorial,Live,149
392,"Compose The Compose logo Articles Sign in Free 30-day trialCAMPUS DISCOUNTS - MAKING THE MOST OF COMPOSE
Published May 3, 2017 case study mongodb elasticsearch Campus Discounts - Making the Most of ComposeCampus Discounts uses several Compose-hosted databases including MySQL, MongoDB, Redis,
Elasticsearch and RabbitMQ to power their social media platform. Recently they
started exploring IBM Watson to add cognitive features to the app. We sat down
with founder and CTO Don Omondi to hear their story.

As a student in Kenya, Don Omondi had difficulty getting information when he was
looking to buy a cell phone. “I had to travel more than 22 miles to the nearest
town to start window shopping for a phone.” He knew he wasn’t the only one
struggling with buying the necessities. “Maybe I could create a platform which
will make it easy for students like me to find and buy things easily and from
sellers nearby,” Don told us. The opportunity came when IBM held their SmartCamp
competition in Nairobi, where Don was among the finalists. Shortly thereafter,
he founded Campus Discounts to realize his dream.

Campus Discounts is a social network where students find and recommend discounts
posted by vendors near their campuses. Businesses create pages and post
discounts on the campus site. Students can then view their campus page and find
discounts nearby. After a free signup, students can select product categories of
interest and also connect to fellow students through the buddy system. Students
can also flag bargains and notify their friends easily via recommendations which
make up a users’ news feed and timelines.

Businesses who are interested in listing their offerings can tag up to 3
locations which will make the discount show in all campuses within a default 10
km radius or they can target a wider geography for an extra fee. They can also
get analytics like traffic flow, behavioral trends etc., on the same platform.
Localization of the platform (language, currency) is done automatically. At the
moment, Campus Discounts doesn’t have any peer-to-peer sales model but that’s on
their roadmap, including transaction processing and other e-commerce features.


Powering the data layer of the platform are six databases. MySQL is used for
‘primary’ data such as users, discounts, business pages, apps, and sessions.
Redis is used to cache this data for redundancy. MongoDB is used for storing
‘secondary’ data. This data is derived from actions on primary data such as
likes, comments, follows, ratings, reviews, friendships, etc. Don likes MongoDB
because “We can store and retrieve all these little pieces of data easily - they
don’t have to be related with one another.”

The majority of the site’s user-centric features are handled by Elasticsearch.
“The reason we use Elasticsearch is for its power of geographic qualities,
scoring and sorting of data and flexible search capabilities,” said Don.
Discounts have a geo shape field mapping while campuses have a geo point field mapping which, for example, allows them to do a query in
Elasticsesarch for any discounts belonging to specific categories, with at least
5 likes, that have the word ‘Samsung’ in their description and within a given
radius (e.g., 10 kilometers). “Elasticsearch makes these kinds of queries very
easy to implement.” The fifth database Campus Discounts uses is JanusGraph
(currently not hosted on Compose). It’s a highly scalable graph database which
is originally a fork of the popular open source project Titan. Don uses this for
graphing relationships between registered and non-registered users for social
invites as well as to suggest new friendships on the platform based on their
interests, what businesses they are following, and so forth. This also makes it
easier for businesses to provide targeted discounts to student segments.
Finally, Campus Discounts uses Compose for RabbitMQ, a popular message broker to
synchronize, track, route, and queue tasks that need to be processed later. “All
our secondary data is persisted asynchronously, developers who tap into our API
can activate webhooks to know when it’s done”

Running on top of the databases is PHP Symfony (a collection of reusable PHP
components) for the backend and Ember.js plus Node.js on the front end. Why PHP?
According to Don, “A lot of people hate on PHP, but it has an unrivaled
community and library support which can be priceless for certain use cases. For
example, Symfony comes with Doctrine, a mature data persistence library for ORM,
ODM and Cache as well as ways to integrate them all.”

Don has shared his experiences working with multiple databases and application
stack in several articles under Compose’s WriteStuff banner. You can find them here .

Recently, Don started experimenting with IBM Watson to embed its cognitive
abilities into the Campus Discounts platform. One idea that he loves and has
already implemented is matching real world items with discounts posted on the
platform. As he explains, “Wouldn’t it be cool if you see this dress or bike
that you really like, you take a picture, upload it to Campus Discounts, and we
find nearby offers that match that image?” With Watson, he can now do that. He’s
also expanded this into a chat bot and voice command like feature. As he
explains, “Using the HTML5 audio API, you can talk to our Watson Bot to find
specific offers, or even to log out.” Don reckons since Compose databases are
already available on IBM Bluemix and Watson Data Platform, it’s secure, easy and
performant to blend the two, “Watson makes sense of it then Elasticsearch finds
it”.

So, why Compose? “First and foremost, as CTO of a growing startup, I have a lot
on my plate right now. Compose really comes in and takes the weight off my
shoulders. I can focus on developing my code; I don't need to worry about
installing databases, keeping them up to date, keeping my platform live and
keeping it secure. And when I do need help, I can rely on your support team for
a prompt response.”

Don likes the fact that as a cloud platform, Compose allows him to host
databases in many locations and with different service providers like Google
Cloud Platform, Amazon AWS, and IBM SoftLayer. He also likes to play around and
try out new things, “Compose recently introduced ScyllaDB, which is a faster,
Cassandra replacement database. So, if I wanted to test it out, I just need to
spin it up with a click and try it in minutes.”

With hard work and Compose’s help, Campus Discounts has seen rapid growth since
its inception in 2015. It’s now available in over 36,500 campuses worldwide.

To learn more about Campus Discounts, visit: https://campus-discounts.com/ .


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Samuel Zeller

Arick Disilva works in Product Marketing at Compose. Love this article? Head over to Arick Disilva ’s author page and keep reading.RELATED ARTICLES
Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX
The power of IBM's Bluemix cloud platform is now able to seamlessly harness
Compose's databases, making Compose-configured Mo…

Dj Walker-Morgan Dec 3, 2015CHOOSING THE RIGHT SOLUTION FOR YOU - COMPOSE PB&J
If you're new to some of the databases that Compose offers, you might be
wondering which ones you should choose for your proj…

Lisa Smith Apr 28, 2017NEWSBITS - MYSQL, ELASTICSEARCH, MONGODB, ETCD, COCKROACHDB, SQL SERVER, CRICKET
AND JUICE
NewBits for the week ending 28th April - MySQL 8.0.1's preview demos better
replication, Elasticsearch, MongoDB and etcd get…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Campus Discounts uses several Compose-hosted databases including MySQL, MongoDB, Redis, Elasticsearch and RabbitMQ to power their social media platform. Recently they started exploring IBM Watson to add cognitive features to the app. We sat down with founder and CTO Don Omondi to hear their story.",Campus Discounts - Making the Most of Compose (customer),Live,150
395,"Homepage Sign in / Sign up 3 * Share
 * 3
 * 
 * 

Never miss a story from Karlijn Willems , when you sign up for Medium. Learn more Never miss a story from Karlijn Willems Blocked Unblock Follow Get updates Karlijn Willems Blocked Unblock Follow Following Data Science Journalist @DataCamp Nov 16
--------------------------------------------------------------------------------

JUPYTER NOTEBOOK TUTORIAL: THE DEFINITIVE GUIDE
Originally published at https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook

Data science is about learning by doing. One of the ways you can learn how to do
data science is by building your own portfolio: elaborating your own pet
project, doing a quick data exploration task, participating in a data challenge,
reporting on your research or advancements you have made in learning data
science, creating an Extract, Transform, and Load (ETL) flow of data, …

This way, you exercise the practical skills you will need when you work as a
data scientist.

As a web application in which you can create and share documents that contain
live code, equations, visualizations as well as text, the Jupyter Notebook is
one of the ideal tools to help you to gain the data science skills you need.

This tutorial will cover the following topics:

 * A basic overview of the Jupyter Notebook App and its components,
 * The history of Jupyter Project to show how it’s connected to IPython,
 * An overview of the three most popular ways to run your notebooks : with the help of a Python distribution, with pip or in a Docker container,
 * A practical introduction to the components that were covered in the first section, complete with an
   explanation on how to make your notebook documents magical and answers to
   frequently asked questions, such as “How to toggle between Python 2 and 3?”,
   and
 * The best practices and tips that will help you to make your notebook an added value to any data science
   project!

The Jupyter Notebook: an interactive data science environment
--------------------------------------------------------------------------------

WHAT IS A JUPYTER NOTEBOOK?
In this case, “notebook” or “notebook documents” denote documents that contain
both code and rich text elements, such as figures, links, equations, … Because
of the mix of code and text elements, these documents are the ideal place to
bring together an analysis description and its results as well as they can be
executed perform the data analysis in real time.

These documents are produced by the Jupyter Notebook App.

We’ll talk about this in a bit.

For now, you should just know that “Jupyter” is a loose acronym meaning Julia,
Python, and R. These programming languages were the first target languages of
the Jupyter application, but nowadays, the notebook technology also supports many other languages .

And there you have it: the Jupyter Notebook.

As you just saw, the main components of the whole environment are, on the one
hand, the notebooks themselves and the application. On the other hand, you also
have a notebook kernel and a notebook dashboard.

Let’s look at these components in more detail.

WHAT IS THE JUPYTER NOTEBOOK APP?
As a server-client application, the Jupyter Notebook App allows you to edit and
run your notebooks via a web browser. The application can be executed on a PC
without Internet access or it can be installed on a remote server, where you can
access it through the Internet.

Its two main components are the kernels and a dashboard.

A kernel is a program that runs and introspects the user’s code. The Jupyter
Notebook App has a kernel for Python code, but there are also kernels available
for other programming languages.

The dashboard of the application not only shows you the notebook documents that
you have made and can reopen but can also be used to manage the kernels: you can
which ones are running and shut them down if necessary.

THE HISTORY OF IPYTHON AND JUPYTER NOTEBOOKS
To fully understand what the Jupyter Notebook is and what functionality it has
to offer you need to know how it originated.

Let’s back up briefly to the late 1980s. Guido Van Rossum begins to work on
Python at the National Research Institute for Mathematics and Computer Science
in the Netherlands.

Wait, maybe that’s too far.

Let’s go to late 2001, twenty years later. Fernando Pérez starts developing
IPython.

In 2005, both Robert Kern and Fernando Pérez attempted building a notebook
system. Unfortunately, the prototype had never become fully usable.

Fast forward two years: the IPython team had kept on working, and in 2007, they
formulated another attempt at implementing a notebook-type system. By October
2010, there was a prototype of a web notebook and in the summer of 2011, this
prototype was incorporated and it was released with 0.12 on December 21, 2011.
In subsequent years, the team got awards, such as the Advancement of Free
Software for Fernando Pérez on 23 of March 2013 and the Jolt Productivity Award,
and funding from the Alfred P. Sloan Foundations, among others.

Lastly, in 2014, Project Jupyter started as a spin-off project from IPython.
IPython is now the name of the Python backend, which is also known as the
kernel. Recently, the next generation of Jupyter Notebooks has been introduced
to the community. It’s called JupyterLab. Read more about it here .

After all this, you might wonder where this idea of notebooks originated or how
it came about to the creators. Go here to find out more.

HOW TO INSTALL JUPYTER NOTEBOOK
RUNNING JUPYTER NOTEBOOKS WITH THE ANACONDA PYTHON DISTRIBUTION
One of the requirements here is Python, either Python 3.3 or greater or Python
2.7. The general recommendation is that you use the Anaconda distribution to
install both Python and the notebook application.

The advantage of Anaconda is that you have access to over 720 packages that can
easily be installed with Anaconda’s conda, a package, dependency, and
environment manager. You can download and follow the instructions for the
installation of Anaconda here .

Is something not clear? You can always read up on the Jupyter installation
instructions here .

RUNNING JUPYTER NOTEBOOK THE PYTHONIC WAY: PIP
If you don’t want to install Anaconda, you just have to make sure that you have
the latest version of pip. If you have installed Python, you will normally
already have it.

What you do need to do is upgrading pip and once you have pip, you can get
started on installing Jupyter.

Go to the original article for the commands to install Jupyter via pip.

RUNNING JUPYTER NOTEBOOKS IN DOCKER CONTAINERS
Docker is an excellent platform to run software in containers. These containers
are self-contained and isolated processes.

This sounds a bit like a virtual machine, right?

Not really. Go here to read an explanation on why they are different, complete with a fantastic
house metaphor.

Running Jupyter in Docker ContainersYou can easily get started with Docker: turn to the original article to get started with Jupyter on Docker.

HOW TO USE JUPYTER NOTEBOOKS
Now that you know what you’ll be working with and you have installed it, it’s
time to get started for real!

GETTING STARTED WITH JUPYTER NOTEBOOKS
Run the following command to open up the application:

jupyter notebook

Then you’ll see the application opening in the web browser on the following
address: http://localhost:8888.

For a complete overview of all the components of the Jupyter Notebook, complete
with gifs, go to the original article .

If you want to start on your notebook, go back to the main menu and click the
“Python 3” option in the “Notebook” category.

You will immediately see the notebook name, a menu bar, a toolbar and an empty
code cell.

You can immediately start with importing the necessary libraries for your code.
This is one of the best practices that we will discuss in more detail later on.

After, you can add, remove or edit the cells according to your needs. And don’t
forget to insert explanatory text or titles and subtitles to clarify your code!
That’s what makes a notebook a notebook in the end.

For more tips, go here .

Are you not sure what a whole notebook looks like? Hop over to the last section to discover the best ones out there!

TOGGLING BETWEEN PYTHON 2 AND 3 IN JUPYTER NOTEBOOKS
Up until now, working with notebooks has been quite straightforward.

But what if you don’t just want to use Python 3 or 2? What if you want to change
between the two?

Luckily, the kernels can solve this problem for you! You can easily create a new
conda environment to use different notebook kernels.

Then you restart the application and the two kernels should be available to you.
Very important: don’t forget to (de)activate the kernel you (don’t) need. Go to
the original article to see how this works and how you can manually register your kernels.

RUNNING R IN YOUR JUPYTER NOTEBOOK
As the explanation of the kernels in the first section already suggested, you can also run other languages besides Python in your
notebook!

If you want to use R with Jupyter Notebooks but without running it inside a
Docker container, you can run the following command to install the R essentials
in your current environment. These “essentials” include the packages dplyr , shiny , ggplot2 , tidyr , caret and nnet . If you don't want to install the essentials in your current environment, you
can use the following command to create a new environment just for the R
essentials.

Next, open up the notebook application to start working with R with the usual
command.

If you want to know about the commands to execute or extra tips to run R
successfully in your Jupyter Notebook, go here .

If you now want to install additional R packages to elaborate your data science
project, you can either build a Conda R package or you can install the package
from inside of R via install.packages or devtools::install_github (from GitHub). You just have to make sure to add new package to the correct R
library used by Jupyter.

Note that you can also install the IRKernel, a kernel for R, to work with R in
your notebook. You can follow the installation instructions here .

Note that you also have kernels to run languages such as Julia, SAS, … in your
notebook. Go here for a complete list of the kernels that are available. This list also contains
links to the respective pages that have installation instructions to get you
started.

Making your Jupter Notebook Magical With Magic CommandsMAKING YOUR JUPYTER NOTEBOOK MAGICAL
If you want to get the most out of this, you should consider learning about the
so-called “magic commands”. Also, consider adding even more interactivity to
your notebook so that it becomes an interactive dashboard to others should be
one of your considerations!

The Notebook’s Built-In Commands

There are some predefined ‘magic functions’ that will make your work a lot more
interactive.

To see which magic commands you have available in your interpreter, you can
simply run the following:

%lsmagic

And you’ll see a whole bunch of them appearing. You’ll probably see some magics
commands that you’ll grasp, such as %save , %clear or %debug , but others will be less straightforward.

If you’re looking for more information on the magics commands or on functions,
you can always use the ?.

Note that there is a difference between using % and && . To know more about this and other useful magic commands that you can use, go here .

You can also use magics to mix languages in your notebook without setting up
extra kernels: there is rmagics to run R code, SQL for RDBMS or Relational Database Management System access
and cythonmagic for interactive work with cython ,... But there is so much more here !

Interactive Notebooks As Dashboards: Widgets

The magic commands already do a lot to make your workflow with notebooks
agreeable, but you can also take additional steps to make your notebook an
interactive place for others by adding widgets to it!

This example was taken from a wonderful tutorial on building interactive
dashboards in Jupyter, which you can find on this page .

SHARE YOUR JUPYTER NOTEBOOKS
In practice, you might want to share your notebooks with colleagues or friends
to show them what you have been up to or as a data science portfolio for future
employers. However, the notebook documents are JSON documents that contain text,
source code, rich media output, and metadata. Each segment of the document is
stored in a cell.

Ideally, you don’t want to go around and share JSON files.

That’s why you want to find and use other ways to share your notebook documents
with others.

When you create a notebook, you will see a button in the menu bar that says
“File”. When you click this, you see that Jupyter gives you the option to
download your notebook as an HTML, PDF, Markdown or reStructuredText, or a
Python script or a Notebook file.

You can use the nbconvert command to convert your notebook document file to another static format, such
as HTML, PDF, LaTex, Markdown, reStructuredText, ... But don't forget to import nbconvert first if you don't have it yet!

Then, you can give in something like the following command to convert your
notebooks:

jupyter nbconvert --to html Untitled4.ipynb

With nbconvert , you can make sure that you can calculate an entire notebook
non-interactively, saving it in place or to a variety of other formats. The fact
that you can do this makes notebooks a powerful tool for ETL and for reporting.
For reporting, you just make sure to schedule a run of the notebook every so
many days, weeks or months; For an ETL pipeline, you can make use of the magic
commands in your notebook in combination with some type of scheduling.

Besides these options, you could also consider the following options .

JUPYTER NOTEBOOKS IN PRACTICE
This all is very interesting when you’re working alone on a data science
project. But most times, you’re not alone. You might have some friends look at
your code or you’ll need your colleagues to contribute to your notebook.

How should you actually use these notebooks in practice when you’re working in a
team?

The following tips will help you to effectively and efficiently use notebooks on
your data science project.

TIPS TO EFFECTIVELY AND EFFICIENTLY USE YOUR JUPYTER NOTEBOOKS
Using these notebooks doesn’t mean that you don’t need to follow the coding
practices that you would usually apply.

You probably already know the drill, but these principles include the following:

 * Try to provide comments and documentation to your code. They might be a great
   help to others!
 * Also consider a consistent naming scheme, code grouping, limit your line
   length, …
 * Don’t be afraid to refactor when or if necessary

In addition to these general best practices for programming, you could also
consider the following tips to make your notebooks the best source for other
users to learn:

 * Don’t forget to name your notebook documents!
 * Try to keep the cells of your notebook simple: don’t exceed the width of your
   cell and make sure that you don’t put too many related functions in one cell.
 * If possible, import your packages in the first code cell of your notebook,
   and
 * [More tips here ]

JUPYTER NOTEBOOKS FOR DATA SCIENCE TEAMS: BEST PRACTICES
Jonathan Whitmore wrote in his article some practices for using notebooks for data science and specifically addresses
the fact that working with the notebook on data science problems in a team can
prove to be quite a challenge.

That is why Jonathan suggests some best practices:

 * Use two types of notebooks for a data science project, namely, a lab notebook
   and a deliverable notebook. The difference between the two (besides the
   obvious that you can infer from the names that are given to the notebooks) is
   the fact that individuals control the lab notebook, while the deliverable
   notebook is controlled by the whole data science team,
 * Use some type of versioning control (Git, Github, …). Don’t forget to commit
   also the HTML file if your version control system lacks rendering
   capabilities, and
 * Use explicit rules on the naming of your documents.

LEARN FROM THE BEST NOTEBOOKS
This section is meant to give you a short list with some of the best notebooks
that are out there so that you can get started on learning from these examples .

You will find that many people regularly compose and have composed lists with
interesting notebooks. Don’t miss this gallery of interesting IPython notebooks or this KD Nuggets article.


--------------------------------------------------------------------------------

Originally published at www.datacamp.com .

Data Science Python Data Mining Machine Learning R 3 Blocked Unblock Follow FollowingKARLIJN WILLEMS
Data Science Journalist @DataCamp","Data science is about learning by doing. One of the ways you can learn how to do data science is by building your own portfolio: elaborating your own pet project, doing a quick data exploration task…",Jupyter Notebook Tutorial,Live,151
396,"Homepage Follow Sign in Get started * Home
 * About Insight
 * Data Science
 * Data Engineering
 * Health Data
 * AI
 * 

Emmanuel Ameisen Blocked Unblock Follow Following Program Director at Insight AI @EmmanuelAmeisen Jan 24
--------------------------------------------------------------------------------

HOW TO SOLVE 90% OF NLP PROBLEMS: A STEP-BY-STEP GUIDE
USING MACHINE LEARNING TO UNDERSTAND AND LEVERAGE TEXT.
How you can apply the 5 W’s and H to Text Data!TEXT DATA IS EVERYWHERE
Whether you are an established company or working to launch a new service, you
can always leverage text data to validate, improve, and expand the
functionalities of your product. The science of extracting meaning and learning
from text data is an active topic of research called Natural Language Processing
(NLP).

NLP produces new and exciting results on a daily basis, and is a very large field. However, having worked with
hundreds of companies, the Insight team has seen a few key practical
applications come up much more frequently than any other:

 * Identifying different cohorts of users/customers (e.g. predicting churn,
   lifetime value, product preferences)
 * Accurately detecting and extracting different categories of feedback
   (positive and negative reviews/opinions, mentions of particular attributes
   such as clothing size/fit…)
 * Classifying text according to intent (e.g. request for basic help, urgent
   problem)

While many NLP papers and tutorials exist online, we have found it hard to find
guidelines and tips on how to approach these problems efficiently from the ground up.

HOW THIS ARTICLE CAN HELP
After leading hundreds of projects a year and gaining advice from top teams all
over the United States, we wrote this post to explain how to build Machine
Learning solutions to solve problems like the ones mentioned above. We’ll begin
with the simplest method that could work, and then move on to more nuanced solutions, such as feature
engineering, word vectors, and deep learning.

After reading this article, you’ll know how to:

 * Gather, prepare and inspect data
 * Build simple models to start, and transition to deep learning if necessary
 * Interpret and understand your models, to make sure you are actually capturing
   information and not noise

We wrote this post as a step-by-step guide; it can also serve as a high level
overview of highly effective standard approaches.


--------------------------------------------------------------------------------

This post is accompanied by an interactive notebook demonstrating and applying all these techniques. Feel free to run the code and
follow along!

STEP 1: GATHER YOUR DATA
EXAMPLE DATA SOURCES
Every Machine Learning problem starts with data, such as a list of emails,
posts, or tweets. Common sources of textual information include:

 * Product reviews (on Amazon, Yelp, and various App Stores)
 * User-generated content (Tweets, Facebook posts, StackOverflow questions)
 * Troubleshooting (customer requests, support tickets, chat logs)

“Disasters on Social Media” dataset

For this post, we will use a dataset generously provided by CrowdFlower , called “Disasters on Social Media”, where:

Contributors looked at over 10,000 tweets culled with a variety of searches like
“ablaze”, “quarantine”, and “pandemonium”, then noted whether the tweet referred
to a disaster event (as opposed to a joke with the word or a movie review or
something non-disastrous).Our task will be to detect which tweets are about a disastrous event as opposed to an irrelevant topic such as a movie. Why? A potential application would be to exclusively notify
law enforcement officials about urgent emergencies while ignoring reviews of the
most recent Adam Sandler film. A particular challenge with this task is that
both classes contain the same search terms used to find the tweets, so we will
have to use subtler differences to distinguish between them.

In the rest of this post, we will refer to tweets that are about disasters as “ disaster ”, and tweets about anything else as “ irrelevant ”.

LABELS
We have labeled data and so we know which tweets belong to which categories. As
Richard Socher outlines below, it is usually faster, simpler, and cheaper to find and label enough data to train a model on, rather than trying to optimize a complex unsupervised
method.

Richard Socher’s pro-tipSTEP 2: CLEAN YOUR DATA
The number one rule we follow is: “Your model will only ever be as good as your
data.”One of the key skills of a data scientist is knowing whether the next step
should be working on the model or the data. A good rule of thumb is to look at
the data first and then clean it up. A clean dataset will allow a model to learn meaningful features and not overfit
on irrelevant noise.

Here is a checklist to use to clean your data: (see the code for more details):

 1. Remove all irrelevant characters such as any non alphanumeric characters
 2. Tokenize your text by separating it into individual words
 3. Remove words that are not relevant, such as “@” twitter mentions or urls
 4. Convert all characters to lowercase, in order to treat words such as
    “hello”, “Hello”, and “HELLO” the same
 5. Consider combining misspelled or alternately spelled words to a single
    representation (e.g. “cool”/”kewl”/”cooool”)
 6. Consider lemmatization (reduce words such as “am”, “are”, and “is” to a common form such as “be”)

After following these steps and checking for additional errors, we can start
using the clean, labelled data to train models!

STEP 3: FIND A GOOD DATA REPRESENTATION
Machine Learning models take numerical values as input. Models working on
images, for example, take in a matrix representing the intensity of each pixel
in each color channel.

A smiling face represented as a matrix of numbers.Our dataset is a list of sentences, so in order for our algorithm to extract
patterns from the data, we first need to find a way to represent it in a way
that our algorithm can understand, i.e. as a list of numbers.

ONE-HOT ENCODING (BAG OF WORDS)
A natural way to represent text for computers is to encode each character
individually as a number ( ASCII for example). If we were to feed this simple representation into a classifier,
it would have to learn the structure of words from scratch based only on our
data, which is impossible for most datasets. We need to use a higher level
approach.

For example, we can build a vocabulary of all the unique words in our dataset, and associate a unique index to each
word in the vocabulary. Each sentence is then represented as a list that is as
long as the number of distinct words in our vocabulary. At each index in this
list, we mark how many times the given word appears in our sentence. This is
called a Bag of Words model , since it is a representation that completely ignores the order of words in
our sentence. This is illustrated below.

Representing sentences as a Bag of Words. Sentences on the left, representation
on the right. Each index in the vectors represent one particular word.VISUALIZING THE EMBEDDINGS
We have around 20,000 words in our vocabulary in the “Disasters of Social Media”
example, which means that every sentence will be represented as a vector of
length 20,000. The vector will contain mostly 0s because each sentence contains only a very small subset of our vocabulary.

In order to see whether our embeddings are capturing information that is relevant to our problem (i.e. whether the tweets are about disasters or not), it is a good idea to
visualize them and see if the classes look well separated. Since vocabularies
are usually very large and visualizing data in 20,000 dimensions is impossible,
techniques like PCA will help project the data down to two dimensions. This is plotted below.

Visualizing Bag of Words embeddings.The two classes do not look very well separated, which could be a feature of our
embeddings or simply of our dimensionality reduction. In order to see whether
the Bag of Words features are of any use, we can train a classifier based on
them.

STEP 4: CLASSIFICATION
When first approaching a problem, a general best practice is to start with the
simplest tool that could solve the job. Whenever it comes to classifying data, a
common favorite for its versatility and explainability is Logistic Regression . It is very simple to train and the results are interpretable as you can
easily extract the most important coefficients from the model.

We split our data in to a training set used to fit our model and a test set to
see how well it generalizes to unseen data. After training, we get an accuracy of 75.4%. Not too shabby! Guessing the most frequent class (“irrelevant”) would give us
only 57%. However, even if 75% precision was good enough for our needs, we should never ship a model without trying to understand it.

STEP 5: INSPECTION
CONFUSION MATRIX
A first step is to understand the types of errors our model makes, and which
kind of errors are least desirable. In our example, false positives are classifying an irrelevant tweet as a disaster, and false negatives are classifying a disaster as an irrelevant tweet. If the priority is to react
to every potential event, we would want to lower our false negatives. If we are
constrained in resources however, we might prioritize a lower false positive
rate to reduce false alarms. A good way to visualize this information is using a Confusion Matrix , which compares the predictions our model makes with the true label. Ideally,
the matrix would be a diagonal line from top left to bottom right (our
predictions match the truth perfectly).

Confusion Matrix (Green is a high proportion, blue is low)Our classifier creates more false negatives than false positives
(proportionally). In other words, our model’s most common error is inaccurately
classifying disasters as irrelevant. If false positives represent a high cost
for law enforcement, this could be a good bias for our classifier to have.

EXPLAINING AND INTERPRETING OUR MODEL
To validate our model and interpret its predictions, it is important to look at
which words it is using to make decisions. If our data is biased, our classifier
will make make accurate predictions in the sample data, but the model would not
generalize well in the real world. Here we plot the most important words for both the disaster and irrelevant class. Plotting word importance is simple
with Bag of Words and Logistic Regression, since we can just extract and rank
the coefficients that the model used for its predictions.

Bag of Words: Word importanceOur classifier correctly picks up on some patterns (hiroshima, massacre), but
clearly seems to be overfitting on some meaningless terms (heyoo, x1392). Right
now, our Bag of Words model is dealing with a huge vocabulary of different words
and treating all words equally . However, some of these words are very frequent, and are only contributing
noise to our predictions. Next, we will try a way to represent sentences that
can account for the frequency of words, to see if we can pick up more signal
from our data.

STEP 6: ACCOUNTING FOR VOCABULARY STRUCTURE
TF-IDF
In order to help our model focus more on meaningful words, we can use a TF-IDF score (Term Frequency, Inverse Document Frequency) on top of our Bag of Words model.
TF-IDF weighs words by how rare they are in our dataset, discounting words that
are too frequent and just add to the noise. Here is the PCA projection of our
new embeddings.

Visualizing TF-IDF embeddings.We can see above that there is a clearer distinction between the two colors.
This should make it easier for our classifier to separate both groups. Let’s see
if this leads to better performance. Training another Logistic Regression on our
new embeddings, we get an accuracy of 76.2%.

A very slight improvement. Has our model has started picking up on more
important words? If we are getting a better result while preventing our model
from “cheating” then we can truly consider this model an upgrade.

TF-IDF: Word importanceThe words it picked up look much more relevant! Although our metrics on our test
set only increased slightly, we have much more confidence in the terms our model
is using, and thus would feel more comfortable deploying it in a system that
would interact with customers.

STEP 7: LEVERAGING SEMANTICS
WORD2VEC
Our latest model managed to pick up on high signal words. However, it is very
likely that if we deploy this model, we will encounter words that we have not
seen in our training set before. The previous model will not be able to
accurately classify these tweets, even if it has seen very similar words during training .

To solve this problem, we need to capture the semantic meaning of words , meaning we need to understand that words like ‘good’ and ‘positive’ are
closer than ‘apricot’ and ‘continent.’ The tool we will use to help us capture
meaning is called Word2Vec.

Using pre-trained words

Word2Vec is a technique to find continuous embeddings for words. It learns from reading
massive amounts of text and memorizing which words tend to appear in similar
contexts. After being trained on enough data, it generates a 300-dimension
vector for each word in a vocabulary, with words of similar meaning being closer
to each other.

The authors of the paper open sourced a model that was pre-trained on a very large corpus which we can
leverage to include some knowledge of semantic meaning into our model. The
pre-trained vectors can be found in the repository associated with this post.

SENTENCE LEVEL REPRESENTATION
A quick way to get a sentence embedding for our classifier is to average
Word2Vec scores of all words in our sentence. This is a Bag of Words approach
just like before, but this time we only lose the syntax of our sentence, while keeping some semantic
information.

Word2Vec sentence embeddingHere is a visualization of our new embeddings using previous techniques:

Visualizing Word2Vec embeddings.The two groups of colors look even more separated here, our new embeddings
should help our classifier find the separation between both classes. After
training the same model a third time (a Logistic Regression), we get an accuracy score of 77.7% , our best result yet! Time to inspect our model.

THE COMPLEXITY/EXPLAINABILITY TRADE-OFF
Since our embeddings are not represented as a vector with one dimension per word
as in our previous models, it’s harder to see which words are the most relevant
to our classification. While we still have access to the coefficients of our
Logistic Regression, they relate to the 300 dimensions of our embeddings rather
than the indices of words.

For such a low gain in accuracy, losing all explainability seems like a harsh
trade-off. However, with more complex models we can we can leverage black box explainers such as LIME in order to get some insight into how our classifier works.

LIME

LIME is available on Github through an open-sourced package. A a black-box explainer allows users to
explain the decisions of any classifier on one particular example by perturbing the input (in our case removing words from the sentence) and
seeing how the prediction changes.

Let’s see a couple explanations for sentences from our dataset.

Correct disaster words are picked up to classify as “relevant”. Here, the contribution of the words to the classification seems less obvious.However, we do not have time to explore the thousands of examples in our
dataset. What we’ll do instead is run LIME on a representative sample of test
cases and see which words keep coming up as strong contributors. Using this
approach we can get word importance scores like we had for previous models and
validate our model’s predictions.

Word2Vec: Word importanceLooks like the model picks up highly relevant words implying that it appears to
make understandable decisions. These seem like the most relevant words out of
all previous models and therefore we’re more comfortable deploying in to
production.

STEP 8: LEVERAGING SYNTAX USING END-TO-END APPROACHES
We’ve covered quick and efficient approaches to generate compact sentence
embeddings. However, by omitting the order of words, we are discarding all of
the syntactic information of our sentences. If these methods do not provide
sufficient results, you can utilize more complex model that take in whole
sentences as input and predict labels without the need to build an intermediate
representation. A common way to do that is to treat a sentence as a sequence of individual word vectors using either Word2Vec or more recent approaches such as GloVe or CoVe . This is what we will do below.

A highly effective end-to-end architecture ( source )Convolutional Neural Networks for Sentence Classification train very quickly and work well as an entry level deep learning architecture.
While Convolutional Neural Networks (CNN) are mainly known for their performance
on image data, they have been providing excellent results on text related tasks,
and are usually much quicker to train than most complex NLP approaches (e.g. LSTMs and Encoder/Decoder architectures). This model preserves the order of words and learns valuable
information on which sequences of words are predictive of our target classes.
Contrary to previous models, it can tell the difference between “Alex eats
plants” and “Plants eat Alex.”

Training this model does not require much more work than previous approaches
(see code for details) and gives us a model that is much better than the previous ones, getting 79.5% accuracy ! As with the models above, the next step should be to explore and explain the
predictions using the methods we described to validate that it is indeed the
best model to deploy to users. By now, you should feel comfortable tackling this
on your own.

FINAL NOTES
Here is a quick recap of the approach we’ve successfully used:

 * Start with a quick and simple model
 * Explain its predictions
 * Understand the kind of mistakes it is making
 * Use that knowledge to inform your next step, whether that is working on your
   data, or a more complex model.

These approaches were applied to a particular example case using models tailored
towards understanding and leveraging short text such as tweets, but the ideas
are widely applicable to a variety of problems . I hope this helped you, we’d love to hear your comments and questions! Feel
free to comment below or reach out to @EmmanuelAmeisen here or on Twitter .


--------------------------------------------------------------------------------

Want to learn applied Artificial Intelligence from top professionals in Silicon
Valley or New York? Learn more about the Artificial Intelligence program.

Are you a company working in AI and would like to get involved in the Insight AI
Fellows Program? Feel free to get in touch .

 * Machine Learning
 * Business
 * Artificial Intelligence
 * Tutorial
 * Insight Ai

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

1K Blocked Unblock Follow FollowingEMMANUEL AMEISEN
Program Director at Insight AI @EmmanuelAmeisen

FollowINSIGHT DATA
Insight Fellows Program —Your bridge to careers in Data Science and Data
Engineering.

 * 1K
 * 
 * 
 * 

Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates","After leading hundreds of projects a year and gaining advice from top teams all over the United States, we wrote this post to explain how to build Machine Learning solutions to solve problems like the ones mentioned above.",How to solve 90% of NLP problems,Live,152
399,"Homepage Follow Sign in Get started Tim Bohn Blocked Unblock Follow Following Sr. Solution Architect, IBM Data Science Elite team. Travel (50+ countries),
Pickleball and Technology. Tweets are personal opinions. Dec 13
--------------------------------------------------------------------------------

UNFRIENDLY SKIES: PREDICTING FLIGHT CANCELLATIONS USING WEATHER DATA, PART 3
Tim Bohn and Ricardo Balduino

Piarco Airport, Trinidad in the 1950s, Copyright John Hill, Creative Commons
Attribution-Share Alike 4.0In Part 1 of this series, we wrote about our goal to explore a use case and use various
machine learning platforms to see how we might build classification models with
those platforms to predict flight cancellations. Specifically, we hoped to
predict the probability of the cancellation of flights between the ten U.S.
airports most affected by weather. We used historical flight data and historical
weather data to make predictions for upcoming flights.

In Part 2 , we started our exploration with IBM SPSS Modeler and APIs from The Weather Company . With this post, we look at IBM’s Data Science Experience (DSX).

TOOLS USED IN THIS USE CASE SOLUTION
DSX is a collaborative platform for data scientists, built on open-source
components and IBM added value, which is available in the cloud or on-premise.
In the simplest terms, DSX is a managed Apache Spark cluster with a Notebook
front-end. By default, it includes integration with data tools like a data
catalog and data refinery, Watson Machine Learning services, collaboration
capability, Model Management, and the ability to automatically review a model’s
performance and refresh/retrain the model with new data — and IBM is quickly
adding more capabilities. Read here to see what IBM is doing lately for data science.

A PYTHON NOTEBOOK SOLUTION
In this case, we followed roughly the same steps we used in the SPSS model from
Part 2, only this time we wrote python code in a Jupyter notebook to get similar
results. We encourage readers to come up with their own solutions. Let us know.
We’d love to feature your approaches in future blog posts.

The first step of the iterative process is gathering and understanding the data
needed to train and test our model. Since we did this work for part 2, we made
use of the analysis here.

Flights data — We gathered data for 2016 flights from the US Bureau of Transportation Statistics website. The website allowed us to export one month at a time, so we ended up
with twelve csv (comma separated value) files. Importing those as dataframes and
merging into a single dataframe was straightforward.

Figure 1 — Gathering and preparing flight data in IBM DSXWeather data — With the latitude and longitude of the 10 Most Weather-Delayed U.S. Major Airports , we used one of the Weather Company’s API’s to get the historical hourly
weather data for all of 2016 for each of the 10 airport locations and created a
csv file that became our data set in the notebook.

Combined flights and weather data — To each flight in the first data set, we added two new columns: ORIGIN and
DEST, containing the respective airport codes. Next, we merged flight data and
the weather data so that the resulting dataframe contained the flight data along
with the weather for the corresponding Origin and Destination airports.

DATA PREPARATION, MODELING, AND EVALUATION
To start preparing the data, we used the combined flights and weather data from
the previous step and performed some cleanup. We deleted columns of features
that we didn’t need, and replaced null values in rows where flight cancellations
were not related to weather conditions.

Next, we took the features we discovered when we created a model using SPSS
(such as flight date, hour, day of the week, origin and destination airport
codes, and weather conditions) and we used them as inputs to our python model.
We also chose the target feature for the model to predict: the cancellation
status. We deleted the remaining features.

Next, we ran OneHotEncode r on the four categorical features. One-hot encoding is a process by which
categorical features get converted into a format that works better with certain
algorithms, like classification and regression. Figure 2 shows the number of
feature columns, expanded significantly with one hot encoding.

Figure 2 — One-hot encoding expands 4 feature columns into many moreInterestingly, the flight data is heavily imbalanced. Specifically, as seen in
Figure 3, of all the flights in the data set only a small percentage are
actually cancelled.

Figure 3 — Historical data: distribution of cancelled (1) and non-cancelled (0)
flightsTo address that skewedness in the original data, we tried oversampling the
minority class, under sampling the majority class, and a combination of both —
but none of these approaches worked well. We then tried something called SMOTE (Synthetic Minority Over-Sampling Technique), an algorithm that provides an
advanced over-sampling algorithm to deal with imbalanced datasets. Since it
generates synthetic examples rather than just using replication, it helped our
selected model work more effectively by mitigating the problem of overfitting
that random oversampling can cause. SMOTE isn’t considered effective for high
dimensional data, but that isn’t the case here.

In Figure 4, we notice a balanced distribution between cancelled and
non-cancelled flights after running the data through SMOTE.

Figure 4 — Distribution of cancelled and non-cancelled flights after using SMOTEIt’s important to mention is that we applied SMOTE only to the training data
set, not the test data set. A detailed blog by Nick Becker guided our choices in the notebook.

At this point, we used the Random Forest Classifier for our model. It did the
best when we used SPSS so we used again in our notebook. We have several ideas
for a second iteration of our model in order to tune it, one of which is to try
multiple algorithms to see how they compare.

Since this use case deals with classification analysis, we used some of the
common ways to evaluate the performance of the model: the confusion matrix, F1
score and ROC curve, among some others. Figures 5 and 6 show the results.

Figure 5 — Test/Validation Results Figure 6 — ROC curve for training data setFigure 6 is the ROC curve from the training data set. Figure 5 shows us that the
results from the training and test data sets are pretty close, which is a good
indication of consistency, though we realize that with some tuning it could get
better. Nevertheless, we decided that the results were still good for the
purposes of our discussion in this blog, and we stopped our iterations here. We
encourage readers to refine the model further or even to use other models to
solve this use case.

CONCLUSION
This was a project to compare creating a model in IBM’s SPSS with IBM’s Data Science Experience . SPSS offers a no-code experience while DSX offers the best of open-source
coding capability with many IBM value adds. SPSS is an amazing product and gets
better with every release, adding many new capabilities.

IBM’s Data Science Experience is a great platform for both the beginning and
experienced data scientist. Anyone can log in and have immediate access to a
managed Spark cluster with a choice of a Jupyter notebook front-end using Scala,
Python or R, SPSS and visual data modeler (no coding). It offers easy
collaboration with other users, including adding other data scientists who could
then look over our shoulders and make suggestions. The community is active and
has already contributed dozens of tutorials, data sets and notebooks . If we had added Watson Machine Learning, we could very easily have deployed
and managed our model with an instant REST endpoint to call from any
application. If our data was changing, we could have WML review our model
periodically and retrain it with any new data if our metric (ROC Curve) value
fell below a given threshold. That, along with new data cataloging and data
refinery tooling added recently, make this a platform worth checking out for any
data science project.

SPSS has a lot, but not everything. Writing the python code in a notebook was a
bit more time-consuming than what we did in SPSS, but it also gave quite a bit
more flexibility and freedom. We had access to everything in the python
libraries, and of course, one of the benefits of python as an open-source
language is the trove of helpful examples.

I would say both platforms have their place, and neither can claim to be better
for everything. Those doing data science for the first time will probably find
SPSS an easier place to start given its drag-and-drop user interface. Those who
have come out of school as programming wizards will want to write code, and DSX
will give them a great way to do that without worrying about installing,
configuring, and correctly integrating various product versions.

RESOURCES
The IBM notebook and data that form the basis for this blog are available on Github .

 * Machine Learning
 * Weather
 * Airlines

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

14 Blocked Unblock Follow FollowingTIM BOHN
Sr. Solution Architect, IBM Data Science Elite team. Travel (50+ countries),
Pickleball and Technology. Tweets are personal opinions.

FollowINSIDE MACHINE LEARNING
Deep-dive articles about machine learning and data. Curated by IBM Analytics.

 * 14
 * 
 * 
 * 

Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates","In Part 1 of this series, we wrote about our goal to explore a use case and use various machine learning platforms to see how we might build classification models with those platforms to predict…","Predicting Flight Cancellations Using Weather Data, Part 3",Live,153
404,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Perform Sentiment Analysis       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  USE DASHDB WITH TABLEAUJess Mantaro / July 17, 2015Watch how quick and easy it is to perform analytics with dashDB and TableauYou can also read a transcript of this videoRead the tutorial (PDF) Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Watch how quick and easy it is to perform analytics with dashDB and Tableau. ,Use dashDB with Tableau,Live,154
412,"Compose The Compose logo Articles Sign in Free 30-day trialMETRICS MAVEN: CALCULATING AN EXPONENTIALLY WEIGHTED MOVING AVERAGE IN
POSTGRESQL
Published Mar 8, 2017 metrics maven postgresql Metrics Maven: Calculating an Exponentially Weighted Moving Average in
PostgreSQLIn our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the metrics you need from your data.
In this article, we'll walk through how and why to calculate an exponentially
weighted moving average.

We've covered a few different kinds of averages in this series. We had a look at mean , dug into weighted averages , showed a couple methods for calculating a simple moving average , generated a cumulative moving average in the same article , and also produced a 7-day weighted moving average . In this article we're going to add an exponentially weighted moving average
to the group.

We'll start by getting a basic understanding of what an exponentially weighted
moving average is and why we would want to use it.

EXPONENTIALLY WEIGHTED MOVING AVERAGE
The exponentially weighted moving average, sometimes also just called
exponential moving average, (EWMA or EMA, for short) is used for smoothing trend
data like the other moving averages we've reviewed. Similar to the weighted moving average we covered in our last article, weights are applied to the data such that dates
further in the past will receive less weight (and therefore be less impactful to
the result) than more recent dates. Rather than decreasing linearly, however,
like we saw with the weighted moving average, the weight for an EWMA decreases
exponentially for each time period further in the past. Additionally, the result
of an EWMA is cumulative because it contains the previously calculated EWMA in
its calculation of the current EWMA. Because of this, all the data values have
some contribution in the result, though that contribution diminishes as each
next period is calculated.

An exponentially weighted moving average is often applied when there is a large
variance in the trend data, such as for volatile stock prices. It can reduce the
noise and help make the trend clearer.

Let's get into our example and see how this works.

OUR DATA
For EWMA, we're going to use the same daily summary data from our hypothetical
pet supply company that we used in the previous article on weighted moving averages .

Our data table is called ""daily_orders_summary"" and looks like this:

date       | total_orders | total_order_items | total_order_value | average_order_items | average_order_value  
--------------------------------------------------------------------------------------------------------------
2017-01-01 | 14           | 18                | 106.84            | 1.29                | 7.63  
2017-01-02 | 10           | 21                | 199.79            | 2.10                | 19.98  
2017-01-03 | 12           | 17                | 212.98            | 1.42                | 17.75  
2017-01-04 | 12           | 15                | 100.93            | 1.25                | 8.41  
2017-01-05 | 10           | 13                | 108.54            | 1.30                | 10.85  
2017-01-06 | 14           | 20                | 216.78            | 1.43                | 15.48  
2017-01-07 | 13           | 16                | 198.32            | 1.23                | 15.26  
2017-01-08 | 10           | 12                | 124.67            | 1.20                | 12.47  
2017-01-09 | 10           | 16                | 140.88            | 1.60                | 14.09  
2017-01-10 | 17           | 19                | 136.98            | 1.12                | 8.06  
2017-01-11 | 12           | 14                | 99.67             | 1.17                | 8.31  
2017-01-12 | 11           | 15                | 163.52            | 1.36                | 14.87  
2017-01-13 | 10           | 18                | 207.43            | 1.80                | 20.74  
2017-01-14 | 14           | 20                | 199.68            | 1.43                | 14.26  
2017-01-15 | 16           | 22                | 207.56            | 1.38                | 12.97  
2017-01-16 | 14           | 19                | 176.76            | 1.36                | 12.63  
2017-01-17 | 13           | 18                | 184.48            | 1.38                | 14.19  
2017-01-18 | 14           | 25                | 265.98            | 1.79                | 19.00  
2017-01-19 | 10           | 17                | 178.42            | 1.70                | 17.84  
2017-01-20 | 19           | 24                | 139.67            | 1.26                | 7.35  
2017-01-21 | 15           | 21                | 187.66            | 1.40                | 12.51  
2017-01-22 | 19           | 24                | 226.98            | 1.26                | 11.95  
2017-01-23 | 17           | 24                | 212.64            | 1.41                | 12.51  
2017-01-24 | 16           | 21                | 187.43            | 1.31                | 11.71  
2017-01-25 | 19           | 27                | 244.67            | 1.42                | 12.88  
2017-01-26 | 20           | 29                | 267.44            | 1.45                | 13.37  
2017-01-27 | 17           | 25                | 196.43            | 1.47                | 11.55  
2017-01-28 | 21           | 28                | 234.87            | 1.33                | 11.18  
2017-01-29 | 18           | 29                | 214.66            | 1.61                | 11.93  
2017-01-30 | 14           | 20                | 199.68            | 1.43                | 14.26  
2017-02-01 | 19           | 27                | 189.98            | 1.42                | 10.00  
2017-02-02 | 22           | 31                | 274.98            | 1.41                | 12.50  
2017-02-03 | 20           | 28                | 213.76            | 1.40                | 10.69  
2017-02-04 | 21           | 30                | 242.78            | 1.43                | 11.56  
2017-02-05 | 22           | 34                | 267.88            | 1.55                | 12.18  
2017-02-06 | 19           | 24                | 209.56            | 1.26                | 11.03  
2017-02-07 | 21           | 33                | 263.76            | 1.57                | 12.56  


IT'S ALL ABOUT THE LAMBDA
As mentioned above, the weight for EWMA decreases exponentially for each time
period in the past. The further in the past, the less weight is given. To apply
the weights for our data, we'll need a smoothing parameter (also called lambda ) which will act as a multiplier on the data values. This smoothing parameter
will be a value between 0 and 1 and is typically 2 divided by the sum of the
length of days. Since we'll stick with a 7-day range, our lambda would be 2 / (1 + 7) which comes out to 0.25.

The formula for calculating an EWMA boils down to this:

(Current period data value * lambda) + (Previous period EWMA * (1 - lambda)) = Current period EWMA


An alternative formula which produces the same result is:

((Current period data value - Previous period EWMA) * lambda) + Previous period EWMA = Current period EWMA


Now that we know what our lambda is and we have the formula we're going to
apply, it's time to run our query:

WITH recursive exponentially_weighted_moving_average  
(date, average_order_value, ewma, rn)
AS (

 -- Initiate the ewma using the 7-day simple moving average (sma)
    SELECT rows.date, rows.average_order_value, sma.sma AS ewma, rows.rn
    FROM (
        SELECT date, average_order_value, ROW_NUMBER() OVER(ORDER BY date) rn
        FROM daily_orders_summary
    ) rows
    JOIN (
        SELECT date,  
           ROUND(AVG(average_order_value)
              OVER(ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW), 2) AS sma
        FROM daily_orders_summary
    ) sma ON sma.date = rows.date
    WHERE rows.rn = 7 -- start on the 7th day since we're using the 7-day sma

    UNION ALL

 -- Perform the ewma calculation for all the following rows
    SELECT rows.date, rows.average_order_value
    , ROUND((rows.average_order_value * 0.25) + (ewma.ewma * (1 - 0.25)), 2) AS ewma
    --, ROUND((((rows.average_order_value - ewma.ewma) * 0.25) + ewma.ewma), 2) AS ewma -- alternative formula
    , rows.rn
    FROM exponentially_weighted_moving_average ewma
    JOIN (
        SELECT date, average_order_value, ROW_NUMBER() OVER(ORDER BY date) rn
        FROM daily_orders_summary
    ) rows ON ewma.rn + 1 = rows.rn
    WHERE rows.rn 


That's a lot to take in all at once so let's break it down and learn how it
works.

USING A RECURSIVE CTE
First, we're creating a recursive CTE ( common table expression using WITH ) called ""exponentially_weighted_moving_average"" that returns 4 field values:
date, average order value, the ewma, and a row number. We're using this approach
because the EWMA calculation requires the previous period's EWMA. A recursive
CTE can provide the previous EWMA calculation to us for each period.

Note that using a recursive CTE on a large data set is not going to be your best
option. Performance will take a big dive since the query will recurse through
all the data. If you have a large data set that you need to calculate EWMA for,
then you should consider using the procedural language options for PostgreSQL
such as PL/Python or PL/Perl. You can learn more about recursive CTEs and procedural language options in the official PostgreSQL documentation.

INITIALIZING THE EWMA
The first query in our WITH block is the EWMA initialization. Because the calculation requires the previous
period EWMA, we have to give it something to start with. A common approach is to
use the simple moving average for the length of the time period as the initial
EWMA. That's what we've done here. Because we're calculating a 7-day EWMA (our
lambda is based on a 7-day range), we have a sub-query called ""sma"" where we're
calculating the 7-day simple moving average for the 7th day (we need 7 days to
get the SMA) and using that as our EWMA starting point. If you're not familiar
with the simple moving average or you just need a refresher, check out our
article on basic moving averages .

You could also simply initialize the EWMA with the actual data value for that
date. Of course the results will be different. Play around with what seems to
work best for your data on how to initialize the EWMA since that will impact all
the rest of your calculations.

Most of the data in that first query comes from another sub-query called ""rows""
that employes the ROW_NUMBER() window function. We are using this sub-query to generate row numbers for each
of the rows, which allows us to identify the 7th row for initialization. If
you're not familiar with window functions, take a gander at our article on window functions .

RECURSING THROUGH THE DATA TO CALCULATE EWMA
The UNION ALL and the next query in the WITH block are where the recursion and the calculation of EWMA occur. We've got the
same ""rows"" subquery, but in this case, we only care about the rows following
the 7th row since that's where we'll apply our EWMA calculation. We're joing the
""rows"" sub-query to our recursive CTE ""exponentially_weighted_moving_average""
(aliased as ""ewma"") on row number where the ""ewma"" row is 1 less than the ""rows""
row. In this way we can use the previously-calculated EWMA from the ""ewma"" CTE
and the current data value (average_order_value in this case) from the ""rows""
sub-query.

To get the calculated EWMA for the current row, we're applying the formula in
SQL as:

ROUND((rows.average_order_value * 0.25) + (ewma.ewma * (1 - 0.25)), 2) AS ewma  


What we're doing here is...

 * multiplying the current period data value (rows.average_order_value) with the
   7-day range lambda we previously determined (0.25)
 * multiplying the previous period EWMA (ewma.ewma) with 1 minus our lambda (1 -
   0.25)
 * adding those two values together to get the current period EWMA
 * rounding to 2 decimal places (which we learned about in the Making Data Pretty article )

Note that we've also included the alternative formula in the SQL, just commented
out. You can use either one.

RETURNING RESULTS
Finally we're selecting the fields we're interested in for our report from the
recursive CTE.

Here's what those results look like:

date       | average_order_value | ewma  
----------------------------------------
2017-01-07 | 15.26               | 13.62  
2017-01-08 | 12.47               | 13.33  
2017-01-09 | 14.09               | 13.52  
2017-01-10 | 8.06                | 12.16  
2017-01-11 | 8.31                | 11.20  
2017-01-12 | 14.87               | 12.12  
2017-01-13 | 20.74               | 14.28  
2017-01-14 | 14.26               | 14.28  
2017-01-15 | 12.97               | 13.95  
2017-01-16 | 12.63               | 13.62  
2017-01-17 | 14.19               | 13.76  
2017-01-18 | 19.00               | 15.07  
2017-01-19 | 17.84               | 15.76  
2017-01-20 | 7.35                | 13.66  
2017-01-21 | 12.51               | 13.37  
2017-01-22 | 11.95               | 13.02  
2017-01-23 | 12.51               | 12.89  
2017-01-24 | 11.71               | 12.60  
2017-01-25 | 12.88               | 12.67  
2017-01-26 | 13.37               | 12.85  
2017-01-27 | 11.55               | 12.53  
2017-01-28 | 11.18               | 12.19  
2017-01-29 | 11.93               | 12.13  
2017-01-30 | 14.26               | 12.66  
2017-02-01 | 10.00               | 12.00  
2017-02-02 | 12.50               | 12.13  
2017-02-03 | 10.69               | 11.77  
2017-02-04 | 11.56               | 11.72  
2017-02-05 | 12.18               | 11.84  
2017-02-06 | 11.03               | 11.64  
2017-02-07 | 12.56               | 11.87  


Let's look at just one date to review how the EWMA was calculated. We'll use
January 21. By applying the formula for EWMA, we get:

(Current period data value * lambda) + (Previous period EWMA * (1 - lambda)) = Current period EWMA
         (12.51 * 0.25)              +         (13.66 * (1 - 0.25))          =      13.37225


SEEING TRENDS
Now that we have our EWMA, let's plot our average order value, the simple moving
average, the weighted moving average, and the EWMA together to see how the trend
lines compare:


We can see that the average order value by itself is pretty volatile making it
difficult to see the overall trend. The simple moving average helps smooth
things out, but over- or under-corrects in some places. The weighted moving
average smooths the trend out further and makes it easier to see the rise that
happened until about the 3rd week of January and then the slight decline from
then. The exponentially weighted moving average follows the true data values
better than the other two metrics while still smoothing the trend line.

WRAPPING UP
In this article we learned how to calculate an exponentially weighted moving
average using a recursive CTE. We discussed when it's useful to apply and
compared the results to other average types we looked at in previous articles.
With each metric we are better able to zero in on just how our business is
performing.

In our next article, we'll be taking a look at CROSSTAB again and covering some aspects there that we didn't have a chance to get to in
our previous article on pivoting in Postgres .

Image by: msandersmusic Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith ’s author page and keep reading.RELATED ARTICLES
Feb 7, 2017METRICS MAVEN: CALCULATING A WEIGHTED MOVING AVERAGE IN POSTGRESQL
In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the…

Lisa Smith Jan 9, 2017METRICS MAVEN: CALCULATING A WEIGHTED AVERAGE IN POSTGRESQL
In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the…

Lisa Smith Dec 6, 2016METRICS MAVEN: MODE D'EMPLOI - FINDING THE MODE IN POSTGRESQL
In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the…

Lisa Smith Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this article, we'll walk through how and why to calculate an exponentially weighted moving average.",Metrics Maven: Calculating an Exponentially Weighted Moving Average in PostgreSQL,Live,155
419,"Compose The Compose logo Articles Sign in Free 30-day trialDATALAYER EXPOSED: JONAS HELFER & JOINS ACROSS DATABASES WITH GRAPHQL
Published Jun 26, 2017 datalayer graphql join DataLayer Exposed: Jonas Helfer & Joins Across Databases with GraphQLWanting something to make Monday mornings a bit more exciting? Well, for the
next few weeks, we're bringing you a new video from this year's DataLayer
Conference. Up this week is Jonas Helfer from Meteor discussing joins across databases with GraphQL.

This year, we were joined by Jonas Helfer from Meteor. More specifically, he works on Meteor's Apollo Project with last year's speaker Sashko Stubailo (you can see his talk here ).

Jonas' presentation is centered around joins across databases with GraphQL. As
it's becoming more and more commons for organizations to have a backend
architecture powered by multiple databases and microservices. But as the number
of these databases and services grows, so scaling also becomes more difficult.
Jonas showed how GraphQL can be leveraged to pull data from multiple databases
in a unified way.

Previous DataLayer 2017 talks:

 * Charity Majors' presentation on observability
 * Ross Kukulinski's presentation on the state of containers
 * Antonio Chavez's presentation on the why he left MongoDB

Be sure to tell us what you think using hashtag #DataLayerConf and check back
next Monday for the next talk at DataLayerConf.


--------------------------------------------------------------------------------

We're in the planning stages for DataLayer 2018 right now so, if you have an
idea for a talk, start flushing that out. We'll have a CFP, followed by a blind
submission review, and then select our speakers, who we'll fly to DataLayer to
present. Sounds fun, right?

Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe ’s author page and keep reading.RELATED ARTICLES
Aug 26, 2016SCYLLA 1.3, CONFERENCES, APOLLO, COCKROACHDB, GRPC, REDBOT AND MORE : COMPOSE'S
LITTLE BITS 45
Scylla is thrifty with 1.3, Conferences to go to, Apollo's typescripted GraphQL
re-write, CockroachDB's code yellow, gRPC hit…

Dj Walker-Morgan Jun 19, 2017DATALAYER EXPOSED: ANTONIO CHAVEZ & WHY WE LEFT MONGODB
It's Monday, so that means time for a new video from this year's DataLayer
conference. Up this week is Antonio Chavez who tal…

Thom Crowe Jun 12, 2017DATALAYER EXPOSED: ROSS KUKULINSKI & THE STATE OF STATE IN CONTAINERS
We're continuing to bring you video of all the sessions from this year's
DataLayer conference, and next up is Ross Kukulinski…

Thom Crowe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL JanusGraph Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",Jonas Helfer from Meteor discussing joins across databases with GraphQL.,DataLayer Exposed: Jonas Helfer & Joins Across Databases with GraphQL,Live,156
420,"KDNUGGETS
Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * Data Mining Software
 * News
 * Top stories
 * Opinions
 * Tutorials
 * Jobs
 * Academic
 * Companies
 * Courses
 * Datasets
 * Education
 * Meetings
 * Polls
 * Webinars


KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » Data Science of Variable Selection: A Review ( 16:n20 )LATEST NEWS, STORIES
 * Ten Simple Rules for Effective Statistical Practice: A... Bank of America: Statistician From Research to Riches: Data Wrangling Lessons from P... Microsoft: Sr. Applied Data Scientist. Achieving End-to-end Security for Apache Spark with Da...


More News & Stories | Top Stories

DATA SCIENCE OF VARIABLE SELECTION: A REVIEW
Previous post Next post Tweet Tags: Algorithms , Big Data , Feature Selection , Statistics
--------------------------------------------------------------------------------

There are as many approaches to selecting features as there are statisticians
since every statistician and their sibling has a POV or a paper on the subject.
This is an overview of some of these approaches.

commentsBy Thomas Ball, Advanced Analytics Professional .

Data scientists are always stressing over the “best” approach to variable
selection, particularly when faced with massive amounts of information -- a
frequent occurrence these days. ""Massive"" by today's standards means terabytes
of data and tens, if not hundreds, of millions of features or predictors. There
are many reasons for this “stress” but the reality is that a single, canonical
solution does not exist. There are as many approaches to selecting features as
there are statisticians since every statistician and their sibling has a POV or
a paper on the subject.

Why Implement Machine Learning Algorithms From Scratch?

For years, there have been rumors that Google uses all available features in
building its predictive algorithms. To date however, no disclaimers,
explanations or working papers have emerged that clarify and/or dispute this
rumor. Not even their published patents help in the understanding. As a result,
no one external to Google knows what they are doing, to the best of my
knowledge.

One of the biggest problems in predictive modeling is the conflation between
classic hypothesis testing with careful model specification vis-a-vis pure data
mining. The classically trained can get quite dogmatic about the need for
""rigor"" in model design and development. The fact is that when confronted with
massive numbers of candidate predictors and multiple possible targets or
dependent variables, the classic framework neither works, holds nor provides
useful guidance – how does anyone develop a finite set of hypotheses with
millions of predictors? Numerous recent papers delineate this dilemma from
Chattopadhyay and Lipson's brilliant paper Data Smashing: Uncovering Lurking Order in Data ( available here ) who state, ""The key bottleneck is that most data comparison algorithms today
rely on a human expert to specify what ‘features’ of the data are relevant for
comparison. Here, we propose a new principle for estimating the similarity
between the sources of arbitrary data streams, using neither domain knowledge
nor learning."" To last year's AER paper on Prediction Policy Problems by Kleinberg, et al., (available here ) which makes the case for data mining and prediction as useful tools in
economic policy making, citing instances where ""causal inference is not central,
or even necessary.""

The fact is that the bigger, $64,000 question is the broad shift in thinking and
challenges to the classic hypothesis-testing framework implicit in, e.g., this Edge.org symposium on ""obsolete"" scientific thinking (available here ) as well as this recent article by Eric Beinhocker on the ""new economics""
(available here ) which presents some radical proposals for integrating widely disparate
disciplines such as behavioral economics, complexity theory, network and
portfolio theory into a platform for policy implementation and adoption.
Needless to say, these discussions go far beyond merely statistical concerns and
suggest that we are undergoing a fundamental shift in scientific paradigms. The
shifting views are as fundamental as the distinctions between reductionistic,
Occam's Razor like model-building vs Epicurus' expansive Principle of Plenitude
or multiple explanations which roughly states that if several findings explain
something, retain them all (see, e.g., here ).

Of course, guys like Beinhocker are totally unencumbered with practical, in the
trenches issues regarding applied, statistical solutions to this evolving
paradigm. Wrt the nitty-gritty questions of ultra-high dimensional variable
selection, there are many viable approaches to model building that leverage,
e.g., Lasso, LAR, stepwise algorithms or ""elephant models” that use all of the
available information. The reality is that, even with AWS or a supercomputer,
you can't use all of the available information at the same time – there simply
isn’t enough RAM to load it all in. What does this mean? Workarounds have been
proposed, e.g., the NSF's Discovery in Complex or Massive Datasets: Common Statistical Themes to ""divide and conquer"" or ""bags of little jacknife"" algorithms for massive
data mining, e.g., Wang, et al's paper, A Survey of Statistical Methods and Computing for Big Data (available here ) as well as Leskovec, et al's book Mining of Massive Datasets (available here ).

There are now literally hundreds, if not thousands of papers that deal with
various aspects of these challenges, all proposing widely differing analytic
engines as their core from so-called “D Bayesian tensor models to classic,
supervised logistic regression, and more. Fifteen years or so years ago, the
debate largely focused on questions concerning the relative merits of
hierarchical Bayesian solutions vs frequentist finite mixture models. In a paper
addressing these issues, Ainslie, et al. (available here ) came to the conclusion that, in practice, the differing theoretical
approaches produced largely equivalent results with the exception of problems
involving sparse and/or high dimensional data -- where HB models had the
advantage. Today with the advent of D&C-type workarounds, any arbitrage HB
models may have historically enjoyed are rapidly being eliminated.
The basic logic of these D&C-type workarounds are, by and large, extensions of
Breiman's famous random forest technique which relied on bootstrapped resampling
of observations and features. Breiman did his work in the late 90s on a single
CPU when massive data meant a few dozen gigs and a couple of thousand features
processed over a couple of thousand iterations. On today's massively parallel,
multi-core platforms, it is possible to run algorithms analyzing terabytes of
data containing tens of millions of features that build millions of ""RF""
mini-models in a few hours. Theoretically, it’s possible to build models using
petabyes of data with these workarounds but the present IT platforms and systems
won’t execute that yet – to the best of my knowledge (if any knows where this is
being done and how, please feel free to share that information).

There are any number of important questions coming out of all of this. One has
to do with a concern over a possible loss of precision due to the approximating
nature of these workarounds. This issue has been addressed by Chen and Xie in
their paper, A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data (available here ) where they conclude that these approximations are indistinguishably different
from ""full information"" models.

A second concern which, to the best of my knowledge hasn't been adequately
addressed by the literature, has to do with what is done with the results (i.e.,
the ""parameters"") from potentially millions of predictive mini-models once the
workarounds have been rolled up and summarized. In other words, how does one
execute something as simple as ""scoring"" new data with these results? Are the
mini-model coefficients to be saved and stored or does one simply rerun the D&C
algorithm(s) on new data?

In his book, Numbers Rule Your World (available here ), Kaiser Fung describes the dilemma Netflix faced when presented with an
ensemble of only 104 models handed over by the winners of their competition. The
winners had, indeed, minimized the MSE vs all other competitors but this
translated into only a several decimal place improvement in accuracy on the
5-point, Likert-type rating scale used by their movie recommender system. In
addition, the IT maintenance required for this small ensemble of models cost
much more than any savings seen from the ""improvement"" in model accuracy.

Then there's the whole question of whether ""optimization"" is even possible with
information of this magnitude. For instance, Emmanuel Derman, the physicist and
financial engineer, in his autobiography My Life as a Quant suggests that optimization is an unsustainable myth, at least in financial
engineering.

Finally, questions concerning relative feature importance with massive numbers
of features have yet to be addressed.

There are no easy answers wrt questions concerning the need for variable
selection and the new challenges opened up by the current, Epicurean workarounds
remain to be resolved. The bottom line is that we are all data scientists now.

Bio: Thomas Ball is an advanced analytics leader with Fortune 500 and start-up experience. He
has led teams in management consulting, digital media, financial and health care
industries.

Source : Originally posted anonymously by the author to a thread on Stack Exchange's
statistical Q&A site, Cross Validated . Reposted with permission.

Related:

 * Datasets Over Algorithms
 * Why Implement Machine Learning Algorithms From Scratch?
 * Beyond One-Hot: an exploration of categorical variables


--------------------------------------------------------------------------------

Previous post Next post


--------------------------------------------------------------------------------


MOST POPULAR LAST 30 DAYS
Most viewed 1. 7 Steps to Mastering Machine Learning With Python R vs Python for Data Science: The Winner is ... What is the Difference Between Deep Learning and “Regular” Machine
    Learning? TensorFlow Disappoints - Google Deep Learning falls shallow 9 Must-Have Skills You Need to Become a Data Scientist Top 10 Data Analysis Tools for Business How to Explain Machine Learning to a Software Engineer

Most shared 1. What is the Difference Between Deep Learning and “Regular” Machine Learning? Data Science of Variable Selection: A Review R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016
    Software Poll Results A Visual Explanation of the Back Propagation Algorithm for Neural Networks Machine Learning Key Terms, Explained How to Build Your Own Deep Learning Box Big Data Business Model Maturity Index and the Internet of Things (IoT)


MORE RECENT STORIES
 * Predicting purchases at retail stores using HPE Vertica and Da... Top tweets, Jun 15-21: Predicting UEFA Euro2016; Visual Exp... Strata + Hadoop World, New York City, Sep 26-29 – KDnugg... Cisco 2016 Data and Analytics Conference, Sep 19-21, Chicago Machine Learning Trends and the Future of Artificial Intelligence Microsoft: Senior Software Engineer. Mining Twitter Data with Python Part 3: Term Frequencies History of Data Mining Bank of Ireland: Senior Data Scientist within the Advanced Ana... DuPont Pioneer: Data Scientist – Encirca KDnuggets 16:n22, Jun 22: Data Science Blog Contest; Free M... KDnuggets Blog Contest: Automated Data Science and Machine Lea... Data Science Career Days at Metis, NYC – June 23, SF &#8... A Review of Popular Deep Learning Models HPE Haven OnDemand Text Extraction API Cheat Sheet for Developers Standards-based Deployment of Predictive Analytics Data Science for Internet of Things course, Online or London How to Compare Apples and Oranges, Part 2 – Categorical ... Top Stories, June 13-19: A Visual Explanation of the Back Prop... Chief Data Officer Forum Insurance 2016, Sep 15, Chicago


KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » Data Science of Variable Selection: A Review ( 16:n20 )

© 2016 KDnuggets. About KDnuggets
Subscribe to KDnuggets News | Follow @kdnuggets | | X",There are as many approaches to selecting features as there are statisticians since every statistician and their sibling has a POV or a paper on the subject. This is an overview of some of these approaches.,Data Science of Variable Selection,Live,157
424,"RStudio Blog * Home

 * Subscribe to feed

D3HEATMAP: INTERACTIVE HEAT MAPS
June 24, 2015 in Packages | Tags: d3 , htmlwidgets

We’re pleased to announce d3heatmap , our new package for generating interactive heat maps using d3.js and htmlwidgets . Tal Galili , author of dendextend , collaborated with us on this package.

d3heatmap is designed to have a familiar feature set and API for anyone who has
used heatmap or heatmap.2 to create static heatmaps. You can specify dendrogram, clustering, and scaling
options in the same way.

d3heatmap includes the following features:

 * Shows the row/column/value under the mouse cursor
 * Click row/column labels to highlight
 * Drag a rectangle over the image to zoom in
 * Works from the R console, in RStudio, with R Markdown , and with Shiny

INSTALLATION
install.packages(""d3heatmap"")

EXAMPLES
Here’s a very simple example (source: flowingdata ):

library(d3heatmap)
url <- ""http://datasets.flowingdata.com/ppg2008.csv""
nba_players <- read.csv(url, row.names = 1)
d3heatmap(nba_players, scale = ""column"")


You can easily customize the colors using the colors parameter. This can take an RColorBrewer palette name, a vector of colors, or a function that takes (potentially scaled)
data points as input and returns colors.


Let’s modify the previous example by using the ""Blues"" colorbrewer palette, and dropping the clustering and dendrograms:

d3heatmap(nba_players, scale = ""column"", dendrogram = ""none"",
    color = ""Blues"")


If you want to use discrete colors instead of continuous, you can use the col_* functions from the scales package.

d3heatmap(nba_players, scale = ""column"", dendrogram = ""none"",
    color = scales::col_quantile(""Blues"", NULL, 5))

Thanks to integration with the dendextend package, you can customize dendrograms
with cluster colors:

d3heatmap(nba_players, colors = ""Blues"", scale = ""col"",
    dendrogram = ""row"", k_row = 3)

For issue reports or feature requests, please see our GitHub repo .

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

16 COMMENTS
June 24, 2015 at 10:25 pm

SF99

Trying out pkg: d3heatmap_0.6.0 in Rstudio.

Like it –
but documentation with simple, clear examples is sparse
…without clear documentation = difficult to use!.

In the included doc example:

x <- mtcars # [c(2:4,7),1:4]
d3heatmap(x, k_row = 4, k_col = 2)

what are the function args:
k_row,
k_col,
scale
and some of the other function args?

Even in this Rstudio post,
the example at the beginning
does not work at all:

QUOTE: ———————————–
Here’s a very simple example (source: flowingdata):
url <-"" http://datasets.flowingdata.com/ppg2008.csv&quot ;
nba_players
END QUOTE —————————-

yields:
""error: nba-players not defined"" (?)

IN SUMMARY:
Urgently needed –
documentation with clear, stepXstep examples.
Please?

Without it,
this potentially fine pkg is an exercise in frustration…

Thank you!

 * June 25, 2015 at 5:11 am
   
   Joe Cheng
   
   Thank you for the feedback, the first code sample that defined nba_players
   was truncated due to my poor WordPress skills. I’ve fixed it, so you should
   be able to step through each of the examples now.
   
   The k_row, k_col, scale, and all other parameters are currently documented
   only in the R help, i.e. ?d3heatmap::d3heatmap. Other than k_row/k_col, all
   the stats-related parameters are identical to heatmap and heatmap.2, if
   you’re familiar with those functions.
   
   Hopefully we can find the time after the useR conference next week to add
   more documentation. In the meantime, if you have any specific questions feel
   free to leave additional comments or email me at joe@rstudio.com .
   
   Thanks again!
   
   
 * June 25, 2015 at 10:47 am
   
   SF99
   
   Thank you
   for the quickly reply, Joe!
   
   No, I was not familiar with the
   with the heatmap and heatmap.2 pkgs.
   
   But since d3heatmap is an evolution
   above the latter 2 pkgs,
   I’d like to suggest
   that it should include a _”self-contained”_
   help file with clear, stepXstep examples –
   (so the user does not need to refer to other pkgs,
   in order to use d3heatmap).
   
   Again Joe –
   looking forward to be a frequent user
   of your _excellent_ d3heatmap pkg .
   
   Thanks!
   SF99
   
   
 * 

June 24, 2015 at 10:45 pm

Alberto Jaimes Romero

Hi there, there is some missed code. I had to read the dowloaded file, easy; and
transform it into a matrix. Greetings

 * June 25, 2015 at 5:13 am
   
   Joe Cheng
   
   Sorry about that, the code sample in the blog post was indeed truncated. I’ve
   fixed it now.
   
   
June 25, 2015 at 4:48 am

GD

It is very beautiful indeed! How can I center the plot in an rmarkdown document?

 * June 25, 2015 at 5:19 am
   
   Joe Cheng
   
   There’s not an official way to center htmlwidgets in rmd documents right now,
   I don’t think. But in a pinch either of these two approaches will work:
   
   1) Add width=”100%” as a parameter to the d3heatmap. That counts as centered,
   right?😉
   2) Wrap it with a div:
   
   tags$div(style=” margin-right: auto”,
   d3heatmap::d3heatmap(mtcars, width=500)
   )
   
   The width:500px and width=500 can be any number, but they have to match.
   
   
 * June 25, 2015 at 5:21 am
   
   Joe Cheng
   
   I forgot to mention in my previous solution #2, you also need to call
   library(htmltools).
   
   Another way to go is to include this in your first code chunk:
   
   “`{r echo=FALSE}
   library(htmltools)
   tags$style(“ }”)
   “`
   
   This will cause any d3heatmap in the document to be centered.
   
   June 25, 2015 at 5:21 am
   
   Joe Cheng
   
   Ugh, the “` should be three backticks.
   
   
 * 

June 25, 2015 at 8:50 am

GD

The row/column/value under the mouse cursor does not appear when I see an
rmarkdown html in firefox (updated to latest version).

 * June 25, 2015 at 12:29 pm
   
   Joe Cheng
   
   I can’t reproduce this–do you have an example Rmd you can email me?
   (joe@rstudio.com)
   
   
 * June 25, 2015 at 3:38 pm
   
   ΓΔ 047
   
   Try this on firefox http://www.htmlwidgets.org/showcase_d3heatmap.html
   
   
 * 

June 25, 2015 at 11:39 am

IP

Great tutorial – question about the tooltip though.

Since this takes a matrix, the ‘on hover’ box shows ‘row’, ‘column’ and ‘value’.

Is there a way to specify names for these?

 * June 25, 2015 at 12:37 pm
   
   Joe Cheng
   
   It’s not possible at the moment. Do you mind filing an issue here? https://github.com/rstudio/d3heatmap/issues/new
   
   
June 28, 2015 at 1:54 pm

dendextend version 1.0.1 + useR!2015 presentation | R-statistics blog

[…] between Joe and I). You are invited to see lively examples of the package in
the post at the RStudio blog. Here is just one quick […]

July 1, 2015 at 12:44 pm

Guillaume Devailly

Nice.
It is quite slow for big heatmap (i.e. 600 x 600), and sometimes even fails…
Plot.ly appears faster (at least on Firefox and Chrome):
https://gdevailly.shinyapps.io/Heatmap (with plot.ly)
http://moderndata.plot.ly/dashboards-in-r-with-shiny-plotly/ (tutorial)
https://gdevailly.shinyapps.io/d3heatmap (with d3heatmap, dirty test, quite slow to display and sometimes fails)
It’s a shame as d3heatmap function are much more R user friendly and plot.ly do
not do trees.


« RStudio adds custom domains, bigger data and package support to shinyapps.io DT: An R interface to the DataTables library »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","We’re pleased to announce d3heatmap, our new package for generating interactive heat maps using d3.js and htmlwidgets. Tal Galili, author of dendextend, collaborated with us on this package. …",d3heatmap: Interactive heat maps,Live,158
426,"G. Adam Cox Blocked Unblock Follow Following May 16
--------------------------------------------------------------------------------

CITIZEN SCIENTIST FINDS “DEATH STAR” IN SETI DATA SET
OUTER SPACE RADIOWAVE METADATA, #ALIENLIFE, #CLICKBAIT
In preparation for the SETI Institute’s Hackathon and Code Challenge , a citizen scientist, Dr. Arun Ramamoorthy , who is also a researcher from Arizona State University, was looking at data
from the SETI@IBMCloud project. (The SETI@IBMCloud project and Hackathon/Code Challenge are separate
projects, but related.)

In particular, he was looking at the metadata for the raw SETI data, found in a
table called SignalDB . To get started, he wanted to simply visualize this data as a sphere.

Amongst other things, the SignalDB database contains the Right Ascension (RA) and Declination (DEC) coordinates for all “Candidate” events observed by the SETI Institute from 2013
to 2015. The RA and DEC values specify the location of objects in the sky. By
mapping the (RA,DEC) coordinates to a sphere and then rotating through a small
angle multiple times, Dr. Ramamoorthy was able to create this “Death Star” GIF:

Dr. Arun Ramamoorthy ’s “Death Star” of radio signal observations from the SETI@IBMCloud data set.The block of data points (where the Death Star’s “superlaser” shoots out of) maps to the location of a number of star systems in the “Kepler
field.” This is a patch of sky observed by NASA’s Kepler spacecraft where
thousands of exoplanets were discovered before the spacecraft malfunctioned in
2012. Since the SETI Institute tends to observes stars with known exoplanets,
this field shows up predominantly because of the large number of observations
made in this area.

The Kepler Field shows us that, on average, 1.6 exoplanets orbit each star in
our galaxy. This means there are roughly 160 billion planets in our galaxy, 40
billion of which may be rocky planets within the habitable zone. What’s the
likelihood of any one of these planets hosting intelligent life?

If you’re interested in joining the SETI Institute on its mission, or looking at
the data yourself, register for the upcoming SETI Institute hackathon and code challenge .

If you enjoyed this article, or are just plain enthusiastic about the Kepler
field, please ♡ it to recommend it to other Medium readers.

Thanks to Mike Broberg . * Astronomy
 * Data Science
 * Data Visualization
 * SETI

Blocked Unblock Follow FollowingG. ADAM COX
FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","In preparation for the SETI Institute’s Hackathon and Code Challenge, a citizen scientist, Dr. Arun Ramamoorthy, who is also a researcher from Arizona State University, was looking at data from the…",Citizen Scientist finds “Death Star” in SETI data set,Live,159
430,"LOCATION TRACKER – PART 2
markwatson / August 11, 2016Want to scale your apps to millions of users while maintaining the ability to
safely and securely (and easily!) sync private information to Cloudant? In this
tutorial we’ll show you how.

In Location Tracker – Part 1 we showed you how to create an iOS app that tracks your location, syncs with
Cloudant, and performs geo queries to find nearby points of interest. We showed
you how to use the database-per-user design pattern to take advantage of
Cloudant’s powerful sync capabilities while ensuring a user’s location
information remains private. We also discussed how the database-per-user design
pattern works well for small- to medium-sized apps, but not so much when you
want to scale to millions of users. In this tutorial we’ll show you how we
extended Location Tracker to do just that.

A REFRESHER
The Location Tracker app is an iOS app developed in Swift that tracks user
locations and syncs those locations to Cloudant. As a user moves, and new
locations are recorded, the app queries the server for points of interest near
the user’s location.

Below is a screenshot of the Location Tracker app. Blue pins mark each location
recorded by the app. A blue line is drawn over the path the user has travelled.
Each time the Location Tracker app records a new location, a radius-based geo
query is performed in Cloudant to find nearby points of interest (referred to in
the app as “places”). The radius is represented by a green circle. Places are
displayed as green pins.


In Part 1 we identified five key requirements for the Location Tracker app:

 1. Track location in the foreground and background.
 2. Use geospatial queries to find points of interest within a specified radius.
 3. Run offline.
 4. Keep user location information private.
 5. Provide ability to consolidate and analyze all locations.

To satisfy requirements #4 and #5, we implemented the database-per-user design
pattern. It was a great first step for learning how to use Cloudant Sync for
syncing personal or private data, but this design becomes problematic when
scaling to millions or even tens of thousands of users. In this post Glynn Bird points out a few of the issues with scaling using the
database-per-user pattern:

 * Backup – How do you design a backup-and-restore plan for millions or even
   thousands of databases?
 * Reporting – How do you generate reports across millions of databases?
 * Change control – How do you propagate data updates across millions of
   databases?

To help provide a solution to these issues (and more), a team of IBMers built Cloudant Envoy .

CLOUDANT ENVOY FTW!
Cloudant Envoy is a microservice that acts as a replication target for your
PouchDB web app or Cloudant Sync-based native app. Envoy allows your client-side
code to adopt a “one database per user” design pattern, with a copy of a user’s
data stored on the mobile device and synced to the cloud when online, while
invisibly storing all the users’ data in one large database. This prevents the
proliferation of databases that occurs as users are added and facilitates
simpler backup and server-side reporting.This is how Cloudant Envoy is described on GitHub. Let’s break down this
description and unpack the relevant points for Location Tracker:

Cloudant Envoy is a microservice that acts as a replication target for your
PouchDB web app or Cloudant Sync-based native app.

In Part 1 we showed how the Location Tracker iOS app targeted user-specific
databases in Cloudant for replication. In this tutorial we’ll show how (without
it even knowing it) the iOS app will target Cloudant Envoy.

Envoy allows your client-side code to adopt a “one database per user” design
pattern, with a copy of a user’s data stored on the mobile device and synced to
the cloud when online…

From the beginning, the Location Tracker iOS app was built using the
database-per-user design pattern. Each user’s locations are stored locally on
the iOS device and synced to Cloudant when online. This doesn’t change when
replicating to Envoy. In fact zero changes were required to the iOS app to
support Envoy.

…while invisibly storing all the users’ data in one large database. This
prevents the proliferation of databases that occurs as users are added and
facilitates simpler backup and server-side reporting.

Using Cloudant Envoy we can store all private location data in a single
database. This makes it easier for backend developers or data scientists to work
with the data and addresses the three problems we mentioned with the
database-per-user pattern: backup, reporting, and change control.

ARCHITECTURE
In Part 1 we implemented the database-per-user design pattern and created a
database for each user to track that user’s location. This is what our
architecture diagram looked like:

Location tracker server v1: Users hit Node.js server, location syncs directly to Cloudant, many unique
small DBs in Cloudant.

User registration and geo queries were performed through a Node.js application
running on IBM Bluemix, while locations were synced directly to user-specific
databases in Cloudant. User-specific databases were configured to replicate to a
centralized database to store all locations. With Cloudant Envoy our
architecture is greatly simplified:

Location tracker server v2: Users hit improved Node.js server, location syncs to Cloudant Envoy proxy, a single big DB in
Cloudant.

Here in Part 2, user registration and geo queries are still performed through a
custom Node.js app, but now all location replication is routed through Envoy and
stored in a single, centralized database. We are no longer connecting directly
to Cloudant. We no longer have to create databases for every user, or configure
replication from those databases to our centralized location database, and we
continue to satisfy our requirements, including:

 * Keep user location information private – This is handled completely by Envoy. Users can only access their own
   locations.
 * Provide ability to consolidate and analyze all locations – By default, with Envoy, all locations are stored in the same database. No
   need for replication or data duplication.

THE NEW SERVER
In Part 1 we discussed the Location Tracker Server, a Node.js application that
provides RESTful APIs for registering new users and querying places using Cloudant Geo . For this tutorial we have created a new server to perform these functions and
configure support for Cloudant Envoy. That server is called the Location Tracker Envoy Server .

When you install the Location Tracker Envoy Server three databases will be
created in your Cloudant instance:

 1. envoyusers – This database is used by the server and by Cloudant Envoy to manage and
    authenticate users.
 2. lt_locations_all_envoy – This database is used to keep track of all locations synced from iOS
    devices to Cloudant through Envoy.
 3. lt_places – This database contains a list of places that the Location Tracker app
    will query.

Follow the instructions on the Location Tracker Envoy Server GitHub page to get the Location Tracker Envoy Server up and running locally or on Bluemix.

THE SAME CLIENT
As mentioned previously, zero changes were required to the iOS app to support
sync with Envoy. The iOS app is given the location replication target on login.
Envoy is a drop-in replacement for Cloudant replication. Instead of returning
the path to a user-specific database for replication, the server returns the
path to the Envoy instance.

Once you’ve set up the Location Tracker Envoy Server, follow the instructions on
the Location Tracker App GitHub page to get the Location Tracker App up and running in Xcode.

HOW IT WORKS
In the rest of this tutorial we’ll provide more detail on how we are using
Envoy. For more information on how the app tracks locations or queries for
points of interest, please check out Part 1 . This tutorial focuses on Cloudant Envoy and the changes made to the backend
to support Envoy.

USER REGISTRATION
Cloudant Envoy has a few different options for managing users. You can configure
which method to use with the ENVOY_AUTH environment variable. This variable must be set on both the Cloudant Envoy app
and Location Tracker Envoy Server app in Bluemix. See the Cloudant Envoy documentation for more information regarding the different authentication options available.

By default users are stored in a database called envoyusers . The user registration process has been greatly simplified from Part 1. The
same PUT request is sent from the iOS app:

{
    ""username"": ""markwatson"",
    ""password"": ""passw0rd"",
    ""type"": ""user"",
    ""_id"": ""markwatson""          
}


However, the backend processing of this request is much simpler. Previously the
backend would create new databases, set up API keys and passwords, and configure
continuous replication between the new databases and the centralized locations
database. When the new Node.js server receives the PUT request the following
steps are executed:

 1. Check if the user exists with the specified id. If the user already exists,
    then return a status of 409 to the client.
 2. Store the user in the users database with their id and password (hashed).

That’s it!

USER LOGIN
Users are logged in immediately after registering. Again, no changes were made
to the iOS app. The app sends the following request to the Node.js server:

{
    ""username"": ""markwatson"",
    ""password"": ""passw0rd""         
}


And the server replies with a response in the same format as the previous
version of the server:

{
    ""ok"": true,
    ""api_key"": ""markwatson"",
    ""api_password"": ""passw0rd"",
    ""location_db_name"": ""lt_locations_all_envoy"",
    ""location_db_host"": ""cloudant-envoy-XXXX.mybluemix.net""
}


The motivation here is backwards compatibility. The app expects to receive the
API key, password, database, and host to sync to. In Part 1 this was the
user-specific database, but as you can see now, the server is sending the
information required to sync with Envoy. The api_key and api_password fields now take the user’s username and password as their values. This is what
is expected by Envoy, and by using this format the code maintains backwards
compatibility with our server from Part 1. Correspondingly, the unique values
from our old database-per-user pattern — location_db_name and location_db_host — now take standardized values: ""lt_locations_all_envoy"" and the Envoy host, respectively.

SYNCING LOCATIONS
Syncing locations between the client and the server has not changed. Envoy
implements the same replication protocol as Cloudant, making the migration
completely transparent to the client. The Location Tracker App uses Cloudant Sync for iOS to sync with Envoy the same exact way it would sync directly to Cloudant.

THE DATA
How does Envoy know who owns the data when they are all stored in the same
database? There are a few different ways that Envoy can identify who owns the
data, but the same principle is applied in each case:

 1. When saving new locations, alter the data to include the authenticated
    user’s information.
 2. When retrieving the locations, use the authenticated user’s information to
    filter the data.

Envoy modifies each document on the way in and filters each document on the way
out. Envoy provides different options for adding ownership information to the
data. These options can be configured by setting the ENVOY_ACCESS environment variable in Cloudant Envoy. See the Cloudant Envoy documentation for more information.

By default, Envoy stores the ownership of a document in the _id field of the document. It prepends the sha1 hash of the username to the id.
Here’s an example document:

{
    ""_id"": ""c00268ec8506774f20229f1eb9142e0d1f1a938b-014144EE-25BB-4251-94AC-A7BBD3C04CB5"",
    ""_rev"": ""1-8801d4d9a3bf8a539692af9697b89eb5"",
    ""created_at"": 1468251203669.592,
    ""geometry"": {
        ""type"": ""Point"",
        ""coordinates"": [
              -122.39203496,
              37.5668706
        ]
    },
    ""properties"": {
        ""timestamp"": 1468251203669.592,
        ""username"": ""envoy_user1"",
        ""background"": false
    },
    ""type"": ""Feature""
}


This document was stored for envoy_user1 . Every one of envoy_user1 ‘s documents stored by Envoy will have an _id prepended with c00268ec8506774f20229f1eb9142e0d1f1a938b .

DIFF PART1 PART2
To summarize, let’s do a diff with Part 1 of the Location Tracker to see the
significant changes that we’ve made to help scale our solution:

 * Only in Part 1 – Create new database for each user.
 * Only in Part 1 – Configure replication between each user-specific database and consolidated
   location database.
 * Only in Part 1 – Tell iOS app to sync directly to user-specific database in Cloudant.
 * Only in Part 2 – Tell iOS app to sync with Cloudant Envoy.
 * iOS App – No changes.

Cloudant was built to scale, but creating millions of databases for millions of
users is not scalable. Cloudant Envoy stores your private data in a single
database that can scale to support millions of users while allowing you to reap
all the benefits of Cloudant Sync. Using Cloudant Envoy we have not only
improved our ability to scale, but we have simplified almost every aspect of our
solution.

CONCLUSION
In this tutorial, we showed how to use Cloudant Envoy to scale Cloudant’s data
replication & synchronization capabilities to millions of mobile users. We
showed you how Cloudant Envoy provides a drop-in replacement for Cloudant
replication that allows you to safely and securely sync private location
information into a single, consolidated database.

Cloudant Envoy is still in beta, but we’re really excited about its potential
and urge you to start experimenting with it today. For more information
regarding the Location Tracker and Cloudant Envoy please see the following
links:

 * Location Tracker – Part 1
 * Cloudant Envoy GitHub page
 * Scaling Offline First with Envoy

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: cloudant / Cloudant Envoy / database per user / Mobile / Offline First / swift Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Our Location Tracker example app shows how simple it is to use Cloudant with Swift + GeoJSON. It's offline-first and scales up the database per user pattern.,Location Tracker – Part 2,Live,160
433,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  USE THE MACHINE LEARNING LIBRARYJess Mantaro / October 22, 2015Learn how to use the Apache® Spark™ Machine Learning Library (MLlib) in IBMAnalytics for Apache Spark on IBM Bluemix. Apache® Spark™ includes extensionlibraries that can be used for SQL and DataFrames, streaming, machine learning,and graph analysis. In this video, you’ll see how to use machine learningalgorithms to determine the top drop off location for New York City taxis usinga popular algorithm known as KMeans.You can also read a transcript of this videoRELATED LINKS * Build SQL Queries * Load and Filter Cloudant Data with Apache Spark * Load and Analyze dashDB Data with Apache SparkTRY THE TUTORIALLearn how to use Apache® Spark™ machine learning algorithms to determine the topdrop off location for New York City taxis using the KMeans algorithm.WHAT YOU’LL LEARNAt the end of this tutorial, you should be able to: * download New York City taxi cab data in CSV format. * create a Scala notebook in IBM Analytics for Apache Spark. * load a CSV file into a Scala notebook. * use the KMeans and Vectors algorithms to analyze the data.BEFORE YOU BEGINWatch the Getting Started on Bluemix video to create a Bluemix account and add the IBM Analytics for Apache Spark service.PROCEDURE 1: DOWNLOAD NEW YORK CITY TAXI CAB DATA 1. Navigate to the NYC OpenData site. 2. Click Transportation . 3. For the search criteria, type taxi . 4. Select the trip data of your choice, and download the data in CSV format. We    recommend you select the 2013_Green_Taxi_Trip_data.csv file, or change the    code found later in this tutorial to match the selected year.PROCEDURE 2: CREATE A SCALA NOTEBOOK 1.  Sign in to Bluemix . 2.  Access the Dashboard , and open the Apache Spark instance. 3.  Click New Notebook , select Scala , type a name for the notebook, and click Create . 4.  Click Add Data Source in the right sidebar. 5.  Drag and drop the CSV file you downloaded in procedure 1 into the box     labelled Drop file to add data source . 6.  Paste the following code into the first cell in the notebook, and then     click the Run icon on the toolbar. This first cell contains two commands that set up use     of the Apache® Spark™ machine learning algorithms KMeans and Vectors.     Commands:     import org.apache.spark.mllib.clustering.KMeans     import org.apache.spark.mllib.linalg.Vectors 7.  Paste the following code into the second cell, and then click Run . Replace nyctaxisub.csv with file name you used. This command reads the contents of the file and     assigns it to the taxifile variable. For example, the filename could be     2013_Green_Taxi_Trip_data.csv.     Command: val taxifile = sc.textFile(""swift://notebooks.spark/filename"") 8.  Paste the following code into the third cell, and then click Run . This command shows what the data in this file looks like. When it     displays, you’ll see that the first row will be the header for the columns,     and the second row actually shows data. So, here is the first row which     shows a few different things, but of particular interest is the     dropoff_latitude and dropoff_longitude. And in the next row, we actually     see data.     Command: taxifile.take(2) 9.  Paste the following code into the fourth cell. This command filters this     data, so we only see the records from 2013. And we also want to make sure     that the dropoff_latitude and dropoff_longitude aren’t null. If you     downloaded a different data set, the column numbers may be different.     Commands:     val taxidata=taxifile.filter(_.contains(""2013"")).     filter(_.split("","") (4) !="""").     filter(_.split("","") (18) !="""") 10. Paste the following code into the fifth cell, and then click Run . This filters the data containing drop off areas with latitudes and     longitudes that are roughly in the Manhattan area.     Commands:     val taxifence = taxidata.filter(_.split("","")(4).toDouble>40.70).     filter(_.split("","") (4).toDouble<40.86).     filter(_.split("","") (18).toDouble>(-74.02)).     filter(_.split("","") (18).toDouble<(-73.93)) 11. Paste the following code into the sixth cell, and then click Run . This command takes this data and puts it in a vector which will be used     as input for the KMeans algorithm.     Command:     val     taxi=taxifence.map(line=>Vectors.dense(line.split(',').slice(17,19).map(_.toDouble))) 12. Paste the following SQL statement into the sixth cell, and then click Run . This final cell contains commands to invoke the KMeans algorithm. In     this case, we’ however, the parameters could be changed in this cell to     determine the top three or the top ten locations. It’s also interesting to     note that Apache® Spark™ machine learning provides other algorithms for     collaborative filtering, clustering, and classification.     Commands:     val model=KMeans.train(taxi,1,1)     val clusterCenters=model.clusterCenters.map(_.toArray)     clusterCenters.foreach(lines= println(lines(0),lines(1)))Select and copy the coordinates. Then, open a browser, and paste the coordinatesinto a map program such as Google Maps to see the location on the map. Find morevideos in the Spark Learning Center at http://developer.ibm.com/clouddataservices/spark .Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",How to use the Spark machine learning programming model in IBM Analytics for Apache Spark on IBM Bluemix,Use the Machine Learning Library in Spark,Live,161
439,"CURTIS MILLER'S PERSONAL WEBSITE
Skip to content * Home
 * Blog
 * Professional Information * My Resume
   
   
 * Educational History * Graduate Education * Data Mining
       * Database Systems
       * Linear Models
       * Machine Learning
       * Multilinear Models
       * Probabilistic Models
       * Reading Course on Time Series Models
       * Visualization
      
      
    * Undergraduate Education * Applied Statistics I
       * Applied Statistics II
       * Construction of Knowledge
       * Game Theory
       * Hinckley Institute of Politics Washington, D.C. Internship
       * Honor’s Core in Intellectual Traditions 2
       * Honor’s Thesis
       * Industrial-Organizational Economics
       * International Economics
       * Marxist Economics
       * Monetary Theory and Policy
       * Principles of Econometrics
       * Principles of Macroeconomics
       * Stochastic Processes and Simulation I
       * Stochastic Processes and Simulation II
       * The Valley as a Laboratory
       * General Education * Biological Science
          * Composition
          * Depth Course
          * Humanities/Diversity
          * Interdisciplinary
          * Lifelong Wellness
          * Physical Science
         
         
 * About Me * Hobbies
   
   
CURTIS MILLER'S PERSONAL WEBSITE
CURTIS MILLER'S PERSONAL WEBSITE, WITH RESUME, PORTFOLIO, BLOG, ETC.
Skip to content * Home
 * Blog
 * Professional Information * My Resume
   
   
 * Educational History * Graduate Education * Data Mining
       * Database Systems
       * Linear Models
       * Machine Learning
       * Multilinear Models
       * Probabilistic Models
       * Reading Course on Time Series Models
       * Visualization
      
      
    * Undergraduate Education * Applied Statistics I
       * Applied Statistics II
       * Construction of Knowledge
       * Game Theory
       * Hinckley Institute of Politics Washington, D.C. Internship
       * Honor’s Core in Intellectual Traditions 2
       * Honor’s Thesis
       * Industrial-Organizational Economics
       * International Economics
       * Marxist Economics
       * Monetary Theory and Policy
       * Principles of Econometrics
       * Principles of Macroeconomics
       * Stochastic Processes and Simulation I
       * Stochastic Processes and Simulation II
       * The Valley as a Laboratory
       * General Education * Biological Science
          * Composition
          * Depth Course
          * Humanities/Diversity
          * Interdisciplinary
          * Lifelong Wellness
          * Physical Science
         
         
 * About Me * Hobbies
   
   
Search for:TAGS
2008 2010 2014 2016 activism cormania debugging democrats donald trump election financial crisis financial sector game art game design gamemaker: studio gml google gui hillary clinton honor 3700 jill stein programming republicans salt lake city statistics suburbs university of utah visualization washington west jordanTOP POSTS & PAGES
 * An Introduction to Stock Market Data Analysis with R (Part 1)
 * An Introduction to Stock Market Data Analysis with Python (Part 1)
 * On Programming Languages; Why My Dad Went From Programming to Driving a Bus

CATEGORIES
 * AAR (1)
 * Economics and Finance (7)
 * Game Programming (9)
 * HONOR 3700 (14)
 * Politics (8)
 * Python (4)
 * R (10)
 * Statistics and Data Science (13)
 * Uncategorized (4)

SOCIAL
 * View NTGuardian’s profile on Twitter
 * View curtis-miller-41568095’s profile on LinkedIn
 * View ntguardian’s profile on GitHub
 * View UCUmC4ZXoRPmtOsZn2wOu9zg’s profile on YouTube
 * View 101301351154608272073’s profile on Google+

Follow Curtis Miller's Personal Website on WordPress.comSUBSCRIBE VIA RSS
 * RSS - Posts
 * RSS - Comments

ARCHIVES
Archives Select Month March 2017 (3) February 2017 (1) December 2016 (2) November 2016 (3) October 2016 (3) September 2016 (4) August 2016 (4) June 2016 (1) August 2015 (1) July 2015 (4) June 2015 (4) December 2014 (2) November 2014 (4) October 2014 (2) September 2014 (6) January 2012 (2)BLOGROLL
 * FiveThirtyEight
 * Planet SciPy
 * R-Bloggers
 * YoYo Games Blog

LINKS
 * R-users
 * University of Utah Mathematics Department

JOBS FOR R USERS
 * Primary Research Analyst @ Boston, Massachusetts, United States March 22, 2017
 * Data Engineer (m/w) March 21, 2017
 * Authoring Video courses on R (Packt) March 17, 2017

JOBS FOR PYTHON USERS
 * Python Developer (Data Science Team) - Owlstone Medical Ltd
 * Python Programmer - Advent Global Solutions Inc
 * Python Developer - Fulfil.IO Inc.

Posted on March 27, 2017 March 27, 2017 Economics and Finance , R , Statistics and Data ScienceAN INTRODUCTION TO STOCK MARKET DATA ANALYSIS WITH R (PART 1)
Around September of 2016 I wrote two articles on using Python for accessing,
visualizing, and evaluating trading strategies (see part 1 and part 2 ). These have been my most popular posts, up until I published my article on learning programming languages (featuring my dad’s story as a programmer), and has been translated into both
Russian (which used to be on backtest.ru at a link that now appears to no longer work) and Chinese ( here and here ). R has excellent packages for analyzing stock data, so I feel there should be
a “translation” of the post for using R for stock data analysis.


This post is the first in a two-part series on stock data analysis using R,
based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah . In these posts, I will discuss basics such as obtaining the data from Yahoo!
Finance using pandas, visualizing stock data, moving averages, developing a
moving-average crossover strategy, backtesting, and benchmarking. The final post
will include practice problems. This first post discusses topics up to
introducing moving averages.

NOTE: The information in this post is of a general nature containing information
and opinions from the author’s perspective. None of the content of this post
should be considered financial advice. Furthermore, any code written here is
provided without any form of guarantee. Individuals who choose to use it do so
at their own risk.

INTRODUCTION
Advanced mathematics and statistics have been present in finance for some time.
Prior to the 1980s, banking and finance were well-known for being “boring”;
investment banking was distinct from commercial banking and the primary role of
the industry was handling “simple” (at least in comparison to today) financial
instruments, such as loans. Deregulation under the Regan administration, coupled
with an influx of mathematical talent, transformed the industry from the
“boring” business of banking to what it is today, and since then, finance has
joined the other sciences as a motivation for mathematical research and
advancement. For example one of the biggest recent achievements of mathematics
was the derivation of the Black-Scholes formula , which facilitated the pricing of stock options (a contract giving the holder
the right to purchase or sell a stock at a particular price to the issuer of the
option). That said, bad statistical models, including the Black-Scholes formula, hold part of the
blame for the 2008 financial crisis .

In recent years, computer science has joined advanced mathematics in
revolutionizing finance and trading , the practice of buying and selling of financial assets for the purpose of
making a profit. In recent years, trading has become dominated by computers;
algorithms are responsible for making rapid split-second trading decisions
faster than humans could make (so rapidly, the speed at which light travels is a limitation when designing systems ). Additionally, machine learning and data mining techniques are growing in popularity in the financial sector, and likely will continue to do so. In fact, a large
part of algorithmic trading is high-frequency trading (HFT) . While algorithms may outperform humans, the technology is still new and
playing an increasing role in a famously turbulent, high-stakes arena. HFT was
responsible for phenomena such as the 2010 flash crash and a 2013 flash crash prompted by a hacked Associated Press tweet about an attack on the White House.

My articles, however, will not be about how to crash the stock market with bad
mathematical models or trading algorithms. Instead, I intend to provide you with
basic tools for handling and analyzing stock market data with R. We will be
using stock data as a first exposure to time series data , which is data considered dependent on the time it was observed (other
examples of time series include temperature data, demand for energy on a power
grid, Internet server load, and many, many others). I will also discuss moving
averages, how to construct trading strategies using moving averages, how to
formulate exit strategies upon entering a position, and how to evaluate a
strategy with backtesting.

DISCLAIMER: THIS IS NOT FINANCIAL ADVICE!!! Furthermore, I have ZERO experience
as a trader (a lot of this knowledge comes from a one-semester course on stock
trading I took at Salt Lake Community College)! This is purely introductory
knowledge, not enough to make a living trading stocks. People can and do lose
money trading stocks, and you do so at your own risk!

GETTING AND VISUALIZING STOCK DATA
GETTING DATA FROM YAHOO! FINANCE WITH QUANTMOD
Before we analyze stock data, we need to get it into some workable format. Stock
data can be obtained from Yahoo! Finance , Google Finance , or a number of other sources, and the quantmod package provides easy access to Yahoo! Finance and Google Finance data, along
with other sources. In fact, quantmod provides a number of useful features for financial modelling, and we will be
seeing those features throughout these articles. In this lecture, we will get
our data from Yahoo! Finance.

# Get quantmod
if (!require(""quantmod"")) {
    install.packages(""quantmod"")
    library(quantmod)
}

start <- as.Date(""2016-01-01"")
end <- as.Date(""2016-10-01"")

# Let' Apple's ticker symbol is AAPL. We use the
# quantmod function getSymbols, and pass a string as a first argument to
# identify the desired ticker symbol, pass 'yahoo' to src for Yahoo!
# Finance, and from and to specify date ranges

# The default behavior for getSymbols is to load data directly into the
# global environment, with the object being named after the loaded ticker
# symbol. This feature may become deprecated in the future, but we exploit
# it now.

getSymbols(""AAPL"", src = ""yahoo"", from = start, to = end)


##     As of 0.4-0, 'getSymbols' uses env=parent.frame() and
##  auto.assign=TRUE by default.
## 
##  This  behavior  will be  phased out in 0.5-0  when the call  will
##  default to use auto.assign=FALSE. getOption(""getSymbols.env"") and 
##  getOptions(""getSymbols.auto.assign"") are now checked for alternate defaults
## 
##  This message is shown once per session and may be disabled by setting 
##  options(""getSymbols.warning4.0""=FALSE). See ?getSymbols for more details.


## [1] ""AAPL""


# What is AAPL?
class(AAPL)


## [1] ""xts"" ""zoo""


# Let's see the first few rows
head(AAPL)


##            AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume
## 2016-01-04    102.61    105.37   102.00     105.35    67649400
## 2016-01-05    105.75    105.85   102.41     102.71    55791000
## 2016-01-06    100.56    102.37    99.87     100.70    68457400
## 2016-01-07     98.68    100.13    96.43      96.45    81094400
## 2016-01-08     98.55     99.11    96.76      96.96    70798000
## 2016-01-11     98.97     99.06    97.34      98.53    49739400
##            AAPL.Adjusted
## 2016-01-04     102.61218
## 2016-01-05     100.04079
## 2016-01-06      98.08303
## 2016-01-07      93.94347
## 2016-01-08      94.44022
## 2016-01-11      95.96942


Let’s briefly discuss this. getSymbols() created in the global environment an object called AAPL (named automatically
after the ticker symbol of the security retrieved) that is of the xts class (which is also a zoo -class object). xts objects (provided in the xts package) are seen as improved versions of the ts object for storing time series data. They allow for time-based indexing and
provide custom attributes, along with allowing multiple (presumably related)
time series with the same time index to be stored in the same object. (Here is a vignette describing xts objects.) The different series are the columns of the object, with the name of the
associated security (here, AAPL) being prefixed to the corresponding series.

Yahoo! Finance provides six series with each security. Open is the price of the stock at the beginning of the trading day (it need not be
the closing price of the previous trading day), high is the highest price of the stock on that trading day, low the lowest price of the stock on that trading day, and close the price of the stock at closing time. Volume indicates how many stocks were traded. Adjusted close (abreviated as “adjusted” by getSymbols() ) is the closing price of the stock that adjusts the price of the stock for
corporate actions. While stock prices are considered to be set mostly by
traders, stock splits (when the company makes each extant stock worth two and halves the price) and dividends (payout of company profits per share) also affect the price of a stock and
should be accounted for.

VISUALIZING STOCK DATA
Now that we have stock data we would like to visualize it. I first use base R
plotting to visualize the series.

plot(AAPL[, ""AAPL.Close""], main = ""AAPL"")


A linechart is fine, but there are at least four variables involved for each
date (open, high, low, and close), and we would like to have some visual way to
see all four variables that does not require plotting four separate lines.
Financial data is often plotted with a Japanese candlestick plot , so named because it was first created by 18th century Japanese rice traders.
Use the function candleChart() from quantmod to create such a chart.

candleChart(AAPL, up.col = ""black"", dn.col = ""red"", theme = ""white"")


With a candlestick chart, a black candlestick indicates a day where the closing
price was higher than the open (a gain), while a red candlestick indicates a day
where the open was higher than the close (a loss). The wicks indicate the high
and the low, and the body the open and close (hue is used to determine which end
of the body is the open and which the close). Candlestick charts are popular in
finance and some strategies in technical analysis use them to make trading decisions, depending on the shape, color, and position
of the candles. I will not cover such strategies today.

(Notice that the volume is tracked as a bar chart on the lower pane as well,
with the same colors as the corresponding candlesticks. Some traders like to see
how many shares are being traded; this can be important in trading.)

We may wish to plot multiple financial instruments together; we may want to
compare stocks, compare them to the market, or look at other securities such as exchange-traded funds (ETFs) . Later, we will also want to see how to plot a financial instrument against
some indicator, like a moving average. For this you would rather use a line
chart than a candlestick chart. (How would you plot multiple candlestick charts
on top of one another without cluttering the chart?)

Below, I get stock data for some other tech companies and plot their adjusted
close together.

# Let's get data for Microsoft (MSFT) and Google (GOOG) (actually, Google is
# held by a holding company called Alphabet, Inc., which is the company
# traded on the exchange and uses the ticker symbol GOOG).
getSymbols(c(""MSFT"", ""GOOG""), src = ""yahoo"", from = start, to = end)


## [1] ""MSFT"" ""GOOG""


# Create an xts object (xts is loaded with quantmod) that contains closing
# prices for AAPL, MSFT, and GOOG
stocks <- as.xts(data.frame(AAPL = AAPL[, ""AAPL.Close""], MSFT = MSFT[, ""MSFT.Close""], 
    GOOG = GOOG[, ""GOOG.Close""]))
head(stocks)


##            AAPL.Close MSFT.Close GOOG.Close
## 2016-01-04     105.35      54.80     741.84
## 2016-01-05     102.71      55.05     742.58
## 2016-01-06     100.70      54.05     743.62
## 2016-01-07      96.45      52.17     726.39
## 2016-01-08      96.96      52.33     714.47
## 2016-01-11      98.53      52.30     716.03


# Create a plot showing all series as lines; must use as.zoo to use the zoo
# method for plot, which allows for multiple series to be plotted on same
# plot
plot(as.zoo(stocks), screens = 1, lty = 1:3, xlab = ""Date"", ylab = ""Price"")
legend(""right"", c(""AAPL"", ""MSFT"", ""GOOG""), lty = 1:3, cex = 0.5)


What’s wrong with this chart? While absolute price is important (pricey stocks
are difficult to purchase, which affects not only their volatility but your ability to trade that stock), when trading, we are more concerned about the
relative change of an asset rather than its absolute price. Google’s stocks are
much more expensive than Apple’s or Microsoft’s, and this difference makes
Apple’s and Microsoft’s stocks appear much less volatile than they truly are
(that is, their price appears to not deviate much).

One solution would be to use two different scales when plotting the data; one
scale will be used by Apple and Microsoft stocks, and the other by Google.

plot(as.zoo(stocks[, c(""AAPL.Close"", ""MSFT.Close"")]), screens = 1, lty = 1:2, 
    xlab = ""Date"", ylab = ""Price"")
par(new = TRUE)
plot(as.zoo(stocks[, ""GOOG.Close""]), screens = 1, lty = 3, xaxt = ""n"", yaxt = ""n"", 
    xlab = """", ylab = """")
axis(4)
mtext(""Price"", side = 4, line = 3)
legend(""topleft"", c(""AAPL (left)"", ""MSFT (left)"", ""GOOG""), lty = 1:3, cex = 0.5)


Not only is this solution difficult to implement well, it is seen as a bad
visualization method; it can lead to confusion and misinterpretation, and cannot
be read easily.

A “better” solution, though, would be to plot the information we actually want:
the stock’s returns. This involves transforming the data into something more
useful for our purposes. There are multiple transformations we could apply.

One transformation would be to consider the stock’s return since the beginning
of the period of interest. In other words, we plot:


This will require transforming the data in the stocks object, which I do next.

# Get me my beloved pipe operator!
if (!require(""magrittr"")) {
    install.packages(""magrittr"")
    library(magrittr)
}


## Loading required package: magrittr


stock_return % t %>% as.xts

head(stock_return)


##            AAPL.Close MSFT.Close GOOG.Close
## 2016-01-04  1.0000000  1.0000000  1.0000000
## 2016-01-05  0.9749407  1.0045620  1.0009975
## 2016-01-06  0.9558614  0.9863139  1.0023994
## 2016-01-07  0.9155197  0.9520073  0.9791734
## 2016-01-08  0.9203607  0.9549271  0.9631052
## 2016-01-11  0.9352634  0.9543796  0.9652081


plot(as.zoo(stock_return), screens = 1, lty = 1:3, xlab = ""Date"", ylab = ""Return"")
legend(""topleft"", c(""AAPL"", ""MSFT"", ""GOOG""), lty = 1:3, cex = 0.5)


This is a much more useful plot. We can now see how profitable each stock was
since the beginning of the period. Furthermore, we see that these stocks are
highly correlated; they generally move in the same direction, a fact that was
difficult to see in the other charts.

Alternatively, we could plot the change of each stock per day. One way to do so
would be to plot the percentage increase of a stock when comparing day to day , with the formula:


But change could be thought of differently as:


These formulas are not the same and can lead to differing conclusions, but there
is another way to model the growth of a stock: with log differences.


(Here, is the natural log, and our definition does not depend as strongly on whether
we use or .) The advantage of using log differences is that this difference can be
interpreted as the percentage change in a stock but does not depend on the
denominator of a fraction.

We can obtain and plot the log differences of the data in stocks as follows:

stock_change % log %>% diff
head(stock_change)


##              AAPL.Close    MSFT.Close   GOOG.Close
## 2016-01-04           NA            NA           NA
## 2016-01-05 -0.025378648  0.0045516693  0.000997009
## 2016-01-06 -0.019763704 -0.0183323194  0.001399513
## 2016-01-07 -0.043121062 -0.0354019469 -0.023443064
## 2016-01-08  0.005273804  0.0030622799 -0.016546113
## 2016-01-11  0.016062548 -0.0005735067  0.002181138


plot(as.zoo(stock_change), screens = 1, lty = 1:3, xlab = ""Date"", ylab = ""Log Difference"")
legend(""topleft"", c(""AAPL"", ""MSFT"", ""GOOG""), lty = 1:3, cex = 0.5)


Which transformation do you prefer? Looking at returns since the beginning of
the period make the overall trend of the securities in question much more
apparent. Changes between days, though, are what more advanced methods actually
consider when modelling the behavior of a stock. so they should not be ignored.

MOVING AVERAGES
Charts are very useful. In fact, some traders base their strategies almost
entirely off charts (these are the “technicians”, since trading strategies based
off finding patterns in charts is a part of the trading doctrine known as technical analysis ). Let’s now consider how we can find trends in stocks.

A -day moving average is, for a series and a point in time , the average of the past days: that is, if denotes a moving average process, then:


Moving averages smooth a series and helps identify trends. The larger is, the less responsive a moving average process is to short-term fluctuations
in the series . The idea is that moving average processes help identify trends from “noise”. Fast moving averages have smaller and more closely follow the stock, while slow moving averages have larger , resulting in them responding less to the fluctuations of the stock and being
more stable.

quantmod allows for easily adding moving averages to charts, via the addSMA() function.

candleChart(AAPL, up.col = ""black"", dn.col = ""red"", theme = ""white"")
addSMA(n = 20)


Notice how late the rolling average begins. It cannot be computed until 20 days
have passed. This limitation becomes more severe for longer moving averages.
Because I would like to be able to compute 200-day moving averages, I’m going to
extend out how much AAPL data we have. That said, we will still largely focus on
2016.

start = as.Date(""2010-01-01"")
getSymbols(c(""AAPL"", ""MSFT"", ""GOOG""), src = ""yahoo"", from = start, to = end)


## [1] ""AAPL"" ""MSFT"" ""GOOG""


# The subset argument allows specifying the date range to view in the chart.
# This uses xts style subsetting. Here, I'm using the idiom
# 'YYYY-MM-DD/YYYY-MM-DD', where the date on the left-hand side of the / is
# the start date, and the date on the right-hand side is the end date. If
# either is left blank, either the earliest date or latest date in the
# series is used (as appropriate). This method can be used for any xts
# object, say, AAPL
candleChart(AAPL, up.col = ""black"", dn.col = ""red"", theme = ""white"", subset = ""2016-01-04/"")
addSMA(n = 20)


You will notice that a moving average is much smoother than the actual stock
data. Additionally, it’ a stock needs to be above or below the moving average
line in order for the line to change direction. Thus, crossing a moving average
signals a possible change in trend, and should draw attention.

Traders are usually interested in multiple moving averages, such as the 20-day,
50-day, and 200-day moving averages. It’s easy to examine multiple moving
averages at once.

candleChart(AAPL, up.col = ""black"", dn.col = ""red"", theme = ""white"", subset = ""2016-01-04/"")
addSMA(n = c(20, 50, 200))


The 20-day moving average is the most sensitive to local changes, and the
200-day moving average the least. Here, the 200-day moving average indicates an
overall bearish trend: the stock is trending downward over time. The 20-day moving average is
at times bearish and at other times bullish , where a positive swing is expected. You can also see that the crossing of
moving average lines indicate changes in trend. These crossings are what we can
use as trading signals , or indications that a financial security is changing direction and a
profitable trade might be made.

Visit next week to read about how to design and test a trading strategy using
moving averages.

# Package/system information
sessionInfo()


## R version 3.3.3 (2017-03-06)
## Platform: i686-pc-linux-gnu (32-bit)
## Running under: Ubuntu 15.10
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] magrittr_1.5     quantmod_0.4-7   TTR_0.23-1       xts_0.9-7       
## [5] zoo_1.7-14       RWordPress_0.2-3 optparse_1.3.2   knitr_1.15.1    
## 
## loaded via a namespace (and not attached):
##  [1] lattice_0.20-34 XML_3.98-1.5    bitops_1.0-6    grid_3.3.3     
##  [5] formatR_1.4     evaluate_0.10   highr_0.6       stringi_1.1.3  
##  [9] getopt_1.20.0   tools_3.3.3     stringr_1.2.0   RCurl_1.95-4.8 
## [13] XMLRPC_0.3-0


AdvertisementsSHARE THIS:
 * Twitter
 * Facebook
 * Email
 * Reddit
 * More
 * 

 * Print
 * LinkedIn
 * 
 * Google
 * Tumblr
 * 
 * Pinterest
 * Pocket
 * 
 * Telegram
 * WhatsApp
 * 
 * Skype
 * 

LIKE THIS:
Like Loading... apple bear market bull market candlestick chart etf finance financial crisis financial sector flash crash google google finance hft math 3900 microsoft moving average quantmod reagan stock market stocks visualization xts yahoo financePOST NAVIGATION
← Data or Die2 THOUGHTS ON “ AN INTRODUCTION TO STOCK MARKET DATA ANALYSIS WITH R (PART 1) ”
 1. A technique related to adding hybrid features to achieve a better signal is
    described here: http://54.174.116.134/recommend/datasets/
    
    Like Like
    
    biomimic , March 27, 2017 at 5:21 pm Reply
 2. 
 3. Pingback: An Introduction to Stock Market Data Analysis with R (Part 1) | A bunch of
    data
 4. 

LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.


Create a free website or blog at WordPress.com. Post to Cancel Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","This post is the first in a two-part series on stock data analysis using R. In these posts, basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking will be covered.",An Introduction to Stock Market Data Analysis with R (Part 1),Live,162
445,"COMPOSE NOTES: PORTAL POWERUPS AND DELETED DEPLOYMENTS
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 1, 2016There's now the option to power up more Compose Portals and we've made a
recently introduced feature, Deleted Deployments, easier to work with - in this Compose Notes , we'll tell you all about them:

DELETED DEPLOYMENTS
When we introduced the ability to see and recover your deleted deployments from backup, we added all your deleted deployments in a list underneath your
existing deployments. We didn't realise how much of a distraction that could be
so we've made a small change and hidden them by default. Now, if you look in
your Compose console, at the bottom of your deployments list you'll find
something like this:


This line tells you simply how many backups of previously deleted deployments
are available to be restored. Clicking on the Show button will open up the list so it will appear like this:


Now you can select any deleted deployment backup and recover it. It makes
bringing your deleted databases back from the dead easier and less distracting
in day to day use.

PORTAL POWERUPS
We're working on something special when it comes to Compose access portals, and
as part of making sure the foundations are solidly all in place, there's a
little-big change happening in the Compose console - the ability to add more
portals.

A quick refresher, for those who don't know - each Compose database deployment
runs on its own private virtual encrypted network. The only way traffic gets in
or out is through one of our access portals and there's generic portals for TCP
connections and SSH tunnels and more specialized variants that know a bit more
about the database they are working with, like the Mongo Router. There's a
number of reasons for doing this, which we talk about in the recent article, Do you know why Compose proxies database connections .

The thing is, up until now, we've pretty much set in stone how many portals you
can have, apart from the SSH tunnel. Well, that's what we're changing; you can
now add as many portals as you think you may need. We're still enforcing minimum
numbers of portals so you won't go below the number you need for high
availability failover, but if you want an extra TCP portal or Mongo Router for
your deployment then you can have one. Or two. Just so you know, each extra
portal is $4.50 a month.

To get your extra portals, just visit the Security tab for your database in the Compose console.


Under each class of portal you'll see an Add button for that class - Add TCP Portal , Add SSH Portal , Add Mongo Router and so on. Click on the button and you get an extra portal. These extra portals
will be identified in the Overview of the deployment's connection strings too. They are also identified in the
portal list.

Each portal also now displays its short name, external DNS name which you can
use in connection strings you create and the internal IP address for the portal
on the private network so you can identify connections from that portal in the
logs. If a portal isn't needed any more, then click the Remove button next to it in the list - only portals which aren't part of the required
quorum of portals get the Remove button.

As we said, this is the foundation for some exciting new features. Currently,
you can only have three of any portal class on a deployment, but that's one of
the things we are working on. We'll let you know more about that, and the other
things, when we are ready to unveil them, but rest assured they'll give you more
control of your databases while letting you get on with developing your apps and
running your business.


--------------------------------------------------------------------------------","There's now the option to power up more Compose Portals and we've made a recently introduced feature, Deleted Deployments, easier to work with - in this Compose Notes, we'll tell you all about them.",Portal Powerups and Deleted Deployments,Live,163
446,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * Learn TensorFlow and Deep Learning Together and Now!
 * This Week in Data Science (March 14, 2017)
 * This Week in Data Science (March 7, 2017)
 * This Week in Data Science (February 28, 2017)
 * This Week in Data Science (February 21, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

LEARN TENSORFLOW AND DEEP LEARNING TOGETHER AND NOW!
Posted on March 20, 2017 by Saeed Aghabozorgi

I get a lot of questions about how to learn TensorFlow and Deep Learning. I’ll
often hear, “How do I start learning TensorFlow?” or “How do I start learning
Deep Learning?”. My answer is, “Learn Deep Learning and TensorFlow at the same
time!”. See, it’s not easy to learn one without the other. Of course, you can
use other libraries like Keras or Theano, but TensorFlow is a clear favorite
when it comes to libraries for deep learning. And now is the best time to start.
If you haven’t noticed, there’s a huge wave of new startups or big companies
adopting deep learning. Deep Learning is the hottest skill to have right now.

So let’s start from the basics. What actually is “Deep Learning” and why is it
so hot in data science right now? What’s the difference between Deep Learning
and traditional machine learning? Why TensorFlow? And where can you start
learning?

WHAT IS DEEP LEARNING?
Inspired by the brain, deep learning is a type of machine learning that uses
neural networks to model high-level abstractions in data. The major difference
between Deep Learning and Neural Networks is that Deep Learning has multiple
hidden layers, which allows deep learning models (or deep neural networks) to
extract complex patterns from data.

HOW IS DEEP LEARNING DIFFERENT FROM TRADITIONAL MACHINE LEARNING ALGORITHMS,
SUCH AS NEURAL NETWORKS?
Under the umbrella of Artificial Intelligence (AI), machine learning is a
sub-field of algorithms that can learn on their own , including Decision Trees, Linear Regression, K-means clustering, Neural
Networks, and so on. Deep Neural Networks, in particular, are super-powered
Neural Networks that contain several hidden layers. With the right
configuration/hyper-parameters, deep learning can achieve impressively accurate
results compared to shallow Neural Networks with the same computational power.

WHY IS DEEP LEARNING SUCH A HOT TOPIC IN THE DATA SCIENCE COMMUNITY?
Simply put, across many domains, deep learning can attain much faster and more accurate results than ever before , such as image classification, object recognition, sequence modeling, speech
recognition, as so on. It all started recently, too; around 2015. There were
three key catalysts that came together resulting in the popularity of deep
learning:

 1. Big Data : the presence of extremely large and complex datasets;
 2. GPUs : the low cost and wide availability of GPUs made the parallel processing
    faster and cheaper than ever;
 3. Advances in deep learning algorithms , especially for complex pattern recognition.

These three factors resulted in the deep learning boom that we see today. Self-driving cars and drones, chat bots, translations, AI
playing games. You can now see a tremendous surge in the demand for data
scientists and cognitive developers. Big companies are recognizing this
evolution in data-driven insights, which is why you now see IBM, Google, Apple,
Tesla, and Microsoft investing a lot of money in deep learning.

WHAT ARE THE APPLICATIONS OF DEEP LEARNING?
Historically, the goal of machine learning was to move humanity towards the singularity of “General Artificial Intelligence ”. But not surprisingly, this goal has been tremendously difficult to attain.
So instead of trying to develop generalized AI, scientists started to develop a
series of models and algorithms that excelled in specific tasks.

So, to realize the main applications of Deep Learning, it is better to briefly
take a look at each of the different types of Deep Neural Networks, their main
applications, and how they work.

WHAT ARE THE DIFFERENT TYPES OF DEEP NEURAL NETWORKS?
CONVOLUTIONAL NEURAL NETWORKS (CNNS)
Assume that you have a dataset of images of cats and dogs, and you want to build
the model that can recognize and differentiate them. Traditionally, your first
step would be “feature selection”. That is, to choose the best features from
your images, and then use those features in a classification algorithm (e.g.,
Logistic Regression or Decision Tree), resulting in a model that could predict
“cat” or “dog” given an image. These chosen features could simply be the color,
object edges, pixel location, or countless other features that could be
extracted from the images.

Of course, the better and effective the feature sets you found, the more
accurate and efficient image classification you could obtain. In fact, in the
last two decades, there has been a lot of scientific research in image
processing just about how one can find the best feature sets from images for the
purposes of classification. However, as you can imagine, the process of
selecting and using the best features is a tremendously time-consuming task and
is often ineffective. Further, extending the features to other types of images
becomes an even greater problem – the features you used to discriminate cats and
dogs cannot be generalized, for example, for recognizing hand-written digits.
Therefore, the importance of feature selection can’t be overstated.

Enter convolutional neural networks (CNNs). Suddenly, without having to find or
select features, CNNs finds the best features for you automatically and
effectively. So instead of you choosing what image features to classify dogs vs. cats, CNNs can automatically
find those features and classify the images for you.

Convolutional Neural Network (Wikipedia)

WHAT ARE THE CNN APPLICATIONS?
CNNs have gained a lot of attention in the machine learning community over the
last few years. This is due to the wide range of applications where CNNs excel,
especially machine vision projects: image recognition/classifications , object detection/recognition in images , digit recognition , coloring black and white images , translation of text on the images , and creating art images ,

Lets look closer to a simple problem to see how CNNs work. Consider the digit
recognition problem. We would like to classify images of handwritten numbers,
where the target will be the digit (0,1,2,3,4,5,6,7,8,9) and the observations
are the intensity and relative position of pixels. After some training, it’s
possible to generate a “function” that map inputs (the digit image) to desired
outputs (the type of digit). The only problem is how well this map operation
occurs. While trying to generate this “function”, the training process continues
until the model achieves a desired level of accuracy on the training data. You
can learn more about this problem and the solution for it through our convolution network with hands-on notebooks .

HOW DOES IT WORK?
Convolutional neural networks (CNNs) is a type of feed-forward neural network , consist of multiple layers of neurons that have learnable weights and biases.
Each neuron in a layer that receives some input, process it, and optionally
follows it with a non-linearity. The network has multiple layers such as
convolution, max pool, drop out and fully connected layers. In each layer, small
neurons process portions of the input image. The outputs of these collections
are then tiled so that their input regions overlap, to obtain a
higher-resolution representation of the original image; and it is repeated for
every such layer. The important point here is: CNNs are able to break the
complex patterns down into a series of simpler patterns, through multiple
layers.

RECURRENT NEURAL NETWORK (RNN)
Recurrent Neural Network tries to solve the problem of modeling the temporal
data. You feed the network with the sequential data, it maintains the context of
data and learns the patterns in the temporal data.

WHAT ARE THE APPLICATIONS OF RNN?
Yes, you can use it to model time-series data such as weather data, stocks, or
sequential data such as genes. But you can also do other projects, for example,
for text processing tasks like sentiment analysis and parsing. More generally,
for any language model that operates at word or character level. Here are some
interesting projects done by RNNs: speech recognition , adding sounds to silent movies , Translation of Text , chat bot , hand writing generation , language modeling (automatic text generation) , and Image Captioning .

HOW DOES IT WORK?
The Recurrent Neural Network is a specialized type of Neural Network that solves
the issue of maintaining context for sequential data . RNNs are models with a simple structure and a feedback mechanism built-in.
The output of a layer is added to the next input and fed back to the same layer.
At each iterative step, the processing unit takes in an input and the current
state of the network and produces an output and a new state that is re-fed into the network .

However, this model has some problems . It’s very computationally expensive to maintain the state for large amounts
of units, even more so over a long amount of time. Additionally, Recurrent
Networks are very sensitive to changes in their parameters. To solve these
problems, a way to keep information over long periods of time and additionally
solve the oversensitivity to parameter changes, i.e., make backpropagating
through the Recurrent Networks more viable was found. What is it? Long-Short
Term Memory (LSTM).

LSTM is an abstraction of how computer memory works: you have a linear unit,
which is the information cell itself, surrounded by three logistic gates
responsible for maintaining the data. One gate is for inputting data into the
information cell, one is for outputting data from the input cell, and the last
one is to keep or forget data depending on the needs of the network.

If you want to practice the basic of RNN/LSTM with TensorFlow or language
modeling, you can practice it here .

RESTRICTED BOLTZMANN MACHINE (RBM)
RBMs are used to find the patterns in data in an unsupervised fashion. They are
shallow neural nets that learn to reconstruct data by themselves. They are very
important models, because they can automatically extract meaningful features from a given input, without the need to label them. RBMs might not be
outstanding if you look at them as independent networks, but they are
significant as building blocks of other networks, such as Deep Believe Networks.

WHAT ARE THE APPLICATIONS OF RBM?
RBM is useful for unsupervised tasks such as feature extraction/learning,
dimensionality reduction, pattern recognition, recommender systems ( Collaborative Filtering ), classification, regression, and topic modeling.

To understand the theory of RBM and application of RBM in Recommender Systems
you can run these notebooks .

HOW DOES IT WORK?
It only possesses two layers: a visible input layer and a hidden layer where the
features are learned. Simply put, RBM takes the inputs and translates them into
a set of numbers that represents them. Then, these numbers can be translated
back to reconstruct the inputs. Through several forward and backward passes, the
RBM will be trained. Now we have a trained RBM model that can reveal two things:
first, what is the interrelationship among the input features; second, which
features are the most important ones when detecting patterns.

DEEP BELIEF NETWORKS (DBN)
Deep Belief Network is an advanced Multi-Layer Perceptron (MLP). It was invented to solve an old
problem in traditional artificial neural networks. Which problem? The
backpropagation in traditional Neural Networks can often lead to “local minima”
or “vanishing gradients”. This is when your “error surface” contains multiple
grooves and you fall into a groove that is not the lowest possible groove as you
perform gradient descent.

WHAT ARE THE APPLICATIONS OF DBN?
DBN is generally used for classification (same as traditional MLPs). One the
most important applications of DBN is image recognition. The important part here
is that DBN is a very accurate discriminative classifier and we don’t need a big set of labeled data to train
DBN; a small set works fine because feature extraction is unsupervised by a
stack of RBMs.

HOW DOES IT WORK?
DBN is similar to MLP in term of architecture, but different in training approach.
DBNs can be divided into two major parts. The first one is stacks of RBMs to
pre-train our network. The second one is a feed-forward backpropagation network,
that will further refine the results from the RBM stack. In the training
process, each RBM learns the entire input. Then, the stacked RBMs, can detect
inherent patterns in inputs.DBN solves the “vanishing problem” by using this
extra step, so-called

DBN solves the “vanishing problem” by using this extra step, so-called pre-training . Pre-training is done before backpropagation and can lead to an error rate not
far from optimal. This puts us in the “neighborhood” of the final solution. Then
we use backpropagation to slowly reduce the error rate from there.

AUTOENCODER
An autoencoder is an artificial neural network employed to recreate a given
input. It takes a set of unlabeled inputs, encodes them and then tries to extract the most valuable information
from them. They are used for feature extraction, learning generative models of
data, dimensionality reduction and can be used for compression. They are very
similar to RBMs but can have more than 2 layers.

WHAT ARE THE APPLICATIONS OF AUTOENCODERS?
Autoencoders are employed in some of the largest deep learning applications,
especially for unsupervised tasks. For example, for Feature Extraction , Pattern recognition, and Dimensionality Reduction . In another example, say that you want to extract what feeling the person in a photography is feeling , Nikhil Buduma explains the utility of this type of Neural Network with
excellence.

HOW DOES IT WORK?
RBM is an example of Autoencoders, but with fewer layers. An autoencoder can be
divided into two parts: the encoder and the decoder .

Let’s say that we want to classify some facial images and each image is very
high dimensionally (e.g 50×40). The encoder needs to compress the representation
of the input. In this case we are going to compress the face of our person, that
consists of 2000 dimensional data to only 30 dimensions, taking some steps
between this compression. The decoder is a reflection of the encoder network. It
works to recreate the input, as closely as possible. It has an important role
during training, to force the autoencoder to select the most important features
in the compressed representation. After training, you can use 30 dimensions to
apply your algorithms.

WHY TENSORFLOW? HOW DOES IT WORK?
TensorFlow is also just a library but an excellent one. I believe that
TensorFlow’s capability to execute the code on different devices, such as CPUs
and GPUs, is its superpower. This is a consequence of its specific structure. TensorFlow defines computations as graphs and these are
made with operations (also know as “ops”). So, when we work with TensorFlow, it
is the same as defining a series of operations in a Graph.

To execute these operations as computations, we must launch the Graph into a
Session. The session translates and passes the operations represented in the
graphs to the device you want to execute them on, be it a GPU or CPU.

For example, the image below represents a graph in TensorFlow. W , x, and b are tensors over the edges of this graph. MatMul is an operation over the tensors W and x , after that Add is called and add the result of the previous operator with b . The resultant tensors of each operation cross the next one until the end,
where it’s possible to get the wanted result.

TensorFlow is really an extremely versatile library that was originally created
for tasks that require heavy numerical computations. For this reason, TensorFlow
is a great library for the problem of machine learning and deep neural networks.

WHERE SHOULD I START LEARNING?
Again, as I mentioned first, it does not matter where to start, but I strongly
suggest that you learn TensorFlow and Deep Learning together. Deep Learning with TensorFlow is a course that we created to put them together. Check it out and please let
us know what you think of it.


Good luck on your journey into one of the most exciting technologies to surface
in our field over the past few years.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: data science , Deep Learning , Deep Neural Networks , TensorFlow


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","In this article, we discuss various Deep Learning approaches and recommend you a way to learn TensorFlow and Deep Learning at the same time.",Learn TensorFlow and Deep Learning Together and Now!,Live,164
447,"Compose The Compose logo Articles Sign in Free 30-day trialCLASSCRAFT - MAKING THE MOST OF COMPOSE
Published Apr 24, 2017 case study mongodb elasticsearch Classcraft - Making the Most of ComposeClasscraft gamifies the whole classroom experience, making education a fun adventure for
both students and teachers. We chatted with Shawn Young, ex-teacher, programmer,
and founder of Classcraft Studios about their teaching platform built on
Meteor.js and their use of Compose for MongoDB and Elasticsearch.

“It’s boring.” That’s the usual answer most parents get when they ask their
children about school. Shawn Young, a former 11th grade teacher has seen this
firsthand in his career. According to Shawn, “There’s a crisis in education
right now. We’ve started solving a lot of the logistical problems with
technology like parent communication, homework, and distributing resources. But
now we’re realizing that, as a market, all these great tools aren’t actually
enough to get students excited about coming to school.” Worse, a recent Gallup study found out that this disengagement increases as students progress causing
dropouts. At home, the students have a richer digital interactive experience
through the internet, social media and games. At school this experience is
missing. Shawn wanted to solve this engagement problem.

In 2013, he created a basic online role-playing game that would make the
classroom participation more engaging for his students. A former student posted
the game on Reddit. Within a week, it went to the front page of Reddit Gaming.
And suddenly Shawn started to get inquiries from thousands of teachers about the
game. Realizing he had got something here, Shawn teamed up with his father
Lauren, a 35-year business veteran, and brother Devin, a creative director in
New York, to start Classcraft Studios.

The first beta of Classcraft was launched in January 2014 and then an open
version of the product was made available in August of the same year. The
original app was built on PHP. But soon they moved to Node.js and Meteor.js
because of the scaling and speed they needed for a real-time game.


The entire monolithic single server app was hosted on Amazon Web Services (AWS)
and deployed using Capistrano, a Ruby tool, for compiling and deploying a Meteor
instance. It all worked fine for a year, but as Classcraft exploded in
popularity they started to see some hiccups with the architecture. They had only
one server, but needed the ability to run multiple instances of the app. Node
wasn’t designed for it. They used Passenger (also a Ruby tool) to overcome this
Node limitation. NGINX was used to direct people to the right process. It
temporarily made things better, but then they started running into memory leak
issues that would impact everyone on the server. 'Hot patches' were deployed to
fix things in the app, but these forced restarts, which required all users to
connect to the database at the same time. As a result, the database started to
crash.

The obvious solution was to scale vertically by adding memory to the servers.
So, they switched to the top tier of the AWS services. But soon it became clear
that they also needed horizontal scaling – a challenge because documentation on
how to do this with Meteor was sparse at the time. Fortunately, Meteor came out
with Galaxy, a Docker-based solution for hosting Meteor apps that would enable
horizontal scaling through containers. Upgrading from MongoDB 2.6 to 3.2 also
helped mitigate some of the performance issues.

But all these changes came at a cost. During peak times, Shawn found himself
spending 20 hours a week on sysadmin tasks for just to keep the app running.
During one of their upgrades, Shawn said, ”I stayed up all night, and I hadn’t
completed the migration – I thought I would do it fast the next day, but
basically it was becoming a huge time sink. My senior developer and I were
running on fumes. The end of that first night I started looking into other
solutions.”

That’s when Classcraft decided to move to Compose. It coincided well, because it
was right after Compose had implemented the WiredTiger storage engine as an
option. Classcraft could use it to migrate their entire platform very easily. As
the product evolved, the team developed features requiring advanced search
capabilities. While MongoDB is great at many things, the complex types of
location-based and fuzzy searches they needed weren't ones that MongoDB supports
well. Thus, Classcraft turned to Compose for Elasticsearch. “Part of what’s cool
is that basically you don’t have to provision an entire Elasticsearch setup
yourself. I can just press a button [on the Compose console] and then I know,
for our use, it’s probably the best way that it should be set up.”

Another feature they liked about Compose was the ability to assign user
permissions and roles. “It’s actually pretty cool to be able to give developers
selective access. For awhile we would restrict access to the database on the old
stack, because you could just go in there and write queries and erase all the
users, right? You don’t want to let anybody do that.”

With Compose, Classcraft was able to select the right tools for the app they
were building without all the administrative overhead. “Looking back, I am very
happy that we moved to Compose,” Shawn said. “Basically, Compose took the hassle
of database management off of our hands so we could focus on what’s most
important to us - our product. And I didn’t have to do anything. It’s pretty
great!”

So how is Classcraft doing these days?

“Fantastic!” according to Shawn. They just passed 2.1 million users. The app is
available in 75 countries in 10 languages. People are flooding social media with
great testimonials and feedback. Classcraft's success is getting noticed in
academia too who are publishing papers on their achievements. And then there are
schools where kids are dressed up in armor and doing giant dance battles against
one another - as part of the game exercise! Occasionally they would get a
testimonial from a teacher that says, 'This has completely changed my classroom.
It’s the best thing that’s happened to me in my entire career.'

“It’s all very humbling”, said Shawn. “Thanks to Compose for having our back as
we set to make an impact in the educational sector.”

To learn more about Classcraft Studios and their platform, visit: https://www.classcraft.com/ .


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by: Classcraft Arick Disilva works in Product Marketing at Compose. Love this article? Head over to Arick Disilva ’s author page and keep reading.RELATED ARTICLES
Jun 16, 2015MYSTRO MODERNIZES MASSAGE THERAPY
We're always excited to hear from our customers on how they're using Compose.
Sheree Evans, a co-founder at Mystro, emailed o…

Jon Silvers Mar 15, 2017USE ALL THE DATABASES – PART 2
Loren Sands-Ramshaw, author of GraphQL: The New REST shows how to combine data
from multiple data sources using GraphQL in p…

Guest Author Mar 2, 2017USE ALL THE DATABASES - PART 1
Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data
from multiple sources using GraphQL in this W…

Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Classcraft gamifies the whole classroom experience, making education a fun adventure for both students and teachers. We chatted with Shawn Young, ex-teacher, programmer, and founder of Classcraft Studios about their teaching platform built on Meteor.js and their use of Compose for MongoDB and Elasticsearch.",Making the Most of Compose – Customer: Classcraft,Live,165
448,"Compose The Compose logo Articles Sign in Free 30-day trialPUSH NOTIFICATIONS WITH MONGODB
Published Jul 18, 2017 mongodb push notifications firebase Push Notifications With MongoDBPush notifications are a staple of mobile and Internet of Things applications,
and in this Write Stuff contribution Don Omondi, Founder and CTO of Campus Discounts , demonstrates how to leverage Compose MongoDB to send more effective push
notifications.

Today’s technology has seen a sharp rise in connected devices, popularly known
as the Internet of Things (IoT). Applications now live in watches, shoes and,
perhaps rather oddly, in salt shakers too!

The IoT surge has also posed a few challenges for developers, one of them being
how to send notifications to the plethora of connected devices. The main problem
arises from the fact that different devices have different ways of subscribing
to, receiving, and unsubscribing from notifications. We’ll see how to tackle
this problem using MongoDB but first a little background information.

WHAT ARE PUSH NOTIFICATIONS?
A push notification is a message that is ""pushed"" from a backend server or application to a user
interface such as mobile applications and desktop applications.

A lot of developers make use of a notification service to send push notifications. A notification service provides a means to push
notifications to many devices at once and may include other features such as
delivery reports and analytics.

PUSH NOTIFICATIONS WITH FIREBASE CLOUD MESSAGING
So with the preliminaries out of the way, let’s see how we can integrate
Firebase Cloud Messaging (FCM), Google’s free notification service into a
MongoDB powered backend. Through FCM we can send notifications to any service
worker enabled browser (Chrome, Firefox, and Opera with Edge coming soon) as
well as native Android & IOS applications.


To push with FCM, all we need to do is create an FCM app which will give us a
server key. Thereafter, using either the Web, Android or iOS SDK generate an FCM
client token once the user grants permissions (A practical example coming a bit
later).

If you google around, you might be surprised to find that there are a number of
people who’ve had a bit of trouble finding the GCM settings. You’ll have to
click the settings icon/cog wheel next to your project name at the top of the
Firebase console, then click on Project settings, and finally select the Cloud
Messaging tab.


Armed with a server key and client token pair, sending a push notification is
performed by a simple POST request to the FCM endpoint with an authorization header containing the key and
a JSON encoded body of the notification with the client token in the ""to"" field
like:

https://fcm.googleapis.com/fcm/send  
Content-Type: application/json  
Authorization: key=AIzaSyC...akjgSX0e4  
{
 ""notification"": {
 ""title"": ""Message Title"",
 ""body"": ""Message body"",
 ""click_action"" : ""https://dummypage.com""
 },
 ""to"" : ""eEz-Q2sG8nQ:APA91bHJQRT0JJ...""
}


The POST response will respond indicating whether the push notification was sent
successfully or failed.

{
 ""multicast_id"": 7986976529786388478,
 ""success"": 1,
 ""failure"": 0,
 ""canonical_ids"": 0,
 ""results"": [{ 
    ""message_id"": ""0:1496965028924567%e609af1cf9fd7ecd"" 
  }]
}


That is really all it takes to send push notifications, but for many real life
applications, it mustn’t stop there. It’s important to note that users don’t
subscribe to push notifications but devices do, so we’ll have to find a way to
link a client token to a user. This means saving the data in a store somewhere.
We may also be interested in granting a user a subscription management interface
as well as logging notifications. Let’s see why MongoDB is a good fit for this
data store.

SOME REASONS TO USE MONGODB FOR PUSH NOTIFICATIONS
Storing Device Metadata: Many times you’d want to store some metadata about the device that has
subscribed to push notifications, such as the browser vendor and version, or the
toaster serial number or perhaps the salt shaker color. With nearly an infinite
number of connectable devices, you’d really want a schemaless database for this.

Handling Shared Devices: A lot of people share devices, whether publicly like when using a cyber-café or
privately when browsing on a friend's laptop, tablet or phone. They might not
unsubscribe from notifications which the notification service provider will
continue to happily deliver. We can mitigate this by setting a time to live (TTL) that automatically removes subscriptions that are not renewed within a
given time. MongoDB has us covered here, too.

One User Many Devices: With the increase in connectable devices, it’s now common for one user to own
many devices that use your application. For performance reasons, embedding a
list of devices in one document per user would ensure maximum efficiency in many
use cases.

Logging: You may also be interested in getting an overview of the recently pushed
notifications. This can be useful for example to delete subscriptions that
repeatedly fail to be delivered. MongoDB’s capped collection would be a perfect
fit for this use case.

A PRACTICAL EXAMPLE: BLOG
Let’s say we have a blog and want to subscribe users to receive push
notifications for example on new posts, comments or likes. Our blog will store
each subscription in a MongoDB document using a sample schema below.

{
 ""_id"": ObjectId,
 ""token"": String,
 ""subscribed_on"": Date,
 ""user_id"": Integer,
 ""fingerprint"": String,
 ""details"": [
   ""browser"" : String,
   ""os"" : String,
   ""osVersion"" : String,
   ""device"" : String,
   ""deviceType"" : String,
   ""deviceVendor"" : String,
   ""cpu"" : String
 ]
}


We already know we need to store two fields in the FCM, client token and the user_id . We'll also want to know the time a user subscribed to receive a push, which
we'll store in the subscribed_on field.

Furthermore, to help a user manage their subscriptions, we’ll need to store a
device’s information like the operating system and version, browser vendor, and
others. This way you can help a user associate a FCM notification endpoint to a
device. We created a details array field to store such arbitrary data.

28th June, 2017 via Chrome on Android 5.1  


Finally, let’s assume we also want to reduce the number of duplicate
subscriptions, which can happen when people share devices or when the
notification service generates a new subscription Universally Unique Identifier
(UUID) for the same device. Duplicate data is bad because it can lead to
different notifications being sent to the same device but for different users.
So we’ll need to create a field to store a value that can fairly accurately
identify a device, to achieve this, we’ll use a technique called device
fingerprinting.

A device fingerprint , also sometimes called a machine fingerprint or browser fingerprint is information collected about a remote computing device for the purpose of
identification. Fingerprints can be used to fully or partially identify
individual users or devices even when cookies are turned off.

With the document schema ready, we’ll need to create a MongoDB collection to
hold them, let’s create one called ‘push_notifications’ from mongo shell

> db.createCollection(‘push_notifications’)


From MongoDB 3.2 and beyond we can enforce some level of strict schema by using
document validation. You can read more about it as well as find some examples in Document Validation in MongoDB By Example . In our example, we want to ensure that every subscription document has a
non-null subscribed_on field with a data type Date . We also need a non-null device fingerprint value of type string and a non-null user_id value of type Int . Let’s enforce it with this validation.

> db.createCollection( ""push_notifications"",
 {
   validator: { 
     $and: [
       { token: { $type: ""string"" } },
       { token: { $exists: true } },
       { subscribed_on: { $type: ""date"" } },
       { subscribed_on: { $exists: true } },
       { user_id: { $type: ""int"" } },
       { user_id: { $exists: true } },
       { fingerprint: { $type: ""string"" } },
       { fingerprint: { $exists: true } }
     ]
   }
 }
)


For speedy lookups on documents matching certain client tokens, let’s create an
index on the token field. We can also declare this index as unique so as to prevent duplicate subscriptions from different users using the same
token.

db.push_notifications.createIndex( { ""token"": 1 }, { unique: true} )

Since we wish to have subscriptions automatically removed after a certain amount
of time, let’s create a TTL index to tell MongoDB to delete documents after some time.

db.push_notifications.createIndex( { ""subscribed_on"": 1 }, { expireAfterSeconds: 604800 } )  


This will purge documents whose subscribed_on field’s value is greater than or equal to 1 week from the time MongoDB runs its
background checks. To keep active subscriptions, update the subscribed_on field periodically, for example, once a day, so that they are always less than
1 week old.

To enable us to quickly look up all subscriptions for a specific user, let’s
create an index on the user_id field.

db.push_notifications.createIndex({ user_id: 1 })  


If your app allows non-logged in users to subscribe to push, then you can make
this a partial index from the MongoDB shell as follows.

db.push_notifications.createIndex({ user_id: 1 } , { partialFilterExpression: { user_id: { $exists: true } } })  


Also, don’t forget to remove the user_id requirements from the document validation. For MongoDB versions prior to 3.2,
use sparse indexes instead.

That’s it, with the database schema all set up, our backend is ready to push.
It’s now up to our frontend to send information so our backend can know to whom.
For that, we’ll use some JavaScript. The procedure is to first ask the user for
permissions, then register the device for push notifications and pass its UUID,
fingerprint as well as some details to our backend.

SETTING UP THE JS CLIENT


For getting the browser fingerprint on the client side, we can use the
conveniently named library clientjs . clientjs also allows us to get device specific information such as the OS, OS version,
CPU type, Device Type and more which we’ll use to fill our details array.

<script src=""/path/to/client.min.js""></script>  
<script src=""https://www.gstatic.com/firebasejs/4.1.2/firebase-app.js""></script>  
<script src=""https://www.gstatic.com/firebasejs/4.1.2/firebasemessaging.js""></script>  
<script type=""text/javascript""
 // Initialize Firebase
 var config = {
   apiKey: ""<YOUR_FCM_APP_API_KEY>"",
   messagingSenderId: ""<YOUR_FCM_APP_PROJECT_NUMBER
 }
</script>  


We’ll also need to create a small service worker script called firebase-messagingsw.js which is also responsible for enabling background pushes.

// Give the service worker access to Firebase Messaging.
// Note that you can only use Firebase Messaging here, other Firebase libraries
// are not available in the service worker.
importScripts('https://www.gstatic.com/firebasejs/4.1.2/firebase-app.js');  
importScripts('https://www.gstatic.com/firebasejs/4.1.2/firebase-messaging.js');  
// Initialize the Firebase app in the service worker by passing in the
// messagingSenderId.
firebase.initializeApp({  
 'messagingSenderId': <YOUR_FCM_APP_PROJECT_NUMBER


WRAPPING UP
So that’s about it then. If you are yet to implement push notifications for your
app(s), hopefully now you know how simple it is. Just adapt the sample code in
this article, grab yourself a MongoDB instance and push away!


--------------------------------------------------------------------------------

Do you want to shed light on a favorite feature in your preferred database? Why
not write about it for Write Stuff ?


attribution Kate SERBIN

This article is licensed with CC-BY-NC-SA 4.0 by Compose.RELATED ARTICLES
Jul 12, 2017INTEGRATION TESTING AGAINST REAL DATABASES
Integration testing can be challenging, and adding a database to the mix makes
it even more so. In this Write Stuff contribu…

Guest Author Mar 28, 2017SIMPLE OAUTH WITH MONGODB & MYSQL
Don Omondi, Campus Discounts' founder and CTO, discusses securing applications
with OAuth and shows you how to securely store…

Guest Author Mar 13, 2017CREATING AN AWS VPC AND SECURED COMPOSE MONGODB WITH TERRAFORM
Connecting to Compose MongoDB from Amazon VPC? Using Terraform for
orchestration? In this Write Stuff article, Yamil Asusta s…

Guest Author Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company",Don Omondi demonstrates how to leverage Compose MongoDB to send more effective push notifications.,Push Notifications With MongoDB,Live,166
450,"Today, we’d like to introduce you to Cloudant Local (or, for the verbose option, “IBM Cloudant Data Layer Local Edition”). Cloudant Local is an on-premises version of Cloudant software that companies can install in their own data centers to run their own DBaaS. It’s a significant point in our company’s history because it advances Cloudant’s story on data movement and provides customers with a clear-cut strategy for managing application data across any mix of infrastructure and deployment strategy.Well, how did we get here?When Cloudant launched in 2009, the company’s vision was to bring a cloud-first database product to market. We designed our NoSQL database-as-a-service from the ground up to run in the cloud, and tackled all the ops and systems management challenges that come with running a 24/7 Web service. We used Apache CouchDB™ as our storage engine, layered on additional open source technologies, and developed our own monitoring and operations systems to ensure high availability and performance.The data replication and sync capabilities of CouchDB are well known, and the notion of data movement lies at the heart of many applications built on Cloudant. Whether it’s modifying data on a single CouchDB instance and replicating those changes up to a Cloudant database cluster — or running cross-datacenter replication to synchronize updates between Cloudant clusters running on opposite sides of the world — developers like being able to put data in more places to make their applications highly available. They also like having more deployment options to fit the changing needs of their applications as they grow.In 2014 we started down the path of offering even more ways to deploy and access Cloudant. In February, we open-sourced the Cloudant Sync native mobile software libraries for Android and iOS. This summer, we launched Cloudant Query at MongoDB World (our declarative querying system that borrows syntax from MongoDB’s query system) and introduced our first language-specific software library, Node.js for Cloudant. With Cloudant on the device and in the cloud, we’re completing the picture by bringing our service on-premises in the form of Cloudant Local.After six years on the market, the one thing our users couldn’t do was get their hands on Cloudant software and install it on their own servers. You might be thinking: “Doesn’t almost every DBMS already run on-premises?” The answer is “Yes, they do.” But what nearly every other DBMS does not give you is a strategy for managing data across a hybrid cloud architecture. Deploying the infrastructure is one thing. Managing data that lives on different parts of that infrastructure is a whole other challenge.The beauty of Cloudant, and now the release of Cloudant Local, is that developers or even entire IT departments don’t have to make deployment decisions up front when launching an app. If you want to start on-premises and then move your database to the cloud, Cloudant replication makes that process simple, without application downtime. If you want to keep some user metadata in the cloud and secure sensitive binary files behind the enterprise firewall, Cloudant’s flexible deployment options make it easy. If you want to distribute copies of data to several data centers worldwide to make it highly available to a large user base, Cloudant data replication and synchronization will work at that scale too.What you get with Cloudant LocalUnlike monolithic database hardware, Cloudant DBaaS and Cloudant Local are designed to run across a cluster of commodity server nodes. Nodes can be easily added and repartitioned as your application grows. A typical Cloudant Local cluster requires four nodes to get started:Cloudant Local also provides the tools, training, and techniques to run your own DBaaS operation. You’ll learn how to run a 24/7 Web service within your business. The three DB servers are a default because Cloudant stores three copies of all records across different physical infrastructure. All writes are immediately committed to disk. It all goes toward making Cloudant highly available and durable.If you’re interested in hardware requirements, software specs, and an architecture overview, register to download the  Cloudant Local data sheet. Want to take it a step further with a POC deployment? Contact us about your interest in a Cloudant Local POC.",Cloudant Local brings Cloudant's NoSQL database to customers whose regulatory restrictions make it difficult to put their data into the cloud.,Introducing Cloudant Local,Live,167
455,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Sep 6
--------------------------------------------------------------------------------

NODEBOOKS: NODE.JS DATA SCIENCE NOTEBOOKS
PYTHON AND NODE.JS IN THE SAME JUPYTER NOTEBOOK (PART 1)
I am a developer, as in computer code. My job is to persuade computers to do my
bidding by typing gibberish into a text file and presenting its contents to the
computer like a sacrificial oblation.

The contents of the text files have changed over the years as my computer and I
have communicated in several languages: BASIC, Pascal, C, C++, Forth, Java,
Objective-C, PHP, Python. But the language we share most often these days is JavaScript , either inside web pages or to run server-side apps and command-line tools
using Node.js .

If I had a gun to my head and had to program my way out of it (which is, let’s
face it, unlikely), I’d choose Node.js. It’s the language I have to Google least
to remember the syntax.

NOTEBOOKS
Notebooks (that’s Jupyter/IPython Notebooks, not Moleskine® notebooks) are where
data scientists process, analyse, and visualise data in an iterative,
collaborative environment. Like other developers, I am not a data scientist , but I do like the idea of having a scratchpad where I can write some code,
iteratively work on some algorithms, and visualise the results quickly.

To that end, David Taieb and I created pixiedust_node , an add-on for Jupyter notebooks that allows Node.js/JavaScript to run inside
notebook cells. It’s built on the popular PixieDust helper library. So let’s get started!

INSTALLING
Install both pixiedust and pixiedust_node using pip , the Python package manager. In a Jupyter Notebook cell:

!pip install pixiedust
!pip install pixiedust_node

USING PIXIEDUST_NODE
Now we can import pixiedust_node into our notebook:

import pixiedust_node

And then we can write JavaScript code in cells whose first line is %%node :

%%node
var date = new Date();
print(date);
// ""2017-05-15T14:02:28.207Z""

The JS code and its output, as rendered in an IPython Notebook cell.It’s that easy! We can have Python and Node.js in the same notebook. Cells are
Python by default, but simply starting a cell with %%node indicates that the next lines will be JavaScript.

PRINTING JAVASCRIPT VARIABLES
Calling the print function within your JavaScript code is the same as calling print in your Python code.

%%node
var x = { a:1, b:'two'
print(x);
// {""a"": 1, ""b"": ""two"", ""c"": true}

USING PIXIEDUST DISPLAY() TO VISUALIZE DATA
You can also use PixieDust’s display function to render data graphically:

%%node
var data = [];
for (var i = 0; i  i++) {
    var x = 2*Math.PI * i/ 360;
    var obj = {
      x: x,
      i: i,
      sin: Math.sin(x),
      cos: Math.cos(x),
      tan: Math.tan(x)
    };
    data.push(obj);
}
display(data);

PixieDust presents visualisations of data frames using Matplotlib, Bokeh, d3,
Google Maps and, MapBox. No code is required on your part because PixieDust
presents simple pull-down menus and a friendly point-and-click interface,
allowing you to configure how the data is presented:

Using PixieDust’s display UI to refine a visualizationADDING NPM MODULES
There are thousands of libraries and tools in the npm repository, Node.js’s package manager. It’s essential that we can install npm
libraries and use them in our notebook code.

Let’s say we want to make some HTTP calls to an external API service. We could
deal with Node.js’s low-level HTTP library, or an easier option would be to use
the ubiquitous request npm module .

Once we have pixiedust_node set up, installing an npm module is as simple as running npm.install in a Python cell:

npm.install('request'

Once installed, you may require the module in your JavaScript code:

%%node
var request = require('request'
var r = {
    method:'GET',
    url: 'http://api.open-notify.org/iss-now.json',
    json: true
};
request(r, function(err, req, body) {
    print(body);
});
// {""timestamp"": 1494857069, ""message"": ""success"", ""iss_position"": {""latitude"": ""24.0980"", ""longitude"": ""-84.5517""}}

As an HTTP request is an asynchronous action, the request library calls our callback function when the operation has completed. Inside that function, we can call print to render the data.

We can organise our code into functions to encapsulate complexity and make it
easier to reuse code. We can create a function to get the current position of
the International Space Station in one notebook cell:

%%node
var request = require('request'
var getPosition = function(callback) {
    var r = {
        method:'GET',
        url: 'http://api.open-notify.org/iss-now.json',
        json: true
    };
    request(r, function(err, req, body) {
        var obj = null;
        if (!err) {
            obj = body.iss_position
            obj.latitude = parseFloat(obj.latitude);
            obj.longitude = parseFloat(obj.longitude);
            obj.time = new Date().getTime();       
        }
        callback(err, obj);
    });
};

And use it in another cell:

%%node
getPosition(function(err, data) {
    print(data);
});
// {""latitude"": 50.5736, ""longitude"": -99.3493, ""time"": 1494422942373}

PROMISE ME A MIRACLE
If you prefer to work with JavaScript Promises when writing asynchronous code, then that’s okay too. Let’s rewrite our getPosition function to return a Promise. First we're going to install the request-promise module from npm:

npm.install( ('request', 'request-promise') )

Notice how you can install multiple modules in a single call. Just pass in a
Python list or tuple.

Then we can refactor our function a little:

%%node
var request = require('request-promise'
var getPosition = function(callback) {
    var r = {
        method:'GET',
        url: 'http://api.open-notify.org/iss-now.json',
        json: true
    };
    return request(r).then(function(body) {
        var obj = null;
        obj = body.iss_position
        obj.latitude = parseFloat(obj.latitude);
        obj.longitude = parseFloat(obj.longitude);
        obj.time = new Date().getTime();         
        return obj;
    });
};

And call it in the Promises style:

%%node
getPosition().then(function(data) {
  print(data);
});
// {""latitude"": 20.7734, ""longitude"": -81.5809, ""time"": 1494857142842}

Or call it in a more compact form:

%%node
getPosition().then(print);
// {""latitude"": 20.7734, ""longitude"": -81.5809, ""time"": 1494857142842}

NEXT TIME
In the next part of this three-part series, we’ll look at sharing variables
between Node.js and Python code and interacting with databases from our
notebook.

LINKS
 * pixiedust – https://github.com/ibm-watson-data-lab/pixiedust
 * pixiedust_node – https://github.com/ibm-watson-data-lab/pixiedust_node

 * JavaScript
 * Data Science
 * Jupyter
 * Ipython
 * Nodejs

4 Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 4
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",Python and Node.js in the same Jupyter notebook (part 1).,Node.js data science notebooks – IBM Watson Data Lab – Medium,Live,168
456,"RETHINKDB AND COMPOSE: WHERE NEXT
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 6, 2016TL;DR : RethinkDB databases on Compose will be available for the foreseeable future.

The unfortunate news that RethinkDB Inc is closing its doors as a commercial entity has led people to ask us, as hosts of RethinkDB database
deployments, what our intentions are with the database.

Before that, we have to say that we really enjoyed working with our friends at
RethinkDB and we hope they'll all find new positions in the industry where they
can continue to be most excellent. We understand how hard the startup business
is and that things don't always work out as expected.

Regarding RethinkDB the database, the RethinkDB code base is open source,
stable, feature rich and working well for many of our customers. It's a tribute
to the engineering work of the RethinkDB team that they have produced software
of this quality. As we write this, the details on how the open source project
that is RethinkDB will be maintained and developed in future have yet to be
revealed, but in his blog post Slava says they are working towards establishing
""a sustainable future for RethinkDB as an independent open-source project.""

If you want to join in that effort, discussions are happening now in RethinkDB's public Slack in the #open-rethinkdb channel.We expect that these plans will ensure that the RethinkDB codebase will be
maintained by an expanding RethinkDB community allowing the database to continue
to flourish. Compose will work with the RethinkDB open source community as
closely as we have worked with Slava and the RethinkDB Inc team and we look
forward to seeing RethinkDB continuing to raise the bar in the NoSQL database
space.

What this all means, practically, is that Compose will continue to be the leader
in hosting RethinkDB databases and will provide that service, as a first class
database on Compose, for the foreseeable future.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe The Compose Team The fine team of people at Compose have brought this article to you through
teamwork. Remember, teamwork makes the dreamwork. Love this article? Head over
to The Compose Team’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",TL;DR: RethinkDB databases on Compose will be available for the foreseeable future.,RethinkDB and Compose,Live,169
458,"TECH | HIRING | INTERVIEWING | CV WRITING | BUSINESS & LIFEJOE BLOGS
POST NAVIGATION
« Big Data 2015: Top 100 Influencers and Brands 7 Tips To Nail Your Phone Interview »GOT A PHD? STRUGGLING TO BE A DATA SCIENTIST?
February 16, 2015 by Joe Burridge

Douglas Mason [Data Scientist at Twitter] – “What I carried with me…was a relentless enthusiasm. I’m certain that was key
to my success, leading to my current role as a Data Scientist at Twitter”

Last week, I helped an incredible person find her first commercial role as a
Data Scientist. No, the lovely lady above isn’t her but what I imagine her
expression to have been when she got the role. Anyway, she has a PhD in
Astrophysics along with 2 years’ of post-doctoral research experience from some
of the top universities in the world. But it still wasn’t an easy ride to
becoming a Data Scientist.

There are many routes to take to become a Data Scientist so I want to outline
the problems you may face when transitioning your career to Data Science, from
Science, and how best to tackle them.

PhD’s supply many valuable, transferrable skills. If you have just completed a
scientific/quantitative PhD, your last 3 years’ should have included:

 * Research involving programming and large datasets
 * Stubbornly persisting when asking / answering hard questions
 * Constantly explaining the motivations and reasoning behind your work, and
   perform numerous presentations.

This is what a Data Scientist does day in day out. So, when it comes to your
interview, you need to show how what you’ve learnt in academia is helpful for a
business and its problems.

A PhD doesn’t necessarily help your CV stand out from the crowd as there are a
lot of PhD graduates applying for Data Scientist positions right now. What you
need to do is prove that you want to be a Data Scientist in the commercial world. Here’s some things I would
suggest:

 * Sign up to Kaggle ( kaggle.com ) and complete a project
 * Engage with a Data Science community. For example, if you’re based in London
   sign up to meetup.com and search for all Data Science related events. Build an idea of what it is
   to actually be a commercial Data Scientist, discover the ‘etiquette’ and
   day-to-day normalities of the industry, and even discuss some ideas with
   fellow event goers. This will provide you with discussion points and ideas in
   your interviews.
 * Were you using MatLab, Fortran, Stata etc. during your PhD? Then it’s time to
   self-learn Python ( http://www.learnpython.org/ ) and R ( https://www.coursera.org/course/rprog ). These are the 2 most sought after hard skills for Data Scientist hires.
 * Sign yourself up to any related courses or Summer Schools. That PhD graduate
   I placed last week did the S2DS course ( s2ds.org ) last summer!

These all prove that you are driven, determined and able to be out of your
comfort zone.

Final tip. Apply to the right companies. Many companies looking for Data
Scientists either don’t know what Data Science is or don’t need it (or both!).
Think about what size company you want to work for, what problems you want to
solve and the data sets you want to work with. Perhaps this point should be my
next blog…

As always, if you want to discuss Data Science careers with me, Joe Burridge,
email me on joe.burridge@welovesalt.com or call me on +44(0)2079 282 525.

Sources:

 * http://insightdatascience.com/blog/from-phd-to-data-scientist.html
 * http://www.quora.com/Do-I-need-a-Masters-PhD-to-become-a-data-scientist
 * http://www.quora.com/How-do-I-become-a-data-scientist-without-a-PhD

SHARE THIS:
 * Tweet
 * 
 * 
 * 
 * 

LIKE THIS:
Like Loading...RELATED
This Post was tagged analytics , big data , career , career advice , cv writing , data science , data scientist , interview , interview tips , job hunting , resume tips . Bookmark the permalink .LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.


MENU
 * WHO IS JOE BLOGS?
 * LIST OF POSTS
 * LINKEDIN
 * TWITTER
 * Instagram
 * GOOGLE+
 * Testimonials
 * HUDL JOBS
 * TechCrunch
 * Wired

CONTACT ME
Salt, 9 Wootton Street, London, SE1 8TG +44 (0)2079 282 525 Monday to Friday
8am to 6pmNOT JUST AN AVERAGE JOE
Cruising on an Elephant, Thailand That’s pure magic Graduation Tougher Mudder 2013 Leicester Square, London DOUBLE RAINBOW! Phi Phi Island, Thailand Joe Live at The Cavern, Exeter Washington NY Got to love EasyJet. Right? Me and Frosty Teaching Guitar If I only I lived here…TOP POSTS & PAGES
Create a free website or blog at WordPress.com. | The Trvl Theme . FollowFOLLOW “JOE BLOGS”
Get every new post delivered to your Inbox.


Join 53,538 other followers


Build a website with WordPress.com Post to Cancel %d bloggers like this:","Douglas Mason [Data Scientist at Twitter] – “What I carried with me…was a relentless enthusiasm. I’m certain that was key to my success, leading to my current role as a Data Scientist a…",Got a PhD? Struggling To Be a Data Scientist?,Live,170
468,"Enterprise Pricing Articles Sign in Free 30-Day TrialDEEPER INTO RETHINKDB 2.3 - FOLD IN USE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 19, 2016Earlier this year, RethinkDB released their Fantasia 2.3 version. In this
article, we're going to take a closer look at one of the lesser-known features
that came out with that release - the aggregation fold command.

BACKGROUND
You can find the official fold API reference documentation here . That link will take you to the javascript examples, but you can also toggle
to python, ruby, and java examples as well. You'll note below that our example
uses ES6 arrows functions to simplify the anonymous function return.

The fold aggregation operates similar to reduce in some ways (processing two inputs at a time from a sequence to produce a
third value until all the elements in the sequence are reduced to a single
value) and to the transformation command concatMap in some ways (with the ability to concatenate and map to a new output
sequence), but it does a bit more than both of them. The fold command processes sequences in order and allows for a base value to be passed in, which neither of the other two commands support. Additionally,
it provides an emit function for each item in the sequence being processed.

Below we'll look at a simple, yet useful, example of how you can use the fold aggregation to calculate a year-to-date monthly cumulative revenue.

THE DATA
For our example, we have a table named ""revenue_by_month"". We've made this
simplistic for the sake of this article, but you can apply the same technique we
describe below on any values in your tables where you want to calculate a
running cumulative total.

The documents in our table have three fields: id, month and revenue. Here's an insert command for one document:

r.table('revenue_by_month').insert(  
  {
    id: 1,
    month: ""2015-01-01"",
    revenue: 89750
  }
)


Let's have a look at all the documents:

r.table(""revenue_by_month"").orderBy(""month"")  


We've got 12 months of revenue data for 2015 for Example Co., as follows:

id  | date       | revenue  
-------------------------
1   | 2015-01-01 | 89750  
2   | 2015-02-01 | 100327  
3   | 2015-03-01 | 96709  
4   | 2015-04-01 | 112835  
5   | 2015-05-01 | 125786  
6   | 2015-06-01 | 124294  
7   | 2015-07-01 | 126734  
8   | 2015-08-01 | 125941  
9   | 2015-09-01 | 146223  
10  | 2015-10-01 | 159125  
11  | 2015-11-01 | 123107  
12  | 2015-12-01 | 138390  


FOLD
Let's first look at the primary purpose of fold - to process a sequence of items in order , one after another. For this, you can use several of the existing functions...
any function that takes two inputs and produces an outcome. For our example,
we'll use add . What we want to do is to step through each revenue value to add the value of
the current row to the value of the previous outcome. What we'll have at the end
of the process is a sum of all the values. Let's have a look at how we'll write
the command and then review each of the parts that make it work:

r.table(""revenue_by_month"").orderBy(""month"")  
  .fold(0, (prev, cur) => prev.add(cur(""revenue"")))


In the command above, we're first referencing our table ""revenue_by_month"" and
then selecting the documents from it in order by month - exactly as we showed in
the section above to retrieve all the documents in the table. In our case, we
could just as easily order by id since our documents are in order that way as
well, but the example demonstrates that you could put the documents in any order
you choose based on one or more fields. The fold command will process the input according to the order you specify.

Next, we're calling fold and passing in a base value of 0. If our fiscal year for some reason included
December 2014 and we knew that value, we could pass that in as the base instead.
For our purposes here, we want to just get the cumulative revenue for 2015 so we
want to start at 0.

In the next part of the command we're naming two variables, ""prev"" for the first
input to the process and ""cur"" for the current row's input.

Finally, we're telling fold what we want it to output. In this case, we're telling it to add the ""prev""
value to the ""cur"" value using the ""revenue"" field.

What happens is this: as the first step, our base value of 0 comes in as ""prev"".
That gets added to the January revenue, the ""cur"" value of 89750. We now have
89750. It becomes the new ""prev"" and we add to it the February revenue value of
100327. We now have 190007. It becomes the new ""prev""... and so on. From this,
we'll get the sum of the 12 months of revenue values:

1469221  


So, why not just use the sum aggregation? Yes, of course we could, but if we were doing something other than add where the order of the processing of the values was important to the outcome or
where being able to pass in a base value that was not part of the dataset was
required, then you can see how fold sets itself apart from reduce and concatMap . For our example, the beauty of what we want to achieve - where those benefits
become apparent - actually comes with the emit function of fold . We'll look at that next.

EMIT
emit outputs an array where each element represents one step in the process. Let's
look at that function:

r.table(""revenue_by_month"").orderBy(""month"")  
  .fold(0, (prev, cur) => prev.add(cur(""revenue"")),
    {emit: (prev, cur, ytd) => [ytd]})


Now we've called the emit function and we've specified three variables (it requires three so even if you
don't want all three, you still need to specify them). For us, that's the ""prev""
and ""cur"" we used for processing the sequence and now we've added in a variable
for the outcome of each of those variables being added together called ""ytd"".

For our emit we're not doing anything much fancy. We're simply outputting the ""ytd"" array,
which will show us the value for each step of the process. In the API reference,
you'll see the examples use the branch function which applies an ""if... then... else..."" logic to the output, but for
our example, our outcome does not require any advanced logic to be applied.

Here's what emit returns to us for the ""ytd"" array:

element | value  
-----------------
1       | 89750  
2       | 190077  
3       | 286786  
4       | 399621  
5       | 525407  
6       | 649701  
7       | 776435  
8       | 902376  
9       | 1048599  
10      | 1207724  
11      | 1330831  
12      | 1469221  


As you can see, we get the year-to-date total for each month. So, at the end of
April, the year-to-date revenue was 399621. Ah, yes... that's what fold can do for us in this example!

NEXT STEPS
With the ""ytd"" array, then, as an output of fold , we could choose to perform any other functions on it by applying a then and a do . The array is now open for additional processing. That's outside the scope of
this article, however, but we hope we've made the fold command a little less esoteric for you in this article.

If you deploy RethinkDB this month (July 2016), you can get a limited edition t-shirt !

Image by: Michael Gaida Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Earlier this year, RethinkDB released their Fantasia 2.3 version. In this article, we're going to take a closer look at one of the lesser-known features that came out with that release - the aggregation fold command.",Deeper into RethinkDB 2.3,Live,171
469,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSIMPLE DATA PIPE CONNECTORSptitzler / April 7, 2016Earlier this year, we released a new version of the Simple Data Pipe application . This app lets you load data from the source of your choice directly intoCloudant. You just create a data pipe configuration and run it.Here, the Simple Data Pipe app loaded 26 case records from Salesforce by runningthe pipe configuration salesforce_case .The Simple Data Pipe app is a framework to create, modify, delete, and run datapipe configurations. When an app user chooses a pipe configuration (like source : Saleforce , and dataset : case ) and runs it, the Simple Data Pipe framework invokes a data-source-specificconnector (in this case, the Salesforce connector) to perform the actual datamovement. The connector interprets the configuration and moves the appropriatedata into Cloudant.A data pipe configuration contains information about the data source,authentication information, and source data set information. A pipeconfiguration depends upon the connector and the choices a user makes.Simple Data Pipe loads a configuration-specific connector from the GitHub repothat contains the connector implementation.Connectors handle data movement from the cloud data source to Cloudant, by: 1. connecting to the source using OAuth (if secure access is required) 2. retrieving the requested data sets 3. optionally enriching them with data from other sources, and 4. storing the results as JSON documents in Cloudant for later processingConnectors copy data from the source into Cloudant databases.The Simple Data Pipe app ships with built-in connectors for Salesforce and Stripe . Additional connectors are available and you can deploy them as add-ons, providing access to a variety of datasources, like Reddit, Slack, and Trello.We developed the connectors that exist so far to facilitate our own dataanalysis projects. As part of this work, we updated the Simple Data Pipe to makeit easier to build new custom connectors for other popular data sources.CLOUD DATA SOURCE AUTHENTICATIONConnectors can now take advantage of the popular Passport authentication middleware for Node.js to establish secure connectivity withdata sources. This eliminates the need to manually implement the entire OAuthauthentication flow. Take a look, for example, at the Slack connector . To implement authentication, we * added the passport-slack strategy as a module dependency, * configured the strategy, and * specified the OAuth scopes required by the Slack API calls (fetch list of   channels and fetch messages in channel) we intended to use.With hundreds of strategies to choose from, chances are good that there's onefor the data source you need. If there isn't one yet, why not implement ityourself and publish on GitHub ?JUMPSTARTING CONNECTOR DEVELOPMENT USING BOILERPLATESTo make it even easier for you to get started, we created a couple connectorboilerplates for popular cloud data sources. These boilerplates haveauthentication support baked-in, which lets you focus on what's important: dataretrieval. Check out our connectors page to see the list.DATA RETRIEVAL AND ENRICHMENTThe Simple Data Pipe framework does not impose any restrictions on how to fetch,manipulate, and optionally enrich data. Browsing through our catalog, you'll seethat some connectors use vendor-provided API libraries (like stripe.com ), some use third-party API wrapper libraries (like this lightweight one for slack ), and some call the REST API endpoints directly via HTTP(S) requests.DATA STORAGE AND OUTPUT FORMATWhen you start a data pipe run, the Simple Data Pipe app automatically creates adedicated Cloudant database for each data set the connector processes. Atruntime, the Simple Data Pipe framework provides the connector with a callbackto be invoked whenever individual records or sets of records need to be writtento the Cloudant database. There are no constraints as to what structure recordshave to use—you can pick whatever makes the most sense in the context of how thedata will be consumed. For example, our connector for Reddit flattens data structures to support processing by the Spark-Cloudant connector , whereas others preserve the data structures returned by the API.DATA ENRICHMENTSimple Data Pipe connectors can simply load data from the cloud data source(like the salesforce.com connector). Or they can be smarter and combine fetcheddata with information obtained from other cloud data sources to provide avalue-added service, as in the following two examples: * The social media connector for Reddit uses Watson Tone Analyzer to gauge the tone of user comments. A complete   use-case scenario based on this data is nicely illustrated in Chetna's blog post . * The connector for flightstats.com combines flight status information with weather data for the departure   airport.TRY ITWhat do you think? Does the Simple Data Pipe sound like a something that wouldstreamline some of your projects?Try it out: Deploy the Simple Data Pipe app , and load data with a built-in connector. Next, deploy an add-on connector . If we've won you over by then, go whole-hog and create a custom connector of your own! Let us know how it goes. We'd love to hear from you andcollaborate on GitHub.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Analytics / cloudant / JSON / migration / Simple Data Pipe Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Import from the cloud data source of your choice. How connectors let you load data from a variety of sources, through the Simple Data Pipe, and into Cloudant.",Simple Data Pipe Connectors,Live,172
472,"Homepage About membership Sign in / Sign up Homepage Formulated.by Blocked Unblock Follow Following Formulating Digital and Face-To-Face Experiences Oct 31
--------------------------------------------------------------------------------

10 MUST ATTEND DATA SCIENCE, ML AND AI CONFERENCES IN 2018
The keynote stage at Strata Data Conference in London (source: O’Reilly Conferences via Flickr 2017)Technology is advancing and new methods of designing effective
business-progressing tools are emerging. Data Science conferences are not only
about discovering the latest trends in the field, but also about building
connections and networks of people who will fulfill your career and personal
goals. Advancing is about continuous learning and we believe that the best way
of doing this is by joining passionate people willing to share their work
insights and innovations. Read about and check out our top ten 2018 Data
Science, Machine Learning and Artificial Intelligence conferences. Enjoy!

1. KDD
The KDD ( Knowledge Discovery and Data Mining) is an annual international
conference focusing on science and engineering topics. It brings together
researchers and practitioners from data science, data mining, knowledge
discovery, large-scale data analytics, and big data. In 2018, the conference
will take place in London between 19 -23rd of August .

2. DATA SCIENCE SALON
Focusing on media and entertainment in Los Angeles on December 14th where the conference was born and finance and technology data science fields in
Miami spin-off, the Data Science Salon is a destination conference which brings
together specialists face-to-face to educate each other, illuminate best
practices, and innovate new solutions in a casual atmosphere. Data Science Salon
is a one-two day conference including workshops targeting executives, senior
data scientists, developers, and business development professionals alike. After
2 successful encounters in 2017, Data Science Salon will come to Miami on 8–9 of February discussing Artificial Intelligence and Machine Learning in the fields of
finance and health technology.

3. STRATA
Strata covers a wide range of topics varying from Machine Learning, Data
Engineering and Architecture, Big Data to Visualization, Cybersecurity, Law,
Ethics and Data-Driven Business Management. Data case studies are shared by
experts around the world together with their best practices, effective new
analytic approaches, and exceptional skills. Strata targets an audience of data
scientists, analysts, and executives. In 2018, three sessions are scheduled in San Jose on 5–8 March, in London on 21–24 May and New York on 11–14 September .

4. NIPS
In 2018, the annual conference of Neural Information Processing Systems (NIPS)
will meet for its thirty-second time. It is a single-track machine learning and
computational neuroscience conference that includes invited talks,
demonstrations and oral and poster presentations of refereed papers. It will
take place in Montréal, Canada on 3–8 December.

5. AI CONFERENCE
Exploring the most essential issues and innovations in Applied AI, the
Artificial Intelligence Conference targets a wide range of topics on the
subject. From AI impact on business and society to implementing AI projects and
its models and methods, this conference brings the growing AI community together
to share executive briefings, case studies and industry-specific applications.
In 2018, there will be four sessions, the first one being held in Mandarin
Chinese in Beijing on 10–13 April , followed by New York on 29 April — 2 May , San Francisco on 4–7 September and London on 8–11 October.

6. DATA SCIENCE POP-UP
The Data Science Pop-up is a day long conference which brings together data
science managers who are passionate about asking the right questions and
identifying problems worth solving. Share ideas, develop best practices, and
network with others in the field. The main focus is on presenting real stories
about the cutting edge work being done today. Catch the last data science pop-up
of 2017 in Chicago on November 14th . In 2018 the conference will be held in New York in February, San Francisco in May and in London in October.

7. ODSC EAST & WEST
The Open Data Science conference is about accelerating your data science
knowledge, training, and network. The event speakers include some of the core
contributors to many open source tools, libraries, and languages. Topics
discussed include the latest AI & data science topics, tools, and languages from
some of the best and brightest minds in the field. In 2018, the ODSC East will
take place in Boston on 1–4 of May and ODSC West will meet in San Frinciso, California.

8. MLCONF
MLconf gathers communities to discuss the recent research and application of
Algorithms, Tools, and Platforms to solve the hard problems that exist within
organizing and analyzing massive and noisy data sets. MLconf events host
speakers from various industries, research and universities.Each event is a
single-track, single-day event, composed of 14–16 presentations around 25 min
each. The goal of this format is for attendees to take home practical tips and
methods to apply in their own work; as well as cited papers, code samples and
work to reference for their own research. Date and location TBA.

9. ENTERPRISE DATA WORLD
The Enterprise Data World (EDW) Conference will meet for its 22nd time in San Diego , California on 22–27 April. EDW is unique in being considered the most comprehensive educational conference
on data management in the world. The six-day conference consists of in-depth
tutorials, hundreds of hours of presentations on educational material and
two-day workshops. Topics discussed by distinguished speakers include Data
Governance and Stewardship, Data Architecture, Modeling, Metadata Management,
NoSQL and Database Technologies, Data and Information Quality, Business
Intelligence, Analytics, Data Science, Big Data and Enterprise Information
Management, and much more.

10. RSTUDIOCONF
RStudio conference is about all things R and RStudio. In 2018 more optional
Training Days workshops for people newer to R and for advanced users and
administrators will be added. Three conference tracks will be available; one
focusing on the fundamentals of data science with R, another for more
experienced RStudio users on advanced capabilities, R in “production” and
interoperability, and a third one on solutions to interesting problems. Next
year, the conference will be held in San Diego, California on 1–3rd of February.

 * Data Science
 * Machine Learning
 * AI
 * Deep Learning
 * Big Data


3 Blocked Unblock Follow FollowingFORMULATED.BY
Formulating Digital and Face-To-Face Experiences

 * 3
 * 
 * 
 * 

Never miss a story from Formulated.by , when you sign up for Medium. Learn more Never miss a story from Formulated.by Blocked Unblock Follow Get updates","Technology is advancing and new methods of designing effective business-progressing tools are emerging. Data Science conferences are not only about discovering the latest trends in the field, but…","10 Must Attend Data Science, ML and AI Conferences in 2018",Live,173
474,"INTRODUCING THE SIMPLE SEARCH SERVICE Glynn Bird / January 21, 2016 Turning your spreadsheet or mysql.dump into a faceted search engine just got alot easier. Try out our new Simple Search Service, built to help you create andmanage a useful polished search engine for your own site or app.I’ve blogged before about turning spreadsheet data into a faceted search engine . That tutorial has a few basic steps: 1. sign up for an IBM Cloudant NoSQL database account 2. use couchimport to import your spreadsheet data into Cloudant 3. instruct Cloudant to index the data using a Design Document 4. perform a Cloudant Search queryIf you’re familiar with NoSQL databases and Cloudant or Apache CouchDB inparticular, you should find those steps relatively easy to follow. But forsomeone new to NoSQL, there’s a lot to learn in there before hitting the searchAPI: JSON, command-line tools, design documents, and Lucene query syntax to namejust a few.The Cloud Data Services Developer Advocacy team is always looking to make thingsas easy as possible. To that end, we are today unveiling the Simple SearchService, which greatly simplifies the steps to turning your tabular data into afaceted search engine.To try it out, visit the Simple Search Service repository on Github and click the Deploy To Bluemix button. This will install the code in your IBM Bluemix account, connect theservices it needs and give you a simple web front-end that lets you import andindex your spreadsheet data. (Bluemix has a free trial, so it won’t cost youanything to try out Simple Search Service in the first month.)WHAT IS THE SIMPLE SEARCH SERVICE?Simple Search Service is a Node.js app that you can get and use immediately bydeploying to the IBM Bluemix platform-as-a-service with a couple of mouse clicks. Deployment gets you yourown working instance of the app, automatically provisions a Cloudant account,attaches it to the service, and presents a web app that lets you upload a datafile. When you upload data, it’s automatically imported into Cloudant, withevery field indexed for search.Simple Search Service then exposes a RESTful search API that your applicationcan use. The API is CORS-enabled, so your client-side web app can use it withoutissue. The API is also cached, meaning that it stores popular searches in anin-memory data store for faster retrieval, giving your application betterperformance.UPLOADING DATAThe Simple Search Service home page invites you to upload your CSV(comma-separated file) or TSV file (tab-separated file):Uploading a CSV or TSV is easySimple Search Service expects the first line of the file to contain the columnheadings like this:transaction_id description price customer_name date 42 Pet food 24.22 Jones 2015-04-02 43 Cake 9.99 Smith 2015-04-02File format must be comma or tab-separated and filenames must end in either .csv or .tsv .Simple Search Service will accept the following data types: * strings * numbers * booleans * arrays of strings (separated by commas)Records like this:person_id first_name last_name score passed tags 1 Glynn Bird 45.3 true uk,tall,glasses 2 Mike Broberg 24.1 false us,short,funnywould be turned into the following JSON documents:{  ""person_id"": ""Glynn"",  ""last_name"": ""Bird"",  ""score"": 45.3,  ""passed"": true,  ""tags"": [""uk"", ""tall"", ""glasses""]}{  ""person_id"": ""Mike"",  ""last_name"": ""Broberg"",  ""score"": 24.1,  ""passed"": false,  ""tags"": [""us"", ""short"", ""funny""]}The values within the score and passed fields are not wrapped in quotation marks. That’s because they’re not strings,they’re numbers and boolean values. Simple Search Service will, in most cases,detect the data types by examining the first few lines of the file but alsogives you the opportunity to override.At this point you may also choose which fields you would like to be “faceted”,by ticking the facet box next to each field:Specify facets on fields upon data importChoose fields you’d use to group your data. Faceting counts the occurrences ofeach field value in a result set. This gives someone searching your data aninsight into the composition of the dataset at a glance. The fields you want tofacet are usually ones where the values tend to repeat frequently, like these: * category names * tags * enumerationsYou can see an example of faceted search results in the guitarsexample app for the tutorial I mentioned at the start of this article. Thefaceted fields (type, range, brand, country, year) appear to the right of theresult set and have been programmed to act as secondary filters within thesearch results.What makes a good facet?SIMPLE SEARCH SERVICE APIThe Simple Search Service API is a simplified version of the Cloudant Search API. With Simple Search Service, there are only two parameters: * q – the query you wish to perform (default = : ) * cache – whether to cache search results (default = true)The API is expecting GET requests to /search e.g. /search?q=brand:fender . Here are some example queries: * q=*:* – return everything * q=brand:fender – a field search looking for a specific value of the field ‘brand’ * q=brand:fender OR brand:gibson – a more complicated fielded search with an ‘OR’ clause * q=blonde+fender+telecaster – a simple, free-text searchUnder the hood, Simple Search Service adds additional parameters to ensure thatthe document body is returned, that counts of faceted fields are returned, andthat the returned JSON is simplified.Simple Search Service automatically caches all search results for an hour. Youcan override this behaviour by adding a cache=false parameter to each Simple Search Service API search request.USING REDIS AS A CACHEBy default, Simple Search Service uses an in-memory hash table to cache commonsearch results. This is fine for testing, but if you are going to multipleSimple Search Service nodes then it makes sense to have a centralised cache. Redis is an in-memory database and can be easily integrated into a Simple SearchService installation. To do so: 1. Sign up for an account at compose.io 2. Create a Redis cluster and make a note of URL and password of your cluster 3. In Bluemix, add a Redis by Compose service, ensuring that you name it Redis by Compose — with no appended characters 4. Configure your Bluemix Redis service with the URL and password from your    Compose.io accountAdd a centralised cache with Redis by Compose to scale up your deploymentWhen Simple Search Service reboots, it will detect the attached Redis serviceand use that for its caching layer.TRY ITSee for yourself. Visit the Simple Search Service repository on Github to preview the code, or click the Deploy To Bluemix button below. After it deploys, click the View your app button and upload your data. Happy searching!SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Quickly create a faceted search API for use in your own apps with open source code for Cloudant & Redis, from IBM's CDS developer advocacy team.",Introducing the Simple Search Service: Faceted search API made easy,Live,174
476,"Nikole Mcleish Blocked Unblock Follow Following Jul 28
--------------------------------------------------------------------------------

GENERATING POEMS: A WAY WITH WORDS AND CODE
HOW I BUILT A POEM GENERATOR APP USING WATSON APIS
Editor’s note: This article marks the first in an occasional series by the 2017 summer interns
on the Watson Data Platform developer advocacy team, depicting projects they
developed using Bluemix data services, Watson APIs, the IBM Data Science
Experience, and more.Some people have a way with words. Others have a way with code. If you’re using
my Poem Generator project , you don’t necessarily need either. The application creates poems based on
user input. Using the Watson Tone Analyzer service , the user’s feelings are scored on a scale of 0 to 1. The Poem Generator then
uses these feelings to craft a poem.

An example poem from my Poem Generator application.HOW WAS IT BUILT?
The Poem Generator is a Flask web application. Flask is a Python microframework for web development. The
application uses Watson Natural Language Understanding , in addition to the Tone Analyzer service, both available on the IBM Bluemix platform. Tone Analyzer analyzes the emotional sentiment of a text. Natural
Language Understanding can extract topics from text with keywords and entities.

""document_tone"": {
        ""tone_categories"": [
            {
                ""category_id"": ""emotion_tone"", 
                ""tones"": [
                    {
                        ""tone_name"": ""Anger"", 
                        ""score"": 0.000105, 
                        ""tone_id"": ""anger""
                    }, 
                    {
                        ""tone_name"": ""Disgust"", 
                        ""score"": 0.001659, 
                        ""tone_id"": ""disgust""
                    }, 
                    {
                        ""tone_name"": ""Fear"", 
                        ""score"": 0.026971, 
                        ""tone_id"": ""fear""
                    }, 
                    {
                        ""tone_name"": ""Joy"", 
                        ""score"": 0.066884, 
                        ""tone_id"": ""joy""
                    }, 
                    {
                        ""tone_name"": ""Sadness"", 
                        ""score"": 0.946133, 
                        ""tone_id"": ""sadness""
                    }
                ], 
                ""category_name"": ""Emotion Tone""
            },
            ...
        ]
}

In addition, the application uses a PostgreSQL database and offers in-app
database management. In Bluemix, you can create a PostgreSQL database using
either Compose for PostgreSQL or with ElephantSQL .

HOW DOES IT WORK?
ADDING LINES
The Poem Generator keeps a database of lines and their emotional content, if
any. When a user enters a line, the application calls the Tone Analyzer service
and receives scores on the emotional content of that line. If a score for a
particular emotion is high enough, that emotion will be marked.

The Poem Generator is bootstrapped with a database of scored lines from various
verses.Lines that do not score high enough for any emotions are marked as fillers.
These lines have no distinct emotional content, allowing them to be placed in
any poem without affecting the tone. Alternately, the application allows users
to import and export multiple lines as well.

GENERATING POEMS
Users receive the input prompt, “ How are you feeling?” If there is emotional content in their input, the
generator will gather all lines that register any emotion. The application then
randomly selects lines to craft a 5-line poem. The generator will also randomly
determine and select a filler line for lines 1, 3 and 5.

WHAT OTHER FEATURES DOES IT HAVE?
When generating poems, users can toggle certain features. The first is Word Replacement . This feature uses Watson Natural Language Understanding to extract keywords
from the user’s input. Then this service will analyze the generated poem for
keywords. If both contain keywords, then a random keyword from each will be
swapped. I added this feature to allow more personalization to the poems.

Note the various options along the bottom of the UI that you can toggle on and
off.Other features include options to change how the application selects emotions.
The Dominant Emotion feature will select the highest-scoring emotion and create a poem based on
that.

The Shared Emotions feature will select lines that match the range of emotions flagged in the
user’s input. This feature requires lines in the database that can
simultaneously satisfy multiple emotions. Lines that have more than one
prominent emotion, however, are more difficult to find, causing this feature to
rarely generate successful poems.

The last feature for generating poems is No Fillers . Fillers are lines that do not exhibit strong emotions. When users select this
feature, the generator will not add filler lines to the poem.

DATABASE MANAGEMENT
The Poem Generator uses a PostgreSQL database. When using the application for
the first time, it creates a table if one does not already exist in the
database. Eight columns comprise this table: id , line , anger , disgust , fear , joy , sadness , and filler . The database prevents duplicates by requiring unique lines.

When creating a table for the first time, a collection of modern lines will be
automatically added. You can view all the lines in the database via the app’s
UI. These lines can be added, modified and re-scored, or deleted in the
application. There are also options to export all lines in your database to a
CSV file. You can do a bulk import of lines using a CSV too. Additionally, you
can delete all lines from the table in your PostgreSQL database.

CURRENT LIMITATIONS AND FUTURE IMPROVEMENTS
People could potentially use the Poem Generator to create stories or songs. One
limitation, however, was the context of a line’s location in a particular poem.
When analyzing the tones of a line of poetry, the emotional content does not
always match the emotions of the overall poem. Thus, a poem containing lines
marked as sad may exhibit other emotions, like fear or anger.

Looking toward future improvements, this generator could use machine learning to
determine which lines are better together. If users could rank or like certain
poems, the overall emotional content, line placement, and lines included could
be used to create better poems.

It was my first time working with the Watson APIs, and I hope it serves as a
useful example to others just getting started as well.

Give it a try. How are you feeling today?

Filler line: If you enjoyed this article, please ♡ it to recommend it to other
Medium readers.

Thanks to Mike Broberg and G. Adam Cox . * Cognitive Computing
 * Ibm Bluemix
 * Ibm Watson
 * Python Flask
 * Postgres

Blocked Unblock Follow FollowingNIKOLE MCLEISH
FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Some people have a way with words. Others have a way with code. If you’re using my Poem Generator project, you don’t necessarily need either. The application creates poems based on user input. Using…",A Way with Words and Code – IBM Watson Data Lab – Medium,Live,175
478,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
TOP ANALYTICS TOOLS IN 2016
Post Comment June 10, 2016 by Gaurav Vohra CEO & Co-Founder, Jigsaw Academy, The Online School of Analytics Follow me on LinkedIn , TwitterData analysis is not cut and dried, providing results in absolute terms. Rather,
many tools, techniques and processes can help dissect data, structuring it into
actionable insights. As we look toward the future of data analytics, we can
expect certain trends in tools and technologies to dominate the analytics space:

 * Data analysis frameworks
 * Visualization frameworks
 * Model deployment frameworks

DATA ANALYSIS FRAMEWORKS
Open-source frameworks such as R, with its increasingly mature ecosystem, and
Python, with its pandas and scikit-learn libraries, seem poised to continue
their dominance of the analytics space. In particular, certain projects in the
Python ecosystem seem ripe for quick adoption:

 * blaze
   Modern data scientists work with myriad data sources, ranging from CSV files
   and SQL databases to Apache Hadoop clusters. The blaze expression engine
   helps data scientists use a consistent API to work with a full range of data
   sources, lightening the cognitive load required by use of varied frameworks.
   
 * bcolz
   By providing the ability to do processing on disk rather than in memory, this
   interesting projects aims to find a middle ground between using Hadoop for
   cluster processing and using local machines for in-memory computations,
   thereby providing a ready solution when data size is too small to require a
   Hadoop cluster but not so small as to be handled within memory.

R and Python ecosystems, of course, are only the beginning, for the Apache Spark
framework is also seeing rapid adoption—not least because it offers APIs in R as
well as in Python.

Building on a general trend of using open-source ecosystems, we can also expect
to see a move toward distribution-based approaches. Anaconda, for example,
offers distributions for both Python and R, and Canopy offers a Python
distribution geared toward data science. And no one will be surprised if we see
the integration of analytics software such as R or Python in a standard
database.

Beyond open-source frameworks, a growing body of tools is helping business users
interact directly with data while helping them produce guided data analysis.
Tools such as IBM Watson, for example, attempt to abstract the data science
process away from the user. Although such an approach is still in its infancy,
it offers what appears to be a very promising framework for data analysis.

VISUALIZATION FRAMEWORKS
Visualizations are on the verge of being dominated by the use of web
technologies such as JavaScript frameworks. After all, everyone wants to create
dynamic visualizations, but not everyone is a web developer—or has the time to
spend writing JavaScript code. Understandably, then, certain frameworks have
been rapidly gaining in popularity:

 * plotly
   Offering APIs in Python, R and Matlab, this data visualization tool has been
   making a name for itself and seems on track for increasingly broad adoption.
   
 * bokeh
   This library may be exclusive to Python, but it also offers a strong
   potential for rapid future adoption.

What’s more, these two examples are only the beginning. We should expect to see
JavaScript-based frameworks that offer APIs in R and Python continue to evolve
as they see increasing adoption.

MODEL DEPLOYMENT FRAMEWORKS
Many service providers are willing to replicate the SaaS model on premises,
notably the following:

 * Domino Data Labs
 * Yhat
 * Opencpu

What’s more, in addition to needing to deploy models, we’re also seeing a
growing need to document code. Accordingly, we might expect to see a version
control system similar to Github but that is geared toward data science,
offering the ability to track different versions of data sets.

Going forward, we anticipate that data and analytics tools will see increased
implementation in mainstream business processes, and we expect such use to guide
organizations toward a data-driven approach to decision making. For now, keep
your eye on the foregoing tools—you won’t want to miss seeing how they reshape
the world of data.

Experience the power of Apache Spark in an integrated development environment for data science . Also, join the data science experience and explore how you can use Spark and
R to build your own data science applications .


Follow @IBMBigData

Topics: Analytics , Big Data Education , Big Data Research , Big Data Technology , Data Scientists , IBM Watson Foundations Tags: data analytics , data science , modeling , Python , R , visualizationRELATED CONTENT
BLOG
WIMBLEDON: USING REAL-TIME SPORTS STATISTICS FOR FAN ENGAGEMENT
A real-time notifications system was a champ behind-the-scenes at The
Championships, Wimbledon 2015 by enabling its digital and content team to break
the news of a key tournament statistics milestone that scooped media
organizations worldwide. See what value an extension to that system is adding
to... Read Blog Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacy Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Shifting winds in the Cognitive Era for banking’s digital transformation Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Video What does Hadoop and big data success look like? White papers & Reports Driving value from body cameras
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacyMORE
Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacy Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog Cloud-based ingestion: The future is hereMORE
Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog Cloud-based ingestion: The future is here Blog 3 strategies to get your CFO to care about Sales Performance Management Blog Proactive emergency plans: Data empowers law enforcement agencies at all
levels Blog Emergency management information system data needs to be filtered Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformationMORE
Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformation Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog Keep your head above water with information lifecycle governance Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutionsMORE
Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Video What does Hadoop and big data success look like? Blog The death of application performance White papers & Reports Introducing notebooks: A power tool for data scientists * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site","Join us for a look at what’s on the horizon in data analytics, discovering how a broad array of tools aims to change the way we do—and think about—data science.",Top analytics tools in 2016,Live,176
480,"Glynn Bird / April 27, 2016In earlier blog posts I have described a microservice architecture that uses a queue, pubsub channel, or message hub to broker a list of “work”. Each item of work is a block of data—typically aJSON document—that is to be processed, saved, or acted upon in some way. Icreated the Metrics Collector Microservice , which collects web metrics data from mobile or web apps and writes it to aqueue or pubsub channel using Redis, RabbitMQ, or Apache Kafka. A separateMetrics Collector Storage Microservice consumes the work and writes it to achoice of Cloudant, MongoDB, or ElasticSearch. I then described how othermicroservices could be added to aggregate the streaming data as it arrived. Fortunately Compose.io allows the deployment of Redis, RabbiMQ,MongoDB, or ElasticSearch and IBM offers Cloudant and Apache Kafka as services,so it’s very easy to get started but there are a lot of moving parts.Today I’ll be using a new service, OpenWhisk , which makes it simple to deploy microservices and eliminates the need tomanage your own message broker or deploy your own worker servers.OpenWhisk is an open-source, event-driven compute platform. You send your action code to OpenWhisk and then deliver a stream of data that your OpenWhisk code worksupon. OpenWhisk handles the scaling out of the computing resources needed todeal with the workload; all you deal with is the action code and the data that triggers the actions. You pay only for the amount ofwork that is undertaken, not for servers standing idle waiting for something tohappen.You can write action code in JavaScript or Swift. This means that web developersand iOS developers can create server-side code in the same language as theirfront-end code.GETTING STARTEDThe code snippets and command-line calls made in this blog post assume that youhave signed up for the OpenWhisk beta programme in Bluemix and have alreadyinstalled the “wsk” command-line tool. Visit https://developer.ibm.com/openwhisk/ and click the Try Now button to get started.HELLO WORLDLet’s create a JavaScript file called ‘hello.js’ containing a function thatreturns a simple object:function main() {   return {payload: 'Hello world'};}This is the simplest OpenWhisk action; it simply returns a static string as itspayload. Deploy the action to OpenWhisk with:> wsk action create hello hello.jsok: created action helloThis creates an action called “hello” that runs the code found in hello.js . We can run it in the cloud with:> wsk action invoke --blocking hello{  ""payload"": ""Hello world""}We can also make our code expect parameters:function main(params) {   return {payload:  'Hello, ' + params.name + ' from ' + params.place};}Then update our action:> wsk action update hello hello.jsok: created updated helloAnd run our code with parameters:> wsk action invoke --blocking --result hello --param name 'Jennie' --param place 'The Block'{  ""payload"": ""Hello, Jennie from The Block""}We’ve created a simple JavaScript function that processes some data and withoutworrying about queues, workers, or any network infrastructure we were able toexecute the code on the OpenWhisk platform.DOING SOMETHING USEFUL WITH OUR ACTIONSWe can do more complex things in our action, such as making API calls. I createdthe following action, which calls out to a Simple Search Service instance containing Game of Thrones data, passing in the q parameter:var request = require('request');function main(msg) {   var q = msg.q || 'Jon Snow';   var opts = {     method: 'get',     url: 'https://sss-got.mybluemix.net/search',     qs: {        q: q,       limit:5     },     json: true   }   request(opts, function(error, response, body) {       whisk.done({msg: body});   });   return whisk.async();}We can create this action and give it a different name:> wsk action create gameofthrones gameofthrones.jsok: created action gameofthronesThen call it with a parameter q ;> wsk action invoke --blocking --result gameofthrones --param q 'melisandre'{    ""msg"": {        ""_ts"": 1460028600363,        ""bookmark"": ""g2wAAAABaANkAChkYmNvcmVAZGI0LmJtLWRhbC1zdGFuZGFyZDEuY2xvdWRhbnQubmV0bAAAAAJuBAAAAACAbgQA____n2poAkY_7PVPoAAAAGHlag"",        ""counts"": {            ""culture"": {                ""Asshai"": 1            },            ""gender"": {                ""Female"": 1            }        },        ""from_cache"": true,        ""rows"": [            {                ""_id"": ""characters:743"",                ""_order"": [                    0.9049451947212219,                    229                ],                ""_rev"": ""1-c68720782e2500311125768153d7170b"",                ""aliases"": [                    ""The Red Priestess"",                    ""The Red Woman"",                    ""The King's Red Shadow"",                    ""Lady Red"",                    ""Lot Seven""                ],                ""allegiances"": [                    """"                ],                ""books"": [                    ""A Clash of Kings"",                    ""A Storm of Swords"",                    ""A Feast for Crows""                ],                ""born"": ""At\ufffd\ufffdUnknown"",                ""culture"": ""Asshai"",                ""died"": """",                ""father"": """",                ""gender"": ""Female"",                ""mother"": """",                ""name"": ""Melisandre"",                ""playedBy"": ""Carice van Houten"",                ""povBooks"": ""A Dance with Dragons"",                ""spouse"": """",                ""titles"": [                    """"                ],                ""tvSeries"": ""Season 2,Season 3,Season 4,Season 5""            }        ],        ""total_rows"": 1    }}WRITING DATA TO SLACK FROM OPENWHISKAnother task we could perform in an OpenWhisk action is to post a message inSlack. Slack has a great API for creating custom integrations: a Slackadministrator can set up an “incoming webhook”, so posting to a channel is assimple as POSTing a string to an HTTP endpoint. We can create a Slack-postingaction with a few lines of code:var request = require('request');function main(msg) {   var text = msg.text || 'This is the body text';   var opts = {     method: 'post',     url: 'MY_CUSTOM_SLACK_WEBHOOK_URL',     form: {       payload: JSON.stringify({text:text})     },     json: true   }   request(opts, function(error, response, body) {       whisk.done({msg: body});   });   return whisk.async();}replacing MY_CUSTOM_SLACK_WEBHOOK_URL with the Webhook URL that Slack provided when the “Incoming Webhook”integration was created. Notice how this action is executed asynchronously andonly calls back when the request has completed.Then we can deploy and run it in the usual way:> wsk action create slack slack.jsok: created action slack> wsk action invoke --blocking --result slack --param text 'you know nothing, Jon Snow'{    ""msg"": ""ok""}As it happens, Whisk has a built-in Slack integration , but it’s nice to build things yourself isn’t it? Then you can perform yourown logic and decide whether a Slack message is posted or not based on theincoming data.WRITING DATA TO CLOUDANT FROM OPENWHISKIt is relatively simple to write your own custom action to write to Cloudantbecause you can: * ‘require’ the Cloudant Node.js library in your JavaSript action * write data to Cloudant using its HTTP APIThe disadvantage of this approach is that you’d have to hard-code your Cloudantcredentials in the action code, just as we hard-coded the Slack Webhook URL inour previous example, which isn’t best-practice.Fortunately, OpenWhisk has a pre-built Cloudant integration which you can invokewithout any custom code. If you have an existing Cloudant account, then you cangrant access to that Cloudant service on the command-line:> wsk package bind /whisk.system/cloudant myCloudant -p username 'myusername' -p password 'mypassword' -p host 'mydomainname.cloudant.com'Then you see at list of connections that OpenWhisk can interact with:> wsk package listpackages/me@uk.ibm.com_dev/myCloudant                               private bindingwhere me@uk.ibm.com is my Bluemix username (or the name of your Bluemix organisation) and dev is your Bluemix space.You can write data to Cloudant by invoking the write command of the package:> wsk action invoke /me@uk.ibm.com_dev/myCloudant/write --blocking --result --param dbname testdb --param doc '{""name"":""George Friend""}'{    ""id"": ""656eaeaed0fd47aa733dd41c3c79a7a0"",    ""ok"": true,    ""rev"": ""1-a7720095a32c4d1b994ce5e31fe8c73e""}LET’S TAKE A BREATHSo far we’ve created and updated OpenWhisk actions and triggered individualactions as blocking, command-line tasks. The ‘wsk’ tool lets you trigger actionsto run in the background and also chain actions together into sequences , but we are not going to cover those options in this post.Our code has been simple JavaScript blocks where the “main” no need to worryabout servers, operating systems, or network hardware.OpenWhisk is an event-driven system. You’ve seen how to create an event bydeploying code manually. But how can we set up OpenWhisk to act upon a stream ofevents?OPENWHISK TRIGGERSA Trigger in OpenWhisk is another way of firing events and executing code. We can createa number of named Triggers and then create rules that define which of our actions (our code) are executed against which of our triggers. Instead of invokingactions directly, we would invoke Triggers instead; the rules defined againstthe triggers decide which action(s) are executed. This lets us chain actionstogether so that one trigger causes several actions to occur and re-use code byassigning the same action code to multiple triggers.Triggers can fire individually, or tie to external feeds such as: * the changes feed from a Cloudant database – every time a document is added, updated, or deleted, a trigger fires * the commit feed of a Github repository – every time a commit occurs a trigger firesSo we can use a Cloudant database to fire a trigger which writes some data toSlack:> wsk trigger create myCloudantTrigger --feed /me@uk.ibm.com_dev/myCloudant/changes --param dbname mydb --param includeDocs trueand configure that trigger to fire our Slack-posting action:> wsk rule create --enable myRule myCloudantTrigger slackNow every time a document is added, updated or deleted in the Cloudant database,my custom action fires, which in this case posts a message to Slack!WHAT WOULD I USE OPENWHISK FOR?OpenWhisk lends itself to projects where you don’t want to manage anyinfrastructure. You pay only for the work done, and don’t waste money on idleservers. OpenWhisk easily manages peaks of activity, as it scales out to meetthe demand.Combining OpenWhisk with other “as-a-Service” databases, such as Cloudant, meansthat you don’t have to manage any data storage infrastructure either. Cloudantis built to store large data sets, cope with high rates of concurrency, andprovide high-availability.As the act of spinning up an OpenWhisk action is non-zero, it makes sense to useOpenWhisk for non-trivial computing tasks like * processing an uploaded image to create thumbnails, saving them to object   storage * taking geo-located data from a mobile application and enriching it with calls   out to a Weather APIIt is also useful for dealing with systems that feature large amounts ofconcurrency such as: * mobile apps sending data to the cloud * Internet of Things deployments where incoming sensor data needs to be stored   and acted uponThere are features of OpenWhisk that I haven’t touched on such as Swift support,the ability to use Docker containers as action code instead of uploading sourcecode, and the mobile software development kit.REFERENCES * OpenWhisk * OpenWhisk Source Code * OpenWhisk iOS SDKSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Apache Kafka / cloudant / Elasticsearch / microservices / MongoDB / OpenWhisk / pubsub / RabbitMQ Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",OpenWhisk makes it easy to deploy microservices and eliminates the need to manage your own message broker or deploy your own worker servers.,Introducing OpenWhisk: Microservices Made Easy,Live,177
484,"TL;DR: Betteridge's law applies unless your JSON is fairly unchanging and needs to be queried a lot.With the most recent version of PostgreSQL gaining ever more JSON capabilities, we've been asked if PostgreSQL could replace MongoDB as a JSON database. There's a short answer to that, but we'd prefer to show you. Ah, a question from the audience...Yes, it did. Before PostgreSQL 9.4 there was the JSON data type and that's still available. It lets you do this:>CREATE TABLE justjson ( id INTEGER, doc JSON)>INSERT INTO justjson VALUES ( 1, '{""name"":""fred"",""address"":{""line1"":""52 The Elms"",""line2"":""Elmstreet"",""postcode"":""ES1 1ES""That stored the raw text of the JSON data in the database, complete with white space and retaining all the orders of any keys and any duplicate keys. Let's show that by looking at the data:>SELECT * FROM justjson;id |               doc1 | {                              +|     ""name"":""fred"",             +|     ""address"":{                +|         ""line1"":""52 The Elms"", +|         ""line2"":""Elmstreet"",   +|         ""postcode"":""ES1 1ES""   +(1 row)It has stored an exact copy of the source data. But we can still extract data from it. To do that, there's a set of JSON operators to let us refer to elements within the JSON document. So say we just want the address section, we can do:select doc->>'address' FROM justjson;?column?""line1"":""52 The Elms"", +""line2"":""Elmstreet"",   +""postcode"":""ES1 1ES""   +(1 row)The ->> operator says within doc, look up the JSON object with the following fieldname and return it as text. With a number, it would have treated it as an array index, but still returned the value as text. There's also -> to go with ->> which doesn't do that conversion to text. We need that so we can navigate into the JSON objects like so:select doc->'address'->>'postcode' FROM justjson;?column?ES1 1ES(1 row)Though there is a shorter form where we can specify a path to the data we are after using #>> and an array like this:select doc#>>'{address,postcode}' FROM justjson;?column?ES1 1ES(1 row)By preserving the entire document the JSON data type made it easy to work with exact copies of JSON documents and pass them on without loss.  But with that exactness comes a cost, a loss of efficency, and with that comes an inability to index.. So although it's convenient to preserve and parse JSON documents, there was still plenty of room for improvement and thats where JSONB comes in.Well, with JSONB it turns the JSON document into a hierarchy of key/value data pairs. All the white space is discarded, only the last value in a set of duplicate keys is used and the order of keys is lost to the structure dictated by the hashes in which they are stored. If we make a JSONB version of the table we just created, insert some data and look at it:>CREATE TABLE justjsonb ( id INTEGER, doc JSONB)>INSERT INTO justjsonb VALUES ( 1, '{""name"":""fred"",""address"":{""line1"":""52 The Elms"",""line2"":""Elmstreet"",""postcode"":""ES1 1ES"">SELECT * FROM justjsonb;id |                                                doc1 | {""name"": ""fred"", ""address"": {""line1"": ""52 The Elms"", ""line2"": ""Elmstreet"", ""postcode"": ""ES1 1ES""}}(1 row)We can see that all the textyness of the data has gone away, replaced with the bare minimum required to represent the data held within the JSON document. This stripping down of data means the JSONB representation moves the parsing work to when the data is inserted, but relieves any later access to the data of the task of parsing it.Looked at as key/value pairs, then the JSONB datatype does look a bit like the PostgreSQL HSTORE extension. That's a data type for storing key/value pairs but it is an extension, where JSONB (and JSON) are in the core and HSTORE is one-deep in terms of data structure where JSON documents can have a nested elements. Also, HSTORE stores only strings while JSONB understands strings and the full range of JSON numbers.Indexing, indexing everywhere. You can't actually index a JSON datatype in PostgreSQL. You can make an index for it using expression indexes, but that'll cover you for whatever you can put in an expression. So if we wanted to we could docreate index justjson_postcode on justjson ((doc->'address'->>'postcode'));And the postcode, and nothing else would be indexed.With JSONB, there's support for GIN indexes; a Generalized Inverted Index.  That gives you another set of query operators to work with. These are @> contains JSON,  contained, ? test for string existing, ?| any strings existing and ?& all strings existing.There are two kinds of indexes you can create with the default one, called json_ops, which supports all these operators and an index using jsonb_path_ops which only supports @>. The default index creates an index item for every key and value in the JSON, while the jsonb_path_ops only creates a hash of the keys leading up to a value and the value itself and that's a lot more compact and faster to process than the more complex default. But the default does offer more operations at the cost of consuming more space.  After adding some data to our table, we can do a select looking for a particular post code. If we create the default GIN JSON index and do a query:explain select * from justjsonb where doc @> '{ ""address"": { ""postcode"":""HA36CC"" } }';QUERY PLANSeq Scan on justjsonb  (cost=0.00..3171.14 rows=100 width=123)Filter: (doc @> '{""address"": {""postcode"": ""HA36CC""}}'::jsonb)(2 rows)We can see that it will sequentially scan the table. Now, if we create a default JSON GIN index we can see the difference it makes:> create index justjsonb_gin on justjsonb using gin (doc);> explain select * from justjsonb where doc @> '{ ""address"": { ""postcode"":""HA36CC"" } }';QUERY PLANBitmap Heap Scan on justjsonb  (cost=40.78..367.62 rows=100 width=123)Recheck Cond: (doc @> '{""address"": {""postcode"": ""HA36CC""}}'::jsonb)->  Bitmap Index Scan on justjsonb_gin  (cost=0.00..40.75 rows=100 width=0)Index Cond: (doc @> '{""address"": {""postcode"": ""HA36CC""}}'::jsonb)(4 rows)It's a lot more efficient searching as you can tell by the lower cost. But the hidden cost is in the size of the index. In this case it's 41% of the size of the data. Let's drop that index and repeat the process with a jsonbpathops GIN index.> create index justjsonb_gin on justjsonb using gin (doc jsonb_path_ops);> explain select * from justjsonb where doc @> '{ ""address"": { ""postcode"":""HA36CC"" } }';QUERY PLANBitmap Heap Scan on justjsonb  (cost=16.78..343.62 rows=100 width=123)Recheck Cond: (doc @> '{""address"": {""postcode"": ""HA36CC""}}'::jsonb)->  Bitmap Index Scan on justjsonb_gin  (cost=0.00..16.75 rows=100 width=0)Index Cond: (doc @> '{""address"": {""postcode"": ""HA36CC""}}'::jsonb)(4 rows)The total cost is slightly lower and typically the index size should be a lot smaller. It's going to be the classic task of balancing speed and size for indexes. But it's far more efficient than sequentially scanning.If you update your JSON documents in place, the answer is no. What PostgreSQL is very good at is storing and retrieving JSON documents and their fields. But even though you can individually address the various fields within the JSON document, you can't update a single field. Well, actually you can, but by extracting the entire JSON document out, appending the new values and writing it back, letting the JSON parser sort out the duplicates. But it's likely that you aren't going to want to rely on that.If your active data sits in the relational schema comfortably and the JSON content is a cohort to that data then you should be fine with PostgreSQL and it's much more efficient JSONB representation and indexing capabilities. If though, your data model is that of a collection of mutable documents then you probably want to look at a database engineered primarily around JSON documents like MongoDB or RethinkDB.","With the most recent version of PostgreSQL gaining ever more JSON capabilities, we've been asked if PostgreSQL could replace MongoDB as a JSON database.",Is PostgreSQL Your Next JSON Database?,Live,178
485,"POWER PROTOTYPING WITH MONGODB AND NODE-RED
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 23, 2016Do you want to be able to quickly get your database backend fronted by a web
service? Node-RED and MongoDB can be a powerful ally in your strategy and we'll
show you how.

Whether you're just getting started with a small toy project or about to embark
on ""The Next Big Thing""®, being able to quickly set up a backend to your
application can be the key to getting your project off the ground. With the
widespread use of JSON as the de-facto serialization format for data structures
on the web, JSON document databases such as MongoDB are excellent platforms for
prototyping new applications.

However, exposing MongoDB directly to the client-side of your application is
difficult to manage, hard to keep efficient, and pushes logic into your client
applications. In this article, we'll use Node-RED and MongoDB to build a minimal RESTful API for a photographer's portfolio
website.

GETTING STARTED
ACCESSING A MONGODB INSTANCE
You should first spin up a spin up a Compose MongoDB database with SSL enabled or start your own local instance of MongoDB. Starting a new deployment on
Compose is the easiest way to get started.

INSTALLING NODE-RED
You should also have access to a running installation of Node-RED. If you
already have NodeJS on your local machine, you can use the Node Package Manager
from a terminal to install Node-Red: npm install node-red .

If you don't have NodeJS yet, you can download the installer for your platform directly from the NodeJS website .

INSTALL THE MONGODB2 NODE
Node-RED is a flow-based programming (FBP) environment, so connecting to
services requires access to a ""node"" that provides the services you need. In
this article, we'll connect to MongoDB using the node-red-node-mongodb2 package. You can install it by selecting the Manage Palette option from the main menu, searching for the mongodb2 node and clicking install .


CONNECTING TO MONGODB USING NODE-RED
To connect to MongoDB, you must first configure the mongodb2 node. Start by dragging the mongodb2 node onto the Node-RED canvas.


The node initially has a red error indicator letting you know that the node
needs to be configured. Double-click on the node to open the configuration
editor.


In the top section labelled ""server"", ensure that ""Add new mongodb2..."" is
showing in the drop-down menu and click on the ""pencil"" icon to add a new server
configuration.


The configuration section has all of the information Node-RED needs to connect
to MongoDB. You can find the connection string in your Compose console by
clicking on your database name and clicking the Admin tab:


Once you've configured a MongoDB node, it will be available for every
subsequent MongoDB node you create by selecting it from the drop-down menu.

RETREIVING AND QUERYING RECORDS (HTTP GET)
We're now ready to start adding the HTTP endpoints to our RESTful portfolio API.
All of our endpoints will use the same route location but different HTTP methods
to GET , POST , PUT , and DELETE items in the portfolio. We'll start with the GET endpoint. To add an HTTP endpoint, we'll use the HTTP input node which comes installed by default in all new instances of Node-RED.
In the future we can also add other interfaces such as WebSockets and RabbitMQ,
but for now we'll stick with HTTP.

To add a new HTTP endpoint, drag the HTTP input node onto the canvas.


Double-click the node to open the configuration panel and add the URL and method
you prefer (in this case, we'll start with the GET HTTP method). Since we're working with a data entity called ""project"", we'll
make each of our endpoints available at the /projects URL.


We'll also store each of these projects in a database collection called
""projects"". Double-click on the mongodb2 node and type projects in the collection field. Then select the find.toArray operation from the operation drop-down and click done .


Next we'll wire together our mongodb2 node and HTTP input. This can be done by clicking on the out port on the right of the HTTP input node and dragging a wire to the in port on the left of the mongodb2 node as shown below. We'll also need to drag an HTTP output node onto the canvas to ensure that our HTTP client receives a response.
Finally, click deploy to publish the flow and make the endpoints active.


If you're running Node-RED locally, You can now access your endpoints at http://localhost/projects . Any query string parameters you pass to the URL will be available in the msg.payload object. Since the find method in the mongodb2 node also reads in the msg.payload object, we can use this wiring to automatically send all parameters passed into
the HTTP input node to the mongodb2 node.

For example, to search for a project with a title field matching “test” in the
MongoDB shell, the following MongoDB query would look like:

db.projects.find({ ""title"": ""test"" });

The query also could be executed using the following CURL command:

curl -X GET localhost/projects?title=test

CREATING THE REMAINING RESTFUL ENDPOINTS
The other endpoints follow a similar structure: they start with an HTTP input node, send parameters into the mongodb2 node configured with the desired operation, and send the results as a response
back to the user. In this next section, we'll cover the Create , Update , and Delete operations.

CREATE A NEW PROJECT (HTTP POST)
To create a new project in MongoDB, we'll send an HTTP POST request to the /projects endpoint. This is similar to what we did above with the GET request, except we have to extract project data from the POST request's body rather than the query string parameters.

The HTTP input node does not send form body data in the msg.payload object, however we can access form body data directly using the msg.req object. The msg.req object contains the underlying HTTP request from ExpressJS , so POST form data can be found in the msg.req.body object. We can copy the msg.req.body over to the msg.payload by adding a function node in our flows.

We'll start by copying the GET flow and pasting the copy onto the canvas. Then we'll modify the HTTP input node to use the POST method instead of the GET method.


Now we’ll drag a function node onto the canvas and place it between mongodb2 and the HTTP input node.


Then, we'll add the following code to the function node:

msg.payload = msg.req.body;  
return msg;  


Finally, configure mongodb2 to use the projects collection and the insert operation and click Deploy .


You can use the following CURL command to create a new project with a title
field and a value of ""test"":

$ curl -X POST -H ""Content-Type: application/json"" -d '{""title"":""test""}' localhost/projects
{ ""_id"": ObjectID(""2fae32498ac2b113ca241543bfcaef""), ""title"": ""test""}


UPDATE AN EXISTING DOCUMENT (HTTP PUT)
Continuing on with RESTful convention, we'll use the HTTP PUT method to update an existing project. Let's copy the nodes we created for the POST method and modify the HTTP node to use PUT and change the mongodb2 operation to update . Since the document to update will be passed in through the request body,
we'll keep the copied function node as it is.


The PUT command includes the ID in an object in the body of the request, along with the
fields that you want to update. The following CURL command will update an
existing project:

$ curl -X PUT -H ""Content-Type: application/json"" -d '{""_id"": ""2fae32498ac2b113ca241543bfcaef"", ""title"":""not test""}' localhost/projects
{ ""_id"": ObjectID(""2fae32498ac2b113ca241543bfcaef""), ""title"": ""not test""}


DELETING AN EXISTING DOCUMENT (HTTP DELETE)
The last of the CRUD operations we need to implement is DELETE . We'll use the HTTP DELETE method to do this. HTTP DELETE is similar to HTTP GET in that it does not send a form body along with the request. To send the ID of
the record to be deleted we'll encode it in the URL like this: /projects/id_to_delete .

Copy the flow we created for PUT in the previous section and paste the copy onto the canvas. Then, modify the Method field in the HTTP input node by clicking on the node and selecting the DELETE method from the drop-down menu. In the URL field we can insert /project/:id which adds the URL parameter id .


We'll exploit the msg.req object again, this time to get the URL parameters from the msg.req.params object. To do this, we’ll add a function node to move the msg.req.params object over to the msg.payload by double-clicking on the function node and adding the following code to the
editor:

msg.payload = msg.req.params;  
return msg;  


Finally, we’ll update the mongodb2 node's operation field to deleteOne .


You can delete an item with an ID of ""2fae32498ac2b113ca241543bfcaef"" by using
the following CURL command:

$ curl -X DELETE localhost/projects/2fae32498ac2b113ca241543bfcaef
{ ""_id"": ObjectID(""2fae32498ac2b113ca241543bfcaef"")}


WRAP-UP
MongoDB, with its schema-less architecture, and Node-RED, with its flow-based
programming model, make a powerful rapid-prototyping duo. Node-RED also makes it
possible to expand out the functionality of our minimal API as much as we want,
thanks to its flexible programming model and robust community of third-party
nodes. In the next installment, we'll move your API out of the prototype phase
by adding authentication to your exposed endpoints using JSON Web Token .


--------------------------------------------------------------------------------","In this article, we'll use Node-RED and MongoDB to build a minimal RESTful API for a photographer's portfolio website. ",Power Prototyping with MongoDB and Node-RED,Live,179
486,"OFFLINE CAMP CALIFORNIA

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Maureen McElaney 11/17/16Maureen McElaney

Maureen McElaney is a Developer Advocate at IBM Cloud Data Services. Prior to
joining the team, she worked as a QA Engineer at Dealer.com and is passionate
about building tools that increase developer productivity and joy. She is an
experienced community builder. In 2013 she founded the Burlington, Vermont
chapter…

Learn More Recent Posts * Offline Camp California A recap of our second-ever Offline Camp, and how to get involved in the
   Offline…
 * Girl Develop It Summit Recap IBM Cloud Data services proudly sponsored the 2016 Girl Develop It
   Leadership Summit. One of…
 * IBM Went Camping With #OfflineFirst In June 2016, members of the Offline First community gathered for a retreat
   at a…

CAMARADERIE, OFFLINE FIRST & TARANTULAS
Offline Camp is a gathering of folks from the Offline First community, who come
together to share projects, best practices, and hack on offline-first problems
over a long weekend away from it all. The Offline First movement involves people who build Progressive Web Apps , native apps, desktop apps, IoT, and even data scientists! After co-organizing
the first Offline Camp in New York this past June, I was excited to attend the California event as a simple
camper.

Hayride brought us some very good views @OfflineCamp pic.twitter.com/hzdMEFhlqR

— Steve Trevathan (@strevat) November 6, 2016


Two of Offline Camp’s co-organizers, Bradley Holt and Gregor Martynus , held a session on the future of the Offline First community where they took a
first crack at a community logo, which was met with more ¯\_(ツ)_/¯ s. You can see a full list of topics that arose at camp on the Offline Camp medium account .

WiFi ? LiFi ? ¯_☁️_/¯ @OfflineCamp #OfflineFirst pic.twitter.com/A5bhlGMdcT

— Luis Montes (@monteslu) November 7, 2016


THE CAMPERS
The campers are what truly sets Offline Camp apart from your run-of-the-mill
tech conference. All the sessions are proposed, voted upon, and decided by the
people in attendance. The organizing team continuously works hard to connect
with a diverse audience of people who are doing amazing things in the Offline
First space. The organizers promote an inclusive environment at camp via the
camp Code of Conduct . These rules are important because the campers are the ones who decide what
happens at camp. The people who attend truly set the tone for what kind of
projects the Offline First community will tackle afterward!


This installment of Offline Camp occurred in Santa Margarita, California
There were 21 amazing people at camp, but I thought I’d highlight a few so that
you could get a feel for the type of people who attend an event like this.

MAX
I met cat cafe connoisseur Max Ogden , former Fellow at Code for America and author of JS for Cats, who also
maintains Dat Data Project , which shares datasets over peer to peer networks.

Had such an awesome time at @OfflineCamp , made new lifelong friends, got excited about the future of the web. And saw
this bunny pic.twitter.com/FUzOtGPCPB

— maxwell ogden (@denormalize) November 7, 2016


MACHIKO
At camp, you could have gone on an Offline First mapping hike with Machiko Yasuda , who runs multiple tech meetups in Los Angeles including Fullstack and MaptimeLA . She also told me about an open source tool she built for mentoring new
developers and leveling up existing ones, called exercism.io .

on the drive home from @offlinecamp & reflecting on what i learned this weekend, i got stuck on the 10 behind a car
with this plate: READ ME

— machiko / 安田万智子 (@machikoyasuda) November 8, 2016


NOLAN
Around the campfire, we played the role-playing game Werewolf . There I sat across from Nolan Lawson , who maintains PouchDB and works at Microsoft Edge . In the game he poisoned me because he thought I was the werewolf — alas, I
was but an innocent villager! — but in real life I learned a bunch about the
development environment for Microsoft users and made plans for our Offline First panel at SXSW . By the way, we both hope that if you’re planning to attend SXSW next year,
that you’ll come see our panel! Learn more about it here .

Day 2 Design Patterns sesh at @OfflineCamp . ""What percentage of PWAs were written by @jaffathecake ?"" @nolanlawson #offlinefirst pic.twitter.com/ysVDMpL22H

— Mo_Mack (@Mo_Mack) November 6, 2016


TRAIL’S END
It was hard to leave all the friends we made at camp. Campers have a lot of
amazing things to say about their experience:

I just published “My biggest takeaway from the second Offline Camp in Santa
Margarita, CA” https://t.co/mjfHPExh6C

— Disruption disruptor (@jessebeach) November 8, 2016


My jacket smells like a campfire and I'm instantly reminded how awesome @OfflineCamp was!

— John Kleinschmidt (@jkleinsc) November 10, 2016


Follow the Offline Camp Medium account now, as the majority of the campers have signed up to contribute recap
posts from the sessions they participated in. Those articles will be published
continuously in the coming weeks. Sign up for the Offline First Reader to stay on top of news and events happening within the community. If you’re
interested in contributing now, join the Offline First Slack team and add to the discussion.

Stay tuned for more Offline Camp events in 2017 — perhaps we’ll even go to
Europe? ¯\_(ツ)_/¯ Better question: Are there tarantulas in Europe?

Met some (actually quite friendly) neighbors this morning. pic.twitter.com/Ia3iYG0W8M

— Offline Camp (@OfflineCamp) November 7, 2016
","Offline Camp is a gathering of the Offline First community, coming together to hack on offline first problems over a long weekend away from it all.",Offline First at Offline Camp California,Live,180
493,"Homepage PUBLISHED IN AUTONOMOUS AGENTS — #AI Follow Sign in / Sign up 33 Preetham V V Blocked Unblock Follow Following #AI & #MachineLearning enthusiast. Author: Java Web Services / Internet
Security & Firewalls. VP, Brand Sciences & Products @inMobi #UltraRunner 3 days ago 11 min read
--------------------------------------------------------------------------------

BAYESIAN REGULARIZATION FOR #NEURALNETWORKS
image creditIf you are a Science or Math nerd, there is no way in hell you would have not
heard of Bayes’s Theorem . It’s pervasive and quite a powerful inference model to understand and model
anything from growth of Cancer cells, to obstacle detection in Autonomous
Robots, to fixing the probability of a collision course of a Asteroid towards
Earth. The simplicity of the Model is where it draws its power from.
Specifically in the Artificial Intelligence community, you cannot do away with
Bayesian Inference and Reasoning for optimizing your models.

In the past post titled ‘ Emergence of the Artificial Neural Network ” I had mentioned that ANNs are emerging prominently among all other models due
to its ability to accommodate techniques and theories from all other AI
approaches quite well. I did mention that a full Bayesian Model can be used for
interpreting weight decay. In this post, I intend to showcase the Bayesian
techniques for Regularizing Neural Networks.

This concept is also called Bayesian Regularized Artificial Neural Networks or
BRANN for short.
--------------------------------------------------------------------------------

WHAT IS BAYES’S THEOREM?
(Feel free to skip this section if you already understand Bayes’s Theorem)

Bayes’s Theorem fundamentally is based on the concept of “validity of Beliefs”.
Reverend Thomas Bayes was a Presbyterian minster and a Mathematician who
pondered much about developing the proof of existence of God. He came up with
the Theorem in 18th century (which was later refined by Pierre-Simmon Laplace)
to fix or establish the validity of ‘existing’ or ‘previous’ Beliefs in the face
of best available ‘new’ evidence. Think of it as a equation to correct prior
beliefs based on new evidence.

One of the popular example used to explain Bayes’s Theorem is to detect if a
patient has a certain disease or not.

The key inferences in the Theorem is a follows:

Event : An event is a fact. The patient truly having a disease is an event. Also,
truly NOT having the disease is also an event.

Test : A test is a mechanism to detect if a patient has the disease (or a test
devised to prove that a patient does not have the disease. Note that they are
not the same tests)

Subject : A patient is a subject who may or may not have the disease. A test needs to
be devised for the subject to detect the presence of disease or devise a test to
prove that the disease does not exist.

Test Reliability : A test devised to detect the disease may not be 100% reliable. The test may
not possibly detect the disease all the time. When the detection fails to
recognize the disease in a subject who truly has the disease, we call them false negatives . Also the test on the subject who truly does not have the disease may show
that the subject does have the disease. This is called false positives .

Test Probability : This the probability of a test to detect the event (disease) given a subject
(patient). This does not account the Test Reliability.

Event Probability (Posterior Probability) : This is the “corrected” test probability to detect the event given a subject
by considering the reliability of the devised test.

Belief (Prior Probability) : A belief, also called a prior probability (or prior in short) is the
subjective assumption that disease exits in a patient (based on symptoms or
other subjective observations) prior to conducting the test. This is the most
important concept in Bayes’s Theorem. You need to start with the priors (or
Beliefs) before you make corrections to that belief.

The following is the equation which shall accommodate the stated concepts.

In the equation,

 * A1.. A2.. are the events. A1 and A2 are mutually exclusive and collectively
   exhaustive. Let A1 mean that the disease is present in the subject and A2
   mean that the disease is absent.
 * Let Ai refer to either one of the event A1 or A2.
 * B is a test devised to detect the disease (alternatively, it can also be a
   test that is devised to prove that the disease does not exist in the subject.
   Again, note that these are completely two different tests)
 * Let us say there is a population of people (in a random city) where there is
   a prior belief (based on some random observation, which may or may not be
   subjective) that 5% of the population “has the disease”. So, for any given
   subject in the population, the prior probability P(A1) “has the disease” is
   5% and the prior probability P(A2) “does not have the disease” is 95%.
 * Let’s say, the test ‘B’ which is devised to “detect” the presence of a
   disease has a reliability of 90% (In other words, it detects the presence of
   a disease in a patient who truly have the disease only 9 out of 10 tests).
   Written mathematically, the probability of the test to detect a disease when
   the disease is truly present P(B|A1) = 0.9.
 * Unfortunately, the test ‘B’ also has a flaw which sometimes shows that the
   patient has the disease even when the disease is truly not present in the
   patient. Let us say that the 2 out of 10 patients who really does not have a
   disease gets falsely detected as having a disease. Mathematically, P(B|A2) =
   0.2.
 * Now, if you randomly select a subject from the population and conduct the
   test on the subject, AND if the test result shows positive (The patient does
   have the disease), can we calculate the “Event probability” (or the Posterior
   Probability) of the person truly having the disease
 * Mathematically, calculate P(A1|B). Which can be read as, calculate the
   probability of A1 (presence of disease), given B (given test results being
   positive)

So let’s assign the values for each probabilities.

 * Prior Probability of person having the disease = P(A1) = 0.05
 * Prior Probability of person NOT having disease = P(A2) = 0.95
 * Conditional Probability that the test shows positive, given that the person
   truly does have a disease = P(B|A1) = 0.9
 * Conditional Probability that the test shows positive, even if the person
   truly does NOT have a disease = P(B|A2) = 0.2
 * What is the “event probability” of a randomly selected person from the
   population who was performed the test, and the test result shows positive, to
   truly have the disease? = What is P(truly has disease given test is positive)
   = P(A1|B)?

The posterior probability can be calculated based on Bayes’s Theorem as follows:

So the posterior probability of the person truly having the disease, given that
the test result is positive is only 19% !! Note the stark difference in the
corrected probability even if the test results are 90% accurate ? Why do you
think, this is the case?

The answer lies in the ‘priors’. Note that the “belief” that only 5% of the
population may have a disease, is the strong reason for a 19% posterior
probability. It’s easy to prove. Change your prior beliefs (all else being
equal) from 5% to let’s say a 30%. Then you shall get the following results.

Note that the posterior probability for the same test with a higher prior jumped
significantly to 65%.

Hence, while all evidence and tests being equal, Bayes’s theorem is strongly influenced by priors . If you start with a very low prior, even in the face of strong evidence the
posterior probability will be closer to the prior (lower).

A prior is not something you randomly make up. It should be based on
observations even if subjective. There should be some emphasis on why someone
holds on to a belief before assigning a percentage.

If you belief that God does not exist (prior), then strong
test/evidence/hypothesis, which positively detects the possible existence of God
moves your prior belief only a little bit, no matter how accurate the tests are.
--------------------------------------------------------------------------------

WHAT DOES BAYESIAN INFERENCE MEAN FOR NEURAL NETS?
Now that we understand Bayes’s Theorem, let’s see how this is applicable for
Regularizing Neural Networks. In past few posts, we learnt about how Neural Nets
overfit data and also techniques to regularize the Network towards reducing bias
and variance. (A high-variance state is a state when the network is overfitted).

One of the techniques to reduce variance and improve generalization is to apply
weight decay and weight constraints. If we manage to trim the growing weights on
a Neural Network to some meaningful degree, then we can control the variance of
the network and avoid overfitting.

So let’s focus on the probability distribution of the weight vector given a set
of training data. First, let’s relook at what happens in a Neural Network.

 * We initialize the weight vector of a Neural Network to some optimal initial
   state.
 * We have a set of training data that will be run through the network
   continuously which shall change the weight vector to meet a stated output
   during training.
 * Every time we start with a new input (from the training data set) to train,
   we have a prior distribution of the weight vector and a probability of an
   output for the given input based on the weight vector.
 * Based on the new output, a cost function calculates the error deviations.
 * Back-propagation is used to fix the prior weights to reduce error.
 * We seen a posterior distribution of the weight vector for a given training
   data.

The question we ask here is two fold:

 1. Can we use the Bayesian Inference in such a way that the weight distribution
    is made optimal to learn the correct function that relevantly maps the input
    to the output.
 2. Can we ensure that the network is NOT overfitting.

To recap, mathematically, if ‘t’ is a expected target output and ‘y’ was the
output of the Neural Net, then local error is nothing but E=(t-y) . The global error meanwhile can be a MSE as follows:

or a ESS as follows:

 * Note that the dominant part of the equation is the squared Error in the
   equation.
 * We are trying to find the weight vector that minimizes the squared errors.
 * In likelihood terms , we can also state that we want to find the weight vectors that maximizes the log probability density towards a correct answer.
 * Minimizing the squared error is the same as maximizing the log probability
   density of the correct answer. This is called Maximum Likelihood Estimation .

MAXIMUM LIKELIHOOD LEARNING
First, let us look at the Maximum Likelihood learning before we apply Bayesian
Inference. To do so, let’s assume that we are applying Gaussian Noise to the
output of the Neural Network to regularize the network.

In the previous post titled “ Mathematical foundation for Noise, Bias and Variance ”, we used Noise as a regularizer in the input. Note that we can apply Noise
even for the output.Again, mathematically:

In other words, let the output for a given training case y_c be some function of
an input x_c and the weight vector w.

Now assuming that we are applying a Gaussian Noise to the output, we get:

We are simply stating that the probability density of the target value given the
output after applying Gaussian Noise is the Gaussian distribution centered around the output .

Let’s use negative log probability as the cost function as we want to minimize
the cost. So we get:

When we are working on multiple training cases ‘c’ in the dataset ‘D’, we intend
to maximize the product of the probabilities of output of every training case
‘c’ in the dataset ‘D’, to be closer to the target. Since the output error for
every training case is NOT dependent on the previous training case. We can
mathematically state this as :

In other words, the probability of observed data given a weight vector ‘w’ is
the product of all probabilities of training case given the output. (Note that
the output y_c is a function of inputs x_c and weight vector ‘w’).

But, instead of the product of the probability of the target value given an
output, we stated that we can work in the Log domain by taking negative log
probabilities . So we can instead work on maximizing the Sum of log
probabilities as shown:

The above is the log probability of observed data, given a weight vector that
helps in maximizing the log probability density of the output to be closer to
the target value (assuming we are adding a Gaussian noise to the output).

BAYESIAN INFERENCE AND MAXIMUM A POSTERIORI (MAP)
We worked on a equation for the Maximum Likelihood learning, but can we use the
Bayesian Inference to regularize the Maximum Likelihood?

Indeed, the solution seems to lie in applying a Maximum A Posteriori or MAP in short. MAP tries to find the mode of the posterior distribution by
employing Bayes’s Theorem. So for Neural Networks, this can be written as:

Where,

 * P(w|D) is the posterior probability of the weight vector ‘w’ given the
   training data set D.
 * P(w) is the prior probability of the weight vector.
 * P(D|w) is the probability of the observed data given weight vector ‘w’.
 * And, the denominator is the integral of all possible weight vectors.

We can convert the above equation to a cost function again applying the negative log likelihood as follows:

Here,

 * P(D) is an integral over all possible weights and hence log P(D) converts to
   some constant.
 * From the Maximum Likelihood, we already learnt the equation for log P (D|w)

Let’s look at log P(w), which is the log probability of the prior weights. This
is based on how we initialize the weights. In the post titled “ Is Optimizing your Neural Network a Dark Art ? ” we learnt that the best way to initialize the weights is to apply a zero-mean-gaussian

So, mathematically:

So, the Bayesian Inference for MAP is as follows:

Again, notice the similarity of the loss function to L2 regularization.

Also note that we started we a randomly initialized zero-mean-gaussian weight
vector for MAP and then started working towards fixing it to improve P(w|D).
This has the same side-effect as L2 regularizers which can get stuck in local
minima.

We take the MAP approach because a full bayesian approach over all possible
weights is computational intensive and is not tractable. There are tricks with
MCMC which can help approximate a unbiased sample from true posteriors over the
entire weights. I may cover this later in another post.

Maybe now, you are equipped to validate the belief in God…

Machine Learning Artificial Intelligence Neural Networks Deep Learning Bayesian Statistics 33 Blocked Unblock Follow FollowingPREETHAM V V
#AI & #MachineLearning enthusiast. Author: Java Web Services / Internet Security
& Firewalls. VP, Brand Sciences & Products @inMobi #UltraRunner

FollowAUTONOMOUS AGENTS — #AI
Notes of Artificial Intelligence and Machine Learning.

× Don’t miss Preetham V V’s next story Blocked Unblock Follow Following Preetham V V","If you are a Science or Math nerd, there is no way in hell you would have not heard of Bayes’s Theorem. It’s pervasive and quite a powerful…",Bayesian Regularization for #NeuralNetworks – Autonomous Agents — #AI,Live,181
494,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: OVERVIEW OF RSTUDIO IDE
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

16 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * NEAR DEATH CAPTURED by GoPro and camera pt.21 [FailForceOne] - Duration:
   8:57. Fail Force One 450,345 views * New 8:57


--------------------------------------------------------------------------------

 * A day with a BILLIONAIRE! Join Rich Kids of Instagram's Emir Bahadir as he
   works out and shops! - Duration: 12:30. LA Muscle 1,277,381 views 12:30
 * Google Sheets and Python - Duration: 6:53. Twilio 355,684 views 6:53
 * Inside a Google data center - Duration: 5:28. G Suite 6,810,063 views 5:28
 * My Daily Life In NORTH KOREA (MYSTERIOUS 7 DAY TRIP) - Duration: 14:36. Jacob
   Laukaitis 7,916,310 views 14:36
 * Data Science Experience: Analyze Db2 Warehouse on Cloud data in RStudio -
   Duration: 5:30. developerWorks TV 3 views * New 5:30
 * CSV Files in Python - Learn Python Programming (Computer Science) - Duration:
   9:33. Socratica 59,956 views 9:33
 * How to Become a Data Scientist in 2017? | Data Scientist Career | Data
   Science Future - Duration: 1:17:14. HackerEarth 133,464 views 1:17:14
 * Predicting Stock Prices - Learn Python for Data Science #4 - Duration: 7:39.
   Siraj Raval 233,577 views 7:39
 * Introduction to Natural Language Processing - Cambridge Data Science Bootcamp
   - Duration: 22:23. Cambridge Coding Academy 63,240 views 22:23
 * REST API concepts and examples - Duration: 8:53. WebConcepts 1,573,734 views 8:53
 * Tetiana Ivanova - How to become a Data Scientist in 6 months a hacker’s
   approach to career planning - Duration: 56:26. PyData 131,889 views 56:26
 * Learning R in RStudio: corrplot - Duration: 9:00. R at Colby 2,961 views 9:00
 * Soren Macbeth - Data Science in Clojure - Duration: 44:04. ClojureTV 6,690
   views 44:04
 * Introduction To Web Scraping (with Python and Beautiful Soup) - Duration:
   33:31. Data Science Dojo 134,381 views 33:31
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29
 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54.
   developerWorks TV No views * New 6:54
 * Introduction - Learn Python for Data Science #1 - Duration: 6:55. Siraj Raval
   174,005 views 6:55
 * 14-Year-Old Prodigy Programmer Dreams In Code - Duration: 8:42. THNKR
   6,587,062 views 8:42
 * What is an API? - Duration: 3:25. MuleSoft Videos 960,137 views 3:25
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video is a quick tour of the RStudio Integrated Development Environment inside IBM Data Science Experience (DSX). ,Overview of RStudio IDE in DSX,Live,182
496,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×VIDEOS
DATA SCIENCE EXPERT INTERVIEW: HOLDEN KARAU
Post Comment July 28, 2016 | 6:20OVERVIEW
James Kobielus, data science evangelist at IBM, interviews Holden Karau,
principal software engineer of big data at IBM and coauthor of Learning Spark.

To ensure data science success, you need to provide data scientists with an
environment that is open, engaging and collaborative. To explore how your data
scientists can access all the open functionality and expertise they’ll need for
critical projects, join the new Data Science Experience .

To learn how the next generation of open analytics will boost data scientist
productivity, click here to register for IBM DataFirst launch event taking place on Tuesday September 27
in New York, or, if you can’t make it in person, click here to register for the livestream to the event.


Follow @IBMBigData

Topics: Analytics , Big Data Technology , Big Data Use Cases , Data Scientists , Hadoop Tags: data science , Spark , R , Hadoop , predictive analyticsRELATED CONTENT
VIDEO
DATA SCIENCE EXPERT INTERVIEW: JOE CASERTA
Joe Caserta is founder and president of Caserta Concepts, a New York–based
innovation technology and consulting firm that specializes in big data
analytics, data warehousing, ETL and business intelligence. Don’t miss this
enlightening discussion between Joe Caserta and IBM data science evangelist... Watch Video Blog IBM Big Replicate: Complete resilience through active-active replication Video Data science expert interview: Chris Maddern Video Data science expert interview: Dave Saranchak Blog A sneak peek at the future of data migration: The DNA of Bluemix Lift Video Data science expert interview: Jennifer Shin Blog Next-generation data scientist: Harnessing an integrated development
environment Blog The 7 drivers of effective decision optimization Blog Insight Ops: The road to a collaborative self-service model Blog Unstructured and structured data versus repetitive and non-repetitive data Video Data science expert interview: Imran Younus Video Data science expert interview: Nick Pentreath Blog Why data science should be your top priority
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Infographic Analytics for banking: ""Keep up with my changing needs"" Blog Are you a master data chef? Infographic Unleash enterprise content to help power your digital
transformation Presentation 9 results of data integration mistakesMORE
Infographic Analytics for banking: ""Keep up with my changing needs"" Blog Are you a master data chef? Infographic Unleash enterprise content to help power your digital
transformation Presentation 9 results of data integration mistakes Blog Forming deep partnerships with the business as CDO: A two-way street Blog IBM Big Replicate: Complete resilience through active-active replication Blog Torchbearer or follower: Priorities for today’s CIO Infographic Analytics for banking: ""Keep up with my changing needs"" Blog Are you a master data chef? Infographic Unleash enterprise content to help power your digital
transformation Presentation 9 results of data integration mistakesMORE
Infographic Analytics for banking: ""Keep up with my changing needs"" Blog Are you a master data chef? Infographic Unleash enterprise content to help power your digital
transformation Presentation 9 results of data integration mistakes Blog 4 ways banking regulation is driving innovation in counterfraud analytics Podcast CIO Insights: The CIO-CDO collaboration on a master data organization Video Create valuable business practices with customer insights Infographic Analytics for banking: ""Keep up with my changing needs"" Presentation 4 lessons telcos can learn from delivering bad customer service Presentation 9 results of data integration mistakes Blog Customer care professionals: Don't ignore the elephant in the roomMORE
Infographic Analytics for banking: ""Keep up with my changing needs"" Presentation 4 lessons telcos can learn from delivering bad customer service Presentation 9 results of data integration mistakes Blog Customer care professionals: Don't ignore the elephant in the room Podcast Finance in Focus: The science of customer insight Blog 4 ways banking regulation is driving innovation in counterfraud analytics Podcast CIO Insights: The CIO-CDO collaboration on a master data organization Blog Are you a master data chef? Video Data science expert interview: Joe Caserta Presentation 9 results of data integration mistakes Blog Forming deep partnerships with the business as CDO: A two-way streetMORE
Blog Are you a master data chef? Video Data science expert interview: Joe Caserta Presentation 9 results of data integration mistakes Blog Forming deep partnerships with the business as CDO: A two-way street Blog Gearing up for the General Data Protection Regulation Blog IBM Big Replicate: Complete resilience through active-active replication Video Data science expert interview: Chris Maddern * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site","James Kobielus, data science evangelist at IBM, interviews Holden Karau, principal software engineer of big data at IBM and coauthor of Learning Spark.",Data science expert interview: Holden Karau,Live,183
503,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

IMPROVING THE ROI OF BIG DATA AND ANALYTICS THROUGH LEVERAGING NEW SOURCES OF
DATA
Posted on April 21, 2017Contributed by: Bart Baesens

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow
us @DataMiningApps . Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail
over at briefings@dataminingapps.com and let’s get in touch!


--------------------------------------------------------------------------------

Big Data and Analytics are all around these days. Most companies already have
their first analytical models in production and are thinking about further
boosting their performance. Far too often, they hereby focus on the analytical
techniques rather than on the key ingredient: data! We believe the best way to
boost the performance and ROI of an analytical model is by investing in new
sources of data which can help to further unravel complex customer behavior and
improve key analytical insights. In what follows, we briefly explore various
types of data sources that could be worthwhile pursuing in order to squeeze more
economic value out of your analytical models.

A first option concerns the exploration of network data by carefully studying
relationships between customers. These relationships can be explicit or
implicit. Examples of explicit networks are calls between customers, shared
board members between firms, and social connections (e.g., family, friends ).
Explicit networks can be readily distilled from underlying data sources (e.g.,
call logs) and their key characteristics can then be summarized using
featurization procedures resulting into new characteristics which can be added
to the modeling data set. In our previous research (Verbeke et al., 2014; Van
Vlasselaer et al., 2017), we found network data to be highly predictive for both
customer churn prediction and fraud detection. Implicit networks or pseudo
networks are a lot more challenging to define and featurize. Martens and Provost
(2016) built a network of customers where links were defined based upon which
customers transferred money to the same entities (e.g., retailers) using data
from a major bank. When combined with non-network data, this innovative way of
defining a network based upon similarity instead of explicit social connections
gave a better lift and generated more profit for almost any targeting budget. In
another, award-winning study they built a geosimilarity network among users
based upon location-visitation data in a mobile environment (Provost et al.,
2015). More specifically, two devices are considered similar and thus connected,
when they share at least one visited location. They are more similar if they
have more shared locations and as these are visited by fewer people. This
implicit network can then be leveraged to target advertisements to the same user
on different devices or to users with similar tastes, or to improve online
interactions by selecting users with similar tastes. Both of these examples
clearly illustrate the potential of implicit networks as an important data
source. A key challenge here is to creatively think about how to define these
networks based upon the goal of the analysis.

Data are often branded as the new oil. Hence, data pooling firms capitalize on
this by gathering various types of data, analyzing them in innovative and
creative ways, and selling the results thereof. Popular examples are Equifax,
Experian, Moody’s, S&P, Nielsen, and Dun & Bradstreet, among many others. These
firms consolidate publically available data, data scraped from websites or
social media, survey data, and data contributed by other firms. By doing so,
they can perform all kinds of aggregated analyses (e.g., geographical
distribution of credit default rates in a country, average churn rates across
industry sectors), build generic scores (e.g., the FICO in the US) and sell
these to interested parties. Because of the low-entry barrier in terms of
investment, externally purchased analytical models are sometimes adopted by
smaller firms (e.g., SMEs) to take their first steps in analytics. Besides
commercially available external data, open data can also be a valuable source of
external information. Examples are industry and government data, weather data,
news data, and search data (e.g., Google Trends). Both commercial and open
external data can significantly boost the performance and thus economic return
of an analytical model.

Macro-economic data are another valuable source of information. Many analytical
models are developed using a snapshot of data at a particular moment in time.
This is obviously conditional on the external environment at that moment.
Macro-economic up- or down-turns can have a significant impact on the
performance and thus ROI of the analytical model. The state of the macro-economy
can be summarized using measures such as gross domestic product (GDP), inflation
and unemployment. Incorporating these effects will allow us to further improve
the performance of analytical models and make them more robust against external
influences.

Textual data are also an interesting type of data to consider. Examples are
product reviews, Facebook posts, Twitter tweets, book recommendations,
complaints, and legislation. Textual data are difficult to process analytically
since they are unstructured and cannot be directly represented into a matrix
format. Moreover, these data depend upon the linguistic structure (e.g., type of
language, relationship between words, negations, etc.) and are typically quite
noisy data due to grammatical or spelling errors, synonyms and homographs.
However, they can contain very relevant information for your analytical modeling
exercise. Just as with network data (see above), it will be important to find
ways to featurize text documents and combine it with your other structured data.
A popular way of doing this is by using a document term matrix indicating what
terms (similar to variables) appear and how frequently in which documents
(similar to observations). It is clear that this matrix will be large and
sparse. Dimension reduction will thus be very important as the following
activities illustrate:

 * represent every term in lower case (e.g., PRODUCT, Product, product become
   product)
 * remove terms which are uninformative such as stop words and articles (e.g.,
   the product, a product, this product become product)
 * use synonym lists to map synonym terms to one single term (product, item,
   article become product)
 * stem all terms to their root (products, product become product)
 * remove terms that only occur in a single document

Even after the above activities have been performed, the number of dimensions
may still be too big for practical analysis. Singular Value Decomposition (SVD)
offers a more advanced way to do dimension reduction (Meyer, 2000). SVD works
similar to principal component analysis (PCA) and summarizes the document term
matrix into a set of singular vectors (also called latent concepts) which are
linear combinations of the original terms. These reduced dimensions can then be
added as new features to your existing, structured data set.

Besides textual data, other types of unstructured data such as audio, images,
videos, fingerprint, GPS, and RFID data can be considered as well. To
successfully leverage these types of data in your analytical models, it is of
key importance to carefully think about creative ways of featurizing them. When
doing so, it is recommended that any accompanying metadata are taken into
account; for example, not only the image itself might be relevant, but also who
took it, where, and at what time. This information could be very useful for
fraud detection.

To summarize, we strongly believe that the best way to boost the performance and
ROI of your analytical models is by investing in data first! In this
contribution, we gave some examples of alternative data sources which can
contain valuable information about the behavior of your customers.

REFERENCES
 * Martens D., Provost F., Mining Massive Fine-Grained Behavior Data to Improve
   Predictive Analytics, MIS Quarterly, Volume 40, Number 4, pp. 869-888, 2016.
 * Meyer C.D., Matrix Analysis and Applied Linear Algebra, SIAM, Philadelphia,
   2000.
 * Provost F., Martens D., Murray A., Finding Similar Mobile Consumers with a
   Privacy-Friendly Geosocial Design, Information Systems Research, Volume 26,
   Issue 2, pp. 243 – 265, 2015.
 * Van Vlasselaer V., Eliassi-Rad T., Akoglu L., Snoeck M., Baesens B., GOTCHA!
   Network-based Fraud Detection for Security Fraud, Management Science , forthcoming, 2017
 * Verbeke W., Martens D., Baesens B., Social network analysis for customer
   churn prediction, Applied Soft Computing , Volume 14, pp. 341-446, 2014.

‹ Web Picks (week of 17 April 2017) —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Improving the ROI of Big Data and Analytics through Leveraging New Sources of
   Data
 * Web Picks (week of 17 April 2017)
 * Why is benchmarking needed for credit risk modeling?
 * How To Gain Insights from Data Without Sacrificing Privacy?
 * Web Picks (week of 3 April 2017)

Archives * April 2017
 * March 2017
 * February 2017
 * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com", We believe the best way to boost the performance and ROI of an analytical model is by investing in new sources of data which can help to further unravel complex customer behavior and improve key analytical insights.,Improving the ROI of Big Data and Analytics through Leveraging New Sources of Data,Live,184
508,"With Cloudant, building location-aware systems is within the reach of any web developer. This demo application uses HTML5 and JavaScript to record a device's GPS locations, and then save them—both on the device and to IBM Cloudant.To get started, sign up or sign in to Cloudant and then grab the code on Github.Coming Soon: Add a middle tier to manage users, with NodeJSPlease enable JavaScript to view the comments powered by Disqus.","With Cloudant, building location-aware systems is within the reach of any web developer. This demo application uses HTML5 and JavaScript to record a device's GPS locations, and then save them—both on the device and to IBM Cloudant.",Location Tracker,Live,185
514,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectGEOSPATIAL QUERY WITH CLOUDANT SEARCHRaj R Singh / January 7, 2016Geospatial querying is such a basic requirement for modern applications. Manyapps are map-centric, like Yelp! or Hotels.com or retail store finders, whichhelp users find places nearby. But other geospatial query use cases live deepunder the covers of an app, like a ToDo list app that notifies you when you’renear the place you can accomplish a task.This is a quick tutorial on how to use Cloudant Search to add geospatial queryto your apps.GEOSPATIAL QUERY OPTIONS IN CLOUDANTFirst off, as a developer, you need to know that there are 2 different optionsfor performing geospatial queries in Cloudant: * Cloudant Geo offers the most flexible geospatial query options. You can query by radius,   rectangle, and polygon, but you can’t query by any other attributes of the   database at the same time. (At least not today, but engineering elves are   hard at work building this feature!)       * Cloudant Search only supports rectangle bounding box queries, but unlike Cloudant Geo, you   can combine it with attribute and free text search. If you’re searching for a   doctor, seeing mechanics in search results gets in the way, so refining your   geospatial search with additional attributes is a must in many cases. If your   result set is small, it’s easy to do that client-side, but if it gets big   (for instance, if you’re in a densely populated city) a simple geo index   won’t cut it, as you really want to include additional search requirements   with your location data.      Cloudant Search is powered by Apache Lucene, the most popular open-source   search library. By drawing on the speed and simplicity of Lucene, the   Cloudant service provides a familiar way to add search to apps.      Cloudant Search lets you further enhance indexing and querying with:       * Ranked searching . Search results can be ordered by relevance or by custom sort fields    * Powerful query types , including phrase queries, wildcard queries, proximity queries, fuzzy      searches, range queries and more    * Language-specific analyzers    * Faceted search and filtering    * Bookmarking . Paginate results in the style of popular Web search engines      INDEXING BOSTON CRIME DATA FOR SEARCHThere’s already a host of excellent resources on indexing and querying withCloudant Search, so if you’re not familiar with the basics, start here: * Cloudant Learning Center: video on Search * Formal API documentation on Cloudant Search * Cloudant For Developers: Search IndexesOnce you’re up-to-speed, we can have some fun with crime data! We’ll use asample of crimes in Boston, MA provided by the city government as open data here . We already have this data in Cloudant, and you can view a sample here , or replicate the database to your own Cloudant account. If you want to follow along whilecoding and don’t already have a Cloudant account, sign up for a free trial here .The first thing we need to do to the database is define our Search index. Hereis the Javascript function for that:function (doc) {  if ( doc.properties.main_crimecode && doc.geometry.coordinates[0] && doc.geometry.coordinates[1]) {      index(""type"", doc.properties.main_crimecode, {""store"": true, ""facet�      index(""long�      index(""lat�  }}I save this to the crimes database in a design document called lucenegeoblog and name the index findcrimes (those 2 facts will be important next, when we write our queries).Note that I’m indexing 3 properties of the database, and indexing a documentonly if those properties exist. * doc.properties.main_crimecode tells us what the crime was (or at least the main crime, since people could   be doing more than one bad thing at the same time) * doc.geometry.coordinates[0] is where the longitude value for the crime’s location lives * doc.geometry.coordinates[1] is where the latitude value for the crime’s location livesNow we’re ready to play with the data…QUERYING CRIMESTERM SEARCHLucene offers a whole range of interesting ways to query text, including fuzzymatching, proximity search, numerical ranges, and more. Here, since the focus ison the geospatial aspects, we’ll just do the most basic of text searches, barelyflexing Lucene’s muscles, but it’s enough to illustrate the point. Let’s justask for crimes involving an argument:https://examples.cloudant.com/crimes/_design/lucenegeoblog/_search/findcrimes?q=type:ArgueThis query returns 13 rows:{""total_rows"":13,  ""bookmark"":""g1AAAAEWeJzLYWBgYMlgTmFQTElKzi9KdUhJMjTUy00tyixJTE_VS87JL01JzCvRy0styQEqZUpkSLL___9_VgaTmwNPqnMDUCzRFKRfAa7fEo_2JAcgmVQPM4H3rS3YBB00F5jgMSKPBUgyNAApoCn7wcYIioY-ABmjQYJHIMYcgBiD6h-jLADMN1fM"",  ""rows"":[    {""id"":""79f14b64c57461584b152123e38a58ca"",""order"":[4.2708353996276855,0],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38ec546"",""order"":[4.2708353996276855,40],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38c4ce8"",""order"":[3.740839958190918,13],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e3908811"",""order"":[3.740839958190918,38],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e39108e1"",""order"":[3.740839958190918,44],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38b11d4"",""order"":[3.549445152282715,8],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38b5c12"",""order"":[3.549445152282715,10],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38e2803"",""order"":[3.549445152282715,31],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38e7cbf"",""order"":[3.549445152282715,39],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e3905861"",""order"":[3.549445152282715,44],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e390f947"",""order"":[3.549445152282715,50],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e390bc77"",""order"":[3.549445152282715,51],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e3912dab"",""order"":[3.549445152282715,53],""fields"":{""type"":""Argue""}}  ]}Which would look like this if plotted on a map:Now, say we want to organize the results by proximity to a local bar we thinkmay be a problem. We know the coordinates of this bar, so we can use a clever sort parameter to accomplish our goal in this new query:https://examples.cloudant.com/crimes/_design/lucenegeoblog/_search/findcrimes?q=type:Argue&sort=""<distance,long,lat,-71.07505979,42.32865671,mi>""This returns the same 13 rows, but take a look at the id s. The order is now different.{""total_rows"":13,  ""bookmark"":""g1AAAAEmeJzLYWBgYMlgTmFQTElKzi9KdUhJMjTSy00tyixJTE_VS87JL01JzCvRy0styQEqZUpkSLL___9_Fpjj5iA578XsvIjgROMskBkKcDMs8BiR5AAkk-qRTOF5cLvueLNbIm8WmktM8BiTxwIkGRqAFNCk_TCjOM-udxWQPpzIgG4UPk9BjDoAMQruKsHexVECphqJOllZAFqPX6Q"",  ""rows"":[    {""id"":""79f14b64c57461584b152123e3908811"",""order"":[0.0,38],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38ec546"",""order"":[0.46176565188522095,40],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38b5c12"",""order"":[0.9774288003583641,10],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e39108e1"",""order"":[1.399243473889131,44],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38e2803"",""order"":[1.4297353780528468,31],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e3912dab"",""order"":[1.674393221777318,53],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38b11d4"",""order"":[1.7185707796811796,8],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e390f947"",""order"":[2.1562546799337228,50],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38a58ca"",""order"":[3.225431956819621,0],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38c4ce8"",""order"":[3.6097936539303275,13],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e38e7cbf"",""order"":[3.7522872699576357,39],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e3905861"",""order"":[4.388318450202213,44],""fields"":{""type"":""Argue""}},    {""id"":""79f14b64c57461584b152123e390bc77"",""order"":[6.405184200868535,51],""fields"":{""type"":""Argue""}}  ]}Now we can pay more attention to the crimes at the top of the list, and notwaste time looking at crimes far from the bar. This doesn’t seem like a big dealwith 13 results, but if we were using the full crime database, which has almosthalf a million crimes, optimizations like this are crucial.Another way to restrict our search to a small area around the bar would be toadd a geospatial bounding box (or rectangular ‘fence’) to the query, limitingresponses to documents whose longitude falls between -71.08 and -71.04 and whoselatitude falls between 42.28 and 42.32. Let’s also throw an include_docs=true parameter in the query so we can see all the information in the document.https://examples.cloudant.com/crimes/_design/lucenegeoblog/_search/findcrimes?q=type:Argue AND long:[-71.08 TO -71.04] AND lat:[42.28 TO 42.32]&sort=""<distance,long,lat,-71.06,42.30,mi>""&include_docs=trueI won’t reproduce the entire response here, but it contains only 7 rows. Itworked!You’ve glimpsed the power of combining basic geospatial queries with Lucene’sextraordinary text search capabilities. The possibilities are truly endless.Comment here to let us know how you use it, and you could be a future guest starhere on our blog.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Apache Lucene / cloudant / geospatial Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",A powerful way to fine-tune location search results. Combine basic geospatial queries with Lucene's extraordinary text search capabilities.,Geospatial query with Cloudant Search,Live,186
517,"COMPOSE AND RETHINKDB 2.3'S DRIVERS Apr 25, 2016TL;DR : The latest RethinkDB drivers don't work with previous versions of RethinkDB.Take steps to ""pin"" your drivers to a compatible version.RethinkDB recently released their latest excellent update to their database inthe form of version 2.3, ""Fantasia"". There's quite a few improvements in the newversion such as user account support, a fold command, integrated SSL encryption and an official Windows version. You canread about them on RethinkDB's blog . We're working on incorporating support for those changes and releasing anupdated RethinkDB.What also happened was that RethinkDB took the opportunity to update the wayclients and servers communicate and that can lead to some problems for Node,Ruby, Python, Java and Go users. The issue here is about drivers. Using Node.jsas an example of the issue, it'll present itself something like this inpractice:$ node index.js                                                                                                                                                                          ERROR: Received an unsupported protocol version. This port is for RethinkDB queries. Does your client driver version not match the server?  ^SyntaxError: Unexpected token E      at Object.parse (native)    at TLSSocket.handshake_callback (/Users/dj/sandbox/noderethink/node_modules/rethinkdb/net.js:624:35)    at emitOne (events.js:90:13)    at TLSSocket.emit (events.js:182:7)    at readableAddChunk (_stream_readable.js:153:18)    at TLSSocket.Readable.push (_stream_readable.js:111:10)    at TLSWrap.onread (net.js:529:20)WHAT'S BREAKING?RethinkDB drivers use a wire protocol which lets clients talk to the server. Youcan read about it in Writing RethinkDB Drivers . The protocol got up to version 0.4 for RethinkDB 2.0 to 2.2 but for 2.3 therewas a major change; specifically an updated protocol 1.0 which would be able todetect previous protocols and fall back to them. RethinkDB 2.3.0 servers supportthis new protocol, and previous protocol versions. One of the advantages of thenew 1.0 protocol it that it can be updated much more easily so that in future,when a protocol change is introduced, the clients and servers will know how tofall back to a version of the protocol they both speak.The catch is that this once-in-a-development lifecycle change also creates abreak in compatibility. While older drivers are be able talk to the newRethinkDB 2.3 server, the new 2.3 drivers only speak version 1.0 of the protocolwhich means that clients using the newest driver can't speak to previousversions of the server.If this sounds like it shouldn't affect you, think again. Modern softwaredevelopment platforms use package managers to download the various componentsthat applications need to run. They are npm for Node, gem for Ruby, pip forPython, Maven for Java and Go has it baked into its platform. It makesdevelopment much easier - to add the RethinkDB driver to a Node project all youhave to do is run npm install rethinkdb --save and you are ready to go. Or not. By default, these package managers downloadthe latest version of the package which is sensible.With the release of RethinkDB 2.3 though, all those repositories have had theirRethinkDB drivers updated so if you ask to install a driver without qualifyingwhat version you want, you'll get the 2.3 version. Then, when you go to connectto a Compose RethinkDB installation you get the protocol incompatibilitymessage:ERROR: Received an unsupported protocol version. This port is for RethinkDB queries. Does your client driver version not match the server?  Of course, package managers do allow you to set the version, or range ofversions, you want to go with your package. When you find yourself in thissituation, you need to uninstall the latest driver, find out what the mostrecent previous driver you can download is – it'll be version 2.2.something –and install that.THE SOLUTIONS...NODE.JSTo install the correct driver to talk to Compose servers run:npm uninstall rethinkdb  npm install rethinkdb@2.2.3  You can add a dependency in your package.json file so that it only uses a version prior to 2.3.0 like this:{ ""dependencies"" :  { ""rethinkdb"" : ""<2.3.0"" }}RUBYFor Ruby, the quick way to set up the driver is to run:gem uninstall rethinkdb  gem install rethinkdb -v 2.2.0.4  You can get a list of available versions by running gem list -ra rethinkdb . You can specify in your applications Gemfile that you want any driver up tobut not including 2.3.0 and later by adding:gem 'rethinkdb', '< 2.3.0'  PYTHONFor Python applications, you need to run:pip uninstall rethinkdb  pip install rethinkdb==2.2.0.post6  If you want an idea of what versions are available in future, run pip install rethinkdb==noversion and the pip will fail to find version ""noversion"" and list all the otheravailable versions. If your Python program has a pip requirements file, add thisto require a pre-2.3.0 driver:rethinkdb >=2.2.0,<2.3.0  There is, we are told, an undocumented flag in the 2.3.0 driver which lets ittalk to older RethinkDB databases, designed mainly for testing. It's probablybest to ignore that though as it will involve modifying your code to use anunsupported path and why do that when you can just set versions as part of thebuild.JAVAThere's no world of command line package management for Java; it tends to be alldeclared in build configuration files for the various tools. With Maven, as anexample, you may have this in your pom.xml :  <dependency>    <groupId>com.rethinkdb</groupId>    <artifactId>rethinkdb-driver</artifactId>    <version>LATEST</version>  </dependency>  ``` This would pull the latest version of the driver down. It's not that common a setting – Java developers often set version numbers in their `pom.xml` files – but it is how you can unwittingly be caught by this protocol change. Simply replace the version tag with: ```     <version>2.2-beta-6</version>This, of course, pins the application to that version and we're done.GOThe previous drivers are all official drivers, but our favourite unofficialdriver is GoRethink . If you use v1 of the driver you won't need to do anything as that justsupports the 0.4 protocol. If you upgrade – which means specifically importingthe v2 package - then that does use the 1.0 protocol by default, but even there as the CHANGELOG notes , you can set the HandshakeVersion to 0.4 when connecting to enable access to older servers.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","TL;DR: The latest RethinkDB drivers don't work with previous versions of RethinkDB. Take steps to ""pin"" your drivers to a compatible version.",Compose and RethinkDB 2.3's drivers,Live,187
519,* Select a country/region: United StatesIBM� * Site mapSearch * Related materials Download * NO RELATED MATERIALS FOUND    *  *  *  *  *  * LinkedIn * Google+ * Twitter * Facebook *  * Related materials * NO RELATED MATERIALS FOUND    * Download *  *  *  *  *  * DownloadCONTACT IBMCONSIDERING A PURCHASE? * Email IBMFOOTER LINKS * Contact * Privacy * Terms of use * Accessibility,"In the domain of data science, solving problems and answering  questions through data analysis is standard practice. Often,  data scientists construct a model to predict outcomes or  discover underlying patterns, with the goal of gaining insights.  Organizations can then use these insights to take actions that  ideally improve future outcomes.",Foundational Methodology for Data Science,Live,188
522,"Toggle navigation * 
 * About
 * 
 * Archives
 * 
 * 

PRACTICAL BUSINESS PYTHON
Taking care of business, one python script at a time

Sun 30 November 2014COMMON EXCEL TASKS DEMONSTRATED IN PANDAS
Posted by Chris Moffitt in articles

INTRODUCTION
The purpose of this article is to show some common Excel tasks and how you would
execute similar tasks in pandas . Some of the examples are somewhat trivial but I think it is important to show
the simple as well as the more complex functions you can find elsewhere. As an
added bonus, I’m going to do some fuzzy string matching to show a little twist
to the process and show how pandas can utilize the full python system of modules
to do something simply in python that would be complex in Excel.

Make sense? Let’s get started.

ADDING A SUM TO A ROW
The first task I’ll cover is summing some columns to add a total column.

We will start by importing our excel data into a pandas dataframe.

importpandasaspdimportnumpyasnpdf=pd.read_excel(""excel-comp-data.xlsx"")df.head()

account name street city state postal-code Jan Feb Mar 0 211829 Kerluke, Koepp and Hilpert 34456 Sean Highway New Jaycob Texas 28752 10000 62000 35000 1 320563 Walter-Trantow 1311 Alvis Tunnel Port Khadijah NorthCarolina 38365 95000 45000 35000 2 648336 Bashirian, Kunde and Price 62184 Schamberger Underpass Apt. 231 New Lilianland Iowa 76517 91000 120000 35000 3 109996 D’Amore, Gleichner and Bode 155 Fadel Crescent Apt. 144 Hyattburgh Maine 46021 45000 120000 10000 4 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 162000 120000 35000We want to add a total column to show total sales for Jan, Feb and Mar.

This is straightforward in Excel and in pandas. For Excel, I have added the
formula sum(G2:I2) in column J. Here is what it looks like in Excel:

Next, here is how we do it in pandas:

df[""total""]=df[""Jan""]+df[""Feb""]+df[""Mar""]df.head()

account name street city state postal-code Jan Feb Mar total 0 211829 Kerluke, Koepp and Hilpert 34456 Sean Highway New Jaycob Texas 28752 10000 62000 35000 107000 1 320563 Walter-Trantow 1311 Alvis Tunnel Port Khadijah NorthCarolina 38365 95000 45000 35000 175000 2 648336 Bashirian, Kunde and Price 62184 Schamberger Underpass Apt. 231 New Lilianland Iowa 76517 91000 120000 35000 246000 3 109996 D’Amore, Gleichner and Bode 155 Fadel Crescent Apt. 144 Hyattburgh Maine 46021 45000 120000 10000 175000 4 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 162000 120000 35000 317000Next, let’s get some totals and other values for each month. Here is what we are
trying to do as shown in Excel:

As you can see, we added a SUM(G2:G16) in row 17 in each of the columns to get totals by month.

Performing column level analysis is easy in pandas. Here are a couple of
examples.

df[""Jan""].sum(),df[""Jan""].mean(),df[""Jan""].min(),df[""Jan""].max()


(1462000, 97466.666666666672, 10000, 162000)


Now, we want to add a total by month and grand total. This is where pandas and
Excel diverge a little. It is very simple to add totals in cells in Excel for
each month. Because pandas need to maintain the integrity of the entire
DataFrame, there are a couple more steps.

First, create a sum for the month and total columns.

sum_row=df[[""Jan"",""Feb"",""Mar"",""total""]].sum()sum_row


Jan      1462000
Feb      1507000
Mar       717000
total    3686000
dtype: int64


This is fairly intuitive however, if you want to add totals as a row, you need
to do some minor manipulations.

We need to transpose the data and convert the Series to a DataFrame so that it
is easier to concat onto our existing data. The T function allows us to switch the data from being row-based to column-based.

df_sum=pd.DataFrame(data=sum_row).Tdf_sum

Jan Feb Mar total 0 1462000 1507000 717000 3686000The final thing we need to do before adding the totals back is to add the
missing columns. We use reindex to do this for us. The trick is to add all of our columns and then allow pandas
to fill in the values that are missing.

df_sum=df_sum.reindex(columns=df.columns)df_sum

account name street city state postal-code Jan Feb Mar total 0 NaN NaN NaN NaN NaN NaN 1462000 1507000 717000 3686000Now that we have a nicely formatted DataFrame, we can add it to our existing one
using append .

df_final=df.append(df_sum,ignore_index=True)df_final.tail()

account name street city state postal-code Jan Feb Mar total 11 231907 Hahn-Moore 18115 Olivine Throughway Norbertomouth NorthDakota 31415 150000 10000 162000 322000 12 242368 Frami, Anderson and Donnelly 182 Bertie Road East Davian Iowa 72686 162000 120000 35000 317000 13 268755 Walsh-Haley 2624 Beatty Parkways Goodwinmouth RhodeIsland 31919 55000 120000 35000 210000 14 273274 McDermott PLC 8917 Bergstrom Meadow Kathryneborough Delaware 27933 150000 120000 70000 340000 15 NaN NaN NaN NaN NaN NaN 1462000 1507000 717000 3686000ADDITIONAL DATA TRANSFORMS
For another example, let’s try to add a state abbreviation to the data set.

From an Excel perspective the easiest way is probably to add a new column, do a
vlookup on the state name and fill in the abbreviation.

I did this and here is a snapshot of what the results looks like:

You’ll notice that after performing the vlookup, there are some values that are
not coming through correctly. That’s because we misspelled some of the states.
Handling this in Excel would be really challenging (on big data sets).

Fortunately with pandas we have the full power of the python ecosystem at our
disposal. In thinking about how to solve this type of messy data problem, I
thought about trying to do some fuzzy text matching to determine the correct
value.

Fortunately someone else has done a lot of work in this are. The fuzzy wuzzy library has some pretty useful functions for this type of situation. Make sure
to get it and install it first.

The other piece of code we need is a state name to abbreviation mapping. Instead
of trying to type it myself, a little googling found this code .

Get started by importing the appropriate fuzzywuzzy functions and define our
state map dictionary.

fromfuzzywuzzyimportfuzzfromfuzzywuzzyimportprocessstate_to_code={""VERMONT"":""VT"",""GEORGIA"":""GA"",""IOWA"":""IA"",""Armed Forces Pacific"":""AP"",""GUAM"":""GU"",""KANSAS"":""KS"",""FLORIDA"":""FL"",""AMERICAN SAMOA"":""AS"",""NORTH CAROLINA"":""NC"",""HAWAII"":""HI"",""NEW YORK"":""NY"",""CALIFORNIA"":""CA"",""ALABAMA"":""AL"",""IDAHO"":""ID"",""FEDERATED STATES OF MICRONESIA"":""FM"",""Armed Forces Americas"":""AA"",""DELAWARE"":""DE"",""ALASKA"":""AK"",""ILLINOIS"":""IL"",""Armed Forces Africa"":""AE"",""SOUTH DAKOTA"":""SD"",""CONNECTICUT"":""CT"",""MONTANA"":""MT"",""MASSACHUSETTS"":""MA"",""PUERTO RICO"":""PR"",""Armed Forces Canada"":""AE"",""NEW HAMPSHIRE"":""NH"",""MARYLAND"":""MD"",""NEW MEXICO"":""NM"",""MISSISSIPPI"":""MS"",""TENNESSEE"":""TN"",""PALAU"":""PW"",""COLORADO"":""CO"",""Armed Forces Middle East"":""AE"",""NEW JERSEY"":""NJ"",""UTAH"":""UT"",""MICHIGAN"":""MI"",""WEST VIRGINIA"":""WV"",""WASHINGTON"":""WA"",""MINNESOTA"":""MN"",""OREGON"":""OR"",""VIRGINIA"":""VA"",""VIRGIN ISLANDS"":""VI"",""MARSHALL ISLANDS"":""MH"",""WYOMING"":""WY"",""OHIO"":""OH"",""SOUTH CAROLINA"":""SC"",""INDIANA"":""IN"",""NEVADA"":""NV"",""LOUISIANA"":""LA"",""NORTHERN MARIANA ISLANDS"":""MP"",""NEBRASKA"":""NE"",""ARIZONA"":""AZ"",""WISCONSIN"":""WI"",""NORTH DAKOTA"":""ND"",""Armed Forces Europe"":""AE"",""PENNSYLVANIA"":""PA"",""OKLAHOMA"":""OK"",""KENTUCKY"":""KY"",""RHODE ISLAND"":""RI"",""DISTRICT OF COLUMBIA"":""DC"",""ARKANSAS"":""AR"",""MISSOURI"":""MO"",""TEXAS"":""TX"",""MAINE"":""ME""}

Here are some example of how the fuzzy text matching function works.

process.extractOne(""Minnesotta"",choices=state_to_code.keys())


('MINNESOTA', 95)


process.extractOne(""AlaBAMMazzz"",choices=state_to_code.keys(),score_cutoff=80)

Now that we know how this works, we create our function to take the state column
and convert it to a valid abbreviation. We use the 80 score_cutoff for this
data. You can play with it to see what number works for your data. You’ll notice
that we either return a valid abbreviation or an np.nan so that we have some valid values in the field.

defconvert_state(row):abbrev=process.extractOne(row[""state""],choices=state_to_code.keys(),score_cutoff=80)ifabbrev:returnstate_to_code[abbrev[0]]returnnp.nan

Add the column in the location we want and fill it with NaN values

df_final.insert(6,""abbrev"",np.nan)df_final.head()

account name street city state postal-code abbrev Jan Feb Mar total 0 211829 Kerluke, Koepp and Hilpert 34456 Sean Highway New Jaycob Texas 28752 NaN 10000 62000 35000 107000 1 320563 Walter-Trantow 1311 Alvis Tunnel Port Khadijah NorthCarolina 38365 NaN 95000 45000 35000 175000 2 648336 Bashirian, Kunde and Price 62184 Schamberger Underpass Apt. 231 New Lilianland Iowa 76517 NaN 91000 120000 35000 246000 3 109996 D’Amore, Gleichner and Bode 155 Fadel Crescent Apt. 144 Hyattburgh Maine 46021 NaN 45000 120000 10000 175000 4 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 NaN 162000 120000 35000 317000We use apply to add the abbreviations into the approriate column.

df_final['abbrev']=df_final.apply(convert_state,axis=1)df_final.tail()

account name street city state postal-code abbrev Jan Feb Mar total 11 231907 Hahn-Moore 18115 Olivine Throughway Norbertomouth NorthDakota 31415 ND 150000 10000 162000 322000 12 242368 Frami, Anderson and Donnelly 182 Bertie Road East Davian Iowa 72686 IA 162000 120000 35000 317000 13 268755 Walsh-Haley 2624 Beatty Parkways Goodwinmouth RhodeIsland 31919 RI 55000 120000 35000 210000 14 273274 McDermott PLC 8917 Bergstrom Meadow Kathryneborough Delaware 27933 DE 150000 120000 70000 340000 15 NaN NaN NaN NaN NaN NaN NaN 1462000 1507000 717000 3686000I think this is pretty cool. We have developed a very simple process to
intelligently clean up this data. Obviously when you only have 15 or so rows,
this is not a big deal. However, what if you had 15,000? You would have to do
something manual in Excel to clean this up.

SUBTOTALS
For the final section of this article, let’s get some subtotals by state.

In Excel, we would use the subtotal tool to do this for us.

The output would look like this:

Creating a subtotal in pandas, is accomplished using groupby

df_sub=df_final[[""abbrev"",""Jan"",""Feb"",""Mar"",""total""]].groupby('abbrev').sum()df_sub

Jan Feb Mar total abbrev AR 150000 120000 35000 305000 CA 162000 120000 35000 317000 DE 150000 120000 70000 340000 IA 253000 240000 70000 563000 ID 70000 120000 35000 225000 ME 45000 120000 10000 175000 MS 62000 120000 70000 252000 NC 95000 45000 35000 175000 ND 150000 10000 162000 322000 PA 70000 95000 35000 200000 RI 200000 215000 70000 485000 TN 45000 120000 55000 220000 TX 10000 62000 35000 107000Next, we want to format the data as currency by using applymap to all the values in the data frame.

defmoney(x):return""${:,.0f}"".format(x)formatted_df=df_sub.applymap(money)formatted_df

Jan Feb Mar total abbrev AR $150,000 $120,000 $35,000 $305,000 CA $162,000 $120,000 $35,000 $317,000 DE $150,000 $120,000 $70,000 $340,000 IA $253,000 $240,000 $70,000 $563,000 ID $70,000 $120,000 $35,000 $225,000 ME $45,000 $120,000 $10,000 $175,000 MS $62,000 $120,000 $70,000 $252,000 NC $95,000 $45,000 $35,000 $175,000 ND $150,000 $10,000 $162,000 $322,000 PA $70,000 $95,000 $35,000 $200,000 RI $200,000 $215,000 $70,000 $485,000 TN $45,000 $120,000 $55,000 $220,000 TX $10,000 $62,000 $35,000 $107,000The formatting looks good, now we can get the totals like we did earlier.

sum_row=df_sub[[""Jan"",""Feb"",""Mar"",""total""]].sum()sum_row


Jan      1462000
Feb      1507000
Mar       717000
total    3686000
dtype: int64


Convert the values to columns and format it.

df_sub_sum=pd.DataFrame(data=sum_row).Tdf_sub_sum=df_sub_sum.applymap(money)df_sub_sum

Jan Feb Mar total 0 $1,462,000 $1,507,000 $717,000 $3,686,000Finally, add the total value to the DataFrame.

final_table=formatted_df.append(df_sub_sum)final_table

Jan Feb Mar total AR $150,000 $120,000 $35,000 $305,000 CA $162,000 $120,000 $35,000 $317,000 DE $150,000 $120,000 $70,000 $340,000 IA $253,000 $240,000 $70,000 $563,000 ID $70,000 $120,000 $35,000 $225,000 ME $45,000 $120,000 $10,000 $175,000 MS $62,000 $120,000 $70,000 $252,000 NC $95,000 $45,000 $35,000 $175,000 ND $150,000 $10,000 $162,000 $322,000 PA $70,000 $95,000 $35,000 $200,000 RI $200,000 $215,000 $70,000 $485,000 TN $45,000 $120,000 $55,000 $220,000 TX $10,000 $62,000 $35,000 $107,000 0 $1,462,000 $1,507,000 $717,000 $3,686,000You’ll notice that the index is ‘0’ for the total line. We want to change that
using rename .

final_table=final_table.rename(index={0:""Total""})final_table

Jan Feb Mar total AR $150,000 $120,000 $35,000 $305,000 CA $162,000 $120,000 $35,000 $317,000 DE $150,000 $120,000 $70,000 $340,000 IA $253,000 $240,000 $70,000 $563,000 ID $70,000 $120,000 $35,000 $225,000 ME $45,000 $120,000 $10,000 $175,000 MS $62,000 $120,000 $70,000 $252,000 NC $95,000 $45,000 $35,000 $175,000 ND $150,000 $10,000 $162,000 $322,000 PA $70,000 $95,000 $35,000 $200,000 RI $200,000 $215,000 $70,000 $485,000 TN $45,000 $120,000 $55,000 $220,000 TX $10,000 $62,000 $35,000 $107,000 Total $1,462,000 $1,507,000 $717,000 $3,686,000CONCLUSION
By now, most people know that pandas can do a lot of complex manipulations on
data - similar to Excel. As I have been learning about pandas, I still find
myself trying to remember how to do things that I know how to do in Excel but
not in pandas. I realize that this comparison may not be exactly fair - they are
different tools. However, I hope to reach people that know Excel and want to
learn what alternatives are out there for their data processing needs. I hope
these examples will help others feel confident that they can replace a lot of
their crufty Excel data manipulations with pandas.

I found this exercise helpful to cement these ideas in my mind. I hope it works
for you as well. If you have other Excel tasks that you would like to learn how
to do in pandas, let me know via the comments below and I will try to help.

 * ← Creating a Waterfall Chart in Python
 * Common Excel Tasks Demonstrated in Pandas - Part 2 →

Tags pandas excel
--------------------------------------------------------------------------------

Tweet Vote on Hacker NewsCOMMENTS
SOCIAL
 * Github
 * Twitter
 * BitBucket
 * Reddit
 * LinkedIn

CATEGORIES
 * articles
 * news

POPULAR
 * Pandas Pivot Table Explained
 * Common Excel Tasks Demonstrated in Pandas
 * Overview of Python Visualization Tools
 * Web Scraping - It's Your Civic Duty
 * Simple Graphing with IPython and Pandas

TAGS
sets pygal csv barnum process s3 matplotlib plotting stdlib oauth2 xlsxwriter pelican jinja python google matplot pandas ipython seaborn notebooks cases xlwings gui excel vcs ggplot beautifulsoup powerpoint bokeh plotly analyze-this pdf github

FEEDS
 * Atom Feed


--------------------------------------------------------------------------------

Site built using Pelican • Theme based on VoidyBootstrap by RKI",Common excel tasks in pandas part,Common Excel Tasks Demonstrated in Pandas,Live,189
528,"Compose The Compose logo Articles Sign in Free 30-day trialHORIZONTAL SCALING ARRIVES ON COMPOSE ENTERPRISE
Published Apr 25, 2017 compose scaling mongodb Horizontal Scaling arrives on Compose EnterpriseToday, Compose is bringing horizontal scaling to more databases on our
Enterprise platform. MongoDB, Elasticsearch and ScyllaDB deployments join Compose's Redis as databases with horizontal scaling options on Compose Enterprise.

For MongoDB, that means that MongoDB users will be able to add shards to their
MongoDB deployments to spread their database load across systems. Collections
can be split across shards and each can handle queries on its local data
independent of other shards.

For Elasticsearch and ScyllaDB users, they will have the option to add database
nodes to their cluster and replicate their data across more hosts. By doing
this, they increase redundancy in their configuration and allow more nodes to
handle read loads (Elasticsearch) and read/write loads (SycllaDB).

We're making this flexibility available to Compose Enterprise customers who need
to do this particular form of scaling. Most users of Compose won't need
horizontal scaling and can continue to use Compose's powerful vertical
auto-scaling system which adds resources to your database deployment precisely
when they are needed.

DELIVERING HORIZONTAL SCALING
When we developed the horizontal scaling options for Compose we found, from
working with customers, that there were many variables to take account of. So
many that we also decided that we would make this what we are calling a guided
feature.

The horizontal scaling technology is built into the Compose platform but we are
only activating it for Compose Enterprise customers who have consulted with
support on how well their needs fit with horizontally scaling their deployments.

We'll guide them through the factors that will affect their deployment and make
sure that they Once support has confirmed a good fit, we will activate the
feature for them to use as they want.

BEYOND ENTERPRISE
We are constantly refining the Compose platform and the user experience and will
be revisiting how we deliver horizontal scaling regularly. For the immediate
future, it's an exclusive feature available to Compose Enterprise users.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Jason McCay is a proud member of the Compose elite. Love this article? Head over to Jason McCay ’s author page and keep reading.RELATED ARTICLES
Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX
The power of IBM's Bluemix cloud platform is now able to seamlessly harness
Compose's databases, making Compose-configured Mo…

Dj Walker-Morgan Feb 19, 2016COMPOSE'S LITTLE BITS #17 - ELASTICSEARCH, POSTGRESQL, MONGODB, GO AND LINENOISE
Elasticsearch 5 announced... Looking forward to PostgreSQL 9.6, MongoDB updated,
Go goes 1.6 and Linenoise, the next generati…

Dj Walker-Morgan Oct 21, 2015COMPOSE UPDATES - MONGODB AND ELASTICSEARCH
Over the last few weeks, we've been quietly releasing some minor updates for
MongoDB and Elasticsearch. We're happy now to le…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Today, Compose is bringing horizontal scaling to more databases on our Enterprise platform.",Horizontal Scaling arrives on Compose Enterprise,Live,190
533,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (March 28, 2017)
 * This Week in Data Science (March 21, 2017)
 * Learn TensorFlow and Deep Learning Together and Now!
 * This Week in Data Science (March 14, 2017)
 * This Week in Data Science (March 7, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (MARCH 28, 2017)
Posted on March 28, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Data Analytics for Societal Good – An account of an instance of Data Analytics for Societal Good.
 * What Is Data Science, and What Does a Data Scientist Do? – A simple definition of the roles, experiences, qualifications etc. of the
   term Data Scientist.
 * In Defense of Simplicity, A Data Visualization Journey – Discussing the field of Data Visualization.
 * How does machine learning work? –Extract from the IBM booklet “How it works – Machine Learning.
 * Getting Started with Deep Learning –Different approaches to getting started with deep learning from a framework
   perspective
 * Interview questions for data scientists – Advice for recruiters and candidates for data science job interviews.
 * IBM launches blockchain as a service for the enterprise –How IBM is enabling developers to quickly build and host secure blockchain
   networks via the IBM Cloud.
 * Data Science vs. Data Analytics – Why Does It Matter? – Discussion of the difference between the terms Data Science and Data
   Analytics.
 * Sentiment Analysis of Warren Buffett’s Letters to Shareholders – Code and Visualization of the results of Sentiment Analysis on Warren
   Buffets Letter to Shareholders.
 * Galvanize will teach students how to use IBM Watson APIs with new machine
   learning course – IBM will Partner with Galvanize to familiarize students with IBM’s suite
   of Watson APIs.
 * How Data Science Can Help You Not to be Blindsided in Decision-Making – How Data Science can affect every business function.
 * The Future of Machine Learning in Finance – Discussion of the future of Machine Learning in Finance.
 * The Best Resources for Learning D3.js – A list of resources to learn the Javascript library d3.js.
 * Understanding the power of real-time geospatial analytics – How the Geospatial Analytics service in IBM Bluemix can monitor moving
   devices from the Internet of Things.
 * The Top 12 Tips for Data Visualization – Tips for creating simple yet effective data visualization.
 * IBM Watson Health can now detect head trauma – IBM Watson Health partners with MedyMatch Technology, an Israel-based
   startup that uses advanced cognitive analytics and artificial intelligence to
   deliver medical solutions.

UPCOMING DATA SCIENCE EVENTS
 * Introduction to Python with Data Analysis(Hands-On) –March 30, 2017 @ 6:00 pm – 9:00 pm

FEATURED COURSES FROM BDU
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.
 * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to
   apply deep learning to different data types in order to solve real world
   problems.

COOL DATA SCIENCE VIDEOS
 * Machine Learning With Python – Supervised Learning K Nearest Neighbors – An introduction to the K Nearest Neighbors Algorithm.
 * Machine Learning With Python – Supervised Learning Decision Trees – An overview of Decision Trees.
 * Machine Learning With Python – Supervised Learning Random Forests – A brief discussion of Random Forests and their applications.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (March 28, 2017)",Live,191
535,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSENSOR SENSIBILITY AT HULL DIGITALGlynn Bird / February 12, 2016C4Di (Centre for Digital Innovation) is a newly opened Digital Hub based in Kingston-upon-Hull. It hosts officespace for established businesses and startups together with hot desks forindividuals and companies. There are game developers, drone builders, musicdistributors, marketing agencies, and Kickstarter projects all hosted in abuilding with a 60Gbps, symmetric internet connection; that's 1Gbps per desk!I was invited to speak at the latest Hull Digital Meetup which hosts gatherings for developers and entrepeneurs at regular intervals.Tonight's talk was entitled IoT – Sensor Sensibility , a title that I'm more proud of than I should be. It described the buzzwordthat is Internet of Things , the insane amount of investment that is pouring into IoT-related startups,and how the technology works; how MQTT is used to transmit data from sensors to the cloud and how the same protocolcan be used to close the feedback loop. The talk also touched on using an Offline-First approach, storing data on the local device and syncing to the cloud later usingApache CouchDB & IBM Cloudant. Finally the talk ran through some of the hardwarethat's available, from Raspberry Pis to SensorTags.Thanks to Jon Moss for hosting me at the fabulous C4Di headquarters. Here arethe slides from talk:IoT Sensor Sensibility – Hull Digital – C4Di – Feb 2016 from Glynn BirdSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",My IoT talk in Hull included an Offline-First approach. Store data on a local device and sync to the cloud later using Apache CouchDB and IBM Cloudant.,Sensor Sensibility at Hull Digital,Live,192
536,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine
Learning May 22
--------------------------------------------------------------------------------

SPARK 2.1 AND JOB MONITORING AVAILABLE IN DSX
Today we are announcing support for Apache® Spark™ 2.1 and enhanced Spark job
monitoring in the IBM Data Science Experience.

SPARK 2.1
The latest official release of Spark comes with plenty of new features, such as
expanded structured streaming support (welcome Kafka 0.10 :-)), new algorithms
available in SparkR, plus 1200 bug fixes to help make everything run smoothly.

If you are interested in seeing the full list of changes from Spark 2.0 to 2.1
check out the Spark 2.1 Release Announcement .

SPARK JOB MONITORING
Do you ever kick off a Spark job in a Jupyter notebook and wonder if it’s making
any progress? Today, we are announcing Python and Scala DSX notebooks now
generate progress bars for Spark jobs.

Let’s see it in action:

In addition to showing or hiding the progress bars at the cell level, you also
have the option to hide all progress bar output as shown in the following
example:

Another way you can track activity on your Spark cluster is by using the Spark
History Server. This can be accessed inside DSX notebooks by navigating to the
environment tab.

You can try out these features and more by creating a free account for Data Science Experience .


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on May 22, 2017.

 * Spark

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingGREG FILLA
Product manager & Data scientist — Data Science Experience and Watson Machine
Learning

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",Today we are announcing support for Apache® Spark™ 2.1 and enhanced Spark job monitoring in the IBM Data Science Experience. The latest official release of Spark comes with plenty of new features…,Spark 2.1 and Job Monitoring Available in DSX,Live,193
538,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×PODCASTS
DATA SCIENCE FOR REAL-TIME STREAMING ANALYTICS
Post Comment April 18, 2017 | 11:38 play mute max volume MP3 Data science for real-time streaming analytics Update Required To play the media you will need to either update your browser to a recent
version or update your Flash plugin .OVERVIEW
Listen to this podcast where Roger Rea, Senior Offering Manager for IBM Streams,
shares his thoughts on how data scientists can create real-time applications
using IBM Streams.

Find more information about IBM Streams


Follow @IBMBigData

Topics: Analytics , Big Data Use Cases , Data Scientists Tags: streaming analytics , real-time analytics , Streams , data scienceRELATED CONTENT
VIDEO
INTERCONNECT VOICES: EXPLORING OUR NEW COGNITIVE WORLD
Artificial Intelligence expert Steve Ardire discusses why cognitive computing,
artificial intelligence, and machine learning can create faster times to
insights and better customer experiences. Watch Video Video InterConnect voices: Maneuvering in a big data world Blog Big Replicate: A big insurance policy for your big data Blog Development lifecycles for defining the meaning and structure of the data
lake Blog What to do with all that machine learning data Blog Building a cognitive data lake with ODPi-compliant Hadoop Blog Analytics and the cloud: NoSQL databases Video What is IBM Cloud for Financial Services? Infographic Show your employees only the information they need—and nothing more Blog Cognitive technology for competitive advantage in credit risk management Video InterConnect 2017: Conversations with Jeff Spicer and Dez Blanchfield Blog Recapping the IBM Chief Data Officer Strategy Summit Spring 2017 Blog Incorporating machine learning in the data lake for robust business
results
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Sales Performance Management Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog Big Replicate: A big insurance policy for your big data Podcast Data science for real-time streaming analytics Blog Building a cognitive data lake with ODPi-compliant Hadoop Blog Analytics and the cloud: NoSQL databasesMORE
Blog Big Replicate: A big insurance policy for your big data Podcast Data science for real-time streaming analytics Blog Building a cognitive data lake with ODPi-compliant Hadoop Blog Analytics and the cloud: NoSQL databases Video InterConnect 2017: Conversations with Jeff Spicer and Dez Blanchfield Blog What the Academy Awards mix-up teaches us about data integration Blog Recapping the IBM Chief Data Officer Strategy Summit Spring 2017 Podcast Data science for real-time streaming analytics Blog What to do with all that machine learning data Blog Analytics and the cloud: NoSQL databases Blog Cognitive technology for competitive advantage in credit risk managementMORE
Podcast Data science for real-time streaming analytics Blog What to do with all that machine learning data Blog Analytics and the cloud: NoSQL databases Blog Cognitive technology for competitive advantage in credit risk management Blog What the Academy Awards mix-up teaches us about data integration Blog Recapping the IBM Chief Data Officer Strategy Summit Spring 2017 Blog Incorporating machine learning in the data lake for robust business
results Blog Big Replicate: A big insurance policy for your big data Blog Building a cognitive data lake with ODPi-compliant Hadoop Video What is IBM Cloud for Financial Services? Infographic Show your employees only the information they need—and nothing moreMORE
Blog Big Replicate: A big insurance policy for your big data Blog Building a cognitive data lake with ODPi-compliant Hadoop Video What is IBM Cloud for Financial Services? Infographic Show your employees only the information they need—and nothing more Blog Cognitive technology for competitive advantage in credit risk management Blog What the Academy Awards mix-up teaches us about data integration Podcast Finance in Focus: Women in Wealth Management Blog Big Replicate: A big insurance policy for your big data Blog Development lifecycles for defining the meaning and structure of the data
lake Podcast Data science for real-time streaming analytics Blog What to do with all that machine learning dataMORE
Blog Big Replicate: A big insurance policy for your big data Blog Development lifecycles for defining the meaning and structure of the data
lake Podcast Data science for real-time streaming analytics Blog What to do with all that machine learning data Blog Building a cognitive data lake with ODPi-compliant Hadoop Blog Analytics and the cloud: NoSQL databases Video InterConnect 2017: Conversations with Jeff Spicer and Dez Blanchfield * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * About Us
 * Contact Us
 * Search Site","Listen to this podcast where Roger Rea, Senior Offering Manager for IBM Streams, shares his thoughts on how data scientists can create real-time applications using IBM Streams.",Data science for real-time streaming analytics,Live,194
544,"Homepage Follow Sign in / Sign up mark simmonds Blocked Unblock Follow Following Program Director, IBM Analytics Development, Snowboarder and Archer. Jul 24
--------------------------------------------------------------------------------

ARTIFICIAL INTELLIGENCE, ETHICALLY SPEAKING
Wikimedia Commons license“I’m Sorry Dave — I’m afraid I can’t do that.”

Many readers will recognize that line from Stanley Kubrick’s “2001 — A Space
Odyssey” a film in which the onboard computer, a HAL 9000, perceives an
astronaut to be a threat to its “existence” and refuses to open the airlock to
allow the crew member back into the ship. Other films like “Ex-Machina”,
“i-Robot”, “Terminator” sow similar fears of Artificial Intelligence systems
with cognitive capabilities taking control from humans, rendering us
defenseless. Of course, there are also films that focus of on the positive
aspects of AI such as “Bicentennial Man”.

My view is that AI systems are increasingly necessary to augment what we do in
our everyday lives — whether that means…

… turning devices on or off, intelligently learning when and where to do so,

… repeating mundane tasks,

… giving us additional insights into human existence, or

… guiding us toward better decisions

… and beyond.

So, why all the fear? Partly because there is so much misinformation and hype —
and some people just like to sell fear, uncertainly and doubt (F.U.D.). And it’s
true that there will always be people who seek to exploit technology to do bad
things — the dark side v the light side (Star Wars fans). Nonetheless, hype is a
valuable part of the technology lifecycle. It allows us to consider use cases
(sometimes extreme) that were not initially considered relevant.

What’s clear is that machine learning in all its forms is here to stay. It has
established its place in the world and particularly in business — from detecting
and identifying trends and patterns faster and more often than humans alone
could ever achieve (while learning and become progressively smarter as they go)
— to helping predict outcomes and taking action to prevent fraud — to slashing
the time it takes to design advanced cancer treatment programs and health
programs (see figure 1) — to anticipating terror attacks — to recognizing
business opportunities that might last only a moment — to ridding processes of
personal bias and prejudice. I believe that machine learning and AI systems have
the potential to make our world a safer and better place.

Figure 1 : Healthcare embraces AI and machine learningEven so, the ethical side of machine learning is increasingly called into
question. The potential of machine learning and its application to all things AI
means we need rules and controls — not to prevent progress but to help manage
and control how and when progress occurs.

Let’s walk through some scenarios.

MACHINE LEARNING APP ENVY
What if machine learning algorithms are pitched against each other to win a
battle, say a game of chess or other simulation? Not a big deal. An outcome
could be defined as not losing to an opponent — establishing a win or at least a
draw. But what if these machine learning systems were used in a war situation
against each other? Human life, entire civilizations, and life itself are the
stakes. Winning is this case could be defined as seeking an acceptable outcome
while minimizing losses. That’s why we humans need to be careful to avoid
delegating 100% authority to an AI system in such situations.

ENDING LIFE VS. SAVING LIFE
There’s general agreement that humans should have the final say where human life
is concerned, but does that allow us to play “God” if an AI system demonstrates
it can preserve life even though a human may believe it is best to end a life?
While most humans would seek to preserve life, greed, personal bias, hate,
jealousy can often be powerful dark forces that can be used to serve judgement.
It is important that any decision involving an AI system must have an audit
trail clearly showing a path to the outcome. After all, AI systems can learn
from these outcomes also.

NON-HUMAN LIFE FORMS
Moving beyond humans, nothing stops us from applying AI to animal behavior. We
have performed enough animal psychology over the years to think we understand
animals. Would an AI system be better at training an animal? Would it be ethical
in man’s perceived superiority and domination over all other species to subject
those species to AI? Again, we must consider under what circumstances AI can be
used to make decisions over other life forms.

CONSCIENCE AND COMPASSION
Today, limited by what we know of life, physics and computing, AI systems are
just computer models and simulations of human behavior. Could a network of AI
systems have a conscience — even though it may be simulated? My personal
feelings are unique to my life experiences so what makes me happy or reduces me
to tears is different from other humans. Emotions are chemical reactions. AI
systems are not. But what if AI systems could apply cognitive actions and
outcomes to a bank of human chemicals in a controlled environment to learn about
emotion? It is conceivable that an AI system could therefore develop a
conscience and even compassion.

BIG RELIGION
Big Religion is a phrase I hear more and more as Big Data became an established
term. It means looking at scriptures and religions with other sources of data,
events and the tools of science. It scares a lot of people for the challenge it
might pose to their belief systems. Some may fear that is also challenges the
power and control associated with some religious establishments. Nonetheless
this is happening today and can’t be stopped. Humans inevitably seek to more
deeply understand the universe and world around us, challenging ourselves about
what we perceive as the truth beyond our faith.

SPANNING CULTURAL DIVIDES AND VALUE SYSTEMS
Diversity makes the world a fascinating place. It’s one of the reasons many of
us decide to vacation in different parts of the world to experience other
cultures, food, traditions, languages. In doing so we learn more about history,
different belief and value systems. I wonder whether AI systems built within
different cultures with different value systems will behave differently with
those of other cultures. Consider a global AI system that encompasses/embraces
all of this diversity and difference. What might be the global impact on world
leaders?

WE ARE ONE
With recent advances is nanotechnology, it’s possible for nanobots to enter our
bodies — even our bloodstreams — to attack viruses and potentially repair
damaged bones and tissue. There may be a time where we can use nanotechnology to
fight obesity or vainly to enhance our looks, our physical performance. If these
nanobots exist forever in our bodies, do we become part human and part something
else? There is a lot of research happening in the area.

RESPONSIBILITY V ACCOUNTABILITY
There are some things that humans must have the final say on. Checks and
balances. How far are we prepared to go in delegating responsibility to the AI
system? How well are the policies designed? Have some policies been designed or
even adapted over time by machine learning? While machines could be responsible
for sustaining or ending life, can they be held accountable — and if so, what
are the legal implications? Today humans carry the burden of both responsibility
and accountability. This aligns with our legal systems, but we can’t put an AI
system on trial — we just don’t have the legal capacity to do that today. AI
systems learn from human interactions and both the data we produce and the data
it produces. Would that imply that many people would potentially be on trial
should a legal case emerge involving AI systems? Could the AI system or its
creators assert that the human legal system has no jurisdiction over it or that
the legal system even infringes the rights of the AI system? Ethics in this area
are just not mature enough today to give us clear answers. But it’s only a
matter of time before we encounter such situations.

Finally, we could ask whether it’s ethical for AI systems to design and
implement their own set of ethics? I guess my answer would be yes — provided
humans remain involved and a can override any final outcomes where decisions
involving human life and welfare are concerned.

SUMMARY
AI systems already augment what we do and the decisions we make today. The human
species will push the boundaries of machine learning, cognitive computing and AI
systems beyond our current perceptions of its application through positive and
negative exploitation that will ultimately result in AI systems capable of
achieving outcomes beyond our imaginations. The ethics will only emerge as cases
arise that test our legal systems, our value systems and even our belief
systems. Despite some of the F.U.D we read, I believe that machine learning can
help our world become a smarter, safer and better place for us and future
generations — future generations of people and AI systems.

For more information on AI, cognitive computing and IBM research click here .

 * Artificial Intelligence
 * Ethics
 * Machine
 * Scenario

1 Blocked Unblock Follow FollowingMARK SIMMONDS
Program Director, IBM Analytics Development, Snowboarder and Archer.

FollowINSIDE MACHINE LEARNING
Deep-dive articles about machine learning and data. Curated by IBM Analytics.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates","My view is that AI systems are increasingly necessary to augment what we do in our everyday lives — whether that means… So, why all the fear?","Artificial Intelligence, Ethically Speaking – Inside Machine learning – Medium",Live,195
546,"Compose The Compose logo Articles Sign in Free 30-day trialCREATING AN AWS VPC AND SECURED COMPOSE MONGODB WITH TERRAFORM
Published Mar 2, 2017 writestuff guest aws Creating an AWS VPC and Secured Compose MongoDB with TerraformConnecting to Compose MongoDB from Amazon VPC? Using Terraform for
orchestration? In this Write Stuff article, Yamil Asusta shows us how to create
secure connections to Compose MongoDB using Terraform and Amazon VPC.

Security is often overlooked when busy shipping products. As a result of that, thousands of databases are being held captive from their
operators . The attack was possible because none of the security alternatives were
implemented for their deployments. Luckily for us, developers, Compose provides
us with deployments that include security defaults which can be further expanded
to reduce risk. In this post, I hope to explain some basic security practices to
lock down access to a MongoDB deployment from VPC.

AWS VPC
Assuming we are starting from scratch, we need to spin up some infrastructure in
which we can launch our servers. To do so, we will use one of my favorite tools, Terraform .

Create a main.tf file and add the following:

provider ""aws"" {  
  region = ""us-east-1"" # feel free to adjust
}


This will indicate Terraform our target region for the next operations.

CREATING A VPC
Let's proceed with creating a VPC . For the purposes of this post, we will only launch 1 public subnet and 1
private subnet using Segment.io's Stack . Add the following to the file:

module ""vpc"" {  
  source             = ""github.com/segmentio/stack//vpc""
  name               = ""my-test-vpc""
  environment        = ""staging""
  cidr               = ""10.30.0.0/16""
  internal_subnets   = [""10.30.0.0/24""]
  external_subnets   = [""10.30.100.0/24""]
  availability_zones = [""us-east-1a""] # ensure it matches the one for your provider
}


Note: Do not go to production with this setup since it will leave you prone to
downtime in the scenario where the Availability Zone collapses.

This ""vpc"" module will launch an Internet Gateway and attach it to the VPC, thus allowing instances launched in the public subnet
to reach the internet (assuming the were assigned a public IP). Additionally, it
launches the most important piece, a NAT server. The NAT is launched in a public
subnet and is linked to a private subnet which in result, gives instances in the
subnet access to the internet. The NAT is provisioned with an Elastic IP and all requests coming from the private subnet will have this IP (see where
I'm going with this?).

MAKING THE PRIVATE SUBNET AVAILABLE
Now we have reachable subnet and one that isn't. How do we fix that? Let's
create a bastion which will let us jump from our public subnet to our private ones. Add this to
the file:

module ""bastion"" {  
  source          = ""github.com/segmentio/stack//bastion""
  region          = ""us-east-1"" # make sure it matches the one for the provider
  environment     = ""staging""
  key_name        = ""my awesome key"" # upload this in the AWS console
  vpc_id          = ""${module.vpc.id}""
  subnet_id       = ""${module.vpc.external_subnets[0]}""
  security_groups = ""${aws_security_group.bastion.id}""
}

resource ""aws_security_group"" ""bastion"" {  
  name        = ""bastion""
  description = ""Allow SSH traffic to bastion""
  vpc_id      = ""${module.vpc.id}""

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = ""tcp""
    cidr_blocks = [""0.0.0.0/0""]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = ""-1""
    cidr_blocks = [""0.0.0.0/0""]
  }

  lifecycle {
    create_before_destroy = true
  }
}


The security group of the bastion only allows SSH for inbound. We could further
tighten it up but we are going to keep it simple for the sake of example.

Let's launch an instance in the private subnet using the following:

resource ""aws_instance"" ""instance"" {  
  ami                         = ""ami-0b33d91d"" # Amazon Linux AMI
  key_name                    = ""my awesome key""
  instance_type               = ""t2.nano""
  subnet_id                   = ""${module.vpc.internal_subnets[0]}""
  vpc_security_group_ids      = [""${aws_security_group.instance.id}""]
  associate_public_ip_address = false

  tags {
    Name = ""ComposeIPWhitelisted""
  }
}

resource ""aws_security_group"" ""instance"" {  
  name        = ""instance""
  description = ""Allow SSH traffic from bastion""
  vpc_id      = ""${module.vpc.id}""

  ingress {
    from_port       = 22
    to_port         = 22
    protocol        = ""tcp""
    security_groups = [""${aws_security_group.bastion.id}""] # only the bastion SG can access me :)
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = ""-1""
    cidr_blocks = [""0.0.0.0/0""]
  }

  lifecycle {
    create_before_destroy = true
  }
}


Notice that the security group for the instance only allows traffic from the
bastion's security group.

Once we have this ready, let's add some outputs so we can get going.

output ""bastion-ip"" {  
  value = ""${module.bastion.external_ip}""
}

output ""nat-ips"" {  
  value = ""${module.vpc.internal_nat_ips}""
}

output ""instance-ip"" {  
  value = ""${aws_instance.instance.private_ip}""
}


At this point, your main.tf must look similar to this one .

Terraform time:

$ terraform get # pulls dependencies
$ terraform plan # this will show you are the things to be created/destroyed on the next step
$ terraform apply # applies the plan, effectively creating our infrastructure


Once the apply is complete, we can SSH into our bastion using the resulting IP
by running:

$ ssh -A ubuntu@bastionIP # assuming we selected the same key pair, -A will forward our keys allowing us to jump with them


Within the bastion, SSH into our private instance by running:

$ ssh ec2-user@instanceIP # ec2-user is the default user of Amazon Linux AMI


CONFIGURING MONGODB
Go ahead and provision a MongoDB deployment from the Compose dashboard . Be sure to select Enable SSL access . By enabling this, Compose will provide us with SSL certificates, which will
allow us to encrypt our data in transit. This prevents Man-in-the-middle attacks . When the deployment is ready, we will be able to access the deployment
dashboard. From here we need to do two things:

 1. Create a user that we can later use to authenticate against the database. To
    do so, click on the Browser tab, select the admin database and click Add User . Make sure to remember the password as it will not be available from this
    point forward.
 2. Obtain the SSL certificate we will use to connect to our database. In the Overview , tab there will be a section called ""SSL Certificate (Self-Signed)"". Its
    contents are hidden and you will be prompted for your password in order to
    make them visible. This will be available at all times for your convenience.

Let tie everything up now!

Within our target host, install the MongoDB shell. If you kept the same AMI
(Amazon Linux AMI) you can follow this guide . Additionally, create a file called cert.pem which contents are the SSL certificate found in the dashboard.

You should be able to connect to your MongoDB using this command now:

$ mongo --ssl --sslCAFile cert.pem <your deployment url>/admin -u <username> -p <password>


The data we transmit will be encrypted when we use our certificate. Only one
problem left, our MongoDB is still open to anyone to try to authenticate. Let's
fix it by using the IP Whitelist feature. Back in the dashboard, visit the Security tab. Under the section Whitelist TCP/HTTP IPs , select Add IP . When prompted, add the IP address value of the nats-ip output from Terraform. Once the feature is active, all connections that are not
from Compose or our designated list will be dropped.

Let's make a quick test! Try connecting to MongoDB one more time from our
instance. It should work as intended. Now try accessing it from your local
network and tell me how it goes ;)


attribution Pexels

This article is licensed with CC-BY-NC-SA 4.0 by Compose.RELATED ARTICLES
Mar 2, 2017USE ALL THE DATABASES - PART 1
Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data
from multiple sources using GraphQL in this W…

Guest Author Sep 28, 2016POWERING SOCIAL FEEDS AND TIMELINES WITH ELASTICSEARCH
Evolving from MongoDB and Redis to Elasticsearch, Campus Discounts' founder and
CTO Don Omondi talks about how and why the co…

Guest Author Nov 16, 2015ARRAYS AND REPLICATION: A MONGODB PERFORMANCE PITFALL
In this Write Stuff article, Gigi Sayfan tells the tale of a performance problem
with MongoDB. It's one of those problems tha…

Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Compose provides us with deployments that include security defaults which can be further expanded to reduce risk. In this post, I hope to explain some basic security practices to lock down access to a MongoDB deployment from VPC.",Creating an AWS VPC and Secured Compose MongoDB with Terraform,Live,196
550,"CLOUDANT QUERY GROWS UP TO HANDLE AD HOC QUERIESBy Glynn BirdJune 1, 2015Cloudant's NoSQL Database-as-a-Service allows you to store JSON documents in thecloud using a simple HTTP API. Cloudant comes equipped with a number of indexesthat allow you to query your data in several powerful ways: * Primary Index to retrieve documents by their id, which is the primary key * MapReduce to do secondary key lookups and online analytics * Cloudant Search for full-text, wildcard and faceted search * Cloudant GeoSpatial for complex polygon and 4D spatial queries * Cloudant Query , a declarative query language that incorporates a number of indexing   capabilitiesCloudant Query is the best way to get started with querying Cloudant databases;a simple API call is used to define the list of fields to be indexed. Under thehood, Cloudant Query can leverage various indexes to provide a full breadth ofquerying capabilities.SAMPLE DATAIn order to demonstrate the new features we need some sample data. The followingdatabase contains 9,000 movie documents in the following format:{    ""_id"": ""71562"",    ""_rev"": ""1-72726eda3b8b2973ef259dd0c7410a83"",    ""title"": ""The Godfather: Part II"",    ""year"": 1974,    ""rating"": ""R"",    ""runtime"": ""200 min"",    ""genre"": [        ""Crime"",        ""Drama""    ],    ""director"": ""Francis Ford Coppola"",    ""writer"": [        ""Francis Ford Coppola (screenplay)"",        ""Mario Puzo (screenplay)"",        ""Mario Puzo (based on the novel \""The Godfather\"")""    ],    ""cast"": [        ""Al Pacino"",        ""Robert Duvall"",        ""Diane Keaton"",        ""Robert De Niro""    ],    ""poster"": ""http://ia.media-imdb.com/images/M/..._V1_SX300.jpg"",    ""imdb"": {        ""rating"": 9.1,        ""votes"": 656,        ""id"": ""tt0071562""    }}To use this data set: * Sign up for a Cloudant account * Replicate the database into your account. Choose Replication → New   Replication and complete the form * Source Database: Remote database - https://examples.cloudant.com/query-movies Target Database: New local database - ""movies""CREATING A CLOUDANT QUERY INDEXOnce the data has replicated to your Cloudant account, we can instruct Cloudantto create an index from the Cloudant Dashboard by selecting the database andchoosing Query → + → New Query Index:The form will be pre-filled with an index definition of:{  ""index"": {    ""fields"": [      ""foo""    ]  },  ""type"": ""json""}In our case, we are going to overwrite the sample with a text type index thatautomatically indexes all fields in all documents in the database. Replace theJSON text to have { ""index"": {}, ""type"": ""text""} as shown in screenshot below:Simply click ""Create Index"" to instruct Cloudant to index the movie data.The ""text"" index type is new in this iteration of Cloudant Query and by defaultindexes all the fields in your document. We can supply the individual fields to be indexed (in the index object), but by supplying an empty object we are asking for everything to be indexed.The same instruction can be issued using the Cloudant API --curl -X POST https://user:pass@account.cloudant.com/movies/_index      -d '{ ""index"": {}, ""type"": ""text""}'-- substituting user , pass and account for your own personal Cloudant credentials.QUERYING A CLOUDANT QUERY INDEXCloudant Query queries are JSON documents with the following top-level items: * selector - which subset of the data to return; the equivalent of the WHERE part of an SQL statement * fields - the fields to be returned; the equivalent of the SELECT part of an SQL statement * sort - how the result set is to be ordered; the equivalent of the ORDER BY part of an SQL statement * limit - how many results to returnSQL Cloudant QuerySELECT    title, yearFROM moviesWHERE  imdb.rating  9.0SORT year ASCLIMIT 10{ ""fields"": [""title"", ""year""], ""selector"": {   ""imdb.rating"": { ""$gt"": 9.0 } }, ""sort"": [ { ""year:number"": ""asc"" } ], ""limit"": 10 }At its simplest, a query looks like this:{ ""selector"": {  ""year"": 2012 }}The above query is looking for films where the year field is equal to 2012.Queries can be cut-and-pasted into the Cloudant Dashboard. Clicking ""Run Query""posts the results in the right-hand panel:The Cloudant Query API can also be used to perform queries by POSTing to adatabase's _find endpoint:curl -X POST https://user:pass@account.cloudant.com/movies/_find      -d '{ ""selector"": { ""year"": 2012 }, ""limit"": 10}'CLOUDANT QUERY SELECTORThe selector part of the JSON query allows you to specify which subset of the database toreturn. Selectors can be dealt with several ways:one field:""selector"": { ""year"": 2012 }multiple fields:""selector"": { ""year"": 2012, ""rating"": ""R"" }condition operators ( $gt , $lt , $eq , $ne ... see our docs for full list ):""selector"": { ""imdb.rating"": { ""$gt"": 9.0 } }free-text match (the $text operator matches any field in your document):""selector"": { ""$text"": ""Al Pacino"" }match arrays (exactly):""selector"": { ""genre"": [ ""Animation"", ""Comedy"" ] }match value is in array:""selector"": { ""genre"": { ""$in"": [""Horror""] } }match any values are in array:""selector"": { ""year"": { ""$in"": [2013,2015] } }match values are not in array""selector"": { ""year"": { ""$nin"": [2013,2015] } }the existence of fields:""selector"": { ""rating"": { ""$exists"": true } }We can combine the $and , $or and $not operators to produce complex queries:""selector"": {   ""$and"" : [    { ""year"": { ""$lt"": 1990 } },    { ""imdb.rating"": { ""$gt"": 7.0 } },     { ""$text"": ""Marlon Brando"" }  ]}""selector"": {   ""$and"" : [    { ""year"": { ""$gt"": 1980 } },    { ""year"": { ""$lt"": 1990 } },    { ""$not"": { ""title"": ""Aliens"" } },    { ""$text"": ""Sigourney Weaver"" }  ]}""selector"": {   ""$or"" : [    { ""director"": ""George Lucas""  },    { ""director"": ""Steven Spielberg""  }  ]}CLOUDANT QUERY FIELDSThe fields element can be used to instruct the Cloudant Query engine to only return asubset of the underlying documents e.g.{ ""selector"": {  ""cast"": {   ""$in"": [    ""Julia Roberts""   ]  } }, ""fields"": [  ""title"",  ""year"",  ""imdb.rating"" ], ""limit"": 10}returns only partial documents e.g.{ ""title"": ""Flatliners"", ""year"": 1990, ""imdb"": {  ""rating"": 6.5 }}CLOUDANT QUERY SORTIf a sort element is supplied, then the results set is sorted according to the suppliedarray e.g.{   ""selector"": {      ""cast"" :  { ""$in"" : [""Tom Hanks""] }   },   ""sort"": [ { ""year:number"": ""desc"" } ]  }With indexes where type=""text"", each field must be paired with the type of thatfield (number or string) to instruct Cloudant Query to treat it as a numericalor alphabetic sorting algorithm. Sort orders can be either ascending ( asc ) or descending ( desc ).Multi-dimensional sorts can be achieved by adding to the sort array:{  ""selector"": {     ""cast"" :  { ""$in"" : [""Tom Hanks""] }  },  ""sort"": [     { ""year:number"": ""asc"" },    { ""title:string"": ""asc"" }  ] }CLOUDANT QUERY PAGINATIONWhen using Cloudant Query's type=""text"" indexes, pagination is performed by: * page 1 - performing a query to get first page of search results * page 2 - repeating the query but adding the bookmark parameter received in the reply to the first requeste.g.we perform our first query:curl -X POST https://user:pass@account.cloudant.com/movies/_find      -d '{ ""selector"": { ""year"": 2012 }, ""limit"": 10}'which gives a reply of:{  ""docs"":[ ... ],  ""bookmark"": ""g2wAAAABaANkABxkYmNvcmVAZGIxLm""}To get the second page of results, we repeat the query and add the firstrequest's bookmark into our object:curl -X POST https://user:pass@account.cloudant.com/movies/_find      -d '{ ""selector"": { ""year"": 2012 },                          ""limit"": 10,                          ""bookmark"": ""g2wAAAABaANkABxkYmNvcmVAZGIxLm""}'The bookmark concept is the same mechanism used by Cloudant Search and providesa scalable way to paginate through large result sets.WHAT'S THE DIFFERENCE BETWEEN ""JSON"" AND ""TEXT"" INDEXES?Indexes based on type=""json"" become MapReduce-based materialized views under thehood. Their fixed key structure will only allow queries that match the keystructure. i.e., if we create a ""json"" index based on title , firstname and lastname , we can perform queries based on those three fields but not just lastname , for instance. Type=""json"" indexes are quicker to build and may be quicker forsingle-field lookups.Indexes based on type=""text"" become Lucene-based indexes under the hood and cananswer arbitrary queries based on any of the indexed fields in any order.Type=""text"" indexes are the easiest way to start with Cloudant Query as theyindex all fields by default allowing ad-hoc querying of a data set.WATCH CLOUDANT QUERY IN ACTIONThis video provides an overview of Cloudant Query.This video shows you how to build and query a Cloudant Query index.REFERENCES * For further information on Cloudant Query text indexes, please refer to our documentation . * The movie database is a subset of data from OMDB API and is published with permission under Creative Commons licence.Please enable JavaScript to view the comments powered by Disqus.SIGN UP FOR UPDATES!RECENT POSTS * Data Privacy and Governance Update * Cloudant Warehousing: New features and improvements * Announcing ISO 27001 Compliance for Cloudant, dashDB and BigInsights! * Understanding Mango View-Based Indexes vs. Search-Based Indexes * Introducing Monitoring Plugins for IBM Cloudant LocalBlog archive Follow @cloudantPRODUCT * Why DBaaS? * Features * Pricing * DBaaS ComparisonDOCS * Getting Started * API Reference * Libraries * GuidesFOR DEVELOPERS * FAQ * Sample AppsRESOURCES * Blog * Case Studies * Data Sheets * Training * Webinars * Whitepapers * Videos * EventsCOMPANY * About Us * Contact UsNEWS * In the Press * Press Releases * Awards * Terms Of Use * | * Privacy * | * ©IBM Corporation 2016","Cloudant Query is the best way to get started with querying Cloudant databases; a simple API call is used to define the list of fields to be indexed. Under the hood, Cloudant Query can leverage various indexes to provide a full breadth of querying capabilities.",Cloudant Query Grows Up to Handle Ad Hoc Queries,Live,197
553,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectTHE NEW SIMPLE DATA PIPEMike Broberg / February 24, 2016Today, we’re introducing a refactored and streamlined Simple Data Pipe , our open-source data movement project. While the workflow for piping data haschanged, the new architecture opens up more free options for data movement onto,or off of, the IBM cloud.WHY CHANGE THE PIPE?Services are changing rapidly on IBM’s Bluemix application platform . As these services evolve, we wanted to create a more modular Simple Data Pipethat could better deal with new features and brand new products.If you’re already using the Simple Data Pipe, don’t fear. We can still move datato dashDB , IBM’s cloud data warehouse. I’ll cover the mechanics of analytics workflowslater on. For now, let’s look at The Pipe’s new architecture and our motivationsbehind it.A SIMPLER DATA PIPE ARCHITECTUREIt’s all about getting data. The big problem the Simple Data Pipe solves hasalways been about sourcing data from disparate Web APIs. The Pipe captures thatdata in its native structure, and persists it in a database that’s flexibleenough to adapt to your plans for processing it.The new Simple Data Pipe no longer assumes that you plan to process data for aparticular use (analytics), in a particular place (dashDB). We’ve modularizedthe architecture of The Pipe by separating the step of landing data in Cloudant from the step of moving data to a different, more specialized place. Here’s an “annotated” architecturediagram:The new Simple Data Pipe lands data in CloudantInstead of automating the process of moving data from REST sources → Cloudant → dashDB , the new Simple Data Pipe is scoped more narrowly to REST sources → Cloudant and ends the process there. It’s a cleaner, more modular approach that webelieve better handles the rate of innovation in the Bluemix ecosystem and makesthe data pipe more useful to applications beyond analytics use-cases.What the Pipe has lost in push-button, end-to-end data movement, it has gainedin flexibility. Also, it still allows for future implementations that do move data end-to-end, whenever free APIs are available for analytics engineslike IBM’s Apache Spark service , warehouses like dashDB, and other tools.MORE OPTIONS FOR YOUR NEXT MOVEFor users who are focused on analytics use-cases, the new Simple Data Pipe canstill connect to dashDB, although that connection is no longer baked in. It’snow a separate step completed in Cloudant. While this roster will expand, hereis the current set of options for moving data out of Cloudant: * dashDB , via native Cloudant integration with dashDB. Finish movement using Cloudant’s web dashboard . * Apache Spark , via native Cloudant integration with Bluemix’s Spark service. Finish   movement by calling the Cloudant connector in a Spark Scala Notebook . * Transporter , the open source ETL pipeline by Compose.io. Finish movement by configuring   package info and associated JavaScript code. * DataWorks , enterprise-grade APIs for data shaping & movement. A paid service on   Bluemix as of February 2016. Provision DataWorks on Bluemix first, before   deploying the new Simple Data Pipe.When compared to the previous version of the Simple Data Pipe — aside from astreamlined architecture — we’ve removed The Pipe’s dependence on DataWorks.Connecting the DataWorks APIs to the data pipe is still an option, but byremoving this dependency, Cloudant can provide more options for data movement.Moving “Piped” data into dashDB via the Cloudant dashboardWHERE TO GET THE NEW PIPEThe same place as always on our developerWorks site . There you’ll find links to our GitHub repos and other instructions. In thecoming weeks we’ll be updating content to reflect the new Simple Data Pipe.We’ll also kick off a new series of tutorials that shows all the ways you canwork with the Data Pipe’s additional targets.Let’s get that data moving, y’all.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Introducing a refactored Pipe architecture for cloud data movement. Connect to REST data sources, and land data all in one place, in its native structure.",New Simple Data Pipe: Easier cloud data movement,Live,198
556,"DATALAYER: STORAGE WARS - THE ART GENOME PROJECT
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 21, 2016As you can see DataLayer Conf was full of great talks and this next one is no exception. Daniel Doubrovkine , CEO of Artsy.net and 2016 Ruby prize award nominee, took the stage. Daniel presented Artsy.net's
Art Genome Project, a classification system and technological framework that
powers Artsy.

The Art Genome Project maps the characteristics (Artsy.net calls them “genes”)
that connect artists, artworks, architecture, and design objects across history.
There are currently over 1,000 characteristics in The Art Genome Project,
including art-historical movements, subject matter, and formal qualities. This
is the story of the evolution of the data layer and nearest neighbor search
technology, and the lessons learned, from MongoDB and PostgreSQL through
Elasticsearch at Artsy.net.


--------------------------------------------------------------------------------","Daniel Doubrovkine, CEO of Artsy.net and 2016 Ruby prize award nominee, took the stage. Daniel presented Artsy.net's Art Genome Project, a classification system and technological framework that powers Artsy.",DataLayer Conference: Storage Wars - The Art Genome Project,Live,199
557,"* Services * Augmented & Virtual Reality
    * Internet Of Things
    * Growth Hacking
    * Artificial Intelligence
   
   
 * Tech * FinTech
    * Manufacturing & IoT
    * Nano & Quantum
    * Next-Gen Living
    * Security
    * Transportation & Energy
   
   
 * Gadgets * Drones
    * Eye on Sci-Fi
    * Implants
    * Mobile
    * Smart Lives
    * Wearables
   
   
 * Science * Biology & Chemistry
    * Data Science
    * Earth Science
    * Physics
    * Warp Drive
   
   
 * Growth Hacking * Advertising
    * Content Strategies
    * Social Media
   
   
 * VR * Experiences
    * Gaming
    * Marketing
    * Training & Simulations
   
   
 * AI * Algorithms
    * Computer Vision
    * Language
    * Robotics
    * The Singularity
   
   
 * Specials * Awards & Recognitions
    * Entertainment
    * Industry Conventions
    * Talk Nerdy 2Me
   
   
Search * Services
 * About Us
 * Contact Us

Sign in Welcome! Log into your account your username your password Forgot your password? Get help Password recovery Recover your password your email A password will be e-mailed to you. Edgy Labs * Services * Augmented & Virtual Reality
    * Internet Of Things
    * Growth Hacking
    * Artificial Intelligence
   
   
 * Tech * All FinTech Manufacturing & IoT Nano & Quantum Next-Gen Living Security Transportation & Energy Featured6 EDIBLE FOOD PACKAGING PRODUCTS FOR THE FUTURE
      Biology & ChemistryA HAPPY WORLD SCIENCE DAY 2016 TO ALL!
      FeaturedIEA PROJECTS RENEWABLE ENERGY GROWTH LED BY CHINA
      Biology & ChemistryBIO-MONITORING CONTACT LENSES IMPROVE DIABETES TREATMENT
      
   
 * Gadgets * All Drones Eye on Sci-Fi Implants Mobile Smart Lives Wearables AIAI REVOLUTIONIZES INDUSTRIES, NOT WORLD DOMINATION
      Biology & ChemistryBIO-MONITORING CONTACT LENSES IMPROVE DIABETES TREATMENT
      Biology & ChemistrySHAPE MEMORY POLYMERS TO CREATE ORIGAMI-LIKE BIOMPLANTS
      AIROBOT TUTORS CAN TELL WHETHER YOU’RE DISTRACTED
      
   
 * Science * All Biology & Chemistry Data Science Earth Science Physics Warp Drive AIAI REVOLUTIONIZES INDUSTRIES, NOT WORLD DOMINATION
      Biology & ChemistryRED DWARF’S HABITABLE ZONE HOME TO WATERWORLDS
      Featured6 EDIBLE FOOD PACKAGING PRODUCTS FOR THE FUTURE
      Biology & ChemistryA HAPPY WORLD SCIENCE DAY 2016 TO ALL!
      
   
 * Growth Hacking * All Advertising Content Strategies Social Media Featured8 REASONS HOUSTON WILL BE AMERICA’S STARTUP HUB
      AdvertisingCAN PLAYSTATION VR USE PS4’S MASS APPEAL?
      AdvertisingTALKING MARKETING STRATEGIES WITH HOUSTON-BASED AMA WINNER TRI NGUYEN
      Content Strategies5 REASONS WHY YOUR BRAND SHOULD USE PINTEREST
      
   
 * VR * All Experiences Gaming Marketing Training & Simulations Data ScienceT-RAYS TO MAKE COMPUTER MEMORY 1,000 TIMES FASTER
      AdvertisingCAN PLAYSTATION VR USE PS4’S MASS APPEAL?
      GadgetsWHY RAZER AND THX COULD DOMINATE VR DEVELOPMENT
      AlgorithmsHOW VR WILL MAKE HUMANITY MORE RESPONSIBLE
      
   
 * AI * All Algorithms Computer Vision Language Robotics The Singularity AIAI REVOLUTIONIZES INDUSTRIES, NOT WORLD DOMINATION
      AI4 WAYS AI IMPROVES MEDICINE
      AIROBOT TUTORS CAN TELL WHETHER YOU’RE DISTRACTED
      AI5 WHITE HOUSE RECOMMENDATIONS ON REGULATING AI
      
   
 * Specials * All Awards & Recognitions Entertainment Industry Conventions Talk Nerdy 2Me Biology & ChemistryA HAPPY WORLD SCIENCE DAY 2016 TO ALL!
      Featured8 REASONS HOUSTON WILL BE AMERICA’S STARTUP HUB
      Biology & ChemistryHOW SIMILARITIES BETWEEN HUMAN BONE AND NEUTRON STARS SHOW WE ARE…
      Biology & ChemistryQUOTE OF THE WEEK: RADIANT MARIE CURIE ON DISCOVERY
      
   
Home AI AI Revolutionizes Industries, not World Domination * AI
 * Algorithms
 * Science
 * Data Science
 * Gadgets
 * Smart Lives
 * The Singularity

AI REVOLUTIONIZES INDUSTRIES, NOT WORLD DOMINATION
By John N - November 10, 2016 0 10 Share on Facebook Tweet on Twitter * 
 * tweet

Tatiana Shepeleva | Shutterstock.comSPEAKING AT A PROGRAMMER’S CONFERENCE THIS PAST SUMMER, BILL GATES REFERRED TO AI AS THE “HOLY GRAIL OF COMPUTER SCIENCE RESEARCH”, ILLUSTRATING
HOW SCIENTISTS FROM ALL FIELDS OF STUDY ARE WORKING TO CREATE INTELLIGENT
MACHINES AS A TOOL FOR HUMANITY.
HOWEVER, FEARS OVER CREATING SMART MACHINES THAT WILL ENSLAVE HUMANITY PERSIST.
IS IT POSSIBLE TO CHANGE THE POPULAR VIEW OF TECHNOLOGY AS INEVITABLY EVIL BY
ENCOURAGING A MORE PROFOUND UNDERSTANDING OF HOW AI WORKS AND ITS POTENTIAL TO
ASSIST US IN OUR DAILY LIVES?
AI seems to have an ever-increasing influence on how our world works, from Facebook ‘s facial recognition software for image tagging to spellcheck.

Other big technology companies like Google also use AI extensively, and the expectation is that AI will revolutionize
industries and will continue to supplement work that humans no longer (have to)
do themselves.

However, because automation is beginning to eliminate thousands of human jobs
each year and investments in AI research and start-ups is exploding, many fear
the rise of Matrix -style machines.

AI REVOLUTIONIZES INDUSTRIES
Traditional computers, while powerful, lack the capacity to be self-aware or
make independent decisions. AI, in contrast, can make decisions on its own and
even adapt to new rules without being prompted to do so.

Therefore, AI already demonstrates some ability to understand the environment
and become self-aware. With this realization, the assumption is that the machine
will be more efficient at solving complex problems and doing large volumes of
calculations as quickly as possible.

For example, Google uses AI to deliver better search engine results and even
experiment with self-driving cars.

In addition, Facebook also uses AI to optimize customer experience on its social
media sites.

Amazon , for its part, uses intelligent robots in its warehouses to collect items for
packaging.

In the manufacturing industry, AI driven machines can do everything from
coordinate whole production lines to carrying out the smallest and most menial
tasks.

The evidence of broad adoption of AI is there, but for the average person, AI
might appear a distant and abstract concept. However, we engage AI every time we
ask Google to help us find a good fried chicken joint.

ROBOT OVERLORD
Despite the benefits of AI, it continues to be treated with suspicion due, in
part, to Hollywood depictions of hyper-aggressive intelligent robots harvesting
our bodies for energy.

In the Terminator movies, for example, robot assassins are sent by Skynet , an AI defense network that seeks to exterminate the human race.

Even when a company’s goal is to use AI to improve the quality of human life,
they must account for consumer suspicion. Distrust and anxiety make it harder to
garner interest in many AI-powered technologies, perhaps because fear is often a
product of a lack of understanding or a lack of information.

“we won’t stop needing technology until we run out of problems” -Tim O’reilly

COULD SKYNET BECOME A REALITY?
Technologies (like AI) are a tool, and tools are neither inherently good or bad.
Instead, their merit depends on how we use them. AI is a tool that, so far,
seems to be making life easier and safer.

For instance, the adoption of self-driving cars has the potential to save
thousands of lives lost every year from traffic accidents – again depending on
how the technology is used. Furthermore, AI can save the lives of patients
through better, earlier diagnosis brought by seamless access to medical data.

These AI applications are limited and highly specialized, however. Until someone
desires to build a machine hellbent on world domination, it is unlikely that an
AI would choose that path.

According to Tim O’Reilly via Data Center Frontier , “we won’t stop needing technology until we run out of problems”. As people
continue to innovate and complicate their personal and professional lives, more
simple tasks are being automated behind the scenes.

As long as machines can complete these tasks for us, their use is inevitable.
For now, it seems that Hollywood fears won’t stop AI research and development.

At the end of the day, understanding is our most effective weapon against the
fear accompanies ignorance.

SOURCE Data Center Frontier SHARE Facebook Twitter * 
 * tweet

Previous article Red Dwarf’s Habitable Zone Home to Waterworlds John NRELATED ARTICLES MORE FROM AUTHOR
Biology & ChemistryRED DWARF’S HABITABLE ZONE HOME TO WATERWORLDS
Featured6 EDIBLE FOOD PACKAGING PRODUCTS FOR THE FUTURE
Biology & ChemistryA HAPPY WORLD SCIENCE DAY 2016 TO ALL!
- Advertisement -HOT NEWS
Next-Gen LivingHOW 3D PRINTING WILL HELP SHELTER DISASTER VICTIMS
TechBLIMPS ARE THE FUTURE OF FREIGHT TRANSPORT (NOT FLYING BOMBS)
GadgetsWHY YOU’LL NEVER FOLD YOUR LAUNDRY AGAIN
FinTechWALL STREET NEEDS ANOTHER BAILOUT, AND FINTECH CAN HELP
I LOVED YOU LIKE A BROTHER
2,636 Fans Like 1,169 Followers Follow 4,182 Followers FollowMOST POPULAR
HOW 3D PRINTING WILL HELP SHELTER DISASTER VICTIMS
October 10, 2016BLIMPS ARE THE FUTURE OF FREIGHT TRANSPORT (NOT FLYING BOMBS)
September 29, 2016WHY YOU’LL NEVER FOLD YOUR LAUNDRY AGAIN
September 20, 2016WALL STREET NEEDS ANOTHER BAILOUT, AND FINTECH CAN HELP
September 16, 2016 Load moreEDITOR PICKS
RED DWARF’S HABITABLE ZONE HOME TO WATERWORLDS
November 10, 2016BIO-MONITORING CONTACT LENSES IMPROVE DIABETES TREATMENT
November 10, 20168 REASONS HOUSTON WILL BE AMERICA’S STARTUP HUB
November 9, 2016POPULAR POSTS
HOW TO KILL ANTIBIOTIC-RESISTANT ‘SUPERBUGS’ WITHOUT ANTIBIOTICS
October 17, 2016THE TRUE SIGNIFICANCE OF GOOGLE’S PRESS EVENT: A GOOGLE FOR EVERYONE
October 4, 2016MORE POWERFUL SUPERCAPACITORS NOW CARBON-FREE
November 6, 2016POPULAR CATEGORY
 * Science 48
 * Tech 47
 * Gadgets 26
 * Biology & Chemistry 23
 * Data Science 18
 * Specials 18
 * Next-Gen Living 16
 * Transportation & Energy 15
 * AI 14

ABOUT US Edgy Labs is a technology services company focused on the edge of technology.
EDGY LABS: sharp ideas for a competitive edge. Contact us: hello@edgylabs.com FOLLOW US * About Us
 * Privacy Policy
 * Contact Us

© 2016 EDGYLABS.COM All Rights Reserved MORE STORIESWHY THIS DEEP GENERATIVE MODEL IS MAKING WAVES
September 21, 2016THE ICHIP THAT COULD MAKE ANIMAL TESTING OBSOLETE
September 21, 2016","From autocorrect to Google Maps, AI has already started picking up some of the slack. AI revolutionizes industries and our daily lives.","AI Revolutionizes Industries, not World Domination",Live,200
558,"DATALAYER CONFERENCE: KEYNOTE WITH MITCH PIRTLE, CAPITALONE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 25, 2016This week we're introducing our first talk from DataLayer Conf, the Keynote with
Mitch Pirtle from CaptialOne. If you've been in the open source world for
awhile, chances are, you've seen Mitch around the Joomla or Postgres community,
among countless others.

So what's DataLayer. DataLayer is a Compose sponsored conference that we held
last month. It was great. We had speakers from CapitalOne, GitHub, Artsy,
Meteor, Princeton, ZenDesk and more. Over the next several weeks, we're going to
share the videos of the presentations from the conference so those of you who
were unable to attend can still benefit.

In this Keynote, Mitch discusses the current state of the data layer for
enterprise, focusing on the polyglot experience. We are, according to Mitch, in
a world of constantly changing and evolving technology stacks; and this means an
ever-changing roster of languages and platforms that access data (including even
the very ways in which we store and access data from these apps). Where do
databases fit in this rapidly expanding picture of ‘one tool, one task’,
especially at the scale of petabytes and zetabytes?


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","This week we're introducing our first talk from DataLayer Conf, the Keynote with Mitch Pirtle from CaptialOne.","DataLayer Conference: Keynote with Mitch Pirtle, CapitalOne",Live,201
559,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Margriet Groenendijk Blocked Unblock Follow Following Developer Advocate | IBM Watson Data Platform | Data Science | Climate and
Weather | Geography Aug 29, 2016
--------------------------------------------------------------------------------

ANALYZE OPEN DATA SETS USING PANDAS IN A PYTHON NOTEBOOK
Open data is freely available, which means you can modify, store, and use it without any
restrictions. Governments, academic institutions, and publicly focused agencies
are the most common providers of open data. They typically share things like
environmental, economic, census, and health data sets. You can learn more about
open data from The Open Data Institute or from wikipedia .

Two great places to start browsing are data.gov and data.gov.uk where you can find all sorts of data sets. Other good sources are the World Bank , the FAO , eurostat and the bureau for labor statistics . If you’re interested in a specific country or region, just do a quick Google
search, and you’ll likely uncover other sources as well.

Open data can be a powerful analysis tool, especially when you connect multiple
data sets to derive new insights. This tutorial features a notebook that helps
you get started with analysis using pandas . Pandas is one of my favorite data analysis packages. It’s very flexible and
includes tools that make it easy to load, index, classify, and group data.

In this tutorial, you will learn how to work with a DataFrame in 2 basic steps:

 1. Load data from open data sets into a Python notebook in Data Science
    Experience.
 2. Work with a Python notebook on Data Science Experience (join data frames,
    clean, check, and analyze the data using simple statistical tools).

DATA & ANALYTICS ON DATA SCIENCE EXPERIENCE
Data Science Experience features a selection of open data sets that you can
download and use any way you want. It’s easy to get an account, start a
notebook, and grab some data:

 1. Sign in to Data Science Experience (or sign up for a free trial) .
 2. Open the sample notebook called Analyze open data sets with pandas DataFrames . To open the sample notebook, click here (or type its name in the Search field on the home page of Data Science Experience and select the card for
    the notebook), then click the button on the top of the preview page that
    opens. Select a project and Spark service and click Create Notebook . The sample notebook opens for you to work with.
 3. Find the first data set and get its access key URL.
 4. From the Data Science Experience home page, search for “life expectancy”.
 5. Click the card with the title Life expectancy at birth by country in total years .
 6. Click the Manage Access Keys button.
 7. Click Request a New Access Key .
 8. Copy the access key URL, and click Close . You’ll use this link in a minute to load data into the Python notebook.

Tip: If you don’t want to run the commands yourself, you can also just open the
notebook in your browser and follow along: 
https://apsportal.ibm.com/exchange/public/entry/view/47ed96c50374ccd15f93ef262c1af63bLOAD DATA INTO A DATAFRAME
Paste the access key URL you copied from the Life Expectancy data set into the following code (replacing
the <LINK-TO-DATA> string). Then run the following code to load the data in a data frame. This
code keeps 3 columns and renames them.

import pandas as pd 
import numpy as np 

# life expectancy at birth in years 
life = pd.read_csv(""<LINK-TO-DATA>"",usecols=['Country or Area','Year','Value']) 
life.columns = ['country','year','life'] 
life.head()

Life expectancy figures might be more meaningful if we combine them with other
open data sets from Data Science Experience. Let’s start by loading the data set Total Population by country. To do so, find the data set on the DSX home page, request an access key for it,
and replace <LINK-TO-DATA> with your access key URL in the following code. Then run the code.

# population 
population = pd.read_csv(""<LINK-TO-DATA>"",usecols=['Country or Area', 'Year','Value']) 
population.columns = ['country', 'year','population'] 

print ""Nr of countries in life:"", np.size(np.unique(life['country'])) 
print ""Nr of countries in population:"", np.size(np.unique(population['country']))

Nr of countries in life: 246 
Nr of countries in population: 277

JOINING DATA FRAMES
These two data sets don’t fit together perfectly. For instance, one lists more
countries than the other. When we join the two data frames we’re sure to
introduce nulls or NaNs into the new data frame. We’ll use the pandas merge function to handle this problem. This function includes many options . In the following code, how='outer' makes sure we keep all data from life and population . on=['country','year'] specifies which columns to perform the merge on.

df = pd.merge(life, population, how='outer', sort=True, 
on=['country','year']) 
df[400:405]

We can add more data to the data frame in a similar way. For each data set in the following list, find the data set on the DSX home page, request
an access key URL, and copy the the URL into the code (again replacing the <LINK-TO-DATA> string with the corresponding access key URL):

 * Population below national poverty line, total, percentage
 * Primary school completion rate % of relevant age group by country
 * Total employment, by economic activity (Thousands)
 * Births attended by skilled health staff (% of total) by country
 * Measles immunization % children 12–23 months by country

# poverty (%) 
poverty = pd.read_csv(""<LINK-TO-DATA>"",usecols=['Country or Area', 'Year','Value']) 
poverty.columns = ['country', 'year','poverty'] 
df = pd.merge(df, poverty, how='outer', sort=True, on=['country','year']) 

# school completion (%) 
school = pd.read_csv(""<LINK-TO-DATA>"",usecols=['Country or Area', 'Year','Value']) 
school.columns = ['country', 'year','school'] 
df = pd.merge(df, school, how='outer', sort=True, on=['country','year']) 

# employment 
employmentin = pd.read_csv(""<LINK-TO-DATA>"",usecols=['Country or Area','Year','Value','Sex','Subclassification']) 
employment = employmentin.loc[(employmentin.Sex=='Total men and women') & (employmentin.Subclassification=='Total.')] 
employment = employment.drop('Sex', 1) 
employment = employment.drop('Subclassification', 1) employment.columns = ['country', 'year','employment'] 
df = pd.merge(df, employment, how='outer', sort=True, on=['country','year']) 

# births attended by skilled staff (%) 
births = pd.read_csv(""<LINK-TO-DATA>"",usecols=['Country or Area', 'Year','Value']) 
births.columns = ['country', 'year','births'] 
df = pd.merge(df, births, how='outer', sort=True, on=['country','year']) 

# measles immunization (%) 
measles = pd.read_csv(""<LINK-TO-DATA>"",usecols=['Country or Area', 'Year','Value']) 
measles.columns = ['country', 'year','measles'] 
df = pd.merge(df, measles, how='outer', sort=True, on=['country','year']) 

df.head()

The resulting table looks kind of strange, as it contains incorrect values, like
numbers in the country column and text in the year column. You can manually remove these errors from the data frame. Also, we can
now create a multi-index with country and year.

df2=df.drop(df.index[0:40]) 
df2 = df2.set_index(['country','year']) 
df2.head(10)

If you are curious about other variables, you can keep adding data sets from
Data Science Experience to this data frame. Be aware that not all data is
equally formatted and might need some clean-up before you add it. Use the code
samples you just read about, and make sure you keep checking results with a
quick look at each of your tables when you load or change them with commands
like df2.head() .

CHECK THE DATA
You can run a first check of the data with describe() , which calculates some basic statistics for each of the columns in the
dataframe. It gives you the number of values (count), the mean , the standard deviation (std), the min and max, and some percentiles .

df2.describe()

DATA ANALYSIS
At this point, we have enough sample data to work with. Let’s start by finding
the correlation between different variables. First we’ll create a scatter plot,
and relate the values for two variables of each row. In our code, we also
customize the look by defining the font and figure size and colors of the points
with matplotlib.

import matplotlib.pyplot as plt 
%matplotlib inline 

plt.rcParams['font.size']=11 
plt.rcParams['figure.figsize']=[8.0, 3.5] 
fig, axes=plt.subplots(nrows=1, ncols=2) 
df2.plot(kind='scatter', x='life', y='population', ax=axes[0], color='Blue' 
df2.plot(kind='scatter', x='life', y='school', ax=axes[1], color='Red' 
plt.tight_layout()

The figure on the left shows that increased life expectancy leads to higher
population. The figure on the right shows that the life expectancy increases
with the percentage of school completion. But the percentage ranges from 0 to
200, which is odd for a percentage. You can remove the outliers by keeping the
values within a specified range df2[df2.school>100]=float('NaN') .

Even better, would be to check where these values in the original data came
from. In some cases, a range like this could indicate an error in your code
somewhere. In this case, the values are correct, see the description of the school completion data.

We don’t have data for all the exact same years. So we’ll group by country (be
aware that we lose some information by doing so). Also because variables are
percentages, we’ll convert our employment figures to percent. Probably, we no
longer need the population column, so let's drop it. Then we create scatter plots from the data frame
using scatter_matrix , which creates plots for all variables and also adds a histogram for each.

from pandas.tools.plotting import scatter_matrix 

# group by country 
grouped = df2.groupby(level=0) 
dfgroup = grouped.mean() 

# employment in % of total population 
dfgroup['employment']=(dfgroup['employment']*1000.)/dfgroup['population']*100 dfgroup=dfgroup.drop('population',1) 
scatter_matrix(dfgroup,figsize=(12, 12), diagonal='kde')

You can see that the data is now in a pretty good state. There are no large
outliers. We can even start to see some relationships: life expectancy increases
with schooling, employment, safe births, and measles vaccination. You are
deriving insights from the data and can now build a statistical model — for
instance, have a look at an ordinary least squares regression ( OLS ) from StatsModels .

SUMMARY
In this tutorial, you learned how to use open data from Data Science Experience
in a Python notebook. You saw how to load, clean and explore data using pandas.
As you can see from this example, data analysis entails lots of trial and error.
This experimentation can be challenging, but is also a lot of fun!


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on August 30, 2016.

 * Data Science
 * Pandas
 * Python


Blocked Unblock Follow FollowingMARGRIET GROENENDIJK
Developer Advocate | IBM Watson Data Platform | Data Science | Climate and
Weather | Geography

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Open data is freely available, which means you can modify, store, and use it without any restrictions. Governments, academic institutions, and publicly focused agencies are the most common providers…",Analyze open data sets using pandas in a Python notebook,Live,202
563,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Aug 21
--------------------------------------------------------------------------------

SERVERLESS AUTOCOMPLETE
WHICH WAY OF DEPLOYING AN AUTOCOMPLETE SERVICE IS RIGHT FOR YOU?
Autocomplete is everywhere. As you type into a web form, the page offers you
completions that match. An autocomplete service consists of three components:

 1. The data, a list of strings defining the accepted values for a web form.
 2. The code that navigates the list, comparing the typed input with accepted
    completions.
 3. The front end that renders the matches under the web form, allowing an
    answer to be picked.

The three basic components of any autocomplete service: data, code, and a front
end.Now, I’ll consider three ways you might deploy autocomplete in practice.

CLIENT-SIDE AUTOCOMPLETE
When the data size is small (say, less than 100 options), then it makes sense to
bundle the data and the code into the front end itself. When the web page is
loaded, the entire list of options arrives at the web browser and the
autocomplete logic executes locally.

Client-side autocomplete: bundle it all in the browser.This approach gives you the fastest performance, but it’s only suitable for
small data sets.

CLIENT-SERVER AUTOCOMPLETE
For larger data sizes, it becomes impractical to have the web page download the
entire data set. Instead, the web page makes an HTTP call to a server-side
process which queries a database and returns the matching answers:

Client-server autocomplete: a server-side process and a database server work in
tandem to get the browser its data.For this process to be quick enough to respond as a user types, the server needs
to be geographically close to the client and connected to a fast database —
typically an in-memory store like Redis . I wrote last year about a Simple Autocomplete Service that creates multiple autocomplete API microservices for you using Bluemix and
Redis.

But there is a third way.

SERVERLESS AUTOCOMPLETE
Instead of deploying server-side code that runs 24x7 waiting for autocomplete
requests to arrive, “serverless” platforms like Apache OpenWhisk™ allow your micoservices to be deployed on a pay-as-you-go basis. The more
computing capacity you use, the more you pay: from zero to lots.

With this minimalist approach, your autocomplete service can bundle both the
data and the code into an OpenWhisk “action”, so you don’t need to have a
separate database:

Serverless autocomplete: The browser’s request triggers a new serverless
function with bundled data. The code runs on-demand, with no servers to
maintain.Bundling the data and the code in the same serverless package makes for faster
performance with fewer moving parts and automatic scaling.

BUILDING A SERVERLESS AUTOCOMPLETE SERVICE
Removing the database from your application architecture means you’ll have to
implement the indexing and lookup functions yourself. To reduce the repetition,
I’ve built a utility that builds an OpenWhisk action for you.

First, you’ll need Node.js and npm installed on your machine, together with the OpenWhisk wsk utility paired with your IBM Bluemix account. Then simply install the serverless-autocomplete package:

npm install -g serverless-autocomplete

Take a text file of strings that you want to use and run acsetup with the path of the file:

acsetup names.txt

The acsetup command configures an autocomplete service for you, and provides
usage examples.The acsetup utility indexes your data, bundles it with an OpenWhisk autocomplete algorithm
written in Node.js, and sends it to your OpenWhisk account. Here's what it
returns:

 * The URL of your service.
 * An example curl statement.
 * An HTML snippet that you can use in your own web page — simply save it as an
   HTML file and open it in your web browser.

You can create as many autocomplete services as you like:

acsetup uspresidents.txt
acsetup soccerplayers.txt
acsetup gameofthrones.txt

Each service will have its own URL and is ready to use immediately. If you need
to change the data, simply update the text file and re-run acsetup .

Happy searching!

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * Openwhisk
 * Web Development
 * Serverless
 * Nodejs
 * Serverless Architecture

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Autocomplete is everywhere. As you type into a web form, the page offers you completions that match. An autocomplete service consists of three components: When the data size is small (say, less than…",Serverless Autocomplete – IBM Watson Data Lab – Medium,Live,203
565,"Compose The Compose logo Articles Sign in Free 30-day trialDATALAYER EXPOSED: CHARITY MAJORS ON OBSERVABILITY & THE GLORIOUS FUTURE
Published Jun 5, 2017 datalayer infrastructure DataLayer Exposed: Charity Majors on Observability & The Glorious FutureWe're bringing you video of all the sessions from this year's DataLayer
conference, starting with the opening keynote from Charity Majors on
Observability. Dive in now and start your own virtual DataLayer.

Last month, Compose descended on Austin to host the second annual DataLayer
Conference, a conference devoted to the space where apps meet data. The talks
were unbelievable so we decided they needed to be shared with the world, not
just those who were able to join us in Texas.

We kicked the conference off with our morning keynote address with Charity
Majors. Charity gave us a look at the infrastructure complexity today, from
distributed systems and microservices, to automation and orchestration, to
containers, schedulers and persistence layers to discuss looking at the next
generation of observability. Moving forward, it's going to be important to
engineer your systems to be understandable, explorable, and self-explanatory. To
do that, you're going to need tooling which manages that complexity and some of
the ways things have traditionally been done are just going to have to go.

It was a great start to the day and she gave those of us in the audience quite a
bit to think about. Watch her talk and let us know what you think using the
hashtag #DataLayerConf. Be sure to check back every Monday for the next
installment of DataLayer.


--------------------------------------------------------------------------------

We're in the planning stages for DataLayer 2018 right now so, if you have an
idea for a talk, start flushing that out. We'll have a CFP, followed by a blind
submission review, and then select our speakers, who we'll fly to DataLayer to
present. Sounds fun, right?

Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe ’s author page and keep reading.RELATED ARTICLES
May 15, 2017DATALAYER DESCENDS ON AUSTIN
After months of planning, it's finally here: DataLayer Conference. On Wednesday,
we're hosting our second annual conference f…

Thom Crowe Apr 13, 2017GETTING THE BEST CONFERENCE SPEAKERS WITH BLIND SUBMISSIONS
When we decided to launch our conference last year, we knew we wanted the best
speakers and topics. Here's how we ensured we…

Thom Crowe Mar 23, 2017ANNOUNCING DATALAYER CONF 2017'S ALL-STAR LINEUP
DataLayer Conf 2017 is coming to Austin, Texas on May 17th to the Alamo Draft
House on Lamar and we couldn't be more excited.…

Thom Crowe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","We're bringing you video of all the sessions from this year's DataLayer conference, starting with the opening keynote from Charity Majors on Observability.",Charity Majors on Observability & The Glorious Future,Live,204
568,"ERIC JANG
Technology, A.I., Careers

SUNDAY, AUGUST 7, 2016
A BEGINNER'S GUIDE TO VARIATIONAL METHODS: MEAN-FIELD APPROXIMATION
Variational Bayeisan (VB) Methods are a family of techniques that are very
popular in statistical Machine Learning. VB methods allow us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another
random variable) as optimization problems (i.e. find the parameter values that minimize some objective
function).

This inference-optimization duality is powerful because it allows us to use the
latest-and-greatest optimization algorithms to solve statistical Machine
Learning problems (and vice versa, minimize functions using statistical
techniques).

This post is an introductory tutorial on Variational Methods. I will derive the
optimization objective for the simplest of VB methods, known as the Mean-Field
Approximation. This objective, also known as the Variational Lower Bound , is exactly the same one used in Variational Autoencoders (a neat paper which I will explain in a follow-up post).

TABLE OF CONTENTS
 1. Preliminaries and Notation
 2. Problem formulation
 3. Variational Lower Bound for Mean-field Approximation
 4. Forward KL vs. Reverse KL
 5. Connections to Deep Learning

PRELIMINARIES AND NOTATION

This article assumes that the reader is familiar with concepts like random
variables, probability distributions, and expectations. Here's a refresher if you forgot some stuff. Machine Learning & Statistics notation isn't
standardized very well, so it's helpful to be really precise with notation in
this post:
 * Uppercase $X$ denotes a random variable
 * Uppercase $P(X)$ denotes the probability distribution over that variable
 * Lowercase $x \sim P(X)$ denotes a value $x$ sampled ($\sim$) from the
   probability distribution $P(X)$ via some generative process.
 * Lowercase $p(X)$ is the density function of the distribution of $X$. It is a
   scalar function over the measure space of $X$.
 * $p(X=x)$ (shorthand $p(x)$) denotes the density function evaluated at a
   particular value $x$.


Many academic papers use the terms ""variables"", ""distributions"", ""densities"",
and even ""models"" interchangeably. This is not necessarily wrong per se, since
$X$, $P(X)$, and $p(X)$ all imply each other via a one-to-one correspondence. However, it's confusing to mix these words together because their types are
different (it doesn't make sense to sample a function, nor does it make sense to integrate a distribution).

We model systems as a collection of random variables, where some variables ($X$)
are ""observable"", while other variables ($Z$) are ""hidden"". We can draw this
relationship via the following graph:


The edge drawn from $Z$ to $X$ relates the two variables together via the
conditional distribution $P(X|Z)$.

Here's a more concrete example: $X$ might represent the ""raw pixel values of an
image"", while $Z$ is a binary variable such that $Z=1$ ""if $X$ is an image of a
cat"".
$X = $ $P(Z=1) = 1$ (definitely a cat)
$X= $ $P(Z=1) = 0$ (definitely not a cat)

$X = $ $P(Z=1) = 0.1$ (sort of cat-like)
Bayes' Theorem gives us a general relationship between any pair of random variables:

$$p(Z|X) = \frac{p(X|Z)p(Z)}{p(X)}$$

The various pieces of this are associated with common names:

$p(Z|X)$ is the posterior probability : ""given the image, what is the probability that this is of a cat?"" If we can
sample from $z \sim P(Z|X)$, we can use this to make a cat classifier that tells
us whether a given image is a cat or not.

$p(X|Z)$ is the likelihood : ""given a value of $Z$ this computes how ""probable"" this image $X$ is under
that category ({""is-a-cat"" / ""is-not-a-cat""}). If we can sample from $x \sim
P(X|Z)$, then we generate images of cats and images of non-cats just as easily
as we can generate random numbers. If you'd like to learn more about this, see
my other articles on generative models: [1] , [2] .

$p(Z)$ is the prior probability . This captures any prior information we know about $Z$ - for example, if we
think that 1/3 of all images in existence are of cats, then $p(Z=1) =
\frac{1}{3}$ and $p(Z=0) = \frac{2}{3}$.

HIDDEN VARIABLES AS PRIORS

This is an aside for interested readers. Skip to the next section to continue with the tutorial.

The previous cat example presents a very conventional example of observed
variables, hidden variables, and priors. However, it's important to realize that
the distinction between hidden / observed variables is somewhat arbitrary, and
you're free to factor the graphical model however you like.

We can re-write Bayes' Theorem by swapping the terms:

$$\frac{p(Z|X)p(X)}{p(Z)} = p(X|Z)$$

The ""posterior"" in question is now $P(X|Z)$.

Hidden variables can be interpreted from a Bayesian Statistics framework as prior beliefs attached to the observed variables. For example, if we believe $X$ is a
multivariate Gaussian, the hidden variable $Z$ might represent the mean and
variance of the Gaussian distribution. The distribution over parameters $P(Z)$
is then a prior distribution to $P(X)$.

You are also free to choose which values $X$ and $Z$ represent. For example, $Z$
could instead be ""mean, cube root of variance, and $X+Y$ where $Y \sim
\mathcal{N}(0,1)$"". This is somewhat unnatural and weird, but the structure is
still valid, as long as $P(X|Z)$ is modified accordingly.

You can even ""add"" variables to your system. The prior itself might be dependent
on other random variables via $P(Z|\theta)$, which have prior distributions of
their own $P(\theta)$, and those have priors still, and so on. Any
hyper-parameter can be thought of as a prior. In Bayesian statistics, it's priors all the way down .


PROBLEM FORMULATION

The key problem we are interested in is posterior inference , or computing functions on the hidden variable $Z$. Some canonical examples of
posterior inference:
 * Given this surveillance footage $X$, did the suspect show up in it?
 * Given this twitter feed $X$, is the author depressed?
 * Given historical stock prices $X_{1:t-1}$, what will $X_t$ be?


We usually assume that we know how to compute functions on likelihood function
$P(X|Z)$ and priors $P(Z)$.

The problem is, for complicated tasks like above, we often don't know how to
sample from $P(Z|X)$ or compute $p(X|Z)$. Alternatively, we might know the form
of $p(Z|X)$, but the corresponding computation is so complicated that we cannot
evaluate it in a reasonable amount of time. We could try to use sampling-based
approaches like MCMC , but these are slow to converge.

VARIATIONAL LOWER BOUND FOR MEAN-FIELD APPROXIMATION

The idea behind variational inference is this: let's just perform inference on
an easy, parametric distribution $Q_\phi(Z|X)$ (like a Gaussian) for which we
know how to do posterior inference, but adjust the parameters $\phi$ so that
$Q_\phi$ is as close to $P$ as possible.

This is visually illustrated below: the blue curve is the true posterior
distribution, and the green distribution is the variational approximation
(Gaussian) that we fit to the blue density via optimization.


What does it mean for distributions to be ""close""? Mean-field variational Bayes
(the most common type) uses the Reverse KL Divergence to as the distance metric
between two distributions.
$$KL(Q_\phi(Z|X)||P(Z|X)) = \sum_{z \in
Z}{q_\phi(z|x)\log\frac{q_\phi(z|x)}{p(z|x)}}$$

Reverse KL divergence measures the amount of information (in nats, or units of
$\frac{1}{\log(2)}$ bits) required to ""distort"" $P(Z)$ into $Q_\phi(Z)$. We wish
to minimize this quantity with respect to $\phi$.

By definition of a conditional distribution, $p(z|x) = \frac{p(x,z)}{p(x)}$.
Let's substitute this expression into our original $KL$ expression, and then
distribute:

$$
\begin{align}
KL(Q||P) & = \sum_{z \in Z}{q_\phi(z|x)\log\frac{q_\phi(z|x)p(x)}{p(z,x)}} &&
\text{(1)} \\
& = \sum_{z \in Z}{q_\phi(z|x)\big(\log{\frac{q_\phi(z|x)}{p(z,x)}} +
\log{p(x)}\big)} \\
& = \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) +
\Big(\sum_{z}{\log{p(x)}q_\phi(z|x)}\Big) \\
& = \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big) +
\Big(\log{p(x)}\sum_{z}{q_\phi(z|x)}\Big) && \text{note: $\sum_{z}{q(z)} = 1 $}
\\
& = \log{p(x)} + \Big(\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}\Big)
\\
\end{align}
$$

To minimize $KL(Q||P)$ with respect to variational parameters $\phi$, we just
have to minimize $\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}}$, since
$\log{p(x)}$ is fixed with respect to $\phi$. Let's re-write this quantity as an
expectation over the distribution $Q_\phi(Z|X)$.

$$
\begin{align}
\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}} & = \mathbb{E}_{z \sim
Q_\phi(Z|X)}\big[\log{\frac{q_\phi(z|x)}{p(z,x)}}\big]\\
& = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - \log{p(x,z)} \big] \\
& = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - (\log{p(x|z)} + \log(p(z))) \big] &&
\text{(via $\log{p(x,z)=p(x|z)p(z)}$) }\\
& = \mathbb{E}_Q\big[ \log{q_\phi(z|x)} - \log{p(x|z)} - \log(p(z))) \big] \\
\end{align} \\
$$

Minimizing this is equivalent to maximizing the negation of this function:

$$
\begin{align}
\text{maximize } \mathcal{L} & =
-\sum_{z}{q_\phi(z|x)\log{\frac{q_\phi(z|x)}{p(z,x)}}} \\
& = \mathbb{E}_Q\big[ -\log{q_\phi(z|x)} + \log{p(x|z)} + \log(p(z))) \big] \\
& = \mathbb{E}_Q\big[ \log{p(x|z)} + \log{\frac{p(z)}{ q_\phi(z|x)}} \big] &&
\text{(2)} \\
\end{align}
$$

In literature, $\mathcal{L}$ is known as the variational lower bound , and is computationally tractable if we can evaluate $p(x|z), p(z), q(z|x)$.
We can further re-arrange terms in a way that yields an intuitive formula:

$$
\begin{align*}
\mathcal{L} & = \mathbb{E}_Q\big[ \log{p(x|z)} + \log{\frac{p(z)}{ q_\phi(z|x)}}
\big] \\
& = \mathbb{E}_Q\big[ \log{p(x|z)} \big] + \sum_{Q}{q(z|x)\log{\frac{p(z)}{
q_\phi(z|x)}}} && \text{Definition of expectation} \\
& = \mathbb{E}_Q\big[ \log{p(x|z)} \big] - KL(Q(Z|X)||P(Z)) && \text{Definition
of KL divergence} && \text{(3)}
\end{align*}
$$

If sampling $z \sim Q(Z|X)$ is an ""encoding"" process that converts an
observation $x$ to latent code $z$, then sampling $x \sim Q(X|Z)$ is a
""decoding"" process that reconstructs the observation from $z$.

It follows that $\mathcal{L}$ is the sum of the expected ""decoding"" likelihood
(how good our variational distribution can decode a sample of $Z$ back to a
sample of $X$), plus the KL divergence between the variational approximation and
the prior on $Z$. If we assume $Q(Z|X)$ is conditionally Gaussian, then prior
$Z$ is often chosen to be a diagonal Gaussian distribution with mean 0 and
standard deviation 1.

Why is $\mathcal{L}$ called the variational lower bound? Substituting
$\mathcal{L}$ back into Eq. (1), we have:

$$
\begin{align*}
KL(Q||P) & = \log p(x) - \mathcal{L} \\
\log p(x) & = \mathcal{L} + KL(Q||P) && \text{(4)}
\end{align*}
$$

The meaning of Eq. (4), in plain language, is that $p(x)$, the log-likelihood of
a data point $x$ under the true distribution, is $\mathcal{L}$, plus an error
term $KL(Q||P)$ that captures the distance between $Q(Z|X=x)$ and $P(Z|X=x)$ at
that particular value of $X$.

Since $KL(Q||P) \geq 0$, $\log p(x)$ must be greater than $\mathcal{L}$.
Therefore $\mathcal{L}$ is a lower bound for $\log p(x)$. $\mathcal{L}$ is also referred to as evidence lower bound
(ELBO), via the alternate formulation:

$$
\mathcal{L} = \log p(x) - KL(Q(Z|X)||P(Z|X)) = \mathbb{E}_Q\big[ \log{p(x|z)}
\big] - KL(Q(Z|X)||P(Z))
$$

Note that $\mathcal{L}$ itself contains a KL divergence term between the
approximate posterior and the prior, so there are two KL terms in total in $\log
p(x)$.

FORWARD KL VS. REVERSE KL

KL divergence is not a symmetric distance function, i.e. $KL(P||Q) \neq KL(Q||P)$ (except when $Q
\equiv P$) The first is known as the ""forward KL"", while the latter is ""reverse
KL"". So why do we use Reverse KL? This is because the resulting derivation would
require us to know how to compute $p(Z|X)$, which is what we'd like to do in the
first place.

I really like Kevin Murphy's explanation in the PML textbook , which I shall attempt to re-phrase here:

Let's consider the forward-KL first. As we saw from the above derivations, we
can write KL as the expectation of a ""penalty"" function $\log \frac{p(z)}{q(z)}$
over a weighing function $p(z)$.

$$
\begin{align*}
KL(P||Q) & = \sum_z p(z) \log \frac{p(z)}{q(z)} \\
& = \mathbb{E}_{p(z)}{\big[\log \frac{p(z)}{q(z)}\big]}\\
\end{align*}
$$

The penalty function contributes loss to the total KL wherever $p(Z) > 0$. For
$p(Z) > 0$, $\lim_{q(Z) \to 0} \log \frac{p(z)}{q(z)} \to \infty$. This means
that the forward-KL will be large wherever $Q(Z)$ fails to ""cover up"" $P(Z)$.

Therefore, the forward-KL is minimized when we ensure that $q(z) > 0$ wherever
$p(z)> 0$. The optimized variational distribution $Q(Z)$ is known as
""zero-avoiding"" (density avoids zero when $p(Z)$ is zero).


Minimizing the Reverse-KL has exactly the opposite behavior:

$$
\begin{align*}
KL(Q||P) & = \sum_z q(z) \log \frac{q(z)}{p(z)} \\
& = \mathbb{E}_{p(z)}{\big[\log \frac{q(z)}{p(z)}\big]}
\end{align*}
$$

If $p(Z) = 0$, we must ensure that the weighting function $q(Z) = 0$ wherever
denominator $p(Z) = 0$, otherwise the KL blows up. This is known as
""zero-forcing"":


So in summary, minimizing forward-KL ""stretches"" your variational distribution
$Q(Z)$ to cover over the entire $P(Z)$ like a tarp, while minimizing reverse-KL ""squeezes"" the
$Q(Z)$ under $P(Z)$.

It's important to keep in mind the implications of using reverse-KL when using
the mean-field approximation in machine learning problems. If we are fitting a
unimodal distribution to a multi-modal one, we'll end up with more false
negatives (there is actually probability mass in $P(Z)$ where we think there is
none in $Q(Z)$).

CONNECTIONS TO DEEP LEARNING

Variational methods are really important for Deep Learning. I will elaborate
more in a later post, but here's a quick spoiler:
 1. Deep learning is really good at optimization (specifically, gradient
    descent) over very large parameter spaces using lots of data.
 2. Variational Bayes give us a framework with which we can re-write statistical
    inference problems as optimization problems.

Combining Deep learning and VB Methods allow us to perform inference on extremely complex posterior distributions. As it turns out, modern techniques like
Variational Autoencoders optimize the exact same mean-field variational
lower-bound derived in this post!

Thanks for reading, and stay tuned!


Posted by Eric at 11:50 PM Email This BlogThis! Share to Twitter Share to Facebook Share to Pinterest Labels: AI , Statistics17 COMMENTS:
 1.  Incognito August 8, 2016 at 11:34 AMThere should be a minus in equation (3) for E[log p(x|z)] i.e. E[ -log
     p(x|z)] otherwise your definition of KL-divergence isn't consistent.
     
     Ankur.
     
     Reply Delete Replies 1. Eric August 8, 2016 at 2:15 PMThanks for your sharp eyes! I added the minus in front of the KL term.
         
         Delete angusturner27 June 4, 2017 at 3:42 AMDo you mind explaining where that negative comes from? I was
         anticipating a plus...
         
         Delete
      2. Reply
     
     
 2.  Vladislavs Dovgalecs August 8, 2016 at 11:58 PMThanks for the great post, Eric! Do you plan (or have a link to) to write a
     simple tutorial to illustrate the VB in practice?
     
     Reply Delete
 3.  John Barness August 10, 2016 at 5:42 AMThe post is worth reading.
     
     Reply Delete
 4.  Fahim Lee August 11, 2016 at 5:04 AMThis comment has been removed by a blog administrator.
     
     Reply Delete
 5.  Emery Goossens August 12, 2016 at 11:06 AMThis tutorial is fantastic!
     
     I believe the phrase ""must be strictly greater than"" should omit ""strictly""
     seeing as equality could hold according to your definition.
     
     Reply Delete Replies 1. Eric August 14, 2016 at 7:12 PMThat's correct! Thank you :)
         
         Delete
      2. Reply
     
     
 6.  David N. Olson August 15, 2016 at 12:07 AMThis comment has been removed by a blog administrator.
     
     Reply Delete
 7.  skim October 18, 2016 at 1:01 PMGiven the title of your post, it's worth giving some motivation behind the
     name ""mean-field approximation"".
     
     From a statistical physics point of view, ""mean-field"" refers to the
     relaxation of a difficult optimization problem to a simpler one which
     ignores second-order effects. For example, in the context of graphical
     models, one can approximate the partition function of a Markov random field
     via maximization of the Gibbs free energy (i.e., log partition function
     minus relative entropy) over the set of product measures, which is
     significantly more tractable than global optimization over the space of all
     probability measures (see, e.g., M. Mezard and A. Montanari, Sect 4.4.2).
     
     From an algorithmic point of view, ""mean-field"" refers to the naive mean
     field algorithm for computing marginals of a Markov random field. Recall
     that the fixed points of the naive mean field algorithm are optimizers of
     the mean-field approximation to the Gibbs variational problem. This
     approach is ""mean"" in that it is the average/expectation/LLN version of the
     Gibbs sampler, hence ignoring second-order (stochastic) effects (see, e.g.,
     M. Wainwright and M. Jordan, (2.14) and (2.15)).
     
     Reply Delete Replies 1. Eric November 6, 2016 at 10:47 PMI didn't know that! Thank you for sharing this. I hope that interested
         readers will scroll down and find your comment.
         
         Delete
      2. Reply
     
     
 8.  Aafiya Designer February 24, 2017 at 2:42 PMOn the off chance that you have not exploited surveillance cameras to
     ensure your property, please consider to begin utilizing them. best home surveillance system
     
     
     Reply Delete
 9.  sutony April 25, 2017 at 10:09 PMI read a few blogs/articles/slides about variational autoencoders, and I
     personally think this is the best one. The key ideas are pointed out
     clearly. The technical terms(e.g., ELBO) are well explained, too. Thanks so
     much.
     
     Reply Delete
 10. Magdiel Jiménez Guarneros May 3, 2017 at 4:04 PMHi, can you explain me the relation of the sum over q(z) equal to 1 in
     equation (1)?. Thanks, I don't catch it.
     
     Reply Delete Replies 1. SunFish7 May 7, 2017 at 2:52 AMProbabilities sum to 1. i.e. Given a probability distribution q over Z,
         summing q(z) over all possible z in Z must give 1.
         
         Delete
      2. Reply
     
     
 11. SunFish7 May 7, 2017 at 2:53 AMThanks for this, it is a key resource for our reading group discussion on
     VAE today
     https://github.com/p-i-/machinelearning-IRC-freenode/blob/master/ReadingGroup/README.md
     
     Reply Delete
 12. mathnathan May 12, 2017 at 1:01 PMI believe the last formula for reverse KL should be an expectation over q,
     not over p. Great post. Thanks for your effort.
     
     Reply Delete

Add comment Load more...


Newer Post Older Post Home Subscribe to: Post Comments (Atom)BLOG ARCHIVE
 * ► 2017 (1) * ► January (1)
   
   
 * ▼ 2016 (11) * ► November (1)
   
    * ► September (2)
   
    * ▼ August (1) * A Beginner's Guide to Variational Methods: Mean-Fi...
      
      
    * ► July (3)
   
    * ► June (4)
   
   
Not for reproduction. Simple theme. Powered by Blogger .",Variational Bayeisan (VB) Methods are a family of techniques that are very popular in statistical Machine Learning. VB methods allow us to r...,A Beginner's Guide to Variational Methods,Live,205
570,,Watch how to convert XML data to CSV format to load into dashDB. This video shows a tool called Convert XML to CSV found here: http://www.convertcsv.com/xml-to-csv.htm,Load XML data into dashDB,Live,206
574,"Compose The Compose logo Articles Sign in Free 30-day trialCOMPOSE TIPS: DATES AND DATING IN MONGODB
Published May 2, 2017 mongodb datetime compose tips Compose Tips: Dates and Dating in MongoDBWorking with dates in MongoDB can be surprisingly nuanced, and knowing how dates
are stored can make avoiding pitfalls much easier. Read on as we examine the
inner workings of MongoDB dates and show how to choose the right date type for
your needs.

At Compose Tips, we like to address the issues that can leave even experienced
developers scratching their head. We'll kick off this series by taking a look at
how MongoDB stores dates.

WHAT'S IN A DATE?
Dates in MongoDB have a few different representations, and getting the right one
can mean the difference between being able to effectively search your data by
date range using aggregations and being forced to manage your dates on the
client side of your application. Let's take a look at the different ways that
MongoDB can store dates.

Internally, MongoDB can store dates as either Strings or as 64-bit integers . If you intend to do any operations using the MongoDB query or aggregate
functions, or if you want to index your data by date, you'll likely want to
store your dates as integers. If you're using the built-in ""Date"" data type, or
a date wrapped in the ISODate() function, you're also storing your date as an integer.

If you're just looking for a simple way to display dates to a user and aren't
concerned with performing operations on those dates, then a String will allow
you to use the output from a MongoDB query directly without the need to convert
the date into a String. This can be handy for platforms without a convenient or
easy Date wrapper, or if you don't want to spend time processing the date on the client
side of your application.

Let's walk through each of the potential ways a Date can be represented in
MongoDB and discuss the pros and cons of each.

MILLISECONDS SINCE THE EPOCH
One standard way that many databases store dates is as a count of milliseconds
since the Epoch, with 0 representing January 1, 1970 at 00:00:00GMT. This is how
dates are stored internally in most programming languages.

Storing data as milliseconds since the Epoch makes comparing dates to each other
a simple numeric comparison. Developers can also easily modify dates, including
adding time frames (such as adding 1 day to a date) by computing the number of
milliseconds in a day.

While milliseconds are easy to manipulate programmatically, they're difficult
for programmers to conceptualize. It's difficult to tell even what decade a date
is in just by looking at the millisecond count, so it needs to be converted into
a readable String before most developers will be able to display a date
represented in this format.

ISODATE()
If you've tried to save a Javascript Date object into MongoDB, you might've
noticed that MongoDB automatically wrapped your date with a peculiar function: ISODate() .

ISODate(""2012-12-19T06:01:17.171Z"")  


ISODate() is a helper function that's built into to MongoDB and wraps the native
JavaScript Date object. When you use the ISODate() constructor from the Mongo shell, it actually returns a JavaScript Date object.
So why bother with ISODate() ?

ISODate() provides a convenient way to represent a date in MongoDB as a String visually,
while still allowing the full use of date queries and indexing. By wrapping the
ISO date String in a function, the developer can inspect date objects quickly
and visually without having to convert from a Unix timestamp to a time String.
We can see this by comparing the following date using the ISODate() constructor:

ISODate(""2012-12-19T06:01:17.171Z"")

To the corresponding JavaScript Date constructor:

Date(1355897837000)

The ISODate() constructor is clearly easier to read at-a-glance for developers.

One other major benefit is that, while there are many ways to represent dates,
the ISODate() uses the standardized ISO format. Your clients don't have to do any guesswork
to figure out what format they'll need to store dates in your system.

Probably the biggest downside is that the ISODate() will convert your date to
ISO format and, should you need a different date format on the client-side of
your applications, you'll have to convert that date on the client side. This can
be a concern when processing a lot of records that need to have the dates in a
specific format, or when real-time processing of date stamps is a concern.

STRING FORMAT
The final format we'll take a look at is String format, which stores a date as a
simple String in a human-readable format.

Dates stored in String format are very easy to display and don't require any
processing to use in visual displays. Also, assuming the date matches a standard
format, the date stored in String format will be relatively easy to convert to a
date on any platform.

When a date is stored in String format, it can sometimes be difficult to
determine what the actual format of the date is in the String. The following
example illustrates this issue: does the following date represent the 1st of
February 2017 or the 2nd of January 2017?

2017-02-01  


This ambiguity can cause major issues if the date is parsed incorrectly.
Developers using loosely-typed languages like JavaScript will sometimes
accidentally store a time String rather than a date, so sometimes the presence
of a String in the date field can indicate a logic error.

WRAPPING UP
MongoDB dates can initially cause some frustration for developers just starting
out, and understanding the different ways that a date can be stored in MongoDB
can help to ease that frustration. In a future article, we'll cover how to
manipulate and compute with dates in MongoDB.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Fabrizio Verrecchia John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
Apr 28, 2017NEWSBITS - MYSQL, ELASTICSEARCH, MONGODB, ETCD, COCKROACHDB, SQL SERVER, CRICKET
AND JUICE
NewBits for the week ending 28th April - MySQL 8.0.1's preview demos better
replication, Elasticsearch, MongoDB and etcd get…

Dj Walker-Morgan Apr 26, 2017FINDING DUPLICATE DOCUMENTS IN MONGODB
Need to find duplicate documents in your MongoDB database? This article will
show you how to find duplicate documents in your…

Abdullah Alger Apr 25, 2017HORIZONTAL SCALING ARRIVES ON COMPOSE ENTERPRISE
Today, Compose is bringing horizontal scaling to more databases on our
Enterprise platform. MongoDB, Elasticsearch and Scylla…

Jason McCay Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Working with dates in MongoDB can be surprisingly nuanced. Here, we examine the inner workings of MongoDB dates and show how to choose the right date type for your needs.",Compose Tips: Dates and Dating in MongoDB,Live,207
575,"UNDERSTANDING MANGO VIEW-BASED INDEXES VS. SEARCH-BASED INDEXESBy Tony SunNovember 18, 2015MANGO: DECLARATIVE QUERYING FOR APACHE COUCHDB™Mango allows users to declaratively define and query Apache CouchDB indexes.(It's also the open source library that powers Cloudant Query.) For anintroduction to its features, please refer to this post: https://cloudant.com/blog/introducing-cloudant-query/Recently, Cloudant open-sourced its Apache Lucene™-based full-text searchcapabilities for CouchDB as well: https://cloudant.com/blog/open-sourcing-cloudant-search/Mango leverages Lucene not only to perform text search, but also to enablead-hoc querying capabilities: https://cloudant.com/blog/cloudant-query-grows-up-to-handle-ad-hoc-queries/Users can now use either the original CouchDB view-based indexes or the newsearch-based indexes. In this post, I'll compare the two index types to giveusers an idea of when to use each (""json"" or view-based vs. ""text"" orsearch-based).WHEN JSON SYNTAX GETS TRICKYView-based indexes are most efficient for large datasets, but containlimitations on how a user could query an index. For example, given an indexdefined as:{    ""index"": {        ""fields"": [""foo"", ""bar""]    },    ""name"" : ""foo-index"",    ""type"" : ""json""}The following query would fail:{  ""selector"": {""$or"": [{""foo"": ""val1""}, {""bar"": ""val2""}]}}{""error"":""no_usable_index"",""reason"":""There is no index available for this selector.""}To understand the limitation above, users must realize that the underlying indexis still a CouchDB view-based index. The values of the fields are used to compose the keys in theindex. When performing a query, a selector is transformed into a start_key and end_key range search against the index.To satisfy the $or query above, Mango would have to scan the index twice, once for ""foo"" , then another for ""bar"" and perform merging logic. This can get extremely complicated as queries becomemore complex.In order to bypass this limitation, users need to add a ""sub-query"" that willallow the query engine to scan the index once and return results. The rest ofthe query will then be used as an in-memory filter.To continue with the example above, the query must become:{  ""selector"": {""_id"": {""$gt"" : null}, ""$or"": [{""company"": ""x""}, {""twitter"": ""ba""}]}}The above query essentially does a full index scan to return all the documentsand then applies the rest of the $or query as a filter on those documents.CLEANER SYNTAX WITH SEARCH INDEX TYPESMango search-based indexes resolve this issue by using Lucene indexes. To seehow to create those indexes, refer to the cloudant-query-grows-up blog linked to above. Users no longer have to add these ""sub-queries"" toperform operations such as $or , $in , or $elemMatch .A user might be tempted to always use search-based indexes due to their ad-hocquery capabilities. Mango's view-based indexes, however, will perform better inscenarios where query patterns are well known and the user is already familiarwith their data model. Imagine these view-based indexes as a more pleasantabstraction of the traditional CouchDB Map-Reduce view indexing system.If users don't know in advance what queries will be executed, then Mangosearch-based indexes are the way to go. This flexibility, however, comes withits own tradeoffs.Underneath the covers, Mango search indexes create a single default field thatcatalogs every field in the document. Moreover, individual elements in an arrayare also indexed and enumerated in this field. This comprehensive approachallows the user to perform a full-text search via the $text operator. The behavior is turned on automatically when the user creates asearch-based index. So for large databases, index build times can be long. Thesystem provides an option to disable search-based index builds, but disabling italso turns off the full-text search feature.Users who want full ad-hoc capabilities can index the entire database withsearch-based indexes. It's important to note that this approach is differentthan the default field mentioned above. The default field is a single field thathas all the values in the document stored in that one field. When a user indexesthe entire database, all the fields in the document will have their ownrespective values stored in the index. Again, indexing the entire database willcreate long index build times.Given a document such as:{ ""first_name"" : ""john"", ""last_name"" : ""doe""}... with BOTH text search enabled AND the entire database indexed will have anindex that has:""default_field"" - ""john"", ""doe""""first_name"" - ""john""""last_name"" - ""doe""Users who don't want to index their entire database can specify fieldsindividually. Suppose a user only wants to index ""first_name"" . Then the index would look like:""default_field"" - ""john"", ""doe""""first_name"" - ""john""A query that searches for ""last_name"" would then throw an ""index not found"" error.Finally, a user can turn off ""default_field"" and only index specific fields:""first_name"" - ""john""But this would limit the ad-hoc capabilities of search-based indexes, and theuser should use a view-based index instead.ARRAYSArrays can also be confusing for first-time users of Mango. Subtle differencesalso exist for arrays when using view-based indexes vs. search-based indexes.ARRAYS: JSONCurrently, view-based indexes cannot index individual array elements with onefield definition. Given an array such as:""array_field"": [10, 20, 30]If the view-based index is defined as:{    ""index"": {        ""fields"": [""array_field""]    },    ""name"" : ""array-index"",    ""type"" : ""json""}Users can query against the index to match the array exactly:{  ""selector"": {""array_field"" : [10, 20, 30]}}However, the user cannot access an individual array element. Note that mangouses dot-notation to access the individual elements, i.e., my_array.0 , my_array.1 , etc.In the example above, if a user tried:{  ""selector"": {""array_field.0"" : 10}}... then the user would get:{""error"":""no_usable_index"",""reason"":""There is no index available for this selector.""}Users would have specifically index each element in the array to access theindividual elements, For example:{    ""index"": {        ""fields"": [""array_field.0"", ""array_field.1"", ""array_field.2""]    },    ""name"" : ""array-index"",    ""type"" : ""json""}However, if a user did not specify individual elements — and indexed the arrayas a whole — he or she can still perform operations such as $in on the array.For example:{""selector"": {""_id"": {""$gt"": null},""array_field"": {""$in"": [10]}}}The reason this works is, again, because Mango performs the above $in operation as a filtering mechanism against all the documents. As we saw in theconclusion of the previous section on JSON syntax, the performance tradeoff withthe query above is that it, essentially, performs a full index scan and thenapplies a filter.ARRAYS: TEXTWith Mango search-based indexes, the user can query the index however he or shelikes with one index definition:{    ""index"": {        ""fields"": [{""name"": ""array_field.[]"", ""type"": ""number""}]    },    ""name"" : ""array-index"",    ""type"" : ""text""}This not only indexes the entire array, but also individual elements in thearray. Users can then ad-hoc query the array.WHAT'S NEXT?Hopefully this post helps clarify Mango view-based indexes vs. search-basedindexes. In order to enable search-based indexes, currently, users must firstenable text search in their CouchDB distribution. For instructions onrecompiling the current release of CouchDB to use the new search features, readthis article by fellow Apache CouchDB project committer Robert Kowalski: https://cloudant.com/blog/enable-full-text-search-in-apache-couchdb/If recompiling seems like too much work, don't worry! Lucene text search, alongwith Mango's declarative query system, will be included in the upcoming 2.0release of Apache CouchDB. For updates, follow the project on Twitter at: https://twitter.com/couchdb... or join one of the many excellent mailing lists: http://couchdb.apache.org/#mailing-lists© ""Apache"", ""CouchDB"", ""Lucene"", ""Apache CouchDB"", ""Apache Lucene"", and theCouchDB and Lucene logos are trademarks or registered trademarks of The ApacheSoftware Foundation. All other brands and trademarks are the property of theirrespective owners.Please enable JavaScript to view the comments powered by Disqus.SIGN UP FOR UPDATES!RECENT POSTS * Data Privacy and Governance Update * Cloudant Warehousing: New features and improvements * Announcing ISO 27001 Compliance for Cloudant, dashDB and BigInsights! * Understanding Mango View-Based Indexes vs. Search-Based Indexes * Introducing Monitoring Plugins for IBM Cloudant LocalBlog archive Follow @cloudantPRODUCT * Why DBaaS? * Features * Pricing * DBaaS ComparisonDOCS * Getting Started * API Reference * Libraries * GuidesFOR DEVELOPERS * FAQ * Sample AppsRESOURCES * Blog * Case Studies * Data Sheets * Training * Webinars * Whitepapers * Videos * EventsCOMPANY * About Us * Contact UsNEWS * In the Press * Press Releases * Awards * Terms Of Use * | * Privacy * | * ©IBM Corporation 2016","Users can now use either the original CouchDB view-based indexes or the new search-based indexes to query Cloudant and CouchDB. In this post, I'll compare the two index types to give users an idea of when to use each (""json"" or view-based vs. ""text"" or search-based).",Understanding Mango View-Based Indexes vs. Search-Based Indexes in Cloudant and CouchDB,Live,208
580,"This video goes through a demo/workshop on how to build a Java EE App that using Cloudant and Watson to suggest employee recommendations.  The app also using JQuery, Angular, and Bootstrap on the frontend.  This app was internally developed by IBM employees at a 48 hour hackathon.The source code is available for the app at http://ibm.biz/talent-manager.The complete source code is available for the app at http://ibm.biz/talent-manager-complete (dont cheat!)For feedback please contact @jsloyer on twitter (http://twitter.com/jsloyer).Use Case Behind the App:Meet IvyShe's a talent manager at a growing tech startup.She's having trouble finding the right candidate based on: ..* technical skills ..* personal compatibilityI wish I could clone my developer, Emory Wren -- having two guys like Emory working here would be amazing.But that's not possible. So what's the next best thing?** Talent Hotspot A web application that allows you to search for candidates from a pool of applicants based on how closely they resemble one of your current employees. Talent Hotspot uses Watson's User Modeling API service to analyze a potential candidate's personality based on their answers to a questionnaie (completed upon application to Ivy's company.)The application can issue queries such as,""Find me a Developer like Craig Smith"". Then search through all possible candidate and return a ranked list of candidates sorted by highest-to-lowest percentage of personality resemblance. From here, searches can be refined by including technical skills. ""Find me a Developer like Craig Smith, and knows Java, C and Python""","This video goes through a demo/workshop on how to build a Java EE App that using Cloudant and Watson to suggest employee recommendations. The app also using JQuery, Angular, and Bootstrap on the frontend. This app was internally developed by IBM employees at a 48 hour hackathon.",Building a Java EE webapp on IBM Bluemix Using Watson and Cloudant,Live,209
582,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (February 14, 2017)
 * This Week in Data Science (February 7, 2017)
 * This Week in Data Science (January 31, 2017)
 * This Week in Data Science (January 24, 2017)
 * This Week in Data Science (January 17, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (FEBRUARY 14, 2017)
Posted on February 14, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * IBM’s Watson and Bluemix step up: Free cloud-based push to boost tech skills – IBM aims to develop the tech skills of millions of young African for free
   using Watson AI and Bluemix cloud.
 * Job trends for R and Python – A look at the recent job trends for R, Python and SAS.
 * IBM launches cognitive computing hardware unit: Enter the Watson, Power 9
   stack – IBM uses research and application to speed up training for Watson, neural
   networks and machine learning.
 * – A current list of some useful apis for Machine Learning and Prediction.
 * New IoT Cybersecurity Alliance formed by AT&T, IBM, others – IBM partners to form alliance to address concerns around IoT and solve its
   security challenges.
 * 6 ways Business Intelligence is going to change in 2017 – How businesses will utilize data and its advantages.
 * ​IoT devices will outnumber the world’s population this year for the first
   time – A prediction on the growth in number of IoT devices for the next three
   years.
 * What is a Data Scientist? – A broad definition for the term Data Scientist.
 * IBM’s big data meetup program approaches a significant milestone – IBM’s community of big data developers approaches 100,000 members.
 * 5 Career Paths in Big Data and Data Science, Explained – Resources to sharpen skills required for 5 different paths in Data Science
   and Analytics.
 * 6 Top Big Data and Data Science Trends 2017 – Predictions about Big Data and Data Science as the world becomes more
   dependent on Data.
 * Top R Packages for Machine Learning – A ranking of the top Machine Learning Packages for R.
 * Understanding data ownership in the data lake – Answers to questions dealing with the ownership of data and the importance
   of these questions.
 * Infographic: The 4 Types of Data Science Problems Companies Face. – The difficulties surrounding solutions to Data Science Problems.
 * Making the Most of Big Data Requires Effective Training in Data Science – A discussion of the type of training required to create effective data
   scientists.
 * City enlists IBM’s Watson to fix outdated 311 system – How NYC will use IBM’s Watson to handle 311 calls.

FEATURED COURSES FROM BDU
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.

UPCOMING DATA SCIENCE EVENTS
 * IBM Event: Big Data and Analytics Summit – February 14, 2017 @ 7:15 am – 4:45 pm, Toronto Marriott Downtown Eaton
   Centre Hotel 525 Bay St. Toronto Ontario.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (February 14, 2017)",Live,210
588,"THINKY AND RETHINKDB
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 28, 2016If you're looking for alternatives to writing queries and modeling your data
using ReQL (the RethinkDB query language), you might want to consider looking at Thinky , an open-source ORM (Object Relational Mapper) designed for RethinkDB. Thinky
provides a number of features for defining schema, querying, and adding
relationships to your models. The ORM uses the same syntax as RethinkDB's Node.js driver , which makes it a great alternative to developing applications using ReQL.

In this article, we will primarily concentrate on creating schemas and defining
relations between your models.

THE SCHEMA
If you’re already familiar with building schemas in Mongoose , an ODM (Object Document Mapper) for MongoDB, then Thinky will look familiar
since the creation of field names, validations, and queries are similar. If you
are not familiar with Mongoose, then read our articles introducing you to it here and our article covering the latest version: Mongoose 4 .

Overall, Thinky and Mongoose work on similar principles, which is to provide you
with an efficient and object-oriented way to model your data. The difference
between the two is that an ODM like Mongoose concerns itself with the structure
of data within documents or tables, while an ORM like Thinky goes further by
modeling the relationships between them.

Thinky is a lightweight Node.js ORM that uses an alternative version of
RethinkDB’s Node.js driver, rethinkdbdash , on the backend that has the added bonus of connection pools. While the ORM is
not as fully featured as Mongoose, it enforces schema validations and creates
indexes and tables automatically out of the box. One of the most useful
features, however, is its four predefined relation methods that help you create
relations between models, which we will discuss in more detail later.

To give you a small example of the similarities between Mongoose and Thinky,
let’s look at a schema in Thinky and Mongoose in the context of creating
characters and houses from the popular series of novels and TV series “Game of
Thrones"".

THINKY SCHEMA
var thinky = require(""thinky"");  
var type = thinky.type;  
var r = thinky.r;

var Character = thinky.createModel(""Character"", {  
  id: type.string(), // or String
  name: type.string(),
  createdAt: type.date().default(r.now()) 
// using RethinkDB’s r command through rethinkdbdash 
// to set the date on the server during creation
});

var House = thinky.createModel(""House"", {  
  id: type.string(),
  houseName: type.string(),
  characterId: type.string()
});


MONGOOSE SCHEMA
var mongoose = require(""mongoose"");  
var Schema = mongoose.Schema();

var CharacterSchema = Schema({  
  _id: String, // or {type: String}
  name: String,
  createdAt: {type: Date, default: Date.now}
// uses JavaScript’s Date.now() method
});

var HouseSchema = Schema({  
  _id: String,
  houseName: String,
  characterId: String
});

var Character = mongoose.model(""Character"", CharacterSchema);  
var House = mongoose.model(""House"", HouseSchema);  


Looking at the example above, the advantage of using Thinky is that it defines
our schemas, creates a model and assigns it a name, while creating the necessary
tables within RethinkDB simultaneously. Whereas with Mongoose, you must define
the schema then define the model.

Thinky's documentation provides some good schema examples with the different field type options
(String, Number, Date, Array, Boolean, etc.) and chainable methods that you can
use to define fields further.

RELATIONSHIPS AND JOINING DOCUMENTS
An appealing feature of Thinky is its ability to help you assign relations
between your models. The documentation for creating relations is not entirely clear; therefore, you might have to take
a deep dive into its github issues for clarification. This section intends to explain how to define
relationships between models and highlights some of the peculiarities and
pitfalls you may encounter when using them. We will also provide you with a
brief look at how they are interpreted in RethinkDB.

RethinkDB's Node.js driver natively allows us to define many-to-many and
one-to-many relationships by using the eqJoins and zip ReQL commands. For a brief overview of joining tables in RethinkDB using the eqJoin command, we've provided an overview with examples here .

In general, the eqJoins and zip commands will take two tables, join them on foreign and primary keys ( eqJoin ), and merge them together returning you the joined documents. The syntax for
this query would be as follows:

r.table(""House"").eqJoin(""characterId"", r.table(""Character"")).zip()  


However, leveraging the power of Thinky, we are provided with four predefined
relation methods ( hasOne , hasMany , belongsTo , and hasManyAndBelongsTo ) that write all of the ReQL commands for us using RethinkDB’s joins capabilities behind the scenes. All we need to do is provide Thinky with the
primary and foreign keys so that it knows where the joining should occur.

To define the relationships between our character and house models, all we have
to write is the following:

Character.hasOne(House, ""house"", ""id"", ""characterId"");  
House.belongsTo(Character, ""character"", ""characterId"", ""id"");  


Using the relation methods, we are able to choose the tables (or models) where
the relations shall occur ( Character and House ). Then, we create a custom field name for that relationship ( house and character ) and provide the primary and foreign keys from the tables where each should be
joined.

If we need to add secondary indices, Thinky also makes this painless, since it
does all the heavy lifting for you. The method ensureIndex , which under the hood wraps RethinkDB's indexCreate and indexWait commands together, checks to see if the index you defined exists and if it
doesn't, creates it for you. Adding an index in our data just needs the
following:

Character.ensureIndex(""name"");  
House.ensureIndex(""houseName"");  


Now that we have prepared our models, relations, and indices, we are ready to
start inserting data into our database. The only information that we must
include is the name of the character and the name of the house they belong to.

{
  ""createdAt"": ""2016-07-26T05:20:31.381Z"",
  ""id"": ""dc47cc90-f629-499b-8c45-efb1e987717c"",
  ""name"": ""Robert Baratheon""
}


{
  ""houseName"": ""Baratheon"",
  ""id"": ""774b00a8-193b-468f-9784-919d56337baf"",
  ""characterId"": ""dc47cc90-f629-499b-8c45-efb1e987717c""
}


Since our hasOne relationship points to characterId as the foreign key field in our House table, Thinky will automatically populate that field with the id of the
appropriate character from the Character table when both tables are saved using the saveAll() method.

In order to save these documents without running the risk of not inserting
foreign keys, we must use the saveAll() method on our character and then pass in an object with the name of hasOne relationship we defined previously.

character.house = house;  
character.saveAll({house: true}).then(function(data) {...});  


Therefore, character.house joins the Character and House tables together. We assign it to house which will be the key where our house document will be stored when it is
returned. Within the saveAll() method, we insert the name the document we want to join that was defined in our hasOne relation ( house ). Then, we set it to true in order to tell Thinky to save the house and the character tables together.

After we've inserted our documents, we can use the getJoin query method to return the joined documents.

Character.getJoin({house: true}).run().then(function(result) {  
    console.log(result);
});


In our query we call the getJoin method, which is also given an object with the name of the field we created in
the hasOne relationship. It is also set to true so that Thinky knows which table to join to provide you the correct data. When
we run this query, we get the following result:

{
  ""createdAt"": ""2016-07-26T05:20:31.381Z"",
  ""house"": {
    ""houseName"": ""Baratheon"",
    ""id"": ""774b00a8-193b-468f-9784-919d56337baf"",
    ""characterId"": ""dc47cc90-f629-499b-8c45-efb1e987717c""
  },
  ""id"": ""dc47cc90-f629-499b-8c45-efb1e987717c"",
  ""name"": ""Robert Baratheon""
}


If we ran the same query using RethinkDB's eqJoin command, it will produce a similar result. The only difference is that in the
result above, the key house includes nested data from our joined table, whereas with eqJoin the data wouldn't be nested. RethinkDB's eqJoin command might provide a better solution than Thinky's getJoin method if you combine eqJoin with the zip and without commands. These commands will merge your documents together rather than storing
them as nested data. (Refer to our article on RethinkDB joins for a more in depth discussion.)

JOINING MULTIPLE DOCUMENTS
Sometimes you have tables that have documents containing the same foreign key
id. Using a hasMany relation is the optimal solution, which has the same syntax as the hasOne relation. For our use case, we might consider that some characters belong to
two houses (i.e. Jon Snow) and what hasMany will do is modify our house object into an array of objects.

Exclaimer: If you have two documents with the same foreign key and a hasOne relationship, Thinky will throw an error stating that you have more than one
document with the same foreign key, so make sure that you have a hasMany relation defined beforehand.

So, in the House table we might have two houses with the same characterId that refer to Jon Snow as in the following:

{
  ""houseName"": ""Targaryen"",
  ""id"": ""c129274e-44f6-4224-9c30-e7112f596121"",
  ""characterId"": ""f39e479e-5961-4dc8-b763-5c0397279c6a""
},
{
  ""houseName"": ""Stark"",
  ""id"": ""4c459d81-3162-4bc7-acca-4bb5467ccd13"",
  ""characterId"": ""f39e479e-5961-4dc8-b763-5c0397279c6a""
}


Querying our database for Jon Snow, using the same query as we executed above,
will produce the following:

{
  ""createdAt"": ""2016-07-26T05:20:31.381Z"",
  ""house"": [
    {
      ""houseName"": ""Stark"",
      ""id"": ""4c459d81-3162-4bc7-acca-4bb5467ccd13"",
      ""characterId"": ""f39e479e-5961-4dc8-b763-5c0397279c6a""
    },
    {
      ""houseName"": ""Targaryen"",
      ""id"": ""c129274e-44f6-4224-9c30-e7112f596121"",
      ""characterId"": ""f39e479e-5961-4dc8-b763-5c0397279c6a""
    }
  ],
  ""id"": ""f39e479e-5961-4dc8-b763-5c0397279c6a"",
  ""name"": ""Jon Snow""
}


Thus, we are given an array of houses with Jon Snow's id as the characterId in the house array, which produces a nice result that will allow us to
manipulate the data further. We did not have to change any of our queries, or
our code, to implement the hasMany relation. This is nice when you want to write an application fast and without
too many obscurities.

GET THINKY
So, we've looked at some of the ways we can model, query, create relations
between tables, and store data using Thinky. While it does not have all the
capabilities that Mongoose has for MongoDB, the author is actively adding new
features in order to increase its functionality and usability. Overall, it
provides you with a few shortcuts to create tables and relations between them,
which reduces the amount of code you'd have to write if you decided to only use
ReQL. Also, it makes your code readable and helps produce consistent results.

Image by Kalen Emsley Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Thinky is an open-source ORM designed for RethinkDB. Here, we show how to create schemas and define relations between your models.",Thinky and RethinkDB,Live,211
594,"THE POTENCY OF IDEMPOTENT WITH RABBITMQ AND MONGODB UPSERT
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 24, 2016Designing with possible failure in mind is a good strategy when building with
distributed systems. The cloud is a distributed system. Latency, network
failures, service end points changing and such are to be expected. If you don't
expect them, well, be prepared for the unhappy Twitter responses or pager alert
in the middle of the night when your application goes down.

One of the easiest strategies to help mitigate potential failures is to design
your data model for idempotent functions, which can be correctly called over and
over again, and to rely on a messaging system to deliver data at least once.
It's easy to do this with RabbitMQ and a ready made statement for idempotent
data like Mongo's upsert .

You Use Idempotent Data Models All the TimeWhen you make a deposit at your bank or transact with some large e-commerce
companies, the data contained in those events is modeled for idempotent
functions. This type of data model tends to be nothing more than a virtual
ledger versus a non-idempotent model that relies on update in place semantics.
In the ledger model a history is kept and a tally can be computed from the
individual entries. In the non-ledger model there is no history just a balance
field. Bank accounts, or even accounting in general, are great examples of this:


In the ledger model the same piece of data can be processed more than once
without that being an error. In the above example, if the Exxon withdrawal was
inserted over and over again in the ledger model there wouldn't be any
difference in the state assuming that the entry had some kind of key that
identified it as being the same. Update the entry with the same data and rerun
the tally to get the same balance as before. No harm other than some wasted
compute cycles.

In the non-ledger model though, every update other than the first would be an
error. It would rerun the withdraw transaction steps and over debit the account.

At Least OnceRabbitMQ is one of many messaging systems that provide at least once semantics
if configured to do so. By requiring the consumer of a message from a queue to
acknowledge when it has finished processing the message, one can guarantee at
least once processing. If there is a failure between the consumer's finishing
processing prior to acknowledging then it will be rerun again which is the more
than once scenario. Systems that are not designed to handle this can create
errors. The above non-ledger style account above would be a case in point.

Using pika, a Python RabbitMQ driver, we can review configuring the queue to ack when done. The below connects to Rabbit, consumes from the transaction queue, and the acknowledges a message:

import json  
import pika

conn = pika.BlockingConnection(pika.ConnectionParameters(  
              host='aws-us-east-1-portal12.dblayer.com',
              port=15518, 
              virtual_host='tangible-rabbitmq-66', 
              ssl=True, 
              credentials=pika.credentials.PlainCredentials(""hays"", ""thumper"", True))) 

channel = conn.channel()

for frame, props, body in channel.consume('transaction'):  
    msg = json.loads(body)
    # Do Mongo Upsert Here! (See Below) 
    channel.basic_ack(frame.delivery_tag)


In the above you can see the basic_ack . We ack when message processing is complete. The thing that is also handled here is
that if this doesn't finish processing then the original message will still be
in the transaction queue without having been ack ed. That means in certain scenarios this entire code would be run again. In the
non-ledger system this could be a problem with double processing. Idempotent
solves this.

Mongo's upsert as an Idempotent FunctionSo, one can insert data and let the datastore generate a key such as Mongo's ObjectId or using a SERIAL / SEQUENCE from PostgreSQL. This can be a big problem in the more than once scenarios like
above where you could end up with the following:

[
  { 
    id: 1,
    action: ""WITHDRAW"",
    account: 123,
    value: 45.0,
    at: ""2016-10-11T17:04:00-06:00""
  },
  { 
    id: 2,
    action: ""WITHDRAW"",
    account: 123,
    value: 45.0,
    at: ""2016-10-11T17:04:00-06:00""
  }
]


Obviously, this would be bad for our account since it would be withdrawing an
extra $45.

The easiest solution is to treat the WITHDRAW as an entity before it ever makes it to the datastore or to any code where a
failure could create multiple copies of the data. Keying the data with a UUID
before publishing to a queue is a simple solution:

{ 
  id: ""37320056-9d42-492a-a216-03bc5beea0ce"",
  action: ""WITHDRAW"",
  account: 123,
  value: 45.0,
  at: ""2016-10-11T17:04:00-06:00""
}


Since the id is part of the original record we can assert it as the unique key
and just upsert it to our store. We could do it more than once too as long as the processing
doesn't create any other side effects.

Extending the Python example from above, we use the pymongo MongoDB driver to
key and publish just such data:

...
import pymongo  
import ssl

mongo = pymongo.MongoClient(""mongodb://idem:potent@aws-us-east-1-portal.19.dblayer.com:15513/idempotent"",  
                            ssl=True,
                            ssl_ca_certs='i_am_a_ca.pem')
accounts = mongo.idempotent.accounts

...

for frame, props, body in channel.consume('transaction'):  
    msg = json.loads(body)
    account = accounts.replace_one({'_id': msg['_id']}, msg, True)
    channel.basic_ack(frame.delivery_tag) 

...


In the above we've connected to a Mongo database named idempotent with a collection named accounts via SSL with our self-signed certificate which was copied from the Compose
deployments Overview page. The new addition here is the accounts.replace_one function call. With the True parameter, it will insert the msg if not found. Otherwise it updates it. These together are upsert . It's all that is needed to safely process retries which might rarely happen
from our consuming queue code. This is idempotent.

An Added BonusEmbracing this publisher, queue, and consumer processing allows for splitting an
application at a good boundary. Front end responses can proceed quickly and back
end processing can be pushed to an asynchronous worker which relies on a queue:


Less latency and more speed up front with acceptable speed and buffered
processing on the back equals less total resources which is good from a dollars
perspective. To parse that, it makes sense to handle front end HTTP requests
optimistically then to asynchronously handle the heavy lifting on a back end
process that can even be buffered during peak usage.

Insert Queue, Respond ImmediateThere are many scenarios where having an asynchronous worker handle the heavy,
stateful lifting makes good sense. Trading a transactional style commit for a
simple queue insert can be a really good tradeoff for some workloads since your
web serving code can respond without waiting. Let's see what it takes using
Python's Flask web framework to create a simple API and use the pika driver to
publish from the front end to RabbitMQ:

from flask import Flask  
from flask import request  
from flask import jsonify  
import pika  
import json  
import uuid

app = Flask(__name__)

conn = pika.BlockingConnection(pika.ConnectionParameters(  
              host='aws-us-east-1-portal12.dblayer.com',
              port=15518, 
              virtual_host='tangible-rabbitmq-66', 
              ssl=True, 
              credentials=pika.credentials.PlainCredentials(""hays"", ""thumper"", True))) 

channel = conn.channel()  
channel.queue_declare(queue='transaction')

@app.route(""/transaction"", methods=['POST'])
def transaction():  
    msg = request.json
    msg[""_id""] = str(uuid.uuid4())
    channel.basic_publish(exchange='',
                          routing_key='transaction',
                          body= json.dumps(msg),
                          properties=pika.BasicProperties(
                               delivery_mode = 2
                          ))
    return jsonify(msg)

if __name__ == ""__main__"":  
    app.run()


The above connects to Rabbit, declares a queue, and creates an HTTP endpoint
that accepts a JSON POST request. For each POST , it parses the request and generates a unique key before it publishes the data
to a persistent, on disk queue. By keying the entity here instead of at the
database layer, it protects the identity of the data and allows the later upsert to be idempotent.

TradeoffsThere is a lot to be gained by building your apps with messaging: separating
concerns, asynchronous and possibly parallel processing on the back end, the
ability to use multiple languages, and on the list can go. There are some
tradeoffs in an approach such as added complexity and running a new server to
handle messaging. The beauty of Compose is that we can handle running that
RabbitMQ server for you so you can go about the business of building scalable
applications.

Image by: Chris Pastrick Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton writes code and then writes about it. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","One of the easiest strategies to help mitigate potential failures is to design your data model for idempotent functions, which can be correctly called over and over again, and to rely on a messaging system to deliver data at least once.",The Potency of Idempotent with RabbitMQ and MongoDB Upsert,Live,212
601,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own Jun 23, 2016
--------------------------------------------------------------------------------

MODELING ENERGY USAGE IN NEW YORK CITY
On June 6 we introduced the **IBM Data Science Experience** to the world at the
Spark Maker Event that took place in Galvanize. We demonstrated the Experience
with a real use case developed in partnership with BlocPower.

BlocPower is a startup based in New York City. Its technology and finance platform
develops clean energy projects in American inner cities. IBM Data Science Experience helped BlocPower perform a comprehensive energy audit of each property to
determine the correct mix of high-efficiency technology to reduce each
customer’s energy consumption. Tooraj Arvajeh, Chief Engineering Officer at
BlocPower, explained how IBM Data Science Experience made this process simpler.

“BlocPower operation is diverse from outreach and targeting, origination of
investment-grade clean energy projects to financing projects through our
crowdfunding marketplace. Data is the underlying tool of our operation and IBM’s
Data Science Experience will facilitate a closer integration across it and help
our business scale up faster. “GOALS OF THE DEMO:
- Easily import data into a notebook from object storage to quickly start
analyzing data and creating predictive models.
- Model energy usage of buildings in kWh.
- Identify buildings that consume energy inefficiently.
- Create a project and collaborate with other data scientists.
- Create an easy-to-use application to make the outcome of the models consumable
by any user.

To do that, we used tools that data scientists love today that are integrated
into the IBM Data Science Experience: Jupyter notebooks connected to Apache
Spark, RStudio, Shiny, and GitHub.

These are the steps that we followed:

1- GitHub + Jupyter notebooks = ❤

When starting a new project, the data scientist can choose to start from scratch
or to leverage someone else’s work. In this case, we showcase the Import from URL capability to import an existing notebook from GitHub and start working on it
right away. There are more than 200k public Jupyter notebooks out there that you
can use!

2- Load and clean data

To analyze data in a Jupyter notebook, first load the data. Many libraries and
commands can do that, but it’s not always obvious which one to use. One of the
add-ons to Jupyter notebooks is the capability to access data files stored in
object storage or available through data connections and in one click to add the
code needed to load the data into the notebook.

Once the data is loaded, the next step is to clean it. We created a library
called Sparkling.Data , which can scale to big data, to help the data scientist perform this task.

3- Data Exploration

After cleaning the data, we used Matplotlib , the best tool available for data visualization in Python, to explore the
correlations between energy usage and building characteristics such as age,
number of stories, square footage, amount of plugged equipment, and domestic and
heating gas consumption. By analyzing variable relationships, the data scientist
can, for example, determine the best model to use and which variables have more
predictive power.

4- Create a Prediction Model

Our goal is to create a model that predicts the energy consumption in kWh of
different buildings based on characteristics such as square feet, age, number of
stories, and so on. We model energy usage with a linear regression using the
algorithm included in scikit-learn , one of the best Python libraries for machine learning. Before running the
linear regression, we used the MaxAbsScaler function from scikit-learn to scale the data. To visualize the fit of this
model, we use a scatter plot of the observed vs. the predicted values. The
resulting R-squared value was approximately 0.72.

5- Classify buildings by efficiency

We used the popular **K-means** algorithm to cluster buildings in NYC based on
four dimensions that indicate energy efficiency: gas use for heating, gas use
for domestic purposes, electricity use for plugged equipment, and electricity
use for air conditioning. In the next matplotlib plot, we colored our buildings
by using the K-means labels with K=4 and using two out of the four dimensions.
This visualization, and other visualizations not shown here, helped us reduce
the four clusters to two. These two clusters of buildings were interpreted as
the efficient and the inefficient groups of buildings.

6- Flexdashboard and Shiny in RStudio

RStudio just published on CRAN a new R package called Flexdashboard . This great package enables creating dashboards very easily, and you can
include Shiny code to make dashboards very interactive. A dashboard can be
shared with anyone by simply sending the URL.

The dashboard is divided into 4 sections:

- Data Exploration : A map of buildings colored by their electricity consumption. When a building
is selected, a bar plot indicates how this building is doing with respect to the
average energy efficiency measured in four dimensions.
- Clustering : A map of buildings classified as efficient or inefficient.
- Prediction : Scoring of the linear regression model built in the notebook to predict the
energy usage in kWh and annual cost of electricity for the buildings. On the
left side are sliders for selecting the properties of the building to score the
model.
- Raw Data : We use the Data.Tables package to display the data set with search and sorting capabilities.

Link to the Shiny Application

You can check out the 10-minute demo of IBM Data Science Experience here:

We created a GitHub repository with all of the material and instructions needed
to run this demo, too. Enjoy!

Link to GitHub repo

 * Data Science
 * Machine Learning
 * IBM


4 2 Blocked Unblock Follow FollowingARMAND RUIZ
Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 4
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",On June 6 we introduced the **IBM Data Science Experience** to the world at the Spark Maker Event that took place in Galvanize. We demonstrated the Experience with a real use case developed in…,Modeling energy usage in New York City,Live,213
602,"COMPOSE'S 2016 - ALL ABOUT THE DATABASE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 22, 2016What a year it's been for Compose. We...

 * Brought MongoDB 3.2 , PostgreSQL 9.5 , ElasticSearch 2.4.0 and Redis 3.2 to the Compose system
 * Built features around them to make working with them even easier
 * Grew to keep up with demand for our production-ready databases and talking
   about those challenges and the the tools we use
 * Introduced Compose for MySQL and ScyllaDB in beta to the Compose platform
 * Promoted RabbitMQ out of beta
 * Introduced Compose databases as a service on IBM Bluemix
 * Added Google's Cloud Platform as a deployment option for Compose databases
 * Announced and shipped the Compose Enterprise option for private database clusters
 * Integrated FIDO universal 2nd-factor authentication into the Compose UI
 * Held our first DataLayer conference

And if that's not enough, read on to find out what happened with your favorite
databases at Compose.

MONGODB
2016 started with MongoDB 3.2 on the beta of Compose's new MongoDB+SSL deployments. Those beta deployments
became the default MongoDB configuration on the Compose platform after a month and came complete with support for WiredTiger and Oplog access . The addition of new import capabilities to MongoDB to help people migrate came later in the year. MongoDB 3.2 itself
offered some exciting new features such as partial indexes, validation , improved aggregation and lookup which were covered in the Compose Articles blog.

There was also coverage of other subjects like connection pooling , connecting with Go , using Meteor 1.4 with Compose , MongoDB data transfers with NiFi , geospatial queries and using Node-RED with MongoDB for prototyping apps . GraphQL is up and coming and we looked at using it with MongoDB early in the year, thanks to a Write Stuff author.

POSTGRESQL
The arrival of PostgreSQL 9.5.2 in April got us asking, and answering, a regular question - "" Could PostgreSQL 9.5 be your next JSON database? "". PostgreSQL 9.5 included some great new features: row level security and group by options (following on from 2015's look at upsert in 9.5 ).

PostgreSQL deployments saw many enhancements: cross database queries extensions , performance and extension views for your Compose console, and a SQL query data browser to let you browse PostgreSQL's data from your browser.

In Compose Articles, we explored PostgreSQL features like full-text search indexing and per-connection write consistency . There were also looks at some alternative ways of accessing your PostgreSQL
data such as using PostgREST to create a RESTful API from your schema, hugsql with clojure for an SQL-centric approach and a look at PostGraphQL for quickly bringing GraphQL to your database.

Compose's in-house analytics ""Metrics Maven"" started a new and on-going series,
and as you might expect with us using PostgreSQL in-house there were a lot of
useful articles about breaking down data using PostgreSQL:

 * Window Frames in PostgreSQL
 * Calculating a Moving Average in PostgreSQL
 * Making Data Pretty in PostgreSQL
 * Creating Pivot Tables in PostgreSQL using Crosstab
 * Beyond Average: A Look at Mean in PostgreSQL
 * Meet in the Middle Median in PostgreSQL

ELASTICSEARCH
The recent update to Elasticsearch 2.4.0 on Compose is designed to keep things fresh as we have worked on making Elasticsearch more
accessible with a new data browser . Meanwhile, in Compose Articles, there's been a look at how to set up Kibana locally to explore your Elasticsearch deployments. Elasticsearch on Compose also got
simplified and more effective security with our new Let's Encrypt based TLS Certificates .

Compose's Elasticsearch coverage included a wide range of subjects in 2016.
There was a four-part series on Elasticsearch and Perl on how to connect and monitor an ElasticSearch Cluster , how to do indexing , advanced index options , and how to use the querying and search features of the Elasticsearch.pm perl module. For the more modern developer,
they now could learn how to leverage their Node.js skills with our Getting started with Elasticsearch and Node.js five part series which worked with real data ( 2 , 3 , 4 and 5 ).

Compose's Metrics Maven also added to our library of Elasticsearch articles by
looking at how scoring works , a mini-series on increasing Elasticsearch relevance , and a deep dive into how to use query string queries effectively.

REDIS
Redis on Compose saw many improvements over the year. We started exposing more
controls for it like the KEA switch so you could tune it for your needs , and we let you secure it with SSH tunneling from the Compose Redis console. Our developers also added a data browser , made migrations easier with new Redis importing , let you see slow logs , tuned the autoscaling resolution and improved our cache/storage Redis modes , boosting the cache performance for many users.

Updating to Redis 3.2 brought a number of new features to the database itself; one article looked
specifically at the Redis Geo API , which offers applications new ways to understand the geographic proximity of
things - ideal for mobile applications. Another looked at Lua scripting , an older feature which you needed to know about so we could talk about Redis 3.2's new Lua debugging . Compose Articles also looked at the most popular Redis drivers and using Redis PubSub and web sockets .

RETHINKDB
RethinkDB always impressed us with its solid engineering, and the release of RethinkDB 2.2.4 and 2.3.2 brought features like user authentication and native SSL to the database, and a new proxy to Compose RethinkDB
deployments.

Compose articles showed different ways to connect to RethinkDB, such as from Elixir applications and by using the new Java driver over SSL . Other articles demonstrated strategies for preventing data loss by configuring RethinkDB replicas, how to aggregate data from multiple tables
using RethinkDB Joins , using new aggregation features like fold , and using the Thinky ORM for a more object-centric way of accessing RethinkDB.

Unfortunately, the company behind RethinkDB, RethinkDB Inc, had to shut down;
the open-sourced engineering legacy and active planning to seed a new open
source community mean Compose will continue to support RethinkDB .

RABBITMQ
With the promotion of Compose RabbitMQ out of beta , we also updated it to version 3.6.5, introducing new features like Lazy
Queues. RabbitMQ was another platform which got the new Let's Encrypt based
SSL/TLS certificates for AMQPS connections . Compose Articles looked at how to use MongoDB and RabbitMQ together to create resilient applications and using RabbitMQ in microservices .

ETCD
Compose etcd entered beta in 2015 and hasn't left yet. Building a dynamic configuration service , a Write Stuff article, showed how to apply etcd in practice. We also showed
you how to use etcdtool to backup and restore your etcd data and gave you a behind-the-scenes peek at how our engineering team tracked down a problem with performance degradation and fixed it.

MYSQL
This year, two new beta databases debuted on Compose - one of them was Compose for MySQL . This MySQL deployment is built on top MySQL 5.7.15 and makes use of
InnoDB-based group replication to help deliver the high-availability and
reliability that we incorporate in all out databases. Compose for MySQL is a
recent arrival and, like other Compose databases, offers one-click deployments a cluster, SSL connections and a range of drivers for applications .

SCYLLADB
The beta release of ScyllaDB on Compose brings an Apache Cassandra compatible database to Compose for the first time.
ScyllaDB presented at DataLayer and our own Nick Stott talked at Scylla Summit . Scylla CTO Avi Kivity also chatted with Compose Articles about what makes Scylla faster and lighter than Cassandra. Compose launched
with Scylla 1.3 and the blog covered getting started with simple connections . We then followed that up with our better connected Scylla 1.4 release which lets you connect to all the Scylla nodes in your three node clusters.

2017, HERE WE COME
There's so much in the pipeline for 2017 at Compose and we know it'll make your
data life so much better.

Image by Nitish Meena Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",What a year it's been for Compose. Here are the highlights ...,Compose's 2016 — All about the database,Live,214
605,"The dplyr and tidyr packages are built to save you time when you wrangle data. Together, they provide a complete system for reshaping, transforming, and combining data sets.","The dplyr and tidyr packages are built to save you time when you wrangle data. Together, they provide a complete system for reshaping, transforming, and combining data sets.",Data Wrangling with dplyr and tidyr Cheat Sheet,Live,215
606,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register * DASHBOARD
 * HOW-TO
 * BLOG
 * EVENTS
 * DOCS
 * SUPPORT

Search How-to < Previous / Next >CONNECTING POUCHDB TO CLOUDANT ON IBM BLUEMIX
cloudant node.js nosql pouchDB Raymond Camden / April 29, 2015 / 0 commentsRepublished from Raymond Camden’s Blog


--------------------------------------------------------------------------------


So, as always, I tend to feel I’m a bit late to things. Earlier today my
coworker Andy Trice was talking to me about PouchDB. PouchDB is a client-side
database solution that works in all the major browsers (and Node.js) and
intelligently picks the best storage system available. It is even smart enough
to recognize that while Safari supports IDB, it doesn’t make sense to use it and
switches to WebSQL. It has a relatively simply API and best of all – it has incredibly simple sync built in.

I tend to work with client-side databases with just the vanilla JavaScript APIs
available to them, but honestly, after an hour or so of using PouchDB I can’t
see going back. (And yes, I know other solutions exist too – and I’m going to
explore this area more.) Probably the slickest aspect is the sync. If you have a
CouchDB server setup, you can set up automatic sync between all the database
instances in seconds. For my testing, I decided to use IBM Bluemix . This blog post assumes you’re following the PouchDB Getting Started guide.

First, add the Cloudant NoSQL DB service to your Bluemix app:


After you have added the service and restaged your app, select it, and then hit
the Launch button:


This fires up the Cloudant administrator where you can do – well – pretty much
everything related to setting up your database. But to work with that guide at
PouchDB, select Databases and then “Add New Database”:


Then enter todos to match the guide:


Ok, you’re almost done. You then want to enable CORS for your Cloudant install.
In the Cloudant admin, click Account and then CORS. Enable it, and then select
what origin domains you want. For now, it may be easier to just allow all
domains.


Woot! OK, one more step. When using PouchDB and sync, they expect you to supply
a connection URL. You can get this back in your Bluemix console. Select the
“Show Credentials” link to expand the connection data and then copy the “url”
portion.


And voila – that’s it. If you open your test in multiple browsers, you’ll see
everything sync perfectly. Remember you can also use PouchDB in Node.js, which,
coincidentally, you can also host up on Bluemix, so yeah, that works out well too.

SHARE THIS:
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to email this to a friend (Opens in new window)
 * 

LEAVE A COMMENT
Click here to cancel reply. Tell us who you are Name (required) Email (required) Comment text

Notify me of follow-up comments by email.

Notify me of new posts by email.


 * START BUILDING WITH BLUEMIX!
   
   
 * @IBMBLUEMIX
   RT @IBM The #CognitiveDress is almost here! The first @MarchesaFashion + @IBMWatson collaboration debuts tomorrow. #MetGala pic.twitter.com/y54X…
   
   How dynamic introspection has the capacity to advance the state of the art in #containers , and the cloud itself: bit.ly/1SXmXST
   
   RT @IBMdwOpen Why Apps Should be Data Efficient developer.ibm.com/bl… via @IBMBluemix
   
   At @signalconf check out /bots: the sub-conference dedicated to all things bots and the
   future of messaging. bit.ly/1pX1nQO
   
   Learn about IBM's new cloud services for #blockchain and our release of #hyperledger code via @BitcoinMagazine bit.ly/1pWZPGh
   
   
 * @IBMCLOUDSUPPORT
   Maintenance: #Bluemix Analytics for Apache Spark - May 4 02:00 UTC - US-South. See status: bit.ly/1J6UPrT pic.twitter.com/gBNC…
   
   Maintenance: #Bluemix DataWorks service - May 4th, 02:00 UTC - US-South. See status: bit.ly/1J6UPrT pic.twitter.com/YuHB…
   
   #Bluemix platform maintenance May 3, 17:00 UTC - AU-SYD region. See details at bit.ly/1J6UPrT pic.twitter.com/1924…
   
   Read about Active Deploy & have add zero-downtime deploy capability for cloud
   apps: ibm.co/1S65cgi #Bluemix pic.twitter.com/P9uQ…
   
   #Bluemix web console back online in US-South region - bit.ly/1J6UPrT pic.twitter.com/qnuH…
   
   
 * CATEGORIES
    * General
    * Events
    * Updates
    * How-to
   
   
 * SOLUTIONS
    * Analytics
    * App Services
    * Big Data
    * Bluemix Dedicated
    * Bluemix Local
    * Catalog
    * CF Applications
    * Containers
    * DevOps
    * Eclipse
    * Hybrid
    * Integration
    * Internet of Things
    * Mobile
    * Network
    * OpenWhisk
    * Security
    * Storage
    * Virtual Servers
    * Watson
    * Web Apps
   
   
Follow us on Twitter RSS Feed * Contact us
 * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","A step-by-step guide to configuring Cloudant on Bluemix so that it is able to replicate to and from PouchDB - an in-browser, CouchDB-compatible database. PouchDB and Cloudant allow offline-first apps to be developed, allowing your app's users the ability to save their data even when not connected to the internet and syncing the data at a later date.",Connecting PouchDB to Cloudant on IBM Bluemix,Live,216
607,"SEVEN DATABASES IN SEVEN DAYS – DAY 1: RETHINKDB
Lorna Mitchell and Matt Collins / July 28, 2016This post is part of a series of posts created by the two newest members of our
Developer Advocate team here at IBM Cloud Data Services. In honour of the book Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, we challenged Lorna and Matt to take a new
database from our portfolio every day, get it set up and working, and write a
blog post about their experiences. Each post reflects the story of their day
with a new database. We’ll update our seven-days GitHub repo with example code as the series progresses.

Meet RethinkDB and its mascot, The Thinker.

 * Database type: highly scalable JSON storage with real-time data feeds
 * Best tool for: situations where it’s important to quickly update when data changes

OVERVIEW
RethinkDB is a database that aims to provide a performant and scalable storage solution
that pleases both development and operations people. Inside, it’s a document
database using JSON format, is distributed by nature, and includes a
user-friendly admin console for managing it. So far, nothing particularly
special but RethinkDB has a couple of tricks up its sleeve: unusually for
document databases it supports joins, and it also allows you to retain a
connection to a query so if any further results arrive they will instantly be
pushed to the client over the same connection.

RethinkDB is open source, so you can run this anywhere, although for these
examples we’ll make use of the cloud and grab a RethinkDB from Compose . This article covers how to get started setting up RethinkDB and connecting
your application to it. We’ve put together a quick example using a hypothetical
issue tracker and paying particular attention to the data feed updates that
RethinkDB offers.

GETTING SET UP
Start by setting up a RethinkDB instance on Compose , which will then take a moment to deploy to the appropriate cloud (choice of
AWS, SoftLayer or DigitalOcean) for you and then send you to the console.

Once the database has been created, we’ll start by logging into the Admin UI.

CONNECTING TO THE ADMIN UI
In the Compose overview for this database, you’ll find some connection
information – a username (defaults to admin), an Authentication Credential (a
fancy way of saying password!), and a few connection strings. One of these
strings is for the Admin UI.

Pop that into the address bar for your browser of choice, and you should be
prompted for the username and password you discovered above. Enter them into the
prompt and you should see the quite lovely looking RethinkDB Admin UI in front
of you!

“Shine on you crazy admin” —Rethink Floyd

From here we can see how our database is doing, run queries, create tables,
manage indexes, and so on. This is interesting and useful, but even more
interesting and useful is how we can programmatically access our database, which
we’ll cover in the next section.

CONNECTING FROM NODE.JS
There are official libraries for Node.js, Python and Ruby, and there are many
more community-contributed offerings that seem to work well. So most
applications will be able to easily take advantage of RethinkDB’s features.

There are a few pieces of information that we want to grab from the connection
details screen:

 * Authentication Credential: this is a token that you need to click to show, and then copy
 * Certificate: further down the page, there’s a self-signed SSL cert that you should save
   somewhere. Ours is in a file called cert and you’ll see it referenced in our application shortly.

For this example, we used Node.js, and put all the initial configuration and
setup into a file named config.js , which we included in all our other scripts ( example code on GitHub ). Here’s that file:

const fs = require('fs');
const cert = new Buffer(fs.readFileSync('./cert', ""utf8�

const connection = {
  host: ""aws-us-east-1-portal.17.dblayer.com"",
  port: 11557,
  user: ""admin"",
  password: ""SAHPgKzuOeFj7qu8ZaXCDjPNz4LPrCpfWEyquasjrA"",
  ssl: {
    ca: cert
  }
}

module.exports = {
  connection: connection
}


Take a look at the Connection Settings screen again, and specifically at the
“RethinkDB Proxy Connection strings” the password is the Authentication
Credential that you acquired earlier.

Now we can test the connection by attempting to create a database — if we can
successfully do this, then we know everything is working well. Here’s our create_db.js :

const config = require('./config.js');

// RethinkDB Driver
const r = require('rethinkdb');
// connect to the DB
r.connect(config.connection, function(err, conn) {
  if(err) throw err;

  // create our DB
  r.dbCreate('issues').run(conn, function(err, data) {
    if (err) throw err;

    console.log(""DB created�


Pro-tip: Remember to include the self-signed SSL cert that Compose gives you. If you
don’t yet have Node.js, we recommend Homebrew for OS X. Then just brew install node . Treehouse also has some nice instructions .

This code simply includes the config file we created earlier, creates a
connection to the database, and outputs a log message if it is successful. At
this point, we can start to use this connection to perform other operations.

DESIGN YOUR DATABASE
As an example, we’ll consider a simple sort of bug tracker application, just
allowing us to add issues and keep track of their status and so on. First, we’ll
create a table to store the issues.

RethinkDB has a nice, easy web interface which you can use to create tables. You
may also want to do that programmatically, so let’s start by looking at the code
we used to create the issues table. Here’s our create_table.js :

const config = require('./config.js');

// RethinkDB Driver
const r = require('rethinkdb');
// connect to the DB
r.connect(config.connection, function(err, conn) {
  if(err) throw err;

  // create our table
  r.db('issues').tableCreate(""issues�

    console.log(""Table created�


Check in the admin interface to see your new database listed and verify that
everything worked as expected. You should see your new table (but it’s still
empty).

IMPORTING DATA
Since RethinkDB is JSON-based, it’s pretty happy to ingest JSON data of any
kind, which is nice! There’s some detailed documentation on importing data , but we generated some sample data using http://json-generator.com and simply used that to quickly give ourselves something to work with.

Importing data from our application is quite simple. Here’s a snippet from our
application, with the data to import saved into a file named seed_data.json in the same directory. Here’s create_data.js :

const config = require('./config.js');

// RethinkDB Driver
const r = require('rethinkdb');
// seed data
const seed = require('./seed_data.json');
// connect to the DB
r.connect(config.connection, function(err, conn) {
  if(err) throw err;

  // Seed our table with some data
  r.db('issues').table(""issues�

    console.log(""Seed data added�


This is a great way to get started quickly with some data in the issues table,
and it means we can move along to the fun parts: querying the data and then
seeing later changes also arrive instantly.

FETCHING DATA AND RECEIVING UPDATES
RethinkDB has its own query language called ReQL (for the very quickest of
starts, there’s even an SQL to ReQL cheatsheet ). Let’s look at a very simple query. It fetches all records from our issues
table, but here’s where it gets interesting: this script will then remain
connected, and output further records when new data appears.

First the code that queries the database and outputs information for each issue
( fetch_all_data.js ):

const config = require('./config.js');

// RethinkDB Driver
const r = require('rethinkdb');
// seed data
const seed = require('./seed_data.json');
// helper function to format the output
const format = require('./format_issue.js').output;
// async
const async = require('async');
// connect to the DB
r.connect(config.connection, function(err, conn) {
  if(err) throw err;

  var actions = {
    current: function(callback) {
      // Get every issue
      r.db('issues').table(""issues�
    }
  }

  async.series(actions, function() {
    // Get every new issue
    r.db('issues').table(""issues�


Take a look at the output of this script (this is just the last few lines):

ebd578b5-fde3-4318-bb9e-e2aaf7b43b21
ut anim sunt voluptate ex reprehenderit
STATUS: closed
================================
b22d4484-2a00-472d-b5c1-20af894ed056
est sint labore tempor veniam sit
STATUS: wontfix
================================
ef344e27-e809-44cd-8395-1c93490c546e
in officia Lorem in pariatur labore
STATUS: reopened


We can leave this running in the terminal and from another window, use a script
that just inserts one new row that would appear in our dataset. Below is a quick
script to do that; it cheats and steals an existing row of data and repurposes
it. And now, we give you create_new_row.js :

const config = require('./config.js');

// RethinkDB Driver
const r = require('rethinkdb');
// some helper modules
const _ = require('underscore');
const argv = require('optimist').argv;
// seed data
const row = _.shuffle(require('./seed_data.json'))[0];
// connect to the DB
r.connect(config.connection, function(err, conn) {
  if(err) throw err;

  // Seed our table with some data
  r.db('issues').table(""issues�


With the new row in place, take a look at what’s going on in the output of our
original fetch-all-the-data script:

ebd578b5-fde3-4318-bb9e-e2aaf7b43b21
ut anim sunt voluptate ex reprehenderit
STATUS: closed
================================
b22d4484-2a00-472d-b5c1-20af894ed056
est sint labore tempor veniam sit
STATUS: wontfix
================================
ef344e27-e809-44cd-8395-1c93490c546e
in officia Lorem in pariatur labore
STATUS: reopened


================================
3fc73e89-8da1-4bce-91a3-31ae897ab7b6
Lorem nisi proident ea commodo nulla
STATUS: reopened


CONCLUSION
This ability to keep queries running and instantly ship updates when the data
changes is a key feature of RethinkDB. It makes this tool a great choice for
anything which needs to update in response to data, either changing prices on a
ticker or notifying other users of a web-based tool that someone else made
changes. RethinkDB can be used by any number of server-side languages and is
available whether you want to run it on your own hardware or deploy it as-a-service .

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: Compose / RethinkDB Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Looking to learn the basics of cloud databases? In this series, we show them running on Compose and intro programmatic access. First up: RethinkDB.",Seven Databases in Seven Days – Day 1: RethinkDB,Live,217
615,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Libo Li
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

WEB PICKS (WEEK OF 28 DECEMBER 2016)
Posted on January 3, 2017Every two weeks, we find the most interesting data science links from around the
web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting
resources .

 * The NIPS (Neural Information Processing Systems) 2016 conference is just past, and many people are reflecting on the many great
   works presenting there. See NIPS 2016 Highlights – Sebastian Ruder , Some general take aways from #NIPS2016 , 50 things I learned at NIPS 2016 , Post NIPS Reflections , All the available code repos for the NIPS 2016’s top papers for what people are saying, as well as Le Cun’s slides .

 * The great AI awakening
   How Google used artificial intelligence to transform Google Translate, one of
   its more popular services — and how machine learning is poised to reinvent
   computing itself.

 * In the race to build the best AI, there’s already one clear winner
   As Google, Facebook, Microsoft, and Baidu take turns leapfrogging each other
   in artificial intelligence innovation, one company stands to profit from any
   outcome: Nvidia.

 * The World’s Largest Hedge Fund Is Building an Algorithmic Model From its
   Employees’ Brains
   Bridgewater wants day-to-day management—hiring, firing, decision-making—to be
   guided by software that doles out instructions.

 * Crime Prediction software joins Dubai Police Force
   In addition to its fleet of supercars, the Dubai Police are now enlisting the
   help of Crime Prediction software.

 * What I learned creating one chart with 24 tools
   Finding the best tool means thinking hard about your goals and needs.

 * The Most Boring/Valuable Data Science Advice
   “I’m going to make this quick. You do a carefully thought through analysis.
   You present it to all the movers and shakers at your company. Everyone loves
   it. Six months later someone asks you a question you didn’t cover so you need
   to reproduce your analysis…”

 * The major advancements in Deep Learning in 2016
   “In this article, we will go through the advancements we think have
   contributed the most (or have the potential) to move the field forward and
   how organizations and the community are making sure that these powerful
   technologies are going to be used in a way that is beneficial for all.”

 * US starts asking foreign travelers for their social media info
   Homeland Security approved the controversial proposal a few days ago.

 * Wall Street wants algorithms that trade based on Trump’s tweets
   Trump’s volatility is a market opportunity.

 * Tourists Vs Locals: 20 Cities Based On Where People Take Photos
   Tourists and locals experience cities in strikingly different ways. Great
   maps!

 * Tool AI’s want to be Agent AI’s
   “Tool AIs limited purely to inferential tasks will be less intelligent,
   efficient, and economically valuable than independent reinforcement-learning
   AIs learning actions over computation / data / training / architecture /
   hyperparameters / external-resource use.”

 * Building Jarvis
   Wondering how Zuckerberg creates an AI? “My personal challenge for 2016 was
   to build a simple AI to run my home — like Jarvis in Iron Man.”

 * A non-comprehensive list of awesome things other people did in 2016
   Some people always manage to stick an ungodly amount of work in a year!

 * Finding MLB Anomalies with CADE
   “Over the Summer, while an intern at Elder Research, I learned about a very
   intuitive anomaly detection algorithm called CADE, or Classifier-Adjusted
   Density Estimation. The algorithm seemed very simple, so I wanted to try and
   implement it myself and try to find anomalous players in the MLB.”

 * A Guide to Solving Social Problems with Machine Learning
   “We have learned that some of the most important challenges fall within the
   cracks between the discipline that builds algorithms (computer science) and
   the disciplines that typically work on solving policy problems (such as
   economics and statistics). As a result, few of these key challenges are even
   on anyone’s radar screen.”

 * A Visual and Interactive Guide to the Basics of Neural Networks
   Simple explanation with great interactive visualizations.

 * Top 10 Python libraries of 2016
   “Again, we try to avoid most established choices such as Django, Flask, etc.
   that are kind of standard nowadays.”

 * Hamiltonian Monte Carlo explained
   MCMC (Markov chain Monte Carlo) is a family of methods that are applied in
   computational physics and chemistry and also widely used in bayesian machine
   learning.

 * Data science and critical thinking (pdf)
   Some great stats and thoughts in this presentation!

 * Speed up your code with multidplyr
   “There’s nothing more frustrating than waiting for long-running R scripts to
   iteratively run. I’ve recently come across a new-ish package for parallel
   processing that plays nicely with the tidyverse: multidplyr.”

 * Learning a Probabilistic Latent Space of Object Shapes via 3D
   Generative-Adversarial Modeling
   “We study the problem of 3D object generation. We propose a novel framework,
   namely 3D Generative Adversarial Network (3D-GAN), which generates 3D objects
   from a probabilistic space by leveraging recent advances in volumetric
   convolutional networks and generative adversarial nets.”

 * China invents the digital totalitarian state
   Big data, meet big brother.

 * How we learn how you learn
   “In this post, we’ll take a look at the science behind the Duolingo skill
   strength meter, which we published in an Association of Computational
   Linguistics article earlier this year….”

 * Machine learning model to production (presentation)
   As explained by Georg Heiler.

 * Anomaly Detection at Scale (presentation)
   Jeff Henrikson presents at the first annual O’Reilly Security Conference, in
   New York City, 2016.

‹ The Analytics Year in Review and Looking Forward to 2017 —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Web Picks (week of 28 December 2016)
 * The Analytics Year in Review and Looking Forward to 2017
 * Web Picks (week of 12 December 2016)
 * How can you interpret the coefficients of a logistic regression model?
 * Recommender Systems with Multiple Types of Feedback

Archives * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com","Interesting data science links from around the web, collected in Data Science Briefings, the DataMiningApps newsletter. ",Web Picks (week of 28 December 2016),Live,218
616,"SCALING OFFLINE FIRST WITH ENVOY
At Offline Camp, fellow IBM Developer Advocate Bradley Holt gave a Passion Talk on Cloudant Envoy. As I am more involved in this project
than he is, Bradley asked me to write this summary of Cloudant Envoy in his
place.

The “one database per user” design pattern makes things very easy for an Offline
First application developer. Simply create a database on the mobile device and
one in the cloud and get your app to read and write from its local copy. When
there is an internet connection, data can be synced between the the device and
the cloud.

CouchDB 2.0 and IBM Cloudant are built to scale massively on the server side and each mobile device only
needs to store a single user’s data. We can use PouchDB for web apps and Cloudant Sync for native mobile apps on the client side and use the CouchDB replication
protocol to sync without loss of data.

one-database-per-userThe problem comes as the number of users increases: backup, reporting and change
control become problematic when there are hundreds / thousands / millions of
individual databases — one for each user.

Database proliferationEarlier this year, faced with this scaling problem, some IBMers armed with a
flip chart, some Sharpies and a code editor, set about building something to
address the scalability problems with this approach.

Envoy — from scribbles to codeEnvoy is a Node.js micro-service that sits between the mobile devices and the
Cloudant or CouchDB 2.0 cluster in the cloud, acting as CouchDB replication
target. It proxies the replication requests between the client and server
replicas, subtly changing the documents on the way through and storing the data
in a single, server-side database.

Many databases replicating to a single database via EnvoyEach mobile device still has one database per user but Envoy seamlessly stores the server side data in one database — or in two
databases if you count the database of users too. Having a single store of data
in the cloud makes querying, backing-up and managing the data set a breeze.

We think Envoy has potential, but it’s early days for the project and we’re
looking for folks in the Offline First community to try it out, provide feedback
with comments & suggestions and hopefully contribute to the codebase. It’s
published under the Apache-2.0 license so we’d be more than happy for folks to get involved.

In future posts to the IBM Cloud Data Services Developer Center blog , we’ll delve into some of the technical details but for now I’ll leave you
with some links:

 * https://github.com/cloudant-labs/envoy
 * https://www.npmjs.com/package/cloudant-envoy

If you have any questions then you can leave comments here or ping me in the Offline First Slack community .

Thanks to Bradley Holt and Maureen McElaney . JavaScript Web Development Offline First Database Nodejs 1 Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Cloud Data Services. Views are my own etc.

FollowOFFLINE CAMP
We took the Offline First community for a three day retreat in the Catskill
Mountains. You won’t believe what happened next.

× Don’t miss Glynn Bird’s next story Blocked Unblock Follow Following Glynn Bird","At Offline Camp, fellow IBM Developer Advocate Bradley Holt gave a Passion Talk on Cloudant Envoy. As I am more involved in this project than he is, Bradle…",Scaling Offline First with Envoy — Offline Camp,Live,219
620,"A TOUR OF THE REDIS STARS
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 18, 2016On the Redis site is a page that lists Redis clients for various languages. It's very extensive, covering clients that work with
languages as diverse as emacs lisp, GNU Prolog, Haskell and C#. Throughout the
list, some clients have a star next to them and these are the current
recommended clients. In this tour of the starred clients, the ones that are
recommended, we're going to list them in order of language popularity using the Redmonk programming language index for June 2016 .

Before we dive into the tour though, you may be wondering why having so many
differemt clients for Redis is important. Redis can be thought of, in many
cases, as the database glue that can hold many applications together. While
disk-based databases are good as sources of reference, Redis shines in being a
source of state and transient data for many systems. In a lot of cases, it's
used as a cache, but that's just sharing state out to many other clients.

What makes Redis work is that it has so many drivers there's no application that
can be considered out of the running when working with Redis. The recommended
clients are the cream of the crop, the ones that have proven themselves to be
stable, mature and well maintained.

Before we start though, it's worth noting there are two major styles of driver:
minimalist drivers and what we call the idiomatic drivers. The minimalist
drivers provide the framework to send Redis command strings and arguments and
decode the Redis response. The developer using a minimalist driver will have the Redis commands documentation. The idiomatic drivers instead map the Redis command set to a
richer API which exposes the Redis commands in a way that's native to the
language.

So let's begin the tour with...

JAVASCRIPT/NODE.JS (1) - NODE-REDIS AND IOREDIS
JavaScript tops the Redmonk rankings, though you'll only find JavaScript in its
Node.js form in the client list. Node.js is popular with Redis client developers
though; there are ten listed on the client list and two of them are recommended.

Node-redis is an idiomatic driver and claims complete coverage of the Redis command set
with an entirely asynchronous set of calls. That means callbacks all round for
processing results though it can be promisfied with bluebird for less indented, more predictable code flow. It also has support for server
events for managing the connection and subscriber events for managing pub/sub
subscriptions. Handy tricks: a built-in redis.print command you can use instead of a callback to just print results.

ioredis is another idiomatic and extensive Redis client with a similar set of features
to node-redis, and more. For example, it works with Redis sentinels and clusters
out of the box and supports ES6 Map and Hash types. Its support for Lua
scripting includes a defineCommand call to simplify the process of uploading and storing Lua in the Redis server.

You may wonder why two clients are recommended, and it appears so do the
developers who are currently, but not rapidly, working on consolidating the
features of node-redis and ioredis into a single library. Which leaves the
question, which to choose currently. We'd lean towards ioredis purely because
it's a more recently developed codebase.

JAVA (2) - JEDIS, LETTUCE AND REDDISON
Java, halfway house to the motto ""there's more than one way to do things"", has
three recommended drivers, Jedis, lettuce and Redisson...

Jedis is your ""small, lightweight and fast"" idiomatic Redis driver. The single Jedis
instance isn't thread-safe but is usable. For thread safety, you need to create
a statically stored JedisPool and fill it with Jedis instances. There's no
asynchronous support but there is support for sharding over multiple Redis
servers.

Lettuce does claim to be thread-safe and able to service multiple threads with one
connection as long as an app doesn't block the connection. It includes support
for asynchronous and reactive APIs to deal with those blocking commands. It's a
very idiomatic driver with a vast hierarchy of classes representing commands and
results.

Redisson is probably one of the most interesting of the Java clients. It sets out to
create distributed data structures and services that are backed by Redis. This
means you can create a Map or Set locally that is synchronized with a Redis
server without marshalling the data in and out of appropriate Java objects. With
a rich set of integrations, support for many services and codecs and an Apache
license, it's one to look at if you want a higher level interface. Find more at
the git repository .

PHP (3) - PHPREDIS AND PREDIS
Phpredis is a C-based extension for PHP, while Predis is a pure PHP client. Both are recommended and actively maintained. Phpredis
offers better performance but usually can't be installed on hosts where the user
has no shell access. Predis, as a pure PHP client doesn't have that issue, but
doesn't offer the very high performance that phpredis could offer. That said,
many applications don't need that high a level of performance.

PYTHON (4) - REDIS-PY
Pythonic access to Redis is but a ""pip install redis"" away with redis-py . It's notable for its extensive Readme.rst file which makes you aware of all the deviations from the Redis commands and is explicit over thread-safety issues and other pitfalls you could run into
with the library.

C# (5) - SERVICESTACK.REDIS AND STACKEXCHANGE.REDIS
ServiceStack.Redis is a recommended driver, but that could well change as the most recent version,
v4, is now a commercial product with a free tier and after hitting 6000 Redis
requests in an hour ( or other limitations ), it will start generating exceptions requiring an upgrade. If you are in the
market for a commercially supported Redis driver for C#, check it out,
otherwise, your next stop is ...

StackExchange.Redis was developed by StackExchange as a ""logical successor"" to an earlier driver
called BookSleeve , StackExchange.Redis is an MIT-licensed driver which includes support for
clusters, shared connections and coverage of the full redis feature set. With
regular updates and a range of programming models, it's the more recommendable
recommended driver for C#.

RUBY (5) - REDIS-RB
Redis-rb is exactly what you expect from the Ruby community - a one-to-one idiomatic
mapping of Redis functionality which maintains Ruby's idioms and pragmatism.
Redis-rb is all Ruby using Ruby's socket library for connections, but can also
be setup to use the hiredis C driver (see the next entry) for better performance
with large objects.

C (9) - HIREDIS
If you positively have to code in C and want the best performance from your
driver, then hiredis is the foundation you are going to want to build on. Surprisingly, it's still
yet to release a 1.0.0 version and the last release was a year ago. That said, as minimalist driver, it abstracts Redis
communications as a redisCommand() call passing the actual command in a string so Redis command additions don't
require changes to hiredis.

PERL (13) - REDIS
There are a lot of Perl Redis clients with different objectives, as befits the
language built around there being ""more than one way to do it"", but there's only
one recommended Perl client and that's Redis . If you dig into the docs , you'll find a client that idiomatically exposes the Redis API, up to but not
including Redis 3.2 features. There are also modules to tie Redis Hashes and
Lists into Perl Hashes and Arrays and offer sentinel support.

GO (15) - RADIX AND REDIGO (AND GO-REDIS)
Go is a rapidly moving ecosystem and there's an interesting mix of drivers out
there. Radix is an example of the minimalist style of driver, with a non-thread-safe Redis
connection which can be made safer with its own pool, sentinel and cluster
implementations. Redigo is the other recommended driver, and that also offers a minimalist driver with
similar features - external projects that offer the sentinel and cluster clients
support.

Oddly, there's no recommended idiomatic driver, so allow us to informally
recommend go-redis . It's an actively developed driver with cluster and sentinel support and has
interesting additional features like rate limiting and distributed locking.

HASKELL (16) - HEDIS
For Haskell developers, there's only one recommended and actively maintained
Redis client and that's hedis . The documentation has it as a full idiomatic driver for the Redis 2.6 command
set though there are at least some commands from later Redis versions
implemented. It also exposes its low level API giving the user the flexibility
of a minimalist driver.

CLOJURE (20) - CARMINE
Clojure developers have one choice in Redis client support and that's carmine . It's another rich idiomatic driver with support for 2.6 and later features
and adds its own capabilities such as distributed locks, raw binary handling and
easy message queues. Redis commands are exposed as Clojure functions, and -
here's the neat part - generated by using the official Redis command reference so it's always up to date and documented.

We run out of Redmonk ratings - they reasonably stop at 21 places (after ties),
but there are still more recommended clients. Switching to alphabetical order we
have:

CRYSTAL (-) AND CRYSTAL-REDIS
For Crystal developers, the crystal-redis package is the only option we know of. It has an idiomatic style API which
appears to be up to but not including Redis 3.0.

DART (-) AND DARTREDISCLIENT
The DartRedisClient seems to have stalled in development. As a Redis client for Dart , Google's JavaScript alternative, the library reached a version 0.1 last year
and there have been no commits since. That said, the 0.1 version offers an idiomatic API which returns Futures for async/non-blocking functionality.

ERLANG (-) AND EREDIS
Erlang developers are recommended Eredis which is a minimalist non-blocking library, with support for pipelining and
auto-reconnection, but no support for sentinels or clustering.

LUA (-) AND REDIS-LUA
The Redis-lua library has support for commands up to, and including, Redis 2.6 in an
idiomatic API but hasn't been updated since 2014.

RUST (-) AND REDIS-RS
The Rust libraray for Redis, Redis-rs , is being actively developed and strikes a half-way house between idiomatic
and minimalist - there is some high-level functionality but it's only for
commonly used features, but developers are free to fall back to using the
low-level API to construct any Redis commands they wish. There are also features
limited by what's currently implemented in the languge - these are detailed in
the documentation .

SCALA (-) AND SCALA-REDIS
The scala-redis library is actively being developed and its more recent work has brought
support for Redis 3.2's GEO commands among other things. It works with native
Scala types and is not a wrapper around a Java client. It's a blocking client
but has a pool and asynchronous futures built on top of that. And it's
idiomatic. Scala developers are not short of Redis client alternatives, but
scala-redis seems to cover most core requirements.

AND THERE THE TOUR ENDS
Hopefully, you'll come away with a good feeling about the range of languages
covered by the Redis community's driver work. From strictly minimalist drivers
that cover the protocol with a thin veneer of essential code, to rich idiomatic
libraries designed to make Redis a natural fit to the language in use, there's a
lot of ground that is covered. Remember, we've only touched all too briefly on
the recommended drivers and there's a whole lot more that are not recommended
but worth investigating. We'll be taking a deep dive on some of these Redis
libraries in the future.

We invite anyone who is knowledgable about a Redis driver to check out our Write Stuff page where you can earn cash and database credits.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by ESO.org Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",A run-down of Redis drivers for the most popular programming languages.,A tour of the Redis stars,Live,220
623,"* United States

IBM® * Site map

Search within Bluemix Blog Bluemix Blog * About Bluemix * What is Bluemix
    * Getting Started
    * Case Studies
    * Hybrid Architecture
    * Open Source
    * Trust, Security, Privacy
    * Data Centers
    * Our Network
    * Automation
    * Architecture Center
   
   
 * Products * Compute Infrastructure
    * Compute Services
    * Hybrid Deployments
    * Watson
    * Internet of Things
    * Mobile
    * DevOps
    * Data Analytics
    * Network
    * Open Source
    * Storage
    * Security
   
   
 * Services * Bluemix Services
    * Garage
   
   
 * Pricing
 * Support * Support
    * Contact Us
    * Resources
    * Docs
   
   
 * Blog * How-tos
    * Trending
    * What's New
    * Events
   
   
 * Partners * Partners
    * Become a Partner
    * Find a Partner
   
   
 * Sign up

DATA ANALYTICSHOW SMART CATALOGS CAN TURN THE BIG DATA FLOOD INTO AN OCEAN OF OPPORTUNITY
August 1, 2017 | Written by: Jay Limburn

Categorized: Data Analytics

Share this post:


One of the earliest documented catalogs was compiled at the great library of
Alexandria in the third century BC, to help scholars manage, understand and
access its vast collection of literature. While that cataloging process
represented a massive undertaking for the Alexandrian librarians, it pales in
comparison to the task of wrangling the volume and variety of data that modern
organizations generate.

Nowadays, data is often described as an organization’s most valuable asset, but
unless users can easily sift through data artifacts to find the information they
need, the value of that data may remain unrealized. Catalogs can solve this
problem by providing an indexed set of information about the organization’s
data, storing metadata that describes all assets and providing a reference to
where they can be found or accessed.

It’s not just the size and complexity of the data that makes cataloging a tough
challenge: organizations also need to be able to perform increasingly
complicated operations on that data at high speed, and even in real-time. As a
result, technology leaders must continually find better ways to solve today’s
version of the same cataloging challenges faced in Alexandria all those years
ago.


ENTER IBM
IBM’s aim with Watson Data Platform is to make data accessible for anyone who uses it. An integral part of Watson
Data Platform will be a new intelligent asset catalog, IBM Data Manager, a
solution underpinned by a central repository of metadata describing all the
information managed by the platform. Unlike many other catalog solutions on the
market, the intelligent asset catalog will also offer full end-to-end
capabilities around data lifecycle and governance.

Because all the elements of Watson Data Platform can utilize the same catalog,
users will be able to share data with their colleagues more easily, regardless
of what the data is, where it is stored, or how they intend to use it. In this
way, the intelligent asset catalog will unlock the value held within that data
across user groups—helping organizations use this key asset to its full
potential.


BREAKING DOWN SILOS
With Watson Data Platform, data engineers, data scientists and other knowledge
workers throughout an enterprise can search for, share and leverage assets
(including datasets, files, connections, notebooks, data flows, models and
more). Assets can be accessed using the Data Science Experience web user interface to analyze data,

To collaborate with colleagues, users can put assets into a Project that acts as
a shared sandbox where the whole team can access and utilize them. Once their
work is complete, they can submit any resulting content to the catalog for
further reuse by other people and groups across the organization.

Rich metadata about each asset makes it easy for knowledge workers to find and
access relevant resources. Along with data files, the catalog can also include
connections to databases and other data sources, both on- and off-premises,
giving users a full 360-degree view to all information relevant to their
business, regardless of where or how it is stored.


MANAGING DATA OVER TIME
It’s important to look at data as an evolving asset, rather than something that
stays fixed over time. To help manage and trace this evolution, IBM Data Manager
will keep a complete track of which users have added or modified each asset, so
that it is always clear who is responsible for any changes.


SMART CATALOG CAPABILITIES FOR BIG DATA MANAGEMENT
The concept of catalogs may be simple, but when they’re being used to make sense
of huge amounts of constantly changing data, smart capabilities make all the
difference. Here are some of the key smart catalog functionalities that we see
as integral to tackling the big data challenge, and that we will be aiming to
include in upcoming releases of IBM Data Manager.


DATA AND ASSET TYPE AWARENESS
When a user chooses to preview or view an asset of a particular type, the data
and asset type awareness feature will automatically launch the data in the best
viewer—such as a shaper for a dataset, or a canvas for a data flow. This will
save time and boost productivity for users, optimizing discovery and making it
easier to work with a variety of data types without switching tools.


INTELLIGENT SEARCH AND EXPLORATION
By combining metadata, machine learning-based algorithms and user interaction
data, it is possible to fine-tune search results over time. Presenting users
with the most relevant data for their purpose will increase usefulness of the
solution the more it is used.


SOCIAL CURATION
Effective use of data throughout your organization is a two-way street: when
users discover a useful dataset, it’s important for them to help others find it
too. Users can be encouraged to engage by taking advantage of curation features,
enabling them to tag, rank and comment on assets within the catalog. By
augmenting the metadata for each asset, this can help the catalog’s intelligent
search algorithms guide users to the assets that are most relevant to their
needs.


DATA LINEAGE
If data is incomplete or inaccurate, utilizing it can cause more problems than
it solves. On the other hand, if data is accurate but users do not trust it,
they might not use it when it could make a real difference. In either scenario,
data lineage can help.

Data lineage captures the complete history of an asset in the catalog: from its
original source, through all the operations and transformations it has
undergone, to its current state. By exploring this lineage, users can be
confident they know where assets have come from, how those assets have evolved,
and whether they can be trusted.


MONITORING
Taking a step back to a higher-level view, monitoring features will help users
keep track of overall usage of the catalog. Real-time dashboards help chief data
officers and other data professionals monitor how data is being used, and
identify ways to increase its usage in different areas of the organization.


METADATA DISCOVERY
We have already mentioned that data needs to be seen as an evolving asset—which
means our catalogs must evolve with it. We plan to make it easy for users to
augment assets with metadata manually; in the future, it may also be possible to
integrate algorithms that can discover assets and capture their metadata
automatically.


DATA GOVERNANCE
For many organizations, keeping data secure while ensuring access for authorized
users is one of the most significant information management challenges. You can
mitigate this challenge with rule-based access control and automatic enforcement
of data governance policies.


APIS
Finally, the catalog will enable access to all these capabilities and more
through a set of well-defined, RESTful APIs. IBM is committed to offering
application developers easy access to additional components of Watson Data Platform , such as persistence stores and data sets. We hope that they can use our
services to extend their current suite of data and analytics tools, to innovate
and create smart new ways of working with data.


In our next post, we’ll discuss the challenges around data governance, and
explore how IBM Data Manager can help you make light work of addressing them.

JAY LIMBURN
Jay Limburn

THOMAS SCHAECK


Previous Post

WebSphere on the Cloud: Application Modernization (Phase 1)Next Post

Maximize Control with IBM Bluemix Virtual serversADD COMMENT NO COMMENTS
LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


Search for:RECENT POSTS
 * IBM Watson Machine Learning – General Availability
 * Locating IoT with Skyhook Precision Location
 * Monitoring IBM Bluemix Container Service with Sysdig Container Intelligence
 * Mobile Foundation Service integration with Mobile Analytics Service
 * Intel® Optane™ SSD DC P4800X Available Now on IBM Cloud

ARCHIVES
Archives Select Month August 2017 July 2017 June 2017 May 2017 April 2017 March 2017 February 2017 January 2017 December 2016 November 2016 October 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 October 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 November 2013TAGS
analytics announcements api apps Architecture Center best-of-bluemix Bluemix bluemix-support-notifications buildpacks client success cloud cloudant cloud foundry conference conferences containers dashdb deployment devops docker eclipse garage garage-method hackathon homepage hybrid interconnect iot java Kubernetes liberty local microservices mobile MobileFirst node.js OpenStack openwhisk security Spark swift twilio video watson webinar More Data Analytics StoriesData Analytics

MEDTRONIC MAKES DIABETES MANAGEMENT EASIER WITH REAL-TIME INSIGHTS FROM IBM
STREAMS
With cases of both type I and type II diabetes rising, Medtronic recognized the
need to create a new generation of glucose monitoring solutions that would give
people the tools to manage their diabetes more easily, in combination with
routine support from healthcare professionals. Find out how they are working
with IBM Watson to help.

Continue reading


Share this post:


Data Analytics

INTRODUCING A NEW LOOK AND FEEL FOR DB2 WAREHOUSE ON CLOUD
Today, we're proud to announce the launch and immediate availability of the
brand new Db2 Warehouse on Cloud Web console and REST APIs! We want to make your
interaction with our world-class cloud data warehouse offering as seamless as
possible, so we set out to completely redesign these two integral parts of our
user experience.

Continue reading


Share this post:


Data Analytics

IBM DASHDB FOR ANALYTICS IS NOW DB2 WAREHOUSE ON CLOUD
We're rebranding dashDB for Analytics to Db2 Warehouse on Cloud.

Continue reading


Share this post:


SIGN UP FOR A BLUEMIX TRIAL TODAY


Get started free Learn more about Bluemix

CONNECT WITH US


 * Contact
 * Privacy
 * Terms of use
 * Accessibility","When used to make sense of huge amounts of constantly changing data, smart catalog capabilities can make all the difference.",How smart catalogs can turn the big data flood into an ocean of opportunity,Live,221
627,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Join Medium Join Medium Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. 28 mins ago
--------------------------------------------------------------------------------

AUTHENTICATION FOR CLOUDANT ENVOY APPS, PART III
ADDING TWITTER AUTHENTICATION
For those familiar with the Apache CouchDB ecosystem, Cloudant Envoy is a microservice that serves out your static
application and behaves as a replication target for your one-database-per-user
application. Simply build an application that writes data locally using PouchDB or Cloudant Sync , and Envoy will ensure that each user’s data is stored in a single Cloudant
database, with each user’s data carefully segregated.

For more background on Cloudant Envoy, I have a write-up over on Offline Camp :

Scaling Offline First with Envoy At Offline Camp, fellow IBM Developer Advocate
Bradley Holt gave a Passion Talk on Cloudant Envoy. medium.comSo far I’ve been looking at Envoy apps that have been generating their own
users. Save a document in the envoyusers database, and Envoy will use that information for subsequent authentication
requests. But what if you want users to sign up with
Facebook/Google/Twitter/etc? How can Envoy integrate with social media’s
federated login?

In previous blog posts, I showed you how to add Facebook authentication to an
Envoy app and then how to make the app Offline-First:

Authentication for Cloudant Envoy Apps, Part I Adding Facebook Authentication
medium.com Authentication for Cloudant Envoy Apps, Part II Make Your App Offline First
medium.comFor this post, we’ll focus in on Twitter, and I’ll show you how to use Twitter
as an authentication option. Let’s go!

PASSPORT TO THE RESCUE (AGAIN!)
http://passportjs.org/The PassportJS project solves 95% of the problem for us, which we also demonstrated in Part 1
of this series. It has several modules, each handling authentication for a
third-party partner. Although Envoy doesn’t use Passport out-of-the box, you can
create an Envoy app that does. Here’s how.

CREATE A TWITTER APP
Visit the Twitter Application Management page and create a new Twitter app to handle authentication for you. It need
only have read-only access to your users’ profiles; we aren’t going to be tweeting on behalf of
your app’s users.

Once the app is created, two keys will be generated:

 * Consumer Key (API Key)
 * Consumer Secret (API Secret)

Make a note of these values as we’ll need to inject them into your code.

CREATE AN ENVOY APP
Let’s create an Envoy app. These are the same steps we followed in Part 1 of
this series on adding Facebook Authentication . No need to recreate this app if you’re still working from the same sample
application. In a new directory, type npm init and follow the on-screen prompts. This will create a template package.json file for you. Then we can add the modules we’re going to need for this project:

npm install --save cloudant-envoy

Create some static content:

mkdir public
echo ""<h1>Hello World</h1>"" > public/index.html

The layout of an Envoy app is pretty simple — create an app.js :

Once you’ve saved that file in the project directory you can then run the app:

export COUCH_HOST=https://myusername:mypassword@myhost.cloudant.com
node app.js

Note: Envoy assumes the Cloudant URL will be in a COUCH_HOST environment variable. Replace myusername , mypassword and myhost with your own Cloudant account details.

We now have a web server serving out our own static content which also acts as a
replication target for PouchDB/CouchDB/Cloudant/Cloudant-Sync clients.

ADD TWITTER AUTHENTICATION
We’ll need some extra modules to handle Twitter authentication:

npm install --save passport
npm install --save passport-twitter
npm install --save uuid
npm install --save express
npm install --save crypto-js

Then we need to add some custom endpoints into our app to handle the
authentication process.

 * GET /_twitter — Hitting this endpoint in your browser will bounce the user to Twitter and
   ask them to authenticate.
 * GET /_twitter/callback — After logging into the Twitter website, it will bounce the browser to this
   URL to allow us to access the user’s profile.

We implement a getOrCreateUser function, which checks if Envoy knows about this user already. If not, a new
user is created.

Envoy’s user model is very simple: add a document to its users database (default
name envoyusers ) to allow someone to replicate. Envoy provides some helper functions for you:

 * envoy.auth.getUser(userid, callback) — to fetch a user by userid
 * envoy.auth.newUser(userid, password, metaobject, callback) — to create a new user

We need to run the app, passing in our app’s CLIENT_ID and CLIENT_SECRET environment variables we got when we created the Twitter integration:

export TWITTER_API_KEY=1234567
export TWITTER_API_SECRET=abc123456
export COUCH_HOST=https://myusername:mypassword@myhost.cloudant.com
node app.js

Here’s the source code:

HOW DO WE COMMUNICATE THE USER CREDENTIALS TO THE CLIENT SIDE?
We know the ID and password of our user — not the Twitter username and
password—the Envoy username and password. But how can we send that data to the
client side? A simple way is to bounce the browser to a URL with the credentials
in the query string:

http://mypretenddomain.com/bounce.html?username=999888777&password=9886f37a-725e-4096-be67-ff2aba2acb68

We could write some client-side JavaScript to parse the query string, extract
the username and password and store it locally.

A safer way would be to create a single-use token and pass that in the query
string.

http://mypretenddomain.com/bounce.html?token=696ad23c375b4aa4acce97734fa2ea4f

In this case the client-side code needs to extract the token, make a call back
to the server to exchange the token for the username and password and then store
the credentials locally. This is more secure as the token can be made to expire
on use and have a built-in time limit.

Here’s some simple client-side code to extract and decode the token, ultimately
saving the user details in a local PouchDB document. Local documents are never transmitted during replication;
they only remain on the device they are created:

MAKING YOUR APP
Now the client side app has the Envoy credentials (in a PouchDB document whose
ID is _local/user ), we can set about building an app that reads and writes data to its local
PouchDB database and replicates its data to and/or from your Envoy service using
the credentials provided.

var db = new PouchDB('mydb'
  db.get('_local/user').then(function(loggedinuser) {
    var url = window.location.origin.replace('//', '//' + loggedinuser.username + ':' + loggedinuser.meta.password + '@'
    url += '/envoy'


    // sync live with retry, animating the icon when there's a change'
    var remote = new PouchDB(url);
    db.replicate.to(remote).on('change', function(c) {
      console.log('change', c)
    });
  });


--------------------------------------------------------------------------------

https://apps.twitter.com/I hope you enjoyed this 3-part series. Together, we’ve built a static
application that writes data locally using PouchDB, and you’ve used Cloudant
Envoy to synchronize user data to a remote Cloudant database. Using PassportJS
to handle authentication, your users can now sign up to use your app using their
own Facebook and Twitter credentials. We also reviewed how to make an
Offline-First app with Cloudant Envoy using a Progressive Web Application. Until
next time!

Thanks to Maureen McElaney and Mike Broberg . JavaScript Passportjs Pouchdb Tutorial Database Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.",I’m going to show you how to deploy a live application to a Cloudant server and add the ability to use Twitter authentication. Future articles cover Offline First & Facebook auth.,"Authentication for Cloudant Envoy Apps, Part III – IBM Watson Data Lab",Live,222
639,"Homepage Follow Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine
Learning Dec 14
--------------------------------------------------------------------------------

USING BIGDL IN DATA SCIENCE EXPERIENCE FOR DEEP LEARNING ON SPARK
Huge thanks for the contributions from Yulia Tell and Yuhao Yang from Intel and
Roland Weber from IBM in making this integration possible!

Deep Learning has become one of the most popular techniques used in the field of
Machine Learning in recent years. The Data Science Experience (DSX) team has
been excited about deep learning since before launching last year (we have a
couple blogs on this topic: DL trends , Using DL in DSX ).

As a data science platform, we make it easy to scale your analysis by providing
a Spark cluster for all users. Whether working in notebooks or RStudio in DSX
you have access to connect to this cluster to distribute workloads. Until
recently, Spark batch processing was not used for Deep Learning since it
required a lot of effort to optimize Spark’s compute engine for training deep
neural networks. This is where Intel comes in, with their big data deep learning
framework called BigDL. This blog will explain what BigDL is and how it can be
used in Data Science Experience.


--------------------------------------------------------------------------------

WHAT IS BIGDL?
BigDL is a distributed deep learning framework for Apache Spark that was
developed by Intel and contributed to the open source community for the purposes
of uniting big data processing and deep learning (check out https://github.com/intel-analytics/BigDL ). Built on the highly scalable Apache Spark platform, BigDL can be easily
scaled out to hundreds or thousands of servers. In addition, BigDL uses Intel®
Math Kernel Library (Intel® MKL) and parallel computing techniques to achieve
very high performance on Intel® Xeon® processor-based servers (comparable to
mainstream GPU performance).

BigDL helps make deep learning more accessible to the big data community by
allowing developers to continue using familiar tools and infrastructure to build
deep learning applications. BigDL provides support for various deep learning
models (for example, object detection, classification, and so on); in addition,
it also lets us reuse and migrate pre-trained models (in Caffe, Torch*,
TensorFlow*, and so on), which were previously tied to specific frameworks and
platforms, to the general purpose big data analytics platform through BigDL. As
a result, the entire application pipeline can be fully optimized to deliver
significantly accelerated performance.

As the following diagram shows, BigDL is implemented as a library on top of
Spark, so that users can write their deep learning applications as standard
Spark programs. As a result, BigDL can be seamlessly integrated with other
libraries on top of Spark — Spark SQL and DataFrames, Spark ML pipelines, Spark
Streaming, Structured Streaming, etc. — and can run directly on top of existing
Spark or Hadoop clusters.

Highlights of the BigDL v0.3.0 release

Since its initial open source release in December 2016, BigDL has been used to
build applications for fraud detection, recommender systems, image recognition,
and many other purposes. The recent BigDL v0.3.0 release addresses many user
requests, improving usability and additional new features and functionality:

• New layers support

• RNN encoder-decoder (sequence-to-sequence) architecture

• Variational auto-encoder

• 3D de-convolution

• 1D convolution and pooling

• Model quantization support

• Quantize existing (BigDL, Caffe, Torch or TensorFlow) model

• Converting float points to integer for model inference (for model size
reduction & inference speedup)

• Sparse tensor and layers — Efficient support of sparse data


--------------------------------------------------------------------------------

BIGDL ON DSX: A PERFECT FIT
Since notebooks in DSX are already executed on a Spark cluster, it is very easy
to get up and running with BigDL. The only tool you need to get started is a
Data Science Experience notebook. Follow the steps below to install BigDL and
confirm it is working. In future posts, we will show tutorials using BigDL on
DSX.

Installation Guide for BigDL within IBM DSX

This section was written by Roland Weber in this StackOverflow post. You can follow along with this notebook to get up and running with BigDL in DSX.

If your notebooks are backed by an Apache Spark as a Service instance in DSX,
installing BigDL is simple. But you have to collect some version information
first.

 1. Which Spark version? Currently, 2.1 is the latest supported by DSX.
    With Python, you can only install BigDL for one Spark version per service.
 2. Which BigDL version? Currently, 0.3.0 is the latest, and it supports Spark
    2.1.
    If in doubt, check the download page . The Spark fixlevel does not matter.

With this information, you can determine the URL of the required BigDL JAR file
in the Maven repository. For the example versions, BigDL 0.3.0 with Spark 2.1,
the download URL is

https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_2.1/0.3.0/bigdl-SPARK_2.1-0.3.0-jar-with-dependencies.jar

For other versions, replace 0.3.0 and 2.1 in that URL as required. Note that
both versions appear twice, once in the path and once in the filename.

Installing for Python

You need the JAR, and the matching Python package. The Python package depends
only on the version of BigDL, not on the Spark version. The installation steps
can be executed from a Python notebook:

 1. Install the JAR.

!(export sv=2.1 bv=0.3.0 ; cd ~/data/libs/ && wget https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_${sv}/${bv}/bigdl-SPARK_${sv}-${bv}-jar-with-dependencies.jar)

Here, the versions of Spark (sv) and BigDL (bv) are defined as environment
variables, so you can easily adjust them without having to change the URL.

2. Install the Python module.

!pip install bigdl==0.3.0 | cat

If you want to switch your notebooks between Python versions, execute this step
once with each Python version.

After restarting the notebook kernel, BigDL is ready for use.

(Not) Installing for Scala

If you install the JAR as described above for Python, it is also available in
Scala kernels.

If you want to use BigDL exclusively with Scala, better not install the JAR at
all. Instead, use the %AddJar magic at the beginning of the notebook. It’s best
to do this in the very first code cell, to avoid class loading issues.

%AddJar https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_2.1/0.3.0/bigdl-SPARK_2.1-0.3.0-jar-with-dependencies.jar

By not installing the JAR, you gain the flexibility of using different versions
of Spark and BigDL in different Scala notebooks sharing the same service. As
soon as you install a JAR, you’re likely to run into conflicts between that one
and the one you pull in with %AddJar.


--------------------------------------------------------------------------------

Hopefully after following along with those instructions you are ready to start
using BigDL to train deep nets on Spark in DSX! If you prefer a Python notebook
with all these steps you can copy this notebook written by the DSX development team . You can copy this notebook directly into a DSX project using the copy icon in
the top right; this will let you start running the code in your Spark cluster in Data Science Experience.

This notebook also gives you some code to start using the BigDL framework. Stay
tuned for a follow up post showing how to train models with BigDL. If you are
interested to see examples of training models for fraud detection, sentiment
analysis and others with BigDL, feel free to check out BigDL model zoo at https://github.com/intel-analytics/analytics-zoo .

 * Machine Learning
 * Bigdl
 * Data Science
 * Deep Learning
 * Dsx

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingGREG FILLA
Product manager & Data scientist — Data Science Experience and Watson Machine
Learning

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",This blog will explain what BigDL is and how it can be used in Data Science Experience (DSX).,Using BigDL in DSX for Deep Learning on Spark,Live,223
641,"SHIFTING SANDS
A man with a hammer

PAGES
 * Home
 * About Me

MONDAY, DECEMBER 17, 2012
USING APPLY, SAPPLY, LAPPLY IN R

This is an introductory post about using apply, sapply and lapply, best suited
for people relatively new to R or unfamiliar with these functions. There is a
part 2 coming that will look at density plots with ggplot , but first I thought I would go on a tangent to give some examples of the
apply family, as they come up a lot working with R.
I have been comparing three methods on a data set. A sample from the data set
was generated, and three different methods were applied to that subset. I wanted
to see how their results differed from one another.
I would run my test harness which returned a matrix. The columns values were the
metric used for evaluation of each method, and the rows were the results for a
given subset. We have three columns, one for each method, and lets say 30 rows,
representing 30 different subsets that the three methods were applied to.
It looked a bit like this
method1 method2 method3 [1,] 0.05517714 0.014054038 0.017260447 [2,] 0.08367678 0.003570883 0.004289079 [3,] 0.05274706 0.028629661 0.071323030 [4,] 0.06769936 0.048446559 0.057432519 [5,] 0.06875188 0.019782518 0.080564474 [6,] 0.04913779 0.100062929 0.102208706
We can simulate this data using rnorm , to create three sets of observations. The first has mean 0, second mean of 2,
third of mean of 5, and with 30 rows.
m <- matrix(data=cbind(rnorm(30, 0), rnorm(30, 2), rnorm(30, 5)), nrow=30,
ncol=3)
APPLY

When do we use apply? When we have some structured blob of data that we wish to
perform operations on. Here structured means in some form of matrix. The
operations may be informational, or perhaps transforming, subsetting, whatever
to the data.

As a commenter pointed out, if you are using a data frame the data types must
all be the same otherwise they will be subjected to type conversion. This may or
may not be what you want, if the data frame has string/character data as well as
numeric data, the numeric data will be converted to strings/characters and
numerical operations will probably not give what you expected.
Needless to say such circumstances arise quite frequently when working in R, so
spending some time getting familiar with apply can be a great boon to our productivity.
Which actual apply function and which specific incantion is required depends on
your data, the function you wish to use, and what you want the end result to
look like. Hopefully the right choice should be a bit clearer by the end of
these examples.
First I want to make sure I created that matrix correctly, three columns each
with a mean 0, 2 and 5 respectively. We can use apply and the base mean function to check this.
We tell apply to traverse row wise or column wise by the second argument. In this case we
expect to get three numbers at the end, the mean value for each column, so tell apply to work along columns by passing 2 as the second argument. But let's do it
wrong for the point of illustration:
apply(m, 1, mean) # [1] 2.408150 2.709325 1.718529 0.822519 2.693614 2.259044 1.849530 2.544685
2.957950 2.219874 #[11] 2.582011 2.471938 2.015625 2.101832 2.189781 2.319142 2.504821 2.203066
2.280550 2.401297 #[21] 2.312254 1.833903 1.900122 2.427002 2.426869 1.890895 2.515842 2.363085
3.049760 2.027570
Passing a 1 in the second argument, we get 30 values back, giving the mean of
each row. Not the three numbers we were expecting, try again.
apply(m, 2, mean) #[1] -0.02664418 1.95812458 4.86857792
Great. We can see the mean of each column is roughly 0, 2, and 5 as we expected.
OUR OWN FUNCTIONS

Let's say I see that negative number and realise I wanted to only look at
positive values. Let's see how many negative numbers each column has, using
apply again:
apply(m, 2, function(x) length(x[x<0])) #[1] 14 1 0
So 14 negative values in column one, 1 negative value in column two, and none in
column three. More or less what we would expect for three normal distributions
with the given means and sd of 1.
Here we have used a simple function we defined in the call to apply , rather than some built in function. Note we did not specify a return value
for our function. R will magically return the last evaluated value. The actual
function is using subsetting to extract all the elements in x that are less than 0, and then counting how many are left are using length .
The function takes one argument, which I have arbitrarily called x . In this case x will be a single column of the matrix. Is it a 1 column matrix or a just a
vector? Let's have a look:
apply(m, 2, function(x) is.matrix(x)) #[1] FALSE FALSE FALSE
Not a matrix. Here the function definition is not required, we could instead
just pass the is.matrix function, as it only takes one argument and has already been wrapped up in a
function for us. Let's check they are vectors as we might expect.
apply(m, 2, is.vector) #[1] TRUE TRUE TRUE
Why then did we need to wrap up our length function? When we want to define our
own handling function for apply, we must at a minimum give a name to the
incoming data, so we can use it in our function.
apply(m, 2, length(x[x<0])) #Error in match.fun(FUN) : object 'x' not found
We are referring to some value x in the function, but R does not know where that is and so gives us an error.
There are other forces at play here, but for simplicity just remember to wrap
any code up in a function. For example, let's look at the mean value of only the
positive values:
apply(m, 2, function(x) mean(x[x>0])) #[1] 0.4466368 2.0415736 4.8685779
USING SAPPLY AND LAPPLY

These two functions work in a similar way, traversing over a set of data like a
list or vector, and calling the specified function for each item.
Sometimes we require traversal of our data in a less than linear way. Say we
wanted to compare the current observation with the value 5 periods before it.
Use can probably use rollapply for this (via quantmod), but a quick and dirty way is to run sapply or lapply passing a set of index values.
Here we will use sapply , which works on a list or vector of data.
sapply(1:3, function(x) x^2) #[1] 1 4 9
lapply is very similar, however it will return a list rather than a vector:
lapply(1:3, function(x) x^2) #[[1]] #[1] 1 # #[[2]] #[1] 4 # #[[3]] #[1] 9
Passing simplify=FALSE to sapply will also give you a list:
sapply(1:3, function(x) x^2, simplify=F) #[[1]] #[1] 1 # #[[2]] #[1] 4 # #[[3]] #[1] 9
And you can use unlist with lapply to get a vector.
unlist(lapply(1:3, function(x) x^2)) #[1] 1 4 9
However the behviour is not as clean when things have names, so best to use sapply or lapply as makes sense for your data and what you want to receive back. If you want a
list returned, use lapply . If you want a vector, use sapply .
DIRTY DEEDS

Anyway, a cheap trick is to pass sapply a vector of indexes and write your function making some assumptions about the
structure of the underlying data. Let's look at our mean example again:
sapply(1:3, function(x) mean(m[,x])) [1] -0.02664418 1.95812458 4.86857792
We pass the column indexes (1,2,3) to our function, which assumes some variable m has our data. Fine for quickies but not very nice, and will likely turn into a
maintainability bomb down the line.
We can neaten things up a bit by passing our data in an argument to our
function, and using the … special argument which all the apply functions have for passing extra
arguments:
sapply(1:3, function(x, y) mean(y[,x]), y=m) #[1] -0.02664418 1.95812458 4.86857792
This time, our function has 2 arguments, x and y . The x variable will be as it was before, whatever sapply is currently going through. The y variable we will pass using the optional arguments to sapply .
In this case we have passed in m , explicitly naming the y argument in the sapply call. Not strictly necessary but it makes for easier to read & maintain code.
The y value will be the same for each call sapply makes to our function.
I don't really recommend passing the index arguments like this, it is error
prone and can be quite confusing to others reading your code.
I hope you found these examples helpful. Please check out part 2 where we create
a density plot of the values in our matrix.
If you are working with R, I have found this book very useful day-to-day R Cookbook (O'Reilly Cookbooks)
Posted by Pete at 11:43 PM Email This BlogThis! Share to Twitter Share to Facebook Share to Pinterest Labels: apply , lapply , R , sapply7 COMMENTS:
 1. Joshua Ulrich December 19, 2012 at 4:12 AMYou suggest using apply() on a matrix or data.frame, but it's very important
    to note that apply() always coerces its first argument to a matrix/array.
    This is important because a matrix/array can only contain a single atomic
    type, whereas a data.frame can contain columns of varying types/classes.
    
    When a data.frame is converted to a matrix, it will be converted to the
    highest atomic type of any of the columns of the data.frame (e.g. if the
    data.frame has 9 numeric columns and 1 character column, it will be
    converted to a 10 column character matrix).
    
    Reply Delete Replies 1. Pete December 22, 2012 at 7:54 PMHi Joshua, thank you I was not fully aware of that, and it has bitten me
        in the past as well. I have updated the post.
        
        Thanks for stopping by, nice to see you here!
        
        Delete
     2. Reply
    
    
 2. Selva Prabhakaran May 25, 2014 at 11:52 PMGreat post! Thank you so much for sharing..
    
    For those who want to learn R Programming, here is a great new course on
    youtube for beginners and Data Science aspirants. The content is great and
    the videos are short and crisp. New ones are getting added, so I suggest to
    subscribe.
    
    https://www.youtube.com/watch?v=BGWVASxyow8&list=PLFAYD0dt5xCzTQHDhMPZwBoaAXWeVhZzg&index=19
    
    
    Reply Delete
 3. Adrian August 1, 2015 at 7:54 AMThis comment has been removed by the author.
    
    Reply Delete
 4. Adrian August 1, 2015 at 7:56 AMThank you for this insightful and practical post.
    
    In the 3rd to last paragraph, you mentioned that you do not recommend
    passing the index argument in the way you just demonstrated. So, what method
    would you recommend?
    
    Reply Delete Replies 1. Pete January 7, 2016 at 1:08 AMIn general, instead of passing the indexes to use, I would try pass the
        data itself and let the internals of apply do the subsetting and make
        the function operate on that data, vs subsetting the data manually in
        the apply function we pass in.
        
        This isn't always possible though I know, and it is fine to pass indexes
        really, I am just a bit uptight about it I think.
        
        Thanks for you comment though and sorry for the delayed reply, I always
        have trouble posting comments on blogger!
        
        Delete
     2. Reply
    
    
 5. Daniel Maartens May 7, 2016 at 8:31 AMHi Pete,
    
    In your second paragraph under ""using sapply and lapply"" you are trying to
    tell us why we might want to use sapply and lapply instead of apply because
    we might ""require traversal of our data in a less than linear way"" and that
    we also might want to ""compare the current observation with the value 5
    periods before it.""
    
    However, in your subsequent answer to this problem you raised you only give
    us an alternative way of doing the exact same calculation you did using the
    apply() method (i.e. testing the means of the three rnorm-methods).
    
    Could you please provide an example highlighting how the use of sapply or
    lapply would enable me to traverse through data in a less than linear way
    and allow me to compare a current observation with a value 5 periods before
    it in a way that the apply() cannot?
    
    Please note that I am still a beginner in R.
    
    Thanks in advance :)
    
    Reply Delete

Add comment Load more...


Newer Post Older Post Home Subscribe to: Post Comments (Atom)POPULAR POSTS
 * Using apply, sapply, lapply in R
 * Intro to Curl Noise
 * Density Plot with ggplot
 * Tracking down errors in R
 * Streaming OANDA with python and ZeroMQ

BLOG ROLL
 * R-bloggers
 * Quant Mashup | Quantocracy
 * Abnormal Returns
 * Eran Raviv
 * R by examples
 * Quant News

SUBSCRIBE TO
Posts Atom Posts Comments Atom CommentsBLOG ARCHIVE
 * ► 2016 (1) * ► January (1)
   
   
 * ► 2015 (6) * ► December (1)
   
    * ► May (1)
   
    * ► March (3)
   
    * ► February (1)
   
   
 * ► 2014 (12) * ► December (3)
   
    * ► September (1)
   
    * ► July (1)
   
    * ► June (1)
   
    * ► May (4)
   
    * ► April (1)
   
    * ► January (1)
   
   
 * ► 2013 (17) * ► November (2)
   
    * ► October (2)
   
    * ► September (3)
   
    * ► August (1)
   
    * ► June (1)
   
    * ► March (5)
   
    * ► February (1)
   
    * ► January (2)
   
   
 * ▼ 2012 (13) * ▼ December (2) * Density Plot with ggplot
       * Using apply, sapply, lapply in R
      
      
    * ► November (1)
   
    * ► August (1)
   
    * ► July (3)
   
    * ► May (2)
   
    * ► April (2)
   
    * ► February (2)
   
   
 * ► 2011 (9) * ► December (1)
   
    * ► October (4)
   
    * ► September (4)
   
   
Simple template. Powered by Blogger .","This is an introductory post about using apply, sapply and lapply, best suited for people relatively new to R or unfamiliar with these functions.","Using apply, sapply, lapply in R",Live,224
644,"KDNUGGETS
Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * SOFTWARE
 * NEWS
 * Top stories
 * Opinions
 * Tutorials
 * JOBS
 * Academic
 * Companies
 * Courses
 * Datasets
 * EDUCATION
 * Certificates
 * Meetings
 * Webinars


KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » A Visual Explanation of the Back Propagation Algorithm for Neural Networks ( 16:n22 )LATEST NEWS, STORIES
 * U. Chicago Center for Data Science and Public Policy: ... KDnuggets 16:n23, Jun 29: Machine Learning Trends & Fu... The Big Data Ecosystem is Too Damn Big Civis Analytics: Data Scientist, Statistics Civis Analytics: Lead Data Engineer


More News & Stories | Top Stories

A VISUAL EXPLANATION OF THE BACK PROPAGATION ALGORITHM FOR NEURAL NETWORKS
Previous post Next post Tweet Tags: Algorithms , Backpropagation , Machine Learning , Neural Networks
--------------------------------------------------------------------------------

A concise explanation of backpropagation for neural networks is presented in
elementary terms, along with explanatory visualization.

By Sebastian Raschka , Michigan State University.Let's assume we are really into mountain climbing, and to add a little extra
challenge, we cover eyes this time so that we can't see where we are and when we
accomplished our ""objective,"" that is, reaching the top of the mountain.

Since we can't see the path upfront, we let our intuition guide us: assuming
that the mountain top is the ""highest"" point of the mountain, we think that the
steepest path leads us to the top most efficiently.

We approach this challenge by iteratively ""feeling"" around you and taking a step
into the direction of the steepest ascent -- let's call it ""gradient ascent.""
But what do we do if we reach a point where we can't ascent any further? I.e.,
each direction leads downwards? At this point, we may have already reached the
mountain's top, but we could just have reached a smaller plateau ... we don't
know. Essentially, this is just an analogy of gradient ascent optimization
(basically the counterpart of minimizing a cost function via gradient descent).
However, this is not specific to backpropagation but just one way to minimize a
convex cost function (if there is only a global minima) or non-convex cost
function (which has local minima like the ""plateaus"" that let us think we
reached the mountain's top). Using a little visual aid, we could picture a
non-convex cost function with only one parameter (where the blue ball is our
current location) as follows:


Now, backpropagation is just back-propagating the cost over multiple ""levels""
(or layers). E.g., if we have a multi-layer perceptron, we can picture forward
propagation (passing the input signal through a network while multiplying it by
the respective weights to compute an output) as follows:


And in backpropagation, we ""simply"" backpropagate the error (the ""cost"" that we
compute by comparing the calculated output and the known, correct target output,
which we then use to update the model parameters):


It may be some time ago since pre-calc, but it's essentially all based on the
simple chain-rule that we use for nested functions


Instead of doing this ""manually"" we can use computational tools (called
""automatic differentiation""), and backpropagation is basically the ""reverse""
mode of this auto-differentiation. Why reverse and not forward? Because it is
computationally cheaper! If we'd do it forward-wise, we'd successively multiply
large matrices for each layer until we multiply a large matrix by a vector in
the output layer. However, if we start backwards, that is, we start by
multiplying a matrix by a vector, we get another vector, and so forth. So, I'd
say the beauty in backpropagation is that we are doing more efficient
matrix-vector multiplications instead of matrix-matrix multiplications.

Bio: Sebastian Raschka is a 'Data Scientist' and Machine Learning enthusiast with a big passion for
Python & open source. Author of ' Python Machine Learning '. Michigan State University.

Original . Reposted with permission.

Related:

 * When Does Deep Learning Work Better Than SVMs or Random Forests?
 * The Development of Classification as a Learning Machine
 * Why Implement Machine Learning Algorithms From Scratch?


--------------------------------------------------------------------------------

Previous post Next post


--------------------------------------------------------------------------------


MOST POPULAR LAST 30 DAYS
Most viewed 1. 7 Steps to Mastering Machine Learning With Python R vs Python for Data Science: The Winner is ... What is the Difference Between Deep Learning and “Regular” Machine
    Learning? TensorFlow Disappoints - Google Deep Learning falls shallow 9 Must-Have Skills You Need to Become a Data Scientist Top 10 Data Analysis Tools for Business How to Explain Machine Learning to a Software Engineer

Most shared 1. What is the Difference Between Deep Learning and “Regular” Machine Learning? Data Science of Variable Selection: A Review R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016
    Software Poll Results A Visual Explanation of the Back Propagation Algorithm for Neural Networks Machine Learning Key Terms, Explained How to Build Your Own Deep Learning Box Big Data Business Model Maturity Index and the Internet of Things (IoT)


MORE RECENT STORIES
 * An Inside Update on Natural Language Processing Webinar, Jun 30: Introducing Anaconda Mosaic: Visualize. Explo... 5 More Machine Learning Projects You Can No Longer Overlook U. of Iowa: Business Analytics & Information Systems, Lec... U. of Iowa: Lecturer: Business Analytics & Information Sy... Top Stories, June 20-26: New Machine Learning Book, Free Draft... BigDebug: Debugging Primitives for Interactive Big Data Proces... Mining Twitter Data with Python Part 4: Rugby and Term Co-occu... Improving Nudity Detection and NSFW Image Recognition Highmark Health: Medical Economics Consultant Regularization in Logistic Regression: Better Fit and Better G... Doing Data Science: A Kaggle Walkthrough Part 6 – Creati... Highmark Health: Lead Decision Support Analyst Top Machine Learning Libraries for Javascript Predictive Analytics World in October: Government, Business, F... Ten Simple Rules for Effective Statistical Practice: An Overview Bank of America: Statistician From Research to Riches: Data Wrangling Lessons from Physical ... Microsoft: Sr. Applied Data Scientist. Achieving End-to-end Security for Apache Spark with Databricks


KDnuggets Home » News » 2016 » Jun » Tutorials, Overviews » A Visual Explanation of the Back Propagation Algorithm for Neural Networks ( 16:n22 )

© 2016 KDnuggets. About KDnuggets
Subscribe to KDnuggets News | Follow @kdnuggets | | X","A concise explanation of backpropagation for neural networks is presented in elementary terms, along with explanatory visualization. ",A Visual Explanation of the Back Propagation Algorithm for Neural Networks,Live,225
646,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Perform Sentiment Analysis       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  MOVE DATA TO THE CLOUD WITH DASHDB’S MOVETOCLOUD SCRIPTJess Mantaro / July 17, 2015See an easy way to upload files larger than 5GB to a Softlayer Swift cloudobject store using IBM dashDB’s moveToCloud script.You can also read a transcript of this video .Read the tutorial (PDF)RELATED LINKS * Load data from the Cloud into dashDB * Load data from the desktop into dashDBPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",See an easy way to upload files larger than 5GB to a Softlayer Swift cloud object store using IBM dashDB’s moveToCloud script. ,Move data to the Cloud with dashDB's MoveToCloud script,Live,226
650,Bradley spends some time discussing the different types of NoSQL databases available and why you might choose one type over another.,Bradley spends some time discussing the different types of NoSQL databases available and why you might choose one type over another.,Bradley Holt on NoSQL (Channel 9),Live,227
663,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Perform Sentiment Analysis       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  PUBLISH APPS THAT USE R ANALYSIS WITH SHINY AND DASHDBJess Mantaro / July 17, 2015Watch how a you can analyze dashDB data with R and publish insights with Shinyand dashDB.You can also read a transcript of this videoRELATED LINKS * Use dashDB with R * Perform market basket analysis using dashDB and R * Connect R Commander and dashDB * Analyzing with RPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Watch how a you can analyze dashDB data with R and publish insights with Shiny and dashDB.,Publish apps that use R analysis with Shiny and dashDB,Live,228
665,"Homepage IBM Watson Follow Sign in Get started * Home
 * Announcements
 * Editorials
 * Tutorials
 * Code Spotlight
 * 
 * Build with Watson
 * 

Damian Cummins Blocked Unblock Follow Following Software Developer — IBM Watson Data API Apr 9
--------------------------------------------------------------------------------

SERVERLESS DATA FLOW SEQUENCING WITH WATSON DATA API AND IBM CLOUD FUNCTIONS
The complete code for this tutorial and other Watson Data API Data Flow samples
can be found here .

In a previous tutorial , you saw how data flows could be run one after another by polling using a
simple shell script. This tutorial demonstrates how to deploy the same
functionality as a serverless action. IBM Cloud Functions enable you to deploy a simple, repeatable function and run it periodically by
using the alarm package.

Again, a data flow can read data from a large variety of sources, process that
data in a runtime engine using pre-defined operations or custom code, and then
write it to one or more targets.

For example, if you have two data flows ( data_flow_1 and data_flow_2 ) and you always want to run data_flow_2 after data_flow_1 run completes, you can write an IBM Cloud Function to check the status of the
latest data_flow_1 run. If the status is completed, then the function should start a run of data_flow_2 .

CREATING A NODE.JS FUNCTION
First, clone this repository and run npm install to install the dependencies. Once this completes, be sure to include your
project ID and the IDs of the two data flows you want to monitor and run in index.js , for example:

// Parameters
const projectId = 'c2254fed-404d-4905-9b8c-5102f195cc0d'
const dataFlowId1 = '37bd30f0-dd3f-4052-988d-69c8fb2bf40a' // Data Flow Ref to check status of latest run
const dataFlowId2 = 'd31116c7-854f-404c-9e7a-de274a8bb2d6' // Data Flow Ref to trigger run for

The project ID can be retrieved from the browser URI between /projects/ and /assets in Watson Studio or Watson Knowledge Catalog when viewing the project: Similarly, the data flow ID can be retrieved from the browser URI between /refinery/ and /details in Watson Studio or Watson Knowledge Catalog when viewing the data flow:The main function is the one that will be called each time the action is invoked. The
function creates a new authentication token, retrieves the latest run for dataFlowId1 , and then either creates a new dataFlowId2 run or simply returns, depending on the state and completed_date .

The function is configured to run every 20 seconds so we will only start a new run for dataFlowId2 if the latest run for dataFlowId1 completed in the last 20 seconds. This is to avoid starting dataFlowId2 every time we retrieve the latest finished run for dataFlowId1 .

To deploy this node.js function with IBM Cloud using the IBM Cloud Functions CLI , package it as a .zip archive, including the node_modules , index.js and package.json files.

GETTING STARTED WITH IBM CLOUD FUNCTIONS CLI
First, follow the instructions here to install the IBM Cloud Functions CLI.

In a terminal window, upload the .zip file containing the node.js action as a
Cloud Function by using the following command: bx wsk action create packageAction --kind nodejs:default action.zip .

You can test the action you have just created manually by using the following
command: bx wsk action invoke --blocking --result packageAction .

TRIGGER: EVERY-20-SECONDS
You can include a trigger that uses the built-in alarm package feed to fire
events every 20 seconds. This is specified through cron syntax in the cron
parameter.

[Optional] The maxTriggers parameter ensures that it only fires for five minutes
(15 times), rather than indefinitely.

Create the trigger with the following command: bx wsk trigger create every-20-seconds --feed /whisk.system/alarms/alarm
--param cron ""*/20 * * * * *"" --param maxTriggers 15 .

RULE: INVOKE-PERIODICALLY
This rule shows how the every-20-seconds trigger can be declaratively mapped to
the packageAction.

Create the rule with the following command: bx wsk rule create invoke-periodically every-20-seconds packageAction

Next, open a terminal window to start polling the activation log. The console.log statements in the action will be logged here. You can stream them with the
following command: bx wsk activation poll

MONITORING LOGS
Before running your data flow, you should see entries similar to the following
ones:

The first entry shows the IAM Authorization token being obtained, retrieving the
data flow run, and then returning because the entity.summary.completed_date is earlier than the lookback date.

At this point, run dataFlowId1 from either Watson Studio or Watson Knowledge Catalog. You can do this using
the Refine action for the data flow in the project assets page.

This entry is very similar but in this case, the entity.state is running so the function returns again.

In this entry, you can see that the run for the data flow with an ID of 37bd30f0-dd3f-4052-988d-69c8fb2bf40a finished so the data flow with an ID of d31116c7-854f-404c-9e7a-de274a8bb2d6 starts.

TO SUMMARIZE…
In summary, we have created a serverless action that polls the status of a data
flow’s most recent run and, on completion, runs another data flow. This
demonstrates the ability to chain or sequence the running of data flows using
Watson Data APIs in the IBM Cloud.


--------------------------------------------------------------------------------

Damian Cummins is a Cloud Application Developer with the Data Refinery and IBM Watson teams at
IBM.

Thanks to Cecelia Shao . * Nodejs
 * Tutorial
 * Cloud
 * API
 * Big Data And Analytics

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

1 Blocked Unblock Follow FollowingDAMIAN CUMMINS
Software Developer — IBM Watson Data API

FollowIBM WATSON
AI Platform for the Enterprise

 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson , when you sign up for Medium. Learn more Never miss a story from IBM Watson Get updates Get updates","In a previous tutorial, you saw how data flows could be run one after another by polling using a simple shell script. This tutorial demonstrates how to deploy the same functionality as a serverless…",Serverless Data Flow Sequencing with Watson Data API and IBM Cloud Functions,Live,229
667,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Libo Li
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

WEB PICKS (WEEK OF 23 JANUARY 2017)
Posted on January 29, 2017Every two weeks, we find the most interesting data science links from around the
web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting
resources .

 * Some things I’ve found help reduce my stress around science
   “I decided to make a list of things that I’ve learned through hard experience
   do not help me with my own imposter syndrome and do help me to feel less
   stressed out about my science.”

 * The New Gold Rush? Wall Street Wants your Data
   If you’re one of the many startups sitting on a growing data asset and trying
   to figure out whether you can make money selling it to Wall Street, this post
   is for you.

 * Data Readiness Levels: Turning Data from Palid to Vivid “All these problems arise before modeling even starts. Both questions and
   data are badly characterised. This is particularly true in the era of Big
   Data, where one gains the impression that the depth of data-discussion in
   many decision making forums is of the form “We have a Big Data problem, do
   you have a Big Data solution?”, “Yes, I have a Big Data solution.” Of course
   in practice it also turns out to be a solution that requires Big Money to pay
   for because in practice no one bothered to scope the nature of the problem,
   the data, or the solution.”

 * Data Could Be the Next Tech Hot Button for Regulators “Now data — gathered in those immense pools of information that are at the
   heart of everything from artificial intelligence to online shopping
   recommendations — is increasingly a focus of technology competition. And
   academics and some policy makers, especially in Europe, are considering
   whether big internet companies like Google and Facebook might use their data
   resources as a barrier to new entrants and innovation.”

 * Poker Is the Latest Game to Fold Against Artificial Intelligence Two research groups have developed poker-playing AI programs that show how
   computers can out-hustle the best humans.

 * Why go long on artificial intelligence? We are now at the right place and time for AI to be the set of technology
   advancements that can help us solve challenges where answers reside in data.

 * 4 trends in security data science for 2017 How bots, threat intelligence, adversarial machine learning, and deep
   learning are impacting the security landscape.

 * 8 data trends on our radar for 2017 From deep learning to decoupling, here are the data trends to watch in the
   year ahead.

 * 5 Big Predictions for Artificial Intelligence in 2017 Expect to see better language understanding and an AI boom in China, among
   other things.

 * High-Speed Traders Are Taking Over Bitcoin Cryptocurreny offers fragmented market, zero transaction fees, risks include
   hacking thefts, Chinese government crackdown.

 * king – but why? “word2vec is an algorithm that transforms words into vectors, so that words
   with similar meaning end up laying close to each other. Moreover, it allows
   us to use vector arithmetics to work with analogies, for example the famous
   king – man + woman = queen. I will try to explain how it works, with special
   emphasis on the meaning of vector differences, at the same time omitting as
   many technicalities as possible.”

 * Playing with 80 Million Amazon Product Review Ratings Using Apache Spark “Back then, I was only limited to 1.2M reviews because attempting to process
   more data caused out-of-memory issues and my R code took hours to run. Apache
   Spark, which makes processing gigantic amounts of data efficient and
   sensible, has become very popular in the past couple years. Although data
   scientists often use Spark to process data with distributed cloud computing
   via Amazon EC2 or Microsoft Azure, Spark works just fine even on a typical
   laptop, given enough memory.”

 * Concrete AI tasks for forecasting This page contains a list of relatively well specified AI tasks designed for
   forecasting. Currently all entries were used in the 2016 Expert Survey on
   Progress in AI. Still a lot of challenges up ahead.

 * The Humans Working Behind the AI Curtain “Just how artificial is Artificial Intelligence? Facebook created a PR
   firestorm last summer when reporters discovered a human “editorial team” –
   rather than just unbiased algorithms – selecting stories for its trending
   topics section. The revelation highlighted an elephant in the room of our
   tech world: companies selling the magical speed, omnipotence, and neutrality
   of artificial intelligence (AI) often can’t make good on their promises
   without keeping people in the loop, often working invisibly in the
   background. So who are the people behind the AI curtain?”

 * Microsoft touts Deep Learning in SQL Server “Can SQL Server do Deep Learning? The response to this is enthusiastic
   “yes!” With the public preview of the next release of SQL Server, we’ve added
   significant improvements into R Services inside SQL Server including a very
   powerful set of machine learning functions that are used by our own product
   teams across Microsoft. This brings new machine learning and deep neural
   network functionality with increased speed, performance and scale to database
   applications built on SQL Server.”

 * Rules of Machine Learning: Best Practices for ML Engineering (pdf) This document is intended to help those with a basic knowledge of machine
   learning get the benefit of best practices in machine learning from around
   Google. Some great pieces of advice in here!

 * Simulation of empirical Bayesian methods (using baseball statistics) “We’re approaching the end of this series on empirical Bayesian methods, and
   have touched on many statistical approaches for analyzing binomial (success /
   total) data, all with the goal of estimating the “true” batting average of
   each player. There’s one question we haven’t answered, though: do these
   methods actually work?”

 * SOMBER: Self-Organizing Maps in Numpy somber (Somber Organizes Maps By Enabling Recurrence) is a collection of
   numpy/python implementations of various kinds of Self-Organizing Maps (SOMS),
   with a focus on SOMs for sequence data.

 * Calling Bullshit in the Age of Big Data A not-yet-official course on different aspects of bullshit in the current
   age. “We feel that the world has become oversaturated with bullshit and we’re
   sick of it. However modest, this course is our attempt to fight back.” Some
   great references and tidbits included on the syllabus, worth checking out!

 * Distributed Pandas on a Cluster with Dask Data “Dask Dataframe extends the popular Pandas library to operate on big
   data-sets on a distributed cluster. We show its capabilities by running
   through common dataframe operations on a common dataset.”

 * Null Hypothesis Significance Testing Never Worked “Much has been written about problems with our most-used statistical
   paradigm: frequentist null hypothesis significance testing (NHST), p-values,
   type I and type II errors, and confidence intervals. We seldom examine
   whether the original idea of NHST actually delivered on its goal of making
   good decisions about effects, given the data.”

 * Artificial intelligence predicts when heart will fail It correctly predicted those who would still be alive after one year about
   80% of the time. The figure for doctors is 60%.

 * Introducing Embedding.js, a Library for Data-Driven Environments “Data and its visual presentation have become central to our understanding
   of the world, and yet so many visualizations prioritize bling over
   communication. The fear, and it is justified, is that VR will merely
   exacerbate the problem, unleashing new and nauseating ways to deliver empty
   visual calories rather than a meaningful increase in articulative power.”

 * Game Theory reveals the Future of Deep Learning “A disadvantage of adversarial networks are they are difficult to train.
   Adversarial learning consists in finding a Nash equilibrium to a two-player
   non-cooperative game. Yann Lecun, in a recent lecture on unsupervised
   learning, calls adversarial networks the “the coolest idea in machine
   learning in the last twenty years”.”

 * From Natural Language Processing to Artificial Intelligence (presentation) Overview of natural language processing (NLP) from both symbolic and deep
   learning perspectives. Covers tf-idf, sentiment analysis, LDA, WordNet,
   FrameNet, word2vec, and recurrent neural networks (RNNs).

 * R and Spark (presentation) Better support is comming, great!

 * Large scale data processing pipelines at trivago: a use case (presentation) Kafka is used a lot at trivago, these days, together with Impala and R.

 * Text Mining, the Tidy Way (presentation) January 2017 talk at rstudio::conf by Julia Silge.

 * AI Alignment: Why It’s Hard, and Where to Start “In this talk, I’m going to try to answer the frequently asked question,
   “Just what is it that you do all day long?” We are concerned with the theory
   of artificial intelligences that are advanced beyond the present day, and
   that make sufficiently high-quality decisions in the service of whatever
   goals they may have been programmed with to be objects of concern.”

 * DeepTraffic: a gamified simulation of typical highway traffic. Your task is
   to build a neural agent Your neural network gets to control one of the cars (displayed in red) and
   has to learn how to navigate efficiently to go as fast as possible. The car
   already comes with a safety system, so you don’t have to worry about the
   basic task of driving – the net only has to tell the car if it should
   accelerate/slow down or change lanes, and it will do so if that is possible
   without crashing into other cars.

 * The state of d3 Voronoi “ given a set of sites in a space, it partitions that space in cells — one
   cell for each site. Here we explore what our favourite javascript library,
   d3.js, allows to do with this concept.”

 * RL2: Fast Reinforcement Learning via Slow Reinforcement Learning (paper) “ however, the learning process requires a huge number of trials. In
   contrast, animals can learn new tasks in just a few trials, benefiting from
   their prior knowledge about the world. This paper seeks to bridge this gap.
   Rather than designing a “fast” reinforcement learning algorithm, we propose
   to represent it as a recurrent neural network (RNN) and learn it from data.
   In our proposed method, RL2, the algorithm is encoded in the weights of the
   RNN, which are learned slowly through a general-purpose (“slow”) RL
   algorithm.”

 * OpenAI announces support for GTA V in Universe but takes down page afterwards
   Something strange is going on regarding OpenAI’s announcement of supporting
   GTA V. A cached version of the page can still be accessed here , and some people have forked the code repository which was also taken offline. The eye-catching demonstration video is still up, however.

 * WeChat’s App Revolution “Apple Inc. isn’t taking this development lightly. It even prohibited WeChat
   from using the term “app” as applied to mini programs. But the challenge to
   the App Store might be the least of Apple’s worries. For now, WeChat is
   changing smartphones in China. One day soon, its impact will be felt
   worldwide.”

 * Two Google Homes are arguing on Twitch and thousands of people can’t look
   away A Twitch stream called Seebotschat is really taking things to the next level
   with a live feed of two Google Homes engaged in an absolutely hilarious war
   of words. The future of natural language is here, folks!

‹ Offline Recommender Evaluation is Killing Serendipity —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Web Picks (week of 23 January 2017)
 * Offline Recommender Evaluation is Killing Serendipity
 * How can networked data be leveraged for analytics?
 * Web Picks (week of 9 January 2017)
 * 5 Practical Use Cases of Social Network Analytics: Going Beyond Facebook and
   Twitter

Archives * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com","Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. ",Web Picks (week of 23 January 2017),Live,230
668,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSPEED YOUR SQL QUERIES WITH SPARK SQLChetna Warade / August 19, 2015It can be painful to query your enterprise Relational Database Management System(RDBMS) for useful information. You write lengthy java code to create a databaseconnection, send a SQL query, retrieve rows from the database tables, andconvert data types. That’s a lot of steps, and they all take time when users arewaiting for answers. Plus most relational databases are stuck on servers thatlive somewhere inside the walls of your organization, inaccessible tocloud-based apps and services.There’s a more efficient and faster way to get the answers you need. You can useApache® Spark™, the high-speed, in-memory analytics engine to query yourdatabase instead. Not only does Spark’s SQL API provide lightning-fastanalytics, it also lets you access the database schema and data with only a fewsimple lines of code. How efficient and elegant.WHAT YOU’LL LEARNThis tutorial shows you how to use Spark to query a relational database. First,we’ll set up a PostgreSQL database to serve as our relational database (eitheron the cloud-based Compose postgreSQL service or in a local instance). Next,you’ll learn how to connect and run Spark SQL commands through the Spark Shelland then through IPython Notebook.WHY SPARK?Apache® Spark™ is an open-source cluster-computing framework with in-memoryprocessing, which enables analytic applications to run up to 100 times fasterthan other technologies on the market today. It helps developers be moreproductive and frees them to write less code. I’m so glad that IBM is committedto the Apache Spark project, investing in design and education programs topromote open source innovation. We’re working hard to help developers leverageSpark to create smart, fast apps that use and deliver data wisely. Learn more .SET UP YOUR POSTGRESQL DATABASEYou can use a cloud-based Compose PostgreSQL instance (the faster, easier option), or install PostgreSQL locally and openexternal access to its port by HTTP.OPTION 1: SET UP A CLOUD-BASED COMPOSE POSTGRESQL DATABASEThis online option gives you the availability and flexibility of a cloud-basedservice and some neat browser-based tools. If you prefer to work locally, skipdown to Option 2 . 1. Download psql.With Compose, your postgreSQL database will live in the cloud, but you need    to install the psql command line tool. Go to http://www.postgresql.org/download/ and download PostgreSQL, accepting all default installation settings.         2. Sign up for a Compose PostgreSQL account.Go to https://app.compose.io/signup/ select the PostgrSQL database option and enter your account    information.Compose asks for a credit card upon sign-up, but you get a free    30-day trial.         3. Click Deployments button.         4. Click the deployment link to open it.         5. Click the Reveal your credentials link.You see your username and password, which you’ll use in a minute.         6. Locate the Command Line , copy its contents, and keep this Compose browser window open. 7. Populate the database. 1. Open your terminal/command window and go to psql by typing the command:                cd /Library/PostgreSQL/9.4/bin                If your directory/version is different, locate it first, then cd to the        correct directory                     2. Connect with the following commands:        Type ./psql then within quotation marks, paste in the command line you just copied,        then press Enter. This will look something like this:./psql ""sslmode=require host=haproxy429.aws-us-east-1-portal.3.dblayer.com port=10429 dbname=compose user=admin""                     3. When prompted, enter your Compose postgreSQL password and press Enter.        You’ll see:     4. Password: psql (9.4.4) SSL connection (protocol: TLSv1.2, cipher:        ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off) Type ""help""        for help.     5. Copy, paste, and run the following SQL commands:        CREATE TABLE weather (           city            varchar(80),           temp_lo         int,           -- low temperature           temp_hi         int,           -- high temperature           prcp            real,          -- precipitation           date            date        );                        INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');                INSERT INTO weather VALUES ('San Francisco', 43, 57, 0.0, '1994-11-29');                INSERT INTO weather VALUES ('Hayward', 54, 37, 0.25, '1994-11-29');                Shell output from psql will look like:                        compose=> CREATE TABLE weather (        compose(>     city            varchar(80),        compose(>     temp_lo         int,           -- low temperature        compose(>     temp_hi         int,           -- high temperature        compose(>     prcp            real,          -- precipitation        compose(>     date            date        compose(�        CREATE TABLE        compose=�         city | temp_lo | temp_hi | prcp | date         ------+---------+---------+------+------        (0 rows)                                compose=�        INSERT 0 1        compose=�        INSERT 0 1        compose=�        INSERT 0 1        compose=�                                                          city      | temp_lo | temp_hi | prcp |    date            ---------------+---------+---------+------+------------         San Francisco |      46 |      50 | 0.25 | 1994-11-27         San Francisco |      43 |      57 |    0 | 1994-11-29         Hayward       |      54 |      37 | 0.25 | 1994-11-29        (3 rows)                                Return to Compose and from the menu on the left, choose Browser .    Click the compose database. You should see your new weather table. Click it to see the values you just added. Open your deployment again (see Steps 3-4) and    copy the Public hostname/port . You’ll use it in a minute. .OPTION 2: SET UP A LOCAL POSTGRESQL DATABASEIf you prefer to work with a locally-installed PostreSQL database, follow thesteps below. The local option requires a some additional steps like openingexternal access to the database port by HTTP and restarting the database. (Ifyou already followed steps for Option 1 to set up a cloud-based ComposePostgreSQL database, skip ahead to the section on accessing data with Spark .) 1. Go to http://www.postgresql.org/download/ and Download PostgreSQL, accepting all default installation settings. 2. Modify the pg_hba.conf file to allow external access via HTTP.    By default, your postgreSQL database is accessible through port number 5432,    and only localhost can access it. For this tutorial, we’ll open access to    external programs and machines via HTTP. To do so: 1. Open your terminal or command window.     2. Sign in as the Postgres user by enteringsu postgres                If prompted, enter your postgresql password.     3. Add the following line to the pg_hba.conf file (located in        /Library/PostgreSQL/9.4/data)                host all all 0.0.0.0/0 md5     4. Edit this file using your favorite editing tool. We edited through the        command line interface using vi command:     5. vi pg_hba.conf         3. Restart the PostgreSQL database by running these 2 commands in Terminal: cd /Library/PostgreSQL/9.4/bin        ./pg_ctl status -D ../data/        You see the following message:        pg_ctl: server is running (PID: XXXX)     /Library/PostgreSQL/9.4/bin/postgres ""-D/Library/PostgreSQL/9.4/data”        Take note of these additional commands:        ./pg_ctl stop -D ../data/        to stop the database        ./pg_ctl start -D ../data/        to start the database again         4. Populate the database.Launch the SQL Shell (psql) application on your machine (located at    /Library/PostgreSQL/9.4/bin) to connect to the database and populate it with    some data. We used the command line, and entered:        cd /Library/PostgreSQL/9.4/scripts/        and then:        ./runpsql.sh         and psql returns:            Server [localhost]:     Database [postgres]:     Port [5432]:     Username [postgres]:     Password for user postgres:     psql (9.4.4)    Type ""help�       city      | temp_lo | temp_hi | prcp |    date        ---------------+---------+---------+------+------------    San Francisco |      46 |      50 | 0.25 | 1994-11-27    San Francisco |      43 |      57 |    0 | 1994-11-29    Hayward       |      37 |      54 |      | 1994-11-29    (3 rows)    Files                For more on working in PostgreSQL, see http://www.postgresql.org/docs/9.4/static/tutorial-table.html .        ACCESS SQL DATA VIA SPARK SHELLThere are 2 ways to work with Spark: * Access a virtual machine where Spark is installed.   For this tutorial, we used a VM with Apache Spark v1.3.1 installed in it and   hosted via Virtual Box on Mac OS X Version 10.9.4. The VM image is wrapped by   Vagrant, a virtual development environment configuration software.       * Or download Apache Spark and run locally on your machine. Get it at: http://spark.apache.org/downloads.htmlOnce you’ve installed Spark or know where it lives: 1. Download postgresql-9.4-1200.jdbc41.jar from https://jdbc.postgresql.org/download.html and save it to a location accessible to the spark shell. Note its location.    You’ll need its path in a few minutes. 2. In your terminal or command window, open Spark shell. * If Spark’s installed locally, cd to the directory that contains spark       shell. Then spark-shell     * If using a VM, ssh into a VM/machine, where spark is installed.        Create a new Spark dataframe object using SQLContext.load.In a command/terminal window, type:        vagrant@sparkvm:~$ spark-shell --jars ./drivers/postgresql-9.4-1200.jdbc41.jar        At the scala command prompt enter the following command. * If you’re using a cloud-based Compose PostgreSQL database, retrieve the       public hostname:port you copied in Step 10 and insert within the URL       value after jdbc:postgresql:// . It should look something like this:scala val jdbcDF = sqlContext.load(""jdbc"", Map(""url"" -        ""jdbc:postgresql://haproxy425.aws-us-east-1-portal.3.dblayer.com:10425/compose?user=admin&password=XXXXXXXXXXXXXXXX"",       ""dbtable"" - ""weather""))                   * If you’re connecting to a locally deployed PostgreSQL database, enter the       following command:scala val jdbcDF = sqlContext.load(""jdbc"", Map(""url"" -        ""jdbc:postgresql://192.168.1.15:5432/postgres?user=postgres&password=postgres"",       ""dbtable"" - ""weather""))                      Type these commands:    scala jdbcDF.show()    scala jdbcDF.printSchema()    scala jdbcDF.filter(jdbcDF(""temp_h1"") 40).show()You see scala output that looks like this:                                That’s it! You’ve accessed your PostgreSQL data via Spark SQL.        ACCESS SQL DATA VIA IPYTHON NOTEBOOKIn this part of the tutorial we walk through steps on how to modify Spark’sclasspath and run Spark SQL commands through IPython Notebook.Note: This section assumes familiarity with Spark Server installation and IPythonNotebook. 1. Retrieve the complete path and name of the jdbc driver as a string value    (you noted this info in the last section). 2. Locate compute-classpath.sh file under    /usr/local/bin/spark-1.3.1-bin-hadoop2.6/bin 3. Add the following line to the end of the file:appendToClasspath ""/home/vagrant/drivers/postgresql-9.4-1200.jdbc41.jar”         4. Restart the vm that runs Spark. Now the IPython Notebook is ready to connect    and query the sample database. 5. Launch the IPython Notebook. 6. Insert a new cell. 7. Create a new Spark dataframe object using SQLContext.load. 8. Tip: Here, you can use the same Spark commands you used at the Scala command    prompt in the previous section.         9. You see Spark commands in gray boxes and beneath each call, IPython shows    the data returned.        SUMMARYNow you know how to connect Spark to a relational database, and use Spark’s APIto perform SQL queries. Spark can also run as a cloud service, potentiallyunlocking your on-premises SQL data, which we’ll explore more in future posts.To try these calls with another type of database, you’d just follow these samesteps, but download the jdbc driver supported by your RDBMS.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: PostgreSQL / Spark / SQL Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Get faster queries and write less code too. Learn how to use Spark SQL to query your relational database. Follow this tutorial and see how to query a cloud-based Compose PostgreSQL instance or a local PostreSQL database.,Speed your SQL Queries with Spark SQL,Live,231
672,"Homepage Follow Sign in Get started Homepage * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Carmen Ruppach Blocked Unblock Follow Following Offering Manager for Data Refinery on Watson Data Platform at IBM Nov 14
--------------------------------------------------------------------------------

SELF-SERVICE DATA PREPARATION WITH IBM DATA REFINERY
If you are like most data scientists, you are probably spending a lot of time to
cleanse, shape and prepare your data before you can actually start with the more
enjoyable part of building and training machine learning models. As a data
analyst, you might face similar struggles to obtain data in a format you need to
build your reports. In many companies data scientists and analysts need to wait
for their IT teams to get access to cleaned data in a consumable format.

IBM Data Refinery addresses this issue. It provides an intuitive self-service
data preparation environment where you can quickly analyze, cleanse and prepare
data sets. It is a fully managed cloud service, available in open beta now.

Analyze and prepare your data

With IBM Data Refinery, you can interactively explore your data and use a wide
range of transformations to cleanse and transform data into the format you need
for analysis.

You can use a simple point-and-click interface for selecting and combining a
wide range of built-in operations, such as filtering, replacing, and deriving
values. It is also possible to quickly remove duplicates, split and concatenate
values, and choose from a comprehensive list of text and math operations.

Interactive data exploration and preparationIf you prefer to code, in IBM Data Refinery you can directly enter R commands
via R libraries such as dplyr. We provide code templates and in-context
documentation to help you become productive with the R syntax more quickly.

Code templates to help users with R syntaxIf you’re not satisfied with the shaping results, you can easily undo and change
operations in the Steps side bar.

The interactive user interface works on a subset of the data to give you a
faster preview of the operations and results. Once you’re happy with the sample
output, you can apply the transformations on the entire data set and save all
transformation steps in a data flow. You can repeat the data flow later and
track changes that were applied to your data. To accelerate the job execution,
Apache Spark is used as the execution engine.

Data profiling and visualization

Data shaping is an iterative and time-consuming process. In a traditional data
science workflow, you might use one tool to apply various transformations to
your data set, and then load the data into another tool to visualize and
evaluate the results. Over many cycles, this continual tool hopping can become
frustrating.

IBM Data Refinery soothes the pain by integrating both data transformations and
visualizations in a single interface, so you can move between views with a
simple click. You can use the Profile tab to view descriptive statistics of your
data columns in order to better understand the distribution of values. You can
continue to apply transformations and the corresponding profile information
adjusts automatically.

On the Visualization tab you can select a combination of columns to build charts
using Brunel (open source visualization library). IBM Data Refinery
automatically suggests appropriate plots and you can choose between 12
pre-defined chart types. You can adjust the appearance of the charts using
Brunel syntax.

Connecting to data wherever it resides

IBM Data Refinery comes with a comprehensive set of 30 prebuilt data connectors
so that you can set up connections to a wide range of commonly used on-premises
and cloud data stores. You can connect to IBM as well as non-IBM services. If
your data service is hosted on IBM Cloud (formerly IBM Bluemix), you can
directly access the data service instance from IBM Data Refinery.

Once you specify a connection and connect the data object to your data, you can
start to analyze and refine your data wherever it resides.

Try out IBM Data Refinery! Sign up for free at: https://www.ibm.com/cloud/data-refinery

 * Data Science
 * Data Visualization
 * Data Analysis
 * Data Refinery

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingCARMEN RUPPACH
Offering Manager for Data Refinery on Watson Data Platform at IBM

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","If you are like most data scientists, you are probably spending a lot of time to cleanse, shape and prepare your data before you can actually start with the more enjoyable part of building and…",Self-service data preparation with IBM Data Refinery,Live,232
676,"Homepage Stats and Bots Follow Sign in / Sign up Homepage * Home
 * DATA SCIENCE
 * ANALYTICS
 * STARTUPS
 * BOTS
 * DESIGN
 * Subscribe
 * 
 * 🤖 TRY STATSBOT FREE
 * 

Vadim Smolyakov Blocked Unblock Follow Following passionate about data science and machine learning
https://github.com/vsmolyakov Oct 12
--------------------------------------------------------------------------------

BAYESIAN NONPARAMETRICS
AN INTRODUCTION TO THE DIRICHLET PROCESS AND ITS APPLICATIONS
Bayesian Nonparametrics is a class of models with a potentially infinite number
of parameters. High flexibility and expressive power of this approach enables
better data modelling compared to parametric methods.

Bayesian Nonparametrics is used in problems where a dimension of interest grows
with data, for example, in problems where the number of features is not fixed
but allowed to vary as we observe more data. Another example is clustering where
the number of clusters is automatically inferred from data.

The Statsbot team asked a data scientist, Vadim Smolyakov, to introduce us to Bayesian
Nonparametric models. In this article, he describes the Dirichlet process along
with associated models and links to their implementations.

INTRODUCTION: DIRICHLET PROCESS K-MEANS
Bayesian Nonparametrics are a class of models for which the number of parameters
grows with data. A simple example is non-parametric K-means clustering [1].
Instead of fixing the number of clusters K, we let data determine the best
number of clusters. By letting the number of model parameters (cluster means and
covariances) grow with data, we are better able to describe the data as well as
generate new data given our model.

Of course, to avoid over-fitting, we penalize the number of clusters K via a
regularization parameter which controls the rate at which new clusters are
created. Thus, our new K-means objective becomes:

In the figure above, we can see the non-parametric clustering, aka
Dirichlet-Process (DP) K-Means applied to the Iris dataset. The strength of
regularization parameter lambda (right), controls the number of clusters
created. Algorithmically, we create a new cluster, every time we discover that a
point (x_i) is sufficiently far away from all the existing cluster means:

The resulting update is an extension of the K-means assignment step: we reassign
a point to the cluster corresponding to the closest mean or we start a new
cluster if the squared Euclidean distance is greater than lambda. By creating
new clusters for data points that are sufficiently far away from the existing
clusters, we eliminate the need to specify the number of clusters K ahead of
time.

Dirichlet process K-means eliminates the need for expensive cross-validation in
which we sweep a range of values for K in order to find the optimum point in the
objective function. For an implementation of the Dirichlet process K-means
algorithm see the following github repo .

DIRICHLET PROCESS
The Dirichlet process (DP) is a stochastic process used in Bayesian
nonparametric models [2]. Each draw from a Dirichlet process is a discrete
distribution. For a random distribution G to be distributed according to a DP,
its finite dimensional marginal distributions have to be Dirichlet distributed.
Let H be a distribution over theta and alpha be a positive real number.

We say that G is a Dirichlet process with base distribution H and concentration parameter alpha if for every finite measurable partition A1,…, Ar of theta we have:

Where Dir is a Dirichlet distribution defined as:

The Dirichlet distribution can be visualized over a probability simplex as in
the figure below. The arguments to the Dirichlet distribution (x1, x2, x3) can
be interpreted as pseudo-counts.

For example, in the case of (x1, x2, x3) = (2, 2, 2) the Dirichlet distribution
(left) has high probability near the middle, in comparison to the (2, 2, 10)
case where it concentrates around one of the corners. In the case of (10, 10,
10) we have more observations, and the Dirichlet distribution concentrates more
in the middle (since equal number of counts are observed in this case).

The base distribution H is the mean of the DP: E[G(A)] = H(A), whereas the
concentration parameter is the inverse variance: VAR[G(A)] = H(A)[1-H(A)] /
(1+alpha). Thus, the larger the alpha, the smaller the variance and the DP will
concentrate more of its mass around the mean as shown in the figure below [3].

STICK-BREAKING CONSTRUCTION
We have seen the utility of Bayesian Nonparametric models is in having a
potentially infinite number of parameters. We also had a brief encounter with
the Dirichlet process that exhibits a clustering property that makes it useful
in mixture modeling where the number of components grows with data.

But how do we generate a mixture model with an infinite number of components?The answer is a stick-breaking construction [4] that represents draws G from
DP(alpha, H) as a weighted sum of atoms (or point masses). It is defined as
follows:

The mixture model G consists of an infinite number of weights (pi_k) and mixture
parameters (theta_k). The weights are generated by first sampling beta_k from
Beta(1, alpha) distribution, where alpha is the concentration parameter and then
computing pi_k as in the expression above, while mixture parameters theta_k are
sampled from the base distribution H. We can visualize the stick-breaking
construction as in the figure below:

Notice that we start with a stick of unit length (left) and in each iteration we
break off a piece of length pi_k. The length of the piece that we break off is
determined by the concentration parameter alpha. For alpha=5 (middle) the stick
lengths are longer and as a result there are fewer significant mixture weights.
For alpha=10 (right) the stick lengths are shorter and therefore we have more
significant components.

Thus, alpha determines the rate of cluster growth in a non-parametric model. In
fact, the number of clusters created is proportional to alpha x log(N) where N
is the number of data points.

DIRICHLET PROCESS MIXTURE MODEL (DPMM)
A Dirichlet process mixture model (DPMM) belongs to a class of infinite mixture models in which we do not impose any prior knowledge on the number of clusters K. DPMM
models learn the number of clusters from the data using a nonparametric prior
based on the Dirichlet process (DP). Automatic model selection leads to
computational savings of cross validating the model for multiple values of K.
Two equivalent graphical models for a DPMM are shown below:

Here, x_i are observed data points and with each x_i we associate a label z_i
that assigns x_i to one of the K clusters. In the left model, the cluster
parameters are represented by pi (mixture proportions) and theta (cluster means
and covariances) with associated uninformative priors (alpha and lambda).

For ease of computation, conjugate priors are used such as a Dirichlet prior for
mixture weights and Normal-Inverse-Wishart prior for a Gaussian component. In
the right model, we have a DP representation of DPMM where the mixture
distribution G is sampled from a DP (alpha, H) with concentration parameter
alpha and base distribution H.

There are many algorithms for learning the Dirichlet process mixture models
based on sampling or variational inference. For a Gibbs sampler implementation
of DPMMs with Gaussian and Discrete base distribution, have a look at the following code .

The figure above shows DPMM clustering results for a Gaussian distribution
(left) and Categorical distribution (right). On the left, we can see the
ellipses (samples from posterior mixture distribution) of the DPMM after 100
Gibbs sampling iterations. The DPMM model initialized with 2 clusters and a
concentration parameter alpha of 1, learned the true number of clusters K=5 and
concentrated around cluster centers.

On the right, we can see the results of clusters of Categorical data, in this
case a DPMM model was applied to a collection of NIPS articles. It was
initialized with 2 clusters and a concentration parameter alpha of 10. After
several Gibbs sampling iterations, it discovered over 20 clusters, with the
first 4 shown in the figure. We can see that the word clusters have similar
semantic meaning within each cluster and the cluster topics are different across
clusters.

HIERARCHICAL DIRICHLET PROCESS (HDP)
The hierarchical Dirichlet process (HDP) is an extension of DP that models
problems involving groups of data especially when there are shared features
among the groups. The power of hierarchical models comes from an assumption that
the features among groups are drawn from a shared distribution rather than being
completely independent. Thus, with hierarchical models we can learn features
that are common to all groups in addition to the individual group parameters.

In HDP, each observation within a group is a draw from a mixture model and
mixture components are shared between groups. In each group, the number of
components is learned from data using a DP prior. The HDP graphical model is
summarized in the figure below [5]:

Focusing on HDP formulation in the figure on the right, we can see that we have
J groups where each group is sampled from a DP: Gj ~ DP(alpha, G0) and G0
represents shared parameters across all groups which in itself is modeled as a
DP: G0 ~ DP(gamma, H). Thus, we have a hierarchical structure for describing our
data.

There exists many ways for inferring the parameters of hierarchical Dirichlet
processes. One popular approach that works well in practice and is widely used
in the topic modelling community is an online variational inference algorithm
[6] implemented in gensim .

The figure above shows the first four topics (as a word cloud) for an online
variational HDP algorithm used to fit a topic model on the 20newsgroups dataset . The dataset consists of 11,314 documents and over 100K unique tokens.
Standard text pre-processing was used, including tokenization, stop-word
removal, and stemming. A compressed dictionary of 4K words was constructed by
filtering out tokens that appear in less than 5 documents and more than 50% of
the corpus.

The top-level truncation was set to T=20 topics and the second level truncation
was set to K=8 topics. The concentration parameters were chosen as gamma=1.0 at
the top-level and alpha=0.1 at the group level to yield a broad range of shared
topics that are concentrated at the group level. We can find topics about autos,
politics, and for sale items that correspond to the target labels of the
20newsgroups dataset.

HDP HIDDEN MARKOV MODELS
The hierarchical Dirichlet process (HDP) can be used to define a prior
distribution on transition matrices over countably infinite state spaces. The
HDP-HMM is known as an infinite hidden Markov model where the number of states
is inferred automatically. The graphical model for HDP-HMM is shown below:

In a nonparametric extension of HMM, we consider a set of DPs, one for each
value of the current state. In addition, the DPs must be linked because we want
the same set of next states to be reachable from each of the current states.
This relates directly to HDP, where the atoms associated with state-conditional
DPs are shared.

The HDP-HMM parameters can be described as follows:

Where the GEM notation is used to represent stick-breaking. One popular
algorithm for computing the posterior distribution for infinite HMMs is called
beam sampling and is described in [7].

DEPENDENT DIRICHLET PROCESS (DDP)
In many applications, we are interested in modelling distributions that evolve
over time as seen in temporal and spatial processes. The Dirichlet process
assumes that observations are exchangeable and therefore the data points have no
inherent ordering that influences their labelling. This assumption is invalid
for modelling temporal and spatial processes in which the order of data points
plays a critical role in creating meaningful clusters.

The dependent Dirichlet process (DDP), originally formulated by MacEachern,
provides a nonparametric prior over evolving mixture models. A construction of
the DDP built on the Poisson process [8] led to the development of the DDP
mixture model as shown below:

In the graphical model above we see a temporal extension of the DP process in
which a DP at time t depends on the DP at time t-1. This time-varying DP prior
is capable of describing and generating dynamic clusters with means and
covariances changing over time.

CONCLUSION
In Bayesian Nonparametric models the number of parameters grows with data. This
flexibility enables better modeling and generation of data. We focused on the
Dirichlet process (DP) and key applications such as DP K-means (DP-means),
Dirichlet process mixture models (DPMMs), hierarchical Dirichlet processes
(HDPs) applied to topic models and HMMs, and dependent Dirichlet processes
(DDPs) applied to time-varying mixtures.

We looked at how to construct nonparametric models using stick-breaking and
examined some of the experimental results. To better understand the Bayesian
Nonparametric model, I encourage you to read the literature mentioned in the
references and experiment with the code linked throughout the article on
challenging datasets!

REFERENCES
[1] B. Kulis and M. Jordan, “Revisiting k-means: New Algorithms via Bayesian
Nonparametrics ”, ICML, 2012

[2] E. Sudderth, “Graphical Models for Visual Object Recognition and Tracking”,
PhD Thesis (Chp 2.5), 2006

[3] A. Rochford, Dirichlet process Mixture Model in PyMC3

[4] J. Sethuraman, “A constructive definition of Dirichlet priors”, Statistica
Sinica, 1994.

[5] Y. Teh, M. Jordan, M. Beal and D. Blei, “Hierarchical Dirichlet process”,
JASA, 2006

[6] C. Wang, J. Paisley, and D. Blei, “Online Variational Inference for the
Hierarchical Dirichlet process”, JMLR, 2011.

[7] J. Van Gael, Y. Saatci, Y. Teh and Z. Ghahramani, “Beam Sampling for the
infinite Hidden Markov Model”, ICML 2008

[8] D. Lin, W. Grimson and J. W. Fisher III, “Construction of Dependent
Dirichlet processes based on compound Poisson processes”, NIPS 2010

YOU’D ALSO LIKE:
Google Analytics Audit Checklist and Tools Auditing a Google Analytics setup
like a pro blog.statsbot.co Singular Value Decomposition (SVD) Tutorial: Applications, Examples, Exercises
A complete tutorial on the singular value decomposition method blog.statsbot.co Excel for Startups: Simple Financial Models and Dashboards Ready-to-use Excel
templates of different financial models for startups blog.statsbot.co * Data Science
 * Bayesian Statistics
 * Machine Learning
 * Data Analysis
 * Data Analytics

Show your supportClapping shows how much you appreciated Vadim Smolyakov’s story.

101 Blocked Unblock Follow FollowingVADIM SMOLYAKOV
passionate about data science and machine learning https://github.com/vsmolyakov

FollowSTATS AND BOTS
Data stories on machine learning and analytics. From Statsbot’s makers.

 * 101
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates",An introduction to Bayesian Nonparametrics: the Dirichlet process along with associated models and links to their implementations.,Bayesian Nonparametric Models – Stats and Bots,Live,233
680,"Homepage Follow Sign in Get started John Thomas Blocked Unblock Follow Following IBM Distinguished Engineer. #Analytics, #Cognitive, #Cloud, #MachineLearning,
#DataScience. Chess, Food, Travel (60+ countries). Tweets are personal opinions. Dec 20
--------------------------------------------------------------------------------

3 SCENARIOS FOR MACHINE LEARNING ON MULTICLOUD
Wikimedia Commons photoMore and more cloud-computing experts are talking about “multicloud”. The term
refers to an architecture that spans multiple cloud environments in order to
take advantage of different services, different levels of performance, security,
or redundancy, or even different cloud vendors. But what sometimes gets lost in
these discussions is that multicloud is not always public cloud. In fact, it’s
often a combination of private and public clouds.

As machine learning (ML) continues to pervade enterprise environments, we need
to understand how to make ML practical on multicloud — including those
architectures that span the firewall.

Let’s look at three possible scenarios.

SCENARIO 1: TRAIN WITH ON-PREM DATA, DEPLOY ON CLOUD
It often happens that the data science team needs to build and train an ML model
on sensitive customer data even though the model itself will be deployed on a
public cloud. Data gravity and security issues mean that the model needs to be
trained behind the firewall, where the data lives. However, the model may need
to be invoked by cloud-native applications. Concerns about the latency for
scoring calls mean that the model should be deployed close to the consuming app
— near the edge of the network, outside the firewall.

SCENARIO 2: TRAIN ON SPECIALIZED HARDWARE, DEPLOY ON SYSTEMS OF RECORD
Deep Learning models as well as some types of classic ML models can benefit from
significant acceleration using specialized hardware. For example, a data science
team might decide to build and train the model on specialized hardware like a
PowerAI machine, which consists of Power processors coupled to GPUs through
high-speed NVLink connections. The PowerAI machine is designed to significantly
speed up the training process, but the model itself may need to be consumed in a
system of record like an on-premises z System.

SCENARIO 3: TRAIN ON CLOUD WITH PUBLIC DATA, DEPLOY ON-PREM
The third scenario is becoming increasingly common with the increased
availability — and increased quality — of public data. Imagine a financial firm
doing arbitrage on agricultural commodities. The data science team gathers a
variety of publicly available data including weather and climate data, crop
yield data, currency data, and more. Because the data is high-volume and
non-proprietary, they aggregate it on a public cloud where they also train their
ML model. They pull down the latest version of the model and integrate it within
a proprietary application that the firm has developed to predict the prices of
the commodities they trade.

IBM’S APPROACH
Each of these scenarios calls for a fit-for-purpose, multicloud architecture for
flexibly training, deploying, and consuming the machine learning models. IBM
takes an enterprise approach by making our Data Science Experience (DSX)
platform available both on-prem and in the cloud — with intuitive interfaces
designed to let users easily move from one to the other. With the same REST
APIs, you can save, publish, and consume models across environments — on the
mainframe, on a private cloud, or on the public cloud, including on non-IBM public clouds , like AWS and Azure. These two videos demonstrate how easy this is: AWS / Azure .

A Kubernetes-based implementation of the DSX platform gives you the flexibility
to run DSX Local within a variety of infrastructure options. For example, you
can stand up a multi-node cluster with two separate infrastructure vendors, and
then build and train models wherever it’s most convenient, and move your models
from one vendor infrastructure to the other.

In DSX, each deployed model gets an external and internal end point. To invoke
the model, simply use a REST API call for the end point. You can build and train
the model on-prem and deploy the model to the cloud, where an external
application like a chatbot can consume the model by making a REST API call to
the particular end point.

When multicloud flexibility lets you pick and choose the cloud environments that
best fit your needs, you can align with the principle of data gravity and let
your consumption channels dictate where you deploy the machine learning models
that will transform your organization.

Visit us to learn more about the Data Science Experience.

 * Cloud Computing
 * Machine Learning
 * Scenario
 * Scenarioplanning

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

12 Blocked Unblock Follow FollowingJOHN THOMAS
IBM Distinguished Engineer. #Analytics, #Cognitive, #Cloud, #MachineLearning,
#DataScience. Chess, Food, Travel (60+ countries). Tweets are personal opinions.

FollowINSIDE MACHINE LEARNING
Deep-dive articles about machine learning and data. Curated by IBM Analytics.

 * 12
 * 
 * 
 * 

Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates",More and more cloud-computing experts are talking about “multicloud”. The term refers to an architecture that spans multiple cloud environments in order to take advantage of different services…,3 Scenarios for Machine Learning on Multicloud,Live,234
683,"Compose The Compose logo Articles Sign in Free 30-day trialHOW TO ENABLE A REDIS CACHE FOR POSTGRESQL WITH ENTITY FRAMEWORK 6
Published Jul 17, 2017 redis postgresql c# How to enable a Redis cache for PostgreSQL with Entity Framework 6Caching a database can be a chore but in this Write Stuff article, Mariusz
Bojkowski shows how easy it can be to add a Redis cache to your PostgreSQL
database if you are using Entity Framework 6 .

Database caching is a commonly used technique to improve scalability. By
offloading database work to other, faster stores it can also help improve the
availability of the data too. Often, though, that caching comes at the cost of
hardwired code in the application to check the cache first before the database.
But what if we could do it cheaply and transparently to the application? Let's
try to leverage C# and the features of Entity Framework 6 to do all the heavy
lifting. I’ll show how to use PostgreSQL database with the framework and how to
add transparent caching using Redis database.

In this tutorial, I’ll create simple Books table and a console application that
will get the data from the table. Next, I’ll upgrade the application to use
caching. I’ll be using Visual Studio 2017. The full application source is
available on GitHub .

PREPARING THE POSTGRESQL DATABASE
First, you need to create PostgreSQL database using the tools or provider of
your choice. Next, let’s create a sample database table of books. Connect to the
PostgreSQL database and execute the following create statement.

    CREATE TABLE ""Books"" (
      ""Id""      SERIAL       NOT NULL,
      ""Title""   VARCHAR(50)  NOT NULL,
      ""Author""  VARCHAR(50)  NOT NULL,
      PRIMARY KEY (""Id"")
    );


Please remember that all identifiers (table names, column names) are folded to
lower case in a PostgreSQL database. To change it, make sure you use double
quotation marks in the table name and column names. This is required as it will
simplify Book entity mapping to properties of the C# model class.

CREATE ENTITY FRAMEWORK APPLICATION
Once the database table is ready, create a new console application. Open Visual
Studio and click File menu, then New – Project. From the dialog box, choose Installed – Templates – Visual C# – Windows Classic Desktop . Chose Console App (.NET Framework) , then provide a name (I typed RedisCacheForPostgre ) and location.

Next, let’s add PostgreSQL Entity Framework provider – add the latest version of Npgsql.EntityFramework NuGet package.


It will also install Entity Framework 6 NuGet package as it’s one of the
dependencies.


Please note that at the moment of writing this article the latest version of the
Npgsql provider (2.2.7) references Entity Framework version 6.0.0 (not the
latest) and the version 6.0.0 will be installed. We will upgrade few paragraphs
below.

CONFIGURE ENTITY FRAMEWORK
ADD POSTGRESQL CONNECTION STRING
Open App.config file and add connectionStrings section as in the example below. Please keep configSections as the first child element of configuration node – it’s a strict .NET requirement. Otherwise, the application will crash at
runtime.

    <configuration>
      <configSections>
        (...)
      </configSections>
      <connectionStrings>
        password=secret"" providerName=""Npgsql"" />
      </connectionStrings>
      (...)
    </configuration>


Please note that there is providerName attribute in the connection string definition pointing to the PostgreSQL
provider (Npgsql).

DEFINE BOOKS ENTITY
Add a new folder to the project and give it ‘Entities’ name.


Next, add Book class to the folder. It will reflect books entities from the database.

    using System.ComponentModel.DataAnnotations;
    using System.ComponentModel.DataAnnotations.Schema;

    namespace RedisCacheForPostgre.Entities
    {
        [Table(""Books"", Schema = ""public"")]
        public class Book
        {
            [Key]
            public int Id { get; set; }
            public string Title { get; set; }
            public string Author { get; set; }
        }
    }


There is a Table attribute added to the class that defines the database table name. Note the
schema parameter – by default Entity Framework uses dbo schema. PostgreSQL uses
public schema on the other hand.

Also, the Id property is decorated with Key attribute to instruct Entity Framework that its primary key column.

DEFINE THE DATABASE CONTEXT
Add PostgreContext class to the Entities folder.


The class should inherit from System.Data.Entity.DbContext . It will be the main interface for accessing the database.

    using System.Data.Entity;

    namespace RedisCacheForPostgre.Entities
    {
        public class PostgreContext : DbContext
        {
            public PostgreContext() : base(nameOrConnectionString: ""PostgreSQL"") { }
            public DbSet<Book }
        }
    }


There is a connection string name passed to the base class constructor. The Book property will be responsible for operations on the books table.

ADD SAMPLE DATA TO POSTGRESQL
It’s time for doing something real. Open the Program.cs file and add InsertSampleData method.

    using RedisCacheForPostgre.Entities;

    namespace RedisCacheForPostgre
    {
        public class Program
        {
            public static void Main(string[] args)
            {
                InsertSampleData();
            }

            private static void InsertSampleData()
            {
                using (var context = new PostgreContext())
                {
                    context.Book.Add(new Book { Title = ""Witcher"", Author = ""Andrzej Sapkowski"" });
                    context.Book.Add(new Book { Title = ""A Game of Thrones"", Author = ""George R.R. Martin"" });
                    context.Book.Add(new Book { Title = ""Inclusion"", Author = ""Andrzej W. Sawicki"" });
                    context.SaveChanges();
                }
            }
        }
    }


Let’s focus a little bit on the method. First, a new PostgreContext object is created. Then, few new Book objects are created and added to the Book
property (of type DbSet<Book> ). This way Entity Framework will mark them as new rows. Finally, the SaveChanges method adds the new rows to the database.

You can query the database to confirm that the rows have been added.


QUERY POSTGRESQL DATABASE
We have the sample data in the database, so let’s query it in the application.
Add a PrintBooks method to Program class.

    using RedisCacheForPostgre.Entities;
    using System;
    using System.Linq;

    namespace RedisCacheForPostgre
    {
        public class Program
        {
            public static void Main(string[] args)
            {
                //InsertSampleData();
                PrintBooks();
            }

            private static void PrintBooks()
            {
                using (var context = new PostgreContext())
                {
                    var books = context.Book.ToList();

                    foreach(var book in books)
                    {
                        Console.WriteLine($""  '{book.Title}' by {book.Author}"");
                    }
                }
            }
        }
    }


Again, we create an instance of PostgreContext. Then, we get a list of all books
by calling the Book.ToList method. Finally, the list is printed to the console.


ADD REDIS CACHING
ADD REDIS CONNECTION STRING
Edit App.config and insert new connection string to the Redis database.

    <configuration>
      (...)
      <connectionStrings>
        password=secret"" providerName=""Npgsql"" />
        <add name=""Redis"" connectionString=""hostname:6379,password=secret""/>
      </connectionStrings>
      (...)
    </configuration>


In order to easily access the connection string later we have to add a reference
to System.Configuration assembly – right click the project and choose Add - Reference from the context menu. Next, select Assemblies - Framework , find System.Configuration and check the checkbox next to it.


ADD CACHE SUPPORT
Add EFCache.Redis NuGet package that extends Entity Framework Cache by adding Redis support.


It will update the Entity Framework to 6.1.3 version due to dependencies.


DEFINE CACHING POLICY
A cache needs to know how to forget data and that's done through a caching
policy. Let's set one for our Redis cache by first adding RedisCachingPolicy to the Entities folder.


The class has to inherit from EFCache.CachingPolicy .

    using System;
    using System.Collections.ObjectModel;
    using System.Data.Entity.Core.Metadata.Edm;
    using EFCache;

    namespace RedisCacheForPostgre.Entities
    {
        public class RedisCachingPolicy : CachingPolicy
        {
            protected override void GetExpirationTimeout(ReadOnlyCollection affectedEntitySets, out TimeSpan slidingExpiration, out DateTimeOffset absoluteExpiration)
            {
                slidingExpiration = TimeSpan.FromMinutes(5);
                absoluteExpiration = DateTimeOffset.Now.AddMinutes(30);
            }
        }
    }


There is GetExpirationTimeout method overridden – it configures:

 * absoluteExpiration = 30 minutes , means that every cache entry will expire after 30 minutes.
 * slidingExpiration = 5 minutes , means that a cache entry might be expired if it hasn’t been accessed in 5
   minutes (sooner than the above).

Of course, it’s useless at this point as the class is used nowhere.

ENABLE ENTITY FRAMEWORK CACHE
Let’s add the last class to the project a file called Configuration.cs .


It should inherit from System.Data.Entity.DbConfiguration .

    using EFCache;
    using EFCache.Redis;
    using System.Configuration;
    using System.Data.Entity;
    using System.Data.Entity.Core.Common;

    namespace RedisCacheForPostgre.Entities
    {
        public class Configuration : DbConfiguration
        {
            public Configuration()
            {
                var redisConnection = ConfigurationManager.ConnectionStrings[""Redis""].ToString();
                var cache = new RedisCache(redisConnection);
                var transactionHandler = new CacheTransactionHandler(cache);
                AddInterceptor(transactionHandler);

                Loaded += (sender, args) =>
                {
                    args.ReplaceService(
                        (s, _) =
            }
        }
    }


Entity Framework will search for a class that inherits DbConfiguration at runtime. This way the class becomes a code-based configuration for Entity Framework .

There are a few things happening here.

 * Redis connection string is read from an application configuration
 * RedisCache object is created – it’s responsible for reading from and writing to the
   Redis database
 * CacheTransactionHandler is created and registered – it monitors database transactions
 * On Loaded event replaces the default provider with CachingProviderServices – it tries to get items from the Redis cache first and falls back to the
   standard provider. Note that we pass a new instance of RedisCachingPolicy – the class was defined in the previous point and is responsible for caching
   rules (e.g. when data should be forgotten).

Finally, let’s try to run the application. It will print the same set of books.
But having a look at Redis database, you can see that new entries appeared
there.


REDIS CACHE MODE
Please also remember to set the Redis to cache mode, otherwise, it’ll keep
expanding and scaling up.


SUMMARY
In the sample data used you won’t see significant improvement. It’s caused by
small types of queries used (one query) and a very limited number of items
queried. The regular PostgreSQL database can optimize it quite well.

The point of this article was to show how easy it is to add the caching to the
Entity Framework and how transparent it is. Once the cache was added to the
framework we didn’t have to change anything in the PrintBooks method and it
still worked. The same would apply for all Entity Framework queries (if we had
more).


--------------------------------------------------------------------------------

Do you want to shed light on a favorite feature in your preferred database? Why
not write about it for Write Stuff ?


attribution Patrick Tomasso

This article is licensed with CC-BY-NC-SA 4.0 by Compose.RELATED ARTICLES
Mar 2, 2017USE ALL THE DATABASES - PART 1
Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data
from multiple sources using GraphQL in this W…

Guest Author Jul 12, 2017INTEGRATION TESTING AGAINST REAL DATABASES
Integration testing can be challenging, and adding a database to the mix makes
it even more so. In this Write Stuff contribu…

Guest Author Jun 28, 2017ACCESSING RELATIONAL DATABASES USING GO
Have you considered using Go to access your relational databases? In this Write
Stuff article, Gigi Sayfan shows you how to a…

Guest Author Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","Caching a database can be a chore, but Mariusz Bojkowski shows how easy it can be to add a Redis cache to your PostgreSQL database if you are using Entity Framework 6.",How to enable a Redis cache for PostgreSQL with Entity Framework 6,Live,235
684,"Homepage Stats and Bots Follow Sign in Get started Homepage * Home
 * DATA SCIENCE
 * ANALYTICS
 * STARTUPS
 * BOTS
 * DESIGN
 * Subscribe
 * 
 * 🤖 TRY STATSBOT FREE
 * 

Varun Agrawal Blocked Unblock Follow Following Computer Scientist | Inventor | Bibliophile | Musician | Gastronome. Oct 19
--------------------------------------------------------------------------------

IMPROVING REAL-TIME OBJECT DETECTION WITH YOLO
A NEW PERSPECTIVE FOR REAL-TIME OBJECT DETECTION
In recent years, the field of object detection has seen tremendous progress,
aided by the advent of deep learning. Object detection is the task of
identifying objects in an image and drawing bounding boxes around them, i.e.
localizing them. It’s a very important problem in computer vision due its
numerous applications from self-driving cars to security and tracking.

Prior approaches of object detection have generally proposed pipelines that are
separate stages in a sequence. This causes a disconnect between what each stage
accomplishes and the final objective, which is drawing a tight bounding box
around the objects in an image. An end-to-end framework that optimizes the
detection error in a joint fashion would be a better solution, not just to train
the model for better accuracy but to also improve detection speed.

This is where the You Only Look Once (or YOLO) approach comes into play. Varun
Agrawal told the Statsbot team why YOLO is the better option compared to other approaches in object
detection.

Illustration sourceDeep learning has proven to be a powerful tool for image classification,
achieving human level capability on this task. Earlier detection approaches
leveraged this power to transform the problem of object detection to one of
classification, which is recognizing what category of objects the image belonged
to.

The way this was done was via a 2-stage process:

 1. The first stage involved generating tens of thousands of proposals. They are
    nothing but specific rectangular areas on the image also known as bounding
    boxes, of what the system believed to be object-like things in the image.
    The bounding box proposal could either be around an actual object in an
    image or not, and filtering this out was the objective of the second stage.
 2. In the second stage, an image classifier would classify the sub-image inside
    the bounding box proposal, and the classifier would say if it was of a
    particular object type or simply a non-object or background.

While immensely accurate, this 2-step process suffered from certain flaws such
as efficiency, due to the immense number of proposals being generated, and a
lack of joint optimization over both proposal generation and classification.
This leads to each stage not truly understanding the bigger picture, instead
being siloed to their own mini-problem and thus limiting their performance.

WHAT YOLO IS ALL ABOUT
This is where YOLO comes in. YOLO, which stands for You Only Look Once, is a
deep learning based object detection algorithm developed by Joseph Redmon and Ali Farhadi at the University of Washington in 2016.

The rationale behind calling the system YOLO is that rather than pass in
multiple subimages of potential objects, you only passed in the whole image to
the deep learning system once. Then, you would get all the bounding boxes as
well as the object category classifications in one go. This is the fundamental
design decision of YOLO and is what makes it a refreshing new perspective on the
task of object detection.

The way YOLO works is that it subdivides the image into an NxN grid, or more
specifically in the original paper a 7x7 grid. Each grid cell, also known as an
anchor, represents a classifier which is responsible for generating K bounding
boxes around potential objects whose ground truth center falls within that grid
cell (K is 2 in the paper) and classifying it as the correct object.

Note that the bounding box is not restricted to be within the grid cell, it can
expand within the boundaries of the image to accommodate the object it believes
it is responsible to detect. This means that in the current version of YOLO, the
system generates 98 bounding boxes of varying sizes to accommodate the various
objects in the scene.PERFORMANCE AND RESULTS
For more dense object detection, a user could set K or N to a higher number
based on their needs. However, with the current configuration, we have a system
that is able to output a large number of bounding boxes around objects as well
as classify them into one of various object categories, based on the spatial
layout of the image.

This is done in a single pass through the image at inference time. Thus, the
joint detection and classification leads to better optimization of the learning
objective (the loss function) as well as real-time performance.

Indeed, the results of YOLO are very promising. On the challenging Pascal VOC detection challenge dataset , YOLO manages to achieve a mean average precision, or mAP, of 63.4 (out of
100) while running at 45 frames per second. In comparison, the state of the art
model, Faster R-CNN VGG 16 achieves an mAP of 73.2, but only runs at a maximum 7
frames per second, a 6x decrease in efficiency.

You can see comparisons of YOLO to other detection frameworks in the table
below.

If one lets YOLO sacrifice some more accuracy, it can run at 155 frames per
second, though only at an mAP of 52.7.Thus, the main selling point for YOLO is its promise of good performance in
object detection at real-time speeds. That allows its use in systems such as
robots, self-driving cars, and drones, where being time critical is of the
utmost importance.

YOLOV2 FRAMEWORK
Recently, the same group of researchers have released the new YOLOv2 framework,
which leverages recent results in a deep learning network design to build a more
efficient network, as well as use the anchor boxes idea from Faster-RCNN to ease
the learning problem for the network.

Illustration sourceThe result is a detection system which is even better, achieving
state-of-the-art performance at 78.6 mAP on the Pascal VOC detection dataset,
while other systems, such as the improved version of Faster-RCNN (Faster-RCNN
ResNet) and SSD500 , only achieve 76.4 mAP and 76.8 mAP on the same test dataset.

The key differentiator though is the performance speed. The best performing
YOLOv2 model runs at 40 FPS compared to 5 FPS for Faster-RCNN ResNet.Although SSD500 runs at 45 FPS, a lower resolution version of YOLOv2 with mAP
76.8 (the same as SSD500) runs at 67 FPS, thus showing us the high performance
capabilities of YOLOv2 as a result of its design choices.

FINAL THOUGHTS
In conclusion, YOLO has demonstrated significant performance gains while running
at real-time performance, an important middle ground in the era of resource
hungry deep learning algorithms. As we march on towards a more automation ready
future, systems like YOLO and SSD500 are poised to usher in large strides of
progress and enable the big AI dream.

IMPORTANT READING THROUGH THE ARTICLE
 * You Only Look Once: Unified, Real-Time Object Detection
 * The PASCAL Visual Objects Challenge: A Retrospective
 * SSD: Single Shot Multibox Detector

YOU’D ALSO LIKE:
SQL Queries for Funnel Analysis A template for building SQL funnel queries
blog.statsbot.co Generative Adversarial Networks (GANs): Engine and Applications How generative
adversarial nets are used to make our life better blog.statsbot.co How to Reduce Churn Rate By Handling Stripe Failed Payments How We Automated
Dunning Management blog.statsbot.co * Machine Learning
 * Data Science
 * Data Analytics
 * Object Detection
 * Yolo


Blocked Unblock Follow FollowingVARUN AGRAWAL
Computer Scientist | Inventor | Bibliophile | Musician | Gastronome.

FollowSTATS AND BOTS
Data stories on machine learning and analytics. From Statsbot’s makers.

 * 
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates",Why YOLO is the better option compared to other approaches in real-time object detection.,Improving Real-Time Object Detection with YOLO,Live,236
686,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own Jan 22
--------------------------------------------------------------------------------

DEEP LEARNING WITH DATA SCIENCE EXPERIENCE
Deep learning is a branch of Machine Learning that uses lots of data to teach
computers how to do things only humans were capable of before.

A good example of Deep Learning is perception, recognizing what’s in an image,
what people are saying when they are talking, helping robots explore the world
and interact with it. Deep learning is emerging as a central tool to solve
perception problems in recent years. It’s the state of the art having to do with
computer vision and speech recognition. Increasingly people are finding that
deep learning is a much better tool to solve problems.

Many companies today have made deep learning a central part of their machine
learning toolkit. For example Facebook, Google and Uber are all using deep
learning in their products. We at IBM are collaborating with the leaders in the
market to push the research forward and lead in that space.

Deep learning shines wherever there is lots of data and complex problems to
solve and many companies today are facing lots of complicated problems. Deep
learning can be applied to many different fields.

As deep neural networks become increasingly important to everything from
self-driving cars to voice recognition, new libraries are making it much easier
to use deep learning to solve real problems. Building a training a multi-layer
convolutional neural network would have taken hundreds of lines of code just a
few years ago. In this post we are going to have an overview of the most popular
Open Source projects that are available in the IBM Data Science Experience.

WHY DEEP LEARNING NOW?
One of the fascinating things about neural networks is how long they have taken
to be an over night success. The history goes back all the way to the 1950s.
Deep learning has really only taken off in the last five years.The reason is the
increased availability of label data along with the greatly increased
computational throughput of modern processors.

For a long time, we didn’t have the huge label data sets that we needed to make
deep learning work. Those data sets only became widely available with the rise
of the Internet, which made collecting and labeling huge datasets feasible. But
even when we had big datasets, we often didn’t have enough computational power
to make us of them and it is only been in the last five years that processors
have gotten big enough and fast enough to train large scale neural networks.

HOW TO GET STARTED WITH DEEP LEARNING IN PYTHON
There is a fast growing community of researchers, engineers, and data scientists
who share a common, very powerful set of tools and most of them are Open Source.

One of the nice things about deep learning is that it’s really a family of
techniques that adapts to all sorts of data and all sorts of problems, all using
a common infrastructure and a common language to describe things.

The best is start with very simple models and move later to very large ones. It
is simple to get started with your own personal computer to do very elaborate
tasks. In the IBM Data Science Experience you have everything you need for free
to start experimenting with Deep Learning technologies. Find here a summary of
the most popular Deep Learning Python libraries and tutorials:

 * Theano : It is a low-level library that specializes in efficient computation.
   You’ll only use this directly if you need fine-grain customization and
   flexibility. → Tutorial
 * Tensorflow : It is another low-level library that is less mature than Theano. However,
   it’s supported by Google and offers out-of-the-box distributed computing. → Tutorial
 * Keras : It is a heavyweight wrapper for both Theano and Tensorflow. It’s
   minimalistic, modular, and awesome for rapid experimentation. This is our
   favorite Python library for deep learning and the best place to start for
   beginners. → Tutorial
 * Lasagne : It is a lightweight wrapper for Theano. Use this if need the flexibility
   of Theano but don’t want to always write neural network layers from scratch.
   → Tutorial
 * MXNet - It is another high-level library similar to Keras. It offers bindings for
   multiple languages and support for distributed computing. → Tutorial

Resources

 * Getting Started with MXNet
 * Python deep learning

 * Machine Learning


Blocked Unblock Follow FollowingARMAND RUIZ
Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Deep learning is a branch of Machine Learning that uses lots of data to teach computers how to do things only humans were capable of before. A good example of Deep Learning is perception, recognizing…",Deep Learning with Data Science Experience,Live,237
689,"Cloudant is a database service that provides high-availability JSON data access. Kiwi Wearables's platform enables motion recognition for physical devices and software applications. Learn how Kiwi uses Cloudant on its back-end to persist motion events and process JSON between Node.js, Twilio, and other Web services.",Andy Ellicott and John David Chibuk talk about an Internet of Things application to record data captured from wearable technology and recorded in Cloudant ,"Building IoT Apps on Cloudant, with Kiwi Wearables",Live,238
691,"GETTING STARTED WITH GRAPHFRAMES IN APACHE SPARK
David Taieb / July 15, 2016INTRODUCTION TO SPARK AND GRAPHS
GraphX is one of the 4 foundational components of Spark — along with SparkSQL,
Spark Streaming and MLlib — that provides general purpose Graph APIs including
graph-parallel computation:


GraphX APIs are great but present a few limitations. First they only work with
Scala, so if you want to use GraphX with Python in a Jupyter Notebook, then you
are out of luck. The second limitation is that they only work at the RDD ( Resilient Distributed Dataset ) level, which means that they can’t benefit from the performance improvement
provided by DataFrames and the Catalyst query optimizer. GraphFrames is an open source Spark Package that was created with goal of addressing these
two issues:

 * Provides a set of Python APIs
 * Works with DataFrames

In this post, we’ll show how to get started with GraphFrames from a Python
Notebook. We’ll start by creating a graph composed of airports as the vertices
and flight routes as the edges, using the data from the flight predict application . I’ll then show interesting ways of visualizing the data and apply various
graph algorithms to extract insights from the data.

INSTALLING GRAPHFRAMES
As previously mentioned, GraphFrames will be part of the Spark 2.0 distribution, but it’s currently available as a
preview Spark package compatible with Spark 1.6 and higher. There are multiple
ways to install the package depending on how you are running Spark:

 * Spark-submit or Spark-shell: simply add --packages graphframes:graphframes:0.1.0-spark1.6 as a command-line argument
 * Local Jupyter Notebook: assuming that you have access to the configuration files, all you need is to
   add --packages graphframes:graphframes:0.1.0-spark1.6 to the kernel.json located in ~/.ipython/kernels/<yourkernel>/kernel.json .    {
           ""display_name"": ""pySpark (Spark 1.6.0) with graphFrames"",
           ""language"": ""python"",
           ""argv"": [
               ""/Users/dtaieb/anaconda/envs/py27/bin/python"",
               ""-m"",
               ""ipykernel"",
               ""-f"",
               ""{connection_file}""
           ],
           ""env"": {
               ""SPARK_HOME"": ""/Users/dtaieb/cdsdev/spark-1.6.0"",
               ""PYTHONPATH"": ""/Users/dtaieb/cdsdev/spark-1.6.0/python/:/Users/dtaieb/cdsdev/spark-1.6.0/python/lib/py4j-0.9-src.zip"",
               ""PYTHONSTARTUP"": ""/Users/dtaieb/cdsdev/spark-1.6.0/python/pyspark/shell.py"",
               ""PYSPARK_SUBMIT_ARGS"": ""--packages graphframes:graphframes:0.1.0-spark1.6 --master local[10] pyspark-shell"",
               ""SPARK_DRIVER_MEMORY"":""10G"",
               ""SPARK_LOCAL_IP"":""127.0.0.1""
           }
       }
   
   
 * IPython Notebook (hosted on IBM Bluemix Apache Spark™ service): When the notebook is hosted and you don’t have access to the configuration
   files, I wished there were a magic command that would add a Spark Package to
   the session. Unfortunately there is no such thing today, so I made one
   :boom:. I created a helper Python library called pixiedust that implements a
   workaround.

Note: The following steps currently only work on an python Notebook hosted on
IBM Bluemix

Open your python Notebook and run the following code:

 1. Cell1: install the pixiedust library. !pip install --user pixiedust
    
    Or if you want to upgrade the version already installed:
    
     !pip install --user --upgrade --no-deps pixiedust
    
    
 2. Cell2: import the pixiedust packageManager module and install graphframes.
        from pixiedust.packageManager import PackageManager
        pkg=PackageManager()
        pkg.installPackage(""graphframes:graphframes:0"")
        pkg.printAllPackages()
    
        sqlContext=SQLContext(sc)
    
    
    If all goes well, you should see a message printed in red in the output
    asking you to restart the kernel. You can do so using the menu: Kernel/Restart .
    
    
 3. Once the kernel has restarted, run Cell2 again. Even though the Graphframes jar file is now part of the classpath,
    you still need to run the command to add the GraphFrames python APIs to the
    SparkContext.
 4. Cell3: verify that GraphFrames is correctly installed.
        #import the display module
        from pixiedust.display import *
        #import the Graphs example
        from graphframes.examples import Graphs
        #create the friends example graph
        g=Graphs(sqlContext).friends()
        #use the pixiedust display
        display(g)
    
    
Results of the code above should look like this:


Note: I’ll be using the pixiedust display() API call in this post without diving into the details of how it’s built, which
I’ll cover in a future post.

CREATE A GRAPH WITH AIRPORTS AS NODES AND FLIGHT ROUTES AS EDGES
At a high level, GraphFrames is to GraphX what DataFrames is to RDDs. It is
built on top of Spark SQL and provides a set of APIs that elegantly combine
Graph Analytics and Graph Queries:


Diving into technical details, you need two DataFrames to build a Graph: one
DataFrame for vertices and a second DataFrame for edges. With graphFrames
successfully installed, we are now ready to load the data from the flight predict application .

As a reminder, the data lives in two Cloudant databases:

 * flight-metadata : contains the airports info
 * flightpredict_training_set : contains the flight routes augmented with weather info

The first step is to configure the Cloudant-spark connector and load the 2
datasets:

    #Configure connector
    sc.addPyFile(""https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/flightPredict/training.py"")
    sc.addPyFile(""https://github.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/raw/master/flightPredict/run.py"")
    import training
    import run
    sqlContext=SQLContext(sc)
    training.sqlContext = sqlContext
    training.cloudantHost='dtaieb.cloudant.com'
    training.cloudantUserName='weenesserliffircedinvers'
    training.cloudantPassword='72a5c4f939a9e2578698029d2bb041d775d088b5'

    #load the 2 datasets
    airports = training.loadDataSet(""flight-metadata"", ""airports"")
    print(""airports count: "" + str(airports.count()))
    flights = training.loadDataSet(""pycon_flightpredict_training_set"",""training"")
    print(""flights count: "" + str(flights.count()))


Results:

Successfully cached dataframe
Successfully registered SQL table airports
airports count: 17535
Successfully cached dataframe
Successfully registered SQL table training
flights count: 33336


In this step, we build the vertices and edges DataFrames for our graph. The
vertices (airports) must all have at least one edge (flights). They also must
have a column named “id” that uniquely identifies the vertex. To meet these two
requirements, the cell below performs a join between airports and flights, and
renames the column “fs” (airport code) to “id”.

from pyspark.sql import functions as f
from pyspark.sql.types import *
rdd = flights.flatMap(lambda s: [s.arrivalAirportFsCode, s.departureAirportFsCode]).distinct()\
    .map(lambda row:[row])
vertices = airports.join(
      sqlContext.createDataFrame(rdd, StructType([StructField(""fs"",StringType())])), ""fs""
    ).dropDuplicates([""fs""]).withColumnRenamed(""fs"",""id"")
print(vertices.count())


The edges dataframe is almost ready, but we need to make sure that it has the
columns “src” and “dst” that respectively reference the “id” of the source and
destination airport. We also drop a few unneeded columns:

edges=flights.withColumnRenamed(""arrivalAirportFsCode"",""dst"")\
    .withColumnRenamed(""departureAirportFsCode"",""src"")\
    .drop(""departureWeather"").drop(""arrivalWeather"").drop(""pt_type"").drop(""_id"").drop(""_rev"")


We can now build the graph and display it:

from graphframes import GraphFrame
g = GraphFrame(vertices, edges)
display(g)


When you initially run this cell, you’ll see a table. But because pixiedust
introspects the dataset, it knows it contains latitude and longitude coordinates
that can be displayed on a map. Click the map pin icon to see the graph of
airports and flights overlaid on a map of the United States:


Note: The visualization above is coming from a sample pixiedust plugin that
visualizes all the flights for selected airports. It also provides menus to
display the vertices and edges as tables.

LET’S DO SOME GRAPH COMPUTING!


COMPUTE THE DEGREE FOR EACH VERTEX IN THE GRAPH


The degree of a vertex is the number of edges incident to the vertex. In a
directed graph, in-degree is the number of edges where vertex is the destination
and out-degree is the number of edges where the vertex is the source.
GraphFrames has properties for degrees , outDegrees and inDegrees . They return a DataFrame containing the id of the vertex and the number of
edges. We then sort them in descending order:

from pyspark.sql.functions import *
degrees = g.degrees.sort(desc(""degree""))
display( degrees )


Results:


COMPUTE A LIST OF SHORTEST PATHS FOR EACH VERTEX TO A SPECIFIED LIST OF
LANDMARKS


For this example we use the shortestPaths api that returns a DataFrame containing the properties for each vertex plus an
extra column called distances that contains the number of hops to each landmark.
In the following code, we use BOS and LAX as the landmarks:

r = g.shortestPaths(landmarks=[""BOS"", ""LAX""]).select(""id"", ""distances"")
display(r)


Results:


COMPUTE THE PAGERANK FOR EACH VERTEX IN THE GRAPH


PageRank is a famous algorithm used by Google Search to rank vertices in a graph by
order of importance. To compute pageRank, we’ll use the pageRank() API call that returns a new graph in which the vertices have a new pagerank column representing the pagerank score for the vertex, and the edges have a new weight column representing the edge weight that contributed to the pageRank score.
We’ll then display the vertex ids and associated pageranks sorted in descending
order:

from pyspark.sql.functions import *
ranks = g.pageRank(resetProbability=0.20, maxIter=5)
display(ranks.vertices.select(""id"",""pagerank"").orderBy(desc(""pagerank"")))


Results:


SEARCH ROUTES BETWEEN TWO AIRPORTS WITH SPECIFIC CRITERIA


In this section, we want to find all the routes between Boston and San Francisco
operated by United Airlines with at most two hops. To perform this search, we
use the bfs() ( breadth-first search ) API call that returns a DataFrame containing the shortest path between
matching vertices. For clarity, we will only keep the edge when displaying the
results:

paths = g.bfs(fromExpr=""id='BOS'"",toExpr=""id = 'SFO'"",edgeFilter=""carrierFsCode='UA'"", maxPathLength = 2).drop(""from"").drop(""to"")
display(paths)


Results:


FIND ALL AIRPORTS THAT DO NOT HAVE DIRECT FLIGHTS BETWEEN EACH OTHER


In this section, we’ll use a very powerful graphFrames search feature that uses
a pattern called motif to find nodes. We’ll use it to apply the pattern ""(a)-[]-(b)-[]-!(a)-[]->(c)"" , which searches for all nodes a, b and c that have a path to (a,b) and a path
to (b,c) but not a path to (a,c). Also, because the search is computationally
expensive, we reduce the number of edges by grouping the flights that have the
same src and dst.

from pyspark.sql import functions as F
h = GraphFrame(g.vertices, g.edges.select(""src"",""dst"").groupBy(""src"",""dst"").agg(F.count(""src"").alias(""count"")))
query = h.find(""(a)-[]-(b)-[]-!(a)-[]-(c)"").drop(""b"")
display(query)


Results:


COMPUTE THE STRONGLY CONNECTED COMPONENTS FOR THIS GRAPH


Strongly Connected Components are components for which each vertex is reachable from every other vertex. To
compute them, we’ll use the stronglyConnectedComponents() API call that returns a DataFrame containing all the vertices, with the
addition of a component column that contains the id value of each connected vertex. We then group all
the rows by components and aggregate the sum of all the member vertices. This
gives us a good idea of the components distribution in the graph.

from pyspark.sql.functions import *
components = g.stronglyConnectedComponents(maxIter=10).select(""id"",""component"")\
    .groupBy(""component"").agg(F.count(""id"").alias(""count"")).orderBy(desc(""count""))
display(components)


Results:


DETECT COMMUNITIES IN THE GRAPH USING LABEL PROPAGATION ALGORITHM


Label propagation is a popular algorithm for finding communities within a graph. It has the
advantage of being computationally inexpensive and thus works well with large
graphs. To compute the communities, we’ll use the labelPropagation() API call that returns a DataFrame containing all the vertices, with the
addition of a label column that contains the id value of each connected vertex. Similar to the
strongly connected components computation, we’ll then group all the rows by
label and aggregate the sum of all the member vertices.

from pyspark.sql.functions import *
communities = g.labelPropagation(maxIter=5).select(""id"", ""label"")\
    .groupBy(""label"").agg(F.count(""id"").alias(""count"")).orderBy(desc(""count""))
display(communities)


Results:


CONCLUSION
In this post, we have learned several things:

 * How to use GraphFrames (and any other Spark packages) within an IPython
   notebook, including for the IBM Analytics for Apache Spark service on
   Bluemix.
 * We’ve introduced the pixiedust module that, among other things, provides a
   simple API to create compelling in-context interactive visualizations.
 * We’ve shown how to create a graph from data stored in the Cloudant JSON
   database service.
 * Finally, we’ve explored a few of the graph computation APIs provided by
   GraphFrames. Of course there is much more to explore, but hopefully this post
   gave you ideas you can reuse.

All the exercises and code are conveniently available in a completed Jupyter Notebook . Feel free to import it into your own Spark environment or on the IBM Apache
Spark service — and use it as a starting point in your own project.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: Apache Spark / GraphFrames / GraphX / IPython / Jupyter / Notebooks Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","We show how to build a graph of airports and flight paths using GraphFrames. Then, visualize the data and apply various graph algorithms to analyze it.",Getting started with GraphFrames in Apache Spark™,Live,239
695,"RStudio Blog * Home

 * Subscribe to feed

SPARK 1.4 FOR RSTUDIO
July 14, 2015 in RStudio IDE | Tags: Spark , SparkR

Today’s guest post is written by Vincent Warmerdam of GoDataDriven and is reposted with Vincent’s permission from blog.godatadriven.com . You can learn more about how to use SparkR with RStudio at the 2015 EARL Conference in Boston November 2-4, where Vincent will be speaking live.

This document contains a tutorial on how to provision a spark cluster with
RStudio. You will need a machine that can run bash scripts and a functioning
account on AWS. Note that this tutorial is meant for Spark 1.4.0. Future
versions will most likely be provisioned in another way but this should be good
enough to help you get started. At the end of this tutorial you will have a
fully provisioned spark cluster that allows you to handle simple dataframe
operations on gigabytes of data within RStudio.

AWS PREP
Make sure you have an AWS account with billing. Next make sure that you have
downloaded your .pem files and that you have your keys ready.

SPARK STARTUP
Next go and get spark locally on your machine from the spark homepage . It’s a pretty big blob. Unzip it once it is downloaded go to the ec2 folder in the spark folder. Run the following command from the command line.

./spark-ec2 \
--key-pair=spark-df \
--identity-file=/Users/code/Downloads/spark-df.pem \
--region=eu-west-1 \
-s 1 \
--instance-type c3.2xlarge \
launch mysparkr

This script will use your keys to connect to amazon and setup a spark standalone
cluster for you. You can specify what type of machines you want to use as well
as how many and where on amazon. You will only need to wait until everything is
installed, which can take up to 10 minutes. More info can be found here .
When the command signals that it is done, you can ssh into your machine via the
command line.
./spark-ec2 -k spark-df -i /Users/code/Downloads/spark-df.pem --region=eu-west-1
login mysparkr
Once you are in your amazon machine you can immediately run SparkR from the
terminal.

chmod u+w /root/spark/
./spark/bin/sparkR 

As just a toy example, you should be able to confirm that the following code
already works.

ddf <- createDataFrame(sqlContext, faithful) 
head(ddf)
printSchema(ddf)

This ddf dataframe is no ordinary dataframe object. It is a distributed dataframe, one
that can be distributed across a network of workers such that we could query it
for parallelized commands through spark.

SPARK UI
This R command you have just run launches a spark job. Spark has a webui so you
can keep track of the cluster. To visit the web-ui, first confirm on what
IP-address the master node is via this command:

curl icanhazip.com

You can now visit the webui via your browser.

<master-node-ip>:4040

From here you can view anything you may want to know about your spark clusters
(like executor status, job process and even a DAG visualisation).

This is a good moment to stand still and realize that this on it’s own right is
already very cool. We can start up a spark cluster in 15 minutes and use R to
control it. We can specify how many servers we need by only changing a number on
the command line and without any real developer effort we gain access to all
this parallelizing power.
Still, working from a terminal might not be too productive. We’d prefer to work
with a GUI and we would like some basic plotting functionality when working with
data. So let’s install RStudio and get some tools connected.

RSTUDIO SETUP
Get out of the SparkR shell by entering q() . Next, download and install Rstudio.
wget http://download2.rstudio.org/rstudio-server-rhel-0.99.446-x86_64.rpm
sudo yum install --nogpgcheck -y rstudio-server-rhel-0.99.446-x86_64.rpm
rstudio-server restart
While this is installing. Make sure the TCP connection on the 8787 port is open
in the AWS security group setting for the master node. A recommended setting is
to only allow access from your ip.

Then, add a user that can access RStudio. We make sure that this user can also
access all the RStudio files.

adduser analyst
passwd analyst

You also need to do this (the details of why are a bit involved). These edits
need to be made because the analyst user doesn’t have root permissions.
chmod a+w /mnt/spark
chmod a+w /mnt2/spark
sed -e 's/^ulimit/#ulimit/g' /root/spark/conf/spark-env.sh >
/root/spark/conf/spark-env2.sh
mv /root/spark/conf/spark-env2.sh /root/spark/conf/spark-env.sh
ulimit -n 1000000
When this is known, point the browser to <master-ip-adr>:8787 . Then login in as analyst.

RSTUDIO – SPARK LINK
Awesome. RStudio is set up. First start up the master submit.

/root/spark/sbin/stop-all.sh
/root/spark/sbin/start-all.sh

This will reboot Spark (both the master and slave nodes). You can confirm that
spark works after this command by pointing the browser to <ip-adr>:8080 .
Next, let’s go and start Spark from RStudio. Start a new R script, and run the
following code:
print('Now connecting to Spark for you.')

spark_link <- system('cat /root/spark-ec2/cluster-url', intern=TRUE)

.libPaths(c(.libPaths(), '/root/spark/R/lib'))
Sys.setenv(SPARK_HOME = '/root/spark')
Sys.setenv(PATH = paste(Sys.getenv(c('PATH')), '/root/spark/bin', sep=':'))
library(SparkR)

sc <- sparkR.init(spark_link)
sqlContext <- sparkRSQL.init(sc)

print('Spark Context available as \""sc\"". \\n')
print('Spark SQL Context available as \""sqlContext\"". \\n')

LOADING DATA FROM S3
Let’s confirm that we can now play with the RStudio stack by downloading some
libraries and having it run against a data that lives on S3.
small_file = ""s3n://<AWS-ID>:<AWS-SECRET-KEY>@<bucket_name>/data.json""
dist_df <- read.df(sqlContext, small_file, ""json"") %>% cache
This dist_df is now a distributed dataframe, which has a different api than the normal R
dataframe but is similar to dplyr .
head(summarize(groupBy(dist_df, df$type), count = n(df$auc)))
Also, we can install magrittr to make our code look a lot nicer.

local_df <- dist_df %>% 
  groupBy(df$type) %>% 
  summarize(count = n(df$id)) %>% 
  collect

The collect method pulls the distributed dataframe back into a normal dataframe on a single
machine so you can use plotting methods on it again and use R as you would
normally. A common use case would be to use spark to sample or aggregate a large
dataset which can then be further explored in R.
Again, if you want to view the spark ui for these jobs you can just go to:

<master-node-ip>:4040

A MORE COMPLETE STACK
Unfortunately this stack has an old version of R (we need version 3.2 to get the
newest version of ggplot2/dplyr). Also, as of right now there isn’t support for
the machine learning libraries yet. These are known issues at the moment and
version 1.5 should show some fixes. Version 1.5 will also feature RStudio
installation as part of the ec2 stack.
Another issue is that the namespace of dplyr currently conflicts with sparkr , time will tell how this gets resolved. Same would go for other data features
like windowing function and more elaborate data types.

KILLING THE CLUSTER
When you are done with the cluster, you only need to exit the ssh connection and
run the following command:
./spark-ec2 -k spark-df -i /Users/code/Downloads/spark-df.pem --region=eu-west-1
destroy mysparkr

CONCLUSION
The economics of spark are very interesting. We only pay amazon for the time
that we are using Spark as a compute engine. All other times we’d only pay for
S3. This means that if we analyse for 8 hours, we’d only pay for 8 hours. Spark
is also very flexible in that it allows us to continue coding in R (or python or
scala) without having to learn multiple domain specific languages or frameworks
like in hadoop. Spark makes big data really simple again.
This document is meant to help you get started with Spark and RStudio but in a
production environment there are a few things you still need to account for:

 * security , our web connection is not done through https, even though we are telling
   amazon to only use our ip, we may be at security risk if there is a man in
   the middle listening .
 * multiple users , this setup will work fine for a single user but if multiple users are
   working on such a cluster you may need to rethink some steps with regards to
   user groups, file access and resource management.
 * privacy , this setup works well for ec2 but if you have sensitive, private user data
   then you may need to do this on premise because the data cannot leave your
   own datacenter. Most install steps would be the same but the initial
   installation of Spark would require the most work. See the docs for more information.

Spark is an amazing tool, expect more features in the future.

POSSIBLE GOTYA
HangingIt can happen that the ec2 script hangs in the Waiting for cluster to enter 'ssh-ready' state part. This can happen if you use amazon a lot. To prevent this you may want to
remove some lines in ~/.ssh/known_hosts . More info here . Another option is to add the following lines to your ~/.ssh/config file.

# AWS EC2 public hostnames (changing IPs)
Host *.compute.amazonaws.com 
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

3 COMMENTS
July 20, 2015 at 8:43 pm

ypouliot

Thanks Garrett, very helpful. I did run into one problem: the SparkR shell is
not present. I.e., ./spark/bin/sparkR returns “no such file”. I couldn’t find it
anywhere. Could you advise, please?

July 21, 2015 at 8:43 am

Vincent D. Warmerdam (@fishnets88)

this file should be present on the server of amazon, not the github project.

just to double check, you were able to log in to the master node and then this
server didn’t have the `./spark/bin/sparkR` ?

 * July 21, 2015 at 1:09 pm
   
   ypouliot
   
   That’s right, it wasn’t present on the master node (?)
   
   
« Accelerating R: RStudio and the new R Consortium Article Spotlight: Persistent data storage in Shiny apps »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:",Today’s guest post is written by Vincent Warmerdam of GoDataDriven and is reposted with Vincent’s permission from blog.godatadriven.com. You can learn more about how to use SparkR with …,Spark 1.4 for RStudio,Live,240
698,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Home
 * Cognitive Computing
 * Data Science
 * Web Dev
 * 

Brad Noble Blocked Unblock Follow Following Developer Advocacy at IBM. Formerly, product design at Cloudant (@ibmcloudant),
founder at PostPost (RIP), and lunk at various agencies. 2 days ago
--------------------------------------------------------------------------------

I AM NOT A DATA SCIENTIST
BUT I PLAY ONE IN THIS BLOG POST, THANKS TO PIXIEDUST
At a recent All Hands, I shared some thoughts about platforms and notebooks. If
you weren’t there, you didn’t miss much. The only takeaway — and takeaway is
probably generous — was this Venn diagram:

Readers may notice that there’s an idea lurking in the footnote at the bottom of
this diagram. The idea is that notebooks, considered by most to be the domain of
the data scientist, have a real shot at helping teams of all types who are
working on data problems.

I’m happy with the colors, but to bring this idea to life, we’ll need more than
a Venn diagram, amirite?

Enter, PixieDust.

NOTEBOOKS FOR EVERYONE
PixieDust is a helper library for Python notebooks. It makes working with data
simpler.

With PixieDust, I can do this in a notebook…

# load a CSV with pixiedust.sampledata()
df = pixiedust.sampleData(""https://github.com/ibm-cds-labs/open-data/raw/master/cars/cars.csv"")
# display the data with pixiedust
display(df)

Instead of doing all this…

from pyspark.sql.types import DecimalType
import matplotlib.pyplot as plt
from matplotlib import cm
import math

#Load the csv, this assumes that the file is already downloaded on a local file system
path=""/path/to/my/csv""
df3 = sqlContext.read.format('com.databricks.spark.csv')\
    .options(header='true', mode=""DROPMALFORMED"", inferschema='true').load(path)

maxRows = 100
def toPandas(workingDF):        
    decimals = []
    for f in workingDF.schema.fields:
        if f.dataType.__class__ == DecimalType:
            decimals.append(f.name)

    pdf = workingDF.toPandas()
    for y in pdf.columns:
        if pdf[y].dtype.name == ""object"" and y in decimals:
            #spark converts Decimal type to object during toPandas, cast it as float
            pdf[y] = pdf[y].astype(float)

    return pdf

xFields = [""horsepower""]
yFields = [""mpg""]
workingDF = df3.select(xFields + yFields)
workingDF = workingDF.dropna()
count = workingDF.count()
if count > maxRows:
    workingDF = workingDF.sample(False, (float(maxRows) / float(count)))
pdf = toPandas(workingDF)
#sort by xFields
pdf.sort_values(xFields, inplace=True)

fig, ax = plt.subplots(figsize=( int(1000/ 96), int(750 / 96) ))

for i,keyField in enumerate(xFields):
    pdf.plot(kind='scatter', x=keyField, y=yFields[0], label=keyField, ax=ax, color=cm.jet(1.*i/len(xFields)))

#Conf the legend
if ax.get_legend() is not None and ax.title is None or not ax.title.get_visible() or ax.title.get_text() == '':
    numLabels = len(ax.get_legend_handles_labels()[1])
    nCol = int(min(max(math.sqrt( numLabels ), 3), 6))
    nRows = int(numLabels/nCol)
    bboxPos = max(1.15, 1.0 + ((float(nRows)/2)/10.0))
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, bboxPos),ncol=nCol, fancybox=True, shadow=True)

#conf the xticks
labels = [s.get_text() for s in ax.get_xticklabels()]
totalWidth = sum(len(s) for s in labels) * 5
if totalWidth > 1000:
    #filter down the list to max 20        
    xl = [(i,a) for i,a in enumerate(labels) if i % int(len(labels)/20) == 0]
    ax.set_xticks([x[0] for x in xl])
    ax.set_xticklabels([x[1] for x in xl])
    plt.xticks(rotation=30)

plt.show()

To get this…

A scatterplot! No code! With options and controls I can use!

That’s data I can explore!

STEPPING THROUGH THE BENEFITS
With PixieDust, I can[1]

 1. Visualize my data , without having to RTFM and trial-and-error Matplotlib (or other
    renderers)
 2. Explore my data in an embedded interface, and switch between renderers (e.g., Matplotlib,
    Bokeh, Seaborn)
 3. Use Spark , without having to RTFM Spark
 4. Do those things , all of which I hadn’t done before — not even once — and then share those
    things with people, which I’m doing now!

With PixieDust, data scientists and data engineers can

 * Use Python and Scala in the same notebook
 * Share variables between Scala and Python
 * Access Spark libraries written in Scala from Python notebooks
 * Access Python visualizations from Scala notebooks
 * Use any other tools they like, e.g., hard-coded Matplotlib, Bokeh, etc.

Now, people with varied skills and skill levels — even people like me — can use
and share notebooks, and collaborate.

But don’t just take my word for it. Ben Hudson , an offering manager on the dashDB team, said this about PixieDust:

I wanted an easy way to map out some geographical data I added to the dataset,
but all the Python tutorials I had come across were too complex for my needs, so
PixieDust was perfect for me. Instead of having to import a ton of packages and try to reverse-engineer code
from an online tutorial, I only had to do a few clicks to generate a really nice
map using PixieDust. PixieDust also made general graphing tasks a lot easier (no
need for matplotlib) and it was really straightforward to use in general.(Ben’s even started logging PixieDust issues on Github . Thanks, Ben!)

USE PIXIEDUST
You have a couple options.

IBM Data Science Experience (DSX): Check out the PixieDust intro notebook on DSX to see PixieDust in action. To play with this notebook in DSX, follow these
steps to bring the notebook into your account:

 1. Click Add Notebooks
 2. Click From URL
 3. Enter Notebook Name
 4. Enter Notebook URL : 
    https://github.com/ibm-cds-labs/pixiedust/raw/master/notebook/DSX/Welcome%20to%20PixieDust.ipynb
 5. Select the Spark Service
 6. Click Create Notebook

Jupyter Notebooks : If you’re comfortable on the command line, you can run PixieDust inside
Jupyter Notebooks on your laptop, too. The PixieDust installation guide has you covered for an easy install, and takes care of configuration and all
the dependencies at once (e.g., installs Spark, Scala, the Cloudant-Spark
connector, and a few sample notebooks).

FURTHER READING
 * PixieDust on Github
 * PixieDust documentation
 * Announcing PixieDust 1.0 , by David Taieb

FOOTNOTES
[1] I am not a data scientist. Not even close.

 * Data Science
 * Python
 * Data Engineering
 * Scala
 * Apache Spark

17 Blocked Unblock Follow FollowingBRAD NOBLE
Developer Advocacy at IBM. Formerly, product design at Cloudant ( @ibmcloudant ), founder at PostPost (RIP), and lunk at various agencies.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 17
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",PixieDust is a helper library for notebooks that makes it easier for teams of all types to work with data.,I Am Not a Data Scientist – IBM Watson Data Lab,Live,241
703,"Compose Databases * MongoDB
 * Elasticsearch
 * RethinkDB
 * Redis
 * PostgreSQL
 * etcd
 * RabbitMQ
 * ScyllaDB
 * MySQL

Enterprise Pricing Articles Sign in Free 30-Day TrialOMNI LABS – MAKING THE MOST OF COMPOSE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 8, 2016Learn how startup Omni Labs uses Compose-hosted MongoDB and a combination of Node.js, React, and Spark
Python to help bootstrap their startup.

We had the pleasure of meeting Vikram Tiwari, a full-stack developer at Omni
Labs, at DataLayer 2016 in September. Tiwari presented on the topic of working with Compose to bootstrap your startup , based on his experience at Omni Labs, a bootstrapped startup in San Francisco
that seeks to make it easier for marketers to work with data. We spoke with
Tiwari and Alex Modon, CEO and co-founder, to learn more about their experience
with MongoDB hosted on Compose.

Omni is an ""automated visualization platform for marketers to see all of their
data in one place, without having to manually do anything,"" explained Modon.
While there are many BI tools available for companies to use, they still require
quite technical specialization and time to manage. ""With our platform, there's
no pixel placement or database integrations. Users just sign-in to see all of
their up-to-date marketing KPI's in a custom dashboard.""

The company's name comes from the pursuit of omnichannel marketing - gathering
and analyzing data across multiple platforms to construct the most effective
cross-media campaigns. Omni enables their customers to stream raw reports from
their current media partners via API integrations while transforming that data
into constantly updated KPI's.


Omni is built on a Node.js backend with a React front-end and using Spark Python
for data processing, with MongoDB and other databases underneath. ""Our
Mongo-powered app serves as a center point for our customers,"" explained Vikram.
""Customers can do multiple queries on the data set, see past performance
reports, or even set up alerts that get posted into a Slack channel.""

All the server stacks are built around Node.js, and all the data that is
collected goes through ETL pipelines built on Python and Google Cloud and
processed by Spark. From there, the data is stored in Google's data warehouse,
Big Query. ""The data is processed back to the client and we push some part of
that data into MongoDB and some of it into Redis, based on how real-time the
needs are.""

While Omni is still in the early startup phase, much of their focus is on
building predictive analytics for customers. They use Tensorflow for much of the
machine learning process. ""Machines are really good at making decisions, as long
as you feed them the right amount of data and tell them what success is. We're
working really hard on rolling out products that are more predictive and help
analyze opportunities to optimize campaigns, generate new media plans, and
basically take care of vendor management.""

Because the platform was built on various JavaScript tools (Node, JQuery, and
React) MongoDB was an easy choice for Omni. The open-source MongoDB community, a
plethora of answers available on StackOverflow, and a mature set of libraries
also nudged them towards MongoDB.

As for why Compose, Modon added, ""With any startup, the most valued resource is
time. Compose removes the 'white knuckle' approach to database management.
There's only so many hours in the day, so it's great knowing that our database
is being taken care of by a company with a high level of quality and
dedication.""


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Jon Silvers works in marketing at Compose. Love this article? Head over to Jon Silvers’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",Customer use-case.,Making the Most of Compose – Customer: Omni Labs,Live,242
704,"Compose The Compose logo Articles Sign in Free 30-day trialMONGO METRICS: CALCULATING THE MODE
Published Apr 12, 2017 metrics mode mongodb Mongo Metrics: Calculating the ModeIn this third entry in our Mongo Metrics series, we'll round out the ""top 3""
classical analytics methods by taking a look at mode. Check out our previous
articles in this series to learn more about computing and using the mean and median in MongoDB.

We've seen how the mean and median can provide us different perspectives on what
a ""typical"" order might look like. We've also seen how it's important to have
multiple different ""angles"" on your data to gain a full understanding of what a
typical order might look like. Now, let's take a look at one more angle: the
mode.

WHAT'S IN A MODE?
Mode is one of the simpler of the classic methods to understand. Simply put, the mode is the most common item, or the one occurring most frequently, in a set of
data. Unlike the mean or median , we may not necessarily obtain a useful result with mode. For example, if all
of the items in our dataset are encountered exactly once, then mode won't give us a useful result.

Let's take a look at a data set where the mode has great value: determining which products or price points are popular. Mode is great for this because stores will often price many items at the same price
points. By analyzing how well products do at various price points, stores can
determine more efficient pricing and improve their overall sales.

For this example, we'll borrow the pet store product catalog from our Metrics Maven's article on mode in PostgreSQL :

order_id | date       | item_count | order_value  
------------------------------------------------
50000    | 2016-09-02 | 3          | 35.97  
50001    | 2016-09-02 | 2          | 7.98  
50002    | 2016-09-02 | 1          | 5.99  
50003    | 2016-09-02 | 1          | 4.99  
50004    | 2016-09-02 | 7          | 78.93  
50005    | 2016-09-02 | 0          | (NULL)  
50006    | 2016-09-02 | 1          | 5.99  
50007    | 2016-09-02 | 2          | 19.98  
50008    | 2016-09-02 | 1          | 5.99  
50009    | 2016-09-02 | 2          | 12.98  
50010    | 2016-09-02 | 1          | 20.99  


Which, stored in JSON format in MongoDB, looks like the following:

{ ""_id"" : ObjectId(""58db58313b9bbe23a46e91af""), ""order_id"" : 50005, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 0 }
{ ""_id"" : ObjectId(""58db58873b9bbe21cb6e91b1""), ""order_id"" : 50002, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 1, ""order_value"" : 5.99 }
{ ""_id"" : ObjectId(""58db58d33b9bbe1f886e91b0""), ""order_id"" : 50003, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 1, ""order_value"" : 4.99 }
{ ""_id"" : ObjectId(""58db58fc3b9bbe21cb6e91b2""), ""order_id"" : 50010, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 1, ""order_value"" : 20.99 }
{ ""_id"" : ObjectId(""58db591d3b9bbe21cb6e91b3""), ""order_id"" : 50006, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 1, ""order_value"" : 5.99 }
{ ""_id"" : ObjectId(""58db59403b9bbe21cb6e91b4""), ""order_id"" : 50008, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 1, ""order_value"" : 5.99 }
{ ""_id"" : ObjectId(""58db596a3b9bbe21cb6e91b5""), ""order_id"" : 50009, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 2, ""order_value"" : 12.98 }
{ ""_id"" : ObjectId(""58db598c3b9bbe240a6e91ae""), ""order_id"" : 50007, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 2, ""order_value"" : 19.98 }
{ ""_id"" : ObjectId(""58db59ac3b9bbec65f6e91c0""), ""order_id"" : 50000, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 3, ""order_value"" : 35.97 }
{ ""_id"" : ObjectId(""58db59d43b9bbe21cb6e91b6""), ""order_id"" : 50004, ""date"" : ISODate(""2016-09-02T00:00:00Z""), ""item_count"" : 7, ""order_value"" : 78.93 }


Unlike PostgreSQL, MongoDB doesn't have a MODE keyword so we'll have to compute it ourselves. Luckily, the MongoDB
aggregations pipeline comes to the rescue yet again. Let's take a look at how we
can use it to compute the mode of our data set.

GETTING IN THE MODE
Before we get started, make sure you have a foundational understanding of the $match and $group operators in the MongoDB aggregation pipeline. If you need some background, you
can check out our previous article on MongoDB aggregations by example .

For our first step, we need to figure out which fields we want to calculate the mode on. Let's start by getting the mode of the order_value field so we can get a better picture of what a typical order value might be.

Mode is calculated by grouping the data in the data set together based on order_value , counting the number of items in each group, and finding the group with the
highest count. We can do that using the $group and $sum aggregation operators and filtering out any NULL or invalid fields by first running it through a $match operation. Then, we'll sort the results in descending order using the $sort aggregation operator. Finally, we'll return only the first document in the sort
by using the $limit operator.

Our aggregation starts with the $match operator to filter out our NULL values:

  {
    $match: {
      order_value: {
        $exists: true
      }
    }
  }


Next, let's run our $group query to group all of our order values into distinct groups. We'll also need to
count the number of times an order_value occurs so we can sort it later. We can do that all in one shot with the
following query:

{ 
  $group: { 
    _id: ""$order_value"", 
    count: { 
      $sum: 1 
    }
  }
}


Once this stage of the pipeline is reached, our data should now look like the
following:

{ ""_id"" : 78.93, ""count"" : 1 }
{ ""_id"" : 35.97, ""count"" : 1 }
{ ""_id"" : 19.98, ""count"" : 1 }
{ ""_id"" : 20.99, ""count"" : 1 }
{ ""_id"" : 12.98, ""count"" : 1 }
{ ""_id"" : 4.99, ""count"" : 1 }
{ ""_id"" : 5.99, ""count"" : 3 }


The last step now is to find the order_value s with the maximum count. There are a few ways we can do this, but one of the
simplest is to sort the data by the count field and then just return the top result.

First, let's sort the data using the $sort aggregation. We'll sort on the count field, and sort in descending order:

{ 
  $sort: { 
    ""count"": -1
  }
}  


This should give us the following result:

{ ""_id"" : 5.99, ""count"" : 3 }
{ ""_id"" : 78.93, ""count"" : 1 }
{ ""_id"" : 35.97, ""count"" : 1 }
{ ""_id"" : 19.98, ""count"" : 1 }
{ ""_id"" : 20.99, ""count"" : 1 }
{ ""_id"" : 12.98, ""count"" : 1 }
{ ""_id"" : 4.99, ""count"" : 1 }


Finally, we'll use the $limit aggregation to simply limit the return values to only the first one:

{ $limit: 1 }


This should return only the first document that we matched:

{ ""_id"" : 5.99, ""count"" : 3 }


And there's our mode - our most common order is one with a value of $5.99, and
it was encountered 3 times. Our completed query looks like the following:

> db.transactions.aggregate([
  { 
    $match: { 
      order_value: { $exists: true } 
    }
  }, { 
    $group: { 
      _id: ""$order_value"", 
      count: { $sum: 1 } 
    } 
  }, { 
    $sort: { ""count"": -1} 
  } , { 
    $limit: 1 
  } 
])


We can also calculate the mode for the number of items purchase in a transaction
by performing the same calculation on the item_count field:

> db.transactions.aggregate([
  { 
    $match: { 
      item_count: { $exists: true } 
    }
  }, { 
    $group: { 
      _id: ""$item_count"", 
      count: { $sum: 1 } 
    } 
  }, { 
    $sort: { ""count"": -1} 
  } , { 
    $limit: 1 
  } 
])


Which gives us the following:

{ ""_id"" : 1, ""count"" : 5 }


This means that the most common number of items in a transaction is 1, and it
occured in 5 transactions.

WHY SHOULD I CARE?
That's a great question - with all of those wonderful metrics out there, why
should you care about the mode ? Like always, it comes down to giving you a different perspective on your
data. You can read an excellent writeup about the differences in the Metrics Maven article on Mode , and the following table from that article is perhaps the most insightful:

Mean item count = 2.10  
Median item count = 1.5  
Mode item count = 1

Mean order value = $19.98  
Median order value = $10.48  
Mode order value = $5.99  


When you have data that's likely to repeat itself (ie: repeated transactions),
the mode can show you details that mean and median don't. If we expected the mean or even the median to help us determine what to expect from a typical order, we might be very
surprised when our projections were substantially off. Our median order value of
$10.48 is almost double what our most frequent order price actually is. Mean
here is almost completely useless as it is heavily skewed by a few outliers.

WRAPPING IT UP
While these are certainly not the only way to compute metrics across our MongoDB
data sets, the three ""classic"" statistical methods are a great starting point
for analyzing your data. We'll continue this series at a later date by exploring
more ways we can analyze and gain insights from our MongoDB data.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by: Pixabay / Skitterphoto John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
Mar 29, 2017MONGO METRICS: FINDING A HAPPY MEDIAN
In this second entry in our new ""Mongo Metrics"" series, we'll take a look at
using the MongoDB aggregations pipeline to compu…

John O'Connor Mar 6, 2017MONGO METRICS: CALCULATING THE MEAN
Mongo Metrics is a new series in collaboration with Compose's Resident Data
Scientist Lisa Smith that shows you how to extrac…

John O'Connor Feb 23, 2017AGGREGATIONS IN MONGODB BY EXAMPLE
In this second half of MongoDB by Example, we'll explore the MongoDB aggregation
pipeline. The first half of this series cov…

John O'Connor Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this third entry in our Mongo Metrics series, we'll round out the ""top 3"" classical analytics methods by taking a look at mode.",Mongo Metrics: Calculating the Mode,Live,243
705,"* Select a country/region: United States

IBM� * Site map

Search

 * Related materials Download

 * NO RELATED MATERIALS FOUND
   

 * 
 * 
 * 
 * 
 * 

 * LinkedIn
 * Google+
 * Twitter
 * Facebook

 * 

 * Related materials

 * NO RELATED MATERIALS FOUND
   

 * Download

 * 
 * 
 * 
 * 

 * 

 * Download


CONTACT IBM
CONSIDERING A PURCHASE?
 * Email IBM

FOOTER LINKS
 * Contact
 * Privacy
 * Terms of use
 * Accessibility","Data exploration and analysis is a repetitive, iterative process, but  in order to meet business demands, data scientists do not always  have the luxury of long development cycles. What if data scientists  could answer bigger and tougher questions faster? What if they  could more easily and rapidly experiment, test hypotheses and  work more collaboratively on interactive analytics?",Notebooks: A power tool for data scientists,Live,244
706,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSEIUM CONFERENCE AND HEARTBITS HACKATHONMike Elsmore / March 21, 2016HOW WAS SEIUM?SEIUM is a week-long polyglot conference held at the University of Minho in Braga,Portugal. I was invited to give a talk titled No Service, which centered on thetechnology ecosystem for offline Web development.My talk followed a morning workshop on building Android applications, whichcovered using local storage engines within apps. I had a natural audience for mypresentation: people already considering how to make applications workindependent of servers.SEIUM’s excellent site design at seium.orgMY TALKI covered three main subjects for offline-first HTML5 development: 1. Application Cache and its “gotchas” that make it useful but annoying 2. Service Workers and how they are fantastic but still not fully compliant    across browsers 3. In-Browser storage and how with localStorage , it’s easy to implement as a key-value storeFrom this point, I got on to the subject of Local Databases.Local Databases were a big focus of my talk. During this section I described thecurrent ecosystem of browsers and their support for IndexedDB and the deprecated-but-still-widely-used Web SQL . I then looked at how libraries like LocalForage , Dexie and PouchDB abstract the pain of dealing with Local Databases. Going further, I explainedthe utility of PouchDB and its amazing reproduction of the Apache CouchDBinterface that allows it to seamlessly work with IBM Cloudant and other toolsthat implement the CouchDB replication protocol . I also encouraged audience participation by using http://elsmore.me/seium-demo/ onstage, which is a basic chat app that uses PouchDB to demonstrate data syncfunctionality and offline capabilities.I received lots of good questions, including the all-important one on “what notto store in PouchDB”. It was a pleasure to be invited to participate in theconference, and hope I get the opportunity to do so again in the future.OFFLINE-FIRST IN HEARTBITS HACKHeartBits was a hackathon organized by the Medical and Informatics faculties ofthe University to explore how technology could be applied to improve generalhealth. With so many developers who had never been to a hackathon before, theyproduced a wide range of ideas.I spoke with many of the attendees and recieved a fantastic overview of thestudent doctors’ goals — and an even better view of how their engineeringteammates designed apps to achieve them. The collaboration between the twodisciplines was astonishing.During the event one team used some of the tools covered in my talk. They builta prototype of an offline Web app called GestaMed to help women manage theirhealth and track medication schedules during pregnancy. The team consisted offour great people: * Diogo Barroso, Faculty of Engineering of University of Porto ( GitHub , LinkedIn ) * João Maia, Faculty of Engineering of University of Porto ( GitHub , LinkedIn ) * Sofia Sousa Teles, Faculty of Medicine at the University of Lisbon * Miguel Mendes, Faculty of Engineering of University of PortoMedicine info in GestaMed (1 of 2)Medicine info in GestaMed (2 of 2)They researched and built a database of medications from textbooks on druginteractions and applied this data to ensure safe consumption during differentstages of pregnancy. This database was then imported into IBM Cloudant so that it could be replicated to all the apps. The 24-hour deadline didn’tleave much time to devote to native UI development, so they built an Apache Cordova app to allow for cross-platform use. Finally, they also used PouchDB as the local storage to seamlessly sync data with IBM Cloudant.They didn’t win, but they did an amazing job. It was brilliant watching themlearn new technologies. All in all it was an amazing weekend. Here’s to nexttime. Cheers!SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Recap of the SEIUM conference in Braga, Portugal, and its companion HeartBits hackathon. Offline-first Web apps were a big topic.",SEIUM Conference and HeartBits Hackathon,Live,245
708,"OFFLINE VERSE
Bradley Holt / May 26, 2016A screenshot of IBM Verse, our web-based email and calendaring software.

One of my areas of focus as a Developer Advocate here at IBM Cloud Data Services
is Offline First , an approach to building web and mobile apps in which the app is designed to
work in the most resource-constrained environment first and then progressive enhancement is applied to take advantage of network connectivity when available. I spoke
with Yingle Jia (Senior Software Engineer, IBM Verse and IBM Notes) about the offline
capabilities recently added to IBM Verse , our web-based business email and calendaring software.

Bradley: Verse is IBM’s new web-based email client. For those who aren’t familiar with
Verse, can you give us a brief overview of Verse and why it was created?

Yingle: Yes, sure. IBM Verse is a cloud-based business email and calendaring offering.
It is email reimagined for a new way to work, not just another email client.
From the beginning, IBM Verse is created to employ innovative user-centric
design, advanced search and social analytics to help users quickly find and
focus on things important to them.

Bradley: My manager was recently using Verse and he noticed the new “offline settings”
section. What sorts of offline capabilities does Verse have? For example, can I
read and respond to email while offline?

Yingle: Thanks for trying! We designed Verse offline to be a complement to the Verse
online experience. For the initial offline GA, we support synchronization of 7
days of mail in all folders, 7 days of preceding calendar events, and 30 days of
future events. Also, common email operations like reading, composing, saving,
sending email, moving to folder, etc are supported while offline. Security is
important for business email and we do encrypt the offline storage. Moreover, we
are committed to continuously improve the offline capabilities and user
experience over time.

Bradley: What was the motivation for building offline capabilities into Verse?

Yingle: The web-based approach allows us to quickly roll out new features and bug
fixes, however, our customers, including IBM itself, made it clear that they
need to be able to access Verse while offline, for example when on an air plane
or at a customer site where network access is not available or limited. Also,
caching data locally can greatly improve user experience, even when the user is
connected. Caching is important for cloud-based offerings!

Bradley: What browser features were required to make Verse work offline? Have you
encountered any browser compatibility issues?

Yingle: Verse offline is built upon standard web technologies and we support all major
browsers which are supported for online. Technologies being used include
IndexedDB, WebCrypto, Web Workers, etc. We did encounter a couple of browser
compatibility issues, and reported defects to the corresponding browser vendors.
Of course, we tried hard to avoid browser specific code, and the majority ( 99%) of our code is optimized to run well in all major browsers.

Bradley: Were the offline capabilities in Verse added from the beginning? If not, were
there any challenges with adding offline capabilities after the initial
development of Verse?

Yingle: We officially started offline support work after the initial Verse GA. MVC
design pattern is heavily used in Verse from the beginning, which makes it
easier to add offline support without major architectural changes. Of course, it
is still a big challenge to add offline support, since we cannot stop the agile
development of Verse to add offline support, and we have a dozen teams working
on Verse development!

Bradley: I recently spoke with the development team at The Weather Company (a recent IBM
acquisition). They have put significant efforts into developing a Progressive Web App. Have you considered taking a Progressive Web App approach to Verse?

Yingle: Yes, definitely, we are deeply interested in leveraging new web technologies
and programming patterns to improve Verse!

A big thanks to Yingle Jia for taking the time to talk with me about the offline
capabilities in IBM Verse! If you’re interested in getting more involved in the
Offline First movement then please consider joining us for Offline Camp , a three day retreat (June 24-27 th ) in the Catskill Mountains . Offline Camp will be a small gathering of about 30 developers, designers, and
others interested in furthering the Offline First movement.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: IBM Verse / Offline First / Progressive Web Apps Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","I spoke with Yingle Jia (Senior Software Engineer, IBM Verse and IBM Notes) about the offline capabilities recently added to IBM Verse, our web-based business email and calendaring software.",Offline Verse,Live,246
711,"Compose The Compose logo Articles Sign in Free 30-day trialDOCUMENT VALIDATION IN MONGODB BY EXAMPLE
Published Feb 16, 2017 mongodb developing Document Validation in MongoDB By ExampleIn this article, we'll explore MongoDB document validation by example using an
invoice application for a fictitious cookie company. We'll look at some of the
different types of validation available in MongoDB, and provide a practical
working example of validations in action.

Document validation was introduced in MongoDB 3.2 and defines a new way for
developers to control the type of data being inserted into their MongoDB
instances. We like to show rather than tell so we'll use a practical example to
demonstrate basic validations and the commands used to add them to MongoDB.

DOCUMENT VALIDATION IN A NUTSHELL
Document databases are a flexible alternative to the pre-defined schemas of
relational databases. Each document in a collection can have a unique set of
fields, and those fields can be added or removed from documents once they are
inserted which makes document databases, and MongoDB in particular, an excellent
way to prototype applications. However, this flexibility is not without cost and
the most underestimated cost is that of predictability.

Since the data fields stored in a document can be changed for each document in a
collection, developers lose the ability to make assumptions about the data
stored in a collection. This can have major implications in your applications -
if a transaction in a finance application was inserted with the wrong fields it
could throw off calculations and reports that are vital to business.

Developers accustomed to relational databases recognize the importance of
predictability in data formats, and that's one of the reasons that validation
was introduced in MongoDB 3.2. Let's see how document validation works by making
an application that uses it.

CREATING THE DATA MODELS
As a demonstration by example, we're going to create a fictitious cookie
company. We'll use this as an example since the entities in this kind of
business can be generalized to apply to other businesses. In this case, we'll
simply to 3 main data entities:

 1. A Customer , which represents a person making a purchase
 2. A Product , which represents an item being sold
 3. A Transaction , which represents the purchase of a number of products by a customer.

Since this is a trivial example, let's build these out in minimal form. In
practice, you can make your data entities as complex as you need.

CUSTOMER
The Customer entity represents someone making a purchase, so we'll include some
data typically found in a Customer entity. A typical customer entity might look
like the following:

{
  ""id"": ""1"",
  ""firstName"": ""Jane"",
  ""lastName"": ""Doe"",
  ""phoneNumber"": ""555-555-1212"",
  ""email"": ""Jane.Doe@compose.io""
}


Once we know what properties we'll want to include, we'll need to determine what
types of validations we'd like to do on this entity.

The first step in adding validation is to figure out exactly what we'd like to
validate. We can validate any of the fields in a collection and can validate
based on the existence of a field, data type and format in that field, values in
a field, and correlations between two fields in a document.

In the case of the Customer entity, we'd like to validate the following:

 * firstName , lastName , phoneNumber and email are all required to exist
 * phoneNumber is inserted in a specific format (123-456-7890)
 * email exists (we won't validate email format for now)

We can represent these validations in an intermediate format (before putting
them into the database) using the JSONSchema spec. While JSONSchema isn't a necessary step to do validations in MongoDB,
it's helpful to codifying our rules in a standard format and JSONSchema is
quickly gaining traction for doing server-side validations.

{
  ""$schema"": ""http://json-schema.org/draft-04/schema#"",
  ""type"": ""object"",
  ""properties"": {
    ""id"": {
      ""type"": ""string""
    },
    ""firstName"": {
      ""type"": ""string""
    },
    ""lastName"": {
      ""type"": ""string""
    },
    ""phoneNumber"": {
      ""type"": ""string"",
      ""pattern"": ""^([0-9]{3}-[0-9]{3}-[0-9]{4}$""
    },
    ""email"": {
      ""type"": ""string""
    }
  },
  ""required"": [
    ""id"",
    ""firstName"",
    ""lastName"",
    ""phoneNumber"",
    ""email""
  ]
}


Using JSONSchema also allows us to re-use validations on the application side as
well, such as RESTHeart's JSONSchema validation .

PRODUCT
Just as we did above with the Customer entity, let's take a look at what an
example Product entity might contain:

{
  ""id"": ""1"",
  ""name"": ""Chocolate Chip Cookie"",
  ""listPrice"": 2.99,
  ""sku"": 555555555,
  ""productId"": ""123abc""
}


We'll also codify our validations in JSONSchema format as well:

{
  ""$schema"": ""http://json-schema.org/draft-04/schema#"",
  ""type"": ""object"",
  ""properties"": {
    ""id"": {
      ""type"": ""string""
    },
    ""name"": {
      ""type"": ""string""
    },
    ""listPrice"": {
      ""type"": ""number""
    },
    ""sku"": {
      ""type"": ""integer""
    },
    ""productId"": {
      ""type"": ""string""
    }
  },
  ""required"": [
    ""id"",
    ""name"",
    ""listPrice"",
    ""sku"",
    ""productId""
  ]
}


TRANSACTION
The last entity we'll use in our fictitious cookie shop is a transaction. A
transaction represents a single purchase of one or more products by one customer
(many-to-one relationship). An inserted transaction record might look like the
following:

{
  ""id"": ""1"",
  ""productId"": ""1"",
  ""customerId"": ""1"",
  ""amount"": 20.00
}


Lastly, we'll codify the validations we want in JSONSchema format:

{
  ""$schema"": ""http://json-schema.org/draft-04/schema#"",
  ""type"": ""object"",
  ""properties"": {
    ""id"": {
      ""type"": ""string""
    },
    ""productId"": {
      ""type"": ""string""
    },
    ""customerId"": {
      ""type"": ""string""
    },
    ""amount"": {
      ""type"": ""number""
    }
  },
  ""required"": [
    ""id"",
    ""productId"",
    ""customerId"",
    ""amount""
  ]
}


Now that we have the structure and validation rules for our application, let's
add these validation rules to our Mongo database.

ADDING VALIDATION RULES
Now that we have an idea of how we want to validate our data, let's add those
validation rules to a MongoDB collection. First, let's spin up a new MongoDB on Compose deployment and create a new database for your cookie shop. Be sure to add a
database user so we can connect to the database after this step. We'll create a
new collection using the mongo command line application, which you can install for your platform .

Once you've installed the mongo command line application, created a new database, and added a database user,
it's time to create your collection through the mongo command line tool. Open a terminal and type the following:

mongo mongodb://dbuser:secret@aws-us-east-1-portal.8.dblayer.com:15234/cookieshop  


This will load up the interactive mongo shell. Now, let's create our collections in the database with the validations
we determined earlier. We'll start with the Customer collection:

> db.createCollection(""customers"", {
  validator: {
    $and: [
      {
        ""firstName"": {$type: ""string"", $exists: true}
      },
      {
        ""lastName"": { $type: ""string"", $exists: true}      
      },
      {
        ""phoneNumber"": { 
          $type: ""string"", 
          $exists: true,
          $regex: /^[0-9]{3}-[0-9]{3}-[0-9]{4}$/
        }
      },
      {
        ""email"": {
          $type: ""string"",
          $exists: true
        }
      }
    ]
  }
})


We'll leave email validation alone for now since it can be a bit complicated for
a trivial example. Next, let's add our products collection and validations:

> db.createCollection(""products"", {
  validator: {
    $and: [
      {
        ""name"": {$type: ""string"", $exists: true}
      },
      {
        ""listPrice"": { $type: ""double"", $exists: true}      
      },
      {
        ""sku"": { $type: ""int"", $exists: true}
      }
    ]
  }
})


Finally, we'll add our transactions collection which contains a reference to
documents in the products and customers collections:

db.createCollection(""transactions"", {  
  validator: {
    $and: [
      {
        ""productId"": {$type: ""objectId"", $exists: true}
      },
      {
        ""customerId"": { $type: ""objectId"", $exists: true}      
      },
      {
        ""amount"": { $type: ""double"", $exists: true}
      }
    ]
  }
})


The objectId type is a special type that allows us to reference documents from other
collections. In our case, we'll use it to associate a specific product and user
in a transaction.

TESTING VALIDATIONS
Now, it's time to test our validations to make sure they worked out. We'll start
by adding a new customer:

db.customers.insertOne({  
  firstName: ""John"",
  lastName: ""O'Connor"",
  phoneNumber: ""555-555-1212""
});


Notice that we've omitted the email field from our user, which was marked as required when we set up our
validations. If we set up the validations correctly, we'd expect the insertion
to fail which it does:

2017-02-09T12:45:36.714-0800 E QUERY    [thread1] uncaught exception: WriteError({  
    ""index"" : 0,
    ""code"" : 121,
    ""errmsg"" : ""Document failed validation"",
    ""op"" : {
        ""_id"" : ObjectId(""589cd4f06ca2fef0f7737fb9""),
        ""firstName"" : ""John"",
        ""lastName"" : ""O'Connor"",
        ""phoneNumber"" : ""555-555-1212""
    }
}) :
undefined  


Once we add the email field to the customer, the validation passes and the new
customer is inserted:


{
    ""acknowledged"" : true,
    ""insertedId"" : ObjectId(""589cd56b6ca2fef0f7737fbc"")
}


The acknowledge message lets us know that the customer was inserted correctly. Save the insertedId for later as we're going to use it when we make a new transaction.

Now, let's add a product and a transaction:

db.products.insertOne({  
  name: ""Chocolate Chip"",
  listPrice: 2.99,
  sku: 1
});


Again, make sure to keep track of the insertedId so we can use it while making a transaction.

Finally, let's add a transaction in which our new customer purchases our new
product:

db.transactions.insertOne({  
  productId: ObjectId(""589cd9216ca2fef0f7737fc4""),
  customerId: ObjectId(""589cd56b6ca2fef0f7737fbc""),
  amount: 2.99
});


WRAPPING UP
While document validations aren't necessarily desirable in all scenarios, they
provide developers with a more robust set of options when deciding where they
want to place the responsibility for data integrity within their applications.
In this article, we demonstrated how to create collections that have validations
in MongoDB to ensure our data has a predictable format and set of data. In the
next article, we'll use that predictability with MongoDB aggregations to gain
insights into our fictitious business by mining data in our database.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by: Wikimedia Commons John O'Connor is a software architect that enjoys tinkering with things, designing software,
and writing about it all. Love this article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
Jan 25, 2017DRONE DEPLOY CONQUERS THE DATA LAYER
Compose has quite a few unique customers. One of the more unique that we've
visited with is DroneDeploy, a company that autom…

Thom Crowe Jan 24, 2017BUILDING INSTANT RESTFUL API'S WITH MONGODB AND RESTHEART
When you need to turn your Mongo database into a RESTFul API, RESTHeart can get
you up-and-running quickly. In this article,…

John O'Connor Jan 13, 2017NEWSBITS - MONGODB RANSOMS, PYTHON AND POSTGRESQL, EAGLES AND BEAMS AND MORE
NewsBits for the week ending January 13th: MongoDB ransoms continue, picking the
right PostgreSQL/Python driver, Apache Eagle…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this article, we'll explore MongoDB document validation by example using an invoice application for a fictitious cookie company. We'll look at some of the different types of validation available in MongoDB, and provide a practical working example of validations in action.",Document Validation in MongoDB By Example,Live,247
713,Cloudant Query provides you with a declarative way to define and query indexes. This video introduces you to Cloudant Query concepts. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center,Cloudant Query provides you with a declarative way to define and query indexes. This video introduces you to Cloudant Query concepts.,Introducing the new Cloudant query,Live,248
721,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Jul 18
--------------------------------------------------------------------------------

QUERYING YOUR CLOUDANT DATABASE WITH SQL
UPDATING THE SILVERLINING NODE.JS LIBRARY TO SUPPORT THE BASICS OF SQL
Cloudant and its Apache CouchDB stable-mate are “NoSQL” databases — that is,
they are schemaless JSON document stores. Unlike a traditional relational
database, you don’t need to define your schema before writing data to the
database. Just post your JSON to the database and change your mind as often as
you like!

One of the appealing things about relational databases is the query language. Structured Query Language or SQL was developed by IBM in the 1970s and was widely adopted across a host of
databases ever since. In its simplest form, SQL reads like a sentence:

SELECT name, colour, price 
  FROM animalsdb
  WHERE type='cat' OR (price > 500 AND price < 1000) 
  LIMIT 50

This statement translates to:

“Fetch me the name, colour and price from the animals database, but only the
rows that are cats, or ones which are more expensive than 500 but cheaper than
1000. And I only want a maximum of 50 rows returned.”It is a convenient way of expressing the fields you want to fetch, the filter
you wish to apply to the data, and the maximum number of rows you want in reply.

Many databases can store BLOB types , but this isn’t one of those kinds of blobs. Image credit: mark du toit .Unfortunately, NoSQL databases don’t generally support the SQL language.
Cloudant and Apache CouchDB™ have their own form of query language where the
query is expressed as a JSON object: “ Cloudant Query ” (CQ) and “ Mango ,” in their respective contexts. The CQ or Mango equivalent of the above SQL
statement is:

It’s a world of curly brackets! If you’re happier expressing your query in SQL,
then there is a way.

SILVERLINING + SQL
The latest version of the silverlining Node.js library can now accept SQL queries. It will convert the SQL into a
Cloudant Query and deliver the results.

Simply install the Silverlining library:

npm install -s silverlining

And add it to your Node.js app by passing your Cloudant URL to the library:

var db = require('silverlining')('https://USER:PASS@HOST.cloudant.com/animalsdb'

We can then start querying our database with an SQL statement:

db.query('SELECT name FROM animalsdb').then(function(data) {
    // data!
  });

Here are some other sample queries:

Silverlining achieves this by converting your SQL query into the equivalent
Cloudant Query object. If you’d like to see that data yourself, then call the explain function instead of query to be returned by the query that would have been used:

LIMITATIONS
Before we get carried away, this feature doesn’t suddenly make Cloudant support
joins, unions, transactions, stored procedures etc. It’s just a translation from SQL to Cloudant Query .

It doesn’t support aggregations or grouping either, but you can use
Silverlining’s count , sum , and stats functions to generate performant grouped aggregation without any fuss.

This feature simply makes it easier to explore data sets if you already have SQL
language experience.

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * Web Development
 * JavaScript
 * Couchdb
 * Cloudant
 * Database

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Cloudant and its Apache CouchDB stable-mate are “NoSQL” databases — that is, they are schemaless JSON document stores. Unlike a traditional relational database, you don’t need to define your schema…",Querying your Cloudant database with SQL – IBM Watson Data Lab – Medium,Live,249
722,"Homepage IBM Watson Data Lab Follow Sign in Get started * Home
 * Web Dev
 * Serverless
 * Data Science
 * Object Storage
 * Containers
 * 

Mark Watson Blocked Unblock Follow Following Developer Advocate, IBM Watson Data Platform Oct 18, 2017
--------------------------------------------------------------------------------

BUILDING YOUR FIRST MACHINE LEARNING SYSTEM
TRAIN YOUR MODEL AND DEPLOY IT, WATSON ML FOR DEVELOPERS (PART 2)
In Part 1 I gave you an overview of machine learning, discussed some of the tools you can
use to build end-to-end ML systems, and the path I like to follow when building
them.

In this post we are going to follow this path to train a machine learning model,
deploy it to Watson ML, and run predictions against it in real time.

Look Ahead: In Part 3 we’ll create a small web application and backend to
demonstrate how you can integrate Watson ML and make machine learning
predictions in an end-user application. The Model Cafe in the Allston neighborhood of Boston. Image: Toby McGuire .We are going to use our small data set from Part 1 because the point of this
post is to get something up and running quickly — not to actually build an
accurate system for making predictions. Here’s the data set:

Square Feet       # Bedrooms       Color         Price
-----------       ----------       -----         -----        
2,100             3                White         $100,000
2,300             4                White         $125,000
2,500             4                Brown         $150,000

In Part 1 I talked about the tools I use to build machine learning systems.
Before we start building our ML system, let’s setup our tools.

TOOL SETUP
Bluemix/DSX: You’ll need a Bluemix and Data Science Experience account. If you don’t have
one, go to https://datascience.ibm.com to sign up. This will create a single account where you can access Bluemix and
DSX.

Watson Machine Learning: You’ll need an instance of Watson Machine Learning. You can provision a new
instance here .

Apache Spark™: You’ll need a Spark instance, but if you don’t have one now you can create one
later.

Now that you’re all set up, let’s follow the process I outlined in Part 1.

STEP 1: IDENTIFY WHAT YOU WANT TO PREDICT AND THE SOURCE OF YOUR DATA
We’ve identified that we want to predict house prices, and the data set we want
to use to drive those predictions. I have made the data set available on GitHub:


https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv

This URL is important because we’ll need to pull this data into our Jupyter
Notebook in the next step.

STEP 2: CREATE A JUPYTER NOTEBOOK — IMPORT, CLEAN, AND ANALYZE THE DATA
CREATE A JUPYTER NOTEBOOK
We’re going to analyze our data in a Jupyter Notebook in the IBM Data Science
Experience. Jupyter Notebooks are documents that run in a web browser and are
composed of cells. Cells can contain markup or executable code. We’ll be coding
in Python. I’ll show you how we can import and analyze our data with just three
lines of code.


--------------------------------------------------------------------------------

Download the following notebook to your computer:


https://dataplatform.ibm.com/analytics/notebooks/3e83ffa1-f52a-4b76-bbb5-498b6b7f9505/view?access_token=a7dfdd01dbc24c53a5ac9688fbdd32da1b59156117d721fe10d12660f18dd591

Open DSX and create a new project called “Watson ML for Developers”. From here,
create a new Spark instance for it.

In the project navigate to Analytic assets and click New notebook . Choose From file . Specify a name, like “House Prices”, and choose the notebook you downloaded
above.

Finally, click Create Notebook . You should be taken directly to edit the notebook.


--------------------------------------------------------------------------------

If this is your first time using Jupyter notebooks here are a few tips that you
may find helpful (if you are already familiar with Jupyter notebooks, feel free
to skip ahead):

1. Always make sure your kernel is running. You should see the status of your
kernel in the top right.

2. If your kernel is not running, you can restart it from the Kernel menu. From here you can also interrupt your kernel, or change your kernel (if
you want to use a different version of Python or Apache Spark).

3. A notebook is made of up markup and code cells. You can walk through the
notebook and execute the code cells by clicking the run button in the toolbar or
from the Cell menu.


--------------------------------------------------------------------------------

IMPORT, CLEAN, AND ANALYZE THE DATA
Let’s look at the first three code cells in the notebook where we will load and
analyze our data. Here’s the first code cell:

import pixiedust

This cell just imports a Python library called PixieDust . PixieDust is an open source helper library that works as an add-on to Jupyter
Notebooks that makes it easy to import and visualize data.

In the second cell we load our sample data:

df = pixiedust.sampleData(""https://raw.githubusercontent.com/markwatsonatx/watson-ml-for-developers/master/data/house-prices.csv"")

This will generate a Spark DataFrame called “df”. A DataFrame is a data set organized into named columns. You can
think of it as a spreadsheet, or a relational database table. The Spark ML API
uses DataFrames to train and test ML models.

Finally, we’ll call the display function in PixieDust to display our data:

display(df)

It should look something like this:

In this case we are displaying a simple table, but PixieDust also provides
graphs and charts for helping you understand and analyze your data without
writing any code.

In just three lines of code we have imported and analyzed our data set. Now it’s
time to do some machine learning!

STEP 3: USE APACHE SPARK ML TO BUILD AND TEST A MACHINE LEARNING MODEL
BUILD A MACHINE LEARNING MODEL
We’re going to build our first ML model in just a handful of cells. To start we
need to import the Spark ML libraries that we’ll be using:

from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

This is a regression problem (we’re trying to predict a real number), so we are going to use the
LinearRegression algorithm in pyspark.ml.regression . There are other regression algorithms, but those are outside of the scope of
this post.

We are going to build our ML model in just four lines of code. These four lines
are in a single cell in our notebook, like so:

assembler = VectorAssembler(
   inputCols=['SquareFeet','Bedrooms'],
   outputCol=""features""
)

lr = LinearRegression(labelCol='Price', featuresCol='features')

pipeline = Pipeline(stages=[assembler, lr])

model = pipeline.fit(df)

Let’s break this down, line by line.

First of all we need to specify our features . In the previous post we decided that we would use Square Feet and # Bedrooms as our features. Our ML algorithm expects a single vector of feature columns,
so here we use a VectorAssembler to tell our ML pipeline (we’ll talk about pipelines in a minute) that we want SquareFeet and Bedrooms as our features:

assembler = VectorAssembler(
   inputCols=['SquareFeet','Bedrooms'],
   outputCol=""features""
)

Next, we create an instance of LinearRegression , the ML algorithm we are going to use. At a minimum, you must specify the
features and the labels. There are other parameters you can provide to tweak the
algorithm, but they’re not going to do us much good when working with three data
points :)

lr = LinearRegression(labelCol='Price', featuresCol='features')

Next, we create our pipeline . A Pipeline allows us to specify the steps that should be performed when training an ML
model. In this case, we first want to assemble our two feature columns into a
single vector — that’s the assembler. Then we want to run it through our
LinearRegression algorithm. In upcoming posts I’ll discuss other operations that
you’ll run through the pipeline — like converting non-numeric data to numeric
data.

pipeline = Pipeline(stages=[assembler, lr])

Finally, we pass our DataFrame to the fit method on the pipeline to create our ML model.

model = pipeline.fit(df)

Congratulations, you now have a machine learning model that you can use to
predict house prices!

TEST THE MODEL
It’s time to test our model. In our example we are going to run a single
prediction. In future posts I’ll discuss how you can analyze the accuracy of
your model by running a large number of predictions based on your original data
set.

Here we create a Python function to get our prediction:

def get_prediction(square_feet, num_bedrooms):

   request_df = spark.createDataFrame(
      [(square_feet, num_bedrooms)],
      ['SquareFeet','Bedrooms']
   )

   response_df = model.transform(request_df)

   return response_df

Let’s break this cell down. First of all, in order to generate a prediction
against an ML model generated using Spark ML, we need to pass it a DataFrame
with the data we want to use in our prediction (i.e., the square footage and #
bedrooms for the house price we want to predict). This line of code creates the
DataFrame we’ll pass to our model:

request_df = spark.createDataFrame(
   [(square_feet, num_bedrooms)],
   ['SquareFeet','Bedrooms']
)

Then we’ll call transform on the model, passing in the request DataFrame. This
returns another DataFrame:

response_df = model.transform(request_df)

Let’s run a prediction for a house that is 2,400 square feet and has 4 bedrooms:

response = get_prediction(2400, 4)
response.show()

The result is a DataFrame that looks like this:

+----------+--------+------------+------------------+
|SquareFeet|Bedrooms|    features|        prediction|
+----------+--------+------------+------------------+
|      2400|       4|[2400.0,4.0]|137499.99999999968|
+----------+--------+------------+------------------+

Tip: You can use PixieDust to visualize any DataFrame, including this one. If
you’ve imported PixieDust and you have a DataFrame, display() is your friend :)Our ML model returned back our features along with a prediction. In this case,
it predicted that a house that is 2,400 square feet and has 4 bedrooms should
have a price of about $137,500, which is directly in between our 2,300 square
foot house and our 2,500 square foot house.

STEP 4: DEPLOY AND TEST THE MODEL WITH WATSON ML
DEPLOY THE MODEL
We’ve trained and tested our machine learning model, but if we want to predict
house prices from a web or mobile app it’s not going to do us much good in this
notebook. That’s where Watson ML comes in.

In the same notebook, we’re going to deploy this model to Watson ML and create a
“scoring endpoint”, or a REST API for making predictions.

The first thing you’ll need to do is specify your Watson ML credentials. You can
find your credentials by going to the Watson ML in Bluemix and clicking Service Credentials on the left ( head to the catalog to deploy it ):

Fill in the following cell with your credentials:

service_path = 'https://ibm-watson-ml.mybluemix.net'
username = 'YOUR_WML_USER_NAME'
password = 'YOUR_WML_PASSWORD'
instance_id = 'YOUR_WML_INSTANCE_ID'
model_name = 'House Prices Model'
deployment_name = 'House Prices Deployment'

The next cell initializes some libraries for connecting to Watson ML. These
libraries are built into DSX:

from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact
ml_repository_client = MLRepositoryClient(service_path)
ml_repository_client.authorize(username, password)

Next, we’ll use the same libraries to save our model to Watson ML. We pass the
trained model, our data set, and a name for the model — in this case we’re
calling it “House Prices Model”:

model_artifact = MLRepositoryArtifact(
   model,
   training_data=df,
   name=model_name
)
saved_model = ml_repository_client.models.save(model_artifact)
model_id = saved_model.uid

The call to save the model returns an object that we store in our saved_model variable from which we extract the unique ID for the model. This is important
as it will be used later to create a deployment for the model.

We now have a trained machine learning model that we have deployed to Watson ML,
but we still don’t have a way to access it. The next few cells will do just
that.

We are going to create a Deployment for our ML model. To do this, we are going to use the Watson ML Rest API . The Watson ML Rest API uses token-based authentication, so our first step is
to generate a token using our Watson ML credentials:

headers = urllib3.util.make_headers(
   basic_auth='{}:{}'.format(username, password)
)
url = '{}/v3/identity/token'.format(service_path)
response = requests.get(url, headers=headers)
ml_token = 'Bearer ' + json.loads(response.text).get('token')

Now we can create our deployment. Here we make an HTTP POST to the published_models/deployments endpoint — passing in our Watson ML instance_id and the model_id of our newly saved model.

deployment_url = service_path
   + ""/v3/wml_instances/"" + instance_id
   + ""/published_models/"" + model_id
   + ""/deployments/""
deployment_header = {
   'Content-Type': 'application/json',
   'Authorization': ml_token
}
deployment_payload = {
   ""type"": ""online"",
   ""name"": deployment_name
}
deployment_response = requests.post(
   deployment_url,
   json=deployment_payload,
   headers=deployment_header
)
scoring_url = json.loads(deployment_response.text)
   .get('entity')
   .get('scoring_url')
print scoring_url

The last line above prints the scoring_url parsed from the response received from Watson ML. This is an HTTP endpoint that
we can use to make predictions. You now have a deployed machine learning model
that you can use to predict house prices from anywhere! You can call it from a
front-end application, your middleware, or from a notebook — we’ll do just that
next :)

TEST THE MODEL
For now, we’re going to test our Watson ML deployment from our notebook, but the
real value of deploying your ML models to Watson ML is that you can run
predictions from anywhere.

In the notebook I created a new function called get_prediction_from_watson_ml . Just like the last function, this one takes the square footage and the number
of bedrooms for the house price you would like to predict.

Rather than calling the Spark ML APIs, you can see that this function performs
an HTTP POST to the scoring_url we received earlier.

def get_prediction_from_watson_ml(square_feet, num_bedrooms):
   scoring_header = {
      'Content-Type': 'application/json',
      'Authorization': ml_token
   }
   scoring_payload = {
      'fields': ['SquareFeet','Bedrooms'],
      'values': [[square_feet, num_bedrooms]]
   }
   scoring_response = requests.post(
      scoring_url,
      json=scoring_payload,
      headers=scoring_header
   )
   return scoring_response.text

Let’s run the same prediction we ran earlier — a house that is 2,400 square feet
and has 4 bedrooms:

response = get_prediction_from_watson_ml(2400, 4)
print response

The call to our Watson ML REST API returned our features along with the same
prediction we received when we ran our test using Spark ML and the local ML
model that we generated.

{
  ""fields"": [""SquareFeet"", ""Bedrooms"", ""features"", ""prediction""],
  ""values"": [[2400, 4, [2400.0, 4.0], 137499.99999999968]]
}

NEXT STEPS
In this post we built an end-to-end machine learning system using the IBM Data
Science Experience, Spark ML, and Watson ML. In just a few lines of code, we
imported and visualized a data set, built an ML pipeline and trained an ML
model, and made that model available to make predictions from software running
anywhere. Although we barely scratched the surface of machine learning, I hope
this article gave you a basic understanding of how to build an ML system.

In the next post, I will show you how to consume the Watson ML scoring endpoint
from an end-user application. In future posts, I will slowly venture deeper into
machine learning with working examples for common ML problems: supervised and
unsupervised, binary and multiclass classification, clustering, and more.

 * Machine Learning
 * Pixiedust
 * Data Science
 * Jupyter Notebook
 * Cognitive Computing

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

57 Blocked Unblock Follow FollowingMARK WATSON
Developer Advocate, IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Cloud.

 * 57
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","In Part 1, I gave you an overview of machine learning, discussed some of the tools you can use to build end-to-end ML systems, and the path I like to follow when building them.  In this post we are going to follow this path to train a machine learning model, deploy it to Watson ML, and run predictions against it in real time.",Building Your First Machine Learning System ,Live,250
723,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×PODCASTS
DATA SCIENCE EXPERT INTERVIEW: DEZ BLANCHFIELD, CRAIG BROWN, DAVID MATHISON,
JENNIFER SHIN AND MIKE TAMIR PART 2
Post Comment November 16, 2016 | 21:02 play mute max volume MP3 Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison,
Jennifer Shin and Mike Tamir part 2 Update Required To play the media you will need to either update your browser to a recent
version or update your Flash plugin .OVERVIEW
The IBM Insight at World of Watson 2016 conference brought together many leading
experts in data science, cognitive computing and big data analytics. In this,
the second part of a two-part podcast recorded at the conference, IBM data
science evangelist James Kobielus interviews five industry thought leaders to
gain their insights into the trends facing data professionals:

 * Dez Blanchfield (The Bloor Group)
 * Craig Brown (Untapped Potential)
 * David Mathison (CDO Club)
 * Jennifer Shin (8 Path Solutions)
 * Mike Tamir (Intertrust Technologies Corporation)

Explore the power that a productivity platform can bring to team data science by
learning more about the IBM Watson Data Platform .

Listen to part 1


Follow @IBMBigData

Topics: Analytics , Big Data Technology , Big Data Use Cases , Data Scientists , Hadoop Tags: big data , business analyst , chief data officers , cognitive computing , data analytics , data science , open analytics , predictive analyticsRELATED CONTENT
PODCAST
DATA SCIENCE EXPERT INTERVIEW: DEZ BLANCHFIELD, CRAIG BROWN, DAVID MATHISON,
JENNIFER SHIN AND MIKE TAMIR PART 1
Take a peek at the future of data science in this discussion with five thought
leaders in the data analytics industry, the first installment of a two-part
interview recorded at the IBM Insight at World of Watson 2016 conference. Listen to Podcast Blog Calling all TM1 users: Your next on-premises planning solution is here Video Dez Blanchfield's predictions based on what he learned at World of Watson
2016 Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber
criminals? Blog Accessing the power of R through a robust statistical analysis tool Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders
relations? Video IBM Incentive Compensation Management: Improve sales results and
operational efficiencies Blog The cognitive level of surveillance for financial institutions Video Dez Blanchfield's top 3 takeaways from World of Watson 2016 Video Recommender System with Elasticsearch: Nick Pentreath & Jean-François
Puget Video Hyperparameter optimization: Sven Hafeneger Video An introduction to extending Spark ML for custom models: Holden Karau
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Sales Performance Management Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Presentation Calling all IBM TM1 users! There’s a new on-premises solution in
town Podcast The unusual suspects in cyber warfareMORE
Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Presentation Calling all IBM TM1 users! There’s a new on-premises solution in
town Podcast The unusual suspects in cyber warfare Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architects Video Dez Blanchfield's predictions based on what he learned at World of Watson
2016 Blog Internet of Things: A continuum of change with opportunities galore Presentation Calling all IBM TM1 users! There’s a new on-premises solution in
town Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architectsMORE
Blog Internet of Things: A continuum of change with opportunities galore Presentation Calling all IBM TM1 users! There’s a new on-premises solution in
town Blog Calling all TM1 users: Your next on-premises planning solution is here Presentation 8 innovative ideas for data architects Blog Accessing the power of R through a robust statistical analysis tool Video Insurers: Isn't it time to go beyond traditional views of policyholders
relations? Video IBM Incentive Compensation Management: Improve sales results and
operational efficiencies Podcast The unusual suspects in cyber warfare Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber
criminals? Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders
relations?MORE
Podcast The unusual suspects in cyber warfare Podcast Cyber Beat Live: Can analytics and cognitive computing stop cyber
criminals? Podcast Finance in Focus: Meet Watson—your new surveillance officer Video Insurers: Isn't it time to go beyond traditional views of policyholders
relations? Blog The cognitive level of surveillance for financial institutions Blog Dynamic duo: Big data and design thinking Video Data streams in telecom: Koen Dejonghe Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David
Mathison, Jennifer Shin and Mike... Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David
Mathison, Jennifer Shin and Mike...MORE
Blog Internet of Things: A continuum of change with opportunities galore Blog Quest for value: Entering a new era of pragmatism for data and analytics Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David
Mathison, Jennifer Shin and Mike... Podcast Data science expert interview: Dez Blanchfield, Craig Brown, David
Mathison, Jennifer Shin and Mike... Presentation 8 innovative ideas for data architects Video Dez Blanchfield's predictions based on what he learned at World of Watson
2016 Blog Accessing the power of R through a robust statistical analysis tool * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site","Take a peek at the future of data science in this discussion with five thought leaders in the data analytics industry, the second installment of a two-part interview recorded at the IBM Insight at World of Watson 2016 conference.","Data science expert interview: Dez Blanchfield, Craig Brown, David Mathison, Jennifer Shin and Mike Tamir part 2",Live,251
724,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

WEB PICKS (WEEK OF 4 SEPTEMBER 2017)
Posted on September 9, 2017Every two weeks, we find the most interesting data science links from around the
web and collect them in Data Science Briefings , the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting
resources .

 * Silicon Valley siphons our data like oil. But the deep drilling has just
   begun
   Personal data is to the tech world what oil is to the fossil fuel industry.
   That’s why companies like Amazon and Facebook plan to dig deeper than we ever
   imagined.
 * A Survey of 3,000 Executives Reveals How Businesses Succeed with AI
   “The next digital frontier is here, and it’s AI.”
 * Scraping data from the public web may be legal
   When is it okay to grab data from someone else’s website, without their
   explicit permission? A new ruling by a federal judge in California might have
   dramatic implications on this question, and on the open nature of the web in
   general.
 * Data Alone Isn’t Ground Truth
   You should always carry a healthy dose of skepticism in your back pocket.
 * To Survive in Tough Times, Restaurants Turn to Data-Mining
   “According to the tech wizards who are determined to jolt the restaurant
   industry out of its current slump, information culled and crunched from a
   wide array of sources can identify customers who like to linger, based on
   data about their dining histories.”
 * How the GDPR will disrupt Google and Facebook
   Google and Facebook will be disrupted by the new European data protection
   rules that are due to apply in May 2018. This note explains how.
 * Machine Learning for Humans
   Simple, plain-English explanations accompanied by math, code, and real-world
   examples.
 * Why We Need Accountable Algorithms
   AI and machine learning algorithms are marketed as unbiased, objective tools.
   They are not
 * Support Hypothesis
   In September, Stripe is supporting the development of Hypothesis, an
   open-source testing library for Python created by David MacIver. Hypothesis
   is the only project we’ve found that provides effective tooling for testing
   code for machine learning, a domain in which testing and correctness are
   notoriously difficult.
 * Cornea AI aims to predict the popularity of your next photo
   The Cornea score uses Artificial Intelligence to predict the popularity of
   your photo.
 * Logo Rank is an AI system that understands logo design
   It’s trained on a million+ logo images to give you tips and ideas. It can
   also be used to see if your designer took inspiration from stock icons.
 * ggpage
   Creates Page Layout Visualizations in R
 * Can CNNs transliterate Pinyin into Chinese characters correctly?
   This project examines how well neural networks can convert Pinyin, the
   official romanization system for Chinese, into Chinese characters.
 * Simulate colorblindness in R figures
   This new R package provides a variety of functions that are helpful to
   simulate the effects of colorblindness in R figures.
 * PyTorch or TensorFlow?
   “This is a guide to the main differences I’ve found between PyTorch and
   TensorFlow.”
 * Deep Learning is not the AI future
   “While Deep Learning had many impressive successes, it is only a small part
   of Machine Learning, which is a small part of AI. We argue that future AI
   should explore other ways beyond DL.”

‹ Web Picks (week of 21 August 2017) —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Web Picks (week of 4 September 2017)
 * Web Picks (week of 21 August 2017)
 * What discount factor is commonly used in calculating Customer Lifetime Value
   (CLV)?
 * Simple Linear Regression? Do It The Bayesian Way
 * Web Picks (week of 7 August 2017)

Archives * September 2017
 * August 2017
 * July 2017
 * June 2017
 * May 2017
 * April 2017
 * March 2017
 * February 2017
 * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com",Interesting data science links from around the web.,Web Picks (week of 4 September 2017),Live,252
730,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own Oct 3
--------------------------------------------------------------------------------

LIFELONG (MACHINE) LEARNING: HOW AUTOMATION CAN HELP YOUR MODELS GET SMARTER
OVER TIME
MACHINE LEARNING SHOULD HAPPEN CONSTANTLY
Imagine you’re interviewing a new job applicant who graduated top of their class
and has a stellar résumé. They know everything there is to know about the job,
and has the skills that your business needs. There’s just one catch: from the
moment they join your team, they’ve vowed never to learn anything new again. You
probably wouldn’t make that hire, because you know that life long learning is
vital if someone is going to add long-term value to your team. Yet when we turn
to the field of machine learning, we see companies making a similar mistake all
the time. Data scientists work hard to develop, train and test new machine
learning models and neural networks. However, once the models get deployed, they
don’t learn anything new. After a few weeks or months, become static and stale,
and their usefulness as a predictive tool deteriorates.

WHY MODELS STOP LEARNING
Data scientists are well aware of this problem, and would love to find a way to
enable their models to participate in the equivalent of lifelong learning.
However, moving a model into production is typically a tough task, and
deployment requires help from busy IT specialists. When a single deployment can
take weeks, it’s no wonder that most data scientists prefer to hand over their
latest model and move onto the next project, rather than persist with the
drudgery of continually retraining and redeploying their existing models.

Deployment isn’t just painful for data scientists — it can be a headache for IT
teams too. Data scientists might have used any one of a wide variety of
languages, frameworks and tools to build their models, and there is no guarantee
that those choices will make the model easy to integrate into production
systems. In a worst-case scenario, the model may need to be substantially
refactored or even rebuilt from scratch before it can be deployed. As a result,
if data scientists ask for their models to be redeployed too frequently, they
may be met with significant resistance from the IT department.

STREAMLINING DEPLOYMENT TO KEEP MODELS IN TRAINING
The good news is that model deployment isn’t inherently labor-intensive. Just as
in other forms of software development, the principles of DevOps apply here.
With the right platform, it is possible to create seamless continuous deployment
pipelines that automate many aspects of the process, transforming deployment
from weeks of manual effort to a matter of a few mouse-clicks.
For example, with IBM® Watson® Machine Learning integrated in IBM Data Science Experience , data scientists can develop models using a wide range of languages (including
Python, R and Scala) and frameworks (such as SparkML, Scikit-Learn, xgboost and
SPSS). The solution will abstract the models into a standardized API that can be
integrated easily with production systems. This gives data scientists the
flexibility they need to choose best-of-breed tools and techniques during
development, without increasing the complexity of deployment for the IT team.

Watson Machine Learning aims to combine other elements of IBM Watson Data Platform to provide a continuous feedback loop. When your model is ready to move into
production, you can specify how frequently you would like to retrain it, and
automate the redeployment process. You can also monitor and validate the results
of the retrained model to ensure that the new version is an improvement — and
with integrated version control, you can easily roll back to the previous
release if necessary.

GIVING DATA SCIENTISTS MORE POWER
These capabilities help to reduce the need for IT teams to act as intermediaries
in the deployment process, eliminating the biggest bottleneck for continuous
improvement of machine learning m

odels. They also place more power in the hands of data scientists, empowering
them to focus on building and maintaining the most accurate models possible,
instead of being forced to sacrifice quality for practicality. Most importantly,
solutions like Watson Machine Learning give your models the chance to do what
they were always meant to do: learn. By continuously retraining your models
against the latest data, you can ensure that they continue to reflect today’s
business realities, giving your organization the insight it needs to make
smarter decisions and seize competitive advantage.

Get started with Data Science Experience for free


--------------------------------------------------------------------------------

Originally published at www.ibm.com on October 3, 2017.

 * Data Science
 * Machine Learning
 * IBM
 * Ibm Watson

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingARMAND RUIZ
Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Imagine you’re interviewing a new job applicant who graduated top of their class and has a stellar résumé. They know everything there is to know about the job, and has the skills that your business…",Lifelong (machine) learning: how automation can help your models get smarter over time,Live,253
739,"APPLE, IBM ADD MACHINE LEARNING TO PARTNERSHIP WITH WATSON-CORE ML COUPLING
Ron Miller 15 hoursApple and IBM may seem like an odd couple , but the two companies have been working closely together for several years now. That has involved IBM sharing its enterprise expertise
with Apple and Apple sharing its design sense with IBM. The companies have
actually built hundreds of enterprise apps running on iOS devices. Today, they took that friendship a step further when
they announced they were providing a way to combine IBM Watson machine learning with Apple Core ML to make the business apps running on Apple devices all the more intelligent.

The way it works is that a customer builds a machine learning model using Watson, taking advantage of data in an enterprise repository to train the model. For
instance, a company may want to help field service techs point their iPhone
camera at a machine and identify the make and model to order the correct parts.
You could potentially train a model to recognize all the different machines
using Watson’s image recognition capability.

The next step is to convert that model into Core ML and include it in your
custom app. Apple introduced Core ML at the Worldwide Developers Conference last
June as a way to make it easy for developers to move machine learning models
from popular model building tools like TensorFlow, Caffe or IBM Watson to apps
running on iOS devices.

After creating the model, you run it through the Core ML converter tools and
insert it in your Apple app. The agreement with IBM makes it easier to do this
using IBM Watson as the model building part of the equation. This allows the two
partners to make the apps created under the partnership even smarter with
machine learning.

“Apple developers need a way to quickly and easily build these apps and leverage
the cloud where it’s delivered. [The partnership] lets developers take advantage
of the Core ML integration,” Mahmoud Naghshineh, general manager for IBM
Partnerships and Alliances explained.

To make it even easier, IBM also announced a cloud console to simplify the
connection between the Watson model building process and inserting that model in
the application running on the Apple device.

Over time, the app can share data back with Watson and improve the machine
learning algorithm running on the edge device in a classic device-cloud
partnership. “That’s the beauty of this combination. As you run the application,
it’s real time and you don’t need to be connected to Watson, but as you classify
different parts [on the device], that data gets collected and when you’re
connected to Watson on a lower [bandwidth] interaction basis, you can feed it
back to train your machine learning model and make it even better,” Naghshineh
said.

The point of the partnership has always been to use data and analytics to build
new business processes, by taking existing approaches and reengineering them for
a touch screen.

“This adds a level of machine learning to that original goal moving it forward
to take advantage of the latest tech. “We are taking this to the next level
through machine learning. We are very much on that path and bringing improved
accelerated capabilities and providing better insight to [give users] a much
greater experience,” Naghshineh said.",Apple and IBM announce they were providing a way to combine IBM Watson machine learning with Apple Core ML to make the business apps running on Apple devices all the more intelligent.,"Apple, IBM add machine learning to partnership with Watson-Core ML coupling",Live,254
742,"REDIS PUBSUB, NODE, AND SOCKET.IO
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 10, 2016Sockets are the high power pipeline of the realtime web and in this article
we'll show how a minimal amount of code can bring database data to life in a web
browser.

With the rise of bots and the chat based tools such as Slack and Messenger,
users today have come to expect much more immediate interactions from their
applications. One of the tools that most front end developers should have in
their toolbox today is socket based communication. With a socket based solution
it is easy to deliver realtime updating like leaderboards, stock quotes, tweets
or any other streaming style of data to both mobile and web applications.

Here we will look at using just such a set of tools with NodeJS and Socket.io on
both the server and in the browser. And we will complement them with a Redis
PubSub implementation to model interacting with backend services and Smoothie.js
to finish off the front end with a visualization. We'll use tweets as an example
but it is easy to substitute any kind of realtime data you may have available.

NODEJS AND SOCKET.IO
We need three things on the server side. First something to serve a web page
since that in essence is our front end application. ExpressJS works just fine:

var express = require('express');  
var app = express();  
var http = require('http').Server(app);

app.use('/', express.static('www'));

http.listen(8000, function(){  
  console.log('listening on *:8000');
});


Above we setup Node as an HTTP server to deliver our web page (application)
which is just some static assets in the www directory.

Second we create our server's socket infrastructure with Socket.io:

var io = require('socket.io')(http);


Seriously, that's it for setting it up. We haven't sent any messages yet, nor
received any, but the infrastructure is now in place. And what is most
interesting is that this will work over most current infrastructure because it
starts with long-polling and then upgrades the connection to an actual
websocket. So, you can use the socket model even without sockets currently. See engine-io for details.

Third, we'll include a Redis subscription and wire up broadcasting an actual
Socket.io message:

var redis = require('redis');  
var url = config.get('redis.url');  
var client1 = redis.createClient(url);  
var client2 = redis.createClient(url);

client1.on('message', function(chan, msg) {  
  client2.hgetall(msg, function(err, res) {
      res.key = msg;
      io.sockets.emit(res);
  });
});

client1.subscribe('yourChannelName');


We use two Redis connections. client1 handles the PubSub subscription while client2 actually gets the hash for the key that came through the subscription (it is be
possible to remove the second connection and push all of the data through the
PubSub channel too). Then with io.sockets.emit(res); we broadcast all of the data to any connected clients. We've left out the Redis
publish side of above but it really isn't any more complicated than reversing
what we've shown:

client.publish(""yourChannelName"", msg);  


As you can see the simplicity of this highlights how effective Node is as an
event based networking tool. Next we'll move on to the client side which listens
for the broadcast.

WEB BROWSER AND SOCKET.IO
As you might have guessed the browser side of Socket.io is pretty easy too. So,
with the assumption that an html page has been delivered to your browser via
Express and your Node server then the following sock.on() will be called every time a broadcast message is emitted from your server. The
beauty here is that the Socket.io library defaults to contacting the same server
which delivered the page.

<script src=""https://cdn.socket.io/socket.io-1.4.5.js""></script>  
<script�

</script>


That little bit of script is perfect for handing a continuous stream of events
off to a realtime charting tool. While there are lots of JavaScript charting
libraries, one of the easiest for this style of data is Smoothie.js .

To use it set up a <canvas id='twits'> tag in the body of an html page and then you can attach the chart and stream
the data to it.

The JavaScript to wire all of the charting up, attach it to the canvas, and
stream follows:

function createGraphOnPageLoad() {  
  var sock = io();

  var smoothie = new SmoothieChart();
  smoothie.streamTo(document.getElementById('twits'));

  var redLine = new TimeSeries();
  var blueLine = new TimeSeries();

  smoothie.addTimeSeries(redLine,{ strokeStyle:'rgb(255, 0, 0)', lineWidth:3 } );
  smoothie.addTimeSeries(blueLine, { strokeStyle:'rgb(0, 0, 255)', lineWidth:3 });

  sock.on('twits', function(msg) {
    var at = new Date().getTime();
    var reach = msg.reach * 1;
    if(msg.category == ""Red"") {
      redLine.append(at, reach);
    } else {
      blueLine.append(at, reach);
    }
  });
}


The above function should be called after the page loads which ensures that the
canvas element is already created.

It creates the socket, creates the chart and wires it to the canvas.

Then it creates two timeSeries. The messages actually represent tweets and the
reach is the number of people who receive the tweet. The Redis PubSub actually
transports both red and blue categorized tweets. The timelines represent how
many followers could see the tweet on the y-axis and time on the x-axis. Red and
blue are categories of twitter searches for comparison. It adds them to the
chart at which point it waits for the events which are actually tweets and then
it appends them. On each append the chart is updated with the reach metric and inserted at the current time. The web page output looks like this:


While it is a simple charting solution it does a good job of showing the value
of the full chain of soft realtime data via Node, Redis, and Socket.io.

To view the code example on github go here .


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by William Iven Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton writes code and then writes about it. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",Sockets are the high power pipeline of the realtime web and in this article we'll show how a minimal amount of code can bring database data to life in a web browser.,"Redis PubSub, Node, and Socket.io",Live,255
743,"RStudio Blog * Home

 * Subscribe to feed

XML2 1.0.0
July 5, 2016 in Packages

We are pleased to announced that xml2 1.0.0 is now available on CRAN. Xml2 is a
wrapper around the comprehensive libxml2 C library, and makes it easy to work with XML and HTML files in R. Install the
latest version with:

install.packages(""xml2"")

There are three major improvements in 1.0.0:

 1. You can now modify and create XML documents.
 2. xml_find_first() replaces xml_find_one() , and provides better semantics for missing nodes.
 3. Improved namespace handling when working with XPath.

There are many other small improvements and bug fixes: please see the release notes for a complete list.

MODIFICATION AND CREATION
xml2 now supports modification and creation of XML nodes. This includes new
functions xml_new_document() , xml_new_child() , xml_new_sibling() , xml_set_namespace() , xml_remove() , xml_replace() , xml_root() , and replacement methods for xml_name() , xml_attr() , xml_attrs() and xml_text() .

The basic process of creating an XML document by hand looks something like this:

root <- xml_new_document() %>% xml_add_child(""root"")

root %>% 
  xml_add_child(""a1"", x = ""1"", y = ""2"") %>% 
  xml_add_child(""b"") %>% 
  xml_add_child(""c"") %>% 
  invisible()

root %>% 
  xml_add_child(""a2"") %>% 
  xml_add_sibling(""a3"") %>% 
  invisible()

cat(as.character(root))
#> <?xml version=""1.0""?>
#> <root><a1 x=""1"" y=""2""><b><c/></b></a1><a2/><a3/></root>

For a complete description of creation and mutation, please see vignette(""modification"", package = ""xml2"") .

XML_FIND_FIRST()
xml_find_one() has been deprecated in favor of xml_find_first() . xml_find_first() now always returns a single node: if there are multiple matches, it returns the
first (without a warning), and if there are no matches, it returns a new xml_missing object.

This makes it much easier to work with ragged/inconsistent hierarchies:

x1 <- read_xml(""<a>
  <b></b>
  <b><c>See</c></b>
  <b><c>Sea</c><c /></b>
</a>"")

c <- x1 %>% 
  xml_find_all("".//b"") %>% 
  xml_find_first("".//c"")
c
#> {xml_nodeset (3)}
#> [1] <NA>
#> [2] <c>See</c>
#> [3] <c>Sea</c>

Missing nodes are replaced by missing values in functions that return vectors:

xml_name(c)
#> [1] NA  ""c"" ""c""
xml_text(c)
#> [1] NA    ""See"" ""Sea""

XPATH AND NAMESPACES
XPath is challenging to use if your document contains any namespaces:

x <- read_xml('
 <root>
   <doc1 xmlns = ""http://foo.com""><baz /></doc1>
   <doc2 xmlns = ""http://bar.com""><baz /></doc2>
 </root>
')
x %>% xml_find_all("".//baz"")
#> {xml_nodeset (0)}

To make life slightly easier, the default xml_ns() object is automatically passed to xml_find_*() :

x %>% xml_ns()
#> d1 <-> http://foo.com
#> d2 <-> http://bar.com
x %>% xml_find_all("".//d1:baz"")
#> {xml_nodeset (1)}
#> [1] <baz/>

If you just want to avoid the hassle of namespaces altogether, we have a new
nuclear option: xml_ns_strip() :

xml_ns_strip(x)
x %>% xml_find_all("".//baz"")
#> {xml_nodeset (2)}
#> [1] <baz/>
#> [2] <baz/>

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,744 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

1 COMMENT
July 10, 2016 at 2:04 am

Petites choses pour les vacances | Polit’bistro : des politiques, du café

[…] Pour les nerds qui aiment la programmation statistique et le Web, il y a le
XML, et pour le XML avec R, il y a xml2, désormais en version 1.0.0. […]


« Join us at rstudio::conf 2017! httr 1.2.0 »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,744 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","We are pleased to announced that xml2 1.0.0 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library, and makes it easy to work with XML and HTML files in R. Install t…",xml2 1.0.0,Live,256
751,"After visiting a trade show and seeing a succession of dull stands with only leaflets to hand out, Chris Snow and I came up with the idea of building a Cloudant cluster of Raspberry Pis to put at IBM Cloudant’s booth. Cloudant’s NoSQL Database-as-a-Service clusters are hidden away in the depths of data centres around the world belonging to SoftLayer, Rackspace, Microsoft and Amazon, so there is little tangible product to display at a conference stand; no software to install, no drivers required, no sql! A working Cloudant cluster at the booth allows distributed databases to be seen in action with flashing lights indicating per-node activity.I started by building the Developer Preview of CouchDB 2.0 on a single Raspberry Pi running Debian Wheezy to make sure it was feasible. It wasn’t as simple as typing “sudo apt-get install couchdb” - that only gets you CouchDB 1.2. I needed CouchDB 2.0, which includes the multi-node clustering technology that Cloudant developed and has been donated back to Apache CouchDB’s open-source community. The process for installing CouchDB 2.0 is a bit more involved and involves building the project from source after installing its dependencies. The single-node worked and so plans were made to build a 12-node cluster. Here is my original sketch:The idea was to have 12 Pis arranged in ring to mimic the logical arrangement of the full-size servers in a real production Cloudant custer. Each machine was to be connected via wifi to router at the rear and a load balancer (a 13th Pi) would direct traffic around the cluster.* The hardware was ordered and boxes were unpacked. The blank SD cards were burned with fresh operating system images* LEDs were hand-soldered to resistors and GPIO connectors.Then came the tricky bit: building and installing the software on 12 devices. The automation tool I chose to help with this was Ansible which allows the scripting of tasks in YML ‘playbooks’ which can be executed in parallel via SSH on multiple host machines. The playbooks I created are published in two Github repositories:* https://github.com/glynnbird/ansible-cluster-tools - a grab-bag of scripts I used to configure the cluster, install services and customise the installation.A key feature of the project was to make each machine’s LED flash whenever that node was performing an action. To do this I created a Node.js script called ‘flasher’ which pulses an LED on and off whenever a line of text arrives on stdin. This allows output from log files to be piped to ‘flasher’ very simply e.g.tail -f node1.log | flasher > /dev/null &This brings me to head-scratching problem that caused me an hour or two of head scratching. It turns out thattail -f node1.log | grep 'FLASH'happily produces output, but not when its output is piped to another process. i.e.# this works - each line appearing in node1.log containing ‘FLASH’# appears on stdouttail -f node1.log | grep 'FLASH'# this doesn’t work - the LED doesn’t flash - the ‘flasher’ script# doesn’t see any input!tail -f node1.log | grep 'FLASH' | flasherWhy not? You have to do:tail -f node1.log | grep --line-buffered 'FLASH' | flasher > /dev/null &otherwise nothing happens. Silly me.I had to patch CouchDB’s “Fabric” Erlang code to ensure that log messages were created containing the word ‘FLASH’ whenever a node was asked to store or retrieve data e.g.all_docs(DbName, Options, #mrargs{keys=undefined} = Args0) ->couch_log:notice(""FLASH all_docs"", []),The effect of this is that when a document is added into the distributed database, three machines’ LEDs flash simultaneously, indicating the three nodes dealing with the shard that the data resides in. By sharding the data, Cloudant can store more data than could be held on one machine and divides read, write and indexing load into smaller chunks.When all of the database nodes were configured and a load-balancer running HAproxy was built, the cluster was up and running, shown here testing the flashing of LEDs:After that, the devices were sent away to be turned into something worthy of displaying at a conference booth:So if you see Cloudant represented at a developer conference near you, stop by and say hello and I’ll show you how it all works. See how the cluster shares the workload around the cluster, how it keeps multiple copies of the same data and how it can survive node failures automatically. The cluster’s data can be replicated to other instances of CouchDB, to live Cloudant accounts or to mobile devices running PouchDB or Cloudant Sync for iOS or Android.Buy SD cards with a pre-installed operating system image. Burning your own is very slow.Use “class 10” SD cards. It doesn’t make the Raspberry Pis any faster, but it does make dealing with images on your Mac/PC a good deal quicker.Automate everything. Ansible was invaluable for coordinating actions across all the nodes in parallel.Use a tagged release of CouchDB - if you build the “master” branch, then you will also get unstable “master” versions of its dependencies.Use the new Raspberry Pi 2 model - they are much quicker and cost the same as the older models.","Cloudant’s NoSQL Database-as-a-Service clusters are hidden away in the depths of data centres around the world belonging to SoftLayer, Rackspace, Microsoft and Amazon, so there is little tangible product to display at a conference stand; no software to install, no drivers required, no sql! A working Cloudant cluster at the booth allows distributed databases to be seen in action with flashing lights indicating per-node activity.",Building a Cloudant cluster of Raspberry Pis,Live,257
757,"Homepage Follow Sign in / Sign up 47 3 Oliver Cameron Blocked Unblock Follow Following I lead the self-driving car team at @udacity. Previously founder of a
@ycombinator startup. yesterday 2 min read
--------------------------------------------------------------------------------

OPEN SOURCING 223GB OF DRIVING DATA
COLLECTED IN MOUNTAIN VIEW, CA BY OUR LINCOLN MKZ
Data available on GitHubA necessity in building an open source self-driving car is data. Lots and lots of data. We recently open sourced 40GB of driving data to assist the participants of the Udacity Self-Driving Car Challenge #2 , but now we’re going much bigger with a 183GB release. This data is free for anyone to use, anywhere in the world.

WHAT’S INCLUDED
223GB of image frames and log data from 70 minutes of driving in Mountain View
on two separate days, with one day being sunny, and the other overcast. Here is
a sample of the log included in the dataset.

Note: Along with an image frame from our cameras, we also include latitude, longitude, gear, brake, throttle, steering angles and speed .

Mountain View, CATo download both datasets, please head to our GitHub repo .


--------------------------------------------------------------------------------

We can’t wait to see what you do with the data! Please share examples with us in our self-driving car Slack community , participate in Challenge #2 , or send a Tweet to @olivercameron . Enjoy!

Self Driving Cars Autonomous Vehicles Open Source Machine Learning Big Data 47 3 Blocked Unblock Follow FollowingOLIVER CAMERON
I lead the self-driving car team at @udacity . Previously founder of a @ycombinator startup.

FollowUDACITY INC
Be in Demand

× Don’t miss Oliver Cameron’s next story Blocked Unblock Follow Following Oliver Cameron","A necessity in building an open source self-driving car is data. Lots and lots of data. We recently open sourced 40GB of driving data to assist the participants of the Udacity Self-Driving Car Challenge #2, but now we’re going much bigger with a 183GB release. This data is free for anyone to use, anywhere in the world.",Open Sourcing 223GB of Driving Data – Udacity Inc,Live,258
759,"Compose The Compose logo Articles Sign in Free 30-day trialETCD 2 TO 3: NEW APIS AND NEW POSSIBILITIES
Published May 11, 2017 etcd etcd 2 to 3: new APIs and new possibilitiesThe change from version 2 to 3 of the distributed etcd database also sees
massive changes in how the database works. To help you understand the what and
why of the changes, read on...

At Compose our engineering teams have been getting deep into etcd version 3.x, the follow-up to etcd 2.x that is currently deployable on Compose.
Etcd has become an essential tool behind the scenes of many cloud computing
projects and products as it offers a simple, reliable, consistent, key-value
database that can be used as the source of truth for huge clusters of
cloud-deployed applications and their configuration.

A jump in major numbers always means that a lot of things change in any product,
usually in response to the requirements of customers and users of the preceding
version. In etcd 3.x, this is doubly so as fundamental concepts have been
reworked to suit the demands of scale and efficiency and that means there's a
new learning curve.

FROM HTTP TO GRPC
Let's start with a change that touches every point of the system; how
applications communicate with etcd. The etcd 2.x system's API was built on JSON
communicated HTTP endpoints. This was very accessible; all you needed was curl or similar and you could work with it. This is what is now called the etcd API
version 2. It worked for the original scale of etcd but the developers were
looking to handling ""tens of thousands of clients and millions of keys in a
single cluster"".

For that, they have moved over to gRPC which is built on top of Protocol Buffers . It's inspired by HTTP/REST but runs over HTTP/2 , uses static routes only rather than ones with parameters embedded in them and
sends back API-centric results rather than HTTP status codes. It also builds in
support for full-duplex streaming for long running connections. This is the etcd
API version 3.

An etcd 2.x server only understands the version 2 API. An etcd 3.x server can
understand both version 2 and version 3 APIs but, and it's a huge but, anything
you create with clients using one API version will be invisible to clients using
the other API version. That's because around the back end, each API routes to a
separate data store - they are so different that they are isolated from each
other inside the server.

ALL CHANGE IN ETCDCTL
That split goes all the way up to the command line often your first port of call
when working with etcd. Etcdctl , the command-line tool for etcd, is one binary but it now behaves like one of
two programs depending on the ETCDCTL_API environment variable. Set it to 2, and
it behaves like the etcdctl application from etcdv2 using HTTP/JSON
communications and the familiar set of commands. Set it to 3 and pretty much
every command is different as the applications works in terms of the newer API.
To give you an idea, here's a screenshot of both versions of the command side by
side.


From this point on, when we say etcd2, we're referring to the API version 2 and
etcd3 refers to the API version 3.

GOODBYE HIERARCHY, HELLO FLAT KEYSPACE
One of the interesting attributes of keys in etcd2 is the ability to also hold
directories of more keys with values or more directories. This lets you create
hierarchical file-system like structures for holding your data, like
""/clusters/node00/activity/xyz"". You could perform various operations with
reference to this hierarchy too, so etcd2 allowed clients to wait for activity
on a key or a directory (or any of its children) so, for example, you could
monitor ""/clusters/node00"" for changes.

Well, that's all gone. There's now a simple flat namespace for keys. The switch
to flat namespaces makes things much easier to manage in terms of consistency
and efficiency in clustered systems which is why most people want something like
etcd in the first place.

You can create a key that's ""/clusters/node00/activity/xyz"" but it's handled as
a single string. There's no directories implied or created. That said, you can
create your own hierarchy through how you name things and etcd3 is there with a
prefix option to let you match anything that starts with a particular key value.
So you can emulate directory structures; for example, given that key above, we
could just look for changes for anything in ""node00"" with this command:

ETCDCTL_API=3 etcdctl watch --prefix ""/cluster/node00/""  


And get a similar effect. Prefixes mitigate the loss of directory structures in
etcd3 for the more predictable flat namespace. If you are making extensive use
of directory structures in etcd2, this is going to be the first thing you want
to allow for in your migration to etcd3.

COMPARE AND SWAP OUT, TRANSACTIONS IN
In etcd2, much is made of the atomicity of particular options, such as
compare-and-swap to ensure that no two clients interfere with each other and
leave the data inconsistent. The problem with atomic actions is, though, as
things get more complex more data needs to be consistently modified and an
atomic action is by definition, limited in scope to protecting the action.

Etcd3 still has atomic operations, but they are now joined by the more
interesting transactions. These aren't transactions in the traditional ""giant
lock"" sense, but a compact guarded ""if ... then ... else"" operation. Here's a
small sample of Go code and the clientv3 library using a transaction:

    tx := cli.Txn(context.TODO())

    txresp, err := tx.If(
        clientv3.Compare(clientv3.Value(""foo""), ""="", ""bar""),
    ).Then(
        clientv3.OpPut(""foo"", ""sanfoo""), clientv3.OpPut(""newfoo"", ""newbar""),
    ).Else(
        clientv3.OpPut(""foo"", ""bar""), clientv3.OpDelete(""newfoo""),
    ).Commit()


In the If() section, a comparison is defined (checking key foo to see if it's equal to bar ). You can have multiple comparison operators here; the If is true if all the
comparisons are true. If that is true, the operations in the Then() section are run. If not, the Else() sections operations are run. You can do multiple operations and all the changes
will be handled as a single index increment in etcd's database.

It's quite a powerful primitive and it's what you'll use to replace the
Compare-and-swap and Compare-and-delete operations in etcd2 code.

TTLS EXPIRED, LEASES OBTAINED
The change with TTLs in etcd3 sees the per key TTLs of etcd2 turn into a more
general Lease. Leases can be created and have keys attached to them. The Lease
itself has a time to live and when that expires all the keys attached to the
Lease get expired. You can keep the Lease alive with a KeepAlive request or make
it go away with a Revoke request. What this gives you, practically, is much
better-synchronized behavior. A server could create a set of property values
with all the keys to those values under one Lease. If it is the server's
responsibility to send KeepAlive requests to the Lease, when it stops doing that
then all the related properties neatly disappear. Working with it is simple
enough too:

    // Get a lease
    lease, err := cli.Grant(context.TODO(), 10)
    // Attach a key to it
    _, err = cli.Put(context.TODO(), ""foo"", ""bar"", clientv3.WithLease(lease.ID))
...
    // Prod it to keep alive once...
   _, err = cli.KeepAliveOnce(context.TODO(), lease.ID)
    // Sleep
    time.Sleep(time.Second*5)
    // Read the time to live
    status, err = cli.TimeToLive(context.TODO(), lease.ID)
    fmt.Printf(""Status: %v\n"", status.TTL)


WATCHING RATHER THAN WAITING
Watching in etcd2 meant waiting for changes; opening an HTTP connection for each
key you wanted to watch and waiting for it to return changes. For etcd3, and in
keeping with getting everything to scale better, the way you watch is now
handled by watcher RPCs. Create a watcher RPC and request watches on keys or
ranges of keys from it and it'll return a stream of changes to those keys. You
can ask for previous revisions too, back to when the server last compacted its
data, and play back from there.

In the Go client for etcd3, the Watcher RPC is managed for you and all you need
to do is request a Watch which returns you a Go channel down which the changes
arrive. That looks something like this:

    rch := cli.Watch(context.Background(), ""foo"", clientv3.WithPrefix())

    go func(chn clientv3.WatchChan) {
        for wresp := range chn {
            for _, ev := range wresp.Events {
                fmt.Printf(""%s %q : %q\n"", ev.Type, ev.Kv.Key, ev.Kv.Value)
            }
        }
    }(rch)


This snippet launches a goroutine which prints out incoming change events. I'm
using the prefix option which was mentioned earlier. This uses the key value as
the prefix we want to match with so I get changes for ""foo"", ""foo2"",
""foonicular"", ""foo/bar/ftang/ftang"" and whatever other keys start with ""foo"".

PREVIOUS VALUES OR NOT
Many etcd2 operations could return the previous value associated with a key so
you could see what you'd deleted or what you'd replaced. By default, etcd3
doesn't do this. There is a WithPrevKV() option you can add to operations, but don't assume it'll always return
anything. To optimize etcdv3, the server compacts the data regularly and if the
compacted data isn't available, there's nothing for WithPrevKV() to return. If you can, stop relying on this behavior. If you can't though, an
option is to create a transaction which reads the current value and returns it
before changing it. It's fiddly, but it'll be atomic and reliable.

SO ETCD3?
Given all these changes, it is pragmatically worth considering etcd 3.x and the
etcd's version 3 API as a new database in terms of developing your client and
creating your ops workflows. It is built for efficient scaling up of workloads
though and avoids the dangers of simple operations in complex environments with
its use of leases and watchers.

There's no simple migration path for applications and, currently, there are not
as many client drivers for various languages as there are for etcd2. That said,
gRPC is widely available and you can consider developing your own driver.

If you want an enterprise-scaled, consistent, observable source of truth, then
etcd 3.x and the etcd version 3 API are the way to go. We've only skimmed over
the changes here and not touched any of the new features that have appeared;
we'll have more on that when it gets closer to etcd3 being made available on
Compose.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution HypnoArt

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
May 5, 2017NEWSBITS - ELASTICSEARCH, REDIS, MONGODB, ETCD, GCC, GO, HOMEBREW AND MORE
NewBits for the week ending 5th May - Elasticsearch goes to 5.4, Redis history
revealed, MongoDB and etcd updates, GCC is 30…

Dj Walker-Morgan Apr 28, 2017NEWSBITS - MYSQL, ELASTICSEARCH, MONGODB, ETCD, COCKROACHDB, SQL SERVER, CRICKET
AND JUICE
NewBits for the week ending 28th April - MySQL 8.0.1's preview demos better
replication, Elasticsearch, MongoDB and etcd get…

Dj Walker-Morgan Feb 17, 2017NEWSBITS: REDIS, ETCD AND ELASTICSEARCH UPDATES, GO 1.8, GITHUB GUIDES AND
CHATOPS AND MORE
NewsBits for the week ending 17th February - Redis gets a critical update,
etcd's latest release, Elasticsearch gets a bump,…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",The change from version 2 to 3 of the distributed etcd database also sees massive changes in how the database works. Let's understand the what and why of the changes.,etcd 2 to 3: new APIs and new possibilities,Live,259
765,"KDNUGGETS
Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * SOFTWARE
 * NEWS
 * Top stories
 * Opinions
 * Tutorials
 * JOBS
 * Academic
 * Companies
 * Courses
 * Datasets
 * EDUCATION
 * Certificates
 * Meetings
 * Webinars


KDnuggets Home » News » 2016 » Oct » Tutorials, Overviews » MLDB: The Machine Learning Database ( 16:n37 )LATEST NEWS, STORIES
 * MLDB: The Machine Learning Database Top 10 Data Science Videos on Youtube Data Science + Criminal Justice Deep Learning meets Deep Deployment Equifax: Strategic Data Performance Analyst


More News & Stories | Top Stories

MLDB: THE MACHINE LEARNING DATABASE
Previous post Tweet Tags: Classification , Database , Machine Learning , TensorFlow , Transfer Learning
--------------------------------------------------------------------------------

MLDB is an open­source database designed for machine learning. Send it commands
over a RESTful API to store data, explore it using SQL, then train machine
learning models and expose them as APIs.

By François Maillet, MLDB.ai .

In this post, we’ll show how easy it is to use MLDB to build your own real­time
image classification service. We will use different brand of cars in this
example, but you can adapt what we show to train a model on any image dataset
you want. We will be using a TensorFlow deep convolutional neural network,
transfer learning, and everything will run off MLDB.

TRANSFER LEARNING WITH THE INCEPTION MODEL

At a high level, transfer learning allows us to take a model that was trained on
one task and use its learned knowledge on another task. We use the Inception-­v3 model , a deep convolutional neural network, that was trained on the ImageNet Large Visual Recognition Challenge dataset. The task of that challenge was to
classify images into a varied set of 1000 classes, like badger, freight car or cheeseburger .

The Inception model was openly released as a trained TensorFlow graph. TensorFlow is a deep learning library that Google open­-sourced last year, and MLDB has a
built-­in integration for it. As you’ll see, MLDB makes it extremely simple to
run TensorFlow models directly in SQL.

When solving any machine learning problem, one critical step is picking and
designing feature extractors. They are used to take the thing we want to
classify, be it an image, a song or a news article, and transform it into a
numerical representation, called a feature vector, that can be given to a
classifier. Traditionally, the selection of feature extractors was done by hand.
One of the really exciting things about deep neural networks is that they can
learn feature extractors themselves.

Below is the architecture of the Inception model, where images go in from the
left and predictions come out to the right. The very last layer will be of size
1000 and give a probability for each of the classes. However, the layers that
come before are transformations over the raw image learned by the network
because they were the most useful to solve the image classification task. Some
layers are for example edge detectors.


So the idea will be to run images through the network, but instead of getting
the output of the last layer, that is specialised to the ImageNet task, getting
the second to last, which will give us a conceptual numerical representation of
the images. We can then use that representation as features that we can give to
a new classifier that we will train on our own task. So you can think of the
Inception model as a way to get from an image to a feature vector over which a
new classifier can efficiently operate. We are leveraging hundreds of hours of
GPU compute-­time that went into training the Inception model, but applying it
to a completely new task.

INCEPTION ON MLDB

Let’s get started! The code below uses our pymldb library. You can read more
about it on the MLDB Documentation .


What did we do here? We made a simple PUT call using pymldb to create the ​ inception function, of type tensorflow.graph . It is parameterized using a JSON blob. The function loads a trained instance
of the Inception model (note that MLDB can transparently load remote resources,
as well as files inside of compressed archives; more on this here ). We specify that the input to the model will be the remote resource located
at url , and the output will be the ​ pool_3 layer of the model, which is the second to last layer. Using the pool_3 layer will give us high level features, while the last layer called softmax is the one that is specialized to the ImageNet task.

Now that the ​ inception function is created, it is available in SQL and as a
REST endpoint. We can then run an image through the network with a simple SQL
query. Here we’ll run Inception on the KDNuggets logo, and what we’ll get is the
numerical representation of that image. Those 2048 numbers are what we can use
as our feature vector:


PREPARING A TRAINING DATASET WITH SQL

Now we can import our data for training. We have a CSV file containing about 200
links to car images from 3 popular brands: Audi, BMW and Tesla. It’s important
to remember that although we are using a car dataset, you could replace it with
your own images of anything you want.

We can import the CSV file in a dataset by running an ​ import.text procedure :


We can generate some quick stats with SQL:


We can now use a procedure of type transform to apply the ​ Inception model over all images and store the results in another dataset. A transform
procedure simply executes an SQL query and saves the result in a new dataset.
Running the code below is essentially doing feature extraction
over our image dataset.


TRAINING A SPECIALIZED MODEL

Now that we have features for all of our images, we use a procedure of type classifier.experiment to train and test a random forest classifier. The dataset will be split 50/50
between train and test by default.

Notice the contents of the ​ inputData key, that specifies what data to use for training and testing, is SQL. The {* EXCLUDING(label)} is a good example of MLDB’s row expression syntax that is meant to work with
sparse datasets with millions of columns.


Looking at the performance on the test set, this model is doing a pretty good
job:


DOING REAL­TIME PREDICTIONS

Now that we have a trained model, how do we use it to score new images? There
are two things we need to do for this: extract the features from the image and
then run that in our newly trained classifier. This is essentially our scoring
pipeline.

What we do is create a function called brand_predictor of type sql.expression . This allows us to persist an SQL expression as a function that we can then
call many times. When we trained our classifier above, the training procedure
created a car_brand_cls_scorer_0 automatically, available in the usual SQL/Rest, that will run the model. It
will be expecting an input column named ​ features .


And just like that we’re now ready to score new images off the internet:


{
  ""output"": {
    ""scores"": [
      [
        ""\""audi\"""", 
        [
          -8, 
          ""2016-05-05T04:18:03Z""
        ]
      ], 
      [
        ""\""bmw\"""", 
        [
          -7.333333492279053, 
          ""2016-05-05T04:18:03Z""
        ]
      ], 
      [
        ""\""tesla\"""", 
        [
          0.2666666805744171, 
          ""2016-05-05T04:18:03Z""
        ]
      ]
    ]
  }
}


The image we gave it represented a Tesla, and that is the label that got the
highest score.

CONCLUSION

The Machine Learning Database solves machine learning problems end­-to-­end,
from data collection to production deployment, and offers world­-class
performance yielding potentially dramatic increases in ROI when compared to
other machine learning platforms.

In this post, we only scratched the surface of what you can do with MLDB. We
have a white-­paper that goes over all of our design decisions in details.

If we’ve peaked your interest, here are a few links that may interest you:

 * try MLDB for free in 5 minutes by launching a hosted instance run a trial version of MLDB on your own hardware using Docker or Virtualbox Check out our demos and tutorials , especially DeepTeach which uses the same techniques as shown in this post, and MLPaint , a white­box real­time handwritten digit recogniser MLDB Github repository

Don’t hesitate to get in touch! You can find us on Gitter , or follow us on Twitter .

All the code from this article is available in the MLDB repository as a Jupyter notebook , and is also shipped with MLDB. Boot up an instance, go the the demos folder
and you can run a live version.

Happy MLDBing!

Bio: François Maillet is a computer scientist specialising in machine learning and data science. He
leads the machine learning team at MLDB.ai, a Montréal startup building the Machine Learning Database ​ (MLDB). François has been applying machine learning for almost 10 years to
solve varied problems, like real­-time bidding algorithms and behavioral
modelling for the adtech industry, automatic bully detection on web forums,
audio similarity and fingerprinting, steerable music recommendation and playlist
generation.

Related:

 * Recycling Deep Learning Models with Transfer Learning Spark for Scale: Machine Learning for Big Data The Deception of Supervised Learning


--------------------------------------------------------------------------------

Previous post


--------------------------------------------------------------------------------


TOP STORIES PAST 30 DAYS
Most Popular 1. The 10 Algorithms Machine Learning Engineers Need to Know 21 Must-Know Data Science Interview Questions and Answers How to Become a Data Scientist - Part 1 7 Steps to Mastering Machine Learning With Python Top Algorithms and Methods Used by Data Scientists 9 Key Deep Learning Papers, Explained 7 Steps to Mastering Apache Spark 2.0

Most Shared 1. Top Algorithms and Methods Used by Data Scientists Data Science for Internet of Things (IoT) : Ten Differences From
    Traditional Data Science 7 Steps to Mastering Apache Spark 2.0 Battle of the Data Science Venn Diagrams Top Data Scientist Claudia Perlich on Biggest Issues in Data Science Data Science Basics: Data Mining vs. Statistics Automated Data Science & Machine Learning: An Interview with the
    Auto-sklearn Team


MORE RECENT STORIES
 * Equifax: Senior Statistical Modeler Equifax: Senior Director, Search-Match & Data-Linking Rexer Analytics Data Science Survey Highlights Equifax: Metadata Expert Artificial Intelligence, Deep Learning, and Neural Networks, E... Strata Hadoop 2016: Fast Data and Robots NYU Stern – Master of Science in Business Analytics K2 Data Science Bootcamp Data Preparation Tips, Tricks, and Tools: An Interview with th... EDISON Data Science Framework to define the Data Science Profe... Novel Tensor Mining Tool to Enable Automated Modeling Equifax: Employee Analytics Leader Equifax: Data Visualization Engineer Equifax: Data Strategy Leader How to Get Stuff Done at a Data Startup Apache: Big Data Europe (Nov. 14-16) – Leading Event for... The R Graph Gallery Data Visualization Collection Zaireo: Data Scientist Top tweets, Oct 05-11: Most Active #DataScientists on #Gith... Top 12 Interesting Careers to Explore in Big Data


KDnuggets Home » News » 2016 » Oct » Tutorials, Overviews » MLDB: The Machine Learning Database ( 16:n37 )

© 2016 KDnuggets. About KDnuggets
Subscribe to KDnuggets News | Follow @kdnuggets | | X","MLDB is an open­source database designed for machine learning. Send it commands over a RESTful API to store data, explore it using SQL, then train machine learning models and expose them as APIs.",The Machine Learning Database,Live,260
766,This video shows you how to replicate one of the sample databases on cloudant.com to your Cloudant account. Sign up for a Cloudant account here: https://cloudant.com/sign-up/. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center,This video shows you how to replicate one of the sample databases on cloudant.com to your Cloudant account. ,Replicate a Sample Database,Live,261
767,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCHECK OUT IBM’S “NEW BUILDERS” PODCASTMike Broberg / April 28, 2016We here on the CDS developer advocacy team sure like to code. But we also liketo talk. Whether we’re presenting at conferences, leading sessions athackathons, or recording demos of our apps — it’s a rare moment we’re notspreading the good word.Now, we have a new outlet for our motor mouths: The New Builders Podcast !Episodes at https://developer.ibm.com/tv/builders/The New Builders is a weekly podcast featuring developers from around the web,talking about the new languages, libraries, and infrastructure they’re using tobuild their apps. It’s a mix of perspectives from engineers outside and insideof IBM. Here’s a recap of the first three episodes: * The first episode features our own Bradley Holt in a roundtable discussion on web/mobile development, where he advocates for   offline first design. On the other side of the conversation is Greg Avola , the CTO for social beer app Untappd . They talk about progressive web apps, HTML5, Ionic, PouchDB, and more.   While Untappd doesn’t persist data locally for offline access, the app makes   heavy use of cross-platform development with Apache Cordova™. * The second episode features leaders from private messaging app Cyber Dust : CEO & Co-Founder Ryan Ozonian and Lead Engineer Rohit Kotian . They talk about about scaling their stack, which for their core messaging   platform is a lot of Java and GridGain. Their users’ messages are held   in-memory and never persisted anywhere. Take comfort in the ephemeralness. * The third episode is a discussion with our own David Taieb and IBM Lead Data Scientist Jorge Castañón . Jorge and David built an analytics app for predicting flight delays at   airports. But they faced a big challenge in connecting to on-premises data   sources and moving data to the cloud, where it could be more efficiently   analyzed. Listen in for their approach to data movement and machine learning   in Apache Spark™.When I first started working with Cloudant in 2012, I learned the most fromtalking directly to engineers and customers. Often the output was mediainterviews or Q&A articles I’d post to our blog. My marketing colleagues Doug Flora and Jim Young , who are producing this podcast, are taking a similar approach, but doing waybetter than I ever did. I’ve really enjoyed the podcast so far. The New Buildersis worth a listen.The first episode is embedded here below. Enjoy!SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","The New Builders podcast features weekly interviews with developers from around the web, discussing code, infrastructure, and their overall stack.",IBM's New Builders podcast,Live,262
768,"* R Views
 * About this Blog
 * Contributors
 * Some Resources
 * 

 * R Views
 * About this Blog
 * Contributors
 * Some Resources
 * 

DECEMBER ’16 RSTUDIO TIPS AND TRICKS


by Sean Lopp

Here is this month’s collection of RStudio Tips and Tricks. Thank you to those
who responded to last month’s post ; many of your tips are included below! Be sure to subscribe to @rstudiotips on Twitter for more.

This month’s tips fall into two categories: Keyboard Shortcuts and Easier R
Markdown

KEYBOARD SHORTCUTS
The RStudio IDE is built upon “hooks”. Hooks are actions that the IDE can take.
For instance, there is a hook to create a new file. Most users interact with
hooks with point-and-click interactions. ( RStudio toolbar -> new file or File -> New File ). But, there is an alternative! All of these hooks have been surfaced to end
users and can be bound to a keyboard shortcut. (Some of these actions are
“secret” – they aren’t exposed through point-and-click options.)

CUSTOM KEYBOARD SHORTCUTS
To view the complete list of actions, the current keybindings, and to customize
keybindings, go to: Tools -> Modify Keyboard Shortcuts .


CODE CHUNK NAVIGATION
Define shortcuts for code chunk navigation using the previous tip. For example, Alt+Cmd+Down for Next Chunk and Alt+Cmd+Up for Previous Chunk.


ASSIGNMENT OPERATOR
Use Alt+- (press Alt at the same time as pressing - ). This adds the assignment operator and spacing.


PIPE OPERATOR
Use Cmd+Shift+m (for Mac) or Ctrl+Shift+m (for Windows). This adds the pipe operator %>% and spacing.


EASIER R MARKDOWN
R MARKDOWN OPTIONS
R Markdown output formats include arguments specified in the YAML header. Don’t
worry about remembering all of the key-value pairs; in RStudio, you can access
and change the most common through a user-interface:


SPELL CHECKER
Use the built-in spell checker when writing a R Markdown document. (Code chunks
are automatically ignored.)


SQL CODE CHUNKS
Execute SQL queries against database connections directly in R Markdown chunks.


R MARKDOWN WEBSITES
Are you building a website with R Markdown ? Any RStudio project with an R Markdown website will include a Build Website
option in the build pane.


What’s your favorite RStudio Tip?

seanlopp 2016-12-08T17:53:20+00:00 250 Northern Ave, Boston, MA 02210
844-448-1212
info@rstudio.com

DMCA
Trademark
Support
ECCN * Switch tabs w/o muscle cramps: New RStudio Desktop 1.0.136 switches w/
   Ctrl+Tab. Lots of tabs? Ctrl+Shift+. to select tab by name! #rstats
   
   6 days ago

Copyright 2016 RStudio | All Rights Reserved | Legal Terms Twitter Linkedin Facebook Rss Email github Rss",Here is this month’s collection of RStudio Tips and Tricks. Thank you to those who responded to last month’s post; many of your tips are included below! Be sure to subscribe to @rstudiotips on Twitter for more.This month’s tips fall into two categories: Keyboard Shortcuts and Easier R MarkdownKeyboard ShortcutsThe RStudio,December '16 RStudio Tips and Tricks,Live,263
771,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectINTRODUCING SPARK-CLOUDANT, AN OPEN SOURCE SPARK CONNECTOR FOR CLOUDANT DATAmikebreslin / March 9, 2016We would like to introduce you to the spark-cloudant connector, allowing you touse Spark to conduct advanced analytics on your Cloudant data. Thespark-cloudant connector can be found on GitHub or the Spark Packages site and is available for all to use under the Apache 2.0 License . As with most things Spark, it’s available for Python and Scala applications.If you haven’t heard of Apache Spark™, it is the new cool kid on the block inthe analytics space. Spark is touted as being an order of magnitude faster andmuch easier to use than its analytic predecessors, and its popularity hasskyrocketed in the past couple of years. If you would like to learn more aboutSpark in general, I recommend checking out the Spark Fundamentals classes on Big Data University and the great tutorials on IBM developerWorks .Flexible JSON database plus in-memory analytics, ftw!START FAST WITH SPARK ON BLUEMIXSo how do you get going quickly in analyzing your Cloudant data in Spark?Luckily, IBM has a fully-managed Spark-aaS offering in IBM Bluemix that has the latest version of the spark-cloudant connectoralready loaded for you. Head on over to the Bluemix catalog to sign-up and create a Spark instance to get started. Since the spark-cloudantconnector is open source, you are also free to use it in your own stand-aloneSpark deployments with Cloudant or Apache CouchDB™. Next, check out the README on GitHub, the Bluemix docs on Spark-aaS , and the great video tutorials on the Learning Center showing how to use the connector in both a Scala and Python notebook.The integration with Spark opens the door to a number of new analytical usecases for Cloudant data. You can load whole databases into a Spark cluster foranalysis. Alternatively you can read from a Cloudant secondary index (a.k.a.“MapReduce view”) to pull a filtered subset or cleansed version of your CloudantJSON. Once you have the data in Spark, use SparkSQL for full adhoc queryingcapabilities in familiar SQL syntax. Spark can efficiently transform or filteryour data and write it back into Cloudant or another data source. Because Sparkhas a variety of connection capabilities, you can also use it to conductfederated analytics over disparate data sources such as Cloudant, dashDB andObject Storage.EXAMPLE: CLOUDANT ANALYTICS WITH SPARKTo provide another example of using the spark-cloudant connector, check out this example Python Notebook on GitHub and load it into your Spark service running on Bluemix. (It becomesinteractive once you upload it to a Spark notebook using the instructionsbelow.) This notebook does the following: * Loads a Cloudant database spark_sales from Cloudant’s examples account containing documents with sales rep, month,   and amount fields.(Feel free to replicate the https://examples.cloudant.com/spark_sales database into your own Cloudant account and update the connection details if   you prefer.)       * Detects and prints the schema found in the JSON documents. * Counts the number of documents in the database. * Prints out a subset of the data and shows how to print out a specific field   in the data. * Uses SparkSQL to perform counts, sums, and order by value queries on the   data. * Prints a graph of the monthly sales. * Filters the data based on a specific sales rep and month. * Counts and shows the filtered data. * Saves the filtered data as documents into a Cloudant database in your own   account.(You need to create the database in your Cloudant account and enter   credentials for your account in the notebook before this final step will   work.)      Notes for new Bluemix users: 1. After provisioning the IBM Analytics for Apache Spark service, click on its service tile in the Bluemix dashboard and open the UI    to manage Spark instances. 2. Create a new instance (if needed) and a new notebook within that instance. 3. On the Create Notebook page, choose “From URL” and use the URL for the raw IPython notebook data, which should look like     https://raw.githubusercontent.com/cloudant-labs/spark-cloudant/master/examples/ipython/python_Cloudant2.ipynb 4. Run the code block-by-block using the triangular play button in the menu    bar, but be sure to read the code comments before running block 10 and    modify the snippet accordingly.We hope you find the Spark integration a powerful tool to conduct analytics onyour Cloudant data. If you have any feedback or encounter an issue with thespark-cloudant connector, please open an issue in GitHub.--------------------------------------------------------------------------------© “Apache”, “CouchDB”, “Spark”, “Apache CouchDB”, “Apache Spark”, and the Sparklogo are trademarks or registered trademarks of The Apache Software Foundation.All other brands and trademarks are the property of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Meet the new spark-cloudant connector, for adding powerful analytics to your Cloudant JSON. I also include a simple example that shows how to use SparkSQL to order Cloudant data by value.","Introducing spark-cloudant, an open source Spark connector for Cloudant data",Live,264
772,"Enterprise Pricing Articles Sign in Free 30-Day TrialREDIS CONFIGURATION CONTROLS - NEW AT COMPOSE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 2, 2016At Compose, we're all about giving you control of your databases where we can
and Redis users on Compose are about to get a whole lot more control. It's a
story about iterating design. The team at Compose who work on Redis looked at
their recently introduced Slow Query Logs feature and decided they could make it
better. In the process they created the new Redis Configuration Controls.

The Redis Configuration Controls allows experienced users to change a selection
of Redis settings so they can tune their deployments to behave exactly as they
want them. These aren't for new users to modify without thoroughly researching
the consequences and taking care in the process; please consult the linked
documentation for each setting before changing it.

The design of our user interface follows the actual redis.conf configuration file, turning it into an interactive form. This means it'll be
easier to take application recommended configurations and apply them to your own
Redis deployment. Under the hood, we handle these new settings by automatically
applying them to both nodes of your high availability deployment. There's no
need to wait for synchronisation at the Redis level to ensure they are applied
and saved.

Over the coming months, we plan on refining the experience of Configuration
Controls.

USING THE CONFIGURATION CONTROLS
We'll briefly introduce each setting in this article, but for full details of
them, you'll find links to the Redis documentation for many of them by clicking
on the blue circled question mark next to its name. As we mentioned, the
sections and fields of the interface are modelled on the redis.conf configuration file and that means, where there isn't a link to the
documentation, you can find more information there.

Now, onwards to the Configuration Controls themselves. You'll find the them in
the Compose console for Redis under the Settings tab. At the top of the Settings
view, as before, are the version and upgrade controls, then the ""Redis as a
Cache"" control and then, below that are the new Configuration Controls. They
open with the warning we've just given, that these are expert Redis settings
with a link to the example Redis configuration file. Then we're into the various
settings groups and settings.

Any changes made in these settings will only be applied when the Apply Configuration Changes button at the bottom is pressed. This will put any changes you have made to the
configuration into practice on each of the servers.

NETWORK


The first group of settings concern the Network . This contains timeout and tcp-keepalive .

TIMEOUT
When it's set to 0, the timeout setting's default, idle client connections stay open until they are closed by
the client. You may want to ensure idle clients are ejected after some number of
seconds and setting this to a non-zero value will set that number of seconds.

TCP-KEEPALIVE
While some parts of the network will also step in to disconnect idle
connections, use of a keepalive will send TCP ACKs at regular intervals to keep
the connection open. That interval can be set here. Setting it to 0, which is
the default, disables this feature.

SECURITY


This section is slightly different because its requirepass setting is set outside the Configuration Controls.

REQUIREPASS
The Redis authentication credential is a simple password and this is where it
can be set. Clicking on Change will send you to the Overview page where that credential can safely be changed.
Be aware that any other settings you may have made in the Configuration Controls
will be discarded when you click Change .

LIMITS


MAXMEMORY-POLICY
This setting replaces the old ""Change Maxmemory Policy"" control by letting you
directly set the policy. The Redis documentation on eviction policies covers what the available settings - no eviction, LRU, volatile LRU, random,
volatile random and volatile TTL - do. If you continue on reading that page
you'll see there's a setting you can use to fine tune some of those policies
which is...

MAXMEMORY-SAMPLES
This setting lets the user control how the sampling-LRU mechanism works in Redis
by setting the number of samples used. It defaults to 5.

LUA SCRIPTING


LUA-TIME-LIMIT
There's only one setting in Lua Scripting and it sets the lua-time-limit . That's the number of milliseconds that a Lua script can run before being kicked into touch by Redis for taking too long. It's a safety feature to stop the system being
hogged by badly written loopy scripts. Important fact: This doesn't kill the
script, it logs it and tells other clients the system is busy while waiting to
be told to kill the script. The default is five seconds which is enormous when
you consider a script is supposed to run in a millisecond.

SLOW LOGS


The Slow Query Log feature is where we began. It uses two configuration
settings, slowlog-log-slower-than and slowlog-max-len . A brief reminder – read more in our slow log introduction .

SLOWLOG-MAX-LEN
The slow log is actually a queue of slow log events and you can control the size
of that queue with slowlog-max-len . The bigger you make it, the more memory you will consume. Ideally, it should
be big enough for you to catch your problematic slow commands, but not so big
that it becomes an issue itself. The default is 128 and we recommend you run
with that till you are certain you need to expand it.

SLOWLOG-LOG-SLOWER-THAN
The other way to capture that tricky slow event would be to filter out all the
slow, but not that slow, log events . That's where slowlog-log-slower-than comes in. It sets the threshold on what qualifies as a slow event. It defaults
to 10000 microseconds.

The slow query log viewer has moved to the main tab bar of the console as part
of the switch to the Configuration Control to make it more accessible and it
retains it's own Settings dialog so you can quickly and safely adjust just the
two values that matter.

EVENT NOTIFICATION


NOTIFY-KEYSPACE-EVENTS
Another setting that was previously available is the Event Notification 's notify-keyspace-events . This setting gives you the ability to plug into the changes going inside the
database. The feature is called ""Keyspace Notifications"" and you can read about
it in the Redis documentation . The short version of that is you set a configuration variable, notify-keyspace-events , to a string which represents what events you want to hear about. Setting it
to the string ""KEA"" says you want a stream of all events. You can listen to that
stream by connecting to Redis and issuing a psubscribe command looking for messages with a key pattern of __key*__:* .

WRAPPING UP
That's it for the current settings available in the Redis Configuration
Controls. Keep an eye on Compose Notes for updates on new settings being made
available.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose",The team at Compose who work on Redis looked at their recently introduced Slow Query Logs feature and decided they could make it better. In the process they created the new Redis Configuration Controls.,Redis Configuration Controls,Live,265
774,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

IBM Data Science Experience Blocked Unblock Follow Following May 4
--------------------------------------------------------------------------------

DEVELOPING IBM STREAMS APPLICATIONS WITH THE PYTHON API (VERSION 1.6)
The IBM Data Science Experience (DSX) platform now integrates Streaming
Analytics services using version 1.6 of the Python Application API, which
enables application development and monitoring entirely in Python. The currently
supported Python version is Python 3.5.

Python developers can use the streamsx package to:

 * Create IBM Streams applications in DSX Jupyter notebooks.
 * Create apps that are run in the Streaming Analytics service.
 * Access data streams from views defined in any app that is running on the
   service.

Furthermore, Python developers can now monitor submitted jobs with the Python
REST API. This is particularly interesting for developers who want to retrieve
and visualize streaming data in Jupyter notebooks , for example, for debugging or extra logging.

To develop streaming applications with Python 3.5 in DSX Jupyter notebooks, you
can use the STREAMING_ANALYTICS_SERVICE context to submit a Python application to the IBM Streaming Analytics service.
Sample DSX Jupyter notebooks for Python applications that process streams are
available on the community page of DSX:

 * Hello World! : Create a simple Hello World! application to get started and deploy this
   application to the Streaming Analytics service.
 * Healthcare Demo : Create an application that ingests and analyzes streaming data from a
   feed, and then visualizes the data in the notebook. You finally submit this
   application to the Streaming Analytics service.
 * Neural Net Demo : Create a sample data set, create a model for the sample data, use that
   model in a streaming application, visualize the streaming data, and finally
   submit the streaming application to the Streaming Analytics service.

EXAMPLE: THE NEURAL NET NOTEBOOK
To illustrate the workflow of building a streaming application in DSX, we can
walk through the Neural Net demo listed above. The workflow is comprised of
three essential steps:

 1. Use the Python API to compose the streaming application.
 2. Submit the application to be run in a Streaming Analytics service.
 3. Retrieve data back into the notebook for visualization.

The purpose of the Neural Net notebook is to demonstrate how a data scientist
can train a model on a set of data, and then immediately incorporate that model
into a Streaming Application.

CREATING A SAMPLE DATA SET
First, we create a sample data set comparing the temperature of an engine to the
probability that it will fail within the next hour:

xvalues = np.linspace(20,100, 100)
yvalues = np.array([((np.cos((x-50)/100)*100 + np.sin(x/100)*100 + np.random.normal(0, 13, 1)[0])/150.0 for x in xvalues])
yvalues = [y - np.amin(yvalues) for y in yvalues]  

create_plot(xvalues, yvalues, title=""Engine Temp Vs. Probability of Failure"", xlabel = ""Probability of Failure"", ylabel = ""Engine Temp in Degrees Celcius"", xlim = (20,100), ylim = (0,1))

For brevity, several imports and function definitions were removed, however the
full code is shown in the notebook itself .

TRAINING A MODEL
Given the data set we created, we use the PyBrain library to train a Feed Forward Neural Network (FFN) as a model to predict
failure probabilities given a temperature.

# The neural net to be trained
net = buildNetwork(1,100,100,100,1, bias = True, hiddenclass = SigmoidLayer, outclass = LinearLayer)

# Construct a data set of the training data
ds = SupervisedDataSet(1, 1)
for x, y in zip(xvalues, yvalues):
  ds.addSample((x,), (y,))

# The training harness. Used to train the model.
trainer = BackpropTrainer(net, ds, learningrate = 0.0001, momentum=0, verbose = False, batchlearning=False)

# Train the model. for i in range(50):
trainer.train()

# Display the model in the plot.
fig, ax = create_plot(xvalues, yvalues, title=""Engine Temp Vs. Probability of Failure"", xlabel = ""Probability of Failure"",               ylabel = ""Engine Temp in Degrees Celcius"", xlim = (20,100), ylim = (0,1))

ax.plot(xvalues, [net.activate([x]) for x in xvalues], linewidth = 2, color = 'blue', label = 'NN output')

The fully trained model, net , is a simple Python object, which, when provided with a temperature value,
produces a probability reading. Above, we can see the output of the model (in
blue) plotted against the data set.

USING THE MODEL IN A STREAMING APPLICATION
It isn’t enough to simply have the net model in the DSX notebook, we might want to send it into production to predict
failures in real time. To insert the model into a real-time streaming
application with the streamsx.topology Python API, you must use classes that
create and manipulate streaming data.

The following two classes represent such creation and manipulation of data, and
are necessary components of the streaming application. The periodicSource class submits a random number between 20 and 100 every 0.1 seconds, and is used
to simulate sample temperature readings.

The NeuralNetModel class simply takes a data item, feeds it as input to the neural net, and
returns the output onto a stream.

# The source of our data. Every 0.1 seconds, a number between 20-100 will be inserted into the stream
# INPUT: None
# OUTPUT: A float with range [20,100]
class PeriodicSource(object):
  def __call__(self):
    while True:
      time.sleep(0.1)
      yield random.uniform(20,100)

# A class which runs the neural net on data it is passed.
# INPUT: the input to the neural net, in this case a floating point number
# OUTPUT: an array containing the output of the neural net, as well as the input to the neural net.
class NeuralNetModel(object):
  def __init__(self, net):
    self.net = net
  def __call__(self, num):
    return [num, self.net.activate([num])[0]]

BUILDING THE STREAMING APPLICATION
The Application uses the periodicSource class to generate a stream temperature readings, which are then processed by an
instance of the NeuralNetModel class to create a stream of probability readings. Since we are interested in
viewing these probability readings, we allow the stream to be viewable with the view() method.

# Define operator
periodic_src = periodicSource()
nnm = NeuralNetModel(net)

# Build Graph
top = topology.Topology(""myTop"")
stream = top.source(periodic_src)

# Run the temp readings through the neural net and mark the
# output as viewable.
view = stream.transform(nnm).view()

Now that we have defined the application, we submit it to be run on a Streaming
Analytics service on Bluemix using a call to submit .

vs={'streaming-analytics': [{'name': service_name, 'credentials': json.loads (credentials)}]}

cfg = {context.ConfigParams.VCAP_SERVICES : vs, context.ConfigParams.SERVICE_NAME : service_name}

job = context.submit(context.ContextTypes.STREAMING_ANALYTICS_SERVICE, top, config=cfg)

You’ll notice that the credentials and service_name values are used to define a cfg object used for authentication. Both of these can be obtained from the
Streaming Analytics service management page on Bluemix.

VIEWING STREAMING DATA
Once the call to submit has completed successfully, the application is running.
We can view its output in DSX using the view object that was created earlier.

fig, ax = create_plot([], [], title=""Engine Temp Vs. Probability of Failure"", xlabel = ""Probability of Failure"", ylabel = ""Engine Temp in Degrees Celcius"", xlim = (20,100), ylim = (0,1))

xdata = []
ydata = []

try:
  queue = view.start_data_fetch()

  for line in iter(queue.get, None):
    xdata.append(line[0])
    ydata.append(float(line[1]))
    ax.lines[0].set_xdata(xdata)
    ax.lines[0].set_ydata(ydata)
    fig.canvas.draw()
except: 
  raise
finally:
  view.stop_data_fetch()

Each dot in the above graph represents a live temperature reading used to
predict likelihood of failure. Every time a new temperature reading is sent
through the model, its output is reflected in the graph.

CLOSING THE LOOP ON DSX
Data visualization is becoming an increasingly important part of data science.
After creating a model, a data scientist needs immediate visual feedback on its
effectiveness both in and out of a production environment. Whether with static
or real-time data, DSX is a tool that helps developers achieve this.


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on May 4, 2017 by William Marshall.

 * Machine Learning

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","The IBM Data Science Experience (DSX) platform now integrates Streaming Analytics services using version 1.6 of the Python Application API, which enables application development and monitoring…",Developing IBM Streams applications with the Python API (Version 1.6),Live,266
780,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSENTIMENT ANALYSIS OF REDDIT AMASChetna Warade / March 10, 2016Reddit recently announced a coffee-table book, Ask Me Anything: Volume One . It’s a collection of their favorite Ask Me Anything (AMA) web events, in which anyone can get online with luminaries like BillGates, Madonna, Chris Rock, Elon Musk, or President Obama and ask them anyquestion that comes to mind.While this is a gorgeous book, it’s missing one key element that makes AMA’s sovaluable and rich: actionable data. When an AMA is online, you can access andanalyze the text to glean insights from the discussion. The possibilities forinteresting analyses are endless. For instance, check out this interactive graph that measures how people use language on reddit. Search for a term to seetrends.The book organizes AMAs in categories like Inspiring , Informative , Provocative , Fascinating , Beautiful , Courageous , Humorous , and Ingenious . Which category would you land in? We wondered the same thing about ourselves.In the spirit of eating our own dogfood (in every sense), we’ll explore thisquestion using an AMA hosted by IBM developers and our home-grown analysistools. Watson Tone Analyzer helps you understand how you’re coming across toothers, so it’s perfect for this job.Here’s how we built our own reddit AMA sentiment analysis solution (and you cantoo). In this tutorial, we: 1. Take an IBM-hosted AMA . 2. Load its data with our handy Simple Data Pipe , which leverages Bluemix (IBM’s Cloud platform service) and runs Node.js    to move JSON data from reddit (or another source), enriches the data with    Watson Tone Analyzer, and lands results in Cloudant. 3. Run commands in an iPython notebook to analyze the Cloudant JSON output, using Apache Spark to analyze the Watson Tone Analyzer-enriched data to gauge positive or    negative emotions measured across multiple tone dimensions, like anger, joy,    openness, and more.The Spark-Cloudant Connector is the special sauce that makes this solution work.It lets you connect your Apache Spark instance to a Cloudant NoSQL database andanalyze the data.DEPLOY SIMPLE DATA PIPEThe fastest way to deploy this app to Bluemix is to click the Deploy to Bluemix button, which automatically provisions and binds the Cloudant service too.If you would rather deploy manually , or have any issues, refer to the readme .When deployment is done, click the EDIT CODE button.INSTALL REDDIT CONNECTORSince we’re importing data from reddit, you need to establish a connectionbetween reddit and Simple Data Pipe.Note: If you have a local copy of Simple Data Pipe, you can install this connector using Cloud Foundry . 1. In Bluemix, at the deployment succeeded screen, click the EDIT CODE button. 2. Click the package.json file to open it. 3. Edit the package.json file to add the following line to the dependencies list:    ""simple-data-pipe-connector-reddit"": ""^0.1.2""    Tip: be sure to end the line above with a comma and follow proper JSON syntax. 4. From the menu, choose File Save .         5. Press the Deploy app button and wait for the app to deploy again.        ADD SERVICES IN BLUEMIXTo work its magic, the reddit connector needs help from a couple of additionalservices. In Bluemix, we’re going analyze our data using the Apache Spark andWatson Tone Analyzer services. So add them now by following these steps:PROVISION IBM ANALYTICS FOR APACHE SPARK SERVICE 1. Login to Bluemix (or sign up for a free trial) . 2. On your Bluemix dashboard, click Work with Data . Click New Service . Find and click Apache Spark then click Choose Apache Spark Click Create .PROVISION WATSON TONE ANALYZER SERVICE 1. In Bluemix, go to the top menu, and click Catalog . 2. In the Search box, type Tone Analyzer , then click the Tone Analyzer tile. 3. Under app , click the arrow and choose your new Simple Data Pipe application. Doing    so binds the service to your new app. 4. In Service name enter only tone analyzer (delete any extra characters) 5. Click Create . 6. If you’re prompted to restage your app, do so by clicking Restage .LOAD THE REDDIT AMA DATA 1.  Launch simple data pipe in one of the following ways: * If you just restaged, click the URL for your simple data pipe app.              * Or, in Bluemix, go to the top menu and click Dashboard , then on your Simple Data Pipe app tile, click the Open URL button.                   2.  In Simple Data Pipe, go to menu on the left and click Create a New Pipe . 3.  Click the Type dropdown list, and choose Reddit AMA .When you added a reddit connector earlier, you added the Reddit option     you’re choosing now.           4.  In Name , enter ibmama . 5.  If you want, enter a Description . 6.  Click Save and continue . 7.  Enter the URL for the AMA. We’ll use the sample IBM-hosted AMA we mentioned     earlier:      https://www.reddit.com/r/IAmA/comments/3ilzey/were_a_bunch_of_developers_from_ibm_ask_us 8.  Click Connect to AMA .     You see a You’re connected confirmation message.           9.  Click Save and continue .           10. On the Filter Data screen, make the following 2 choices:           * under Comments to Load , select Top comments only .      * under Output format , choose JSON flattened .          Then click Save and continue .          Why flattened JSON? Flat JSON format is much easier for Apache Spark to process, so for this     tutorial, the flattened option is the best choice. If you decide to use the     Simple Data Pipe to process reddit data with something other than Spark,     you probably want to choose JSON to get the output in its purest form.           11. Click Skip , to bypass scheduling. 12. Click Run now .          When the data’s done loading, you see a Pipe Run complete! message.           13. Click View details .                    Tip: You can review the processed reddit comments in Cloudant along with theenriched Tone Analyzer metadata by clicking the run’s Details link and then clicking the Top comments only link. If prompted, enter your Cloudant password.ANALYZE AMA DATACREATE NEW PYTHON NOTEBOOK 1. In Bluemix, open your Apache Spark service.    Go to your dasbhoard and, under Services , click the Apache Spark tile and click Open . 2. Open an existing instance or create a new one.     3. Click New Notebook . 4. Click the From URL tab. 5. Enter any name, and under Notebook URL enter     https://github.com/ibm-cds-labs/reddit-sentiment-analysis/raw/master/notebook/Reddit-AMA-python.ipynb 6. Click Create Notebook 7. Copy and enter your Cloudant credentials.In a new browser tab or window, open your bluemix dashboard and click your    Cloudant service to open it. From the menu on the left, click Service Credentials . If prompted, click Add Credentials . Copy your Cloudant host , username , and password into the corresponding places in cell 3 of the notebook (replacing XXXX’s).         8. Still in cell 3, at the end of the line, specify which cloudant database to    load by making sure the following string includes name of the pipe you just    created, ibmama .reddit_ibmama_top_comments_only        Edit this string to include the name you gave your pipe in the preceding    section. The naming convention here is    reddit_PIPENAME_top_comments_only         9. Leave this notebook open. We’ll run this code in a minute.ABOUT THE SPARK-CLOUDANT CONNECTORBefore we run commands in the notebook, let’s peek under the hood. We use the Spark-Cloudant Connector , which lets you connect your Apache Spark instance to a Cloudant NoSQL DBinstance and analyze the data. This is a great way to leverage Spark’slightning-fast processing power directly on your Cloudant JSON data.RUN THE CODE AND GENERATE REPORTSNew to notebooks? If you’ve never used a Python notebook before, here’s how you run commands. Youmust run cells in order from top to bottom. To run a cell, click it (a boxappears around it) and in the menu above the notebook, click the Run button. While the command processes, an * asterisk appears (for a moment ora few minutes) in place of the number. When the asterisk disappears, and thenumber returns, processing is done, and you may move on to the next cell.Now you can run the code in each notebook cell. Here’s what you’re doing as yourun each command: 1. Run cells 1 and 2 to connect to a SparkContext.A SparkContext is the connection to a Spark cluster. It’s how you create RDDs and other items on that cluster.         2. Connect to your Cloudant database.        Run cell 3 (which you just customized, adding your database credentials) to    connect to Cloudant, where the AMA data resides.         3. Create the dataframe and get it in tabular format. In cell 4, run df.printSchema() then in cell 5, run df.show() .         4. Prep the dataframes for SQL commands. In cell 6, run df.registerTempTable(""reddit"");         5. Now start analyzing this data.        Watson Tone Analyzer captures tones in the text, gauging:         * emotions like Joy, Disgust, Anger, Fear, and Sadness     * social traits like Agreeableness, Openness, Conscientiousness, Extraversion, and       Emotional Range     * language styles like Analytical, Tentative, and Confident        First, run the following code to compute the distribution of tweets by    sentiment scores greater than 70%.         sentimentDistribution=[0] * 13        for i, sentiment in enumerate(df.columns[-23:13]):        sentimentDistribution[i]=sqlContext.sql(""SELECT count(*) as sentCount FROM reddit where cast("" + sentiment + "" as String) > 70.0"")\            .collect()[0].sentCount             6. With the data stored in sentimentDistribution array, run the following code    that plots the data as a bar chart.    %matplotlib inline    import matplotlib    import numpy as np    import matplotlib.pyplot as plt         ind=np.arange(13)    width = 0.35    bar = plt.bar(ind, sentimentDistribution, width, color='g', label = ""distributions"")         params = plt.gcf()    plSize = params.get_size_inches()    params.set_size_inches( (plSize[0]*3.5, plSize[1]*2) )    plt.ylabel('Reddit comment count')    plt.xlabel('Emotion Tone')    plt.title('Histogram of comments by sentiments > 70% in IBM Reddit AMA')    plt.xticks(ind+width, df.columns[-23:13])    plt.legend()         plt.show()            This bar chart shows the number of comments that scored above 70% for each    tone.         7. In the last cell, run the following code to group by tone values:comments=[]    for i, sentiment in enumerate(df.columns[-23:13]):    commentset = df.filter(""cast("" + sentiment + "" as String) > 70.0"")    comments.append(commentset.map(lambda p: p.author + ""\n\n"" + p.text).collect())    print ""\n--------------------------------------------------------------------------------------------""    print sentiment    print ""--------------------------------------------------------------------------------------------\n""    for comment in comments[i]:        print ""[-]  "" + comment +""\n""        REVIEW RESULTSScroll through the resulting list. You’ll see comments grouped by tone. Rememberthat these are comments that scored greater than 70% for each value.Comments that scored high for Confident and Conscientiousnessare listed and grouped under those tones.Some comments appear under multiple headings, because they scored high for morethan one. For example, the following comment appears under the language style Analytical and also under the social trait Emotional Range (sensitivity to environment, moodiness).How do you keep convincing people to pay for Lotus notes as an email solution?Watson Tone Analyzer documentation says: “Tone analysis is less about analyzinghow someone else feels, and more about analyzing how you are coming across toothers.” So, how did IBMers come across within this AMA?Comments from IBMers take up most of the Agreeableness (tendency to be compassionate and cooperative toward others) section.They live there beside some “agreeable” questions from outsiders that come witha wink, likeIs your favorite TV show Halt and Catch Fire? I really want it to be...That comment also scored high under Extraversion and Emotional Range , maybe for its enthusiasm.No comments from IBMers appear under Emotional Range . These guys are a bunch of cool cats, perhaps–or just polite and friendly AMAhosts.Note: No comments scored over 70% on emotions like Joy, Anger, Fear, Disgust, andSadness. This conversation just didn’t get that heated. Try running anotherreddit AMA discussion through these same steps to see how results differ.So, when reddit includes this IBM AMA in their next book, which category willthey apply? Comments from non-IBMers may land this AMA in the Provocative or Humorous group. IBMers alone? Courageous , of course. ;-) Or perhaps, Informative , which would put us in good company.Meanwhile, we’ll keep working hard and aspire to Ingenious .OTHER OPTIONSNow you know how to tweak the Simple Data Pipe to load data from a source youwant, like reddit. Once you do so, the Cloudant-Spark Connector makes it easy toperform analysis on your Cloudant JSON. In this example, we used an iPythonnotebook to help us leverage Watson Tone Analyzer, but you can use the analysistool of your choice.When you ran Simple Data Pipe, the reddit AMA landed in Cloudant. From there,it’s a breeze to send data on into dashDB. The dashDB data warehouse is also agreat place to run analytics. Stay tuned for my next post, which will show youhow to take reddit data, load it into dashDB, and analyze with R (Can’t wait? Watch a video on how these two work together ).TRY THESE AMASLaunch your Simple Data Pipe app again and return to the Load reddit AMA Data section. In step 7, swap in one of these AMA URLs and check out the results. * Matei Zaharia, creator of Spark      https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/ * Chris Rock   https://www.reddit.com/r/IAmA/comments/2pi16o/chris_rock_here_ama/ * Tim Berners Lee      https://www.reddit.com/r/IAmA/comments/2091d4/i_am_tim_bernerslee_i_invented_the_www_25_years/ * Neil deGrasse Tyson      https://www.reddit.com/r/IAmA/comments/qccer/i_am_neil_degrasse_tyson_ask_me_anything/ * Bill Gates      https://www.reddit.com/r/IAmA/comments/18bhme/im_bill_gates_cochair_of_the_bill_melinda_gates/ * Louis C. K.      https://www.reddit.com/r/IAmA/comments/n9tef/hi_im_louis_ck_and_this_is_a_thing/ * Amy Poehler   https://www.reddit.com/r/IAmA/comments/2kp7w0/im_amy_poehler_amaa/ * IBM’s Chef Watson      https://www.reddit.com/r/IAmA/comments/3id842/we_are_the_ibm_chef_watson_team_along_with_our/ * Barack Obama      https://www.reddit.com/comments/z1c9z/i_am_barack_obama_president_of_the_united_states/SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Use Apache Spark, Cloudant, and Watson Tone Analyzer to perform sentiment analysis on a reddit Ask Me Anything web event.",Sentiment Analysis of Reddit AMAs,Live,267
787,"* R Views
 * About this Blog
 * Contributors
 * Some Resources
 * 

 * R Views
 * About this Blog
 * Contributors
 * Some Resources
 * 

REPRODUCIBLE FINANCE WITH R: SECTOR CORRELATIONS SHINY APP


by Jonathan Regenstein

In a previous post , we built an R Notebook that pulled in data on sector ETFs and allowed us to
calculate the rolling correlation between a sector ETF and the S&P 500 ETF,
whose ticker is SPY. Today, we’ll wrap that into a Shiny app that allows the
user to choose a sector, a returns time period such as ‘daily’ or ‘weekly’, and
a rolling window. For example, if a user wants to explore the 60-day rolling
correlation between the S&P 500 and an energy ETF, our app will show that. As is
customary, we will use the flexdashboard format and reuse as much as possible
from our Notebook.

The final app is here , with the code available in the upper right-hand corner. Let’s step through
this script.

The first code chunk is where we do the heavy lifting in this app. We will build
a function that takes as parameters an ETF ticker, a returns period, and a
window of time, and then calculates the desired rolling correlation between that
ETF ticker and SPY.


That function uses getSymbols() to pull in prices and periodReturns() to convert to log returns, either daily, weekly or monthly. Then we merge into
one xts object and calculate rolling correlations, depending on the window
parameter. It should look familiar from the Notebook, but honestly, the
transition from the previous Notebook to this code chunk wasn’t as smooth as
would be ideal. I broke this into two functions in the Notebook, but thought it
flowed more smoothly as one function in the app since I don’t need the
intermediate results stored in a persistent way. Combining the two functions
wasn’t difficult, but it did break the reproducible chain in a way that I don’t
love. In the real world, I would (and, in my IDE, I did) refactor the Notebook
to line up with the app better. Enough self-shaming, back to it.

Next, we need to create a sidebar where our users can select a sector, a returns
period and a rolling window. Nothing fancy here, but one thing to note is how we
use selectInput to translate from the sector to the ETF ticker symbol. This means our users
don’t have to remember those three-letter codes; they just choose the name of
the desired sector from a drop-down menu.


Have a close look at the last three lines of code in that chunk. These are a new
addition that let the user determine if the mean, max and/or min rolling
correlation should be included in the dygraph. We haven’t built any way of
calculating those values yet, but we will shortly. This is the UI component.

Those three lines of code create checkboxes and are set to default as FALSE,
meaning they won’t be plotted unless the user chooses to do so. I wanted to
force the user to actively click a control to include these, but that’s a purely
stylistic choice. Perhaps you don’t want to give them a choice at all here?

Next, we create our reactive values that will form the substance of this app.
First, we need to calculate and store an object of rolling correlations, and
we’ll use a reactive that passes user inputs to our sector_correlations function.

Then, we build reactive objects to store mean, minimum and maximum rolling
correlations. These values will help contextualize our final dygraph.


At this point, we have done some good work: built a function to calculate
rolling correlations based on user input, built a sidebar to take that user
input, and coded reactives to hold the values and some helpful statistics. The
hard work is done, and really we did most of the hard work in the Notebook,
where we toiled over the logic of arriving at this point. All that’s left now is
to display this work in a compelling way. Dygraphs plus value boxes has worked
in past; let’s stick with it!


That dygraph code should look familiar from the Notebook and previous posts,
except we have added a little interactive feature. By including if(input$mean == TRUE) {avg()} , we allow the user to change the graph by checking or unchecking the ‘mean’
input box in the sidebar. We are going to display this same information
numerically in a value box, but the lines make this graph a bit more compelling.

Speaking of those value boxes, they rely on the reactives we built above, but,
unlike the graph lines, they are always going to be displayed. The user doesn’t
have a choice here.


Again, this just adds a bit of context to the graph. Note that the lines and the
value boxes take their value from the same reactives. If we were to change those
reactives, both UI components would be affected.

Our job is done! This a simple but powerful app: the user can choose to see the
60-day rolling correlations between the S&P 500 and an energy ETF, or the
10-month rolling correlations between the S&P 500 and a utility ETF, etc. I
played around with this a little bit and was surprised that the 10-week rolling
correlation between the S&P 500 and health care stocks plunged in April of 2016.
Someone smarter than I can probably explain, or at least hypothesize, as to why
that happened.

A closing thought about how this app might have been different: we are severely
limiting what the user can do here, and intentionally so. The user can choose
only from the sector ETFs that we are offering in the selectInput dropdown. This is a sector correlations app, so I included only a few sector
ETFs. But, we could just as easily have made this a textInput and allowed the users to enter whatever ticker symbol struck their fancy. In
that case, this would not longer be a sector correlations app; it would be a
general stock correlations app. We could go even further and make this a general
asset correlations app, in which case we would allow the user to select things
like commodity, currency and housing returns and see how they correlate with
stock market returns. Think about how that might change our data import logic
and time series alignment.

Thanks for reading, enjoy the app, happy coding, and see you next time!

Jonathan Regenstein 2017-02-02T19:43:19+00:00LEAVE A COMMENT CANCEL REPLY
Comment

250 Northern Ave, Boston, MA 02210
844-448-1212
info@rstudio.com

DMCA
Trademark
Support
ECCN * Missed #rstudioconf ? Here are some tips from IDE engineer @kevin_ushey ! Slides from all talks forthcoming. #rstats twitter.com/bhaskar_vk/sta…
   
   2 weeks ago

Copyright 2016 RStudio | All Rights Reserved | Legal Terms Twitter Linkedin Facebook Rss Email github Rss","In a previous post, we built an R Notebook that pulled in data on sector ETFs and allowed us to calculate the rolling correlation between a sector ETF and the S&P 500 ETF, whose ticker is SPY. Today, we’ll wrap that into a Shiny app that allows the user to choose a",Sector Correlations Shiny App,Live,268
788,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (September 06, 2016)
 * This Week in Data Science (August 30, 2016)
 * This Week in Data Science (August 23, 2016)
 * This Week in Data Science (August 16, 2016)
 * This Week in Data Science (August 09, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (SEPTEMBER 06, 2016)
Posted on September 6, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * How Tech Giants Are Devising Real Ethics for Artificial Intelligence – Researchers from tech companies have been meeting to discuss the impact of
   artificial intelligence on jobs, transportation and even warfare.
 * The Three Faces of Bayes – The three main uses of the term “Bayesian” are presented through the lens
   of a naïve Bayes classifier.
 * IBM Data Science Experience: First steps with yorkr – A user goes through his use of IBM’s Data Science Experience, an
   integrated delivery platform for analytics.
 * EPA challenges communities to develop air sensor data platforms – EPA’s chief data scientist believes that real-time pollution sensors are
   the way of the future.
 * Inside the ‘brain’ of IBM Watson: how ‘cognitive computing’ is poised to
   change your life – IBM’s Cognitive Computing revolution is changing how doctors, financial
   experts, and many other professions find and investigate key issues in their
   work.
 * The sneaky math that made the lottery more alluring – and harder to win – In recent years, popular lotteries have been re-engineered to make their
   contests more appealing, but also further decrease your odds of hitting the
   jackpot.
 * What Robots Can Learn from Babies – Researchers at the Allen Institute for Artificial Intelligence (Ai2) in
   Seattle have developed a computer program that shows how machines determine
   how the objects captured by a camera will most likely behave.
 * Majority of mathematicians hail from just 24 scientific ‘families’ – The evolution of mathematics is traced using a comprehensive genealogy
   database.
 * Enhanced DMV facial recognition technology helps NY nab 100 ID thieves – In January, the New York State DMV enhanced its facial recognition
   technology by increasing the measurement points of a driver’s license
   picture.
 * How to Become a Data Scientist – Part 1 – Check out this excellent (and exhaustive) article on becoming a data
   scientist, written by someone who spends their day recruiting data
   scientists.
 * Essentials of Machine Learning Algorithms (with Python and R Codes) – Sunil created a guide to simplify the journey of aspiring data scientists
   and machine learning enthusiasts across the world.
 * Could artificial intelligence help humanity? Two California universities
   think so – Two California universities separately announced new centers devoted to
   studying the ways in which AI can help humanity.
 * Y’all have a Texas accent? Siri (and the world) might be slowly killing it – Voice recognition tools such as Apple’s Siri still struggle to understand
   regional quirks and accents, and users are adapting the way they speak to
   compensate.
 * Big data salaries set to rise in 2017 – Starting salaries for big data pros will continue to rise in 2017 as
   companies jockey to hire skilled data professionals.
 * 10 Years of Color – Analysis on my Personal Photo Collection – See how Brett Kobold creates a data visualization that shows the most
   prominent color from every photo he took over the last 10 years.

UPCOMING DATA SCIENCE EVENTS
 * IBM World of Watson 2016 – Unleash your company’s cognitive potential at IBM World of Watson 2016
   this October.
 * Graph Processing with Spark GraphX – Learn about graph processing on September 8th.
 * IBM DataFirst Launch Event – Join data and analytics leaders and practitioners from the open source
   community, startups, and enterprises at the IBM DataFirst Launch Event on
   September 27th in NYC.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (September 06, 2016)",Live,269
791,"400 BAD REQUEST
LIemNiB4/cqS644CH @ Tue, 04 Oct 2016 14:39:26 GMT

SEC-43",Compilation of Youtube videos teaching Statistics using R and other languages,Learning Statistics on Youtube,Live,270
799,"CONFIGURING COMPOSE ENTERPRISE ON GOOGLE CLOUD PLATFORM
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 1, 2016*Compose is now available on Google's Cloud Platform . Being a different platform, the configuration for launching your first
Compose Enterprise cluster is also different and in this article, we'll walk you
through what you need to do to create your own database powerhouse in your
private cloud.

Before creating a Compose Enterprise cluster, you will need to do some
preparatory work in the Google Cloud. We'll assume at this point that you've
created a project under your Google Cloud account and enabled billing on it -
without that Google will not let you enable the APIs used by Compose to create
the hosts needed for your cluster.

GOOGLE CLOUD SEEDING
You should be here at your Dashboard on Google Cloud. There's a ""Service
Account"" we need to create so select the menu in the top left of the dashboard:


This will slide in the main Google Cloud Platform menu. If you are wondering
where something is in Google Cloud Platform, head to this menu and you should be
able to filter it down. What we want is at the top though.

And select IAM & Admin .


Then select Service accounts ...


Followed by clicking on the Create service account label. This will get you here:


Enter a name for your service account and make sure you check the Furnish a new key check box. This will ensure that the keys you need to access your project and
resources will be transferred to you as a JSON file. Watch out for that, it's a
blink or you'll miss it download. Make sure that file is safe and we can move
on.

The next thing to be done is to give your new service account the ability to
manage storage. Go back to the left hand menu and select IAM .


Then look up that account you just named in the
list of accounts displayed.


You need to grant some roles, ""Storage Admin"", ""Storage Object Admin"" and
""Service Account Manager"", to that user. For the first two, click on the Editor drop down and then scroll the list till you see Storage . Click on that and a pop-up menu opens up.


Click on ""Storage Admin"" and ""Storage Object Admin"" to add their roles. Next
select ""Project"" in the menu:


Click on ""Service Account Actor"" then click on Save .

CREATING THE CLUSTER ON COMPOSE
Head over to your Compose console and select the Enterprise button from your
left hand side bar. Now click the Create Cluster button on the right. Enter a name for your new Enterprise cluster at the top
and select ""Google Cloud Platform"" from the options.

The page will then expand with this form:


Most of the values for this form is in the JSON file we downloaded when the
service account was created. Open that up in your preferred editor. The project_id field value should be copied, less the double-quotes, into the project_id field. The same for the private_key_id to private key id , private_key to private key , client_email to client_email and client_id to client id . Watch out for the private_key value, it's a long one.

The last two fields are not in the JSON document. The region is the Google Cloud Region you want your hosts deployed to, such as us-east1 - find out more about Regions and Zones on the Google Cloud Platform help. The other field, bucketname is a name for the backups - enter a name which you prefer here.

Finally, there is a slider which shows how much Compose will charge per month
for provisioning a cluster that supports that much total RAM. Each cluster is
made of 3 hosts so 24GB, for example, represents 8GB of RAM on each host. You
will, of course, be provisioning and charged directly by Google for those hosts.
In our walkthrough, we'll go with 24GB and then click Create Cluster .

GETTING YOUR OWN DEPLOYMENT CONFIGURATION
The cluster will be created on Compose's side at this point, but the hosts on
the Google side have yet to be created and connected to Compose. That's what
this next page is about:


The first step is to create a Google Deployment Manager configuration which can
do the creation work for you. This is a YAML file which Compose will create
according to your needs. First, we select where we want to deploy these hosts.
For example, if you entered us-east1 in the preceding Create Cluster form, you should select ""Eastern US"".

You will need to select the size of host for your deployment. Google offers a
number of predefined standard hosts and high-memory hosts . Select one that matches up with the memory that you selected when you created
the cluster. For example, if we selected 24GB when creating the cluster, that
equates to 8GB of RAM per host, and looking at the predefined machine types, the
nearest match is the ""n1-standard-2"" with two virtual cores and 7.5GB of RAM.
We'll select that on the page.

The last item is the amount of storage to initially allocate to your
deployments. The slider ranges from 512GB to 3TB. Select how much you'd like and
we're ready to make the configuration file.

Click on Download Configuration and a file will be downloaded to your system called compose-enterprise.yaml .

ENABLE APIS
Before you move on to the next step, you need to activate an API, the Google
Cloud Deployment Manager V2 API. You can navigate to this through the API option
in the Cloud dashboard, or simply visit the API page and click on Enable .

STARTING THE CLUSTER DEPLOYMENT ON GOOGLE
This file needs to be used with machine with Google Cloud SDK installed on it. This can be done on a workstation or you can make use of the
Google Cloud Shell, a remote shell running on the Google cloud.

In the case of having the SDK installed locally, follow the appropriate install
instructions for your operating system . That process will include setting up your account and connection to Google
Cloud so you will need your account details to hand. If you only have one
project on Google Cloud, the SDK will automatically select that. When the
process is complete, you should be able to run gcloud auth list to see which accounts are configured and active.

In the case of using the Google Shell we use the built in shell in the Google
Cloud Platform dashboard. Select the ""prompt"" icon in the top menu and it will
connect, through the web browser, to a shell on a system preconfigured. What
isn't in the shell is the configuration file we need. There's a number of routes
to getting it over but the quickest way is just to cut and paste it into an
editor. On Mac OS X, you can do cat compose-enterprise.yaml | pbcopy or on Linux, you can install xclip and then run cat compose-enterprise.yaml | xclip -selection clipboard . Then you can go to the Google Cloud Shell and run the nano editor with nano compose-enterprise.yaml , paste the clipboard into the editor and then exit with control-X then y then
return. With the file in place we can continue.

The command displayed in step two on the Compose Hosts page now needs to be
executed. It assumes that you will be in the same directory as the file you
downloaded (or copied over). If it isn't, change the file name that comes after
the --config to point at your downloaded file. If you get an error like:

ERROR: (gcloud.deployment-manager.deployments.create) ResponseError: code=403, message=Access Not Configured. Google Cloud Deployment Manager API has not been used in project 99999999999 before or it is disabled.  


Then go back to the Enable APIs step above, do that and retry the command. For
illustration, what you should see is something like this, only with your own
names in it:

[~] gcloud deployment-manager deployments create exemplumcluster --config Downloads/compose-enterprise.yaml
Waiting for create operation-1470301061838-5393b2480f4b1-e795a8d6-47a1fe4b...done.  
Create operation operation-1470301061838-5393b2480f4b1-e795a8d6-47a1fe4b completed successfully.  
NAME                              TYPE                 STATE      ERRORS  
exemplumcluster-disk-0-data       compute.v1.disk      COMPLETED  []  
exemplumcluster-disk-0-swap       compute.v1.disk      COMPLETED  []  
exemplumcluster-disk-1-data       compute.v1.disk      COMPLETED  []  
exemplumcluster-disk-1-swap       compute.v1.disk      COMPLETED  []  
exemplumcluster-disk-2-data       compute.v1.disk      COMPLETED  []  
exemplumcluster-disk-2-swap       compute.v1.disk      COMPLETED  []  
exemplumcluster-image             compute.v1.image     COMPLETED  []  
exemplumcluster-instance-0        compute.v1.instance  COMPLETED  []  
exemplumcluster-instance-1        compute.v1.instance  COMPLETED  []  
exemplumcluster-instance-2        compute.v1.instance  COMPLETED  []  
exemplumcluster-network           compute.v1.network   COMPLETED  []  
exemplumcluster-network-capsules  compute.v1.firewall  COMPLETED  []  
exemplumcluster-network-udp-4789  compute.v1.firewall  COMPLETED  []  
[~]                                                                            


The cluster is now being deployed and after a few minutes, reloading the Hosts
page will show you that the initialisation is taking place:


It should take around 20 minutes for this process to complete as each element of
the Google cluster meshes with Compose's cluster management. After 20 minutes,
refreshing the page should show the cluster as ready to run:


THE FIRST DATABASE DEPLOYMENT
Your first database deployment can now be done. Click on Create Deployment and you'll see the Compose database selection page. Select any database and
you'll see the form for deploying your database, with one difference:


There's one difference from the default deployment page. Because we have a
Enterprise cluster, the Create Deployment On option appears and will default to the Enterprise cluster. It's still possible
to select Compose Hosted databases, but they are charged separately from the
Enterprise cluster. The interface defaults to the Enterprise cluster to avoid
that. Enter the name for your deployment, select your options and configure your
initial deployment resources. On Enterprise, the current default minimum is a
configuration with 1GB of RAM. Once done, click Create Deployment and Compose will provision your database.

VPN STEPS
It's at this point in the configuration process that you have a choice. At
Compose, we understand that Compose Enterprise customers will have different
security requirements and, rather than open up ports to your cloud
infrastructure automatically, we give you the opportunity to apply your own
security procedures and processes.

Briefly, that Compose hosts will need to be accessible from wherever you are
administering them. You can configure a VPN or SSH tunnelling to achieve this.
Within the network, enable your access host to pass TLS traffic to and from the
hosts and this should cover most databases requirements. Applications configured
within your project will require that the firewall rules allow them to connect
to the Compose database hosts. The internal IP addresses of the hosts are mapped
to *.compose.direct DNS addresses.

That said, we also know that users may just want to quickly configure a VPN to
access their databases. In that case we offer the following guide to creating an
IPSEC VPN with the least steps possible.

CREATING THE VPN INSTANCE
The first step in this process is to create a machine instance that will run
your VPN software. Go to the Google Cloud Platform console and select Compute Engine from the products menu. Select VM Instances from the sidebar and then select Create Instance from the top list of options.


Give the new instance a name, it's mostly decorative – we'll call ours vpngateway – then select a zone for this instance to live in; you can accept the default
offered if you wish or you can set it to a zone in the region where you placed
your Compose Enterprise cluster. Generally for administration you won't need a
whole dedicated CPU to handle the VPN load, so in Machine Type select Micro to reduce the cost of this new node. For the boot disk, click Change and select Ubuntu 14.04 LTS.

Then carry on down the page till you hit the Management, disk, networking, SSH keys link. Click that to reveal the options underneath. The first screen that is
revealed will be Management . Click in the Tags field and enter vpn . We'll need that tag when we set up the firewall rules. Now select Networking .


This is where we set up this instance to be our gateway between the outside
world and our Compose cluster. The Network field should be set to the network that was created when we created the cluster
– in our example, we named the cluster exemplumcluster so the network is exemplumcluster-network so we select that. Set the External IP to ""New static IP address"". A dialog
will pop up asking you to reserve an IP address with a name – we'll use vpnip for a name – enter a name and click Reserve . Finally set IP Forwarding to On and click Create .

The display will now return to the VM Instances dashboard with an extra entry
and after a little while, our new node will be deployed and it'll show an SSH button next to it. It's time to log into our gateway to configure it.

INSTALLING THE VPN SOFTWARE
Click that SSH button and Google will start a session to the VPN gateway.
There's a lot of ways you could enable this as a gateway and we're going to use
one of the quickest and simplest ones we've found hwdsl2's setup-ipsec-vpn . This is script which automatically configures the system to run a IPSEC VPN
and it can be run with no user intervention whatsoever - see the [installation
instructions]9 https://github.com/hwdsl2/setup-ipsec-vpn#ubuntu--debian ) for alternative ways of setting it up. For our configuration needs, all we
need to do is run this:

wget https://git.io/vpnsetup -O vpnsetup.sh && sudo sh vpnsetup.sh  


Hit return and watch as the script downloads and builds the required code into a
VPN. When it finishes, it'll display something like this:

================================================
IPsec VPN server is now ready for use!  
Connect to your new VPN with these details:  
Server IP: 104.196.169.215  
IPsec PSK: BqSfZg8qcNFDjLAc  
Username: vpnuser  
Password: M7J6Bt3EyCmwPZbM  
Write these down. You'll need them to connect!  
Important notes:   https://git.io/vpnnotes  
Setup VPN clients: https://git.io/vpnclients  
================================================


That bit about writing them down, do it then exit from the SSH session. These
are our IPSEC VPN credentials. The VPN is running, but there's still a step to
go.

OPENING THE FIREWALL
We need to allow the traffic to flow from the outside to the VPN and to allow
TLS traffic to go between the VPN and the hosts. This can be done from the GCP
console. Go to the Networking product page and you'll see the general networking
overview. There will be a default network, at least, and the network for the
cluster – in our example exemplumcluster-network . Select the clusters network and you'll now see this:


We can add firewall rules here simply by clicking on Add firewall rule which brings up this form:


First, the incoming rule for the VPN. Give it the name vpn-rule and select ""Allow from any source (0.0.0.0/0)"" in the source filter. Then, in
the Allowed protocols and ports field put

tcp:1701; udp:4500; udp:500

This allows TCP/IP traffic on port 1701 and UDP traffic on ports 4500 and 500.

In the Target tags field, enter vpn , the tag allocated to the instance when it was created earlier. This will lock
down the rule to being between the outside word and the VPN host. Click Create and the rule will be applied.

That lets the traffic from the VPN in. Now we need to enable TLS connections
within the cluster. Click Add firewall rule again. Name this rule ""databrowser"". The Source Filter will need to be set to Subnetworks and when you do that, the form changes to allow you to enter those subnetworks:


We need all the subnetworks in this case, so click Select all and Ok . In the Allowed protocols and ports enter:

tcp: 443

We won't be setting any target tags as this rule will apply to all systems in
the cluster. Click Create and that should make your clusters VPN connection ready to use.

CONFIGURING THE INCOMING CONNECTION
How you set up your incoming connection will entirely depend upon your operating
system. Recall back when the connection credentials were generated, there were a
few URLs included. Specifically, https://git.io/vpnclients , which gives directions for creating a client VPN connection on Windows,
Linux, Mac OS X, iOS and Android. We'll use Mac OS X as an example here. As per
the instructions at the previous link, go to System Preferences and then to the Network section, click on the + at the bottom of the interface list to add an interface and select VPN in the drp down that appear. Diverging slightly from the instructions, select Cisco IPsec as the VPN type. Click Create to make the network interface and you'll return to the Network screen with the new interface selected.


Now we can fill the details for our VPN server connection. From the information
we recorded earlier...

 * enter the Server IP into the Server Address field
 * enter the Username into the Account Name field
 * enter the Password into the *Password field
 * to use the IPsec PSK * click Authentication Settings
    * select Shared Secret
    * enter the IPsec PSK into the Shared Secret field
    * click Ok
   
   
 * click Apply
 * click Connect

TESTING THE CONNECTION
The VPN should have been configured and connected by now. If you want to see if
it is configured you can either try selecting the data browser in any database
that have a browser option. The data browser is integrated into your cluster and
seamlessly blends with the Compose console; if it appears, the VPN is working.
For any database with a HTTPS web ui (eg RethinkDB or RabbitMQ), you can also
try connecting to their admin UI (details in the Compose console for deployed
databases).

BEYOND DEPLOYMENT
You now have a Compose cluster running of the Google Cloud Platform, complete
with VPN access. You can deploy new compute instances into the Google Cloud
project to run your application and connect directly to those databases, or
create secure tunnels or SSL connections to remote applications. The choice is
yours with Compose Enterprise.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Compose is now available on Google's Cloud Platform. Being a different platform, the configuration for launching your first Compose Enterprise cluster is also different and in this article, we'll walk you through what you need to do to create your own database powerhouse in your private cloud.",Configuring Compose Enterprise on Google Cloud Platform,Live,271
800,"Mike Broberg Blocked Unblock Follow Following Editor for the IBM Watson Data Platform developer advocacy team. OK person. May 9
--------------------------------------------------------------------------------

SIMPLE DATA VISUALIZATION IN APACHE COUCHDB™
D3, RIGHT IN THE FAUXTON DASHBOARD, VIA USERSCRIPT
Have you ever wanted to quickly visualize the results of CouchDB’s built-in
reduce functions for some quick feedback, without leaving the context of its
handy dashboard?

The JSON representation of a CouchDB map-reduce operation, aggregating movies by
rating.INTRODUCING CHANGO
Recently, my colleague and office neighbor va barbosa published an article on integrating a data-visualizing view directly into the Cloudant dashboard . Essentially, it’s a userscript that adds a new menu button to a database view
when the results are aggregated and JSON is returned in a specific format.
Clicking the Chart button will render a D3 chart.

Now, it works in CouchDB, too:

Automatically visualize aggregated CouchDB JSON with Chango .Because Cloudant and CouchDB now share the same codebase , updating Va’s userscript—we call it Chango , as a portmanteau combining “chart + Mango ”—was pretty straightforward. While the spirit of the Mango query interface is to make querying CouchDB easier , we decided to riff on the name with “Chango,” since it aspires to make data
visualization in CouchDB more convenient.

Here is the Chango script, in its entirety:

All the userscript for Chango.GENERATING YOUR FIRST CHANGO CHART
Chango currently works using the Firefox browser with the Greasemonkey extension. Once you have set up the browser, click the view raw button and install the script when prompted.

Rather than write your own reduce functions, CouchDB comes with built-in reduce functions that run in Couch’s native Erlang. Make sure to specify your reduce when
defining your database view, like so:

Using the built-in reduce function _sum to aggregate results on the Movie_rating field. _count would also work here, and without emitting the value 1 for each document in the index.Then, include the reduce in your query options when using the dashboard:

Including the Reduce query option in the Fauxton dashboard.You’ll be all set to generate your chart from there.

Some Chango charts expect data in the same format. For example, pie-, bar-, and
bubble-chart all expect to render data in the schema of [{ key: """", value: n }, ...] . When that happens, Chango will randomly select one of them. Just toggle the Chart button until you get the pie, bar, or bubble visualization you prefer. Through
Chango’s dependency on Va’s simple-data-vis project, you can find the JSON schemas that SimpleDataVis expects . There’s more there than your basic charts covered here, so check it out.

CHANGO UNCHAINED
With that, we’re excited to see what the CouchDB community does with Chango and
SimpleDataVis. Please let us know in the comments about any modifications you’ve
made or questions you have.

Thanks for checking out Chango, and please ♡ this article to recommend it to
other Medium readers.

Thanks to va barbosa . * Data Visualization
 * Couchdb
 * JavaScript
 * Web Development
 * Cloudant

1 Blocked Unblock Follow FollowingMIKE BROBERG
Editor for the IBM Watson Data Platform developer advocacy team. OK person.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Have you ever wanted to quickly visualize the results of CouchDB’s built-in reduce functions for some quick feedback, without leaving the context of its handy dashboard? Recently, my colleague and…",Simple data visualization in Apache CouchDB™ – IBM Watson Data Lab – Medium,Live,272
804,"* Videos & Webinars
 * About Me // Contact
 * Download The E-book!
 * Blog
 * Why Data?

Menu Close * Videos & Webinars
 * About Me // Contact
 * Download The E-book!
 * Blog
 * Why Data?

Hey, I'm Tomi Mester. This is my data blog, where I give you a sneak peek into
online data analysts' best practices. You will find here articles and videos
about data analysis, AB-testing, researches, data science and more...SUBSCRIBE FOR DATA ARTICLES HERE:
Email Address * Name *© 2017 Data36 .

Powered by WordPress .

STATISTICAL BIAS TYPES EXPLAINED (WITH EXAMPLES) – PART1
Written by Tomi Mester on August 21, 2017Humans are stupid.
We all are, because our brain has been made that way. The most obvious evidence
to this built-in stupidity is the different biases, that our brain produces.
Even if it’s so, at least we can be a bit smarter, than the average, if we are
aware of them. This is a data blog, so in this article I’ll focus only on the
most important statistical bias types – but I promise, that even if you are not
an aspiring data professional (yet), you will profit a lot from this write up.
For the ease of understanding for each statistical bias type I’ll provide two
examples: an everyday one and a more online analytics related one!

And just to make this clear: biased statistics are bad statistics. Everything I
will describe here is to help you prevent the same mistakes, that some of the
less smart “researcher” folks are doing time to time.

THE MOST IMPORTANT STATISTICAL BIAS TYPES
There is a long list of statistical bias types. I’ll cover those, that can
affect your job as a data scientist or analyst the most. These are:

 1. Selection bias
 2. Self-selection bias
 3. Recall bias
 4. Observer bias
 5. Survivorship bias
 6. Omitted variable bias
 7. Cause-effect bias
 8. Funding bias
 9. Cognitive bias

STATISTICAL BIAS #1: SELECTION BIAS
proper random sampling

selection bias

Selection bias occurs, when you are selecting your sample or your data wrong.
Usually this means accidentally working with a specific subset of your audience
instead of the whole, hence your sample is not representative of the whole
population. There are many underlying reasons, but by far the most typical I
see: collect and work only with data that is easy to access .


Everyday example of selection bias:Please answer this question: What’s people’s overall opinion about Donald
Trump’s presidency?
Most people have an immediate and very “educated” answer for that. Unfortunately
for many of them the top source of their information is their Facebook feed.
Very bad and sad practice, because what they see there does not show the public
opinion – it’s only their friends’ opinion. (In fact, it’s even narrower,
because they see there only those friends’ opinion, who are active and posting
to Facebook – so most probably 25-35 and extroverted people are
overrepresented.) That’s a classic selection bias: easy-to-access data, but only
for a very specific, unrepresentative subset of the whole population.

Note 1: I do recommend blocking your Facebook feed for many reasons, but mostly
not to get narrow-minded by it: FB News Feed Eradicator !
Note 2: If you want to read another classy selection bias story, check how Literary Digest did a similar mistake (also referred as undercoverage bias) ~80 years ago!

Online analytics related example of selection bias:Another example for selection bias is, when you send out a survey for your
newsletter subscribers – asking what new product would they pay for. Of course,
interacting with your audience is important (I send out surveys to my Newsletter Subscribers sometimes too), but when you analyze these survey results, you should be aware,
that your newsletter subscribers are not representing your potential paying
audience.

There might be a bunch of people, who are willing to pay for you, but they are
not a part of your newsletter list. And on the other hand there might be a lot
of people on your list, who would never spend money on your products, they are
around just to get notified about your free stuff. And that’s only one reason
yet (see the rest below), why surveying is just the simple worst research
method. By the way, for this particular example, I’d suggest to do fake door testing instead!

STATISTICAL BIAS #2: SELF-SELECTION BIAS
Self-selection bias is a subcategory of selection bias. If you let the subjects
of your analyses/researches select themselves, that means that less proactive
people will be excluded. The bigger issue is that self-selection is a specific
behaviour – that implies other specific behaviours – thus this sample does not
represent the entire population.

Everyday example of self-selection bias:Any type of polling/surveying. Eg. when you want to research successful
entrepreneurs’ behaviour with surveys, your results will be skewed for sure.
Why? Because successful people most probably don’t have time/motivation to
answer or even take a look at random surveys. So the 99% of your answers will
come from entrepreneurs, who thinks they are successful, but in fact they are
not. In this specific case, I’d rather try to lure people who are proven to be
successful into face-to-face interviews.

Online analytics related example of self-selection bias:Say, you have an online product – and a knowledge base for that with 100+
how-to-use-the-product kind of articles in it. Let’s find out how good your
knowledge base is and compare the users, who read at least 1 article from it to
the users who didn’t. We find that the article-reader users are 50% more active
in terms of product usage, than the non-readers. Knowledge base performs great!
Or does it? In fact, we don’t know, because the article-readers are a special
subset of your whole population, who might have a higher commitment to your
product and this might be the reason of their interest in your knowledge base.
With other words, they have “selected themselves” into the reader-group. This
self-selection bias leads to a classy correlation/causation dilemma , that you can never solve by data research, just by A/B testing .

STATISTICAL BIAS #3: RECALL BIAS
Recall bias is another common error of interview/survey situations, when the
respondent doesn’t remember correctly for things. It’s not bad or good memory –
humans have selective memory by default. After a few years certain things stay,
others fade. It’s normal, but it makes researches much more difficult.

Everyday example of recall bias:How was that vacation 3 years ago? Awesome, right? Looking back we tend to
forget the bad things and keep remembering to the good things only. Although it
doesn’t help us to objectively evaluate different memories, I’m pretty sure our
brain is like that for a good reason.

Online analytics related example of recall bias:I’m holding data workshops from time to time. I usually send out feedback forms
afterwards, so I can make the workshops better and better based on participants’
feedbacks. I usually send them the day after the workshop, but there was one
particular case when I completely forgot it and sent it one week later. Looking
at the comments I got, that was my most successful workshop of all time. Except
that it’s not necessarily true. It’s more likely that recall bias might have
kicked in pretty hard. One week after the workshop neither of the attendees
would recall if the coffee were cold or if I was over-explaining a slide here or
there. They remembered only to the good things. Not that I wasn’t happy for
their good feedback, but if the coffee were cold, I would want to know about it
– to get it fixed for the next time…

STATISTICAL BIAS #4: OBSERVER BIAS
Observer bias is happening, when the researcher subconsciously projects his/her
expectations to the research. It can come in many forms. Eg. (unintentionally)
influencing the participants (only at interviews and surveys) or doing some
serious cherry picking (focusing rather on the statistics that support our hypothesis, than to the
statistics, that doesn’t.)

Everyday example of observer bias:Fake news! 🙂 It needs a very thorough and consequent investigative journalist
to be OK with rejecting her own null-hypothesis at the publication phase. Eg. if
a journalist spends 1 month on an investigation to prove that the local crime
rate is high because of the careless police officers – most probably she will
find a way to prove it – leaving aside the counter arguments and any serious
statistical considerations.

Extended by other common journalist-kind-of statistical biases, like funding
bias (studies tend to support the financial sponsors’ interests) or publication
bias (to fake or extremize the research results to get published) led me to the
conclusion that reading any type of online media will never get me closer to any
sort of truth about our world. So I’d rather suggest to consume trustful
statistics than online media – or even better: find trustworthy raw data and do
your own analyses to learn a “truer truth”.

Online analytics related example of observer bias:Observer bias can affect online researches as well. Eg. when you are doing a Usability Tests . As a user researcher, you know your product very well (and maybe you like it
too), so subconsciously you might have expectations. If you are a pro User
Experience Researcher, you will know, how not to influence your testers by your
questions – but if you are new to that field, make sure you spend enough time
with preparing good, unbiased questions and scenarios. Maybe consider hiring a
professional UX consultant to help.

Note: in my workshop feedback example observer bias can occur if I send out the
survey right after the workshop. Participants might be under the influence of
the personal encounter – and this might indicate that they don’t want to “hurt
my feelings” with negative feedbacks. Workshop feedback forms should be sent 1
day after the workshop itself.

STATISTICAL BIAS #5: SURVIVORSHIP BIAS
Survivorship bias is a statistical bias type, where the researcher is focusing
only to that part of the data set, that already went through some kind of
pre-selection process – and missing those data-points, that fell off during this
process (because they are not visible anymore).

Everyday example of survivorship bias:One of the most interesting stories of statistical biases: falling cats. There
was a study written in 1987 about cats falling out from buildings. It stated
that the cats who fell from higher have less injuries than cats who fell from
lower. Odd. They explained the phenomenon with the terminal velocity, which
basically means that cats falling from higher than six stories are reaching
their maximum velocity during the fall, so they start to relax, prepare to
landing and that’s why they don’t injure themselves that hard.

As ridiculous as it sounds, as mistaken this theory turned out to be. 20 years
later, the Straight Dope newspaper pointed out to the fact, that those cats who are falling from higher
than six stories might have died with a higher chance, thus people don’t take
them to the veterinarian – so they were simply not registered and didn’t become
the part of the study. And the cats that fell from higher, but survived were
simply falling more luckily, that’s why they had less injuries. Survivorship
bias – literally. (I feel sorry for the cats though.)

Online analytics related example of survivorship bias:Reading case studies. Case studies are super useful to give you inspiration and
ideas to your new projects. But remind yourself all the time, that only success
stories are published! You will never hear about the stories, where one used the
exact same methods, but failed.

Not so long ago I’ve read a bunch of articles about exit intent pop-ups. Every
article declared that exit intent pop-ups are great and brought +30%, +40%,
+200% in number of newsletter subscriptions. In fact it works pretty decent on
my website too… But let’s take a break for a moment. Does it mean that
exit-intent popups will work for everyone? Isn’t it possible that those guys,
who have tested exit-intent pop-ups and found that it actually hurts the user
experience, the brand or the page load time, they have just simply didn’t write
an article about this bad experience? Of course, it’s possible – nobody likes to
write about unsuccessful experiment results… The point is: if you read a case
study, think about it, research it and test it – and decide based on hard
evidence if it’s the right solution for you or not.

4 MORE STATISTICAL BIAS TYPES AND SOME SUGGESTIONS TO AVOID THEM…
This is just the beginning! Next week I’ll continue this article with 4 more
statistical bias types – that every data scientist and analyst should know
about. And on the week after, I’ll give you some practical suggestions, how to
overcome these!
Stick with me and subscribe to my weekly Newsletters (no spam, just 100% useful data content)! And if you have any comments, let me
know below!

Cheers,
Tomi

 * August 21, 2017
 * In Analyze the Data
 * AB test analytics bias data data science learn data science metrics qualitative research research statistical bias types statistics tomi mester

← Previous post2 COMMENTS
 1. MANUELPB
    August 22, 2017Good article. Waiting for reading second part. Manuel, from Spain
    
    Reply * TOMI MESTER
       August 22, 2017Thanks Manuel! Coming next week! 😉
       Tomi
       
       Reply
     * 
    
    
 2. 

LEAVE A REPLY CANCEL REPLY
Comment

Name *

Email *

Website


Get free data articles weekly: We use cookies to ensure that we give you the best experience on our website. Ok","Be aware of the different statistical bias types is inevitable, if you are about to learn data science and analytics. Here are the most important ones.",Statistical Bias Types explained (with examples),Live,273
806,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectHOW TO ANALYZE YOUR PIPE RUNS WITH BUNYANDavid Taieb / August 11, 2015INTRODUCTIONIn this post, I’ll discuss how our Simple Data Pipe sample app uses the Bunyan Node.js logging framework to capture detailed logging information about a pipe run. Then I’ll show youhow to analyze the report using the Bunyan viewer tool.If you’ve explored our Simple Data Pipe tutorial on Bluemix, you know thatmetadata about your pipe runs is stored in Cloudant as JSON. Cloudant’s supportfor binary attachments within JSON lets you attach the logs from Bunyan rightalongside their associated JSON document, which you can then access for furtheranalysis.A WORD ABOUT BUNYANBunyan is a simple and fast JSON logging library for Node.js services. It can beconfigured to output the data to streams that can be stored anywhere. SimpleData Pipe uses this library to capture log information about a particular run,then attach the report to the pipe run document stored in the pipe_db database in your Cloudant account.This logging framework supports many log levels: trace , debug , info , warn , and error . As you’ll see, Bunyan also provides a CLI utility to pretty-print its output,with the ability to filter by logging group and level. See https://github.com/trentm/node-bunyan for more information.HOW TO LOCATE THE LOG FOR A PARTICULAR RUNHere’s the scenario: You attempted a pipe run and something went wrong. You nowneed to locate the log, download it from the pipe_db Cloudant database, and analyze it for troubleshooting. 1. Go to Bluemix and click on your pipe app instance. 2. Click on the pipes-cloudant-service box to open the Cloudant dashboard.    Pipes Cloudant Service Box         3. Click the Launch button. 4. In the Cloudant dashboard, click on the pipe_db database. 5. In the menu on the left, click _design/application , then Views , then all_runs .         6. Locate your last run.On the left-hand side of the all_runs view, you can see that the Map function logic that defines this view    indexes pipe run documents sorted in chronological order. So, the run    document you’re looking for is the last one in the view. (You may need to    page through the results a few times if you have performed a lot of runs.)         7. Click on the pencil icon to open the run document. You should be able to see    the JSON metadata for the run. 8. In the toolbar above the document, click the View Attachments dropdown button and right-click on run.log . Then click on the Save Link As… option. This will download the file to a directory of your choice. 9. The next step will be to use the Bunyan CLI tool to analyze run.log.ANALYZE RUN.LOGIf you have not already done so, install Bunyan on your local machine usingthese simple steps:npm install bunyan -gNote: If you are on a Linux-based system like Mac OS X, use sudo .You can view the entire log in pretty-printed format using this command: bunyan <path/to/run.log>Note: To quit long output, use q .You can also filter the log to only view errors using the following command:bunyan <path/to/run.log> -l errorNote: You can use any log level you want, e.g., error , info , warn , etc.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",How to use Bunyan to capture detailed logging of data migration runs through our Simple Data Pipes app.,How to analyze your pipe runs with Bunyan,Live,274
809,"DO MORE WITH COMPOSE POSTGRESQL USING ZAPIERShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 21, 2016Zapier is a service which allows you to create custom integrations among avariety of applications, including PostgreSQL. Below we'll look at a couple ofexamples for how you can do more with Compose PostgreSQL by integrating it withother tools via Zapier.We first introduced our readers to Zapier when we showed how to Zapier your data to MongoDB and later we followed that up with an article about how to send alerts from our platform to your application of choice. Since then, Zapier has added PostgreSQL to the long list of integrations they offer.NOTIFY ME WHEN SOMETHING CHANGESOne of the most common business requirements with databases is to receivenotifications when something important has changed. DBAs may want to receive anotification when a new table is created or when a new column is added to atable. Business users may want to know when there's a new row added to afavorite table or when the data from a custom query changes.Among other things, at Compose we like to keep track of how many people aretrying out our service with our free 30-day trial so we've written a ""zap"" (a Zapier integration widget) to notify us in our main Slack channel when the number of trial customers changes. Here's how:Once you sign up for a Zapier account , you'll see a button to ""Make a Zap"":TRIGGERSThe first step in making a zap is setting the trigger. You'll be asked whichapplication you want to start from. For this scenario, that's PostgreSQL:Next comes the trigger type we want to use. For us, that's a custom query, butyou can see the other options we mentioned (table, column and row):In a previous article we explained how to setup a Segment warehouse using Compose PostgreSQL . We're going to run our custom query against our Segment warehouse where we'retracking trial events.If you haven't already created a connection in Zapier to your ComposePostgreSQL, you'll be asked to do so:If you already have a connection, you'll be asked if you want to use an existingone or create a new one.For this scenario, because we are using the custom query trigger, we're asked toprovide our query:After that, we're asked to test the query by fetching a row. Once we've run thetest, we can view the row, re-test, or choose to just continue:ACTIONSNow that we have our trigger setup from PostgreSQL, we'll set the resultingaction - the notification.For the action, we're asked what application to use. At Compose, we're usingSlack:We're going to send a channel message, but there are several other options tochoose from:Then, if you already have a Slack account setup in Zapier, you'll be asked ifyou want to use it or create a new one. For our example, we'll create a newaccount:The next step is to fill out the Slack template. As you can see we're sending amessage to our ""general"" channel. In the message text, we're using the trialscount from our query with some additional text:There are several other options in the template including bot, image, link, andmention settings.At the end, similar to the trigger section, we'll then get to test our Slackmessage action. We can also re-test or just finish.Finally, we'll give our ""zap"" a name (in this case we are using Effie, thefictional tributes' escort from The Hunger Games ) and we'll turn it ""on"":Hey! There's a Slack notification!Now that the ""zap"" is turned on, it will run every 5-15 minutes (depending onthe billing plan you select) and will only perform the action (called a ""task""in the billing plan) when there has been a change in the data. Zapier lets you try this out with a 14 day free trial so you can determine what billing plan best fits your situation.FILTERS AND INTERMEDIARY ACTIONSWhile we didn't use any filters or intermediary actions between our ""zap""trigger and action, you can add them to enhance the precision or functionalityof your own zaps. A filter might check if the data met a certain criteria. Forexample, we could apply a filter to check if our trials are greater than 300before moving on to our final action that sends a channel message in Slack. Anintermediary action might be posting the data to another application, such assetting the data as a metric in a dashboard tool like Leftronic , before moving on the the final action. In this way, you can hit multiple appsor take multiple steps in the same app, with your trigger data. That's prettypowerful stuff!Now, that we've seen how to create notifications from changes in our ComposePostgreSQL database, let's look at a couple other use cases.COPY THE DATA SOMEWHERE ELSEIn our example above where we're using a custom query to generate a data row orwhen a new row is added to our table, rather than sending a notification, we maywant to copy that data to another app. For example, we may want to copy thatdata to Google sheets for our marketing team to have easy access to it forcreating reports. We could also use this option in a polyglot persistencescenario where we need the same data in a different database. In that case, wecould copy data from our Compose PostgreSQL database to our Compose MongoDB orRethinkDB databases (or vice versa!), or even to your corporate SQL Serverinstance.The steps for this use case are similar to the ones we demonstrated above,though each application will have its own specifics, of course. The great thingwith Zapier is that it's built to be intuitive and to guide you along for eachintegration type. Since we've run through one case with you here and a coupleothers in our previous articles, we know you've already got the hang of how toget data from PostgreSQL to other apps.Now let's look at our final use case for this article... having anotherapplication trigger the insertion of a data row into PostgreSQL.ADD DATA FROM ANOTHER APPAt Compose, we use Help Scout for keeping in contact with our customers and helping them resolve supportissues. Let's say that we want to tally the customer support conversations fromHelp Scout so that we can tie them directly to our accounts database inPostgreSQL.So, we'll make a new ""zap"", choose Help Scout as the trigger application, andthen choose ""New conversation"" as the trigger:You'll be asked to setup the connection to Help Scout app via API key (which youcan generate in the settings for your Help Scout profile) if you don't have onecreated already.Next, we'll select the Help Scout mailbox we want and a status if that'sapplicable:We'll then test the request and continue on.Next, we'll move on to the action... Add data to PostgreSQL.We'll choose to add a new row for this example:Since our PostgreSQL connection already exists in Zapier, we'll choose to useit, though as we mentioned, you could add a new one if you need to.Next, we'll select the table in PostgreSQL and set how the fields from HelpScout map to the fields in our table. For this example, we just have a simpletable that will collect the timestamp at which the conversation was created andthe customer email:We then test our data row insertion into PostgreSQL and finish our ""zap"" bygiving it a name and turning it on.Now, what we can do is create a report from data in PostgreSQL that aggregates acustomer's conversations from Help Scout and joins that to the account recordthat already exists in our PostgreSQL database. Or we can create a query to tellus the most frequent days and times that conversations are created to make surewe have good coverage in support.This is just a simple example, but think of how powerful this use case can be.With Zapier, you can add data to your PostgreSQL database from otherapplications so that you can easily create reports and run analyses frommultiple sources in one convenient location - Compose PostgreSQL!WRAPPING UPZapier is a powerful tool that will help you get more from your ComposePostgreSQL database, either by using it to trigger data or notifications toother apps or by using it to generate new data rows in the database based ondata or events from other apps. Compose PostgreSQL, MongoDB and RethinkDB areall currently supported by Zapier as well as more than 500 other applicationsavailable for integration. If you don't already have a Compose account, signup to get started with PostgreSQL today.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Zapier is a service which allows you to create custom integrations among a variety of applications, including PostgreSQL. Below we'll look at a couple of examples for how you can do more with Compose PostgreSQL by integrating it with other tools via Zapier.",Do More with Compose PostgreSQL using Zapier,Live,275
818,,"Love to work in Microsoft Excel? Watch how to connect to IBM dashDB as the data source for Excel, and how to import tables into a spreadsheet. ",Integrate dashDB with Excel,Live,276
819,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: WORK WITH DATA CONNECTIONS
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

2 views 1LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 2 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Tanmay Bakshi on building AskTanmay - Duration: 22:59. developerWorks TV
   193,680 views 22:59


--------------------------------------------------------------------------------

 * Big data and dangerous ideas | Daniel Hulme | TEDxUCL - Duration: 14:40. TEDx
   Talks 36,478 views 14:40
 * Cabling a SoftLayer Data Center Server Rack - Duration: 4:09. IBM Bluemix
   1,170,843 views 4:09
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29
 * Tableau for Data Scientists - Duration: 35:23. Brent Tabl 138 views 35:23
 * Data science and our magical mind: Scott Mongeau at TEDxRSM - Duration:
   16:33. TEDx Talks 18,776 views 16:33
 * JavaOne: Microservice hands-on - Duration: 5:22. developerWorks TV No views *
   New 5:22
 * IBM Bluemix Data Connect - Self Service Data Preparation and Integration Demo
   - Duration: 16:11. carlo appugliese 1,705 views 16:11
 * Data hacking - data science for entrepreneurs | Kevin Novak | TEDxWakeForestU
   - Duration: 17:11. TEDx Talks 18,901 views 17:11
 * Data Science Hands on with Open source Tools - WHAT IS DATA SCIENTIST
   WORKBENCH? - Duration: 3:47. Cognitive Class 5,687 views 3:47
 * Data Science Hands on with Open source Tools - Creating & Uploading Workflows
   - Duration: 4:42. Cognitive Class 2,069 views 4:42
 * HURRICANE MARIA RECORD RAIN - FLOODING - Cosmic Ray Connection and the Grand
   Solar Minimum - Duration: 8:44. Oppenheimer Ranch Project 1,763 views 8:44
 * My Journey to Data Scientist - Duration: 3:13. Story by Data 1,952 views 3:13
 * Data Science Hands on with Open source Tools - What are Jupyter notebooks -
   Duration: 2:22. Cognitive Class 4,362 views 2:22
 * IBM Watson Machine Learning: Build a Predictive Analytic Model - Duration:
   4:06. developerWorks TV 21 views * New 4:06
 * JavaOne: The excitement so far - Duration: 5:04. developerWorks TV 1 view *
   New 5:04
 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54.
   developerWorks TV No views * New 6:54
 * JavaOne: Optimize enterprise Java with Microprofile 1.2 - Duration: 5:49.
   developerWorks TV No views * New 5:49
 * IBM Analytics Engine Overview - Duration: 7:21. developerWorks TV 7 views *
   New 7:21
 * JavaOne: Meet a new Java face at developerWorks - Duration: 2:30.
   developerWorks TV 1 view * New 2:30

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to set up connections to both Bluemix and external sources.,Work with Data Connections in DSX,Live,277
820,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Jorge Castañón Blocked Unblock Follow Following applied mathematician and art lover | opinions are my own Jun 29, 2016
--------------------------------------------------------------------------------

DEEP LEARNING TRENDS AND AN EXAMPLE
The Spark Summit 2016 took place on June 6–8 in San Francisco and it was a sold out event with more
than 2,500 attendees. Not surprisingly, deep learning (DL) and artificial
intelligence (AI) were the main dishes of the conference. On day one, most of
the keynotes were on how DL and AI are making the world better. Don’t you think
it’s amazing that you can teach a computer to distinguish between an image of a
cat from an image of a dog? I do! And the mentioned example is nothing compared
to others fantastic examples that were presented at the Summit. Based on google
searches, starting around 2014, both terms Apache Spark and Deep Learning have
had a dramatic increase.

Jeff Dean, head of Google’s brain team, talked about how DL is used to verbally
describe an image. Imagine a blind person using an app to understand an image
without the help of other person! Jeff also talked about other use-cases where
DL is useful like speech recognition and email smart reply, among others.

Andrew Ng, chief scientist at Baidu and co-founder of Coursera , compared AI models with rockets: artificial neural networks to their engine
and data to its fuel. At Baidu, DL and AI are being applied to train models for
autonomous driving, fraud and malware detection, among other use-cases. Neural
networks (NN) need more data than traditional algorithms, especially deep neural
networks. The gain of NN algorithms trained with large amounts of data is in the
quality of your predictions at a cost of more computational power (therefore the
popularity of GPU’s used for training NN’s). Find the slide shown and more
information about all the very interesting talks at the Spark Summit here .

Hopefully I have convinced you that DL and AI is quiet something to look at.
This is why I build a notebook on Data Science Experience to run a very well-known and simple DL example for classifying handwritten
digits. Please check my notebook here .

The best way to contact me for questions, feedback or just to say hi is @castanan .


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on June 29, 2016.

 * Artificial Intelligence
 * Data Science Experience
 * Dsx

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingJORGE CASTAÑÓN
applied mathematician and art lover | opinions are my own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","The Spark Summit 2016 took place on June 6–8 in San Francisco and it was a sold out event with more than 2,500 attendees. Not surprisingly, deep learning (DL) and artificial intelligence (AI) were…",Deep learning trends and an example,Live,278
825,"Compose The Compose logo Articles Sign in Free 30-day trialHOW TO TALK RAW REDIS
Published Feb 22, 2017 redis Development How to talk raw RedisFind out how to talk to a Redis database with nothing more than echo and the
netcat command and get a deeper understanding of why developers love Redis.

Redis has, as we've shown in the past, many many drivers . One of the reasons for situation is that, by design, Redis has a very simple
protocol, RESP, for communicating with the server. Building on that simple
protocol has allowed people to create these many drivers and their various
levels of abstraction or idiomatic appropriateness. But let's talk about getting
down to basics here; sometimes resource constraints demand you create the
smallest possible connection code.

OUR FIRST COMMAND
For this example, we're going to get some status information from the Redis
server; the INFO command returns lots of useful information so we will use that.
Now, to send strings with Redis's RESP protocol, you need to say how long the
string is. That's done by preceding your string with a $ and the number of characters in the string so that's 4 characters. After the
number and after the string should be the carriage return and newline, \r\n . Let's build our string to send:

$4\r\nINFO\r\n


Redis's RESP also wants to know how many strings are in a command. It has a
""bulk-strings"" indicator which is like the $ operator, except uses a * and the number following it is the number of strings in the command. For this
command, that's... 1. So we need to precede the string with *1 .

*1\r\n\$4\r\nINFO\r\n


We can use the nc - net cat - command to send this string to our server. That server, for this
example is at sl-eu-lon-2-portal.1.dblayer.com and on port 10030. If we echo our
string and pipe it into nc we should get information:

$ echo ""*1\r\n$4\r\nINFO\r\n"" | nc sl-eu-lon-2-portal.1.dblayer.com 10030

-ERR Protocol error: expected '$', got ' '
$


The problem here is the old classic shell thing of unexpected expansion. The
shell sees that $ and wants to expand it from an environment variable $4 which is, of course, blank. We need to escape that and try again.

$ echo ""*1\r\n\$4\r\nINFO\r\n"" | nc sl-eu-lon-2-portal.1.dblayer.com 10030
-NOAUTH Authentication required.
$


TIME TO LOG ON
Because who would put a completely open Redis server up on the internet. The
Compose Redis servers start up with authentication on and a 16 character
password set. If you are doing this against a server that you own and it worked,
check now that that server isn't externally visible. Other people do not value
your data. Anyway, back to getting authenicated. We need to construct another
command, the AUTH command and follow it with our password. This time, there's two strings in the
command, the AUTH command itself and the password, let's say it's FLIBBERTIGIBBETS for now. That gives us *2\r\n\$4\r\nAUTH\r\n\$16\r\nFLIBBERTIGIBBETS\r\n which we can put into the front of out command now:

$ echo ""*2\r\n\$4\r\nAUTH\r\n\$16\r\nFLIBBERTIGIBBETS\r\n*1\r\n\$4\r\nINFO\r\n"" | nc sl-eu-lon-2-portal.1.dblayer.com 10030
+OK
$2306
# Server
redis_version:3.2.6  
redis_git_sha1:00000000  
redis_git_dirty:0  
redis_build_id:dafadbf6141a77d5  
redis_mode:standalone  
os:Linux 3.19.0-39-generic x86_64  
arch_bits:64  
multiplexing_api:epoll  
gcc_version:4.8.4  
process_id:36  
run_id:275d6af5f6ccf9be4efde1dbcb8223483386cf42  
tcp_port:6379  
...
$


That goes on for a little while... There are 2,306 characters of information in
total. We know that because that's the first thing Redis told us. In the same
way, as we tell it how long strings are, it does the same back, so the second
line, $2306 is telling us how many characters are coming back. That's the response to the INFO command. Immediately preceding that is +OK , the response to the AUTH command. The + is the signal that this is a simple non-binary safe string; a minimal OK.

LET'S MAKE IT TIDY
Anyway now we can pipe those results to any command we want for post-processing.
Add a | tail -n +3 to chop off the Redis RESP responses and we have a clean output, just like
entering INFO at the redis-cli command line. We're good people and good people don't leave
passwords in shell scripts. Let's pop the password into an environment
variable...

export REDISAUTH=""FLIBBERTIGIBBETS""  


And change the command to use that. This time we want that shell expansion to happen.

echo ""*2\r\n\$4\r\nAUTH\r\n\$16\r\n$REDISAUTH\r\n*1\r\n\$4\r\nINFO\r\n"" | nc sl-eu-lon-2-portal.1.dblayer.com 10030  | tail -n +3  


Now the command is sharable without giving away your password. There's one last
thing to do for this example. That INFO data is quite a lot to work through and we only really, say, want the STATS section. That's not a problem, we just need to add that to the INFO command, first by bumping the string count at the start of the command to 2 and
then appending \$5\r\nSTATS\r\n to the end.

$ echo ""*2\r\n\$4\r\nAUTH\r\n\$16\r\n$REDISAUTH\r\n*2\r\n\$4\r\nINFO\r\n\$5\r\nSTATS\r\n"" | nc sl-eu-lon-2-portal.1.dblayer.com 10030  | tail -n +3
# Stats
total_connections_received:1700495  
total_commands_processed:40375802  
instantaneous_ops_per_sec:7  
total_net_input_bytes:1984826397  
total_net_output_bytes:2497942423646  
instantaneous_input_kbps:0.41  
instantaneous_output_kbps:1.10  
rejected_connections:0  
sync_full:5  
sync_partial_ok:0  
sync_partial_err:0  
expired_keys:0  
evicted_keys:0  
keyspace_hits:194  
keyspace_misses:6  
pubsub_channels:1  
pubsub_patterns:0  
latest_fork_usec:469  
migrate_cached_sockets:0  
$


Now we just have our statistics. And it's still one shell command.

WHAT THIS GETS US
You'll note that the Redis RESP protocol is remarkably simple. You can find out
more about it on the RESP specification page . There are some other response types ( - for an error, which we saw when we got the NOAUTH message and : for integers) and some other rules to take note of but it's also pretty simple
to code for.

If you have a new language on your hands and no Redis driver, it is good to know
that the protocol is so simple and readily implementable in even the most
constrained of languages. As long as you can open a TCP socket to a port and
read/write to it, you are good to go.

This should also help explain the types of Redis drivers that are around. The
minimalist drivers basically provide enough to make sending commands to and
receiving data from Redis; the user of the driver sends the strings for the
commands.

Also, it's got its uses in the Internet of Things. In a future article, we'll be
looking at that when we add Redis stats gathering (among other things) to a very
resource-constrained device. Until then, have fun going low with Redis.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Patrick Hendry

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Feb 21, 2017WHY IT CONSULTING AND DEVELOPER SERVICES COMPANIES LOVE COMPOSE
One of the great constants of software consulting is this: You need reliable,
stable, and repeatable databases and database s…

Arick Disilva Feb 17, 2017NEWSBITS: REDIS, ETCD AND ELASTICSEARCH UPDATES, GO 1.8, GITHUB GUIDES AND
CHATOPS AND MORE
NewsBits for the week ending 17th February - Redis gets a critical update,
etcd's latest release, Elasticsearch gets a bump,…

Dj Walker-Morgan Feb 10, 2017NEWSBITS: RETHINKDB LIVES, REDIS AND POSTGRESQL FUTURES, FOSDEM, RUST AND WUZZ
NewsBits for the week ending 10th February - RethinkDB has a new home, Redis's
future is being mapped out, PostgreSQL 10's fe…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",Find out how to talk to a Redis database with nothing more than echo and the netcat command and get a deeper understanding of why developers love Redis.,How to talk raw Redis,Live,279
826,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

FULL TEXT SEARCH FROM WITHIN APACHE COUCHDB™
Mike Broberg / October 20, 2015A few months back, IBM Cloudant open-sourced the repos that power our integration with the Apache Lucene™ text search engine library. See this excellent blog from Robert Newson, who outlines the projects Clouseau and Dreyfus and explains how they interact with Cloudant’s CouchDB-based system.

Use Lucene with the current release of CouchDB y’all.

The Lucene Search integration will become part of the forthcoming CouchDB 2
release, but if you can’t wait, our own Robert Kowalski published instructions on how to recompile the current 1.6.1 release of CouchDB to use
the new search features . See his blog at https://cloudant.com/blog/enable-full-text-search-in-apache-couchdb/ for more.

In addition to their work at IBM Cloudant, both Roberts are deeply involved in
Apache CouchDB as members of its Project Management Committee. A big thank you
to both for their work and for making CouchDB an awesome place to store JSON
data :D

© “Apache”, “CouchDB”, “Lucene”, “Apache CouchDB”, “Apache Lucene”, and the
CouchDB and Lucene logos are trademarks or registered trademarks of The Apache
Software Foundation. All other brands and trademarks are the property of their
respective owners.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 


Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Learn how to connect the 1.6.1 release of CouchDB to Cloudant's recently open-sourced Lucene integration.,Using Lucene search from within CouchDB,Live,280
831,"Compose The Compose logo Articles Sign in Free 30-day trialANALYZING PET NAME TRENDS WITH POSTGRESQL'S CROSSTABVIEW
Published Jul 13, 2017 postgresql crosstabviews Analyzing Pet Name Trends with PostgreSQL's crosstabviewPostgreSQL 9.6 comes with a number of updates and new features to explore. One
very useful addition is the \crosstabview command, which gives you the power to rearrange how your data is viewed without
the difficulty of writing complex SQL queries.

Since the release of PostgreSQL 9.6.2 on Compose, we've been playing with some
of the new additions to the database. One addition we found interesting and
quite useful is the new psql meta-command \crosstabview , which was released with PostgreSQL 9.6. This command allows query results to
be shown in a representation, similar to a spreadsheet pivot table, without
needing to write complex SQL queries.

Here, we'll look at how it works and show you some of the use cases where it
might be beneficial to use. The dataset we'll use is the Current Pet Licenses for the City of Tacoma and Fircrest for 2017 , which is a CSV file that contains a list of 15,555 names of cats and dogs in
the two cities. To follow along, download the dataset from the link and let's
look at how the \crosstabview command works.

IMPORTING THE DATASET AND QUERYING PET NAMES
After downloading the dataset, we created a database pets and a table names , then we imported the CSV data.

CREATE TABLE names (  
    name TEXT, 
    animaltype TEXT, 
    primarybreed TEXT, 
    tpdsector INT, 
    latlon TEXT, 
    animalcount INT
);

\COPY names (name, animaltype, primarybreed, tpdsector, latlon, animalcount) FROM '/Downloads/Current_Pet_License-City_of_Tacoma___Fircrest.csv' CSV HEADER;  


Now that the pet names have been inserted, let's look for the names that both
cats and dogs share. A simple query using count and a GROUP BY clause will do the trick.

SELECT  
    name, 
    animaltype,
    count(name)
FROM  
    names 
GROUP BY  
    name, 
    animaltype
ORDER BY  
    1;


A sample of the results of that query is below. As you can see, some names are
shared between cats and dogs (e.g. ""ABBY""). However, since the names are divided
between CAT and DOG , the names are grouped accordingly and we don't have one row dedicated to a
single name.

         name         | animaltype | count 
----------------------+------------+-------
 2P2                  | DOG        |     1
 A BARKSDALE          | DOG        |     1
 A509966              | CAT        |     1
 AARON                | DOG        |     1
 AB                   | DOG        |     1
 AB ""ABBY""            | DOG        |     1
 ABBBY                | CAT        |     1
 ABBEY                | DOG        |     5
 ABBI                 | DOG        |     2
 ABBIE                | DOG        |    10
 ABBIGAIL             | CAT        |     1
 ABBOTT               | CAT        |     1
 ABBY                 | DOG        |    40
 ABBY                 | CAT        |    12
 ...


Trying to look at every row to find each cat and dog with identical names will
be tedious, especially if the dataset is much larger than this. One way that we
might overcome the problem is to design a new query that would put the count of
cats and dogs in their own column.

SELECT  
    name,                                                                            
    count(CASE WHEN animaltype='CAT' THEN 1 END) AS CAT,
    count(CASE WHEN animaltype='DOG' THEN 1 END) AS DOG
FROM  
    names
GROUP BY  
    name
ORDER BY  
    1;


which produces ...

         name         | cat | dog 
----------------------+-----+-----
 2P2                  |   0 |   1
 A BARKSDALE          |   0 |   1
 A509966              |   1 |   0
 AARON                |   0 |   1
 AB                   |   0 |   1
 AB ""ABBY""            |   0 |   1
 ABBBY                |   1 |   0
 ABBEY                |   0 |   5
 ABBI                 |   0 |   2
 ABBIE                |   0 |  10
 ABBIGAIL             |   1 |   0
 ABBOTT               |   1 |   0
 ABBY                 |  12 |  40
 ...


But creating an entirely new query to reorganize our data might be overkill,
especially if you only want to rearrange the columns. That's where PostgreSQL's \crosstabview will help.

QUERY WITH \CROSSTABVIEW
The first query we ran grouped together and counted all the cats and dogs having
names with identical spellings and placed them into separate rows. \crosstabview can transform the data automatically by placing CAT and DOG in separate, horizontal columns, merging together the pet names in the vertical column, and using the count values to fill in the grid where cells are shared between the horizontal and
vertical headers. All that's required for \crosstabview to work is that you have at least three columns that it can select data from.

It does this by finding the distinct values within the query's results and uses
them as horizontal and vertical headers. The data shared between the header
values are then projected into the grid of cells.

To see it in action, all you have to do is run \crosstabview after your SQL query.

SELECT  
    name, 
    animaltype,
    count(name)
FROM  
    names 
GROUP BY  
    name, 
    animaltype
ORDER BY  
    1 
\crosstabview


Once the \crosstabview command is executed, it sends the query input buffer to the server then shows
the results of that query in a crosstab grid. That means crosstabview will only use the last query executed to populate the crosstab grid.

If \crosstabview is appended to the SQL query like in the query above, don't use the semicolon
after \crosstabview , otherwise, you'll get an error: Invalid command \crosstabview; . That's because \crosstabview works similarly to ; at the end of an SQL query. Alternatively, if you execute the query first with
a semicolon ; then, afterward, execute \crosstabview , it will give you the same results because it uses the query buffer.

When \crosstabview is executed in the above query, we'll get the following table with individual
columns for DOG and CAT , which are the distinct values taken from the animalType column. As you can see, each name is grouped together like we want and the count column values are then used to fill in the table grid. We got a similar result
using the second query we wrote above but using \crosstabview allowed users to use the query buffer and saved us from building and executing
a new query that produces a similar result.

         name         | DOG | CAT 
----------------------+-----+-----
 2P2                  |   1 |    
 A BARKSDALE          |   1 |    
 A509966              |     |   1
 AARON                |   1 |    
 AB                   |   1 |    
 AB ""ABBY""            |   1 |    
 ABBBY                |     |   1
 ABBEY                |   5 |    
 ABBI                 |   2 |    
 ABBIE                |  10 |    
 ABBIGAIL             |     |   1
 ABBOTT               |     |   1
 ABBY                 |  40 |  12 
 ...


REARRANGING TABLES WITH \CROSSTABVIEW
Behind the scenes, PostgreSQL's \crosstabview will determine how to set up your table. However, if you want to rearrange how
your data is viewed, PostgreSQL gives you that option, too.

If you want to tell \crosstabview to rearrange the table, for example, you may want to flip horizontal and
vertical headers by placing DOG and CAT vertically and name horizontally, you can do that by specifying the vertical and horizontal
headers, respectively, like:

\crosstabview animaltype name


This tells \crosstabview to place type as the vertical header and name as the horizontal header. You need to put a space between the column names. For
this example, the count column will automatically be used as the data that fills in the grid. If we
wanted to specify the count column as the data that \crosstabview will use, we'd place count as the third argument.

\crosstabview animaltype name count


However, this is not really necessary here since PostgreSQL will automatically
deduce that count is the data shared by the values in the vertical and horizontal headers.

There are some limitations if you decide to specify the order of the headers.
For example, running the query above with the name column as the horizontal header will give you:

\crosstabview: maximum number of columns (1600) exceeded


This error occurs because we've put all our names in the horizontal header,
which PostgreSQL has limited to 1600 columns. Therefore, we can run the query
again with \crosstabview animaltype name , but limit the query to get the first ten results, which would return
something like:

 animaltype | 2P2 | A BARKSDALE | A509966 | AARON | AB | AB ""ABBY"" | ABBBY | ABBEY | ABBI | ABBIE 
------------+-----+-------------+---------+-------+----+-----------+-------+-------+------+-------
 DOG        |   1 |           1 |         |     1 |  1 |         1 |       |     5 |    2 |    10
 CAT        |     |             |       1 |       |    |           |     1 |       |      |      


TRENDING PET NAMES
Looking beyond getting pet names and animal types, we could use \crosstabview to find out what breed of dogs, for instance, tend to have certain names and
whether there is a correlation between animal breeds and pet names that pet
owners prefer. To do that, we could construct a query that analyzes the breeds
of DOG and the names associated with them.

SELECT  
    primarybreed, 
    name, 
    count(name) 
FROM  
    names 
WHERE  
    animaltype = 'DOG' 
GROUP BY  
    primarybreed, 
    name 
ORDER BY  
    3 DESC;


This query will give us a list of dog breeds, the names of dogs associated with
a breed, and the number of dogs that have a specific name that is a certain
breed.

  primarybreed   |         name         | count 
-----------------+----------------------+-------
 LABRADOR RETR   | BELLA                |    23
 LABRADOR RETR   | MAX                  |    20
 LABRADOR RETR   | SADIE                |    15
 LABRADOR RETR   | CHARLIE              |    14
 LABRADOR RETR   | DAISY                |    14
 LABRADOR RETR   | MAGGIE               |    13
 LABRADOR RETR   | RILEY                |    13
 CHIHUAHUA SH    | BUDDY                |    12
 CHIHUAHUA SH    | CHICO                |    12
 LABRADOR RETR   | MOLLY                |    12
 CHIHUAHUA SH    | BELLA                |    12
 LABRADOR RETR   | BEAR                 |    10
 LABRADOR RETR   | LUCY                 |    10
 GOLDEN RETR     | CHARLIE              |     9
 LABRADOR RETR   | BAILEY               |     9
 LABRADOR RETR   | STELLA               |     9
 LABRADOR RETR   | COCO                 |     8
 GERM SHEPHERD   | MAGGIE               |     8
 LABRADOR RETR   | DUKE                 |     8
 LABRADOR RETR   | LUNA                 |     8
 GERM SHEPHERD   | MAX                  |     8
 ...


From the results, it seems that there are a lot of Labrador Retrievers named
Bella, but we also have a high number of short hair Chihuahua's with the same
name. Bella is not the only name that is shared between breeds, but looking at
the entire list of all the occurrences of Bella, or any dog for that matter is
not efficient.

In fact, it's the same problem that we ran into in the first query where we have
a repetition of names on separate rows, but this time it's because the names are
listed with different breeds. The problem with this query is that if we decided
to run \crosstabview , we'd exceed the number of columns allowed since the name column would be placed in the horizontal header. We could try to go around this
by specifying that we want name in the vertical column and primarybreed in the horizontal column like \crosstabview name primarybreed , but we'd get a table that is extremely difficult to read.

In order to overcome this, we might want to select the top 10 names of dogs and
then use those names to see what breeds tend to have those names. To do that,
we'll use the following query, which is a modified version of the first query we
ran in the article that selects only the animaltype = 'DOG' and is ordered in descending order according to the animal name :

SELECT  
    name, 
    animaltype,
    count(name)
FROM  
    names 
WHERE  
    animaltype = 'DOG'
GROUP BY  
    name, 
    animaltype
ORDER BY  
    3 DESC
LIMIT  
    10
\crosstabview


This gives us the following table with the top ten dog names:

  name   | DOG 
---------+-----
 BELLA   | 117
 LUCY    | 103
 BUDDY   | 102
 MAX     |  92
 DAISY   |  87
 CHARLIE |  77
 MOLLY   |  77
 SADIE   |  64
 JACK    |  60
 MAGGIE  |  56


Now that we know the top ten dog names, we can create a second query that
narrows down the search and selects the number of dogs with those top ten names
and the breeds that they belong to.

SELECT  
    primarybreed, 
    name, 
    count(primarybreed) 
FROM  
    names 
WHERE  
    animaltype = 'DOG' AND
    name LIKE ANY('{BELLA,LUCY,BUDDY,MAX,DAISY,CHARLIE,MOLLY,SADIE,JACK,MAGGIE}')  
GROUP BY  
    primarybreed,
    name
ORDER BY  
    3 DESC;


This will return a table that looks something like this:

  primarybreed   |  name   | count 
-----------------+---------+-------
 LABRADOR RETR   | BELLA   |    23
 LABRADOR RETR   | MAX     |    20
 LABRADOR RETR   | SADIE   |    15
 LABRADOR RETR   | CHARLIE |    14
 LABRADOR RETR   | DAISY   |    14
 LABRADOR RETR   | MAGGIE  |    13
 CHIHUAHUA SH    | BELLA   |    12
 CHIHUAHUA SH    | BUDDY   |    12
 ...


Now, using \crosstabview the results will be arranged according to the name of the dogs in the horizontal column and the primarybreed in the vertical column like:

  primarybreed   | BELLA | MAX | SADIE | CHARLIE | DAISY | MAGGIE | BUDDY | MOLLY | LUCY | JACK 
-----------------+-------+-----+-------+---------+-------+--------+-------+-------+------+------
 LABRADOR RETR   |    23 |  20 |    15 |      14 |    14 |     13 |     8 |    12 |   10 |    7
 CHIHUAHUA SH    |    12 |   4 |     1 |       2 |     6 |      5 |    12 |     1 |    6 |    7
 GOLDEN RETR     |     4 |   4 |     5 |       9 |     6 |      4 |     5 |     6 |    2 |    3
 GERM SHEPHERD   |     4 |   8 |     4 |       4 |     3 |      8 |     1 |     3 |    3 |    2
 POMERANIAN      |     4 |     |     1 |         |       |        |     3 |       |    5 |    1
 SHIH TZU        |     5 |   4 |     2 |       3 |     3 |      1 |     5 |     4 |    2 |    2
 PIT BULL        |     4 |   5 |     4 |       2 |     5 |        |     4 |     3 |    4 |    1
 AUST SHEPHERD   |     1 |   4 |     2 |       3 |     1 |      2 |     5 |     1 |    1 |     
 DACHSHUND       |     2 |   2 |     1 |       3 |     2 |      1 |     5 |     4 |    5 |    2
 ...


Using the first table to get the top ten dog names, we can already assume the
order of the most popular dogs. However, the other question that we wanted to
answer is whether there are particular breeds of dogs that have these top ten
names. Instead of creating another query for this, we simply used \crosstabview to organize the name of dogs and the breeds in horizontal and vertical headers.
The count was then dispersed throughout the grid forming what we have above.

From the data that's presented, we can determine that not only is Bella the most
popular name, but it's the most popular name for Labrador Retrievers. At the
same time, it's a pretty popular name for Chihuahuas, too. The table also tells
us the most popular breed of dog for among the top ten names are Labrador
Retrievers overwhelmingly, which might conclude that the inhabitants of Fircrest
and Tacoma like their so-called family dogs.

Other interesting questions that might be answered with further data is whether
pet owners prefer female over male dogs, and what names and breeds are preferred
for males and females. According to the limited data presented here, it appears
that female dogs are preferred over males just by looking at the top ten names.
However, to make that claim we'd have to categorize the gender of all the pets
according to their name, which may be easy to do with Pippy Long Stockings,
Clarice, and Han Solo, but a little more difficult with Fluffy, Snickerdoodle,
and Boo Boo.

There is a lot more that we could conclude from these results, but \crosstabview has provided, nonetheless, a way to easily take rows with figures and get
meaningful result that would otherwise appear jumbled across a number of rows
that we'd have to sift through, or create more complex queries to get similar
results.

SUMMING UP
The c\rosstabview command only works in the psql shell. It's not a command that you can use in
your application; for that, you will have to write a query that will produce the
table structure you need, or use the crosstab function, which is included in the tablefunc extension. This extension is easy to add in Compose PostgreSQL by selecting the
extension from the Compose console. However, if you simply want another view of
your data from within the psql shell, then \crosstabview is a fantastic alternative that will make your life easier when trying to
disect complicated datasets and the best part is that it comes out of the box
with PostgreSQL 9.6.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Ricardo Gomez Angel

Abdullah Alger is a former University lecturer who likes to dig into code, show people how to
use and abuse technology, talk about GIS, and fish when the conditions are
right. Coffee is in his DNA. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
Jul 12, 2017INTEGRATION TESTING AGAINST REAL DATABASES
Integration testing can be challenging, and adding a database to the mix makes
it even more so. In this Write Stuff contribu…

Guest Author Jul 7, 2017NEWSBITS: ELASTICSEARCH UPDATE ADDS IP RANGES AND MORE
These are the NewsBits from Compose for the week ending 7th July: Elasticsearch
and Kibana updated A release date for Redis 4…

Dj Walker-Morgan Jul 3, 2017DATALAYER EXPOSED: JOSHUA DRAKE & POSTGRESQL: THE CENTER OF YOUR DATA UNIVERSE
Start your Monday on a high note and catch up on videos from this year's
DataLayer Conference. This week we're highlighting J…

Thom Crowe Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","Let's explore the `\crosstabview` command, which gives you the power to rearrange how your data is viewed without the difficulty of writing complex SQL queries.",Analyzing Pet Name Trends with PostgreSQL's crosstabview,Live,281
835,"Enterprise Pricing Articles Sign in Free 30-Day TrialDRONE DEPLOY CONQUERS THE DATA LAYER
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 25, 2017Compose has quite a few unique customers. One of the more unique that we've
visited with is DroneDeploy , a company that automates drone flight and lets users explore map data from
within an app.

Nick Pilkington, DroneDeploy's CTO, tells us that they are, ""taking the existing
drone hardware and combining it with a very powerful piece of software to make
that drone into a useful tool... something that's repeatable, something that's
reliable, something that's safe, and something that provides a huge amount of
value.""

Pretty cool, huh? So, we visited with Nick to talk about their mapping, app and
how they're using Compose.

Check out the video to see how Drone Deploy conquered their data layer.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","We visited Nick Pilkington, DroneDeploy's CTO, to talk about their mapping, app and how they're using Compose.",Customer: Drone Deploy Conquers the Data Layer,Live,282
838,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register * Projects
 * Blogs
 * About
 * Contribute
 * OpenTech
 * Tutorials
 * Events
 * Videos

Search Brunel Visualization | More Brunel Visualization posts < Previous / Next >TWELVE WAYS TO COLOR A MAP OF AFRICA USING BRUNEL
Graham Wills / Follow @GrahamWills / November 23, 2015The two main new features of Brunel 0.8 are an enhanced UI for building and a
thorough re-working of our code for mapping data to color. This post is going to
talk about the latter — with a lot of examples!

The data set we are using is from http://opengeocode.org . We took a subset of the countries and data columns ( CSV data ) for this exercise.

These examples are using some prototype code for geographic maps that we are
going to introduce into a later version of Brunel (probably v1.0, slated for
January), but maps looks so nice, we wanted to use them for this article. Please
do not depend on the currently functionality — consider this an “advance
preview” and highly subject to change.

Because there are a lot of maps, these are not live versions, but static images
— click on them to open up a Brunel editor window where you can see it live and
make changes.

The Brunel language reference describes the improvements to the color command in detail. Here we just show
examples!

CATEGORICAL COLORS

The above two images are created by the following Brunel:

 * map(‘africa’) x(name) color(language) label(iso) tooltip(#all) style(‘text-shadow:none}’)
 * map(‘africa’) x(name) color(language:[white, nominal]) label(iso) tooltip(#all) style(‘text-shadow:none}’)

For all our examples, the only changes are the color statement, so from now on
we’ll just refer to the color command.

If you use a simple color command, as in the first example, Brunel chooses a
suitable palette. In this case “language” is a categorical field, so it chooses
a nominal palette. This is a palette of 19 colors chosen to be visually distinct.

The second example specifies which colors we want in the output space. The first
category in the “language” field is special, so we ask for a palette consisting
of white, then all the usual colors from the nominal palette.

Because we know the data well, we can hand-craft a color mapping here that
reflects the language patterns better. I used color(language:[white, red, yellow, green, cyan, green, green, blue, blue,
blue, blue, gray, gray, gray, gray, gray]) to use red for lists containing Arabic, green when they contain English, and
blue when they contain French. I mixed the colors to show lists where the
languages are mixed.

The geographical similarities in languages can be seen pretty easily in the
chart, but the colors are a bit bright. Which leads to the following adjustment
…

For areas and “large” shapes, Brunel automatically creates muted versions of
colors, so names like “red” and “green” are less visually dominant and
distracting. This can be altered by adding a “=” to the list of colors, which
means “leave the colors unmuted”, or a series of asterisks, which means “mute
them more”. Here are a couple of examples, using the same basic palette as the
previous one


If you have a smaller fixed number of categories in your field, you can use
palettes carefully designed to work well for that number. Rather than provide
them in Brunel, our suggestion is to go directly to a site that allows you to
select them (Cynthia Brewer’s site ColorBrewer is the standout recommendation) and copy the array of color codes and paste
them directly into the Brunel code.

For the example on the right, we did exactly that, using en:[‘#beaed4′, ‘#7fc97f’]) as our colors (the quotes are optional in this list).

COLOR RANGES
For numeric data, we want to map the data values to a smoothly changing range of
values. So, instead of defining individual values, we define values which are
intermediate points on a smoothly changing scale of colors. We do this using the
same syntax pattern as for categorical data. We are using the latitude of the
capital city to color by, rather than a more informative variables, so the color
changes can be seen more clearly.


On the left we specified color as color(capital_lat) so we get Brunel’s default blue-red sequential scale. This uses a variety of
hues, again taken from ColorBrewer, to provide points along a linear scale of
color. On the right we use an explicit color mapping from ColorBrewer, color(capital_lat:[‘#8c510a’, ‘#bf812d’, ‘#dfc27d’, ‘#f6e8c3′, ‘#f5f5f5′,
‘#c7eae5′, ‘#80cdc1′, ‘#35978f’, ‘#01665e’]) , where we simply went to the site, found a scale we liked and used the export>Javascript method. Note that Brunel will adapt to to the number of colors in the palette
automatically.


The above two charts show the difference between asking for color(capital_lat:reds) and color(capital_lat:red) . When a plural is used, it gives a palette that uses multiple hues, with the general tone of the color being requested.
With a singular color request, you only gets shades of that exact hue . Generally we would recommend the former unless you have some specific reason
to need the single-hue version.


We can specify multiple colors in the same way as we do for categorical data,
using capital_lat:[purpleblues, reds]) on the left and capital_lat:[blue, red]) on the right. When we have exactly two colors defined, we stitch them together,
running through a neutral central color, to make a diverging color scale that
highlights the low and high values of the field.

SUMMARY
Mapping data to color is a tricky business, and in version 0.8 of Brunel our
goal is twofold:

 * Ensure that if you only specify a field, a suitable mapping is generated
 * Allow the output space of colors to be customized for user needs

In future versions of Brunel we will add mapping for the input space, so, for
example, we could tie the value mapped to white in the last example to be the
equator, not simply midway through the data range. Look for that in a few
months!

 * Click to share on Twitter (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * 

Tagged: brunel / brunelvis / color / d3 / dashboard / datavis / geo / infovis / mapping / maps / open source / perception / vis / visualizationLEAVE A COMMENT
Click here to cancel reply. Tell us who you are Name (required) Email (required) Comment text

Notify me of follow-up comments by email.

Notify me of new posts by email.


RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM",Brunel Visualization now has thoroughly re-worked code to provide improved options for mapping data to color. These maps of Africa show the results.,Twelve ways to color a map of Africa using Brunel,Live,283
839,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
MACHINE LEARNING
MACHINE LEARNING IN APACHE SPARK 2.0: UNDER THE HOOD AND OVER THE RAINBOW
Now that the dust has settled on Apache Spark™ 2.0 , the community has a chance to catch its collective breath and reflect a
little on what was achieved for the largest and most complex release in the
project's history.

One of the main goals of the machine learning team here at the Spark Technology
Center is to continue to evolve Apache Spark as the foundation for end-to-end,
continuous, intelligent enterprise applications. With that in mind, we'll
briefly mention some of the major new features in the 2.0 release in Spark's
machine-learning library, MLlib, as well as a few important changes beneath the
surface. Finally, we'll cast our minds forward to what may lie ahead for version
2.1 and beyond.

For MLlib, there were a few major highlights in Spark 2.0:

 * The older RDD-based API in the mllib package is now in maintenance mode, and the newer DataFrame-based API (in
   the ml package), with its support for DataFrames and machine learning pipelines,
   has become the focus of future development for machine learning in Spark
 * Full support for saving and loading pipelines in Spark's native format, across languages (with the exception of
   cross-validators in Python)
 * Additional algorithm support for Python and R

While these have already been well covered elsewhere, the STC team has worked
hard to help make these initiatives a reality — congratulations!

Another key focus of the team has been feature parity — both between mllib and ml , and between the Python and Scala APIs. In the 2.0 release, we're proud to
have contributed significantly to both areas, in particular reaching close to
full parity for PySpark in ml .

UNDER THE HOOD
Despite the understandable attention paid to major features in such a large
release, what happens under the hood in terms of bug fixes and performance
improvements can be equally important (if not more so!).

While the team has again been involved across the board in this area, here we'd
like to highlight just one example of a small (but subtle) issue that has
dramatic implications for performance.

WE NEED TO WORK ON OUR COMMUNICATION...
Linear models, such as logistic regression, are the work-horses of machine
learning. They're especially useful for very large datasets, such as those found
in online advertising and other web-scale predictive tasks, because they are
relatively less complex than, say, deep learning, and so are easier to train and
more scalable. As such, they are among the most-used algorithms around, and were
among the earliest algorithms added to Spark ml .

In distributed machine learning, the bottleneck for scaling large models (that
is, where there are a large number of unique variables in the model) is often
not computing power, as one might think, but communication across the network.
This is because these algorithms are iterative in nature, and tend to send a lot
of data back and forth between nodes in a cluster in each iteration. Therefore,
it pays to be as communication-efficient as possible when constructing such an
algorithm.

While working on adding multi-class logistic regression to Spark ML (part of the ongoing push towards parity between ml and mllib ), STC team member Seth Hendrickson realized that, due to the way that Spark
automatically serializes data when inter-node communication is required (e.g.
during a reduce or aggregation operation), the aggregation step of the logistic
regression training algorithm resulted in 3x more data being communicated than
necessary.

This is illustrated in the chart below, where we compare the amount of shuffle data per iteration as the feature dimension increases.


Once fixed , this resulted in a decrease in per-iteration time of over 11% (shown in the
chart below), as well as a decrease in overall execution time of over 20%,
mostly due to lower shuffle read time and less data being broadcast at each
iteration. We would expect the performance difference to be even larger as data
and cluster size increases 1 .


Subsequently, various Spark community members rapidly addressed the same issue
in linear regression and AFT survival regression (these patches will be released as part of version 2.1).

So there you have it - Spark 2.0 even improves your communication skills!

OVER THE RAINBOW
What does it mean when we refer to Apache Spark as the ""foundation for
end-to-end, continuous, intelligent enterprise applications""? In the context of
Spark's machine learning pipelines, we believe this means usability,
scalability, streaming support, and closing the loop between data, training and
deployment to enable automated, intelligent workflows - in short the ""pot of
gold"" at the end of the rainbow!

In line with this vision, the focus areas for the team for Spark 2.1 and beyond
include:

 * Achieving full feature parity between mllib and ml
 * Integrating Spark ML pipelines with the new structured streaming API to support continuous machine-learning applications
 * Exploring additional model export capabilities including standardized
   approaches such as PMML
 * Improving the usability and scalability of the pipeline APIs, for example in
   areas such as cross-validation and efficiency for datasets with many columns

We'd love to hear your feedback on these areas of interest — email me at
NickP@za.ibm.com, and we look forward to working with the Spark community to
help drive these initiatives forward.


--------------------------------------------------------------------------------

 1. Tests were run on a relatively small cluster with 4 worker nodes (each with
    48 cores, 100GB memory). Input data ranged from 6GB to 200GB, with 48
    partitions, and was sized to fit in cluster memory at the maximum feature
    size. The quoted performance improvement figures are for the maximum feature
    size. ↩
    
    
SHARE ON
 * 
 * Share

NICK PENTREATH
DATE
30 August 2016TAGS
machine learning, spark performanceSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.","Now that the dust has settled on Apache Spark 2.0, the community has a chance to catch its collective breath and reflect a little on what was achieved for the largest and most complex release in the project's history.",Apache Spark 2.0: Machine Learning. Under the Hood and Over the Rainbow.,Live,284
844,"Compose The Compose logo Articles Sign in Free 30-day trialMETRICS MAVEN: CROSSTAB REVISITED - PIVOTING WISELY IN POSTGRESQL
Published Apr 4, 2017 metrics maven postgresql Metrics Maven: Crosstab Revisited - Pivoting Wisely in PostgreSQLIn our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the metrics you need from your data.
In this article, we'll take another look at crosstab to help you pivot wisely.

In this article, we'll look again at the crosstab function, focusing this time on the option that does not use category sql.
We'll explain how and when (not) to use it. We'll also compare it to the option
that does use category sql, which we covered in our previous article on pivot tables using crosstab . You can also find some discussion of both options in the official Postgres documentation for tablefunc .

To use crosstab with Compose PostgreSQL, refer to the previous article for how to enable tablefunc for your deployment.

PIVOTING YOUR DATA
Pivoting your data can sometimes simplify how data is presented, making it more
understandable. PostgreSQL provides the crosstab function to help you do that.

The simplest option for crosstab , which we'll focus on in this article, is referred to as crosstab(text sql) in the documentation. We're going to call it the ""basic option"" in this
article. It differs from the crosstab(text source_sql, text category_sql) option in a couple of significant ways, which we'll cover a little later in
this article. If you want to learn how the crosstab(text source_sql, text category_sql) option works before diving into the basic option we're going to look at here,
check out our article Creating Pivot Tables in PostgreSQL Using Crosstab .

OUR DATA
As we did in the previous article on crosstab , we'll use the product catalog from our hypothetical pet supply company.

id  | product         | category | product_line   | price | number_in_stock  
---------------------------------------------------------------------------
1   | leash           | dog wear | Bowser         | 15.99 | 48  
2   | collar          | dog wear | Bowser         | 10.99 | 76  
3   | name tag        | dog wear | Bowser         | 5.99  | 204  
4   | jacket          | dog wear | Bowser         | 24.99 | 12  
5   | ball            | dog toys | Bowser         | 6.99  | 27  
6   | plushy          | dog toys | Bowser         | 8.99  | 30  
7   | rubber bone     | dog toys | Bowser         | 4.99  | 52  
8   | rubber bone     | dog toys | Tippy          | 4.99  | 38  
9   | plushy          | dog toys | Tippy          | 6.99  | 16  
10  | ball            | dog toys | Tippy          | 2.99  | 47  
11  | leash           | dog wear | Tippy          | 12.99 | 34  
12  | collar          | dog wear | Tippy          | 6.99  | 88  
13  | name tag        | dog wear | Tippy          | 5.99  | 165  
14  | jacket          | dog wear | Tippy          | 20.99 | 50  
15  | rope chew       | dog toys | Bowser         | 7.99  | 27  


We've got one additional item in the catalog than we had last time - a rope chew
toy in the Bowser line.

As tends to be the case in a relational database, the data in our table extends
downward, repeating values for product_line, category, and product in different
combinations for each price and inventory value. We want to create a pivot table
to get a simpler view of our catalog.

Let's get started.

AGGREGATING A VALUE
Let's start by getting the average price of each product category for each of
the product lines. This was the same example we used in our previous article,
but this time we'll use the basic crosstab option which does not use category sql. Here's what that looks like:

 -- using the basic option
SELECT * FROM crosstab(  
  'select distinct
     product_line,
     category,
     round(avg(price),2) as avg_price
   from catalog
   group by product_line, category
   order by 1,2')
AS catalog(product_line character varying,  
    dog_toys numeric,
    dog_wear numeric
)
;


Let's look at the sub-query first.

The first thing to notice is that the sub-query is encapsulated in single
quotes. The query is passed to the crosstab function as a string that it will
run. Next, we're using round with the avg function to get the average price rounded to two decimal places for each
product line and category combination. If you need a refresher on either of
these functions, we covered rounding in our Make Data Pretty article and avg in our article on mean . To get the average aggregate value, we're using group by with the other two columns: product line and category. Finally, we're ordering
our results first by product line then by category. The ordering is important
because in the outer query, we have to explicitly name the columns we want to
see and need to know what order the data will be populated into them.

The outer query calls the crosstab function on the results from the sub-query and then specifies the column names
and data types for presenting that pivoted data. In effect, this creates a new
table that is presented as the result of the query.

Here's what the result looks like:

product_line | dog_toys_avg_price | dog_wear_avg_price  
-------------------------------------------------------
Bowser       | 7.24               | 14.49  
Tippy        | 4.99               | 11.74  


If you compare this result to the result we got in the previous article , which used the category sql option for crosstab , you'll find they are exactly the same. The only difference here is that the
Bowser line of dog toys has increased slightly since then due to the addition of
the new rope chew toy.

If that's the case, then you may be wondering what the difference is between the
two crosstab options... Let's look into that.

COMPARING CROSSTAB OPTIONS
Before we look at the key differences between the two optons, let's cover a
couple caveats that apply to both options.

SIMILAR CAVEATS
As we mentioned above and in the previous article, both options require you to
indicate an explicit order for the resulting columns. If you don't order the
data, you will have a hodge-podge in your pivoted columns. PostgreSQL has no way
of being ""smart"" here. It does not know how your pivoted columns map to the data
you're querying on. You have to know that and, to do that, you need to order the
data.

The next probably goes without saying, but let's just go ahead and be extra
clear here. The resultant pivoted rows must have only one value for each row. If
there can be multiple values, then PostgreSQL will return you one from the list.
For example, if we did not average the price in the query above (which
aggregates the price to a single value), but instead simply requested the price
column, we could get any one of the prices associated with each product category
and product line. The point of pivoting the data is to present a single value
for each possible combination of attributes.

The pivoted columns' data types must match the data types expected from the
source data. For example, we would get an error if we had our pivoted column
""avg_price"" specified as an int instead of numeric . The result of the avg function on our price values will not produce an int . If we wanted the pivoted column to be an int , we'd need to cast the value accordingly in the sub-query.

Now the differences...

BIG DIFFERENCES
The reason our previous article used the category sql option of crosstab is that it is more flexible than the basic option we covered here. We recommend
using the category sql option over the basic option. Here's why:

The category sql option allows you to include ""extra columns"" in your pivot
table result. The extra columns are not used for pivoting. The common use for
these columns is to provide additional descriptors of the data in each row. You
can have as many extra columns as you want; however, there can only be one extra
column value for each. As mentioned above in the caveat section, multiple
possible values will result in any one of the values being displayed. Here's an
example to make this easier to understand:

 -- using category sql option
SELECT * FROM crosstab (  
  'select distinct
    product_line,
    case
        when product_line = 'Bowser' then 'Fashion and fun for big dogs.'
        when product_line = 'Tippy' then 'Small dog fashion and fun.'
        else null
    end as description,
    category,
    round(avg(price),2) as avg_price    
  from catalog
  group by
    product_line,
    category
  order by product_line',

  'select distinct category from catalog order by 1'
 )
 AS (
   product_line character varying,
   description text,
   dog_toys_avg_price numeric,
   dog_wear_avg_price numeric
 )
 ;


In this case, we've added an ""extra column"" called ""description"". For this
example, we've provided the values manually in a case statement, but another
column from the table could also be used if there was a column that contained
the additional descriptive data. Note the escaped single quotes (leaving us with
two single quotes around each text value) since the sub-query for crosstab needs to be encapsulated in single quotes. We'll get a result like this:

product_line | description                      | dog_toys_avg_price | dog_wear_avg_price  
------------------------------------------------------------------------------------------
Bowser       | Fashion and fun for big dogs.    | 7.24               | 14.49  
Tippy        | Small dog fashion and fun.       | 4.99               | 11.74  


If you try to add an extra column using the basic crosstab option, you'll get
this error: ""The provided SQL must return 3 columns: rowid, category, and
values."" No extra columns allowed.

The next difference is the more compelling one to use the category sql crosstab option: it places data in the correct columns when one of the rows is missing a
particular value for the specified attribute.

Remember our new dog toy, the rope chew? The Tippy line does not have that toy.
If we wanted to pivot by toy products instead of by product categories, we would
only be able to get an accurate result using the category sql option of crosstab . Check it out:

 -- using category sql option
SELECT * FROM crosstab(  
  'select distinct
     product_line,
     category,
     product,
     price
   from catalog
   where category = ''dog toys''
   order by 1,2',

  'select distinct product from catalog where category = ''dog toys'' order by 1'
 )
 AS (
   product_line character varying,
   category character varying,
   ball_price numeric,
   plushy_price numeric,
   rope_chew_price numeric,
   rubber_bone_price numeric
)
;


We'll get the result we expect (a null value for the rope chew toy on the Tippy
product line row):

product_line | category | ball_price | plushy_price | rope_chew_price | rubber_bone_price  
-------------------------------------------------------------------------------
Bowser       | dog toys | 6.99       | 8.99         | 7.99            | 4.99  
Tippy        | dog toys | 2.99       | 6.99         |                 | 4.99  


Notice in the query above that we did not need to use an aggregation for the
price because there is one price per product per product line. We also added the
product category as an ""extra column"" since our pivoted rows were limited to
only the category for dog toys - just an additional example of using extra
columns for you to ""chew"" on (pun intended).

If we use the basic option of crosstab to present dog toy prices per product line, not only can we not use any extra
columns as we learned above, but worse, we'll get a bad result... Here's the
SQL:

 -- using the basic option
SELECT * FROM crosstab(  
  'select distinct
     product_line,
     product,
     price
   from catalog
   where category = ''dog toys''
   order by 1,2')
AS catalog(product_line text,  
    ball_price numeric,
    plushy_price numeric,
    rope_chew_price numeric,
    rubber_bone_price numeric
)
;


And here's the result:

product_line | ball_price | plushy_price | rope_chew_price | rubber_bone_price  
-------------------------------------------------------------------------------
Bowser       | 6.99       | 8.99         | 7.99            | 4.99  
Tippy        | 2.99       | 6.99         | 4.99            |  


WHAT?! The rubber bone price for the Tippy line shifted over to populate the
rope chew column! That's because, without the category sql, the basic option
does not know how many columns to expect and simply populates the data
top-to-bottom, left-to-right until there are no more values. So, you can only
use the basic option if your data values have exactly the same number and type.
That's a pretty big limiter in our book.

WRAPPING UP
Hopefully you now have a much more thorough understanding of crosstab in PostgreSQL, including the differences between the two options that are
presented in the documentation. You are now armed with the knowledge that will
help you pivot wisely.

Image by: herbert2512 Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith ’s author page and keep reading.RELATED ARTICLES
Mar 8, 2017METRICS MAVEN: CALCULATING AN EXPONENTIALLY WEIGHTED MOVING AVERAGE IN
POSTGRESQL
In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the…

Lisa Smith Feb 7, 2017METRICS MAVEN: CALCULATING A WEIGHTED MOVING AVERAGE IN POSTGRESQL
In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the…

Lisa Smith Jan 9, 2017METRICS MAVEN: CALCULATING A WEIGHTED AVERAGE IN POSTGRESQL
In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the…

Lisa Smith Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","we'll look again at the crosstab function, focusing this time on the option that does not use category sql. We'll explain how and when (not) to use it. We'll also compare it to the option that does use category sql ...",Metrics Maven: Crosstab Revisited - Pivoting Wisely in PostgreSQL,Live,285
847,"USING CLOUDANT TO ENHANCE UPLOADS FOR IBM GRAPH

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Prachi Shirish Khadke 12/8/16Prachi Shirish Khadke

Backend Developer for IBM Graph and a Ballroom Junkie!

Learn More Recent Posts * Using Cloudant to enhance uploads for IBM Graph How I solved a graph development issue with parallel Cloudant Index creation
   requests.

Hi. I am Prachi, a backend developer for IBM Graph, a fully-managed,
enterprise-grade graph database service built on the cloud. Our development team
works via a continuous delivery pipeline to regularly add new features, enhance
existing ones and deliver bug fixes. Several weeks ago, I was working on the
backend code to improve the graph upload experience, adding REST API methods for
asynchronous graph uploads. When the service receives an asynchronous graph
upload request, it notifies the user that the request has been accepted and
generates an upload Id. The upload Id can be used to query the status of the
upload via the service’s REST API, as in the following commands. This setup
provides an nice user experience since they’re not blocked with a wait time
dependent on how big the upload is or due to slowness in the service.


# Session auth
curl -X GET -H 'Content-Type:application/json' -u 'cffb672f-fe5e-4810-a5da-a6ce182014e2:2eafd208-841d-4afd-aa35-6bdb2214d84b' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/_session
{""gds-token"":""Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=""}

# Asynchronous graph upload
curl -X POST -H 'Content-Type:multipart/form-data' -H 'Authorization: gds-token Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=' -F 'graphml=@./air-routes-small.graphml' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/g/uploads/graphml
{""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""operation"":""bulkload"",""status"":""ACCEPTED"",""code"":202}

# Graph upload status using uploadId
curl -X GET -H 'Content-Type:application/json' -H 'Authorization: gds-token Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/g/uploads/502f4f57-f60c-4e92-ae9a-63eca980817a/status
{""uploads"":[{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":null,""statusCode"":202,""statusMessage"":""ACCEPTED"",""type"":""bulkload""},{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":1479508249550,""statusCode"":201,""statusMessage"":""COMPLETED"",""type"":""bulkload""}]}


Part of this effort required storing state in a Cloudant database. Initially, I
added three indexes to query upload status in different ways – using the Service
Id, the Graph Id and the Upload Id. These queries looked like this:


 # Graph upload status using uploadId
 curl -X GET -H 'Content-Type:application/json' -H 'Authorization: gds-token Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/g/uploads/502f4f57-f60c-4e92-ae9a-63eca980817a/status�/pre�
{""uploads"":[{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":null,""statusCode"":202,""statusMessage"":""ACCEPTED"",""type"":""bulkload""},{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":1479508249550,""statusCode"":201,""statusMessage"":""COMPLETED"",""type"":""bulkload""}]}

# Graph upload status using graphId
curl -X GET -H 'Content-Type:application/json' -H 'Authorization: gds-token Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/g/uploads/status
{""uploads"":[{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":null,""statusCode"":202,""statusMessage"":""ACCEPTED"",""type"":""bulkload""},{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":1479508249550,""statusCode"":201,""statusMessage"":""COMPLETED"",""type"":""bulkload""}]}

# Graph upload status using serviceId
curl -X GET -H 'Content-Type:application/json' -H 'Authorization: gds-token Y2ZmYjY3MmYtZmU1ZS00ODEwLWE1ZGEtYTZjZTE4MjAxNGUyOjE0Nzk1MDgxMTUxNTU6eXZNdUxBSGxNSXYvUEszM3pMVDJEakh6QlVkRFdEdStucFFiRFd2d2xmcz0=' https://ibmgraph-alpha.ng.bluemix.net/32b7fa84-df0e-4546-b38e-74a71a1e69c7/uploads/status
{""uploads"":[{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":null,""statusCode"":202,""statusMessage"":""ACCEPTED"",""type"":""bulkload""},{""serviceId"":""32b7fa84-df0e-4546-b38e-74a71a1e69c7"",""graphId"":""g"",""uploadId"":""502f4f57-f60c-4e92-ae9a-63eca980817a"",""startTimestamp"":1479508238184,""completionTimestamp"":1479508249550,""statusCode"":201,""statusMessage"":""COMPLETED"",""type"":""bulkload""}]}


At first, I used Cloudant map-reduce views for index creation, but code review
feedback recommended Cloudant Queries instead. This meant rewriting a lot of
code, which was painful to contemplate when the existing logic already worked.
On the plus side, we’d gain a performance improvement. So I rewrote index
creation using Cloudant Queries. But it was still slow. The problem was that I
had created only one Cloudant design document, sequentially creating the
indexes, to keep things organized properly. A colleague suggested that separate
design documents may help. At first, that approach seemed unorganized and
sloppy, until I realized: backend development is like general surgery. As Dr.
Richard Webber said in Grey’s Anatomy:

I don’t need pretty. And I don’t need perfect. What I need is for this to work.
And what’s gonna make it works is for me to take out that tumor and put these
healthy organs inside my very sick patient. It won’t be pretty, but it will
work, and it will keep my patient alive.

In engineering school, they teach us the importance of performance and agility.
This real-world example shows how prioritizing engineering concerns over
organization and prettiness is smart and effective. I ended up invoking 3 index
creation requests in parallel, which was so fast! It’s learning moments like
this that just make me smile. The fact that Cloudant Query is a REST API –
stateless, predictable and easy to use, just added to my joy. :)",How I solved a graph development issue with parallel Cloudant Index creation requests.,Using Cloudant to enhance uploads for IBM Graph,Live,286
848,"Enterprise Pricing Articles Sign in Free 30-Day TrialCOMPOSE FOR MYSQL - A DEVELOPER'S VIEW
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 26, 2017In this interview with Chris Winslett, Compose developer and lead on the Compose
for MySQL, we talk about why MySQL is on the Compose platform, what makes it
different on Compose and how the Compose for MySQL beta is going.

Q: So why Compose for MySQL?

Chris Winslett: We already had nine databases and we already had a SQL database. The question
is though is ""Why MySQL?"" and the answer is that it's simple yet powerful. You
can get the relational database model without the high administration cost you
see with other SQL databases. You get all the SQL capabilities, SELECT
statements with GROUPing and JOINing and so on. It also can query across
databases and concatenate the results together with UNION ALL. With other
databases, like Postgres, you have to make choices which tend to increase
complexity; MySQL has this feature out of the box.

There's also a large ecosystem of libraries - every programming language can
connect to MySQL. PHP, Python, C, all of then have extensive libraries and a lot
of these have been around so long that there are two or more versions. PHP had
an old model and a new model. Ruby has a MySQL library, and also a later MySQL2
library. They've all gone through these iterations reacting to the needs of the
database systems leading to some very mature experiences when running on top of
MySQL. That means that it's easy for a new user to spin up a database and get
working with it. A large ecosystem of tools is another reason. Some common tools
include Wordpress, Drupal, SugarCRM and other open-source CRMs, along with GUIs
for creating queries and reports.

The size of the MySQL environment is compelling. It's been the largest database
since the late 90s when web databases and the open-source movement began
growing. Which leads to the last reason - Customers were asking for it and
wanted it. They had other databases on Compose and enjoyed the autoscaling and
the automatic backups and they wanted a MySQL which was easy to deploy and
highly available like all our databases are.

Q: So what do users get when they deploy Compose for MySQL ?

CW: We start with availability on AWS, Softlayer and Google Cloud; you can deploy
on all those platforms. Then there's the Compose process for what we expect from
databases. High availability, automated disaster recovery backups, failover
support and simple routing all delivered from a private VLAN and manageable from
the web.

Q: How do you create a MySQL database on Compose?

Same as any other Compose database: sign up - we have free thirty day trials,
and get to the Compose web front end then click on the Create Deployments button. You'll see all the databases we do at Compose there. Browse down to the
Beta section, and you'll find Compose for MySQL in there. Click it, enter a name for your database, pick where you want it,
pick a size - remembering we have auto-scaling - and click Create . Your Compose for MySQL database will be with you shortly; it takes about two or three minutes.

Q: What does the Beta mean?

CW: A Beta database is a database that Compose has just begun offering. We've been
offering MySQL since late October and during this beta period we monitor the
database. With MySQL, what we are doing is watching the metrics, monitoring the
uptime, seeing how we can improve the uptime, how we can improve self-healing
tasks and seeing what kind of questions customers have about MySQL. We fully
expect this to be a production-grade database and we have high expectations
during this beta. However, we want customers to know it's a new database on
Compose so it may not best fit some use cases. That's where we gather data in
the beta.

Q: So what MySQL are you running?

CW: We're running MySQL 5.7.17 currently with Group Replication. We don't modify
MySQL in anyway, so you can use all your standard MySQL drivers and tools with
it. The one caveat is that because we use Group Replication to run the MySQL
cluster, all the tables in the database require primary keys. A primary key is a
unique identifier for a row; it can be an integer, UUID or string. It just has
to be unique for the clustering.

Q: Where would you not have unique ids?

CW: One example would be a join table, where you are creating a table which joins
users records and group records together. The table created to represent that
join would typically not be designed to have a unique id. So what you need to do
is alter the table, add an id column and make that id column an
auto-incrementing integer.

Q: Why do you need to do this?

CW: The unique id caveat lets us run multiple nodes with replication and high
availability. Having a unique id means it's easier for replication to see what's
new and what has changed and keep things consistent. That means we can replicate
data over three nodes.

Q: Why three nodes?

CW: Three replicated nodes allow us to take a node offline without bringing the
database down. That means that we can do zero-downtime maintenance. If you've
run databases before, you'll know the number one reason for a database outage is
not because a host has gone down, but because you need to do maintenance on that
host; update the kernel, update how the system is tuned or reset some
parameters. Maintenance is the number one reason for database downtime.

We also get zero-downtime backups. MySQL backups are best if you can shut down
the database on a node, so what we do is shut down a data node, do the backups
and bring that node back up. That gives us the best, most consistent backups.

Finally, we get failover during a server outage. While the number one reason for
an outage is maintenance, the number one reason for an unplanned outage is
server failure. Three nodes give us a lot of advantages during these unplanned
outages. That's why we were ok with the requirement to have primary keys on
tables. The tradeoff for high availability is something we think – and we expect
customers will think – is worth it.

Q: So, how do you pick which node to connect to?

CW: We look to make it as simple as possible. Customers applications connect to a
haproxy and that haproxy talks to the master data node. We try and take a lot of
the magic out of the process of connecting. The haproxy knows which node is
currently the master data node.

Q: How do you know what's in your cluster?

CW: Look at the Topology in the Compose console overview. What you can see there is
the result of health checks being run on the cluster. You can see the clusters
own private infrastructure with the three data nodes on them and you can see the
proxy which is routing to the master among the data nodes. You don't need to
know that, though, all you need is the to know is the address of the proxy.

Q: Do you have any advice for someone bringing an application to Compose for MySQL
and the cloud?

CW: Remember to create your cloud database as close as possible, network-wise, to
your application as possible.

Q: You mention how Compose runs beta databases; Any insights from the MySQL Beta
so far?

CW: We'll be blogging about the Compose for MySQL beta and doing some deep dives into how group replication works and how we
recover from failure. Look out for them appearing soon.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Maxime Daquet Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","In this interview with Chris Winslett, Compose developer, we talk about why MySQL is on the Compose platform, what makes it different on Compose and how the Compose for MySQL beta is going.",Compose for MySQL - A developer's view,Live,287
850,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (January 31, 2017)
 * This Week in Data Science (January 24, 2017)
 * This Week in Data Science (January 17, 2017)
 * This Week in Data Science (January 10, 2017)
 * This Week in Data Science (December 27, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (JANUARY 31, 2017)
Posted on January 31, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * How will we cope with the AI Chatbot takeover? – How the capabilities of AI will impact the development of chatbots.
 * 6 areas of AI and Machine Learning to watch closely – A breakdown of six major areas defined by the term Artificial
   Intelligence.
 * IBM adds TensorFlow support to its PowerAI – IBM adds support for Google’s TensorFlow in a move highlighting the
   collaboration between the AI tech giants.
 * Social media data and the customer-centric strategy – How to utilize social media data in improving customer relations.
 * The Top Predictive Analytics Pitfalls to Avoid – Missteps to avoid when performing predictive analysis in order to obtain
   expected results from your models.
 * Trusting AI with important decisions: capabilities and challenges – The importance of considering the concrete benefits of AI while ensuring
   safety to property and human life.
 * What developers actually need to know about Machine Learning –A deviation from the traditional way of exposure to and learning Machine
   Learning.
 * Applied Data Science – Excerpts from a whitepaper on data science teams and the application of
   insights gained through analytics to the real world.
 * Apple joins Amazon, Facebook, Google, IBM and Microsoft in AI initiative –Apple joins the Partnership on AI to Benefit People and Society.
 * How Employers Judge Data Science Projects – 6 criteria that influence how potential employers evaluate applicants
   strength.
 * Introduction to Natural Language Processing, Part 1: Lexical Units – An exploration to the core concepts of Natural Language Processing.
 * What is Data Engineering? – The distinction between the wide fields of data science and data
   engineering.
 * Becoming a Data Scientist – An overview of the many skills and tools used by data scientists.
 * The Data Science Puzzle, Revisited – A discussion of how the key concepts related to data science and data
   science itself are unified.
 * Why It Matters That Artificial Intelligence Is About to Beat the World’s Best
   Poker Players – How a new AI system is contributing to advancement in the field.
 * Get Up to Speed with Data Science in 7 Easy Steps – 7 steps for beginners to get up-to-date with data science.

UPCOMING DATA SCIENCE EVENTS
 * IBM Event: Big Data and Analytics Summit – February 14, 2017 @ 7:15 am – 4:45 pm

COOL DATA SCIENCE VIDEOS
 * Deep Learning with Tensorflow – Recursive Neural Tensor Networks – An overview of Recursive Neural Tensor Networks and the Natural Language
   Processing problems that they are able to solve.
 * Deep Learning with Tensorflow – The Long Short Term Memory Model – An overview of the Long Short Term Memory Model.
 * Deep Learning with Tensorflow – The Recurrent Neural Network Model – An overview of the Recurrent Neural Network Model.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (January 31, 2017)",Live,288
853,"Slack’s Integration API and Cloudant’s HTTP API make it simple to store data directly into a Cloudant database without breaking a sweat. This tutorial shows how to create a custom slash command in Slack and how to post it directly to Cloudant.Slack is a messaging and team-working application that is used widely to allow disparate teams of people to chat, share files, and interact on desktop, tablet, and mobile platforms. We use Slack in IBM Cloud Data Services to coordinate our activities, to work in an open collaborative environment, and to cut down on email and meetings.One of the strengths of Slack is that it integrates with other web services, so events happening in Github or Stack Overflow can be surfaced in the appropriate Slack channels. Slack also has an API that lets you create custom integrations. The simplest of these is slash commands: when a user starts a Slack message with a forward slash followed by a command string, Slack can be configured to POST that data to an external API. Say you create the slash command /lunch. A user could type:","Slack's integration API allows external services to be plugged in with ease. Even if your service isn't listed in the off-the-shelf integrations, you can still push data to other HTTP services. This tutorial shows how a Slack 'slash command' can be configured to push data to a Cloudant or CouchDB database in a few easy steps.",Writing Data Directly to Cloudant from Slack,Live,289
858,"Outside of the core Elasticsearch toolset, there's a world of tools that make the search and analytics database even more useful and accessible. In this article we'll look at some and show what you do to get them working with Compose's Elasticsearch deployments. We'll start with a command line tool, move on to a simple search tool and finish with an all purpose client for searching and manipulating your Elasticsearch database...Let us start the tool tour with Es2unix, from the Elasticsearch developers. Es2unix is a version of the Elasticsearch API that you can use from the command line. It doesn't just make the API calls though, it also converts the returned results into a line-oriented, tabular format like many other Unix tools output. That makes it ideal for integrating Elasticsearch into your awk, grep and sort using shell scripts.Es2unix will need Java installed, Java 7 at least, and the binary version can be simply downloaded with a curl command and enabled with chmod as per the installation instructions:curl -s download.elasticsearch.org/es2unix/es >~/bin/eschmod +x ~/bin/esNote this assumes you have a bin directory in your $HOME and it's on your path.Now, when you run es it'll assume that Elasticsearch is running locally. When you are using Compose Elasticsearch, that isn't the case. If you've got the HTTP/TCP access portal enabled, you'll have to give the es command a URL to locate your Elasticsearch deployment. You can get the URL from your Compose dashboard - remember to substitute in the username and password of a Elasticsearch user (from the Users tab) into the URL. This URL is then passed using the -u option:$ es -u https://user:pass@haproxy1.dblayer.com:10360/ versiones            20140723711d4f9elasticsearch 1.3.4The es command is followed by one of a selection of subcommands. There we've used the version subcommand to get the version of the es command and the version of Elasticsearch it is talking to. The health of the cluster can be established with the health subcommand:$ es -u https://user:pass@haproxy1.dblayer.com:10360/ health -vtime     cluster    status nodes data pri shards relo init unassign11:14:39 EsExemplum green      3    3   3      6    0    0        0Drop the -v to get unlabelled results, ideal for passing into monitoring software - adding -v on many es subcommands is a signal that more extensive labelling of returned data is desired.The es command has the ability to count all documents or the number of documents that meets a simple query, and to search all indices and return matching ids:$ es -u https://user:pass@haproxy1.dblayer.com:10360/ count ""one species or variety""11:44:02 16 ""one species or variety""shows a count of documents matching the parts of that phrase to different extents. Using the search command we can dig deeper:$ es -u https://user:pass@haproxy1.dblayer.com:10360/ -v search ""one species or variety""score   index         type    id0.16337 darwin-origin chapter II0.12559 darwin-origin chapter IX0.10360 darwin-origin chapter IV0.10141 darwin-origin chapter I0.09734 darwin-origin chapter XI0.09326 darwin-origin chapter V0.09226 darwin-origin chapter XV0.08744 darwin-origin chapter XIV0.08069 darwin-origin chapter VIII0.07525 darwin-origin chapter IIITotal: 16Now we can see the matching score along with the id, index and type of the document. Although here, 16 documents match, Elasticsearch returns only the top ten results by default. If we wanted to be more precise  we could quote the string (remembering we're in the shell so back-slash escapes are needed) and select a field for matching:$ es -u https://user:pass@haproxy1.dblayer.com:10360/ -v search """"one species or variety"""" textscore   index         type    id text0.03073 darwin-origin chapter I  [""CHAPTER I. VARIATI0.03073 darwin-origin chapter IX [""CHAPTER IX. HYBRIDTotal: 2Other subcommands in es2unix include indices, for listing indexes, ids for retrieving all ids from an index and a variety of management reporting commands such as nodes, heap and shards.You'll have probably noticed that the es command is a little laborious when you have to specify the URL every time. Es2unix doesn't have any short cuts when it comes to passing that URL like environment variables. There is another way though to shorten things and thats by using an SSH access portal instead. If you configure an SSH access portal for your Elasticsearch deployment then the default command for creating your SSH tunnels makes a node of the cluster appear to be at localhost:9200 which is the default. Once you have an SSH tunnel set up, you can drop the entire -u [URL] part and use tools as if you had Elasticsearch locally configured.Sometime you just want to set up a quick search for your Elasticsearch database with the minimum of effort. The Calaca project is very useful in that regard. It's an all JavaScript search front end for Elasticsearch which connects up to Elasticsearch. To get up and running, you'll want to download and unpack the zip file available from the Github page. Calaca's configuration can be found in the file js/config.js which looks like this:var indexName = ""name""; //Ex: twittervar docType = ""type""; //Ex: tweetvar maxResultsSize = 10;var host = ""localhost""; //Ex: http://ec2-123-aws.comvar port = 9200;As you can see, it comes configured to use the database on localhost port 9200, so you could use the SSH shortcut above. But we're here anyway so we need to change the host variable to ""https://user:pass@haproxy1.dblayer.com"" to match the URL we're given in the Compose dashboard and don't forget to copy in the username and password. The port number also needs to be copied from the dashboard URL to the port variable. The rest of the configuration is selecting what to search and what to show. Set the indexName and docType variables to index and data type you want to search. So, for our example here we have a config.js that reads:var indexName = ""darwin-origin"";var docType = ""chapter"";var maxResultsSize = 10;var host = ""https://user:pass@haproxy1.dblayer.com"";var port = 10361;Then it's a matter of editing the index.html file to set what results are shown. In the middle of the file is a section which says:Edit the result.name and result.description to display what fields you want to display from your document:We have a particularly long block of text in our document which we truncates down and we use the id and title together to create a heading. Save that, open index.html in your browser – there's no need to deploy to a server – and you'll see Calaca's search field. Enter a term and you'll see results like so:It's a quick way to get a pretty search query front end up locally without wrestling with forming Curl/JSON requests or deploying a full on server.Where Calaca's great for a super simple search client, you might want something a little more potent for your searching. For that, try ESClient, which not only has an extensive search UI but adds the ability to display those results in a table or as raw JSON results and then edit and delete selected documents. Like Calaca, ESClient needs no server, just download the zip or clone the Github respository.  Configuring it means just editing the config.js file and putting in the URL from the Compose dashboard:var Config = {'CLUSTER_URL':'https://user:pass@haproxy1.dblayer.com:10361',Then you open esQueryClient.html in your browser and before you know it, there's the ESClient configuration screen - click the Connect button and a connection to the Elasticsearch database will be made and you'll be moved to the Search tab where you can select index, type, fields, sort fields, specify a Lucene or DSL query and click Search to see the results in a table below the query.Double clicking on a result will let you edit the documents that make up the result or you can use the results as a guide for a delete operation. If you set to ""Raw JSON"" switch in the Configuration tab, you'll also be able to view the complete raw returned results in the JSON Results tab.It's all rather usefully functional and there's only one slight problem. If you look at the top of the ESClient page, you'll see it's displaying the username and password as part of the URL for the database you are connecting to. Not really ideal that, but the SSH access portal can help out there too. If you set up and activate the tunnel, then you can return the CLUSTER_URL value in the config.js file to http://localhost:9200 and there'll be no username or password to display on screen.We've touched on three tools in this article, but more importantly we've shown the practical differences between using the HTTP/TCP and SSH access portals on componse. With HTTP/TCP access, there will be usernames and passwords embedded in the URL you use and this will leave any scripts or tools you configure susceptible to shoulder surfers and the like. That said, for occasionally launched tools it is quick and simple.With the SSH access portal, the configuration and authentication is done when you set up the tunnel in a separate process and the tunnel means you can use Elasticsearch as if the node was installed locally. The downside is you do need to make sure the SSH tunnel is up before you run any command and it may be easier to go through the HTTP/TCP access portal. But then thats why we give you both options at Compose so you can choose what suits you and your applications best.",There's a world of tools that make the Elasticsearch even more useful and accessible. In this article we'll look at some and show what you do to get them working with Compose's Elasticsearch deployments. ,Elasticsearch Tools & Compose,Live,290
859,"Homepage Follow Sign in Get started * Home
 * ✍️ Contribute
 * 
 * 🔥 ML Newsletter
 * 

Dang Ha The Hien Blocked Unblock Follow Following PhD student at UiO, Data Scientist at eSmart Systems Apr 5, 2017
--------------------------------------------------------------------------------

A GUIDE TO RECEPTIVE FIELD ARITHMETIC FOR CONVOLUTIONAL NEURAL NETWORKS
The receptive field is perhaps one of the most important concepts in Convolutional Neural Networks
(CNNs) that deserves more attention from the literature. All of the
state-of-the-art object recognition methods design their model architectures
around this idea. However, to my best knowledge, currently there is no complete
guide on how to calculate and visualize the receptive field information of a
CNN. This post fills in the gap by introducing a new way to visualize feature
maps in a CNN that exposes the receptive field information, accompanied by a
complete receptive field calculation that can be used for any CNN architecture.
I’ve also implemented a simple program to demonstrate the calculation so that
anyone can start computing the receptive field and gain better knowledge about
the CNN architecture that they are working with.

To follow this post, I assume that you are familiar with the CNN concept,
especially the convolutional and pooling operations. You can refresh your CNN
knowledge by going through the paper “ A guide to convolution arithmetic for deep learning [1]”. It will not take you more than half an hour if you have some prior
knowledge about CNNs. This post is in fact inspired by that paper and uses
similar notations.

Note: If you want to learn more about how CNNs can be used for Object
Recognition, this post is for you.THE FIXED-SIZED CNN FEATURE MAP VISUALIZATION
The receptive field is defined as the region in the input space that a particular CNN’s feature is
looking at (i.e. be affected by) . A receptive field of a feature can be fully described by its center location
and its size. Figure 1 shows some receptive field examples. By applying a
convolution C with kernel size k = 3x3 , padding size p = 1x1 , stride s = 2x2 on an input map 5x5 , we will get an output feature map 3x3 (green map). Applying the same convolution on top of the 3x3 feature map, we
will get a 2x2 feature map (orange map). The number of output features in each dimension can
be calculated using the following formula, which is explained in detail in [ 1 ].

Note that in this post, to simplify things, I assume the CNN architecture to be
symmetric, and the input image to be square. So both dimensions have the same
values for all variables. If the CNN architecture or the input image is
asymmetric, you can calculate the feature map attributes separately for each
dimension.

Figure 1: Two ways to visualize CNN feature maps. In all cases, we uses the
convolution C with kernel size k = 3x3, padding size p = 1x1, stride s = 2x2.
(Top row) Applying the convolution on a 5x5 input map to produce the 3x3 green
feature map. (Bottom row) Applying the same convolution on top of the green
feature map to produce the 2x2 orange feature map. (Left column) The common way
to visualize a CNN feature map. Only looking at the feature map, we do not know
where a feature is looking at (the center location of its receptive field) and
how big is that region (its receptive field size). It will be impossible to keep
track of the receptive field information in a deep CNN. (Right column) The
fixed-sized CNN feature map visualization, where the size of each feature map is
fixed, and the feature is located at the center of its receptive field.The left column of Figure 1 shows a common way to visualize a CNN feature map.
In that visualization, although by looking at a feature map, we know how many
features it contains. It is impossible to know where each feature is looking at
(the center location of its receptive field) and how big is that region (its
receptive field size). The right column of Figure 1 shows the fixed-sized CNN
visualization, which solves the problem by keeping the size of all feature maps
constant and equal to the input map. Each feature is then marked at the center
of its receptive field location. Because all features in a feature map have the
same receptive field size, we can simply draw a bounding box around one feature
to represent its receptive field size. We don’t have to map this bounding box
all the way down to the input layer since the feature map is already represented
in the same size of the input layer. Figure 2 shows another example using the
same convolution but applied on a bigger input map — 7x7. We can either plot the
fixed-sized CNN feature maps in 3D (Left) or in 2D (Right). Notice that the size
of the receptive field in Figure 2 escalates very quickly to the point that the
receptive field of the center feature of the second feature layer covers almost
the whole input map. This is an important insight which was used to improve the
design of a deep CNN.

Figure 2: Another fixed-sized CNN feature map representation. The same
convolution C is applied on a bigger input map with i = 7x7. I drew the
receptive field bounding box around the center feature and removed the padding
grid for a clearer view. The fixed-sized CNN feature map can be presented in 3D
(Left) or 2D (Right).RECEPTIVE FIELD ARITHMETIC
To calculate the receptive field in each layer, besides the number of features n in each dimension, we need to keep track of some extra information for each
layer. These include the current receptive field size r , the distance between two adjacent features (or jump) j, and the center coordinate of the upper left feature (the first feature) start . Note that the center coordinate of a feature is defined to be the center
coordinate of its receptive field, as shown in the fixed-sized CNN feature map
above. When applying a convolution with the kernel size k , the padding size p , and the stride size s , the attributes of the output layer can be calculated by the following
equations:

 * The first equation calculates the number of output features based on the number of input features and the convolution properties. This
   is the same equation presented in [ 1 ].
 * The second equation calculates the jump in the output feature map, which is equal to the jump in the input map times the number of input features that you jump over when applying the convolution (the stride size).
 * The third equation calculates the receptive field size of the output feature map, which is equal to the area that covered by k input features (k-1)*j_in plus the extra area that covered by the receptive field of the input feature
   that on the border.
 * The fourth equation calculates the center position of the receptive field of the first output feature, which is equal to the center position of the first input feature plus the distance from the location of the first input feature to the center of
   the first convolution (k-1)/2*j_in minus the padding space p*j_in. Note that we need to multiply with the jump of the input feature map in both
   cases to get the actual distance/space.

The first layer is the input layer, which always has n = image size , r = 1 , j = 1 , and start = 0.5. Note that in Figure 3, I used the coordinate system in which the center of the
first feature of the input layer is at 0.5. By applying the four above equations
recursively, we can calculate the receptive field information for all feature
maps in a CNN. Figure 3 shows an example of how these equations work.

Figure 3: Applying the receptive field calculation on the example given in
Figure 1. The first row shows the notations and general equations, while the
second and the last row shows the process of applying it to calculate the
receptive field of the output layer given the input layer information.I’ve also created a small python program that calculates the receptive field
information for all layers in a given CNN architecture. It also allows you to
input the name of any feature map and the index of a feature in that map, and
returns the size and location of the corresponding receptive field. The
following figure shows an output example when we use the AlexNet. The code is
provided at the end of this post.

 * Machine Learning
 * Artificial Intelligence
 * Deep Learning
 * Image Recognition
 * Computer Vision

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

785 13 Blocked Unblock Follow FollowingDANG HA THE HIEN
PhD student at UiO, Data Scientist at eSmart Systems

FollowML REVIEW
Highlights from Machine Learning Research, Projects and Learning Materials. From
and For ML Scientists, Engineers an Enthusiasts.

 * 785
 * 
 * 
 * 

Never miss a story from ML Review , when you sign up for Medium. Learn more Never miss a story from ML Review Get updates Get updates",The receptive field is perhaps one of the most important concepts in Convolutional Neural Networks (CNNs) that deserves more attention from the literature. This post will introduce a new way to visualize feature maps in a CNN that exposes the receptive field information.,A guide to receptive field arithmetic for Convolutional Neural Networks,Live,291
861,This video will help you to understand how Cloudant replication works. Visit http://www.cloudant.com/sign-up to sign up for a free Cloudant account. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center,Understand how Cloudant database replication works,Understand how replication works,Live,292
863,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectOPEN DATA DAY, ECONOMIC JUSTICE, AND CIVIC ENGAGEMENTRaj R Singh / March 8, 2016I spent International Open Data Day at the NYC School of Data , New York City's civic technology an open data conference. It was aninspirational, battery-recharging experience that reminded me what's trulyimportant in life. Along the way, I learned many things: 1. In 2012, New York city passed the first sweeping open data law that switched    the burden of information sharing from the public (think 1970s-era Freedom    of Information Act policies), to the government. In short, NYC government    departments are legally required to publish data online, for free, whenever possible! (I need to see how my    local City of Boston open data policy compares…) 2. IBM has a Chief Data Strategist , Steven Adler. He's on the board of the NYCLU and got me involved in this    event. Thanks Steve! Looking forward to moving the needle on data issues    with you in the future. 3. Most importantly, I learned that the increasing availability of government    open data sets around the country are providing powerful new ways for    communities to engage on civic issues. Not only can we surface issues, we    can also partner with government in operationalizing the monitoring and    analysis of problems and solutions.As Jennifer Pahlka put it today, government needs to know whether policies are working in days ormonths, not decades . What an inspiring idea!Jennifer Pahlka presenting her work.One issue the group began to tackle, spurred by the NYCLU , is around economic justice. How can we tell if government policies areplaying out fairly in society and having the intended results? An example of apowerful data-driven story is that of "" million-dollar blocks ."" These are city blocks where states are spending in excess of a milliondollars a year to incarcerate their residents. Are you surprised million-dollarblocks exist? Is that a good way to spend public funds? Only by surfacing thesefacts with real data can we begin to have a truly informed public debate.Map of “million-dollar blocks” which show state incarceration spending byhousehold.If you're reading this, you're probably in the tech sector and doing pretty wellcompared to the rest of the world. A lot of that is luck. Your embryonic-cellself replicated and grew without mutation. Then you were born into a first-worldsociety, were well-nourished, and it was pretty easy for you to get a lot ofeducation without being interrupted by famine, drought, or war. Noteveryone—even in the US—is that lucky.So my message today is: give something back. Even if you only have an hour amonth, or a day a week, or just some cash, get involved. Join a local civichack, find a Code for America project , or update OpenStreetMap . Happy International Open Data Day!SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: dashdb / geospatial / opendata / Python / R / Spark Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Exciting civic and open data projects discussed Saturday at NYC School of Data.,"Open Data Day, Economic Justice, and Civic Engagement",Live,293
866,"I've consulted with hundreds of people who use CouchDB, and the same sorts of questions keep coming up. Come to this talk if you want to know more about the kinds of mistakes many users make when thinking about how to use the database in their application. I'll talk a bit about the ""rough edges"" of CouchDB, and how to work them to your advantage.",Joan Touzet talks about ten common misconceptions about CouchDB and lends insight into best practices and design patterns.,10 Common Misconceptions about CouchDB,Live,294
877,"Skip to content * Features
 * Business
 * Explore
 * Marketplace
 * Pricing

This repository Sign in or Sign up * Watch 1,180
 * Star 11,226
 * Fork 1,736

TERRYUM / AWESOME-DEEP-LEARNING-PAPERS
Code Issues 6 Pull requests 1 Projects 0 Insights Pulse Graphs Permalink Branch: master Switch branches/tags * Branches
 * Tags

master Nothing to show Nothing to show Find file Copy path awesome-deep-learning-papers / README.md a667046 Jun 28, 2017 terryum Update README.md 21 contributorsUSERS WHO HAVE CONTRIBUTED TO THIS FILE
 * terryum
 * miguelballesteros
 * Jeet1994
 * jdoerrie
 * sunshinemyson
 * rtlee9
 * flukeskywalker
 * pra85
 * mbchang
 * mendelson
 * lserafin
 * ltrottier
 * liyaguang
 * lamblin
 * jeremyschlatter
 * rajikaimal
 * hosang
 * eddiepierce
 * dcastro9
 * dan2k3k4
 * bamos

Raw Blame History 384 lines (320 sloc) 43.1 KBAWESOME - MOST CITED DEEP LEARNING PAPERS


A curated list of the most cited deep learning papers (since 2012)

We believe that there exist classic deep learning papers which are worth reading regardless of their application
domain. Rather than providing overwhelming amount of papers, We would like to
provide a curated list of the awesome deep learning papers which are considered as must-reads in certain research domains.

BACKGROUND
Before this list, there exist other awesome deep learning lists , for example, Deep Vision and Awesome Recurrent Neural Networks . Also, after this list comes out, another awesome list for deep learning
beginners, called Deep Learning Papers Reading Roadmap , has been created and loved by many deep learning researchers.

Although the Roadmap List includes lots of important deep learning papers, it feels overwhelming for me
to read them all. As I mentioned in the introduction, I believe that seminal
works can give us lessons regardless of their application domain. Thus, I would
like to introduce top 100 deep learning papers here as a good starting point of overviewing deep learning researches.

To get the news for newly released papers everyday, follow my twitter or facebook page !

AWESOME LIST CRITERIA
 1. A list of top 100 deep learning papers published from 2012 to 2016 is suggested.
 2. If a paper is added to the list, another paper (usually from *More Papers
    from 2016"" section) should be removed to keep top 100 papers. (Thus,
    removing papers is also important contributions as well as adding papers)
 3. Papers that are important, but failed to be included in the list, will be
    listed in More than Top 100 section.
 4. Please refer to New Papers and Old Papers sections for the papers published in recent 6 months or before 2012.

(Citation criteria)

 * < 6 months : New Papers (by discussion)
 * 2016 : +60 citations or ""More Papers from 2016""
 * 2015 : +200 citations
 * 2014 : +400 citations
 * 2013 : +600 citations
 * 2012 : +800 citations
 * ~2012 : Old Papers (by discussion)

Please note that we prefer seminal deep learning papers that can be applied to
various researches rather than application papers. For that reason, some papers
that meet the criteria may not be accepted while others can be. It depends on
the impact of the paper, applicability to other researches scarcity of the
research domain, and so on.

We need your contributions!

If you have any suggestions (missing papers, new papers, key researchers or
typos), please feel free to edit and pull a request. (Please read the contributing guide for further instructions, though just letting me know the title of papers can
also be a big contribution to us.)

(Update) You can download all top-100 papers with this and collect all authors' names with this . Also, bib file for all top-100 papers are available. Thanks, doodhwala, Sven and grepinsight !

 * Can anyone contribute the code for obtaining the statistics of the authors of
   Top-100 papers?

CONTENTS
 * Understanding / Generalization / Transfer
 * Optimization / Training Techniques
 * Unsupervised / Generative Models
 * Convolutional Network Models
 * Image Segmentation / Object Detection
 * Image / Video / Etc
 * Natural Language Processing / RNNs
 * Speech / Other Domain
 * Reinforcement Learning / Robotics
 * More Papers from 2016

(More than Top 100)

 * New Papers : Less than 6 months
 * Old Papers : Before 2012
 * HW / SW / Dataset : Technical reports
 * Book / Survey / Review
 * Video Lectures / Tutorials / Blogs
 * Appendix: More than Top 100 : More papers not in the list


--------------------------------------------------------------------------------

UNDERSTANDING / GENERALIZATION / TRANSFER
 * Distilling the knowledge in a neural network (2015), G. Hinton et al. [pdf]
 * Deep neural networks are easily fooled: High confidence predictions for
   unrecognizable images (2015), A. Nguyen et al. [pdf]
 * How transferable are features in deep neural networks? (2014), J. Yosinski et al. [pdf]
 * CNN features off-the-Shelf: An astounding baseline for recognition (2014), A. Razavian et al. [pdf]
 * Learning and transferring mid-Level image representations using convolutional
   neural networks (2014), M. Oquab et al. [pdf]
 * Visualizing and understanding convolutional networks (2014), M. Zeiler and R. Fergus [pdf]
 * Decaf: A deep convolutional activation feature for generic visual recognition (2014), J. Donahue et al. [pdf]

OPTIMIZATION / TRAINING TECHNIQUES
 * Training very deep networks (2015), R. Srivastava et al. [pdf]
 * Batch normalization: Accelerating deep network training by reducing internal
   covariate shift (2015), S. Loffe and C. Szegedy [pdf]
 * Delving deep into rectifiers: Surpassing human-level performance on imagenet
   classification (2015), K. He et al. [pdf]
 * Dropout: A simple way to prevent neural networks from overfitting (2014), N. Srivastava et al. [pdf]
 * Adam: A method for stochastic optimization (2014), D. Kingma and J. Ba [pdf]
 * Improving neural networks by preventing co-adaptation of feature detectors (2012), G. Hinton et al. [pdf]
 * Random search for hyper-parameter optimization (2012) J. Bergstra and Y. Bengio [pdf]

UNSUPERVISED / GENERATIVE MODELS
 * Pixel recurrent neural networks (2016), A. Oord et al. [pdf]
 * Improved techniques for training GANs (2016), T. Salimans et al. [pdf]
 * Unsupervised representation learning with deep convolutional generative
   adversarial networks (2015), A. Radford et al. [pdf]
 * DRAW: A recurrent neural network for image generation (2015), K. Gregor et al. [pdf]
 * Generative adversarial nets (2014), I. Goodfellow et al. [pdf]
 * Auto-encoding variational Bayes (2013), D. Kingma and M. Welling [pdf]
 * Building high-level features using large scale unsupervised learning (2013), Q. Le et al. [pdf]

CONVOLUTIONAL NEURAL NETWORK MODELS
 * Rethinking the inception architecture for computer vision (2016), C. Szegedy et al. [pdf]
 * Inception-v4, inception-resnet and the impact of residual connections on
   learning (2016), C. Szegedy et al. [pdf]
 * Identity Mappings in Deep Residual Networks (2016), K. He et al. [pdf]
 * Deep residual learning for image recognition (2016), K. He et al. [pdf]
 * Spatial transformer network (2015), M. Jaderberg et al., [pdf]
 * Going deeper with convolutions (2015), C. Szegedy et al. [pdf]
 * Very deep convolutional networks for large-scale image recognition (2014), K. Simonyan and A. Zisserman [pdf]
 * Return of the devil in the details: delving deep into convolutional nets (2014), K. Chatfield et al. [pdf]
 * OverFeat: Integrated recognition, localization and detection using
   convolutional networks (2013), P. Sermanet et al. [pdf]
 * Maxout networks (2013), I. Goodfellow et al. [pdf]
 * Network in network (2013), M. Lin et al. [pdf]
 * ImageNet classification with deep convolutional neural networks (2012), A. Krizhevsky et al. [pdf]

IMAGE: SEGMENTATION / OBJECT DETECTION
 * You only look once: Unified, real-time object detection (2016), J. Redmon et al. [pdf]
 * Fully convolutional networks for semantic segmentation (2015), J. Long et al. [pdf]
 * Faster R-CNN: Towards Real-Time Object Detection with Region Proposal
   Networks (2015), S. Ren et al. [pdf]
 * Fast R-CNN (2015), R. Girshick [pdf]
 * Rich feature hierarchies for accurate object detection and semantic
   segmentation (2014), R. Girshick et al. [pdf]
 * Spatial pyramid pooling in deep convolutional networks for visual recognition (2014), K. He et al. [pdf]
 * Semantic image segmentation with deep convolutional nets and fully connected
   CRFs , L. Chen et al. [pdf]
 * Learning hierarchical features for scene labeling (2013), C. Farabet et al. [pdf]

IMAGE / VIDEO / ETC
 * Image Super-Resolution Using Deep Convolutional Networks (2016), C. Dong et al. [pdf]
 * A neural algorithm of artistic style (2015), L. Gatys et al. [pdf]
 * Deep visual-semantic alignments for generating image descriptions (2015), A. Karpathy and L. Fei-Fei [pdf]
 * Show, attend and tell: Neural image caption generation with visual attention (2015), K. Xu et al. [pdf]
 * Show and tell: A neural image caption generator (2015), O. Vinyals et al. [pdf]
 * Long-term recurrent convolutional networks for visual recognition and
   description (2015), J. Donahue et al. [pdf]
 * VQA: Visual question answering (2015), S. Antol et al. [pdf]
 * DeepFace: Closing the gap to human-level performance in face verification (2014), Y. Taigman et al. [pdf] :
 * Large-scale video classification with convolutional neural networks (2014), A. Karpathy et al. [pdf]
 * Two-stream convolutional networks for action recognition in videos (2014), K. Simonyan et al. [pdf]
 * 3D convolutional neural networks for human action recognition (2013), S. Ji et al. [pdf]

NATURAL LANGUAGE PROCESSING / RNNS
 * Neural Architectures for Named Entity Recognition (2016), G. Lample et al. [pdf]
 * Exploring the limits of language modeling (2016), R. Jozefowicz et al. [pdf]
 * Teaching machines to read and comprehend (2015), K. Hermann et al. [pdf]
 * Effective approaches to attention-based neural machine translation (2015), M. Luong et al. [pdf]
 * Conditional random fields as recurrent neural networks (2015), S. Zheng and S. Jayasumana. [pdf]
 * Memory networks (2014), J. Weston et al. [pdf]
 * Neural turing machines (2014), A. Graves et al. [pdf]
 * Neural machine translation by jointly learning to align and translate (2014), D. Bahdanau et al. [pdf]
 * Sequence to sequence learning with neural networks (2014), I. Sutskever et al. [pdf]
 * Learning phrase representations using RNN encoder-decoder for statistical
   machine translation (2014), K. Cho et al. [pdf]
 * A convolutional neural network for modeling sentences (2014), N. Kalchbrenner et al. [pdf]
 * Convolutional neural networks for sentence classification (2014), Y. Kim [pdf]
 * Glove: Global vectors for word representation (2014), J. Pennington et al. [pdf]
 * Distributed representations of sentences and documents (2014), Q. Le and T. Mikolov [pdf]
 * Distributed representations of words and phrases and their compositionality (2013), T. Mikolov et al. [pdf]
 * Efficient estimation of word representations in vector space (2013), T. Mikolov et al. [pdf]
 * Recursive deep models for semantic compositionality over a sentiment treebank (2013), R. Socher et al. [pdf]
 * Generating sequences with recurrent neural networks (2013), A. Graves. [pdf]

SPEECH / OTHER DOMAIN
 * End-to-end attention-based large vocabulary speech recognition (2016), D. Bahdanau et al. [pdf]
 * Deep speech 2: End-to-end speech recognition in English and Mandarin (2015), D. Amodei et al. [pdf]
 * Speech recognition with deep recurrent neural networks (2013), A. Graves [pdf]
 * Deep neural networks for acoustic modeling in speech recognition: The shared
   views of four research groups (2012), G. Hinton et al. [pdf]
 * Context-dependent pre-trained deep neural networks for large-vocabulary
   speech recognition (2012) G. Dahl et al. [pdf]
 * Acoustic modeling using deep belief networks (2012), A. Mohamed et al. [pdf]

REINFORCEMENT LEARNING / ROBOTICS
 * End-to-end training of deep visuomotor policies (2016), S. Levine et al. [pdf]
 * Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and
   Large-Scale Data Collection (2016), S. Levine et al. [pdf]
 * Asynchronous methods for deep reinforcement learning (2016), V. Mnih et al. [pdf]
 * Deep Reinforcement Learning with Double Q-Learning (2016), H. Hasselt et al. [pdf]
 * Mastering the game of Go with deep neural networks and tree search (2016), D. Silver et al. [pdf]
 * Continuous control with deep reinforcement learning (2015), T. Lillicrap et al. [pdf]
 * Human-level control through deep reinforcement learning (2015), V. Mnih et al. [pdf]
 * Deep learning for detecting robotic grasps (2015), I. Lenz et al. [pdf]
 * Playing atari with deep reinforcement learning (2013), V. Mnih et al. [pdf] )

MORE PAPERS FROM 2016
 * Layer Normalization (2016), J. Ba et al. [pdf]
 * Learning to learn by gradient descent by gradient descent (2016), M. Andrychowicz et al. [pdf]
 * Domain-adversarial training of neural networks (2016), Y. Ganin et al. [pdf]
 * WaveNet: A Generative Model for Raw Audio (2016), A. Oord et al. [pdf] [web]
 * Colorful image colorization (2016), R. Zhang et al. [pdf]
 * Generative visual manipulation on the natural image manifold (2016), J. Zhu et al. [pdf]
 * Texture networks: Feed-forward synthesis of textures and stylized images (2016), D Ulyanov et al. [pdf]
 * SSD: Single shot multibox detector (2016), W. Liu et al. [pdf]
 * SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 1MB model
   size (2016), F. Iandola et al. [pdf]
 * Eie: Efficient inference engine on compressed deep neural network (2016), S. Han et al. [pdf]
 * Binarized neural networks: Training deep neural networks with weights and
   activations constrained to+ 1 or-1 (2016), M. Courbariaux et al. [pdf]
 * Dynamic memory networks for visual and textual question answering (2016), C. Xiong et al. [pdf]
 * Stacked attention networks for image question answering (2016), Z. Yang et al. [pdf]
 * Hybrid computing using a neural network with dynamic external memory (2016), A. Graves et al. [pdf]
 * Google's neural machine translation system: Bridging the gap between human
   and machine translation (2016), Y. Wu et al. [pdf]


--------------------------------------------------------------------------------

NEW PAPERS
Newly published papers (< 6 months) which are worth reading

 * Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour (2017), Priya Goyal
   et al. [pdf]
 * TACOTRON: Towards end-to-end speech synthesis (2017), Y. Wang et al. [pdf]
 * Deep Photo Style Transfer (2017), F. Luan et al. [pdf]
 * Evolution Strategies as a Scalable Alternative to Reinforcement Learning
   (2017), T. Salimans et al. [pdf]
 * Deformable Convolutional Networks (2017), J. Dai et al. [pdf]
 * Mask R-CNN (2017), K. He et al. [pdf]
 * Learning to discover cross-domain relations with generative adversarial
   networks (2017), T. Kim et al. [pdf]
 * Deep voice: Real-time neural text-to-speech (2017), S. Arik et al., [pdf]
 * PixelNet: Representation of the pixels, by the pixels, and for the pixels
   (2017), A. Bansal et al. [pdf]
 * Batch renormalization: Towards reducing minibatch dependence in
   batch-normalized models (2017), S. Ioffe. [pdf]
 * Wasserstein GAN (2017), M. Arjovsky et al. [pdf]
 * Understanding deep learning requires rethinking generalization (2017), C.
   Zhang et al. [pdf]
 * Least squares generative adversarial networks (2016), X. Mao et al. [pdf]

OLD PAPERS
Classic papers published before 2012

 * An analysis of single-layer networks in unsupervised feature learning (2011),
   A. Coates et al. [pdf]
 * Deep sparse rectifier neural networks (2011), X. Glorot et al. [pdf]
 * Natural language processing (almost) from scratch (2011), R. Collobert et al. [pdf]
 * Recurrent neural network based language model (2010), T. Mikolov et al. [pdf]
 * Stacked denoising autoencoders: Learning useful representations in a deep
   network with a local denoising criterion (2010), P. Vincent et al. [pdf]
 * Learning mid-level features for recognition (2010), Y. Boureau [pdf]
 * A practical guide to training restricted boltzmann machines (2010), G. Hinton [pdf]
 * Understanding the difficulty of training deep feedforward neural networks
   (2010), X. Glorot and Y. Bengio [pdf]
 * Why does unsupervised pre-training help deep learning (2010), D. Erhan et al. [pdf]
 * Learning deep architectures for AI (2009), Y. Bengio. [pdf]
 * Convolutional deep belief networks for scalable unsupervised learning of
   hierarchical representations (2009), H. Lee et al. [pdf]
 * Greedy layer-wise training of deep networks (2007), Y. Bengio et al. [pdf]
 * Reducing the dimensionality of data with neural networks, G. Hinton and R.
   Salakhutdinov. [pdf]
 * A fast learning algorithm for deep belief nets (2006), G. Hinton et al. [pdf]
 * Gradient-based learning applied to document recognition (1998), Y. LeCun et
   al. [pdf]
 * Long short-term memory (1997), S. Hochreiter and J. Schmidhuber. [pdf]

HW / SW / DATASET
 * OpenAI gym (2016), G. Brockman et al. [pdf]
 * TensorFlow: Large-scale machine learning on heterogeneous distributed systems
   (2016), M. Abadi et al. [pdf]
 * Theano: A Python framework for fast computation of mathematical expressions,
   R. Al-Rfou et al.
 * Torch7: A matlab-like environment for machine learning, R. Collobert et al. [pdf]
 * MatConvNet: Convolutional neural networks for matlab (2015), A. Vedaldi and
   K. Lenc [pdf]
 * Imagenet large scale visual recognition challenge (2015), O. Russakovsky et
   al. [pdf]
 * Caffe: Convolutional architecture for fast feature embedding (2014), Y. Jia
   et al. [pdf]

BOOK / SURVEY / REVIEW
 * On the Origin of Deep Learning (2017), H. Wang and Bhiksha Raj. [pdf]
 * Deep Reinforcement Learning: An Overview (2017), Y. Li, [pdf]
 * Neural Machine Translation and Sequence-to-sequence Models(2017): A Tutorial,
   G. Neubig. [pdf]
 * Neural Network and Deep Learning (Book, Jan 2017), Michael Nielsen. [html]
 * Deep learning (Book, 2016), Goodfellow et al. [html]
 * LSTM: A search space odyssey (2016), K. Greff et al. [pdf]
 * Tutorial on Variational Autoencoders (2016), C. Doersch. [pdf]
 * Deep learning (2015), Y. LeCun, Y. Bengio and G. Hinton [pdf]
 * Deep learning in neural networks: An overview (2015), J. Schmidhuber [pdf]
 * Representation learning: A review and new perspectives (2013), Y. Bengio et
   al. [pdf]

VIDEO LECTURES / TUTORIALS / BLOGS
(Lectures)

 * CS231n, Convolutional Neural Networks for Visual Recognition, Stanford
   University [web]
 * CS224d, Deep Learning for Natural Language Processing, Stanford University [web]
 * Oxford Deep NLP 2017, Deep Learning for Natural Language Processing,
   University of Oxford [web]

(Tutorials)

 * NIPS 2016 Tutorials, Long Beach [web]
 * ICML 2016 Tutorials, New York City [web]
 * ICLR 2016 Videos, San Juan [web]
 * Deep Learning Summer School 2016, Montreal [web]
 * Bay Area Deep Learning School 2016, Stanford [web]

(Blogs)

 * OpenAI [web]
 * Distill [web]
 * Andrej Karpathy Blog [web]
 * Colah's Blog [Web]
 * WildML [Web]
 * FastML [web]
 * TheMorningPaper [web]

APPENDIX: MORE THAN TOP 100
(2016)

 * A character-level decoder without explicit segmentation for neural machine
   translation (2016), J. Chung et al. [pdf]
 * Dermatologist-level classification of skin cancer with deep neural networks
   (2017), A. Esteva et al. [html]
 * Weakly supervised object localization with multi-fold multiple instance
   learning (2017), R. Gokberk et al. [pdf]
 * Brain tumor segmentation with deep neural networks (2017), M. Havaei et al. [pdf]
 * Professor Forcing: A New Algorithm for Training Recurrent Networks (2016), A.
   Lamb et al. [pdf]
 * Adversarially learned inference (2016), V. Dumoulin et al. [web] [pdf]
 * Understanding convolutional neural networks (2016), J. Koushik [pdf]
 * Taking the human out of the loop: A review of bayesian optimization (2016),
   B. Shahriari et al. [pdf]
 * Adaptive computation time for recurrent neural networks (2016), A. Graves [pdf]
 * Densely connected convolutional networks (2016), G. Huang et al. [pdf]
 * Region-based convolutional networks for accurate object detection and
   segmentation (2016), R. Girshick et al.
 * Continuous deep q-learning with model-based acceleration (2016), S. Gu et al. [pdf]
 * A thorough examination of the cnn/daily mail reading comprehension task
   (2016), D. Chen et al. [pdf]
 * Achieving open vocabulary neural machine translation with hybrid
   word-character models, M. Luong and C. Manning. [pdf]
 * Very Deep Convolutional Networks for Natural Language Processing (2016), A.
   Conneau et al. [pdf]
 * Bag of tricks for efficient text classification (2016), A. Joulin et al. [pdf]
 * Efficient piecewise training of deep structured models for semantic
   segmentation (2016), G. Lin et al. [pdf]
 * Learning to compose neural networks for question answering (2016), J. Andreas
   et al. [pdf]
 * Perceptual losses for real-time style transfer and super-resolution (2016),
   J. Johnson et al. [pdf]
 * Reading text in the wild with convolutional neural networks (2016), M.
   Jaderberg et al. [pdf]
 * What makes for effective detection proposals? (2016), J. Hosang et al. [pdf]
 * Inside-outside net: Detecting objects in context with skip pooling and
   recurrent neural networks (2016), S. Bell et al. [pdf] .
 * Instance-aware semantic segmentation via multi-task network cascades (2016),
   J. Dai et al. [pdf]
 * Conditional image generation with pixelcnn decoders (2016), A. van den Oord
   et al. [pdf]
 * Deep networks with stochastic depth (2016), G. Huang et al., [pdf]
 * Consistency and Fluctuations For Stochastic Gradient Langevin Dynamics
   (2016), Yee Whye Teh et al. [pdf]

(2015)

 * Ask your neurons: A neural-based approach to answering questions about images
   (2015), M. Malinowski et al. [pdf]
 * Exploring models and data for image question answering (2015), M. Ren et al. [pdf]
 * Are you talking to a machine? dataset and methods for multilingual image
   question (2015), H. Gao et al. [pdf]
 * Mind's eye: A recurrent visual representation for image caption generation
   (2015), X. Chen and C. Zitnick. [pdf]
 * From captions to visual concepts and back (2015), H. Fang et al. [pdf] .
 * Towards AI-complete question answering: A set of prerequisite toy tasks
   (2015), J. Weston et al. [pdf]
 * Ask me anything: Dynamic memory networks for natural language processing
   (2015), A. Kumar et al. [pdf]
 * Unsupervised learning of video representations using LSTMs (2015), N.
   Srivastava et al. [pdf]
 * Deep compression: Compressing deep neural networks with pruning, trained
   quantization and huffman coding (2015), S. Han et al. [pdf]
 * Improved semantic representations from tree-structured long short-term memory
   networks (2015), K. Tai et al. [pdf]
 * Character-aware neural language models (2015), Y. Kim et al. [pdf]
 * Grammar as a foreign language (2015), O. Vinyals et al. [pdf]
 * Trust Region Policy Optimization (2015), J. Schulman et al. [pdf]
 * Beyond short snippents: Deep networks for video classification (2015) [pdf]
 * Learning Deconvolution Network for Semantic Segmentation (2015), H. Noh et
   al. [pdf]
 * Learning spatiotemporal features with 3d convolutional networks (2015), D.
   Tran et al. [pdf]
 * Understanding neural networks through deep visualization (2015), J. Yosinski
   et al. [pdf]
 * An Empirical Exploration of Recurrent Network Architectures (2015), R.
   Jozefowicz et al. [pdf]
 * Deep generative image models using a￼ laplacian pyramid of adversarial
   networks (2015), E.Denton et al. [pdf]
 * Gated Feedback Recurrent Neural Networks (2015), J. Chung et al. [pdf]
 * Fast and accurate deep network learning by exponential linear units (ELUS)
   (2015), D. Clevert et al. [pdf]
 * Pointer networks (2015), O. Vinyals et al. [pdf]
 * Visualizing and Understanding Recurrent Networks (2015), A. Karpathy et al. [pdf]
 * Attention-based models for speech recognition (2015), J. Chorowski et al. [pdf]
 * End-to-end memory networks (2015), S. Sukbaatar et al. [pdf]
 * Describing videos by exploiting temporal structure (2015), L. Yao et al. [pdf]
 * A neural conversational model (2015), O. Vinyals and Q. Le. [pdf]
 * Improving distributional similarity with lessons learned from word
   embeddings, O. Levy et al. [[pdf]] ( https://www.transacl.org/ojs/index.php/tacl/article/download/570/124 )
 * Transition-Based Dependency Parsing with Stack Long Short-Term Memory (2015),
   C. Dyer et al. [pdf]
 * Improved Transition-Based Parsing by Modeling Characters instead of Words
   with LSTMs (2015), M. Ballesteros et al. [pdf]
 * Finding function in form: Compositional character models for open vocabulary
   word representation (2015), W. Ling et al. [pdf]

(~2014)

 * DeepPose: Human pose estimation via deep neural networks (2014), A. Toshev
   and C. Szegedy [pdf]
 * Learning a Deep Convolutional Network for Image Super-Resolution (2014, C.
   Dong et al. [pdf]
 * Recurrent models of visual attention (2014), V. Mnih et al. [pdf]
 * Empirical evaluation of gated recurrent neural networks on sequence modeling
   (2014), J. Chung et al. [pdf]
 * Addressing the rare word problem in neural machine translation (2014), M.
   Luong et al. [pdf]
 * On the properties of neural machine translation: Encoder-decoder approaches
   (2014), K. Cho et. al.
 * Recurrent neural network regularization (2014), W. Zaremba et al. [pdf]
 * Intriguing properties of neural networks (2014), C. Szegedy et al. [pdf]
 * Towards end-to-end speech recognition with recurrent neural networks (2014),
   A. Graves and N. Jaitly. [pdf]
 * Scalable object detection using deep neural networks (2014), D. Erhan et al. [pdf]
 * On the importance of initialization and momentum in deep learning (2013), I.
   Sutskever et al. [pdf]
 * Regularization of neural networks using dropconnect (2013), L. Wan et al. [pdf]
 * Learning Hierarchical Features for Scene Labeling (2013), C. Farabet et al. [pdf]
 * Linguistic Regularities in Continuous Space Word Representations (2013), T.
   Mikolov et al. [pdf]
 * Large scale distributed deep networks (2012), J. Dean et al. [pdf]
 * A Fast and Accurate Dependency Parser using Neural Networks. Chen and
   Manning. [pdf]

ACKNOWLEDGEMENT
Thank you for all your contributions. Please make sure to read the contributing guide before you make a pull request.

LICENSE


To the extent possible under law, Terry T. Um has waived all copyright and related or neighboring rights to this work.

Jump to Line Go * Contact GitHub
 * API
 * Training
 * Shop
 * Blog
 * About

 * © 2017 GitHub , Inc.
 * Terms
 * Privacy
 * Security
 * Status
 * Help

You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.",A curated list of the most cited deep learning papers (since 2012).  ,Awesome deep learning papers,Live,295
878,"Compose The Compose logo Articles Sign in Free 30-day trialMAKING OF A SMART BUSINESS CHATBOT: PART 3
Published Aug 24, 2017 Making of a Smart Business Chatbot: Part 3 janusgraph watson conversation Free 30 Day TrialChatbots are a great way to interact with your customers in real-time and gain
insights into your users. In this third part of the series on building smart
business chatbots, we’ll use a JanusGraph-backed knowledge base to give our
chatbot from part 1 and part 2 some utility.

We’ve reached the third part of our Building Smart Business Chatbots and now
we’re going to use JanusGraph to give our bot the knowledge to go with the chat. We’ll use Watson Conversation to allow our users to search for articles that might match their interests and
responds back in conversational form.


Let’s get started...

GRAPHING IT UP
It always helps to have some data before we start coding things up, so let’s
start by inputting some articles from the Compose blog into a new JanusGraph
database. Follow the first few steps in our article on Markov Chains to spin up a JanusGraph Instance on Compose and get Gremlin up-and-running.

Once you have those going, we’ll create a new database for our articles:

gremlin> :> def graph = ConfiguredGraphFactory.create(""composeblog"")  
==>standardjanusgraph[astyanax:[10.189.87.4, 10.189.87.3, 10.189.87.2]]
gremlin> :> graph.tx().commit()  


Next, we’ll grab a few articles with various tags and topics from a few
different authors. We’ll model them by using vertices for our authors, tags, and
articles. We’ll use edges to represent the relationships between those vertices:


Let’s go ahead and start building out our graph. We’ll fill in 10 articles from
the blog across 3 different authors. First, let’s add the authors:

gremlin> :> graph.tx().commit()  
==>null
gremlin> :> def john = graph.addVertex(T.label, ""person"", ""name"", ""John O'Connor"")  
==>v[4112]
gremlin> :> def abdullah = graph.addVertex(T.label, ""person"", ""name"", ""Abdullah Alger"")  
==>v[8208]
gremlin> :> def dj = graph.addVertex(T.label, ""person"", ""name"", ""DJ Walker-Morgan"")  
==>v[4208]
gremlin> :> graph.tx().commit()  
==>null


Next, we’ll add some tags from a sampling of articles. We’ll use the following
sampling of articles to give us a good starting point, and we’ll pull the tags
directly from those articles:

 * Taking a Look at Robomongo and Studio 3T with Compose for MongoDB
 * Avoid Storing Data Inside ""Admin"" When Using MongoDB
 * Storing Network Addresses using PostgreSQL
 * Mastering PostgreSQL Tools: Full-Text Search and Phrase Search
 * How to Script Painless-ly in Elasticsearch
 * MQTT and STOMP for Compose RabbitMQ
 * Elasticsearch 5.4.2 comes to Compose
 * Compose PostgreSQL powers up to 9.6
 * Introduction to Graph Databases
 * Easier Java connections to MongoDB at Compose
 * Graph 101: Magical Markov Chains
 * Building Secure Instant API's with RESTHeart and Compose
 * Compose Tips: Dates and Dating in MongoDB
 * 5-minute Signup Forms with Node-RED and Compose
 * Mongo Metrics: Calculating the Mode
 * Building Secure Distributed Javascript Microservices with RabbitMQ and
   SenecaJS

Let’s go through each of these articles and extract the relevant tags:

gremlin> :> def mongodb = graph.addVertex(T.label, ""tag"", ""name"", ""mongodb"")  
==>v[8304]
gremlin> :> def janusgraph = graph.addVertex(T.label, ""tag"", ""name"", ""janusgraph"")  
==>v[4232]
gremlin> :> def nodeRed = graph.addVertex(T.label, ""tag"", ""name"", ""node-red"")  
==>v[4304]
gremlin> :> def nodejs = graph.addVertex(T.label, ""tag"", ""name"", ""nodejs"")  
==>v[8328]
gremlin> :> def rabbitmq = graph.addVertex(T.label, ""tag"", ""name"", ""rabbitmq"")  
==>v[4152]
gremlin> :> def elasticsearch = graph.addVertex(T.label, ""tag"", ""name"", ""elasticsearch"")  
==>v[4184]
gremlin> :> def postgres = graph.addVertex(T.label, ""tag"", ""name"", ""postgres"")  
==>v[8280]
gremlin> :> graph.tx().commit()  
==>null


Now that we have our tags, we can input our articles along with their
relationship between tags and authors.

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Taking a Look at Robomongo and Studio 3T with Compose for MongoDB"", “url”, “https://www.compose.com/articles/taking-a-look-at-robomongo-and-studio-3t-with-compose-for-mongodb/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Avoid Storing Data Inside ""Admin"" When Using MongoDB"", “url”, “https://www.compose.com/articles/avoid-storing-data-inside-admin-when-using-mongodb/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Storing Network Addresses using PostgreSQL"", “url”, “https://www.compose.com/articles/storing-network-addresses-using-postgresql/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Mastering PostgreSQL Tools: Full-Text Search and Phrase Search"", “url”, “https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""How to Script Painless-ly in Elasticsearch"", “url”, “https://www.compose.com/articles/how-to-script-painless-ly-in-elasticsearch/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""MQTT and STOMP for Compose RabbitMQ"", “url”, “https://www.compose.com/articles/mqtt-and-stomp-for-compose-rabbitmq”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Elasticsearch 5.4.2 comes to Compose"", “url”, “https://www.compose.com/articles/elasticsearch-5-4-2-comes-to-compose”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Compose PostgreSQL powers up to 9.6"", “url”, “https://www.compose.com/articles/compose-postgresql-powers-up-to-9-6/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Introduction to Graph Databases"", “url”, “https://www.compose.com/articles/introduction-to-graph-databases/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Easier Java connections to MongoDB at Compose"", “url”, “https://www.compose.com/articles/easier-java-connections-to-mongodb-at-compose-2/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Graph 101: Magical Markov Chains"", “url”, “https://www.compose.com/articles/graph-101-magical-markov-chains/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Building Secure Instant API's with RESTHeart and Compose"", “url”, “https://www.compose.com/articles/building-secure-instant-apis-with-restheart-and-compose/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Compose Tips: Dates and Dating in MongoDB"", “url”, “https://www.compose.com/articles/understanding-dates-in-compose-mongodb/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""5-minute Signup Forms with Node-RED and Compose"", “url”, “https://www.compose.com/articles/5-minute-signup-with-node-red-and-compose/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Mongo Metrics: Calculating the Mode"", “url”, “https://www.compose.com/articles/mongo-metrics-calculating-the-mode/”)

gremlin> :> graph.addVertex(T.label, ""article"", ""name"", ""Building Secure Distributed Javascript Microservices with RabbitMQ and SenecaJS"", “url”, “https://www.compose.com/articles/building-secure-distributed-javascript-microservices-with-rabbitmq-and-senecajs/”)

gremlin> :> graph.tx().commit()


Finally, let's add edges between our articles, authors, and tags so our graph is
complete. At this point, you are entering quite a bit of data and, if you're
using JanusGraph on Compose, your session might have timed out. Rather than
using variable names to add edges to vertices like we did in the previous article , you can access them directly through the traversal object:

gremlin> :  
==>v[4112]


Where the number inside of g.V() is the ID of the vertex. If you're not sure
what the ID is of the vertex you're looking for, you can use the valueMap() method to figure it out:

gremlin> :  
==>{name=[Abdullah Alger], id=8208, label=person}
==>{name=[John O'Connor], id=4112, label=person}
==>{name=[DJ Walker-Morgan], id=4208, label=person}


First, we'll add edges between each of our articles and authors. Since we may
have disconnected by now, we'll use the id of each article to add the edges. We can find that by using the following
command:

gremlin> :  
==>{label=article, id=45168, name=[Building Secure Distributed Javascript Microservices with RabbitMQ and SenecaJS], url=[https://www.compose.com/articles/building-secure-distributed-javascript-microservices-with-rabbitmq-and-senecajs/]}
==>{label=article, id=12496, name=[MQTT and STOMP for Compose RabbitMQ], url=[https://www.compose.com/articles/mqtt-and-stomp-for-compose-rabbitmq]}
...


Use the command above to find the id s of your articles, and remember to change the ID in the g.V() command below with the id of those article you want to add an edge to:

gremlin> :  
==>v[4112]
gremlin> :> g.V(36976).next().addEdge(""author"", john)  
==>e[he6-sj4-5jp-368][36976-author->4112]
gremlin> :> g.V(41072).next().addEdge(""author"", john)  
==>e[hse-vow-5jp-368][41072-author->4112]
gremlin> :> g.V(4120).next().addEdge(""author"", john)  
==>e[1z7-36g-5jp-368][4120-author->4112]
gremlin> :> g.V(16472).next().addEdge(""author"", john)  
==>e[7ij-cpk-5jp-368][16472-author->4112]
gremlin> :> g.V(20568).next().addEdge(""author"", john)  
==>e[7wr-fvc-5jp-368][20568-author->4112]
gremlin> :> def abdullah = g.V(8208).next()  
==>v[8208]
gremlin> :> g.V(12496).next().addEdge(""author"", abdullah)  
==>e[4re-9n4-5jp-6c0][12496-author->8208]
gremlin> :> g.V(12400).next().addEdge(""author"", abdullah)  
==>e[i6m-9kg-5jp-6c0][12400-author->8208]
gremlin> :> g.V(24688).next().addEdge(""author"", abdullah)  
==>e[iku-j1s-5jp-6c0][24688-author->8208]
gremlin> :> g.V(16496).next().addEdge(""author"", abdullah)  
==>e[iz2-cq8-5jp-6c0][16496-author->8208]
gremlin> :> g.V(20592).next().addEdge(""author"", abdullah)  
==>e[jda-fw0-5jp-6c0][20592-author->8208]
gremlin> :> g.V(8400).next().addEdge(""author"", abdullah)  
==>e[55m-6hc-5jp-6c0][8400-author->8208]
gremlin> :  
==>v[4208]
gremlin> :> g.V(12496).next().addEdge(""author"", dj)  
==>e[5ju-9n4-5jp-38w][12496-author->4208]
gremlin> :> g.V(32880).next().addEdge(""author"", dj)  
==>e[jri-pdc-5jp-38w][32880-author->4208]
gremlin> :> g.V(28784).next().addEdge(""author"", dj)  
==>e[k5q-m7k-5jp-38w][28784-author->4208]
gremlin> :> g.V(12376).next().addEdge(""author"", dj)  
==>e[8az-9js-5jp-38w][12376-author->4208]
gremlin> :> g.V(12424).next().addEdge(""author"", dj)  
==>e[4cx-9l4-5jp-38w][12424-author->4208]
gremlin> :> graph.tx().commit()  
==>null


Now, we'll add an edge for each of our topics.

gremlin> :> g.V(45168).next().addEdge(""topic"", rabbit)  
==>e[odxce-yuo-28lx-37c][45168-topic->4152]
gremlin> :> g.V(45168).next().addEdge(""topic"", nodejs)  
==>e[odxqm-yuo-28lx-6fc][45168-topic->8328]
gremlin> :> g.V(12496).next().addEdge(""topic"", rabbit)  
==>e[odxcq-9n4-28lx-37c][12496-topic->4152]
gremlin> :> g.V(12400).next().addEdge(""topic"", mongodb)  
==>e[ody4u-9kg-28lx-6eo][12400-topic->8304]
gremlin> :> g.V(36976).next().addEdge(""topic"", janus)  
==>e[odyj2-sj4-28lx-39k][36976-topic->4232]
gremlin> :> g.V(32880).next().addEdge(""topic"", mongodb)  
==>e[odyxa-pdc-28lx-6eo][32880-topic->8304]
gremlin> :> g.V(41072).next().addEdge(""topic"", mongodb)  
==>e[odzbi-vow-28lx-6eo][41072-topic->8304]
gremlin> :> g.V(24688).next().addEdge(""topic"", elastic)  
==>e[odzpq-j1s-28lx-388][24688-topic->4184]
gremlin> :> g.V(4120).next().addEdge(""topic"", mongodb)  
==>e[odxc3-36g-28lx-6eo][4120-topic->8304]
gremlin> :> g.V(28784).next().addEdge(""topic"", janus)  
==>e[oe03y-m7k-28lx-39k][28784-topic->4232]
gremlin> :> g.V(16496).next().addEdge(""topic"", postgres)  
==>e[oe0i6-cq8-28lx-6e0][16496-topic->8280]
gremlin> :> g.V(12376).next().addEdge(""topic"", postgres)  
==>e[odxcb-9js-28lx-6e0][12376-topic->8280]
gremlin> :> g.V(16472).next().addEdge(""topic"", mongodb)  
==>e[odxqj-cpk-28lx-6eo][16472-topic->8304]
gremlin> :> g.V(20568).next().addEdge(""topic"", nodered)  
==>e[ody4r-fvc-28lx-3bk][20568-topic->4304]
gremlin> :> g.V(20568).next().addEdge(""topic"", mongodb)  
==>e[odyiz-fvc-28lx-6eo][20568-topic->8304]
gremlin> :> g.V(12424).next().addEdge(""topic"", elastic)  
==>e[odxch-9l4-28lx-388][12424-topic->4184]
gremlin> :> g.V(20592).next().addEdge(""topic"", postgres)  
==>e[oe0we-fw0-28lx-6e0][20592-topic->8280]
gremlin> :> g.V(8400).next().addEdge(""topic"", mongodb)  
==>e[odxqy-6hc-28lx-6eo][8400-topic->8304]
gremlin> :> graph.tx().commit()  
==>null


If you're paying close attention, you'll notice that I actually doubled-up on
some of those topics. One of the most useful things about graph databases is the
ability to model relationships as you discover them, rather than having to plan
out these relationships ahead of time (as you would with a relational database).
We're able to connect multiple topics to the same article simply by adding
another edge to the article node.

Now that we have our graph put together, let's run a quick test by querying
JanusGraph for all of the articles written by Abdullah:

gremlin> :> g.V(abdullah).in(""author"").values(""name"")  
==>Avoid Storing Data Inside 'Admin' When Using MongoDB
==>Taking a Look at Robomongo and Studio 3T with Compose for MongoDB
==>MQTT and STOMP for Compose RabbitMQ
==>Storing Network Addresses using PostgreSQL
==>Mastering PostgreSQL Tools: Full-Text Search and Phrase Search
==>How to Script Painless-ly in Elasticsearch


And for fun, let's see all of the articles with a topic of mongodb :

gremlin> :> def mongo = g.V().has(""name"", ""mongodb"").next()  
==>v[8304]
gremlin> :> g.V(mongo).in(""topic"").values(""name"")  
==>Compose Tips: Dates and Dating in MongoDB
==>Avoid Storing Data Inside 'Admin' When Using MongoDB
==>Taking a Look at Robomongo and Studio 3T with Compose for MongoDB
==>Building Secure Instant API's with RESTHeart and Compose
==>5-minute Signup Forms with Node-RED and Compose
==>Easier Java connections to MongoDB at Compose
==>Mongo Metrics: Calculating the Mode


That looks about right - we can now ask JanusGraph to find all of the articles
written on a particular topic or by a particular author. Now, let's see how we
can bring these together by connecting JanusGraph up with our Node-RED
application.

CONNECTING TO JANUSGRAPH FROM NODE-RED
We've been building our chatbot with Node-RED hosted on Bluemix, and now it's time to connect our JanusGraph instance to it. The JanusGraph HTTP API can be used to execute gremlin queries using HTTP, so we'll try this out by
using the HTTP Request node in Node-RED.

JanusGraph exposes a single HTTP POST endpoint to execute Gremlin queries. The
endpoint expects a JSON-formatted document with a single key (gremlin) that has
the value of your Gremlin query:

{
""gremlin"": ""YOUR_GREMLIN_QUERY_HERE""
}


This API is stateless which means that, unlike using Gremlin from the command
line, we won't be able to use variables across commands. We'll also need to open
the graph each time we want to use it (remember, the graph and g we used previously won't be available to us.

Connecting to the API is a two-step process: first, we'll need a session token
we can use to authenticate our web calls. These tokens have a timeout of 60
minutes, so we'll need to refresh the tokens periodically. Once we have the
token, we'll be able to send requests to JanusGraph with the token in the header
of our call.

GENERATING A SESSION TOKEN
First, we'll need to generate the session token. Let's start by just using a
simple inject node to test our session token web call. Drag an inject node, an http request node, and a debug node onto the canvas. Double-click the http request node and give it a name of JG Auth . Wire them all up so they look like the following:


Then, double-click the JG Auth node to configure it with a method of GET and a URL using the connection string from the Gremlin using Token Authentication section of the Compose dashboard:


Wire them up, click deploy , and click on the button next to the inject node. You should see something like this in the debug panel:

{""token"": ""<your_token_here>""}


That's the session token you can now use to make requests to your JanusGraph
instance.

Now, let's send a request using that token. Drag another inject node, http request , and debug node onto the canvas, and this time drag a function node onto the canvas as well. Double click each of them to name them, giving
the http request node a name of JG Request and the function node a name of JG Query . Then, wire them up like the following:


Double-click the JG Query function node so we can add the token to the msg.header object and the query to our msg.payload object. We'll also configure our msg.url and msg.method here so we don't have to open the JG Request node, and we'll hard-code the token for now:

msg.headers = {  
    ""Authorization"": ""Token <your_token_here  


Notice that we had to basically string together all of the variables that we
were using before. That's because the HTTP API for JanusGraph is stateless and
does not remember the variables sent to it. We'll have to do this every time we
want to make a call.

Wire them together and click on the inject node to see your JanusGraph query come to life:

{
""requestId"":""5bac1a3d-9e37-4658-9001-496fc86e0f62"",
  ""status"":{
    ""message"":"""",
    ""code"":200,
    ""attributes"":{}
   },
""result"":{
   ""data"":[
      ""Avoid Storing Data Inside 'Admin' When Using MongoDB"",
      ""Taking a Look at Robomongo and Studio 3T with Compose for MongoDB"",
      ""MQTT and STOMP for Compose RabbitMQ"",
      ""Storing Network Addresses using PostgreSQL"",
      ""Mastering PostgreSQL Tools: Full-Text Search and Phrase Search"",
      ""How to Script Painless-ly in Elasticsearch""
   ], 
   ""meta"":{}
 }}""


Eventually, we'll want to implement these into a flow that can be used to send
JanusGraph calls while automatically refreshing the token if the token is valid,
but for now this should be sufficient.

MAKING CONVERSATION
Our last step in this process is to use Watson Conversation to build a dialog of conversational commands that our users can send to
traverse our graph of articles. The dialog should allow the user to ask
questions of the bot and get data back in a usable form. For example, a question
like ""How many authors are there?"" Would yield a response of ""3"" .

We'll start by creating a couple of new intents. The first, called find , will allow us to query for articles either via topic or via author. The
second intent, called list will list the authors and categories so the user knows what to look for.

Click on the Create New button on the intents tab and type #find as the name. Then, add a few phrases that can be used to trigger that intent.
You can choose any you'd like:


Do the same thing again, but name the intent list and include phrases that could be used to display a listing of items.


Next, we'll add a few entities to help us qualify our find intent. When a user asks us to find articles, they can ask for articles BY a
specific author or ABOUT a specific topic. We'll code these up as an entity
called queryType and add some values which will add context to the conversation. Click on the entities tab and click Create new , then add some values and synonyms:


Now, we'll add some dialog to help our conversation respond properly to our
chatbot. Click on the Dialog tab and click Add Node to add a new dialog entry point. Name it Look For Topic and, under the if bot recognizes tab, type find . We'll also need to qualify this search, so add another condition by clicking
on the + button next to the find condition to add another condition. The second
condition contains the queryType:about entity, which will add information to the response that goes back to Node-RED.
In the Then Respond With field, add a response of Here's what I found about: .


Let's go ahead and test our dialog - go into slack and type the following:

show me articles about mongodb

You should get a response like the following:

Here's what I found about

RESPONDING WITH REAL DATA
Now that we have the dialog up and running, it's time to use that conversation
to query JanusGraph. First, let's take a look under the hood at what Watson
Conversation is sending to Node-RED. The response generated by our first
conversation phrase looks like the following:

{
   ""intents"":[
      {
         ""intent"":""Find"",
         ""confidence"":0.9929267883300781
      }
   ],
   ""entities"":[
      {
         ""entity"":""querytype"",
         ""location"":[
            17,
            22
         ],
         ""value"":""about"",
         ""confidence"":1
      }
   ],
   ""input"":{
      ""text"":""show me articles about mongodb""
   },
   ""output"":{
      ""text"":[
         ""Here's what I found about""
      ],
      ""nodes_visited"":[
         ""Look For Topic""
      ],
      ""log_messages"":[

      ]
   }
}


The response has several structures we can use to tailor our final text
response. We'll keep it simple for now and assume that, if our intent matches
the find intent, that the topic is the final word in the input . In Node-RED, add a new function node connected to the output of the conversation node and give it a name of select intent . We'll also increase the number of outputs to 2.


In this function, we'll check the intent field and, if the find intent was triggered, we'll extract a topic from the users' input. For now,
we'll make the assumption that the topic is simply the last word in the users'
input. We'll also save off the output text from Conversation so we can use it
later.

Copy the following code into the function node:

msg.dialog = msg.payload.output.text[0];

if (msg.payload.intents[0].intent == ""Find"") {  
    var input = msg.payload.input.text.split(' ');
    msg.topic = input[input.length - 1];
    node.send([msg]);
} else {
    node.send([null, msg]);
}


Now, copy the JG Query and JQ Query Request nodes we used earlier and paste them near the output of our select intent function. Then, wire them up to the first output.


Next, we'll want to update the query to use the topic we just saved in the select intent node. Double-click the JG Query node and replace the code in it with the following:

msg.headers = {  
    ""Authorization"": ""Token <your_auth_token

msg.url = ""<your_janusgraph_url  


Finally, make sure your JG Query Request node is returning parsed JSON. Double-click it and select a parsed JSON object from the return drop-down. We should now be able to send a message to our Slackbot and turn
that directly into a JanusGraph query.

When we execute the query and inspect the response, we get something like the
following:

{
  ""requestId"":""9a0ed1f5-0da5-4ac2-b47a-795cb85ef9fb"",
  ""status"":{
    ""message"":"""",
    ""code"":200,
    ""attributes"":{}
},""result"":{
    ""data"":
      [{
        ""id"":4120,
        ""label"":""article"",
        ""type"":""vertex"",
        ""properties"":{
          ""name"":[{
            ""id"":""16r-36g-1l1"",
            ""value"":""Compose Tips: Dates and Dating in MongoDB""
          }], ""url"":[{
            ""id"":""1kz-36g-4qt"",

  ""value"":""https://www.compose.com/articles/understanding-dates-in-compose-mongodb/""
}]}},
....


Using this output, we can take the results of the query and format them into a
response. Drag a function node onto the canvas and name it Format Slack Message . Wire it to the output of the JG Query Request node.


In this function, we're going to loop through all of the articles and format
them as Slack links so the user can simply click on the articles directly in
Slack. Paste the following code into your formatting function:

msg.conversationResponse = msg.payload;  
msg.headers = {  
    ""content-type"": ""application/x-www-form-urlencoded""
}

var articles = msg.payload.result.data;  
var articleMap = [];  
for (var i = 0; i  i++) {  
    articleMap.push(""<"" + articles[i].properties.name[0].value + ""|"" + articles[i].properties.url[0].value + ""  
msg.payload = {  
    ""token"": ""<your_slack_token_here  


Finally, let's grab the http response node that posts messages to slack (we'll name this Slack Request ), drag it over to the output of our Format Slack Message function and wire them up.


We've now come full circle and, if everything is working properly, you should be
able to get responses like this:


WRAPPING IT UP
While there are a few more things we can do to add polish to our chatbot, such
as automatically renewing our JanusGraph session token, this series serves as a
good jumping off point for those interested in building a chatbot for your their
business.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Scott Webb

John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page to keep reading.CONQUER THE DATA LAYER
Spend your time developing apps, not managing databases.

Try Compose for Free for 30 DaysRELATED ARTICLES
Aug 2, 2017MAKING OF A SMART BUSINESS CHATBOT - PART 2
Chatbots are a great way to interact with your customers in real-time and gain
insights into your users. In this second in o…

John O'Connor Jun 30, 2016THE CONVERSATIONAL INTERFACE IS THE NEW PARADIGM
In 1962 Thomas Kuhn published The Structure of Scientific Revolutions. In it he
posited that science moves forward with brief…

Hays Hutton Jul 20, 2017MAKING OF A SMART BUSINESS CHATBOT: PART 1
Chatbots have become an incredibly popular method for providing real-time
communication with customers, and in this first in…

John O'Connor Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","In this third part of the series on building smart business chatbots, we’ll use a JanusGraph-backed knowledge base to give our chatbot from part 1 and part 2 some utility.",Making of a Smart Business Chatbot: Part 3,Live,296
882,"* Free 7-Day Crash Course
 * Blog
 * Masterclass

DIMENSIONALITY REDUCTION ALGORITHMS: STRENGTHS AND WEAKNESSES
EliteDataScience 0 Comments

May 23, 2017

Share Google Linkedin TweetWelcome to Part 2 of our tour through modern machine learning algorithms. In
this part, we’ll cover methods for Dimensionality Reduction, further broken into
Feature Selection and Feature Extraction. In general, these tasks are rarely
performed in isolation. Instead, they’re often preprocessing steps to support
other tasks.

If you missed Part 1, you can check it out here . It explains our methodology for categorization algorithms, and it covers the
“Big 3” machine learning tasks:

 1. Regression
 2. Classification
 3. Clustering

In this part, we’ll cover:

 1. Feature Selection
 2. Feature Extraction

We will also cover other tasks, such as Density Estimation and Anomaly
Detection, in dedicated guides in the future.

THE CURSE OF DIMENSIONALITY
In machine learning, “dimensionality” simply refers to the number of features
(i.e. input variables) in your dataset.

When the number of features is very large relative to the number of observations
in your dataset, certain algorithms struggle to train effective models. This is called the “Curse of
Dimensionality,” and it’s especially relevant for clustering algorithms that rely on distance calculations.

A Quora user has provided an excellent analogy for the Curse of Dimensionality, which
we'll borrow here:

Let's say you have a straight line 100 yards long and you dropped a penny
somewhere on it. It wouldn't be too hard to find. You walk along the line and it
takes two minutes.

Now let's say you have a square 100 yards on each side and you dropped a penny
somewhere on it. It would be pretty hard, like searching across two football
fields stuck together. It could take days.

Now a cube 100 yards across. That's like searching a 30-story building the size
of a football stadium. Ugh.

The difficulty of searching through the space gets a lot harder as you have more dimensions.

In this guide, we'll look at the 2 primary methods for reducing dimensionality:
Feature Selection and Feature Extraction.

4. FEATURE SELECTION
Feature selection is for filtering irrelevant or redundant features from your
dataset. The key difference between feature selection and extraction is that
feature selection keeps a subset of the original features while feature
extraction creates brand new ones.

To be clear, some supervised algorithms already have built-in feature selection , such as Regularized Regression and Random Forests. Typically, we recommend
starting with these algorithms if they fit your task. They're covered in Part 1 .

As a stand-alone task, feature selection can be unsupervised (e.g. Variance
Thresholds) or supervised (e.g. Genetic Algorithms). You can also combine
multiple methods if needed.

4.1. VARIANCE THRESHOLDS
Variance thresholds remove features whose values don't change much from
observation to observation (i.e. their variance falls below a threshold). These
features provide little value.

For example, if you had a public health dataset where 96% of observations were
for 35-year-old men, then the 'Age' and 'Gender' features can be eliminated
without a major loss in information.

Because variance is dependent on scale, you should always normalize your features first.

 * Strengths: Applying variance thresholds is based on solid intuition: features that
   don't change much also don't add much information. This is an easy and
   relatively safe way to reduce dimensionality at the start of your modeling
   process.
 * Weaknesses: If your problem does require dimensionality reduction, applying variance
   thresholds is rarely sufficient. Furthermore, you must manually set or tune a
   variance threshold, which could be tricky. We recommend starting with a
   conservative (i.e. lower) threshold.
 * Implementations: Python / R

4.2. CORRELATION THRESHOLDS
Correlation thresholds remove features that are highly correlated with others
(i.e. its values change very similarly to another's). These features provide
redundant information.

For example, if you had a real-estate dataset with 'Floor Area (Sq. Ft.)' and
'Floor Area (Sq. Meters)' as separate features, you can safely remove one of
them.

Which one should you remove? Well, you'd first calculate all pair-wise
correlations. Then, if the correlation between a pair of features is above a
given threshold, you'd remove the one that has larger mean absolute correlation
with other features.

 * Strengths: Applying correlation thresholds is also based on solid intuition: similar
   features provide redundant information. Some algorithms are not robust to
   correlated features, so removing them can boost performance.
 * Weaknesses: Again, you must manually set or tune a correlation threshold, which can be
   tricky to do. Plus, if you set your threshold too low, you risk dropping
   useful information. Whenever possible, we prefer algorithms with built-in
   feature selection over correlation thresholds. Even for algorithms without
   built-in feature selection, Principal Component Analysis (PCA) is often a
   better alternative.
 * Implementations: Python / R

4.3. GENETIC ALGORITHMS (GA)
Genetic algorithms (GA) are a broad class of algorithms that can be adapted to
different purposes. They are search algorithms that are inspired by evolutionary biology and natural selection, combining
mutation and cross-over to efficiently traverse large solution spaces. Here's a
great intro to the intuition behind GA's .

In machine learning, GA's have two main uses. The first is for optimization , such as finding the best weights for a neural network.

The second is for supervised feature selection. In this use case, ""genes""
represent individual features and the ""organism"" represents a candidate set of
features. Each organism in the ""population"" is graded on a fitness score such as
model performance on a hold-out set. The fittest organisms survive and
reproduce, repeating until the population converges on a solution some
generations later.

 * Strengths: Genetic algorithms can efficiently select features from very high
   dimensional datasets, where exhaustive search is unfeasible. When you need to
   preprocess data for an algorithm that doesn't have built-in feature selection
   (e.g. nearest neighbors) and when you must preserve the original features
   (i.e. no PCA allowed), GA's are likely your best bet. These situations can
   arise in business/client settings that require a transparent and
   interpretable solution.
 * Weaknesses: GA's add a higher level of complexity to your implementation, and they
   aren't worth the hassle in most cases. If possible, it's faster and simpler
   to use PCA or to directly use an algorithm with built-in feature selection.
 * Implementations: Python / R

4.4. HONORABLE MENTION: STEPWISE SEARCH
Stepwise search is a supervised feature selection method based on sequential
search, and it has two flavors: forward and backward. For forward stepwise
search, you start without any features. Then, you'd train a 1-feature model
using each of your candidate features and keep the version with the best
performance. You'd continue adding features, one at a time, until your
performance improvements stall.

Backward stepwise search is the same process, just reversed: start with all
features in your model and then remove one at a time until performance starts to
drop substantially.

We note this algorithm purely for historical reasons. Despite many textbooks
listing stepwise search as a valid option, it almost always underperforms other
supervised methods such as regularization. Stepwise search has many documented flaws , one of the most fatal being that it's a greedy algorithm that can't account for future effects of each change. We don't
recommend this method.

5. FEATURE EXTRACTION
Feature extraction is for creating a new, smaller set of features that stills
captures most of the useful information. Again, feature selection keeps a subset
of the original features while feature extraction creates new ones.

As with feature selection, some algorithms already have built-in feature extraction . The best example is Deep Learning, which extracts increasingly useful
representations of the raw input data through each hidden neural layer. We
covered this in Part 1 .

As a stand-alone task, feature extraction can be unsupervised (i.e. PCA) or
supervised (i.e. LDA).

4.1. PRINCIPAL COMPONENT ANALYSIS (PCA)
Principal component analysis (PCA) is an unsupervised algorithm that creates
linear combinations of the original features. The new features are orthogonal,
which means that they are uncorrelated. Furthermore, they are ranked in order of
their ""explained variance."" The first principal component (PC1) explains the most variance in your dataset, PC2 explains the second-most
variance, and so on.

Therefore, you can reduce dimensionality by limiting the number of principal
components to keep based on cumulative explained variance. For example, you
might decide to keep only as many principal components as needed to reach a
cumulative explained variance of 90%.

You should always normalize your dataset before performing PCA because the
transformation is dependent on scale. If you don't, the features that are on the
largest scale would dominate your new principal components.

 * Strengths: PCA is a versatile technique that works well in practice. It's fast and
   simple to implement, which means you can easily test algorithms with and
   without PCA to compare performance. In addition, PCA offers several
   variations and extensions (i.e. kernel PCA, sparse PCA, etc.) to tackle
   specific roadblocks.
 * Weaknesses: The new principal components are not interpretable, which may be a
   deal-breaker in some settings. In addition, you must still manually set or
   tune a threshold for cumulative explained variance.
 * Implementations: Python / R

4.2. LINEAR DISCRIMINANT ANALYSIS (LDA)
Linear discriminant analysis (LDA) - not to be confused with latent Dirichlet
allocation - also creates linear combinations of your original features.
However, unlike PCA, LDA doesn't maximize explained variance. Instead, it
maximizes the separability between classes.

Therefore, LDA is a supervised method that can only be used with labeled data.
So which is better: LDA and PCA? Well, results will vary from problem to
problem, and the same ""No Free Lunch"" theorem from Part 1 applies.

The LDA transformation is also dependent on scale, so you should normalize your
dataset first.

 * Strengths: LDA is supervised, which can (but doesn't always) improve the predictive performance of the extracted
   features. Furthermore, LDA offers variations (i.e. quadratic LDA) to tackle
   specific roadblocks.
 * Weaknesses: As with PCA, the new features are not easily interpretable, and you must
   still manually set or tune the number of components to keep. LDA also
   requires labeled data, which makes it more situational.
 * Implementations: Python / R

4.3. AUTOENCODERS
Autoencoders are neural networks that are trained to reconstruct their original
inputs. For example, image autoencoders are trained to reproduce the original
images instead of classifying the image as a dog or a cat.

So how is this helpful? Well, the key is to structure the hidden layer to have fewer neurons than the input/output layers. Thus, that hidden layer will learn to produce a
smaller representation of the original image.

Because you use the input image as the target output, autoencoders are
considered unsupervised. They can be used directly (e.g. image compression) or
stacked in sequence (e.g. deep learning).

 * Strengths: Autoencoders are neural networks, which means they perform well for certain
   types of data, such as image and audio data.
 * Weaknesses: Autoencoders are neural networks, which means they require more data to
   train. They are not used as general-purpose dimensionality reduction
   algorithms.
 * Implementations: Python / R

PARTING WORDS
We've just taken a whirlwind tour through modern algorithms for Dimensionality
Reduction, broken into Feature Selection and Feature Extraction.

We'll leave you with the same parting advice from Part 1 .

 1. Practice, practice, practice. Grab a dataset and strike while the iron is hot.
 2. Master the fundamentals. For example, it's more fruitful to first understand the differences between
    PCA and LDA than to dive into the nuances of LDA versus quadratic-LDA.
 3. Remember, better data beats fancier algorithms. We repeat this a lot, but it's the honest darn truth!

If you'd like to learn more about the applied machine learning workflow and how
to efficiently train professional-grade models, we invite you to sign up for our free 7-day email crash course .

For more over-the-shoulder guidance, we also offer a comprehensive masterclass for beginners to get up to speed quickly and smoothly.

Share Google Linkedin TweetLEAVE A RESPONSE CANCEL REPLY
Name* Email* Website* Denotes Required Field

RECOMMENDED READING
 * Dimensionality Reduction Algorithms: Strengths and Weaknesses
 * Modern Machine Learning Algorithms: Strengths and Weaknesses
 * The Ultimate Python Seaborn Tutorial: Gotta Catch ‘Em All
 * The 5 Levels of Machine Learning Iteration
 * R vs Python for Data Science: Summary of Modern Advances

Copyright © 2017 · EliteDataScience.com · All Rights Reserved


 * Home
 * Terms of Service
 * Privacy Policy","Which modern dimensionality reduction algorithms are best for machine learning? We'll discuss their practical tradeoffs, including when to use each one.",Dimensionality Reduction Algorithms,Live,297
885,"Homepage Follow Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Carmen Ruppach Blocked Unblock Follow Following Offering Manager for Data Refinery on Watson Data Platform at IBM Jan 7
--------------------------------------------------------------------------------

JOIN AND ENRICH DATA FROM MULTIPLE SOURCES
As a data scientist or business analyst you need to use data for your analysis
that typically comes from a variety of data sources. Cloud, Social Media, and
Internet of Things are providing vast amounts of new data in a wide variety of
formats. Businesses can leverage these new data sets along with traditional
systems of records that run today’s business operations to generate new
insights. However, this data diversity makes it difficult to combine data before
any analysis can even begin.

Data Refinery allows you to shape, combine, and enrich data from a variety of
different sources. With a recently introduced feature, data scientists can use
join operations to blend data sets faster. Let’s look at some of the join
operations that are now supported:

Left Join

Returns all rows in the original data set and return only rows in the joining
data set that match the join condition. Returns one row in the original data set
for each matching row in the joining data set.

Right Join

Returns all rows in the joining data set and return only rows in the original
data set that match the join condition. Returns one row in the joining data set
for each matching row in the original data set.

Inner Join

Returns only the rows in each data set that match rows in the other data set
based upon the join-condition. Returns one row in the original data set for each
matching row in the joining data set.

Full Outer Join

Returns all rows in both data sets based upon the join-condition. Blends rows in
the original data set with matching rows in the joining data set.

Left Semi Join

Returns only the rows in the original data set that match rows in the joining
data set based upon the join-condition. Only attributes from the original data
set are returned.

Anti Join

Returns only the rows in the original data set that do not match rows in the
joining data set based upon the join-condition.

The following video shows how join operations can be used in Data Refinery:

Joining Data with IBM Data RefineryYou can find more information about IBM Data Refinery in the announcement blog
post: Self-service data preparation with IBM Data Refinery

Try out IBM Data Refinery! Sign up for free at: https://www.ibm.com/cloud/data-refinery

 * Data Science
 * Data Refinery
 * Data Analysis
 * Ibm Watson

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

6 Blocked Unblock Follow FollowingCARMEN RUPPACH
Offering Manager for Data Refinery on Watson Data Platform at IBM

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 6
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Data Refinery allows you to shape, combine, and enrich data from a variety of different sources. With a recently introduced feature, data scientists can use join operations to blend data sets faster. Let’s look at some of the join operations that are now supported:",Join and enrich data from multiple sources,Live,298
889,"Skip navigation IN Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__BRUNEL IN JUPYTER
BrunelLoading...

Unsubscribe from Brunel? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 38Loading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

250 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Feb 1, 2016Analytics and visualization often go hand-in-hand. One of the great things about
notebooks such as IPython/Jupyter is that they provide a single interface to
numerous data analysis technologies that often can be used together. So, using
Brunel within notebooks is a very natural fit. For example, I can use a wide
variety of python libraries to cleanse, shape and analyze data–and then use
Brunel to visualize those results.

Additionally, coming up with a good visualization is a highly iterative process:
Try something, look at the results and refine until done. So, again, the
notebook metaphor of having live code execution near the results is extremely
convenient. Lastly, since notebook cells containing output can themselves be
interactive, direct manipulation techniques such as brushing/linking, filtering
and selection are also available.

To try this out, we have provided an integration of Brunel for Python 3 that
runs in IPython/Jupyter notebooks. Details on how to install and get started are
on the PyPI site. The video gives a very small taste of the kinds of things that
are possible.

 * CATEGORY
    * People & Blogs
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Brunel Maps Preview - Duration: 3:38. Brunel 64 views 3:38


--------------------------------------------------------------------------------

 * Starting to Use the Jupyter Notebook - Steve Holden - Duration: 32:36. Python
   Ireland 17,399 views 32:36
 * Using Jupyter notebooks to develop and share interactive data displays -
   Duration: 28:45. PyCon Australia 4,689 views 28:45
 * Brunel Web App - Duration: 4:41. Brunel 68 views 4:41
 * What is Jupyter Notebook? - Duration: 8:25. codebasics 14,539 views 8:25
 * Quick introduction to Jupyter Notebook - Duration: 7:06. Michael Fudge 1,130
   views 7:06
 * Reproducible Data Analysis in Jupyter, Part 1/10: Loading and Visualizing
   Data - Duration: 5:00. Jake Vanderplas 5,371 views 5:00
 * Ipython / Jupyter Notebook - Introduction - Duration: 23:41. Sourabh Bhat
   8,753 views 23:41
 * 1. Introduction - Jupyter Tutorial (IPython 3) - Duration: 13:54. Roshan
   151,154 views 13:54
 * iPython Notebook / Jupyter + IJavaScript (ijs) + Selenium WebDriver -
   Duration: 2:42. Dmytro Zharii 704 views 2:42
 * Brunel Overview 1080 - Duration: 3:19. Brunel 384 views 3:19
 * Import Data and Analyze with Python - Duration: 11:58. APMonitor.com 93,088
   views 11:58
 * Titanic Data Analysis Ipython Notebook Demonstration - Duration: 9:34. Peter
   Job 376 views 9:34
 * Brunel Blogging - Duration: 2:29. Brunel 71 views 2:29
 * Brunel Visualization Tech Talk with Videos - Duration: 51:55. Brunel 47 views 51:55
 * Python Code from Jupyter - Duration: 1:41. Roshan 5,549 views 1:41
 * Reproducible, One Button Workflows with the Jupyter Notebook & Scons | SciPy
   2016 | Jessica Hamrick - Duration: 27:56. Enthought 5,387 views 27:56
 * Deploying Python, Jupyter Notebook & Flask Apps in the Cloud in Real-Time -
   Duration: 14:32. Yves Hilpisch 4,629 views 14:32
 * Introduction to Pivot Tables, Charts, and Dashboards in Excel (Part 1) -
   Duration: 14:48. Excel Campus - Jon 2,932,211 views 14:48
 * Python Web Scraping Tutorial 5 (Network Requests) - Duration: 16:01. chris
   reeves 40,443 views 16:01
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: India
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Analytics and visualization often go hand-in-hand. One of the great things about notebooks such as IPython/Jupyter is that they provide a single interface to...,Brunel In Jupyter,Live,299
890,"INTRODUCING COMPOSE'S BIGGER BITS PODCASTShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 8, 2016Today we're trying something new that we're calling Compose's Big Bits Podcast . Our Bigger Bits Podcast will bring you a lively discussion with someComposers about news and updates from around the web that we find interesting aswell as insights into what's happening here at Compose. Give it a listen and letus know what you think by reaching out to us on our Twitter at @composeio .Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on thebeach, reading, spending time with his wife and daughter and tinkering. Lovethis article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Compose's Big Bits is a new podcast to keep you in the loop about what's happening in the data, database and tech world.",Introducing Compose's Big Bits,Live,300
891,"Compose The Compose logo Articles Sign in Free 30-day trialACCESSING RELATIONAL DATABASES USING GO
Published Jun 28, 2017 go postgresql sql Accessing Relational Databases Using GoHave you considered using Go to access your relational databases? In this Write Stuff article, Gigi Sayfan shows you how to access PostgreSQL two ways using Go's
standard library and sqlx.

Go (a.k.a Golang) is an open source programming language that focuses on
simplicity and has top-notch support for concurrent programming. It was designed
at Google in 2007 and has become very popular for building secure, scalable and
robust applications.

In this article, I'll demonstrate two ways to access a PostgreSQL relational
database in Go - the first using the standard Go library, and the second using
sqlx.

OVERVIEW
As a sample micro-dataset I found a sensational and frankly quite nonsensical
list of European historical (15th - 19th century) geniuses and their IQ. There
are 13 geniuses on the list:

Name                           IQ  Nationality  
-----------------------------------------------
Charles Dickens                165 English  
Rafael                         170 Italian  
Michael Faraday                175 English  
Baruch Spinoza                 175 Dutch  
Michaelangelo                  177 Italian  
Desiderius Erasmus             177 Dutch  
Rene Descartes                 177 French  
Galileo Galilei                182 Italian  
John Stuart Mill               182 English  
Gottfried Wilhelm Leibnitz     191 German  
Isaac Newton                   192 English  
Leonardo Da Vinci              200 Italian  
Johann Wolfgang von Goethe     220 German  


As a general background on IQ (intelligent quotient), the standard Stanford-Bine
intelligence test can only measure scores in the range 40 - 160. The scores fall
on a bell curve (normal distribution) where the mean is 100. Scores become less
reliable as you deviate from the mean. The tests are unable to distinguish
between people with scores higher than 160 (or lower than 40). Please don't take
these scores seriously or try to draw any conclusions whatsoever. This is
strictly for fun. But, you may want to read up on the great people in this list.

Back to Go and databases... I'll create a table in a PostgreSQL database and
populate it with this list of geniuses and then I'll show you how to perform
typical data access, data manipulation and queries all through a Go program.

THE DATABASE/SQL PACKAGE
The Go standard library comes with a package called ""database/sql"" that defines
various interfaces and data types for working with relational databases. The
library is very low-level and doesn't provide any cross-implementation
abstraction. You will need to construct the connection string for your specific
DBMS and the specific SQL flavor. The sql.DB is a DB handle that represents a pool of zero or more connections.

CONNECTING TO A POSTGRES DB
Here the connect() function connects to a local PostgreSQL database and returns the DB handle:

const (  
    host     = ""localhost""
    port     = 5432
    user     = ""postgres""
    password = ""postgres""
    dbname   = ""postgres""
    sslmode = ""disable""
)

func connect() *sql.DB {  
    t := ""host=%s port=%d user=%s password=%s dbname=%s sslmode=disable""
    connectionString := fmt.Sprintf(t, host, port, user, password, dbname, sslmode)
    db, err := sql.Open(""postgres"", connectionString)
    if err != nil {
        log.Fatal(err)
    }

    err = db.Ping()
    if err != nil {
        log.Fatal(err)
    }

    return db
}


The sql.Open() function that returns the sql.DB object is actually NOT connecting to the database. That's why the db.Ping() call is there to actually establish a connection and verify the database can be
reached.

In the interest of conciseness, I treat any error as a fatal error, which will
cause the program to exit immediately. Normally, I would perform some retry
logic and/or report these errors to the caller to deal with the situation.

CREATING TABLES
The sql.DB interface provides the Exec() method that can send commands to the database. It can be used to execute any
command against it including DDL commands. The createSchema() function creates a genius table with name, IQ and nationality columns:

func createSchema(db *sql.DB) {  
    schema := `
        CREATE TABLE IF NOT EXISTS genius (
          id SERIAL PRIMARY KEY,
          name TEXT UNIQUE,
          iq INTEGER,
          nationality TEXT
        );
    `
    _, err := db.Exec(schema)
    if err != nil {
        log.Fatal(err)
    }
}


The db.Exec() method returns a Results object and an Error object. In this case, I don't care about the result, so I assign it to the _ placeholder, which Go doesn't complain about if I don't use later. Unused named
variables will cause a compilation error.

POPULATING THE TABLE
You can insert data using the Exec() method too. It's interesting that Go automatically uses upsert, which means
inserting existing data will overwrite the previous one. The genius name is
defined in the schema as unique. If you insert another genius with the same
name, the second genius data (IQ and nationality) will prevail with no error. I
also defined a little helper function called exec() that does the error checking and panics if something goes wrong. As a bonus, I
also threw in a cleanDB() function that deletes all data from the genius table. The main event is the populateDB() function that iterates over our 13 geniuses and inserts them into the database.

func exec(db *sql.DB, command string) {  
    _, err := db.Exec(command)
    if err != nil {
        log.Fatal(err)
    }
}

func cleanDB(db *sql.DB) {  
    exec(db, ""DELETE FROM genius"")
}


func populateDB(db *sql.DB) {  
    data := []Genius{
        {""Charles Dickens"", 165, ""English""},
        {""Rafael"", 170, ""Italian""},
        {""Michael Faraday"", 175, ""English""},
        {""Baruch Spinoza"", 175, ""Dutch""},
        {""Michaelangelo"", 177, ""Italian""},
        {""Desiderius Erasmus"", 177, ""Dutch""},
        {""Rene Descartes"", 177, ""French""},
        {""Galileo Galilei"", 182, ""Italian""},
        {""John Stuart Mill"", 182, ""English""},
        {""Gottfried Wilhelm Leibnitz"", 191, ""German""},
        {""Isaac Newton"", 192, ""English""},
        {""Leonardo Da Vinci"", 200, ""Italian""},
        {""Johann Wolfgang von Goethe"", 220, ""German""},
    }

    for _, g := range data {
        t := ""INSERT INTO genius (name, iq, nationality) VALUES ('%s', %d, '%s')""
        command := fmt.Sprintf(t, g.Name, g.IQ, g.Nationality)
        exec(db, command)
    }
}


QUERYING DATA
Querying data is a little awkward. The Query() method accepts a SQL select query string returning a Rows object that you can
iterate on with .Next() , but to get to the current row you have to scan it using the rows' Scan() method into pre-defined variables for each column. It's not the slickest
interface, but it gets the job done.

func getEnglishGeniuses(db *sql.DB) {  
    rows, err := db.Query(""SELECT name, iq FROM genius WHERE nationality='English'"")
    if err != nil {
        log.Fatal(err)
    }

    var name string
    var iq int
    for rows.Next() {
        err = rows.Scan(&name, &iq)
        if err != nil {
            log.Fatal(err)
        }

        fmt.Println(""name:"", name, ""IQ:"", iq)
    }
}


Output:

name: Charles Dickens IQ: 165  
name: Michael Faraday IQ: 175  
name: John Stuart Mill IQ: 182  
name: Isaac Newton IQ: 192  


USING SQLX FOR FUN AND PROFIT (OR JUST FUN)
The sqlx package extends the standard ""database/sql"" package. It has objects that are a
superset of the standard objects (DB, Rows, Tx and Stmt). This concept makes it
very easy to transition to sqlx, since all your existing DB code still works.
Here are the main features of sqlx:

 * Marshal rows into structs (with embedded struct support), maps, and slices
 * Named parameter support including prepared statements
 * Get and Select to go quickly from query to struct/slice

Let's see how sqlx makes life easier.

CONNECTING TO THE DB WITH SQLX
The requirement to ping the database after open is both non-intuitive and
annoying. With sqlx you can use Connect() or even MustConnect() , which will panic if it can't open or ping the database. Here's how it's done
with sqlx:

func connectx() *sqlx.DB {  
    t := ""host=%s port=%d user=%s password=%s dbname=%s sslmode=disable""
    connectionString := fmt.Sprintf(t, host, port, user, password, dbname)
    db := sqlx.MustConnect(""postgres"", connectionString)
    return db
}


CREATING TABLES WITH SQLX
The MustExec() method of sqlx is similar to the standard Exec except it panics if something goes wrong, which again provides a convenient way
to sidestep explicit and verbose error handling.

func createSchemax(db *sqlx.DB) {  
    schema := `
        CREATE TABLE IF NOT EXISTS genius (
          id SERIAL PRIMARY KEY,
          name TEXT UNIQUE,
          iq INTEGER,
          nationality TEXT
        );
    `
    db.MustExec(schema)
}


POPULATING THE TABLE WITH SQLX
Again, MustExec() comes in handy. There is no need for the wrapper exec() method with sqlx:

func populateDBx(db *sqlx.DB) {  
    data := []Genius{
        {""Charles Dickens"", 165, ""English""},
        {""Rafael"", 170, ""Italian""},
        {""Michael Faraday"", 175, ""English""},
        {""Baruch Spinoza"", 175, ""Dutch""},
        {""Michaelangelo"", 177, ""Italian""},
        {""Desiderius Erasmus"", 177, ""Dutch""},
        {""Rene Descartes"", 177, ""French""},
        {""Galileo Galilei"", 182, ""Italian""},
        {""John Stuart Mill"", 182, ""English""},
        {""Gottfried Wilhelm Leibnitz"", 191, ""German""},
        {""Isaac Newton"", 192, ""English""},
        {""Leonardo Da Vinci"", 200, ""Italian""},
        {""Johann Wolfgang von Goethe"", 220, ""German""},
    }

    for _, g := range data {
        t := ""INSERT INTO genius (name, iq, nationality) VALUES ('%s', %d, '%s')""
        command := fmt.Sprintf(t, g.Name, g.IQ, g.Nationality)
        db.MustExec(command)
    }
}


QUERYING DATA WITH SQLX
This is where the real wins are. The struct mapping of sqlx lets you map the
results of a query into a struct. Note that sqlx is taking advantage of Go's
struct tags (e.g. db:""name"" ) to add metadata to each field and determine the corresponding database column
(e.g. column ""iq"" maps to structure field ""IQ""). The sqlx package reads these
tags using reflection.

type Genius struct {  
    Name        string `db:""name""`
    IQ          int    `db:""iq""`
    Nationality string `db:""nationality""`
}

func getEnglishGeniusesx(db *sqlx.DB) {  
    geniuses := []Genius{}
    db.Select(&geniuses, ""SELECT name, iq FROM genius WHERE nationality='English'"")

    for _, g := range geniuses {
        fmt.Println(""name:"", g.Name, ""IQ:"", g.IQ)
    }
}


TRANSACTIONS
Transactions are one of the most important features of relational databases to
support data integrity. A transaction may include an arbitrary number of
operations on the database that must all succeed or fail. All partial
modifications can be rolled back on any failure. Working with transactions is
pretty straightforward - you create a transaction that performs operations on
the transaction handle instead of the usual database handle, and eventually you
either commit all the operations you performed or rollback all the operations.
The changes to the database will be visible to other users only once you commit
them. Note that transactions use a single connection from the pool, so you must
scan or close all cursors (Rows) before issuing further commands or queries.

Let's bump the intelligence of all Dutch geniuses up by 10 points:

func increaseIntelligenceOfDutchGeniusesx(db *sqlx.DB) {  
    geniuses := []Genius{}
    db.Select(&geniuses, ""SELECT name, iq FROM genius WHERE nationality='Dutch'"")
    tx, err := db.Beginx()
    if err != nil {
        panic(""Can't start transaction"")
    }

    for _, g := range geniuses {
        t := ""UPDATE genius SET iq = %d WHERE name = '%s'""
        command := fmt.Sprintf(t, g.IQ + 10, g.Name)
        _, err = tx.Exec(command)
        if err != nil {
            fmt.Println(""Rolling back transaction"")
            tx.Rollback()
            return
        }
    }
    tx.Commit()
}


If any update fails, all updates are rolled back.

USING A COMPOSE POSTGRESQL DEPLOYMENT
Switching to use Compose is as simple as changing the constants that create the
connection string. The SSL mode must be changed from ""disabled"" to ""require"":

const (  
    host     = ""aws-us-east-1-portal.6.dblayer.com""
    port     = 20180
    user     = ""admin""
    password = ""*********""
    dbname   = ""compose""
    sslmode = ""require""
)


Here is a screenshot of the genius table from the Compose console:


CONCLUSION
Go has a pretty good, non-opinionated SQL package to work with relational
databases. It is pretty low-level and stays out of your way for the most part.
The sqlx package extends the built-in ""database/sql"" package with tag-based automatic
struct mapping and a few convenience functions. It is very compatible and easy
to switch back and forth if needed. If you want to have full control on your
database I highly recommend it.


--------------------------------------------------------------------------------

Do you want to shed light on a favorite feature in your preferred database? Why
not write about it for Write Stuff ?


attribution John Towner

This article is licensed with CC-BY-NC-SA 4.0 by Compose.RELATED ARTICLES
Oct 10, 2016FORMATTED SQL IN PYTHON WITH PSYCOPG’S MOGRIFY
In this Compose Write Stuff Addon, Lucero Del Alba takes a look at the problem
of viewing queries sent to a server, and how t…

Guest Author Nov 9, 2015USING JSON EXTENSIONS IN POSTGRESQL FROM PYTHON
In this Write Stuff article, Ryan Scott Brown takes a look at how you can work
with PostgreSQL's JSON and JSONB support from…

Guest Author Jun 7, 2017MASTERING POSTGRESQL TOOLS: FILTERS AND FOREIGN DATA WRAPPERS
Lucero Del Alba takes a look at how to use PostgreSQL's filter clause, and
streamline database imports using PostgreSQL's for…

Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL JanusGraph Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",Gigi Sayfan shows you how to access PostgreSQL two ways using Go's standard library and sqlx.,Accessing Relational Databases Using Go,Live,301
893,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Adam Massachi Blocked Unblock Follow Following Data Science Experience @ IBM Oct 26
--------------------------------------------------------------------------------

PLEASINGLY PARALLEL: ACCELERATE YOUR WORKFLOW WITH DSX
A Case Study

HOW WE REDUCED MODELING TIME BY 80% WITH DATA SCIENCE EXPERIENCE
We used IBM Data Science Experience to substantially improve an R modeling workflow with R Notebooks, Spark, and
parallel computing. In the below case study, I’ll give a walkthrough of the
tools we chose and the decisions we made along the way. Importantly, the client
was strong in SQL and R, but less so with Spark and other approaches to building
models in parallel.

THE CLIENT
 * A multinational retailer with many hundreds of stores and millions of monthly
   customers.

THE TEAM AND THEIR ROLE
 * An analytics team builds a customer churn model for each of several hundred
   customer segments. These models do not depend on one another — they do not
   need to communicate any intermediate results. This task is pleasingly
   parallel.
 * Will this particular credit card holder make a purchase at one of our stores
   in month x ? Which factors are most important?

THE ISSUES
 * Modeling execution time. Too time-consuming. It’s difficult to experiment and
   build prototypes. The completion time of the R process is over 90 min.
 * Access control and authorization. The client wants to provide access to some
   data to certain people, but guard database access.

Consider below a graphical overview of the differences between the original and
the new approaches.

COMPARING APPROACHES
The diagram above provides a high-level comparison of the original and the new
approaches to the client’s workflow.

ORIGINAL APPROACH
The client’s analytics team builds one churn model for each of several hundred
customer segments. Contained in one long R script, their original workflow was
straightforward, logical, and slow.

In psuedocode, the logic looked something like

for(i in 1:800){ 
# query the database, select rows corresponding to the current segment 
data <- dbQuery(""select * from tbl where segment = segment[i]"")
testData <- subsetOf(data) 

# aggregate and preprocess the data 
processedData <- preprocess(data) 
processedTestData <- preprocess(testData) 

# build a model 
myModel <- model(processedData) 

# write results 
results <- model.preciousResults(processedTestData) write.table(results, localFile) 
}

For each customer segment, the original approach would query the database,
perform a standard set transformations, build a model, and write the results to
a text file. This process executes in over 90 minutes.

We modified each of these steps in order to dramatically reduce execution time
and to provide access control along the way.

MODIFIED APPROACH
Let’s examine the changes

DATA INGESTION
We first tackled data ingestion and the standard transformations. We replaced a
db query in each iteration with one query using read.jdbc() from SparkR in a DSX Notebook . This creates a Spark Dataframe representing the database table accessible via JDBC URL.

 * This is advantageous because we can leverage the power of Spark to perform
   aggregations and transformations on the data at once with one query, rather
   than in each iteration on a subset of the data. Using a combination of Spark SQL , Spark Dataframe Operations and User-Defined Functions , we can preprocess all of the data needed for modeling much more quickly.
 * Another advantage is that we guarantee that the transformations do not fail
   on a particular subset of the data. This is very helpful for debugging and
   for making sure that everything runs smoothly.

SAVE AND ACCESS CONTROL
Next, we write the aggregated and transformed data to a text file. This is an
important step because it accomplishes both goals of speed and access control.

 * We write the data as a csv in a DSX Project because we can persist the cleaned data, collaborate, and control
   permissions. This means that analysts can load the data into a BI tool and
   other data scientists can audit the data and collaborate -- without the need
   for database credentials.
 * On DSX, you can connect a database to a Project directly and control
   permissions that way as well, if you so choose.

PARALLEL PROCESSING
Now that we’ve prepared the data for modeling, it’s time to take advantage of
the parallel computing capabilities we’ve enabled on DSX.

 * We converted the for-loop into a function. This way, we could run this
   function over a list of elements, distributing the computations with Spark .
 * The parallel R package also allowed us to take advantage of our environment and get
   lightning fast results.
 * Each worker evaluates the function on a different subset of the data. The
   most important point is that these computations do not depend on one another,
   so we can parallelize them easily. After each function completes, the worker
   sends the results to the master node. All of the results of the model are
   small enough to fit in the large memory of the master.

After revising our approach, we went with something similar to this:

# read the entire table into a Spark DataFrame 

data <- spark.read.jdbc(""the database table"") 

# preprocess the data all at once 
# combination of Spark SQL, User Defined Functions, and Spark Transformers 
# converting datatypes, datetime transformations, aggregating columns, imputing values, text processing, etc 

prepData <- sparkPreprocess(data) 

# write the data as a csv 
# aggregated data is OK to share in DSX projects, ready to use for modeling 

write.df(prepData) 

# now, we convert the for loop from the previous method into a function 
# we remove all of the aggregation and preprocessing steps which we've done at once with Spark 
# with a function, we can use `spark.lapply` (from `SparkR`) or `parLapply` from (`library(parallel)`) to distribute the function across a cluster 

paraFunc <- function{ 
    build the models in parallel 
} 

# distribute the function across the cluster 
# results are small enough to fit in the memory of the large master node on DSX 

results <- parallelApply(1:800, paraFunc)

We’ve successfully executed very similar logic in a fast, distributed way. The
original approach took over 90 minutes, while the modified runs in under 15 min.

REVIEW THE ISSUES
 * Execution time : With the revised method, we reduced execution time by over 80%
 * Access control : We control data governance and access through the concept of Projects and
   we provide access to preprocessed data (anonymized, aggregated) that we use for modeling.

Rethink your workflow with Data Science Experience to achieve more efficient
performance, expand control over collaboration and permissions across the
modeling pipeline,
and develop better solutions for your team and for yourself.


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com .

 * Spark
 * Dsx
 * IBM
 * Data Science
 * Machine Learning


Blocked Unblock Follow FollowingADAM MASSACHI
Data Science Experience @ IBM

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",How we reduced modeling time by 80% with Data Science Experience.,Accelerate Your Workflow with DSX,Live,302
894,"Homepage PUBLISHED IN AUTONOMOUS AGENTS — #AI Follow Sign in / Sign up 14 1 Preetham V V Blocked Unblock Follow Following #AI & #MachineLearning enthusiast. Author: Java Web Services / Internet
Security & Firewalls. VP, Brand Sciences & Products @inMobi #UltraRunner Aug 9 7 min read
--------------------------------------------------------------------------------

BACKPROPAGATION — HOW NEURAL NETWORKS LEARN COMPLEX BEHAVIORS
Learning is the most important ability and attribute of a Intelligent System. A
system which acquires knowledge by experience, trial-and-error or through
coaching, exhibits early traces of intelligence. This post explains how ANNs
learn.

In the previous post, ‘ Layman’s Intro to AI ’, we explored a simple analogy of how a Artificial Neural Network or ANN gains
to understand the ‘knowledge weight’ of a Cat (or what we termed as the
Catiness).

QUICK RECAP
We said, the best fit arithmetic analogy of a Neural Network is in the following
equation (which is a oversimplified lie btw):

E=(x*w)-y

Where, ‘ E’ is the error which should tend to zero

‘ x ’ is a input vector (pixels of cat image)

‘ w ’ is the knowledge weight that the network needs to learn (about the Catiness
of a Cat)

and ‘ y ’ is the output expected (Which in our case was the classification “Cat”)

The ‘*’ operator is a function called the Activation Function , which was introduced in the post titled “ Mathematical foundation for Activation Functions ”. This post will now look at the ‘minus’ operator (as an analogy again) which
encapsulates the Loss (or Cost )and Learning Function of Neural Network.

HOW TO TRAIN YOUR ANN?
In the simplified equation E=(x*w)-y, we said we do not know what the weight ‘w’
needs to be. Instead we started taking guess works at ‘w’ to reduce the error E
to zero. Any value for ‘w’ that best fits the equation and reduces the error to
zero will be considered as ‘knowledge’.

How do we now fit this analogy to a real ANN?

Let’s say, the network in the above illustration uses a Logistic Sigmoid as the
activation function as shown below

Logistic SigmoidThe logit ‘𝛉’ in the above activation function is the transfer potential
function as shown below

Transfer PotentialSo the ‘*’ operator is a Activation Function as follows:

Full form of the logistic sigmoid activation functionHere y’ (y-prime) is the output of the activation function.

So, the arithmetic analogy of E= (x*w) -y can be replaced with E= y’ -y

Why do I use logistic sigmoid in this example?

 * Logistic sigmoids are easier to understand.
 * An important property of the logistic sigmoid is that it always produces a
   real value between zero and one as output.
 * Values closer to zero inhibits the neurons from firing while while values
   closer to one excites the neuron to fire.
 * It is real valued and differentiable, which is a primary requirement for
   being an activation function in ANN.
 * Also it is symmetrical in the left and right asymptotes (which is not a
   requirement for ANN but helps in understanding the concept of thresholding)
 * The threshold is nearly mid-point of the sigmoid.

Here is an illustration of the logistic sigmoid from wolframalpha. Here, I have
plotted the logit between -10 and 10. Notice how the curve thresholds at 0.5 and
switches over. This is quite intuitive.

So far, so good. But, how do we calculate the error? Enter the Loss Function…

LOSS FUNCTION (OR COST FUNCTION)
Training on ANN happens in iterations. Let’s consider, the iteration as a single
forward pass of input vector to the hidden units to the output of the neural
network.

Whenever we see a Error deviation in this single iteration, it’s considered as a local error .

The equation E=y’-y is actually a true representation of a local error which is a standard linear error , where y’ is the output from the activation function and y is the actual
expected output.

Training can also be done in batches. For example, if we have 500 pictures of
cats used for training, we can set a batch size of 250. Which means we shall
send 250 pictures of cat, one after another and capture all local errors for
each picture and aggregate that to a global error . Here, we are running 2-iterations of batch size 250 . The global error is calculated for each iteration.

(One whole training run on all 500 pictures is called a epoch .)

The global errors can be aggregated using any of the following techniques (But,
not limited to).

Now, we know what the error is. How do we tell ANN that what it produced was not
the answer expected? (This is the concept of supervised learning where we know
what to expect). Enter Backpropagation…

BACKPROPAGATION
Backpropagation is a powerful training tool used by most ANNs to learn the
knowledge weights of the hidden units. By now, we have the activation function
and we have the loss function. What we do not know is how to change the hidden
units, or particulary the knowledge weights ‘w’ of the hidden unit in such a way
that the error reduces to zero.

In a multi-layered ANN, every unit of neuron affects many output units in the
next layer. Let’s consider the illustration below.

Base schematic of a Multi-Layered Neural NetworkHere, each hidden activity in the layer ‘i’ can affect many outputs in the layer
‘j’ and hence shall have many different errors. It is prudent to combine these
errors. The idea behind combining the different errors is to compute the rate of change of error for all the units at the same time for every iteration.

The rate of change is nothing but the error derivative of the units. (At this point, I would advise you brush up your calculus a bit)

The error derivative we are planning to find is a Gradient Descent function. In other words, we are trying to find a way to reduce the error to
zero in every step of the iteration (and hence a “descent” in gradient as a
function).

FUNDAMENTAL MATH BEHIND BACKPROPAGATION
In order to truly understand backpropagation, we need to understand 2 simple
truths,

 1. We cannot directly arrive at the rate of change of the error with respect to
    weights for all the units. Instead, we need to first compute the rate of
    change of error with respect to the activation functions of the hidden
    activities.
 2. Once the rate of change of error with respect to the activation function is
    known, then using chain-rule, we can compute the rate of change of error
    with respect to the hidden units.

(If you are still scratching your head, let’s break this down further)

In the above illustration, we have:

 * y(j) which is the output of the activation function in the layer ‘j’.
 * y(i) which is the output of the activation function in layer ‘i’.
 * Knowledge weight ‘w(ij)’ which is the strength of the connections between
   neurons in layer ‘i’ and layer ‘j’.
 * ‘theta’ as a logit, which is the transfer potential into layer ‘j’.
 * Let’s assume an error E, which is the difference between output y(j) and some
   expected value t(j), such that E=t(j)-y(j)

In order to compute the rate of change of error E with respect to weight w(ij),
we must compute in the following order.

 * Compute the rate of change of error E with respect to the transfer potential 𝛉(j)
 * Compute the rate of change of error E with respect to the activation function
   of the hidden units in layer ‘ i ’
 * Which will lead to, computation of the rate of change of error E with respect
   to the hidden weights w(ij)


--------------------------------------------------------------------------------

Here is the Math:


--------------------------------------------------------------------------------

And that is the mathematical foundation to understand the weight updates.

T he most important thing to note is that the rate of change of error with
respect to the weights now have a equation that is purely a function of the
output of the activation functions instead of the input.

Feeding back of the rate of change of error with respect to the weight is what
is called as backpropagation .

Note that there are multiple factors that gets introduced to the weight updates
such as; a learning rate that can scale the gradient, a momentum value that
dampens the velocity of change etc. These shall be looked at in later posts.

In conclusion, Backpropagation is a powerful tool that allows the ANN to learn
the knowledge of the input vectors without having to write any specific rules to
encode the knowledge. Note that Backpropagation is used ONLY in supervised
learning as it needs a known output for training which helps in determining the g radient descent of the loss function as explained.

Do brush up your college calculus once again and revisit the post, or feel free
to shoot comments so that i can try to expand further. Happy reading.

Machine Learning Neural Networks Backpropagation Loss Function Gradient Descent 14 1 Blocked Unblock Follow FollowingPREETHAM V V
#AI & #MachineLearning enthusiast. Author: Java Web Services / Internet Security
& Firewalls. VP, Brand Sciences & Products @inMobi #UltraRunner

FollowAUTONOMOUS AGENTS — #AI
Notes of Artificial Intelligence and Machine Learning.

× Don’t miss Preetham V V’s next story Blocked Unblock Follow Following Preetham V V","Learning is the most important ability and attribute of a Intelligent System. A system which acquires knowledge by experience, trial-and…",Backpropagation — How Neural Networks Learn Complex Behaviors,Live,303
897,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (August 30, 2016)
 * This Week in Data Science (August 23, 2016)
 * This Week in Data Science (August 16, 2016)
 * This Week in Data Science (August 09, 2016)
 * This Week in Data Science (August 02, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (AUGUST 30, 2016)
Posted on August 30, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * World’s First Self-Driving Taxis Debut in Singapore – The world’s first self-driving taxis are picking up passengers in
   Singapore.
 * IBM Wants You to Know That AI Is Not Futuristic – It’s Here Now – IBM is launching new TV ads this week during the U.S. Open, showing how
   artificial intelligence and cloud-based computing are technologies that are
   here today and transforming businesses.
 * A world of languages – and how many speak them – This infographic displays the 23 languages that make up the native tongue
   of 4.1 billion people.
 * What is the Internet of Things (IoT)? – We’ve compiled a beginner’s guide to the IoT to help you navigate the
   increasingly connected world.
 * 9 Paths to a Data Science Interview – Roger Huang compiles a the best methods to getting a data science
   interview.
 * Facebook opens its advanced AI vision tech to everyone – Facebook announces that it’s open-sourcing its latest computer-vision
   findings to the public so that everyone can pitch in to develop the tech.
 * A Beginner’s Guide to Neural Networks with R! – Learn how Neural Networks work and how to implement them with the R
   programming language and see how we can easily visualize them.
 * MIT Researchers Radically Boost Wi-Fi With Smart Routers That Talk To Each
   Other – Tech lets wireless access points cancel out interference, providing a
   speed boost for crowded venues.
 * A Concise History of Neural Networks – Neural networks began unsurprisingly as a model of how neurons in the
   brain function. Read about the history of neural networks.
 * The 10 Algorithms Machine Learning Engineers Need to Know – Read this introductory list of contemporary machine learning algorithms of
   importance that every engineer should understand.
 * Every Data Science Interview Boiled Down To Five Basic Questions – Data science interviews are notoriously complex, but most of what they
   throw at you will fall into one of these categories.
 * New tourism app has IBM’s Watson guide you around Orlando – A new app, backed by the supercomputing power of IBM’s Watson, will tell
   you how to get the most out of your time in Orlando.
 * Big data analytics and NLP: How health plans can make more money – and keep
   it – Natural language processing is an emerging area that can help unlock value
   from the vast stores of unstructured data that account for as much as 80% of
   all clinical data. UPMC Health Plan does just that.
 * Airlines, Trackers Aim to Prevent a Travel Nightmare: Lost Luggage – Delta and other airlines improve how they track luggage; new products like
   Trakdot, LugLoc and the Bluesmart bag let you track your luggage.
 * 3 Ways Big Data Is Being Used in IT – One industry that’s often overlooked in the rush to use computers to
   crunch data to optimize our world is the information technology (IT) sector
   itself.
 * IBM’s Watson Takes On Yet Another Job, as a Weather Forecaster – The integration of the supercomputer and weather stations around the world
   could have a huge impact on global industry
 * How big data is making sports more intelligent – From trainers and athletes to businesses, big data analytics can make a
   difference in improving efficiency, accuracy and profitability in sports.

UPCOMING DATA SCIENCE EVENTS
 * Data Innovation Day 2016: Algorithms, Automation, and Public Policy – Join the Center for Data Innovation for a conversation with leading
   experts on the state of artificial intelligence and machine learning on
   October 19th in Washington, DC.
 * Big Data & Health Presented By IBM Canada – On September 12th, We Are Wearables Toronto and IBM Canada present “Big
   Data and Health”, a look at how wearables and sensors are changing
   healthcare.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (August 30, 2016)",Live,304
898,"Compose The Compose logo Articles Sign in Free 30-day trialMYSQL FOR JSON: GENERATED COLUMNS AND INDEXING
Published Feb 16, 2017 mysql json indexing MySQL for JSON: Generated Columns and IndexingMySQL doesn't have a way to index JSON documents directly, but it has given us
an alternative: generated columns.

One thing that has been missing since MySQL added the JSON data type in version
5.7.8, is the ability to index JSON values, at least directly. We can though use
generated columns to achieve something similar. Generated columns, introduced in
MySQL 5.7.5, allow developers to create columns that hold information generated
from other columns, predefined expressions, or calculations. By generating
columns from values within a JSON document and then indexing that column, we can
practically index a JSON field. Let's show you how to do that ...

In this article, we'll be using a player and games JSON dataset that can be
downloaded here . The dataset contains a list of players with the following elements: a player
ID, name, and games played: Battlefield , _Crazy Tennis, and Puzzler.

{
    ""id"": 1,  
    ""name"": ""Sally"",  
    ""games_played"":{    
       ""Battlefield"": {
          ""weapon"": ""sniper rifle"",
          ""rank"": ""Sergeant V"",
          ""level"": 20
        },                                                                                                                          
       ""Crazy Tennis"": {
          ""won"": 4,
          ""lost"": 1
        },  
       ""Puzzler"": {
          ""time"": 7
        }
     }
 }
...


Battlefield contains the player's favorite weapon, their current rank, and the level of
that rank, while Crazy Tennis includes the number of games won and lost, and Puzzler contains the time it took a player to solve the game. Let's start with our
initial table creation:

CREATE TABLE `players` (  
    `id` INT UNSIGNED NOT NULL,
    `player_and_games` JSON NOT NULL,
    PRIMARY KEY (`id`)
);


This creates a table called players that includes an ID, the JSON document, and sets the primary key to the ID
we've provided. What we want to do is index that JSON name value. Let's look at
what we need to add to the CREATE TABLE command.

GENERATING A COLUMN
When you want to create generated columns, you use this syntax within a CREATE TABLE statement to set them up:

`column_name` datatype GENERATED ALWAYS AS (expression)


The key here are the words GENERATED ALWAYS and AS . The phrase GENERATED ALWAYS is actually optional; it's only needed if you want to explicitly state that the
table column is going to be a generated column. What is necessary is the word AS followed by an expression that will return a value for what you want in the
generated column. Let's start there:

`names_virtual` VARCHAR(20) GENERATED ALWAYS AS ...


We're making a column called names_virtual which is up to 20 characters long and will contain the value of the ""name""
property from the JSON dataset. We'll access the ""name"" using a JSON path using
MySQL's ->> operator, which is equivalent to writing JSON_UNQUOTE(JSON_EXTRACT(...)) that will return the ""name"" as an unquoted result from the JSON document. We've talked about some of these JSON functions here .

`names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name')


That means we're going to take the JSON field player_and_games and extract the key name which is a child of the root.

As with most column definitions, there's a number of constraints and options you
can apply to a column.

[VIRTUAL|STORED] [UNIQUE [KEY]] [[NOT] NULL] [[PRIMARY] KEY]


Unique to generated columns, the keywords VIRTUAL and STORED indicate whether the values will not or will be stored in the table. The
keyword VIRTUAL is used by default, which means that the column's values are not stored so they
don't take up storage space. They are evaluated every time the row is read. If
you create an index with a virtual column, the value does get stored - in the
index itself. The STORED keyword, on the other hand, indicates that values are calculated as the data is
written to a table, which means values are calculated when documents are
inserted or updated. In this case, the index doesn't need to store the value and
behaves more traditionally.

The last three parameters enforce whether the values can be NOT NULL or NULL , and add index constraints such as UNIQUE or PRIMARY KEY . We probably should always use NOT NULL when creating a column to ensure that values exist, but using index constraints
depend on your use case.

The other options are optional constraints to enforce whether the values can be NOT NULL or NULL , and add index constraints such as UNIQUE or PRIMARY KEY . If you are relying on a field existing, you should use NOT NULL when creating a column to ensure that values exist. The constraints really
depend on your use case. We'll use NOT NULL as we expect players to have a name, just not a unique one.

Now let's look at our CREATE TABLE statement:

CREATE TABLE `players` (  
   `id` INT UNSIGNED NOT NULL,
   `player_and_games` JSON NOT NULL,
   `names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->


If we use this to create the table, we can then insert some of the JSON
documents. You can find the SQL for this in the compose examples respository . In this dataset, we've inserted the id for each player and then the JSON document like:

INSERT INTO `players` (`id`, `player_and_games`) VALUES (1, '{  
    ""id"": 1,  
    ""name"": ""Sally"",
    ""games_played"":{    
       ""Battlefield"": {
          ""weapon"": ""sniper rifle"",
          ""rank"": ""Sergeant V"",
          ""level"": 20
        },                                                                                                                          
       ""Crazy Tennis"": {
          ""won"": 4,
          ""lost"": 1
        },  
       ""Puzzler"": {
          ""time"": 7
        }
      }
   }'
);
...


Once we've run the code, and the data has been inserted into the players table, we can do a SELECT query giving us the following:

SELECT * FROM `players`;

+----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
| id | player_and_games                                                                                                                                                                                           | names_virtual |
+----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
|  1 | {""id"": 1, ""name"": ""Sally"", ""games_played"": {""Puzzler"": {""time"": 7}, ""Battlefield"": {""rank"": ""Sergeant V"", ""level"": 20, ""weapon"": ""sniper rifle""}, ""Crazy Tennis"": {""won"": 4, ""lost"": 1}}}                  | Sally         |
|  2 | {""id"": 2, ""name"": ""Thom"", ""games_played"": {""Puzzler"": {""time"": 25}, ""Battlefield"": {""rank"": ""Major General VIII"", ""level"": 127, ""weapon"": ""carbine""}, ""Crazy Tennis"": {""won"": 10, ""lost"": 30}}}            | Thom          |
|  3 | {""id"": 3, ""name"": ""Ali"", ""games_played"": {""Puzzler"": {""time"": 12}, ""Battlefield"": {""rank"": ""First Sergeant II"", ""level"": 37, ""weapon"": ""machine gun""}, ""Crazy Tennis"": {""won"": 30, ""lost"": 21}}}           | Ali           |
|  4 | {""id"": 4, ""name"": ""Alfred"", ""games_played"": {""Puzzler"": {""time"": 10}, ""Battlefield"": {""rank"": ""Chief Warrant Officer Five III"", ""level"": 73, ""weapon"": ""pistol""}, ""Crazy Tennis"": {""won"": 47, ""lost"": 2}}} | Alfred        |
|  5 | {""id"": 5, ""name"": ""Phil"", ""games_played"": {""Puzzler"": {""time"": 7}, ""Battlefield"": {""rank"": ""Lt. Colonel III"", ""level"": 98, ""weapon"": ""assault rifle""}, ""Crazy Tennis"": {""won"": 130, ""lost"": 75}}}          | Phil          |
|  6 | {""id"": 6, ""name"": ""Henry"", ""games_played"": {""Puzzler"": {""time"": 17}, ""Battlefield"": {""rank"": ""Captain II"", ""level"": 87, ""weapon"": ""assault rifle""}, ""Crazy Tennis"": {""won"": 68, ""lost"": 149}}}             | Henry         |
+----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+


As we can see, the table includes the names_virtual column with all the player's names inserted. Let's do a quick check to show how
the columns have been set up by MySQL:

SHOW COLUMNS FROM `players`;

+------------------+------------------+------+-----+---------+-------------------+
| Field            | Type             | Null | Key | Default | Extra             |
+------------------+------------------+------+-----+---------+-------------------+
| id               | int(10) unsigned | NO   | PRI | NULL    |                   |
| player_and_games | json             | NO   |     | NULL    |                   |
| names_virtual    | varchar(20)      | NO   |     | NULL    | VIRTUAL GENERATED |
+------------------+------------------+------+-----+---------+-------------------+


Since we haven't indicated whether the generated column is VIRTUAL or STORED , by default MySQL automatically set up a VIRTUAL column for us. If you don't know whether your columns are VIRTUAL or STORED , just run the above SHOW COLUMNS query and it will either show VIRTUAL GENERATED or STORED GENERATED .

Now that we set up the table and our first VIRTUAL column, let's add four more columns using the ALTER TABLE and ADD COLUMN operations. These will hold the Battlefield levels, tennis games won, tennis
games lost, and the Puzzler times.

ALTER TABLE `players` ADD COLUMN `battlefield_level_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->  
ALTER TABLE `players` ADD COLUMN `tennis_won_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->  
ALTER TABLE `players` ADD COLUMN `tennis_lost_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->  
ALTER TABLE `players` ADD COLUMN `times_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->  


Again, running the query SHOW COLUMNS FROM players; , we see that all of the columns have VIRTUAL GENERATED next to them, meaning that we've successfully set up new VIRTUAL generated columns.

+---------------------------+------------------+------+-----+---------+-------------------+
| Field                     | Type             | Null | Key | Default | Extra             |
+---------------------------+------------------+------+-----+---------+-------------------+
| id                        | int(10) unsigned | NO   | PRI | NULL    |                   |
| player_and_games          | json             | NO   |     | NULL    |                   |
| names_virtual             | varchar(20)      | NO   |     | NULL    | VIRTUAL GENERATED |
| battlefield_level_virtual | int(11)          | NO   |     | NULL    | VIRTUAL GENERATED |
| tennis_won_virtual        | int(11)          | NO   |     | NULL    | VIRTUAL GENERATED |
| tennis_lost_virtual       | int(11)          | NO   |     | NULL    | VIRTUAL GENERATED |
| times_virtual             | int(11)          | NO   |     | NULL    | VIRTUAL GENERATED |
+---------------------------+------------------+------+-----+---------+-------------------+


Running the SELECT query shows us all the values from the VIRTUAL COLUMNS , which should look like:

SELECT `names_virtual`, `battlefield_level_virtual`, `tennis_won_virtual`, `tennis_lost_virtual`, `times_virtual` FROM `players`;

+---------------+---------------------------+--------------------+---------------------+---------------+
| names_virtual | battlefield_level_virtual | tennis_won_virtual | tennis_lost_virtual | times_virtual |
+---------------+---------------------------+--------------------+---------------------+---------------+
| Sally         |                        20 |                  4 |                   1 |             7 |
| Thom          |                       127 |                 10 |                  30 |            25 |
| Ali           |                        37 |                 30 |                  21 |            12 |
| Alfred        |                        73 |                 47 |                   2 |            10 |
| Phil          |                        98 |                130 |                  75 |             7 |
| Henry         |                        87 |                 68 |                 149 |            17 |
+---------------+---------------------------+--------------------+---------------------+---------------+


Now that the data has been inserted and the generated columns set up, we can
create indexes on each column to optimize our searches ...

INDEXING GENERATED COLUMNS
When putting secondary indexes on VIRTUAL generated column values, the values are materialized and stored in the index.
This gives us the benefit of not increasing the table size and being able to
take advantage of MySQL indexing.

Let's do a simple query on a generated column to see what it looks like before
we index it. Examining the query plan when selecting names_virtual and the name ""Sally"", we'd get the following:

EXPLAIN SELECT * FROM `players` WHERE `names_virtual` = ""Sally""\G  
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: players
   partitions: NULL
         type: ALL
possible_keys: NULL  
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 6
     filtered: 16.67
        Extra: Using where


For this query, MySQL has to look at every row to find ""Sally"". However, we get
an entirely different result once we put an index on the column like:

CREATE INDEX `names_idx` ON `players`(`names_virtual`);  


Now, running the same query, we get:

EXPLAIN SELECT * FROM `players` WHERE `names_virtual` = ""Sally""\G  
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: players
   partitions: NULL
         type: ref
possible_keys: names_idx  
          key: names_idx
      key_len: 22
          ref: const
         rows: 1
     filtered: 100.00
        Extra: NULL


As we can see, our index on the column sped up our query by only looking at one
row instead of six using the index names_idx instead of all of the rows. Let's create indexes on the rest of our virtual
columns following the same syntax as names_idx like:

CREATE INDEX `times_idx` ON `players`(`times_virtual`);  
CREATE INDEX `won_idx` ON `players`(`tennis_won_virtual`);  
CREATE INDEX `lost_idx` ON `players`(`tennis_lost_virtual`);  
CREATE INDEX `level_idx` ON `players`(`battlefield_level_virtual`);  


We can check to see if all of our columns have now been indexed by running:

SHOW INDEX ON `players`;

+---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name  | Seq_in_index | Column_name               | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| players |          0 | PRIMARY   |            1 | id                        | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
| players |          1 | names_idx |            1 | names_virtual             | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
| players |          1 | times_idx |            1 | times_virtual             | A         |           5 |     NULL | NULL   |      | BTREE      |         |               |
| players |          1 | won_idx   |            1 | tennis_won_virtual        | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
| players |          1 | lost_idx  |            1 | tennis_lost_virtual       | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
| players |          1 | level_idx |            1 | battlefield_level_virtual | A         |           6 |     NULL | NULL   |      | BTREE      |         |               |
+---------+------------+-----------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+


Now that we created multiple indexes on our generated columns, let's make a more
complex search to see how generated columns and indexes work. For this example,
we'll get the ids, names, tennis games won, Battlefield level, and Puzzler time
for players who have a level above 50 and who also have won 50 tennis games. All
of the results will be ordered by ascending order according to the Puzzler time.
The SQL command and results will look like:

SELECT `id`, `names_virtual`, `tennis_won_virtual`, `battlefield_level_virtual`, `times_virtual` FROM `players` WHERE (`battlefield_level_virtual` > 50 AND  `tennis_won_virtual` 

+----+---------------+--------------------+---------------------------+---------------+
| id | names_virtual | tennis_won_virtual | battlefield_level_virtual | times_virtual |
+----+---------------+--------------------+---------------------------+---------------+
|  5 | Phil          |                130 |                        98 |             7 |
|  6 | Henry         |                 68 |                        87 |            17 |
+----+---------------+--------------------+---------------------------+---------------+


Let's look at how MySQL planned out this query:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: players
   partitions: NULL
         type: range
possible_keys: won_idx,level_idx  
          key: won_idx
      key_len: 4
          ref: NULL
         rows: 2
     filtered: 66.67
        Extra: Using where; Using filesort


When using the indexes on won_idx and level_idx it only had to access two columns to return the result we wanted. As you can
see, if the query had to do a complete table scan on millions of documents, it
would probably have taken a very long time. However, with the power of generated
columns and indexing those columns, MySQL has provided a very fast and
convenient way to search elements within JSON documents.

One question that remains, nonetheless, is what do we do with STORED generated columns? How do we use them, and how do they work?

STORING VALUES IN GENERATED COLUMNS
Using the STORED keyword when setting up a generated column is generally not preferred since
you're basically storing values twice in a table: once in the JSON document and
again in the STORED column. However, there are three scenarios when MySQL suggests you use a STORED generated column: 1) indexing primary keys, 2) you need a fulltext/R-tree
index, or 3) you have a column that is scanned a lot.

The syntax for adding a STORED generated column is the same as creating VIRTUAL generated columns, except we append the keyword STORED after the expression like:

`id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED NOT NULL,


To show you how to use STORED , let's create another table that will take the IDs from our JSON documents and
store them in a STORED generated column. Then we'll set the PRIMARY KEY to the id column. At the same time, we'll create all of our VIRTUAL columns and set up indexes for those columns. This is done entirely within the
initial table creation:

CREATE TABLE `players_two` (  
    `id` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.id') STORED NOT NULL,
    `player_and_games` JSON NOT NULL,
    `names_virtual` VARCHAR(20) GENERATED ALWAYS AS (`player_and_games` ->> '$.name') NOT NULL,
    `times_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.Puzzler.time') NOT NULL,
    `tennis_won_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.""Crazy Tennis"".won') NOT NULL,
    `tennis_lost_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->> '$.games_played.""Crazy Tennis"".lost') NOT NULL,
    `battlefield_level_virtual` INT GENERATED ALWAYS AS (`player_and_games` ->


Next, we'll insert the same dataset into players_two except we'll remove the ID that we previously added to the INSERT operation:

INSERT INTO `players_two` (`player_and_games`) VALUES ('{  
    ""id"": 1,  
    ""name"": ""Sally"",  
    ""games_played"":{    
...
);


After our data has been inserted into the table, we can run SHOW COLUMNS on the new table to see how MySQL has generated our columns. Notice that the id field now has STORED GENERATED beside it with a PRIMARY KEY index.

SHOW COLUMNS FROM `players_two`;

+---------------------------+-------------+------+-----+---------+-------------------+
| Field                     | Type        | Null | Key | Default | Extra             |
+---------------------------+-------------+------+-----+---------+-------------------+
| id                        | int(11)     | NO   | PRI | NULL    | STORED GENERATED  |
| player_and_games          | json        | NO   |     | NULL    |                   |
| names_virtual             | varchar(20) | NO   | MUL | NULL    | VIRTUAL GENERATED |
| times_virtual             | int(11)     | NO   | MUL | NULL    | VIRTUAL GENERATED |
| tennis_won_virtual        | int(11)     | NO   | MUL | NULL    | VIRTUAL GENERATED |
| tennis_lost_virtual       | int(11)     | NO   | MUL | NULL    | VIRTUAL GENERATED |
| battlefield_level_virtual | int(11)     | NO   | MUL | NULL    | VIRTUAL GENERATED |
+---------------------------+-------------+------+-----+---------+-------------------+


One thing to note about creating a PRIMARY KEY with generated columns is that MySQL will not allow you to create primary keys
on VIRTUAL generated columns. In fact, if we left out STORED on the id field, MySQL will throw the following error:

ERROR 3106 (HY000): 'Defining a virtual generated column as primary key' is not supported for generated columns.  


At the same time, if you leave off the primary key index and attempt to insert a
data, MySQL will throw an error stating:

ERROR 3098 (HY000): The table does not comply with the requirements by an external plugin.  


This means that you don't have a primary key on your table. Therefore, you must
go back and either drop and recreate your table, or drop the id column and add the column as a STORED generated column with a PRIMARY KEY like:

ALTER TABLE `players_two` ADD COLUMN `id` INT GENERATED ALWAYS AS (`player_and_games` ->  


SUMMING IT UP
This article paved the way to understand in more depth how to use JSON documents
in your MySQL database, and how to use generated columns and indexes to search
for your data quickly and efficiently. Using generated columns will allow you to
put indexes on specific elements of your JSON documents without indexing them in
their entirety, and provide you with the flexibility of either storing or not
storing values within a table. It's this flexibility and customization that
makes MySQL very attractive to use with JSON, so give it a try next time you're
using MySQL for Compose.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Ren Ran Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
Feb 8, 2017MYSQL FOR YOUR JSON
Since the arrival of Compose for MySQL beta, we've been asking customers about
how they use the database and what they'd like…

Abdullah Alger Feb 16, 2017DOCUMENT VALIDATION IN MONGODB BY EXAMPLE
In this article, we'll explore MongoDB document validation by example using an
invoice application for a fictitious cookie co…

John O'Connor Jan 26, 2017COMPOSE FOR MYSQL - A DEVELOPER'S VIEW
In this interview with Chris Winslett, Compose developer and lead on the Compose
for MySQL, we talk about why MySQL is on the…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","MySQL doesn't have a way to index JSON documents directly, but it has given us an alternative: generated columns.",MySQL for JSON: Generated Columns and Indexing,Live,305
900,"MOVING TO MEDIUM

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Mike Broberg 2/24/17Mike Broberg

Mike worked in tech PR for many moons before starting to work with Cloudant in
2012. He joined Cloudant's marketing team full-time in 2013 and is now happily
ensconced in the IBM CDS developer advocacy team, where he edits articles, QAs
demo apps, and helps other peeps learn what's up…

Learn More Recent Posts * Moving to Medium We're now the IBM Watson Data Lab on Medium. More on what that means for…
 * Check Out IBM's ""New Builders"" Podcast The New Builders is a weekly podcast featuring developers from around the
   web, talking about…
 * Interview: Leveraging Technology, and D3.js I recently spoke with Leveraging Technology, an IBM systems integrator,
   about how JavaScript is making…

Howdy, fine readers! First: a big thank you for following the CDS developer
advocacy team here on IBM developerWorks. You might have noticed that while this
site, as a whole, is still very active (see all the new products and
getting-started guides in the Services tab), the blog has not been. That’s because we’ve moved our blog to the Medium
publishing platform.

OUR NEW MEDIUM BLOG
We’re on Medium as the IBM Watson Data Lab . Yes, we have a new name, and with it comes a slightly different charter. We
still work with our friends in IBM’s analytics software group, but as part of
the new IBM Watson data platform group, our focus has narrowed to cloud data
management services and Watson APIs.

How can you quickly sign up and combine IBM’s cognitive APIs with a range of
databases and data-processing tools to build meaningful apps? It’s our goal to
show you.

From a strategy perspective, the director of the WDP developer advocacy program,
Brad Noble, further explains our reasoning for the move on—where else?—Medium. And, of course, all our example apps and code are still on GitHub, right where they’ve always been.

So head on over to Medium and say hello to the team . Like the masthead says: “ The things you can make with data, on the IBM Watson Data Platform .” See you there!


 * Uncategorized

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus
RECENT UPDATES

 * Blog
 * Recent Post
 * Moving to Medium
 * 2/24/17
 * We're now the IBM Watson Data Lab on Medium. More on what that means for…
 * Mike Broberg",We're now the IBM Watson Data Lab on Medium. More on what that means for our blog and our developerWorks site.,Moving to Medium,Live,306
901,"DATALAYER CONFERENCE: AMBRY AT LINKEDIN
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 31, 2016After the Keynote at DataLayer Conference, Compose welcomed Sivabalan Narayanan from LinkedIn.
LinkedIn runs Ambry, an open-source geo-distributed object store built in-house,
on hundreds of nodes spanning multiple data centers and is the source of truth
for all media and immutable content. If you missed DataLayer, don't worry, we
recorded this session and you can watch it here:


--------------------------------------------------------------------------------","At DataLayer Conference, Compose welcomed Sivabalan Narayanan from LinkedIn. LinkedIn runs Ambry, an open-source geo-distributed object store built in-house, on hundreds of nodes spanning multiple data centers and is the source of truth for all media and immutable content.",DataLayer Conference: Ambry at LinkedIn,Live,307
902,CouchImport is designed to assist you with importing flat datainto CouchDB efficiently. It can be used either as command-line utilities couchimport and couchexport or the underlying functions can be used programatically: * simply pipe the data file to 'couchimport' on the command line * handles tab or comma separated data * uses Node.js's streams for memory efficiency * plug in a custom function to add your own changes before the data is written * writes the data in bulk for speed * can also write huge JSON files using a streaming JSON parser * allows multiple writes to happen at once using the --parallelism option,couchimport - CouchDB/Cloudant import tool to allow data to be bulk inserted,glynnbird/couchimport,Live,308
908,"The search index lets you create flexible queries on one or more field in the documents. This video shows you how to perform group, facet, and geo searches. Find more videos in the Cloudant Learning Center: http://www.cloudant.com/learning-center","The search index lets you create flexible queries on one or more field in the documents. This video shows you how to perform group, facet, and geo searches.",Perform Group Facet And Geo Searches with Cloudant,Live,309
909,"Homepage Stats and Bots Follow Sign in / Sign up Homepage * Home
 * Subscribe
 * 
 * 🤖 TRY STATSBOT FREE - Empower every department with data
 * 

Neelabh Pant Blocked Unblock Follow Following I love Data Science. Let’s build some intelligent bots together! ;) Sep 7
--------------------------------------------------------------------------------

TIME SERIES PREDICTION USING RECURRENT NEURAL NETWORKS (LSTMS)
FORECASTING FUTURE CURRENCY EXCHANGE RATES WITH LONG SHORT-TERM MEMORY (LSTMS)
The Statsbot team has already published the article about using time series analysis for anomaly detection . Today, we’d like to discuss time series prediction with a long short-term
memory model (LSTMs). We asked a data scientist, Neelabh Pant, to tell you about
his experience of forecasting exchange rates using recurrent neural networks.

As an Indian guy living in the US, I have a constant flow of money from home to
me and vice versa. If the USD is stronger in the market, then the Indian rupee
(INR) goes down, hence, a person from India buys a dollar for more rupees. If
the dollar is weaker, you spend less rupees to buy the same dollar.

If one can predict how much a dollar will cost tomorrow, then this can guide
one’s decision making and can be very important in minimizing risks and
maximizing returns. Looking at the strengths of a neural network, especially a
recurrent neural network, I came up with the idea of predicting the exchange
rate between the USD and the INR.

There are a lot of methods of forecasting exchange rates such as:

 * Purchasing Power Parity (PPP) , which takes the inflation into account and calculates inflation
   differential.
 * Relative Economic Strength Approach , which considers the economic growth of countries to predict the direction
   of exchange rates.
 * Econometric model is another common technique used to forecast the exchange rates which is
   customizable according to the factors or attributes the forecaster thinks are
   important. There could be features like interest rate differential between
   two different countries, GDP growth rates, income growth rates, etc.
 * Time series model is purely dependent on the idea that past behavior and price patterns can be
   used to predict future price behavior.

In this article, we’ll tell you how to predict the future exchange rate behavior
using time series analysis and by making use of machine learning with time
series.

SEQUENCE PROBLEMS
Let us begin by talking about sequence problems. The simplest machine learning
problem involving a sequence is a one to one problem.

One to OneIn this case, we have one data input or tensor to the model and the model
generates a prediction with the given input. Linear regression, classification,
and even image classification with convolutional network fall into this
category. We can extend this formulation to allow for the model to make use of
the pass values of the input and the output.

It is known as the one to many problem. The one to many problem starts like the
one to one problem where we have an input to the model and the model generates
one output. However, the output of the model is now fed back to the model as a
new input. The model now can generate a new output and we can continue like this
indefinitely. You can now see why these are known as recurrent neural networks.

One to ManyA recurrent neural network deals with sequence problems because their
connections form a directed cycle. In other words, they can retain state from
one iteration to the next by using their own output as input for the next step.
In programming terms this is like running a fixed program with certain inputs
and some internal variables. The simplest recurrent neural network can be viewed
as a fully connected neural network if we unroll the time axes.

RNN Unrolled TimeIn this univariate case only two weights are involved. The weight multiplying
the current input xt , which is u, and the weight multiplying the previous output yt-1 , which is w . This formula is like the exponential weighted moving average (EWMA) by making
its pass values of the output with the current values of the input.

One can build a deep recurrent neural network by simply stacking units to one
another. A simple recurrent neural network works well only for a short-term
memory. We will see that it suffers from a fundamental problem if we have a
longer time dependency.

LONG SHORT-TERM NEURAL NETWORK
As we have talked about, a simple recurrent network suffers from a fundamental
problem of not being able to capture long-term dependencies in a sequence. This
is a problem because we want our RNNs to analyze text and answer questions,
which involves keeping track of long sequences of words.

In late ’90s, LSTM was proposed by Sepp Hochreiter and Jurgen Schmidhuber , which is relatively insensitive to gap length over alternatives RNNs, hidden
markov models, and other sequence learning methods in numerous applications.

LSTM ArchitectureThis model is organized in cells which include several operations. LSTM has an
internal state variable, which is passed from one cell to another and modified
by Operation Gates .

1. Forget Gate

It is a sigmoid layer that takes the output at t-1 and the current input at time t and concatenates them into a single tensor and applies a linear transformation
followed by a sigmoid. Because of the sigmoid, the output of this gate is
between 0 and 1. This number is multiplied with the internal state and that is
why the gate is called a forget gate. If ft=0 then the previous internal state is completely forgotten, while if ft=1 it will be passed through unaltered.

2. Input Gate

The input gate takes the previous output and the new input and passes them
through another sigmoid layer. This gate returns a value between 0 and 1. The
value of the input gate is multiplied with the output of the candidate layer.

This layer applies a hyperbolic tangent to the mix of input and previous output,
returning a candidate vector to be added to the internal state.

The internal state is updated with this rule:

.The previous state is multiplied by the forget gate and then added to the
fraction of the new candidate allowed by the output gate.

3. Output Gate

This gate controls how much of the internal state is passed to the output and it
works in a similar way to the other gates.

These three gates described above have independent weights and biases, hence the
network will learn how much of the past output to keep, how much of the current
input to keep, and how much of the internal state to send out to the output.

In a recurrent neural network, you not only give the network the data, but also
the state of the network one moment before. For example, if I say “Hey!
Something crazy happened to me when I was driving” there is a part of your brain
that is flipping a switch that’s saying “Oh, this is a story Neelabh is telling
me. It is a story where the main character is Neelabh and something happened on
the road.” Now, you carry a little part of that one sentence I just told you. As
you listen to all my other sentences you have to keep a bit of information from
all past sentences around in order to understand the entire story.

Another example is video processing, where you would again need a recurrent neural network. What happens in the
current frame is heavily dependent upon what was in the last frame of the movie
most of the time. Over a period of time, a recurrent neural network tries to
learn what to keep and how much to keep from the past, and how much information
to keep from the present state, which makes it so powerful as compared to a
simple feed forward neural network.

TIME SERIES PREDICTION
I was impressed with the strengths of a recurrent neural network and decided to
use them to predict the exchange rate between the USD and the INR. The dataset
used in this project is the exchange rate data between January 2, 1980 and
August 10, 2017. Later, I’ll give you a link to download this dataset and
experiment with it.

Table 1. Dataset ExampleThe dataset displays the value of $1 in rupees. We have a total of 13,730
records starting from January 2, 1980 to August 10, 2017.

USD vs INROver the period, the price to buy $1 in rupees has been rising. One can see that
there was a huge dip in the American economy during 2007–2008, which was hugely
caused by the great recession during that period. It was a period of general
economic decline observed in world markets during the late 2000s and early
2010s.

This period was not very good for the world’s developed economies, particularly
in North America and Europe (including Russia), which fell into a definitive
recession. Many of the newer developed economies suffered far less impact,
particularly China and India, whose economies grew substantially during this
period.

TEST-TRAIN SPLIT
Now, to train the machine we need to divide the dataset into test and training
sets. It is very important when you do time series to split train and test with
respect to a certain date. So, you don’t want your test data to come before your
training data.

In our experiment, we will define a date, say January 1, 2010, as our split
date. The training data is the data between January 2, 1980 and December 31,
2009, which are about 11,000 training data points.

The test dataset is between January 1, 2010 and August 10, 2017, which are about
2,700 points.

Train-Test SplitThe next thing to do is normalize the dataset. You only need to fit and
transform your training data and just transform your test data. The reason you
do that is you don’t want to assume that you know the scale of your test data.

Normalizing or transforming the data means that the new scale variables will be
between zero and one.

NEURAL NETWORK MODELS
A fully Connected Model is a simple neural network model which is built as a simple regression model
that will take one input and will spit out one output. This basically takes the
price from the previous day and forecasts the price of the next day.

As a loss function, we use mean squared error and stochastic gradient descent as
an optimizer, which after enough numbers of epochs will try to look for a good
local optimum. Below is the summary of the fully connected layer.

Summary of a Fully Connected LayerAfter training this model for 200 epochs or early_callbacks (whichever came first), the model tries to learn the pattern and the behavior
of the data. Since we split the data into training and testing sets we can now
predict the value of testing data and compare them with the ground truth.

Ground Truth(blue) vs Prediction(orange)As you can see, the model is not good. It essentially is repeating the previous
values and there is a slight shift. The fully connected model is not able to
predict the future from the single previous value. Let us now try using a
recurrent neural network and see how well it does.

LONG SHORT-TERM MEMORY
The recurrent model we have used is a one layer sequential model. We used 6 LSTM
nodes in the layer to which we gave input of shape (1,1), which is one input
given to the network with one value.

Summary of LSTM ModelThe last layer is a dense layer where the loss is mean squared error with
stochastic gradient descent as an optimizer. We train this model for 200 epochs
with early_stopping callback. The summary of the model is shown above.

LSTM PredictionThis model has learned to reproduce the yearly shape of the data and doesn’t
have the lag it used to have with a simple feed forward neural network. It is
still underestimating some observations by certain amounts and there is
definitely room for improvement in this model.

CHANGES IN THE MODEL
There can be a lot of changes to be made in this model to make it better. One
can always try to change the configuration by changing the optimizer. Another
important change I see is by using the Sliding Time Window method, which comes from the field of stream data management system.

This approach comes from the idea that only the most recent data are important.
One can show the model data from a year and try to make a prediction for the
first day of the next year. Sliding time window methods are very useful in terms
of fetching important patterns in the dataset that are highly dependent on the
past bulk of observations.

Try to make changes to this model as you like and see how the model reacts to
those changes.

DATASET
I made the dataset available on my github account under deep learning in python repository . Feel free to download the dataset and play with it.

USEFUL SOURCES
I personally follow some of my favorite data scientists like Kirill Eremenko , Jose Portilla , Dan Van Boxel (better known as Dan Does Data), and many more. Most of them are available on
different podcast stations where they talk about different current subjects like
RNN, Convolutional Neural Networks, LSTM, and even the most recent technology, Neural Turing Machine .

Try to keep up with the news of different artificial intelligence conferences . By the way, if you are interested, then Kirill Eremenko is coming to San Diego this November with his amazing team to give talks on Machine Learning, Neural Networks, and
Data Science.

CONCLUSION
LSTM models are powerful enough to learn the most important past behaviors and
understand whether or not those past behaviors are important features in making
future predictions. There are several applications where LSTMs are highly used.
Applications like speech recognition, music composition, handwriting
recognition, and even in my current research of human mobility and travel
predictions.

According to me, LSTM is like a model which has its own memory and which can
behave like an intelligent human in making decisions.

Thank you again and happy machine learning!

YOU’D ALSO LIKE:
Strategic Pricing in Retail with Machine Learning Building a dynamic pricing
strategy in retail blog.statsbot.co Ensemble Learning to Improve Machine Learning Results How ensemble methods
work: bagging, boosting and stacking blog.statsbot.co Support Vector Machines Tutorial Learning SVMs from examples blog.statsbot.c * Machine Learning
 * Data Science
 * Neural Networks
 * Timeseries
 * Recurrent Neural Network

1 Blocked Unblock Follow FollowingNEELABH PANT
I love Data Science. Let’s build some intelligent bots together! ;)

FollowSTATS AND BOTS
Data stories on machine learning and analytics. From Statsbot’s makers.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates",We’d like to discuss time series prediction with LSTM recurrent neural networks. We’ll tell you how to predict the future exchange rate behavior using time series forecasting.,Time Series Prediction Using Recurrent Neural Networks (LSTMs),Live,310
910,"Homepage Stats and Bots Follow Sign in Get started * Home
 * DATA SCIENCE
 * ANALYTICS
 * STARTUPS
 * BOTS
 * DESIGN
 * Subscribe
 * 
 * 🤖 TRY STATSBOT FREE
 * 

Prasoon Goyal Blocked Unblock Follow Following PhD candidate at UT Austin. For more content on machine learning by me, check
my Quora profile (https://www.quora.com/profile/Prasoon-Goyal). Nov 23, 2017
--------------------------------------------------------------------------------

PROBABILISTIC GRAPHICAL MODELS TUTORIAL — PART 2
PARAMETER ESTIMATION AND INFERENCE ALGORITHMS
In the previous part of this probabilistic graphical models tutorial for the Statsbot team, we looked at the two types of graphical models, namely Bayesian networks
and Markov networks. We also explored the problem setting, conditional
independences, and an application to the Monty Hall problem. In this post, we
will cover parameter estimation and inference, and look at another application.

PARAMETER ESTIMATION
BAYESIAN NETWORKS
Estimating the numbers in the CPD tables of a Bayesian network simply amounts to
counting how many times that event occurred in our training data. That is, to
estimate p(SAT=s1 | Intelligence = i1), we simply count the fraction of data
points where SAT=s1 and Intelligence = i1, out of the total data points where
Intelligence = i1. While this approach may appear ad hoc, it turns out that the
parameters so obtained maximize the likelihood of the observed data.

MARKOV NETWORKS
For Markov networks, unfortunately, the above counting approach does not have a
statistical justification (and will therefore lead to suboptimal parameters).
So, we need to use more sophisticated techniques. The basic idea behind most of
these techniques is gradient descent — we define parameters that describe the
probability distribution, and then use gradient descent to find values for these
parameters that maximize the likelihood of the observed data.

Finally, now that we have the parameters of our model, we want to use them on
new data, to perform inference!

INFERENCE
The bulk of the literature in probabilistic graphical models focuses on
inference. The reasons are two-fold:

 1. Inference is why we came up with this entire framework — being able to make
    predictions from what we already know.
 2. Inference is computationally hard! In some specific kinds of graphs, we can
    perform inference fairly efficiently, but on general graphs, it is
    intractable. So we need to use approximate algorithms that trade off
    accuracy for efficiency.

There are several questions we can answer with inference:

 * Marginal inference : Finding the probability distribution of a specific variable. For instance,
   given a graph with variables A, B, C, and D, where A takes values 1, 2, and
   3, find p(A=1), p(A=2) and p(A=3).
 * Posterior inference : Given some observed variables v_E (E for evidence) that take values e,
   finding the posterior distribution p(v_H | v_E=e) for some hidden variables
   v_H.
 * Maximum-a-posteriori (MAP) inference : Given some observed variables v_E that take values e, finding the setting
   of other variables v_H that have the highest probability.

Answers to these questions may be useful by themselves, or may need to be used
as part of larger tasks.

In what follows, we are going to look at some of the popular algorithms for
answering these questions, both exact and approximate. All these algorithms are
applicable on both Bayesian networks and Markov networks.

VARIABLE ELIMINATION
Using the definition of conditional probability, we can write the posterior
distribution as:

Let’s see how we can compute the numerator and the denominator above, using a
simple example. Consider a network with three variables, and the joint
distribution defined as follows:

Let’s say we want to compute p(A | B=1). Note that this means that we want to
compute the values p(A=0 | B=1)and p(A=1 | B=1), which should sum to one. Using
the above equation, we can write

The numerator is the probability that A = 0 and B = 1. We don’t care about the
values of C. So we would sum over all the values of C. (This comes from basic probability — p(A=0, B=1, C=0) and
p(A=0, B=1, C=1) are mutually exclusive events, so their union p(A = 0, B=1) is
just the sum of the individual probabilities.)

So we add rows 3 and 4 to get p(A=0, B=1) = 0.15. Similarly, adding rows 7 and 8
gives usp(A=1, B=1) = 0.40. Also, we can compute the denominator by summing over
all rows that contain B=1, that is, rows 3, 4, 7, and 8, to get p(B=1) = 0.55.
This gives us the following:

p(A = 0 | B = 1) = 0.15 / 0.55 = 0.27

p(A = 1 | B = 1) = 0.40 / 0.55 = 0.73

If you look at the above computation closely, you would notice that we did some
repeated computations — adding rows 3 & 4, and 7 & 8 twice. A more efficient way
to compute p(B=1)would have been to simply add the values p(A=0, B=1) and p(A=1,
B=1). This is the basic idea of variable elimination.

In general, when you have a lot of variables, not only can you use the values of
the numerator to compute the denominator, but the numerator by itself will
contain repeated computations, if evaluated naively. You can use dynamic
programming to use precomputed values efficiently.

Because we are summing over one variable at a time, thereby eliminating it, the
process of summing out multiple variables amounts to eliminating these variables
one at a time. Hence, the name “variable elimination.”

It is straightforward to extend the above process to solve the marginal
inference or MAP inference problems as well. Similarly, it is easy to generalize
the above idea to apply it to Markov networks too.

The time complexity of variable elimination depends on the graph structure, and
the order in which you eliminate the variables. In the worst case, it has
exponential time complexity.

BELIEF PROPAGATION
The VE algorithm that we just saw gives us only one final distribution. Suppose
we want to find the marginal distributions for all variables . Instead of running variable elimination multiple times, we can do something
smarter.

Suppose you have a graph structure. To compute a marginal, you need to sum the
joint distribution over all other variables, which amounts to aggregating
information from the entire graph. Here’s an alternate way of aggregating
information from the entire graph — each node looks at its neighbors, and
approximates the distribution of variables locally.

Then, every pair of neighboring nodes send “messages” to each other where the
messages contain the local distributions. Now, every node looks at the messages
it receives, and aggregates them to update its probability distributions of
variables.

In the figure above, C aggregates information from its neighbors A and B, and
sends a message to D. Then, D aggregates this message with the information from
E and F.

The advantage of this approach is that if you save the messages that you are
sending at every node, one forward pass of messages followed by one backward
pass gives all nodes information about all other nodes. That information can
then be used to compute all the marginals, which was not possible in variable
elimination.

If the graph does not contain cycles, then this process converges after a
forward and a backward pass. If the graph contains cycles, then this process may
or may not converge, but it can often be used to get an approximate answer.

APPROXIMATE INFERENCE
Because exact inference may be prohibitively time consuming for large graphical
models, numerous approximate inference algorithms have been developed for
graphical models, most of which fall into one of the following two categories:

Sampling-based
These algorithms estimate the desired probability using sampling. As a simple
example, consider the following scenario — given a coin, how you would determine
the probability of getting heads when the coin is tossed? The simplest thing is
to flip the coin, say, 100 times, and find out the fraction of tosses in which
you get heads.

This is a sampling-based algorithm to estimate the probability of heads. For
more complex questions in probabilistic graphical models, you can use a similar
procedure. Sampling-based algorithms can further be divided into two classes. In
the first one, the samples are independent of each other, as in the coin toss
example above. These algorithms are called Monte Carlo methods.

For problems with many variables, generating good quality independent samples is
difficult, and therefore, we generate dependent samples, that is, each new sample is random, but close to the last sample. Such
algorithms are called Markov Chain Monte Carlo (MCMC) methods, because the
samples form a “Markov chain.” Once we have the samples, we can use them to
answer various inference questions.

Variational methods
Instead of using sampling, variational methods try to approximate the required
distribution analytically. Suppose you write out the expression for computing
the distribution of interest — marginal probability distribution or posterior
probability distribution.

Often, these expressions have summations or integrals in them that are
computationally expensive to evaluate exactly. A good way to approximate these
expressions is to then solve for an alternate expression, and somehow ensure
that this alternate expression is close to the original expression. This is the
basic idea behind variational methods.

When we are trying to estimate a complex probability distribution p_complex, we
define a separate set of probability distributions P_simple, which are easier to
work with, and then find the probability distribution p_approx from P_simple
that is closest to p_complex.

APPLICATION: IMAGE DENOISING
Let us now use some of the ideas we just discussed on a real problem. Let’s say
you have the following image:

Now suppose that it got corrupted by random noise, so that your noisy image
looks as follows:

The goal is to recover the original image. Let’s see how we can use
probabilistic graphical models to do this.

The first step is to think about what our observed and unobserved variables are,
and how we can connect them to form a graph. Let us define each pixel in the
noisy image as an observed random variable, and each pixel in the ground truth
image as an unobserved variable. So, if the image is M x N, then there are MN
observed variables and MN unobserved variables. Let us denote observed variables
as X_ij and unobserved variables as Y_ij. Each variable takes values +1 and -1
(corresponding to black and white pixels, respectively). Given the observed
variables, we want to find the most likely values of the unobserved variables.
This corresponds to MAP inference.

Now, let us use some domain knowledge to build the graph structure. Clearly, the observed variable at position (i, j) in the noisy image depends on the unobserved variable at position (i, j) in the ground truth image. This is because most of
the time, they are identical.

What more can we say? For ground truth images, the neighboring pixels usually
have the same values — this is not true at the boundaries of color change, but
inside a single-colored region, this property holds. Therefore, we connect Y_ij
and Y_kl if they are neighboring pixels.

So, our graph structure looks as follows:

Here, the white nodes denote the unobserved variables Y_ij and the grey nodes
denote observed variables X_ij. Each X_ij is connected to the corresponding
Y_ij, and each Y_ij is connected to its neighbors.

Note that this is a Markov network, because there is no cause-effect relation
between pixels of an image, and therefore, defining directions of arrows in
Bayesian networks is unnatural here.

Our MAP inference problem can be mathematically written as follows:

Here, we used some standard simplification techniques common in maximum log
likelihood computation. We will use X and Y(without subscripts) to denote the
collection of all X_ij and Y_ij values, respectively.

Now, we need to define our joint distribution P(X, Y) based on our graph
structure. Let’s assume that P(X, Y) consists of two kinds of factors — ϕ(X_ij,
Y_ij) and ϕ(Y_ij,Y_kl), corresponding to the two kinds of edges in our graph.
Next, we define the factors as follows:

 * ϕ(X_ij, Y_ij) = exp(w_e X_ij Y_ij), where w_e is a parameter greater than
   zero. This factor takes large values when X_ij and Y_ij are identical, and
   takes small values when X_ij and Y_ij are different.
 * ϕ(Y_ij, Y_kl) = exp(w_s Y_ij Y_kl), where w_s is a parameter greater than
   zero, as before. This factor favors identical values of Y_ij and Y_kl.

Therefore, our joint distribution is given by:

where (i, j) and (k, l) in the second product are adjacent pixels, and Z is a
normalization constant.

Plugging this into our MAP inference equation gives:

Note that we have dropped the term containing Zsince it does not affect the
solution.

The values of w_e and w_s are obtained using parameter estimation techniques
from pairs of ground truth and noisy images. This process is fairly
mathematically involved (although, at the end of the day, it is just gradient
descent on a complicated function), and therefore, we shall not delve into it
here. We will assume that we have obtained the following values of these
parameters — w_e = 8 and w_s = 10.

The main focus of this example will be inference. Given these parameters, we
want to solve the MAP inference problem above. We can use a variant of belief
propagation to do this, but it turns out that there is a much simpler algorithm
called Iterated conditional modes (ICM) for graphs with this specific structure.

The basic idea is that at each step, you choose one node, Y_ij, look at the
value of the MAP inference expression for both Y_ij = -1 and Y_ij = 1, and pick
the one with the higher value. Repeating this process for a fixed number of
iterations or until convergence usually works reasonably well.

You can use this Python code to do this for our model.This is the denoised image returned by the algorithm:

Pretty good, isn’t it? Of course, you can use more fancy techniques, both within
graphical models, and outside, to generate something better, but the takeaway
from this example is that a simple Markov network with a simple inference
algorithm already gives you reasonably good results.

Quantitatively, the noisy image has about 10% of the pixels that are different
from the original image, while the denoised image produced by our algorithm has
about 0.6% of the pixels that are different from the original image.

It is important to note that the graph that we used is fairly large — the image
size is about 440 x 300, so the total number of nodes is close to 264,000.
Therefore, exact inference in such models is essentially infeasible, and what we
get out of most algorithms, including ICM, is a local optimum.

LET’S RECAP
In this section, let us briefly review the key concepts we covered in this
two-part series:

 * Graphical models: A graphical model consists of a graph structure where nodes
   represent random variables and edges represent dependencies between
   variables.
 * Bayesian networks: These are directed graphical models, with a conditional
   probability distribution table associated with each node.
 * Markov networks: These are undirected graphical models, with a potential
   function associated with each clique.
 * Conditional independences: Based on how the nodes in the graph are connected,
   we can write conditional independence statements of the form “X is
   independent of Y given Z.”
 * Parameter estimation: Given some data and the graph structure, we want to
   fill the CPD tables or compute the potential functions.
 * Inference: Given a graphical model, we want to answer questions about
   unobserved variables. These questions are usually one of the following —
   Marginal inference, posterior inference, and MAP inference.
 * Inference on general graphical models is computationally intractable. We can
   divide inference algorithms into two broad categories — exact and
   approximate. Variable elimination and belief propagation in acyclic graphs
   are examples of exact inference algorithms. Approximate inference algorithms
   are necessary for large-scale graphs, and usually fall into sampling-based
   methods or variational methods.

CONCLUSIONS
We looked at some of the core ideas in probabilistic graphical models in this
two-part tutorial. As you should be able to appreciate at this point, graphical
models provide an interpretable way to model many real-world tasks, where there
are dependencies. Using graphical models gives us a way to work on such tasks in
a principled manner.

Before we close, it is important to point out that this tutorial, by no means,
is complete — many details have been skipped to keep the content intuitive and
simple. The standard textbook on probabilistic graphical models is over a
thousand pages! This tutorial is meant to serve as a starting point, to get you
interested in the field, so that you can look up more rigorous resources.

Here are some additional resources that you can use to dig deeper into the
field:

 * Graphical Models in a Nutshell
 * Graphical Models textbook

You should also be able to find a few chapters on graphical models in standard
machine learning textbooks.

YOU’D ALSO LIKE:
Probabilistic Graphical Models Tutorial — Part 1 Basic terminology and the
problem setting blog.statsbot.co Machine Learning Algorithms: Which One to Choose for Your Problem Intuition of
using different kinds of algorithms in different tasks blog.statsbot.co Neural networks for beginners: popular types and applications An introduction
to neural networks learning blog.statsbot.co * Machine Learning
 * Data Science
 * Algorithms
 * Bayesian Statistics
 * Markov Chains

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

350 Blocked Unblock Follow FollowingPRASOON GOYAL
PhD candidate at UT Austin. For more content on machine learning by me, check my
Quora profile ( https://www.quora.com/profile/Prasoon-Goyal ).

FollowSTATS AND BOTS
Data stories on machine learning and analytics. From Statsbot’s makers.

 * 350
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates","In the previous part of this tutorial, we looked at the two types of graphical models, namely Bayesian networks and Markov networks. In this post, we will cover parameter estimation and inference, and look at another application.",Probabilistic Graphical Models Tutorial — Part 2 – Stats and Bots,Live,311
912,"Compose The Compose logo Articles Sign in Free 30-day trialON-DEMAND BACKUPS WITH THE COMPOSE API AND NODE.JS
Published Feb 13, 2017 Compose API node.js backups On-demand backups with the Compose API and Node.jsYou need your latest backup and you need it automatically? We can show you how
to create an on-demand backup and retrieve it automatically using the Compose
API and Node.js

We recently showed you how to use the API and Node.js to retrieve the automated backups made by Compose
deployments every day . The API is pretty straightforward for those cases - find a backup record,
download the associated file from the backup record's link. To create and
retrieve an on-demand backup is a little more intricate as we'll have to make
the backup program wait. Ride along with the code, and we'll show you why.

First up, we need to create our latest backup. For that the 2016-07-post-deployments-backups endpoint comes into play. POSTing at this endpoint with a deployment id in the
path will start an on-demand backup:

let startBackup = (deploymentid, options) =
        })


But this endpoint doesn't wait till it can return a backup id or backup record.
It returns immediately with a Compose API recipe. Recipes are how we track the
progress of tasks in Compose. When there's work to be done a recipe is found to
do that task and it is run. Each recipe has an id and we can query another
endpoint with that recipe id to see how it has progressed. That progress is
represented by the status field of the recipe.

As an aside, recipes are forever on Compose, once run, they form a part of the
history of a deployment. You can even query a deployment for all the recipes
ever run against it, including the ones use to create it initially. That way, we
know what happened to any deployment. The more you know.

Back to the task at hand, for this we'll print out some of the recipe
information first, then call up a new function, pollForRecipeComplete() giving it the recipe id to work with.

        .then(function (recipe) {
            console.log(`Recipe Id: ${recipe.id}
Status:        ${recipe.status}  
`);
            pollForRecipeComplete(recipe.id)
        })
        .catch(function (err) {
            console.log(err);
        });
}


So, the key to checking a recipe's status is the 2016-07-get-recipe endpoint. By using GET on this endpoint with the recipe id, we can get a
current copy of the recipe.

let pollForRecipeComplete = (recipeid) => {  
    fetch(`${apibase}/recipes/${recipeid}`, { headers: apiheaders })
        .then(function (res) { return res.json() })


We now have the current recipe JSON and we can check for whether the recipe has
run to completion. At that point the on-demand backup will be ready. We'll know
that if the status becomes ""complete"":

        .then(function (recipe) {
            if (recipe.status == ""complete"") {
                process.stdout.write(""\n"");
                return getLatestOndemand(recipe.deployment_id);
                }
            else {
                process.stdout.write('.');
                setTimeout(pollForRecipeComplete, 10000, recipeid);
            }
        })
};


If it is complete, we return the result of calling getLatestOndemand() with the deployment_id the recipe was working on. We'll come back to that in a
moment. The else side of the if simply writes out a dot and then sets up the pollForRecipeComplete function to be called again in ten seconds.

Ok, it's time to get the latest on-demand backup in getLatestOndemand() . The recipe, in its current incarnation, can't tell us about the backup it
made, so we have to download the on-demand backup list so we can search through
that for its work:

let getLatestOndemand = (deploymentid) =


Getting the list of backups is essentially the same as the listBackups() function in the previous article , using the 2016-07-get-deployment-backups endpoint to retrieve the JSON list. The difference is that we aren't going to
list them. Instead, we are going to search for a backup of type ""on_demand"".
There's only one on-demand backup associated with a deployment at any time so
this should be the backup we're looking for:

            for (let backup of backups) {
                if(backup.type==""on_demand"") {
                    getBackup(backup.deployment_id,backup.id)
                    return true;
                }
            }
            console.log(""No on demand backup found"")
            return false;
        })
        .catch(function (err) {
            console.log(err);
        });
}


Once we find the backup id, we call the getBackup() function we created in the previous article , giving it the deployment id and backup id. And we're done...

Well, nearly. To really finish this we need to add a line to the yargs command
line definition:

yargs.version(""0.0.1"")  
    ...
    .command(""start <deploymentid>"", ""Start on demand backup"", {}, (argv) =


Now we can run our command:

$ backupomatplus.js start 55f694344d847d005d000009
Recipe Id: 58a1bf06e1baf600140002cf  
Status:        running

..
Going to download rethinkery_2017-02-13_14-13-37_utc_on_demand  
Done  
$


You'll find the code for this in backupomatplus.js in the backupmat repository on Github. Of course, the Compose API is there for you to make the integrations
that best suit your application. We'll be taking a wider look at the API in a
future article, covering all the API's functionality.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Pixabay

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Feb 9, 2017HOW TO GET BACKUPS WITH THE COMPOSE API AND NODE.JS
Automatic backup retrieval is now possible with Compose's API and here we'll
show you how to use it from Node.js. If you want…

Dj Walker-Morgan Dec 13, 2016DATABASE UPDATES AND THE NEW COMPOSE API
There's database updates for Redis and the early availability of the new Compose
API for developers wanting to automate their…

Dj Walker-Morgan Dec 8, 2016OMNI LABS – MAKING THE MOST OF COMPOSE
Learn how startup Omni Labs uses Compose-hosted MongoDB and a combination of
Node.js, React, and Spark Python to help bootst…

Jon Silvers Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",You need your latest backup and you need it automatically? We can show you how to create an on-demand backup and retrieve it automatically using the Compose API and Node.js.,On-demand backups with the Compose API and Node.js,Live,312
917,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
WHAT IS MACHINE LEARNING?
MAKING THE COMPLEX SIMPLE
Post Comment May 11, 2016 by Mike Ferguson Managing Director of Intelligent Business Strategies Limited, Intelligent
Business Strategies LimitedMachine learning is the process of building analytical models to automatically discover
previously unknown patterns from data that indicate associations, sequences,
anomalies (outliers), classifications, and clusters and segments. These patterns
reveal hidden rules as to why an event happened—for example, rules that predict
likely customer churn. Businesses can take advantage of several kinds of uses
for machine learning :

 * Segmentation, or grouping sets of customers who have similar buying patterns
   for targeted marketing
 * Classification based on a set of attributes to make a prediction—for example
   propensity to buy, customers with insurance policies likely to lapse and
   equipment failure that triggers preventive maintenance
 * Forecasts—for example, sales projections based on time series
 * Pattern discovery that associates one product with another to reveal
   cross-sell opportunities and sequences—for example, products that sell
   together over time
 * Anomaly detection—for example, detecting fraud

PREDICTIVE ANALYTICS MODEL METHODOLOGY
The widely used Cross Industry Standard Process for Data Mining (CRISP-DM)
methodology is used to develop predictive analytical models. CRISP-DM includes
six phases: business understanding, data understanding, data preparation, model
development using supervised and unsupervised learning, model evaluation and
model deployment.

BUSINESS UNDERSTANDING
The business understanding phase involves defining the business problem or use
case, the business objectives and the business questions that need to be
answered. It also involves defining success criteria. Then the standard
project-related tasks need to be performed. These tasks include defining
resource requirements such as people and money , technology requirements, creating a project plan, defining any constraints,
assessing risks and creating a contingency plan.

DATA UNDERSTANDING
The data understanding phase involves data requirements such as internal and
external data sources and data characteristics including data volumes, variety,
velocity, formats and so on, as well as whether the data is in flat files, a
relational database, a Hadoop Distributed File System (HDFS) or if it is live,
streaming data.

This phase also includes data exploration using statistical analysis to look at
data—for example, basic statistics about each data column and any information
about whether data is skewed in any way. Visualizations such as histograms and
scatterplots help with drilling down on outliers and errors. In addition, a data
quality assessment involves understanding the degree to which data is missing,
has errors, is inconsistent and is duplicated.

DATA PREPARATION
The objective of the data preparation phase is to produce a set of data that can
be fed into machine-learning algorithms. This process requires a number of tasks
including data enrichment, filtering and cleaning; data conversion; data
transformation; and variable identification, which is also known as feature
selection or dimensionality reduction. Variable identification’s objective is to
create a data set of the most highly relevant variables to be used as model
input to get optimal results. The intention is also to remove variables from a
data set that are not useful as model input without compromising the model's
accuracy—for example, the accuracy of the predictions it makes.

MODEL DEVELOPMENT
The model development phase is about the development of a machine-learning
model. Models can be built to predict, forecast or analyze data to find patterns
such as associations and groups. Two types of machine learning can be used in
model development: supervised learning and unsupervised learning.

Typically, predictive models are built using supervised learning . For example, if we want to develop a model that predicts equipment failure,
we can use data that describes equipment that has actually failed. We can use
that data to train the model to recognize the profile of a piece of equipment
that is likely to fail. To accomplish this profile recognition, we split the
data set containing failed equipment data records into a training data set and a
test data set. Then we train the model by feeding the training data set into an
algorithm, several of which can be used for prediction. Then we test the model
using the test data set.

Unsupervised learning is a process of analyzing data to try and find hidden patterns in the data that
indicate product association and groupings—for example, customer segmentation.
Grouping is based on maximizing or minimizing similarity. The K-means clustering
algorithm is a widely used algorithm for this approach. Predictive and
descriptive analytical models can be built using advanced analytics or data
mining tools, data science interactive workbooks with procedural or declarative
programming languages, analytics clouds and automated model development tools.

MODEL EVALUATION
After a model is developed, the next phase is to evaluate the accuracy of
predictions or groupings. For predictions, this evaluation means understanding
how many predictions were correct and how many were incorrect. Various methods
can accomplish this evaluation. Key measures in model evaluation are the number
of true positives, false positives, true negatives and false negatives. The
bottom line is that we need to make sure that the model is accurate; otherwise,
it could generate lots of false positives that may result in incorrect decisions
and actions.

MODEL DEPLOYMENT
Once we are happy with the model we’ve developed, the final phase involves
deploying models to run in many different environments. These environments
include spreadsheets, analytics servers, applications, database management
systems (DBMSs), analytical relational database management systems (RDBMSs),
Apache Hadoop, Apache Spark and streaming analytics platforms :


Learn more about Apache Hadoop and how it can increase your speed of innovation, and stay tuned for more
installments of the Making the complex simple blog series.


Follow @IBMBigData

Topics: Analytics , Big Data Education , Big Data Technology , Big Data Use Cases , Data Scientists , Hadoop Tags: big data , open source , analytics , Apache Hadoop , Hadoop , cognitive , predictive , predictive analytics , Apache Spark , Spark , machine learning , text analytics , graph analytics , unstructured data , streaming analytics , machine-learning algorithm , CRISP-DMRELATED CONTENT
BLOG
THE 3 CS OF BIG DATA
Data scientists and others often encapsulate big data by its dimensions known as
the four Vs: volume, variety, velocity and veracity. But when considering big
data as a source for insight to enhance decision making, it may be best
characterized by its three Cs—confidence, context and choice—with... Read Blog Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Video What does Hadoop and big data success look like? Blog Cloud-based ingestion: The future is here Blog Don’t sweat the ROI: IBM + Box = time well spent Blog The power of machine learning in Spark Blog How can data scientists collaborate to build better business applications? Blog A DB2 release that doubles down on data protection Blog InsightOut: The role of Apache Atlas in the open metadata ecosystem Blog Top analytics tools in 2016 Blog End-to-end analytics in the cloud Video IBM Cloud Data Services: A quick primer
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacyMORE
Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacy Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog Cloud-based ingestion: The future is hereMORE
Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog Cloud-based ingestion: The future is here Blog 3 strategies to get your CFO to care about Sales Performance Management Blog Proactive emergency plans: Data empowers law enforcement agencies at all
levels Blog Emergency management information system data needs to be filtered Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformationMORE
Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformation Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog Keep your head above water with information lifecycle governance Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutionsMORE
Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Video What does Hadoop and big data success look like? Blog The death of application performance White papers & Reports Introducing notebooks: A power tool for data scientists * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site","Learn more about machine learning, the process of building analytical models to automatically discover previously unknown patterns from data.",What is machine learning?,Live,313
919,"* R Views
 * About this Blog
 * Contributors
 * Some Resources
 * 

 * R Views
 * About this Blog
 * Contributors
 * Some Resources
 * 

SOME RANDOM WEEKEND READING


by Joseph Rickert

Few of us have enough time to read, and most of us already have depressingly
deep stacks of material that we would like to get through. However, sometimes a
random encounter with something interesting is all that it takes to regenerate
enthusiasm. Just in case you are not going to get to a book store with a good
technical section this weekend, here are a few not-quite-random reads.

Deep Learning by Goodfellow, Bengio and Courville is a solid, self-contained introduction to
Deep Learning that begins with Linear Algebra and ends with discussions of
research topics such as Autoencoders, Representation Learning, and Boltzman
Machines. The online layout extends an invitation to click anywhere and begin
reading. Sampling the chapters, I found the text to be engaging reading; much
more interesting and lucid than just an online resource. For some Deep Learning
practice with R and H2O , have a look at the post Deep Learning in R by Kutkina and Feuerriegel.

However, if you are under the impression that getting a handle on Deep Learning
will get you totally up to speed with neural network buzzwords, you may be
disappointed. Extreme Learning Machines , which “aim to break the barriers between the conventional artificial learning
techniques and biological learning mechanisms”, are sure to take you even deeper
into the abyss. For a succinct introduction to ELMs with and application to
handwritten digit classification, have a look at the recent paper by Pang and Yang. For more than an afternoon’s worth of reading, browse through
the IEEE Intelligent Systems issue on Extreme Learning Machines here , and the other resources collected here . See the announcement of the 2014 conference for the full context of the quote above.

For something a little lighter and closer to home, Christopher Gandrud’s page on the networkD3 package is sure to set you browsing through Sankey Diagrams and Force Directed Drawing Alorithms .


Finally, if you are like me and think that the weekends are for catching up on
things that you should probably already know, but on which you might be a bit
shaky, remember that you can never know enough about GitHub. Compliments of
GitHub’s Carolyn Shin, here is some online GitHub reading: GitHub Guides , GitHub on Demand Training , and an online version of the Pro Git Book .

Reading recommendations go both ways. Please feel free to comment with some
recommendations of your own.

Joseph Rickert 2017-03-24T09:59:35+00:00LEAVE A COMMENT CANCEL REPLY
Comment

250 Northern Ave, Boston, MA 02210
844-448-1212
info@rstudio.com

DMCA
Trademark
Support
ECCN * New blog post: Tips and Tricks from #rstudioconf , from writing functions to using %>% robinsones.github.io/RStudio-Confer… #rstats
   
   4 days ago

Copyright 2016 RStudio | All Rights Reserved | Legal Terms Twitter Linkedin Facebook Rss Email github Rss","Few of us have enough time to read, and most of us already have depressingly deep stacks of material that we would like to get through. However, sometimes a random encounter with something interesting is all that it takes to regenerate enthusiasm. Just in case you are not going to get to a",Some Random Weekend Reading,Live,314
920,"WILL WOLF
DATA SCIENCE THINGS AND THOUGHTS ON THE WORLD
 * About
 * Archive
 * RSS
 * EN * ES
   
   
NEURALLY EMBEDDED EMOJIS
WILL WOLF
June 19, 2017As I move through my 20's I'm consistently delighted by the subtle ways in which
I've changed.

Will at 22: Reggaeton is a miserable, criminal assault to my ears.

Will at 28: Despacito (Remix) for breakfast, lunch, dinner.


Will at 22: Western Europe is boring. No — I've seen a lot of it! Everything is
too clean, too nice, too perfect for my taste.

Will at 28, in Barcelona, after 9 months in Casablanca : Wait a second: I get it now . What is this summertime paradise of crosswalks, vehicle civility and apple-green parks
and where has it been all my life?


Will at 22: Emojis are weird.

Will at 28: 🚀 🤘 💃🏿 🚴🏻 🙃.

Emojis are an increasingly-pervasive sub-lingua-franca of the internet. They
capture meaning in a rich, concise manner — alternative to the 13 seconds of
mobile thumb-fumbling required to capture the same meaning with text.
Furthermore, they bring two levels of semantic information: their context within
raw text and the pixels of the emoji itself.

QUESTION-ANSWER MODELS
The original aim of this post was to explore Siamese question-answer models of
the type typically applied to the InsuranceQA Corpus as introduced in ""Applying Deep Learning To Answer Selection: A Study And An
Open Task"" ( Feng, Xiang, Glass, Wang, & Zhou, 2015 ). We'll call them SQAM for clarity. The basic architecture looks as follows:


By layer and in general terms:

 1. An input — typically a sequence of token ids — for both question (Q) and
    answer (A).
 2. An embedding layer.
 3. Convolutional layer(s), or any layers that extract features from the matrix
    of embeddings. (A matrix, because the respective inputs are sequences of
    token ids; each id is embedded into its own vector.)
 4. A max-pooling layer.
 5. A tanh non-linearity.
 6. The cosine of the angle between the resulting, respective embeddings.

AS CANONICAL RECOMMENDATION
Question answering can be viewed as canonical recommendation: embed entities
into Euclidean space in a meaningful way, then compute dot products between
these entities and sort the list. In this vein, the above network is (thus far)
quite similar to classic matrix factorization yet with the following subtle
tweaks:

 1. Instead of factorizing our matrix via SVD or OLS we build a neural network that accepts (question, answer) , i.e. (user, item) , pairs and outputs their similarity. The second-to-last layer gives the
    respective embeddings. We train this network in a supervised fashion,
    optimizing its parameters via stochastic gradient descent.
 2. Instead of jumping directly from input-index (or sequence thereof) to
    embedding, we first compute convolutional features.

In contrast, the network above boasts one key difference: both question and
answer, i.e. user and item, are transformed via a single set of parameters — an
initial embedding layer, then convolutional layers — en route to their final
embedding.

Furthermore, and not unique to SQAMs, our network inputs can be any two sequences of (tokenized, max-padded, etc.) text: we are not restricted to
only those observed in the training set.

QUESTION-EMOJI MODELS
Given my accelerating proclivity for the internet's new alphabet, I decided to
build text-question- emoji -answer models instead. In fact, this setup gives an additional avenue for
prediction: if we make a model of the answers (emojis) themselves, we can now
predict on, i.e. compute similarity with, each of

 1. Emojis we saw in the training set.
 2. New emojis, i.e. either not in the training set or new (like, released
    months from now) altogether.
 3. Novel emojis generated from the model of our data. In this way, we could conceivably answer a
    question with: ""we suggest this new emoji we've algorithmically created
    ourselves that no one's ever seen before.""

Let's get started.

CONVOLUTIONAL VARIATIONAL AUTOENCODERS
Variational autoencoders are comprised of two models: an encoder and a decoder.
The encoder embeds our 872 emojis of size \((36, 36, 4)\) into a low-dimensional latent code, \(z_e \in \mathbb{R}^{16}\) , where \(z_e\) is a sample from an emoji-specific Gaussian. The decoder takes as input \(z_e\) and produces a reconstruction of the original emoji. As each individual \(z_e\) is normally distributed, \(z\) should be distributed normally as well. We can verify this with a quick
simulation.

mu=np.linspace(-3,3,10)sd=np.linspace(0,3,10)z_samples=[]forminmu:forsinsd:samples=np.random.normal(loc=m,scale=s,size=50)z_samples.append(samples)z_samples=np.array(z_samples).ravel()


Training a variational autoencoder to learn low-dimensional emoji embeddings
serves two principal ends:

 1. We can feed these low-dimensional embeddings as input to our SQAM.
 2. We can generate novel emojis with which to answer questions.

As the embeddings in #1 are multivariate Gaussian, we can perform #2 by passing
Gaussian samples into our decoder. We can do this by sampling evenly-spaced
percentiles from the inverse CDF of the aggregate embedding distribution:

percentiles=np.linspace(0,1,20)forpinpercentiles:z=norm.ppf(p,size=16)generated_emoji=decoder.predict([z])

NB: norm.ppf does not accept a size parameter; I believe sampling from the inverse CDF of a multivariate Gaussian is non-trivial in Python.

Similarly, we could simply iterate over (mu, sd) pairs outright:

axis=np.linspace(-3,3,20)formuinaxis:forsdinaxis:z=norm.rvs(loc=mu,scale=sd,size=16)generated_emoji=decoder.predict([z])

The ability to generate new emojis via samples from a well-studied distribution,
the Gaussian, is a key reason for choosing a variational autoencoder.

Finally, as we are working with images, I employ convolutional intermediary
layers.

DATA PREPARATION
EMOJIS_DIR='data/emojis'N_CHANNELS=4EMOJI_SHAPE=(36,36,N_CHANNELS)emojis_dict={}forsluginos.listdir(EMOJIS_DIR):path=os.path.join(EMOJIS_DIR,slug)emoji=imread(path)ifemoji.shape==(36,36,4):emojis_dict[slug]=emojiemojis=np.array(list(emojis_dict.values()))

SPLIT DATA INTO TRAIN, VALIDATION SETS
Additionally, scale pixel values to \([0, 1]\) .

train_mask=np.random.rand(len(emojis))<0.8X_train=y_train=emojis[train_mask]/255.X_val=y_val=emojis[~train_mask]/255.

Dataset sizes:
    X_train:  (685, 36, 36, 4)
    X_val:    (182, 36, 36, 4)
    y_train:  (685, 36, 36, 4)
    y_val:    (182, 36, 36, 4)


Before we begin, let's examine some emojis.


MODEL EMOJIS
EMBEDDING_SIZE=16FILTER_SIZE=64BATCH_SIZE=16

VARIATIONAL LAYER
This is taken from a previous post of mine, Transfer Learning for Flight Delay Prediction via Variational Autoencoders .

classVariationalLayer(KerasLayer):def__init__(self,embedding_dim:int,epsilon_std=1.):'''A custom ""variational"" Keras layer that completes the        variational autoencoder.        Args:            embedding_dim : The desired number of latent dimensions in our                embedding space.        '''self.embedding_dim=embedding_dimself.epsilon_std=epsilon_stdsuper().__init__()defbuild(self,input_shape):self.z_mean_weights=self.add_weight(shape=input_shape[-1:]+(self.embedding_dim,),initializer='glorot_normal',trainable=True,name='z_mean_weights')self.z_mean_bias=self.add_weight(shape=(self.embedding_dim,),initializer='zero',trainable=True,name='z_mean_bias')self.z_log_var_weights=self.add_weight(shape=input_shape[-1:]+(self.embedding_dim,),initializer='glorot_normal',trainable=True,name='z_log_var_weights')self.z_log_var_bias=self.add_weight(shape=(self.embedding_dim,),initializer='zero',trainable=True,name='z_log_var_bias')super().build(input_shape)defcall(self,x):z_mean=K.dot(x,self.z_mean_weights)+self.z_mean_biasz_log_var=K.dot(x,self.z_log_var_weights)+self.z_log_var_biasepsilon=K.random_normal(shape=K.shape(z_log_var),mean=0.,stddev=self.epsilon_std)kl_loss_numerator=1+z_log_var-K.square(z_mean)-K.exp(z_log_var)self.kl_loss=-0.5*K.sum(kl_loss_numerator,axis=-1)returnz_mean+K.exp(z_log_var/2)*epsilondefloss(self,x,x_decoded):base_loss=binary_crossentropy(x,x_decoded)base_loss=tf.reduce_sum(base_loss,axis=[-1,-2])returnbase_loss+self.kl_lossdefcompute_output_shape(self,input_shape):returninput_shape[:1]+(self.embedding_dim,)

AUTOENCODER
# encoderoriginal=Input(shape=EMOJI_SHAPE,name='original')conv=Conv2D(filters=FILTER_SIZE,kernel_size=3,input_shape=original.shape,padding='same',activation='relu')(original)conv=Conv2D(filters=FILTER_SIZE,kernel_size=3,padding='same',activation='relu')(conv)conv=Conv2D(filters=FILTER_SIZE,kernel_size=3,padding='same',activation='relu')(conv)flat=Flatten()(conv)variational_layer=VariationalLayer(EMBEDDING_SIZE)variational_params=variational_layer(flat)encoder=Model([original],[variational_params],name='encoder')# decoderencoded=Input(shape=(EMBEDDING_SIZE,))upsample=Dense(np.multiply.reduce(EMOJI_SHAPE),activation='relu')(encoded)reshape=Reshape(EMOJI_SHAPE)(upsample)deconv=Conv2DTranspose(filters=FILTER_SIZE,kernel_size=3,padding='same',activation='relu',input_shape=encoded.shape)(reshape)deconv=Conv2DTranspose(filters=FILTER_SIZE,kernel_size=3,padding='same',activation='relu')(deconv)deconv=Conv2DTranspose(filters=FILTER_SIZE,kernel_size=3,padding='same',activation='relu')(deconv)dropout=Dropout(.8)(deconv)reconstructed=Conv2DTranspose(filters=N_CHANNELS,kernel_size=3,padding='same',activation='sigmoid')(dropout)decoder=Model([encoded],[reconstructed],name='decoder')# end-to-endencoder_decoder=Model([original],decoder(encoder([original])))

The full model encoder_decoder is composed of separate models encoder and decoder . Training the former will implicitly train the latter two; they are available
for our use thereafter.

The above architecture takes inspiration from Keras , Edward and the GDGS (gradient descent by grad student) method by as discussed by Brudaks on Reddit :

A popular method for designing deep learning architectures is GDGS (gradient
descent by grad student). This is an iterative approach, where you start with a
straightforward baseline architecture (or possibly an earlier SOTA), measure its
effectiveness; apply various modifications (e.g. add a highway connection here
or there), see what works and what does not (i.e. where the gradient is
pointing) and iterate further on from there in that direction until you reach a
(local?) optimum.

I'm not a grad student but I think it still plays.

FIT MODEL
encoder_decoder.compile(optimizer=Adam(.003),loss=variational_layer.loss)encoder_decoder_fit=encoder_decoder.fit(x=X_train,y=y_train,batch_size=16,epochs=100,validation_data=(X_val,y_val))

GENERATE EMOJIS
As promised we'll generate emojis. Again, latent codes are distributed as a
(16-dimensional) Gaussian; to generate, we'll simply take samples thereof and
feed them to our decoder .

While scanning a 16-dimensional hypercube, i.e. taking (evenly-spaced, usually)
samples from our latent space, is a few lines of Numpy, visualizing a
16-dimensional grid is impractical. In solution, we'll work on a 2-dimensional
grid while treating subsets of our latent space as homogenous.

For example, if our 2-D sample were (0, 1) , we could posit 16-D samples as:

A. `(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1)`
B. `(0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1)`
C. `(0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1)`


Then, if another sample were (2, 3.5) , we could posit 16-D samples as:

A. `(2, 2, 2, 2, 2, 2, 2, 2, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5, 3.5)`
B. `(2, 3.5, 2, 3.5, 2, 3.5, 2, 3.5, 2, 3.5, 2, 3.5, 2, 3.5, 2, 3.5)`
C. `(2, 2, 3.5, 3.5, 2, 2, 3.5, 3.5, 2, 2, 3.5, 3.5, 2, 2, 3.5, 3.5)`


There is no math here: I'm just creating 16-element lists in different ways.
We'll then plot ""A-lists,"" ""B-lists,"" etc. separately.

defcompose_code_A(coord_1,coord_2):return8*[coord_1]+8*[coord_2]defcompose_code_B(coord_1,coord_2):return8*[coord_1,coord_2]defcompose_code_C(coord_1,coord_2):return4*[coord_1,coord_1,coord_2,coord_2]ticks=20axis=np.linspace(-2,2,ticks)

plot_generated_emojis(compose_code_A)


plot_generated_emojis(compose_code_B)


plot_generated_emojis(compose_code_C)


As our emojis live in a continuous latent space we can observe the smoothness of
the transition from one to the next.

The generated emojis have the makings of maybe some devils, maybe some bubbles,
maybe some hearts, maybe some fish. I doubt they'll be featured on your cell
phone's keyboard anytime soon.

TEXT-QUESTION, EMOJI-ANSWER
I spent a while looking for an adequate dataset to no avail. (Most Twitter
datasets are not open-source, I requested my own tweets days ago and continue to
wait, etc.) As such, I'm working with the Twitter US Airline Sentiment dataset: tweets are labeled as positive , neutral , negative which I've mapped to 🎉, 😈 and 😡.

CONTRASTIVE LOSS
We've thus far discussed the SQAM. Our final model will make use of two SQAM's
in parallel, as follows:

 1. Receive (question, correct_answer, incorrect_answer) triplets as input.
 2. Compute the cosine similarity between question , correct_answer via SQAM_1 — correct_cos_sim .
 3. Compute the cosine similarity between question , incorrect_answer via SQAM_2 — incorrect_cos_sim .

The model is trained to minimize the following: max(0, margin - correct_cos_sim + incorrect_cos_sim) , a variant of the hinge loss . This ensures that (question, correct_answer) pairs have a higher cosine similarity than (question, incorrect_answer) pairs, mediated by margin . Note that this function is differentiable: it is simply a
[ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks).

ARCHITECTURE
A single SQAM receives two inputs: a question — a max-padded sequence of token ids — and an answer — an emoji's 16-D latent code.

To process the question we employ the following steps, i.e. network layers:

 1. Select the pre-trained-with-Glove 100-D embedding for each token id. This gives a matrix of size (MAX_QUESTION_LEN, GLOVE_EMBEDDING_DIM) .
 2. Pass the result through a bidirectional LSTM — (apparently) key to current state -of-the- art results in a variety of NLP tasks. This can be broken down as follows:
    
     * Initialize two matrices of size (MAX_QUESTION_LEN, LSTM_HIDDEN_STATE_SIZE) : forward_matrix and backward_matrix .
     * Pass the sequence of token ids through an LSTM and return all hidden
       states. The first hidden state is a function of, i.e. is computed using,
       the first token id's embedding; place it in the first row of forward_matrix . The second hidden state is a function of the first and second token-id
       embeddings; place it in the second row of forward_matrix . The third hidden state is a function of the first and second and third
       token-id embeddings, and so forth.
     * Do the same thing but pass the sequence to the LSTM in reverse order.
       Place the first hidden state in the last row of backward_matrix , the second hidden state in the second-to-last row of backward_matrix , etc.
     * Concatenate forward_matrix and backward_matrix into a single matrix of size (MAX_QUESTION_LEN, 2 * LSTM_HIDDEN_STATE_SIZE) .
    
    
 3. Max-pool .
    
    
 4. Flatten.
 5. Dense layer with ReLU activations, down to 10 dimensions.

To process the answer we employ the following steps:

 1. Dense layer with ReLU activations.
 2. Dense layer with ReLU activations, down to 10 dimensions.

Now of equal size, we further process our question and answer with a single set of dense layers — the key difference between a SQAM and (the neural-network
formulation of) other canonical (user, item) recommendation algorithms. The last of these layers employs tanh activations as suggested in Feng et al. (2015).

Finally, we compute the cosine similarity between the resulting embeddings.

PREPARE QUESTIONS, ANSWERS
IMPORT TWEETS
tweets_df=pd.read_csv('data/tweets.csv')[['text','airline_sentiment']]\
    .sample(5000)\
    .reset_index()tweets_df.head()

index text airline_sentiment 0 2076 @united that's not an apology. Say it. negative 1 7534 @JetBlue letting me down in San Fran. No Media... negative 2 14441 @AmericanAir where do I look for cabin crew va... neutral 3 13130 @AmericanAir just sad that even after spending... negative 4 3764 @united What's up with the reduction in E+ on ... negativeEMBED ANSWERS INTO 16-D LATENT SPACE
Additionally, scale the latent codes; these will be fed to our network as input.

# embedsentiment_embeddings=np.array([emojis_dict['1f389.png'],emojis_dict['1f608.png'],emojis_dict['1f621.png']])sentiment_embeddings=encoder.predict(sentiment_embeddings).astype(np.float64)# scalesentiment_embeddings=scale(sentiment_embeddings)# build vectors of correct, incorrect answersembedding_map={'positive':sentiment_embeddings[0],'neutral':sentiment_embeddings[1],'negative':sentiment_embeddings[2]}incorrect_answers,correct_answers=[],[]sentiments=set(embedding_map.keys())forsentimentintweets_df['airline_sentiment']:correct_answers.append(embedding_map[sentiment])incorrect_sentiment=random.sample(sentiments-{sentiment},1)[0]incorrect_answers.append(embedding_map[incorrect_sentiment])questions=tweets_df['text']correct_answers=np.array(correct_answers)incorrect_answers=np.array(incorrect_answers)

We've now built (only) one (question, correct_answer, incorrect_answer) training triplet for each ground-truth (question, correct_answer) . In practice, we should likely have many more, i.e. (question, correct_answer, incorrect_answer_1), (question, correct_answer,
incorrect_answer_2), ..., (question, correct_answer, incorrect_answer_n) .

CONSTRUCT SEQUENCES OF TOKEN IDS
MAX_QUESTION_LEN=20VOCAB_SIZE=2000tokenizer=Tokenizer(num_words=VOCAB_SIZE)tokenizer.fit_on_texts(questions)question_seqs=tokenizer.texts_to_sequences(questions)question_seqs=pad_sequences(question_seqs,maxlen=MAX_QUESTION_LEN)

SPLIT DATA INTO TRAIN, VALIDATION SETS
NB: We don't actually have y values: we pass (question, correct_answer, incorrect_answer) triplets to our network and try to minimize max(0, margin - correct_cos_sim + incorrect_cos_sim) . Notwithstanding, Keras requires that we pass both x and y (as Numpy arrays); we pass the latter as a vector of 0's.

train_mask=np.random.rand(len(question_seqs))<0.8questions_train=question_seqs[train_mask]correct_answers_train=correct_answers[train_mask]incorrect_answers_train=incorrect_answers[train_mask]questions_val=question_seqs[~train_mask]correct_answers_val=correct_answers[~train_mask]incorrect_answers_val=incorrect_answers[~train_mask]y_train_dummy=np.zeros(shape=questions_train.shape[0])y_val_dummy=np.zeros(shape=questions_val.shape[0])

Dataset sizes:
    questions_train:         (4079, 20)
    correct_answers_train:   (4079, 16)
    incorrect_answers_train: (4079, 16)
    questions_val:           (921, 20)
    correct_answers_val:     (921, 16)
    incorrect_answers_val:   (921, 16)


BUILD EMBEDDING LAYER FROM GLOVE VECTORS
GLOVE_EMBEDDING_DIM=100# 1. Load Glove embeddings# 2. Build embeddings matrix# 3. Build Keras embedding layerembedding_layer=Embedding(input_dim=len(word_index)+1,output_dim=GLOVE_EMBEDDING_DIM,weights=[embedding_matrix],input_length=MAX_QUESTION_LEN,trainable=True)

BUILD SIAMESE QUESTION-ANSWER MODEL (SQAM)
GDGS architecture, ✌️.

LSTM_HIDDEN_STATE_SIZE=50# questionquestion=Input(shape=(MAX_QUESTION_LEN,),dtype='int32')question_embedding=embedding_layer(question)biLSTM=Bidirectional(LSTM(LSTM_HIDDEN_STATE_SIZE,return_sequences=True))(question_embedding)max_pool=MaxPool1D(10)(biLSTM)flat=Flatten()(max_pool)dense_question=Dense(10,activation='relu')(flat)# answeranswer=Input(shape=(EMBEDDING_SIZE,))dense_answer=Dense(64,activation='relu')(answer)dense_answer=Dense(10,activation='relu')(answer)# combineshared_dense_1=Dense(100,activation='relu')shared_dense_2=Dense(50,activation='relu')shared_dense_3=Dense(10,activation='tanh')dense_answer=shared_dense_1(dense_answer)dense_question=shared_dense_1(dense_question)dense_answer=shared_dense_2(dense_answer)dense_question=shared_dense_2(dense_question)dense_answer=shared_dense_3(dense_answer)dense_question=shared_dense_3(dense_question)# compute cosine sim, a normalized dot productcosine_sim=dot([dense_question,dense_answer],normalize=True,axes=-1)# modelqa_model=Model([question,answer],[cosine_sim],name='qa_model')


BUILD CONTRASTIVE MODEL
Two Siamese networks, trained jointly so as to minimize the hinge loss of their
respective outputs.

# contrastive modelcorrect_answer=Input(shape=(EMBEDDING_SIZE,))incorrect_answer=Input(shape=(EMBEDDING_SIZE,))correct_cos_sim=qa_model([question,correct_answer])incorrect_cos_sim=qa_model([question,incorrect_answer])defhinge_loss(cos_sims,margin=.2):correct,incorrect=cos_simsreturnK.relu(margin-correct+incorrect)contrastive_loss=Lambda(hinge_loss)([correct_cos_sim,incorrect_cos_sim])# modelcontrastive_model=Model([question,correct_answer,incorrect_answer],[contrastive_loss],name='contrastive_model')

BUILD PREDICTION MODEL
This is what we'll use to compute the cosine similarity of novel (question, answer) pairs.

prediction_model=Model([question,answer],qa_model([question,answer]),name='prediction_model')

FIT CONTRASTIVE MODEL
Fitting contrastive_model will implicitly fit prediction_model as well, so long as latter has been compiled.

# compileoptimizer=Adam(.001,clipnorm=1.)contrastive_model.compile(loss=lambday_true,y_pred:y_pred,optimizer=optimizer)prediction_model.compile(loss=lambday_true,y_pred:y_pred,optimizer=optimizer)# fitcontrastive_model.fit(x=[questions_train,correct_answers_train,incorrect_answers_train],y=y_train_dummy,batch_size=64,epochs=3,validation_data=([questions_val,correct_answers_val,incorrect_answers_val],y_val_dummy))

Train on 4089 samples, validate on 911 samples
Epoch 1/3
4089/4089 [==============================] - 18s - loss: 0.1069 - val_loss: 0.0929
Epoch 2/3
4089/4089 [==============================] - 14s - loss: 0.0796 - val_loss: 0.0822
Epoch 3/3
4089/4089 [==============================] - 14s - loss: 0.0675 - val_loss: 0.0828


PREDICT ON NEW TWEETS
 1. ""@united Flight is awful only one lavatory functioning, and people lining
    up, bumping, etc. because can't use 1st class bathroom. Ridiculous""
    
    
 2. ""@usairways I've called for 3 days and can't get thru. is there some secret
    method i can use that doesn't result in you hanging up on me?""
    
    
 3. ""@AmericanAir Let's all have a extraordinary week and make it a year to
    remember #GoingForGreat 2015 thanks so much American Airlines!!!""
    
    
new_questions=[""@united Flight is awful only one lavatory functioning, and people lining up, bumping, etc. because can't use 1st class bathroom. Ridiculous"",""@usairways I've called for 3 days and can't get thru. is there some secret method i can use that doesn't result in you hanging up on me?"",""@AmericanAir Let's all have a extraordinary week and make it a year to remember #GoingForGreat 2015 thanks so much American Airlines!!!""]new_questions_seq=tokenizer.texts_to_sequences(new_questions)new_questions_seq=pad_sequences(new_questions_seq,maxlen=MAX_QUESTION_LEN)

n_questions,n_sentiments=len(new_questions_seq),len(sentiment_embeddings)q=np.repeat(new_questions_seq,repeats=n_sentiments,axis=0)a=np.tile(sentiment_embeddings,(n_questions,1))preds=prediction_model.predict([q,a])

TWEET #1
positive_pred,neutral_pred,negative_pred=preds[:3]print('Predictions:')print(f'    🎉 (Positive): {positive_pred[0]:0.5}')print(f'    😈 (Neutral) : {neutral_pred[0]:0.5}')print(f'    😡 (Negative): {negative_pred[0]:0.5}')

Predictions:
    🎉 (Positive): 0.51141
    😈 (Neutral) : 0.56273
    😡 (Negative): 0.9728


TWEET #2
positive_pred,neutral_pred,negative_pred=preds[3:6]print('Predictions:')print(f'    🎉 (Positive): {positive_pred[0]:0.5}')print(f'    😈 (Neutral) : {neutral_pred[0]:0.5}')print(f'    😡 (Negative): {negative_pred[0]:0.5}')

Predictions:
    🎉 (Positive): 0.41422
    😈 (Neutral) : 0.61587
    😡 (Negative): 0.99161


TWEET #3
positive_pred,neutral_pred,negative_pred=preds[6:9]print('Predictions:')print(f'    🎉 (Positive): {positive_pred[0]:0.5}')print(f'    😈 (Neutral) : {neutral_pred[0]:0.5}')print(f'    😡 (Negative): {negative_pred[0]:0.5}')

Predictions:
    🎉 (Positive): 0.87107
    😈 (Neutral) : 0.46741
    😡 (Negative): 0.73435


ADDITIONALLY, WE CAN PREDICT ON THE FULL SET OF EMOJIS
Some emoji embeddings contain np.inf values, unfortunately. We could likely mitigate this by further tweaking
hyperparameters of our autoencoder.

all_embeddings=encoder.predict(emojis).astype(np.float64)inf_mask=np.isinf(all_embeddings).any(axis=1)print(f'{100 * inf_mask.mean():.3}% of values are `np.inf`.')all_embeddings=all_embeddings[~inf_mask]

4.15% of values are `np.inf`.


n_questions,n_sentiments=len(new_questions_seq),len(all_embeddings)q=np.repeat(new_questions_seq,repeats=n_sentiments,axis=0)a=np.tile(all_embeddings,(n_questions,1))preds=prediction_model.predict([q,a])

TWEET #1
preds_1=preds[:n_sentiments]top_5_matches=extract_top_5_argmax(preds_1)display_top_5_results(top_5_matches)


TWEET #2
preds_2=preds[n_sentiments:2*n_sentiments]top_5_matches=extract_top_5_argmax(preds_2)display_top_5_results(top_5_matches)


TWEET #3
preds_3=preds[2*n_sentiments:]top_5_matches=extract_top_5_argmax(preds_3)display_top_5_results(top_5_matches)


Not particularly useful. These emojis have 0 notion of sentiment, though: the
model is simply predicting on their (pixel-based) latent codes.

FUTURE WORK
In this work, we trained a convolutional variational autoencoder to model the
distribution of emojis. Next, we trained a Siamese question-answer model to
answer text questions with emoji answers. Finally, we were able to use the
latter to predict on novel emojis from the former.

Moving forward, I see a few logical next steps:

 * Use emoji embeddings that are conscious of sentiment — likely trained via a
   different network altogether. This way, we could make more meaningful
   (sentiment-based) predictions on novel emojis.
 * Predict on emojis generated from the autoencoder.
 * Add 1-D convolutions to the text side of the SQAM.
 * Add an ""attention"" mechanism — the one component missing from the ""embed, encode, attend, predict"" dynamic quartet of modern NLP.
 * Improve the stability of our autoencoder so as to not produce embeddings
   containing np.inf .

Sincere thanks for reading, and emojis 🤘.

CODE
The repository and rendered notebook for this project can be found at their respective links.

REFERENCES
 * Deep Language Modeling for Question Answering using Keras
 * Applying Deep Learning To Answer Selection: A Study And An Open Task
 * LSTM Based Deep Learning Models For Non-Factoid Answer Selection
 * Embed, encode, attend, predict: The new deep learning formula for
   state-of-the-art NLP models
 * Keras Examples - Convolutional Variational Autoencoder
 * Introducing Variational Autoencoders (in Prose and Code)
 * Under the Hood of the Variational Autoencoder (in Prose and Code)


--------------------------------------------------------------------------------

COMMENTS
Tweet Please enable JavaScript to view the comments powered by Disqus.SOCIAL:
LINKS: TRAVEL BLOG , SOURCE CODE
© Will Wolf 2017

Powered by Pelican","Emojis are an increasingly-pervasive sub-lingua-franca of the internet. They capture meaning in a rich, concise manner — alternative to the 13 seconds of mobile thumb-fumbling required to capture the same meaning with text. Furthermore, they bring two levels of semantic information: their context within raw text and the pixels of the emoji itself.",Neurally Embedded Emojis,Live,315
924,"developerWorks Premium An all-access pass to building your next great app!Sign up

 * 
 * Sign in | Register * › My developerWorks
    * 
      --------------------------------------------------------------------------
      
      
    * developerWorks Community
    * › My profile
    * › My communities
    * › Settings
    * 
      --------------------------------------------------------------------------
      
      
    * › Sign out
   
   
 * 
 * IBM

 * 

 * Technical topics
 * Evaluation software
 * Community
 * Events

Search developerWorks

 * developerWorks
 * Technical topics
 * Open source
 * Technical library

LEVERAGE PYTHON, SCIKIT, AND TEXT CLASSIFICATION FOR BEHAVIORAL PROFILING
Learn how to build a behavioral profile model for customers based on text
attributes of previously purchased product descriptions. With SciKit, a powerful
Python-based machine learning package for model construction and evaluation,
learn how to build and apply a model to simulated customer product purchase
histories. In a sample scenario, construct a model that assigns music-listener
profiles to individual customers, based on the specific products each customer
purchases and the corresponding textual product descriptions.

PDF (218 KB) |

Share:Deep Dhillon ( deep.s.dhillon@gmail.com ), Technologist

Close [x]

Deep Dhillon is an accomplished technology executive with extensive experience
conceptualizing, architecting, and deploying multiple advanced networking
applications. At Mayalogy, he helps start-up companies that leverage machine
learning, natural language processing (NLP), and data science-based technologies
in their product development.


17 April 2014

Also available in Chinese Russian Japanese

 * Table of contents * Introduction
    * Music behavioral profile scenario
    * Libraries, software, and data setup
    * Building a behavioral profile model
    * Evaluating the behavioral profile model
    * Playing with the behavioral model
    * Applying the behavioral model to your customers
    * Scale the model up
    * Resources
    * Comments
   
   
INTRODUCTION
Nearly all of us shop. We buy all sorts of things, from basic necessities like
food to entertainment products, such as music albums. When we shop, we are not
just finding things to use in our lives we are also expressing our interest in
certain social groups. Our online actions and decisions form our behavioral
profiles.

When we buy a product, that product has a number of attributes that make it
similar to or different from other products. For example, a product's price,
size, or type are examples of distinguishing attributes. In addition to these
numerical or enumerated structured attributes, there are unstructured text
attributes. For example, the text of a product description or customer reviews
also form distinguishing attributes.

Text analysis and other natural language processing (NLP) techniques can be
quite helpful in extracting meaning from these unstructured text attributes,
which in turn are valuable in tasks, such as behavioral profiling.

This article gives an example of how to build a behavioral profile model using
text classification. It shows how to use SciKit, a powerful Python-based machine
learning package for model construction and evaluation to apply that model to
simulated customers and their product purchase history. In this specific
scenario, you construct a model that assigns to customers one of several
music-listener profiles, such as raver, goth, or metalhead. The assignment is
based on the specific products each customer purchases and the corresponding
textual product descriptions.


--------------------------------------------------------------------------------

Back to top

MUSIC BEHAVIORAL PROFILE SCENARIO
Consider the following scenario. You have a data set containing many customer
profiles. Each customer profile includes a list of terse, natural language-based
descriptions for all products the customer has purchased. Following is a sample
product description for a boot.

Description: Rivet Head offers the latest fashion for the industrial, goth, and dark wave
subculture, and this men's buckle boot is no exception. Features synthetic,
man-made leather upper, lace front with cross-buckle detail down the shaft,
treaded sole and combat-inspired toe, and inside zipper for easy on and off.
Rubber outsole. Shaft measures 13.5 inches and with about a 16-inch
circumference at leg opening. (Measurements taken from a size 9.5.) Style: Men's
Buckle Boot.

The goal is to categorize each current and future user into one of several
behavioral profiles, based on these product descriptions.

As shown below, the curator uses product examples to build a behavioral profile,
a behavioral model, a customer profile, and finally a customer behavioral
profile.

Figure 1. High-level approach to build a customer behavioral profileThe first step is to assume the role of curator and provide the system an
understanding of each behavioral profile. One way to do this is to manually seed
the system with examples of each product. The examples help to define the
behavioral profile. For the sake of this discussion, classify the users into one
of the following, musical behavioral profiles:

 * Punk
 * Goth
 * Hip hop
 * Metal
 * Rave

Give examples of products identified as being punk, such as descriptions of punk
albums and bands — ""Never Mind the Bollocks"" by the Sex Pistols, for example.
Other items might include products related to hair styles or footwear, such as
mohawks and Doc Marten boots.


--------------------------------------------------------------------------------

Back to top

LIBRARIES, SOFTWARE, AND DATA SETUP
All of the data and source code used in this article can be downloaded from the bpro project on JazzHub . After you download and unpack the tar file, you need to make sure you have
Python, SciKit Learn (the machine learning and text analysis package), and all of the dependencies
(such as numpy, scipy, etc.). If you are on a Mac, the SciPy Superpack is probably your best bet.

After you unpack the tar file, you notice two YAML files containing profile data. The product descriptions are artificially
generated by taking a seed corpora, or body of documents. Frequencies of word
occurrences in product descriptions are respected in the generation process. Listing 1 is an artificial product description.

Note: The following description is not a true natural language description, but in a
real situation, it would be.

Listing 1. Artificial product descriptioncustomer single clothes for his size them 1978 course group 
rhymes have master record-breaking group few starts heard 
blue ending company that the band the music packaged 
master kilmister not trousers got cult albums heart 
commentary cut 20.85 tour...

Two data files are included for this analysis:

 * customers.yaml — Consists of a list of customers. For each customer, a list of product
   descriptions is included, as well as the target label, or correct behavioral profile. The correct behavioral profile is the one you know to be
   correct. For example, in a real scenario, you inspect the profile data for a
   goth user to verify that these purchases indicate that the user is a goth.
 * behavioral_profiles.yaml — Consists of a list of the profiles (punk, goth, etc.), along with a sample
   set of product descriptions defining that profile.

You can generate your own simulated files by running the command python bpro.py -g .

Note: You must first populate the seed directory with content that defines the genres
of interest. Go into the seed directory and open any file for instructions. You
can manipulate the parameters in the bpro.py file to change the product
description lengths, amount of noise, number of training examples, or other
parameters.


--------------------------------------------------------------------------------

Back to top

BUILDING A BEHAVIORAL PROFILE MODEL
Start by building a simple term-count-based representation of the corpus using
SciKit's CountVectorizer . The corpus object is a simple list of strings that contains the product
descriptions.

Listing 2. Building a simple term count    vectorizer = CountVectorizer(gmin_df=1)
    corpus=[]
    for bp in behavioral_profiles:
        for pd in bp.product_descriptions:
            corpus.append(pd.description)

SciKit has other, more advanced, vectorizers, such as the TFIDFVectorizer , which stores document terms using Term Frequency/Inverse Document Frequency
(TF/IDF) weightings. TF/IDF representation is helpful for weighting unique terms
such as Ozzy , raver , and Bauhaus , more heavily than frequently occurring terms, such as and , the , and for .

Next, tokenize the product descriptions into individual words and build a
dictionary of terms. Each term found by the analyzer during the fitting process
is given a unique integer index that corresponds to a column in the resulting
matrix:
fit_corpus = vectorizer.fit_transform(corpus)

Note: This tokenizer configuration also drops single-character words.

You can print out some of the features to see what was tokenized using print vectorizer.get_feature_names()[200:210] . This command gives the output below.

Listing 3. Output of print command[u'better', u'between', u'beyond', u'biafra', u'big', 
u'bigger', u'bill',   u'billboard', u'bites', u'biting']

Note that the current vectorizer has no stemmed words. Stemming is the process of getting a common base or root form for inflected or derived
words. For example, big is a common stem for the word bigger in the previous list. SciKit does not handle more involved tokenization, such
as stemming, lemmatizing, and compound splitting, but you can use custom
tokenizers, such as those from the Natural Language Toolkit (NLTK) library. See scikit-learn.org for a nice example of a custom tokenizer.

Tokenization processes such as stemming help to reduce the number of training
examples required, because multiple forms of a word do not each require
statistical representation. You can employ other tricks to reduce training
needs, such as using a dictionary of types. For example, if you have a list of
band names for all goth bands, you can create a common word token, such as goth_band , and add that to your description before generating features. With this
approach, even if you encounter a band for the first time in a description, the
model handles it in the way that it handles other bands whose patterns it
understands. For the simulated data in this article, you are not concerned with
reducing training needs, so you can move on to the next step.

In machine learning, supervised classification problems such as this one are
posed by first defining a set of features and a corresponding target, or
correct, label for a set of observations. The chosen algorithm then attempts to
find the model that best fits the data and that minimizes mistakes against a
known data set. Therefore, the next step is to build the feature and target
label vectors (see Listing 4). It's always a good idea to randomize the
observations in case the validation technique does not do so.

Listing 4. Build the feature and target label vectors   data_target_tuples=[ ]
    for bp in behavioral_profiles:
        for pd in bp.product_descriptions:
            data_target_tuples.append((bp.type, pd.description))

    shuffle(data_target_tuples)

Next, assemble the vectors as shown in Listing 5.

Listing 5. Assemble the vectors    X_data=[ ]
    y_target=[ ]
    for t in data_target_tuples:
        v = vectorizer.transform([t[1]]).toarray()[0]
        X_data.append(v)
        y_target.append(t[0])

    X_data=np.asarray(X_data)
    y_target=np.asarray(y_target)

Now you are ready to choose a classifier and train your behavioral profile
model. Before doing so, it's a good idea to evaluate the model, just to make
sure that the model works before trying it out on your customers.


--------------------------------------------------------------------------------

Back to top

EVALUATING THE BEHAVIORAL PROFILE MODEL
Start by using a Linear Support Vector Machine (SVM), which is a nice,
heavy-hitter model for sparse vector problems such as this one. Use the code linear_svm_classifier = SVC(kernel=""linear"", C=0.025) .

Note: You can swap out other model types by just changing this model initialization
code. To play around with different model types, you can use this map of
classifiers, which sets initializations for a number of common options.

Listing 6. Use the map of classifiersclassifier_map = dict()
classifier_map[""Nearest Neighbors""]=KNeighborsClassifier(3)
classifier_map[""Linear SVM""]=SVC(kernel=""linear"", C=0.025)
classifier_map[""RBF SVM""]= SVC(gamma=2, C=1)
classifier_map[""Decision Tree""]=DecisionTreeClassifier(max
    _depth=5)
classifier_map[""Random Forest""]=RandomForestClassifier
    (max_depth=5, n_estimators=10, max_features=1)
classifier_map[""AdaBoost""]=AdaBoostClassifier()
classifier_map[""Naive Bayes""]=GaussianNB()
classifier_map[""LDA""]=LDA()
classifier_map[""QDA""]=QDA()

Because this is a multi-class classification problem — that is, a problem where
you need to choose between more than just two possible categories — you need to
also specify a corresponding strategy. A common approach is to perform a one vs. all classification. For example, product descriptions from the goth class are used
to define one class, and the other class consists of example descriptions from
all the other classes — metal , rave , etc. Finally, as a part of the validation, you need to make sure that the
model is not trained on the same data it is being tested on. A common technique
is to use cross-fold validation. You use five folds, which means five passes are
made against a five-part partitioning of the data. In each pass, four-fifths of
the data is used to train, and the remaining fifth is used to test.

Listing 7. Cross-fold validationscores = cross_validation.cross_val_score(OneVsRestClassifier
    (linear_svm_classifier), X_data, y_target, cv=2)
print(""Accuracy using %s: %0.2f (+/- %0.2f) and %d folds"" 
    % (""Linear SVM"", scores.mean(), scores.std() * 2, 5))

You get perfect accuracy nonetheless, which is a sign that the simulated data is
a bit too perfect. Of course, in a real-life scenario, noise makes its way in,
because perfect boundaries between groups don't always exist. For example, there
is the problematic genre of goth punk, so a band such as Crimson Scarlet might
make its way into the training examples for both goth and punk . You can play around with the seed data in the bpro downloaded package to better understand this type of noise.

After you understand a behavioral profile model, you can cycle back and train it
on all of the data that you have.

Listing 8. Train your behavioral profile model    behavioral_profiler = SVC(kernel=""linear"", C=0.025)
    behavioral_profiler.fit(X_data, y_target)


--------------------------------------------------------------------------------

Back to top

PLAYING WITH THE BEHAVIORAL MODEL
Now you can just play around with the model for a bit by typing in some
fictitious product descriptions to see how the model works.

Listing 9. Playing with the modelprint behavioral_profiler.predict(vectorizer.transform(['Some black 
Bauhaus shoes to go with your Joy Division hand bag']).toarray()[0])

Notice that indeed it does return ['goth'] . If you remove the word Bauhaus and re-run, you note that it now returns ['punk'] .


--------------------------------------------------------------------------------

Back to top

APPLYING THE BEHAVIORAL MODEL TO YOUR CUSTOMERS
Go ahead and apply the trained model against the customers and their purchased
product descriptions.

Listing 10. Applying the trained model against our customers and their product
descriptionspredicted_profiles=[ ]
ground_truth=[ ]
for c in customers:
    customer_prod_descs = ' '.join(p.description for p in 
c.product_descriptions)
    predicted =   behavioral_profiler.predict(vectorizer
.transform([customer_product_descriptions]).toarray()[0])
    predicted_profiles.append(predicted[0])
    ground_truth.append(c.type)
    print ""Customer %d, known to be %s, was predicted to 
be %s"" % (c.id,c.type,predicted[0])

Finally, compute the accuracy to see how often you were able to profile the
shoppers.

Listing 11. Computing your accuracy    a=[x1==y1 for x1, y1 in zip(predicted_profiles,ground_truth)]
    accuracy=float(sum(a))/len(a)
    print ""Percent Profiled Correctly %.2f"" % accuracy

The result should be 95 percent with the default profile data provided. If this
were real data, that would be a reasonably good accuracy rate.


--------------------------------------------------------------------------------

Back to top

SCALE THE MODEL UP
Now that you've built and tested the model, you are ready to turn it loose on
millions of customer profiles. You can use the MapReduce framework and send
trained behavioral profilers onto worker nodes. Each worker node then gets a
batch of customer profiles with their purchase history and applies the model.
Save the results. At this point, the model has been applied, and your customers
are assigned to a behavioral profile. You can use the profile assignments in
many ways. For example, you might decide to target customers with tailored
promotions or use the profiles as input to a product recommendation system.

RESOURCES
LEARN
 * The Python.org is the starting point for all things Pythonic.
 * Learn about the IBM Watson research project.
 * Read the Hadoop MapReduce tutorial at Apache.org.
 * To listen to interesting interviews and discussions for software developers,
   check out developerWorks podcasts .

GET PRODUCTS AND TECHNOLOGIES
 * Download the bpro package .
 * Visit scikit-learn.org for a nice example of a custom tokenizer.
 * SciPy Superpak : Recent builds of fundamental Python scientific computing packages for OS
   X.
 * Get Hadoop 0.20.1 from Apache.org.
 * Get Hadoop MapReduce .
 * Innovate your next open source development project with IBM trial software , available for download or on DVD.

DISCUSS
 * Participate in developerWorks blogs and get involved in the developerWorks community.

COMMENTS
Close [x]

DEVELOPERWORKS: SIGN IN
Required fields are indicated with an asterisk ( * ).

IBM ID: *
Need an IBM ID?
Forgot your IBM ID?

Password: *
Forgot your password?
Change your password

Keep me signed in.

By clicking Submit , you agree to the developerWorks terms of use .


--------------------------------------------------------------------------------

The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is
displayed to the public and will accompany any content you post, unless you opt
to hide your company name . You may update your IBM account at any time.

All information submitted is secure.

Close [x]

CHOOSE YOUR DISPLAY NAME


The first time you sign in to developerWorks, a profile is created for you, so
you need to choose a display name. Your display name accompanies the content you
post on developerWorks.

Please choose a display name between 3-31 characters . Your display name must be unique in the developerWorks community and should
not be your email address for privacy reasons.

Required fields are indicated with an asterisk ( * ).

Display name: * (Must be between 3 – 31 characters.)

By clicking Submit , you agree to the developerWorks terms of use .


--------------------------------------------------------------------------------

All information submitted is secure.

DIG DEEPER INTO BIG DATA AND ANALYTICS ON DEVELOPERWORKS
 * Overview
 * Proven practices
 * Products
 * Technical library (tutorials and more)


--------------------------------------------------------------------------------

 * DEVELOPERWORKS PREMIUM
   Exclusive tools to build your next great app. Learn more.
   
   
 * 
 * 
 * 
 * DEVELOPERWORKS LABS
   Technical resources for innovators and early adopters to experiment with.
   
   
 * 
 * 
 * IBM EVALUATION SOFTWARE
   Evaluate IBM software and solutions, and transform challenges into
   opportunities.
   
   
--------------------------------------------------------------------------------

Back to top

static.content.url=http://www.ibm.com/developerworks/js/artrating/ SITE_ID=1 Zone=Big data and analytics, Open source ArticleID=967147 ArticleTitle=Leverage Python, SciKit, and text classification for behavioral
profiling publish-date=04172014 * About
 * Help
 * Contact us
 * Submit content

 * Feeds
 * Newsletters
 * Follow
 * Like

 * Report abuse
 * Terms of use
 * Third party notice
 * IBM privacy
 * IBM accessibility

 * Faculty
 * Students
 * Business Partners

 * Select a language:
 * English
 * 中文
 * 日本語
 * Русский
 * Português (Brasil)
 * Español
 * Việt","  Learn how to build a behavioral profile model for customers based on text attributes of previously purchased product descriptions. With SciKit, a powerful Python-based machine learning package for model construction and evaluation, learn how to build and apply a model to simulated customer product purchase histories. In a sample scenario, construct a model that assigns music-listener profiles to individual customers, based on the specific products each customer purchases and the corresponding textual product descriptions.","Leverage Python, SciKit, and text classification for behavioral profiling",Live,316
925,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. 4 mins ago
--------------------------------------------------------------------------------

AUTHENTICATION FOR CLOUDANT ENVOY APPS, PART II
MAKE YOUR APP OFFLINE FIRST
In the first part of this series about building a Cloudant Envoy application that uses Facebook
to provide authentication, I showed you how to:

 * Create a Facebook developer application to provide us with a client_id and client_secret auth codes.
 * Use PassportJS to perform most of the work of bouncing the browser between our app and
   Facebook while the user logs in.
 * Use Cloudant Envoy to serve out our static content and the Passport code, as well as its main
   function of behaving as middleware that our client-side application will
   replicate with.
 * Use time-limited, single-use tokens to transmit the Envoy credentials from
   the server to the client.

All of the above is just a means of serving out your content and maintaining the
database of users. The work of creating your app itself is still to do and can
be treated as a separate exercise: the client side code (reading and writing
from its local data store) can be built independently and can be plugged into
the main framework at the last minute.

Audiomark is a Progressive Web AppThe app I am building is called Audiomark . The idea for it came from a conversation with a friend of mine who is a
schoolteacher. Like most teachers, he grades a lot of written work usually by
handwriting comments into each of his students’ books. He wanted a means of
recording audio snippets, printing out a QR-code on his label printer and then
sticking the label to the book — saving himself the labour of writing so many
comments by hand.

ASSEMBLE THE OPEN-SOURCE PROJECTS
The client-side code needs to record audio in a web page, store it offline and
then sync it to Cloudant Envoy. Here are the technologies I chose:

 * PouchDB provides a common storage API on a range of browsers. It will happily sync
   directly to Cloudant or via Cloudant Envoy.
 * MSR provides cross-browser audio recording. Unfortunately, Safari doesn’t
   support the web API that this library relies on, so Apple devices are left
   out.
 * QRCode.js allows a string to be encoded as a QR-code on the client side. Initially I
   called out to a web-based API but this library allows the whole process to be
   undertaken on the client side.
 * MaterializeCSS provides a framework for creating mobile-friendly HTML markup.

HOW DOES IT WORK?
Assuming you’ve logged into the application via Facebook, then a local PouchDB
database is created which is set to sync with Envoy in one direction: client to
server. Any new data that is added to the PouchDB database will be synced to
Envoy and from there to Cloudant. It doesn’t matter if you happen to be offline
or have a poor network connection—you can continue to record audio and print QR
codes.

After the page loads, the microphone icon becomes clickable. When you click the
microphone icon a 20-second (maximum) audio clip is recorded. You can halt the
recording at any time by clicking the icon again. At the end of the recording
the MSR library sends us a binary blob object that represents the WAV file we have just recorded. How can we store
binary blobs offline? Fortunately PouchDB can store binary data as attachments to the JSON documents it normally stores. It’s simple:

function saveBlob(blob) 
  var id = genid();
  var doc = {
    _id: id,
    ts: new Date().getTime(),
    _attachments: {
      'audio.wav': {
        content_type: blob.type,
        data: blob
      }
    }
  };
  return db.post(doc);
});

We are generating our own database IDs here because we want to keep the ID size
small. The ID will form part of the URL that gets turned into a QR code — the
longer the URL, the more detailed the image needs to be.

Once we’ve generated our ID, we know what the URL is going to be so we can go
ahead and generate the QR code:

var url = location.origin + '/w/' + userid + '/'
qrcode.clear();
qrcode.makeCode(url);

It’s worth noting here that, although we have set the PouchDB database to sync
to the cloud, we have no idea whether the data has reached the server yet;
however, it doesn’t prevent us from printing the QR code containing the URL. As
long as the data is synced before the student scans the QR code, then all will
be fine. It would be perfectly fine to take your laptop and printer to a remote
location, record numerous clips, print out multiple labels and then sync with
the cloud when you return home. That’s the advantage of an Offline First approach — 100% up time, even when there’s limited or zero network
connectivity.

When a clip is recorded it can be printed by hitting the print icon. The CSS for
the print output hides everything but the hidden QR code.

Clicking the done button sets Audiomark up to prepare for another recording.

THE PWA BENEFITS
As this app is built as a Progressive Web App (PWA), it can offer benefits to
modern browsers that support the newest APIs. If Service Workers are supported,
the app caches its assets offline so the site can reload with no network
connection. The app scales visually to work on mobile and tablet form-factors
and is easily added to a phone’s home screen. The data is stored on the device
using PouchDB, and the QR codes are generated client-side so the app is
completely self-contained.

My Audiomark application in actionTRY IT YOURSELF!
You can try Audiomark for yourself at https://audiomark.mybluemix.net .


--------------------------------------------------------------------------------

This post was Part II of the “Authentication for Cloudant Envoy Apps” 3-part
series. You can review Part I , which covers building a static application that writes data locally using
PouchDB. In Part I you’ll also learn to use Cloudant Envoy to store user data
and how to use PassportJS to handle authentication via Facebook. Now that you’ve
made your application Offline First in Part II of the series, stay tuned for
Part III, where I’ll walk through how to add Twitter authentication to your app.
See you in the future.

JavaScript Offline First Pouchdb Tutorial Database Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
For developers, by developers.","In the first part of this series about building a Cloudant Envoy application that uses Facebook to provide authentication, I showed you how to: All of the above is just a means of serving out your…","Authentication for Cloudant Envoy Apps, Part II – IBM Watson Data Lab",Live,317
927,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Perform Sentiment Analysis       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  USE DASHDB WITH PYSPARK AND PANDASJess Mantaro / July 22, 2015See how to use IBM dashDB as data source loading data into Pyspark and thengenerate data analytics using Pandas/IPython interfaces.You can also read a transcript of this videoRELATED LINKS * Use dashDB to analyze Data loaded into SparkPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",See how to use IBM dashDB as data source loading data into Pyspark and then generate data analytics using Pandas/Ipython interfaces. ,Use dashDB with Pyspark and Pandas,Live,318
930,"Homepage IBM Watson Follow Sign in Get started * Home
 * Announcements
 * Editorials
 * Tutorials
 * Code Spotlight
 * 
 * Build with Watson
 * 

Arron Harden Blocked Unblock Follow Following SaaS Architect in IBM Watson Data Platform Mar 26
--------------------------------------------------------------------------------

USING SHELL SCRIPTS TO CONTROL DATA FLOWS CREATED IN WATSON APPLICATIONS
IBM Watson offers a collection of REST APIs for creating, running, managing, and
troubleshooting data flows to allow your applications to easily integrate with Data Refinery .

A flow can read data from a large variety of sources, process that data in a
runtime engine using pre-defined operations or custom code, and then write it to
one or more targets. The runtime engine can handle large amounts of data so it’s
ideally suited for reading, processing, and writing data at volume.

A key requirement for many organisations is the ability to integrate between
different systems and for many years, the humble shell script has been the most
convenient for its ubiquity and quick prototyping. Once something can be shown
to work in a shell script, moving it to other scripting languages becomes a lot
easier.

To demonstrate the use of some of the APIs for controlling data flows, I’ve
created a sample shell script with reusable functions to perform the following
operations:

 * Authenticate with IBM Cloud and generate a bearer token
 * Run an existing data flow
 * Wait for an in-progress data flow run to complete
 * Collect simple metrics (rows read and written) for a completed data flow run

Before you can run this script, you need a Platform API key for your IBM Cloud
account. See these instructions for creating the key.

You also need to create two data flows by using Watson Studio or Watson
Knowledge Catalog. Once you’ve created them, log in to https://dataplatform.ibm.com/ and click one of the data flows you want to run. Copy the data flow ID and the
project ID from your browser’s URL. (The red rectangle below shows the data flow
ID and the blue rectangle shows the project ID.) Do the same for the other data
flow you want to run.


--------------------------------------------------------------------------------

The sample shell script will simply run an existing data flow, wait for it to
finish, and then run another data flow, waiting for that one to finish as well.
This demonstrates how multiple data flows can easily be run one after another.
With simple adjustments, multiple data flows can also be run in parallel.

Here’s the example output from running the script with my own data flows:

Authenticating

Starting flow 1…
Running data flow e656b54f-fb2f-4560–9f62–2ac7fe78b41d in project 425372d3-df7c-4d5b-aafb-3c6960260232
Waiting for run 99eea8cf-b43c-469b-af78–441bf1dd9caa of e656b54f-fb2f-4560–9f62–2ac7fe78b41d in project 425372d3-df7c-4d5b-aafb-3c6960260232 to complete…
…state is starting
…state is running
…state is finished
Finished waiting for run 99eea8cf-b43c-469b-af78–441bf1dd9caa with a final state of finished
Stats for run 99eea8cf-b43c-469b-af78–441bf1dd9caa state=finished started=2018–03–16T17:10:47.000Z completed=2018–03–16T17:11:09.663Z rows_read=274 rows_written=274

Starting flow 2…
Running data flow 0c03670e-b8f4–49f2-a076–8e3e41d63270 in project 425372d3-df7c-4d5b-aafb-3c6960260232
Waiting for run 0be91ef5–5c42–47aa-8d1d-e26db7284655 of 0c03670e-b8f4–49f2-a076–8e3e41d63270 in project 425372d3-df7c-4d5b-aafb-3c6960260232 to complete…
…state is starting
…state is running
…state is finished
Finished waiting for run 0be91ef5–5c42–47aa-8d1d-e26db7284655 with a final state of finished
Stats for run 0be91ef5–5c42–47aa-8d1d-e26db7284655 state=finished started=2018–03–16T17:11:22.000Z completed=2018–03–16T17:11:48.905Z rows_read=3278 rows_written=3278

Done

The shell script itself is shown below with the API calls separated into
reusable functions. Before running the script, remember to replace the following
environment variables that are defined at the bottom of the script.

 * APIKEY (your IBM Cloud platform API key)
 * MY_PROJECT_ID (the project ID containing the data flows)
 * MY_DATA_FLOW_1_ID (the ID of the first data flow to run)
 * MY_DATA_FLOW_2_ID (the ID of the second data flow to run)

You can find more information about Data Refinery in the announcement blog post: Self-service data preparation with Data Refinery


--------------------------------------------------------------------------------

API DOCUMENTATION
The data flows API specification can be found in the Watson Data API documentation under Documentation > Data flows.

TUTORIALS AND NOTEBOOKS
The Watson Studio Community is a hub of useful blogs, notebooks, tutorials and data sets to get you
started.

 * Ibm Watson
 * Data Flow
 * Rest
 * API
 * Tutorial

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

10 Blocked Unblock Follow FollowingARRON HARDEN
SaaS Architect in IBM Watson Data Platform

FollowIBM WATSON
AI Platform for the Enterprise

 * 10
 * 
 * 
 * 

Never miss a story from IBM Watson , when you sign up for Medium. Learn more Never miss a story from IBM Watson Get updates Get updates","IBM Watson offers a collection of REST APIs for creating, running, managing, and troubleshooting data flows to allow your applications to easily integrate with Data Refinery. A flow can read data…",Using shell scripts to control data flows created in Watson applications,Live,319
935,"COUCHDB DOCUMENT COPY AND TRANSFORM SERVICE
ptitzler / September 28, 2016Those of you who have explored our sample applications know that we are tracking deployments to measure interest . In preparation of the analysis process, we are continuously loading the
tracking records (JSON documents identifying when a sample application was
deployed) from the Cloudant repository database into a dashDB data warehouse.

Data movement scenarioCloudant’s Data Warehousing feature makes it easy to continuously extract JSON documents and load them into the
relational data warehouse. Unfortunately, some of these tracking records don’t
meet the data warehouse quality guidelines and have to be filtered out or
transformed before an attempt is made to load them.

To support document filters and transformations, we’ve built a simple
lightweight service that handles two main steps:

 1. Copies JSON documents from one Apache CouchDB™ or Cloudant database to a staging database
 2. Applies user-defined filters and/or transformation routines

Together, they prepare your JSON for cleaner continuously-running data
warehousing tasks.

Solution overviewWe implemented the service as a simple Node.js web application. It’s available
on our couchdb-db-transform GitHub repo.

Unlike CouchDB-style replication, our copy-and-transform service was designed to
process an incoming flow of new documents, not new and updated documents.

Once you’ve downloaded the app, you have to customize it by defining a filter or
transformation routine that meets your needs. Out of the box, no filtering or
transformation is applied.

Service detailsFILTERING DOCUMENTS
Certain documents in the source database might not be of analytical interest and
should therefore not be loaded into the data warehouse. To exclude these
documents — which, for example, could be invalid tracking records or design
documents in our project — the service provides server-side filters and
client-sider filters. Filter routines return a boolean value indicating whether
a document should be processed ( true ) or not ( false ).

SERVER-SIDE FILTERS
Server-side filters are implemented in Javascript and defined within design
documents in the source database. These filters are applied on the source
database and exclude documents that should not be sent to the service, which,
depending on the filter condition, can significantly reduce the network traffic.
The GitHub repository includes a couple of sample server-side filters that you can use as a starting point to build your own. The following example
depicts a simple filter that removes deleted documents:


{
  ""_id"": ""_design/transform_service"",
  ""filters"": {
    ""exclude_deleted_docs"": � }}""
  },
  ""language"": ""javascript""
}


CLIENT-SIDE FILTERS
Client-side filters are applied by the service, removing documents that should
not be transformed and loaded into the target database. Filter routines are
implemented in Node.js and packaged with the service. The GitHub repository
includes sample client-side filters , such as the following that ignores design documents:


module.exports = function(change) {
    if((! change) || (! change.doc) || (change.doc._id.startsWith('_design/'))) {
        return false;
    }
    else {
        return true;
    }
};


The doc property in the change object contains the complete document from the source database.

If possible, use server-side filters instead of client-side filters to reduce
the network overhead if filter conditions are highly selective. Use client-side
filters if you are not authorized (or don’t want) to create a design document in
the source database.

TRANSFORMING DOCUMENTS
Documents that have passed server-side and client-side filters will be processed
by the services’ transformation routine, if one is configured. Like client-side
filters, transformation routines are implemented in Node.js. A transformation
routine receives a document as its input and is expected to return the modified
document, as shown in the example below, which adds a timestamp property (in ISO
8601 format) to the document:


module.exports = function(doc) {
    if(doc) {
        doc.timestamp = new Date().toISOString();
    }
    return doc;
};


To get you started quickly, we’ve included a couple of example transformation routines .

The design was inspired by Glynn Bird’s popular couchimport utility , which is used to bulk load large data sets from text files into
CouchDB/Cloudant.

RUNNING THE SERVICE
You can install and run this service locally or deploy it to Bluemix by
following the setup instructions in the project’s README .

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: cloudant / CouchDB / ETL / migration Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Introducing a simple & free ETL service for CouchDB that handles document filtering and transformation, to prepare for cleaner data warehouse loading tasks.",CouchDB Document Copy and Transform Service,Live,320
939,"GETTING STARTED WITH ELASTICSEARCH AND NODE.JS - PART 3
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 13, 2016So far in this series of articles we've been looking at the constituencies dataset and how we can control the way Elasticsearch indexes our data so it
works for us.

In part 1 we created our Compose Elasticsearch deployment, then indexed and searched our
first documents. In part 2 we started looking at mappings and datatypes.

We'll be exploring these some more in this article using the larger petitions dataset. Before we index it, though, we need to introduce one or two more
advanced topics, so let's take a look at those before we begin.

NESTED DATATYPES
The constituencies dataset formed a relatively straightforward set of data. Each
document described a constituency, and consisted of a number of fields, each of
which had a value. Most were simple strings, two were integers, and the only
mapping we had to explicitly give Elasticsearch was for the constituencyname field (although we added some others to reduce the size of some numerical
fields as good practice).

The petitions dataset is a bit more complex. As well as text and numeric fields,
petitions also include two fields that contain nested data, where the content of
the field is not just a piece of data, but a new dataset.

For example, if you look at the petition ""Abolish road tax on Motorcycles"" you'll find entries in the signatures_by_constituency field like this:

{

  ""name"": ""Central Suffolk and North Ipswich"",
  ""ons_code"": ""E14000624"",
  ""mp"": ""Dr Poulter MP"",
  ""signature_count"": 43

}


These fields are nested, and they present certain difficulties when it comes to
comparing them using Elasticsearch. If you were to index this document as a new
type - say, petitions - all the data would be stored in the index, but because of the way the
inverted index is created some of the relationships between the fields would be
lost. You would know that some people in the constituency of Central Suffolk and
North Ipswich signed the petition, and you would know that in at least one
constituency the petitions was signed by 43 people, but you wouldn't know that
those 43 people were from the Central Suffolk and North Ipswich constituency.

There's a pretty good explanation of this in the Elasticsearch docs using comments on a blog article as an example.

MAPPING NESTED OBJECTS
Losing this relationship presents a problem for us because the application we're
working towards is designed to answer the question: ""Which petitions are most
important to people in my constituency?"". In other words, for any given
constituency we need to be able to check the index to find out which petitions
have the most signatures. The way Elasticsearch is designed to create indexes
out of the box doesn't allow us to answer this question.

What we need to do is tell Elasticsearch to index the fields as nested objects.
This is how we ensure that the relationship between each constituency and the
number of people who signed a petition in that constituency is preserved. To do
this we simply set type as 'nested' instead of 'string', 'integer' etc. We'll also set the properties
of the nested fields, defining the name fields inside signatures_by_constituency and signatures_by_country as 'not_analyzed' just as we did for the constituencyname field in part 2.

So, in mappings.js , add the following:

client.indices.putMapping({  
  index: 'gov',
  type: 'petitions',
  body: {
    properties: {
      'signatures_by_constituency': {
        'type': 'nested',
        properties: {
          'name': {
            'type': 'string',
            'index': 'not_analyzed'
          }
        }
      },
      'signatures_by_country': {
        'type': 'nested',
        properties: {
          'name': {
            'type': 'string',
            'index': 'not_analyzed'
          }
        }
      }
    }
  }
},function(err,resp){
    if (err) {
      console.log(err);
    }
    else {
      console.log(resp);
    }
});


Also add this to getmappings.js :

client.indices.getMapping({  
    index: 'gov',
    type: 'petitions',
  },
function (error,response) {  
    if (error){
      console.log(error.message);
    }
    else {
      console.log('Mappings:\n',response.gov.mappings.petitions.properties);
    }
});


Run mappings.js and then getmappings.js and you'll see the details of your new mappings for the petitions type.

Mappings:  
 { signatures_by_constituency: { type: 'nested', properties: { name: [Object] } },
  signatures_by_country: { type: 'nested', properties: { name: [Object] } } }


THE PETITIONS DATASET
The petitions data can be found at the UK Government petitions website . We need to gather all the relevant data from the thousands of petitions that
have been submitted, and then index all the petitions using the bulk operator
like we did for the constituency data.

It's a two-stage process: first we get the data and save it as a json file, then
we index the contents of that file. If you're not too bothered about having
completely up-to-date data you can skip the first part (""Forming the json"") and
instead get the data by downloading the petitions json ( petitions.json ) from the petitioneering repo on Github , before heading straight to ""Indexing the petitions"".

FORMING THE JSON
The starting point for getting our json together is at https://petition.parliament.uk/petitions.json . We're going to read this file, and then read each petition it lists one by
one. When we get to the end of this page, we'll move onto the page specified by links->next and repeat the process until we reach the last page of petitions, which doesn't
specify a value for links->next .

Each page covers 50 petitions, and each of those has its own link (for example,
petition #108072 ), which we can get from data->links->self . Each petition contains various pieces of data, such as the title of the
petition ( data->attributes->action ), the government response ( data->attributes->government_response ) and the data that we're most interested in, which is the signatures, listed
by parliamentary constituency ( data->attributes->signatures_by_constituency ).

Note that we're mostly just taking a subset of the fields from the petition and
creating our own document from them, but we are also adding an importance field. What this does is measure the number of people in each constituency who
signed a given petition relative to the overall number of signatures for that
petition. A petition that gets 100 signatures in the Ipswich constituency out of
a total of one thousand signatures is more important to Ipswich than one that
gets 100 out of ten thousand total signatures.

We're using the petition number as the document id in our Elasticsearch index so
we can update the index later with new petition data without having to start all
over again. We could also check timestamps and only modify or add petitions that
have been recently updated.

Save this code as petitions_get.js :

var getJSON = require('get-json');  
var fs = require('fs');  
var client = require('./connection.js')

var bulk = [];

var firstfile = ""https://petition.parliament.uk/petitions.json"";

var handleError = function(err) {  
  if(!err) return false;
  console.log(err);
  return true;
};

var processMany = function(petitions,callback) {

  for(var increment in petitions.data){
    getJSON('https://petition.parliament.uk/petitions/'+petitions.data[increment].id+'.json', function(error, response){
      if(handleError(error)) {
        finishup(responses);
      }
      else {
        var preCon = response.data.attributes.signatures_by_constituency;

        for(var i = 0; i �


Run petitions_get.js and you should end up with a rather large json file containing around ten
thousand (at the time of publication) signatures.

INDEXING THE PETITIONS
The next step is to add these petitions to our gov index. There's nothing new here in terms of Elasticsearch - again we're using
the bulk operator to index documents, but we're processing them in batches of
250 petitions to reduce timeouts and prevent bottlenecks. (The splice
instruction specifies a length of 500, but remember that each document you add
using bulk consists of two objects: one for the index, type and id, and one for
the content). Save the following as petitions_add.js :

var client = require('./connection.js');

var bulk = require(""./petitions.json"");

var handleError = function(err) {  
  if(!err) return false;
  console.log(err);
  return true;
};

var indexall = function(petitionlist,callback){  
  console.log('items left to index: '+(petitionlist.length/2));
  segment = petitionlist.splice(0,500);
  if (segment.length){
    bulkindex(segment,function(response){
      indexall(petitionlist,callback);
      callback(response);
    })
  }
  else {
    callback('No more petitions to index');
  }
}

var bulkindex = function(segment,callback){  
  client.bulk({
    index: 'gov',
    type: 'petitions',
    body: segment
  },function(err,resp){
    if (err) {
      console.log(err);
      callback(err);
    }
    else {
      console.log('items',resp.items.length);
      setTimeout(function() { callback('Indexed '+resp.items.length+' items'); }, 2000);
    }
  })
}

indexall(bulk,function(response){  
  console.log(response);
});


Run petitions_add.js to index the petitions.

We can quickly check that the petitions have been indexed by adding a few lines
to info.js and running it.

client.count({index: 'gov',type: 'petitions'},function(err,resp,status) {  
  console.log('petitions',resp);
});


We can also easily show that our indexed petitions have some content by
modifying search.js from the previous article.

type: 'petitions',  
  body: {
    query: {
      match: { 'action': 'Ipswich' }
    },
  }


At the time of publication that should get you one hit for petition #126939 ) - ""To Re-name the Sky Bet Championship the Ipswich Town Level League"", which
demonstrates both that our indexing has worked and that some petitions are more
serious than others.

A BRIEF INTRODUCTION TO QUERIES
We can illustrate a few concepts of querying in Elasticsearch by modifying our
existing search.

Let's run a simple query against our petitions, looking for any that have the
word ""government"" in the title.

Create a new file query.js with the following:

var client = require('./connection.js');

client.search({  
  index: 'gov',
  type: 'petitions',
  fields: ['action','signature_count'],
  size: 5,
  body: {
    query: {
      match: { 'action': 'government' }
    },
  }
},function (error, response,status) {
    if (error){
      console.log('search error: '+error)
    }
    else {
      console.log('--- Response ---');
      console.log('Total hits: ',response.hits.total);
      console.log('--- Hits ---');
      response.hits.hits.forEach(function(hit){
        console.log(hit);
      })
    }
});


By default Elasticsearch returns the complete document source in query results,
but here we've added the fields parameter to our query, which tells Elasticsearch to only return the action and signature_count fields. We've also limited the results to 5 hits instead of the default of 10
hits.

Run query.js and you'll see the total number of hits and the titles of the top 5 results. By
top 5 here we mean the petitions with the highest Elasticsearch score, because
that is its default sort for results.

For an in-depth look at scoring, read our article How scoring works in Elasticsearch . In the context of our search the petitions with the highest scores are the
ones where the word NHS appeared most prominently in the action field.

Looking at our results we can see that we've got a few petitions with very low
signature counts. Let's change our query so that only petitions with at least
ten thousand signatures are returned.

For this we'll need to use a range query in addition to our existing match query. Now we have what's known as a compound query so we'll put both parts inside a bool query. It looks better than it sounds:

body: {  
    query: {
      bool: {
        must: [
          { match: { 'action': 'government' } },
          { range : {
                'signature_count' : {
                    'gte' : 10000
                }
            }
          }
        ]
      }
    }
  }


Update query.js with this new query and run it. You'll see there are fewer overall hits, thanks
to our signature count limitation.

At this point it might look like the hits have been sorted by the signature_count field, but on this occasion it's just pure co-incidence that the Elasticsearch
scores correlate pretty well with the number of signatures. We can show this by
adding in a sort that does just that:

query: {  
   [...]
},
sort: {  
      'signature_count': {
        order: 'desc'
      }
    }


This time when you run query.js you'll see the results sorted according to the value of signature_count .

For a more detailed look at some of these concepts and more, take a look at our
article Elasticsearch.pm - Part 4: Querying and Search Options .

NEXT
In the next article in the series we'll look at the nested fields in our
documents, and how we can run queries on those and sort the results. Once we've
done that we'll be one step closer towards turning it all into a web app.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Neil Dewhurst Love this article? Head over to Neil Dewhurst’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","We'll be exploring mappings and data types some more in this article using the larger petitions dataset. Before we index it, though, we need to introduce one or two more advanced topics.",Getting started with Elasticsearch and Node.js - part 3,Live,321
941,"Compose The Compose logo Articles Sign in Free 30-day trialMASTERING POSTGRESQL TOOLS: FILTERS AND FOREIGN DATA WRAPPERS
Published Jun 7, 2017 postgresql writestuff guest Mastering PostgreSQL tools: Filters and Foreign Data WrappersLucero Del Alba takes a look at how to use PostgreSQL's filter clause, and
streamline database imports using PostgreSQL's foreign data wrapper in this Compose Write Stuff article.

The development of PostgreSQL happens in big steps, with plenty of features and
improvements introduced in every major update. In this case, we'll explore two
of them: simpler filtering in aggregate functions with the FILTER clause, and IMPORT FOREIGN SCHEMA from an external server. We'll review in a use case with examples.

ACCOUNTSRUS
To illustrate these features, let's consider an imaginary accountancy agency,
AccountsRUs. You're working on the back-end side for the development team, and
you need to be able to make all kinds of reports on the fly. The team has a leg
up in that it uses PostgreSQL, but every tool they can lay their hands on will
help.

Once reports and tables are created, you'll need to move the schemata onto the
production servers. This used to be a time-consuming process: exporting them off
the development server, editing them, and importing them on the production
servers. But, we'll see later how things get simplified with the foreign-data
wrapper.

Imagine AccountsRUs have some sales data on their development systems...

name                              | 2014       | 2015        | 2016  
----------------------------------|-----------:|-----------:|----------:
A Torta o Caca SRL                |    980,444 |  1,243,544 |  2,020,540  
Compu-Global-Hyper-Mega-Net       | 12,421,424 | 21,432,245 | 20,456,100  
Der Hammer AG                     | 11,425,024 | 10,452,525 |  1,524,659  
Minha Farinha SA                  |    420,304 |    545,242 |  1,004,234  
Philatropia, Inc.                 |    200,534 |     26,523 |    156,534  


Here's the table schema for such reports:

CREATE TABLE companies  
(
  name character varying(32) NOT NULL,
  year integer NOT NULL,
  sales numeric(10, 0) NOT NULL
)


And here's the SQL code for such entries, should you want to play with this set
as you read the article:

INSERT INTO companies (name, year, sales)  
    VALUES ('A Torta o Caca SRL', 2014, 980444);
INSERT INTO companies (name, year, sales)  
    VALUES ('A Torta o Caca SRL', 2015, 1243544);
INSERT INTO companies (name, year, sales)  
    VALUES ('A Torta o Caca SRL', 2016, 2020540);

INSERT INTO companies (name, year, sales)  
    VALUES ('Compu-Global-Hyper-Mega-Net', 2014, 12421424);
INSERT INTO companies (name, year, sales)  
    VALUES ('Compu-Global-Hyper-Mega-Net', 2015, 21432245);
INSERT INTO companies (name, year, sales)  
    VALUES ('Compu-Global-Hyper-Mega-Net', 2016, 20456100);

INSERT INTO companies (name, year, sales)  
    VALUES ('Der Hammer AG', 2014, 11425024);
INSERT INTO companies (name, year, sales)  
    VALUES ('Der Hammer AG', 2015, 10452525);
INSERT INTO companies (name, year, sales)  
    VALUES ('Der Hammer AG', 2016,  1524659);

INSERT INTO companies (name, year, sales)  
    VALUES ('Minha Farinha SA', 2014,  420304);
INSERT INTO companies (name, year, sales)  
    VALUES ('Minha Farinha SA', 2015,  545242);
INSERT INTO companies (name, year, sales)  
    VALUES ('Minha Farinha SA', 2016, 1004234);

INSERT INTO companies (name, year, sales)  
    VALUES ('Philatropia, Inc.', 2014, 200534);
INSERT INTO companies (name, year, sales)  
    VALUES ('Philatropia, Inc.', 2015,  26523);
INSERT INTO companies (name, year, sales)  
    VALUES ('Philatropia, Inc.', 2016, 156534);


Ready? Let's examine these tools...

FILTERING FOR FUN
You're probably used to summing, counting, averaging, and grouping data that
meets certain criteria; but things get slightly more complicated when you want
to perform these queries only on certain rows. PostgreSQL 9.4 introduced a simple, yet powerful way to deal with this: the FILTER clause. With it, aggregate functions can be filtered so that only the input
rows where the filter clause evaluates to true are fed to such functions, while
the others are discarded.

Maybe that was a little too abstract, so let's see an example.

Let's say management asks for a report with the number of companies grossing
more than a million and less than 500K in sales.

Simple enough, you could trigger one query for high sales:

SELECT year, COUNT(*) high_sales  
FROM companies  
WHERE sales   


... returning the amount of companies with high sales:

 year | high_sales
------+------------
 2014 |          2
 2015 |          3
 2016 |          4


And another one for low sales:

SELECT year, COUNT(*) low_sales  
FROM companies  
WHERE sales   


... returning the amount of companies with low sales:

 year | low_sales
------+-----------
 2014 |         2
 2015 |         1
 2016 |         1


With FILTER , however, you can easily combine the two in a single query:

SELECT year,  
    COUNT(*) FILTER (WHERE sales > 1000000) high_sales,
    COUNT(*) FILTER (WHERE sales   


... and get both results in a single output:

year | high_sales | low_sales
-----+------------+-----------
2014 |          2 |         2
2015 |          3 |         1
2016 |          4 |         1


Your manager will also be happier since you've spent less time extracting the
relevant information and presented it in a simpler manner.

More than Syntactic SugarThose with some extra SQL savvy may have noticed for this particular example,
that FILTER works as syntactic sugar for CASE WHEN... :

SELECT year,  
  SUM(CASE WHEN sales > 1000000 THEN 1 ELSE 0 END) AS high_sales,
  SUM(CASE WHEN sales   


... as that would return the same results as in the previous query:

year | high_sales | low_sales
-----+------------+-----------
2014 |          2 |         2
2015 |          3 |         1
2016 |          4 |         1


However, with FILTER not only the code is cleaner and easier to follow, you can do more than
counting. In fact, you can use any aggregate function, and even use them in window functions!

Use in Window FunctionsWindow functions are a topic in their own right; but in short, they allow you to
perform calculations across a set of rows that are related to the current row.
This is, in a way, similar to when you are paginating results with LIMIT from, to , but all of the result sets for the different pages are being processed in the
current query.

Again, an example should make things clearer.

Using FILTER in a window function with OVER window_name would look like this:

SELECT year, name, sales,  
  COUNT(*) FILTER (WHERE sales   


... which returns the details of the sales with the amount high sales companies
for that period:

 year |            name             |  sales   | high_sales
------+-----------------------------+----------+------------
 2014 | Minha Farinha SA            |   420304 |          2
 2014 | Der Hammer AG               | 11425024 |          2
 2014 | A Torta o Caca SRL          |   980444 |          2
 2014 | Compu-Global-Hyper-Mega-Net | 12421424 |          2
 2014 | Philatropia, Inc.           |   200534 |          2
 2015 | Der Hammer AG               | 10452525 |          3
 2015 | A Torta o Caca SRL          |  1243544 |          3
 2015 | Compu-Global-Hyper-Mega-Net | 21432245 |          3
 2015 | Minha Farinha SA            |   545242 |          3
 2015 | Philatropia, Inc.           |    26523 |          3
 2016 | Philatropia, Inc.           |   156534 |          4
 2016 | Minha Farinha SA            |  1004234 |          4
 2016 | A Torta o Caca SRL          |  2020540 |          4
 2016 | Der Hammer AG               |  1524659 |          4
 2016 | Compu-Global-Hyper-Mega-Net | 20456100 |          4


This sort of composite output can prove really handy when building complex
reports with a mix of aggregate and non-aggregate data, since you don't need to
store certain periodic (window) values (as the number of high sales for a given
year, in this case) in a variable, as it is all already available to you in the
results set.

For more on these topics, you can see the syntax for Aggregate Expressions and read about Window Functions in the PostgreSQL documentation.

IMPORTING FOR FUN
Setting up a new database environment is a seemingly simple task. However, more
often than not, it ends up consuming more time than initially thought, and it's
still prone to problems coming up later on. With the foreign-data wrapper (FDW)
and IMPORT FOREIGN SCHEMA introduced in PostgreSQL 9.5 , you can conveniently import table definitions from an external server,
handling everything from the target's database prompt.

Going back to our use case. In AccountsRUs, you find yourself in the situation
in which you regularly need to import the table schemata from the development
server into the one running in production. But, now you'll use the foreign-data
wrapper on the production server, which is provided by the postgres_fdw module, that will allow you to access data that's stored on the development
server.

You should notice that unlike simply exporting/importing SQL files, the FDW
first needs to be set-up. This is a fairly straightforward process, and, once
it's done, it simplifies the importing process a lot, giving you some peace of
mind since this is a task you'll be performing quite often.

FDW Set-Up: Prepare for Remote Access 1. Install the postgres_fdw ExtensionYou'll need to do this step on the server you intend to import to only once. Run
the following command:

CREATE EXTENSION postgres_fdw;  


Compose Note : On Compose, you'll need to enable the extension from the Compose console for
your PostgreSQL deployment. Select Browser , then the Compose database, then Extensions . Scroll down the list for postgres_fdw and then click Install on the right.

2. Create a Foreign Server ObjectYou need to create a representation of each remote database you want to access.
In this case, it will be your development server. You can create as many remote
databases as you'd like to connect to (development, backup, etc), you'll only
need the appropriate credentials to connect to each of them, which we'll do in
the next step.

This is the command to create the foreign server object:

CREATE SERVER development_server  
      FOREIGN DATA WRAPPER postgres_fdw 
      OPTIONS (host 'remote_host', dbname 'remote_db', port '5432');


development_server is the identifier you'll use to refer to the remote server. Remember, there can
be as many as you need.

3. Create a User MappingThe idea here is that you'll make logging into the remote servers a completely
transparent process to the local users who will need to access them. For this,
you'll have to map the local user(s) for each remote database user. Like this:

CREATE USER MAPPING  
        FOR local_user
        SERVER development_server
        OPTIONS (user 'remote_user', password 'remote_password');


This way, the local user in your production server doesn't need to know what the
remote credentials are; as long he or she has been mapped to the remote servers,
the access will be granted.

That's all for the setup! Now let's retrieve some remote data.

IMPORTING THE FOREIGN SCHEMA OR SET OF TABLES
You can import a whole PostgreSQL schema, a set of tables, or everything except certain tables.

This is how you import a schema :

IMPORT FOREIGN SCHEMA remote_schema  
    FROM SERVER development_server INTO local_schema;


That single command will take care of everything —
wrapping all of the remote schema from your development server, bringing it into
the local (production) server, and importing it.

Often, though not always, the remote_schema and the local_schema will simply be public , as it is PostgreSQL's default, but you should double check this.

To import a set of tables , you'll use the LIMIT TO parameter:

IMPORT FOREIGN SCHEMA remote_schema  
    LIMIT TO (balances, companies)
    FROM SERVER development_server INTO local_schema;


And finally, to import everything except certain tables you'll pass the EXCEPT parameter:

IMPORT FOREIGN SCHEMA remote_schema  
    EXCEPT (logs)
    FROM SERVER development_server INTO local_schema;


Both, LIMIT TO and EXCEPT accept as many table names as needed. And this last option is very handy since
often you don't want to import certain tables that are used for tracking or
consolidating data, and that are normally useless out of the context in which
they were created.

ADDITIONAL REMARKS
Keep in mind that, as the wording suggests, IMPORT FOREIGN SCHEMA does precisely that: it imports the database structure and the indicated table
definitions without the data.

A caveat for this feature, if you may, is that the remote server (your
development environment) should be accessible from the target server (the
production environment, in our example). Occasionally this might not be the
case, due to security constraints (e.g. firewall settings) or to some networking
and routing issues (remote requests reaching the router but not being forwarded
to the appropriate server). So, before you can use IMPORT FOREIGN SCHEMA , you should make sure that you have access to the remote server.

But as we have seen, FDW manages external data in a very transparent way, that
is, without the need to handle logins, permission accesses, SQL files, etc. You
only need to set the FWD with as many remote accesses as you need and you'll be
all set.

WRAP-UP
One thing that sets PostgreSQL apart from other DBMS is the amount of
enterprise-grade features it delivers. In this article, we only got to review
two of them, the FILTER clause and IMPORT FOREIGN SCHEMA , but there are many more exciting features in the current versions of
PostgreSQL, and yet more in the versions to come.

Fortunately, you can keep track of recent features and which version they were
introduced on the PostgreSQL: Feature Matrix . There, you'll find a comprehensive list that you can easily filter by version
number. This is important, as features aren't normally backward compatible, and
using them in a legacy environment will introduce errors in your code.


--------------------------------------------------------------------------------

Do you want to shed light on a favorite feature in your preferred database? Why
not write about it for Write Stuff ?


attribution Adam Sherez

This article is licensed with CC-BY-NC-SA 4.0 by Compose.RELATED ARTICLES
Mar 20, 2017FASTER OPERATIONS WITH THE JSONB DATA TYPE IN POSTGRESQL
Lucero Del Alba takes a look at how to get better performance out of jsonb data
types in PostgreSQL in this Compose's Write S…

Guest Author Mar 2, 2017USE ALL THE DATABASES - PART 1
Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data
from multiple sources using GraphQL in this W…

Guest Author Jan 4, 2017STORE RESULT SETS WITH MATERIALIZED VIEWS IN POSTGRESQL
Lucero Del Alba talks about using PostgreSQL's materialized views to process
large amounts of data without slowing down your…

Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Lucero Del Alba takes a look at how to use PostgreSQL's filter clause, and streamline database imports using PostgreSQL's foreign data wrapper in this Compose Write Stuff article.",Mastering PostgreSQL tools: Filters and Foreign Data Wrappers,Live,322
942,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: MANAGE OBJECT STORAGE
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

4 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Science Experience: Tour the Community Section - Duration: 3:20.
   developerWorks TV No views * New 3:20


--------------------------------------------------------------------------------

 * Charge Your Phone using COIN - Amazing Life Hack - Science Experiment -
   Duration: 3:57. RoyTecTips 6,574,392 views 3:57
 * JavaOne: Microservice hands-on - Duration: 5:22. developerWorks TV No views *
   New 5:22
 * datascience@berkeley | Python for Data Science - Duration: 3:33.
   datascience@berkeley 1,281 views 3:33
 * Data Science Design Patterns - Duration: 27:56. PyCon Australia 495 views 27:56
 * Using Data Science Experience DSX to extract insights from NY restaurant
   inspection records - Duration: 32:27. Data Gurus 136 views 32:27
 * Exploring Data Science Experience, a Platform for Data Scientists using Open
   Source Technologies - Duration: 54:10. Data Gurus 102 views 54:10
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29
 * Awesome Data Science: 1.0 Jupyter Notebook Tour - Duration: 28:01. Alfred
   Essa 17,261 views 28:01
 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54.
   developerWorks TV No views * New 6:54
 * IBM Analytics Engine Overview - Duration: 7:21. developerWorks TV 7 views *
   New 7:21
 * Data Science Experience: Create a project and notebook - Duration: 1:04.
   developerWorks TV 1 view * New 1:04
 * Q&A with Lightbend’s Duncan DeVore on Reactive Microservices and JavaOne -
   Duration: 5:39. developerWorks TV 36 views * New 5:39
 * A world first! Success at complete quantum teleportation #DigInfo - Duration:
   3:49. ikinamo 319,883 views 3:49
 * Visual Machine Learning in Data Science Experience - Duration: 1:37. Armand
   Ruiz 2,996 views 1:37
 * JavaOne: Optimize enterprise Java with Microprofile 1.2 - Duration: 5:49.
   developerWorks TV No views * New 5:49
 * JavaOne: Making some key moves - Duration: 6:30. developerWorks TV No views *
   New 6:30
 * The Python Datamodel: When and how to write objects - Duration: 23:29. Next
   Day Video 2,156 views 23:29
 * JavaOne: The excitement so far - Duration: 5:04. developerWorks TV 1 view *
   New 5:04
 * IBM Watson Machine Learning: Build a Predictive Analytic Model - Duration:
   4:06. developerWorks TV 21 views * New 4:06

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to manage an object storage instance through IBM Data Science Experience (DSX).,Manage Object Storage in DSX,Live,323
943,"See also: NumPy Cheat Sheet | All Cheat SheetsPANDAS | MYCHEATSHEETS.COM
Search | # All categories | # Import, version | # Create data objects | # View data info | # Visualize data | # Select data | # Manage unique & empty data | # Modify & transform data | # Iterate over data | # Aggregate data | # Save & Load |PANDAS > ALL CATEGORIES

--------------------------------------------------------------------------------

Name Code Output Add new columnsdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
print (""dataframe added new columns"")
dataframe['C1p10']=dataframe['C1'] + 10
dataframe['C101']=101
print (dataframe)

dataframe
   C1  C2  C3
0  14  81  99
1  33  89  95
2  60  27  21
3  19  85  44
4  20  62  57
dataframe added new columns
   C1  C2  C3  C1p10  C101
0  14  81  99     24   101
1  33  89  95     43   101
2  60  27  21     70   101
3  19  85  44     29   101
4  20  62  57     30   101

Append dataframesdataframe1 = pandas.DataFrame(numpy.random.randint(0,10,size=(3, 2)), columns=['C1','C2'])
dataframe2 = pandas.DataFrame(numpy.random.randint(0,10,size=(3, 2)), columns=['C1','C2'])
print (""dataframe1"")
print (dataframe1)
print (""dataframe2"")
print (dataframe2)
dataframe3 = dataframe1.append(dataframe2,ignore_index = True)
print (""appended dataframes"")
print (dataframe3)

dataframe1
   C1  C2
0   4   5
1   9   1
2   3   6
dataframe2
   C1  C2
0   3   8
1   8   8
2   5   6
appended dataframes
   C1  C2
0   4   5
1   9   1
2   3   6
3   3   8
4   8   8
5   5   6

Apply aggregate functionsdataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
print (""aggregates"")
aggregates = dataframe.agg(['sum', 'max','mean'])
print (aggregates)

dataframe
   C1  C2  C3
0   4   3   0
1   5   0   9
2   8   9   2
3   6   7   5
4   5   9   2
aggregates
        C1    C2    C3
sum   28.0  28.0  18.0
max    8.0   9.0   9.0
mean   5.6   5.6   3.6

Apply function along axisdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
# sum for columns
sum_columns = dataframe[['C1','C2']].apply(sum,axis=0)
print (""sum for columns"")
print (sum_columns)
# sum for rows
sum_rows = dataframe[['C1','C2']].apply(sum,axis=1)
print (""sum for rows"")
print (sum_rows)

dataframe
   C1  C2
0  56   2
1  94  89
2   0  56
3  62  80
4  80  98
sum for columns
C1    292
C2    325
dtype: int64
sum for rows
0     58
1    183
2     56
3    142
4    178
dtype: int64

Apply function to dataframe (map/lambda)dataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
# cross-tabulation of two factors (default is a frequency table) 
dataframe['C1'] = dataframe['C1'].map(lambda x: x-100)
print (""modified dataframe"")
print (dataframe)

dataframe
   C1  C2
0  27  35
1  46  38
2  21  98
3  34  30
4  76   8
modified dataframe
   C1  C2
0 -73  35
1 -54  38
2 -79  98
3 -66  30
4 -24   8

Apply function to each elementdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
function_result = dataframe.applymap(lambda x: x*10)
print (""apply result"")
print (function_result)

dataframe
   C1  C2
0  54  11
1  31  59
2  81  76
3  42  46
4   8  24
apply result
    C1   C2
0  540  110
1  310  590
2  810  760
3  420  460
4   80  240

Calculate correlationdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
print (""dataframe correlation"")
print (dataframe.corr())

dataframe
   C1  C2  C3
0  82  71  74
1  62  12   8
2  97  86  26
3  26  98  48
4   5  97  29
dataframe correlation
          C1        C2        C3
C1  1.000000 -0.339904  0.107885
C2 -0.339904  1.000000  0.456666
C3  0.107885  0.456666  1.000000

Calculate covariancedataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
print (""dataframe covariance"")
print (dataframe.cov())

dataframe
   C1  C2  C3
0  59  45  78
1  24  43  84
2  65  68  56
3  72  21  54
4   4  14  44
dataframe covariance
        C1      C2     C3
C1  861.70  279.55   41.3
C2  279.55  459.70  152.2
C3   41.30  152.20  289.2

Calculate cross-tabluation of factorsdataframe = pandas.DataFrame(numpy.random.randint(0,4,size=(5, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
# cross-tabulation of two factors (default is a frequency table) 
aggregate = pandas.crosstab(dataframe.C1, dataframe.C2)
print (""aggregate data"")
print (aggregate)

dataframe
   C1  C2
0   0   3
1   2   2
2   3   2
3   2   1
4   2   1
aggregate data
C2  1  2  3
C1         
0   0  0  1
2   2  1  0
3   0  1  0

Change dataframe to numpy arraydataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
print (""numpy array"")
print (dataframe.values)

dataframe
   C1  C2  C3
0   3   0   0
1   9   8   0
2   4   3   5
3   4   5   9
4   5   6   7
numpy array
[[3 0 0]
 [9 8 0]
 [4 3 5]
 [4 5 9]
 [5 6 7]]

Check duplicatesdataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
# alternatively duplicates = dataframe.C1.duplicated()
duplicates = dataframe.duplicated('C1')
print (""duplicates"")
print (duplicates)

dataframe
   C1  C2
0   9   1
1   2   2
2   6   1
3   9   7
4   0   7
duplicates
0    False
1    False
2    False
3     True
4    False
dtype: bool

Concatenate dataframesdataframe1 = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 2)), columns=['C1','C2'])
dataframe2 = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 2)), columns=['C1','C2'])
print (""dataframe1"")
print (dataframe1)
print (""dataframe2"")
print (dataframe2)
dataframe3 = pandas.concat([dataframe1,dataframe2])
print (""concatenated dataframes"")
print (dataframe3)

dataframe1
   C1  C2
0  37  43
1   8  19
2  69  23
dataframe2
   C1  C2
0   1  20
1  63   8
2  82  41
concatenated dataframes
   C1  C2
0  37  43
1   8  19
2  69  23
0   1  20
1  63   8
2  82  41

Create dataframe from dictionary# creating dataframe from dictionary allows to specify different data source/type for each column

dataframe = pandas.DataFrame({
    'C1': pandas.date_range('20170101', periods=4),
    'C2' : [10,20,30,40],
    'C3': pandas.Categorical(['A','B','C','D']),
    'C4': 1})
print (""dataframe"")
print (dataframe)

dataframe
          C1  C2 C3  C4
0 2017-01-01  10  A   1
1 2017-01-02  20  B   1
2 2017-01-03  30  C   1
3 2017-01-04  40  D   1

Create dataframe from numpy arrayarray = numpy.array([(1,2,3), (4,5,6),(7,8,9)])
dataframe = pandas.DataFrame(array,columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)

dataframe
   C1  C2  C3
0   1   2   3
1   4   5   6
2   7   8   9

Create dataframe from random numpy arraydataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)

dataframe
   C1  C2  C3  C4
0  27  32  73  54
1  63  21  46  34
2  68  89  31  36
3  16  47  49  64
4  28  20  82  84

Create multi-index dataframe# muti index contains multiple numpy arrays

multi_index = [numpy.array(['Alpha','Beta','Gamma','Alpha','Beta']), numpy.array([1,2,3,4,5])]
dataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'],index = multi_index)
print (""dataframe"")
print (dataframe)
print (""rows Alpha"")
print (dataframe.loc['Alpha'])
print (""row Alpha 4"")
print (dataframe.loc['Alpha',4])

dataframe
         C1  C2  C3
Alpha 1   7   7   6
Beta  2   9   3   1
Gamma 3   7   4   8
Alpha 4   7   3   2
Beta  5   4   1   3
rows Alpha
   C1  C2  C3
1   7   7   6
4   7   3   2
row Alpha 4
C1    7
C2    3
C3    2
Name: (Alpha, 4), dtype: int64

Create series from array# series is one dimentional array with axis

series = pandas.Series([1,5,10,15,20])
print (""series"")
print (series)

series
0     1
1     5
2    10
3    15
4    20
dtype: int64

Create series from dictionarydata = {'a' : 0., 'b' : 1., 'c' : 2.}
series = pandas.Series({'a' : 1, 'b' : 2, 'c' : 3})
print (""series"")
print (series)

series
a    1
b    2
c    3
dtype: int64

Create series with date/time indexmonths_index = pandas.date_range('1/1/2015', periods=12, freq='M')
series = pandas.Series(numpy.random.randint(0,100,size=(12,)),index = months_index)
print (""series"")
print (series)

series
2015-01-31    45
2015-02-28    59
2015-03-31    66
2015-04-30    50
2015-05-31    82
2015-06-30     2
2015-07-31    18
2015-08-31    41
2015-09-30    63
2015-10-31    37
2015-11-30    89
2015-12-31    31
Freq: M, dtype: int64

Dataframe lookup labelsdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
print (""Info axis"")
print (dataframe.lookup([1,2],['C1','C1']))

dataframe
   C1  C2
0  62  41
1  67  49
2  12  96
Info axis
[67 12]

Dataframe transform using lambda functiondataframe = pandas.DataFrame(numpy.random.randint(0,5,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
new_dataframe = dataframe.transform(lambda x: x*100)
print (""transformed data"")
print (new_dataframe)

dataframe
   C1  C2  C3
0   3   4   4
1   4   3   3
2   2   3   2
3   3   0   3
4   4   3   2
transformed data
    C1   C2   C3
0  300  400  400
1  400  300  300
2  200  300  200
3  300    0  300
4  400  300  200

Delete (drop) single columndataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
# second parameter is axis, for columns is 1
dataframe = dataframe.drop('C1', 1)
print (""dataframe with removed column 1"")
print (dataframe)

dataframe
   C1  C2  C3
0  73  97  36
1  97  79   8
2  22  10  80
3   0  92  69
4  54   8  75
dataframe with removed column 1
   C2  C3
0  97  36
1  79   8
2  10  80
3  92  69
4   8  75

Describe dataframedataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
print (""describe dataframe"")
print (dataframe.describe())

dataframe
   C1  C2  C3  C4
0  28  88  51  26
1   4  81  68  63
2  53  86  90  36
3  61  99  27  95
4  39  97  59  13
describe dataframe
              C1         C2         C3         C4
count   5.000000   5.000000   5.000000   5.000000
mean   37.000000  90.200000  59.000000  46.600000
std    22.394196   7.596052  23.075962  32.700153
min     4.000000  81.000000  27.000000  13.000000
25%    28.000000  86.000000  51.000000  26.000000
50%    39.000000  88.000000  59.000000  36.000000
75%    53.000000  97.000000  68.000000  63.000000
max    61.000000  99.000000  90.000000  95.000000

Display dataframe histogram# requires: import matplotlib
dataframe = pandas.DataFrame(numpy.random.randint(0,5,size=(10, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
print (""dataframe histogram"")
print (dataframe.hist())

dataframe
   C1  C2
0   4   4
1   3   3
2   4   4
3   2   0
4   3   0
5   1   2
6   4   1
7   4   1
8   0   3
9   4   4
dataframe histogram
[[<matplotlib.axes._subplots.AxesSubplot object at 0x117585f98>
  <matplotlib.axes._subplots.AxesSubplot object at 0x11762ada0>]]

Display dataframe plot# requires: import matplotlib
dataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
print (""show plot"")
dataframe.plot()
matplotlib.pyplot.show()

dataframe
   C1  C2
0   3   8
1   2   2
2   6   9
3   9   7
4   8   2
show plot

Drop columns with empty valuesdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 3)), columns=['C1','C2','C3'])
dataframe['C2'] = numpy.nan
print (""dataframe"")
print (dataframe)
# option with inplace will change the same dataframe
dataframe2 = dataframe.dropna(axis=1)
print (""dataframe drop empty column"")
print (dataframe2)

dataframe
   C1  C2  C3
0  86 NaN   8
1  40 NaN  39
2  35 NaN  31
3  52 NaN  10
4  73 NaN  41
dataframe drop empty column
   C1  C3
0  86   8
1  40  39
2  35  31
3  52  10
4  73  41

Drop duplicatesdataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
dataframe2 = dataframe.drop_duplicates('C1', keep='first')
print (""dataframe without duplicates"")
print (dataframe2)

dataframe
   C1  C2
0   0   6
1   0   9
2   3   3
3   3   0
4   4   6
dataframe without duplicates
   C1  C2
0   0   6
2   3   3
4   4   6

Drop rows with empty valuesdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 3)), columns=['C1','C2','C3'])
dataframe.loc[3] = numpy.nan
print (""dataframe"")
print (dataframe)
# option with inplace will change the same dataframe
dataframe2 = dataframe.dropna(axis=0)
print (""dataframe drop empty row"")
print (dataframe2)

dataframe
     C1    C2    C3
0  81.0  65.0  92.0
1  75.0  15.0  46.0
2  44.0  96.0   7.0
3   NaN   NaN   NaN
4  34.0  95.0  76.0
dataframe drop empty row
     C1    C2    C3
0  81.0  65.0  92.0
1  75.0  15.0  46.0
2  44.0  96.0   7.0
4  34.0  95.0  76.0

Fill missing/empty valuesdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 3)), columns=['C1','C2','C3'])
dataframe['C2'] = numpy.nan
dataframe.loc[3] = numpy.nan
print (""dataframe"")
print (dataframe)
# option with inplace will change the same dataframe
dataframe2 = dataframe.fillna(77)
print (""dataframe with filled empty values"")
print (dataframe2)

dataframe
     C1  C2    C3
0  11.0 NaN  41.0
1  14.0 NaN   9.0
2  79.0 NaN   5.0
3   NaN NaN   NaN
4   5.0 NaN  70.0
dataframe with filled empty values
     C1    C2    C3
0  11.0  77.0  41.0
1  14.0  77.0   9.0
2  79.0  77.0   5.0
3  77.0  77.0  77.0
4   5.0  77.0  70.0

Filter rows or columnsdataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
dataframe2 = dataframe.filter(items=[1,3],axis=0)
print (""dataframe selected items"")
print (dataframe2)
dataframe3 = dataframe.filter(items=['C1','C3'],axis=1)
print (""dataframe selected items"")
print (dataframe3)

dataframe
   C1  C2  C3
0   8   2   8
1   5   7   7
2   9   5   0
3   6   8   1
4   0   1   9
dataframe selected items
   C1  C2  C3
1   5   7   7
3   6   8   1
dataframe selected items
   C1  C3
0   8   8
1   5   7
2   9   0
3   6   1
4   0   9

Get info axisdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
print (""Info axis"")
print (dataframe.keys())

dataframe
   C1  C2
0  15  99
1  40   2
2   5  79
Info axis
Index(['C1', 'C2'], dtype='object')

Group by column using sumdataframe = pandas.DataFrame(numpy.random.randint(0,5,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
aggregate = dataframe.groupby('C1').sum()
print (""aggregated data"")
print (aggregate)

dataframe
   C1  C2  C3
0   2   4   2
1   2   1   1
2   4   1   4
3   3   1   1
4   3   2   0
aggregated data
    C2  C3
C1        
2    5   3
3    3   1
4    1   4

Import pandasimport pandas

# Pandas by convention is usually imported as pd  (import pandas as pd). 
# Here and in other cheats we use full name to avoid confusion


Inner/outer dataframes mergedataframe1 = pandas.DataFrame(numpy.random.randint(0,4,size=(3, 2)), columns=['C1','C2'])
dataframe2 = pandas.DataFrame(numpy.random.randint(0,4,size=(3, 2)), columns=['C1','C2'])
print (""dataframe1"")
print (dataframe1)
print (""dataframe2"")
print (dataframe2)
dataframe3 = pandas.merge(dataframe1, dataframe2, on='C1', how='inner')
print (""inner merged dataframes"")
print (dataframe3)
dataframe4 = pandas.merge(dataframe1, dataframe2, on='C1', how='outer')
print (""outer merged dataframes"")
print (dataframe4)

dataframe1
   C1  C2
0   3   3
1   0   1
2   1   3
dataframe2
   C1  C2
0   3   2
1   3   0
2   3   3
inner merged dataframes
   C1  C2_x  C2_y
0   3     3     2
1   3     3     0
2   3     3     3
outer merged dataframes
   C1  C2_x  C2_y
0   3     3   2.0
1   3     3   0.0
2   3     3   3.0
3   0     1   NaN
4   1     3   NaN

Iterate over column name / series pairsdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
print (10*'-')
print (""Iteration \n"")
for col_idx,data in dataframe.iteritems():
    print (""column:"",col_idx)
    print (""column data:"")
    print (data,""\n"")

dataframe
   C1  C2
0  58  37
1  31   0
2  53  60
----------
Iteration 

column: C1
column data:
0    58
1    31
2    53
Name: C1, dtype: int64 

column: C2
column data:
0    37
1     0
2    60
Name: C2, dtype: int64

Iterate over columnsdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
print (10*'-')
print (""Iteration \n"")
for col_idx,data in dataframe.iteritems():
    print (""column:"",col_idx)
    print (""column data:"")
    print (data,""\n"")

dataframe
   C1  C2
0  63  20
1  25  47
2  60  65
----------
Iteration 

column: C1
column data:
0    63
1    25
2    60
Name: C1, dtype: int64 

column: C2
column data:
0    20
1    47
2    65
Name: C2, dtype: int64

Iterate over rows producing data seriesdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
print (10*'-')
print (""Iteration \n"")
for col_idx,data in dataframe.iterrows():
    print (""row:"",col_idx)
    print (""row data:"")
    print (data,""\n"")

dataframe
   C1  C2
0  47  15
1  68  59
2   8  96
----------
Iteration 

row: 0
row data:
C1    47
C2    15
Name: 0, dtype: int64 

row: 1
row data:
C1    68
C2    59
Name: 1, dtype: int64 

row: 2
row data:
C1     8
C2    96
Name: 2, dtype: int64

Iterate over rows producing tuplesdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
print (10*'-')
print (""Iteration \n"")
for data in dataframe.itertuples():
    print (""row data:"")
    print (data,""\n"")

dataframe
   C1  C2
0  39  89
1  54  76
2  93  34
----------
Iteration 

row data:
Pandas(Index=0, C1=39, C2=89) 

row data:
Pandas(Index=1, C1=54, C2=76) 

row data:
Pandas(Index=2, C1=93, C2=34)

Join dataframesdataframe1 = pandas.DataFrame(numpy.random.randint(0,4,size=(3, 2)), columns=['C1','C2'])
dataframe2 = pandas.DataFrame(numpy.random.randint(0,4,size=(3, 2)), columns=['C3','C4'])
print (""dataframe1"")
print (dataframe1)
print (""dataframe2"")
print (dataframe2)
dataframe3 = dataframe1.join(dataframe2,how='right')
print (""joined dataframes"")
print (dataframe3)

dataframe1
   C1  C2
0   3   1
1   1   1
2   2   0
dataframe2
   C3  C4
0   0   3
1   2   3
2   0   3
joined dataframes
   C1  C2  C3  C4
0   3   1   0   3
1   1   1   2   3
2   2   0   0   3

Left/right dataframes mergedataframe1 = pandas.DataFrame(numpy.random.randint(0,4,size=(3, 2)), columns=['C1','C2'])
dataframe2 = pandas.DataFrame(numpy.random.randint(0,4,size=(3, 2)), columns=['C1','C2'])
print (""dataframe1"")
print (dataframe1)
print (""dataframe2"")
print (dataframe2)
dataframe3 = pandas.merge(dataframe1, dataframe2, on='C1', how='left')
print (""left merged dataframes"")
print (dataframe3)
dataframe4 = pandas.merge(dataframe1, dataframe2, on='C1', how='right')
print (""right merged dataframes"")
print (dataframe4)

dataframe1
   C1  C2
0   0   3
1   0   0
2   1   3
dataframe2
   C1  C2
0   2   3
1   0   3
2   3   1
left merged dataframes
   C1  C2_x  C2_y
0   0     3   3.0
1   0     0   3.0
2   1     3   NaN
right merged dataframes
   C1  C2_x  C2_y
0   0   3.0     3
1   0   0.0     3
2   2   NaN     3
3   3   NaN     1

Melt columnsdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
dataframe2 = dataframe.melt()
print (""melted dataframe"")
print (dataframe2)

dataframe
   C1  C2
0  83  62
1  18  20
2  78  94
melted dataframe
  variable  value
0       C1     83
1       C1     18
2       C1     78
3       C2     62
4       C2     20
5       C2     94

Number of dataframe dimensionsdataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
print (""number of dataframe dimensions"")
print (dataframe.ndim)

dataframe
   C1  C2  C3
0   6   5   9
1   2   7   2
2   0   4   2
3   1   9   6
4   8   8   4
number of dataframe dimensions
2

Number of elements (size)dataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
print (""dataframe number of elements"")
print (dataframe.size)

dataframe
   C1  C2  C3
0   2   8   2
1   9   7   2
2   9   4   3
3   5   5   7
4   0   1   0
dataframe size
15

Pivot tabledataframe = pandas.DataFrame(numpy.random.randint(0,4,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
pivot = pandas.pivot_table(dataframe, values='C3', index=['C1'], columns=['C2'], aggfunc=sum)
print (""pivot"")
print (pivot)

dataframe
   C1  C2  C3
0   1   0   3
1   1   3   0
2   1   0   2
3   2   0   2
4   1   2   2
pivot
C2    0    2    3
C1               
1   5.0  2.0  0.0
2   2.0  NaN  NaN

Print dataframe infodataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
print (""dataframe info"")
print (dataframe.info())

dataframe
   C1  C2  C3
0   3   5   5
1   0   8   2
2   8   1   1
3   9   5   4
4   4   7   0
dataframe info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
C1    5 non-null int64
C2    5 non-null int64
C3    5 non-null int64
dtypes: int64(3)
memory usage: 200.0 bytes
None

Print dataframe shapedataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
shape = dataframe.shape
print (""dataframe shape"")
print (shape)
print (""shape axis 0"")
print (shape[0])
print (""shape axis 1"")
print (shape[1])

dataframe
   C1  C2  C3  C4
0  20   7  74  62
1  40  70  16  12
2  48  34  42  98
3  66  86  41  74
4  79  33   7  56
dataframe shape
(5, 4)
shape axis 0
5
shape axis 1
4

Print pandas versionprint (""Pandas version"",pandas.__version__)

Pandas version 0.20.2

Reindexdataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
dataframe2 = dataframe.reindex([3,4,1,2,3])
print (""dataframe renamed"")
print (dataframe2)

dataframe
   C1  C2  C3
0   4   8   9
1   4   2   1
2   4   9   8
3   3   2   3
4   8   7   5
dataframe renamed
   C1  C2  C3
3   3   2   3
4   8   7   5
1   4   2   1
2   4   9   8
3   3   2   3

Remove (drop) rowdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
# second parameter is axis, for rows is 0
dataframe = dataframe.drop(2, 0)
print (""dataframe with removed row 2"")
print (dataframe)

dataframe
   C1  C2  C3
0  14  32  53
1  41  96  84
2  11  90  85
3  52  11  13
4  94  72  33
dataframe with removed row 2
   C1  C2  C3
0  14  32  53
1  41  96  84
3  52  11  13
4  94  72  33

Rename columndataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
# option without inplace will return new dataframe
dataframe.rename(columns={'C1':'C100'},inplace=True)
print (""dataframe renamed column"")
print (dataframe)

dataframe
   C1  C2
0  65  42
1  37  12
2  84   3
3  42  54
4  55  45
dataframe renamed column
   C100  C2
0    65  42
1    37  12
2    84   3
3    42  54
4    55  45

Replace valuesdataframe = pandas.DataFrame(numpy.random.randint(0,5,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
dataframe2 = dataframe.replace(1,-100)
print (""aggregated data"")
print (dataframe2)

dataframe
   C1  C2  C3
0   3   1   3
1   1   3   4
2   4   1   0
3   4   0   3
4   4   0   2
aggregated data
    C1   C2  C3
0    3 -100   3
1 -100    3   4
2    4 -100   0
3    4    0   3
4    4    0   2

Reset indexdataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
dataframe = dataframe.drop(2, 0)
dataframe = dataframe.drop(3, 0)
print (""dataframe"")
print (dataframe)
print (""dataframe reindex"")
print (dataframe.reset_index(drop=True))

dataframe
   C1  C2  C3
0   7   8   4
1   9   6   1
4   7   9   5
dataframe reindex
   C1  C2  C3
0   7   8   4
1   9   6   1
2   7   9   5

Save & load data from CSV file# create dataframe
dataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
# save dataframe to csv
dataframe.to_csv('dataframe.csv')
# list directory
for entry in os.scandir('.'):
    print(entry.name)
# load dataframe from csv
loaded_dataframe = pandas.read_csv('dataframe.csv',index_col=0)
print (""loaded dataframe"")
print (loaded_dataframe)

dataframe
   C1  C2  C3  C4
0  98  82  43  95
1  61  76  19  42
2   3  59  27  40
3  88  13  36  45
4  29  12  43  24
dataframe.csv
loaded dataframe
   C1  C2  C3  C4
0  98  82  43  95
1  61  76  19  42
2   3  59  27  40
3  88  13  36  45
4  29  12  43  24

Save & load data from Excel file# requires instaled Python openpyxl module
# create dataframe
dataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
# save dataframe to Excel
dataframe.to_excel('dataframe.xlsx',sheet_name='sheet01')
# list directory
for entry in os.scandir('.'):
    print(entry.name)
# load dataframe from csv
loaded_dataframe = pandas.read_excel('dataframe.xlsx',sheet_name='sheet01')
print (""loaded dataframe"")
print (loaded_dataframe)

dataframe
   C1  C2  C3  C4
0  81  13  27  17
1  58   0  44  94
2  19   0  49  87
3  12  56  93  53
4  47  92  10  11
dataframe.xlsx
loaded dataframe
   C1  C2  C3  C4
0  81  13  27  17
1  58   0  44  94
2  19   0  49  87
3  12  56  93  53
4  47  92  10  11

Select column unique valuesdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
print (""dataframe column unique values"")
print (dataframe['C1'].unique())

dataframe
   C1  C2  C3  C4
0  97  77  28  12
1  61  58  86  64
2   0  12   0  62
3   0  66  88  34
4  16   6  51  60
dataframe column unique values
[97 61  0 16]

Select columnsdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
print (""dataframe selected columns"")
print (dataframe[['C1','C2']])

dataframe
   C1  C2  C3  C4
0  47  34  32  96
1  30  25  73  18
2  61  17  27   9
3  74  23  15  73
4  79  49  42  44
dataframe selected columns
   C1  C2
0  47  34
1  30  25
2  61  17
3  74  23
4  79  49

Select columns using condition with all/anydataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
dataframe2 = dataframe.loc[:,(dataframe>50).any()]
print (""dataframe selected columns"")
print (dataframe2)
dataframe3 = dataframe.loc[:,(dataframe>50).all()]
print (""dataframe selected columns"")
print (dataframe3)

dataframe
   C1  C2  C3
0  80  41  76
1  38   6  57
2  45  14  84
dataframe selected columns
   C1  C3
0  80  76
1  38  57
2  45  84
dataframe selected columns
   C3
0  76
1  57
2  84

Select columns with null/not null valuesdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 3)), columns=['C1','C2','C3'])
dataframe['C2'][1] = numpy.NaN
dataframe['C3'] = numpy.NaN
print (""dataframe"")
print (dataframe)
dataframe2 = dataframe.loc[:,dataframe.isnull().any()]
print (""dataframe selected columns"")
print (dataframe2)
dataframe3 = dataframe.loc[:,dataframe.notnull().all()]
print (""dataframe selected columns"")
print (dataframe3)

dataframe
   C1    C2  C3
0  91   7.0 NaN
1  77   NaN NaN
2  54  62.0 NaN
dataframe selected columns
     C2  C3
0   7.0 NaN
1   NaN NaN
2  62.0 NaN
dataframe selected columns
   C1    C2
0  91   7.0
1  77   NaN
2  54  62.0

Select data using condition in listdataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
dataframe2 = dataframe.loc[dataframe.C1.isin([1,2,3])]
print (""dataframe selected columns"")
print (dataframe2)

dataframe
   C1  C2  C3
0   8   0   4
1   2   0   5
2   5   4   5
3   6   4   1
4   3   5   9
dataframe selected columns
   C1  C2  C3
1   2   0   5
4   3   5   9

Select data using multiple conditionsdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
print (""dataframe conditional selection"")
print (dataframe[(dataframe['C1']>50) & ((dataframe['C2']<25)| (dataframe['C2']>75))])

dataframe
   C1  C2  C3  C4
0  72  23  21   1
1  59  67  77  59
2  47  49  34   6
3  28   8  66  97
4  91   4  44  37
dataframe conditional selection
   C1  C2  C3  C4
0  72  23  21   1
4  91   4  44  37

Select data using querydataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
dataframe2 = dataframe.query('C1 > C2')
print (""dataframe selected items"")
print (dataframe2)

dataframe
   C1  C2  C3
0   7   3   2
1   1   6   8
2   9   7   3
3   3   2   7
4   4   6   0
dataframe selected items
   C1  C2  C3
0   7   3   2
2   9   7   3
3   3   2   7

Select data using single conditiondataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
print (""dataframe conditional selection"")
print (dataframe[dataframe['C1']>50])

dataframe
   C1  C2  C3  C4
0  86  13   3  83
1  63  58   1  24
2  80  87  52  40
3  34  35  37  43
4  24  30  83  11
dataframe conditional selection
   C1  C2  C3  C4
0  86  13   3  83
1  63  58   1  24
2  80  87  52  40

Select items using where conditiondataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
dataframe2 = dataframe.where(dataframe>5)
print (""dataframe selected items"")
print (dataframe2)

dataframe
   C1  C2  C3
0   9   6   4
1   7   7   0
2   1   7   7
3   4   1   5
4   6   2   6
dataframe selected items
    C1   C2   C3
0  9.0  6.0  NaN
1  7.0  7.0  NaN
2  NaN  7.0  7.0
3  NaN  NaN  NaN
4  6.0  NaN  6.0

Select rowdf_index = [101,202,303,404,505]
dataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 3)), columns=['C1','C2','C3'],index = df_index)
print (""dataframe"")
print (dataframe)
print (""row 3 with label 303"")
print (dataframe.loc[303])

dataframe
     C1  C2  C3
101  88  84  54
202  15  99  10
303   6  57  57
404  23  51  51
505  92  12  36
row 3 with label 303
C1     6
C2    57
C3    57
Name: 303, dtype: int64

Select row by index numberdf_index = [101,202,303,404,505]
dataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 3)), columns=['C1','C2','C3'],index = df_index)
print (""dataframe"")
print (dataframe)
print (""row 3"")
print (dataframe.iloc[2])

dataframe
     C1  C2  C3
101  26  73  73
202  14  68  61
303   6   1   1
404  40  83   1
505  49  84  31
row 2
C1    6
C2    1
C3    1
Name: 303, dtype: int64

Select rows using conditiondataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
# single contition
dataframe2 = dataframe.loc[dataframe['C1'] > 50]
print (""dataframe selected rows"")
print (dataframe2)
# multiple conditions
dataframe3 = dataframe.loc[(dataframe['C1'] > 20) & (dataframe['C1'] < 50)]
print (""dataframe selected rows"")
print (dataframe3)

dataframe
   C1  C2  C3
0  24  55  55
1  84  97  32
2  65  15  83
3  11  31  95
4  41  96  70
dataframe selected rows
   C1  C2  C3
1  84  97  32
2  65  15  83
dataframe selected rows
   C1  C2  C3
0  24  55  55
4  41  96  70

Select rows using functiondataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
# select elements with even index number
dataframe2 = dataframe.select(lambda x: x%2==0)
print (""dataframe selected items"")
print (dataframe2)

dataframe
   C1  C2  C3
0   5   6   0
1   1   2   0
2   9   1   7
3   6   2   7
4   0   7   9
dataframe selected items
   C1  C2  C3
4   0   7   9

Set indexdataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
print (""dataframe index"")
print (dataframe.index.values)
dataframe.set_index('C2',inplace = True)
print (""dataframe new index"")
print (dataframe.index.values)

dataframe
   C1  C2  C3
0   8   6   9
1   1   2   0
2   2   4   5
3   0   1   7
4   3   2   2
dataframe index
[0 1 2 3 4]
dataframe new index
[6 2 4 1 2]

Sort dataframe along axisdataframe = pandas.DataFrame(numpy.random.randint(0,5,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
sorted_dataframe = dataframe.sort_values(by=['C1','C2'],axis=0)
print (""sorted dataframe"")
print (sorted_dataframe)

dataframe
   C1  C2  C3  C4
0   2   1   2   2
1   4   0   3   2
2   3   2   4   4
3   1   1   0   2
4   1   2   2   4
sorted dataframe
   C1  C2  C3  C4
3   1   1   0   2
4   1   2   2   4
0   2   1   2   2
2   3   2   4   4
1   4   0   3   2

Stack & unstack columnsdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(3, 2)), columns=['C1','C2'])
print (""dataframe"")
print (dataframe)
dataframe2 = dataframe.stack()
print (""stacked dataframe"")
print (dataframe2)
dataframe3 = dataframe2.unstack()
print (""unstacked dataframe"")
print (dataframe3)

dataframe
   C1  C2
0  51  54
1  44  72
2  66  35
stacked dataframe
0  C1    51
   C2    54
1  C1    44
   C2    72
2  C1    66
   C2    35
dtype: int64
unstacked dataframe
   C1  C2
0  51  54
1  44  72
2  66  35

Transpose dataframedataframe = pandas.DataFrame(numpy.random.randint(0,10,size=(5, 3)), columns=['C1','C2','C3'])
print (""dataframe"")
print (dataframe)
print (""transposed dataframe"")
print (dataframe.T)

dataframe
   C1  C2  C3
0   5   0   9
1   0   2   7
2   9   6   8
3   2   1   1
4   6   5   5
transposed dataframe
    0  1  2  3  4
C1  5  0  9  2  6
C2  0  2  6  1  5
C3  9  7  8  1  5

View columns typesdataframe = pandas.DataFrame({
    'C1': pandas.date_range('20170101', periods=4),
    'C2' : [10,20,30,40],
    'C3': pandas.Categorical(['A','B','C','D']),
    'C4': 1})
print (""dataframe column types"")
print (dataframe.dtypes)

dataframe column types
C1    datetime64[ns]
C2             int64
C3          category
C4             int64
dtype: object

View dataframe columnsdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
print (""dataframe columns"")
print (dataframe.columns)

dataframe
   C1  C2  C3  C4
0  54  50  37  74
1  46  76   8  96
2  43  92  25  69
3  55  64  58   5
4  58  79   6  39
dataframe columns
Index(['C1', 'C3', 'C3', 'C4'], dtype='object')

View dataframe headdataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
print (""dataframe head"")
print (dataframe.head(2))

dataframe
   C1  C2  C3  C4
0  59  94   5  79
1  67  43  21  88
2  74  68  75  79
3  62  40  75  59
4  99  76  18  84
dataframe head
   C1  C3  C3  C4
0  59  94   5  79
1  67  43  21  88

View dataframe taildataframe = pandas.DataFrame(numpy.random.randint(0,100,size=(5, 4)), columns=['C1','C2','C3','C4'])
print (""dataframe"")
print (dataframe)
print (""dataframe head"")
print (dataframe.tail(2))

dataframe
   C1  C2  C3  C4
0  70  85   7  80
1  15  82  58  57
2  84  55  88   9
3  60   6   6  67
4  55  74  64  10
dataframe tail
   C1  C3  C3  C4
3  60   6   6  67
4  55  74  64  10

Copyright MyCheatSheets.com © 2017 Terms and Conditions Contact Powered by web2py","Pandas cheat sheets: collection of code snippets, tips and tricks for Pandas Python numerical library",MyCheatSheets.com,Live,324
950,"MAP YOUR CLOUD DATA
Raj R Singh / August 1, 2016I’m a big fan of CARTO , an online service for cartographic visualization and basic spatial analysis.
At its core, CARTO is a hosted, managed PostgreSQL platform that relies heavily
on that database’s PostGIS extension. On top of the database, CARTO has built a
slick mapping platform , including a map editor , a Python module , and a front-end Javascript mapping library . Today, I’ll show you how to use CARTO to map the outputs of your Jupyter
notebook-based Python analytics work. Quickly show spatial patterns with a
visual presentation quality that will impress your boss and customers.

CARTO
First, go to CARTO.com and sign up for a new account (if you don’t already have one). After setup, you should find yourself at an
empty maps screen that looks something like this:


If you already have an account, this page is your maps dashboard at https://{yourusername}.CARTO.com/dashboard

IMPORT ZIP CODE DATA
We’re going to map some zip code data, so let’s add US zip codes to our account.
I’ve created a zip code data file that you can bring into your own workspace.

 1. Click this link: https://ibmanalytics.CARTO.com/u/ibm/tables/ibm.zipcodes/public
 2. On the lower right of the screen click the CREATE MAP button.
    
    The zip code database will load in your CARTO account, and generate a blank
    zip code map.
    
    
 3. Name the map.
    Go to the top left of the screen and click Edit metadata and name it Zips Map .
    
    
CREATE AN ON/OFF COLUMN IN YOUR ZIP CODE TABLE
Here we’ll add a new column that we’ll use to either show or hide the zip code
on our map.

 1. On the upper left of the screen, click the left arrow next to the map name.
    
 2. Switch to the datasets view.
    
    At the top of the screen, click the small arrow beside Maps and choose Your datasets .
    
    
    You are now on a URL that looks like this:
    
    https://{yourusername}.CARTO.com/dashboard/datasets
    
    
 3. Click the data set called carto_query .
    
    
 4. On the top left, click the Edit metadata link. Name it zips and click save.
    
    
 5. On the bottom right of the screen, click the Add column icon.
    
    
 6. Name the new column showme , and set its data type to boolean (by pressing on the word ‘string’ below the column’s name).
    
    
 7. Set the display of all zip codes to false.
    
    On the right side of the screen, click on the SQL icon. Enter this SQL and click Apply query .
    
    UPDATE zips SET showme = FALSE
    
    
TWEAK THE MAP CARTOGRAPHY
The last setup step before we can try some analysis, is to tweak the map’s
cartographic look to make the styling better suited to our use case. As is, if
we run an analysis that returns only a few zip codes, you won’t be able to see
them when looking at a map of the entire United States.

We’ll tweak CARTO’s CSS-based map styling language to say: when looking at a
large land area, show selected zip codes (where showme=true ) as large dots. But when zoomed-in close on a city or region, shade in the
actually boundaries of the zip code and don’t show dots.

To make this change:

 1. Return to your map by visiting:https://{yourusername}.CARTO.com/dashboard/maps
    
    
 2. Click on your Zips Map to open it.
    
    
 3. On the right side of the screen, click the CSS icon.
    
    
 4. Replace any text you see there with the following JSON
    
    
    /** simple visualization */
    Map{
      buffer-size:256; 
    }
    
    #zips{
      marker-fill: #FF6600;
      marker-width: 24.0;
      marker-line-color: #FFFFFF;
      marker-line-width: 2.0;
      marker-line-opacity: 1;
    
      [zoom�
      }
    }
    
    
 5. Press Apply style .

You now see a blank map, which makes sense, because the current value in showme is false for all rows in the zips table. We’ll change that by doing some analysis with
Python’s Pandas module in a Jupyter notebook.

COPY THE MAP URL
 1. In CARTO, on the upper left of your screen, click the left arrow to return
    to your dashboard.
    
    
 2. Switch to the maps view.
    
    At the top of the screen, click the small arrow beside Datasets and choose Your maps .
    
    
 3. Click Zips Map to open it.
    
    
 4. On the upper right of the screen, click the Publish button.
    
    
 5. Copy the CartoDB.js URL. You’ll need it in a minute.
    
    
COPY CARTO ACCOUNT CREDENTIALS
The notebook you’re about to create also needs access to your CARTO account.
Grab your credentials now and copy those too:

 1. On the upper right of the CARTO screen, click the icon (this is a
    randomly-generated icon).
 2. From the menu that appears, select Your API Keys .
 3. Copy the API key that appears.
 4. Copy your username, which you’ll find in the URL of your CARTO account: https://{yourusername}.CARTO.com/

SET UP ANALYTICS NOTEBOOK
Bluemix (IBM’s cloud platform) includes Analytics Exchange, which features a
selection of open data sets that you can download and use any way you want. It’s
easy to get an account and grab some data. You can also create a notebook to run
some analysis online.

GET POPULATION DATA SET
We’ll map highly populous zip codes–those that have over 100,000 residents. To
do so, we’ll load an open data set containing population info.

 1. Login to Bluemix (or sign up for a free trial) .
 2. From the menu at the top of any Bluemix screen, click Dashboard .
 3. Click the Data & Analytics tile.
 4. In the menu on the left side of the screen, click Exchange .
 5. At the top of the screen, in the Search box , type Life Expectancy . Click the United States Demographic Measures: Population and Age data set.
 6. On the right side on the screen, click Apps & Notebooks to request a new access key.
 7. Click OK to agree to terms and conditions.
 8. Click Request a New Access Key .
 9. Click the key and copy the URL that appears. You’ll use it in a minute to
    load data into your notebook.

CREATE PYTHON NOTEBOOK
Tip: If you don’t want to create a notebook and run the commands yourself, you can
also just open the notebook in your browser and follow along: 
https://github.com/ibm-cds-labs/open-data/blob/master/samples/cartodb-notebook.ipynb

Create a new python notebook in one of 2 ways:

 * online, hosted by IBM Bluemix 1. In the menu on the left side of the screen, click Services .
    2. Scroll down and click the New Service button.
    3. Find Apache Spark and click it.
    4. Click Choose Apache Spark , then click Create .
    5. Under Work with Notebooks and Spark click the Notebooks button.
    6. Click New Notebook .
    7. Click the From URL tab, give the notebook a name and in the Notebook URL field enter: 
       https://github.com/ibm-cds-labs/open-data/raw/master/samples/cartodb-notebook.ipynb
    8. Click Create Notebook .
   
   
 * or locally using Python and Jupyter 1. Download this notebook from GitHub: 
       https://github.com/ibm-cds-labs/open-data/raw/master/samples/cartodb-notebook.ipynb . (Copy the text and save as your own .ipynb file.)
    2. Install Python from Anaconda (a free distribution that includes the most common packages).
       
       
    3. Launch Jupyter.
       
       In Terminal, cd to the directory where you downloaded the notebook and
       type jupyter notebook . Jupyter launches in your browser.
       
       
This notebook uses Pandas to read a CSV file for US population by zip code, and
selects zip codes where the population is greater than 100,000 persons.

To map the results, we use CARTO’s Python module to execute SQL statements that
will change showme to true for those 8 zip codes. That automatically updates the map we created earlier in
CARTO.

ADD CREDENTIALS
Your notebook needs access to your data, so edit the following entries in your
new notebook.

 1. In cell 2, replace os.environ['AE_KEY_AGE'] with the URL of your Population and Age database. Then that cell reads something like:pop_df = pd.read_csv(""https://console.ng.bluemix.net/data/exchange-api/v1/entries/beb8c30a3f559e58716d983671b65c10/data?accessKey=2dbb61f9aed0ecb65316b1ecadfb6ebb"", usecols=['GEOID','B01001e1'], dtype={""GEOID"": np.str} )
    pop_df.columns = ['GEOID','POPULATION']
    pop_df = pop_df.set_index('GEOID')
    pop_df.sample(10)
    
    
 2. In cell 5, replace placeholder CARTO credentials with your CARTO API key and username .
    
    
 3. In the last cell, replace the map URL with the CARTO.js URL you copied a few
    minutes ago.
    
    sample URL:
    
    https://{yourusername}.CARTO.com/api/v2/viz/{yourmapid}/viz.json
    
    
RUN THE NOTEBOOK
Run all cells/commands in the notebook, in order. Comments in the notebook and
its code explain what each cell does.

Tip: If you plan to play with some data and run through this notebook again, comment
out the pip install line in cell 5 on subsequent runs. To do so, insert a # character at the start of that one line. You only need to install on your first
run through the notebook.

THE MAP
All the cells build to generate the map in the final cell. That last cell uses
JavaScript “magic” to embed the map in the notebook, which draws from the map
URL you just entered.

Can’t see the map? At the time of writing, CARTO is using a certificate authority that many
browsers don’t recognize. If you can’t see the map, the browser is blocking
access to CARTO web resources. To work around this issue, manually set https://libs.cartocdn.com/ as a trusted site in your browser. In Google Chrome, do so by visiting https://libs.cartocdn.com/CARTO.js/v3/3.15/themes/css/CARTO.css . When Chrome tells you that isn’t safe, click Advanced , trust the URL, and load it.

You can see my map here:

https://rajrsingh.carto.com/viz/3e4b46a4-3ed3-11e6-bbbe-0e3a376473ab/public_map

Notice that when you look at the whole country, it seems like there are only 4
locations that meet the criteria (population 100,000), but this is because we’re using large dots to mark locations at this
scale. Zoom in on New York City or Los Angeles and you see multiple zip code
boundaries highlighted (NYC has 2, LA has 4, Chicago has 1, and El Paso has 1).


THE TAKEAWAY
This was an extremely simple analysis, but it illustrates a powerful concept. As
long as you can get the data into a Python object, you can map it with CARTO.
Our data came from a CSV file, but it could have come from a relational or a
NoSQL database. And we didn’t even have to use Pandas–this could have been a
Spark-based analysis. The only real limitation is that you don’t want to send
huge SQL statements over the Internet. But huge is a relative term. Since the
SQL update commands need to happen only once (the map persists until you want to
update it), it’s probably acceptable to run multi-megabyte SQL statements that
take many minutes to execute. So this is an incredibly flexible way to mix and
match best-of-breed cloud services for your work.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Use CARTO and IBM open data sets to add maps to your python notebook analysis.,Map Your Cloud Data,Live,325
959,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

CONTENTS
 * Apache Spark * Get Started * Get Started in Bluemix
      
      
    * Tutorials * Load dashDB Data with Apache Spark
       * Load Cloudant Data in Apache Spark Using a Python Notebook
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Build SQL Queries
       * Use the Machine Learning Library
       * Build a Custom Library for Apache Spark
       * Sentiment Analysis of Twitter Hashtags
       * Use Spark Streaming
       * Launch a Spark job using spark-submit
      
      
    * Sample Notebooks * Sample Python Notebook: Precipitation Analysis
       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis
      
      
 * BigInsights * Get Started * BigInsights on Cloud for Analysts
       * BigInsights on Cloud for Data Scientists
       * Perform Text Analytics on Financial Data
       * Perform Sentiment Analysis
       * Analyze dashDB Data
       * Sample Scripts
      
      
 * Compose * Get Started * Create a Deployment
       * Add a Database and Documents
       * Back Up and Restore a Deployment
       * Enable Two-Factor Authentication
       * Add Users
       * Enable Add-Ons for Your Deployment
      
      
    * Compose Enterprise * Get Started
      
      
 * Graph * Get Started
   
   
 * Cloudant * Get started * Copy a sample database
       * Create a database
       * Change database permissions
       * Connect to Bluemix
       * Developing against Cloudant
      
      
    * Intro to the HTTP API * Execute common API commands
       * Set up pre-authenticated cURL
      
      
    * Database Replication * Use cases for replication
       * Create a replication job
       * Check replication status
       * Set up replication with cURL
      
      
    * Indexes and Queries * Use the primary index
       * MapReduce and the secondary index
       * Build and query a search index
       * Use Cloudant Query
       * Cloudant Geospatial
      
      
    * Integrate * Create a Data Warehouse from Cloudant Data
       * Store Tweets Using Cloudant, dashDB, and Node-RED
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Load Cloudant Data in Apache Spark Using a Python Notebook
      
      
 * dashDB * dashDB Quick Start
    * Get * Get started with dashDB on Bluemix
       * Load data from the desktop into dashDB
       * Load from Desktop Supercharged with IBM Aspera
       * Load data from the Cloud into dashDB
       * Move data to the Cloud with dashDB’s MoveToCloud script
       * Load Twitter data into dashDB
       * Load XML data into dashDB
       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB
       * Load JSON Data from Cloudant into dashDB
       * Integrate dashDB and Informatica Cloud
       * Load geospatial data into dashDB to analyze in Esri ArcGIS
       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion
         Workbench (DCW)
       * Install IBM Database Conversion Workbench
       * Convert data from Oracle to dashDB
       * Convert IBM Puredata System for Analytics to dashDB
       * From Netezza to dashDB: It’s That Easy!
       * Use Aginity Workbench for IBM dashDB
      
      
    * Build * Create Tables in dashDB
       * Connect apps to dashDB
      
      
    * Analyze * Use dashDB with Watson Analytics
       * Perform Predictive Analytics and SQL Pushdown
       * Use dashDB with Spark
       * Use dashDB with Pyspark and Pandas
       * Use dashDB with R
       * Publish apps that use R analysis with Shiny and dashDB
       * Perform market basket analysis using dashDB and R
       * Connect R Commander and dashDB
       * Use dashDB with IBM Embeddable Reporting Service
       * Use dashDB with Tableau
       * Leverage dashDB in Cognos Business Intelligence
       * Integrate dashDB with Excel
       * Extract and export dashDB data to a CSV file
       * Analyze With SPSS Statistics and dashDB
      
      
    * REST API * Load delimited data using the REST API and cURL
      
      
 * DataWorks * Get Started * Connect to Data in IBM DataWorks
       * Load Data for Analytics in IBM DataWorks
       * Blend Data from Multiple Sources in IBM DataWorks
       * Shape Raw Data in IBM DataWorks
       * DataWorks API
      
      
GET STARTED
sharynr / May 10, 2016HOW TO USE IBM GRAPH
This set of videos is coming soon to help you learn how to perform basic create/read/update/delete options on a
Graph database, how to build a simple traversal query, and learn more about the
types of schema properties available.

 * 
 * Basic CRUD
 * Traversals
 * Modeling and Schema

In the meantime, check out the IBM Graph documentation .

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM","IBM Graph is an easy-to-use, fully managed graph database service for storing, querying, and visualizing data points, their connections, and properties. Watch this video to see an overview of IBM Graph.",Introducing IBM Graph,Live,326
960,"ENHANCED CLOUDANT SEARCH WITH WATSON ALCHEMY

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Glynn Bird 11/10/16Glynn Bird

Before joining IBM Cloud Data Services, Glynn served as the Head of IT and
Development for Central Index, creating a white-label frontend for a NoSQL
business directory (using PHP, Node.js, MySQL, Redis, Cloudant, and Redshift).
His experience includes writing CRM systems, ""find my nearest"" indexes,
e-commerce platforms, and a phone…

Learn More Recent Posts * Enhanced Cloudant Search with Watson Alchemy Create an app that takes an RSS news feed, passes it through the Alchemy
   Language…
 * Building Offline-First, Progressive Web Apps In this article, I aim to summarise Progressive Web Apps and provide
   recommendations from my…
 * Plug into the Cloudant Node.js Library v1.5 Today marks version 1.5 of the Cloudant Node.js Library. The library comes
   with a new…

It’s hard to deliver data in a way that’s easy for users to consume. Think of an
RSS news feed. You subscribe and start drinking from the firehose, but maybe
you’re interested only in technology news or want to quickly see the latest
Brexit developments.

Filters and search features are fundamental to any app that delivers data. I’ll
show you how to implement these features in a way that leverages some artificial
intelligence, so you don’t need to tax your own too heavily. We’ll create an app
that takes an RSS news feed, passes it through the Alchemy Language API (which
provides text analysis through natural language processing), then saves the
enhanced data to a Cloudant database for querying.

My news demo app lives on a simple, static html page. It features a hierarchical menu that lists
articles by category and a search box that lets users find articles featuring
specific terms.

SEARCH AND DATA STRUCTURE
Presenting a compelling search experience like this in a web or mobile
application is a problem of two halves:

 * free-text search – the user supplies a multi-word phrase and the search
   engine returns the documents which best match the query e.g. find me
   documents that best matches “cat gifs”
 * fielded search – the user provides a structured search e.g. find me documents
   published in the last month from the entertainment category with cat in the title

A document may contain a mixture of structured and unstructured data. Let’s take
an RSS feed from a news website. A typical news feed article might look like this:


    <item>
        <title><![CDATA[Is this the world's most oversubscribed school?]]></title>
        <description><![CDATA[A school in India that offers an elite education for poor rural families receives 250,000 applications for 200 places.]]></description>
        <link>http://www.bbc.co.uk/news/business-37372776</link>
        <guid isPermaLink=""true"">http://www.bbc.co.uk/news/business-37372776</guid>
        <pubDate>Tue, 11 Oct 2016 23:20:23 GMT</pubDate>
        <media:thumbnail width=""976"" height=""549"" url=""http://c.files.bbci.co.uk/85F1/production/_91298243_img_7116.jpg""/>
    </item>


Some of the data is fielded and factual (the link and pubDate), but some data is
unstructured text. There is no indication of a hierarchy of categories but we
could use a free-text search algorithm to find best matches. Can we do better?
Can we make our unstructured text more structured ?

ALCHEMY – MAKING SENSE OF CHAOS
The Alchemy API comes from Watson, IBM’s cognitive computing division. The Alchemy Language API parses unstructured data and offers its take on what the data is about – its
structure, sentiment, taxonomy, and which entities it refers to–entities being people, places, companies, and so on. If we pass a
news article about Indian schools to Alchemy, it tells us that the article:

 * is about the country India with some useful links
 * offers some keywords, like “poor rural families”, “elite education”,
   “school”, “India”
 * suggests some taxonomy options, like /family and parenting , /education/school

Using this suggested data, we can enhance our bare RSS feed to add additional
structure and to link discrete articles together by geography, similarity and
theme.

You can provision your own instance of the Alchemy API service on Bluemix , and try it for free. After you create the service, open it, click Service Credentials and click Add to get your API key, which you’ll need in minute.

USING NODE-RED TO FETCH RSS FEEDS
Node-Red is a simple way to prototype and implement data flows visually. You drag
functional blocks onto a visual flow editor and draw lines between them. Provision the Node-Red service on Bluemix and try it for free. Cloudant comes along automatically, which is handy,
because we want to feed news updates from the BBC into Cloudant, which we can do
with this this simple flow:

The block on the left is an RSS feed reader. I created it by selecting a feedparse input (under Advanced ) and configuring it with the BBC RSS URL so it knows where to fetch the data.
Next I went to Storage and dragged Cloudant out into the flow.

But we also want Alchemy in the mix, adding its taxonomy and entity data before
we save to Cloudant. In the left column, under IBM Watson , find Feature Extract and drag it into the flow. Insert it in-between the BBC reader and Cloudant.
Configure it by entering your Alchemy API key.

Now, data passes through the Alchemy API which adds taxonomy and entity data
before it is saved to Cloudant. Once running, all new articles published to the
news sites pass through your pipeline, processed by Alchemy, and stored in
Cloudant without having written a line of code.

Next, we’ll use Cloudant’s built-in search and indexing features, to create views we can query to power a web-based front end.

Tip: If you’re not keen on visual programming, then you can achieve the same thing
with regular, text-based code. There are Watson Alchemy SDKs for a number of languages , and with a few lines of code, you can configure an RSS feed parser to write data to Cloudant.

INDEXING STRUCTURED DATA WITH MAPREDUCE
Let’s look at the structure of the data as it arrives in Cloudant:


{
  _id: ""17c1444"",
  _rev: ""1-f2bb9e31f865df943d6b5d4934f1f844"",
  topic: ""http://www.bbc.co.uk/news/health-37642587"",
  payload: ""Some children are born with a fussiness towards food which is hard-wired into their DNA, scientists say - so are parents off the hook?"",
  article: { ... },
  features: { ... }
}


The article contains further details from the RSS feed. The features object is the data that Alchemy has added, including:

 * entity – an array of entities (people, places, countries) identified in the
   article
 * taxonomy – an array of suggested places in hierarchy of categories

We can use Cloudant MapReduce to extract the data from the document and emit keys and values into a index:


// extract entities
function (doc) {
  if (doc.features) {
    if (doc.features.entity) {
      doc.features.entity.map(function(e) {
        emit(e.type + ':' + e.text);
      });
    }
  }
}

// extract the taxonomy with the highest score
function (doc) {
  if (doc.features) {
    if (doc.features.taxonomy) {
      var winner = null;
      var winningscore = 0;
      doc.features.taxonomy.map(function(e) {
        if (e.score �
      }
    }
  }
}


With the _count reducer, we can use the indexes to get documents from anywhere in the taxonomy
hierarchy or that contain entities. The same indexes can be used to aggregate
the data (to provide categories and counts of articles) and for selection (to
provide a list of documents within a category).

INDEXING UNSTRUCTURED DATA WITH CLOUDANT SEARCH
Cloudant can also create free-text search indexes which are populated with
unstructured text, chiefly the article’s title and description field. We can
also index the content discovered by Alchemy too! Cloudant Search indexes, like MapReduce indexes, are created by supplying JavaScript functions.
Instead of calling an emit function, we call an index function to define which fields are to be indexed or stored:


function (doc) {
  index(""title�
  index(""default�
  index(""default�
  if (doc.features.entity) {
    doc.features.entity.map(function(e) {
      index(""default�
  }
}


The above function puts the article’s title, description and entities in the default index. You can then use the default index to power a best match search of all of the news items given a users search phrase.

HERE IS THE NEWS
We can visualise the news data in a web-based user interface using our MapReduce
index of the taxonomy to present a hierarchical menu showing the number of
articles in each category:

We can use the same index to retrieve lists of articles from anywhere in the
hierarchy. Our index of entities can be used to link to and retrieve articles
that are about individuals or places. Our free-text search can be used to
provide a site search facility.

The news site can be a static , single-page web application. Its dynamic content is fetched by in-page web
requests to the Cloudant server (as long as the Cloudant database is publicly
readable). You can serve the site out on any web server or even host it on Github Pages .

This static news demo site is updated every fifteen minutes by Node-Red.

HOMEWORK
Now that we have our news data and our index definitions in a Cloudant database,
what’s to stop us replicating the data from the server to a client-side database
such as PouchDB or Cloudant Sync and reconfiguring our web app to read data from its local store instead? This offline-first approach would allow us to sync the news to our local device and consume it
offline, even without a network connection. Try it for yourself!","Create an app that takes an RSS news feed, passes it through the Alchemy Language API, then saves the enhanced data to Cloudant for querying.",Enhanced Cloudant Search with Watson Alchemy,Live,327
961,"COMPOSE: NOW AVAILABLE ON IBM BLUEMIX
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 11, 2016The power of IBM's Bluemix cloud platform is now able to seamlessly harness Compose's databases, making
Compose-configured MongoDB, PostgreSQL, Redis, RethinkDB, Elasticsearch,
RabbitMQ and etcd available to Bluemix users and their applications.

The mission of Compose is to deliver production quality database clusters with
high-availability configurations, automated backup and management, auto-scaling
resources and enhanced security to users as simply as possible. Bluemix users
can now enjoy those features while continuing to use Bluemix's service and
application configuration technology.

The Bluemix platform provides applications with a solid platform to built on
Cloud Foundry's open source provisioning and deployment platform. This allows
IBM customers to manage a complete application stack in the cloud with a wide
range of IBM and third party services to call on. This is done by simply
creating an instance of a Bluemix service and binding it to an application, a
process that can be done on the web using the Bluemix console or automated from
the command line with Cloud Foundry tools.

Compose has been working to ensure Bluemix users would be able to benefit from
the latest cloud platform integration being developed by the engineers. With
Compose's integration, Bluemix users can create Compose databases from within
the Bluemix service catalog and bind those database services to their
applications, further closing the gap between application and database.

You'll find Compose databases in the Bluemix catalog under Data and Analytics .

Image by Scott Webb Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","The power of IBM's Bluemix cloud platform is now able to seamlessly harness Compose's databases, making Compose-configured MongoDB, PostgreSQL, Redis, RethinkDB, Elasticsearch, RabbitMQ and etcd available to Bluemix users and their applications. ",Now available on IBM Bluemix,Live,328
963,"Toggle navigation * 
 * About
 * Recommended Resources
 * 
 * Archives
 * 
 * 

PRACTICAL BUSINESS PYTHON
Taking care of business, one python script at a time

Mon 03 July 2017INTRODUCTION TO MARKET BASKET ANALYSIS IN PYTHON
Posted by Chris Moffitt in articles

INTRODUCTION
There are many data analysis tools available to the python analyst and it can be
challenging to know which ones to use in a particular situation. A useful (but
somewhat overlooked) technique is called association analysis which attempts to
find common patterns of items in large data sets. One specific application is
often called market basket analysis. The most commonly cited example of market
basket analysis is the so-called “beer and diapers” case. The basic story is
that a large retailer was able to mine their transaction data and find an
unexpected purchase pattern of individuals that were buying beer and baby
diapers at the same time.

Unfortunately this story is most likely a data urban legend . However, it is an illustrative (and entertaining) example of the types of
insights that can be gained by mining transactional data.

While these types of associations are normally used for looking at sales
transactions; the basic analysis can be applied to other situations like click
stream tracking, spare parts ordering and online recommendation engines - just
to name a few.

If you have some basic understanding of the python data science world, your
first inclination would be to look at scikit-learn for a ready-made algorithm.
However, scikit-learn does not support this algorithm. Fortunately, the very
useful MLxtend library by Sebastian Raschka has a a an implementation of the Apriori algorithm for extracting frequent item sets for further analysis.

The rest of this article will walk through an example of using this library to
analyze a relatively large online retail data set and try to find interesting purchase combinations. By the end of this
article, you should be familiar enough with the basic approach to apply it to
your own data sets.

WHY ASSOCIATION ANALYSIS?
In today’s world, there are many complex ways to analyze data (clustering,
regression, Neural Networks, Random Forests, SVM , etc.). The challenge with many of these approaches is that they can be
difficult to tune, challenging to interpret and require quite a bit of data prep
and feature engineering to get good results. In other words, they can be very
powerful but require a lot of knowledge to implement properly.

Association analysis is relatively light on the math concepts and easy to
explain to non-technical people. In addition, it is an unsupervised learning
tool that looks for hidden patterns so there is limited need for data prep and
feature engineering. It is a good start for certain cases of data exploration
and can point the way for a deeper dive into the data using other approaches.

As an added bonus, the python implementation in MLxtend should be very familiar
to anyone that has exposure to scikit-learn and pandas. For all these reasons, I
think it is a useful tool to be familiar with and can help you with your data
analysis problems.

One quick note - technically, market basket analysis is just one application of
association analysis. In this post though, I will use association analysis and
market basket analysis interchangeably.

ASSOCIATION ANALYSIS 101
There are a couple of terms used in association analysis that are important to
understand. This chapter in Introduction to Data Mining is a great reference for those interested in the math behind these definitions
and the details of the algorithm implementation.

Association rules are normally written like this: {Diapers} -> {Beer} which
means that there is a strong relationship between customers that purchased
diapers and also purchased beer in the same transaction.

In the above example, the {Diaper} is the antecedent and the {Beer} is the consequent . Both antecedents and consequents can have multiple items. In other words,
{Diaper, Gum} -> {Beer, Chips} is a valid rule.

Support is the relative frequency that the rules show up. In many instances, you may
want to look for high support in order to make sure it is a useful relationship.
However, there may be instances where a low support is useful if you are trying
to find “hidden” relationships.

Confidence is a measure of the reliability of the rule. A confidence of .5 in the above
example would mean that in 50% of the cases where Diaper and Gum were purchased,
the purchase also included Beer and Chips. For product recommendation, a 50%
confidence may be perfectly acceptable but in a medical situation, this level
may not be high enough.

Lift is the ratio of the observed support to that expected if the two rules were
independent (see wikipedia ). The basic rule of thumb is that a lift value close to 1 means the rules were
completely independent. Lift values > 1 are generally more “interesting” and
could be indicative of a useful rule pattern.

One final note, related to the data. This analysis requires that all the data
for a transaction be included in 1 row and the items should be 1-hot encoded.
The MLxtend documentation example is useful:

Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion Unicorn Yogurt 0 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 0 1 2 1 0 0 1 0 1 1 0 0 0 0 3 0 1 0 0 0 1 1 0 0 1 1 4 0 1 0 1 1 1 0 0 1 0 0The specific data for this article comes from the UCI Machine Learning Repository and represents transactional data from a UK retailer from 2010-2011. This mostly represents sales to wholesalers so it is
slightly different from consumer purchase patterns but is still a useful case
study.

LET’S CODE
MLxtend can be installed using pip, so make sure that is done before trying to
execute any of the code below. Once it is installed, the code below shows how to
get it up and running. I have made the notebook available so feel free to follow along with the examples below.

Get our pandas and MLxtend code imported and read the data:

importpandasaspdfrommlxtend.frequent_patternsimportapriorifrommlxtend.frequent_patternsimportassociation_rulesdf=pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')df.head()

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country 0 536365 85123A WHITE HANGING HEART T- LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART . 6 2010-12-01 08:26:00 3.39 17850.0 United KingdomThere is a little cleanup, we need to do. First, some of the descriptions have
spaces that need to be removed. We’ll also drop the rows that don’t have invoice
numbers and remove the credit transactions (those with invoice numbers
containing C).

df['Description']=df['Description'].str.strip()df.dropna(axis=0,subset=['InvoiceNo'],inplace=True)df['InvoiceNo']=df['InvoiceNo'].astype('str')df=df[~df['InvoiceNo'].str.contains('C')]

After the cleanup, we need to consolidate the items into 1 transaction per row
with each product 1 hot encoded . For the sake of keeping the data set small, I’m only looking at sales for
France. However, in additional code below, I will compare these results to sales
from Germany. Further country comparisons would be interesting to investigate.

basket=(df[df['Country']==""France""].groupby(['InvoiceNo','Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo'))

Here’s what the first few columns look like (note, I added some numbers to the
columns to illustrate the concept - the actual data in this example is all 0’s):

Description 10 COLOUR SPACEBOY PEN 12 COLOURED PARTY BALLOONS 12 EGG HOUSE PAINTED WOOD 12 MESSAGE CARDS WITH ENVELOPES 12 PENCIL SMALL TUBE WOODLAND 12 PENCILS SMALL TUBE RED RETROSPOT 12 PENCILS SMALL TUBE SKULL 12 PENCILS TALL TUBE POSY InvoiceNo 536370 11.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 536852 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 536974 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 537065 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 537463 0.0 0.0 9.0 0.0 0.0 0.0 0.0 0.0There are a lot of zeros in the data but we also need to make sure any positive
values are converted to a 1 and anything less the 0 is set to 0. This step will
complete the one hot encoding of the data and remove the postage column (since
that charge is not one we wish to explore):

defencode_units(x):ifx<=0:return0ifx>=1:return1basket_sets=basket.applymap(encode_units)basket_sets.drop('POSTAGE',inplace=True,axis=1)

Now that the data is structured properly, we can generate frequent item sets
that have a support of at least 7% (this number was chosen so that I could get
enough useful examples):

frequent_itemsets=apriori(basket_sets,min_support=0.07,use_colnames=True)

The final step is to generate the rules with their corresponding support,
confidence and lift:

rules=association_rules(frequent_itemsets,metric=""lift"",min_threshold=1)rules.head()

antecedants consequents support confidence lift 0 ( PLASTERS IN TIN WOODLAND ANIMALS ) ( PLASTERS IN TIN CIRCUS PARADE ) 0.170918 0.597015 3.545907 1 ( PLASTERS IN TIN CIRCUS PARADE ) ( PLASTERS IN TIN WOODLAND ANIMALS ) 0.168367 0.606061 3.545907 2 ( PLASTERS IN TIN CIRCUS PARADE ) ( PLASTERS IN TIN SPACEBOY ) 0.168367 0.530303 3.849607 3 ( PLASTERS IN TIN SPACEBOY ) ( PLASTERS IN TIN CIRCUS PARADE ) 0.137755 0.648148 3.849607 4 ( PLASTERS IN TIN WOODLAND ANIMALS ) ( PLASTERS IN TIN SPACEBOY ) 0.170918 0.611940 4.442233That’s all there is to it! Build the frequent items using apriori then build the rules with association_rules .

Now, the tricky part is figuring out what this tells us. For instance, we can
see that there are quite a few rules with a high lift value which means that it
occurs more frequently than would be expected given the number of transaction
and product combinations. We can also see several where the confidence is high
as well. This part of the analysis is where the domain knowledge will come in
handy. Since I do not have that, I’ll just look for a couple of illustrative
examples.

We can filter the dataframe using standard pandas code. In this case, look for a
large lift (6) and high confidence (.8):

rules[(rules['lift']>=6)&(rules['confidence']>=0.8)]

antecedants consequents support confidence lift 8 ( SET /6 RED SPOTTY PAPER CUPS ) ( SET /6 RED SPOTTY PAPER PLATES ) 0.137755 0.888889 6.968889 9 ( SET /6 RED SPOTTY PAPER PLATES ) ( SET /6 RED SPOTTY PAPER CUPS ) 0.127551 0.960000 6.968889 10 ( ALARM CLOCK BAKELIKE GREEN ) ( ALARM CLOCK BAKELIKE RED ) 0.096939 0.815789 8.642959 11 ( ALARM CLOCK BAKELIKE RED ) ( ALARM CLOCK BAKELIKE GREEN ) 0.094388 0.837838 8.642959 16 ( SET /6 RED SPOTTY PAPER CUPS , SET /6 RED SPOTTY … ( SET /20 RED RETROSPOT PAPER NAPKINS ) 0.122449 0.812500 6.125000 17 ( SET /6 RED SPOTTY PAPER CUPS , SET /20 RED RETRO … ( SET /6 RED SPOTTY PAPER PLATES ) 0.102041 0.975000 7.644000 18 ( SET /6 RED SPOTTY PAPER PLATES , SET /20 RED RET … ( SET /6 RED SPOTTY PAPER CUPS ) 0.102041 0.975000 7.077778 22 ( SET /6 RED SPOTTY PAPER PLATES ) ( SET /20 RED RETROSPOT PAPER NAPKINS ) 0.127551 0.800000 6.030769In looking at the rules, it seems that the green and red alarm clocks are
purchased together and the red paper cups, napkins and plates are purchased
together in a manner that is higher than the overall probability would suggest.

At this point, you may want to look at how much opportunity there is to use the
popularity of one product to drive sales of another. For instance, we can see
that we sell 340 Green Alarm clocks but only 316 Red Alarm Clocks so maybe we
can drive more Red Alarm Clock sales through recommendations?

basket['ALARM CLOCK BAKELIKE GREEN'].sum()340.0basket['ALARM CLOCK BAKELIKE RED'].sum()316.0

What is also interesting is to see how the combinations vary by country of
purchase. Let’s check out what some popular combinations might be in Germany:

basket2=(df[df['Country']==""Germany""].groupby(['InvoiceNo','Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo'))basket_sets2=basket2.applymap(encode_units)basket_sets2.drop('POSTAGE',inplace=True,axis=1)frequent_itemsets2=apriori(basket_sets2,min_support=0.05,use_colnames=True)rules2=association_rules(frequent_itemsets2,metric=""lift"",min_threshold=1)rules2[(rules2['lift']>=4)&(rules2['confidence']>=0.5)]

antecedants consequents support confidence lift 7 ( PLASTERS IN TIN SPACEBOY ) ( PLASTERS IN TIN WOODLAND ANIMALS ) 0.107221 0.571429 4.145125 9 ( PLASTERS IN TIN CIRCUS PARADE ) ( PLASTERS IN TIN WOODLAND ANIMALS ) 0.115974 0.584906 4.242887 10 ( RED RETROSPOT CHARLOTTE BAG ) ( WOODLAND CHARLOTTE BAG ) 0.070022 0.843750 6.648168It seems that in addition to David Hasselhoff, Germans love Plasters in Tin
Spaceboy and Woodland Animals.

In all seriousness, an analyst that has familiarity with the data would probably
have a dozen different questions that this type of analysis could drive. I did
not replicate this analysis for additional countries or customer combos but the
overall process would be relatively simple given the basic pandas code shown
above.

CONCLUSION
The really nice aspect of association analysis is that it is easy to run and
relatively easy to interpret. If you did not have access to MLxtend and this
association analysis, it would be exceedingly difficult to find these patterns
using basic Excel analysis. With python and MLxtend, the analysis process is
relatively straightforward and since you are in python, you have access to all
the additional visualization techniques and data analysis tools in the python
ecosystem.

Finally, I encourage you to check out the rest of the MLxtend library. If you
are doing any work in sci-kit learn it is helpful to be familiar with MLxtend
and how it could augment some of the existing tools in your data science
toolkit.

 * ← How Accurately Can Prophet Project Website Traffic?

Tags pandas mlxtend
--------------------------------------------------------------------------------

Tweet Vote on Hacker NewsCOMMENTS
SOCIAL
 * Github
 * Twitter
 * BitBucket
 * Reddit
 * LinkedIn

CATEGORIES
 * articles
 * news

POPULAR
 * Pandas Pivot Table Explained
 * Common Excel Tasks Demonstrated in Pandas
 * Overview of Python Visualization Tools
 * Web Scraping - It's Your Civic Duty
 * Simple Graphing with IPython and Pandas

TAGS
pelican sets seaborn bokeh scikit-learn gui ggplot plotting jinja powerpoint process google plotly barnum ipython xlsxwriter excel python analyze-this cases pdf pygal vcs oauth2 stdlib matplot xlwings matplotlib csv s3 beautifulsoup word github pandas mlxtend notebooks

FEEDS
 * Atom Feed


--------------------------------------------------------------------------------

DISCLOSURE
We are a participant in the Amazon Services LLC Associates Program, an affiliate
advertising program designed to provide a means for us to earn fees by linking
to Amazon.com and affiliated sites.

Ⓒ 2017 Practical Business Python • Site built using Pelican • Theme based on VoidyBootstrap by RKI",Using mlxtend to perform market basket analysis on online retail data set.,Introduction to Market Basket Analysis in Python,Live,329
966,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register * Projects
 * Blogs
 * About
 * Contribute
 * OpenTech
 * Tutorials
 * Events
 * Videos

Search Brunel Visualization | More Brunel Visualization posts < Previous / Next >BRUNEL: IMITATION IS A SINCERE FORM OF FLATTERY
Daniel Rope / October 30, 2015Because I’m in the field, I frequently get e-mails with links to extraordinarily
creative data visualizations. I generally have two reactions: First, I usually
think “wow, that’s exceedingly clever!”, usually followed shortly by “hmm, I bet
that visualization would work really well for showing ….” But then the
follow-through for my second reaction hardly ever occurs because other
distractions set in.

The high level of abstraction in Brunel Visualization gave us the opportunity to enable any Brunel syntax to easily be applied to
different data. (For some examples of different visualizations, check out Working Vis .) We can simply look through actions from one visualization and replace data
variables with others that have similar statistical properties (or additional
criteria, as we make improvements) from a different data set.

For example, one of our gallery visualizations shows how many board games by
category that have been published over the years; overlaid are the top actual
games sized by the number of votes they received. (Click the image to go to the
live visualization; use the Data menu option to add your own data in the form of
a CSV file.)


Let’s say I would like to re-purpose this same visualization technique to show
the carbon emissions of the top economies over time. The Gapminder website has a nice collection of data about countries, including “CO2 emissions (tonnes
per person)” which was provided by the Carbon Dioxide Information Analysis
Center (CDIAC).

First, we need to do a bit of data reshaping to columnar form and some filtering
for the countries we want to include — pretty simple to do in Python within a
Jupyter Notebook:


import pandas as pd

#read the data
co2 = pd.read_csv(""data/indicator CDIAC carbon_dioxide_emissions_per_capita.csv"")
#define large economies
large_economies=[""United States"", ""China"", ""Japan"", ""Germany"", ""United Kingdom"", ""France"", ""India"", 
  ""Brazil"", ""Italy"", ""Canada""]
#reshape the data
co2 = pd.melt(co2, id_vars=[""CO2 per capita""], var_name=""Year"")
#filter
co2 = co2.loc[co2['CO2 per capita'].isin(large_economies)]
#write out as CSV
co2.to_csv(""data/Co2.csv"", index=False)


The code above writes the transformed data to a CSV file, which can then be
uploaded to the service and resulting in a new Brunel visualization using the
emissions data instead of the game data. All that’s left to do is cut and paste
this Brunel syntax back into my Jupyter Notebook and remove the “data()”
statements, since I want this to use the data from my notebook. A few more
trivial changes to the Brunel code (adjusting the text on the tooltip, removing
the summation that isn’t important here, and so on) yields a graph showing
carbon emissions for each country over time with the largest emissions values
overlaid as points — and with working tooltips too!


import brunel
% brunel  path x(Year) y(CO2_per_capita) color(CO2_per_capita) size(value:200) 
top(Year:1800)  legends(none) +  x(Year) y(CO2_per_capita) size(value:300) 
tooltip(CO2_per_capita, ': ', value) top(Year:1800, value:20) :: width=900, height=450


Here’s another use of the same visualization, but this time modified to become
more interactive. After we again replace the data from the original gallery
visualization and make some minor modifications to the resulting Brunel, we
examine my colleague Graham’s tastes in music from different eras using data
from his iTunes.

The lines are now sized by the number of times he has played songs within each
genre, and the top 20 songs are overlaid. A slider allows us to change the
display based on how he has rated his songs. ( Note: If the following visualization does not appear in your browser, try viewing it here .)


data('http://bit.ly/1PQlGXM') path x(Year) y(Genre) color(Genre) 
size(Play_Count:200) sum(Play_Count) legends(none) + 
data('http://bit.ly/1PQlGXM') x(Year) y(Genre) size(Play_Count:300) 
filter(Rating) tooltip(Name, ': ', Artist, ' (Play Count: ', Play_Count, ')') 
top(Play_Count:20) sum(Play_Count)


This technique won’t always give the perfect replication, but it should give you enough information to get started and add
your own personal creativity.

Have fun!

 * Click to share on Twitter (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * 

LEAVE A COMMENT
Click here to cancel reply. Tell us who you are Name (required) Email (required) Comment text

Notify me of follow-up comments by email.

Notify me of new posts by email.


RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM",Brunel Visualization lets you easily re-purpose visualization techniques and replace data variables with similar statistical properties.,Brunel: Imitation is a sincere form of flattery,Live,330
968,"Compose Databases * MongoDB
 * Elasticsearch
 * RethinkDB
 * Redis
 * PostgreSQL
 * etcd
 * RabbitMQ
 * ScyllaDB
 * MySQL

Enterprise Pricing Articles Sign in Free 30-Day TrialDATALAYER: PORTING ZENDESK'S EXISTING APP TO GRAPHQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 12, 2016Senior Software Engineer at Zendesk Jason Denizac oversaw the monumental task to
move Zendesk's existing app to GraphQL. In his talk at DataLayer, he chronicles
the real-life story of how they ported Zendesk Inbox from a REST API to GraphQL
- exposing data backed by MySQL, ElasticSearch, and internal APIs via a common
data model to allow more efficient data fetching and faster front-end
development time.

Jason develops the web, organizes communities, and advocates for open technology
and participatory community governance. Jason studied public policy at Boise
State University and has worked developing web applications and APIs in the
medical industry and for nonprofits, including Code for America and currently
uses GraphQL at Zendesk to improve customer relationships.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",Customer video form Compose's DataLayer conference.,DataLayer Conference: Porting Zendesk's Existing App to GraphQL,Live,331
973,"Compose Menu Databases * MongoDB * Elasticsearch * RethinkDB * Redis * PostgreSQL * etcd * RabbitMQEnterprise Pricing Articles Sign in Free 30-Day TrialDEPLOYING LOGSHARE WITH COMPOSE AND BLUEMIXShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 12, 2016DEPLOY YOUR OWN LOGSHARE SERVER USING COMPOSE AND BLUEMIXI tend to develop locally, running Node.js code on my machine from the terminal.I can tail a log file locally and see the output data stream from top to bottom:> tail -f /var/log/system.logBut how could I share this output with my co-workers? There are severalunsatisfactory options: * Screenshare. Just think of the bandwidth! * Cut-and-paste into an email or Slack. Yuck. * Let co-workers ssh in to my machine. No thanks.So, I came up with a simple solution: logshare . It's a Node.js app that you can install via npm :> npm install -g logshareAnd then, when you pipe streaming content to it, it issues a URL you can sharewith your co-workers:> tail -f /var/log/system.log | logshareShare URL: https://logshare.mybluemix.net/share/kkdgapgdx  Your co-workers can watch the logs go by on a web page, or they too can install logshare and view the logs on their console with:> logshare kkdgapgdxwhere kkdgapgdx is the unique token issued to each logshare session.You can read about how logshare is constructed here .HOW DOES LOGSHARE WORK?Logshare consists of two components: * logshare-server - a Node.js app that brokers the streams of data and servers out the web   front-end * logshare-client - a Node.js command-line utility that acts as the data producer and   command-line consumerRedis handles one pubsub channel per data stream with data fed between theclient and server via a combination of RESTful APIs and WebSockets connections.In my deployment, I use IBM Bluemix to host the server-side code and Compose to provide the Redis service . By default, the command-line utility uses my logshare deployment at https://logshare.mybluemix.net , but you can install your own instance of logshare and run your own setup.INSTALL YOUR OWN LOGSHARE SERVERSTEP 1 - DEPLOY TO BLUEMIXGo to the bottom of the logshare-server Github page and click the handy Deploy to Bluemix button. If you don't already have a Bluemix account, you're prompted to signup.STEP 2 - CREATE REDIS INSTANCE ON COMPOSEThe app can't run without a Redis database, so Sign up for a Compose account and click Create Deployment .You can select your Redis cluster location at this point; I picked SoftLayerDallas because that's where my Bluemix account is based too. It's best forperformance to have your app server and database close. Within a minute or so,your deployment is ready.With your Redis deployment open in Compose, click the Overview tab. Under Credentials , hit the show link to see your password. Then scroll down the page and note the domain nameand port number. These credentials let you connect to the Redis cluster.redis://x:THEPASSWORD@thedomainname:portnumber  STEP 3 - CONFIGURE BLUEMIXIn your Bluemix dashboard, click Add a Service or API and choose Redis by Compose . Then configure the service, entering the password, domain name, and port youcopied in Step 2. Although the service prompts for a username, you can enter an x in this field.The service restarts, and your logshare instance should be up and running!THE LOGSHARE CLIENTBy default, the logshare-client is configured to communicate with my logshare instance at https://logshare.mybluemix.net , but you can override this by setting an environment variable LOGSHARE that contains the URL of your service on Bluemix, e.g.> export LOGSHARE=https://mylogshareservice.mybluemix.net> tail -f logs.txt | logshareShare URL: https://logshare.mybluemix.net/share/kkdgapgdx  WHY WOULD YOU WANT YOUR OWN LOGSHARE SERVER?You could just use the default logshare server for your work, but if you wantcontrol of your own destiny and you want your data to flow through to yourserver, then you are better off creating your own instance.One logshare user recently suggested that logshare could keep its channel openat the end of a session (currently the channel is deleted). This would letseveral application servers continually spool their data to static channelnames:> tail -f /var/log/apache.log | logshare server1And the sysadmins could tail any logs they liked by knowing the channel names.It would require a small change to the code, but PRs are always welcome!Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Glynn Bird - Developer Advocate for IBM Cloud Data Services. Follow Glynn athttp://www.glynnbird.com. Love this article? Head over to Glynn Bird’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose",How could I share log output with my co-workers?  I came up with a simple solution: logshare. It's a Node.js app that you can install via npm.,Deploying Logshare with Compose and Bluemix,Live,332
988,"Covering both mobile and Internet of Things (IoT) use cases, this deep dive into offline first explored several patterns for using PouchDB together with Cloudant, including setting up one database per user, one database per device, read-only replication, and write-only replication.","Covering both mobile and Internet of Things (IoT) use cases, this deep dive into offline first explored several patterns for using PouchDB together with Cloudant, including setting up one database per user, one database per device, read-only replication, and write-only replication.",A Deep Dive into Offline-First with PouchDB and IBM Cloudant,Live,333
990,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Aug 31
--------------------------------------------------------------------------------

OPEN SOURCE FOR FUN, LEARNING, AND KUDOS
REFLECTING ON MY CAREER, AND ADVICE ON GETTING STARTED
My first paid, web-facing work was closed source. My employer was developing
applications for the Microsoft Windows web server platform (paid / closed
source), using Microsoft Visual C++ (paid / closed source) and the code remained
firmly behind closed doors. I would love to be able to see that code today,
nearly 20 years later, but it’s sadly lost to history. The only code I have from
that era is a Java Applet (!) game that I made for a company promotion.

Askeroids still works! If you have Java and a Unix-like environment with “ make” installed , give it a try.CONSUMERS OF OPEN SOURCE, BUT CONTRIBUTORS TO NONE
As time went on, the company (and more broadly, the rest of the world) moved
from closed-source software with restrictive licensing to free, open-source
software developed by the community. Open-source wasn’t a philosophical choice
for my employer in the 2000s, it was an economic one.

As the company expanded its portfolio of computers, when faced with the choice
of paying for operating systems, programming tools and desktop software or
getting them for free, they chose the cheaper option. But they gave nothing
back. Companies I worked for paid for support contracts occasionally, but the
code that they built on these free foundations remained closed source.

THE FUTURE IS OPEN
Today things are different. The tech giants of IBM, Microsoft, Google and the
like embrace open source not as a means to shave a few lines from the balance
sheet, but as a way of shaping software collaboratively. IBM offers open source
products “as-a-service” — Apache CouchDB, Redis, MongoDB, MySQL, Postgres,
ElasticSearch, RabbitMQ, Apache Kafka, Apache OpenWhisk, Apache Spark,
RethinkDB, ScyllaDB, Hyperledger — but more than that, it employs people who
commit work to such projects.

The evolution of my GitHub activity .Open source, when used properly, is a way to expand your horizons, learn how
stuff works, and join in yourself.

OPEN SOURCE – YOUR FIRST PR
Developing open-source code can be extremely rewarding.

If you want to get started with open source, then there are resources out there
to help you contribute to existing projects. Take a look at Your First PR , a site that collects issues for open source projects suitable for first-time
committers. Study the project’s contributing guidelines before submitting
anything, and don’t be afraid to ask questions. The vast majority of project
maintainers will treat you cordially and with respect.

If you have an problem with some open source project, try Stack Overflow, read
the documentation, scan the Github issues, and then perhaps raise your own.
Decompose your problem to the smallest piece of code you can that demonstrates
your problem, or even better submit a pull-request with a new automated test
that demonstrates the failure. The project will love you for that!

Remember that open-source projects are often run by maintainers working in their
own time, so keep your communications polite, respectful, and constructive. Many
projects have a code of conduct which should be read and understood. No one wants comments in GitHub issues to
look like those found on YouTube or under newspaper articles.

OPEN SOURCE – YOUR FIRST PROJECT
You might then want to open-source a project of your own. It could be an app, a
library, or a command-line tool. It could be a fork of an existing project, with
additional features that you need, but are not aligned to the direction the
original project is going.

If you’ve built something that could potentially save someone else the time it
took you to research and write it, then it’s probably worth publishing. Some
projects you publish may remain unfound and untouched for years, but
occasionally you may hit on something bigger and find that issues are being
logged, the number of GitHub “stars” is accumulating, and third-party pull
requests are arriving.

Bear in mind the judgements you make of others’ GitHub projects when you see
them for the first time:

 * Is the documentation good? Does the project say what it is, how to get
   started and dive into the level of detail I need?
 * Is it currently maintained? When was the last commit? Is this a one-person
   project or a team effort?
 * Does it have a set of automated tests?
 * Is the license suitable for my needs?

If you want folks to find and use your code, then try and step into the shoes of
a first-time user of your project and compose a README for them. It’s not as
easy as it sounds.

OPEN SOURCE – HEALTH WARNING
If your open source project really hits the big time, then you may need to prioritise your own health over the that of the project . It is impossible to maintain everything you write forever. Don’t be afraid to
deprecate projects which are now irrelevant or are too time-consuming to deal
with.

If it’s starting to feel like a chore rather than a privilege, then it may be
time re-evaluate your priorities.

GITHUB AS A CV
Hopefully, you will be able to look back at your GitHub profile as a scrapbook
of your software development history, and others can view it as a living,
breathing “CV”. There’s even a tool to turn your Github profile into a résumé .

If you’re looking to hire a developer, and the candidate is fortunate enough to
work for an employer who encourages open-source contributions, then their GitHub
profile can reveal a host of positive signals:

 * Is the candidate a regular open-source contributor?
 * Does the code they contribute indicate experience in the skills you are
   looking for?
 * Do the written exchanges (issues, pull requests, comments, and documentation)
   indicate the cooperative, respectful team player you are looking for?

Bear in mind that not all developers are on GitHub! Other source code version
control systems are available, and not all code is open source.

WHEN ONE DOOR CLOSES, OPEN A PULL REQUEST
The difference between today and when I started my career is stark. Sharing your
code and collaborating well on other people’s projects is one of the best ways
to stand out professionally. So get involved, and may it lead to bigger and
better things for both you and the projects you work on.

 * Open Source
 * Career Advice

2 Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 2
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Reflecting on my career, and advice on getting started","Open Source for Fun, Learning, and Kudos – IBM Watson Data Lab – Medium",Live,334
995,"Skip navigation Upload Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseCREATING NOTEBOOKS IN IBM DATA SCIENCE EXPERIENCE
IBM Analytics Subscribe Subscribed Unsubscribe 19,384 19KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics

10 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Uploaded on Jul 20, 2016This video is a quick overview of the many ways you can create Jupyter notebooks
powered by R, Python, or Scala in IBM Data Science Experience. Visit IBM Data
Science Experience at ibm.co/data-science

Subscribe to the IBM Analytics Channel: https://www.youtube.com/subscription_...

The world is becoming smarter every day, join the conversation on the IBM Big
Data & Analytics Hub:
http://www.ibmbigdatahub.com
https://www.youtube.com/user/ibmbigdata
https://www.facebook.com/IBManalytics
https://www.twitter.com/IBMbigdata
https://www.linkedin.com/company/ibm-...
https://www.slideshare.net/IBMBDA

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

 * Cyber Beat Live: Exploring the dark web - Duration: 53:50. IBM Analytics 74
   views 53:50
 * Data science expert interview: Somesh Nigam - Duration: 9:03. IBM Analytics
   68 views 9:03
 * Explore the power of statistical analysis in your organization - Duration:
   5:07. IBM Analytics 24 views * New 5:07
 * Data science expert interview: Nick Pentreath - Duration: 8:13. IBM Analytics
   147 views 8:13
 * A surveillance solution can help you navigate complex trading scenarios -
   Duration: 6:49. IBM Analytics 58 views * New 6:49
 * Data science expert interview: Imran Younus - Duration: 6:31. IBM Analytics
   46 views * New 6:31
 * Data science expert interview: Dean Wampler - Duration: 7:31. IBM Analytics
   167 views 7:31
 * Applying data insight to customer care - Duration: 7:44. IBM Analytics 54
   views 7:44
 * Enhancing targeted marketing with programmatic ads - Duration: 10:05. IBM
   Analytics 67 views 10:05
 * How Zions leverages Watson Analytics and Incentive Compensation Management -
   Duration: 1:51. IBM Analytics 30 views * New 1:51
 * Big Replicate Demo Video - Duration: 5:06. IBM Analytics 90 views 5:06
 * IBM Client Insight for Wealth Management offers customized advice for a full
   range of clients - Duration: 7:27. IBM Analytics 263 views 7:27
 * How are chief information officers balancing self-service data and security?
   - Duration: 3:48. IBM Analytics 88 views 3:48
 * Use data science to engage your fan base - Duration: 4:18. IBM Analytics 73
   views 4:18
 * Open source: Speed of technology and innovation - Duration: 1:18. IBM
   Analytics 55 views 1:18
 * Machine learning in our daily lives - Duration: 1:36. IBM Analytics 21 views
   * New 1:36
 * Experience the new IBM Incentive Compensation solution - Duration: 0:56. IBM
   Analytics 244 views 0:56
 * IBM's Customer Insight for Banking brings a personal touch to finance -
   Duration: 2:08. IBM Analytics 139 views 2:08

 * Language: English
 * Country: Worldwide
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Try something new!
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...","This video is a quick overview of the many ways you can create Jupyter notebooks powered by R, Python, or Scala in IBM Data Science Experience. Visit IBM Dat...",Creating Notebooks in IBM Data Science Experience,Live,335
996,"Skip to contentParallelDots

Blog

HEADER MENU
Menu * ParallelDots
 * Karna
 * API
 * Schedule a Demo

MAIN NAVIGATION
Search for: Menu * ParallelDots
 * Karna
 * API
 * Schedule a Demo

CHALLENGES IN DEEP LEARNING
Parth Shrivastava Jul 19, 2017 Jul 19, 2017 * 4
 * 
 * 0
 * 0

4 shares Neural Networks. CreditsDeep Learning has become one of the primary research areas in developing
intelligent machines. Most of the well-known applications (such as Speech
Recognition, Image Processing and NLP) of AI are driven by Deep Learning. Deep
Learning algorithms mimic human brains using artificial neural networks and
progressively learn to accurately solve a given problem. But there are
significant challenges in Deep Learning systems which we have to look out for.

In the words of Andrew Ng , one of the most prominent name in Deep Learning:

“I believe Deep Learning is our best shot at progress towards real AI.”

If you look around, you might realize the power of the above statement by
Andrew. From Siris and Cortanas to Google Photos, from Grammarly to Spotify’s
music recommendations are all powered by Deep Learning. These are just a few
examples of how deep in our life Deep Learning has come.

But, with great technological advances comes complex difficulties and hurdles.
In this post, we shall discuss prominent challenges in Deep Learning.

CHALLENGES IN DEEP LEARNING
LOTS AND LOTS OF DATA
Deep learning algorithms are trained to learn progressively using data. Large
data sets are needed to make sure that the machine delivers desired results. As
human brain needs a lot of experiences to learn and deduce information, the
analogous artificial neural network requires copious amount of data. The more
powerful abstraction you want, the more parameters need to be tuned and more
parameters require more data.

For example, a speech recognition program would require data from multiple
dialects, demographics and time scales. Researchers feed terabytes of data for
the algorithm to learn a single language. This is a time-consuming process and
requires tremendous data processing capabilities. To some extent, the scope of
solving a problem through Deep Learning is subjected to availability of huge
corpus of data it would train on.

The complexity of a neural network can be expressed through the number of
parameters. In the case of deep neural networks, this number can be in the range
of millions, tens of millions and in some cases even hundreds of millions. Let’s call this number P . Since you want to be sure of the model’s ability to generalize, a good rule
of a thumb for the number of data points is at least P*P.

OVERFITTING IN NEURAL NETWORKS
At times, the there is a sharp difference in error occurred in training data set
and the error encountered in a new unseen data set. It occurs in complex models,
such as having too many parameters relative to the number of observations. The
efficacy of a model is judged by its ability to perform well on an unseen data
set and not by its performance on the training data fed to it.

Training error in blue, Validation error in red (Overfitting) as a function of
the number of cycles. Credits: WikipediaIn general, a model is typically trained by maximizing its performance on a
particular training data set. The model thus memorizes the training examples but
does not learn to generalize to new situations and data set.

HYPERPARAMETER OPTIMIZATION
Hyperparameters are the parameters whose value is defined prior to the
commencement of the learning process. Changing the value of such parameters by a
small amount can invoke a large change in the performance of your model.

Relying on the default parameters and not performing Hyperparameter Optimization
can have a significant impact on the model performance. Also, having too few
hyperparameters and hand tuning them rather than optimizing through proven
methods is also a performance driving aspect.

REQUIRES HIGH-PERFORMANCE HARDWARE
Training a data set for a Deep Learning solution requires a lot of data. To
perform a task to solve real world problems, the machine needs to be equipped
with adequate processing power. To ensure better efficiency and less time
consumption, data scientists switch to multi-core high performing GPUs and
similar processing units. These processing units are costly and consume a lot of
power.

Facebook’s Oregon Data Center. Credits: MIT Technology ReviewIndustry level Deep Learning systems require high-end data centers while smart
devices such as drones, robots other mobile devices require small but efficient
processing units. Deploying Deep Learning solution to the real world thus
becomes a costly and power consuming affair.

NEURAL NETWORKS ARE ESSENTIALLY A BLACKBOX
We know our model parameters, we feed known data to the neural networks and how
they are put together. But we usually do not understand how they arrive at a
particular solution. Neural networks are essentially Balckboxes and researchers
have a hard time understanding how they deduce conclusions.

The Neural Network Blackbox. Credits: University of FloridaThe lack of ability of neural networks for reason on an abstract level makes it
difficult to implement high-level cognitive functions. Also, their operation is
largely invisible to humans, rendering them unsuitable for domains in which
verification of process is important.

However, Murray Shanahan , Professor of Cognitive Robotics at Imperial College London, has presented a
paper with his team which discusses Deep Symbolic Reinforcement Learning , which showcases advancements in solving aforementioned hurdles.

LACK OF FLEXIBILITY AND MULTITASKING
Deep Learning models, once trained, can deliver tremendously efficient and
accurate solution to a specific problem. However, in the current landscape, the
neural network architectures are highly specialized to specific domains of
application.

Google DeepMind’s Research Scientist Raia Hadsell summed it up:

“There is no neural network in the world, and no method right now that can be
trained to identify objects and images, play Space Invaders, and listen to
music.”

Most of our systems work on this theme, they are incredibly good at solving one
problem. Even solving a very similar problem requires retraining and
reassessment. Researchers are working hard in developing Deep Learning models
which can multitask without the need of reworking on the whole architecture.

Although, there are small advancements in this aspect using Progressive Neural Networks . Also, there is significant progress towards Multi Task Learning(MTL) . Researchers from Google Brain Team and University of Toronto presented a
paper on MultiModel , a neural network architecture that draws from the success of vision, language
and audio networks to simultaneously solve a number of problems spanning
multiple domains, including image recognition, translation and speech
recognition.


--------------------------------------------------------------------------------

Deep Learning may be one the primary research verticals for Artificial
Intelligence, but it certainly is not flawless. While exploring new and less
explored territories of cognitive technology, it is very natural to come across
certain hurdles and difficulties. As is the case with any technological
progress. The future beholds the answer for the question “Is Deep Learning our
best solution towards real AI?” And as an AI research group, we certainly are
all ears to it.

Know any other challenges in Deep Learning models? Let us know in the comments
below.

————————————————————————————————————————————————————————————————–

ParallelDots is an Artificial Intelligence research and Deep Learning startup that provides
AI solutions to clients in multiple domains. You can check out some of our text analysis APIs and reach out to us by filling this form here .


 * 4
 * 
 * 0
 * 0

4 shares Categories Artificial intelligence , Deep learning Tags Artificial Intelligence , Deep Learning , Machine LearningPOST NAVIGATION
Previous Previous post: Contextual Semantic Search: What, Why and HowLEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


PRIMARY SIDEBAR
Toggle Sidebar
 * Artificial intelligence (10)
 * Data Scientist (3)
 * Deep learning (8)
 * Insights (2)
 * Machine learning (4)
 * Media-Monitoring (3)
 * NLP (4)
 * Technology (13)
 * Text Analytics (7)

© Artzen Software Labs Pvt. Ltd. | All Rights Reserved.","Deep Learning powered AI systems comes with complex difficulties and hurdles. In this post, we discuss prominent challenges in Deep Learning.",Challenges in Deep Learning,Live,336
998,"ACCESS DENIED
Sadly, your client does not supply a proper User-Agent, and is consequently
excluded.

We have an inordinate number of problems with automated scripts which do not
supply a User-Agent, and violate the automated access guidelines posted at
arxiv.org -- hence we now exclude them all.

(In rare cases, we have found that accesses through proxy servers strip the
User-Agent information. If this is the case, you need to contact the
administrator of your proxy server to get it fixed.)

If you are using the PDF Plug-in, it has many bugs and is forbidden here due to
problems it causes at the server end. You must confirm that you have disabled it
before access can be restored.

In Netscape try Edit -> Preferences -> Navigator -> Applications , look for Portable Document Format and uncheck the plug-in box. Or delete the pdf plugin dll file from the Program Files/Netscape/Navigator/Program/plugins directory and restart browser. Or for Acroread4/Explorer5 users, go into
Acroread's File : Preferences : General : Web_Browser_Integration and make sure the little box is unchecked.

Note to MacOSX users : There is a bug in the Acrobat reader which causes it to make endless streams
of requests after having successfully downloaded the full pdf. Note that it is not necessary to
use Acrobat at all, since pdf's from here render as well or better in the
default Preview.app on MacOSX. If for some reason you think you need to use
Acrobat, go to Acrobat Preferences -> Internet and turn off the ""Allow speculative downloading in the background"" option,
which comes (incorrectly) turned on by default, and whose behavior is quite
broken.

If you believe this determination to be in error, see https://arxiv.org/denied.html for additional information.","This paper explains why deep learning can generalize well, despite large capacity and possible algorithmic instability, nonrobust- ness, and sharp minima, effectively address- ing an open problem in the literature. ",Generalization in Deep Learning,Live,337
1013,"AUTHENTICATING NODE-RED WITH JSONWEBTOKEN
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 14, 2016Node-RED is great for power prototyping , but how do you keep the bad guys (or the general public for that matter) from
using your endpoints without permission? When it's time to think of
authenticating, JSONWebToken (JWT) is an elegant and simple way of checking your users' ID at the door.

Authentication doesn't have to be limited to HTTP and web calls. JSONWebToken
can be used to authenticate data from any source. So if you want to start with
HTTP, and move on to WebSockets or RabbitMQ later, you can authenticate using the same method.

Using JSONWebToken in Node-RED is a snap thanks to the node-jsonwebtoken node library and the node-red-contrib-auth node. In this first of a two-part article, we'll using these libraries to
quickly and easily add cryptographically secure authentication to our Node-RED
application. In the second part we'll show you how to use that with your Compose
MongoDB database to create user authentication tokens.

GETTING STARTED
You should have access to a running installation of Node-RED. If you already
have Node.js on your local machine, you can use the terminal to install
Node-Red: npm install node-red .

If you don't have Node.js yet, you can download the installer for your platform directly from the Node.js website .

INSTALLING THE NODE-RED-CONTRIB-AUTH NODE
Node-RED is a flow-based programming (FBP) environment, so connecting to
services requires access to a ""node"" that provides the services you need. In
this article, we'll use the node-red-contrib-auth package, which installs a Node-RED node that can encode and decode data passed
into it using the JSONWebToken spec . You can install it by selecting the Manage Palette option from the main menu,
searching for the auth node and clicking install .


ENCODING DATA INTO A JSONWEBTOKEN
We'll start by dragging an input node onto the Node-RED canvas. For now we'll
use an http input node, but keep in mind that this can be done with any type of input node.

First, drag an HTTP input node onto the canvas and ""double-click"" it to configure. Set the URL of the node to /encrypt and the method to GET .


Next, we'll find the JWT node in the palette and drag that onto the canvas next to the HTTP input node. Double-click it to open the configuration panel, then click the
pencil icon next to the drop down that says Add new JsonWebToken_config . Give the configuration a unique name, and enter a random set of characters in
the secret section. You want the secret to be random and not guessable - this will serve
as the key you use to encrypt and decrypt the JSONWebToken.


The JWT node will encode any data in the msg.payload object into a cryptographically secure JSONWebToken using the encryption secret
you configured it with. Finally, it will pass that JSONWebToken to the output
through the msg.token object, taking care to preserve the original msg.payload .

Since we want the JSONWebToken to be sent to the output node, let's add a change node here and transfer the msg.token over to the msg.payload . The change node contains a list of rules, with an initial rule that sets the msg.payload to an empty string. The rules contain three parts: the action, the value being
set, and the value being received.

The ""action"", which can be one of Set , Change , Delete , and Move , determines how data will be transformed. For now, we'll leave this at the
default of SET . The value being set will initially be the msg.payload object. We'll keep this for now and update the value being received to msg.token by clicking on the AZ dropdown, selecting msg , and typing payload in the text box.


Now, let's drag an HTTP Response output node onto the canvas. This will ensure that any requests made to the
input node receive a response. When we access this route, any data passed into
the HTTP input node as a query string parameter is automatically added to the msg.payload object. It will then pass through the JWT node and output an encrypted token, which will be returned in the response.


DECODING DATA FROM JSONWEBTOKENS
The JWT node will encrypt data passed into it on the msg.payload , but it will also attempt to decrypt JSONWebTokens passed into it in the msg.token object. To decode the token, you'll need to use the same key used to encrypt
the original token.

For this example, we'll create an HTTP input node that will decrypt the JSONWebToken and return the decrypted data in
an HTTP Response node. Drag an HTTP input node onto the canvas and double-click on it to configure. Set the URL of the node to /decrypt and the method to GET :


Then, drag the same JWT node onto the canvas next to the HTTP input node. Double-click it to configure the node and make sure to select the
same configuration from the drop-down that you used to encrypt the token
earlier.


So far, we have a JWT node waiting to decrypt any data passed into it via the msg.token object, and an HTTP input node which passes POST data in via the msg.req.body object. What we need now is a function node to transfer the appropriate parameter from msg.req.body to msg.token . Drag a new function node onto the canvas between the HTTP input node and the JWT node and double-click it and add the following code:

msg.token = msg.req.body.access_token;  
return msg;  


This assumes that the token you'd like to decrypt is sent in a JSON object that
looks like the following:

{ ""access_token"": ""eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJuYW1lIjoiam9obiIsImlhdCI6MTQ4MDcyNjIzOH0.QMAH5MrAGbpnwr_mCycE3Qx_YKLujMxYDc1SNbfURtsckear""}


Since the decrypted value is put back on the msg.token object, we'll need to add a second function on the other side of the JWT node to transfer the decrypted values to the msg.payload object. Drag a new function node on the canvas and double-click it to add the
following code:

msg.payload = msg.token;  
return msg;  


Finally, drag an HTTP Response output node onto the canvas and wire the nodes together.


You can test out our brand new JSONWebToken encoder / decoder using the
following CURL commands, remembering to replace the values below with your own values.

$ curl -i -H ""Accept: application/json"" http://localhost/encrypt?name=john


which will encrypt the name into a JSONWebToken that looks similar to this:

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJuYW1lIjoiam9obiIsImlhdCI6MTQ4MDcyNTg2NH0.Rz-0Smq_ulmriXLVO9weN_CnO4CnwPh35ktAbmAdhZg  


Decrypting looks something like this:

% curl -H ""Content-Type: application/json"" -X POST -d '{""access_token"": ""eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJuYW1lIjoiam9obiIsImlhdCI6MTQ4MDcyODYxMH0.D0fU4Wg3wNeH0Ui9A0JU1_flFXNpvntk_LfqtCLsAOY""}' http://localhost/decrypt


and will output the value encoded within the token:

{""name"":""john"",""iat"":1480728610}


FINISHING UP
This article represents a simple demonstration of securing data using
JSONWebTokens in Node-RED. You can use JSONWebToken for encrypting sensitive
information, such as user data, which need to be securely passed over insecure
networks. In the next part in this series, we'll explore using JSONWebTokens to
authorize and authenticate user actions in a Node-RED application using MongoDB
on Compose.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Mike Wilson Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe John O'Connor is a software architect that enjoys tinkering with things, designing software,
and writing about it all. Love this article? Head over to John O'Connor’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Node-RED is great for power prototyping, but how do you keep the bad guys (or the general public for that matter) from using your endpoints without permission?","Authenticating Node-RED with JSONWebToken, part 1",Live,338
1019,"Skip to content * United States

IBM® developerWorks * Site map

Search Search Streamsdev

Search

Streamsdev * Github
 * Documentation
 * Support
 * Blog
 * Videos

in Getting started , Roadmap , BeginnerGET STARTED WITH STREAMS DESIGNER BY FOLLOWING THIS ROADMAP

Natasha DSilva
Published on January 19, 2018 / Updated on February 9, 2018 0 Comments * All Documentation expand
 * Knowledge Centers (1)
 * Getting started (16) * Introductory Labs (10)
   
   
 * Roadmap (5)
 * Learn About Streams (29)
 * Beginner (59) * Application Development (37)
    * Install and Setup (9)
    * Tutorial (10)
   
   
 * Read and Store Data (20)
 * Analyze and Classify Data (33) * Statistics/Machine Learning (10)
   
   
 * Advanced (103) * Application Development (63)
    * Monitoring / Performance / Troubleshooting (8)
    * Administration and Install (36) * Streams Console (17)
       * Security (10)
      
      
 * Archived (12)
 * Videos (11)

INTRODUCTION TO STREAMS DESIGNER
Streams Designer is a web based IDE for quickly creating streaming analytics
applications. The applications are created in the browser and run on the IBM
Cloud. This post is collection of short videos and articles to introduce you to
the the canvas and show you how to create and monitor an application.

TABLE OF CONTENTS
 * Streams Designer overview – your first look at the canvas
 * Set up the IBM Cloud services
 * Run an example flow
 * Monitor a running flow
 * Creating your own application * Create a Streams flow with the Canvas
    * Extending the example to use Message Hub as a data source
    * Computing moving averages and running totals with the Aggregation operator
   
   
STREAMS DESIGNER OVERVIEW
Why should you use Streams Designer? This video provides an overview as well as
an introduction to the Streams Designer canvas.


SET UP THE IBM CLOUD SERVICES
Streams Designer is part of the Watson Data Platform, so you’ll first need to log in/create an account. Applications created with Streams Designer run on the Streaming Analytics
service in the IBM Cloud, so after creating an account, follow along in this
video to set up the needed services.


Now that your setup is complete, a great way to try out Streams Designer is by
running a example application.


RUNNING AN EXAMPLE FLOW
Applications created in Streams Designer are called flows .

In this video, you will learn how you can deploy the Data Historian example flow
that is available in Streams Designer. This flow ingests data from simulated
weather stations and uses the Aggregation operator in Streams designer to
compute statistics like average temperature and humidity.


Learn more about this example


MONITOR THE RUNNING FLOW
The next video in the series demonstrates how you can monitor a running
application using the metrics page. You can observe the application’s
performance, see the data as it moves between operators and download application
logs.


CREATE YOUR OWN APPLICATION

After running an example flow and learning how to interact with a running flow,
you’re now ready to create your own applications.

CREATE A STREAMS FLOW WITH THE CANVAS


EXTEND THE DATA HISTORIAN EXAMPLE TO USE MESSAGE HUB AS A DATA SOURCE
You’ve succesfully run a flow in Streams Designer. Now, you probably want to
start creating your own applications. Logically, the first step in creating your
own flow is connecting to a data source. Right now, supported data sources are
the Watson IoT Platform or Message Hub. So now you need to learn how to send
data to one of those services.

Follow along in this notebook to see how to modify the Data Historian flow to use data from Message Hub,
IBM’s Apache Kafka offering. You will learn how to 1) Send data to Message Hub
using Python, 2) Ingest and analyze that data in a Streams flow, and 3) Send
results from the flow back to Message Hub.

Open the notebook on the Data Science Experience , and after logging in, click copy to import the notebook into a project for use.

A tutorial for ingesting from the Watson IoT Platform is coming soon.


COMPUTING MOVING AVERAGES AND RUNNING TOTALS WITH THE AGGREGATION OPERATOR
You may have noticed in the example flow that the Aggregation operator was used
to compute general statistics like averages, max/min, totals, and so on. Learn
more about the Aggregation operator and how to use it in this post.

USEFUL LINKS
 * Streams Designer documentation
 * Get help – ask a question on dwAnswers

We are still working on more content, so stay tuned!

Tags featured , streams_designerby Natasha DSilva


JOIN THE DISCUSSION CANCEL REPLY
You must be logged in to post a comment.

Back to top

 * Contact
 * Privacy
 * Terms of use
 * Accessibility
 * Feedback
 * Report Abuse
 * Cookie Preferences",This post has a series of videos and articles to help you go from getting started with Streams Designer to creating your own applications.,Get started with Streams Designer by following this roadmap,Live,339
1021,"Enterprise Pricing Articles Sign in Free 30-Day TrialGEOFILE: ELASTICSEARCH GEO QUERIES
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 19, 2017GeoFile is a series dedicated to looking at geographical data, its features, and
uses. In this article, we'll be covering Elasticsearch and its Geo mapping
datatypes, geo_point and geo_shape , and Geo querying capabilities. We'll show you how to construct your mappings
and demonstrate how to query some data.

The use of GeoData in databases has risen in popularity. In other GeoFile
articles, we've covered database extensions like PostGIS for PostgreSQL, as well
as MongoDB and Redis's Geo querying capabilities that come out of the box. In
this article, we'll look at Elasticsearch's Geo queries, how you can set up
mappings and indices, and provide you with some example of how you can query
your data.

While Elasticsearch may not be your first choice when searching through your
GeoData, its developers have been improving on its capabilities since 1.x by
adding and enhancing querying features. For this article, we'll be using version
2.4, but you should be able to use these queries from 2.x on.

ELASTICSEARCH AND GEODATA
Elasticsearch allows you to represent GeoData in two ways: geo_shape and geo_point .

Geo Point allows you to store data as latitude and longitude coordinate pairs. Use this
field type when you want to filter data for distances between points, search
within bounding boxes, or when using aggregations. There are a lot of features
and options that you can specify which are beyond the scope of this article.
We'll cover a couple here, but you can view the options for Geo Bounding Box , Geo Distance , and Geo Aggregations in Elasticsearch's documentation.

Use Geo-Shape when you have GeoData that represents a shape, or when you want to query points
within a shape. geo_shape data must be encoded in GeoJSON format which is converted into strings
representing long/lat coordinate pairs on a grid of Geohash cells. Since Elasticsearch indexes shapes as terms, it's simple for it to
determine the relationships between shapes, which can be queried using intersects , disjoint , contains , or within query spatial relation operators.

Unfortunately, geo-point and geo-shape cannot be queried together. For example, if you want to get all the cities
within a specified polygon, you cannot use cities that are indexed with geo-point . They must be indexed using a ""type"": ""Point"" in GeoJSON and indexed as geo-shape . We'll see how this works later. However, note that you must determine how
you'll query your data prior to indexing it in Elasticsearch otherwise you'll
end up remapping and reindexing your data.

THE DATA AND CONVERSION TO USABLE GEOJSON
The data that we'll be using for this walkthrough is taken from the Washington State Department of Transportation (WSDOT) GeoData Catalogue. Download the shapefiles for ""City Points"" and ""WSDOT Regions
24k"". City Points will give us the cities in Washington, while WSDOT Regions
will provide us with regions designated by WSDOT. You can view the data before
downloading by clicking View next to their download link.

We'll also be using data from Washington State's Office of Financial Management , which has census geographic files that provide us with geographic coordinates
for county boundaries and cities. Download the file for ""Counties"".

The unzipped file contain all of the necessary files needed with a shapefile.
Since Elasticsearch does not use shapefiles, we'll have to convert it to
GeoJSON. A simple way to convert shapefiles to GeoJSON is to use the GDAL's ogr2ogr , which is a command line program that converts geographic data files from one
format to another. We recommend downloading GDAL using homebrew brew install gdal . Once GDAL has been downloaded, you can use ogr2ogr on the command line.

To convert the ""City Points"" shapefile to GeoJSON, enter the following command
into a terminal:

ogr2ogr -f ""GeoJSON"" /destination/file.json -t_srs ""EPSG:4326"" /path/to/shapefile/shapefile.shp  


Here, we write ogr2ogr to start the program, use the -f switch to indicate the file format ""GeoJSON"" along with the new file destination, use the -t_srs switch to indicate the encoding of our GeoJSON file, and then provide the path
of the shape file. There are other options that you can define that are located
in ogr2ogr's documentation, but the ones used above will suit our needs. Once we
run the command, you'll see the GeoJSON file you defined location.

The output of the GeoJSON file will look similar to the following:

{
""type"": ""FeatureCollection"",
""crs"": { ""type"": ""name"", ""properties"": { ""name"": ""urn:ogc:def:crs:OGC:1.3:CRS84"" } },

""features"": [
{ ""type"": ""Feature"", ""properties"": { ""OBJECTID"": 1, ""NAME"": ""Sumas"", ""CountySeat"": null, ""GNIS"": 2412000, ""LastUpdate"": ""2009\/08\/31"", ""MajorCity"": null, ""CountyFIPS"": 73, ""CityFIPS"": ""5368330WA"" }, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -122.264923557847354, 49.00004692672551 ] } },
...
{ ""type"": ""Feature"", ""properties"": { ""OBJECTID"": 281, ""NAME"": ""Tacoma"", ""CountySeat"": ""yes"", ""GNIS"": 2412025, ""LastUpdate"": ""2006\/08\/31"", ""MajorCity"": ""yes"", ""CountyFIPS"": 53, ""CityFIPS"": ""5370000WA"" }, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -122.440097136359299, 47.253172271414293 ] } }
]
}


One of the first issues you may notice with GeoJSON is that the ""coordinates"" key is defined as lon/lat rather than lat/lon as we specified for geo_point above. The specifications for GeoJSON require coordinates to be in lon/lat, and
since the coordinates are in an array, geo_point requires the same coordinate format lon/lat. Therefore, we won't have to
reformat coordinates if we are indexing using geo_point or geo_shape .

On the other hand, the GeoJSON file will have to be modified a little to use in
Elasticsearch. An easy way to do this is to delete the following, since we don't
need it:

{
""type"": ""FeatureCollection"",
""crs"": { ""type"": ""name"", ""properties"": { ""name"": ""urn:ogc:def:crs:OGC:1.3:CRS84"" } },

""features"": 

} // end of the file


What we'll be left with is an array of documents:

[
{ ""type"": ""Feature"", ""properties"": { ""OBJECTID"": 1, ""NAME"": ""Sumas"", ""CountySeat"": null, ""GNIS"": 2412000, ""LastUpdate"": ""2009\/08\/31"", ""MajorCity"": null, ""CountyFIPS"": 73, ""CityFIPS"": ""5368330WA"" }, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -122.264923557847354, 49.00004692672551 ] } },
...
{ ""type"": ""Feature"", ""properties"": { ""OBJECTID"": 281, ""NAME"": ""Tacoma"", ""CountySeat"": ""yes"", ""GNIS"": 2412025, ""LastUpdate"": ""2006\/08\/31"", ""MajorCity"": ""yes"", ""CountyFIPS"": 53, ""CityFIPS"": ""5370000WA"" }, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -122.440097136359299, 47.253172271414293 ] } }
]


You will want to format the file further depending on whether you want all of
the GeoJSON contents within your Elasticsearch index. For this article, we will
modify the file slightly by only using the ""NAME"" , ""OBJECTID"" , and ""geometry"" keys. A code sample can be found in the repository here .

MAPPINGS
Mappings of both geo_point and geo_shape is fairly straight-forward, but there are a couple differences that you should
be aware of. When defining a mapping in Elasticsearch using geo_point , you do not have to include the type of shape your longitude and latitude
coordinates. What is necessary is that you have the correct lat/lon order,
otherwise Elasticsearch will give you an error, except if your coordinates are
in an array, then coordinates should be in lon/lat.

The following formats are acceptable for the geo_point type:

{
    ""kind"": ""Object"",
    ""location"": {
        ""lat"": 48.3,
        ""lon"": -117.3
    }
}

{
    ""kind"": ""String"",
    ""location"": ""48.3, -117.3""
}

{
    ""kind"": ""Array"",
    ""location"": [-117.3, 48.3]
}


The geo_shape mapping is different in that it only accepts GeoJSON formatted data and that
you must include ""type"" and ""coordinates"" within the GeoJSON ""geometry"" object to tell Elasticsearch which type of shape it's indexing.

An example of this type of data is the following:

{
    ""kind"": ""GeoJSON"",
    ""geometry"": {
        ""type"": ""point"",
        ""coordinates"": [-117.3, 48.3]
    }
}

{
    ""kind"": ""GeoJSON"",
    ""geometry"": {
        ""type"": ""polygon"",
        ""coordinates"": [[-117.323, 48.312], ... [-117.315, 48.319]]
    }
}


Before indexing geo-points or geo-shapes you must define the mapping beforehand since their fields are not dynamically
mapped. Mappings for both datatypes look like the following:

{
  ""mappings"": {
    ""cities"": {
      ""properties"": {
        ""name"": {
          ""type"": ""string""
        },
        ""geometry"": {
          ""type"": ""geo_point"" // or ""geo_shape""
        }
      }
    }
  }
}


By using geo_point or geo_shape , Elasticsearch will automatically find the coordinates, validate them
according to the needed format, and index them.

We'll be using a similar mapping for the downloaded data. To create the mapping,
we're using NodeJS and the elasticsearch client . If you're not familiar with NodeJS and Elasticsearch see a good primer on
setting up, building and deploying an application in our five-part series ""Getting started with Elasticsearch and Node.js"" .

What you'll first need to do is create an index and call the index ""wa_cities"".
Compose's Elasticsearch browser allows you to do this easily. From your
deployment, click on the Browser button on the sidebar, which will take you to the browser page. Then click the Create Index button on the browser and insert the name of the index then click Run .


After setting up the index, we can write the code for setting up the mapping and
run it node mapping.js from the terminal.

const elasticsearch = require('elasticsearch'),  
    client = new elasticsearch.Client({
        hosts: [
            'https://[username]:[password]@[server]:[port]/',
        'https://[username]:[password]@[server]:[port]/'
        ]
});

client.indices.putMapping({  
  index: 'wa_cities_points',
  type: 'cities',
  body: {
    properties: {
          ""location"": {
                ""type"": ""geo_point"",
      },
      ""name"": {
        ""type"": ""string""
          }
        }
  }
}, (err, resp, status) =


Once the code has been run, you'll see {""acknowledged"": true} returned indicating that the mapping was successfully created. You'll want to
create two more mappings using {""type"": ""geo_shape""} for the ""County"" and again for the ""City Point"" data for when we look at Geo
Queries.

Once your mappings have been created, you can insert the data using the _bulk API. All of the code to create the mappings, modify the data, and insert it
into an index has been provided in the example repository here . You can modify it accordingly.

GEO QUERIES
Elasticsearch uses the terms queries and filters. Querying relies on ""scoring"",
or if and how well a document matches the query. Filtering, on the other hand,
is ""non-scoring"" and determines if the document matches a query. According to
Elasticsearch, as of 2.x querying and filtering have become synonymous in that
you can have queries that are both scoring and non-scoring. There are various
performance benefits and drawbacks to using scoring or non-scoring queries, but
the rule-of-thumb is to use scoring queries when a relevance score is important,
and non-scoring queries for everything else.

The queries that we will look at here will focus on some of the basic queries
that you can do with Elasticsearch. We'll look at aggregations and Geohashes in
an upcoming supplement.

Since we have some data in our indices, it's time to start querying. We'll look
at some of the basic queries that we can use for geo_point and geo_shape .

The queries available for Elasticsearch 2.x are geo_shape , geo_bounding_box , geo_distance , geo_distance_range , geo_polygon , and geohash_cell . As of Elasticsearch 5.x, geohash_cell has been deprecated.

Distance with Geo PointTo get the distance between any two points, our data must be stored using the geo_point type. The documentation provides various data formats as examples. But, since our data is stored as an
array that conforms to GeoJSON, our query would look something like the
following:

{
    ""query"": {
        ""bool"" : {
            ""must"" : {
                ""match_all"" : {}
            },
            ""filter"" : {
                ""geo_distance"" : {
                    ""distance"" : ""10mi"",
                    ""location"" : 
                    [-122.3375,
    47.6112]
                                // Seattle
                }
            }
        }
    }
}


This query asks Elasticsearch to look for all matching points within a radius of
10 miles of the ""location"" you provide. Here, our location is Seattle so we're searching for all cities
within a 10-mile radius. If you want to use other distance units other than
miles, see the documentation for acceptable units of measurement.

The output of this query will look similar to the following, giving us twelve
documents:

{
  ""took"" : 12,
  ""timed_out"" : false,
  ""_shards"" : {
    ""total"" : 5,
    ""successful"" : 5,
    ""failed"" : 0
  },
  ""hits"" : {
    ""total"" : 12,
    ""max_score"" : 1.0,
    ""hits"" : [ {
      ""_index"" : ""wa_cities_points"",
      ""_type"" : ""cities"",
      ""_id"" : ""84"",
      ""_score"" : 1.0,
      ""_source"" : {
        ""name"" : ""Bainbridge Island"",
        ""location"" : [ -122.52083338339754, 47.62471310583139 ]
      }
    }, {
      ""_index"" : ""wa_cities_points"",
      ""_type"" : ""cities"",
      ""_id"" : ""108"",
      ""_score"" : 1.0,
      ""_source"" : {
        ""name"" : ""Mercer Island"",
        ""location"" : [ -122.23504918395818, 47.58665906645245 ]
      }
    },
...


Distance Range with Geo PointInstead of locating cities within a radius from a point of origin, you can also
set the start and end distances, forming a donut shape instead of a circle. The
query is similar to geo_distance except that define distances within ""from"" and ""to"" .

{
    ""query"": {
        ""bool"" : {
            ""must"" : {
                ""match_all"" : {}
            },
            ""filter"" : {
                ""geo_distance_range"" : {
                    ""from"" : ""10mi"",
                    ""to"": ""12mi"",
                    ""location"" : 
                    [-122.3375,
    47.6112]
                                // Seattle
                }
            }
        }
    }
}


With this query, we're asking Elasticsearch to start at 10 miles from ""location"" and include cities between 10 to 12 miles only.

{
  ""took"" : 38,
  ""timed_out"" : false,
  ""_shards"" : {
    ""total"" : 5,
    ""successful"" : 5,
    ""failed"" : 0
  },
  ""hits"" : {
    ""total"" : 8,
    ""max_score"" : 1.0,
    ""hits"" : [ {
      ""_index"" : ""wa_cities_points"",
      ""_type"" : ""cities"",
      ""_id"" : ""123"",
      ""_score"" : 1.0,
      ""_source"" : {
        ""name"" : ""Normandy Park"",
        ""location"" : [ -122.33965429615549, 47.43995693292799 ]
      }
    }, {
      ""_index"" : ""wa_cities_points"",
      ""_type"" : ""cities"",
      ""_id"" : ""70"",
      ""_score"" : 1.0,
      ""_source"" : {
        ""name"" : ""Bothell"",
        ""location"" : [ -122.20559409874588, 47.76009486395827 ]
      }
...


Geo Polygon with Geo PointQueries don't have to specify radii, but you can also define polygons as search
areas. Here's an example:

{
    ""query"": {
        ""bool"" : {
            ""must"" : {
                ""match_all"" : {}
            },
            ""filter"" : {
                ""geo_polygon"" : {
                    ""location"" : {
                        ""points"": 
          [
            [
              -122.35610961914062,
              47.70514099299205
            ],
            [
              -122.48519897460936,
              47.5626274374099
            ],
            [
              -122.28744506835938,
              47.44852243794931
            ],
            [
              -122.15972900390624,
              47.558920607496525
            ],
            [
              -122.2283935546875,
              47.719001413201916
            ],
            [
              -122.35610961914062,
              47.70514099299205
            ]
          ]

            }
            }
        }
    }
}
}


On a map the polygon looks like this:


The result of the query will give us eight locations within the polygon:

{
  ""took"" : 9,
  ""timed_out"" : false,
  ""_shards"" : {
    ""total"" : 5,
    ""successful"" : 5,
    ""failed"" : 0
  },
  ""hits"" : {
    ""total"" : 8,
    ""max_score"" : 1.0,
    ""hits"" : [ {
      ""_index"" : ""wa_cities_points"",
      ""_type"" : ""cities"",
      ""_id"" : ""108"",
      ""_score"" : 1.0,
      ""_source"" : {
        ""name"" : ""Mercer Island"",
        ""location"" : [ -122.23504918395818, 47.58665906645245 ]
      }
    }, {
      ""_index"" : ""wa_cities_points"",
      ""_type"" : ""cities"",
      ""_id"" : ""110"",
      ""_score"" : 1.0,
      ""_source"" : {
        ""name"" : ""Beaux Arts"",
        ""location"" : [ -122.19852431727838, 47.58511084550324 ]
      }
    }
...


Elasticsearch mentions that while this query is available, it's expensive, and geo-shapes should be used instead.

Geo Bounding Box with Geo PointIn this query, you will define a box (top, bottom, left, right) and search for
points within the box.

{
  ""query"": {
    ""filtered"": {
      ""filter"": {
        ""geo_bounding_box"": {
          ""location"": { 
            ""top_left"": {
              ""lat"":  47.7328,
              ""lon"": -122.448
            },
            ""bottom_right"": {
              ""lat"":  47.4680,
              ""lon"": -122.0924
            }
          }
        }
      }
    }
  }
}


The results of the query provide us with twelve results since it's a large box.

{
  ""took"" : 5,
  ""timed_out"" : false,
  ""_shards"" : {
    ""total"" : 5,
    ""successful"" : 5,
    ""failed"" : 0
  },
  ""hits"" : {
    ""total"" : 12,
    ""max_score"" : 1.0,
    ""hits"" : [ {
      ""_index"" : ""wa_cities_points"",
      ""_type"" : ""cities"",
      ""_id"" : ""108"",
      ""_score"" : 1.0,
      ""_source"" : {
        ""name"" : ""Mercer Island"",
        ""location"" : [ -122.23504918395818, 47.58665906645245 ]
      }
    }, {
      ""_index"" : ""wa_cities_points"",
      ""_type"" : ""cities"",
      ""_id"" : ""110"",
      ""_score"" : 1.0,
      ""_source"" : {
        ""name"" : ""Beaux Arts"",
        ""location"" : [ -122.19852431727838, 47.58511084550324 ]
      }
    },
...


GEO-SHAPE QUERIES
All geo-shape queries require your data to be mapped using the geo_shape mapping. This is why we asked you to create two indices for cities using geo_point and geo_shape . Using geo-shapes we can find documents that intersect with the query shape.

What's nice about geo-shape queries is that you do not have to define all of the
coordinates of the shape. What does this mean? Elasticsearch will allow you to
reference a pre-indexed shape in another index, or provide the entire
coordinates of a shape within the query.

Querying using a user defined shape is similar to querying using geo_point . If we want to get all the points within a specified radius we can use the
following query:

{
  ""query"": {
    ""geo_shape"": {
      ""location"": {
        ""shape"": {
          ""type"": ""circle"",
            ""radius"": ""10mi"", 
          ""coordinates"": [
           -122.33, 47.61
          ]
        }
      }
    }
  }
}


Here we only provide the starting point of the coordinates , which corresponds to Seattle, WA. Using the geo_shape query, we tell the query to look at the location field and provide the type of shape ( circle ), how wide the radius of the circle is (10 miles), and provide the point of
origin ( coordinates ). This produces the same results as our geo_point query above.

{
  ""took"" : 8,
  ""timed_out"" : false,
  ""_shards"" : {
    ""total"" : 5,
    ""successful"" : 5,
    ""failed"" : 0
  },
  ""hits"" : {
    ""total"" : 12,
    ""max_score"" : 1.0,
    ""hits"" : [ {
      ""_index"" : ""wa_cities_shapes"",
      ""_type"" : ""cities"",
      ""_id"" : ""84"",
      ""_score"" : 1.0,
      ""_source"" : {
        ""name"" : ""Bainbridge Island"",
        ""location"" : {
          ""type"" : ""Point"",
          ""coordinates"" : [ -122.52083338339754, 47.62471310583139 ]
        }
      }
    }
...


One of the useful features of geo_shape is being able to use pre-indexed shapes. When using pre-indexed shapes, we
don't have to insert a shape's coordinates in the query, but we only have to
refer to a shape's index , type , and id . The query looks like the following:

{
  ""query"": {
    ""geo_shape"": {
      ""location"": {
        ""indexed_shape"": {
          ""index"": ""wa_counties"",
          ""type"":  ""county"",
          ""id"":    ""King"",
          ""path"":  ""location""
        }
      }
    }
  }
}


Here, we tell the query to look in the location field, but this time we use index_shape instead of shape so that Elasticsearch retrieves the shape from a specified index and id . In this example, we use the wa_counties index, the name of our index containing the County data, our mapping type is county , and we want all the points within ""id"": ""King"" , which we specified when inserting the data into the index. We also mention
the path of the county document coordinates, which is location like out cities data.

Running this query gives us 38 cities within King County.

{
  ""took"" : 100,
  ""timed_out"" : false,
  ""_shards"" : {
    ""total"" : 5,
    ""successful"" : 5,
    ""failed"" : 0
  },
  ""hits"" : {
    ""total"" : 38,
    ""max_score"" : 1.0,
    ""hits"" : [ {
      ""_index"" : ""wa_cities_shapes"",
      ""_type"" : ""cities"",
      ""_id"" : ""98"",
      ""_score"" : 1.0,
      ""_source"" : {
        ""name"" : ""Sammamish"",
        ""location"" : {
          ""type"" : ""Point"",
          ""coordinates"" : [ -122.03554939749195, 47.61655151942234 ]
        }
      }
    }
...


We can add more to this query by defining another field called relation , which allows us to add spatial relation operators: intersects , disjoint , within , or contains . A handy guide to these is located here . Te default value is intersects which in our case will give us all the cities within and on the border of our
county. If we use a relation like disjoint , all the cities outside of King County will be counted.

{
  ""query"": {
    ""geo_shape"": {
      ""location"": { 
        ""relation"": ""disjoint"",
        ""indexed_shape"": {
          ""index"": ""wa_counties"",
          ""type"":  ""county"",
          ""id"":    ""King"",
          ""path"":  ""location""
        }
      }
    }
  }
}'


This will result in 243 cities to be returned. In total, there are 281 cities
that we indexed, so 281-243 = 38 (the cities in King County).

{
  ""took"" : 56,
  ""timed_out"" : false,
  ""_shards"" : {
    ""total"" : 5,
    ""successful"" : 5,
    ""failed"" : 0
  },
  ""hits"" : {
    ""total"" : 243,
    ""max_score"" : 1.0,
    ""hits"" : [ {
      ""_index"" : ""wa_cities_shapes"",
      ""_type"" : ""cities"",
      ""_id"" : ""14"",
      ""_score"" : 1.0,
      ""_source"" : {
        ""name"" : ""Marcus"",
        ""location"" : {
          ""type"" : ""Point"",
          ""coordinates"" : [ -118.06261699119396, 48.6635062250126 ]
        }
      }
    }


SUMMING UP
So, we've given you an overview of the Geo querying capabilities of
Elasticsearch and looked at the basics of how to set up your mappings and do
some basic querying. In the next installment, we'll look at Elasticsearch
aggregations and GeoData further, and show you how to build an application that
will allow you to see your GeoData on a map.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Stephen Monroe Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","In this article, we'll be covering Elasticsearch and its Geo mapping datatypes, geo_point and geo_shape, and Geo querying capabilities. We'll show you how to construct your mappings and demonstrate how to query some data.",GeoFile: Elasticsearch Geo Queries,Live,340
1024,"METRICS MAVEN: CREATING PIVOT TABLES IN POSTGRESQL USING CROSSTAB
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 6, 2016In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the metrics you need from your data.
In this article, we'll look at the crosstab function in PostgreSQL to create a
pivot table of our data with aggregate values.

If you've used spreadsheet software, then you're probably familiar with pivot
tables since they're one of the key features of those applications. The same
pivot functionality can be applied to data in your database tables.

Typical relational database tables will contain multiple rows, often with
repeating values in some columns. In this way, the data extends downward through
the table. Aggregate functions and group by options can be applied at query time
to determine metrics like count, sum, and average for categories of the data.
It's all pretty straightforward, but sometimes having a pivot table that extends
the data across, rather than downward, with those metrics at-the-ready makes it
easier to do comparisons or to filter on certain attributes. Luckily PostgreSQL
has a function for creating pivot tables. It's called crosstab .

In this article we're going to look at how to use the crosstab function to output a result set of aggregate values pivoted by category. In our
examples below, we'll pivot data from a product catalog, but you'll be able to
see how it can be applied to a variety of data situations.

ENABLING TABLEFUNC
First things first. To run crosstab we'll need to enable the tablefunc module . Besides crosstab , the tablefunc module also contains functions for generating random values as well as creating
a tree-like hierarchy from table data.

On Compose PostgreSQL, we enable tablefunc in the Compose administrative console for the Postgres database where we'll run crosstab . We do this in the data browser by navigating to our database then clicking on
the ""Extensions"" option on the left side:


Once we're on the Extensions page, we just scroll down to ""tablefunc"" and select
""install"" from the right side. It will instantly be enabled:


Voila! Now we're ready to crosstab .

LET THE PIVOTING BEGIN
There are a couple of different crosstab options that you can read about on the tablefunc page in the PostgreSQL documentation and experiment with for your particular
situation. We're going to focus on the one that uses source SQL since that fits
our use case best:

crosstab(text source_sql, text category_sql)  


Let's get to it!

THE DATA
Here's what our example ""catalog"" table data looks like:

id  | product         | category | product_line   | price | number_in_stock  
---------------------------------------------------------------------------
1   | leash           | dog wear | Bowser         | 15.99 | 48  
2   | collar          | dog wear | Bowser         | 10.99 | 76  
3   | name tag        | dog wear | Bowser         | 5.99  | 204  
4   | jacket          | dog wear | Bowser         | 24.99 | 12  
5   | ball            | dog toys | Bowser         | 6.99  | 27  
6   | plushy          | dog toys | Bowser         | 8.99  | 30  
7   | rubber bone     | dog toys | Bowser         | 4.99  | 52  
8   | rubber bone     | dog toys | Tippy          | 4.99  | 38  
9   | plushy          | dog toys | Tippy          | 6.99  | 16  
10  | ball            | dog toys | Tippy          | 2.99  | 47  
11  | leash           | dog wear | Tippy          | 12.99 | 34  
12  | collar          | dog wear | Tippy          | 6.99  | 88  
13  | name tag        | dog wear | Tippy          | 5.99  | 165  
14  | jacket          | dog wear | Tippy          | 20.99 | 50  


PIVOTING FOR ONE AGGREGATE VALUE
The first thing we want to know from our data is the average price of the
products in each category by product line. Typically we'd run a query that uses
the avg aggregate function and group by to determine this:

select distinct  
    product_line,
    category,
    round(avg(price),2) as avg_price    
  from catalog
  group by
    product_line,
    category
  order by product_line
;


Here's what we'd get back:

product_line | category | avg_price  
------------------------------------
Bowser       | dog toys | 6.99  
Bowser       | dog wear | 14.49  
Tippy        | dog toys | 4.99  
Tippy        | dog wear | 11.74  


Now let's use crosstab to get a pivoted result set instead:

select * from crosstab (  
  'select distinct
    product_line,
    category,
    round(avg(price),2) as avg_price    
  from catalog
  group by
    product_line,
    category
  order by product_line',

  'select distinct category from catalog order by 1'
 )
 AS (
   product_line character varying,
   dog_toys_avg_price numeric,
   dog_wear_avg_price numeric
 )
 ;


Here we're wrapping our original query in a crosstab query: select * from crosstab . Note the single quotes around the original query. You'll get a syntax error
without them. You might also have noticed our usage of the round function, which we covered in our previous article on how to make data pretty , in order to round the result to an appropriate number of decimal points - 2
in this case since we're dealing with currency.

We've then added another query (note the comma separating the two queries) to
return the distinct categories in the order we're expecting. It's important to
know exactly which values (and in which order the pivoted field will return
them) so that we can name the new columns correctly. Ideally, the values are not
changing often (if ever) since we're doing a bit of hard-coding here. And that
brings us to specifying the output column names and data types in an AS clause.

Our result set now looks like this with the data across the new columns:

product_line | dog_toys_avg_price | dog_wear_avg_price  
-------------------------------------------------------
Bowser       | 6.99               | 14.49  
Tippy        | 4.99               | 11.74  


In this format, we can now easily see that the Tippy product line is, on
average, less expensive than the Bowser line. But what if we want to also know
the count of items from each line that are still in stock? Let's add that.

PIVOTING TO GET MORE THAN ONE AGGREGATE VALUE
So, you might think that you can just add sum(number_in_stock) to the query with the corresponding 2 output columns for the total number in
stock for dog toys and dog wear, but that will only produce an ""invalid return
type"" error that states ""Query-specified return tuple has 5 columns but crosstab
returns 4."" Huh? It means that there are more fields in the result set than the
crosstab expects. That's because sum(number_in_stock) will be treated as an ""extra column"" that won't get pivoted since that's how
this particular flavor of crosstab works - you can only pivot one field and one aggregate value at a time. Note
that you'll also get this error if the second query that provides the distinct
category names ends up having more or less values than you've defined columns
for in the output.

While this might seem like a setback, it gives us a chance to be more explicit
with our query by using CTEs (common table expressions) to do multiple crosstab queries and join them together. If you're not familiar with CTEs, check out our
article on series, random and with . What we'll do is create two CTEs, each with one of the crosstabbed aggregates
we want to generate. Our new query looks like this:

with product_lines_avg_price as (  
  select * from crosstab (
    'select distinct
      product_line,
      category,
      round(avg(price),2) as avg_price
    from catalog
    group by
      product_line,
      category
    order by product_line',

    'select distinct category from catalog order by 1'
   )
   AS (
     product_line character varying,
     dog_toys_avg_price numeric,
     dog_wear_avg_price numeric
   )
),
product_lines_total_in_stock as (  
  select * from crosstab (
    'select distinct
      product_line,
      category,
      sum(number_in_stock) as total_in_stock
    from catalog
    group by
      product_line,
      category
    order by product_line',

    'select distinct category from catalog order by 1'
   )
   AS (
     product_line character varying,
     dog_toys_total_in_stock numeric,
     dog_wear_total_in_stock numeric
   )
)
select  
  ap.product_line,
  ap.dog_toys_avg_price,
  ap.dog_wear_avg_price,
  tis.dog_toys_total_in_stock,
  tis.dog_wear_total_in_stock
from product_lines_avg_price ap  
join product_lines_total_in_stock tis on tis.product_line = ap.product_line  
;


We've created two CTEs, one called ""product_lines_avg_price"" and one called
""product_lines_total_in_stock"". Each of these is a temporary table that we can
select from to create our final pivot table by joining on the unique
product_line value from each of them.

Here's what our final table looks like:

product_line | dog_toys_avg_price | dog_wear_avg_price | dog_toys_total_in_stock | dog_wear_total_in_stock  
-----------------------------------------------------------------------------------------------------------
Bowser       | 6.99               | 14.49              | 109                     | 340  
Tippy        | 4.99               | 11.74              | 101                     | 337  


While the total amounts left in stock for each product line are not too
different from each other at this point, the Tippy line has a bit less than the
Bowser line. If we started to see the Tippy inventory decrease faster over time
than the Bowser inventory, we might consider whether the lower prices for the
Tippy line lead to more purchases from it. Imagine if we had dozens of product
lines and other data points we wanted to consider besides average price and
number in stock. Creating a pivot table like this makes it easy to compare one
to another.

WRAPPING UP
If you need to use the pivot often, you may want to consider creating a materialized view of the pivoted data. Users and other applications may find the format more
simple to use than what they'd see with more standard query result sets. You may
also want to experiment with the other crosstab options to see how they work for your use case.

Join us next time when we'll get more cozy with mean, median and mode metrics.

Image by: Unsplash Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",We'll look at the crosstab function in PostgreSQL to create a pivot table of our data with aggregate values.,Metrics Maven: Creating Pivot Tables in PostgreSQL Using Crosstab,Live,341
1028,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Perform Sentiment Analysis       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  LOAD DATA FROM THE CLOUD INTO DASHDBJess Mantaro / July 17, 2015See how to populate data into a table in your IBM dashDB database from a filelocated in a Softlayer Swift cloud object store.You can also read a transcript of this videoRead the tutorial (PDF)RELATED LINKS * Move data to the Cloud with the MoveToCloud script * Load data from the desktop into dashDB * Load JSON data from a Cloudant database into dashDBPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",See how to populate data into a table in your IBM dashDB database from a file located in a Softlayer Swift cloud object store. ,Load data from the Cloud into dashDB,Live,342
1029,"INTRODUCING METEOR TOYS 3, BECAUSE TOYS MAKE LIFE BETTER
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 13, 2016Today we turned over the reins to Max Savin , the creator of Meteor Toys and Aqua . To further help you with your journey of building apps with Meteor, Compose
is giving you a 45-day free trial for your hosting your MongoDB database.

INTRODUCING METEOR TOYS 3, BECAUSE TOYS MAKE LIFE BETTER
They say whoever dies with the most toys is still dead. It’s true, but they
still have the most toys. Still, sometimes it’s not about quantity as much as it
is about quality, and sometimes, it’s about both. This Meteor Toys release, like
all the others, is about giving you both.

I’M NEW TO METEOR.. WHAT IS METEOR TOYS?
Meteor Toys is the original development tools for the Meteor web development
framework. It works with the classic stack of meteor-base, templating
(blaze/react), session, tracker and mongo to give you special powers during
development. It covers everything from visualizing your client's data to helping
you switch accounts in seconds, and more.


The Meteor Toys packages sit quietly within your app, only to reveal their
presence when you press Control + M. Normally, the functionality provided could
be seen as a security risk, but that is not a problem because they are debugOnly
packages. This means they only be included into your build when your app is in
development mode.

Meteor Toys comes in a free and premium edition and is used by thousands of
Meteor developers. Customers include everyone from college students and hobby
developers to established businesses and government agencies. The common
denominator between these different groups of people is that they all just want
to get things done faster.

MAKING ALL THE THINGS HANDIER
Meteor Toys 2 was all about covering territory and focusing on customization
with ToyKit and ThemeKit. With Meteor Toys 3, the focus shifted to improving the
existing functionality and giving you the small changes that make the big
differences.

For example, the Authenticate toy allows you to impersonate an account just by
clicking on it. With the new update, the Toy will automatically detect the
latest 15 accounts in your database, and you can also run a manual search. These
two features are especially handy when you’re using a seeded or cloned database.
Additionally, when you impersonate an account while being already signed in,
terminating the impersonation session will revert you to the previously
authenticated account. It's too easy.


Meanwhile, the Method and Pub Toys take it a step further from letting you
visualize and run your methods and publications. Now, whenever you enter the
arguments for them, Meteor Toys will automatically save them to localStorage.
That way, you can easily tweak and re-run the functions as needed. When you run
the Method or Shell Toys, their results will be posted in the Result Toy. The
Result Toy will also display the code you ran to get that result.


Throttle is one of my favorite Toys because, well, they're all my favorite. It's
really nice for simulating how your app would feel in production. Even if you
use a performant database solution like Compose, the distance between your
client, server and database make a huge impact on the speed at which data gets
to and from your clients.


The new Throttle Toy is even more reliable with support for two throttling
speeds. Meteor Toys Throttle works on DDP, while Chrome DevTools Throttle works
on HTTP requests. Combine the two for a painfully honest development experience.

For the rest of the improvements, check out the changes in the Meteor Toys
repository.

INTRODUCING METEOR TOYS MOBILE
Meteor Toys has been dubbed “insanely handy,” but up until now, I’m afraid that
label did not hold up for mobile development. Today, that changes with an
interface optimized for mobile devices. It provides you quick access to commonly
needed features: account impersonation, connection management, and reloading.


When Meteor Toys detects iPhone Mobile Safari or Cordova, it will load up the
mobile interface instead of the desktop interface. This is completely optional
with the meteortoys:mobile package.

Since Cordova may block your access to keyboard shortcuts, you can combine it
with the new meteortoys:toggle package. The Meteor Toys Toggle gives you a quick way to open the Toys with a
mouse hover or touch action.

METEOR 1.3 AND REACT COMPATIBILITY
Meteor Toys 3 has improved compatibility with Meteor 1.3, and has been tested to
work with the standard Meteor-Blaze and Meteor-React apps. While Meteor Toys
continue to use Blaze for the view layer, it can easily accommodate people who
prefer React.

Meteor-React and Meteor-Blaze have many common parts: minimongo, tracker, etc.
The main difference is the view layer. In order to get Meteor Toys to work with
Meteor-React, we'd only need to add on the Blaze library. When you use Meteor
Toys, it automatically gets included in development mode and then removed along
with Meteor Toys when you build for production.

WHAT DO WE DO NEXT?
Meteor Toys is well integrated into the classic Meteor stack to help you get
things done faster and with more fun. Virtually everyone who gave Meteor Toys a
shot loves it, and now it's your turn. Try Meteor Toys today and let me know how
they work for you. There's nothing wrong with a bit of play time while you work!
😎

Max Savin is an entrepreneur, designer and developer, recently launching Meteor
Toys 3 ( http://meteor.toys/ ) and Aqua ( http://aqua.me/ ). When he's not making new toys, he takes on freelance projects while
traveling around the world. Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose",Meteor Toys is the original development tools for the Meteor web dev framework. It covers everything from visualizing data to switching accounts in seconds.,"Introducing Meteor Toys 3, Because Toys Make Life Better",Live,343
1031,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (October 25, 2016)
 * This Week in Data Science (October 18, 2016)
 * How to run a successful Data Science meetup
 * This Week in Data Science (October 11, 2016)
 * This Week in Data Science (October 05, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (OCTOBER 25, 2016)
Posted on October 25, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Close the skills gap at IBM Insight at World of Watson 2016 – As part of its ongoing engagement with educators, IBM is reaching out to
   university faculty members in hopes of increasing the number of data-savvy
   students entering the workforce.
 * Machine learning versus AI: what’s the difference? – AI and machine learning are very much related, but they’re not quite the
   same thing
 * New IBM Watson Data Platform Puts Machine Learning Into The Hands Of Everyone – IBM is announcing the launch of the Watson Data Platform.
 * The Key to the Cognitive Business is Putting Data to Work – What is needed is a platform, an ecosystem, and a method.
 * Machine learning tool cleans dirty data – Big data is a big deal, but problems within the data can skew results and
   lead to problematic choices. To help keep data clean, researchers at Columbia
   University and the University of California at Berkeley have developed new
   software.
 * Debate Night Twitter: Analyzing Twitter’s Reaction to the Presidential
   Debate. – George McIntire streamed tweets under the hashtag #debate and analyzed
   them to discover trends in Twitter’s mood and how users were reacting to not
   just the debate overall but to certain events in the debate.
 * How Vector Space Mathematics Helps Machines Spot Sarcasm – Sarcasm is almost impossible for computers to spot. A mathematical
   approach to linguistics could change that.
 * What you need to know about Internet of things – The Internet of things is predicted to revolutionize the way in which we
   live our lives, with many industry experts tipping it to have the biggest
   technological impact since cloud computing.
 * Microsoft claims its speech transcription AI is now better than human
   professionals – A new paper from Microsoft Research claims to slightly beat human-level
   transcription of conversation, even when the human transcript is
   double-checked by a second human for accuracy.
 * How Big Data Is Changing the Way We Watch TV – As more consumers grow increasingly frustrated with rising cable costs and
   channel packages that don’t quite fit their preferences, online streaming
   services are continuing to gain popularity.
 * California is covering mountains with sensors to fight drought – A project is kicking off in the Sierra Nevada mountains to monitor
   moisture levels to help control the state’s water supplies and hydro power.
 * Cameras monitor hospital patients’ vital signs from afar – Oxehealth’s software uses camera data to measure heart rate, respiration
   and blood oxygenation from a distance.
 * Battle of the Data Science Venn Diagrams – Read this comparative overview of data science Venn diagrams for both the
   insight into the profession and the humor that comes along for free.
 * @HillaryClinton vs. @realDonaldTrump – A comparison of the words unique to the candidates on Twitter.

UPCOMING DATA SCIENCE EVENTS
 * TDWI AUSTIN 2016 – TDWI Austin focuses on state-of-the-art technologies and practices for
   storing, analyzing, and harnessing enterprise data to drive customer-centric
   innovation. Attend on December 4-9.
 * How Data Can Help Fix K-12 Education in the United States – Join the Center for Data Innovation on November 15th for a discussion on a
   forthcoming report examining how the United States can build a data-driven
   education system.
 * Introduction to Python for Data Science – Learn the basics of using Python for Data Science on November 10th.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our thirty seventh release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (October 25, 2016)",Live,344
1033,"The secondary index provides a way for you to query data in more ways than just a simple primary key lookup. This video shows you how to create a secondary index, and construct queries to access the secondary index through the API. Find more videos and tutorials in the cloudant.com Learning Center: http://www.cloudant.com/learning-center","The secondary index provides a way for you to query data in more ways than just a simple primary key lookup. This video shows you how to create a secondary index, and construct queries to access the secondary index through the API. ",Use the secondary index in Cloudant,Live,345
1034,"400 BAD REQUEST
3IKcY0nH/XWWMZyBX @ Thu, 25 Jan 2018 15:54:22 GMT

SEC-43","I had seen a lot of work like this, using data science to prove that an existing map had been egregiously gerrymandered. But I had seen less work using data science to draw an optimally fair map. The challenge with drawing an optimally fair map, however, is that reasonable people disagree about what makes a map fair. Some believe that a map with perfectly rectangular districts is the most common sense approach. Others want maps optimized for electoral competitiveness – gerrymandered for the opposite effect. Many people want maps that take racial diversity into account. For example if a sate is 20% hispanic, 20% of its congressional districts should be majority hispanic.",Fighting Gerrymandering: Using data science to draw fairer congressional districts,Live,346
1038,"Homepage Follow Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine
Learning Feb 15
--------------------------------------------------------------------------------

ANNOUNCING DATA SCIENCE EXPERIENCE ENVIRONMENTS IN BETA!
I am very excited to announce that today we are releasing Data Science Experience Environments in beta for all users to try.

An environment defines a relationship between a tool, software configuration,
and hardware configuration. For example, you can create a Jupyter notebook
environment configured with Anaconda 5.0 with 4 vCPU and 16 GB RAM.

With DSX Environments you can quickly scale up or down your compute resources
and customize your package dependencies. For this beta release, we have two
sizes available for compute, and software configurations available for Python
3.5 and 2.7.

Benefits of DSX Environments:

 * Flexible compute options — quickly change vCPU and RAM
 * Dedicated resources — environments provide dedicated resources for each project collaborator, so
   no more competing for resources
 * Package management — conda environment definitions are used for customization
 * Reproducible research — environments are project assets and can be shared and reused by all
   members of your team
 * More to come! — configuration options and supported tools will continue to expand


--------------------------------------------------------------------------------

HOW TO GET STARTED
We’ve made it very easy for anyone to get started using environments. Before
following these steps you will need a Data Science Experience account: Sign up here !

 1. Once you’ve logged in, create a new project or go to an existing project
    that uses Cloud Object Storage (Older projects using Swift Object Storage do not support environments).
 2. All projects now have a new Environments tab; this is where you can create, modify, and monitor environment
    runtimes. DSX includes a Python 3.5 extra-small and small environment for
    all projects to make it quick and easy to get started.

Projects have default environments created so you can get working right away3. Since these environments are included by default, you can go directly to
create a notebook, either by using the Add to project drop down or New Notebook on the Assets tab.

4. On the New notebook screen, select the box to test the environments feature and select your
environment:

Creating a notebook using an environment5. Creating a notebook instantiates your environment’s runtime — from here you
should know what to do :-)


--------------------------------------------------------------------------------

A really helpful page to check out when using notebooks is the reached through
the info (i) icon. This page contains general information, where you can change the
title and description of the notebook, and environment information, as shown
here:

Manage your environment runtime while editing a notebookFrom this section you can easily change the environment that your runtime is
based on. You can also start, stop, or restart your runtime from this page. To
see more details on environments, check out our documentation .


--------------------------------------------------------------------------------

As you can see, this new functionality makes it easier than ever to customize,
scale, and manage your work. This is all possible thanks to the incredible designers and engineers building
Data Science Experience! 👏 👏 👏

After trying this feature please let us know your feedback in a comment below or
a tweet @gdfilla .

 * Data Science
 * Announcements
 * Cloud Computing
 * Python
 * IBM

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

182 Blocked Unblock Follow FollowingGREG FILLA
Product manager & Data scientist — Data Science Experience and Watson Machine
Learning

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 182
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","An environment defines a relationship between a tool, software configuration, and hardware configuration. For example, you can create a Jupyter notebook environment configured with Anaconda 5.0 with 4 vCPU and 16 GB RAM.",Announcing DSX Environments in Beta!,Live,347
1040,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Watson Student Advisor

 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (April 25, 2017)
 * This Week in Data Science (April 18, 2017)
 * This Week in Data Science (April 11, 2017)
 * How to Become a Data Scientist
 * This Week in Data Science (April 4, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (APRIL 25, 2017)
Posted on April 25, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * The Value of Exploratory Data Analysis – A high level overview of Exploratory Data Analysis.
 * A Dramatic Tour through Python’s Data Visualization Landscape (including ggpy
   and Altair) – Rundown of traditional and new python visualization packages.
 * Top 16 Machine Learning, Data Mining, and NLP Books for Data Scientists – 16 Books about ML, Data Mining and NLP that are worth reading.
 * The History of Neural Networks – The history of neural networks in the context of Deep Learning.
 * Development lifecycles for defining the meaning and structure of the data
   lake. – Understanding the relationship between various activities around the data
   lake.
 * What to do with all that machine learning data – Combining Machine Learning and Decision Optimization in today’s
   businesses.
 * How Big Data Helps Today’s Airlines Operate – The reason why companies are placing more value in insights gained from
   Big Data Analytics.
 * NumPy Cheat Sheet – Python for Data Science – A cheat sheet to help with the use of NumPy functions.
 * Big Data 101: Intro to Probabilistic Data Structures – A look into different data structures used during data analysis.
 * Hot or Not? Top Programming Languages Today – A look into different trends in choice of programming languages.
 * Is Blockchain the Ultimate Enabler of Data Monetization? – Discussion about the ability of blockchain to aid in the monetization of
   data and data analytics insights.
 * What Makes a Good Analyst? – The different skills that contribute to the becoming a better analyst.
 * March ’17 New Package Picks – A pick for the top 40 R packages released in March 2017.
 * Meet the IBM Watson-powered robot that could make your next smartphone – How a cloud-connected robot arm could become the factory robots of
   tomorrow.
 * Planet analytics: big data, sustainability, and environmental impact – A look at the relationship between Big Data and Environmental
   Sustainability.
 * Data scientists really love their jobs, survey finds – The results of a CrowdFlower survey on data scientists and their
   satisfaction with their jobs.


FEATURED COURSES FROM BDU
 * SQL and Relational Databases 101 – Learn the basics of the database querying language, SQL.
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.
 * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to
   apply deep learning to different data types in order to solve real world
   problems.


UPCOMING DATA SCIENCE EVENTS
 * Data Science: Classification Algorithms in Python(Hands-On) – April 25, 2017 @ 6:00 pm – 8:30 pm
 * UofT Data Science Workshop: Intro to Clustering with R – May 2, 2017 @ 6:00 pm – 9:00 pm
 * UofT Data Science Workshop: Intro to Classification with R –May 4, 2017 @ 6:00 pm – 7:00 pm


COOL DATA SCIENCE VIDEOS
 * Machine Learning With Python – Dimensionality Reduction – Feature Extraction
   & Selection – Dimensionality Reduction using Feature Extraction and Selection.
 * Machine Learning With Python – Unsupervised Learning – Density Based
   Clustering – A video from our Machine Learning with Python BDU course on Density Based
   Clustering.
 * Machine Learning With Python – Measuring The Distances Between Clusters –
   HIERARCHICAL CLUSTERING – A video covering measuring the distances between clusters using hierarchal
   clustering.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (April 25, 2017)",Live,348
1042,"* United States

IBM® * Site map

Search

 * Offering Information

The document you have requested is not available. Please try again.

Report this link as unavailable

FOOTER LINKS
 * Contact
 * Privacy
 * Terms of use
 * Accessibility","This white paper introduces SparkR, a package for the R  statistical programming language that enables programmers  and data scientists to access large-scale in-memory data  processing. The R runtime is single-threaded and can  therefore normally only run on a single computer, processing  data sets that fit within that machine’s memory. By providing  a bridge to Spark’s distributed computation engine, SparkR  enables large R jobs to run across multiple cores within a  single machine or across nodes in massively parallel clusters,  with access to all the memory in the cluster. ",IBM Data Science Experience White paper - SparkR Transforming R into a tool for big data analytics,Live,349
1044,"Homepage Towards Data Science Follow Sign in Get started * Home
 * Data Science
 * Machine Learning
 * Programming
 * Visualization
 * Letters
 * Contribute
 * 
 * Become a Patron
 * 

Stephen Hsu Blocked Unblock Follow Following MS in Data Science, Machine Learning Enthusiast, I 💖APIs Jan 8
--------------------------------------------------------------------------------

COLLECT YOUR OWN FITBIT DATA WITH PYTHON
Python Generated Fitbit CloudSo you’ve got your Fitbit over the Christmas break and you’ve got some New Years
Resolutions. You go online and see the graphs on your dashboard but you’re still
not pleased. You want more data, more graphs, and more information. Well say no
more, because I’m going to teach you how to collect your own Fitbit data using
nothing but a little Python code. With this tutorial, you can get your elusive
minute by minute data (also known as intraday data), which is not readily
available when you first get your Fitbit.

Step 1 : Set up your account and create the app

The first thing you’ll need to do is create a Fitbit account . Once you’ve done that, you can go to dev.fitbit.com . Under “Manage”, go to “Register An App”. This will lead you to a page that
looks like:

Application RegistrationFor the application website and organization website, name it anything starting
with “http://” or “https://”. Secondly, make sure the OAuth 2.0 Application Type
is “Personal” as this is key to allowing us to download our intraday data.
Lastly, make sure the Callback URL is “http://127.0.0.1:8080/” in order to get
our Fitbit API to connect properly. After that, click on the agreement box and
submit.

NOTE : depending on the app, we may need an additional step to fill out a form in
order to gain permission to our intraday data at this link . Fitbit is supportive of personal projects and any other non profit research,
so these should already be callable, but commercial apps might take longer to be
approved.

After that, you’ll be redirected to a page looking like this:

Registered ApplicationThe parts we will need from this page are the OAuth 2.0 Client ID and the Client
Secret.

Step 2: The API

Once Step 1 is completed, our next step is to use a Fitbit unofficial API . Click the green button on the right side to download the repo and afterwards
unzip the file. After that, open up the command line to change directories to
the directory containing the unzipped files and run a quick ‘sudo pip install -r
requirements/base.txt’ .

Now we’re finally ready to start coding. First things first, we will need to
import the necessary packages and bring in our Client_ID and Client_Secret from
earlier to authorize ourselves.

import fitbit
import gather_keys_oauth2 as Oauth2
import pandas as pd 
import datetime

CLIENT_ID = '22CPDQ'
CLIENT_SECRET = '56662aa8bf31823e4137dfbf48a1b5f1'

Using the ID and Secret, we can obtain the access and refresh tokens that
authorize us to get our data.

server = Oauth2.OAuth2Server(CLIENT_ID, CLIENT_SECRET)
server.browser_authorize()

ACCESS_TOKEN = str(server.fitbit.client.session.token['access_token'])
REFRESH_TOKEN = str(server.fitbit.client.session.token['refresh_token'])

auth2_client = fitbit.Fitbit(CLIENT_ID, CLIENT_SECRET, oauth2=True, access_token=ACCESS_TOKEN, refresh_token=REFRESH_TOKEN)

What the page should look like after authorization and signing inWe’re in! … but not so fast. We’ve authenticated ourselves but we still haven’t
gotten our data. Before jumping in, I have one handy trick that will save a lot
of manual typing: many of the calls require a date format of (YYYY-MM-DD) as a
string format. So if you were to collect your data daily, we’ll have to change
this string manually everyday. Instead, we can import the useful ‘datetime’
package and let it do the formatting for us instead.

yesterday = str((datetime.datetime.now() - datetime.timedelta(days=1)).strftime(""%Y%m%d""))

yesterday2 = str((datetime.datetime.now() - datetime.timedelta(days=1)).strftime(""%Y-%m-%d""))

today = str(datetime.datetime.now().strftime(""%Y%m%d""))

Step 3 : Acquire the data

We can finally now call for our data.

fit_statsHR = auth2_client.intraday_time_series('activities/heart', base_date=yesterday2, detail_level='1sec')

Heart Rate DataWhile this data looks readable to us, it’s not yet ready to be saved on our
computer. The following bit of code reads the dictionary format and iterates
through the dictionary values — saving the respective time and value values as
lists before combining both into a pandas data frame .

time_list = []
val_list = []

for i in fit_statsHR['activities-heart-intraday']['dataset']:
    val_list.append(i['value'])
    time_list.append(i['time'])

heartdf = pd.DataFrame({'Heart Rate':val_list,'Time':time_list})

Our Newly Created Fitbit pandas data frameNow to finally save our data locally. Thinking ahead, we’ll be calling our data
about once a data (1 file = 1 day), so we’ll want to have a good naming
convention that prevents any future mixups or overriding data. My preferred
format is to go with heartYYYMMDD.csv. The following code takes our heart data
frame and saves them as a CSV in the /Downloads/python-fitbit-master/Heart/
directory.

heartdf.to_csv('/Users/shsu/Downloads/python-fitbit-master/Heart/heart'+ \
               yesterday+'.csv', \
               columns=['Time','Heart Rate'], header=True, \
               index = False)

Small army of heart rate filesBut wait! There’s still more!

Reading the documentation , there’s plenty of other data we can still collect like:

Our sleep log data:

""""""Sleep data on the night of ....""""""

fit_statsSl = auth2_client.sleep(date='today')
stime_list = []
sval_list = []

for i in fit_statsSl['sleep'][0]['minuteData']:
    stime_list.append(i['dateTime'])
    sval_list.append(i['value'])

sleepdf = pd.DataFrame({'State':sval_list,
                     'Time':stime_list})

sleepdf['Interpreted'] = sleepdf['State'].map({'2':'Awake','3':'Very Awake','1':'Asleep'})
sleepdf.to_csv('/Users/shsu/Downloads/python-fitbit-master/Sleep/sleep' + \
               today+'.csv', \
               columns = ['Time','State','Interpreted'],header=True, 
               index = False)

Sleep LogOr our sleep summary:

""""""Sleep Summary on the night of ....""""""

fit_statsSum = auth2_client.sleep(date='today')['sleep'][0]

ssummarydf = pd.DataFrame({'Date':fit_statsSum['dateOfSleep'],
                'MainSleep':fit_statsSum['isMainSleep'],
               'Efficiency':fit_statsSum['efficiency'],
               'Duration':fit_statsSum['duration'],
               'Minutes Asleep':fit_statsSum['minutesAsleep'],
               'Minutes Awake':fit_statsSum['minutesAwake'],
               'Awakenings':fit_statsSum['awakeCount'],
               'Restless Count':fit_statsSum['restlessCount'],
               'Restless Duration':fit_statsSum['restlessDuration'],
               'Time in Bed':fit_statsSum['timeInBed']
                        } ,index=[0])

Sleep SummaryAnd that’s all you need to get started on collecting all your Fitbit data! So
play around some more, read the Python Fitbit documentation, get lost in it for
a whole, find your way out, and see overall how nifty your heart data is. In
case you wanted a feel for what you can make with that data, here is a link to one of my analyses of my own heart data.

Thanks for taking the time to read my tutorial and feel free to leave a comment
or connect on LinkedIn as I’ll be posting more tutorials on data mining and data science.

References:

 1. Orcas Fitbit API Github
 2. Python-Fitbit documentation
 3. Official Fitbit API documentation
 4. Fitbit Intraday Access Form
 5. Personal Github with all the code
 6. Pandas data frame documentation

 * Python
 * Data Science
 * New Years Resolutions
 * Data

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

317 Blocked Unblock Follow FollowingSTEPHEN HSU
MS in Data Science, Machine Learning Enthusiast, I 💖APIs

FollowTOWARDS DATA SCIENCE
Sharing concepts, ideas, and codes.

 * 317
 * 
 * 
 * 

Never miss a story from Towards Data Science , when you sign up for Medium. Learn more Never miss a story from Towards Data Science Get updates Get updates","So you’ve got your Fitbit over the Christmas break and you’ve got some New Years Resolutions. You go online and see the graphs on your dashboard but you’re still not pleased. You want more data, more graphs, and more information. Well say no more, because I’m going to teach you how to collect your own Fitbit data using nothing but a little Python code.",Collect Your Own Fitbit Data with Python,Live,350
1047,"developerWorks Premium An all-access pass to building your next great app!Sign up

 * 
 * Sign in | Register * › My developerWorks
    * 
      --------------------------------------------------------------------------
      
      
    * developerWorks Community
    * › My profile
    * › My communities
    * › Settings
    * 
      --------------------------------------------------------------------------
      
      
    * › Sign out
   
   
 * 
 * IBM

 * 

 * Technical topics
 * Evaluation software
 * Community
 * Events

Search developerWorks

 * developerWorks
 * Technical topics
 * Open source
 * Technical library

DO I NEED TO LEARN R?
Four good reasons to try the open source platform for data analysis

R is a flexible programming language designed to facilitate exploratory data
analysis, classical statistical tests, and high-level graphics. With its rich
and ever-expanding library of packages, R is on the leading edge of development
in statistics, data analytics, and data mining. R has proven itself a useful
tool within the growing field of big data and has been integrated into several
commercial packages, such as IBM SPSS® and InfoSphere®, as well as Mathematica.
This article offers a statistician's perspective on the value of R.

PDF (399 KB) |

Share:Catherine Dalzell ( mail@catherinedalzell.ca ), Statistician, Dalzell Consulting

Close [x]

Catherine Dalzell is a statistician with more than 15 years of experience in
data mining and data analytics, mostly in a healthcare setting. She first used
the S language in the 1980s. She has followed with enthusiasm the development of
the S language through S-Plus and R as the language has brought flexible data
analytics and high-level graphics to the desktop. She holds a doctorate from
Carnegie-Mellon University and a master's degree in Biomathematics from the
University of Oxford. Currently, she teaches at the University of Ottawa and
runs her own statistical consulting business.


03 September 2013

Also available in Chinese Russian

 * Table of contents * Why choose R?
    * What is R, and what is it for?
    * What happens when I use R?
    * R carries on into the 21st century
    * Conclusion
    * Resources
    * Comments
   
   
You have heard about R. Perhaps you read an article like Sam Siewert's "" Big data in the cloud ."" You know that R is a programming language and that it has something to do
with statistics, but is it right for you?

WHY CHOOSE R?
R does statistics. You could view it as a competitor of analytic systems like
SAS Analytics, not to mention simpler packages like StatSoft STATISTICA or
Minitab. Many professional statisticians and methodologists in government,
business, and the pharmaceutical industry spend their careers on IBM SPSS or SAS
without writing one line of R code. So in part, the decision to learn and to use
R is a matter of corporate culture and how you like to work. I use several tools
in my statistical consulting practice, but most of what I do is done in R. These
examples show why:

 * R is a powerful scripting language. I was recently asked to analyze the results of a scoping study. The
   researchers had gone through 1,600 research papers and coded their contents
   on several criteria — a large number of criteria, in fact, with multiple
   options and forks. Their data, once flattened onto a Microsoft® Excel®
   spreadsheet, contained more than 8,000 columns, most of them void. The
   researchers wanted to roll up totals under different categories and headings.
   R is a powerful scripting language with access to Perl-like regular
   expressions for handling text. Messy data require the resources of a
   programming language, and although SAS and SPSS have scripting languages for
   tasks that go beyond the drop-down menu, R was written as a programming
   language and so is a better tool for that purpose.
 * R leads the way. Many new developments in statistics appear first as R packages before making
   their way into commercial platforms. I recently obtained data from a medical
   study on patient recall. For each patient, we had the number of treatment
   items the physician had suggested, along with the number of items the patient
   actually remembered. The natural model is the beta-binomial distribution . This has been known since the 1950s, but estimation procedures relating
   the model to covariates of interest is recent. Data like these are usually
   handled by general estimating equations (GEE), but GEE methods are asymptotic
   and assume that the sample is large. I wanted a generalized linear model with
   beta-binomial R. A recent R package estimates this model: betabinom by Ben Bolker. SPSS does not.
 * Integration with document publishing. R integrates smoothly with the LaTeX document publishing system, meaning
   that statistical output and graphics from R can be embedded in
   publication-quality documents. This isn't for everyone, but if you want to
   write a book about your data analytics or simply don't like copying your
   results into a word-processing document, the shortest and most elegant route
   lies through R and LaTeX.
 * No cost. As the owner of a small business, I like that R is free. Even for a larger
   enterprise, it is nice to know that you can bring in someone on a temporary
   basis and immediately sit them down to a workstation with leading-edge
   analytic software. No need to worry about the budget.


--------------------------------------------------------------------------------

Back to top

WHAT IS R, AND WHAT IS IT FOR?
THE 140-CHARACTER EXPLANATION
R is an open source implementation of S, a programming environment for data
analysis and graphics.

As a programming language, R is similar to many others. Anyone who has ever
written code will find much in R that is familiar. The distinctiveness of R lies
in the statistical philosophy that it supports.

A STATISTICAL REVOLUTION: S AND EXPLORATORY DATA ANALYSIS
Computers have always been good at computing things — after you have written and
debugged a program to carry out the algorithm you want. But in the 1960s and
1970s, they were weak in the display of information, especially graphics. These
technical limitations, together with trends within statistical theory, meant
that the practice of statistics and the training of statisticians focused on
model building and hypothesis testing. One assumed a world in which researchers
opined hypotheses (often agricultural), built carefully designed experiments (at
an agricultural station), fit the model, and ran the test. A spreadsheet-based,
menu-driven program like SPSS reflects this approach. In fact, the first
versions of SPSS and SAS Analytics consisted of subroutines that could be
invoked from a (Fortran or other) program to fit and test one out of a toolbox
of models.

Into this formalized and theory-laden framework, John Tukey dropped the concept
of exploratory data analysis (EDA) like a boulder through a glass roof. Today,
it is difficult to imagine a time when the analysis of a data set could begin
without a box plot to check for skewness and outliers or when the residuals of a
linear model were not checked for normality against a quantile plot. These ideas
originated with Tukey, and now, no introductory statistics course is given
without them. It was not always so.

FROM ""GRAPHICAL METHODS FOR DATA ANALYSIS""
""In any serious application, you should look at the data in several ways,
construct a number of plots, and perform several analyses, letting the results
of each step suggest the next. Effective data analysis is iterative."" —John
Chambers (see Resources ).

EDA is more an approach than a theory. Essential to that approach are the
following rules of thumb:

 * Where possible, use graphics to discern features of interest.
 * Analysis is incremental. Try one model; based on the results, fit another
   model.
 * Check model assumptions using graphics. Remark outliers, where present.
 * Use robust methods to protect against departures from distributional
   assumptions.

Tukey's approach launched a wave of development of new graphical methods and
robust estimators. It also inspired the development of a new software framework
better suited to exploratory methods.

The S language was developed at the Bell Laboratories by John Chambers and
colleagues as a platform for statistical analysis, especially of the Tukey sort.
The first version, for internal Bell use, was developed in 1976, but it wasn't
until 1988 that it reached something like its current form. By this time, the
language was also available to users outside of Bell. Every aspect of the
language fits the ""new model"" of data analysis:

 * S is an interpreted language operating within a programming environment. The
   syntax of S is a lot like the syntax of C, but with the difficult bits left
   out. S takes care of memory management and variable declarations, for
   example, so the user does not have to write or debug such things. The lower
   programming overhead enables a number of analyses to be done quickly on the
   same data set.
 * From the start, S allowed for the creation of high-level graphics, and you
   can add features to any open graphics window. You can readily highlight
   points of interest, query their values, add smoothers to scatter plots, etc.
 * Object orientation was added to S by 1992. In a programming language, objects
   structure data and functions to meet the intuition of the user. Human thought
   is always object-oriented, and statistical reasoning especially so. The
   statistician works with frequency tables, time series, matrices, spreadsheets
   of diverse data types, models, etc. In every case, the raw numbers are vested
   with attributes and expectations: A time series consists of observations and
   time points, for instance. And for each data type, standard statistics and
   plots are expected. For a time series, I might do a time series plot and a
   correlogram; for a fitted model, I might plot fits and residuals. S enables
   the creation of objects for all of these concepts and you can create more
   object classes as needed. Objects make it easy to go from the
   conceptualization of a problem to its implementation in code.

A LANGUAGE WITH ATTITUDE: S, S-PLUS, AND HYPOTHESIS TESTING
The original S language took Tukey's EDA seriously, to the extent that it was
awkward to do anything in S but EDA. This was a language with attitude. For example, although S came with
several useful internal functions, it was lacking in some of the most obvious
features you would expect statistical software to possess. There was no function
to perform a two-sample t test or indeed hypothesis testing of any kind. But Tukey notwithstanding, a
hypothesis test is sometimes the right thing to do.

In 1988, Seattle-based Statistical Science licensed S and ported an enhanced
version of the language, called S-Plus , to DOS and later Windows®. Realistically aware of what its customers wanted,
Statistical Science added the functionality of classical statistics to S-Plus.
Functions for the analysis of variance (ANOVA), the t test, and other models were added. True to S's object orientation, the outcome
of any such fitted model is itself an S object. Appropriate function calls
deliver the fits, the residuals, and the p -value of a hypothesis test. A model object can even contain the intermediate
computational steps of an analysis, like a QR decomposition (where Q is
orthogonal and R is upper right triangular) of the design matrix.

THERE'S AN R PACKAGE FOR THAT! AN OPEN SOURCE COMMUNITY
At about the same time that S-Plus was launched, Ross Ihaka and Robert Gentleman
of the University of Auckland in New Zealand decided to try their hands at
writing an interpreter. They chose the S language as their model. The project
took shape and gained support. They named it R.

R is an implementation of S with the additional models developed by S-Plus. In
some cases, the same people were involved. R is an open source project under the
GNU licence. On that basis, R continues to grow, largely through the addition of
packages. An R package is a collection of data sets, R functions, documentation, and dynamic load
items in C or Fortran that can be installed as a group and accessed from an R
session. R packages add new functionality to R, and through these packages,
researchers can easily share computational methods among their peers. Some
packages are limited in scope, others represent whole areas of statistics, and
some contain leading-edge developments. In fact, many developments in statistics
appear first as R packages before making it into commercial software.

At the time of this writing, 4,701 R packages appear on CRAN, the R download
site. Of these, six were added on that day alone. R has a package for
everything, or so it seems.


--------------------------------------------------------------------------------

Back to top

WHAT HAPPENS WHEN I USE R?
Note: This article is not a tutorial for R. The following example attempts no more
than to give you a sense of what an R session looks like.

R binaries are available for Windows, Mac OS X, and several Linux®
distributions. Source code is also available for those who like to compile their
own.

In Windows®, the installer adds R to the Start menu. To launch R in Linux, open a terminal window and type R at the prompt. You should see something like Figure 1.

Figure 1. The R workspaceType a command at the prompt, and R responds.

At this point, in a real-world setting, you would probably read data to an R
object from an external data file. R can read data from a variety of formats,
but for this example, I use the michelson data set from the MASS package. This is the package that accompanies Venables
and Ripley's landmark text, Modern Applied Statistics with S-Plus (see Resources ). michelson contains results from the famous Michelson and Morley experiments to measure
the speed of light.

The commands provided in Listing 1 load the MASS package, get the michelson data and take a peek at it. Figure 2 shows the commands with responses from R.
Each line contains an R function, with its arguments in square brackets ( [] ).

Listing 1. Start an R session    2+2             # R can be a calculator. R responds, correctly, with 4.
    library(""MASS"") # Loads into memory the functions and data sets from 
                    # package MASS, that accompanies Modern Applied Statistics in S

    data(michelson) # Copies the michelson data set into the workspace.

    ls()            # Lists the contents of the workspace. The michelson data is there.

    head(michelson) # Displays the first few lines of this data set.
                    # Column Speed contains Michelson and Morleys estimates of the 
                    # speed of light, less 299,000, in km/s.
                    # Michelson and Morley ran five experiments with 20 runs each.
                    # The data set contains indicator variables for experiment and run.
    help(michelson) # Calls a help screen, which describes the data set.

Figure 2. Session start and R's responsesNow let's have a look at the data (see Listing 2). The output is shown in Figure
3.

Listing 2. A box plot in R    # Basic boxplot

    with(michelson, boxplot(Speed ~ Expt)) 

    # I can add colour and labels. I can also save the results to an object.

    michelson.bp = with(michelson, boxplot(Speed ~ Expt, xlab=""Experiment"", las=1, 
                    ylab=""Speed of Light - 299,000 m/s"", 
                    main=""Michelson-Morley Experiments"",
                    col=""slateblue1"")) 
                 
    # The current estimate of the speed of light, on this scale, is 734.5
    # Add a horizontal line to highlight this value.

    abline(h=734.5, lwd=2,col=""purple"")  #Add modern speed of light

It seems that Michelson and Morley systematically overestimated the speed of
light. There also seems to be a some heterogeneity across experiments.

Figure 3. Plotting a box plotWhen I am happy with my analysis, I can save all the commands to one R function.
See Listing 3.

Listing 3. A simple function in R    MyExample = function(){
        library(MASS)
        data(michelson)
        michelson.bw = with(michelson, boxplot(Speed ~ Expt, xlab=""Experiment"", las=1, 
        ylab=""Speed of Light - 299,000 m/s"", main=""Michelsen-Morley Experiments"", 
            col=""slateblue1""))
        abline(h=734.5, lwd=2,col=""purple"")

    }

This simple example illustrates several important features of R:

DOES R NEED MAJOR HARDWARE?
I worked this example on an Acer netbook running Crunchbang Linux. R does not
require a heavy machine to carry out small or medium-sized analyses. For 20
years, it has been said of R that it is slow because it is interpreted and that
the size of data it can analyze is limited by computer memory. This is true but
usually irrelevant on modern machines, unless the application is seriously huge
(big data).

 * Saving results — The boxplot() function returns a number of useful statistics along with the graph, and you
   can save these to an R object through an assignment statement like michelson.bp = ... and extract them as needed. The outcome of any assignment statement is
   available throughout the R session and could be the subject of further
   analysis. The boxplot function returns a matrix of statistics used to draw the box plot (medians,
   quartiles, etc.), the number of items in each box plot, and the values of the
   outliers (shown on the graph in Figure 3 as open circles). See Figure 4. Figure 4. Statistics from the boxplot function
 * The formula language — R (and S) has a compact language for expressing statistical models. The
   code Speed ~ Expt in the argument tells the function to do box plots of Speed for each level
   of Expt (the experiment number). Had I wished to do an ANOVA to test whether
   Speed varied significantly across experiments, I would have used the same
   formula: lm(Speed ~ Expt) . The formula language can express a wide variety of statistical models,
   including crossed and nested effects and fixed and random factors.
 * User-defined R functions — It's a programming language.


--------------------------------------------------------------------------------

Back to top

R CARRIES ON INTO THE 21ST CENTURY
Tukey's exploratory approach to data analysis has become the classroom norm.
It's what we teach, and it's what statisticians do. R supports this approach,
which may explain why it is still popular. Object orientation also helps R
remain current, as new sources of data require new data structures for their
analysis. InfoSphere® Streams now supports R analytics for data that are
different from those envisaged by John Chambers.

R-PROJECT TOOLKIT IN INFOSPHERE STREAMS
InfoSphere Streams is an advanced computing platform that allows user-developed
applications to ingest, analyze, and correlate information quickly as it arrives
from thousands of real-time sources, handling very high data throughput rates:
up to millions of events or messages per second. It includes an R-project
Toolkit. Learn more and give it a try .

R AND INFOSPHERE STREAMS
InfoSphere Streams is a computing platform and integrated development
environment for the analysis of high-velocity data arriving from thousands of
sources. The content of these data streams is typically unstructured or
semi-structured. The goal of the analyses is to detect changing patterns in the
data and direct decision-making based on quickly changing events. SPL, the
programming language for InfoSphere Streams, organizes data through a paradigm
that reflects the dynamic nature of the data and the need for rapid analysis and
response.

We are a long way from a spreadsheet and the usual flat files of classic
statistical analysis, but R can adapt. As of Version 3.1, SPL applications can
pass data to R and thus draw on R's extensive library of packages. InfoSphere
Streams supports R analytics by creating appropriate R objects to receive the
information contained in SPL tuples , the basic data structure in SPL. InfoSphere Streams data can thus be passed
to R for further analysis and the results passed back to SPL.

WHAT R DOES NOT DO WELL
In fairness, there are some things that R does not do well or at all. Nor is R
equally well suited to every user:

 * R is not a data vault. The easiest way to enter data in R is to enter it somewhere else, then
   import it to R. Efforts have been made to add a spreadsheet front end to R,
   but they have not caught on. Not only does the absence of a spreadsheet
   feature affect data entry but it is also difficult to visually inspect data
   in R, as you can do in SPSS or Excel.
 * R makes ordinary tasks difficult. In medical research, for example, the first thing you do with the data is
   calculate summary statistics for all of the variables while listing the
   occurrence of nonresponse and missing data. This is a three-click process in
   SPSS, but R has no built-in function to calculate this fairly obvious
   information and display it in tabular form. You could write something easily
   enough, but sometimes you just want to point and click.
 * The learning curve for R is nontrivial. A novice can open a menu-driven statistical platform and obtain results in
   minutes. Not everyone wants to become a programmer to be an analyst, and
   perhaps not everyone needs to.
 * R is open source. The R community is large, mature, and active, and R is surely among the more
   successful open source projects. As I have shown, the implementation of R is
   more than 20 years old, and the S language has been around longer than that.
   This is a proven concept and a proven product. But with any open source
   product, reliability depends on transparency. We believe in the code because
   we can check it ourselves and because other people can check it and report
   errors. This is not the same as a corporate project that takes it upon itself
   to benchmark and validate its software. And in the case of lesser-used R
   packages, you have no reason to suppose that they actually produce correct
   results.


--------------------------------------------------------------------------------

Back to top

CONCLUSION
Do I need to learn R? Perhaps not; need is a strong word. But is R a valuable tool for data analytics? Certainly. The
language was expressly designed to reflect the way that statisticians think and
work. R reinforces good habits and sound analysis. To me, it's the right tool
for the job.

RESOURCES
LEARN
 * The New S Language: A Programming Environment for Data Analysis and Graphics (R.A. Becker, John M. Chambers, A.R. Wilks; Chapman & Hall, 1988): This
   foundational work is known in R and S circles as ""The Blue Book."" It lists
   all of the built-in functions that come with S and provides a complete
   description of the language.
 * Read Graphical Methods for Data Analysis (John M. Chambers, William S. Cleveland, Beat Kleiner, Paul A. Tukey;
   Duxbury Press, 1983).
 * Check out Exploratory Data Analysis , by John Tukey (not to be confused with Paul Tukey). This book provided the
   conceptual inspiration that is implemented in S.
 * Modern Applied Statistics with S-Plus , (Springer-Verlag, 1997) by W.N. Venables and B.D. Ripley, is a classic
   introduction to object orientation in S-Plus (and R). The data sets and a
   number of functions used in this book are found in the R package MASS.
 * With Joris Meys and Andrie de Vries''s R for Dummies (2012), R hits the big time.
 * Joseph Adler's R in a Nutshell (O'Reilly, 2009) is a solid introduction to R, intended for people doing
   standard statistical analyses on moderate data sets. It does not cover big
   data.
 * Springer has a series of books with orange covers and titles like Time Series Analysis in R and An Introduction to Applied Multivariate Analysis with R . These are a good introduction for the R user with a particular application
   area in mind. Unlike general introductions, the books of this series focus on
   relevant packages for their subject area, with less to say about base R.
 * Many R ""books"" are really papers in applied statistics that use R. Probably
   the hardest thing about using R is understanding the statistical methods that
   it implements. Along these lines, "" Data Analysis and Graphics Using R — An Example-Based Approach ,"" by John Maindonald and John Braun (Cambridge UP, 2010), is one of my
   favorites. It covers a host of useful statistical techniques and shows you
   how to use these methods in R. It has a supporting R package with data and
   functions, as well.
 * The Art of R Programming , by Norman Matloff (O'Reilly, 2011), is not a statistics book but rather one
   of the few books to teach R precisely as a programming language. It's
   essential if you plan to write much code in R rather than simply running
   packages.
 * If you could buy only one R book, Data Mining with R , by Luis Torgo, should not be that book. But assuming you plan to own more
   than one book, this is a nice, intermediate-level read. It consists of three
   cases studies in data mining, all different, and walks you through each step
   of the way, including data cleaning and dealing with missing values.
 * "" An introduction to InfoSphere Streams "" is an excellent introductory article to the Streams language.
 * "" Overview of the R-project toolkit "" provides a description of the Streams toolkit for integrating R code into
   SPL applications.
 * Find resources to help you get started with InfoSphere Streams , IBM's high-performance computing platform.
 * See the scope of products in the InfoSphere Platform for information-intensive projects.
 * Click through the recent videos on big data that appeal to novices and experts alike.
 * Browse the technology bookstore for books on these and other technical topics.
 * Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners to advanced
   functionality for experienced developers.
 * Learn more about big data in the developerWorks big data content area . Find technical documentation, how-to articles, education, downloads,
   product information, and more.
 * Find resources to help you get started with InfoSphere BigInsights , IBM's Hadoop-based offering that extends the value of open source Hadoop
   with features like Big SQL, text analytics, and BigSheets.
 * Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis,
   analyze data with BigSheets, develop your first big data application, develop
   Big SQL queries to analyze big data, and create an extractor to derive
   insights from text documents with InfoSphere BigInsights.
 * Find resources to help you get started with InfoSphere Streams , IBM's high-performance computing platform that enables user-developed
   applications to rapidly ingest, analyze, and correlate information as it
   arrives from thousands of real-time sources.
 * Stay current with developerWorks technical events and webcasts .
 * Follow developerWorks on Twitter .

GET PRODUCTS AND TECHNOLOGIES
 * Try out InfoSphere Streams : download it for 90 days or try it in the cloud.
 * Check out many IBM SPSS products for free: * IBM SPSS Decision Management , which automates and optimizes transactional decisions before deployment
    * SPSS Modeler , a data mining workbench that helps you build predictive models quickly
      and intuitively, without programming
    * SPSS Text Analytics for Surveys , which uses powerful natural language processing (NLP) technologies
      specifically designed for survey text.
    * SPSS Visualization Designer , which lets you easily create and share compelling visualizations that
      better communicate your analytic results
   
   
 * Download R and get documentation from CRAN .
 * Download InfoSphere BigInsights Quick Start Edition , available as a native software installation or as a VMware image.
 * Download InfoSphere Streams , available as a native software installation or as a VMware image.
 * Use InfoSphere Streams on IBM SmartCloud Enterprise .
 * Build your next development project with IBM trial software , available for download directly from developerWorks.

DISCUSS
 * Stop by the Streams Exchange , a community for InfoSphere Streams developers to share information, source
   code, associated information about the source code.
 * Ask questions and get answers in the InfoSphere BigInsights forum .
 * Ask questions and get answers in the InfoSphere Streams forum .
 * Check out the developerWorks blogs and get involved in the developerWorks community .

COMMENTS
Close [x]

DEVELOPERWORKS: SIGN IN
Required fields are indicated with an asterisk ( * ).

IBM ID: *
Need an IBM ID?
Forgot your IBM ID?

Password: *
Forgot your password?
Change your password

Keep me signed in.

By clicking Submit , you agree to the developerWorks terms of use .


--------------------------------------------------------------------------------

The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is
displayed to the public and will accompany any content you post, unless you opt
to hide your company name . You may update your IBM account at any time.

All information submitted is secure.

Close [x]

CHOOSE YOUR DISPLAY NAME


The first time you sign in to developerWorks, a profile is created for you, so
you need to choose a display name. Your display name accompanies the content you
post on developerWorks.

Please choose a display name between 3-31 characters . Your display name must be unique in the developerWorks community and should
not be your email address for privacy reasons.

Required fields are indicated with an asterisk ( * ).

Display name: * (Must be between 3 – 31 characters.)

By clicking Submit , you agree to the developerWorks terms of use .


--------------------------------------------------------------------------------

All information submitted is secure.

DIG DEEPER INTO BIG DATA AND ANALYTICS ON DEVELOPERWORKS
 * Overview
 * Proven practices
 * Products
 * Technical library (tutorials and more)


--------------------------------------------------------------------------------

 * DEVELOPERWORKS PREMIUM
   Exclusive tools to build your next great app. Learn more.
   
   
 * 
 * 
 * 
 * DEVELOPERWORKS LABS
   Technical resources for innovators and early adopters to experiment with.
   
   
 * 
 * 
 * IBM EVALUATION SOFTWARE
   Evaluate IBM software and solutions, and transform challenges into
   opportunities.
   
   
--------------------------------------------------------------------------------

Back to top

static.content.url=http://www.ibm.com/developerworks/js/artrating/ SITE_ID=1 Zone=Big data and analytics, Information Management, Open source ArticleID=942839 ArticleTitle=Do I need to learn R? publish-date=09032013 * About
 * Help
 * Contact us
 * Submit content

 * Feeds
 * Newsletters
 * Follow
 * Like

 * Report abuse
 * Terms of use
 * Third party notice
 * IBM privacy
 * IBM accessibility

 * Faculty
 * Students
 * Business Partners

 * Select a language:
 * English
 * 中文
 * 日本語
 * Русский
 * Português (Brasil)
 * Español
 * Việt","R is a flexible programming language designed to facilitate exploratory data analysis, classical statistical tests, and high-level graphics. With its rich and ever-expanding library of packages, R is on the leading edge of development in statistics, data analytics, and data mining. R has proven itself a useful tool within the growing field of big data and has been integrated into several commercial packages, such as IBM SPSS and InfoSphere, as well as Mathematica. This article offers a statistician's perspective on the value of R.",Do I need to learn R?,Live,351
1060,"* 
 * NEWS l lCHANNELS
    * GamesBeat
    * AI
    * AR/VR
    * Big Data
    * Bots
    * Business
    * Cloud
   
    * Dev
    * Enterprise
    * Entrepreneur
    * Esports
    * Heartland Tech
    * Marketing
    * Media
   
    * Mobile
    * PC Gaming
    * Security
    * Social
    * Transportation
   
    * Got a news tip?
    * Press Releases
   
    * Newsletters
    * Webinars
   
    * Advertise
   
   
 * 
 * EVENTS l lVB EVENTS
    * Upcoming Events
   
   GET INVOLVED
    * Sponsor
    * Speaker
   
    * Media Partner
    * Volunteer
   
    * Got a news tip?
    * Press Releases
   
    * Newsletters
    * Webinars
   
    * Advertise
   
   
 * 
 * RESEARCH l lCATEGORIES
    * Marketing Tech
    * Mobile
   
    * Gaming
    * Miscellaneous
   
   LEARN MORE
    * VB Top Ten
    * Subscription
   
    * Got a news tip?
    * Press Releases
   
    * Newsletters
    * Webinars
   
    * Advertise
   
   
 * 

 * 
 * 
 * 

 * ‹ NEWS ›
   CHANNELS
   GamesBeat Esports AI Heartland Tech AR/VR Marketing Big Data Media Bots Mobile Business PC Gaming Cloud Security Dev Social Enterprise Transportation Entrepreneur
 * ‹ EVENTS ›
   VB EVENTS
   Upcoming EventsGET INVOLVED
   Sponsor Media Partner Speaker Volunteer
 * ‹ RESEARCH ›
   CATEGORIES
   Marketing Tech Gaming Mobile MiscellaneousLEARN MORE
   VB Top Ten Subscription
 * ‹ VENTUREBEAT ›
   About Advertise Careers Contact

 * Got a news tip?
 * Press Releases

 * Newsletters
 * Webinars

AI GuestHOW IBM BUILDS AN EFFECTIVE DATA SCIENCE TEAM
Seth Dobrin, IBM Analytics December 22, 2017 12:10 PM Image Credit: pathdoc / ShutterstockVB RECOMMENDATIONS
 * 5 predictions for 2018: Blockchain bounty hunters and more
 * With cryptocurrency, buy the substance, sell the hype
 * How Coinbase could disrupt traditional brokerages and dominate the investment
   market

UPCOMING EVENTS
 * BLUEPRINT Mar 5 - 7
 * GamesBeat 2018 Apr 9 - 10
 * MB 2018 July 10 - 11
 * VB Summit 2018 Oct 23 - 24

Data science is a team sport. This sentiment rings true not only with our
experiences within IBM, but with our enterprise customers, who often ask us for
advice on how to structure data science teams within their own organizations.

Before that can be done, however, it’s important to remember that the various
skills required to execute a data science project are both rare and distinct . That means we need to make sure that each team member can focus on what he or
she does best.

Consider this breakdown of a data science project, along with the skills
required for each role:


While each role is certainly distinct, each team member does need to have
T-shaped skills — meaning they’ll need to have depth in their own role but also
a cursory understanding of the adjacent roles.

Let’s explore each role from the chart in a little more depth.

PRODUCT OWNERS
Product owners are the subject matter experts, with a deep understanding of the
particular business sector and its concerns. In some instances, the primary role
of the product owner will be on the business side, while they work periodically
with the data science team to address a specific data science problem or set of
problems before cycling back into the broader role.

In fact, cycling back to the normal role is a benefit to the data science team.
It means the product owner acts as the ultimate end user of the models and can
offer concrete feedback and requests. It also means the product owner can
advocate for data science from within the business units themselves.

Product owners are most often responsible for:

 * Defining the business problem and working with data scientists to define the
   working hypothesis
 * Helping to locate data and data stewards as necessary
 * Brokering and resolving data quality issues

DATA ENGINEERS
Data engineers are the wizards who move all the data to the center of gravity
and connect that data via services and message queues. They also build APIs to
make the data generally available to the enterprise, and they’re responsible for
engineering the data onto the platform that best fits the needs of the team.
With data engineers, we look for these top three skills:

 * Proficient in at least three of the following: Python, Scala, Java, Ruby, SQL
 * Proficient at consuming and building REST APIs
 * Proficient at integrating predictive and prescriptive models into
   applications and processes

DATA SCIENTISTS
Data scientists tend to fill one of two distinct roles: machine learning
engineers and decision optimization engineers. Because market conditions have
caused “data scientist” to be such a hot role, making this distinction can
remove some confusing wiggle room. (For our detailed thoughts on this, see our recent article on VentureBeat.)

MACHINE LEARNING ENGINEERS
Machine learning engineers build the machine learning models, which means
identifying the important data elements and features to use in each model. They
determine which types of models to use, and they test the accuracy and precision
of those models. They’re also responsible for the long-term monitoring and
maintenance of the models. They need these top three skills:

 * Training and experience applying probability and statistics
 * Experience in data modeling and evaluation and a deep understanding of
   supervised and unsupervised machine learning
 * Experience programming in at least two of the following: Python, R, Scala,
   Julia, or Java, with a preference for Python expertise

DECISION OPTIMIZATION ENGINEERS
Decision optimization engineering skills and experiences overlap with machine
learning engineers, but the differences are important. Decision optimization
engineers need these top three skills:

 * Experience applying mathematical modeling and/or constraint programming to a
   range of industry problems
 * Proficient programming skills in Python and the ability to apply predictive
   models as input into decision optimization problems
 * Experience building Monte Carlo simulation/optimization for what-if scenario
   analysis

DATA JOURNALISTS
That brings us to data journalists, the team members who help represent the
output of the model in the context of the data that drove it and who can clearly
articulate the business problem at hand. With data journalists, we look for
these top three skills:

 * Coding skills in either Python, Java, or Scala
 * Experience integrating data and the output of predictive and prescriptive
   models within the context of a business problem
 * Proficiency with data parsing, scraping, and wrangling

If you can gather together a team with these essential skills — and if you can
ensure they collaborate well and maintain a meaningful understanding of one
another’s work — you’ll be well on your way to uncovering the insights and
understanding that can supercharge whatever organization you’re leading.

Without them, you could be flying blind.

Seth Dobrin is vice president and chief data officer at IBM Analytics.","Data science is a team sport. This sentiment rings true not only with our experiences within IBM, but with our enterprise customers, who often ask us for advice on how to structure data science teams within their own organizations.",How IBM builds an effective data science team,Live,352
1065,"RStudio Blog * Home

 * Subscribe to feed

SPARKLYR — R INTERFACE FOR APACHE SPARK
September 27, 2016 in Featured , News , Packages

We’re excited today to announce sparklyr , a new package that provides an interface between R and Apache Spark .


Over the past couple of years we’ve heard time and time again that people want a
native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to
Spark’s distributed machine learning algorithms and much more. Highlights
include:

 * Interactively manipulate Spark data using both dplyr and SQL (via DBI).
 * Filter and aggregate Spark datasets then bring them into R for analysis and
   visualization.
 * Orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater .
 * Create extensions that call the full Spark API and provide interfaces to Spark packages.
 * Integrated support for establishing Spark connections and browsing Spark data
   frames within the RStudio IDE.

We’re also excited to be working with several industry partners. IBM is incorporating sparklyr into their Data Science Experience, Cloudera is working with us to ensure that sparklyr meets the requirements of their
enterprise customers, and H2O has provided an integration between sparklyr and H2O Sparkling Water.

GETTING STARTED
You can install sparklyr from CRAN as follows:

install.packages(""sparklyr"")

You should also install a local version of Spark for development purposes:

library(sparklyr)spark_install(version=""1.6.2"")

If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark.

Extensive documentation and examples are available at http://spark.rstudio.com .

CONNECTING TO SPARK
You can connect to both local instances of Spark as well as remote Spark
clusters. Here we’ll connect to a local instance of Spark:

library(sparklyr)sc<-spark_connect(master=""local"")

The returned Spark connection ( sc ) provides a remote dplyr data source to the Spark cluster.

READING DATA
You can copy R data frames into Spark using the dplyr copy_to function (more
typically though you’ll read data within the Spark cluster using the spark_read family of functions). For the examples below we’ll copy some datasets from R
into Spark (note that you may need to install the nycflights13 and Lahman
packages in order to execute this code):

library(dplyr)iris_tbl<-copy_to(sc, iris)flights_tbl<-copy_to(sc, nycflights13::flights, ""flights"")batting_tbl<-copy_to(sc, Lahman::Batting, ""batting"")

USING DPLYR
We can now use all of the available dplyr verbs against the tables within the
cluster. Here’s a simple filtering example:

# filter by departure delay
flights_tbl%>%filter(dep_delay==2)

Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last
example from the tutorial which plots data on flight delays:

delay<-flights_tbl%>%group_by(tailnum)%>%summarise(count=n(), dist=mean(distance), delay=mean(arr_delay))%>%filter(count>20, dist<2000, !is.na(delay))%>%collect()

# plot delays
library(ggplot2)ggplot(delay, aes(dist, delay))+geom_point(aes(size=count), alpha=1/2)+geom_smooth()+scale_size_area(max_size=2)

Note that while the dplyr functions shown above look identical to the ones you
use with R data frames, with sparklyr they use Spark as their back end and
execute remotely in the cluster.WINDOW FUNCTIONS
dplyr window functions are also supported, for example:

batting_tbl%>%select(playerID, yearID, teamID, G, AB:H)%>%arrange(playerID, yearID, teamID)%>%group_by(playerID)%>%filter(min_rank(desc(H))<=2&H>0)

For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.

USING SQL
It’s also possible to execute SQL queries directly against tables within a Spark
cluster. The spark_connection object implements a DBI interface for Spark, so you can use dbGetQuery to execute SQL and return the result as an R data frame:

library(DBI)iris_preview<-dbGetQuery(sc, ""SELECT * FROM iris LIMIT 10"")

MACHINE LEARNING
You can orchestrate machine learning algorithms in a Spark cluster via either Spark MLlib or via the H2O Sparkling Water extension package. Both provide a set of high-level APIs built on top of
DataFrames that help you create and tune machine learning workflows.

SPARK MLLIB
In this example we’ll use ml_linear_regression to fit a linear regression model.
We’ll use the built-in mtcars dataset, and see if we can predict a car’s fuel consumption ( mpg ) based on its weight ( wt ) and the number of cylinders the engine contains ( cyl ). We’ll assume in each case that the relationship between mpg and each of our features is linear.

# copy mtcars into spark
mtcars_tbl<-copy_to(sc, mtcars)

# transform our data set, and then partition into 'training', 'test'
partitions<-mtcars_tbl%>%filter(hp>=100)%>%mutate(cyl8=cyl==8)%>%sdf_partition(training=0.5, test=0.5, seed=1099)

# fit a linear model to the training dataset
fit<-partitions$training%>%ml_linear_regression(response=""mpg"", features=c(""wt"", ""cyl""))

For linear regression models produced by Spark, we can use summary() to learn a bit more about the quality of our fit, and the statistical
significance of each of our predictors.

summary(fit)

Spark machine learning supports a wide array of algorithms and feature
transformations, and as illustrated above it’s easy to chain these functions
together with dplyr pipelines. To learn more see the Spark MLlib section of the sparklyr website.

H2O SPARKLING WATER
Let’s walk the same mtcars example, but in this case use H2O’s machine learning algorithms via the H2O Sparkling Water extension. The dplyr code used to prepare the data is the same, but after
partitioning into test and training data we call h2o.glm rather than ml_linear_regression :

# convert to h20_frame (uses the same underlying rdd)
training <- as_h2o_frame(partitions$training)
test <- as_h2o_frame(partitions$test)

# fit a linear model to the training dataset
fit <- h2o.glm(x = c(""wt"", ""cyl""),
               y = ""mpg"",
               training_frame = training,
               lamda_search = TRUE)

# inspect the model
print(fit)

For linear regression models produced by H2O, we can use either print() or summary() to learn a bit more about the quality of our fit. The summary() method returns some extra information about scoring history and variable
importance.

To learn more see the H2O Sparkling Water section of the sparklyr website.

EXTENSIONS
The facilities used internally by sparklyr for its dplyr and machine learning
interfaces are available to extension packages. Since Spark is a general purpose
cluster computing system there are many potential applications for extensions
(e.g. interfaces to custom machine learning pipelines, interfaces to 3rd party
Spark packages, etc.).

The sas7bdat extension enables parallel reading of SAS datasets in the sas7bdat format into
Spark data frames. The rsparkling extension provides a bridge between sparklyr and H2O’s Sparkling Water .

We’re excited to see what other sparklyr extensions the R community creates. To
learn more see the Extensions section of the sparklyr website.

RSTUDIO IDE
The latest RStudio Preview Release of the RStudio IDE includes integrated support for Spark and the sparklyr
package, including tools for:

 * Creating and managing Spark connections
 * Browsing the tables and columns of Spark DataFrames
 * Previewing the first 1,000 rows of Spark DataFrames

Once you’ve installed the sparklyr package, you should find a new Spark pane within the IDE. This pane includes a New Connection dialog which can be used to make connections to local or remote Spark
instances:


Once you’ve connected to Spark you’ll be able to browse the tables contained
within the Spark cluster:


The Spark DataFrame preview uses the standard RStudio data viewer:


The RStudio IDE features for sparklyr are available now as part of the RStudio Preview Release . The final version of RStudio IDE that includes integrated support for
sparklyr will ship within the next few weeks.

PARTNERS
We’re very pleased to be joined in this announcement by IBM, Cloudera, and H2O,
who are working with us to ensure that sparklyr meets the requirements of
enterprise customers and is easy to integrate with current and future
deployments of Spark.

IBM
“With our latest contributions to Apache Spark and the release of sparklyr, we
continue to emphasize R as a primary data science language within the Spark
community. Additionally, we are making plans to include sparklyr in Data Science
Experience to provide the tools data scientists are comfortable with to help
them bring business-changing insights to their companies faster,” said Ritika
Gunnar, vice president of Offering Management, IBM Analytics.

CLOUDERA
“At Cloudera, data science is one of the most popular use cases we see for
Apache Spark as a core part of the Apache Hadoop ecosystem, yet the lack of a
compelling R experience has limited data scientists’ access to available data
and compute,” said Charles Zedlewski, vice president, Products at Cloudera. “We
are excited to partner with RStudio to help bring sparklyr to the enterprise, so
that data scientists and IT teams alike can get more value from their existing
skills and infrastructure, all with the security, governance, and management our
customers expect.”

H2O
“At H2O.ai, we’ve been focused on bringing the best of breed open source machine
learning to data scientists working in R & Python. However, the lack of robust
tooling in the R ecosystem for interfacing with Apache Spark has made it
difficult for the R community to take advantage of the distributed data
processing capabilities of Apache Spark.

We’re excited to work with RStudio to bring the ease of use of dplyr and the
distributed machine learning algorithms from H2O’s Sparkling Water to the R
community via the sparklyr & rsparkling packages”

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Security
 * Shiny
 * shinyapps.io
 * tidyverse
 * Training
 * Uncategorized

ARCHIVES
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,911 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

LEAVE A COMMENT
Comments feed for this article

LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.

Notify me of new posts via email.


« Shiny Server (Pro) 1.4.6Blog at WordPress.com. Ben Eastaugh and Chris Sternal-Johnson.

Subscribe to feed.

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","We’re excited today to announce sparklyr, a new package that provides an interface between R and Apache Spark. Over the past couple of years we’ve heard time and time again that people …",sparklyr — R interface for Apache Spark,Live,353
1068,The search index lets you create flexible queries on one or more field in the documents. This video shows you how to build a search index. Find more videos in the Cloudant Learning Center: http://www.cloudant.com/learning-center,,Build the search index in Cloudant,Live,354
1071,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: RUN SHINY APPLICATIONS IN RSTUDIO
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

3 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Science Experience: Load data into RStudio - Duration: 1:51.
   developerWorks TV No views * New 1:51


--------------------------------------------------------------------------------

 * Metis: How to Start a Career in Data Science | UNCUBED - Duration: 6:52.
   UNCUBED 6,058 views 6:52
 * A day with a BILLIONAIRE! Join Rich Kids of Instagram's Emir Bahadir as he
   works out and shops! - Duration: 12:30. LA Muscle 1,277,381 views 12:30
 * Is It REALLY WORTH $6000? - Duration: 13:24. TravelFeels 26,235 views * New 13:24
 * Inside a Google data center - Duration: 5:28. G Suite 6,810,063 views 5:28
 * Quit social media | Dr. Cal Newport | TEDxTysons - Duration: 13:51. TEDx
   Talks 2,504,893 views 13:51
 * World's Roundest Object! - Duration: 11:44. Veritasium 21,554,395 views 11:44
 * Discover Pricing Inefficiencies in an Airbnb Market using Machine Learning -
   Duration: 47:49. Data Gurus 73 views 47:49
 * Predicting Stock Prices - Learn Python for Data Science #4 - Duration: 7:39.
   Siraj Raval 233,577 views 7:39
 * Explaining Big Data - Duration: 8:33. ExplainingComputers 554,326 views 8:33
 * Dash - A New Framework for Building User Interfaces for Technical Computing |
   SciPy 2017 | Chris Par - Duration: 27:19. Enthought 5,494 views 27:19
 * 14-Year-Old Prodigy Programmer Dreams In Code - Duration: 8:42. THNKR
   6,587,062 views 8:42
 * Data Science Experience: Analyze Db2 Warehouse on Cloud data in RStudio -
   Duration: 5:30. developerWorks TV 3 views * New 5:30
 * X399 NVME Raid support, DJI Local Data Mode, No More Netlinked (for now) -
   Duration: 6:50. NCIX Tech Tips 82,667 views * New 6:50
 * Using Data Science Experience DSX to extract insights from NY restaurant
   inspection records - Duration: 32:27. Data Gurus 136 views 32:27
 * 10 Strangest Planets In Space - Duration: 12:25. Thoughty2 12,612,983 views 12:25
 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21.
   IBM Analytics 8,386 views 8:21
 * Inside Facebook's Data-Science Team - Duration: 4:26. Wall Street Journal
   7,345 views 4:26
 * Tetiana Ivanova - How to become a Data Scientist in 6 months a hacker’s
   approach to career planning - Duration: 56:26. PyData 131,889 views 56:26
 * What is an API? - Duration: 3:25. MuleSoft Videos 960,137 views 3:25
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to analyze data with R using a Shiny application in IBM Data Science Experience (DSX).,Run Shiny Applications in RStudio in DSX,Live,355
1073,"ABIGAIL SEE
 * About
 * 

DEEP LEARNING, STRUCTURE AND INNATE PRIORS
A DISCUSSION BETWEEN YANN LECUN AND CHRISTOPHER MANNING
February 21, 2018


Earlier this month, I had the exciting opportunity to moderate a discussion
between Professors Yann LeCun and Christopher Manning , titled “What innate priors should we build into the architecture of deep learning
systems?” The event was a special installment of AI Salon , a discussion series held within the Stanford AI Lab that often features
expert guests.

This discussion topic – about the structural design decisions we build into our
neural architectures, and how those correspond to certain assumptions and
inductive biases – is an important one in AI right now. In fact, last year I highlighted “the return of linguistic structure” as one of the top four NLP Deep Learning
research trends of 2017.

On one side, Manning is a prominent advocate for incorporating more linguistic structure into deep learning systems. On the other, LeCun is a
leading proponent for the ability of simple but powerful neural architectures to
perform sophisticated tasks without extensive task-specific feature engineering. For this reason, anticipation for
disagreement between the two was high, with one Twitter commentator describing the event as “the AI equivalent of Batman vs Superman”.

However, LeCun and Manning agreed on more than you may expect. LeCun’s most
famous contribution (the Convolutional Neural Network) is all about an innate prior – the assumption that an image processing system should be translationally invariant – which is enforced through an architectural design choice (weight sharing).
For his part, Manning has spoken publicly to say that the Deep Learning
renaissance is A Good Thing for NLP .

While the two professors agreed on many other things during the discussion,
certain key differences emerged – you can watch the full video above. The rest of this post is a summary of the main themes that emerged throughout
the discussion , plus some links to relevant further materials.

STRUCTURE: A NECESSARY GOOD OR A NECESSARY EVIL?
In their opening statements, Manning and LeCun quickly established their main
difference of opinion.

Manning described structure as a “necessary good” ( 9:14 ), arguing that we should have a positive attitude towards structure as a good
design decision. In particular, structure allows us to design systems that can
learn more from less data, and at a higher level of abstraction, compared to
those without structure.

Conversely, LeCun described structure as a “necessary evil” ( 2:44 ), and warned that imposing structure requires us to make certain assumptions,
which are invariably wrong for at least some portion of the data, and may become
obsolete within the near future. As an example, he hypothesized that ConvNets
may be obsolete in 10 years ( 29:57 ).

Despite this disagreement, we should note that LeCun and Manning did at least
agree that structure is “necessary” – they just have different attitudes towards
that necessity.

Manning views it as the right and principled thing to do – for example, language
is fundamentally recursive, so NLP architectures should be too ( 23:40 )! He did acknowledge, however, that in practice it’s difficult to make the
correct structural assumptions, and those assumptions don’t always translate to
comprehensive performance gains (see for example, the mixed success of the Recursive Neural Network , aka Tree-RNN, which imposes recursive compositionality as an innate prior).

LeCun has a much less idealized view of structure. Several times during the
discussion, he referred to various types of structure (e.g. residual
connections, convolutions), as merely “a meta-level substrate” ( 53:33 ) that is required for optimization to work. A similar network without the
structural constraints, he claimed, would work just as well, except it would
take longer to train.

THE LIMITATIONS OF TODAY’S AI
LeCun and Manning noted the historical trajectory that has brought us to this
present moment in AI research. Over the last few decades, innate priors have
gone out of fashion, and today Deep Learning research prizes closely-supervised
end-to-end learning (supported by big-data and big-compute) as the dominant
paradigm.

Both LeCun and Manning repeatedly highlighted the limitations of this paradigm –
for example the progress that remains to be made on memory, planning, transfer
learning, world knowledge, and multi-step reasoning – and expressed positivity ( 22:17 , 37:20 , 57:28 ) towards current research that aims to tackle these problems via structural
design decisions.

However, Manning went further, asserting that the big-data big-compute paradigm
of modern Deep Learning has in fact “perverted the field” (of computational
linguistics) and “sent it off-track” ( 10:48 ). If you have access to huge amounts of data and computation, he argued, you
can succeed by building simple but inefficient systems that perform “glorified
nearest neighbor learning” at a superficial level ( 43:20 ). This disincentivizes researchers from building good learning systems – ones
which learn representations at a higher level of abstraction, and do not require
huge amounts of data. This, he said, is bad for the field as a whole. The
answer? Impose the right kind of innate structure, that enables systems to learn concepts efficiently at
the right level of abstraction.

Despite my attempt to prod the two into conflict ( 33:15 ), I’m unsure what exactly LeCun thought of Manning’s claim that Deep Learning
has in some sense “perverted the field”. However, LeCun did agree ( 34:30 ) that Deep Learning is missing basic principles (to read more on that topic, see his CVPR’15 keynote, What’s Wrong With Deep Learning? ) .

THE IMPORTANCE OF UNSUPERVISED LEARNING
While the discussion touched upon many core limitations of today’s AI
techniques, one particular challenge – which may be loosely described as
Unsupervised Learning, or at least Less-Supervised Learning – emerged as a
matter of particular urgency.

Both professors gave examples ( 9:48 , 30:30 ) of humans’ ability to do few-shot learning; to learn about the world via
observation, without a task or an external reward; and to learn abstract
concepts with discrete structure (for example, categorization of objects)
without explicit supervision.

These unsupervised learning abilities, they agreed, are essential to progress in
AI. But when it came to the role structure should play in the Unsupervised Revolution , however, LeCun and Manning disagreed.

Manning argued that imposing structure is the key to unlock unsupervised
learning ( 35:05 ). If we provide machines with the right structural tools to learn at an
appropriate level of abstraction, he said, then they can learn with less
supervision.

By contrast, LeCun argued that if you can perform unsupervised learning, you
don’t need to impose structure. As an example ( 28:57 ), he described how the human brain does not have any innate convolutional
structure – but it doesn’t need to, because as an effective unsupervised
learner, the brain can learn the same low-level image features (e.g. oriented
edge detectors) as a ConvNet, even without the convolutional weight-sharing
constraint. He concluded that imposing more structure on our current neural architectures may be futile, because once we
have developed better methods for unsupervised learning, those structural design
decisions may be obsolete.

The difference between the two positions was subtle; and perhaps mostly a
chicken-and-egg distinction. Manning regards structure as an important key to
achieve unsupervised learning, whereas LeCun regards unsupervised learning as
the only long-term way to learn structure.

STRUCTURE AS A HARD-WIRED PRIOR, OR LEARNED FROM THE ENVIRONMENT?
During the discussion, it became clear that there are at least two types of
“structure”: structure baked into the model as an innate prior (for example, the
convolutional assumption in ConvNets, or the recursive assumption in Recursive
Neural Networks), and structure learned and computed dynamically by the machine
(for example, the structure computed by dynamic routing in Capsule Networks , or the alignments computed by the attention mechanism ). There is no easy distinction between the two, and at one point Manning and
LeCun differed on whether ConvNets’ hierarchical structure should be regarded as
one or the other ( 25:55 ).

LeCun repeatedly spoke against what he called hard-wired priors, arguing that
all structure should instead be learned from the environment ( 30:42 , 34:14 ). Though Manning agreed that much structure should be learned from the
environment, he also argued that we (the designers of AI systems) should play some part in providing that structure. While we shouldn’t return to the days of
intricately human-designed systems (such as Chomskyan grammars), he said, we
should provide machines with the right “primitives and scaffolding” to learn
more effectively ( 11:37 ).

REWARD AS AN INNATE PRIOR
LeCun and Manning agreed that ideally, reward should be innate – that is, understanding the world correctly should be its own reward ( 46:03 ). For example, humans are constantly building their own internal model of the
world, and revising it in response to external observations.

By contrast, most Machine Learning systems today learn from externally-provided
rewards that are closely related to a particular task. Manning described these
objective functions as too superficial – noting that we will never build AI
systems that learn abstract concepts if the objective function is defined at
such a low level ( 37:55 ). LeCun agreed that reward needs to be intrinsic, and rich – rather than
learning from occasional task-specific rewards, AI systems should learn by
constantly predicting “everything from everything”, without requiring training
labels or a task definition ( 49:14 ).

ON LANGUAGE
In the final minutes of the discussion, LeCun, perhaps being a little
provocative, claimed language is “not that complicated”, nor that crucial to
achieving general intelligence ( 59:54 ). To support this, he appealed to the fact that orangutans are almost as
intelligent as humans, yet they have no language. In response, Manning leaped to
the defense of language – which, he claimed, is crucial to general intelligence,
because language is the conduit by which individual intelligence is shared and
transformed into societal intelligence!

MISCELLANEOUS NOTES AND FURTHER READING
For convenience, here is a (non-comprehensive) list of some papers, ideas and
resources mentioned or otherwise relevant to the discussion. There were some
references mentioned in the discussion that I was unable to find, so please
contribute any further links in the comments!

 * At 19:17 , Manning discusses the paper Grammar as a Foreign Language , which tackled a highly recursive linguistic task (parsing) with a surprisingly unstructured method (sequence-to-sequence).
 * The question at 39:15 refers to the idea that Stochastic Gradient Descent acts as a kind of
   implicit regularization. To read more about this idea, see for example the
   work of Tomaso Poggio and his collaborators. Here is a set of slides he presented at Stanford’s Theory of Deep Learning class last year – slide 44 shows the connection between SGD and implicit
   regularization. More generally, Poggio and his collaborators are one of the
   many theorists LeCun mentions as investigating “the theoretical mystery” of
   why neural nets work ( 41:29 ).
 * At 40:52 , I question whether bigger networks are necessarily better, and mention a
   paper that shows this is not always true. I was referring to the ResNet paper , which demonstrates that deeper networks can be harder to train than
   shallower networks, and thus sometimes achieve worse results. However, the
   same paper then shows that residual connections (the paper’s main
   contribution) provide a way to train deep nets much more effectively. So
   perhaps “bigger isn’t always better” isn’t a fair conclusion to draw from the
   paper – “bigger is only better if you can train it effectively” would be more
   precise! For more thoughts on whether bigger is better, see Do Deep Nets Really Need to be Deep?
 * The “where do rewards come from?” question at 45:52 mentions Andrew Barto , who has written a paper with that exact title .
 * At 57:46 LeCun mentions a paper by Leon Bottou on the idea of mapping representations back to the same space, thus enabling
   chains of reasoning. The paper is called From Machine Learning to Machine Reasoning .
 * In October 2017, Yann LeCun took part in a debate with Gary Marcus at NYU,
   with a similar discussion topic to ours – “Does AI Need More Innate Machinery?” . It is a highly interesting discussion, and I recommend you watch it here . The two have since had further disagreement on Twitter on the subject of Deep Learning.


--------------------------------------------------------------------------------

Thanks to both Yann LeCun and Christopher Manning for sharing their perspectives
with us in this discussion. Special thanks to Siva Reddy for organizing much of the event.

Four deep learning trends from ACL 2017 Part Two: Interpretability and Attention
Prev Please enable JavaScript to view the comments powered by Disqus.Powered by Jekyll with Type Theme","Earlier this month, I had the exciting opportunity to moderate a discussion between Professors Yann LeCun and Christopher Manning, ti...","Deep Learning, Structure and Innate Priors",Live,356
1077,"Compose The Compose logo Articles Sign in Free 30-day trialDATALAYER EXPOSED: LORNA JANE MITCHELL & SURVIVING FAILURE WITH RABBITMQ
Published Jul 10, 2017 datalayer rabbitmq DataLayer Exposed: Lorna Jane Mitchell & Surviving Failure with RabbitMQIt's Monday which means it's time for our next DataLayer Conf video installment.
This week, we'll hear about surviving failure with RabbitMQ from Lorna Jane Mitchell .

Just before lunch, Lorna Jane Mitchell took the stage. Lorna is a developer and author currently working at IBM Watson
Data Platform who loves writing and speaking on a wide range of technical
topics.She led us through several examples of RabbitMQ worst case scenarios and
how to configure RabbitMQ to prepare for them. The goal is to anticipate failure
and look to recovery options before a problem arises.

Previous DataLayer 2017 talks:

 * Charity Majors' presentation on observability
 * Ross Kukulinski's presentation on the state of containers
 * Antonio Chavez's presentation on the why he left MongoDB
 * Jonas Helfer's presentation on Joins across databases with GraphQL
 * Joshua Drake's presenatation on PostgreSQL as the center of your data
   universe

Be sure to tell us what you think using hashtag #DataLayerConf and check back
next Monday for the next talk at DataLayerConf.


--------------------------------------------------------------------------------

We're in the planning stages for DataLayer 2018 right now so, if you have an
idea for a talk, start fleshing that out. We'll have a CFP, followed by a blind
submission review, and then select our speakers, who we'll fly to DataLayer to
present. Sounds fun, right?

Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe ’s author page and keep reading.RELATED ARTICLES
Jul 3, 2017DATALAYER EXPOSED: JOSHUA DRAKE & POSTGRESQL: THE CENTER OF YOUR DATA UNIVERSE
Start your Monday on a high note and catch up on videos from this year's
DataLayer Conference. This week we're highlighting J…

Thom Crowe Jun 26, 2017DATALAYER EXPOSED: JONAS HELFER & JOINS ACROSS DATABASES WITH GRAPHQL
Wanting something to make Monday mornings a bit more exciting? Well, for the
next few weeks, we're bringing you a new video f…

Thom Crowe Jun 23, 2017COMPOSE'S NEWSBITS - ELASTICSEARCH AND ERLANG GET UPDATED
These are the NewsBits for the week ending June 23: Elasticsearch gets an update
to 5.4.2 There's a major update for Erlang/O…

Dj Walker-Morgan Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","It's time for our next DataLayer Conf video installment. This week, we'll hear about surviving failure with RabbitMQ from Lorna Jane Mitchell.",Lorna Jane Mitchell & Surviving Failure with RabbitMQ,Live,357
1079,"Enterprise Pricing Articles Sign in Free 30-Day TrialMETEOR 1.4, MONGODB AND COMPOSE - READY TO OPLOG
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Aug 10, 2016With a switch to newer MongoDB drivers and other low level improvements, the new
Meteor 1.4 works much better with Compose's MongoDB. That means that we now
recommend that users migrate to Meteor 1.4 to get the best performance and
reliability with their Compose databases.

Meteor 1.4 has two major improvements. Firstly it can use Compose's default
connection strings which specify two addresses for database connections. Updated
Node MongoDB drivers now correctly parse the connection string which, in turn,
means it can make use of both access portals for failover.

The other improvement is that change-streaming Oplog connections are now far
more reliable, again thanks to the updated Node MongoDB drivers in Meteor 1.4.
Before we move on to talk about that, we'd like to remind you that the use of
the Oplog may not be ideal for all Meteor applications.

TO OPLOG OR NOT OPLOG
The question of whether or not you want to enable the Oplog is still a question
for Meteor application developers to answer for themselves. As we discussed in Connecting to the Oplog on the new MongoDB there are application architecture issues to bear in mind.

The number of meteor servers and actual speed and quantity of updates needed
between those servers has an impact on overall performance. If your users are,
for example, not collaborating on the same forms or documents then the constant
stream of updates when the Oplog is in use from other database changes may use
more resources in your Meteor servers than the users using those servers.

The good news is that use of the Oplog is entirely determined by the presence of
one environment variable so it is easy to switch between oplog and non-oplog
use. So let's look at configuring the new Meteor and Compose.

CONFIGURING METEOR 1.4 AND COMPOSE MONGODB
WITHOUT OPLOG
As with all Compose databases, the process of connecting begins with a Compose
connection string, which you'll find on the console for your Compose deployment.
It looks like this:


This is the global connect string and as such it shows how to connect to the
admin database which is always going to exist (as it stores the user credentials
among other things). Note the admin at the end before the ?ssl=true (unless you've created your database without SSL enabled). Replace the admin with the name of the database you want to lookup if you want to quickly create
a string for that database.

The <username> and <password> will, of course, need to be replaced too, but remember that if you will need to
use a username that has been given access to whichever database you are using.
Creating a user for a database is handled through the database browser which you
can get by selecting Browser in the left hand sidebar. In the browser, select the database to view the
options specific to that database. You can then select Users in the left hand side bar. You can also select Admin to view the connection strings and statistics for that particular database.

Once you have your connection string, go to your Meteor application and, before
starting it, set the environment variable MONGO_URL to the connection string. For example, if we have a database called ""blacklist""
with a user ""reddington"" and password ""xyzzy"" this would come out as

$ export MONOGO_URL=mongodb://reddington:xyzzy@sl-eu-lon-2-portal.2.dblayer.com:10287,sl-eu-lon-2-portal.1.dblayer.com:10302/blacklist?ssl=true


Now start your Meteor application. The Meteor server will now use polling to
check for updates to the database server by other Meteor servers.

WITH OPLOG
Meteor requires an environment variable, MONGO_OPLOG_URL , to be set to enable the Oplog. MongoDB on Compose uses an add-on to give
access to the Oplog. You need only one Oplog addon for a MongoDB deployment; all
changes from all databases in the deployment are reflected in the Oplog. To add
an add-on select Add-ons from the left-hand side bar. This takes you to the Add-on page for all add-ons:


Click Add on the Oplog Access panel, which will take you to the confirmation page:


We recommend you leave SSL enabled and note the cost ($4.50 per month) of having
the oplog Access Add-on installed. Click on Add MongoDB Oplog Access and the process will begin. You'll be taken to the Jobs screen where the progress of installing the add-on is displayed. It's a quick
addition, so click on Add-Ons in the sidebar again and you should see the same add-on list, but now the Oplog
Access add-on displays Configure . Click on that button to see something like this:


This screen shows you the user name for the oplog user in the database (you do
not need to create an oplog user), an obscured password for that account, a
connection string for Oplog using applications to use, a command line to be used
when debugging and a button to reveal an SSL certificate.

To get the full connection string, click on the show button next to the password. If you haven't re-authenticated recently, you will
be prompted for your account password. Once entered the display will look like
this:


You can now copy the entire Connection String for the router as the value for the MONGO_OPLOG_URL , like so:

$ export MONGO_OPLOG_URL=""mongodb://oploguser:H0gQySYhaU1WqjR79XV72Y2Eymznc5c4dtWrCSxwArQ@haproxy402.sl-eu-lon-2-portal.2.dblayer.com:10402/local?authSource=admin&ssl=true""


Note that the entire connection string is wrapped in double quotes to stop the
shell from interpreting the ampersand towards the end as an instruction to end
the line.

With this and MONGO_URL set, you can now start your Meteor server and it will use the Oplog to identify
updates from the database made by other Meteor servers.

FINALLY...
Now you can update to the SSL-enabled Compose MongoDB and make use of our dual
HAProxy/MongoS routers for resilient connections to the database using Meteor.
In the future we'll be looking at the options afforded by the new Mongo.setConnectionOptions(options) function. Till then, why not sign up for a free 30 day Compose trial , spin up a new Compose MongoDB today and try out Meteor 1.4 with it?

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer Google Cloud© 2016 Compose","With a switch to newer MongoDB drivers and other low level improvements, the new Meteor 1.4 works much better with Compose's MongoDB.","Meteor 1.4, MongoDB and Compose - Ready to Oplog",Live,358
1080,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Rory Keeley Blocked Unblock Follow Following Jun 16
--------------------------------------------------------------------------------

OPTIMIZING A MARKETING CAMPAIGN: MOVING FROM PREDICTIONS TO ACTIONS
How can a retail bank maximize expected revenue from a marketing campaign
directed at new customers and using a limited budget?

The Jupyter Notebook described in this post presents a combination of machine
learning and optimization to illustrate the power of combining predictive and
prescriptive analytics. From the predictive analysis we identify likelihoods,
but in practice we cannot leverage them all because we are bound by a limited
budget. This is where the prescriptive analysis provides us with the best
actions to take among all the possible actions.

The case presented is a retail bank that has acquired new customers following a
merger. The bank has data on the success of previous marketing campaigns among
existing customers. These data are used to train a predictive model using the
characteristics of customers who responded favorably to the previous campaign.
This predictive model is given the new customer data to identify who among the
new customers is most likely to purchase which offer. Once the customers are
identified, a greedy algorithm selects which ones to target with which offer and
via which medium. The expected return of this method is $50800.

ADD OPTIMIZATION TO IMPROVE OUTCOMES
Another approach is to take the output of the predictive model — that is, the
list of likely customers for different offers — and use this as input to an
optimization model. The optimization model goes further than the greedy
algorithm by searching for the best possible mix of customers, offers, and
channels to maximize revenue. The expected return of this method is $72620,
almost 50% more than the greedy algorithm.

What’s more, the optimization model lets us investigate different scenarios:

 * What could we expect if we increase the marketing budget?
 * What’s the minimum budget we’d need to contact 5% of our customers?

Adding Decision Optimization to the predictive model makes it possible to take
the next step — from prediction to prescription, or from what we expect to what
we should do based upon that expectation. Optimization gives us the confidence
we need to make sure that we’re taking the best actions based upon the best
information available.

You can download the Notebook from the Decision Optimization GitHub .

If you’re new to Decision Optimization, you can find some tutorial Notebooks in
the same GitHub .


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on June 16, 2017.

 * Data Science
 * Dsx
 * Optimization
 * Jupyter

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingRORY KEELEY
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",How can a retail bank maximize expected revenue from a marketing campaign directed at new customers and using a limited budget? The Jupyter Notebook described in this post presents a combination of…,Optimizing a marketing campaign: Moving from predictions to actions,Live,359
1087,"Cloudant Query provides a simple way to define and query indexes on a Cloudant database. It’s akin to MongoDB’s find() operation, adapted to run on our distributed database service.Watch this webcast replay from July 24, 2014, for a briefing on Cloudant Query with Cloudant VP of Product Dan DeMichele. Dan will discuss the motivations for building our own query syntax within Cloudant’s existing REST interface, and what you’ll need to know to use Cloudant Query effectively with your applications.You’ll learn about:Intro to data access in CloudantOverview of Cloudant QueryExample query comparisonTips for writing your first queries","Cloudant's Dan DeMichele introduces Cloudant Query, a declartive query language for NoSQL data, based on MongoDB's query language.","Intro to Cloudant Query, declarative query API for JSON",Live,360
1088,"LOCATION TRACKER – PART 1
markwatson / June 14, 2016The goal of the Location Tracker app is to show Swift developers how simple it
is to use Cloudant to track, store and query locations, all while enabling
offline first design and providing architecture guidance for scaling your
solutions to support millions of users.

In this tutorial we’ll show you how we built the app and what strategies we
employed to accomplish our goal. We’ll show you how we used Cloudant Sync for
offline support and data synchronization. We’ll show you how we used Cloudant
Geo to perform and visualize geospatial queries. Finally, we’ll introduce
alternative architectures and approaches that we’ll use in future tutorials to
show you how to scale your apps to support millions of users.

OVERVIEW
The Location Tracker app is an iOS app developed in Swift that tracks user
locations and stores those locations in Cloudant. As a user moves, and new
locations are recorded, the app queries the server for points of interests near
the user’s location.

Below is a screenshot of the Location Tracker app. Blue pins mark each location
recorded by the app. A blue line is drawn over the path the user has travelled.
Each time the Location Tracker app records a new location a radius-based geo
query is performed in Cloudant to find nearby points of interest (referred to in
the app as “places”). The radius is represented by a green circle. Places are
displayed as green pins:

REQUIREMENTS
To help achieve our goal we created 5 key requirements:

 1. Track location in the foreground and background: The app should be able to
    track a user’s location when the app is running in the foreground or the
    background.
 2. Use geospatial queries to find points of interest within a specified radius:
    The app should show users how to use Cloudant Geo to perform geospatial
    queries.
 3. Run offline: The app should be able to track user locations while offline
    and sync those locations to Cloudant when a network connection is available.
 4. Keep user location information private: Users should not have access to
    other users location information.
 5. Provide ability to consolidate and analyze all locations: It should be
    simple for backend engineers or data scientists to perform analysis on all
    locations without compromising requirement #4.

ARCHITECTURE
In order to satisfy requirement #4 (User Privacy) the Location Tracker was
implemented using the database-per-user design pattern. A dedicated database is
created for each user that only that user has access to. Here’s how it works:

 1. When the user registers, the Location Tracker app posts the user information
    to our Node.js server.
 2. The Node.js server creates a new user in the Users database.
 3. The Node.js server creates a user-specific database to track the user’s
    locations.
 4. The Node.js server returns the database name and authentication information
    to the app.
 5. The app connects directly to Cloudant to sync location information.

We’ll discuss this in more detail later. Here is a high-level diagram of the
system architecture:

Location Tracker uses the Cloudant/CouchDB “one-database-per-user” design
pattern.

In addition to creating user-specific databases as shown in the adjacent
architecture diagram, the server also configures continuous replication for each
user-specific database into a consolidated database ( All Locations ). This satisfies requirement #5 (Location Consolidation & Analysis) by
providing us a single location to query and analyze all locations recorded by
all users while not compromising user safety and privacy (no users are given
direct access to the consolidated database).

Note: The database-per-user design pattern makes it easy to sync location information
between the iOS app and Cloudant, all while ensuring that information is kept
private. It is a great solution for small- to medium-sized apps. In the next
tutorial we will show you alternative ways of replicating user-segregated data
and how you can scale your app to support millions of users.

THE SERVER
The Location Tracker Server is a Node.js application that provides RESTful APIs
for registering new users and querying places using Cloudant Geo . When you install the Location Tracker Server three databases will be created
in your Cloudant instance:


 1. lt_locations_all – This database is used to keep track of all locations. When a user
    registers, a specific database will be created to track locations for that
    user. Each user-specific database will be configured to continuously
    replicate into the lt_locations_all database.
 2. lt_places – This database contains a list of places that the Location Tracker app
    will query.
 3. lt_users – This database is used to manage users. Each user will have a username,
    password and information regarding the location database for that specific
    user.

The lt_locations_all and lt_places database will each be created with a geo index allowing you to make geo queries
and take advantage of the integrated map visuals in the Cloudant Dashboard. The lt_places database will be populated with 50 sample places that follow the path of the
“Freeway Drive” debug location setting in the iOS simulator:

Location Tracker’s sample data can be previewed directly in the Cloudant
dashboard with integrated Mapbox tiles.

Follow the instructions on the Location Tracker Server GitHub page to get the Location Tracker Server up and running locally or on Bluemix.

THE CLIENT
The Location Tracker client is an iOS app developed in Swift. As mentioned
previously the Location Tracker app tracks and records user locations and
queries Cloudant for points of interest. When a new user registers with the
Location Tracker app a new database will be created specifically to track
locations for that user.


The Location Tracker app uses Cloudant Sync for iOS to store locations locally and sync them to Cloudant:

Geolocation data previewed in the Cloudant dashboard, and the same data rendered
in the Location Tracker UI.

Follow the instructions on the Location Tracker App GitHub page to get the Location Tracker App up and running in Xcode.

HOW IT WORKS
Hopefully at this point you have successfully deployed the Location Tracker
Server and can run the Location Tracker app in the iOS simulator or on your iOS
device.

In the rest of this tutorial we’ll go into more detail on how the app works and
how we satisfied our 5 key requirements, including:

 * How we track a user’s location in iOS.
 * How we use the database-per-user design pattern to segregate and sync user
   locations.
 * How we use Cloudant Sync to support offline location tracking and two-way
   synchronization with Cloudant.
 * How we replicate user locations into a consolidated location database.
 * How we use Cloudant Geo to find points of interest near a user’s location.

USER REGISTRATION
It all starts with user registration. As you can see below we only require a
username and a password. You can easily extend the app to add new fields, such
as name, email address, etc.


When the user taps the Register button, the app executes an HTTP PUT to the
Location Tracker Server. The PUT body is a JSON representation of the user:

{
    ""username"": ""markwatson"",
    ""password"": ""passw0rd"",
    ""type"": ""user"",
    ""_id"": ""markwatson""          
}

The JSON is generated in the getRegisterHttpBody function in the RegisterViewController . As you can see below we are simply creating a dictionary and using the
built-in NSJSONSerialization class :

 func getRegisterHttpBody(_id:String) -> NSData {
    var params: [String:String] = [String:String]()
    params[""username""] = self.usernameTextField.text
    params[""password""] = self.passwordTextField.text
    params[""type""] = ""user""
    params[""_id""] = _id
    var body: NSData!
    do {
        body = try NSJSONSerialization.dataWithJSONObject(params as NSDictionary, options: [])
    }
    catch {
        print(error)
    }
    return body
}

The real work begins when the Node.js server receives the PUT request. The
request is processed in the createUser function in api/routes.js and the following steps are executed:

 1. Check if the user exists with the specified id. If the user already exists
    then return a status of 409 to the client.
 2. Create a location database for the user. The database will be called lt_locations_user_USERNAME and will be used to store only locations for this user. On login, the name
    of the database will be returned to the application to allow for
    synchronization using Cloudant Sync.
 3. Create geo indexes on the newly created database.
 4. Generate an API key and password in Cloudant used to access the database.
    The API key and password will also be returned to the application on login.
 5. Associate the API key with the newly created location database.
 6. Store the user in the users database with their id, password, api key, and
    api password (the api password will be encrypted).
 7. Configure continuous replication for the user’s location database to the
    lt_locations_all database.

Here’s the javascript code. Refer to the routes.js file for the full definition of each function:

 var username = req.params.id;
var dbName = 'lt_locations_user_' + encodeURIComponent(username);
checkIfUserExists(cloudant, req.params.id)
  .then(function () {
    return createDatabase(cloudant, dbName);
  })
  .then(function () {
    return createIndexes(cloudant, dbName);
  })
  .then(function () {
    return generateApiKey(cloudant);
  })
  .then(function (api) {
    return applyApiKey(cloudant, dbName, api);
  })
  .then(function (api) {
    return saveUser(req, cloudant, dbName, api);
  })
  .then(function (user) {
    return setupReplication(cloudant, dbName, user);
  })
  .then(function (user) {
    res.status(201).json({
      ok: true,
      id: user._id,
      rev: user.rev
    });
  }, function (err) {
    console.error(""Error registering user.�
    if (err.statusCode &�

USER LOGIN
Users are logged in immediately after registering. The app sends the following
request to the Node.js server:

{
    ""username"": ""markwatson"",
    ""password"": ""passw0rd""         
}

Here is a sample response:

{
    ""ok"": true,
    ""api_key"": ""ytorestenauneexxxxedstoo"",
    ""api_password"": ""ffdc36ea8dbaadxxxx94d9d884d0255c56c08e1e"",
    ""location_db_name"": ""lt_locations_user_markwatson"",
    ""location_db_host"": ""9f61849d-2884-4463-XXXX-56344789b05c-bluemix.cloudant.com""
}

The response contains the information needed to sync locations with the Cloudant
database that was created for this user. These values, along with the user’s
login information, are subsequently stored on the device (passwords are stored
securely in the Keychain):

UsernamePasswordStore.saveUsernamePassword(username, password: password)

LocationDbInfoStore.saveApiKeyPasswordDbNameHost(
    dict[""api_key""] as! String,
    apiPassword: dict[""api_password""] as! String,
    dbName: dict[""location_db_name""] as! String,
    dbHost: dict[""location_db_host""] as! String
)

Storing this information locally allows the app to function completely offline.
If a user kills the app while logged in, and re-opens the app while offline, the
app will automatically log the user in.

In addition, these values are made available to the application via the AppState class. Any class in the project can access these values at any time. For
example:

let credentials = ""\(AppState.locationDbApiKey!):\(AppState.locationDbApiPassword!)""
let url = ""https://\(credentials)@\(AppState.locationDbHost!)/\(AppState.locationDbName!)""

We’ll take a closer look at this code later.

TRACKING LOCATIONS
Tracking locations in iOS is fairly straight forward, but tracking locations in
the background can be a little tricky. Apple only allows continuous location
tracking in the background to be performed by certain types of apps (Fitness,
GPS, etc), but they offer significant location changes to all apps (as long as
the user approves it). In the Location Tracker we use the significant-change
location service when the app is in the background. Refer to the iOS Documentation for more information on location tracking and the significant-change service.

We created a wrapper that automatically handles switching between monitoring
real-time locations and the significant-change service. The wrapper is called LocationMonitor . Here is an example of how to use the LocationMonitor :

class MyViewController: UIViewController, LocationMonitorDelegate

override func viewDidAppear(animated: Bool) {
   ...
   LocationMonitor.instance.addDelegate(self)
}

func locationUpdated(location:CLLocation, inBackground: Bool) {
   // do something with the location
}

There are a number of variables that dictate when the LocationMonitor will notify subscribers of new locations. Those variables can be found in the AppConstants class:

static let minMetersLocationAccuracy : Double = 25
static let minMetersLocationAccuracyBackground : Double = 100
static let minMetersBetweenLocations : Double = 15
static let minMetersBetweenLocationsBackground : Double = 100
static let minSecondsBetweenLocations : Double = 15

 1. minMetersLocationAccuracy – The iOS Location libraries report the accuracy of a given location. This
    variable dictates how accurate a location must be while running in the
    foreground to notify registered subscribers.
 2. minMetersLocationAccuracyBackground – This variable is similar to minMetersLocationAccuracy , but is used when tracking locations in the background.
 3. minMetersBetweenLocations – The iOS Location libraries can report the slightest changes in location.
    For our purposes we don’t want to store every single location if the user
    hasn’t moved. This variable dictates the minimum # of meters that the user
    must have moved to report the location.
 4. minMetersBetweenLocationsBackground – This variable is similar to minMetersBetweenLocations , but is used when tracking locations in the background.
 5. minSecondsBetweenLocations – Similar to minMetersBetweenLocations this variable dictates the minimum # of seconds that must have passed since
    the last location to report the new location.

SYNCING LOCATIONS
Before we can sync locations to our Cloudant database we need to configure a
local datastore. In the viewDidLoad function of the MapViewController class you will see a call to initDatastoreManager . This function initializes the datastore manager which can be used to manage
one or more datastores and specifies where the local datastores should reside on
the device:

func initDatastoreManager() {
    let fileManager = NSFileManager.defaultManager()
    let documentsDir = fileManager.URLsForDirectory(.DocumentDirectory, inDomains: .UserDomainMask).last!
    let storeURL = documentsDir.URLByAppendingPathComponent(""locationtracker"")
    let path = storeURL.path
    do {
        datastoreManager = try CDTDatastoreManager(directory: path)
        
    } catch {
        fatalError(""Failed to initialize datastore: \(error)"")
    }
}

After we initialize the datastore manager we call the initLocationsDatastore function to initialize the datastore for user locations. Here we create a new
datastore and index on the created_at property:

func initLocationsDatastore() {
    do {
        locationDatastore = try datastoreManager!.datastoreNamed(locationDatastoreName)
        locationDatastore?.ensureIndexed([""created_at""], withName: ""timestamps"")
    }
    catch {
        fatalError(""Failed to initialize location datastore: \(error)"")
    }
}

Now we are ready to start saving locations. When a new location is captured we
create a new instance of the LocationDoc class:

let locationDoc = LocationDoc(docId: nil,
                              latitude: location.coordinate.latitude,
                              longitude:location.coordinate.longitude,
                              username:AppState.username!,
                              sessionId: AppState.sessionId,
                              timestamp: NSDate(),
                              background: inBackground)

Then we create a CDTDocumentRevision to be stored in the local datastore:

func createLocationDoc(locationDoc: LocationDoc) -> Bool {
    ...
    let rev = CDTDocumentRevision(docId: locationDoc.docId)
    rev.body = NSMutableDictionary(dictionary:locationDoc.toDictionary())
    do {
        try locationDatastore!.createDocumentFromRevision(rev)
    }
    catch {
        print(""Error creating location: \(error)"")
    }
    return true
}

Finally, we sync the local datastore with Cloudant. We start by configuring the
URL to the Cloudant database. We use the server and auth information returned by
the server and stored in the AppState class:

let credentials = ""\(AppState.locationDbApiKey!):\(AppState.locationDbApiPassword!)""
let url = ""https://\(credentials)@\(AppState.locationDbHost!)/\(AppState.locationDbName!)""

Next we create a one-way replication job:

let factory = CDTReplicatorFactory(datastoreManager: self.datastoreManager)

let job = CDTPushReplication(source: self.locationDatastore!, target: url)

// Create the replication job.
var job = try factory.oneWay(job)
    
// Assign self as the replication delegate (to be notified when the job is complete).
job!.delegate = self
    
// Start the job
try job!.start()

The code above shows how we can push new locations to Cloudant, but we can also
pull locations from Cloudant, and when a user first logs in we do just that. In
the viewDidAppear function in the MapViewController we call syncLocations(.Pull) . This will create a one-way replication job from Cloudant to the app. When the
replication is finished our local datastore will contain the locations retrieved
from Cloudant. We call the loadLocationDocsFromDatastore function to load the locations and add them to the map. Here is the loadLocationDocsFromDatastore function:

func loadLocationDocsFromDatastore() {
    let query = [""created_at"": [""$gt"":0]]
    let result = locationDatastore?.find(query, skip: 0, limit: UInt(AppConstants.locationDisplayCount), fields:nil, sort: [[""created_at"":""desc""]])
    guard result != nil else {
        print(""Failed to query for locations"")
        return
    }
    dispatch_async(dispatch_get_main_queue(), {
        self.removeAllLocations()
        // we are loading the documents from most recent to least recent
        // we want our array to be in the oppsite order
        // so we can draw our path and when we add new locations we increment the label
        // here we enumerate the documents and add them to a local array in reverse order
        // then we loop through that local array and add them one by one to the map
        var docs: [CDTDocumentRevision] = []
        result!.enumerateObjectsUsingBlock({ (doc, idx, stop) -> Void in
            docs.insert(doc, atIndex: 0)
        })
        for doc in docs {
            if let locationDoc = LocationDoc(aDoc: doc) {
                self.addLocation(locationDoc, drawPath: false, drawRadius: false)
            }
        }
        self.drawLocationPath()
    })
}

This same function is called when a user logs in while offline. This allows the
user to see any locations store in the local datastore event when the network is
unavailable.

QUERYING PLACES
As new locations are recorded the Location Tracker app looks for points of
interest (“places”) within radius of those locations. Places are stored in the lt_places database with a GeoJSON geometry object. Here is a sample place:

{
  ""_id"": ""94cf2fcb31459b2244bf8ff6140a5282"",
  ""_rev"": ""1-6ded3878845982c77e08c19bfed65d95"",
  ""geometry"": {
    ""type"": ""Point"",
    ""coordinates"": [
      -122.3162468,
      37.4722645
    ]
  },
  ""name"": ""Edgewood Park Natural Preserve"",
  ""type"": ""Feature"",
  ""created_at"": 1462910143508
}

The lt_places database also includes a Cloudant geospatial index which allows us to perform geo-based queries. The index is created when you
install the Location Tracker server and is defined as follows:

{
  ""_id"": ""_design/points"",
  ""_rev"": ""1-cd7d85016e88bcc571b7d6e8c2d33768"",
  ""language"": ""javascript"",
  ""st_indexes"": {
    ""pointidx"": {
      ""index"": ""function (doc) { if (doc.geometry &� }}""
    }
  }
}

Cloudant Geo supports a wide-range of geo-based queries ( click here for more information ). We use a simple radius-based query in the Location Tracker app. The app
sends the query to the Node.js server which in turn sends the query to Cloudant.
Here is how the app formulates the query:

let url = NSURL(string: ""\(AppConstants.baseUrl)/api/places
    ?lat=\(lastLocation.geometry!.latitude)
    &lon=\(lastLocation.geometry!.longitude)
    &radius=\(AppConstants.placeRadiusMeters)
    &relation=contains
    &nearest=true
    &include_docs=true"")

You can see above we are connecting to the /api/places endpoint on the Node.js server. We are passing the latitude and longitude of
the last location and a radius defined in AppConstants . We are specifying a geometric relationship of contains which tells Cloudant
to return any object that is within the radius of the specified lat/long.
Finally, we are requesting that Cloudant return documents in order of closest to
furthest ( nearest=true ).

The Node.js server calls the geo-index endpoint on Cloudant (passing in the
query string sent from the Location Tracker app). See the getPlaces function in api/routes.js :

...
var url = cloudant.config.url + ""/lt_places/_design/points/_geo/pointidx�
url += ""?�
request.get({uri:url}, function(err, response, body) {
...

When the app receives the list of places from the server it stores them in a
local datastore, just like we did with locations. This allows us to view the
places we have queried while running offline.

CONCLUSION AND NEXT STEPS
In this tutorial we showed you how to create user-specific Cloudant databases on
the fly to segregate locations for individual users. We showed you how to track
locations in iOS and how to use Cloudant Sync to sync locations with Cloudant
and track locations while running offline. We touched on how you can
programmatically configure continuous replication to replicate locations in
user-specific databases to a central database for analyzing all locations.
Finally, we showed you how to use Cloudant Geo to index and query documents
within the radius of a lat/long and view locations on maps inside the Cloudant
Dashboard.

In doing so we satisfied our 5 key requirements:

 1. Track location in the foreground and background.
 2. Use geospatial queries to find points of interest within a specified radius.
 3. Run offline.
 4. Keep user location information private.
 5. Provide ability to consolidate and analyze all locations.

In future tutorials we will discuss alternative ways to satisfy these
requirements and how you can use Cloudant to support millions of users,
including:

 * How to use the CouchDB change feed as an alternative to continuous
   replication.
 * How to use Cloudant Envoy to store locations in single database while maintaining user privacy and
   safety.
 * How to use alternative map providers, such as Mapbox , and how to store maps offline.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our Location Tracker example app shows how simple it is to use Cloudant to track, store and query GeoJSON. It supports offline first design & future scale.",Location Tracker: Offline first design in Swift,Live,361
1094,"Homepage Sign in Get started Homepage * Home
 * Archive
 * 

Adam Massachi Blocked Unblock Follow Following Data Science Experience @ IBM Oct 11
--------------------------------------------------------------------------------

DSX: HYBRID MODE
DATA SCIENCE EXPERIENCE & WATSON MACHINE LEARNING IN THE CLOUD AND ON-PREMISES
In this series of tutorials, we’ll showcase the combined strengths of DSX Cloud and DSX Local with Watson Machine Learning. We’ll explore a few new tools, such as the Watson Machine Learning API Client for Python , the community R wrapper for Watson Machine Learning APIs, R4WML , and the Scala Repository client . Follow along in the companion notebooks .

___________________________________________________________________

We’ll build two models:

I. Chicago Building Violations
We’ll build a model of Chicago Building Violation Inspections, using Scala and
Spark on DSX Cloud. Then, we’ll use DSX Local to score new data with R Studio
and R4WML .

II. Spam-N-Bayes
Next, we’ll create a simple spam classifier in DSX Local with python, scikit-learn , and the Watson Machine Learning API Client for Python . We’ll score new data in DSX Cloud.

___________________________________________________________________

I. CHICAGO BUILDING VIOLATIONS
This building would likely fail an inspectionThe data we’ll use are Violations issued by the Chicago Department of Buildings over the period from 2006 until present. The dataset contains instances of building violations . There are about 1.5 million records over the period from 2006 to the present.

A typical record we’ll use for modeling looks like this:

 * Each violation is associated with an inspection and an inspection status .
 * Using Spark Machine Learning , we’re going to develop a model for the data from 2006–2016, which we’ll
   use to provide a score in the interval [0,1] for how likely we believe an individual building is to Pass or Fail an inspection. About 80% of the records in this dataset Fail an inspection.

 1. Data Wrangling

After logging in to DSX , Create a new notebook directly or from a Project.

Create a new Project or Notebook to get startedThen, select Scala and Spark 2.0 for this tutorial.

Ready the kernel, associate with ProjectNow, we’re ready to begin

Some import statements to get things rollingFirst, we’ll import some libraries.

// imports
import scala.sys.process._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.util._
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.expressions.DateFormatClass

Next, we’ll read the data into a Spark DataFrame . This is straightforward on DSX. First, upload the data to your project . You can also access this data via API.

// initialize the Spark Session and read `csv` data
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder().
    getOrCreate()

val violations = spark.read.
format(""org.apache.spark.sql.execution.datasources.csv.CSVFileFormat"").
    option(""header"", ""true"").
    option(""inferSchema"", ""true"").
    option(""dateFormat"", ""MM/dd/yyyy"").
    load(bmos.url(""hybridDemos"", ""Building_Violations.csv""))

violations.printSchema()

We’re not going to build a model on all of the data. We’ll need to separate
2006–2016 from 2017. We’ll use a decade of data to train, and then we’ll test
the performance of our model on the 2017 data. Notice that in the above Schema, VIOLATION DATE , is string type. This means we’ll need to do some wrangling before we can filter by the
dates in an intuitive way.

val dated = violations.withColumn(""timeStamp"", to_date(unix_timestamp(
  $""VIOLATION DATE"", ""MM/dd/yyyy""
).cast(""timestamp"")))

Now, we’ve represented the DataFrame with a new field, timeStamp . We can use this to filter the timestamp data intuitively.

Let’s make some more modifications. First, we’ll rename all of the columns so
that we can reference them more easily later. Remove the space between the names
and replace with an underscore.

// replace whitespace with `_`
var cleanDf = dated
for(col <- dated.columns){
    cleanDf = cleanDf.withColumnRenamed(col,col.replaceAll(""\\s"", ""_""))
  }

We’re modeling INSPECTION_STATUS , but there are a small number of records where the status has not been
resolved into PASSED or FAILED . We can select only those records that meet our criteria with SQL Transformer .

// select pass or fail records
import org.apache.spark.ml.feature.SQLTransformer

val df = new SQLTransformer().
setStatement(""SELECT * FROM __THIS__ WHERE INSPECTION_STATUS IN ('FAILED', 'PASSED')"").
transform(cleanDf)

// change datatype of LAT and LONG from string to numeric
val preppedFrame = df.withColumn(""LATITUDE"", df(""LATITUDE"").cast(DoubleType)).
withColumn(""LONGITUDE"", df(""LONGITUDE"").cast(DoubleType))

We also change the datatype of LATITUDE and LONGITUDE from string to Double . Now we’ll separate the data by year.

// Filter by date. Train on  year < 2017, test on 2017 data
val trainingData2016 = preppedFrame.filter(year($""timestamp"").leq(lit(2016)))

val testingData2017 = preppedFrame.filter(year($""timestamp"").gt(lit(2016)))

Notice that leq is less-than-or-equal-to . Then gt should be easy to guess.

// sanity check
testingData2017.select(""VIOLATION_DATE"").show(3)

For simplicity, we’re going to choose only a subset of the fields to use for
modeling. Many of the other fields have numerous missing values, which is
slightly beyond the scope of this tutorial.

First, we’ll specify a subset of the columns. Then we’ll drop those rows which
contain nulls.

// fields to keep
val keepCols = Array(""VIOLATION_CODE"", ""VIOLATION_DESCRIPTION"", 
                   ""INSPECTION_STATUS"", ""INSPECTOR_ID"", 
                   ""INSPECTION_CATEGORY"", ""DEPARTMENT_BUREAU"", 
                   ""LATITUDE"", ""LONGITUDE"")

// train, test data

val dfTrain = trainingData2016.select(keepCols.head, keepCols.tail: _*).na.drop
val dfTest = testingData2017.select(keepCols.head, keepCols.tail: _*).na.drop

// take a look
dfTrain.printSchema()

2. Build a Pipeline

When deploying a model to Watson Machine Learning, we need to provide a Spark Machine Learning Pipeline which indicates how to transform raw data into the representation required by
our model. Pipelines typically include a series of transformers and terminate
with a model or, especially in classification tasks, some transformer which will
convert model predictions into string labels.

// import transformers, model
import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorAssembler}
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

We’ll use the StringIndexer in order to convert strings into a numeric representation for the machine. You
can read about many transformations in the documentation . We assign each transformation a value because we’ll need to reference them
later in the Pipeline .

// Label colum
val labelCol = new StringIndexer().setInputCol(""INSPECTION_STATUS"").setOutputCol(""STATUS_LABEL"").fit(df)

// Feature cols with String Indexer => Vector Assembler //

//* VIOLATION CODE * //
val interCodeCol = new StringIndexer().setInputCol(""VIOLATION_CODE"").setOutputCol(""CODE_X"").setHandleInvalid(""skip"")

//* INSPECTOR ID * //
val interSpector = new StringIndexer().setInputCol(""INSPECTOR_ID"").setOutputCol(""INSP_X"").setHandleInvalid(""skip"")

//* INSPECTION CATEGORY * //
val interCatSpector = new StringIndexer().setInputCol(""INSPECTION_CATEGORY"").setOutputCol(""INCAT_X"").setHandleInvalid(""skip"")

//* DEPARTMENT BUREAU * //
val interBureau = new StringIndexer().setInputCol(""DEPARTMENT_BUREAU"").setHandleInvalid(""skip"").setOutputCol(""BUR_X"")

//** DEALING WITH TEXT **//
val regexTokenizer = new RegexTokenizer().setInputCol(""VIOLATION_DESCRIPTION"").setOutputCol(""WORD_X"").setPattern(""\\W"")

val hashingTF = new HashingTF().setInputCol(""WORD_X"").setOutputCol(""DESCRIPTION"").setNumFeatures(128) 
// experiment with numFeatures + regularization params

// LAT AND LONG ARE NUMERIC ALREADY//
0

Notice that after creating a new instance of StringIndexer , we use setInputCol and setOutputCol . The output column will go into the VectorAssembler . All of those features we use for modeling we’ll include in VectorAssembler .

But what about string data that is not categorical? Sure, we can index all of
the INSPECTOR_ID data, but does that make sense for the VIOLATION_DESCRIPTION , where almost every field is unique?

For text data like this, Scala and Spark provide other handy transformations . I’m not going to get into the weeds about the RegexTokenizer and HashingTF , but the general idea is simple. We’re going to take the text and break it
into individual words with the tokenizer and then we map the tokens contained in each violation description to their
frequencies. This will allow us to accept unseen data as well.

Let’s assemble the features.

//** VECTOR ASSEMBLER **//

val vecAssembler = new VectorAssembler().setInputCols(Array(""BUR_X"", ""INCAT_X"", ""CODE_X"", ""INSP_X"", ""DESCRIPTION"", ""LATITUDE"", ""LONGITUDE"")).setOutputCol(""FEATURES"")

Time to instantiate a new untrained model object and pipeline.

// model and pipeline
import org.apache.spark.ml.{Model, Pipeline, PipelineStage, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression

//** Logistic Regression **//
val logitModel = new LogisticRegression().setLabelCol(""STATUS_LABEL"").setFeaturesCol(""FEATURES"").setRegParam(0.1)

//** Convert index prediction back to string **//
val labelConverter = new IndexToString().setInputCol(""prediction"").setOutputCol(""PREDICTED_LABEL"").setLabels(labelCol.labels)

Now, we’ve specified the transformations, the vector assembler, and the model,
and so on. We’ll arrange these steps in a Pipeline .

/* Logitic Regression Pipeline */ 
val logisticPipe = new Pipeline().setStages(
                                    Array(
                                        labelCol, 
                                        interCodeCol, 
                                        interSpector, 
                                        interCatSpector,
                                        interBureau,
                                        regexTokenizer, hashingTF,
                                        vecAssembler,
                                        logitModel                                                                  
                                    )
                                )

3. Train the model

Just call .fit() on the pipe.

val trainedLogit = logisticPipe.fit(dfTrain)

We can make predictions and get metrics.

// predict
val predictionsLogisitc = trainedLogit.transform(dfTest)

// Prepare for metrics
val predictionAndLabels = predictionsLogisitc.select(""STATUS_LABEL"", ""prediction"").rdd.map(row => 
            (row.getAs[Double](""prediction""), row.getAs[Double](""STATUS_LABEL"")))

val metrics = new BinaryClassificationMetrics(predictionAndLabels)

We have a new object metrics which contains a lot of information.

//AUC
metrics.areaUnderROC

It’s OK — .683

An area of .5 under the ROC Curve indicates that the model performs as well as random guessing, so we’ve beaten
that enough to continue for purposes of this tutorial.

4. Save to your WML Repo

You’ll need an instance of Watson Machine Learning on Bluemix . You can create a new instance directly from within DSX, but you’ll need to
log in to Bluemix for your credentials.

WML CredentialsOpen the drop down menu and find your credentials.

// Credentials WML
val service_path = ""https://ibm-watson-ml.mybluemix.net""
val username =  ""***********""
val password =  ""***********""
val instance_id = ""***********""

Let’s import the IBM Scala Repository API Client for Watson Machine Learning and other helpful libraries.

// IBM scala client
import com.ibm.analytics.ngp.repository._

// Helper libraries

import scalaj.http.{Http, HttpOptions}
import scala.util.{Success, Failure}
import java.util.Base64
import java.nio.charset.StandardCharsets
import play.api.libs.json._

Let’s make a connection and authorize.

// Authorize
val client = MLRepositoryClient(service_path)
client.authorize(username, password)

// returns `Success(())`

Next, we’ll use MLRepositoryArtifact to create a model artifact for the repo. We must pass a Spark ML pipeline, the
training data used, and a name for the model.

// model artifact

val model_artifact = MLRepositoryArtifact(trainedLogit, dfTrain, “VIOLATIONS_SCALA211_SPARK20”)

val saved = client.models.save(model_artifact)

saved is the model artifact. Check it out.

println(""modelType: "" + saved.get.meta.prop(""modelType""))
println(""trainingDataSchema: "" + saved.get.meta.prop(""trainingDataSchema""))
println(""creationTime: "" + saved.get.meta.prop(""creationTime""))
println(""modelVersionHref: "" + saved.get.meta.prop(""modelVersionHref""))
println(""label: "" + saved.get.meta.prop(""label""))
println(""runtime: ""+ saved.get.meta.prop(""runtime""))

Now that we’ve saved the model, we can deploy and create an API endpoint to
score records. In this part of the tutorial, I’ll show how to do this from the
UI with Cloud, and we’ll deploy the second model programmatically.

First, find your model under Analytics Assets in your Project. We named ours

VIOLATIONS_SCALA211_SPARK20

Models under Analytics AssetsThis will take you to the model assets. From here, add a Deployment with one
click. Then, DSX will create an endpoint for you.

Deployments tab The endpointWe can test the API here as well

Test API5. Score with R4WML and R Studio on DSX Local

Let’s score some new data with R.

We can access models in our Watson Machine Learning Repo with R4WML . This package makes it simple to score new records with any model with R code.
This wrapper has been provided by the Community, and is not yet fully supported.
However, it will satisfy your WML needs from R. In this guide, we’re
demonstrating the interoperability of IBM Data Science Experiences instances on Cloud and Local . We’ve developed a model of Chicago Building Violations data with Scala and
Spark on Cloud . But, fire up a free instance of DSX Local .

DSX Local home screen w/expanded sidebarWe’ll start a new instance of R Studio and install the library from GitHub .

R Studio on DSX LocalThe code to install R4WML from github is familiar.

# install R4WML 
devtools::install_github(repo = 'IBMDataScience/R4WML')

After we install the package, run library(R4WML) .

Then we’ll load in the testing data from 2017 and score some records.

buildingTestDF <- read.csv('../datasets/part-r-00000-620e16a4-9c66-4d9a-b5e4-1dcb7d881725.csv', stringsAsFactors=F)

str(buildingTestDF)

Looks good.

Now, we’ll need to specify our credentials again.

watson_ml_creds_url <- ""https://ibm-watson-ml.mybluemix.net""
watson_ml_creds_username <- ""**********""
watson_ml_creds_password <- ""**********""
watsom_ml_creds_instanceID <- ""**********""

When we deployed the model at the end of Step 4 above, I showed where you can find the endpoint URL from within DSX Analytics
Assets. You can also do this from within Watson Machine Learning on Bluemix.

Launch the dashboardFirst, launch the WML Dashboard.

Find the deployed modelYou can find your deployed model here. If your model has not yet been deployed,
go left to the Models tab.

API EndpointThe endpoint is the Scoring Endpoint .

OK — back to the R code.

So, we’ve defined our credentials, loaded the library, now let’s define the
endpoint.

ml_endpoint.scalaSpark <- ""https://ibm-watson-ml.mybluemix.net/v3/wml_instances/0b917078-2507-4590-b152-9904dfdff9d9/published_models/bcbf46f6-a5bc-4ed9-bc78-a90940711c6b/deployments/698ce608-9a34-451b-9424-d0e1f3948e0c/online""

Next, you’ll need to define headers to send with your scoring API request. R4WML has functions which handle this sort of thing.

# generate headers from credentials and urls 
watson_ml_creds_auth_headers <- get_wml_auth_headers(watson_ml_creds_url, watson_ml_creds_username, watson_ml_creds_password)

To score a record, we submit a POST request. First, we need to represent the R object as the JSON expected by the
API. Our R data frame is buildingTestDF , let’s consider two records.

We’ll use the to_wml_payload functionality from R4WML

payload <- to_wml_payload(buildingTestDF[1:2,])


payload

{""fields"":[""VIOLATION_CODE"",""VIOLATION_DESCRIPTION"",""INSPECTION_STATUS"",""INSPECTOR_ID"",""INSPECTION_CATEGORY"",""DEPARTMENT_BUREAU"",""LATITUDE"",""LONGITUDE""],""values"":[[""EV1111"",""MAINTAIN OR REPAIR HYDRO ELEVA"",""FAILED"",""385638"",""PERMIT"",""ELEVATOR"",41.913,-87.7275],[""CN067024"",""REPAIR EAVES"",""FAILED"",""BL01000"",""COMPLAINT"",""CONSERVATION"",41.902,-87.7225]]}

When we send the JSON data, the API will return JSON. Using some more helper
functions, we can easily convert this back to an R data frame and observe the
new fields, prediction and probability .

results <- from_wml_payload(wml_score(ml_endpoint.scalaSpark, watson_ml_creds_auth_headers, payload))

First, we pass the endpoint , the headers , and the payload to the wml_score . This returns JSON, so we use from_wml_payload to convert back to a familiar R object.

Returned R data frame after ScoringObserve the predicted class label in prediction and the predicted probability for each class in probability .

6. That’s awesome

Yea, it is awesome. We built a model with Scala and Spark on DSX Cloud. Then we
exposed the model as an API endpoint and scored new records from R Studio. You
don’t need to be in DSX Local to score new records. Just use the R4WML package as you would any other. But, since we’re here, let’s make another
model. This time, with python 3.5 and scikit-learn in DSX Local.

2. SPAM-N-BAYES
We’re going to use the Watson Machine Learning API Client for Python which is available on PyPi . Similar to the Scala client, WML expects a pipeline object with instructions
on how to prepare the raw data for consumption by the model.

Start a new notebook in DSX Local and let’s get started. [Show local image]

Start a new python notebook for this modelFirst, we need to get the library.

 1. Get the Watson Machine Learning Python Client

!pip install watson-machine-learning-client

PyPi + pip , fantastic.

2. Your WML Credentials

from watson_machine_learning_client import WatsonMachineLearningAPIClient

wml_credentials = {
  ""url"": ""https://ibm-watson-ml.mybluemix.net"",
  ""access_key"": ""**********"",
  ""username"": ""**********"",
  ""password"": ""**********"",
  ""instance_id"": ""**********""
}
client = WatsonMachineLearningAPIClient(wml_credentials)

It’s easy to request information about all of your WML service details

myDeets = dict(client.service_instance.get_details())

3. Load Data

Now, let’s bring in the data . The data are SMS Messages which have been labeled spam or ham .

# read in the data
import pandas as pd
df = pd.read_csv('../datasets/spam.csv', encoding='latin-1')

# there are some junk columns
df = df[df.columns[:2]]
df.columns = ['ham', 'text']

The spam dataOur first step will be converting the string label into a numeric
representation. We can use a pandas.Series method factorize()[0] to convert strings into numeric factors.

df['label'] = df.ham.factorize()[0]

After converting the label4. Build a model

We’re building a spam classifier for text data using a similar method to the one
we used on the VIOLATION_DESCRIPTION in the Building Violations data. In the case of sklearn , we’ll use the HashingVectorizer , which converts the SMS’ text into a matrix representation suitable for
modeling.

from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(n_features=5000, stop_words='english', non_negative=True) # alternate_sign=False in 0.19+

We need connect the output of the vectorizer to the input of the model. We’ll
use Multinomial Naive Bayes . It’s a Naive Bayes classifier which works well with the representation of our
features — integer representations of the word frequencies.

Next, we’ll use train_test_split in order to divide the data into testing and training sets so that we can
evaluate the performance of the model.

# import
from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df['text'], df['label'])

Next, we need to transform the text and fit the model.

# first transform the text data
transformed_x = vectorizer.fit_transform(x_train)


# import the modules
from sklearn.naive_bayes import MultinomialNB
bn = MultinomialNB().fit(transformed_x, y_train)

We’ve got a fit model in bn . Let’s evalue the performance on the test data after creating the pipeline .

# make a pipe

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(vectorizer, bn)

The pipe will sequentially transform the data according to the transformers specified,
terminating in what scikit-learn calls an estimator . Then, we can call predict or score , and so on.

pipe.score(x_test, y_test)

Using the pipePretty, pretty good. The model guesses correctly almost 96% on the testing set.
Notice how we did not have to transform x_test with vectorizer because of the pipe . This same logic follows our deployed model. We’ll just pass in text data, and
the rest will be handled behind the scenes.

5. Publish the model

You can publish the model to your Watson Machine Learning Repo in one line.

client.repository.publish(pipe, name=""Spam-N-Bayes"")

Deploying from the UINow, you could deploy this from the Cloud UI under your Analytics Assets again, as we did for the Scala model. But, you can do it programmatically as
well, which is what we’ll do here.

We’ll need the model_uid .

models_details = client.repository.get_details()

And find the uid for your model. You can always refer back to the documentation .

scoring_endpoint = client.deployments.create('<modelguid>', name=""SPAMO"", description=""Spam model deployed from notebook"")

It’s deployed! Make a note of the url. We’ve now both (a) persisted the model in
WML, so we can load it into any python project, and (b) exposed the model as an
API, so other applications can make use of the resource.

6. Make predictions from DSX Cloud

Now that we’ve published the model to our WML Repo and created a deployment, we
can score new data with this model. So far, we’ve shown two ways that you can
access the API endpoint url for a deployed model: from within DSX Cloud, and
from within the Watson Machine Learning Service on IBM Bluemix. We can also use
the python client!

Start a new notebook on DSX Cloud. Follow the same steps as we did a minute ago
to make a connection with the client.

Now, let’s test it. We’ll use client.deployments.score(scoring_url, payload) .

# create a payload
my_sms = ""Send me to the API por favor""
payload = {""values"": [my_sms]}

# make a request
response = client.deployments.score(scoring_endpoint, payload)

In response we can find

{'fields': ['prediction', 'probability'],
 'values': [[0, [0.8164689448418778, 0.1835310551581235]]]}

Now any application can make use of this api that we’ve developed on DSX Local.
If you’re not using the python client, it’s best to refer to the Watson Machine Learning documentation .

___________________________________________________________________

Data Science Experience offers powerful tools in a sophisticated environment built on a combination of
IBM and open source technologies. In the cloud and on-prem, DSX provides the
resources you need to build and deploy machine learning models quickly and
easily.

 * Machine Learning
 * Data Science
 * Watson
 * IBM
 * Ibm Bluemix

Show your supportClapping shows how much you appreciated Adam Massachi’s story.

13 Blocked Unblock Follow FollowingADAM MASSACHI
Data Science Experience @ IBM

FollowIBM DATA SCIENCE EXPERIENCE
Master the art of data science

 * 13
 * 
 * 
 * 

Never miss a story from IBM Data Science Experience , when you sign up for Medium. Learn more Never miss a story from IBM Data Science Experience Get updates Get updates","In this series of tutorials, we’ll showcase the combined strengths of DSX Cloud and DSX Local with Watson Machine Learning. ",DSX: Hybrid Mode,Live,362
1099,"Homepage IBM Watson Follow Sign in Get started * Home
 * Announcements
 * Editorials
 * Tutorials
 * Code Spotlight
 * 
 * Build with Watson
 * 

Julian Vizor Blocked Unblock Follow Following Development Manager with the Watson Data Platform team at IBM. Mar 20
--------------------------------------------------------------------------------

MIGRATION FROM IBM BLUEMIX DATA CONNECT API (ACTIVITIES) TO IBM WATSON DATA API
(DATA FLOWS)
The following information will help IBM Bluemix Data Connect consumers of the
Data Load REST API to port to the IBM Watson Data API data flows service.

COMPARISON OF ACTIVITY JSON VS DATA FLOW JSON
At a high level, the activity JSON and data flow JSON structures achieve a
broadly similar outcome, albeit in a slightly different way. This document will
summarize the main differences. (Note that property naming has changed from
being “camelCase” in activities to “running_case” in data flows.)

Below is a summary of the main elements in the inputDocument of the activity
json. This is the basic flow definition. Many of the elements are optional and
default behaviour occurs if they do not exist.

Activity inputDocument structure:

sourceOptions

targetOptions

sources
       connection
       tables[]

shapingPipelines[]
       sourceTable
       shapingOperations[]

targets
       connection
       tables[]

However, the data flow equivalent has a much cleaner, simpler format.

Data flow entity structure:

pipeline
       pipelines[]
              nodes[]

In activity json, metadata is separately defined for the sources and targets , and then referenced within the shapingPipelines via the use of defined data sets . These are referenced to create a directed acyclic graph (DAG), the linkages
of which represent how the data is processed. The same concept exists in data
flows, except there is no need for the metadata definition, and sources and targets are just defined as binding nodes as part of the nodes list in the pipeline . Similarly, the equivalent of an operation in an activity is an execution node in a data flow. In the data flow, the nodes contain input and output port definitions with IDs that are equivalent to data sets in activities.

BINDING NODE EXAMPLES
The first example has a simple connection to a source database table. The
connection is referenced directly in the binding node (this is in the source section of activity json) and the interaction properties
are part of the connection section (whereas in activity json, they are in a separate “sourceOptions”
section). This binding node has one output which has a “port” ID source1output , unique within that binding node .

{
       ""id"": ""source1"",
       ""type"": ""binding"",
       ""output"": {
              ""id"": ""source1output""
       },
       ""connection"": {
              ""properties"": {
                     ""schema_name"": ""MyDatabaseSchema"",
                     ""table_name"": ""MySourceTable"",
       },
       ""ref"": ""UniqueConnectionId""
       },
}

The second example has a simple connection to a target database table. The connection properties are handled in exactly the same way. This binding node has one input which “links” to the first example (the link contains the node ID source1 as well as the port ID source1output ), to create the simplest of pipelines .

{
       ""id"": ""target1"",
       ""type"": ""binding"",
       ""input"": {
              ""link"": {
                     ""node_id_ref"": ""source1"",
                     ""port_id_ref"": ""source1output""
              },
              ""id"": ""targetInput1""
       },
       ""connection"": {
              ""properties"": {
                     ""schema_name"": ""MyDatabaseSchema"",
                     ""table_name"": ""MyTargetTable"",
                     ""table_action"": ""replace""
              },
              ""ref"": ""UniqueConnectionId""
       },
}

EXECUTION NODE EXAMPLE
The example below could be added to the node list containing the example above (by changing the target input link
“node_id_ref” property to operation1 and “port_id_ref” property to outputPort1 ) to insert it into the pipeline . Like an activity operation, there is an “op” but the equivalent of
“argumentBindings” is “parameters” and as mentioned before, “inputDatasets” and
“outputDatasets” are now “inputs” and “outputs”.

{
       ""id"": ""operation1"",
       ""type"": ""execution_node"",
       ""op"": ""com.ibm.wdp.transformer.FreeformCode"",
       ""parameters"": {
              ""FREEFORM_CODE"": ""filter(CUST_NO > 100004)""
       },
       ""inputs"": [{
              ""link"": {
                     ""node_id_ref"": ""source1"",
                     ""port_id_ref"": ""source1output""
              },
              ""id"": ""inputPort1""
       }],
       ""outputs"": [{
              ""id"": ""outputPort1""
       }],
}

For more information about binding nodes and execution nodes in data flows see Getting started > Data Flows in the Watson Data API documentation .

API EQUIVALENTS
The following tables show activities APIs and their data flow service
equivalents.

Activities and data flows Activity and data flow runsFor more information about the data flow APIs see Documentation > Data Flows in
the Watson Data API documentation.

 * API

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

1 Blocked Unblock Follow FollowingJULIAN VIZOR
Development Manager with the Watson Data Platform team at IBM.

FollowIBM WATSON
AI Platform for the Enterprise

 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson , when you sign up for Medium. Learn more Never miss a story from IBM Watson Get updates Get updates","The following information will help IBM Bluemix Data Connect consumers of the Data Load REST API to port to the IBM Watson Data API data flows service. At a high level, the activity JSON and data…",Migration from IBM Bluemix Data Connect API (activities) to IBM Watson Data API (data flows),Live,363
1100,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (October 18, 2016)
 * How to run a successful Data Science meetup
 * This Week in Data Science (October 11, 2016)
 * This Week in Data Science (October 05, 2016)
 * Mexico: Amazing interest in Data Science

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (OCTOBER 18, 2016)
Posted on October 18, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * The combination of human and artificial intelligence will define humanity’s
   future – Through the past few decades of summer blockbuster movies and Silicon
   Valley products, artificial intelligence (AI) has become increasingly
   familiar and sexy, and imbued with a perversely dystopian allure.
 * What can people do better than machines? The view from 1951 – The pessimistic view, in a world of advancing AI and robotics, would be
   “less and less every day”.
 * IBM Watson News Explorer – The IBM Watson News Explorer uses the AlchemyData News API to
   automatically construct a news information network and present large volumes
   of news results in an understandable fashion.
 * Hear all about open source PixieDust at IBM Insight at World of Watson 2016 – Jupyter Notebooks is a powerful environment for performing fast, flexible
   and interactive data analysis. Notebooks are rapidly becoming the tools of
   choice for data scientists and application developers.
 * Justice Department to Track Use of Force by Police Across U.S. – The Justice Department said Thursday that it would start collecting
   nationwide data early next year on police shootings and other violent
   encounters with the public, after a series of protests and investigations
   since 2014 spurred by a string of deadly episodes.
 * You can now get certified as an IBM Watson Application Developer – IBM has announced the Watson Application Developer Certification, a
   program that for the first time enables developers to test and validate their
   expertise in cognitive computing.
 * What it really takes to be a data scientist – Ted Fisher defines data science as the ability to predict the near-term
   future or identify otherwise unknown facts about the present based on
   patterns from what has happened in the recent past.
 * Unemployment in America, Mapped Over Time – Watch the regional changes across the country from 1990 to 2016.
 * IBM Aims To Bring Cognitive Computing Closer To Internet of Things – IBM is doubling down its investment in Watson by adding the power of
   cognitive computing to its IoT platform.
 * DeepMind’s differentiable neural computer helps you navigate the subway with
   its memory – A new algorithm from DeepMind is beginning to show us that so-called
   “slow” thinking may soon be within the reach of machine learning.
 * Open for who? – There seems to be a kind of fractal misunderstanding about what the word
   ‘open’ means, or more specifically who the ‘open’ is for.
 * Give a 3D printer artificial intelligence, and this is what you’ll get – A London-based startup has combined some of today’s most disruptive
   technologies in a bid to change the way we’ll build the future.
 * Why you should retrain your employees to become your data scientists – The demand for data scientists in the coming years is expected to create
   an immense talent shortage. Businesses should consider retraining their
   existing employees as data scientists.
 * Big data on campus – Colleges and universities are sifting through reams of data in search of
   ways to bolster graduation rates.
 * How to Share the Planet With Artificial Intelligence – Human-level intelligence is familiar in biological hardware—you’re using
   it now. Science and technology seem to be converging, from several
   directions, on the possibility of similar intelligence in non-biological
   systems.

UPCOMING DATA SCIENCE EVENTS
 * IBM Webinar: From Mobile First to Offline First –Learn how you can provide an outstanding user experience by taking an
   Offline First approach to web, mobile and IoT apps on October 20th.
 * Data Science for High School – In this meetup, on October 20th, we discuss different topic about Data
   Science for High Schools, starting with a panel.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our thirty sixth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (October 18, 2016)",Live,364
1103,"Follow Sign in / Sign up Home About Insight Data Science Data Engineering Health Data AI 5 * Share
 * 5
 * 
 * 

Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates Sebastien Dery Blocked Unblock Follow Following Master of Layers, Protector of the Graph, Wielder of Knowledge. #OpenScience
#NoBullshit 2 days ago
--------------------------------------------------------------------------------

GRAPH-BASED MACHINE LEARNING: PART 2
COMMUNITY DETECTION AT SCALE
During the seven-week Insight Data Engineering Fellows Program recent grads and experienced software engineers learn the latest open source technologies by building a data platform to handle large, real-time datasets.

Sebastien Dery (now a Data Science Engineer at Yewno ) discusses his project on community detection on large datasets.


--------------------------------------------------------------------------------

#tltr : Graph-based machine learning is a powerful tool that can easily be merged
into ongoing efforts. This work reviews the feasibility of performing community
detection through a distributed implementation using GraphX. Embedded within the
Hadoop ecosystem, this modularity optimization approach allows the study of
networks of unprecedented size. This change of scales, previously limited by
RAM, opens exciting perspectives as the self modular structure of complex
systems have been shown to hold crucial information to understanding their
nature.In my previous post , we discussed the foundation of community detection using modularity
optimization. One major constraint however, is that your graph needs to fit in
memory. This quickly turns problematic as your number of nodes surpass billions, and
the number of edges becomes trillions.

Thankfully we can leverage distributed computation systems in order to solve this limitation. To do this we first need to define the state
of a node so that it contains all the information needed during computation;
this will serve as a basic structure to pass around between the machines of our
distributed cluster.

“Node” and “Vertex” are often used interchangeably in the literature. This class
serves as structure for the nodes within the graph.Let’s also briefly review the process behind modularity optimization. This works
by iteratively merging nodes that optimize for local modularity to yield a new, and smaller, graph. Repeat until satisfied.

Two great properties emerge from this approach

 1. Locality : Each node requires knowledge from only its first-degree neighbors. This
    means a minimal amount of data needs to be passed around between clusters.
    This way, you don’t need to extensively jump from node to node across the
    clusters in order to get the necessary information.
 2. Independence : Each local computation occurs independently of the graph layout. Within
    an iteration, every node can asynchronously send its information to its
    neighbors without waiting for a blocking sequential set of operations to
    happen.

These are important points to highlight as they make distributed computation a
prime candidate for this memory problem. Turns out we can easily implement the
logic behind those properties using nothing but a simple iteration and a
developer-defined halting criteria. As previously discussed this can take many forms; here are a few ideas for brainstorming:

 * Scheduled based on a predefined number of iterations
 * Hits a specific total number of communities
 * Modularity gain between iteration is below a threshold

Simple iteration over the two stage process of our optimization: transfer and
compress.Let’s dive into the initial step of transferring community between nodes.
Remember that each node needs the information from its neighbors in order to
compute the gradient for local modularity.

TRANSFER
The best way to do this at scale (when you don’t know where the information
ultimately is on disk) is by using distributed transactions (aka passing messages ). This type of architecture is ubiquitous in modern computer software; it is
used as a way for the objects that make up a program to work with each other and
as a way for objects and systems running on different computers (e.g., the
Internet) to interact. In algorithms, you’ll often find it referenced under the
name of Belief Propagation or simply message passing . In the context of community detection, each node sends a message to its
neighbors with content along the lines of:

“ Hey I’m your friendly neighbor Node 3 from Community 12 ”

By independently sending messages to their first degree neighbors, each node can
retrieve all the information necessary for them to optimize for local
modularity. The content of each message can easily be tweaked thus adding
considerable flexibility to your approach.If you’ve ever worked with graphs you’re likely to be very familiar with the
concepts of vertices and edges . Should we perform the message passing exhaustively you’d basically go through
each vertex and send a message for each of its edges. This is not an
intrinsically bad approach if that’s all you have to work with. Turns out that
in the world of GraphX we have access to a third primitive for easy manipulation of our data: the triplet .

The three different types of view allowed within GraphX. Taken from AMPLab .The triplet logically joins the vertex and edge properties for a simplified and
useful view. Literally, the EdgeTriplet class extends the Edge class by simply adding the srcAttr and dstAttr members containing the source and destination properties respectively.

By reducing the triplets view, each node receives N messages corresponding to its N first degree neighbors. sendMsg and mergeMsg are both internal functions which perform the necessary aggregation for the
local modularity update. Independently, and in parallel, each node waits for its
turn to reduce all its messages into a coherent local sum of weighted edges, and
make a decision based on the local modularity deltaQ of each neighboring community.

A few iterations later, the graph has converged to a local equilibrium (e.g. a
minimal amount of nodes feel the need to change community). The algorithm can
now progress to the next step of compressing those communities into a compact
representation. This is done by creating a new graph with a new set of nodes
(corresponding to each community) and edges being inferred from the edges during
the previous computation (e.g. average or sum of external edges).

COMPRESSION
What function to choose really depends on the use case (e.g. averaging, total
sum, maximum, softmax , etc. are all valid functions, although their respective advantages remains
unclear in this particular scenario). When in doubt, let’s use a simple average.
Note that additional information, say the internal coherence within a community,
can be propagated in a similar fashion to the condensed node and provide
valuable information.

Effect of compressing community into single nodes at each iteration.Finally, here we have a fully functional procedure to perform modularity
optimization on graphs of ridiculously large size, assuming we have enough
computers to store all the information on disk.

CAVEATS AND TIPS
COMPUTATION TIME
Note that the number of meta-communities naturally decreases at each pass, and
as a consequence most of the computing time is used in the first pass. This
suggests pre-ordering of the data would hold considerable benefit in terms of
computation time.

Optimizing for node locality at the cluster level means less transfer between
machines.CONVERGENCE
This approach does not necessarily converge to the optimal solution . To improve this, multiple iterations can increase confidence over the
structure of your data. Conveniently, this also offers a proxy for the
probability of two nodes belonging to the same community.

LAYOUT
Take into account graph connectivity when determining the usefulness of this
strategy. For example, for a completely connected and unweighted graph, the
output will be degenerate. Consider thresholding the graph beforehand to extract
a more sparse representation of your data.

The adequateness of modularity optimization is dependent on the connectivity
pattern of your graph. For example, in a lattice layout this algorithm will
perform rather poorly. Modularity optimization doesn’t guarantee adequate
clustering; thus obtaining a community at the end is not enough to conclusively
say a node decidedly belongs to that group (or even any group, for that matter).HIERARCHY
The iterative nature of this process offers a hierarchical view between
communities of subsequent iteration. The intermediary step should therefore be
saved for further investigation as they likely yield valuable information on the
structural complexity of the data. This saving procedure is not covered in this
post but should be trivial to introduce (insert configuration state into your
favorite database) between iteration.

SUMMARY
This work reviewed the feasibility of performing community detection through a
distributed implementation using GraphX . Embedded within the Hadoop ecosystem , this modularity optimization approach allows the study of networks of
unprecedented size. This change of scales, previously limited by RAM, opens
exciting perspectives as the self modular structure of complex systems have been
shown to hold crucial information to understanding their nature. This enables,
among others, targeted marketing , market segmentation , gene clustering , topic modeling , etc.

Being an unsupervised learning technique and an initial starting point for a lot
of analysis, the low barriers of entry make this approach applicable to a wide
range of datasets.

Did I miss something crucial to get you up and running? Have something to add?
Would love to hear your experience with this type of approach!


--------------------------------------------------------------------------------

Want to learn Spark, machine learning with graphs, and other big data tools from
top data engineers in Silicon Valley or New York? The Insight Data Engineering Fellows Program is a free 7-week professional training where you can build cutting edge big
data platforms and transition to a career in data engineering.

Learn more about the program and apply today .

Big Data Data Science Machine Learning Insight Data Engineering Social Network 5 Blocked Unblock Follow FollowingSEBASTIEN DERY
Master of Layers, Protector of the Graph, Wielder of Knowledge. #OpenScience
#NoBullshit

FollowINSIGHT DATA
Insight Fellows Program —Your bridge to careers in Data Science and Data
Engineering.",During the seven-week Insight Data Engineering Fellows Program recent grads and experienced software engineers learn the latest open source technologies by building a data platform to handle large…,Graph-based machine learning,Live,50
1104,"GO SERVERLESS WITH APEX AND COMPOSE'S MONGODB
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 6, 2016Apex is tooling that wraps the development and deployment experience for AWS
Lambda functions. It provides a local command line tool which can create
security contexts, deploy functions, and even tail cloud logs. While AWS's
Lambda service treats each function as an independent unit, Apex provides a
framework which treats a set of functions as a project. Plus, it even extends
the service to languages beyond just Java, Javascript, and Python such as Go.

Two years ago the creator of Express, the almost de facto web framework for
NodeJS, said goodbye to the Node community and turned his attention to Go, the backend services
language from Google, and Lambdas, the Functions as a Service offering from AWS.
While one developer's actions don't make a trend, it is interesting to look at
the project he has been working on named Apex because it may portend some changes in how a good portion of the web will be
delivered in the future.

What is a Lambda?Currently, if one doesn't run their own hardware they pay to run some kind of
virtual server in the cloud. On it they deploy a complete stack such as Node,
Express, and a custom application. Or if they have gone further with something
like a Heroku or Bluemix, then they deploy their full application to some
preconfigured container that already has Node setup and they just deploy the
application's code.

The next step up the abstraction ladder is to deploy just the functions
themselves to the cloud without even a full application. These functions can
then be triggered by a variety of external events. For example, AWS's API
Gateway service can proxy HTTP requests as events to these functions and the
Function as a Service provider will execute the mapped function on demand.

Getting Started with ApexApex is a command line tool which wraps the AWS CLI (Command Line Interface).
So, the first step to getting started with Apex is to ensure that you have the
command line tools from AWS installed and configured (see AWS CLI Getting Started or Apex documentation ).

Next install Apex:

curl https://raw.githubusercontent.com/apex/apex/master/install.sh | sh

Then create a directory for your new project and run:

apex init


This sets up some of the necessary security policies and even appends the
project name to the functions since the Lambda namespace is flat. It also
creates some config and the functions directory with a default ""Hello World""
style function in Javascript.


One of the nice things about Apex/Lambdas is that creating a function is really
straightforward. Create a new directory with the name of your function and then
in that create the program. To use Go, you could create a directory named simpleGo then in that create a small main program:


//  serverless/functions/simpleGo/main.go
package main

import (  
    ""encoding/json""
    ""github.com/apex/go-apex""
    ""log""
)

type helloEvent struct {  
    Hello string `json:""hello""`
}

func main() {  
    apex.HandleFunc(func(event json.RawMessage, ctx *apex.Context) (interface{}, error) {
        var h helloEvent
        if err := json.Unmarshal(event, � err != nil {
            return nil, err
        }
        log.Print(""event.hello:"", h.Hello)
        return h, nil
    })
}


Apex uses a NodeJS shim, since Node is a supported runtime of Lambda, to call
the binary which is created from the above program. It passes the event into the binary's STDIN and takes the value returned from the binary's STDOUT. It logs via STDERR. The apex.HandleFunc manages all of the piping for you. Really it is a very simple solution in the
Unix tradition. You can even test it from the command line locally with a go run main.go :


Deploying to the cloud is trivial with Apex:


Notice that it namespaced your function, managed versioning, and even had a
place for some env things which we could have used for multiple development environments like staging and production .

Executing on the cloud is trivial too with apex invoke :


And we can even tail some logs:


Those are results from AWS CloudWatch. They are available in the AWS UI but when
developing it is much faster to follow them like this in another terminal.

What's Inside?It is instructive to see inside the artifact that is actually deployed. Apex
packages up the shim and everything needed for the function to run. Plus, it
goes ahead and configures things like the entry point and security roles:


The Lambda service actually accepts a zip archive with all of the dependencies
which it deploys to the servers that execute the function. We can use apex build <functionName> to create an archive locally which we can then unzip to explore:


The _apex_index.js handle function is the original entry point. It sets up some environment variables and
then calls into index.js .
The index.js spawns a child process of the main Go binary and wires everything together.

Go Further with mgoThe Golang driver for MongoDB is called mgo . Using Apex to create a function that connects to Compose's MongoDB is almost
as straightforward as the simpleGo function which we have been reviewing. Here we'll create a new function by
adding a directory called mgoGo and creating another main.go :

// serverless/functions/mgoGo/main.go

package main

import (  
    ""crypto/tls""
    ""encoding/json""
    ""github.com/apex/go-apex""
    ""gopkg.in/mgo.v2""
    ""log""
    ""net""
)

type person struct {  
  Name  string `json:""name""`
  Email string `json:""email""`
}

func main() {  
    apex.HandleFunc(func(event json.RawMessage, ctx *apex.Context) (interface{}, error) {
        tlsConfig := &tls.Config{}
        tlsConfig.InsecureSkipVerify = true

        //connect URL:
        // ""mongodb://<username>:<password>@<hostname>:<port>,<hostname>:<port>/<db-name>
        dialInfo, err := mgo.ParseURL(""mongodb://apex:mountain@aws-us-west-2-portal.0.dblayer.com:15188, aws-us-west-2-portal.1.dblayer.com:15188/signups"")
        dialInfo.DialServer = func(addr *mgo.ServerAddr) (net.Conn, error) {
            conn, err := tls.Dial(""tcp"", addr.String(), tlsConfig)
            return conn, err
        }
        session, err := mgo.DialWithInfo(dialInfo)
        if err != nil {
            log.Fatal(""uh oh. bad Dial."")
            panic(err)
        }
        defer session.Close()
        log.Print(""Connected!"")

    var p person
    if err := json.Unmarshal(event, � err != nil {
            log.Fatal(err)
    }

        c := session.DB(""signups"").C(""people"")
        err = c.Insert(&p) 
        if err != nil {
            log.Fatal(err)
        }

    log.Print(""Created: "", p.Name,"" - "", p.Email)
        return p, nil
    })
}


Post deploy. We can invoke with the correct kind of event to mimic calling an
API:


The net result is to insert into MongoDB on Compose :


So Much More...While we have covered a lot of ground so far with Apex there are many more
things to explore. There is integration with Terraform . You could deliver a polyglot language project with Javascript, Java, Python
and Go if you so desired. You could configure multiple environments for things
like development, staging, and production. You could tweak the runtime resources
by sizing memory and timeouts which effects pricing. And you could hook
functions up to the API Gateway to deliver an HTTP API or use something like SNS
(Simple Notification Service) to build pipelines of functions in the cloud.

Like most things, Apex and Lambdas aren't perfect for every scenario. Functions
with high IO waits defeat the purpose of paying for compute time. But, adding a
tool to your toolbox that requires no infrastructure management on your part at
all makes good sense.

Image by Esaias Tan Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","While AWS's Lambda service treats each function as an independent unit, Apex provides a framework which treats a set of functions as a project.",Go Serverless with Apex and Compose's MongoDB,Live,365
1105,"Skip navigation Upload Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseLELAND MCINNES, JOHN HEALY | CLUSTERING: A GUIDE FOR THE PERPLEXED
PyData Subscribe Subscribed Unsubscribe 15,908 15KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics

52 views 3LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 4 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 24, 2016PyData DC 2016

Finding clusters is a powerful tool for understanding and exploring data. While
the task sounds easy, it can be surprisingly difficult to do it well. Most
standard clustering algorithms can, and do, provide very poor clustering results
in many cases. We discuss how to do clustering correctly.

Finding clusters is a powerful tool for understanding and exploring data. While
the task sounds easy, it can be surprisingly difficult to it well. Most standard
clustering algorithms can, and do, provide very poor clustering results in many
cases. Our intuitions for what a cluster is are not as clear as we would like,
and can easily be lead astray. We will attempt to find a definition of
clustering that makes sense for most cases, and introduce an algorithm for
finding such clusters, along with a high performance python implementation of
the algorithm, building up more intuition for what clustering really means as we
go.

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Patrick Harrison | Modern NLP in Python - Duration: 1:31:29. PyData 28 views
   * New 1:31:29


--------------------------------------------------------------------------------

 * Ryan Zotti | How to Build Your Own Self Driving Toy Car - Duration: 1:25:21.
   PyData 316 views * New 1:25:21
 * Darshan Pandit | Predicting Usage for Capital Bikeshare stations based upon
   Spatial Characteristics - Duration: 21:03. PyData 12 views * New 21:03
 * Kevin Markham | Machine Learning with Text in scikit learn - Duration:
   1:24:20. PyData 4 views * New 1:24:20
 * Stephen Simmons | Pandas from the Inside - Duration: 1:20:54. PyData No views
   * New 1:20:54
 * Dhavide Aruliah | Learn how to Make Life Easier with Anaconda - Duration:
   1:22:02. PyData No views * New 1:22:02
 * Laura Lorenz | How I learned to time travel, or, data pipelining and
   scheduling with Airflow - Duration: 42:29. PyData 11 views * New 42:29
 * Austin Rochford | Variational Inference in Python - Duration: 36:47. PyData 8
   views * New 36:47
 * Skipper Seabold | Using Dask for Parallel Computing in Python - Duration:
   44:29. PyData No views * New 44:29
 * Chase Coleman | Julia Tutorial - Duration: 1:06:30. PyData No views * New 1:06:30
 * Steven Lott | NoSQL doesn't mean No Schema - Duration: 44:26. PyData No views
   * New 44:26
 * Jason Grout | JupyterLab: Building Blocks for Interactive Computing -
   Duration: 40:09. PyData No views * New 40:09
 * Robert Cohn, Peter Wang | Keynote: How Open Data Science Opens the World of
   Innovation - Duration: 41:24. PyData 14 views * New 41:24
 * Daniel Chen | Python useRs - Duration: 45:13. PyData 1 view * New 45:13
 * Aron Ahmadia, Matthew Rocklin | Parallel Python Analyzing Large Data Sets -
   Duration: 1:00:37. PyData 1 view * New 1:00:37
 * Steven Lott | The Five Kinds of Python Functions - Duration: 1:14:08. PyData
   1 view * New 1:14:08
 * Elizabeth Lindsey | Keynote: Become a Data Superhero How Data Can Change the
   World - Duration: 30:31. PyData 1 view * New 30:31
 * Alex DeBrie, Kelly Burdine | Promoting a data driven culture in a world of
   microservices - Duration: 32:06. PyData 2 views * New 32:06
 * Scaling up to Big Data Devops for Data Science - Duration: 35:28. PyData 1
   view * New 35:28
 * Will Voorhees | Eat Your Vegetables Data Security for Data Scientists -
   Duration: 40:00. PyData 1 view * New 40:00
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Try something new!
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...","Finding clusters is a powerful tool for understanding and exploring data. While the task sounds easy, it can be surprisingly difficult to do i...",Clustering: A Guide for the Perplexed,Live,366
1110,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Homepage * Home
 * Cognitive Computing
 * Data Science
 * Web Dev
 * 

Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Sep 21
--------------------------------------------------------------------------------

NODEBOOKS: VISUALISING DATA THE NODE.JS WAY
GENERATING CHARTS IN PYTHON NOTEBOOKS USING ONLY NODE.JS CODE (PART 3)
In part one of this series we saw how we could:

 * Use Node.js code inside of Jupyter notebooks by adding pixiedust_node
 * Use print and display in JavaScript and Python code to output and visualise data
 * Add npm modules into our notebook and build functions using callbacks or promises

Part two added:

 * Using store to move Node.js variables into the Python address space
 * Using the silverlining library to access a Cloudant database
 * Using print/display/store to show, visualise, and move data in a notebook

In this blog, we’re going to add some more npm modules into our notebook and
show how you can create your own visualisations using only Node.js code.

There are two more Node.js pixiedust_node helper functions we haven’t introduced
yet. Let’s fix that now.

DISPLAYING HTML AND IMAGES IN NOTEBOOK CELLS
We can use the html function to render HTML code in a cell:

The html function, rendered in a Jupyter Notebook cell.If we have an image we want to render, we can do that with the image function:

The image function, rendered in a Jupyter Notebook cell.We can use these tools to programmatically create custom visualizations for our
data.

MAKING A CHART
Let’s install another npm library called quiche :

Then we can generate QR codes on the fly:

We can fetch some data from our Cloudant database and visualise the data using
the quiche library:

Using the *quiche* library in a Jupyter Notebook cell.There’s no graphical pull-down menu to edit the visualisation as you get with
PixieDust’s display function, but if you prefer generating charts programmatically, this is
certainly an option.

FETCHING FROM REDIS
The are many Node.js modules available to help you. If you want to fetch
real-time information from a remote Redis database, then the redis npm module is there to help:

PLAYING NICELY TOGETHER
Notebooks are great for experimenting with code in an interactive environment,
where you can see code, data, visualisations, and documentation together on the
same page. This overall view is useful for:

 * Creating training scripts that guide users through a new API or service
 * Prototyping code to perfect algorithms, explore the data you are working
   with, and sketch ideas before diving into the real code
 * Building reports and dashboards that pull data from various sources

Easily visualising a data set with a single function call (PixieDust display ) makes data exploration a breeze. As convenient as that is, sometimes it's
better to collaborate with tools others find more familiar. Adding the power of
the Node.js package management system brings a host of libraries into the mix to
access, process, and visualise data.

Whether you’re using Python or Node.js, I hope the pixiedust_node extension
helps you get your ideas out faster. Let me know what you build here in the
comments, and please submit issues and pull requests . Cheers.

Thanks to Mike Broberg and Teri Chadbourne, CMP . * Nodejs
 * JavaScript
 * Data Science
 * Jupyter
 * Ipython

Show your supportClapping shows how much you appreciated Glynn Bird’s story.

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","In this blog, we’re going to add some more npm modules into our notebook and show how you can create your own visualisations using only Node.js code. There are two more Node.js pixiedust_node helper…",Visualising Data the Node.js Way,Live,367
1111,"Working Vis * 
 * 

 * Home
 * About This Blog
 * Brunel","After shoveling the driveway several times and burning through the Netflix que, one way to counter act cabin fever is to hunt down some snowfall data and play around with it.  So, I found some data over at the National Weather Service that contains snowfall depth measurements collected from a variety of sources around the region at various time points during the storm.",Snowzilla!,Live,368
1115,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own Aug 15, 2016
--------------------------------------------------------------------------------

UPLOAD DATA AND CREATE DATA FRAMES IN JUPYTER NOTEBOOKS
When you create an account in IBM Data Science Experience we provision a free
Apache Spark Cluster and 5 GB of Free IBM Object Storage . Some of our users shared that they are having trouble loading data in
notebooks due to inexperience working with Cloud Data Services. Most of them are
used to working with data hosted on their own laptop. We want to provide for you
the easiest experience to work with your data, and with the tools and libraries
that you already know.

We are excited to announce a new feature in IBM Data Science Experience that
will help users create data frames in one click using the Jupyter Notebooks
interface.

UPLOAD DATA TO OBJECT STORAGE
Uploading data to Object Storage is very simple. Just drag and drop your file in
the notebook and, Magic! ✨ , the file is uploaded and you will see it available
in the Notebook palette. There is a progress bar to show how long the upload
process will take, which depends on the file size.

CREATE DATA FRAMES TO START YOUR ANALYSIS
A data frame is a two-dimensional labeled data structure with columns of
potentially different types. You can think of it like a spreadsheet or SQL
table.

Once your file is uploaded, it appears in the Notebook palette. Now, click Insert Code , which will open a drop-down menu with different options to create different
types of data frames depending on your preference and language:

Python Notebook

 * Pandas DataFrame
 * Spark SQL DataFrame
 * Spark RDD
 * Insert Credentials

R Notebook

 * R base DataFrame
 * Spark SQL DataFrame
 * Insert Credentials

Note that today this feature is only supported in CSV files but if you like it
we will extend it quickly to other file formats!

This action will create a new cell in the notebook that will perform four
actions:

 1. Install and import all the needed libraries to load the data. This action happens only the
    first time that you use this feature in the notebook, since you need only to
    load the libraries once.
 2. Connect to the Object Storage object, automatically inserting the credentials for
    you.
 3. Load the data frame.
 4. Display a preview of the data frame.

See a demo in action here:

 * Machine Learning
 * IBM
 * Data Science


Blocked Unblock Follow FollowingARMAND RUIZ
Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",When you create an account in IBM Data Science Experience we provision a free Apache Spark Cluster and 5 GB of Free IBM Object Storage. Some of our users shared that they are having trouble loading…,Upload data and create Data Frames in Jupyter Notebooks,Live,369
1119,"MAKING THE MOST OF COMPOSE - FATHOM
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 11, 2016FATHOM is an advanced manufacturer who likes challenges, like building the first 3D
printed incubator or efficiently harnessing the collective power of professional
3D printers on its manufacturing floors. ""We do the full gamut,"" says Dr. Carlo
Quinonez, FATHOM's Director of Research, ""we support the full range of
production services during product development from – one or two – prototypes to
engineering foundation runs and even short production runs of up to maybe
100,000 units"". Using cloud-based applications allows FATHOM to take on these
challenges efficiently. What it doesn't want to deal with are the day to day
issues of managing the databases it needs and that's where Compose comes in.

FATHOM started out in 2008 as an authorized partner for Objet and later Stratsys 3D printers. As the business has grown, so have the needs of the customers
resulting in the company adding high-end manufacturing services. Now, with
offices and production centers in Oakland, Seattle, and China, FATHOM is
geographically well-positioned to handle the complete prototype-to-production
process.

FATHOM takes on projects in numerous markets, most of which are covered by NDA.
The 3D printed oven, on the other hand, is an example of an internal project. It
was the result of the research work and is, Quinonez says, ""really more inspired
by scientific incubators, where the chamber you are growing your cells in is
simply surrounded by a big jacket of heated water, allowing for an even,
indirect heating of the contents"". The oven, Project Pyra , is a research project by Quinonez that is ""a concrete example of how
complexity can be leveraged fruitfully in a way to create functional things.""
The parts of the oven are quite intricate, going against conventional
engineering guidance to KISS (Keep It Simple, Stupid) to obtain results with
materials usually thought unsuited to the task.

FATHOM has worked with companies on other unique projects such as the IoT
enabled leg cast BOOMcast for Doc North and the Big Mama robotic spider for Intel. For each of these projects, FATHOM helped design and
produce the parts needed to bring these ideas to reality. 3D printing
technologies and services are practically changing the way we make everything
and with that kind of pace you have to create your own applications to cater for
your special requirements.

FATHOM has used its additive manufacturing expertise acquired through the years
and fed it into their next product, FATHOM Analytics, a business intelligence
tool for companies running multiple 3D printers. Quinonez explained that on this
project ""getting it done quickly is way more important than trying to do it all
ourselves and save money"". That meant the FATHOM team went with Compose's
RethinkDB deployments for the backend to their new application.

RethinkDB was chosen for features like the simpler cluster configuration and the
change feeds for real time updating. With experience of MongoDB, the developers
found the indexing and join capabilities of RethinkDB offered options that
weren't previously available to them.

The application itself runs over multiple systems. One part gathers data about
what the additive manufacturing systems in the production center are 3D
printing, what jobs, what part costs, error states and more. This information is
transferred to a queue that lives on a Redis cluster. Cloud-based workers farm
that queue's seven days of data to extract coherent events and jobs. That
information is then transferred RethinkDB. Another application makes makes that
information available over a RESTful API and through that, analysis and other
insights are made available to web applications with the clients. Designed for
users with internal 3D print farms it should be able to get a better idea on how
to schedule their jobs for best yields and see how well their doing.

Quinonez notes that going beyond Analytics is an ordering tool which ""provides a
central portal for ordering, costing and pricing, so that you can then integrate
it, whatever internal systems you have like CRM or MRP/MES, accounting, whatever
it is."" That tool is based around Meteor.js and MongoDB and gives companies a
place to organize and manage their parts orders. It's a development of a tool
they have been using internally to manage their production processes. ""The nice
thing is that, because we're a professional production center, we're actually
using it internally for our own network of commercial 3D printers. It's like,
dogfooding, right, to run all of our internal systems"" Quinonez noted.

What underlies all of this is the ability for FATHOM to pick databases
appropriate to the workloads and solutions they are developing. Compose puts
those databases a click away, with production quality developments. ""It was
really clear to me that the best way is to just go with the managed database
service, rather than doing it ourselves"" says Quinonez, adding ""maybe at some
point down the road when we get bigger, we can figure out whether or not we'll
revisit that value proposition. But that's way down the road.""

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","FATHOM is an advanced manufacturer who likes challenges, like building the first 3D printed incubator or efficiently harnessing the collective power of professional 3D printers on its manufacturing floors.",Making the most of Compose - Customer: FATHOM,Live,370
1120,Join us to explore how we transformed IBM Cloudant's Location Tracker application from a CouchApp into a full three-tier Node.js app. You’ll learn how to add user management functionality and leverage Cloudant Query to display a map of individual user locations within the app.,Bradley Holt presents a webinar where he explores how we transformed IBM Cloudant's Location Tracker application from a CouchApp into a full three-tier Node.js app. You’ll learn how to add user management functionality and leverage Cloudant Query to display a map of individual user locations within the app,Integrate User Management and Cloudant Query into a Node.js & Cloudant Application,Live,371
1124,"Mark Watson Blocked Unblock Follow Following Developer Advocate, IBM Watson Data Platform Feb 18
--------------------------------------------------------------------------------

HAVE YOU HAD “THE TALK” WITH YOUR CHATBOT ABOUT GRAPH DATA STRUCTURES?
A COMING-OF-AGE STORY FOR YOUR DATABASE MODEL
Image credit: Charlotte ParentGraph databases are a great way to store conversational data. A simple dialog tree can add depth to character interactions in a video game. A knowledge graph can extract more meaning from dialog to better understand how user intent
relates to an application’s data.

In this article, I’ll show you a basic graph model for capturing chatbot
interactions and how to persist them using the Apache TinkerPop framework. I’ll also show you some Gremlin queries for adding a recommendation feature to the chatbot. The source code and
setup instructions for my example “Recipe Bot” are on GitHub .

REVIEW: RECIPE BOT
The Recipe Bot is a Slack Bot User that lets people request recipes based on specified ingredients or cuisines.
Previously I showed you how to add support for users to request their favorite
recipes, like so:

The graph version of the bot has all of the same features I discussed in my previous article on persisting metadata with JSON , but with the graph version you’ll be adding recommendations.

HOW IT WORKS WITH TINKERPOP
Here is an architecture diagram of how the bot works:

My Recipe Bot. Hey! The diagram is actually an undirected graph . Who knew?You’ll see that I’m using the Watson Conversation service. Watson Conversation
lets me describe the flow of the conversation through the use of dialogs, and it
helps me extract information and user intent from chat messages. You can code
your own dialog tree and perform your own message parsing, or you can use tools
like Watson Conversation or Botkit to help. Here is how the dialog tree for the
Recipe Bot is modeled:

The Watson Conversation UI. Graphs are everywhere.You can follow a conversation through the dialog tree similar to how you can
follow vertices and edges in a graph (after all, trees are graphs too):

It’s not much of a tree, but I’m keeping it simple, y’all.In the simplified graph above, the Recipe Bot cares only about the progression
between the major entities of the bot:

 1. People
 2. Ingredients
 3. Cuisines
 4. Recipes

DATA MODEL & ACCESS PATTERN
As the conversation progresses, you store the following vertices and edges using
the TinkerPop API:

 1. Person vertex: For each person that interacts with the bot, store that person as a vertex
    in the graph.

{
  ""label"": ""person"",
  ""type"": ""vertex"",
  ""properties"": {
    ""name"": ""U2JBLUPL2""
  }
}

2. Ingredient or cuisine vertex: When a person requests a specific ingredient or cuisine, you store that
ingredient or cuisine — along with the list of recipes retrieved from Spoonacular — as a vertex.

{
  ""label"": ""cuisine"",
  ""type"": ""vertex"",
  ""properties"": {
    ""name"": ""chinese"",
    ""detail"": ""[{\""id\"": 573147, \""title\"": \""Kale Fried Rice\""...""
  }
}

3. Selects edge, person → (ingredient | cuisine): You create an edge, labelled ""selects"" , between the person and the ingredient or cuisine (that is, “person selects
cuisine”). In addition, store a ""count"" property on the edge and increment its value each time the user requests the
same ingredient or cuisine.

{
  ""label"": ""selects"",
  ""type"": ""edge"",
  ""inV"": 4152,
  ""outV"": 4224,
  ""properties"": {
""count"": 3
  }
}

4. Recipe vertex: When a user requests a recipe, store the recipe as a vertex.

{
  ""label"": ""recipe"",
  ""type"": ""vertex"",
  ""properties"": {
    ""name"": ""573147"",
    ""detail"": ""Ok, it takes *45* minutes to make...*"",
    ""title"": ""Kale Fried Rice""
  }
}

5. Selects edge, (ingredient | cuisine) → recipe: You create another ""selects"" edge between the ingredient or cuisine and the recipe (that is, “cuisine
selects recipe”). In addition, store a ""count"" property on the edge and increment it each time the ingredient or cuisine
selects the same recipe.

{
  ""label"": ""selects"",
  ""type"": ""edge"",
  ""inV"": 4320,
  ""outV"": 4152,
  ""properties"": {
""count"": 22
  }
}

6. Selects edge, person → recipe: You create yet another ""selects"" edge directly between the person and the recipe (that is, “person selects
recipe”). Store a ""count"" property on the edge and increment it each time a person requests the same
recipe.

{
  ""label"": ""selects"",
  ""type"": ""edge"",
  ""inV"": 4320,
  ""outV"": 4224,
  ""properties"": {
""count"": 4
  }
}

7. Has edge, recipe → (ingredient | cuisine): Finally, create an edge, labelled ""has"" , between the recipe and the ingredient or cuisine (that is, “recipe has
cuisine”). This relationship allows you to find all the ingredients and cuisines
that a recipe uses. There is no count field on this edge.

{
  ""label"": ""has"",
  ""type"": ""edge"",
  ""inV"": 4152,
  ""outV"": 4320
}

The graph for a single user looks something like this:

This graph has it going on. It’s a (weakly) connected graph . There are all kinds of graphs.So far, by using a graph database, you get the following benefits:

 1. Reduce third-party API calls by caching entities.
 2. Provide a more personal experience for users by harnessing metadata on their
    interactions.

A “more personal experience” for Recipe Bot means allowing users to request
their favorite recipes. To find a user’s favorite recipes, you use the Gremlin graph traversal language. The following Gremlin query will give you a user’s top-five favorite
recipes, sorted by count:

ADDING RECOMMENDATIONS
Since you track every user interaction with the bot as a graph, you can find
popular ingredients, cuisines, or recipes requested by all users. You can use
Gremlin to find popular recipes based on an ingredient or cuisine. Here’s how it
works:

Let’s say, a user is looking for recipes that use onions:

You can find popular recipes that use onions by issuing the following query.
(I’ll unpack it further below—don’t worry!):

This query says, “Give me anyone, excluding the calling user, who has requested
recipes more than once that have onions.” It breaks down like so:

1. Start with ""onions"" :

g.V().hasLabel(""ingredient"").has(""name"",""onions"")

2. Get the recipes that have ""onions"" . This API call uses the ""has"" edge coming from the recipe vertex into the ingredient vertex. Using .in() skips the edge and only returns the recipe vertex. (You don’t need any
properties from the edge object, so there’s no reason to return it here.)

.in(""has"")

3. Get the users that have requested these recipes more than once. This call
uses the ""selects"" edge coming from the person to the recipe:

.inE().has(""count"",gt(1)).order().by(""count"", decr)

4. Get the users, excluding the current user:

.outV().hasLabel(""person"").has(""name"",neq(""CURRENT_USER""))

5. Get the full path:

.path()

This call returns an array of matching paths that looks like this:

ingredient ← recipe ← edge ← personYou can access these recommended recipes at index 1.

When you return this recipe list to the user, the app puts the recommended
recipes at the top and highlights the number of users who have previously used
which recipe:

WHAT’S NEXT?
Try a deployment for yourself. The project’s README has step-by-step instructions for completing your first deployment on IBM
Bluemix. There’s also a Java port of the example app.

If you’re already using a dialog tree in your applications and want to use a
graph database to persist metadata on interactions, I hope the source code in
the repo above gives you some ideas on delivering more personalized experiences
to your users.

And if you’ve enjoyed this article, please hit the ol’ ♥ so other Medium users
might find it and dig it too. Happy coding!

Thanks to Mike Broberg . Chatbots Graph Database Recommendation System Ibm Watson Web Development 57 Blocked Unblock Follow FollowingMARK WATSON
Developer Advocate, IBM Watson Data Platform

FollowFREECODECAMP
Our community publishes stories worth reading on development, design, and data
science.

 * Share
 * 57
 * 
 * 

Never miss a story from freeCodeCamp , when you sign up for Medium. Learn more Never miss a story from freeCodeCamp Get updates Get updates",How to store conversation data as a graph to add a recommendation feature to our chatbot example app. To the code!,Have you had “The Talk” with your chatbot about graph data structures?,Live,372
1133,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Snehal Gawas Blocked Unblock Follow Following Jul 12
--------------------------------------------------------------------------------

WORKING WITH NOTEBOOKS IN DSX
Data Science Experience (DSX) is a complete platform for learning, building and
collaboration. There are different features to enhance your learning experience.
Let’s explore some of the important features of DSX — notebooks.

CREATE NOTEBOOK
To start working on with Data Science Experience, you can create your own
project and add multiple notebooks in that project.

You have three options to create a notebook.

 1. Blank
 2. From File
 3. From URL

You can choose between different languages- Python , Scala , R and Spark versions - 2.1 , 2.0 while creating your notebook.

Learn more about how to create a notebook.

Add connections
Now adding data to your project is as easy as drag and drop. You can also create
connections to external data sources like Cloudant and DB2 .

Save versions

You can save different versions for your notebook and revert to older versions
as well. To save a version, you can use the menu bar or the panel on your
right-hand side. It will help you to organize your work in different versions.
You can save up to 10 versions for each notebook.

Schedule notebook

Scheduling your notebook is an easy task within Data Science Experience. Select
few things like start date, end date, version for your notebook and you are done
with your scheduling job.

Publish notebook

Data Science Experience helps you to publish your work on GitHub using this
publish feature. Isn’t it amazing! Publishing your notebook is just one click
away!

Share notebook

If you want to share your amazing work with others, click on share icon and use
the given link to view the notebook. You can hide sensitive data while sharing
the notebook.

Comments

Comments are an effective way to insert notes in your notebook and to have a
discussion with your collaborators. You can always edit or delete your comments.
Working in an interactive environment always expedite the process of learning.

Search

Work smarter with the help of community search feature. Find all the resources
related to your query within the community.

Information

Finally, you can view all information related to your notebook and environment
setup in this information feature.

Now it’s easy to Master the Art of Data Science with the help of Data Science
Experience.

 * Data Science
 * Data Science Experience
 * Notebook
 * Dsx
 * Data

Show your supportClapping shows how much you appreciated Snehal Gawas’s story.

Blocked Unblock Follow FollowingSNEHAL GAWAS
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Data Science Experience (DSX) is a complete platform for learning, building and collaboration. There are different features to enhance your learning experience. Let’s explore some of the important…",Working with notebooks in DSX,Live,373
1134,"Today we’re announcing the open-sourcing of our Android™ and iOS® sync libraries. Our press release discusses some of the business drivers for this, but, as project lead, I wanted to talk a little about the features and design choices we made.It’s clear the future for mobile devices is having all your data available, all the time -- even when network connectivity is unavailable. When we started this process, we were therefore very clear on our goal:Make a great database for device developers which can synchronise data across devices using CouchDB and Cloudant.In essence, we did an experiment: produce a database which used enough of the Apache CouchDB™ data model to replicate data -- and so allow developers to easily add synchronisation features to their application -- but to also make a device-local database tuned to the needs of application developers. This experiment was successful, and so Cloudant officially created a Sync team within the company to bring the product to market.Alongside the extensive testing we’ve done against the Cloudant service, the two main differentiating features of our Sync libraries are:A new API, designed for the needs of mobile developers.A new indexing and query layer, more closely matching the requirements and expectations of application developers than CouchDB’s view model.Cloudant’s HTTP interface is straightforward -- for those well-versed in HTTP. So one of the first things we decided to do was to create a native API for devices which didn’t attempt to mimic Cloudant’s HTTP interface but instead felt at home to mobile application developers. Our MVCC data model imposes certain constraints on this, of course, but we’ve tried to offer affordances unavailable over HTTP. The conflicts API is an example of this (Android docs).We also decided to revisit the querying model used on devices. The querying requirements on local datastores often differ from those required for data stored in the cloud. Our customers often run large analytics tasks over their data using our incremental map-reduce platform. On a device, the queries are usually simpler, but also more fluid. So, to start with, we concentrated on building an easy-to-use, more familiar indexing and query framework (Android, iOS).Finally, we’re releasing this under the Apache licence, meaning that it can be used in both proprietary applications and by other open source projects.The Cloudant Sync libraries build on a number of battle-tested open source libraries. For both the Android and iOS libraries, the rough architecture is the following:We use several open source libraries and pieces of code in Cloudant Sync, which are called out in the diagram above.While it might seem like an odd choice to build a NoSQL database on top of SQLite, SQLite is incredibly well proven in production and is available on all of the platforms we’re targeting. Using SQLite as our durability layer was therefore a no-brainer.We use MVCC code from TouchDB, which layers MVCC semantics onto SQLite. Again, this code is well-tested, particularly on iOS, so reinventing the wheel didn’t make sense.The library is able to synchronise data directly with either Cloudant or CouchDB over HTTP(S) -- no need for any extra software. We based our iOS replication code on TouchDB’s, but wrote our own for Android to make better use of the facilities the Java platform offered.On top of these core pieces, we wrapped a native-application-friendly API. At the moment we’re still hammering out the details of this as we use the library in different applications; expect this to change and improve on the way to 1.0.The final piece was again leveraging SQLite to allow developers to easily build indexes around the data in the JSON data, by allowing developers to easily make use of SQLite’s fast indexing with the data from their JSON documents. Furthermore, each document in the database can be indexed in any number of ways, allowing for flexible querying semantics which are harder to achieve with CouchDB’s map-reduce views. We’ve found this to be a more natural fit to application use-cases in our testing.Both libraries are on Github. Each repository has API tutorials and an example application:With these libraries, we’re hoping to make the development of applications that sync across devices and platforms simpler and faster. Bringing together a distinctly native programming interface with data querying functionality focussed on the specialised needs of mobile applications, along with the powerful and proven replication functionality protocol of CouchDB, presents a uniquely easy to integrate solution.We’re still in early days for this product, so now is a great time to get involved and influence the direction of the project as we explore the problem space. In particular, we’d love to hear feedback on the APIs -- where they work and where they don’t -- along with feedback on how the querying functionality fits into your applications. If we’re missing features useful to your application, let us know.And finally, I hope you enjoy using the library.The Android code is available on Github here and the iOS code is here. Both projects are available under the Apache 2.0 licence, which means that they can be safely integrated into both open source and proprietary projects without problem.Open source is very important to Cloudant. As a service business, giving our users the tools to adapt how they access the service to their needs is important, and open source libraries such as the sync libraries are a part of this. To this end, we’ve also removed as many barriers to contributing as possible, making it as simple as a Github pull request -- no paperwork as we’re committed to maintaining the library under the Apache licence.Both the Android and iOS libraries make use of code from TouchDB, and we’re incredibly grateful to be standing on the shoulders of those who’ve gone before us. However, we decided against forking the project in favour of reusing targetted portions of the TouchDB code[1].TouchDB follows the CouchDB interface very precisely, going so far as to implement a HTTP layer and allowing access from CouchCocoa, an Objective-C CouchDB client. We wanted to experiment in creating a datastore following CouchDB’s data model but not its API. It therefore made sense to extract appropriate pieces from TouchDB rather than forking the project and immediately engaging in a large scale re-organisation and cutting-down of the library -- leaving something very different to the original project.I think the way we’ve chosen to go avoids belittling the contribution of existing code while not sowing confusion about what the “right” fork of TouchDB is given the diverging project aims.My experience as an app developer lead me to believe that the indexing and querying functionality we’ve implemented is a better fit to many applications’ needs. However, we know views are useful for many other use-cases, particularly for those already making use of views on Cloudant/CouchDB, so this feature is very much on our radar.",Cloudant Sync allows you to create native mobile apps for iOS and Android that can store data locally and sync with Cloudant when online. This allows your mobile application continue to work offline to deliver a great user experience.,"Introducing Cloudant Sync, Open Source Libraries for Mobile",Live,374
1135,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register * Projects
 * Blogs
 * About
 * Contribute
 * OpenTech
 * Tutorials
 * Events
 * Videos

Search

APACHE SYSTEMML


CATEGORIES Analytics

LANGUAGE – MODIFIED –

WATCH – STAR –

CONTRIBUTORS – ISSUES –

PULL REQUESTS – FORKS –

BRANCHES – RELEASES –


WEBCAST! Join us for an Apache SystemML Tech Talk on Wednesday, May 25, 2016, and
discover Declarative Machine Learning. Get more details and register here .Machine learning (ML) is the capability of computers to learn without being
explicitly programmed. Although the broad ideas around ML are well formulated,
the field continues to rapidly gain interest due to the the vast proliferation
of digital data and the ready availability of compute power to digest it all.
The age of thinking machines is upon us.


Update: SystemML is now an Apache Incubator project!

Apache SystemML advances machine learning in two very important ways. The Apache SystemML
language, Declarative Machine Learning (DML), includes linear algebra
primitives, statistical functions, and ML-specific constructs that make it
easier and more natural to express ML algorithms. Algorithms can be expressed in
either an R-like or a Python-like syntax. DML significantly increases the
productivity of data scientists by providing full flexibility in expressing
custom analytics as well as data independence from the underlying input formats
and physical data representations.
Second, Apache SystemML provides automatic optimization according to data and
cluster characteristics to ensure both efficiency and scalability. Apache
SystemML runs in MapReduce or Spark environments.
Follow these links for more background:

 * IBM Spark Technology Center
 * SystemML: Declarative Machine Learning on MapReduce


WHY SHOULD I CONTRIBUTE?
You’ll learn about Apache Spark and the DML scripting language, but probably the
most important takeaway will be how to implement an advanced ML system in an
advanced, parallel, distributed environment.


WHAT TECHNOLOGY PROBLEM WILL I HELP SOLVE?
Apache SystemML will benefit from contributions in several areas. Data
scientists can contribute new algorithms or enhance existing ones by making them
more robust and accurate. Engineers can build support for other distributed
platforms and help with the parser or improve the performance of the runtime.


HOW WILL APACHE SYSTEMML HELP MY BUSINESS?
Apache SystemML promises to greatly improve the productivity of analysts and
data scientists by providing 1) DML, a declarative, R-like language for flexibly
expressing custom analytics and 2) data independence from the underlying input
formats and physical data representations.


HOW TO CONTRIBUTE GITHUB CLONE Type the following command to clone the repository locally ( Help ).
https://github.com/apache/incubator-systemml.git DOWNLOAD STACKOVERFLOW SLACK (Join) COMMUNITY dW ANSWERS COMMITS * 0 –
 * 0 –
 * 0 –
 * 0 –
 * 0 –
 * 0 –


DEVELOPER STORIES SEE ALL


Luciano’s Story


 * Click to share on Twitter (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * 

RELATED BLOGS SEE ALL
November 23, 2015SYSTEMML IN THE APACHE INCUBATOR, SOLVING THOUSANDS OF PROBLEMS Marc-Arthur Pierre-Louis As an Apache project, SystemML will reach even more people. It will be at the
fingertips of college kids, who will use it in their analytics projects, and
data scientists, who will use it to solve scores of complex problems. More November 4, 2015SYSTEMML RELEASE 0.8.0 – DISTRIBUTED AND DECLARATIVE MACHINE
LEARNING LUCIANO RESENDE Just out: SystemML 0.8.0, a significant release that includes APIs, data
ingestion, optimizations, language and runtime operators, new algorithms,
testing, and online documentation. More October 20, 2015SYSTEMML – DECLARATIVE MACHINE LEARNING LUCIANO RESENDE I give a brief introduction to SystemML and the power it provides data
scientists who are looking for a more straightforward way to craft
machine-learning algorithms. MorePROJECT CONTRIBUTORS SEE ALL
LUCIANO RESENDE RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM",SystemML includes makes it easier to express machine-learning algorithms and provides automatic optimization ensure both efficiency and scalability.,Apache SystemML,Live,375
1137,"RStudio Blog * Home

 * Subscribe to feed

GGPLOT2 2.2.0 COMING SOON!
September 30, 2016 in tidyverse

I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d
like to announce that a release candidate is now available: version 2.1.0.9001.
Please try it out, and file an issue on GitHub if you discover any problems. I hope we can find and fix any major issues
before the official release.

Install the pre-release version with:

# install.packages(""devtools"")
devtools::install_github(""hadley/ggplot2"")

If you discover a major bug that breaks your plots, please file a minimal reprex , and then roll back to the released version with:

install.packages(""ggplot2"")

ggplot2 2.2.0 will be a relatively major release including:

 * Subtitles and captions.
 * A large rewrite of the facetting system.
 * Improved theme options.
 * Better stacking
 * Numerous bug fixes and minor improvements .

The majority of this work was carried out by Thomas Pederson , who I was lucky to have as my “ggplot2 intern” this summer. Make sure to
check out other visualisation packages: ggraph , ggforce , and tweenr .

SUBTITLES AND CAPTIONS
Thanks to Bob Rudis , you can now add subtitles and captions:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE, method = ""loess"") +
  labs(
    title = ""Fuel efficiency generally decreases with engine size"",
    subtitle = ""Two seaters (sports cars) are an exception because of their light weight"",
    caption = ""Data from fueleconomy.gov""
  )


These are controlled by the theme settings plot.subtitle and plot.caption .

The plot title is now aligned to the left by default. To return to the previous
centering, use theme(plot.title = element_text(hjust = 0.5)) .

FACETS
The facet and layout implementation has been moved to ggproto and received a
large rewrite and refactoring. This will allow others to create their own
facetting systems, as descrbied in the Extending ggplot2 vignette. Along with the rewrite a number of features and improvements has been
added, most notably:

 * Functions in facetting formulas, thanks to Dan Ruderman .ggplot(diamonds, aes(carat, price)) + 
     geom_hex(bins = 20) + 
     facet_wrap(~cut_number(depth, 6))
   
   
 * Axes were dropped when the panels in facet_wrap() did not completely fill the rectangle. Now, an axis is drawn underneath the
   hanging panels:ggplot(mpg, aes(displ, hwy)) + 
     geom_point() + 
     facet_wrap(~class)
   
   
 * It is now possible to set the position of the axes through the position argument in the scale constructor:ggplot(mpg, aes(displ, hwy)) + 
     geom_point() + 
     scale_x_continuous(position = ""top"") + 
     scale_y_continuous(position = ""right"")
   
   
 * You can display a secondary axis that is a one-to-one transformation of the
   primary axis with the sec.axis argument:ggplot(mpg, aes(displ, hwy)) + 
     geom_point() + 
     scale_y_continuous(
       ""mpg (US)"", 
       sec.axis = sec_axis(~ . * 1.20, name = ""mpg (UK)"")
     )
   
   
 * Strips can be placed on any side, and the placement with respect to axes can
   be controlled with the strip.placement theme option.ggplot(mpg, aes(displ, hwy)) + 
     geom_point() + 
     facet_wrap(~ drv, strip.position = ""bottom"") + 
     theme(
       strip.placement = ""outside"",
       strip.background = element_blank(),
       strip.text = element_text(face = ""bold"")
     ) +
     xlab(NULL)
   
   
THEMING
 * Blank elements can now be overridden again so you get the expected behavior
   when setting e.g. axis.line.x .
 * element_line() gets an arrow argument that lets you put arrows on axes.arrow <- arrow(length = unit(0.4, ""cm""), type = ""closed"")
   
   ggplot(mpg, aes(displ, hwy)) + 
     geom_point() + 
     theme_minimal() + 
     theme(
       axis.line = element_line(arrow = arrow)
     )
   
   
 * Control of legend styling has been improved. The whole legend area can be
   aligned according to the plot area and a box can be drawn around all legends:ggplot(mpg, aes(displ, hwy, shape = drv, colour = fl)) + 
     geom_point() + 
     theme(
       legend.justification = ""top"", 
       legend.box.margin = margin(3, 3, 3, 3, ""mm""), 
       legend.box.background = element_rect(colour = ""grey50"")
     )
   
   
 * panel.margin and legend.margin have been renamed to panel.spacing and legend.spacing respectively as this better indicates their roles. A new legend.margin has been actually controls the margin around each legend.
 * When computing the height of titles ggplot2, now inclues the height of the
   descenders (i.e. the bits g and y that hang underneath). This makes improves the margins around titles,
   particularly the y axis label. I have also very slightly increased the inner
   margins of axis titles, and removed the outer margins.
 * The default themes has been tweaked by Jean-Olivier Irisson making them better match theme_grey() .
 * Lastly, the theme() function now has named arguments so autocomplete and documentation
   suggestions are vastly improved.

STACKING BARS
position_stack() and position_fill() now stack values in the reverse order of the grouping, which makes the default
stack order match the legend.

avg_price <- diamonds %>% 
  group_by(cut, color) %>% 
  summarise(price = mean(price)) %>% 
  ungroup() %>% 
  mutate(price_rel = price - mean(price))

ggplot(avg_price) + 
  geom_col(aes(x = cut, y = price, fill = color))


(Note also the new geom_col() which is short-hand for geom_bar(stat = ""identity"") , contributed by Bob Rudis.)

Additionally, you can now stack negative values:

ggplot(avg_price) + 
  geom_col(aes(x = cut, y = price_rel, fill = color))


The overall ordering cannot necessarily be matched in the presence of negative
values, but the ordering on either side of the x-axis will match.

If you want to stack in the opposite order, try forcats::fct_rev() :

ggplot(avg_price) + 
  geom_col(aes(x = cut, y = price, fill = fct_rev(color)))


SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Security
 * Shiny
 * shinyapps.io
 * tidyverse
 * Training
 * Uncategorized

ARCHIVES
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,922 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

LEAVE A COMMENT
Comments feed for this article

LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.

Notify me of new posts via email.


« sparklyr — R interface for Apache SparkBlog at WordPress.com. Ben Eastaugh and Chris Sternal-Johnson.

Subscribe to feed.

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d like to announce that a release candidate is now available: version 2.1.0.9001. Please try it out, and file an issue on …",ggplot2 2.2.0 coming soon!,Live,376
1138,"Compose The Compose logo Articles Sign in Free 30-day trialTRANSPORTER 0.3.0 RELEASED - TRANSPORTER STREAMLINED
Published Mar 22, 2017 compose transporter Transporter 0.3.0 Released - Transporter StreamlinedTL;DR: Transporter 0.3.0 is out and it has a whole new way of working which will
streamline its use on the mission to make moving data a joy, not a chore.

Transporter 0.3.0 has been tagged and released. This is the most tangible
overhaul to how you work with the open-source Transporter to date, making it a
more productive and elegant tool for moving your data between data sources, be
they files, databases or, now, message queues.

In the process of releasing version 0.2.x of Transporter, it became clear that
we really wanted to refine the process of how you configured the Transporter.
We'd expected that addressing this would have taken longer, but Compose
engineering has worked at a tremendous rate to bring us this new version.

WHAT'S NEW
Here's a brief summary of the changes:

 * There's a new JavaScript-based DSL which handles everything
 * The transporter.yaml file is gone
 * Defining sources, sinks and transformers has been unified
 * Namespaces have been simplified and matching is now just a regular expression
 * New Native Transformers offer fast, JavaScript-free standard functions
 * A new JavaScript engine, goja, for evaluating JavaScript
 * A RabbitMQ adaptor joins the existing adaptors
 * MongoDB's adaptor can now filter collections

The new Native Transformers are implemented in Go and allow messages to have
common operations performed on them without the overhead of the JavaScript
engine.

 * omit() - removes top-level fields from a message
 * pick() - selects only particular top-level fields from a message
 * rename() - renames top-level fields
 * skip() - skips messages based on a test of a top-level field against a value
 * pretty() - pretty prints the message to the log for debugging

There's also ""native"" transformers to call up JavaScript:

 * goja() - uses the Goja JavaScript engine
 * otto() - uses the Otto JavaScript engine

Both take a filename as a parameter and use the contained JavaScript to process
the message through a transformation function. We will have a detailed look at
the new Native Transformers in a new article.

GETTING GOING
You can download ready-built Transporter 0.3.0 binaries from the releases page or you can build your own with the available source.

We have also swept the Transporter issues and if you want to get involved we now
have a great list of issues to get started on . We do Transporter development in the open and you'll also be able to track
design and development through the Transporter issues , pull requests, commits and other activity.

MIGRATING TO 0.3.0
If you are an existing Transporter user, be prepared for a lot of breaking
changes for the better in 0.3.0. The first thing you may notice is that there
are fewer commands. We've dropped the eval and list commands to simplify working with Transporter.

To show how the configuration has changed, if we look back to moving Transporter data from database to disk , when we ran transporter init mongodb file , we got a transporter.yaml file and a pipeline.js file.

If we run transporter init mongodb file with 0.3.0, we get just a pipeline.js file. The first stanzas of that file duplicate the intention of the old yaml
file, but by creating JavaScript objects:

var source = mongodb({  
  ""uri"": ""${MONGODB_URI}""
  // ""timeout"": ""30s"",
  // ""tail"": false,
  // ""ssl"": false,
  // ""cacerts"": [""/path/to/cert.pem""],
  // ""wc"": 1,
  // ""fsync"": false,
  // ""bulk"": false,
  // ""collection_filters"": ""{}""
})

var sink = file({  
  ""uri"": ""stdout://""
})


Now, each adapter is defined with var variablename=adaptorname(options) where the options are in the form of a JavaScript object. To get an example of
the options for an adaptor, transporter about adaptorname will print out what options are available formatted as a JavaScript object and
ready to cut and paste into your code. In our old tutorial, we set our
MONGODB_URI environment variable and set the ssl option to true, and set the
file to write to a file. With the new configuration that looks like this:

var source = mongodb({  
  ""uri"": ""${MONGODB_URI}"",
  ""ssl"": true
})

var sink = file({  
  ""uri"": ""file://dump.json""
})


The init function also generated the pipeline script which looked like this:

Source({name:""source"", namespace:""test./.*/""}).save({name:""sink"", namespace:""test./.*/""})  


This needed modifying to select the database to read from, even though it was in
the MONGODB_URI, and there were redundant elements in the namespace for the
sink.

Now, with 0.3.0, the pipeline looks like this:

t.Source(""source"", source, ""/.*/"").Save(""sink"", sink, ""/.*/"")  


The t. is the ""Transporter"" variable; all pipelines are rooted with that now. The
Source and Save functions take an (optional) name, adapter variable, and
(optional) namespace. Notice also, that Save is upper-case now. The namespaces
are simplified to just a regular expression. In the Source, the namespace is
used to select the collections, in the Save, the namespace selects the messages
that flow into the sink.

We don't need to edit this at all; the simplified namespaces mean these
namespaces match everything. We can just do:

transporter run  


And our Transporter will run the pipeline.js file. That's another change; if no pipeline is specified, the Transporter looks
for pipeline.js as a default.

As an example, if we wanted to add the pretty Native Transformer, that would look like this""

t.Source(""source"", source, ""/.*/"").Transform(pretty({}).Save(""sink"", sink, ""/.*/"")  


There's lots more to cover in Transporter 0.3.0 and we'll be doing that in the
Wiki and here in Compose Articles. Look out for it and come and join us on Github.com/compose/transporter where the development happens.

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Mar 7, 2017BUILDING BETTER DATABASE BRIDGES WITH THE NEW COMPOSE TRANSPORTER
TL;DR: Updated Transporter tool has a quicker way to get started, a completely
rebuilt Elasticsearch adaptor, updated MongoDB…

Dj Walker-Morgan May 18, 2016COMPOSE NOTES - POSTGRESQL UPDATING, FDW AND TRANSPORTER
Compose Notes details the small enhancements and changes which make life better
for Compose users. In this edition, there's a…

Dj Walker-Morgan Feb 3, 2016COMPOSE'S NEW MONGODB AND WHAT YOU NEED TO KNOW
The release of our new MongoDB deployments has brought a lot of new questions
about what you need to know and why and what ha…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Transporter 0.3.0 is out and it has a whole new way of working which will streamline its use on the mission to make moving data a joy, not a chore.",Transporter 0.3.0 Released - Transporter Streamlined,Live,377
1141,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix       * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Use Spark Streaming                * Tutorials and samples * Sample Notebooks       * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load from Desktop Supercharged with IBM Aspera       * Load data from the desktop into dashDB       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata for Analytics to dashDB       * From Neteeza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  USE DASHDB WITH WATSON ANALYTICSJess Mantaro / July 17, 2015Watch interactive and self-service data analytics and visualization using dashDBtogether with Watson Analytics.You can also read a transcript of this videoRead the tutorial (PDF) Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Watch interactive and self-service data analytics and visualization using dashDB together with Watson Analytics. ,Use dashDB with Watson Analytics,Live,378
1144,"Stats and Bots Follow Sign in / Sign up * Home
 * Analytics
 * Big Data
 * Design
 * Startups
 * Bots
 * Updates
 * Subscribe
 * 
 * 🤖 STATSBOT
 * 

Peter Mills Blocked Unblock Follow Following Passionate about science. Interested in atmospheric physics, chaos theory and
machine learning. Jul 20
--------------------------------------------------------------------------------

DATA STRUCTURES RELATED TO MACHINE LEARNING ALGORITHMS
A PRIMER ON DATA STRUCTURES FOR DATA SCIENTISTS
Illustration sourceIf you want to solve some real-world problems and design a cool product or
algorithm, then having machine learning skills is not enough. You would need
good working knowledge of data structures. The Statsbot team has invited Peter Mills to tell you about data structures for machine
learning approaches.


--------------------------------------------------------------------------------

So you’ve decided to move beyond canned algorithms and start to code your own
machine learning methods. Maybe you’ve got an idea for a cool new way of
clustering data, or maybe you are frustrated by the limitations in your favorite
statistical classification package.

In either case, the better your knowledge of data structures and algorithms, the
easier time you’ll have when it comes time to code up.I don’t think the data structures used in machine learning are significantly
different than those used in other areas of software development. Because of the
size and difficulty of many of the problems, however, having a really solid
handle on the basics is essential.

Also, because machine learning is a very mathematical field, one should have in
mind how data structures can be used to solve mathematical problems and as
mathematical objects in their own right.

There are two ways to classify data structures: by their implementation and by
their operation.By implementation, I mean the nuts and bolts of how they are programmed and the
actual storage patterns. How they look on the outside is less important than
what’s going on under the hood. For data structures classed by operation or abstract data types , it is the opposite: their external appearance and operation is more important
than how they are implemented, and in fact they can usually be implemented using
a number of different internal representations.

ARRAY
I’m not kidding when I say that the basic array is the most important data
structure in machine learning, and there is more to this bread-and-butter type
than you might think. Arrays are important because they are used in linear
algebra: the most useful and powerful mathematical tool at your disposal.

Therefore, the most common types will be the one- and two-dimensional variety,
corresponding to vectors and matrices respectively, but you will occasionally
encounter three- or four-dimensional arrays either for higher ranked tensors or
to group examples of the former.

When doing matrix arithmetic , you will have to choose from a dizzying variety of libraries, data types, and
even languages. Many scientific programming languages such as Matlab,
Interactive Data Language (IDL), and Python with the Numpy extension are
designed primarily for working with vectors and matrices.

But the nice thing about these data structures is that, even in more
general-purpose programming languages, implementing vectors and matrices is
straightforward right next to the metal, assuming the language has any Fortran
DNA in it at all. Consider the translation of matrix-vector multiplication:

into C++:

for (int i=0; i� i++) {

  y[i]=0;

  for (int j=0; j� j++) y[i]=a[i][j]*x[j]

}

EXTENSIBLE ARRAY
In most cases, arrays can be allocated to a fixed size at run time, or you can
calculate a reliable upper bound. In those cases where you need your arrays to
expand indefinitely, you can use an extensible array such as the vector class in
the C++ standard template library (STL). Regular arrays in Matlab are similarly
extensible, and extensible arrays are the basis of the entire Python language.

In this data structure, there are two pieces of “meta-data” stored alongside the
actual data values. These are the amount of storage space allocated to the data
structure and the actual size of the array. As soon as the size of the array
exceeds the storage space, a new space is allocated that’s twice the size, the
values copied into it, and the old array deleted.

This is an O( n ) operation, where n is the size of the array, but since it only happens occasionally, time to add a
new value onto the end actually amortizes to constant time, O(1). It is a very
flexible data structure with fast average insertions and fast access.

Extensible arrays are excellent for composing other, more complex data
structures and making them extensible. For example, to store a sparse matrix : any number of new elements can be added onto the end and they are then sorted
by position to make location faster. More on this later.

Sparse matrix can be used in text classification problems .LINKED LIST
A linked list consists of several separately allocated nodes . Each node contains a data value plus a pointer to the next node in the list. Insertions, at constant time, are very efficient,
but accessing a value is slow and often requires scanning through much of the
list.

Linked lists are easy to splice together and split apart. There are many
variations: for instance, insertions can be done at either the head or the tail;
the list can be doubly-linked and there are many similar data structures based
on the same principle such as the binary tree, below.

Mainly, I find linked lists useful for parsing lists of indeterminate length . Afterwards, they can be converted to fixed-length arrays for fast access. For
this reason, I use a linked list class that includes a method for conversion to
an array.

BINARY TREE
A binary tree is similar to a linked list except that each node has two pointers
to subsequent nodes instead of just one. The value in the left child is always less than the value in the parent node, which in turn is smaller than that of the right child . Thus, data in binary trees is automatically sorted. Both insertion and access
are efficient at O(log n ) on average. Like linked lists, they are easy to transform into arrays and
this is the basis for a tree-sort .

BALANCED TREE
If the data is already already sorted, binary trees are less efficient at O( n ) worst case since the data will be laid out linearly as if it were a linked
list. While the ordering in a binary tree is constrained, it is by no means
unique and the same list can be arranged in many different configurations
depending on the order in which it is inserted.

There are several transformations that can be applied to a tree in order to make
it more balanced. Self-balancing trees perform these operations automatically in order to keep access and insertion at
an optimal average.

A widespread problem in machine learning is finding the nearest neighbor to a particular point . This problem is needed for NN algorithm. KD tree, a type of binary tree,
provides an efficient solution for that.

HEAP
A heap is another hierarchical, ordered data structure similar to a tree except
instead of a horizontal ordering, it has a vertical ordering. This ordering
applies along the hierarchy, but not across it: the parent is always larger than
both its children, but a node of higher rank is not necessarily larger than a
lower one that’s not directly beneath it.

Both insertion and retrieval are performed by promotion. An element is first
inserted in the highest available position. Then it is compared with its parent
and promoted until it reaches the right rank. To take an element off the heap,
the larger of the two children is promoted to the missing position, then the
larger of those two children is promoted and so on until everything has trickled
up the ranks.

Typically, the highest ranking value at the top is pulled off the heap in order
to sort a list. Unlike a tree, most heaps are simply stored in an array with the
relationships between elements only implicit.

STACK
A stack is defined as “first in, last out.” An element is pushed onto the top of the stack where it covers the previous element. The top element
must be popped off before any of the others can be accessed.

Stacks are mainly useful for parsing grammars and implementing computer
languages.There are many machine learning applications for which a domain specific
language (DSL) is the perfect solution. For instance, the libAGF library uses a recursive control language to generalize binary classification to multi-class. A special character is used
to repeat a previous option, but because the language is recursive, the option
must be taken from the same hierarchical level or higher. This is implemented by
a stack.

QUEUE
A queue is defined as “first in, first out.” Think of the line at the bank
teller (for those of us still old enough to remember a time before internet
banking). Queues are useful in real time programming so that the program can maintain a list of jobs to be processed.

Consider an application to record split times of athletes. You type in the bib
number and hit enter, except in the time it took you to do that the next athlete
behind has also passed. So you type in a list of bib numbers of the nearest
approaching athletes, then hit a separate key to register the next in the queue
as having passed.

SET
A set consists of an un-ordered list of non-repeating elements. If you add an
element that’s already in the set, there will be no change. Since much of the
mathematics of machine learning deals with sets, they are very useful data
structures.

ASSOCIATIVE ARRAYS
In an associative array, there are two types of data which are stored in pairs:
the key and its associated value . The data structure is relational in nature: the value is addressed by its
key. Since much of the training data is also relational, this type of data
structure would seem ideally suited to machine learning problems.

In practice, it’s not used so much, in part because most associative arrays are
only one-dimensional, whereas machine learning data is typically
multi-dimensional.

Associative arrays are good for building dictionaries.Suppose you are building a DSL, want to store a list of functions and variables,
and need to distinguish between the two.

“sin” → function “var” → variable “exp” → function “x” → variable “sqrt” → function “a” → variableQuerying the array on “sqrt” would return, “function.”

CUSTOM DATA STRUCTURES
As you work on more problems, you are sure to encounter those for which the
standard recipe box does not contain optimal structures. You will need to design
your own data structure.

Consider a multi-class classifier, which generalizes a binary classifier to work
with classification problems having more than two classes. An obvious solution
is bisection: recursively split the classes into two groups. You could use
something similar to a binary tree to organize the binary classifiers, except
that a hierarchical solution is not the only method of solving for multi-class.

Consider several partitions that are then used to solve for all the class
probabilities simultaneously.The most general solution would combine the two, thus each hierarchical
partition need not be binary but could be solved by a non-hierarchical
multi-class classifier. This is the approach taken in the libAGF library.

More complex data structures can also be composed of the basic structures.
Consider a sparse matrix class. In a sparse matrix, most of the elements are
zero and only the non-zero elements are stored. We could store the position and
value for each element as a triplet and have a list of them in an extensible array .

Consider the 3 by 3 identity:

CONCLUSIONS
Data structures are only occasionally interesting in their own right. What makes
them truly interesting are the kinds of problems you can solve with them.

For most of the work I do, I’m using a lot of basic fixed-length arrays. I
mostly use more sophisticated data structures to make the programs a little
smoother in how they run and interface with the outside world and a little more
user friendly. Less like the Fortran programs of yore where you had to endure a
compile cycle of close to half an hour just to change the grid sizes (I actually
worked on a program like this!).

Even if you can’t come up with an application off the top of your head I still I
think it’s good to know about things like stacks and queues. You never know when
one might come in handy.

Really sophisticated artificial intelligence applications might use things like
directed and undirected graphs, which are really just generalizations of trees
and linked lists. How are you going to build things like the former if you can’t
cope with the latter?

PROBLEMS
If you want to practice and realize data structures for ML algorithm yourself,
try to solve some of problems below:

 1.  Encapsulate the matrix-vector multiplication code snippet into a subroutine
     called matrix_times_vector . Design the calling syntax for the subroutine.
 2.  Using struct, typedef or class , encapsulate both vectors and matrices into a pair of abstract types
     called vect and matrix , respectively. Design an API for the types.
 3.  Find at least three libraries online that do the above.
 4.  Download and install the LIBSVM library. Consider the method Kernel::k_function on line 316 of “svm.cpp”. What are the advantages and disadvantages of the
     data structure used to hold vectors?
 5.  How would you re-factor calculation of kernel functions in the LIBSVM
     library?
 6.  Which data structures described in the text are abstract types?
 7.  What internal representation or data structure could you use to implement
     the abstract data types? Are there any that are not included in the list
     above?
 8.  Using a binary tree, design an associative array.
 9.  Consider the vector type in LIBSVM. How can this be used to represent a
     sparse matrix? Contrast this with the sparse matrix class described above.
     Look at the complete type . What are the advantages and disadvantages of each representation?
 10. Implement a treesort and a heapsort. Now use the same data structures to
     find the top k elements. What common machine learning algorithm is this good for?
 11. Implement your favorite data structure in your favourite language.

YOU’D ALSO LIKE:
Recommendation System Algorithms Main existing recommendation engines and how
they work blog.statsbot.co Text Classifier Algorithms in Machine Learning Key text classification
algorithms with use cases and tutorials blog.statsbot.co Time Series Anomaly Detection Algorithms The current state of anomaly detection
techniques in plain language blog.statsbot.co * Programming
 * Data Structures
 * Machine Learning
 * Data Science
 * Algorithms

9 Blocked Unblock Follow FollowingPETER MILLS
Passionate about science. Interested in atmospheric physics, chaos theory and
machine learning.

FollowSTATS AND BOTS
Data stories from Statsbot’s makers

 * Share
 * 9
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates","If you want to solve some real-world problems and design a cool product or algorithm, then having machine learning skills is not enough. You would need good working knowledge of data structures. The…",Data Structures Related to Machine Learning Algorithms,Live,379
1146,"ANALYZE MARKET TRENDS IN TWITTER USING APACHE SPARK, PYTHON, AND DASHDB
David Taieb / June 13, 2016Getting insight into market trends is easier than ever before. Here, I’ll show
you how to use a few cloud-based data services to understand the worldwide
automotive market, its brands, and its customers.

This tutorial walks you through:

 * setup of Apache Spark, dashDB data warehouse, and IBM Insights for Twitter
 * importing data from Twitter
 * creation of a Python notebook
 * data shaping and prep
 * use of the Natural Language toolkit for text processing
 * sophisticated analyses and visualizations via Python notebook

Save Time! If you don’t feel like walking through installation. but want to understand the
analytics, skip ahead to the Analyze in Python Notebook section and just follow along using our sample notebook on github: 
https://github.com/ibm-cds-labs/spark.samples/blob/master/notebook/DashDB%20Twitter%20Car%202015%20Python%20Notebook.ipynb

All the steps in this tutorial are illustrated in this slide deck:


PROVISION SERVICES IN BLUEMIX
We’ll start by setting up the 3 cloud data technologies we’ll use to perform
analysis: Apache Spark, dashDB, and IBM Insights for Twitter. We’ll do it all on
Bluemix, IBM’s cloud platform.

 1. Log in to Bluemix (or sign up for a free trial) .
 2. On your Bluemix dashboard, go to the upper left of the screen and click Create a new space .A new space gives you a fresh clean environment in which to run your
    services and data analysis.
    
    
 3. Name the space and click Create .
    
    
CREATE A NEW IBM ANALYTICS FOR APACHE SPARK INSTANCE
 1. On your Bluemix dashboard, click Work with Data .
 2. Click New Service .
 3. Find and click Apache Spark then click Choose Apache Spark .
 4. Click Create .

CREATE A DASHDB INSTANCE
 1. Return to your Bluemix dashboard
 2. Click Work with Data .
 3. Click New Service .
 4. Find and click dashDB then click Choose dashDB .
 5. Click Create .

CREATE AN IBM INSIGHTS FOR TWITTER INSTANCE
 1. Return to your Bluemix dashboard
 2. Click Work with Data .
 3. Click New Service .
 4. Find and click Insights for Twitter then click Choose Insights for Twitter .
 5. Click Create .

LOAD DATA
Load car-related tweets from IBM Insights for Twitter into your dashBD
warehouse.

 1.  Launch dashDBReturn to your Bluemix dashboard and click dashDB tile. Then click the Launch button.
     
     
 2.  From the menu on the left, choose Load Load Twitter Data
     
     dashDB finds the Insights for Twitter service you just instantiated.
     
     
 3.  Click Next .
     
     
 4.  In the Search for Twitter data box, enter the following query:
     
     posted:2015-01-01,2015-12-31 followers_count:2000 listed_count:1000 (volkswagen OR vw OR toyota OR daimler OR mercedes OR bmw OR gm OR ""general motors"" OR tesla)
     
     
 5.  Click Get Tweet Count .
 6.  Once the count loads, click Next .
     
     
 7.  In the Load the data into new tables with this prefix field, enter any text you wish for a namespace and click Next .
     
     Depending on your bandwidth, data loading may take up to 15-60 minutes.
     dashDB shows you progress as the tweets load.
     
     
 8.  When the data finishes loading, click Next .
     
     
 9.  Peruse the reports that dashDB shows.
     
     
 10. Click the Data tab to explore the data.
     
     
 11. Copy connection details
     
     From the menu on the left, choose Connect Connection Information and copy the User ID , Password and JDBC URL string . You’ll need these in a few minutes.
     
     
ANALYZE IN PYTHON NOTEBOOK
Now that we’ve got the data, we can use Apache Spark to perform some fast
analysis and manage it all through a Python notebook.

CREATE NEW NOTEBOOK
 1. On your Bluemix dashboard, click Apache Spark to open the service.
 2. Click Notebooks .
 3. Click New Notebook .
 4. Click the From URL tab.
 5. Enter a name for the notebook.
 6. In the Notebook URL field, enter:
    
    https://github.com/ibm-cds-labs/spark.samples/raw/master/notebook/DashDB%20Twitter%20Car%202015%20Python%20Notebook.ipynbThis sophisticated notebook comes complete with data shaping, analyses, and
    visualizations baked-in. We borrowed the notebook from Bassel Zeidan , the IBM Developer who created it. Thanks, Bassel!
    
    
 7. Click the Create Notebook button.

SET UP TOOLS AND DATA FOR ANALYSIS
 1. Import Python packagesRun cells 1 and 2 to install the Natural language toolkit. We’ll use this
    package to filter stop words in a few minutes.
    
    !pip install nltk --user
    
    import nltk
    nltk.download(""stopwords"")
    
    
 2. Import Python modules and set up the SQLContext, by running the commands in
    cell 3.
 3. Define global variables.
    
    Run cell 4 to set up various data structures we’ll use throughout this
    notebook.
    
    car_makers_list = [['bmw'], ['daimler', 'mercedes'], ['gm', 'general motors'], ['tesla'], ['toyota'], ['vw', 'volkswagen']]
    
    car_makers_name_list = []
    for car_maker in car_makers_list:
    car_makers_name_list.append(car_maker[0].upper())
    
    #plotting variables
    ind = np.arange(len(car_makers_list)) #index list for plotting
    width = 0.8       # the width of the bars in the bar plots
    
    num_car_makers = len(car_makers_list)
    
    ##car features #support English, Deutsch, french, Spanish
    electric_vehicle_terms = ['electric car', 'electric vehicle', 'electric motor', 'hybrid vehicle', 'Hybrid car', 'elektroauto', 'elektrofahrzeug', 
                          'hybridautos', 'voiture hyprid', 'coche híbrido', 'Auto Hibrido', 'vehículo híbrido', 'elektrovehikel', 'voiture électrique', 'coche eléctrico']
    auto_driver_terms = ['auto drive', 'autodrive', 'autonomous', 'driverless', 'self driving', 'robotic', 'autonomes', 'selbstfahrendes', 'autonome', 'autónomo']
    
    SCHEMA=""DASH7504.""
    PREFIX=""AUTO_""
    
    
 4. Run cell 5 to set up some global helper functions for google maps and
    missing dates.
 5. Acquire the data from dashDB.
    
    Enter the credentials you copied from your dashDB connections page.
    
     * Copy in your password from dashDB.
     * In the jdbcurl entry, copy your dashDB JDBC URL string, only up to BLUBD .
    
    
    Then run the cell.
    
    
 6. Run cell 7 to count the tweets from the data frame, which we’ll use in
    further processing.
    
    
 7. Assign sentiment score to each tweet.
    
    Run the following command in cell 8 to:
    
     * Join the Tweets and Sentiments table
     * Encode the sentiment into a number (POSITIVE=+1, AMBIVALENT=0)
     * Create an average for each sentiment associated with a tweet
     * Instrument the code (with %time) to provide profile execution stats.
    
    from pyspark.sql.functions import UserDefinedFunction
    from pyspark.sql.types import *
    udf = UserDefinedFunction(lambda x: 0 if x=='AMBIVALENT' else 1 if x=='POSITIVE' else -1, IntegerType())
    udf2 = UserDefinedFunction(lambda x: 'POSITIVE' if x0 else 'NEGATIVE' if x
    
    
 8. Transform the data.Run the command in cell 9 to create a clean working data frame that will be
    easier to analyze.
    
    
GEOGRAPHIC ANALYSIS
 1. Run cell 10 to group tweets by country.df_cleaned_tweets_countries = df_cleaned_tweets.groupBy('USER_COUNTRY')\
                                .agg(F.count('MESSAGE_BODY').alias('NUM_TWEETS'))\
                                .orderBy('NUM_TWEETS', ascending=False)
    df_cleaned_tweets_countries.cache()
    df_cleaned_tweets_countries.show(5)
    
    
 2. Run cell 11 to convert Spark SQL dataframe to Pandas data structure for
    visualization.p_df_cleaned_tweets_countries = df_cleaned_tweets_countries.toPandas()
    p_df_cleaned_tweets_countries.ix[p_df_cleaned_tweets_countries['USER_COUNTRY'] == 'NONE', 'USER_COUNTRY'] = 'UNKNOWN'
    
    
 3. To see the geographic tweet distribution in a bar chart, run cell 12.
 4. See locations for tweets on a Google map by running cell 13.
    
    
 5. Clean up memory.
    
    Resources on the Spark driver machine, including memory, are not infinite.
    So, run the following code in cell 14 to get rid of country info before you
    move on to perform more analysis.
    
    df_cleaned_tweets_countries.unpersist()
    df_cleaned_tweets_countries = None
    p_df_cleaned_tweets_countries = None
    
    
SENTIMENT ANALYSIS
 1. Group tweets by sentiment, aggregate counts, then convert Spark SQL
    dataframe to Pandas for visualization, by running commands in cell 15.#get number of tweets with P N U sentiment by grouping the sentiment value
    tweets_grouped_by_sentiment = df_cleaned_tweets\
                .groupBy('SENTIMENT')\
                .agg(F.count('MESSAGE_ID').alias('NUM_TWEETS'))
    tweets_grouped_by_sentiment.cache()
    tweets_grouped_by_sentiment.show(5)
    
    #move the results to pandas
    p_tweets_grouped_by_sentiment = tweets_grouped_by_sentiment.toPandas()
    
    
 2. Visualize with a Matplot pie chart. Run cell 16:#data plot 1
    plot1_labels = p_tweets_grouped_by_sentiment['SENTIMENT']
    plot1_values = p_tweets_grouped_by_sentiment['NUM_TWEETS']
    plot1_colors = ['blue', 'red', 'gray', 'yellow', 'green']
    
    #data plot 2
    cond1 = (p_tweets_grouped_by_sentiment['SENTIMENT'] == 'POSITIVE')
    cond2 = (p_tweets_grouped_by_sentiment['SENTIMENT'] == 'NEGATIVE')
    pMessage_sentiment_statistics_defined = p_tweets_grouped_by_sentiment[cond1 | cond2]
    plot2_labels = pMessage_sentiment_statistics_defined['SENTIMENT']
    plot2_values = pMessage_sentiment_statistics_defined['NUM_TWEETS']
    plot2_colors = ['blue', 'red']
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(23, 10))
    axes[0].pie(plot1_values,  labels=plot1_labels, colors=plot1_colors, autopct='%1.1f%%')
    axes[0].set_title('Percentage of Sentiment Values in all Tweets')
    axes[0].set_aspect('equal')
    axes[0].legend(loc=""upper right"", labels=plot1_labels)
    
    # Plot
    axes[1].pie(plot2_values,  labels=plot2_labels, colors=plot2_colors, autopct='%1.1f%%')
    axes[1].set_title('Percentage of Positive and Negative Sentiment Values in all Tweets')
    axes[1].set_aspect('equal')
    axes[1].legend(loc=""upper right"", labels=plot2_labels)
    fig.subplots_adjust(hspace=1)
    plt.show()
    
    
 3. Clear memory again by running cell 17.
    
    
TIMELINE ANALYSIS
 1. Analyze tweet timelines. Run cell 18 to do the following:
    
     * group by posting time and sentiment#group by year-month-day and the sentiment
       df_num_tweets_and_sentiment_over_time = df_cleaned_tweets.groupBy('POSTING_TIME', 'SENTIMENT')\
                   .agg(F.count('MESSAGE_BODY').alias('NUM_TWEETS'))\
                   .orderBy('POSTING_TIME', ascending=True)
       
       
     * order tweets chronologically#group by year-month-day
       df_num_tweets_over_time = df_num_tweets_and_sentiment_over_time.groupBy('POSTING_TIME')\
                   .agg(F.sum('NUM_TWEETS').alias('NUM_TWEETS'))\
                   .orderBy('POSTING_TIME', ascending=True)
       
       
     * Convert Spark SQL dataframe to Pandas for visualization#move to Pandas
       p_df_num_tweets_and_sentiment_over_time = df_num_tweets_and_sentiment_over_time.toPandas()
       p_df_num_tweets_over_time = df_num_tweets_over_time.toPandas()
       
       
    Prepare data structures for plottingRun cell 19 to move different sentiment values into different data frames.
    
    To see a time series visualization, run cell 20. You see a huge spike in
    October 2015, especially negative tweets.
    
    
    Clear memory again by running cell 21.
    
    
MANUFACTURER ANALYSIS
 1. Set up data for a deep dive into car manufacturers.
    
     * Run the command in cell 22 to create a new dataframe that enriches tweets
       with extra metadata, including boolean values for manufacturer, electric
       car, and self-driving car.columns_names = ['MESSAGE_ID', 'MESSAGE_BODY', 'SENTIMENT', 'USER_GENDER', 'USER_COUNTRY', 'POSTING_TIME', 'INFLUENCE']
       for carMakerName in car_makers_name_list:
           columns_names.append(carMakerName)
       columns_names.append('ELECTRIC_CARS')
       columns_names.append('AUTO_DRIVE')
       
       df_tweets_car_maker = sqlContext.createDataFrame(df_cleaned_tweets
                                .map(lambda x: getAllAttributes(x)), columns_names)
       
       
     * Run cell 23 to create a dataframe for each manufacturer, aggregate
       counts, and order by posting time.car_maker_results_list = []
       for car_maker in car_makers_name_list:
       #get competitor dataframe
       df_car_maker = df_tweets_car_maker.filter(df_tweets_car_maker[car_maker] == True)
       overall_car_maker_time_data = df_car_maker.groupBy('POSTING_TIME')\
                           .agg(F.count('MESSAGE_ID').alias('COUNT'))\
                           .orderBy('POSTING_TIME' , ascending=True)
       
       
    Get the visual by running cell 24.Sure enough, we can see that the spike in chatter around October 2015 was
    all about one manufacturer:
    
    
    To explain the peak, we'll explore the nature of those comments in the next
    few steps.
    
    
 2. Run cell 25 to filter for only VW tweets around that time.
    
    
 3. Run cell 26 to:
    
     * use Natural Language toolkit module to filter out stopwords (like of , or , the , http urls, and other strings that mean little in analysis)
     * create a map count of non-stopwords
        tagsRDD = df_tweets_debacle.flatMap( lambda t: re.split(""\s"", t.MESSAGE_BODY))
               .filter( lambda word: not word.startswith(""http"") and all(ord(c) 3)
       .map( lambda word : (word, 1 ))
           .reduceByKey(add, 10)
           .map(lambda (a,b): (b,a))
               .sortByKey(False)
               .map(lambda (a,b):(b,a))
       
       
     * present a pie chart of the top 10 terms in tweets during the spike in
       chatter:
    
    
    As suspected, the conversation was all about the fraudulent emissions
    testing scandal.
    
    
 4. If interested, review more sentiment and brand analysis around this car
    data. Continue following this notebook and running cells.
    
    
CONCLUSION
You see how a simple Python notebook can provide valuable insights into large
data problems. This tutorial shows a fast, creative approach to analyzing
sweeping data sets without the need for computing infrastructure and expensive
data experts. Explore the rest of this notebook to learn more. Then use this
lightweight approach to load your own data and analyze for insights.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","See how a simple Python notebook, used in combination with a dashDB data warehouse and Apache Spark, can provide valuable insights into large data problems.","Analyze Market Trends in Twitter Using Apache Spark, Python, and dashDB",Live,380
1153,"Follow Sign in / Sign up Home Engineering Algorithms Ethan Rosenthal Blocked Unblock Follow Following Data Scientist at Dia&Co. Formerly at Birchbox (modeling, data and otherwise)
and Columbia (physics phd). Oct 28, 2016
--------------------------------------------------------------------------------

HIRING DATA SCIENTISTS: A CLASSIFICATION PROBLEM
It is easy as an interviewee to bash the entire interviewing process — it takes
forever, it’s painful, and it can be demoralizing. Just as well, it’s easy for
interviewers to lament the difficulty in finding and assessing good talent.
Being in New York City, this reminds me of the contention between drivers and
pedestrians. We can switch between either role but will continually complain
about the opposing one!

With this in mind, we have strived to make the data science interviewing process
at Dia&Co respectful of the candidate while still serving its purpose of identifying
people who match our needs. A key part of this process is the Data Challenge.

Take-home tests, data challenges, data assessments, or whatever else you want to
call them have become fashionable as a step in data science interviews. As the Data team at Dia&Co has experience with many other companies’ interview processes, we have built up
a wealth of opinions on what works and what does not for the Data Challenge.

WHAT THE CHALLENGE IS
From our side, the purpose of the Data Challenge is

To assess how well the candidate translates business problems into machine
learning solutions as well as get an understanding of their coding ability and
style.However, the challenge also gives the candidate insights into our business and
some problems that we may be thinking about. After all, interviewing goes both
ways, especially right now when the data science market is so hot .

The challenge itself takes the form of some data and a business problem
presented to the candidate. Specifically, this is fake data corresponding to
realistic data that a company would have. Sometimes, companies send out data
with anonymized features (e.g. “feature_1”, “feature_2”) which feels largely
academic and does not test domain or business thinking. The candidate is then
asked to write a script to solve this problem and include a brief writeup of
their approach.

WHAT THE CHALLENGE IS NOT
We spent considerable time thinking about what the challenge should not do as a
result of being burned on poor processes from other companies in the past.
Primarily, the challenge serves as the minimum bar that we set for moving candidates passed this stage of the interview. Because
of this, we do not give extra consideration for going above and beyond the
requirements of the test because we do not wish to favor candidates with
infinite free time. We do not wish to assess feature engineering prowess or how
accurate one can make their machine learning model because we do not want them
to spend all day on it. While these can be important to the job, these easily
lead to scope creep in the data challenge, and we have other skills that we
would rather prioritize.

All of this is an attempt to be respectful of the candidate’s time.
Consequently, we expect the challenge to be able to be completed in less than 3
hours. Many companies claim an upper bound on the time that they expect
candidates to spend on their take-home tests, but rarely have we found cases
where this is actually true. In fact, I once completed half of a test while
taking twice the amount of time, and the company moved me on to the next stage!
In this case, the test length should have been cut in half (at least).

CLASSIFYING
Once the challenge is completed and returned by the candidate, we have a strict
rubric for grading. The rubric gives approximately equal weighting to coding,
machine learning, and business know-how. The process is then somewhat similar to
a classification problem. The scores on each section of the rubric are our
features, we add them up as our model, and then a threshold is used to determine
Pass vs. Not Pass. Of course, this threshold is a key parameter for trading off
between precision and recall. We do not want to waste time on false positive
candidates, nor do we want to miss out on false negatives. We even employ an
“ensemble model” and have multiple people grade in order to minimize bias and
variance.

Maybe the entire process can be thought of as an operations research problem —
maximizing hiring of candidates who will be successful on the job while
constraining against money, time, and everything else. Our data challenge
classification problem serves as an input into the global optimization problem,
and synthesizing machine learning and operations is something we are very excited about here at Dia&Co .

If you would like to work on this synthesis, then please check out our careers page because we’re hiring! And let me know what you think of the challenge :)

Thanks to Zachary Friedman and Dave Brown . Machine Learning Data Science Hiring Interview 9 1 Blocked Unblock Follow FollowingETHAN ROSENTHAL
Data Scientist at Dia&Co. Formerly at Birchbox (modeling, data and otherwise)
and Columbia (physics phd).

FollowMAKING DIA
Stories from the team who are building Dia&Co.

9 1 * Share
 * 9
 * 
 * 

Never miss a story from Making Dia , when you sign up for Medium. Learn more Never miss a story from Making Dia Subscribe Subscribe","It is easy as an interviewee to bash the entire interviewing process — it takes forever, it’s painful, and it can be demoralizing. Just as well, it’s easy for interviewers to lament the difficulty in…",A Classification Problem,Live,381
1155,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
WHAT IS TEXT ANALYTICS?
MAKING THE COMPLEX SIMPLE
Post Comment May 19, 2016 by Mike Ferguson Managing Director of Intelligent Business Strategies Limited, Intelligent
Business Strategies LimitedText analysis is about deriving high-quality structured data from unstructured text. Another
name for text analytics is text mining . A good reason for using text analytics might be to extract additional data
about customers from unstructured data sources to enrich customer master data,
to produce new customer insight or to determine sentiment about products and
services. Several text analytics use cases exist:

 * Case management—for example, insurance claims assessment, healthcare patient
   records and crime-related interviews and reports
 * Competitor analysis
 * Fault management and field-service optimization
 * Legal ediscovery in litigation cases
 * Media coverage analysis
 * Pharmaceutical drug trial improvement
 * Sentiment analytics
 * Voice of the customer

Entity extraction, the parsing and extracting of entities from raw text, is a key part of text
analytics. Several examples of entity extraction exist:

 * Company names
 * Dates and times
 * Domain-specific names such as names of diseases in pharmaceutical data
 * Monetary amounts
 * People’s names and social network handles
 * Phrases, negative or positive
 * Product names

In many cases, entity extraction can be turned into automated entity
recognition, in which text is parsed and well-understood entities are
automatically selected from the text by the software. This requirement is
common. In the previous list of use cases, recognizing dates, times, monetary
amounts and so on may be something that text analytics software can do out of
the box, without you having to help it figure out what these extractions are.
The benefit of this approach should be obvious. The time necessary to do entity
extraction is drastically reduced, it can be done at scale across large volumes
of text and it produces structured data that can be merged with enterprise data
for further analysis.

PROVING ITS METTLE
Widely used data sources for text analytics include social networks—Facebook,
LinkedIn and Twitter—along with internal email, inbound customer email, news
articles, online discussion forums and customer relationship management (CRM)
customer service notes. Other widely used sources include review websites such
as TripAdvisor, documents such as PDF files and online forms such as
applications containing text or forms containing structured data stored as text.

Text analytics has been around for many years, and with as much as 80 percent of
data in enterprises now in unstructured form, it is proving its worth very
quickly. A well-understood process for text analytics includes t he following steps :

 1. Extracting raw text
 2. Tokenizing the text—that is, breaking it down into words and phrases
 3. Detecting term boundaries
 4. Detecting sentence boundaries
 5. Tagging parts of speech—words such as nouns and verbs
 6. Tagging named entities so that they are identified—for example, a person, a
    company, a place, a gene, a disease, a product and so on
 7. Parsing—for example, extracting facts and entities from the tagged text
 8. Extracting knowledge to understand concepts such as a personal injury within
    an accident claim

Text analytics can also be combined with other advanced analytics. For example,
it can be combined with other data and with machine learning to predict the
sentiment trend—such as negative sentiment predicted to increase or people who
are likely to tweet negative sentiment based on their profile. Text analytics
can also be combined with graph analysis, whereby people, places, activities and
things are extracted from text using entity extraction and fed into a graph and
graph analysis to discover completely new relationships you weren’t previously
aware of.

Probably one of the most widely used forms of text analytics is sentiment
analysis. Inbound customer email and social network data such as tweets may help
determine positive, negative or neutral sentiment about your products and
brands. Sentiment analysis is the process of determining a sentiment score from text. Companies want to
determine these scores to be able to respond quickly to negative sentiment to
minimize its impact and to make sure customer satisfaction and loyalty are
constantly being improved. They also want to protect brands. Product managers
want to use sentiment analysis to understand any problems with newly released
products and services, so they can fix them quickly. In addition, finance
departments want to take sentiment into account when doing financial
planning—something that may be surprising to many.

NEGOTIATING HURDLES
All kinds of challenges are evident when using text analysis to analyze
sentiment. For example, people tweet in multiple languages. Data quality can be
poor. Many people also use emoticons, and different generations seem to speak
their own language, even if it is all in one actual language of the world. Slang
can also be an obvious challenge. And ambiguities can also exist—for example, a
young person may tweet, “This new phone is sick!”

The good news is that text analytics is now mature enough to handle many of
these characteristics. Ultimately, sentiment analysis is about opinions that may
be transient and therefore have short-lived impact. Of course, an opinion consists of a number of elements including the name of a product or brand,
perhaps a part of a product, the sentiment itself and even intent or desire—as
in, “I really want to get that new Star Wars DVD.” Other elements include the opinion holder and when the opinion—the time
of the tweet—was expressed.

New visualizations exist for text analytics to easily understand sentiment in a
huge amount of text. Examples include word clouds that enable you to see how
frequently words appear in a given corpus of text. Also, word trees and phrase
nets offer other examples. The former allow you to choose a word or phrase and
show you all the different contexts in which it appears; the latter show the
relationships between different words used in a text.

Text analysis can be a computationally intensive process involving complex
character-level operations such as pattern matching. Therefore, for large
volumes scalability matters and is a major reason why people use platforms such
as Apache Hadoop to do batch text analysis. However, you can also do the
analysis in real time using streaming text analytics to react to sentiment or
even to detect fraud as people are filling out online application forms, for
example.

CAPITALIZING ON MANY FLAVORS
Today, you can find text analytics available on the cloud, on-premises in
stand-alone text analytics offerings and at scale on Hadoop. IBM offers text
analytics as part of the IBM BigInsights Data Science module . This solution includes web-based tooling to extract information that
generates a language called Annotation Query Language (AQL) to do the analysis.
IBM also includes application templates in BigInsights to get you up and running
quickly with sentiment analysis.

Text analysis can also be done using search, which allows for indexing raw text
and launching exploratory queries on that data to find content of interest.
Connecting self-service business intelligence (BI) tools to search engines
enables using search-based text analytics from within BI tools and to visualize
the results in dashboards. Text analytics represents an exciting area that can
generate significant business value.


Follow @IBMBigData

Topics: Analytics , Big Data Education , Big Data Technology , Big Data Use Cases , Data Scientists , Hadoop Tags: text analytics , big data , open source , analytics , Apache Hadoop , cognitive , predictive , machine learning , graph analytics , unstructured data , structured data , streaming analytics , streaming data , text , data , data scientist , Annotation Query Language , AQL , biginsights , visualization , sentiment , business valueRELATED CONTENT
BLOG
WELL-VERSED IN BIG DATA AND ANALYTICS?
Are you a big data and analytics subject-matter expert? Do you enjoy writing?
Would you like to be published? Check out IBM Press and the great opportunity to
be a big data and analytics author. Share your expertise with readers from
customer and partner organizations, colleagues and the greater... Read Blog Blog CIO Insights: One CIO’s priorities—an agile culture and a big windshield Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Video What does Hadoop and big data success look like? Blog Cloud-based ingestion: The future is here Blog Don’t sweat the ROI: IBM + Box = time well spent Blog The power of machine learning in Spark Blog How can data scientists collaborate to build better business applications? Blog A DB2 release that doubles down on data protection Blog InsightOut: The role of Apache Atlas in the open metadata ecosystem Blog Top analytics tools in 2016
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog Datapalooza 101: Why you need to make time for Datapalooza Blog What is customer segmentation? Blog Well-versed in big data and analytics? Blog CIO Insights: One CIO’s priorities—an agile culture and a big windshieldMORE
Blog Datapalooza 101: Why you need to make time for Datapalooza Blog What is customer segmentation? Blog Well-versed in big data and analytics? Blog CIO Insights: One CIO’s priorities—an agile culture and a big windshield Video Cognitive-based behavior segmentation for wealth management portfolios Blog Wimbledon: Using real-time sports statistics for fan engagement Infographic Why manually analyzing video data is not an option Blog Datapalooza 101: Why you need to make time for Datapalooza Blog CIO Insights: One CIO’s priorities—an agile culture and a big windshield Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governanceMORE
Blog Datapalooza 101: Why you need to make time for Datapalooza Blog CIO Insights: One CIO’s priorities—an agile culture and a big windshield Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog Cloud-based ingestion: The future is here Blog 3 strategies to get your CFO to care about Sales Performance Management Blog What is customer segmentation? Blog CIO Insights: One CIO’s priorities—an agile culture and a big windshield Video Cognitive-based behavior segmentation for wealth management portfolios Blog Wimbledon: Using real-time sports statistics for fan engagementMORE
Blog What is customer segmentation? Blog CIO Insights: One CIO’s priorities—an agile culture and a big windshield Video Cognitive-based behavior segmentation for wealth management portfolios Blog Wimbledon: Using real-time sports statistics for fan engagement Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Datapalooza 101: Why you need to make time for Datapalooza Blog Well-versed in big data and analytics? Blog CIO Insights: One CIO’s priorities—an agile culture and a big windshield Blog The 3 Cs of big dataMORE
Blog Datapalooza 101: Why you need to make time for Datapalooza Blog Well-versed in big data and analytics? Blog CIO Insights: One CIO’s priorities—an agile culture and a big windshield Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site",Learn details of text analytics to see how this relatively mature but exciting form of analysis generates significant business value for a wide range of enterprises.,What is text analytics?,Live,382
1156,"O'REILLY
Ideas Learning Events Shop Search search Feedback Loading... Log in Log out configure Close Menu Open MenuON OUR RADAR
AI Economy Business Data Design Operations Security Software Engineering Web Programming See all Ideas Learning Events Shop Search searchON OUR RADAR
AI Economy Business Data Design Operations Security Software Engineering Web Programming See all Data ToolsAN INTERVIEW WITH PYTHONISTA KATHARINE JARMUL
Using Python, and other tools, for natural language processing, sentiment
analysis, and data wrangling.

By Seth Grimes June 22, 2016 ""A Shepherdess with her Flock,"" Eugène Verboeckhoven, 1871. (source: Wikimedia Commons ).For more on this topic, see Katharine Jarmul's new video course, "" Data Wrangling and Analysis with Python .""

I wrote my first Python program in 1996, and my most recent a few weeks ago, so
I can appreciate Python's advance to cover a very broad range of computing
tasks. I don't program much anymore, but in my work over the years—and yours
too, if you do much coding—data manipulation has always played an important
role.

You can't build and apply analytical models, manage transactions, craft a web
experience, or carry out any other significant task without investing time and
attention to data acquisition, cleansing, and structuring. Python is ideal for
those tasks as well as model building and data analysis. Python is also great
for natural language processing (NLP) in particular (a special interest of
mine). Chances are, Python is a good fit for just about any data work that
interests you.

Figure 1. Katharine Jarmul.This interview with Pythonista Katharine Jarmul focuses on data work. A couple
of events provide context. Katharine is presenting a talk titled ""How Machine
Learning Changed Sentiment Analysis, or I Hate You, Computer 😉"" at this year's Sentiment Analysis Symposium , July 12, 2016 in New York, following which she's offering a class, Learn Big Data Wrangling with Python , July 13-14, also in New York.

We'll get into Katharine's background in the course of the interview. I'll add
now only that she's co-author of O'Reilly's Data Wrangling with Python book, published earlier this year, and "" Data Wrangling and Analysis with Python "" video, just published this month.

Seth Grimes: Make some converts. Why should folks use Python for data work, and in
particular for natural language processing and sentiment analysis?

Katharine Jarmul: It's actually pretty hard to argue against using Python for these tasks. With Google using Python (primarily) for TensorFlow , Parsey McParseface [SyntaxNet], and word2vec as well as hundreds of startups and open source tools making advancements for
machine learning, sentiment analysis and NLP in Python, I'd love to hear a good
argument against it as the language du jour . I love Python because it's easy to read, it has great math and science
libraries, it's proven to be quite scalable, and the community is unbeatable.

SG: Your consulting work centers on market analysis, which involves data of varied
types—text and numeric, and perhaps geospatial and time-based—from disparate
sources. Do you have any special guidance regarding ways to clean and mash it
all up in ways that make sense and produce justified insights?

KJ: I actually just gave a talk about this at PyData Berlin, which was an amazing conference. Data wrangling
and data cleaning are the un-sexy bits of our daily work, and I wish more people
were talking about them, since I think there's a lot of work we can do to make
them less painful. For me, I generally use Python and Pandas to perform some of these tasks, but there are so many tools and techniques
available. In preparation for my talk, I also read a lot of the latest research
and academic papers on automating the data cleaning process via machine
learning. To help move the technical side along, I'll be putting together a
literature review on the topic, and hopefully we can start building some great
open source tools to help us make line-by-line data cleaning a thing of the
past.

SG: You're not like most of the technologists I encounter: you made a career change
from journalism and public policy to data and analytics. What motivated the
switch?

KJ: In my opinion, the distance between data for journalism and data for startups
is actually quite small. When I was working at the Washington Post and USA
TODAY, I was in charge of quite a few projects involving data wrangling and data
munging, so those skills were shared between the two. At a startup, however, I
usually had more autonomy to make technical decisions and to grow and learn more
technical things, so for me it was a natural progression of my interest in the
field.

SG: I assume your journalism and policy skills and experience have informed your
approach to your current work. If that's the case, in what ways? Or are the
disciplines really different for you?

KJ: I think my background in journalism helps when it comes to communication and
reporting. Many times my clients aren't statisticians or data scientists. They
want to know what the numbers mean. I had a few great professors in journalism
school who helped me with communicating my mathematical knowledge into an
understandable and comprehensible article. I now can use those skills to work
with my clients and make sure they understand the competitive landscape for
their technology or startup.

SG: Your Sentiment Analysis Symposium presentation is titled ""How Machine Learning
Changed Sentiment Analysis, or I Hate You, Computer.😉"" What species of machine
learning do you see as applicable to sentiment problems, and which toolkits?

KJ: I myself am not a machine learning expert, nor do I use it often in my work. I
am, of course, interested in the topic. As a Python developer, it's very easy to
write 10 lines of code that ""just work"" using the amazing tools available, such
as TensorFlow, scikit-learn , and Theano . There are even more I haven't had the chance to play with, so it's a great
time to be in machine learning. Regarding sentiment analysis, I recommend taking
a look at Spacy.io , run and primarily written by Matthew Honnibal . They already have some interesting training sets with informal text, and have
some great resources on how to get started.

SG: Are you also applying established techniques, stuff like the lexical analysis
and parsing you get with NLTK ?

KJ: Most of the toolkits I've used have these as a part of the library, yes.

SG: Certain emoji, including your winking emoji, most often negate rather than
emphasize. You're communicating that ""I Hate You, Computer"" is ironic. Have you
worked in emoji analytics yourself? In techniques aimed at understanding irony
and sarcasm and the like? If not, is that stuff on your to-do list, or is it not
important in the market analyses and other work you take on?

KJ: Again, I am more of an NLP user rather than library creator. For my upcoming
talk, I'll be interviewing Matthew Honnibal about what they have done with sense2vec , and it's pretty amazing. Emoji are just unicode and, for that reason, they
can be parsed just like anything else. In a PyData user group talk in Berlin, Spacy.io was demonstrated to know that a smiley face
emoji is similar to other positive emoji faces. At the end of the day, text is
parseable and emojis are just special code points, so I don't see why we aren't
in an age where this is a (nearly) solved problem.

SG: I've learned to be wary when a coder uses the word ""just."" And I'll add that
one of the most interesting talks at last year's sentiment symposium was Emojineering at Instagram , given by Instagram software engineer Thomas Dimson and covering semantic
analysis of emoji use. What (else) is on your to-do list, to learn and to apply,
when it comes to Python, data wrangling, machine learning, and NLP and sentiment
analysis?

KJ: I'll be focusing on the intersection of data cleaning and machine learning.
It's of interest to me, and I'm already chatting with some folks about what
happens next in terms of open source libraries to use and possibly build. If you
are also interested in these problems, feel free to reach out! I'm @kjam on Twitter and freenode , or reachable by email at katharine (at) kjamistan.com.

SG: Very cool. Thanks Katharine.

Article image: ""A Shepherdess with her Flock,"" Eugène Verboeckhoven, 1871. (source: Wikimedia Commons ).
SETH GRIMES
Seth Grimes is a leading industry analyst covering NLP, text analytics, and
sentiment analysis. He founded Washington D.C.-based Alta Plana Corporation, an
information technology strategy consultancy, in 1997 and organizes the Sentiment
Analysis Symposium and LT-Accelerate conferences.


--------------------------------------------------------------------------------

Data ToolsACCELERATING BIG DATA ANALYTICS WORKLOADS WITH TACHYON
By Shaoshan LiuHow Baidu combined Tachyon with Spark SQL to increase speed 30-fold

Data ToolsQUESTIONING THE LAMBDA ARCHITECTURE
By Jay KrepsThe Lambda Architecture has its merits, but alternatives are worth exploring.

Data ToolsANALYZING THE WORLD’S NEWS: EXPLORING THE GDELT PROJECT THROUGH GOOGLE BIGQUERY
By Kalev Leetaru Felipe HoffaWhat it looks like to analyze, visualize, and even forecast human society using
global news coverage.

Video play Data ToolsARCHITECTING HADOOP APPLICATIONS
By Mark Grover Gwen Shapira Jonathan Seidman Ted MalaskaIn this O'Reilly training video, the ""Hadoop Application Architectures"" authors
present an end-to-end case study of a clickstream analytics engine to provide a
concrete example of how to architect and implement a complete solution with
Hadoop. In this segment, they provide an overview of the complete architecture.
Presenters: Mark Grover, Gwen Shapira, Jonathan Seidman, Ted Malaska

ABOUT US
 * Our Company
 * Work with Us
 * Customer Service
 * Contact Us

SITE MAP
 * Ideas
 * Learning
 * Topics
 * All

 * facebook
 * twitter
 * youtube-large
 * google
 * linkedin

© 2016 O'Reilly Media, Inc. All trademarks and registered trademarks appearing
on oreilly.com are the property of their respective owners.


Terms of Service • Privacy Policy • Editorial Independence","Using Python, and other tools, for natural language processing, sentiment analysis, and data wrangling.",An interview with Pythonista Katharine Jarmul,Live,383
1158,"Homepage Follow Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Adam Massachi Blocked Unblock Follow Following Data Science Experience @ IBM Jan 15
--------------------------------------------------------------------------------

CONTINUOUS LEARNING ON WATSON DATA PLATFORM
Build models that learn over time with Watson Machine Learning and Data Science
Experience

Performance Monitoring on WDPWe hear from many clients that one of the hardest parts of machine learning is
closing the feedback loop . This means that models need to be monitored and updated frequently to
incorporate the latest data. Watson Machine Learning and Data Science Experience
allow data scientists and analysts to quickly build and prototype models, to
monitor deployments, and to learn over time as more data become available. Performance Monitoring and Continuous Learning enable machine learning models to retrain on new data supplied by the user or
another data source. Then, all of your applications and analysis tools which
depend on the model are automatically updated as Watson Data Platform handles
selecting and deploying the best model.

In this video, we’ll solve a problem for the City of Chicago using the Model
Builder to model building violations . We’ll predict which buildings are most likely to fail buildings inspections.
Then, we can intelligently rank the buildings by their likelihood to fail an
inspection, saving time and resources for the City and for our inspectors. We’ll
start building a model on publicly available data from 2017, starting in
September before we introduce October, November, and December data to simulate
learning over time.

You’ll need Data Science Experience, Watson Machine Learning, and IBM Db2 Warehouse on Cloud services on IBM Cloud to complete this guide on your own.
Free and Lite versions are available. Get the data .

 * Machine Learning
 * Data Science
 * IBM
 * Watson

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

86 Blocked Unblock Follow FollowingADAM MASSACHI
Data Science Experience @ IBM

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 86
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",Build models that learn over time with Watson Machine Learning.,Continuous Learning on Watson,Live,384
1159,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

CONTENTS
 * Apache Spark * Get Started * Get Started in Bluemix
      
      
    * Tutorials * Load dashDB Data with Apache Spark
       * Load Cloudant Data in Apache Spark Using a Python Notebook
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Build SQL Queries
       * Use the Machine Learning Library
       * Build a Custom Library for Apache Spark
       * Sentiment Analysis of Twitter Hashtags
       * Use Spark Streaming
       * Launch a Spark job using spark-submit
      
      
    * Sample Notebooks * Sample Python Notebook: Precipitation Analysis
       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis
      
      
 * BigInsights * Get Started * BigInsights on Cloud for Analysts
       * BigInsights on Cloud for Data Scientists
       * Perform Text Analytics on Financial Data
       * Perform Sentiment Analysis
       * Sample Scripts
      
      
 * Compose * Get Started * Create a Deployment
       * Add a Database and Documents
       * Back Up and Restore a Deployment
       * Enable Two-Factor Authentication
       * Add Users
       * Enable Add-Ons for Your Deployment
      
      
    * Compose Enterprise * Get Started
      
      
 * Cloudant * Get started * Copy a sample database
       * Create a database
       * Change database permissions
       * Connect to Bluemix
       * Developing against Cloudant
      
      
    * Intro to the HTTP API * Execute common API commands
       * Set up pre-authenticated cURL
      
      
    * Database Replication * Use cases for replication
       * Create a replication job
       * Check replication status
       * Set up replication with cURL
      
      
    * Indexes and Queries * Use the primary index
       * MapReduce and the secondary index
       * Build and query a search index
       * Use Cloudant Query
       * Cloudant Geospatial
      
      
    * Integrate * Create a Data Warehouse from Cloudant Data
       * Store Tweets Using Cloudant, dashDB, and Node-RED
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Load Cloudant Data in Apache Spark Using a Python Notebook
      
      
 * dashDB * dashDB Quick Start
    * Get * Get started with dashDB on Bluemix
       * Load data from the desktop into dashDB
       * Load from Desktop Supercharged with IBM Aspera
       * Load data from the Cloud into dashDB
       * Move data to the Cloud with dashDB’s MoveToCloud script
       * Load Twitter data into dashDB
       * Load XML data into dashDB
       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB
       * Load JSON Data from Cloudant into dashDB
       * Integrate dashDB and Informatica Cloud
       * Load geospatial data into dashDB to analyze in Esri ArcGIS
       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion
         Workbench (DCW)
       * Install IBM Database Conversion Workbench
       * Convert data from Oracle to dashDB
       * Convert IBM Puredata System for Analytics to dashDB
       * From Netezza to dashDB: It’s That Easy!
       * Use Aginity Workbench for IBM dashDB
      
      
    * Build * Create Tables in dashDB
       * Connect apps to dashDB
      
      
    * Analyze * Use dashDB with Watson Analytics
       * Perform Predictive Analytics and SQL Pushdown
       * Use dashDB with Spark
       * Use dashDB with Pyspark and Pandas
       * Use dashDB with R
       * Publish apps that use R analysis with Shiny and dashDB
       * Perform market basket analysis using dashDB and R
       * Connect R Commander and dashDB
       * Use dashDB with IBM Embeddable Reporting Service
       * Use dashDB with Tableau
       * Leverage dashDB in Cognos Business Intelligence
       * Integrate dashDB with Excel
       * Extract and export dashDB data to a CSV file
       * Analyze With SPSS Statistics and dashDB
      
      
    * REST API * Load delimited data using the REST API and cURL
      
      
 * DataWorks * Get Started * Connect to Data in IBM DataWorks
       * Load Data for Analytics in IBM DataWorks
       * Blend Data from Multiple Sources in IBM DataWorks
       * Shape Raw Data in IBM DataWorks
       * DataWorks API
      
      
LOAD GEOSPATIAL DATA INTO DASHDB TO ANALYZE IN ESRI ARCGIS
Jess Mantaro / July 17, 2015See how to load new geospatial data into your dashDB database, and how to
analyze it with Esri ArcGIS for Desktop.

You can also read a transcript of this video

RELATED LINKS
 * Analyzing Geospatial data

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM","See how to load new geospatial data into your dashDB database, and how to analyze it with Esri ArcGIS for Desktop. ",Load geospatial data into dashDB to analyze in Esri ArcGIS,Live,385
1167,"This is the first of a three part blog series on how to deal with conflicts in the Cloudant DBaaS (Database as-a-Service). This blog assumes you have a working knowledge of Cloudant’s database, and its API.In part one, we introduce the concept of a document conflict, describe what it looks like, and explain what happens if conflicts are left unresolved. Later in this series, we show how to tidy up conflicts, and discuss how they can be avoided.Cloudant allows data to be stored as JSON documents in a highly-resilient cluster of connected nodes. Cloudant’s HTTP interface can be accessed natively through a web browser, through an interactive dashboard, or through programming languages for most platforms. Cloudant is a distributed system, with Replication at it's heart. It enables mobile developers to replicate their data to portable devices, and allows data to be replicated across the globe. The effect is similar to a content delivery network (CDN) for your data.The Consistency, Availability and Partition tolerance (CAP) theorem states that a distributed system can have only two of the three characteristics. In Cloudant, Consistency is sacrificed in favour of availability and partition tolerance. Specifically, Cloudant is an eventually- consistent database, which means that no locking occurs when data is written. This characteristic enables the system to offer best-of-class uptime and scalability.In order to keep the service available at all times, Cloudant must allow the same document id to be altered on different nodes in the database. To reconcile the data, Cloudant maintains a revision history for every document in the database. This is a timeline of changes to the document; not the document body itself, only a history of the revision tokens:In this example, document “abc” has had three revisions. The revision token is made up of a sequential number and a hash of the document content. For convenience, a shortened version of the hash appears in the diagram.One of the consequences of eventual consistency is that documents might enter a conflicted state if the same version of a document is modified in different ways on two disconnected nodes. An exception is where the change made on the two nodes is the same change, to the same revision of the document. In this case, no conflict is generated, because the hash remains the same.In the above example there are two Cloudant databases (A and B) that are not connected. They both have the same document, with identical revision histories for revisions 1 and 2. At revision 3, the databases diverged. If we replicate database A to B, or alternatively replicate database B to A (in Cloudant, bi-directional replication is simply two separate replication processes in opposite directions), then the document enters a conflicted state:The three scenarios below are essentially descriptions of the same thing, but with different application architectures:Many Cloudant customers create one database for every one of their users (yes, this scales). This architecture is especially suited to mobile app developers as it allows the app to continue to function offline and can sync its data to the cloud whenever it is connected. A document might become conflicted if it is modified on the app (via the phone) and on Cloudant itself (via a web dashboard, for example) and then the two copies are subsequently synced.A Cloudant customer might have two clusters hosted in separate geographic locations. The clusters are connected by continuous replication. If the same document is modified in each cluster while the inter-site connection is down, a conflict is recorded when the clusters are reconnected.Even in a small Cloudant cluster, a conflict can arise if changes to the same document are sent to two nodes at the same time.Normally, when retrieving a single document, we would receive only the latest revision:GET /mydb/0f900fc85f2c5249759d9dd939b9c080 {""_id"": ""0f900fc85f2c5249759d9dd939b9c080"",""_rev"": ""3-cb1624f72667f6f0378d628e0e065f24"",""name"": ""Glynn Bird"",""age"": 24If we additionally pass in the parameter ?conflicts=true, Cloudant will return the document and a list of conflicting revision tokens:GET /mydb/0f900fc85f2c5249759d9dd939b9c080?conflicts=true {""_id"": ""0f900fc85f2c5249759d9dd939b9c080"",""_rev"": ""3-cb1624f72667f6f0378d628e0e065f24"",""name"": ""Glynn Bird"",""age"": 24,""_conflicts"": [""3-ba7697cffda8cdfdfc63267473ffaf7d""If no _conflicts parameter is returned, then the document is conflict-free.Cloudant continues to serve out the documents as before, but if a conflict occurs:* Cloudant returns what it considers to be the “winning” revision. The algorithm that determines the winner is deterministic; different nodes with the same conflict scenario will return the same winner.* the database’s size is inflated because Cloudant keeps the bodies of unresolved conflicts in full* performance suffers if there are many conflicts in the same documentIt is good practice, as an application developer, to deal with any conflicts that arise in your documents. The benefit is a reduction in data size, and an optimised performance.Next week, in part two of this series, we'll return to this subject, and show how conflicts can be dealt with in your application.","In part one, we introduce the concept of a document conflict, describe what it looks like, and explain what happens if conflicts are left unresolved. Later in this series, we show how to tidy up conflicts, and discuss how they can be avoided.",Introduction to document conflicts – Part One,Live,386
1170,"There are lots of convenient things about Cloudant, like its HTTP API or incremental MapReduce, but the thing that really blows my mind is replication, where any number of distributed nodes can masterlessly exchange state, bringing themselves into sync, whether fully or partially. If any nodes lose connection, they can still take writes, and will automatically come back up to speed when they're reconnected.OK, so what?Well, database servers can be nodes, so we can create ad-hoc masterless clusters using replication. Datacenters can be nodes, too, so we can get global applications the same speed of access as region-local apps. Or, browsers and phones can be nodes, so we can replicate right into the application, letting the app continue to operate flawlessly even while offline.Cloudant implements the replication protocol, as does CouchDB, PouchDB, and a growing list of other technologies, so more applications on more devices can leverage replication.Cloudant hosts database clusters all around the world, and lets you replicate data between databases. So, how long will it take to replicate a document around the world? Specifically, this image:We'll insert the image to a database at one end of the world. Then, we'll replicate it to the next, and the next, and the next... until the image has been replicated from one end of the Earth to the other. How long will it take?First, get the experiment and its dependencies. Get node.js, then run this:git clone git@github.com:garbados/datathatmoves.gitcd datathatmovesnpm installNow, we'll need accounts in every data center:If you stop the npm start script mid-way, use bin/cleanup.js to reset your databases.On average, my setup reported the image replicating around the world in under four seconds, or 3738.75 ms. Here's the sample output from one run:1. create the necessary databases on all clusters...2. put replication documents on all clusters...3. upload document to replicate...4. done!Lagoon 2 (US West)1239 msMeritage (US West)1414 msMalort (Chicago)1941 msJulep (US East)2644 msJenever (Amsterdam)2686 msMead (London)2985 msSling (Singapore)3459 msIn all, took 3459 msThat's an overestimation, actually, since it's both the time it took the document to replicate, plus the time it took for my application to figure out it'd replicated.Although our test case is small, teams at Samsung, Akamai, Microsoft, and others do this every day with datasets spanning many terabytes in order to get application data as close to the client as possible.Recent technologies like PouchDB act as replication-ready nodes, bringing data right to the user. For example, Quilter uses PouchDB to sync your filesystem with Cloudant; I use it to sync my images with EggChair. This strategy is sometimes called local-first storage, and I'm personally a big fan of it.By letting your application make changes locally like that, and letting those changes sync with the cluster over time (that is, about four seconds), the user is never waiting on the server to confirm changes, making their experience quick and seamless. If the app loses connection to the Internet, nothing changes for the users; all their relevant data is local to the app. When they regain connectivity, the app brings itself up to speed automatically.Best of all? You don't code any of this. PouchDB handles it for you. Cloudant handles it for you. It's handled, so you can code what you love.A growing list of technologies replicate like Cloudant. CouchDB did it first, but now PouchDB does too, as does Couchbase-Lite, and the upcoming Cloudant Sync. As we and this developer community advance the replication protocol, we'll also develop more and more tools for more and more environments to replicate anything, anywhere.Replication opens the door to build robust, distributed, masterless systems that operate seamlessly in the face of connectivity issues, hardware failure, and even data conflicts. I am excited beyond words to stand at such a frontier :DAs always, if you have any trouble, check our docs, post your question to StackOverflow, ping us on IRC, or if you'd like to discuss the matter in private, email us at support@cloudant.com.Create an account and try Cloudant yourself","Cloudant's replication allows you to replicate your data across Cloudant clusters to provide a global ""CDN"" for your database. This blog post shows how this can also power the offline-first design pattern, to store data locally first and sync to the cloud when there is an internet connection.",Why Replication is Awesome,Live,387
1171,"Compose The Compose logo Articles Sign in Free 30-day trialAGGREGATIONS IN MONGODB BY EXAMPLE
Published Feb 23, 2017 mongodb metrics aggregation Aggregations in MongoDB by ExampleIn this second half of MongoDB by Example, we'll explore the MongoDB aggregation
pipeline. The first half of this series covered MongoDB Validations by Example .

When it's time to gather metrics from MongoDB, there's no better tool than
MongoDB aggregations. Aggregations are a set of functions that allow you to
manipulate the data being returned from a MongoDB query, and in this article,
we'll explore MongoDB aggregations by demonstrating a few. In particular, we'll
take a look at how to create basic data transformations using aggregations, and
then explore how to create more complex queries by chaining multiple
transformations together. Finally, we'll demonstrate how to use those
transformations to extract insights from our data.

STARTING AN AGGREGATION PIPELINE
Getting an aggregation pipeline started is a simple affair: simply call the aggregate function on any collection. Let's start by using our customers entity from the previous article on validations :

{
  ""id"": ""1"",
  ""firstName"": ""Jane"",
  ""lastName"": ""Doe"",
  ""phoneNumber"": ""555-555-1212"",
  ""city"": ""Beverly Hills"",
  ""state: ""CA"",
  ""zip"": 90210
  ""email"": ""Jane.Doe@compose.io""
}


Our customers collection contains any number of customers, each with a similar
data format. To begin an aggregation on the customers collection, we call the aggregate function on the customers collection:


The aggregate function accepts an array of data transformations which are applied to the data
in the order they're defined. This makes aggregation a lot like other data flow
pipelines: the transformations that are defined first will be executed first and
the result will be used by the next transformation in the sequence.

In the next sections, we'll explore the types of data transformations you can
perform in an order that's typical of how you'll want to perform them. The order
in which you perform transformations can make a big difference since you'll want
to filter out any collections you don't want to manipulate before you do any
complex or intensive operations on them.

MATCHING DOCUMENTS
The first stage of the pipeline is matching, and that allows us to filter out
documents so that we're only manipulating the documents we care about. The
matching expression looks and acts much like the MongoDB find function or a SQL WHERE clause. To find all users that live in Beverly Hills (or more specifically, the 90210 area code), we'll add a match stage to our aggregation pipeline:


This will return the array of customers that live in the 90210 zip code. Using
the match stage in this way is no different from using the find method on a collection. Let's see what kind of insights we can gather by adding
some stages to the pipeline.

GROUPING DOCUMENTS
Once we've filtered out the documents we don't want, we can start grouping
together the ones that we do into useful subsets. We can also use groups to
perform operations across a common field in all documents, such as calculating
the sum of a set of transactions and counting documents.

Before we dive into more complex operations, let's start with something simple:
counting the documents we matched in the previous section:


The $group transformation allows us to group documents together and performs
transformations or operations across all of those grouped documents. In this
case, we're creating a new field in the results called count which adds 1 to a running sum for every document. The _id field is required for grouping and would normally contain fields from each
document that we'd like to preserve (ie: phoneNumber). Since we're just looking
for the count of every document, we can make it null here.

{ ""_id"" : null, ""count"" : 24 }


Here we saw the use of the $sum arithmetic operator, which sums a field in all of the documents in a
collection. We can group our documents together on any fields we'd like and
perform other types of computations as well. Let's take a look at some of the
other operators we can use and how we can use them.

GAINING INSIGHTS WITH SUM, MIN, MAX, AND AVG
Transactions are a great place to stretch our proverbial legs when it comes to
mathematical operations. Let's analyze our transactions and see if we can gain
some insights into our product catalog.

Our transaction model looks like this:

{
  ""id"": ""1"",
  ""productId"": ""1"",
  ""customerId"": ""1"",
  ""amount"": 20.00,
  ""transactionDate"": ISODate(""2017-02-23T15:25:56.314Z"")
}


Let's start by calculating the total amount of sales made for the month of
January. We'll start by matching only transactions that occurred between January
1 and January 31.

{
   $match: {
    transactionDate: {
        $gte: ISODate(""2017-01-01T00:00:00.000Z""),
        $lt: ISODate(""2017-02-01T00:00:00.000Z"")
    }
  }
}


The $gte and $lt operators are comparators that allow us to limit our dates to specific range.
In our case, we want the transactionDate to be greater than or equal to the first day in January and less than the first day in February. Notice also that we're using Zulu (UTC) time here;
you'll want to make sure your transactions are inserted in the proper timezone
for this to work, but for this example we'll assume that all timezones are input
in Zulu time.

The next stage of the pipeline is summing the transaction amounts and putting
that amount in a new field called total :

{
  $group: {
    _id: null, 
    total: {
      $sum: ""$amount""
    }
  }
}


The final query looks something like this:


The final result is a transaction amount that looks like the following:

{ _id: null, total: 20333.00 }


Some other helpful monthly metrics we might want are the average price of each
transaction, and the minimum and maximum transaction in the month. Let's add
those to our group so we can get a single picture of the entire month:

Combined with the $match statement it looks like the following:


Our final result gives us an interesting picture of what monthly sales looked
like in our fictitious cookie shop:

{ 
  _id: null, 
  total: 20333.00, 
  average_transaction_amount: 8.50,
  min_transaction_amount: 2.99,
  max_transaction_amount: 347.22
}


Using these calculations we can see that more than half of our transactions are
below $10, and it's likely that our average transaction amount is being skewed
by a few outliers (bulk cookie orders). We could even take this a step further
by calculating the standard deviation across all transactions using stdDevPop or modify our groupings to exclude outliers by changing our $match conditions.

WRAPPING UP
Now that we know how to use the aggregations pipeline we can start to combine
groups, operations, and even other matching parameters to gain deep and detailed
insights into any business. While there are plenty of other operations we can
perform that weren't covered in this article, it should provide a good launching
point for anyone interesting in analyzing data stored in MongoDB.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by: Carli Jeen John O'Connor is a software architect that enjoys tinkering with things, designing software,
and writing about it all. Love this article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
Feb 16, 2017DOCUMENT VALIDATION IN MONGODB BY EXAMPLE
In this article, we'll explore MongoDB document validation by example using an
invoice application for a fictitious cookie co…

John O'Connor Jan 25, 2017DRONE DEPLOY CONQUERS THE DATA LAYER
Compose has quite a few unique customers. One of the more unique that we've
visited with is DroneDeploy, a company that autom…

Thom Crowe Jan 24, 2017BUILDING INSTANT RESTFUL API'S WITH MONGODB AND RESTHEART
When you need to turn your Mongo database into a RESTFul API, RESTHeart can get
you up-and-running quickly. In this article,…

John O'Connor Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this second half of MongoDB by Example, we'll explore the MongoDB aggregation pipeline. The first half of this series covered MongoDB Validations by Example.",Aggregations in MongoDB by Example,Live,388
1172,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Advisory Council
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Advisory Council
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
PERFORMANCE
APACHE SPARK™ 2.0: IMPRESSIVE IMPROVEMENTS TO SPARK SQL
What a difference a version number makes! With the release of Spark 2.0 (with
Spark SQL) on July 26th, Spark SQL capability and performance has improved
impressively:

 * Now we can run all 99 TPC-DS queries serially (single-user) at 1TB, and 98 of
   99 queries at 10TB, using Spark SQL 2.0 combined with deep tuning.
   
   
 * For the subset of the queries that could be run across Spark 1.5, 1.6 and
   2.0, Spark SQL 2.0 performance was more than 2.5x better than Spark 1.6 and
   4x better than Spark 1.5. at both 1TB and 10TB.
   
   
 * For the challenging multi-user concurrency test the percentage of queries
   able to complete has shot up dramatically with Spark SQL 2.0, again both for
   1TB and 10TB data volumes.
   
   
Back in the spring of 2016, we wrote about the rapid evolution of Spark SQL performance and hinted that we would have more to say on an even more comprehensive test
derived from TPC-DS using the upcoming Spark 2.0 release. Well, it took a little
longer than expected but the results show that the wait was worthwhile.

Let’s dig into the each of these 3 highlights in more detail.

First of all, raw SQL functionality, as exercised by the TPC-DS specification has been enhanced in Spark SQL 2.0, particularly in the area of subqueries. The
spec allows minor syntactic sugar variations (called Minor Query Modifcations or
MQM’s) but that’s it. No manual rewriting of the query is permitted; especially
not with knowledge of the content of the data set. Spark SQL 2.0 is able to
parse, execute and produce the correct answer for all 99 queries against the 1GB
TPC-DS qualification database. IBM’s BigInsights Big SQL is the only other SQL
on Hadoop engine that has demonstrated that it can run all 99 queries in
specification compliant way. (See a technical preview of IBM Open Platform with Apache Hadoop 4.3.)

TPC-DS defines a total of 5 official data set sizes, ranging from 1TB to 100TB.
In our tests we select the 1TB and 10TB data set sizes. We used a 5-node
configuration for the 1TB testing and a 20-node cluster for the 10TB tests. Each
node in the cluster was a 2-socket x64 server with 128GB of RAM and a set of x
SATA drives – a typical Hadoop cluster node. 1

In a single user test using 1TB of data, comparing Spark 1.5, 1.6 and 2.0, for a
subset of 55 queries that were able to run across all 3 releases, Spark SQL 2.0
was 2.8x faster than Spark 1.6 and a whopping 4.1x faster than Spark 1.5. Lest
we think that we are done, I’ll mention that in the game of SQL performance
""whack-a-mole"" things are never simple. There is always a query that runs worse
than in the previous release and Spark SQL is no different. For example
(unlucky) Query 13, ran nearly 2x longer with Spark 2.0 than Spark 1.6, using
the same set of tuning parameters.


In fact, Spark SQL 2.0 was able to run all 99 TPC-DS queries in just a little
more than the time it took Spark 1.5 to run the 55 queries it was capable of
running.

When we expanded the data volume from 1TB to 10TB and simultaneously increased
the cluster size from 5 to 20 nodes, we observed very similar ratios. The
elapsed time for Spark SQL 2.0 to run all 49 comparable queries was 2.7x shorter
than Spark SQL 1.6 and 4x shorter than Spark SQL 1.5.


So what enabled this leap forward? There are three main factors:

 1. Enhanced SQL support. As has been noted in blogs by IBM and Databricks , there has been a significant infusion of SQL capability in Spark SQL 2.0,
    most notably the addition of enough subquery support to enable all the TPC-DS queries to run.
 2. Continued focus on an improved optimizer and an improved, optimized run-time .
 3. Our team’s growing experience in understanding, analyzing and tuning Spark
    with its large pool of tuning parameters. You can read more details about
    the tuning parameters we are using in this Slideshare . In the bar charts above the only difference between Spark 2.0 and Spark
    2.0 with deep tuning results is a set of tuning parameter changes related to
    memory and to various timeouts. The Spark binary itself is identical.

In real customer environments, as well as what is required by the TPC-DS
specification, multi-user throughput is essential. In TPC-DS there is a minimum
of 4 concurrent streams of execution. In our previous testing we found it
extremely challenging to be able to even run most of the queries in concurrent
multi-user test environment at 1TB or above. With Spark 2.0 and our deeper
understanding of all the Spark tunables we have made tremendous progress in
getting most of the queries to run most of the time at the 1TB level. The
following graph shows how the mix of queries succeeding and failing has evolved
from Spark 1.5 to Spark 2.0 at 1TB.


As we increase the data volume to 10TB we see a fall-off in the success rate in
running the queries with 4 concurrent users, but still a significant improvement
using Spark 2.0 over Spark 1.6.


The Spark Technology Center’s Spark SQL development team, working closely with
the performance team, is continuing to analyze, tune and work with JIRAs to
further improve these results, since even with all the progress that has been
made with Spark SQL 2.0, there is still further innovation required to be able
to run all the queries, and run them well. For a typical enterprise wanting to
use Spark SQL in production, the possible set of tuning parameters — especially
the timeout parameters — may prove to be daunting, and it would be far better if
these could be successfully automated and dynamically adjusted based on query
workload requirements. Finally, for a significant set of queries there remain
tremendous opportunities to further improve performance, especially when there
are fact to fact table joins and when data volumes, especially of intermediate
results, exceed available real memory. Together, the 10 longest queries account
for over half the total elapsed time in the single-user performance test. We
expect that the rapid evolution of Spark SQL will continue.

 1. Hardware specifications: 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680
    v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. Software specifications: IBM Open Platform with Apache Hadoop v4.2 . Apache Spark 2.0 ↩
    
    
SHARE ON
 * 
 * Share

BERNI SCHIEFER
DATE
07 September 2016TAGS
Performance, apache spark 2.0NEWSLETTER
Subscribe to the Spark Technology Center newsletter for the latest thought
leadership in Apache Spark™, machine learning and open source.

SubscribeNEWSLETTER

YOU MIGHT ALSO ENJOY
APACHE SPARK™ 2.0: DEEP DIVE INTO SPARK CATALOG AND DDL NATIVE SUPPORTS by Xiao
Li RESEARCH APACHE SPARK™ 2.0: KEEPING COUNT by Christian Kadner EDUCATIONAL APACHE SPARK™ 2.0: MIGRATING APPLICATIONS by Glenn Weidner NEWS APACHE SYSTEMML™ PAPER TAKES TOP PRIZE AT THIS YEAR’S VLDB CONFERENCE by
Steve MooreSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About
 * Advisory Council

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.

 * 
 * 
 * 
 *","What a difference a version number makes! With the release of Spark 2.0 (with Spark SQL) on July 26th, Spark SQL capability and performance has improved impressively.",Apache Spark™ 2.0: Impressive Improvements to Spark SQL,Live,389
1173,"Homepage Submission Sign in Get started Homepage Armand Ruiz Blocked Unblock Follow Following Lead Product Manager IBM Watson Studio and Machine Learning Draft
--------------------------------------------------------------------------------

INTRODUCING IBM WATSON STUDIO
Professionals are putting AI to work to turn our most valuable resource — data —
into new ways of doing business. With AI, we are no longer wrestling with data,
but using it to recommend with confidence, accelerate research and discovery,
and enrich interactions with customers on their terms. The purpose of AI systems
is to augment human intelligence, and today, we are excited to announce the next
step on our journey to make AI more accessible for everybody with IBM Watson Studio .

WATSON STUDIO: ACCELERATING VALUE FOR ENTERPRISES WITH AI
Watson Studio accelerates the machine and deep learning workflows required to infuse AI into your business to drive innovation. It
provides a suite of tools for data scientists, application developers and
subject matter experts, allowing them to collaboratively connect to data,
wrangle that data and use it to build, train and deploy models at scale.
Successful AI projects require a combination of algorithms + data + team, and a
very powerful compute infrastructure.

Until today, there was a gap between data experts and domain experts. Only
highly technical professionals in IT could organize and make sense of the vast
amounts of data. Only domain experts could successfully convert data into the
rich knowledge needed by AI. But domain experts and IT professionals worked in
silos, with different tools and no visibility to each others work. The result
was AI that fell short in its promise to augment people’s expertise.

Watson Studio closes the gap with a unified experience to create new insights
from knowledge contained in the data. Watson Studio enables multidisciplinary
teams across the organization to collaborate. We are convinced, after working
with clients around the world, that rich collaboration is key unlocking the full
potential of AI.

COMPREHENSIVE SET OF TOOLS FOR THE END-TO-END AI WORKFLOW
In Watson Studio, we provide a choice of tools for the full AI lifecycle
including best of-breed open source and IBM tools. You can choose between code
or no-code tools to build and train your own ML/DL models, or easily retrain and
customize pre-trained Watson APIs. Use the rich capabilities and controls to
fine tune your models and automate the feedback loop of your models so they
become smarter over time and continually adapt to changing conditions.

CONNECT AND PREPARE DATA
In order to apply AI, the first step of the workflow starts with connecting and
accessing data. Data scientists spend up to 80% of their time finding and
preparing data, and 57% of data scientists said that cleaning and organizing
data is the least enjoyable part of their job. The problem isn’t just limited to
data scientists. Business analysts face similar struggles obtaining the data
they need to build reports — often having to wait weeks for their IT team to
extract data from the source systems.

To address the issue, we provide integrated capability to refine and wrangle
data with Data Refinery, a tool that makes fast, self-service data preparation a
reality. Watson Studio comes with more than 35 data connectors to the most
popular data sources, whether they are in the IBM Cloud, 3rd party Clouds or
application, or on-premises.

Data Refinery tool simplifies the data preparation and wrangling processWATSON TOOLS AND PRE-TRAINED MODELS
Once you are connected to data, the next step is to build and train models.
Application developers can get started with best-in-class pre-trained Watson
APIs , which are the most accurate in the industry. These models will understand
sentiment, classify topics in text, identify personality insights or recognize
objects in a photo. We provide access to well documented APIs with samples and
code snippets in the most popular programming languages.

Access pre-trained Watson modelsNot all companies have access to talent and resources to the most advanced
machine and deep learning technologies. That’s why we are offering simple tools
to help transfer knowledge to Watson with your own data. The first integrated
tool is for the Watson Visual Recognition service. This tool allows you to train
custom models with your own images, to suit your specific visual recognition
needs.

Customize Visual Recognition models with the new Watson toolsCHOICE OF FRAMEWORKS AND BEST-IN-BREED TOOLS
The machine and deep learning landscape is constantly changing. In Watson
Studio, you will find support for the most popular tools providing users choice
to easily train, save, deploy and automate the retraining of those models. They
come pre-installed and we manage the underlying infrastructure for you so you
can focus on your projects.

Machine and Deep Learning frameworks supported at Watson StudioUse Python, R, or Scala in Jupyter Notebooks. Notebooks are a popular environment to create and share documents that contain
live code, equations, visualizations, and explanatory text. Uses include data
cleaning and transformation, numerical simulation, statistical modeling, machine
learning, and much more.

We teamed with RStudio to deliver the most widely used open-source R statistical computing
environment. Watson Studio includes the RStudio flagship product, a popular
integrated development environment (IDE) that makes it easy for anyone to
analyze data with R.

Open Source tools in Watson Studio include Jupyter Notebooks and RStudioVISUAL MODELING TOOLS TO POWER THE NON-CODER
Although open source code in Python and R is popular because of its low cost,
flexibility, and power, the time required to properly create code and ensure
that it is working correctly can sometimes be frustrating. Talking to many
customers, we understood that not everyone is a programmer. For those visual
modelers we provide two tools:

 * SPSS Modeler: an intuitive interface that is easy for everyone to learn and use — from business users to data
   scientists. Uncover valuable insights quickly for rapid time-to-value just by
   using drag-and-drop capabilities. There are thousands of Modeler users out
   there and the best news for them is that they can import and bring their SPSS
   streams within the Watson Studio — they will work and render as expected!
 * Neural Network Modeler : an intuitive drag-and-drop, no-code interface for designing neural network structures . It speeds up the design process by avoiding the need to write and debug
   code by hand. Neural networks can be exported in TensorFlow, Keras, PyTorch
   and Caffe as well as in JSON format for sharing within blogs and code posted
   to GitHub.

Design Deep Learning models with the new Neural Network ModelerFLEXIBLE COMPUTE ENVIRONMENTS TO TRAIN MODELS AT SCALE
Today’s advanced AI models have grown in complexity and may require terabytes of
data to train. There is a lot of innovation happening around increasing
processing capabilities with CPUs, GPUs, distributed training and more. Watson
Studio provides the flexibility to select the hardware you want for your
experiments and lets you train models with unparalleled speed. You can quickly
scale up or down your compute resources and customize your package dependencies
in the environments.

Underlying technology powering the AI training infrastructureEach compute environments is dedicated for each project collaborator, so no more competing
for resources. We provide package management capabilities— the popular conda environment definitions are used for customization. The compute environments
are project assets and can be shared and reused by all members of your team for reproducible research.

Because our compute environment is in the IBM Cloud, you pay only for what you
use. Our infrastructure is elastic and can handle the biggest spikes of use with
ease to support the most demanding projects.

Customize the environments with your favorite Open Source librariesMODEL LIFECYCLE & MANAGEMENT
Moving a model into production is typically a tough task, and deployment
requires help from busy IT specialists. When a single deployment can take weeks,
it’s no wonder that most data scientists prefer to hand over their latest model
and move onto the next project, rather than persist with the drudgery of
continually retraining and redeploying their existing models.

In Watson Studio, it is possible to automate the retraining of models, and to
monitor how the performance of those models evolve over time. That’s what we
call Continuous Learning , which is unique to our platform. Thresholds can be setup, and if the
performance drops, the user will get alerts and notifications so the data
scientist can act.

Models are dynamic assets that need to be updated periodically, which is why it
is key to have version control to roll-back to previous versions when needed —
accessible through APIs and UI.

Automate the retraining of Machine Learning models with the Continuous learning
featureSHARE YOUR RESULTS QUICKLY WITH DASHBOARDS
Dashboards in Watson Studio allows users to painlessly add end-to-end data visualization
capabilities to your application so your users can easily drag and drop to
quickly find valuable insight and create visualizations on their own.

Interactive dashboards produce visualizations directly from your data in
real-time. Smart data analysis and visualization capabilities help users
discover underlying patterns and meanings in their data. Data can be explored
using filtering and navigation paths. Embed dashboards into your application’s
context, keeping users engaged.

INTEGRATED WITH WATSON KNOWLEDGE CATALOG
We have developed Watson Studio to support the most demanding Enterprise AI use
cases. For that, we integrated an intelligent cataloging service that allows you to bring together and prepare analytic assets, including
structured and unstructured data wherever they live (on-premises or in the
cloud), to turbocharge your data science, machine learning and AI.

Unite all your information assets into a single metadata-rich catalog. Watson-powered AI recommendations suggest the best assets based on Watson’s
understanding of relationships between assets, and how they’re being used and
socialized among users.

Powerful, integrated data policy activation engine , ensuring your sensitive data is automatically protected as determined by your
governance policies.

Watson Knowledge Catalog provides intelligent discovery of data and AI assets
that enables reuse & improves productivityGIVING YOU MORE AI POWER
Watson Studio gives your models the chance to do what they were always meant to
do: learn. By continuously training your models against the latest data, you can
ensure that they continue to reflect today’s business realities, giving your
organization the insight it needs to make smarter decisions and seize
competitive advantage.

Watson Studio is the result of close collaboration with IBM Research and other IBM Cloud teams. We hope you enjoy the product and we can’t wait to
see what you build with the it. Give it a try here: www.ibm.com/cloud/watson-studio","The purpose of AI systems is to augment human intelligence, and today, we are excited to announce the next step on our journey to make AI more accessible for everybody with IBM Watson Studio.",Introducing IBM Watson Studio ,Live,390
1174,"Compose The Compose logo Articles Sign in Free 30-day trialBUILDING BETTER DATABASE BRIDGES WITH THE NEW COMPOSE TRANSPORTER
Published Mar 7, 2017 transporter compose Building better database bridges with the new Compose TransporterTL;DR : Updated Transporter tool has a quicker way to get started, a completely
rebuilt Elasticsearch adaptor, updated MongoDB adaptor, a new PostgreSQL
adaptor, and lots of internal engineering improvements.

The Compose Transporter is an open-source project from Compose, designed to bring some agility to the
world of heavy-set ETL tools with a compact and focused tool for moving records
and documents from one data source, processing it, and delivering it to a
destination. The first editions of Transporter powered users' data transfers and
our own backend import functions for some databases. Now, we move on to the next
phase...

There's a new release of Compose's open source Transporter project, 0.2.1 , a release which sets out to strengthen the connections between databases and
improve the command-line user experience. In this release, the work has focussed
on getting the MongoDB and Elasticsearch adaptors to work better for all
versions of MongoDB and Elasticsearch. There's a new adaptor for PostgreSQL, so
you can now read or write data to the first Transporter supported SQL database.

The important thing to know about this update is that it is the first fruits of
the active redevelopment of the Transporter. In the not-too-distant future,
there will be a range of internal and external changes as the Transporter
platform is refined to offer an all round better experience. If you want to help
shape that future, there's never been a better time to start working with the
Transporter.

A QUICKER STARTER
Finding and copying the appropriate parts of configuration files with the
Transporter is not needed anymore with the new init command. Run transporter init source-adaptor sink-adaptor and the Transporter will generate the config files to use the named adaptors.
Don't know what adaptors are called? transporter about will list the available ones for you. With this addition, the Transporter's
executable becomes more self-sufficient with no collateral files needed.

ENHANCING ELASTICSEARCH
The combination on MongoDB and Elasticsearch has proved, by far, one of the most
popular use cases for the Transporter. People like the idea of moving their
document data from MongoDB to Elasticsearch for search, of course, and for
report and analysis. With the Transporter, you can get a continuously updated
stream of changes in MongoDB and turn those changes into Elasticsearch updates.

But, since the last major Transporter release, Elasticsearch has changed its API
in a number of ways and this made it necessary for the developers at Compose to
take a long look at how the adaptor was implemented. The challenge was to make
it work for not just the current Elasticsearch but all the previous versions.
Part of that work involved rewriting the Elasticsearch adaptor and switching to
the elastic library which has multiple versions that each target a specific version of
Elasticsearch. The developers have made this transparent to Transporter users by
automatically selecting the correct Elasticsearch client on connecting to the
database.

MANAGING MONGODB
The MongoDB adaptor has also seen improvements and fixes. Using the Bulk operator now no longer keeps the Transporter running idle and MongoDB ids are
tracked correctly so failed queries can be reissued. When using the oplog, a
contributed fix has got around the problem that all database changes were being
sent through the transporter; a problem that only manifested itself in an
unexpected edge case when you had two databases with the same named collection
in each.

The most visible change, which is also a breaking change, is that the MongoDB
adaptor name has changed from mongo to mongodb in the configuration files. This change is, though, accompanied by a new,
simpler set up for SSL connections ( ssl: true and an optional cacerts: with an array of filenames for certificates).

These changes come about as result of work done to restructure the internals of
the Transporter, uncoupling the adaptors from the database client
implementations below them and making it easier to add future adaptors.

POWERING POSTGRESQL
The newest arrival in the Transporter is the PostgreSQL adaptor. The adaptor can
read or write to a PostgreSQL database translating the messages in the
Transporter into a template for the columns and values of an SQL table. It
sticks to the Transporter philosophy of keeping the adapters simple; to
manipulate or reorganize fields from, for example, a complex JSON document you
can use the Transporter's Transformer, JavaScript adaptors which plug into the
Transporter pipeline and give you a whole language to lean on for your data
manipulation.

ALSO...
Other changes include a new standing documentation site being built in the Wiki at the Transporter Github repository which will be actively tracking the Transporter developments as they arrive.
There's also new tutorials incoming to get new users up and running as smoothly
as possible.

You'll find the Github repository for Transporter with the latest releases and documentation , and keep reading Compose Articles for the new tutorial series.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Teresita Garit

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
May 18, 2016COMPOSE NOTES - POSTGRESQL UPDATING, FDW AND TRANSPORTER
Compose Notes details the small enhancements and changes which make life better
for Compose users. In this edition, there's a…

Dj Walker-Morgan Feb 3, 2016COMPOSE'S NEW MONGODB AND WHAT YOU NEED TO KNOW
The release of our new MongoDB deployments has brought a lot of new questions
about what you need to know and why and what ha…

Dj Walker-Morgan Feb 27, 2017BY POPULAR DEMAND, DATALAYER CONFERENCE CALL FOR SPEAKERS IS EXTENDED
We're extending the call for speakers for our DataLayer conference because we
were asked. It's only a few days though so get…

Thom Crowe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Updated Transporter tool has a quicker way to get started, a completely rebuilt Elasticsearch adaptor, updated MongoDB adaptor, a new PostgreSQL adaptor, and lots of internal engineering improvements.",Building better database bridges with the new Compose Transporter,Live,391
1177,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Home Cognitive Computing Data Science Web Dev Lorna Mitchell Blocked Unblock Follow Following Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net) 22 mins ago
--------------------------------------------------------------------------------

LET THE BLUEMIX BUILD SERVER RUN CF PUSH
SOURCE-CONTROL YOUR WORK AND DEPLOY CODE MORE SAFELY
Bluemix is IBM’s cloud computing platform, and it’s based on the excellent Cloud Foundry . To deploy apps, you describe the app configuration and dependencies in a
manifest file. Then you run the cf push command, which pushes the files from your local machine to the cloud.

This approach is a great way to get started, but it’s not a process I’d
recommend to a dev team. Pushing the contents of your local machine, rather than
source control, is a recipe for at least a minor mishap, if not an actual
disaster. However, help is at hand! Bluemix has a super-easy deployment pipeline
tool that will run cf push for you, ensuring that you only deploy the things that are in your team’s
source control repository and can therefore be traced, inspected, and all the
other good stuff.

SET UP A REPO
On your apps dashboard, click the app whose deployment you’d like to automate.
If you’re not already using a Bluemix git repository with the project, then
create one by clicking “Add a Git Repo”. If you usually store your git
repositories elsewhere, such as on GitHub, BitBucket or GitLab, then don’t
worry. The Bluemix repo can be added as an additional remote that you push to
when you want to deploy. It’s common for teams to keep their production branch
only on this repo and do active development elsewhere. Feel free to experiment
to find a setup that works for you.

For this example, I’m using the Retro Guestbook app. You can find the source
code for the guestbook on Github . This example has a few moving parts: in one repo I have a PHP web
application, as well as two NodeJS worker scripts that have their own manifest
files. This repo will show how you can configure deployment for even fairly
complex applications quite easily. (If you’re interested, I blogged about how I deployed this app to Bluemix after developing it locally.)

If you create a new repo, it may start with an initial commit with a README.md
file and a license file. For projects not already under source control, you can
clone this repo, add your files, and keep working. For projects which already
have git history elsewhere, I recommend that you add this new repo as a remote
(mine is called “bluemix” so I’m clear about where I’m pushing to) on your
existing local clone of the project. Then, you can force push to it. It’s not
often I recommend a force push, but in this instance, it seems like a reasonable
use case!

CONFIGURE THE BUILD
Once the code is in the repo, it’s time to tell the build server what to do when
new code arrives. Your exact steps will vary but may include:

 * static analysis of code
 * tests to run
 * actually shipping the code to the production platform

I’ll be focussing on the final point, as it’s common to all applications. To
start, click the Build and Deploy buttons, and then choose to Add Stage . The input tab tells you when this job should run. When shipping code to production,
you’ll normally be faced with these choices:

 * run it manually
 * run it when there’s a git push to a particular branch name
 * run it when the previous stage has successfully completely

When you create the first stage, this third option isn’t available yet since
there aren’t any previous stages. Set your stage to run when commits appear on
whichever branch you use for production code (my examples are lazily set to
master):

Next, visit the jobs tab. Here, you can create the various tasks to run for each stage. If you’re
just automating the cf push command, then most of the work is done for you already: choose Add Job and pick the Deploy type of job. There are some options here, and you have full control over the
script that this job actually runs. Take a look, but you probably don’t need to
change anything at all, which is nice. You can come back and add or edit jobs,
or the stage itself, at any time by clicking the cog config icon on the tile for this stage in pipeline view (pictured below).

Save the stage. From the main pipeline view, you can try running it (the play button next to the config icon in the top right of the stage tile). When you run these
stages manually, they simply use the most recent commit in the git repo as the
basis of what to deploy. As soon as the stage starts running, you can click a
link to see the logs of the process as it happens. This way you can see the
output you’d normally see if you were running it locally.

For my own application, I’m actually deploying three apps. So I simply repeat
the steps above, changing the application name for each one, and create three
stages. By setting up each stage to run in response to the git commit, rather
than the previous stage completing, my stages will run in parallel with one
another.

With the pipeline in place, my team and I can safely push code to a particular
branch in a repo. We can be sure that the only code being deployed is code in
our source control system. This centralized approach creates a more robust and
reliable deployment process, and it’s super-simple to set up. You can extend and
customise the pipelines to meet your application’s needs, even as it grows and
evolves over time. Even here—where I simply showed you how to use cf push more safely—taking advantage of these tools is well worth the time you’ll need
to set them up. Happy shipping!

Deployment Cloud Foundry Cloud Deployment Pipelines Bluemix Blocked Unblock Follow FollowingLORNA MITCHELL
Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net )

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","IBM Bluemix is based on Cloud Foundry. While its “cf push” command is a great way to deploy from local to cloud, here I’ll show you a safer way for your team to automate its deployment pipeline.",Let the Build Server cf push – IBM Watson Data Lab,Live,392
1178,"Homepage IBM Watson Data Lab Follow Sign in / Sign up 1 * Share
 * 1
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. 21 hrs ago
--------------------------------------------------------------------------------

AUTHENTICATION FOR CLOUDANT ENVOY APPS, PART I
ADDING FACEBOOK AUTHENTICATION
For those familiar with the Apache CouchDB ecosystem, Cloudant Envoy is a microservice that serves out your static
application and behaves as a replication target for your one-database-per-user
application. Simply build an application that writes data locally using PouchDB or Cloudant Sync , and Envoy will ensure that each user’s data is stored in a single Cloudant database , with each user’s data carefully segregated.

So far I’ve been looking at Envoy apps that have been generating their own
users. Save a document in the envoyusers database, and Envoy will use that information for subsequent authentication
requests. But what if you want users to sign up with
Facebook/Google/Twitter/etc? How can Envoy integrate with social media’s
federated login?

Before we move on, let’s review the prerequisite knowledge needed to benefit
from this post. I’m going to show you how to deploy a live application to a
Cloudant server and add the ability to use Facebook authentication so that your
users can sign up for your service with Facebook. I’m going to assume you know a
fair amount about JavaScript to be able to write the necessary Express routes to complete this task.

Now back to the task at hand: How can we allow our users to sign up for our
application with Facebook?

PASSPORT TO THE RESCUE!
http://passportjs.org/Luckily the PassportJS project solves 95% of the problem for us. It has a number of modules, each
handling authentication for a third-party partner. Although Envoy doesn’t use
Passport out-of-the box, you can create an Envoy app that does. Here’s how.

CREATE A FACEBOOK APP
We’re going to use Facebook as an example. Visit the Facebook Developers portal and create a new app in your account. You will need to supply the name of your
app, its URL and other sundry details. I found that Facebook was reluctant to
accept localhost as a valid domain name, so I added a line to my /etc/hosts file:

127.0.0.1 mypretenddomain.com

I then used mypretenddomain.com as my app’s domain.

After filling in the form, Facebook will issue you with a CLIENT_ID (or app id) and a CLIENT_SECRET (or app secret), which you will need to take note of for your code later on.

CREATE AN ENVOY APP
Now we’ll walk through how to create an Envoy app. In a new directory, type npm init and make sure to follow the on-screen prompts. This will create a template package.json file for you. Then we can add the modules we’re going to need for this project:

npm install --save cloudant-envoy

Create some static content:

mkdir public
echo ""<h1>Hello World</h1>"" > public/index.html

The layout of an Envoy app is pretty simple. First create an app.js file:

Once you’ve saved that file in the project directory you can then run the app:

export COUCH_HOST=https://myusername:mypassword@myhost.cloudant.com
node app.js

Note: Envoy assumes the Cloudant URL will be in a COUCH_HOST environment variable. Replace myusername , mypassword and myhost with your own Cloudant account details.

We now have a web server serving out our own static content, which also acts as
a replication target for PouchDB/CouchDB/Cloudant/Cloudant-Sync clients.

ADD FACEBOOK AUTHENTICATION
We’ll need some extra modules to handle Facebook authentication:

npm install --save passport
npm install --save passport-facebook
npm install --save uuid
npm install --save express
npm install --save crypto-js

Then we need to add some custom endpoints into our app to handle the
authentication process.

 * GET /_facebook — Hitting this endpoint in your browser will bounce the user to Facebook and
   ask them to authenticate.
 * GET /_facebook/callback — After logging into the Facebook website, it will bounce the browser to
   this URL to allow us to access the user’s profile.

We implement a getOrCreateUser which checks if Envoy knows about this user already. If not, a new user is
created.

Envoy’s user model is very simple: add a document to its users database (default
name envoyusers ) to allow someone to replicate. Envoy provides some helper functions for you:

 * envoy.auth.getUser(userid, callback) — to fetch a user by userid
 * envoy.auth.newUser(userid, password, metaobject, callback) — to create a new user

We need to run the app, passing in our app’s CLIENT_ID and CLIENT_SECRET environment variables we got when we created the Facebook integration:

export CLIENT_ID=1234567
export CLIENT_SECRET=abc123456
export COUCH_HOST=https://myusername:mypassword@myhost.cloudant.com
node app.js

Here’s the source code:

HOW DO WE COMMUNICATE THE USER CREDENTIALS TO THE CLIENT SIDE?
We know know the user id and password of our user — not the Facebook username
and password, the Envoy username and password. But how can we send that data to
the client side? A simple way is to bounce the browser to a URL with the
credentials in the query string:

http://mypretenddomain.com/bounce.html?username=999888777&password=9886f37a-725e-4096-be67-ff2aba2acb68

We could write some client-side JavaScript to parse the query string, extract
the username and password and store it locally.

A safer way would be to create a single-use token and pass that in the query
string.

http://mypretenddomain.com/bounce.html?token=696ad23c375b4aa4acce97734fa2ea4f

In this case the client-side code needs to extract the token, make a call back
to the server to exchange the token for the username and password and then store
the credentials locally. This is more secure as the token can be made to expire
on use and have built-in time limit.

Here’s some simple client-side code to extract and decode the token, ultimately
saving the user details in a local PouchDB document. Local documents are never transmitted during replication,
they only remain on on the device they are created:

MAKING YOUR APP
Now the client side app has the Envoy credentials (in a PouchDB document whose
id is _local/user ) we can set about building an app that reads and writes data to its local
PouchDB database and replicates its data to and/or from your Envoy service using
the credentials provided.

var db = new PouchDB('mydb'
  db.get('_local/user').then(function(loggedinuser) {
    var url = window.location.origin.replace('//', '//' + loggedinuser.username + ':' + loggedinuser.meta.password + '@'
    url += '/envoy'


    // sync live with retry, animating the icon when there's a change'
    var remote = new PouchDB(url);
    db.replicate.to(remote).on('change', function(c) {
      console.log('change', c)
    });
  });


--------------------------------------------------------------------------------

https://developers.facebook.com/If you’ve followed along to this point, then you’ve built a static application
that writes data locally using PouchDB and you’ve used Cloudant Envoy to store
user data in a Cloudant database. Using PassportJS to handle authentication,
your users can now sign up to use your app using their own Facebook credentials.
Stay tuned, in Part II of this post we’ll see how to make an Offline-First app
with Cloudant Envoy using the application we built in this post.

JavaScript Pouchdb Passportjs NoSQL Tutorial 1 Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
For developers, by developers.","For those familiar with the Apache CouchDB ecosystem, Cloudant Envoy is a microservice that serves out your static application and behaves as a replication target for your one-database-per-user…","Authentication for Cloudant Envoy Apps, Part I – IBM Watson Data Lab",Live,393
1179,"This post is part of a series of posts created by the two newest members of our
Developer Advocate team here at IBM Cloud Data Services. In honour of the book Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, we challenged Lorna Mitchell and Matt Collins to take a new database from our portfolio every day, get it set up and working,
and write a blog post about their experiences. Each post reflects the story of
their day with a new database. We’ll update our seven-days GitHub repo with example code as the series progresses. —The Editors

 * Database type: Highly scalable, distributed and schema less JSON storage with full-text
   searching.
 * Best tool for: Building data intensive apps that need to scale with high availability and
   reliability.

OVERVIEW
Cloudant is the IBM Cloud version of CouchDB . It offers everything that you’d expect from a managed CouchDB service as well
as adding a few additional features specific to this platform. You can review
these new features now with the recent release of CouchDB 2.0 .

At its heart Cloudant is a JSON document database that allows you to perform all
of the familiar CRUD procedures as well as use MapReduce to build query-able
views of your data. If you need to do something a bit more taxing, the added
extras that you get with Cloudant such as Full Text search, Geospatial indexing
and Cloudant Query should allow you to do just that.

Although CouchDB is open source, Cloudant is a hosted DBaaS with fully managed
single or multi-tenant offerings – including a free tier available via the IBM Bluemix platform.

This article covers how to get started with Cloudant and uses as an example a
simple student grade tracker containing information about students, their
courses, and their grade for that course.

SET UP CLOUDANT
Sign up via Cloudant or IBM Bluemix . Once you have your credentials, sign in to the Cloudant dashboard and access
the databases page. To create the new database, click the “Create Database”
option in the top right of the screen and create your database name – we’ve
called ours “students”.

Create Database

This database is where we will store all of our student data.

CLOUDANT SPEAKS HTTP
Cloudant is CouchDB at its core, and it is accessed via an HTTP REST API.
Everything you need to do with Cloudant can be done via HTTP.

We can demonstrate this by making some simple curl requests to our new Cloudant
instance. Authentication is achieved by using simple HTTP authentication, so the
base URL for your instance will be something like this:


https://<username>:<password>@<hostname>.cloudant.com


Initially you’ll notice that the username and hostname are the same, but this
will change as new users are added to the database.

Note: For the purposes of this article we are going to assume a username and hostname
of ‘sevendays’ and a password of ‘password’

If you do a simple HTTP GET request to the above URL, you should see that your Cloudant instance is ready
and waiting. You can make GET requests like this from your browser, but for most of these examples we’ll use curl so let’s start with a curl example:

Input


curl -X GET 'https://sevendays:password@sevendays.cloudant.com'


Output


{""couchdb"":""Welcome"",""version"":""1.0.2"",""cloudant_build"":""2580""}


All actions for CouchDB are HTTP calls and there are some great API reference docs which will show off all the various options. There are also alternative
interfaces available; Cloudant offers a rich dashboard based on the Fauxton project and you could run this with a local CouchDB as well. For now, we’ll
stick with the curl examples so that you can see how to make the HTTP calls
themselves, from your own applications you might make HTTP calls or use a
wrapper library.

First, let’s access a list of databases. CouchDB has a special endpoint for this /_all_dbs so the curl call would be:

Input


curl -X GET 'https://sevendays:password@sevendays.cloudant.com/_all_dbs'


Output


[""students""]


We can see that the students database is visible in the collection since we created it earlier. Now we know
where our database is, we can start adding records to it.

ADD STUDENTS TO THE DATABASE
We want to store students, their courses and their grades. Cloudant is a NoSQL
database so our database design is quite freeform. For this example, we’ll store
a record per student, and then an array of modules that the student is studying,
including their grade. Here’s the simplest record getting inserted:

Input


curl -X POST -H ""Content-Type: application/json"" 'https://sevendays:password@sevendays.cloudant.com/students' -d '{""name"": ""Janet Doe"", ""modules"": [{""name"": ""Calculus 101"", ""score"": 87}]}'


Output

 
{""ok"":true,""id"":""62d403135819d12627a75dc2c01736ae"",""rev"":""1-79b124addc531ca0f37fc3bbb9ca126b""}

The result here gives an ok value of true and also adds an id and a rev field. The ID is what you’d expect: a unique identifier that we can use to
refer to this record. The rev field is the current version of this record, we use this field when we’re
updating data to let the database know which version of the record we based our
changes on. This extra metadata is what enables CouchDB’s ability for syncing
between databases that may have been offline from one another for some time.

Let’s see the data that’s currently stored in our database, using another magic
endpoint from CouchDB called /_all_docs . To see what’s in the students database, the curl command would be:

Input


curl 'https://sevendays:password@sevendays.cloudant.com/students/_all_docs?include_docs=true'


Output


{
  ""total_rows"": 15,
  ""offset"": 0,
  ""rows"": [
    {
      ""id"": ""f9503d195957585f520f0719a06ee5d0"",
      ""key"": ""f9503d195957585f520f0719a06ee5d0"",
      ""value"": {
        ""rev"": ""1-79b124addc531ca0f37fc3bbb9ca126b""
      },
      ""doc"": {
        ""_id"": ""f9503d195957585f520f0719a06ee5d0"",
        ""_rev"": ""1-79b124addc531ca0f37fc3bbb9ca126b"",
        ""name"": ""Janet Doe"",
        ""modules"": [
          {
            ""name"": ""Calculus 101"",
            ""score"": 87
          }
        ]
      }
    }
  ]
}


Note that this example included the ?include_docs=true parameter on the request; this includes the body of the document (i.e. the
actual data) rather than just the id and revision information.

With some data in place (and a few more rows added to make it more interesting),
let’s look at how we design and use views in Cloudant.

MODIFYING STUDENT DATA
In the earlier example we created a record for Janet Doe, who was only studying
one module – Calculus 101. If Janet then starts to study the Data Structures and
Design module, we will need to update her record in the database.

To do this we need to do almost the exact same thing, with a few small changes.

 * We need to make a PUT request, rather than a POST
 * The API endpoint must also include the document ID we are updating
 * The data that we provide must also include the current revision number of
   this document in the _rev field

We also need to supply the whole document again, with the addition of our changes. In this case, we have added
the “Data Structures and Design” module to our modules array.

Check out the example below:

Input


curl -X PUT -H ""Content-Type: application/json"" 'https://sevendays:password@sevendays.cloudant.com/students/62d403135819d12627a75dc2c01736ae' -d '{""_rev"":""1-79b124addc531ca0f37fc3bbb9ca126b"", ""name"": ""Janet Doe"", ""modules"": [{""name"": ""Calculus 101"", ""score"": 87},{""name"":""Data Structures and Design"",""score"":78}]}'


Output


{""ok"":true,""id"":""62d403135819d12627a75dc2c01736ae"",""rev"":""2-611087a6f0d8a60b06e9705fcc7566ce""}


The result is identical to the create example we did earlier, but notice how the
revision number has incremented to indicate that this document is now updated.

If the _rev field that is supplied is not the current revision number, that can lead to a
document conflict. This is covered in more detail here .

FETCH AND ANALYZE DATA WITH VIEWS
To select data in Cloudant, we use views . The _all_docs endpoint that we used above is a built-in view and depending on what
information we want to use in our applications, we’ll build our own view
accordingly. Cloudant/CouchDB requires that you design the views and they become
part of the database. Then you can use them when you need them. These views use
a Map/Reduce approach, which means that these views are performant even when the data sets
become very large.

Using Map/Reduce in views means that performance scales up along with our data.

A view is created just like any other record, over HTTP. We give the view a name
and then make a PUT request to where we want the view to reside. The view is in JSON format, and we
add callable functions for both the “map” and the “reduce” sections of our view.
For our first view, we’ll fetch a list of students, and show some statistics
about their grades using the built in _stats reduce function. Here’s the view definition, which sets up a view called foo in the students space:


{
  ""_id"" : ""_design/students"",
  ""views"" : {
    ""foo"" : {
      ""map"" : ""function(doc){ 
          if(doc.modules.length > 0) {
              for(var idx in doc.modules) {
                  emit(doc.name, doc.modules[idx].score)
              }
          }
      }"",
      ""reduce"" : ""_stats""
    }
  }
}


It’s easier to store the JSON data in a file and supply that to curl, so the
above is stored in students_design.json . To create the view, we PUT to its location:


curl -X PUT -H ""Content-Type: application/json"" 'https://sevendays:password@sevendays.cloudant.com/students/_design/students' --data @students_design.json


We’ll get a success response and this will include the revision of the view. If
you want to update your view at any point, you need to add a field _rev into your view definition and include the newest revision when making the same PUT request with the changed JSON data.

Let’s inspect what we created in that view. The map function here simply checks that the student does have a modules array, and then iterates over it. For each module, our map function will emit,
or output, a record that has the student’s name as the key and their score in
that module as the value.

We can see the output of the map step by itself by using the URL we’d usually
use to access the view (i.e. /_design/students/_view/foo ) but by including a parameter ?reduce=false we will only perform the map step. Doing this is recommended if you don’t need
the reduce step – but in this case we’re using it so we can inspect what
happens.

Making the curl request to the view:


curl 'https://sevendays:password@sevendays.cloudant.com/students/_design/students/_view/foo?reduce=false'


The output of the view looks something like this (just the first few lines):


{
  ""total_rows"": 26,
  ""offset"": 0,
  ""rows"": [
    {
      ""id"": ""04d6030005e1e873a12323602f6971eb"",
      ""key"": ""Dominic Carter"",
      ""value"": 63
    },
    {
      ""id"": ""04d6030005e1e873a12323602f6971eb"",
      ""key"": ""Dominic Carter"",
      ""value"": 64
    },
    {
      ""id"": ""04d6030005e1e873a12323602f6971eb"",
      ""key"": ""Dominic Carter"",
      ""value"": 67
    },
    {
      ""id"": ""62d403135819d12627a75dc2c01736ae"",
      ""key"": ""Janet Doe"",
      ""value"": 87
    },
    {
      ""id"": ""98ed5f8fc3ffbce7365699791325756a"",
      ""key"": ""Janice Doe"",
      ""value"": 56
    },
...


As we can see, the map step takes a user like Janet with one module and outputs
one record, but for Dominic who has three modules, we get three records. This
approach means we can easily run aggregate queries on these many small pieces of
data. In this case, our view just uses the built-in _stats reduce function, so let’s see what happens when we enable the reduce step by
removing the reduce parameter we set earlier.

Input


curl 'https://sevendays:password@sevendays.cloudant.com/students/_design/students/_view/foo'


Returns this output:


{""rows"":[
{""key"":null,""value"":{""sum"":1843,""count"":26,""min"":48,""max"":89,""sumsqr"":133509}}
]}


Hmmm … that’s statistics for the whole dataset, but the reason we used the
student as the key when we emitted values from the map function was so that we could get results per student. What we need here is the
equivalent of an SQL GROUP BY clause, and in CouchDB views, we apply these when making the request, not when
designing the view. As a result, my view is complete, and to see the results by
student I will add a ?group_level=1 parameter to the query:


 curl 'https://sevendays:password@sevendays.cloudant.com/students/_design/students/_view/foo?group_level=1'


The results look a bit more interesting now!

Output


{""rows"":[
{""key"":""Dominic Carter"",""value"":{""sum"":194,""count"":3,""min"":63,""max"":67,""sumsqr"":12554}},
{""key"":""Janet Doe"",""value"":{""sum"":84,""count"":1,""min"":87,""max"":87,""sumsqr"":7138}},
{""key"":""Janice Doe"",""value"":{""sum"":127,""count"":2,""min"":56,""max"":71,""sumsqr"":8177}},
{""key"":""Janine Doe"",""value"":{""sum"":232,""count"":3,""min"":71,""max"":89,""sumsqr"":18146}},
{""key"":""Kelly Carter"",""value"":{""sum"":78,""count"":1,""min"":78,""max"":78,""sumsqr"":6084}},
{""key"":""Laura Carter"",""value"":{""sum"":166,""count"":2,""min"":79,""max"":87,""sumsqr"":13810}},
{""key"":""Lee Johnson"",""value"":{""sum"":188,""count"":3,""min"":57,""max"":68,""sumsqr"":11842}},
{""key"":""Samantha Hewitt"",""value"":{""sum"":201,""count"":3,""min"":56,""max"":76,""sumsqr"":13673}},
{""key"":""Simon Jones"",""value"":{""sum"":154,""count"":2,""min"":73,""max"":81,""sumsqr"":11890}},
{""key"":""Stephen McDonald"",""value"":{""sum"":112,""count"":2,""min"":48,""max"":64,""sumsqr"":6400}},
{""key"":""Steven Ord"",""value"":{""sum"":217,""count"":3,""min"":65,""max"":79,""sumsqr"":15795}}
]}


The group_level parameter indicates how many levels of key should be observed in grouping – we
used a single key but it’s also possible to use arrays here and group by varying
levels in different situations.

Another view example would be to use the modules themselves as the basis of the
data we want to view – for example how many students are enrolled for each
module. Exactly as before, we prepare a JSON definition of the view in a file ( courses_design.json in this case) and then make a PUT request to create it.

Here’s the view definition:


{
  ""_id"" : ""_design/courses"",
  ""views"" : {
    ""foo"" : {
      ""map"" : ""function(doc){ 
          if(doc.modules.length > 0) {
              for(var idx in doc.modules) {
                  emit(doc.modules[idx].name, 1)

              }
          }
      }"",
      ""reduce"": ""function (keys, values, rereduce) {
          return sum(values)
      }""
    }
  }
}


And the request to create it:


curl -X PUT -H ""Content-Type: application/json"" 'https://sevendays:password@sevendays.cloudant.com/students/_design/courses' --data @courses_design.json


The view has a map function that emits the module name as the key, and a value of one. This view
also has a reduce function, which sums the values. Again, we can adjust how many
levels of information we want to group by in our query, which we’ll do in our
example by setting the ?group_level to 1:


curl 'https://sevendays:password@sevendays.cloudant.com/students/_design/courses/_view/foo?group_level=1'


Remember that you can also try the query with ?reduce=false to see the output of the intermediate map step, you don’t need the group_level parameter though in that situation. With the query above, I get output like
this:


{""rows"":[
{""key"":""Calculus 101"",""value"":11},
{""key"":""Data Structures and Design"",""value"":6},
{""key"":""Programming Principles"",""value"":9}
]}


Working with these views is very flexible and performant, but it can seem like a
bit of a learning curve if you are new to Map/Reduce. Hopefully having some
examples gives you an idea of what’s possible and how to begin.

FIND STUDENTS BY NAME
Views are a great way to find data that corresponds to a known key, but it has
limitations. You must know the exact key. What if we wanted to find a student by
their name but are unsure on what their full name is, for example? We couldn’t
do this with a view.

This is where Search Indexes come in. This is a little something extra that
Cloudant have bolted onto the side of regular CouchDB and it is very powerful,
allowing full-text searching using the Lucene Query Parser Syntax .

We start by defining our index, in very much the same way as we defined our
view, with a few key differences:


{
  ""_id"": ""_design/search"",
  ""indexes"": {
    ""by_name"": {
      ""index"": ""function(doc){
        index(\""name\�
      }""
    }
  }
}


Let’s run through this step-by-step.

We are creating a new _design document, but instead of defining views, we are going to define indexes.

Each index has a name ( by_name , in this example), and each index has a callable function ( index ). This function is used to define a query-able field (in this example, it is name ), and the value that is associated with it (in this example we will use the
value of doc.name ).

We can then query our database via the API:


curl 'https://sevendays:password@sevendays.cloudant.com/students/_design/search/_search/by_name?q=name%3A""carter""&include_docs=true'


The important bit is here:


# removed URL encoding for easy reading
?q=name:""carter""


This is simply querying our new search index, looking for all students where the
value “carter” is somewhere in the name field we defined.

So, what does this get us?


{
  ""total_rows"": 3,
  ""bookmark"": ""g1AAAACjeJzLYWBgYMpgTmHQSUlKzi9KdUhJstBLyilNzc2s0C3N1i0uScxLSSxKMdRLzskvTUnMK9HLSy3JAelKZEiS____f1YGk5v9h3elQCGGREZUo8yINCqPBaR7AZACGrgfbOITBgaIiVkACDs0iQ"",
  ""rows"": [{
    ""id"": ""229881fa954f75370c3f47b898a3927d"",
    ""order"": [1.0582170486450195, 1],
    ""fields"": {},
    ""doc"": {
      ""_id"": ""229881fa954f75370c3f47b898a3927d"",
      ""_rev"": ""1-a2ca609c02c1e4f2a625c7da7e86661f"",
      ""name"": ""Laura Carter"",
      ""modules"": [ ... ]
    }
  },
  {
    ""id"": ""ac161022ce339ec768e6f9503a805bd6"",
    ""order"": [0.625, 0],
    ""fields"": {},
    ""doc"": {
      ""_id"": ""ac161022ce339ec768e6f9503a805bd6"",
      ""_rev"": ""1-78cfbdeae85a78dc8cd9864e952a54cd"",
      ""name"": ""Kelly Carter"",
      ""modules"": [ ... ]
    }
  },
  {
    ""id"": ""04d6030005e1e873a12323602f6971eb"",
    ""order"": [0.625, 1],
    ""fields"": {},
    ""doc"": {
      ""_id"": ""04d6030005e1e873a12323602f6971eb"",
      ""_rev"": ""1-0a135a3fbf3f2e4bcd59b195ca437f4f"",
      ""name"": ""Dominic Carter"",
      ""modules"": [ ... ]
    }
  }]
}


Three students are returned, all with the surname “Carter”. Perfect!

Check out the extensive Cloudant Search documentation for more information on how to build complex search queries using Cloudant.

USING THE CLOUDANT WEB INTERFACE
The web interface allows creation, editing and usage of views, and this might be
easier to work with when creating views that will later be accessed
programmatically.

Cloudant Web Interface

There are a few things to look at:

 * Access views via the white sidebar. In this case I’ve drilled down to the
   courses one that we just created.
 * Set query options. On the upper right, click Options to see query settings. By default, Reduce is disabled, but you can tick the box to enable it.
 * Set grouping and other parameters here in the Query Options window too.
 * To include each document record and its metadata, go to the top toolbar and
   tick the Include Docs checkbox, which you can use with a resultset or map output. It’s off in this screenshot, because it doesn’t make sense to
   use it with a view that has the reduce step enabled.

CONCLUSION
Cloudant is a service for those developers who want to get on and build their
app, rather than worry about the trials and tribulations of managing and scaling
infrastructure. It has the featureset you would hope for from any database
service, and its Cloudant Query feature is useful to help smooth the transition
if you are coming from an SQL background. Being able to access all of these
features via a very simple HTTP API is also a big positive when considering ease
of use.

On the flip side, NoSQL document stores like this can be a bit of a head
scratcher if you are unfamiliar with them, and using MapReduce to create views
is something of a barrier at first – although being able to do this in
Javascript does make this somewhat more accessible.","Looking to learn the basics of cloud databases? In this series, we show them running on Compose and intro programmatic access. Enter: Cloudant.",Seven Databases in Seven Days – Day 4: Cloudant,Live,394
1180,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine
Learning Jun 1
--------------------------------------------------------------------------------

RUN DSX NOTEBOOKS ON AMAZON EMR
Since launching Data Science Experience in June of 2016, we’ve received a lot of
feedback from data scientists. One piece of feedback we heard frequently was
that DSX was great for enabling collaboration and managing data science
projects, but it did not connect to clusters outside of IBM Cloud.

In April, we announced the release of DSX Local , which offers the same tools and experience of DSX, just behind your firewall.
Continuing the work on making DSX flexible and working with different
deployments, we are announcing Beta support for running DSX notebooks on Amazon
EMR.

Whether you have massive amounts of data sitting in S3, or need to run your
Spark jobs in existing EMR clusters, we have you covered!

An EMR cluster can be associated with any project in DSX via the project
settings page. Once there, follow the steps for configuring your cluster with a
kernel gateway that allows DSX notebooks to execute code in EMR. This is
especially valuable if your data lives in S3, since the EMR compute lives closer
to the data.

Want to use your EMR cluster in DSX? Follow these steps in our documentation! The documentation also outlines some limitations of this Beta integration.

Once you set up a connection to your EMR cluster, you can use this notebook from the DSX community that shows how to perform analysis on data in S3.


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on June 1, 2017.

 * AWS
 * Emr
 * Data Science
 * Apache Spark
 * Jupyter Notebook

Show your supportClapping shows how much you appreciated Greg Filla’s story.

1 Blocked Unblock Follow FollowingGREG FILLA
Product manager & Data scientist — Data Science Experience and Watson Machine
Learning

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Since launching Data Science Experience in June of 2016, we’ve received a lot of feedback from data scientists. One piece of feedback we heard frequently was that DSX was great for enabling…",Run DSX Notebooks on Amazon EMR,Live,395
1182,"Compose The Compose logo Articles Sign in Free 30-day trialHOW COMPOSE IS HELPING EDUCATIONAL ORGANIZATIONS INNOVATE
Published Aug 7, 2017 How Compose is Helping Educational Organizations Innovate case study education Free 30 Day TrialHundreds of educational and learning services organizations like schools,
universities, e-learning providers and training institutes have been using
Compose to build innovative apps and productivity tools to help students and
teachers succeed.

We talked with three of our customers who are working in this industry on
different yet compelling causes. One of them is on a mission to make
pre-schooling accessible to a larger demographic with their enablement platform.
Another company is turning the classroom into a gaming adventure while yet
another is helping students find deals online.

While each of their start-up stories is fascinating, their approaches to solving
technological hurdle with Compose are quite intriguing too. For example, while
Wonderschool relied on PostgreSQL and Redis to build their entire platform,
Campus Discounts decided to use a total of six databases and IBM Watson AI to
create their solution. On the other hand, Classcraft went through several
interesting iterations and update cycles with their databases and front-end
before settling down on solid ground with Compose.

Let's check out their stories:

Wonderschool | San Francisco, CaliforniaChris and Arrel started Wonderschool (originally know as One Preschool) to solve
some tantalizing problems with pre-schooling: the schools are hard to find and
even harder to establish.

Wonderschool’s platform makes it easy to start and run a preschool for educators
who want to start their own business. The platform comes with three distinct
components. With the first module, parents can search a school, book a tour and
get enrolled. The second component, a dashboard allows the teachers to manage
class activities and expenses. The third component provides setup and mentoring
support.

Powering the platform is Phoenix/Elixir on the server side and React.js on the
front end. For their data layer, Wonderschool selected Compose PostgreSQL for
persistent data and Compose Redis as their message queue. Arrel, the co-founder
and the architect behind the platform chose PostgreSQL because he had experience
with it before and found it provides faster development and easy scalability.
Plus he loves the fast and features rich text search capabilities that
PostgreSQL has. He uses Redis because of its low-maintainability and blazingly
fast performance.

When asked why Wonderschool chose Compose, Arrel said, “As a software engineer,
the most important features are the ones that let me stay focused on the
software and forget about the infrastructure. It’s also about reliability. After
sticking with Compose for several years, I've learned that when things go wrong,
you guys got my back.”

With already over sixty schools in San Francisco and Los Angeles alone,
Wonderschool is growing relentlessly on Compose platform and well on its way
towards fulfilling its goal of spreading easy pre-schooling on a global scale.

Read the full case study.

Classcraft | Sherbrooke, QC, CanadaClasscraft gamifies a monotonous classroom experience into an immersive and fun
adventure. The entire platform is based on a role-playing game strategy.
Teachers can assign different activates and reward students with ‘experience
points’ for completing a task. Students get more points for working in groups to
solve a problem. This encourages communication and teamwork while improving
student motivation.

Initially, the app was written in PHP and later converted to Node.js and
Meteor.js to handle scaling and real-time needs of the game. MongoDB powered the
back-end. The entire monolithic single server app was hosted on Amazon Web
Services and maintained by Shawn, the co-founder and original programmer of the
game. But as the app started to gain popularity, soon he began to face scaling
and administration challenges. Shawn found himself spending more and more time
managing the databases and performing sysadmin tasks than focus on what’s
important to his business – developing the app. This was one of the main reasons
Classcraft decided to move to Compose.

Compose takes away all the pain of database administration by providing Shawn
with a fully managed DBaaS platform where scaling, performance, backup, support
and everything else associated with a data layer would be taken care of. Now,
Shawn has more time to focus on his core business and even expand the game with
Compose's help. For example, recently he added Compose hosted Elasticsearch to
provide locationtional and fuzzy-logic based search capabilities in the app.

“I am very happy that we moved to Compose,” Shawn said. “Basically, Compose took
the hassle of database management off of our hands so we could focus on what’s
most important to us - our product. And I didn’t have to do anything. It’s
pretty great!”

Since its first release on Compose, Classcraft turned into a popular app among
students and teachers. In only a few years it’s now being used by 20,000+ school
in North America.

Read the full case study.

Campus Discounts | Nairobi, KenyaDon Omondi came up with the idea for Campus Discounts to solve a problem he was
facing as a student - how to easily find bargains near his campus. He submitted
the project to IBM’s SmartCamp program where it became a finalist. Inspired by
the recognition, he launched the company in 2015.

Campus Discounts is a social network where students find and recommend
discounted products and services posted by vendors near their campuses. It also
has a built-in buddy system where students can share bargains and communicate
with their friends.

Architected and built by Don himself, Campus Discounts’ platform isn’t a typical
one. A total of six databases are powering its data stack – all but one hosted
on Compose. He’s using MySQL for ‘primary data' like users and discounts
information; Redis for caching and redundancy; MongoDB for storing social
interaction data; Elasticsearch for complex search; RabbitMQ for message routing
and synchronization and finally JanusGraph for correlating relationships among
users and their interactions.

Running on top of the databases is PHP Symfony for the back-end and Ember.js
plus Node.js on the front-end. Don is also using IBM Watson’s powerful APIs to
add cognitive features like chat bot, voice and pattern recognition, etc. to the
app.

So, why Compose? “First and foremost, as CTO of a growing startup, I have a lot
on my plate right now. Compose really comes in and takes the weight off my
shoulders.”

With hard work and Compose’s help, Campus Discounts has seen rapid growth in a
short time with 36,500 campuses worldwide using its platform at the moment.

Read the full case study.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Pixabay | Wokandapix

Arick Disilva works in Product Marketing at Compose. Love this article? Head over to Arick Disilva ’s author page to keep reading.CONQUER THE DATA LAYER
Spend your time developing apps, not managing databases.

Try Compose for Free for 30 DaysRELATED ARTICLES
Jun 16, 2017WEBINAR: HOW HARMONY PLATFORM USES RABBITMQ FOR THEIR MOBILE PLATFORM
Compose customer Harmony Platform was founded to make it possible for anyone to
rapidly build mobile apps for their workforce…

Jon Silvers May 8, 2017HOW MARKETING TECHNOLOGY COMPANIES USE COMPOSE TO CONQUER THEIR DATA LAYER
According to an infographic released at the 2016 MarTech conference, marketing
tech has exploded in the last 7 years from aro…

Arick Disilva May 3, 2017CAMPUS DISCOUNTS - MAKING THE MOST OF COMPOSE
Campus Discounts uses several Compose-hosted databases including MySQL, MongoDB,
Redis, Elasticsearch and RabbitMQ to power t…

Arick Disilva Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company",We talked with three of our customers who are working in this industry on different yet compelling causes.,How Compose is Helping Educational Organizations Innovate,Live,396
1183,"DATALAYER CONFERENCE: RETHINKING INDEXING IN DATA STORES WITH REPLEX
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 14, 2016Next up at DataLayer , Amy Tai from Princeton University took the stage to discuss Replex and
rethinking how to index in the modern, distributed datastore. Replex is a
datastore that enables efficient querying on multiple keys.

Partitioning tends to use a single fixed index key which is not ideal when you
need to do searches which involve a secondary index. Replex rethinks where data
is placed during replication using knowledge of what keys we would be using to
query data using multiple keys. It combines replication and indexing, hence
Repl-ex.

Amy demonstrates the why and how of Replex and introduces hybrid Replexes which
enable a new design space where architects can trade off steady-state
performance for faster recovery.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Amy Tai from Princeton University took the stage to discuss Replex and rethinking how to index in the modern, distributed datastore.",DataLayer Conference: Rethinking Indexing in Data Stores with Replex,Live,397
1186,"Homepage Follow Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Sourav Mazumder Blocked Unblock Follow Following Nov 27
--------------------------------------------------------------------------------

USING APACHE SPARK AS A PARALLEL PROCESSING FRAMEWORK FOR ACCESSING REST BASED
DATA SERVICES
Today’s world of data science leverages data from various sources. Commonly,
these sources are Hadoop File System, Enterprise Data Warehouse, Relational
Database systems, Enterprise file systems, etc. The data from these sources are
accessed in bulk using connectors specific to the underlying technology and
optimized for accessing large volume of data.

However, many a times, a data science exploration/modeling exercise also needs
to access data from sources that support only API-based data access. These
API-based data sources/data services can be of various types. For example:

 * Data services (external or internal), which can provide curated/enriched data
   in record-by-record manner.
 * Validation services for verifying the data using an API. For example Address
   validation.
 * Machine learning/AI services, which provide prediction, recommendations, and
   insights based on a single input record.
 * Service from internal systems (like CRM, MDM, etc.) of the organization,
   which supports data access through API only in record-by-record manner.
 * And many more …

These API-based data services are commonly implemented using REST architectural
style ( https://en.wikipedia.org/wiki/Representational_state_transfer ) and are designed to be called for single item (or a limited set of items) per
request. While this works well when the API needs to be called from an online
application, the approach breaks down in situations when the API has to be
called in bulk. For example, during an online sign-up process an address
validation API can be called for the particular address of the user. But, say in
a health care analytics application, where addresses of thousands of doctors,
which already exist in a database or were obtained as part of a bulk load from
an external source, have to be verified, this approach will not work. Because of
the “single item per request” design of the API, you’d have to call the API
thousands of times.

Calling data service APIs in sequence — Processing Time = (# of Records)*(API
response time)
--------------------------------------------------------------------------------

The above pseudo code snippet shows how calling a target REST API service is
handled in a sequential manner. You must first load the list of parameter values
from a file or table in the memory. Next run a loop. In the loop, the target
REST API has to be called for each set of parameter values. From the response
returned by each call the output must be extracted. The output is typically
populated in a complex object like JSON, XML, etc. Next, the necessary part of
the output has to be added to a result array or collection. For that, you must
know the schema of the result beforehand so that you can process the result
accordingly. Finally, you can filter, exploring, aggregating data from the
result array or collection. For all of these steps, you have to use
language-specific complex code.

Alternatively, you could use a programming language-specific library related to
multi-processing/multi-threading that can parallelize the call to the API.
However, with that approach the parallelization achieved from a single machine
would be minuscule — limited to the number of cores of the machine. Consider, a
case where someone is trying to get personality insights from tweets or Facebook
comments using a Natural Language Processing service. The tweets and comments
can be in tens to hundreds of thousands. So, using a single machine could take a
number of hours to get the result. Hence, the approach should be to use a
distributed processing framework to make the API calls parallelized using
multiple cores of multiple machines with the least coding effort. Though it is
possible to get distributed computing libraries or frameworks to achieve the
same in some programming languages like Java, C++ etc., they require a
reasonable amount of coding and setup to achieve the same result. Achieving this
in popular data science languages, like R or Python is actually more difficult
as they are originally designed to run in single threaded/single machine
environment.

Here enters distributed computing frameworks like Apache Spark ( https://spark.apache.org/ ). REST APIs are inherently conducive to parallelization as each call to the
API is completely independent of any other call to the same API. This fact, in
conjunction with the parallel computing capability of Spark, can be leveraged to
create a solution that solves the problem by delegating the API call to Spark’s
parallel workers. Under this approach, one can package a specification for how
to call the API along with the input data, and pass that to Spark to divide the
effort among its workers (and tasks). The output can be assembled in set-level
abstractions supported by Spark (like Dataframes or Datasets ) and passed back to the calling program. This approach not only helps you turn a
sequential execution into a parallel one with the least coding effort, but also
makes it much easier to analyze and transform the returned result with an easier
data abstraction model to work with.

The performance benefit you get is tremendous in this approach. This turns a
problem that takes incremental time for computation (that increases linearly
with the number of records to process), to one that is much more efficient and
scales linearly on a much lower slope — number of records to process divided by
the number of cores available to process them. Theoretically, one can make the
process constant time by having enough cores to process ALL of the records at
once.

To enable the benefits of using Spark to call REST APIs, we are introducing a
custom data source for Spark, namely REST Data Source. It has been built by
extending Spark’s Data Source API. This helps in delegating calls to the target
REST API to a Spark level Task for each set of input parameter values/record.
This also enables the results from multiple API calls to be returned as one
Spark Dataframe. The REST Data Source expects the input to be in the format of a
Spark Temporary table. The results from the API calls are returned in a single
Dataframe of Rows including the input parameters in their corresponding column
names, as well as the output from the REST call in a structure matching that of
the target API’s response. You can check the schema of this Dataframe, and
access the result as necessary using Spark SQL.

The architecture of REST Data SourceThe above figure shows how REST Data Source works.

 1. You first read different sets of parameter values (that have to be sent to
    target REST API) from a file/table to a Spark Dataframe (say Input Data
    Frame).
 2. Then the Input Data Frame is passed to the REST Data Source.
 3. The REST Data Source returns the results to another Dataframe, say Result
    Data Frame.
 4. Now you can use Spark SQL to explore, aggregate, and filter the result using
    the Result Data Frame.

REST Data Source internally calls the target REST API in parallel by executing
multiple tasks spawned by multiple worker processes running in different
machines. Each task is responsible for calling the target REST API Service for a
part of the input (part of sets of parameter values).

The code snippet below demonstrates how to use REST Data Source in Python to get
results from Socrata Data Service (SODA API) for multiple sets of parameter
values by calling the appropriate REST API in parallel.

A sample code snippet showing use of REST Data Source to call REST API in
parallelYou can configure the REST Data Source for different extent of parallelization.
Depending on the volume of input sets of parameter values to be processed and
throughput supported by the target REST API server, you can pass the number of
partitions to be used, and that can limit or extend the level of parallelization
as needed. You can use this framework in all programming languages supported by
Spark — Python, Scala, R, or Java — without any additional coding specific to
that programming language. Last, but not the least, you can also use this
framework to ensure that the target API is called only once for a given set of
parameter values. In this way you can avoid calling the target REST API multiple
times for same set of parameter values. This is especially useful when you must
pay for the REST API being called or there is a limit per day for the same.

See 
https://github.com/sourav-mazumder/Data-Science-Extensions/tree/master/spark-datasource-rest for details of the REST Data Source.

You can also refer to this notebook 
https://dataplatform.ibm.com/analytics/notebooks/ae63f056-e267-443e-bfc0-b9331f51d68a/view?access_token=0ec63c6e031aa57d065a4e1c4b71733729db43b1490c331a44323cce28725b7d for an example of how to use the REST Data Source.

Sign up for a free Data Science Experience account ( https://datascience.ibm.com/ ) to try out this technique on a Spark cluster.

 * Big Data
 * Spark
 * Artificial Intelligence
 * Data Science
 * Rest Api

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

21 Blocked Unblock Follow FollowingSOURAV MAZUMDER
Medium member since Nov 2017 FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 21
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Today’s world of data science leverages data from various sources. Commonly, these sources are Hadoop File System, Enterprise Data Warehouse, Relational Database systems, Enterprise file systems, etc…",Using Apache Spark as a parallel processing framework for accessing REST based data services,Live,398
1191,"Probabilistic World

Home 2016 US Presidential Election Predicting The 2016 US Presidential ElectionPREDICTING THE 2016 US PRESIDENTIAL ELECTION
Posted on September 20, 2016 Written by The Cthaeh Leave a Comment


In the first 10 posts I mostly concentrated on theoretical topics. But the
general focus of this blog is much broader. For the first time I’m going to show
an actual application of probability theory for estimating real life events.

An ongoing event many people are closely following right now is the US
presidential election. The primary season officially concluded at the end of
July and now the general election battle is in full swing. The main clash is
between former Secretary of State Hillary Clinton (D) and businessman Donald
Trump (R). Clinton and Trump are challenged by 3 rd party candidates Gary Johnson from the Libertarian Party (a two-time former
governor of New Mexico) and Jill Stein from the Green Party (a physician and a
political activist).


I worked with my friend and classmate Luca Ambrogioni on a model that predicts
the evolution of voter preferences in time. More specifically, the model uses
election polls to estimate what voter preferences for all 4 candidates would be
on election day (the 8 th of November). We estimate voter preferences in all states and use them to
calculate the probability of each candidate winning each state. Finally, we use
the state win probabilities to calculate each candidate’s probability of winning
the entire election and becoming the next US president.

Luca is currently doing a PhD in cognitive neuroscience. He has serious interest
in machine learning and time series analysis and is involved in developing new
algorithms for neuroscience and other fields. I will link to his blog as soon as
it goes live (which should be soon enough). For now, check out his personal page in Google Scholar.

The model we worked on is an example of an application of a machine learning method. You can see the current model predictions in the very first section
below. They are followed by an overview of how US presidential elections work in
general. In the main part of this post, I’ll go over the details of our model and the
way we use it to make these predictions. Finally, I’m going to conclude the post with some general thoughts on the
current election cycle.

OUR 2016 PREDICTIONS
Without further ado, take a look at the latest predictions of our model. The
explanations are below the map.

Each state on the map is colored according to the candidate who is a favorite to
win it. The candidate colors are:

 * Red : Donald Trump
 * Blue : Hillary Clinton
 * Green : Jill Stein
 * Yellow : Gary Johnson

The intensity of each state’s color represents how certain the favorite is to
win. That is, states with paler colors are less predictable than those with
solid colors.

The three tables below the map show:

 * The electoral votes each candidate is currently expected to win.
 * The probability of each candidate winning the majority of the electoral votes
   (270 or more).
 * The probability of each candidate winning the highest number of electoral
   votes without necessarily reaching 270.

In case no candidate reaches the 270 electoral vote threshold, the presidency
will be determined by Congress. You can find a detailed explanation of the
general election process in the next section.

You can see our current estimates for candidate preferences by clicking on the
tabs below the map. The plots show the candidate preferences’ evolution in time.
The colored circles in the single-candidate plots show all the polls the
estimates are based on. The poll sample size is color-coded according to the
legend on the right.

You can click on individual states to see the state-specific predictions of our
model.

We are going to estimate the election outcome on a daily basis. The map is
always going to be showing the latest predictions but you can see predictions
from previous dates by clicking on the date at the top. Soon we are going to add
our estimates for all dates of 2016.

HOW THE ELECTION PROCESS WORKS
In a little over a month and a half, eligible US voters will choose the next US
president. However, unlike in many Western democracies, the US president is not
directly elected by voters. Instead, the president is chosen by an institution
called the United States Electoral College . This institution currently consists of 538 electors who choose the president about a month after the November vote. Each elector
casts a single vote and the presidential candidate who gets 270 or more votes
(i.e., a 50%+ majority) wins the election.

There are three questions that need clarification:

 1. Who are these electors and why are they exactly 538?
 2. How do they vote?
 3. What happens if no candidate gets 270 or more electoral votes?

Here are the answers.

THE US ELECTORAL COLLEGE
The Electoral College consists of 538 electors. This number comes from the
number of members of the US Congress. There are currently 535: 435 in the House
of Representatives and 100 in the Senate. The additional 3 electors are given to
the District of Columbia, which is not a state and has no representation in
Congress.

The Electoral College is completely independent of Congress, but the number of
electors given to a state is strictly equal to that state’s number of
legislators. The number is proportional to the population of the state and, consequently, large states
like California and Texas have a much bigger influence on the election than
small states like Delaware and Montana.

Here’s how electors are distributed across all states:

US Electoral CollegeHOW THE ELECTORAL COLLEGE VOTES
The 538 electors never actually meet. Electors from each state vote locally a
month after election day and their votes eventually reach Congress to be
processed along the electoral votes from other states.

Electors from a state cannot split their votes. That is, all electors of а state
(with the exception of Maine and Nebraska) vote for the same candidate. Namely,
for the candidate who wins the popular vote of that state. This is essentially a winner-takes-all system where a candidate only needs a plurality win to secure all electors of a
state. Maine and Nebraska are somewhat of an exception, since the
winner-takes-all system there is applied on a congressional district level, not
on a state level (so each of their 4 electors can, in principle, vote for a
different candidate).

Technically, there is no text in the US Constitution that forces electors to
follow their state’s rules and vote for the winner candidate. An elector’s vote
against the popular vote will still be legitimate. However, in practice this
almost never happens and some states even have local laws against so-called faithless electors .

CONGRESS BREAKS THE TIE
A candidate needs to get at least half of the electoral votes to become
president. In case no candidate passes the 270 electors threshold , the next US
president and vice president are chosen by Congress. More specifically, the
House of Representatives chooses the president and the Senate chooses the vice
president. The eligible candidates the House can vote for are the top 3
candidates who won the highest number of electoral votes. There is also the
requirement that a candidate needs to win at least one state to be considered in
the Congress vote. As for the vice presidential vote, the options in front of
the Senate are limited to the top 2 vice presidential candidates.

If it comes to that, Congress will continue to vote until a candidate receives a
majority of the votes. It’s important to note that the 435 representatives don’t
vote separately. Instead, all representatives of each state get a single vote.
So, California’s 53 representatives will have the same impact on the result of
the vote as, say, Alaska’s single representative. To become president, a
candidate needs at least 26 of the 50 available House votes.

The 1800 election was the first time the presidency had to be decided by
Congress in a final contest between Aaron Burr and Thomas Jefferson.Alright, enough small talk. Let’s go to the actual predictions for the 2016
election.

THE MATH BEHIND OUR PREDICTIONS
Predicting the election outcome means calculating the probability of each
candidate winning the presidency. You can think of a candidate’s win as an event . If you haven’t already, check my introductory post on Bayes’ theorem where I explain the concepts of events and probabilities of events. In the
current context, the events of winning the presidency can be represented with
the following notation:

 * P(“Hillary Clinton becomes the next US president”) = 0.7

This says that Hillary Clinton has a 70% probability of winning the election.
However, I will use a more compact left-hand side notation and write the
probabilities simply as:

 * P(Clinton)
 * P(Johnson)
 * P(Stein)
 * P(Trump)

In order to calculate these probabilities, let’s reduce them to a series of
smaller events. We can say:

 * P(Clinton) = P(“Clinton wins at least 270 electoral votes”)

Now, to be really strict, the actual probability is:

 * P(Clinton) = P(“Clinton wins at least 270 electoral votes OR Clinton finishes
   in top 3 and gets voted in by Congress”)

However, in our analysis we don’t model the preferences of Congress and assume
that the probabilities of legislators choosing a particular candidate are more
or less proportional to the number of electoral votes the candidate won. In
other words, the assumption is that not including Congress in our model doesn’t
have a significant effect on our predictions. Given the very small chance of no
candidate getting 270+ electoral votes in this election, this is most likely not
going to be relevant at all.

Okay, let’s delve deeper into the analysis.

PROBABILITY OF 270+ ELECTORAL VOTES
There are many ways for a candidate to win 270+ electoral votes. The probability
of this event is equal to the probability that the candidate wins a combination
of states whose sum of electors is greater than or equal to 270. An example of
such a combination is:

 1.  New Jersey (14)
 2.  North Carolina (15)
 3.  Georgia (16)
 4.  Michigan (16)
 5.  Ohio (18)
 6.  Illinois (20)
 7.  Pennsylvania (20)
 8.  Florida (29)
 9.  New York (29)
 10. Texas (38)
 11. California (55)

A candidate who wins these 11 states will have exactly 270 electoral votes and
win the election.

Okay, we’re making progress. So far, a candidate’s probability of becoming
president looks like this:

 * P(Candidate) = P(“Candidate wins a combination of states whose sum of
   electors ≥ 270”)

The next logical step is to calculate the candidate probabilities of winning
each of the 51 states (yes, I know the District of Columbia isn’t a state).
After that, each candidate’s general election probability can be calculated by
following these steps:

 1. Calculate the probabilities of all state combinations.
 2. Take all combinations with 270+ electors and calculate the portion of the sample space they occupy.

If you’re hearing the term ‘sample space’ for the first time, check out my post about sample spaces . In short, a sample space is the set of all possibilities of a random experiment or a random process. In
our case, the elements of the sample space are the possible state combinations.
The random process is the voting on election day itself. For each candidate, the
probability of winning more than 270 electors is simply the portion of the
sample space occupied by the “270+” combinations.


The figure above shows an example representation of Hillary Clinton’s sample
space. Each rectangle stands for one combination of states. Its area represents
the probability of her winning that particular combination. Rectangles colored
in blue are combinations that pass the threshold of 270 electors (some of the
rectangles are labeled to serve as examples). The size of each rectangle is
proportional to the probability of Clinton winning the combination it
represents. Her overall probability of winning the election is equal to the blue
area (the sum of the areas of blue rectangles) divided by the total area of the
sample space (the whole square).

Remember, each rectangle represents a combination like “Nevada, New Jersey,
Oklahoma…”. The numbers associated with rectangles are the sums of electors a
candidate would get if they won exactly that combination.

Notice that the areas of rectangles aren’t related to the number of electors in
the combinations. The sample space consists of all possible combinations and two
separate combinations which result in the same number of electoral votes may
have very different probabilities for a candidate. Combinations mainly
consisting of deep red states like Mississippi and Louisiana will be very
unlikely for Clinton but very likely for Trump to win.

HOW WE CALCULATE CANDIDATE WIN PROBABILITIES
The figure above is an extremely simplified representation of a candidate’s real
sample space. You can see that it only has about 150 rectangles. However, the
actual number of possible state combinations is huge and impossible to represent
visually. In case you’re curious, the exact number is:

 * 2 51 = 2 251 799 813 685 248

The power set of the set {1, 2, 3}This is the number of all subsets of all 51 states, also known as the power set (you can see how the number is reached in my post on combinatorics ). So, we’re interested in the probability of the compound event of winning one of the 270+ state combinations. This probability is given by the
sum of the probabilities of these special combinations. However, given the huge
number of possibilities, calculating each combination’s probability would
literally take forever. So, we took a simpler approach.

We decided to approximate the probability by simulating the election multiple
times. We calculated each candidate’s win probability as the number of times they crossed the 270 threshold, divided by the total number
of simulations . These kinds of approximations are common in fields like machine learning and
statistics, when calculating the exact probability is infeasible because it
involves summing a very large number of terms.

One simulation is basically having virtual voters express their preferences. In
each simulation, each candidate wins or loses a state with the probability we
calculated with our model. Then, the electors from all the states the candidate
won are summed. If the number passes the 270 threshold, the candidate’s win
count is increased by 1. A candidate’s win probability is their win count
divided by the total number of simulations we run.

We estimated the state win probabilities for all candidates in a previous step
of the analysis. This is the main and most important part of our model and I’ll
explain how we did it in the next section.

A SHORT DESCRIPTION OF OUR MODEL
We use two sources of information in estimating the state-level candidate
probabilities:

 1. The state and national poll results (which we take from the website RealClearPolitics)
 2. The demographic characteristics of states

The polls inform us about the voter preferences for candidates. What we need to
do is use these preferences to calculate the probability of each candidate
winning each state. So, how do we do that?

A common mistake is to equate a candidate’s win probability with their
respective preference. In other words, if 53% of the polled likely voters in a
state said they preferred Trump, this doesn’t mean Trump will win the state with
a probability of 0.53. In fact, if we knew with absolute certainty that 53% of
the voters would vote for Trump, his win probability would be 1 and everybody
else’s would be 0.

Can’t we just take the average of all polls for a particular state and declare
the candidate with the largest preference the most likely winner? There are two
problems with this approach. First, the polls are inherently noisy because they
are taken over samples and not over the entire population. Their results can
never be taken at face value. Second, and more importantly, the preferences for
each candidate evolve over time. Some voters change their preferences from one
candidate to another or change their mind about voting versus not voting in the
first place.

The number of votes a candidate receives is inherently a random and uncertain
quantity. We need to somehow estimate the likely distribution of preferences on
election day based on the available polls at any point in time. This is what
such an estimation looks like:

The evolution of likely voter preferences for Hillary ClintonThis is an example plot which shows our estimation for Hillary Clinton’s
national preferences in the first days of September (the approach for estimating
state preferences is identical). The circles show the poll results and their
colors indicate the poll sample size. The thick line shows the expected
percentage of voters for Clinton. The shaded area around the line represents the
range of most probable values (the variance) according to our predictor—the ones
closer to the line being more probable than the ones further away.

We model the evolution of the preferences as Gaussian processes . The details of this part of the analysis are quite advanced, however. I
haven’t introduced all the mathematical machinery necessary for explaining them
in my previous posts, so I’m going to write another post dedicated specifically
to the technicalities of our analysis. For now, what you need to know is that
our model is capable of detecting trends in the poll results and uses them to
estimate candidate preferences in future time points.

The state demographic data is used to transfer information between states. We
use the following information (per state) to calculate a similarity matrix between all states:

 * Age and sex
 * Educational attainment
 * Median household income
 * Party and political affiliation
 * Race and ethnicity
 * Religion
 * Sexual orientation
 * Unemployment rate
 * Results from the previous election

This similarity matrix is used to allow information to flow between states. A
state poll will update the estimates not only for the state in which it was
taken, but also for states with similar demographics. To give an example, a poll
in Alabama will have a larger effect on the estimates of another red state like
Kansas than on a blue state like Massachusetts. The exact details of how
information flows between states is also left for a future post.

We calculate the state win probabilities using a similar method to the one we
used for simulating the entire election. Namely, we draw samples from the
estimated distribution of preferences on election day, count the number of times
a candidate had the highest preference, and divide by the total number of draws.

In the final section of this part of the post, I am going to demonstrate the
reliability of our model by showing its accuracy in “predicting” the 2012
election outcome.

VALIDATION OF THE MODEL
Below is the map showing the results of the 2012 election followed by our
model’s predictions based on all 2012 state and national polls.

2012 presidential election resultsBy clicking on any of the months below, you can see what our predictions would have been on that particular month.

February | March | April | May | June | July | August | September | October | November | Latest

Our model shows that Obama was an huge favorite to win the election throughout
the entire cycle, despite him being neck-to-neck with Romney according to the
national polls. This is a perfect illustration of how the Electoral College
renders winning the national popular vote almost useless.

Notice that we correctly predict the favorite to win every state, except for
Florida (even though some of our early predictions do correctly give Florida to
Obama). A possible explanation for this discrepancy could be related to pollster
biases we didn’t take into account. Or this could simply be a surprising result
too—small probability events do occur (in fact, they occur all the time).

The overall accuracy of our predictions gives us great confidence in our
estimates of this year’s election.

GENERAL THOUGHTS ON THE 2016 PRESIDENTIAL ELECTION
Hillary Clinton and Donald Trump are the favorites to win the presidency.
However, there are 2 other serious players in this election: Gary Johnson and
Jill Stein. Although their chances of winning are currently very low, they are
the only other candidates who have a theoretical chance of winning the election.
By ‘theoretical chance’ I mean that they are going to be on the ballot in enough
states to be able to win at least 270 electoral votes. Also, even if they don’t
win, they can have a significant effect on the election outcome if they manage
to “steal” enough votes from Trump or Clinton. Especially if they do so
disproportionately and in key states.

Gary JohnsonMany people, including myself, think that this year’s US presidential election
is among the most interesting elections in recent US history.

The main reason is that there have been significant shifts in the political
preferences among US voters in a relatively short amount of time. Maybe it’s
more accurate to say that the election cycle so far has revealed shifts in the
political spectrum many people were unaware of. These were shown by both the
Democratic and Republican primary elections, as well as the general election
campaigns.

Jill SteinFor starters, nobody expected 3 rd party candidates’ current popularity. In the 2012 election, Gary Johnson
received about 1% of the national vote. Now, just 4 years later, he is polling
at around 10%! Similarly, Jill Stein received less than 0.5% of the vote in 2012
but now she is polling at around 4-5%.

This can be partly explained by the fact that the two main candidates have a
strongly polarizing effect on voters. Both Hillary Clinton and Donald Trump have
very high unfavorability ratings and many voters vow to never vote for one or
either of them. Naturally, this led to the rise of popularity of other
candidates, as many voters started looking for alternatives.

Hillary ClintonAttitudinal changes among voters have deeper roots than disaffection with the
two main candidates alone. A Gallup analysis published earlier this year showed a record high number of voters identifying
as political independents. So, it’s more accurate to say that the disaffection
is actually with the two-party system that has been dominant in US politics for
the past 2 centuries, as well as with the two major parties themselves.

Donald TrumpIt first became clear that this election was going to be different from previous
ones by the surprisingly big popularity of candidates perceived as
“anti-establishment”. On the Republican side, Donald Trump started out with
small poll numbers but quickly established himself as the front-runner in the
primaries—a position he held all the way to the Republican National Convention
in July. On the Democratic side, Hillary Clinton started the race as a big
favorite but ended up having to fight for her nomination against Bernie Sanders,
who was the “anti-establishment” candidate in the Democratic primaries. Sanders
surprised voters and pundits by closing a 50% gap in the polls and gave Clinton
a very hard time throughout the primary season. Unlike Trump, Sanders didn’t get
the Democratic nomination, but his popularity is likely to have a significant
effect on the election outcome. Despite being out of the race, he is still
campaigning for Clinton and against Trump. He also managed to shift some of
Clinton’s positions on important issues like healthcare, education, and campaign
financing.

The number of eligible US voters is around 250 million, of which about 50% are
likely voters. The voter turnout in this election is expected to be higher than usual, due to the factors I discussed above. I expect the election to get
even more intriguing and unpredictable in its final stages, especially when the
candidates go head-to-head in the debates.

SUMMARY
Here’s a recap of the 4 main steps to arriving at our final election forecast:

 1. Estimate the likely range of voter preferences for each candidate on
    election day using polls and state demographics.
 2. Calculate each candidate’s probability of winning each state
 3. Using these probabilities, simulate the election a large number of times.
 4. Calculate each candidate’s probability of winning the election as the
    percentage of simulations in which the candidate won 270 or more electoral
    votes.

The accuracy of these predictions depends on the assumption that the polls are
relatively unbiased towards candidates. Also, note that at any point in time
there may be an event or multiple events which drastically shift the trends of
the election. Such events may be a big political blunder, a leaked revelation,
or another type of political scandal involving one or more candidates. Keeping
in mind how this election has unfolded so far, such events aren’t all that
unlikely. Especially if we’re left to see the best of the election for last!

Our model does quite well in “predicting” the 2012 election results, so we are
optimistic about the accuracy of its predictions for the current presidential
election too. We are looking forward to comparing the election results to our
predictions at different points in time.

All in all, we have an exciting month and a half ahead of us!


--------------------------------------------------------------------------------

Special thanks to Sparx, Veli, and Meryl for their contribution to the writing
of this post!

Filed Under: 2016 US Presidential Election Tagged With: elections , politics

LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


SIGN UP FOR THE PROBABILISTIC WORLD NEWSLETTER
Enter your email below to receive updates and be notified about new posts.

Your email addressFOLLOW PROBABILISTIC WORLD
 * 
 * 

RECENT POSTS
 * Predicting The 2016 US Presidential Election
 * Frequentist And Bayesian Approaches In Statistics
 * An Intuitive Introduction To Combinatorics
 * Calculating Compound Event Probabilities
 * Probability: What Is It, Really?

Probabilistic World 2016",In the first 10 posts I mostly concentrated on theoretical topics. But the general focus of this blog is much broader. For the first time I’m going to show an actual application of probability theory for estimating real life events. An ongoing event many people are closely following right now is the US presidential election. The primary season officially …,Predicting The 2016 US Presidential Election,Live,399
1192,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Charles gomes Blocked Unblock Follow Following Mar 14
--------------------------------------------------------------------------------

READ AND WRITE DATA TO AND FROM AMAZON S3 BUCKETS IN RSTUDIO
In this article, you will learn how to bring data into Rstudio on DSX from
Amazon S3 and write data from Rstudio back into Amazon S3 using ‘sparklyr’ to
work with spark and using ‘aws.s3’ to work with local r objects.

USING SPARKLYR TO WORK WITH SPARK
First thing you want to do is connect to spark service using sparklyr’s
spark_connect function. You can refer to this Post .

#connect to spark 
library(sparklyr) 
library(dplyr) 
sc <- spark_connect(config = ""Apache Spark-ic"")

Get the java Context from spark context to set the S3a credentials needed to
connect S3 bucket.

#Get spark context 
ctx <- sparklyr::spark_context(sc)
#Use below to set the java spark context 
jsc <- invoke_static( sc, ""org.apache.spark.api.java.JavaSparkContext"", ""fromSparkContext"", ctx )

Now replace below your access key and secret key generated for your AWS account.

#set the s3 configs: 
hconf <- jsc %>% invoke(""hadoopConfiguration"") hconf %>% invoke(""set"",""fs.s3a.access.key"", ""<put-your-access-key>"") hconf %>% invoke(""set"",""fs.s3a.secret.key"", ""<put-your-secret-key>"")

Lets use spark_read_csv to read from Amazon S3 bucket into spark context in
Rstudio. First argument is sparkcontext that we are connected to. Second
argument is the name of the table that you can refer within spark. Third is path
to your s3 bucket. You can additionally specify repartition to number to
parallelize reads.

#Lets try to read using sparklyr packages 
usercsv_tbl <- spark_read_csv(sc,name = ""usercsvtlb"",path = ""s3a://charlesbuckets31/FolderA/users.csv"")

Use src_tbls to see if we read the table in spark. Use head to view the check
the dataframe.

src_tbls(sc)
head(usercsv_tbl,4)

Likewise you can read parquet as well.

usercsv_tbl <- spark_read_parquet(sc,name = ""usertbl"",path=""s3n://charlesbuckets31/FolderB/users.parquet"") src_tbls(sc)

You can also write the dataframe back to S3 bucket using spark_write_csv.

#Write back into Amazon S3 sparklyr::spark_write_csv(usercsv_tbl,path = ""s3a://charlesbuckets31/FolderA/usersOutput.csv"")

USING ‘AWS.S3’ TO WORK WITH LOCAL R
First install ‘aws.s3’ package and load it.

install.packages(""aws.s3"", repos = c(""cloudyr"" = ""http://cloudyr.github.io/drat"")) library(""aws.s3"")

‘aws.s3’ package need AWS ACCESS KEY and AWS SECRET KEY added to
environment.Replace below with yours.

Sys.setenv(""AWS_ACCESS_KEY_ID"" = ""<PUT-ACCESS-KEY>"",""AWS_SECRET_ACCESS_KEY"" = ""<PUT-SECRET-KEY>"")

Now to read the object into R use ‘get_object’ and specify your s3 path as shown
below.

usercsvobj <-get_object(""s3://charlesbuckets31/FolderA/users.csv"")

Since the get_object returns a raw object returns a raw object, you would need
to do further processing to convert the raw object to your desire type of object
(dataframe) depending on the type of file you are reading.

csvcharobj <- rawToChar(usercsvobj)
con <- textConnection(csvcharobj) data <- read.csv(con) close(con) data

Reference to ‘aws.s3’ package here .


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on March 14, 2017.

 * AWS

Show your supportClapping shows how much you appreciated Charles gomes’s story.

Blocked Unblock Follow FollowingCHARLES GOMES
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","In this article, you will learn how to bring data into Rstudio on DSX from Amazon S3 and write data from Rstudio back into Amazon S3 using ‘sparklyr’ to work with spark and using ‘aws.s3’ to work…",Read and Write Data To and From Amazon S3 Buckets in Rstudio,Live,400
1197,"Enterprise Pricing Articles Sign in Free 30-Day TrialTHE WELL CONNECTED RABBIT
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 16, 2017Compose's RabbitMQ can help your business and in this Write Stuff article, Ken
Whipday shows how it has helped Tripcatcher. He'll show how to set up a basic
AMQP message queue with NodeJS and then move on to Rabbot's powerful abstraction
for RabbitMQ clusters, exchanges and queues.

We have a Node.js application hosted on Heroku. The application is starting to
outgrow Heroku's 30 second limit for programs running on the web server so we
need to move the slower running transactions off the web server and into
background processes.

RabbitMQ is the most popular open source choice when using messaging to offload
work from the web server to a background process. Compose had just launched its
RabbitMQ offering and the proposition appealed to us at Tripcatcher because of
its simplicity and scalability. There are lots of tutorials around too for
RabbitMQ on NodeJS, so the only challenge was how to configure it to work with
Compose.

In this article, we'll show you how to get two node libraries (amqplib and
Rabbot) connected to Compose's RabbitMQ.

COMPOSE SETUP
The first thing we need to do is create our Compose RabbitMQ instance and create
a play environment.

 1. For a small application we'll use the vhost as an environment, so we'll
    build one vhost to play with: one for development, one for staging and one
    for production. Larger applications might use one vhost per chunk of
    functionality, such as logging. On Compose's RabbitMQ page, select the Browser tab and click the Add vhost button and create a vhost called 'plaything'.
 2. Select the plaything vhost and add a username 'demo-user' and a suitable
    password.
 3. Now, select the username and grant it access to the plaything vhost.
 4. Back on the Overview view, make a note of the connection string ready to use it in the next
    step. It is in the format:
    amqps://[username]:[password]@[hostname]:[port]/[vhost]

AMQPLIB
Since we've set up our RabbitMQ deployment, all we have to do is install amqplib , which is the most popular NodeJS library for RabbitMQ. It is comprehensive,
works and there are lots of tutorials. Let's install it by running the following
command in your terminal:

npm install --save amqplib

GET CONNECTED
Now that our deployment is set up and the AMQPLIB library has been installed,
let's write some code and create a connection. In the uri variable below, substitute the hostname and port with the username and password you created earlier in your Compose deployment.

    var amqp      = require('amqplib/callback_api');
    var url       = require('url');

    var uri       = 'amqps://[username]:[password]@[hostname]:[port]/plaything';
    var parsedUrl = url.parse(uri);
    var opts      = { servername: parsedUrl.hostname };

    amqp.connect(uri, opts, function(err, conn) {
        if (err) {
            console.error(""[AMQP]"", err.message);
            return;
        }
        console.log(""[AMQP] connected"");
        // conn is the connection object
    });


If you now visit the Compose Admin UI page (the link is on your deployment's Overview page), you should see there is a connection to your RabbitMQ instance.

However, amqplib is a low-level library; therefore, you have to code your own
cluster support. But, you can get around this by using Rabbot, which provides
cluster support via a simple config setting.

RABBOT AND RABBUS
Rabbot is an opinionated abstraction over amqplib. At the same time, Rabbus is another abstraction, but it abstracts over Rabbot. Rabbus makes use of
Rabbot's code for making a connection. Hence, coding the Rabbot connection will
work for both Rabbot and Rabbus.

One great advantage of using a Rabbot connection is it supports clusters so we
can connect in a resilient manner to Compose's RabbitMQ.

In the next section, we'll show you how to create a connection with Rabbot that
you can then use with Rabbot or Rabbus.

RABBOT CONNECTION
Rather than using a connection string as we did previously, we're now going to
use a connection object. The parsedUrl variable is removed and we're hard-coding the hostname and port number in the connection object (this is so we can add cluster support in the
next step).

Below is the updated code. Update the config object with your password , hostname and port . Then run the code and you should now see the plaything vhost on the Compose
console.

    var rabbit = require('rabbot');

    var config = {
        connection: {
            protocol:  'amqps://',
            name:      'plaything',
            user:      'demo-user',
            pass:      'demo-password',
            server:     hostname1,
            port:       port1,
            vhost:     'plaything'
        }
    };

    rabbit
        .configure(config)
        .then( function() {
            console.log('rabbit is hopping');
            // ready to start rabbit receivers and publishers
        })
        .then(null, function(err) {
            console.log(err);
        });


CLUSTER SUPPORT
On the Compose RabbitMQ console, there are two connection strings. To add the
second connection string, add the hostnames as a comma-separated array. The two
port numbers will probably be the same number, but add them as a comma-separated
array, too.

    var config = {
        connection: {
            name:      'plaything',
            user:      'demo-user',
            pass:      'demo-password',
            server:    [hostname1, hostname2],
            port:      [port1, port2],
            vhost:     'plaything'
        }
    };


I haven't found out yet, how to test the resilience provided by the two
hostnames. Using Compose with MongoDB you can drop one of the connections and
test your driver switches to the other connection. Hopefully, this feature will
be available in RabbitMQ soon. But, at least you are as resilient as you can be.

CONFIGURING EXCHANGES AND QUEUES
With Rabbot, you publish your message to an exchange and the exchange uses the Routing Key to allocate the message to a queue. The background process (e.g. worker.js ) can then fetch the message from the queue.

Let's update the config variable to include an Exchange and Queue. The 'bindings' object links the
Queue to the Exchange.

var config = {  
        connection: {
            protocol:  'amqps://',
            name:      'plaything',
            user:      'demo-user',
            pass:      'demo-password',
            server:    [hostname1, hostname2],
            port:      [15121, 15121],
            vhost:     'plaything'
        },
        exchanges: [
            {   name:           ""worker.exchange"",
                type:           ""direct"",
                autoDelete:     false,
                durable:        true,
                persistent:     true},

            {   name:           ""deadLetter.exchange"",
                type:           ""fanout""}
        ],
        queues: [
            {   name:           ""worker.queue"",
                autoDelete:     false,
                durable:        true,
                noBatch:        true,
                limit:          1,
                subscribe:      true,
                deadLetter:     'deadLetter.exchange'},

            {   name:           'deadLetter.queue'}
        ],
        bindings: [
            {   exchange:   ""worker.exchange"",
                target:         ""worker.queue"",
                keys:           [""email""]},

            {exchange:      ""deadLetter.exchange"",
                target:         ""deadLetter.queue"",
                keys:           [""email""]}
        ]
    };


This creates two exchanges ('worker' and 'deadLetter') and two queues ('worker'
and 'deadLetter'). The worker queue is going to handle all the messages from
Rabbit. We can spin up additional worker processes as the load increases.
Normally, we'd configure Rabbit to maximize throughput. But, in this case, we
want the messages to flow through Rabbit really slow, so we can have a peep at
the Management Console and watch it work.

The queue.limit is set to 1, so the messages are handed out one at a time. This is very slow
but makes it easier to watch what is happening.

SENDING MESSAGES
Let's write a short harness to send 5 messages, which will send a message every
10 seconds; slow enough to watch the messages appear on the Management Console.
The first 4 messages will have dataOK = true and the last message will have dataOK = false . Later, the worker will use this to determine whether to accept and process
the message or to reject it.

The payload is the message content. We've arbitrarily split the payload into
data (in payload.body ) and meta-data (e.g. payload.routingKey ).

Below is the code to send messages. Open the Management Console and look at the worker.queue . Then run the program (e.g. node sender.js ) and watch the queued messages increase on the chart - exciting isn't it!

    var rabbit = require(""rabbot"");

    var config = {..}; // see config object defined earlier

    rabbit
        .configure(config)
        .then(function(){
            console.log('connected to Rabbit');
            var x = 0;
            var payload = {};
            payload.routingKey = 'email';

            var intervalID = setInterval(function () {
                payload.body = {msg:'Greetings, this is message ' + x, dataOK: true};
                if (x === 3) {
                    payload.body.dataOK = false;
                }
                sendMessage(payload);
                if (x === 4) {
                    clearInterval(intervalID);
                }
                x++;
            }, 10000);

        })
        .then(null, function(err){
            console.log(""RabbitMQ Connection Error!"");
            console.log(err.stack);
            process.exit(1);
        });

    function sendMessage(payload) {
        rabbit.publish(""worker.exchange"", payload, 'plaything')
            .then( function() {
                console.log('payload sent: ' + payload.body.msg);
            })
            .catch( function(err) {
                console.log(err);
            });
    }   


RECEIVING MESSAGES
Now, let's write a worker.js file to receive and process the messages. Use the same config object we used in the previous example.

    var rabbit = require(""rabbot"");
    var config = {..}; // see earlier config definition
    var handleMessage = function(payload) {

        if (payload.body.dataOK) {
            // if the data in the payload is good, lets delete the message from the queue
            console.log('payload received: ' + payload.body.msg);
            payload.ack();
        } else {
            // the data is not good, lets move the message to the dead-letter queue
            console.log('rejecting message: ' + payload.body.msg );
            payload.reject();
        }

    };

    var startListening = function() {

        rabbit.handle({}, handleMessage);

        // must define handler before starting the subscription, otherwise messages will be lost
        rabbit.startSubscription(config.queues[0].name, config.connection.name);
    };


    rabbit
        .configure(config)
        .then(function(){
            console.log('connected to Rabbit');
            startListening();
        })
        .then(null, function(err){
            console.log(""RabbitMQ Connection Error!"");
            console.log(err.stack);
            process.exit(1);
        })
        .catch( function(err) {
            console.log(err.stack);
            process.exit(1);
        });


Now let's test it. Run worker.js and the messages will be taken from the worker.queue . Watch the graph on the Management Console to see the messages being fetched.

One of the messages was rejected with payload.reject() . This will be in the 'deadletter' queue. Open the Management Console, view the
'deadletter' queue and use the Get Message(s) button to fetch the failed message.

CONCLUSION
Let's recap on what you've done so far:

 1. Setup an instance of Compose's RabbitMQ;
 2. Connected to RabbitMQ using the amqplib low-level library;
 3. Connected to RabbitMQ using Rabbot, an opinionated abstraction over amqplib;
 4. Configured Rabbot to provide both a 'worker' queue and a 'deadletter' queue
    for failed messages;
 5. Built a simple NodeJS program for sending messages to the 'worker' queue;
 6. Built a simple worker program for reading messages from the queue and
    processing them.

The RabbitMQ configuration used was quite simple, it only processed one message
at a time, but this allowed us to watch the code in action on the Compose
RabbitMQ Management Console.

At Tripcatcher, we initially installed RabbitMQ to move the workload of sending
emails away from our web server and into a background process. But, we're now
adding management reports and integrations with third parties (where the
response time may be slow), too. RabbitMQ is providing a really slick way of
moving the processing load off the web server, while still having the web server
authenticate requests and initiate work.

I recommend looking at some RabbitMQ tutorials and checking the documentation on
the config object to get a better understanding of how RabbitMQ might benefit your
business.

Ken Whipday is the Tech co-founder at Tripcatcher , helping remove the hassle from mileage expenses. He lives in Cheltenham,
England and in his spare time likes running and embarrassing his two daughters
with really bad dad-dancing.

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Image by Pexels Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","Ken Whipday shows how RabbitMQ has helped Tripcatcher. He'll show how to set up a basic AMQP message queue with NodeJS and then move on to Rabbot's powerful abstraction for RabbitMQ clusters, exchanges and queues.",The Well Connected Rabbit,Live,401
1198,"METRICS MAVEN: MEET IN THE MIDDLE - MEDIAN IN POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 2, 2016In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the metrics you need from your data.
In this article we'll cozy-up to calculating a median in PostgreSQL.

To set the stage for median, in our previous article we learned about calculating a mean -- what the default settings are for the AVG function, how to change those as appropriate for our analytic needs, and using DISTINCT with AVG to get a sense of whether our data was on the higher end or the lower end. In
this article we're going to build on what we previously learned by turning our
attention to median. Calculating a median can be a bit trickier than it sounds,
especially since there's not a built-in function for it, so we'll look at some
different options for the calculation so that you can get familiar with them and
choose the one that works best for your situation.

MEDIAN
The median is the middle value in an ordered series. If there are an odd number
of values in the ordered series, then the middle value is the median. If there
are an even number of values, then the mean of the two middle values is the
median.

The median value can be a more representative metric than the mean if the data
contains outliers that may pull the data one way versus another. Finding the
middle shows you just how much your mean value may be misleading you. It's
always good to look at the median and mean together with the mode (which we'll
look at in our next article) to get the best sense of your data. Companies that
rely only on averages for their metrics are walking a dangerous path if they're
basing key decisions on those averages. Incorporating median and mode metrics
along with mean can provide a much more solid foundation for decision-making.

OUR DATA
Carrying on from our previous article on mean , we'll use the orders data we used there for our examples in this article as
well. Before we get into our SQL query options for the median calculation, let's
go through the data and eyeball what the median should be so that we get a
handle on a couple considerations we'll need to make.

In our orders table, we have an odd number of rows so we can identify the middle
row for the median value. First we need to order the values we want to consider,
however. Let's start with finding the median item count.

Here's our orders table ordered by item_count :

order_id | date       | item_count | order_value  
------------------------------------------------
50005    | 2016-09-02 | 0          | (NULL)  
50002    | 2016-09-02 | 1          | 5.99  
50003    | 2016-09-02 | 1          | 4.99  
50010    | 2016-09-02 | 1          | 20.99  
50006    | 2016-09-02 | 1          | 5.99  
50008    | 2016-09-02 | 1          | 5.99  
50009    | 2016-09-02 | 2          | 12.98  
50007    | 2016-09-02 | 2          | 19.98  
50001    | 2016-09-02 | 2          | 7.98  
50000    | 2016-09-02 | 3          | 35.97  
50004    | 2016-09-02 | 7          | 78.93  


The middle row is order_id 50008 so our median value for the item_count is 1. The 0 item_count value in the first row is a valid value for the median calculation. If we
wanted to exclude it, we'd need to add logic to our calculation to include
values only greater than 0. We'll actually do that in just a little bit.

Let's find the median for the order_value now:

order_id | date       | item_count | order_value  
------------------------------------------------
50003    | 2016-09-02 | 1          | 4.99  
50006    | 2016-09-02 | 1          | 5.99  
50002    | 2016-09-02 | 1          | 5.99  
50008    | 2016-09-02 | 1          | 5.99  
50001    | 2016-09-02 | 2          | 7.98  
50009    | 2016-09-02 | 2          | 12.98  
50007    | 2016-09-02 | 2          | 19.98  
50010    | 2016-09-02 | 1          | 20.99  
50000    | 2016-09-02 | 3          | 35.97  
50004    | 2016-09-02 | 7          | 78.93  
50005    | 2016-09-02 | 0          | (NULL)  


Here the middle row is order_id 50009 so the median order_value is $12.98.

But wait! What about that NULL value sorted to the bottom there? It will be
considered part of the series unless we explicitly exclude NULL values in our
calculation or unless our calculation uses a function that ignores NULL values
by default. Since in our case we do want to exclude it, we'll now have an even
number of rows:

order_id | date       | item_count | order_value  
------------------------------------------------
50003    | 2016-09-02 | 1          | 4.99  
50006    | 2016-09-02 | 1          | 5.99  
50002    | 2016-09-02 | 1          | 5.99  
50008    | 2016-09-02 | 1          | 5.99  
50001    | 2016-09-02 | 2          | 7.98  
50009    | 2016-09-02 | 2          | 12.98  
50007    | 2016-09-02 | 2          | 19.98  
50010    | 2016-09-02 | 1          | 20.99  
50000    | 2016-09-02 | 3          | 35.97  
50004    | 2016-09-02 | 7          | 78.93  


Now we've got two middle rows, order_id 50001 and order_id 50009. We'll need to find the mean of the two order values:

 -- average the two middle values
(7.98 + 12.98) / 2 = 10.48


Our new median order_value is $10.48.

In our case, we also want to exclude orders from our metrics that have a 0 item_count . Here's our reasoning: an order with a 0 item_count or a NULL order_value is clearly an invalid order. We excluded 0 values and NULL values in our
metrics for mean in the previous article and we need to do the same here. Let's
re-calculate the median for item_count , then, this time excluding the 0 value:

order_id | date       | item_count | order_value  
------------------------------------------------
50002    | 2016-09-02 | 1          | 5.99  
50003    | 2016-09-02 | 1          | 4.99  
50010    | 2016-09-02 | 1          | 20.99  
50006    | 2016-09-02 | 1          | 5.99  
50008    | 2016-09-02 | 1          | 5.99  
50009    | 2016-09-02 | 2          | 12.98  
50007    | 2016-09-02 | 2          | 19.98  
50001    | 2016-09-02 | 2          | 7.98  
50000    | 2016-09-02 | 3          | 35.97  
50004    | 2016-09-02 | 7          | 78.93  


Now we've got an even number of rows for the item_count calculation, too, so we'll find the mean of the two middle values. In this case
our middle rows are order_id 50008 and order_id 50009 so here's our calculation:

 -- average the two middle values
1 + 2 / 2  1.5  


Our median item_count is now 1.5.

While the differences from our dataset are not staggering when we exclude 0 and
NULL values, in other datasets they very well may be. Consider carefully how you
want these value types to be treated for your median calculations. Also, note
that if you want to include 0 values and you change any NULL values to 0 in your
calculation, the 0 value will sort to the top of the series whereas the NULL
value will sort to the bottom. This shift can also significantly affect your
median.

If we compare our medians to the means we calculated in the last article , we now have a better understanding of why our orders were on the lower side
when we compared the averages we calculated across our dataset to the averages
we determined while using DISTINCT values only ... the medians are much lower.

Mean item count = 2.10  
Median item count = 1.5

Mean order value = $19.98  
Median order value = $10.48  


This means that we have some outliers in the orders that are actually pulling
our average values higher than the median. So, even though in the previous
article we got a sense that orders were on the lower side than we wanted them to
be (by using DISTINCT with AVG ), our medians now confirm that. The mean values by themselves might have led
us to believe we were doing better on orders than we actually are. Using the
median to complement the mean helps us get a fuller picture of how our business
is doing. Orders are indeed lower, both in item count and order value, than we
would like. Now that we know this, we can decide what our business should do
about it.

QUERY OPTIONS FOR MEDIAN
OK, now that we know what our median values are, how can we formulate a query to
calculate the median for us for ongoing reporting? Since PostgreSQL does not
come with a built-in function for median, we've got a handful of options. Let's
go through each of them. We're going to start with the simplest option to
understand, but the most complicated to write out and work toward the option
that is the simplest to write out, but can be the most difficult to understand.
If you want to jump straight to that one, check out Option 4 below.

OPTION 1 - GET TO THE MIDDLE USING MAX OF THE TOP 50% AND MIN OF THE BOTTOM 50%
One method involves taking the maximum value from the top 50% and the maximum
value from the bottom 50% of ordered rows then taking the average of the two
results. Here's what that looks like:

SELECT ROUND((t.middle1 + b.middle2) / 2, 2) AS median_item_count  
FROM (  
  SELECT MAX(item_count) AS middle1
  FROM (
    SELECT item_count,
           COUNT(*) OVER() AS row_count,
           ROW_NUMBER() OVER (ORDER BY item_count) AS row_number
    FROM orders
    WHERE item_count <> 0
  ) top
  WHERE (row_number::float / row_count) <= 0.50
) t,
(
  SELECT MIN(item_count) AS middle2
  FROM (
    SELECT item_count,
           COUNT(*) OVER() AS row_count,
           ROW_NUMBER() OVER (ORDER BY item_count DESC) AS row_number
    FROM orders
    WHERE item_count <> 0
  ) bottom
  WHERE (row_number::float / row_count) �


Queries within queries wrapped up in a query - that's an eye-full! We could use
CTEs (common table expressions) here using the WITH clause to make this a little more readable, but it would still be a lengthy and
complex query, though the concept behind it is simple.

The way it works is that it orders the rows by item_count in ascending order, counts and numbers the rows which do not have an item_count of 0 (as we determined we needed to do in the section above), then selects the
maximum item_count value from the first 50% of rows in the table. That maximum value in our case
would be 1.

Here's a way to look at the top 50% of rows that might help this make more
sense:

order_id | item_count | row_count | row_number | percentile  
-----------------------------------------------------------
50010    | 1          | 10        | 1          | 0.1  
50002    | 1          | 10        | 2          | 0.2  
50003    | 1          | 10        | 3          | 0.3  
50006    | 1          | 10        | 4          | 0.4  
50008    | 1          | 10        | 5          | 0.5  


The maximum item_count has to be 1 since that's the only value we see in this set.

The same exact query is performed to get the bottom 50% of rows except we put
them in descending order and then we take the minimum value instead of the
maximum one. That value would be 2 in this case. Here's a way to look at the
bottom 50% from our dataset to understand why:

order_id | item_count | row_count | row_number | percentile  
-----------------------------------------------------------
50004    | 7          | 10        | 1           | 0.1  
50000    | 3          | 10        | 2           | 0.2  
50001    | 2          | 10        | 3           | 0.3  
50007    | 2          | 10        | 4           | 0.4  
50009    | 2          | 10        | 5           | 0.5  


2 is the minimum item_count value from this set.

Instead of using DESC for the bottom 50%, you could also alternatively change the ""<= 0.50"" condition
to "" 0.50"". That will produce the same result. Just a slightly different way of
pulling the bottom 50% set.

Finally, those two middle values (1 from the top 50% and 2 from the bottom 50%)
are averaged to produce the median of 1.50.

As you can tell by the OVER clauses in each of the subqueries, this query uses window functions for the row count and row number. Note, too, the use of ROUND to 2 decimal places, which we previously reviewed in Making Data Pretty to round to the specified number of decimal points, and the use of the ::float cast so that our percent calculation will actually be returned as a decimal
(performing math with two integers returns an integer by default, which isn't
granular enough for us to use for the percentile so we need to cast the first
integer row_number as a float to return a result with a decimal point).

Let's move on to a less complicated-looking query.

OPTION 2 - FINDING ROW NUMBERS IN THE MIDDLE OF ROW COUNTS
A similar, but slightly different, way to do the same thing is by identifying
the two middle rows as a list to be passed to the AVG function:

SELECT ROUND(AVG(item_count), 2) AS median_item_count  
FROM (  
  SELECT item_count
  FROM (
    SELECT item_count,
           COUNT(*) OVER() AS row_count,
           ROW_NUMBER() OVER(ORDER BY item_count) AS row_number
    FROM orders
    WHERE item_count <�


Still a query within a query within a query, but less overwhelming. This is
essentially the same thing as the previous query, but a little more compact.
Instead of separating the values into the top and bottom 50% and then taking the
minimum and maximum values, it checks if the row number is one of the middle
rows by using row count.

In this case, we specifically don't use the ::float cast since we don't want the row count calculations to resolve to a decimal
value. We want to keep the results there as integers to be able to be compared
with the row number. For example, in our case, the row count is 10. So, if we
perform the first row count calculation (10 + 1)/2 we'd get 5.5 if we specified a ::float cast. By not specifying it, the result resolves to just a 5 (because math
operations with integers will produce an integer by default in PostgreSQL),
which is the level of granularity we want here. In our case, this query will
return row numbers 5 and 6 as the middle rows.

Once the middle rows are identified, the middle values for item_count are averaged. We'll get 1.50 from this query for our dataset.

OPTION 3 - WHERE ASCENDING AND DESCENDING ROWS MEET IN THE MIDDLE
The next option is a twist on the previous one and also uses window functions
with ROW_NUMBER() like the previous two options, but it doesn't use COUNT() . In this option, all the rows are ordered in ascending and descending order.
Then the ascending order row numbers are compared to the descending order rows.
Where the ascending order row number is between the descending order row numbers
plus or minus 1, those are the ones in the middle. The middle values identified
are averaged just like in the previous query. The result from this query for our
dataset will be 1.50.

SELECT ROUND(AVG(item_count), 2) as median_item_count  
FROM  
(
   SELECT item_count, 
      ROW_NUMBER() OVER (ORDER BY item_count, order_id) AS rows_ascending,
      ROW_NUMBER() OVER (ORDER BY item_count DESC, order_id DESC) AS rows_descending
   FROM orders
   WHERE item_count <�  


Here we've only got one query inside a query.

Note that this option requires your data to have ordered ids to line up the rows
the right way if you have duplicate values in the field you are trying to find
the median of. In our case, our item_count field has a few duplicate values so we use order_id to line up the rows secondary to the item_count itself.

These were the ways to calculate a median until PostgreSQL 9.4 when the ordered set aggregate functions were introduced. They still work great, but you may find this last option
easiest to work with.

OPTION 4 - CONTINUOUS PERCENTILE
PERCENTILE_CONT is one of the ordered set aggregate functions that was introduced in PostgreSQL
9.4 and it can help us calculate a median value without any fuss. It uses linear interpolation to find the requested value in a continuous distribution. The wikipedia link is
provided for reference if you want to learn more, but basically this just means
that the median is found by finding the ""between"" value of two other values.
Exactly what we want.

SELECT  
  ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY item_count)::numeric, 2) AS median_item_count
FROM orders  
WHERE item_count <�  


Because we're looking for the median, we use ""0.50"" for PERCENTILE_CONT . The WITHIN GROUP clause with the ORDER BY clause creates the ordered subset that PERCENTILE_CONT will operate across. The ""between"" value that this function will arrive at for
our dataset is 1.50 — the median. Note that we've cast the result from PERCENTILE_CONT as numeric using ::numeric . We've done this so that we cn use the ROUND function to get 2 decimal places as we have in the previous options. Since PERCENTILE_CONT returns a double precision result by default, it would just resolve to 1.5 if
we did not use ROUND with the ::numeric cast.

We can use any of these queries to find the median for the order_value as well. They're return $10.48. To do so, we'd alter the field names and
ordering, of course, and we'd replace the ""WHERE item_count < 0"" with ""WHERE order_value IS NOT NULL"" in options 1-3. Even though options 2
and 3 use the AVG function (which as we learned in the last article ignores NULL values by default), we need to remove the NULL value row before
our query even gets to the point where it's performing the AVG . Note that for PERCENTILE_CONT , NULL values are automatically ignored so we do not need to include the ""WHERE
order_value IS NOT NULL"" in that case. This is an important difference between
it and the previous options as well.

WRAPPING UP
In this article we learned how median can give us a fuller picture of how our
business is doing than mean can by itself. To calculate the median, we looked at
four different options, each with its own level of complexity and compactness.
Depending on the size of your dataset, you may also notice performance
differences as well. Play around with each of them with your data to determine
which one works best for your use case.

In our next article, we'll get into mode to complete our overview of mean,
median, and mode metrics.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by: skeeze Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","In our Metrics Maven series, Compose's data scientist shares database features, tips, tricks, and code you can use to get the metrics you need from your data. In this article we'll cozy-up to calculating a median in PostgreSQL.",Metrics Maven: Meet in the Middle - Median in PostgreSQL,Live,402
1203,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSENTIMENT ANALYSIS OF TWITTER HASHTAGSDavid Taieb / October 6, 2015Use Apache® Spark™ Streaming in combination with IBM Watson to perform sentimentanalysis and track how a conversation is trending on Twitter.How’s your relationship with your customers? You can track how consumers feelabout you, your products, or your company based on their tweets. Gauge positiveor negative emotions measured across multiple tone dimensions, like anger,cheerfulness, openness, and more. To get real-time sentiment analysis, set upSpark Streaming with Twitter and Watson on Bluemix and use its Notebook toanalyze public opinion.This tutorial covers how to build this app from the source code, configure itfor deployment on Bluemix, and analyze the data to produce compelling,insight-revealing visualizations.HOW IT WORKSThis sample app uses Spark Streaming to create a feed that captures live tweetsfrom Twitter. You can optionally filter the tweets that contain the hashtag(s)of your choice. The tweet data is enriched in real time with various sentimentscores provided by the Watson Tone Analyzer service (available on Bluemix). Thisservice provides insight into sentiment, or how the author feels. We then useSpark SQL to load the data into a DataFrame for further analysis. You can alsosave the data into a Cloudant database or a parquet file and use it later totrack how you’re trending over longer periods.The following diagram shows the basic architecture of this app:BEFORE YOU BEGINIf you haven’t already, read and follow the Start Developing with Spark and Notebooks tutorial , which shows how to build a custom library for Apache® Spark™ and deploy it toa Jupyter Notebook in IBM Analytics for Apache Spark on Bluemix.BUILD THE APPLICATION 1. Clone the source code on your local machine: git clone https://github.com/ibm-cds-labs/spark.samples.git 2. Go to the sub-directory that contains the code for this application: cd streaming-twitter 3. Compile and assemble the jar using the following command: sbt assembly to create an uber jar (jar that contains the code and all its dependencies).Note: Soon, I’ll update this tutorial with a link to information on how to    assemble uber jars.         4. Post the jar on a publicly available url, by doing one of the following:         * Upload the jar into a github repository. (Note its download url. You’ll       use in a few minutes.)     * Or, you can use our sample jar, which is pre-built and posted here on GitHub .        CONNECT TO TWITTERCreate a new app on your Twitter account and configure the OAuth credentials. 1. Go to https://apps.twitter.com/ . Sign in and click the Create New App button 2. Complete the required fields: * Name and Description can be anything you want.     * Website. Enter any valid URL.         3. Below the developer agreement, turn on the Yes, I agree check box and click Create your Twitter application . 4. Click the Keys and Access Tokens tab. 5. Scroll to the bottom of the page and click the Create My Access Tokens button. 6. Copy the Consumer Key , Consumer Secret , Access Token , and Access Token Secret . You will need them later in this tutorial.    INITIATE AND RUN SERVICES ON BLUEMIX 1. On Bluemix, initiate the IBM Analytics for Apache Spark service. 1. In the top menu, click Catalog .     2. Under Data and Analytics , find Apache Spark .             3. Click to open it, and click Create .     4. Click Open .     5. Click the Object Storage tab.     6.          2. Click the Add Object Storage button and click Create .Initiate the Watson Tone Analyzer service too. To do so: 1. In Bluemix, go to the top menu and click Catalog . 2. Scroll down to the bottom of the page and click the Bluemix Labs Catalog link. 3. Select the Tone Analyzer service and click Create .On left side of the screen, click Service CredentialsCopy the information (you’ll need it later when running the app in a Notebook):         ""credentials"": {         ""url"":""XXXXX"",         ""username"":""XXXXX"",         ""password"":""XXXXX""        }On the upper left of the screen click Back to Dashboard . Create a new Scala notebook. 1. In Bluemix, open your Apache Spark service. 2. If prompted, open an existing instance or create a new one.     3. Click New Notebook . 4. Enter a Name , and under Language select Scala . Click Create Notebook .In the first cell, enter the following to install the application jar created inthe previous section:  %AddJar https://github.com/ibm-cds-labs/spark.samples/raw/master/dist/streaming-twitter-assembly-1.2.jar -fIn the next cell, configure the credential parameters needed to connect toTwitter and Watson Tone Analyzer service. Enter the following, with yourcredentials inserted in the proper slots (replacing the XXs):    val demo = com.ibm.cds.spark.samples.StreamingTwitter //Shorter handle      //Twitter OAuth params from section above    demo.setConfig(""twitter4j.oauth.consumerKey"",""XXXX"")    demo.setConfig(""twitter4j.oauth.consumerSecret"",""XXXXXX"")    demo.setConfig(""twitter4j.oauth.accessToken"",""XXXX"")    demo.setConfig(""twitter4j.oauth.accessTokenSecret"",""XXXX"")    //Tone Analyzer service credential copied from section above    demo.setConfig(""watson.tone.url"",""XXXX"")    demo.setConfig(""watson.tone.password"",""XXXX"")    demo.setConfig(""watson.tone.username"",""XXXX"")Run the following input code to start streaming from Twitter and collect thedata.Input    import org.apache.spark.streaming._    demo.startTwitterStreaming(sc, Seconds(30))Output Twitter stream startedTweets are collected real-time and analyzedTo stop the streaming and start interacting with the data use: StreamingTwitter.stopTwitterStreamingStopping Twitter stream. Please wait this may take a whileTwitter stream stoppedYou can now create a sqlContext and DataFrame with 447 Tweets created. Sample usage:             val (sqlContext, df) = com.ibm.cds.spark.samples.StreamingTwitter.createTwitterDataFrames(sc)            df.printSchema            sqlContext.sql(""select author, text from tweets"").show By default, the stream runs until you call the stopTwitterStreaming api. You maywant to use an optional parameter to set a specific duration for the stream. Inthe input code above, we run the stream for 30 seconds: demo.startTwitterStreaming(sc, Seconds(30))Wait for 30 seconds or more for the stream to run. Once the stream stops, runthe following input code to create a DataFrame using Spark SQL and startquerying the data:Input    val (sqlContext, df) = demo.createTwitterDataFrames(sc)Output         A new table named tweets with 1247 records has been correctly created and can be accessed through the SQLContext variable        Here's the schema for tweets        root         |-- author: string (nullable = true)         |-- date: string (nullable = true)         |-- lang: string (nullable = true)         |-- text: string (nullable = true)         |-- lat: integer (nullable = true)         |-- long: integer (nullable = true)         |-- Cheerfulness: double (nullable = true)         |-- Negative: double (nullable = true)         |-- Anger: double (nullable = true)         |-- Analytical: double (nullable = true)         |-- Confident: double (nullable = true)         |-- Tentative: double (nullable = true)         |-- Openness: double (nullable = true)         |-- Agreeableness: double (nullable = true)         |-- Conscientiousness: double (nullable = true)Run this input to display a sample of the data.Input    val fullSet = sqlContext.sql(""select * from tweets limit 100000"")  //Select all columns    fullSet.showOutput             author            date                 lang text                 lat long Cheerfulness Negative Anger Analytical Confident Tentative Openness          Agreeableness     Conscientiousness            Lizzy Johnson     Sun Sep 27 20:18:... en   @CaylorxShacks an... 0.0 0.0  0.0          100.0    100.0 0.0        0.0       0.0       0.0               1.0               0.0                          26631stwc         Sun Sep 27 20:18:... en   Get Weather Updat... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       74.0              45.0              96.0                         Ayndrei?          Sun Sep 27 20:18:... en   RT @drycilagan: H... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       97.0              0.0               68.0                         C.                Sun Sep 27 20:18:... en   RT @denisSDJEM: #... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       100.0     84.0              86.0              6.0                          Jason Brinker     Sun Sep 27 20:18:... en   RT @FirstBaptistJ... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       100.0     36.0              1.0               26.0                         Binary Trader Pro Sun Sep 27 20:18:... en   RT http://t.co/XX... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       97.0              0.0               68.0                         Scully            Sun Sep 27 20:18:... en   @michaeldweiss @T... 0.0 0.0  0.0          100.0    0.0   0.0        0.0       0.0       11.0              0.0               0.0                          Nomi Garcia       Sun Sep 27 20:18:... en   I'm earning #mPOI... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       15.0              55.00000000000001 68.0                         Luke Robinson     Sun Sep 27 20:18:... en   RT @mzenitz: Amar... 0.0 0.0  100.0        0.0      0.0   0.0        0.0       0.0       30.0              72.0              100.0                        Info Jatim        Sun Sep 27 20:18:... en   Sebar Ratusan Sti... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       97.0              0.0               68.0                         Chris Niosi       Sun Sep 27 20:18:... en   Beginning to see ... 0.0 0.0  0.0          0.0      0.0   100.0      0.0       0.0       48.0              5.0               12.0                         CR7               Sun Sep 27 20:18:... en   I question my soc... 0.0 0.0  0.0          0.0      0.0   100.0      0.0       0.0       0.0               97.0              9.0                          Steve Eggleston   Sun Sep 27 20:18:... en   @moxiemom I did. ... 0.0 0.0  97.0         0.0      0.0   93.0       100.0     0.0       0.0               98.0              2.0                          Robert            Sun Sep 27 20:18:... en   http://t.co/q9QvM... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       97.0              0.0               68.0                         savannah neubert  Sun Sep 27 20:18:... en   @Teedubbz12 gnarl... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       97.0              0.0               68.0                         Aixa J. Rudolph   Sun Sep 27 20:18:... en   I liked a @YouTub... 0.0 0.0  100.0        0.0      0.0   0.0        0.0       0.0       0.0               99.0              100.0                        Jay Pellecchia    Sun Sep 27 20:18:... en   @deborah_lary @GA... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       86.0              1.0               68.0                         Pond              Sun Sep 27 20:18:... en   RT @CraziestEyes ... 0.0 0.0  100.0        100.0    100.0 0.0        0.0       92.0      0.0               98.0              0.0                          Georgiee          Sun Sep 27 20:18:... en   Simba is the mvp ... 0.0 0.0  0.0          0.0      0.0   0.0        0.0       0.0       56.99999999999999 1.0               100.0                         dolores luchavez Sun Sep 27 20:18:... en   Love love love  #... 0.0 0.0  100.0        0.0      0.0   0.0        0.0       0.0       0.0               100.0             68.0     Run the following code to save the dataset into a parquet file on ObjectStorage. (You’ll use this in the next section when you analyze the data with anIPython Notebook.)    fullSet.repartition(1).saveAsParquetFile(""swift://notebooks.spark/tweetsFull.parquet"")Disregard any warning messages related to SLF4JThis command saves the file to your object store on Bluemix. To view storedfiles, go to the upper right of your screen and click Data Source .Run this other query example, which filters the data to show only tweets thathave an Anger rating greater than 70%.Input      val angerSet = sqlContext.sql(""select author, text, Anger from tweets where Anger � 70"")    println(angerSet.count)    angerSet.showOutput         22        author             text                 Anger        Lizzy Johnson      @CaylorxShacks an... 100.0        Pond               RT @CraziestEyes ... 100.0        Mychal Elliott     RT @TheTweetOfGod... 100.0        Kat Willey         Girls gotta look ... 100.0        Big Chan Trill OG? Bros don't have r... 100.0        FIFA15 Messi Trick @M4DE_DARKf Luis ... 100.0        Courtney Perkins   Why does Lauren t... 100.0        Miami Celebs       We are great writ... 100.0        InfoblazeCentral   @BillClinton blam... 100.0        Dave               Argument from ign... 100.0        Mispooky           @loghainroger_ SH... 100.0        Nourelmalah        RT @stillawinner_... 100.0        yella              RT @PoloMylogo_: ... 100.0        cheyenne           RT @LRHASBOYFRIEN... 100.0        anette             dont you hate it ... 100.0        Savage Emily       RT @esterluvzpll:... 100.0        Nicole Williams    The REAL men and ... 100.0        Lexi Babyy         RT @Lowkey: peopl... 100.0        la malinche        @luvhairyguys1 sc... 100.0        ? MIGO ?           RT @Lowkey: peopl... 100.0See more: View a copy of this Scala Notebook on GitHub .ANALYZE THE DATA USING AN IPYTHON NOTEBOOKIn the previous section, using a Scala Notebook, you learned how to run theTwitter Stream to acquire data and enrich it with sentiment scores from WatsonTone Analyzer. You also ran a command to persist the data in a parquet file onthe Object Storage bound to this Spark instance. Now, we’ll reload this data inan IPython Notebook for further analysis and visualization. 1. From the Notebook main page, create a new Python Notebook.    From within your Scala notebook, go to the upper left of the screen and    click the back button to return to your My Notebooks page.    Click New Notebook . Enter a Name , and under Language select Python . Then click Create Notebook .         2. In a cell, run the following code to load the data and create a DataFrame    with the entire dataset:        # Import SQLContext and data types    from pyspark.sql import SQLContext    from pyspark.sql.types import *        # sc is an existing SparkContext.    sqlContext = SQLContext(sc)            parquetFile = sqlContext.read.parquet(""swift://notebooks.spark/tweetsFull.parquet"")    print parquetFile            parquetFile.registerTempTable(""tweets�    sqlContext.cacheTable(""tweets"")    tweets = sqlContext.sql(""SELECT * FROM tweets"")    print tweets.count()    tweets.cache()         3. Now start analyzing this data. First, enter the following to compute the    distribution of tweets by sentiment scores greater than 60%.      #create an array that will hold the count for each sentiment    sentimentDistribution=[0] * 9    #For each sentiment, run a sql query that counts the number of tweets for which the sentiment score is greater than 60%    #Store the data in the array    for i, sentiment in enumerate(tweets.columns[-9:]):        sentimentDistribution[i]=sqlContext.sql(""SELECT count(*) as sentCount FROM tweets where "" + sentiment + "" > 60"")\            .collect()[0].sentCount             4. With the data stored in sentimentDistribution array, run the following code    that plots the data as a bar chart.     %matplotlib inline    import matplotlib    import numpy as np    import matplotlib.pyplot as plt        ind=np.arange(9)    width = 0.35    bar = plt.bar(ind, sentimentDistribution, width, color='g', label = ""distributions"")        params = plt.gcf()    plSize = params.get_size_inches()    params.set_size_inches( (plSize[0]*2.5, plSize[1]*2) )    plt.ylabel('Tweet count')    plt.xlabel('Tone')    plt.title('Distribution of tweets by sentiments > 60%')    plt.xticks(ind+width, tweets.columns[-9:])    plt.legend()        plt.show()            Results:                 5. Enter the following to compute the top 10 hashtags contained in the tweets.    This code uses RDD transformations (flatMap, filter, etc…) to massage the    data that will be used by the visualization code. Read more about the RDD APIs .      from operator import add    import re    tagsRDD = tweets.flatMap( lambda t: re.split(""\s"", t.text))\        .filter( lambda word: word.startswith(""#"") )\        .map( lambda word : (word, 1 ))\        .reduceByKey(add, 10).map(lambda (a,b): (b,a)).sortByKey(False).map(lambda (a,b):(b,a))    top10tags = tagsRDD.take(10)             6. Enter this visualization code to plot the data as a pie chart:       %matplotlib inline    import matplotlib    import matplotlib.pyplot as plt        params = plt.gcf()    plSize = params.get_size_inches()    params.set_size_inches( (plSize[0]*2, plSize[1]*2) )        labels = [i[0] for i in top10tags]    sizes = [int(i[1]) for i in top10tags]    colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral', ""beige"", ""paleturquoise"", ""pink"", ""lightyellow"", ""coral""]        plt.pie(sizes, labels=labels, colors=colors,autopct='%1.1f%%', shadow=True, startangle=90)        plt.axis('equal')        plt.show()            Results:                 7. Now build a more complex report, which decomposes the top 5 hashtags by    sentiment scores. Run the following code to compute the mean of all the    sentiment scores and visualizes them in a multi-series bar chart.       cols = tweets.columns[-9:]    def expand( t ):        ret = []        for s in [i[0] for i in top10tags]:            if ( s in t.text ):                for tone in cols:                    ret += [s + u""-"" + unicode(tone) + "":"" + unicode(getattr(t, tone))]        return ret     def makeList(l):        return l if isinstance(l, list) else [l]        #Create RDD from tweets dataframe    tagsRDD = tweets.map(lambda t: t )        #Filter to only keep the entries that are in top10tags    tagsRDD = tagsRDD.filter( lambda t: any(s in t.text for s in [i[0] for i in top10tags] ) )        #Create a flatMap using the expand function defined above, this will be used to collect all the scores     #for a particular tag with the following format: Tag-Tone-ToneScore    tagsRDD = tagsRDD.flatMap( expand )        #Create a map indexed by Tag-Tone keys     tagsRDD = tagsRDD.map( lambda fullTag : (fullTag.split("":"")[0], float( fullTag.split("":"")[1]) ))        #Call combineByKey to format the data as follow    #Key=Tag-Tone    #Value=(count, sum_of_all_score_for_this_tone)    tagsRDD = tagsRDD.combineByKey((lambda x: (x,1)),                      (lambda x, y: (x[0] + y, x[1] + 1)),                      (lambda x, y: (x[0] + y[0], x[1] + y[1])))        #ReIndex the map to have the key be the Tag and value be (Tone, Average_score) tuple    #Key=Tag    #Value=(Tone, average_score)    tagsRDD = tagsRDD.map(lambda (key, ab): (key.split(""-"")[0], (key.split(""-"")[1], round(ab[0]/ab[1], 2))))        #Reduce the map on the Tag key, value becomes a list of (Tone,average_score) tuples    tagsRDD = tagsRDD.reduceByKey( lambda x, y : makeList(x) + makeList(y) )        #Sort the (Tone,average_score) tuples alphabetically by Tone    tagsRDD = tagsRDD.mapValues( lambda x : sorted(x) )        #Format the data as expected by the plotting code in the next cell.     #map the Values to a tuple as follow: ([list of tone], [list of average score])    #e.g. #someTag:([u'Agreeableness', u'Analytical', u'Anger', u'Cheerfulness', u'Confident', u'Conscientiousness', u'Negative', u'Openness', u'Tentative'], [1.0, 0.0, 0.0, 1.0, 0.0, 0.48, 0.0, 0.02, 0.0])    tagsRDD = tagsRDD.mapValues( lambda x : ([elt[0] for elt in x],[elt[1] for elt in x])  )        #Use custom sort function to sort the entries by order of appearance in top10tags    def customCompare( key ):        for (k,v) in top10tags:            if k == key:                return v        return 0    tagsRDD = tagsRDD.sortByKey(ascending=False, numPartitions=None, keyfunc = customCompare)        #Take the mean tone scores for the top 10 tags    top10tagsMeanScores = tagsRDD.take(10)             8. Input the following visualization code to plot the data in a multi-series    bar chart. It also provides a custom legend to present the data more    clearly:      %matplotlib inline    import matplotlib    import numpy as np    import matplotlib.pyplot as plt        params = plt.gcf()    plSize = params.get_size_inches()    params.set_size_inches( (plSize[0]*3, plSize[1]*2) )        top5tagsMeanScores = top10tagsMeanScores[:5]    width = 0    ind=np.arange(9)    (a,b) = top5tagsMeanScores[0]    labels=b[0]    colors = [""beige"", ""paleturquoise"", ""pink"", ""lightyellow"", ""coral"", ""lightgreen"", ""gainsboro"", ""aquamarine"",""c""]    idx=0    for key, value in top5tagsMeanScores:        plt.bar(ind + width, value[1], 0.15, color=colors[idx], label=key)        width += 0.15        idx += 1    plt.xticks(ind+0.3, labels)    plt.ylabel('AVERAGE SCORE')    plt.xlabel('TONES')    plt.title('Breakdown of top hashtags by sentiment tones')        plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='center',ncol=5, mode=""expand"", borderaxespad=0.)        plt.show()            Results:            See more: View a copy of this IPython Notebook on GitHubCONCLUSIONIn this tutorial, you learned how to: * build a complex Apache® Spark™ solution that integrates multiple services   from Bluemix. * load the data into Spark SQL dataframes and query the data using SQL. * run complex analytics using RDD transformations and actions. * create compelling visualizations using the powerful matplotlib Python package   provided in the IPython Notebook.This tutorial shows the power and potential of the Apache® Spark™ engine andprogramming model. Hopefully these examples have inspired you to run your ownanalytics and reports with these fast and flexible tools. If you’re interestedin real-time analysis, read my follow-up on adding IBM Message Hub (Apache Kafka) into the mix .Happy Sparking!© “Apache”, “Spark,” and “Apache Spark” are trademarks or registered trademarksof The Apache Software Foundation. All other brands and trademarks are theproperty of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Use Spark Streaming in combination with IBM Watson to perform sentiment analysis and track how a conversation is trending on Twitter.,Show Twitter trends with Spark Streaming and IBM Watson,Live,403
1205,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Trevor Grant Blocked Unblock Follow Following Ain't no data like CPG data because CPG data don't stop. Apr 21
--------------------------------------------------------------------------------

PERSISTENT CHANGES TO SPARK CONFIG IN DSX
There are several reasons one might need to adjust the Spark configuration. For
example, many machine learning libraries require Kryo Serialization.

I recommend creating a whole project specifically dedicated to monkeying around
with your Spark Config. Keep in mind the changes you make in this project will
persist to all of your projects.

SHELL BASICS AND BACKUPS
Let’s start with some basics about working in the file system in DSX, and back
up the original Spark Config file.

Begin by starting a new notebook. Start a line with a ! to gain access to the shell. The underlying system is Linux, and we'll be
working in the bash shell. Here is a full list of commands . We will only need three for our purposes, ls , which lists the directory contents, cd to change directories, and cp to copy files.

For a first command type:

!ls $HOME

A fun note on the bash shell, there are environmental variables, like $HOME . If you want to know the value of the variable you can do

!echo $HOME

That will return the full path to your home directory. So ls $HOME simply lists the files and directories in your home directory. At this point,
if you're inclined- I'd encourage you to explore your file system a bit. (You
may want to read up on basic shell commands in Linux).

Moving on, we want to go to the directory where the Spark Config is.

!ls $SPARK_CONFIG_HOME

Here you’ll see directories for the various Spark Configs. We’re going to be
hacking on the Spark 1.6 config, but the process will be the same for any. An
important first thing to do is BACK UP THE ORIGINAL CONFIG FILES . Our monkey business absolutely could result in a broken DSX. If we jam
something up too bad, we want to be able to fall back to the original
configuration.

We do this with the following line:

!cp -rf $SPARK_CONFIG_HOME/spark160master $SPARK_CONFIG_HOME/spark160master_bu; ls $SPARK_CONFIG_HOME

That copies the directory containing the Spark Config to a back up directory ( -rf flag recursively forces the copy the files, i.e. the directory and its contents) to a new directory
with _bu appended to the end of the name. If we need to revert we would use this line to
copy it back (thus overwriting spark160master with the backup)

!cp -rf $SPARK_CONFIG_HOME/spark160master_bu $SPARK_CONFIG_HOME/spark160master; ls $SPARK_CONFIG_HOME

PYTHON-FU AND LOADING THE SPARK CONFIG
I’m going to give you a couple of clever functions for loading the Spark Config
file as a Python dictionary. Dictionary is a reasonable form to work with the
Spark config in because in essence- this config file is a list of keys and
values delimited by white-space.

We rely on Python’s os package for file manipulations, so start a new Python paragraph and run the
following

import os  
os.listdir(os.environ['SPARK_CONFIG_HOME'])

 * import os imports the os package
 * os.listdir(os.environ['SPARK_CONFIG_HOME']) will list the evironmental variable SPARK_CONFIG_HOME

You should see some output like

['spark20master',
 'spark160master',
 'spark20master-python3',
 'spark160master-python3']

Those are the 4 kernels that Jupyter in DSX has by default. Each one has its own Spark Config, and so
we’re going to pass one of those 4 values as the argument to our function.

Here is the entire read function:

def readSparkConfig(spark):  
    """"""
    A function to read the Spark Config of DSX
    :param spark: a string specifying which `spark-defaults.conf` to read. 
    :return: a dict, where the keys corespond to the parameters of `spark-defaults.conf` and the values are the values.
    """"""
    if not spark in os.listdir(os.environ['SPARK_CONFIG_HOME']):
        print ""'%s' is not a valid option- valid options are: "" % (spark, "", "".join(os.listdir(os.environ['SPARK_CONFIG_HOME'])))
        return None
    with open(os.environ['SPARK_CONFIG_HOME'] + ""/%s/spark-defaults.conf"" % spark, 'rb') as f:
        spark_conf = f.read()
    spark_conf_dict = {line.split( )[0] : line.split( )[1] for line in spark_conf.split(""\n"") if len(line.split( )) > 0}
    return spark_conf_dict

Now let’s dissect this block by block.

Block 1:

if not spark in os.listdir(os.environ['SPARK_CONFIG_HOME']):  
        print ""'%s' is not a valid option- valid options are: "" % (spark, "", "".join(os.listdir(os.environ['SPARK_CONFIG_HOME'])))
        return None

In block 1, we put a fail-safe to make sure the argument being passed in refers
to the config directory passed in does in fact exist. The print statement simply
returns what are the valid parameter arguments.

Block 2:

with open(os.environ['SPARK_CONFIG_HOME'] + ""/%s/spark-defaults.conf"" % spark, 'rb') as f:  
        spark_conf = f.read()

This is a pretty straight forward function to read spark-defaults.conf into a list of text lines. We dynamically create the path to spark-defaults.conf by glueing together the path from the environment variable SPARK_CONFIG_HOME and the argument passed in. ( spark-defaults.conf exists in each of the listed directories, if that wasn't clea

Block 3:

spark_conf_dict = {line.split( )[0] : line.split( )[1] for line in spark_conf.split(""\n"") if len(line.split( )) > 0}  
    return spark_conf_dict

The first line is a fairly complex dictionary comprehension. It creates a
dictionary by
1. iterating through the list of text lines
2. split the line on white space
3. if the line isn’t empty, create a key based on the first half of the split
and the value based on the second part of the split.

Try it out

Let’s give this thing a whirl and see how it does. Run the following code in a
new cell.

readSparkConfig('spark160master')

You’ll see something along the lines of this.

{'spark.deploy.resourceScheduler.factory': 'org.apache.spark.deploy.master.EGOResourceSchedulerFactory',
 'spark.driver.maxResultSize': '1210M',
 'spark.eventLog.enabled': 'true',
 'spark.executor.extraJavaOptions': '-Djava.security.egd=file:/dev/./urandom',
 'spark.extraListeners': 'com.ibm.spaas.listeners.DB2DialectRegistrar',
 'spark.logConf': 'true',
 'spark.port.maxRetries': '512',
 'spark.r.command': '/usr/local/src/bluemix_jupyter_bundle.v36/R/bin/Rscript',
 'spark.serializer': 'org.apache.spark.serializer.KryoSerializer',
 'spark.shuffle.service.enabled': 'true',
 'spark.shuffle.service.port': '7340',
...

Now we can modify our Spark config in the same way we would modify any other
dictionary in spark.

For example:

myConfig = readSparkConfig('spark160master')

myConfig['**new_key_1**'] = **new_value**

See Full Config Documentation for an exaustive list of the possible configuration keys and values.

But updating the config isn’t enough… we need to write it back to disk.

PYTHON-FU FOR WRITING THE SPARK CONFIG BACK TO DISK
So you’ve loaded your Spark Config and updated, great. Now how do we get it back
to disk?

Just like the last section, I’m going to give you the entire function, and then
go through block by block.

def writeSparkConfig(spark, spark_conf_dict):  
    """"""
    A function to write the updated Spark Config of DSX (overwrites previous config)
    :param spark: a string specifying which `spark-defaults.conf` to write. 
    :return: None 
    """"""

    if not spark in os.listdir(os.environ['SPARK_CONFIG_HOME']):
        print ""'%s' is not a valid option- valid options are: "" % (spark, "", "".join(os.listdir(os.environ['SPARK_CONFIG_HOME'])))
        return None
    with open(os.environ['SPARK_CONFIG_HOME'] + ""/%s/spark-defaults.conf"" % spark, 'wb') as f:
        for k,v in spark_conf_dict.iteritems():
            f.write(k + '\t' + v + '\n')
    print ""Successfully wrote new configuration to %s"" % (os.environ['SPARK_CONFIG_HOME'] + ""/%s/spark-defaults.conf"" % spark)
    return None

Block 1

if not spark in os.listdir(os.environ['SPARK_CONFIG_HOME']):  
        print ""'%s' is not a valid option- valid options are: "" % (spark, "", "".join(os.listdir(os.environ['SPARK_CONFIG_HOME'])))
        return None

Just like the read function, if the spark variable passed doesn't exist- return a friendly message.

Block 2

with open(os.environ['SPARK_CONFIG_HOME'] + ""/%s/spark-defaults.conf"" % spark, 'wb') as f:  
        for k,v in spark_conf_dict.iteritems():
            f.write(k + '\t' + v + '\n')

Here we are opening the targeted spark-detaults.conf file for writing, 'wb' is OVERWRITE mode, so we are going to be deleting the old file and writing a new one.

We iterate through our new config dictionary with the .iteritems() method, writing each key and value to a line (separated by a tab, \t and then a new line \n ).

PUTTING IT ALL TOGETHER
Now that we have these two awesome functions- let’s demonstrate how we would use
them to update the Spark Config.

A good time to review Spark Configs

Suppose we want to turn on Kryo serialization.

myConfig = readSparkConfig('spark160master')  
myConfig['spark.serializer'] = ""org.apache.spark.serializer.  
KryoSerializer""  
writeSparkConfig('spark160master', myConfig)

Now, go to your Spark project and restart the kernel- and shebang! You’re config
is updated!


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on April 21, 2017.

 * Big Data
 * Dsx
 * Spark

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingTREVOR GRANT
Ain't no data like CPG data because CPG data don't stop.

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","There are several reasons one might need to adjust the Spark configuration. For example, many machine learning libraries require Kryo Serialization. I recommend creating a whole project specifically…",Persistent changes to Spark Config in DSX,Live,404
1206,"Enterprise Pricing Articles Sign in Free 30-Day TrialMETRICS MAVEN: WINDOW FRAMES IN POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 7, 2016In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the metrics you need from your data.
In this article, we'll take a closer look at window frames in PostgreSQL.

In our last article, we learned about using window functions in PostgreSQL . With that as our foundation, we'll dive a little deeper here to learn about
window frames. Understanding window frames is important because the results you
get from your window functions depend on how your windows are framed.

Window frames are used to indicate how many rows around the current row the
window function should include. Think of them as a way to zoom-in with your
window function to focus on specific rows. The basic syntax for specifying a
window frame is to use either the RANGE or ROWS indicator (we cover both below) and a BETWEEN clause specifying the start of the frame and the end of the frame. There are a
few different options to understand with window frames so let's start by
exploring the default behavior.

UNBOUNDED PRECEDING
Window frames in window functions use UNBOUNDED PRECEDING by default, more accurately RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW , when an ORDER BY is specified. What we'll see is that the window function will be inclusive of
the current row being processed and all previous rows according to the order
indicated by the ORDER BY clause. To understand this, let's perform a SUM() aggregation of state populations from the US Census , which you may remember from our last article's examples:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER() AS national_population
FROM population  
WHERE state �


In the above query, our window function is performing the SUM of the state populations over all the rows since we did not indicate any
conditions. Note that our ORDER BY clause is not part of the window function; it's being performed against the
result set after the window function has been completed. Our results:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 324893002  
Alaska               | 738432           | 324893002  
Arizona              | 6828065          | 324893002  
Arkansas             | 2978204          | 324893002  
California           | 39144818         | 324893002  
Colorado             | 5456574          | 324893002  
Connecticut          | 3590886          | 324893002  
Delaware             | 945934           | 324893002  
District of Columbia | 672228           | 324893002  
Florida              | 20271272         | 324893002  
. . . .


Now, however, if we move the ORDER BY clause so that it becomes a condition applied to the window function, you'll
see we get something quite different:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name) AS national_population
FROM population  
WHERE state �


In these results, we can see that the aggregation was performed against the
current row and all of the previous rows to it (the default behavior for window
frames) so that the ""national_population"" column is now a running subtotal as
each new row gets added to the previous ones:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 4858979  
Alaska               | 738432           | 5597411  
Arizona              | 6828065          | 12425476  
Arkansas             | 2978204          | 15403680  
California           | 39144818         | 54548498  
Colorado             | 5456574          | 60005072  
Connecticut          | 3590886          | 63595958  
Delaware             | 945934           | 64541892  
District of Columbia | 672228           | 65214120  
Florida              | 20271272         | 85485392  
. . . .


Because we've added an ORDER BY clause as a condition of our window function, the query is actually using RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW . If we were to be explicit about it in the query, it'd look like this:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS national_population
FROM population  
WHERE state �


Understandably, it's a bit of a mind bend when you first encounter this
functionality. At the same time, a whole new world of possibilities opens up for
reporting and analytics.

Now that we know the default behavior, let's have a look at some other options.

PRECEDING
While the default behavior for window frames uses an UNBOUNDED PRECEDING , it is also possible to specify how many rows preceding from the current row
we'd like to include. Let's say we only want to include the previous 2 rows with
the current one. Our new query would look like this:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS national_population
FROM population  
WHERE state �


Notice two changes we've made here: instead of RANGE we're using ROWS because RANGE can only be used with UNBOUNDED and also we've swapped out UNBOUNDED for the value of 2 to indicate only 2 rows preceding. Our results will now show
the SUM of state populations including only the current row and the 2 previous rows:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 4858979  
Alaska               | 738432           | 5597411  
Arizona              | 6828065          | 12425476  
Arkansas             | 2978204          | 10544701  
California           | 39144818         | 48951087  
Colorado             | 5456574          | 47579596  
Connecticut          | 3590886          | 48192278  
Delaware             | 945934           | 9993394  
District of Columbia | 672228           | 5209048  
Florida              | 20271272         | 21889434  
. . . .


For example, if we look at the ""national_population"" value for Colorado, it's
47579596. That number is the sum of state populations for Colorado (5456574),
California (39144818), and Arkansas (2978204). So, while this calculation
probably doesn't make a whole lot of sense for determining the national
population, it can be useful in data sets where adjacency or sequence is a key
factor, such as one with geographic coordinates defining nearby areas, a social
network, or time series event data. We'll keep our example consistent here,
though, to make learning the concepts easy.

Now we've seen how changing the value of PRECEDING impacts our results. We can do the same thing using FOLLOWING .

FOLLOWING
In the examples above, our window function has been run against the current row
and some amount of preceding rows. Let's flip that and start at the current row
with some amount of following rows instead. Let's start with UNBOUNDED FOLLOWING :

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS national_population
FROM population  
WHERE state �


Now in our query, we're starting with the current row and performing our window
function against it and all the following rows. This will give us a reverse
tally of the population sums from what we saw with UNBOUNDED PRECEDING . Take a look:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 324893002  
Alaska               | 738432           | 320034023  
Arizona              | 6828065          | 319295591  
Arkansas             | 2978204          | 312467526  
California           | 39144818         | 309489322  
Colorado             | 5456574          | 270344504  
Connecticut          | 3590886          | 264887930  
Delaware             | 945934           | 261297044  
District of Columbia | 672228           | 260351110  
Florida              | 20271272         | 259678882  
. . . .


Here we can see that we actually start out with the full national population
value of 324893002 because we're starting with the current row (Alabama in this
case) and summing all the following rows with it. As we move to Alaska,
Alabama's population isn't included anymore. Instead now we're summing the
population for Alaska and all the following rows, and so on.

Just as we did with PRECEDING , we can also supply a value of following rows to include rather than do UNBOUNDED . Let's use 2 again to keep things simple:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) AS national_population
FROM population  
WHERE state �


Again, we have to switch from the RANGE indicator to ROWS since RANGE can only be used with UNBOUNDED . We've also set 2 as the number of following rows instead of UNBOUNDED . Here's what we get:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 12425476  
Alaska               | 738432           | 10544701  
Arizona              | 6828065          | 48951087  
Arkansas             | 2978204          | 47579596  
California           | 39144818         | 48192278  
Colorado             | 5456574          | 9993394  
Connecticut          | 3590886          | 5209048  
Delaware             | 945934           | 21889434  
District of Columbia | 672228           | 31158360  
Florida              | 20271272         | 31917735  
. . . .


Let's look at Colorado again as our example to understand what's happening in
these results. The ""national_population"" value for Colorado is 9993394. It
includes the current row of Colorado's population (5456574) and the two
following rows of Connecticut (3590886) and Delaware (945934). As we mentioned
above, this kind of aggregation functionality makes more sense to use in a data
set where adjacency or sequence matters, but it's easy enough for us to
understand what's happening by using this population data set that we're
familiar with.

BETWEEN X PRECEDING AND Y FOLLOWING
Finally, we can combine PRECEDING and FOLLOWING to specify exactly how many rows before the current row and how many rows after
the current row to include for the window function. Let's use 2 in either
direction:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS national_population
FROM population  
WHERE state �


And our results:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 12425476  
Alaska               | 738432           | 15403680  
Arizona              | 6828065          | 54548498  
Arkansas             | 2978204          | 55146093  
California           | 39144818         | 57998547  
Colorado             | 5456574          | 52116416  
Connecticut          | 3590886          | 49810440  
Delaware             | 945934           | 30936894  
District of Columbia | 672228           | 35695180  
Florida              | 20271272         | 33535897  
. . . .


Looking at Colorado again, we see that the value for the ""national_population""
column is 52116416. That's the sum of Colorado (5456574) as the current row plus
two the rows preceding, California (39144818) and Arkansas (2978204), and the
two rows following, Connecticut (3590886) and Delaware (945934)

Remember at the beginning we noted that if we're not including an ORDER BY clause in our window function, then by default, the aggregation is performed
over all the rows, that is RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING . It's the exact same thing as our original query, but because we now have an ORDER BY clause in our window function, let's state it explicitly to demonstrate the
behavior:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       SUM(popestimate2015)
            OVER(ORDER BY name RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS national_population
FROM population  
WHERE state �


And the results we get are the sum of all the rows to give us the national
population:

state_name           | state_population | national_population  
--------------------------------------------------------------
Alabama              | 4858979          | 324893002  
Alaska               | 738432           | 324893002  
Arizona              | 6828065          | 324893002  
Arkansas             | 2978204          | 324893002  
California           | 39144818         | 324893002  
Colorado             | 5456574          | 324893002  
Connecticut          | 3590886          | 324893002  
Delaware             | 945934           | 324893002  
District of Columbia | 672228           | 324893002  
Florida              | 20271272         | 324893002  
. . . .


Now, we're back where we started!

PARTITIONING
As we saw in our previous article , using the PARTITION BY condition in our window function effectively groups the results according to
the values in the field we're partitioning by. In one example from that article,
we partitioned by region for each state, so we saw states grouped to ""West"" or
""South"" or ""Northeast"" regions, for example. We can do the same thing here, but
also include a window frame specification. Here's and example:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER(PARTITION BY region ORDER BY name RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS regional_population
FROM population  
WHERE state �


Notice, we've added the ""region"" column and we've also added the PARTITION BY condition for region to our window function. We're using RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW to be explicit about what we're doing, but because that is the default behavior
for a window frame, we don't need to actually include it. Here are the results
we get:

state_name    | state_population | region    | regional_population  
-------------------------------------------------------------------------
Illinois      | 12859995         | Midwest   | 12859995  
Indiana       | 6619680          | Midwest   | 19479675  
Iowa          | 3123899          | Midwest   | 22603574  
Kansas        | 2911641          | Midwest   | 25515215  
Michigan      | 9922576          | Midwest   | 35437791  
Minnesota     | 5489594          | Midwest   | 40927385  
Missouri      | 6083672          | Midwest   | 47011057  
Nebraska      | 1896190          | Midwest   | 48907247  
North Dakota  | 756927           | Midwest   | 49664174  
Ohio          | 11613423         | Midwest   | 61277597  
South Dakota  | 858469           | Midwest   | 62136066  
Wisconsin     | 5771337          | Midwest   | 67907403  
Connecticut   | 3590886          | Northeast | 3590886  
Maine         | 1329328          | Northeast | 4920214  
Massachusetts | 6794422          | Northeast | 11714636  
. . . .


You are probably asking yourself, ""What just happened? Where did Alabama go?"" In
this case, because we are using PARTITION BY , the results are first grouped by the partition field values in order, then by
the ORDER BY clause we specified which is the state name. So, with the Midwest region, our
states are listed in alphabetical order. Also, because we used RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW , we're getting the running tally of the sum of the state populations within
just that region. So, for example, if we look at Kansas, the ""regional population"" column has a value of 25515215. That includes the state population
of Kansas (2911641) and all of the preceding rows, Iowa (3123899), Indiana
(6619680), and Illinois (12859995). Once the Midwest is completed, we're on to
the next partition for the Northeast. Massachusetts is showing the ""regional population"" as 11714636, which includes the population for Massachusetts
(6794422) and the preceding rows in the Northeast, Maine (1329328) and
Connecticut (3590886).

Same functionality we saw in our examples above, but now just grouped according
to partition.

A FINAL NOTE ABOUT RANGE VS ROWS
As we mentioned a couple times, the RANGE indicator can only be used with UNBOUNDED , while ROWS can actually be used for all of the options we discussed here. Besides that,
they actually treat aggregations differently in a very small but important way:
if the field you use for ORDER BY does not contain unique values for each row, then RANGE will combine all the rows it comes across for non-unique values rather than
processing them one at a time whereas ROWS will include all of the rows in the non-unique bunch but process each of them
separately. Our examples have all ordered by state name, which are unique to
each row in our data set, so it didn't matter for us. However, if you order by
date and your data set has more than one row per date, for example, you'll want
to make sure you choose between using ROWS and RANGE appropriately for your intended results.

To demonstrate this, let's make a small change to the PARTITION BY query we used above. In this case, all we're going to do is change our ORDER BY clause to order by region rather than by state name. Because region has
non-unique values for each row, RANGE will combine them. Take a look:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER(PARTITION BY region ORDER BY region RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS regional_population
FROM population  
WHERE state �


We'll see that the state names are in random order, but that the regional sum is
now the combination of all the rows of the region:

state_name    | state_population | region    | regional_population  
-------------------------------------------------------------------------
North Dakota  | 756927           | Midwest   | 67907403  
Indiana       | 6619680          | Midwest   | 67907403  
Michigan      | 9922576          | Midwest   | 67907403  
Ohio          | 11613423         | Midwest   | 67907403  
Wisconsin     | 5771337          | Midwest   | 67907403  
Iowa          | 3123899          | Midwest   | 67907403  
Kansas        | 2911641          | Midwest   | 67907403  
Minnesota     | 5489594          | Midwest   | 67907403  
Missouri      | 6083672          | Midwest   | 67907403  
Nebraska      | 1896190          | Midwest   | 67907403  
South Dakota  | 858469           | Midwest   | 67907403  
Illinois      | 12859995         | Midwest   | 67907403  
Connecticut   | 3590886          | Northeast | 56283891  
Maine         | 1329328          | Northeast | 56283891  
Massachusetts | 6794422          | Northeast | 56283891  
. . . .


In this particular example, it amounts to the same thing as summing across all
the rows in the partition so we might as well not even have the conditions for ORDER BY or RANGE in there. In cases without a partition, the combination of the rows with
non-unique values may not be what you're looking for so be aware of this
behavior.

Let's see how using ROWS changes the results for that query:

SELECT name AS state_name,  
       popestimate2015 AS state_population,
       region,
       SUM(popestimate2015)
            OVER(PARTITION BY region ORDER BY region ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS regional_population
FROM population  
WHERE state �


Now we've got the running tally again as we saw above, with the total regional
population of the Midwest eventually being calculated as 67907403. We just don't
have our states in any particular order:

state_name    | state_population | region    | regional_population  
-------------------------------------------------------------------------
North Dakota  | 756927           | Midwest   | 756927  
Indiana       | 6619680          | Midwest   | 7376607  
Michigan      | 9922576          | Midwest   | 17299183  
Ohio          | 11613423         | Midwest   | 28912606  
Wisconsin     | 5771337          | Midwest   | 34683943  
Iowa          | 3123899          | Midwest   | 37807842  
Kansas        | 2911641          | Midwest   | 40719483  
Minnesota     | 5489594          | Midwest   | 46209077  
Missouri      | 6083672          | Midwest   | 52292749  
Nebraska      | 1896190          | Midwest   | 54188939  
South Dakota  | 858469           | Midwest   | 55047408  
Illinois      | 12859995         | Midwest   | 67907403  
Connecticut   | 3590886          | Northeast | 3590886  
Maine         | 1329328          | Northeast | 4920214  
Massachusetts | 6794422          | Northeast | 11714636  
. . . .


WRAPPING UP
In this article, we built on top of what we'd learned previously about using window functions in PostgreSQL by learning how to use window frame options to get different results. For
reference, the PostgreSQL documentation has a good little section on window frames in the chapter about SQL Syntax as well as the official writeup about window functions .

In our next Metrics Maven article, we'll apply what we've learned here to
calculate a moving average.

Make sure you're getting the latest from Compose by subscribing to Compose
Articles or following us on Twitter, Facebook or Google+ by using the links
below.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","In our last article, we learned about using window functions in PostgreSQL. With that as our foundation, we'll dive a little deeper here to learn about window frames. Understanding window frames is important because the results you get from your window functions depend on how your windows are framed.",Window Frames in PostgreSQL,Live,405
1208,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (November 29, 2016)
 * This Week in Data Science (November 22, 2016)
 * This Week in Data Science (November 15, 2016)
 * This Week in Data Science (November 08, 2016)
 * Partnering with Big Data University – UMUC Case Study

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (NOVEMBER 29, 2016)
Posted on November 29, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Machine Learning for Everyday Tasks – Machine learning is often thought to be too complicated for everyday
   development tasks. I have always felt like we can benefit from using machine
   learning for simple tasks that we do regularly.
 * An Interactive Tutorial on Numerical Optimization – Numerical Optimization is one of the central techniques in Machine
   Learning. I thought that it might be fun to provide some interactive
   visualizations of how these algorithms work.
 * Man living with machine: IBM’s AI-driven Watson is learning quickly,
   expanding to new platforms – Watson has gone from a game show novelty to a tool in use in many
   industries, from medicine to cooking.
 * The Top Predictive Analytics Pitfalls to avoid – We devised the following list of top predictive analytics pitfalls to
   avoid in order to keep your models performing as expected.
 * Better Questions to Ask Your Data Scientists – While it’s impossible to give an exhaustive account, here are some
   important factors to think about when communicating with data scientists,
   particularly as you begin a data search.
 * AI Machine Attempts to Understand Comic Books … and Fails – Understanding comic books is surprisingly hard.
 * Google’s AI translation tool seems to have invented its own secret internal
   language – All right, don’t panic, but computers have created their own secret
   language and are probably talking about us right now.
 * The most Googled Thanksgiving recipe in every state – Looking at search data from the past five years, Google’s researchers
   found the most unique recipe that people in every state (plus Washington DC)
   Googled during Thanksgiving week.
 * The secret to smarter fresh-food replenishment? Machine learning – With machine-learning technology, retailers can address the common—and
   costly—problem of having too much or too little fresh food in stock.
 * How machine learning could help doctors improve the reading of medical images – The radiology world has been abuzz with discussions of machine learning
   and what artificial intelligence may mean for the future of the field.
 * Google’s Hand-Fed AI Now Gives Answers, Not Just Search Results – Ask the Google search app “What is the fastest bird on Earth?,” and it
   will tell you.
 * How to build a Successful Big Data Analytics Proof-of-Concept – For all kinds of organizations, whether large multi-national enterprises
   or small businesses, developing a big data strategy is a difficult and
   time-consuming exercise.
 * 5 amazing ways IBM Watson is transforming healthcare – IBM Watson: The doctor of the future will see you now.
 * Statistical Mistakes and How to Avoid Them – Here are three kinds of avoidable statistics mistakes that I notice in
   published papers.
 * How Big Data Takes the Retail Industry to a Whole New, More Informed Space – What kind of insights are being gathered via big data? Many.
 * iSee: Using deep learning to remove eyeglasses from faces – Melissa Runfeldt is an Insight alumna from the Summer 2016 Silicon Valley
   Data Science session, where she built a deep learning model for DITTO
   technologies to remove eyeglasses from images of faces.

UPCOMING DATA SCIENCE EVENTS
 * Data Science Bootcamp – This is a beginner-friendly, hands-on bootcamp, where you will learn the
   fundamentals of data science from IBM Data Scientists Saeed Aghabozorgi, PhD
   and Polong Lin.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our forty second release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (November 29, 2016)",Live,406
1220,"It's easy to get started quickly using your new Cloudant account. Here are some videos and tutorials to help you with creating and using your first data layer.IBM Cloudant is a NoSQL JSON document store that is optimized for handling heavy workloads of concurrent reads and writes in the cloud; a workload that is typical of large, fast-growing web and mobile apps. Watch this video to see what IBM Cloudant can do for you!Understanding the Basics of NoSQL and Database-as-a-ServiceAre you just starting to explore NoSQL and/or Database-as-a-Service offerings? Then this Big Data University course will provide you with an overview of the NoSQL database landscape, the benefits of using a Database-as-a-Service offering, and where Cloudant fits into the picture.Use this link to sign up for this free course: Introduction to NoSQL and Database-as-a-Service",Developers new to Cloudant will welcome this list of guides and tutorials for how to get started with Cloudant.,Cloudant Learning Center,Live,407
1221,"ARRIVING NOW ON COMPOSE - LET'S ENCRYPT TLS CERTIFICATES
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 9, 2016TL;DR: Compose Elasticsearch and RabbitMQ deployments are being offered an
upgrade to easier, more reliable Let's Encrypt security certificate-backed
connections.

We've just begun rolling out a new way of securing your connections to Compose
databases using Let's Encrypt certificates. This will make using TLS/SSL
database connections simpler and more trusted than ever before.

Compose has, up until now, used a self-signed verification certificate for
TLS/SSL connections. The public part of the self-signed certificate was given to
you to implement verification. Because it was self-signed, those steps could, in
some cases, be quite involved. This meant that developers had a choice - write
the additional verification steps or skip verification and expose themselves to
man-in-the-middle attacks.

We knew there had to be a better way and the arrival of Let's Encrypt meant we
could start looking at generating trusted, verifiable TLS/SSL certificates for
every Compose database host and implementing the changes needed to make that
work. That includes new host names for all nodes too so we can use SNI, Server Name Indication to ensure you are getting the right certificate for your database's access
portals. Where connections previously went to *.dblayer.com addresses, they will now go to unique *.composedb.com addresses.

Now, we have begun the process of rolling out those changes on a database by
database basis. First up are RabbitMQ and Elasticsearch. Users of these
databases on Compose can switch over to this new scheme right now. New
deployments of RabbitMQ and Elasticsearch will have the Let's Encrypt scheme
only.

MIGRATING TO LET'S ENCRYPT CERTIFICATES
Visiting the Compose console's Deployment Overview for your RabbitMQ or Elasticsearch database deployment will offer you the
option to upgrade:


Let's just look at that...

Install Let's Encrypt Certificate

Your deployment is capable of utilizing SSL certificates  
signed by Let's Encrypt. By clicking the button below, our system  
will configure a new connection endpoint to be used by your  
application(s) for SSL connectivity. We recommend that you make a  
record of your current connection strings in case you need to  
refer to them when migrating to the new Let's Encrypt configuration.  


All of this is important. When you upgrade, while the old connection strings
will work, they will not be displayed in the console. This, for example, is what
the connection strings on a RabbitMQ deployment look like beforehand.


Remember, also, if you are verifying your connection – and you should be – you
will also want to make a copy of the SSL Certificate (Self-Signed) . Once you've archived all that, then you'll be ready to click Install . The system will run a job to update and when you return to the Deployment Overview , you will see new connection strings with .composedb.com addresses, something like this:


RabbitMQ has a web accessible admin user interface which is listed in the second
panel. You can copy the URL and open it in another browser window or just click
the Open button to open it in a new tab. When it opens, unlike before, you won't be
asked to trust a self-signed certificate to move on. Instead, you'll be
connected immediately and be ready to log in. If you look in the address bar,
you'll see a green padlock and clicking on it will reveal details of your
connection's security:


And if you want to follow the details link, you can find out more about how the
connection is secured:


And if you really want to, you can dig into the certificate and see who's
validating it:


We can carry on, logging in, safe in the knowledge that our connection is
secure. The next stop is connecting our applications.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Compose Elasticsearch and RabbitMQ deployments are being offered an upgrade to easier, more reliable Let's Encrypt security certificate-backed connections.",Let's Encrypt TLS Certificates,Live,408
1223,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

IBM Data Science Experience Blocked Unblock Follow Following Jan 31
--------------------------------------------------------------------------------

USING GITHUB FOR PROJECT CONTROL IN DSX
Try as we might, sometimes our projects don’t fit nicely into notebooks.
Notebooks are nice for R&D, but at some point you may need to build a more
integrated folder structure for your project. One approach in Data Science
Experience (DSX) is to use a source control environment to store and persist
code, and use DSX as the spark engine for running your code. Here I am using
GitHub, and pulling and pushing code changes to be tried on DSX. I am using DSX
notebooks as a CLI interface to Spark for pyspark, as well as a shell interface
for git with the magic “!” and “%” commands in Jupyter notebooks.

CREATE A GITHUB REPO ON GITHUB.COM
First thing you need to do is create a repository for your code. Here I am using
github.com, but this process would work fine with any git system that you have
access to through DSX.

After you have created the repo, you can load it up with code, either from your
local environment, or just use the text editor in GitHub. Here is a nice hello
world on git .

CLONE REPO TO DSX
To clone the repo over to DSX just use the hashbang and magic calls in Jupyter
notebooks to call the underlying Linux and git commands that you need to manage
files and kernel location.

Below is a screenshot of the how I cloned a GitHub repo over to my GPFS in DSX,
and here is a link to an overview of distributed data in GPFS storage [pdf format] which is the format used in case you are interested. One nice
attribute of GPFS as opposed to HDFS is that you can use the file format to hold
and access kernel data without having to resort to hdfs program calls, and here
we are just the GPFS as local data store for our code.

Here’s the code for cloning a repo in GitHub in python . Notice that you are going to need the path to your DSX home directory, which
you can retrieve using python by simply running the !pwd command. The part of
the address before “/notebook/work” is your home directory.

python for cloning repo in DSX:

# if you don't care about persistence, just save to the notebooks/work with:
!git clone https://github.com/jimcrozier/testdsx.git

# if you do care about persistence
# move up gpfs out of notebooks
# get your DSX_HOME with !pwd
%cd /gpfs/global_fs01/sym_shared/YPProdSpark/user/DSX_HOME
!git clone https://github.com/GITHUBNAME/testdsx.git

# move to the project's home directory  
# notice that you will need your DSX_HOME, just use !pwd
%cd /gpfs/global_fs01/sym_shared/YPProdSpark/user/DSX_HOME

# pull the repo, from the home directory of the project
!git pull

# you will rarely need to push a repo, but here is how
# from the home directory of the project
!git add .
!git commit -m 'something useful message'
!git push

You can also do this in R instead of python

# if you don't care about persistence 
system(""git clone https://github.com/jimcrozier/testdsx.git"", intern=TRUE)

# if you do care about persistence
# move up gpfs out of notebooks
setwd(""./../.."")  
system(""git clone https://github.com/jimcrozier/testdsx.git"", intern=TRUE)

# move to the project's home directory  
# notice that you will need your username, just use getwd
setwd(""'/gpfs/global_fs01/sym_shared/YPProdSpark/user/DSX_HOME/testdsx"")

# pull the repo, from the home directory of the project
system(""git pull"", intern=TRUE)

# you will rarely need to push a repo, but here is how
# from the home directory of the project
system(""git add ."", intern=TRUE)  
system(""git commit -m 'something useful message'"", intern=TRUE)  
system(""git push"", intern=TRUE)

Happy hacking, and feel free to reach out if you have any problems or
suggestions for better managing this type of work flow.


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on January 31, 2017 by Jim Crozier .

 * Github
 * Data Science


Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Try as we might, sometimes our projects don’t fit nicely into notebooks. Notebooks are nice for R&D, but at some point you may need to build a more integrated folder structure for your project. One…",Using GitHub for project control in DSX,Live,409
1226,"Enterprise Pricing Articles Sign in Free 30-Day TrialUSING RETHINKDB 2.3'S USER AUTHENTICATION
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 14, 2016When you deploy a new RethinkDB deployment or update RethinkDB to version 2.3 , you need to know about RethinkDB's new user authentication system. It's a
great enhancement to RethinkDB but it does change how you use Compose's
RethinkDB deployments.

FLASHBACK
Before 2.3, RethinkDB had no user authentication at all. Clients could validate
themselves by presenting an auth key, but that was per server. On Compose we
upped our security by ensuring RethinkDB deployments connected over HTTPS
connections and had a haproxy portal guarding the gates asking for a username
and password.

When you added a user in the Compose UI, it was being added to that haproxy
portal. That meant we could deliver SSL/TLS secured connections and control
access to the RethinkDB deployment.

RethinkDB 2.3 sees a lot of changes and we've adopted them for our Compose
deployments. For example, now your RethinkDB connections now go to a RethinkDB
Proxy node because that supports SSL/TLS. From RethinkDB 2.3 onwards, you also
administer users with RethinkDB because there's a whole new users, permissions
and scopes system in RethinkDB now. That's covered in detail in the documentation .

QUICKLY UP
If you want to get up and running quickly though, follow this quick guide.
RethinkDB now has it's own users table in the rethinkdb database. To add users means you have to connect to the Admin UI of RethinkDB
first and for that, you need the admin user. That user is automatically set up
for you when you create your deployment. Look on the Compose Console's overview
page for your deployment and you'll see this:


The Authentication Credential is the password for the admin user. You'll need that to log in to the Admin UI.
Click the Show/Change link to reveal it.

If you've read all the documentation at this point, you may wonder why you need
a password at all as the docs say that there's no protection for the Web UI. At
Compose, we don't like leaving web interfaces unprotected so we made sure your
web connection requires the admin authentication credentials to log in.

When you use your web browser to access the Admin UI link shown, do be aware
that the site is served with a self-signed certificate and will usually require
you to make an exception for that domain. The connection is still encrypted.

Once we're in the Admin UI, we can create a user in the Data Explorer. Select Data Explorer in the top menu bar and go to the editor window and type something like this:


Let's look at command in more detail:

r.db('rethinkdb').

RethinkDB's user table is kept in the rethinkdb database, so we address that database.

r.db('rethinkdb'). table('users').

The users table is where the user information is kept.

r.db('rethinkdb').table('users'). insert({id:""brian"",password:""brianpass""})

And we want to insert an object that will represent our user. It has an id value with a value set to our new user name and a password value set to a super secret password.

This command will create a user with that password. You may wonder ""Can I query
that table to get the password back?"", well, no you can't; the password field
comes back as a boolean when you query it. If we run r.db('rethinkdb').table('users') we can see that.


PERMISSION SEEKING
That command created a user, but if you connected to the RethinkDB deployment as
that user, you'd quickly find you couldn't do anything with it. That's because
it has been granted no permissions. RethinkDB's permissions can be global, at
the database level or at the table level.

The permissions themselves are kept in another system table. If you run r.db('rethinkdb').table('permissions') on a new RethinkDB deployment you'll get results as pictured.


As you can see, admin has all four available permissions. The config permission is about being able to configure the system, including adding and
removing tables. The connect permission controls whether the user can make an outgoing connection with the http command to retrieve data from the web. The read and write permissions should be self-explanatory.

We don't directly modify this table though. For that we have the grant command. This comment lets us set the permissions for a user on a particular
object. So say we have a contents database with a table settle we can give the brian user read and write permissions to that with:

r.db(""contents"").table(""settle"").grant(""brian"",{ read: true, write:true })  


You many want to give brian read and write access to the entire database:

r.db(""contents"").grant(""brian"",{ read: true, write:true })  


And, finally, if you want to grant admin level permissions to brian (but not Admin UI login capability) you can do:

r.grant(""brian"", { read: true, write: true, config: true, connect:  true})  


We do advise you consider your policy for permissions if you do that more than
once. You can read more about permissions in the documentation where it goes into more detail on the scoping.

If you just want to quickly get up and running with a client connected using a
particular database, your likely best option is:

r.db(""dbname"").grant(""user"",{ read: true, write: true, config: true, connect:  true})  


CLIENTS
Clients libraries for RethinkDB have been updated so they take user and password
parameters. The user parameter typically defaults to admin while the password defaults to an empty string. Details for the connect
paramters for JavaScript , Python , Ruby and Java are in the RethinkDB documentation. For Go, the details also cover programmatic user creation.

MIGRATING TO 2.3.X
The details of the migration process are covered in a the previous RethinkDB 2.3 article . In the process, the old Authkey is turned into the password for the admin account. Use the instructions above to add users.

STEP FORWARD
The addition of users, permissions and scopes to RethinkDB enables it to be used
in more controlled production environments. It's another big step forward for
the RethinkDB database, even if it does introduce the ""hassle"" of creating and
controlling users.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","When you deploy a new RethinkDB deployment or update RethinkDB to version 2.3, you need to know about RethinkDB's new user authentication system.",Using RethinkDB 2.3's user authentication,Live,410
1229,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
APACHE SPARK
0 TO LIFE-CHANGING APP: NEW APACHE SYSTEMML API ON SPARK SHELL
SYSTEMML ON SPARK SHELL? YES!
A very simple way of using SystemML for all of your machine learning and big
data needs. This tutorial will get you set up and running SystemML on the Spark
Shell like a star. But first, to refresh your memory, let me remind you that I
am on a quest to create a life-changing app! I am new to the world of data
science and am currently tackling the challenge of building an app using Apache
SystemML and Apache Spark one step at a time. If you haven't already, make sure
to check out my previous tutorials, which start here .

So far we've daydreamed about delightful data , complained about how hard it is to find good data, found good data , learned how to write Scala and NOW we will learn how to access SystemML from the Spark Shell.

Not familiar with the Spark shell? Here's a great tutorial . Not sure what SystemML is? Look here !

At a high-level, SystemML is what is used for the machine learning and
mathematical part of your data science project. You can log into Spark Shell,
load SystemML on the shell, load your data and write your linear algebra,
statistical equations, matrices, etc. in code much shorter than it would be in
the Spark shell. It helps not only with mathematical exploration and machine
learning algorithms, but it allows you to be on Spark where you can do all of
the above with really big data that you couldn't use on your local computer.
Focusing on this step of your project, let's walk through how to set your
computer up for all of SystemML's assumptions, how to load Spark Shell, load
SystemML, load data and do a few examples in scala. (I promise a PySpark
tutorial will come in the future!)

NOW LET'S GET GOING ON OUR LEARNING. FIRST STEP: ASSUMPTIONS FOR SYSTEMML.
Have Java, Scala, wget and Spark installed on your computer.

brew tap caskroom/cask  
brew install Caskroom/cask/java  
brew install scala  
brew install wget  
brew install apache-spark  


NOW LET'S SET UP SYSTEMML!
Download SystemML.

wget https://sparktc.ibmcloud.com/repo/latest/SystemML.jar  


Now type the following code to access the Spark Shell with SystemML.

spark-shell --executor-memory 4G --driver-memory 4G --jars SystemML.jar  


Now, using the Spark Shell (Scala), import the MLContext for SystemML.

import org.apache.sysml.api.mlcontext._  
import org.apache.sysml.api.mlcontext.ScriptFactory._  
val ml = new MLContext(sc)  


CONGRATULATIONS!! NOW YOU ARE IN APACHE SYSTEMML!!
In the future you will just need to do the last two steps to get this going.

LET'S FIGURE OUT HOW TO LOAD A SCRIPT AND RUN IT AS WELL AS LOAD DATA AND RUN
SOME EXAMPLES.
These examples and tons of documentation can also be found here .

Here's a quick example: Script from a URL.
Here s1 is created by reading Univar-Stats.dml from a URL address.

val uniUrl = ""https://raw.githubusercontent.com/apache/incubator-systemml/master/scripts/algorithms/Univar-Stats.dml""  
val s1 = ScriptFactory.dmlFromUrl(scriptUrl)  


More examples of how to load scripts can be found here .

Our next step is to parallelize the information, read in two matrices as RDDs,
getting the sum of the first, the sum of the second and a message.

scala> val data1 = sc.parallelize(Array(""1.0,2.0"", ""3.0,4.0”))  
scala> val data2 = sc.parallelize(Array(""5.0,6.0"", ""7.0,8.0”))  
scala�
     | if (s1 > s2) {
     |  message = ""s1 is greater""
     | } else if (s2 > s1) {
     |  message = ""s2 is greater""
     | } else {
     |  message = ""s1 and s2 are equal""
     | }
     | """"""

scala> val script = dml(s).in(""m1"",data1).in(""m2"", data2).out(""s1"",""s2"", ""message”)  


Your should get:

script: org.apache.sysml.api.mlcontext.Script =  
Inputs:  
[1] (RDD) m1: ParallelCollectionRDD[0] at parallelize at <console>:33
[2] (RDD) m2: ParallelCollectionRDD[1] at parallelize at <console>:33

Outputs:  
[1] s1
[2] s2
[3] message


Now print your script info. You should see:

scala> println(script.info)  
Script Type: DML

Inputs:  
[1] (RDD) m1: ParallelCollectionRDD[0] at parallelize at <console>:33
[2] (RDD) m2: ParallelCollectionRDD[1] at parallelize at <console�  
if (s1 > s2) {  
 message = ""s1 is greater""
} else if (s2 �  
if (s1 > s2) {  
 message = ""s1 is greater""
} else if (s2 �  


Execute your script and get your results!

scala> val results = ml.execute(script)  
results: org.apache.sysml.api.mlcontext.MLResults =  
[1] (Double) s1: 10.0
[2] (Double) s2: 26.0
[3] (String) message: s2 is greater


Just as an example, you can set your value as x and get your results in Double
form.
Not familiar with Scala? Check this tutorial out!

scala> val x = results.getDouble(""s1"")  
x: Double = 10.0

scala> val y = results.getDouble(""s2"")  
y: Double = 26.0

scala> x + y  
res1: Double = 36.0  


Here is another version. Because the API is very Scala friendly, you can pull
out your results as a Scala tuple.

scala> val (firstSum, secondSum, sumMessage) = results.getTuple[Double, Double, String](""s1"", ""s2"", ""message"")  
firstSum: Double = 10.0  
secondSum: Double = 26.0  
sumMessage: String = s2 is greater  


Here is the really handy part. As another example you can load in your data,
type the short code and get a whole table of standard statistical measures for
each feature!

Let's first get our data into Spark.
Because this step of our awesome life-changing, data science project/app is
about focusing on the mathematical exploration (very soon it will be about
machine learning algorithms), we want to make sure our data is clean and ready
to go. Let's load in some data and run a SystemML script.

scala> val habermanUrl = ""http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data""

scala> val habermanList = scala.io.Source.fromURL(habermanUrl).mkString.split(""\n"")

scala> val habermanRDD = sc.parallelize(habermanList)

scala> val habermanMetadata = new MatrixMetadata(306, 4)

scala> val typesRDD = sc.parallelize(Array(""1.0,1.0,1.0,2.0""))

scala> val typesMetadata = new MatrixMetadata(1, 4)

scala> val scriptUrl = ""https://raw.githubusercontent.com/apache/incubator-systemml/master/scripts/algorithms/Univar-Stats.dml""

scala> val script = dmlFromUrl(scriptUrl).in(""A"", habermanRDD, habermanMetadata).in(""K"", typesRDD, typesMetadata).in(""$CONSOLE_OUTPUT"", true)  


scala> val results = ml.execute(script)

-------------------------------------------------
Feature [1]: Scale  
 (01) Minimum             | 30.0
 (02) Maximum             | 83.0
 (03) Range               | 53.0
 (04) Mean                | 52.45751633986928
 (05) Variance            | 116.71458266366658
 (06) Std deviation       | 10.803452349303281
 (07) Std err of mean     | 0.6175922641866753
 (08) Coeff of variation  | 0.20594669940735139
 (09) Skewness            | 0.1450718616532357
 (10) Kurtosis            | -0.6150152487211726
 (11) Std err of skewness | 0.13934809593495995
 (12) Std err of kurtosis | 0.277810485320835
 (13) Median              | 52.0
 (14) Interquartile mean  | 52.16013071895425
-------------------------------------------------
Feature [2]: Scale  
 (01) Minimum             | 58.0
 (02) Maximum             | 69.0
 (03) Range               | 11.0
 (04) Mean                | 62.85294117647059
 (05) Variance            | 10.558630665380907
 (06) Std deviation       | 3.2494046632238507
 (07) Std err of mean     | 0.18575610076612029
 (08) Coeff of variation  | 0.051698529971741194
 (09) Skewness            | 0.07798443581479181
 (10) Kurtosis            | -1.1324380182967442
 (11) Std err of skewness | 0.13934809593495995
 (12) Std err of kurtosis | 0.277810485320835
 (13) Median              | 63.0
 (14) Interquartile mean  | 62.80392156862745
-------------------------------------------------
Feature [3]: Scale  
 (01) Minimum             | 0.0
 (02) Maximum             | 52.0
 (03) Range               | 52.0
 (04) Mean                | 4.026143790849673
 (05) Variance            | 51.691117539912135
 (06) Std deviation       | 7.189653506248555
 (07) Std err of mean     | 0.41100513466216837
 (08) Coeff of variation  | 1.7857418611299172
 (09) Skewness            | 2.954633471088322
 (10) Kurtosis            | 11.425776549251449
 (11) Std err of skewness | 0.13934809593495995
 (12) Std err of kurtosis | 0.277810485320835
 (13) Median              | 1.0
 (14) Interquartile mean  | 1.2483660130718954
-------------------------------------------------
Feature [4]: Categorical (Nominal)  
 (15) Num of categories   | 2
 (16) Mode                | 1
 (17) Num of modes        | 1
results: org.apache.sysml.api.mlcontext.MLResults =  
[1] (Matrix) baseStats: Matrix: scratch_space/_p5250_9.31.116.229/parfor/2_resultmerge1, [17 x 4, nnz=44, blocks (1000 x 1000)], binaryblock, dirty


You can also ask for the base stats.

scala> val baseStats = results.getMatrix(""baseStats"")  
baseStats: org.apache.sysml.api.mlcontext.Matrix = org.apache.sysml.api.mlcontext.Matrix@237cd4e5

scala> baseStats.  
asDataFrame          asDoubleMatrix       asInstanceOf         asJavaRDDStringCSV   asJavaRDDStringIJV   asMLMatrix           asMatrixObject       asRDDStringCSV  
asRDDStringIJV       isInstanceOf         toString             


You can also get the base stats as an RDD. Note: IJV leaves out non values and
CSV includes them. Here's an example of both:

scala> baseStats.asRDDString  
asRDDStringCSV   asRDDStringIJV   

scala> baseStats.asRDDStringCSV.collect  
res4: Array[String] = Array(30.0,58.0,0.0,0.0, 83.0,69.0,52.0,0.0, 53.0,11.0,52....1.0)

scala> baseStats.asRDDStringIJV.collect  
res5: Array[String] = Array(1 1 30.0, 1 2 58.0, 1 3 0.0, 1 4 0.0, 2 1 83.0, 2 2 69.0, 2 3 52.0, 2 4 0.0, ... 1...  


I think that's a great start to using SystemML with Spark Shell! Once you're
done you can quit to exit.

:quit


YOU HAVE SUCCESSFULLY SET UP YOUR COMPUTER FOR RUNNING SYSTEMML AND SPARK,
LOADED THE SPARK SHELL, RAN SCRIPTS, LOADED DATA AND RUN SOME EXAMPLES!!
CONGRATS!!
Stay tuned for more tutorials and next steps on our life-changing app!

By Madison J. Myers

SHARE ON
 * 
 * Share

MADISON J MYERS
DATE
08 August 2016TAGS
apache spark, machine learning, Life-changingSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.",SystemML on Spark Shell? Yes!  A very simple way of using SystemML for all of your machine learning and big data needs. This tutorial will get you set up and running SystemML on the Spark Shell like a star. ,0 to Life-Changing App: New Apache SystemML API on Spark Shell,Live,411
1236,"Homepage Follow Sign in / Sign up Homepage * Home
 * Archive
 * 

Lukasz Cmielowski, PhD Blocked Unblock Follow Following Automation architect and data scientist at IBM Krakow Software Lab. Currently
working on Watson Machine Learning cloud offering. Oct 20
--------------------------------------------------------------------------------

Adoption of machine learning to software failure prediction

Problem statement:

For any software development organization, the cost of defects verification is
extremely large. Such process is not always trivial or even achievable and often
requires following very specific use cases or replicating complex customer’s
environments. This can consume a lot of time and skills.

Consider test team, that verifies few hundreds of bugs per year. Either those
come from the field or are found in-house, the time required for verification of
a single defect is at minimum two hours. For 250 defects, that sums up to almost
100 days of work on nothing else but this.

Additionally, this process requires extreme carefulness to make sure to not to
introduce any regressions in already supported functionalities, which may easily
happen when delivering bug fixes.

Defect life cycle.Goal of the project:

The goal of this project is to build analytic solution that will reduce cost of defects verification by predicting which defects were
incorrectly fixed and need to be revisited by development team.

Detailed requirements:

 1. Based on historical data (i.e. previously verified defects) create training
    data set.
 2. Train predictive model using classification algorithm that meets defined
    quality criteria.
 3. Model should be exposed as an online REST service, so we can easily call it
    and make prediction requests for new incoming defects.
 4. Defect records that are stored in code repository, are automatically updated
    with scoring result: prediction and probability of prediction.
 5. Defects should be sorted by prediction probability, so QA engineer could
    start testing bugs with highest probability of incorrect fix.
 6. If new bugs are successfully verified and label value is known (correctly
    fixed / incorrectly fixed), such records are marked as new training data and
    stored in feedback data store.
 7. Feedback data store is used to evaluate served model quality, automatically
    retrain deployed model, and finally to re-deploy new model version. The goal
    of such continuous learning system, is to ensure the highest possible
    quality of exposed model.

Flow chart of the solution.
--------------------------------------------------------------------------------

Solution details:

To solve our problem, the following technologies were used

 * Data Science Experience (DSX) with Watson Machine Learning service on Cloud ( Bluemix ),
 * For predictive model creation, we leveraged Jupyter Notebooks and Spark MLlib served by DSX (python),
 * For model deployment, monitoring and retraining we applied Continuous Learning System of Watson Machine Learning offering.

Detailed workflow can be found below.

Input data information.

Each reported bug contains set of fields that have to be filled out by bug
creator, bug resolver (software developer) and bug validator (test engineer)
(see figure 3 and table 1). Those fields are used to create training data (used
to train a model) as well as scoring data (used to get prediction).

List of fields (features) used for model training and scoring. Sample anonymized defect records.Creation of training data set.

Each bug verified by test engineer has information that tells us if bug has been
fixed (the “Reopened” field). “Reopened” field is our target column — we want to
predict this field value [false, true] for not verified bugs. Verified bugs are
used as training data since the target is known here. This is called supervised
learning since we provide (supervise) learning process by providing target
values in training data set (for more detailed definition please refer to [1] ).

Based on data gathered from verified bugs we create training data set consisting
of feature columns and target column. The data needs to be pre-processed
(transformed) to be compliant with Spark MLlib . The process of training data set preparation is described on chart below.

Training data preparation. Schema of training data.Building Spark MLlib pipeline.

“In machine learning, it is common to run a sequence of algorithms to process
and learn from data.”

“MLlib represents such a workflow as a Pipeline, which consists of a sequence of
PipelineStages (Transformers and Estimators).” [2]

Our simple machine learning pipeline consists of StringIndexer and
VectorAssembler transformations. As an estimator we have used
RandomForestClassifier.

Code snippet for Spark MLlib pipeline building.Train and evaluate model.

To train and evaluate our model, we split training data set into 2 subsets:
train (to train a model) and test (to evaluate quality of trained model).
Evaluation done using test data set showed 87% of accuracy.

Code snippet showing model evaluation.Note: For simplicity of this article we have skipped model tuning section.

Model deployment on Cloud with Watson Machine Learning.

To publish and deploy the model we have used watson-machine-learning-client available on pypi.

At this stage our model is present in Watson Machine Learning repository on
Cloud. Now we can create online deployment of this model and score new data
records.

Online deployment is created so let’s print scoring endpoint.

You can use below method to do test scoring request against deployed model.

Sample prediction result (number of features simplified for this example).In above example, scoring response predicted value is „true”, what means that
bug has been incorrectly fixed. The probability of that prediction is ~67%.

Enablement of Continuous Learning System for deployed model.

The IBM Watson™ Machine Learning service includes a continuous learning system.
Continuous learning systems provide automated monitoring of model performance,
retraining, and redeployment to ensure right predictions quality.

To work with the system, you need to set it up first. System set up can be
either done through REST API or Dashboard (for details please refer to [3] ).

Screenshot of learning system configuration window.Next figure shows an example of the system run. We can notice that original
model quality was above specified threshold (threshold is represented by red
line, whereas model quality by blue). In 1st and 2nd iteration of the system
deployed model was evaluated based on feedback data. Evaluation result showed
model accuracy degradation (blue line below red). Since model accuracy is below
specified threshold retraining process has been triggered. New model version was
trained using training and feedback data stores. Since new version of the model
shows accuracy above specified threshold model redeployment took place. Old
model version is being replaced by newly trained one ensuring correct quality of
predictions.

Scrrenshot of evaluation events chart.Save prediction results in bug record.

Prediction result as well as probability are being saved in bug records as two
additional fields. That allows to easily sort and validate the list of bugs
awaiting verification.


--------------------------------------------------------------------------------

Authors:

Lukasz Cmielowski , automation architect and data scientist.
Piotr Gnysinski , QA lead working on IGC product.


--------------------------------------------------------------------------------

References:

 1. https://en.wikipedia.org/wiki/Supervised_learning ,
 2. https://spark.apache.org/docs/2.1.1/ml-pipeline.html
 3. 
    https://console.bluemix.net/docs/services/PredictiveModeling/pm_service_ui_spark_learning_system.html#continuous-learning-system-span-class-tag-beta-beta-span-
 4. http://datascience.ibm.com/

 * Machine Learning
 * Quality Assurance
 * Test Automation
 * Testing
 * Classification

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

1 Blocked Unblock Follow FollowingLUKASZ CMIELOWSKI, PHD
Automation architect and data scientist at IBM Krakow Software Lab. Currently
working on Watson Machine Learning cloud offering.

FollowIBM DATA SCIENCE EXPERIENCE
Master the art of data science

 * 1
 * 
 * 
 * 

Never miss a story from IBM Data Science Experience , when you sign up for Medium. Learn more Never miss a story from IBM Data Science Experience Get updates Get updates","For any software development organization, the cost of defects verification is extremely large. Such process is not always trivial or even achievable and often requires following very specific use…",Adoption of machine learning to software failure prediction,Live,412
1237,"JOIN US FOR OFFLINE CAMP Bradley Holt / May 3, 2016A three-day retreat in the Catskill Mountains, Offline Camp will be anopportunity to foster the growing community around the Offline First movement.For over a year now, I’ve been advocating for an Offline First approach to building web and mobile apps. I’ve spoken to developers aboutOffline First at numerous conferences including Fluent , CodeMash, Node.js Interactive , All Things Open, NoSQL Now!, That Conference, OSCON, Cloud Expo, andphp[tek]. I’m happy to announce that we’re working with our friends at Hoodie,Make&Model, and Bocoup to bring you the first ever Offline Camp !Come join us in the beautiful Catskill Mountains June 24-26, 2016.Offline Camp will be an opportunity to foster the growing community around theOffline First movement. We will explore the entire spectrum of Offline Firstincluding: * Progressive Web Apps mobile apps (native and hybrid) desktop apps (e.g. Electron) Internet of Things (IoT) apps the business case for Offline First the UX of Offline FirstOffline Camp will be an open, participatory event. Please bring your perspectiveon Offline First and be willing to share and listen!Interested in joining us? Please let us know!SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Desktop / IoT / Mobile / Offline Camp / Offline First / Progressive Web Apps Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","I'm happy to announce that we're working with our friends at Hoodie, Make&Model, and Bocoup to bring you the first ever Offline Camp!",Join us for Offline Camp,Live,413
1244,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Aug 22
--------------------------------------------------------------------------------

CUSTOM CLOUDANT REPLICATION
EMULATING FILTERED REPLICATION USING THE CHANGES FEED AND OPENWHISK FUNCTIONS
Cloudant has first-class replication built in. A database can be replicated to
another local database or to a remote Cloudant instance — or to any other
database that speaks the same replication protocol, such as Apache CouchDB™ or PouchDB . A replication process can be one-shot or continuous, and replication streams
can be “filtered,” (i.e., the documents that are replicated can be a subset of
the total).

Not all applications need replication. It is essential when there are multiple,
disconnected copies of the data where the data can be modified on either side . Cloudant solves the problem by never throwing away conflicting revisions of
the same document and allows your app to decide how to resolve the situation .

ONE-WAY STREET
Other apps look like they need replication but only really involve the movement of data from
one place to another in one direction. This is a much simpler problem to solve
and lets us get creative with replication.

Let’s take the example of a transport system. A central database contains a
continuously growing collection of events:

 * Bus AB12XJK has begun its journey on route X1 from Newcastle-upon-Tyne at
   13:08 on 4th August 2017
 * Bus AB12XJK has stopped for a break at 15:08 on 4th August 2017
 * Bus AB12XJK has resumed its journey at at 15:30 on 4th August 2017
 * Bus AB12XJK has arrived at its destination (Victoria Coach Station) at 19:10
   on 4th August 2017

This trip could be modeled in a Cloudant database with the following document
structure:

{
  ""_id"": ""3007166d-3fd3-4e3f-be0d-43aa9c054a48"",  // auto-generated id
  ""_rev"": ""1-16e262673ed141f0b711f33e6bb0fdc1"", // revision token
  ""route"": ""X1"",
  ""name"": ""Newcastle to London Express"",
  ""start"": ""Newcastle-upon-Tyne"",
  ""end"": ""Victoria, London"",
  ""scheduled_start"": ""2017-08-04 13:05:00 Z"",
  ""actual_start"": ""2017-08-04 13:08:00 Z"",
  ""scheduled_arrival"": ""2017-08-04 13:05:00 Z"",
  ""estimated_arrival"": ""2017-08-04 13:08:00 Z"",
  ""actual_arrival"": null,
  ""stops"": [
    {
      ""type"": ""comfort_break"",
      ""location"": ""Woodall services"",
      ""start"": ""2017-08-04 15:00:00 Z"",
      ""actual_start"": null,
      ""end"": ""2017-08-04 15:30:00 Z"",
      ""actual_end"": null
    }
  ],
  ""driver"": {
    ""name"": ""Sheila Davies"",
    ""employee_num"": ""SD_1552""
  },
  ""vehicle"": {
    ""model"": ""Volvo 9700"",
    ""registration"": ""PQ89MGW""
  }
}

Some things to note about this document structure:

 * The document’s id is auto-generated by the database, although you could
   provide your own.
 * The revision token contains a number ( 1 ), a hyphen ( - ), and a hash of the contents of the document ( 16e262... ). If the document changes, the number will be incremented and the system
   will compute a new hash.
 * The other fields are up to us. Our data model allows for multiple “stops” on
   a journey (hence the array of objects).
 * This document contains everything we need to get a snapshot of the progress
   of this journey. Although it contains references to other data structures —
   route id, employee id, vehicle registration — Cloudant is not a relational
   database. So it is okay to take copies of related data in our object to allow
   us to get a useful view of the data without de-referencing everything.

We can use this data to show the progress of a particular journey on our company
website, to perform analytics (e.g., statistics of how many journeys were late
on arrival) and to power public displays at each end.

Image courtesy of Flickr user Carlbob.com .MOVING THE DATA
Let’s imagine we were building such a display. We would need:

 * A display!
 * The arrival and departure data for the station in question.
 * A local data store so we can cache the data locally. If the remote database
   becomes disconnected, we can still render the most recent information.

We can use PouchDB or CouchDB within the display. Both are small enough to be
incorporated into an embedded system, but we want to keep data volumes to a
minimum. A single display only needs to know about journeys that list it as the starting or destination point. It would be overkill to replicate the entire database to each display.

Using CouchDB-style replication to move data to our display boards is good, but
it moves all data on all bus trips.We could use filtered replication . This approach involves sending a JavaScript function to Cloudant — a function
that decides which documents are replicated and which are not. It would be
simple to create a filter by bus station (i.e., a function which passed any
document whose start or destination matches the display’s station). But there are drawbacks to this
approach.

Adding filter functions to CouchDB is a good next step, but performance suffers
on large data sets.Because our database contains all documents back to the beginning of time, a
first-time replication will begin at zero and have to spool through every
document in turn. It would work eventually but is increasingly inefficient as
the data size grows.

Another solution is to have a copy of each station’s data in its own database:

Giving each bus station its own database allows replication to be appropriately
scoped to each video board.This “one database per station” approach has some advantages:

 * The station display can replicate the data easily from its paired database
   without filtering and with a reduced data size.
 * The per-station databases can be destroyed and recreated daily, keeping the
   replicatable data sizes much smaller because the display boards are only
   interested in today’s data and the data only pertains to their own station.
 * The per-station databases could contain only a subset of the original
   document — just the bare minimum required for the display boards, keeping the
   document sizes small.

CUSTOM CLOUDANT REPLICATION WITH OPENWHISK
My solution doesn’t use Cloudant replication to feed the per-station databases.
Instead, it uses the OpenWhisk serverless platform. An OpenWhisk Node.js
function is called with each change on the main database. The code identifies
which per-station databases need to be fed (the start and destination stations
and any calling points along the way), prunes the document structure, and makes
the writes to the relevant “per-station” databases.

Using a serverless function simplifies the amount of replication jobs we’d have
to configure in the previous scenario.Here’s the sample code . It includes the OpenWhisk action that is called with every change and a deployment script that deploys it to OpenWhisk and creates the Cloudant changes feed trigger.

BUILDING YOUR OWN
Fork the code and build your own logic to decide how data is routed from your primary
database to the secondary database(s).

Hopefully this article provided some new ideas for thinking about data movement
in your own applications. If you enjoyed it, the clap button awaits you below.
Thanks for reading.

Thanks to Mike Broberg and Teri Chadbourne, CMP . * JavaScript
 * Web Development
 * Serverless
 * Cloudant
 * Openwhisk

1 Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",Emulating filtered replication using the changes feed and OpenWhisk functions.,Custom Cloudant Replication – IBM Watson Data Lab – Medium,Live,414
1245,"Homepage Airbnb Engineering & Data Science Follow Sign in / Sign up * Home
 * AI
 * Backend
 * Data
 * Infrastructure
 * Native
 * Web
 * 
 * Open Source
 * 

Robert Chang Blocked Unblock Follow Following Data @Airbnb, previously @Twitter Jul 17
--------------------------------------------------------------------------------

USING MACHINE LEARNING TO PREDICT VALUE OF HOMES ON AIRBNB
by Robert Chang

Amazing view from a Airbnb Home in Imerovigli, Egeo, GreeceINTRODUCTION
Data products have always been an instrumental part of Airbnb’s service.
However, we have long recognized that it’s costly to make data products. For
example, personalized search ranking enables guests to more easily discover
homes, and smart pricing allows hosts to set more competitive prices according
to supply and demand. However, these projects each required a lot of dedicated
data science and engineering time and effort.

Recently, advances in Airbnb’s machine learning infrastructure have lowered the
cost significantly to deploy new machine learning models to production. For
example, our ML Infra team built a general feature repository that allows users
to leverage high quality, vetted, reusable features in their models. Data
scientists have started to incorporate several AutoML tools into their workflows
to speed up model selection and performance benchmarking. Additionally, ML infra
created a new framework that will automatically translate Jupyter notebooks into
Airflow pipelines.

In this post, I will describe how these tools worked together to expedite the
modeling process and hence lower the overall development costs for a specific
use case of LTV modeling — predicting the value of homes on Airbnb.

WHAT IS LTV?
Customer Lifetime Value (LTV), a popular concept among e-commerce and
marketplace companies, captures the projected value of a user for a fixed time
horizon, often measured in dollar terms.

At e-commerce companies like Spotify or Netflix, LTV is often used to make
pricing decisions like setting subscription fees. At marketplace companies like
Airbnb, knowing users’ LTVs enable us to allocate budget across different
marketing channels more efficiently, calculate more precise bidding prices for
online marketing based on keywords, and create better listing segments.

While one can use past data to calculate the historical value of existing listings, we took one step further to predict LTV of new listings
using machine learning.

MACHINE LEARNING WORKFLOW FOR LTV MODELING
Data scientists are typically accustomed to machine learning related tasks such
as feature engineering, prototyping, and model selection. However, taking a
model prototype to production often requires an orthogonal set of data
engineering skills that data scientists might not be familiar with.

Luckily, At Airbnb we have machine learning tools that abstract away the
engineering work behind productionizing ML models. In fact, we could not have
put our model into production without these amazing tools. The remainder of this
post is organized into four topics, along with the tools we used to tackle each
task:

 * Feature Engineering: Define relevant features
 * Prototyping and Training: Train a model prototype
 * Model Selection & Validation: Perform model selection and tuning
 * Productionization: Take the selected model prototype to production

FEATURE ENGINEERING
Tool used: Airbnb’s internal feature repository — ZiplineOne of the first steps of any supervised machine learning project is to define
relevant features that are correlated with the chosen outcome variable, a
process called feature engineering. For example, in predicting LTV, one might
compute the percentage of the next 180 calendar dates that a listing is
available or a listing’s price relative to comparable listings in the same
market.

At Airbnb, feature engineering often means writing Hive queries to create
features from scratch. However, this work is tedious and time consuming as it
requires specific domain knowledge and business logic, which means the feature
pipelines are often not easily sharable or even reusable. To make this work more
scalable, we developed Zipline — a training feature repository that provides features at different levels of
granularity, such as at the host, guest, listing, or market level.

The crowdsourced nature of this internal tool allows data scientists to use a wide variety of
high quality, vetted features that others have prepared for past projects. If a
desired feature is not available, a user can create her own feature with a
feature configuration file like the following:

When multiple features are required for the construction of a training set,
Zipline will automatically perform intelligent key joins and backfill the
training dataset behind the scenes. For the listing LTV model, we used existing
Zipline features and also added a handful of our own. In sum, there were over
150 features in our model, including:

 * Location : country, market, neighborhood and various geography features
 * Price : nightly rate, cleaning fees, price point relative to similar listings
 * Availability : Total nights available, % of nights manually blocked
 * Bookability : Number of bookings or nights booked in the past X days
 * Quality : Review scores, number of reviews, and amenities

A example training datasetWith our features and outcome variable defined, we can now train a model to
learn from our historical data.

PROTOTYPING AND TRAINING
Tool used: Machine learning Library in Python — scikit-learnAs in the example training dataset above, we often need to perform additional
data processing before we can fit a model:

 * Data Imputation: We need to check if any data is missing, and whether that data is missing at
   random. If not, we need to investigate why and understand the root cause. If
   yes, we should impute the missing values.
 * Encoding Categorical Variables : Often we cannot use the raw categories in the model, since the model
   doesn’t know how to fit on strings. When the number of categories is low, we
   may consider using one-hot encoding . However, when the cardinality is high, we might consider using ordinal encoding , encoding by frequency count of each category.

In this step, we don’t quite know what is the best set of features to use, so
writing code that allows us to rapidly iterate is essential. The pipeline
construct, commonly available in open-source tools like Scikit-Learn and Spark , is a very convenient tool for prototyping. Pipelines allow data scientists to
specify high-level blueprints that describe how features should be transformed,
and which models to train. To make it more concrete, below is a code snippet
from our LTV model pipeline:

At a high level, we use pipelines to specify data transformations for different
types of features, depending on whether those features are of type binary,
categorical, or numeric. FeatureUnion at the end simply combines the features column-wise to create the final
training dataset.

The advantage of writing prototypes with pipelines is that it abstracts away
tedious data transformations using data transforms . Collectively, these transforms ensure that data will be transformed
consistently across training and scoring, which solves a common problem of data
transformation inconsistency when translating a prototype into production.

Furthermore, pipelines also separates data transformations from model fitting.
While not shown in the code above, data scientists can add a final step to
specify an estimator for model fitting. By exploring different estimators, data scientists can
perform model selection to pick the best model to improve the model’s out of
sample error.

PERFORMING MODEL SELECTION
Tool used: Various AutoML frameworksAs mentioned in the previous section, we need to decide which candidate model is
the best to put into production. To make such a decision, we need to weigh the
tradeoffs between model interpretability and model complexity. For example, a
sparse linear model might be very interpretable but not complex enough to
generalize well. A tree based model might be flexible enough to capture
non-linear patterns but not very interpretable. This is known as the Bias-Variance tradeoff .

Figure referenced from Introduction to Statistical Learning with R by James,
Witten, Hastie, and TibshiraniIn applications such as insurance or credit screening, a model needs to be
interpretable because it’s important for the model to avoid inadvertently
discriminating against certain customers. In applications such as image
classification, however, it is much more important to have a performant
classifier than an interpretable model.

Given that model selection can be quite time consuming, we experimented with
using various AutoML tools to speed up the process. By exploring a wide variety of models, we found
which types of models tended to perform best. For example, we learned that eXtreme gradient boosted trees (XGBoost) significantly outperformed benchmark models such as mean response
models, ridge regression models, and single decision trees.

Comparing RMSE allows us to perform model selectionGiven that our primary goal was to predict listing values, we felt comfortable
productionizing our final model using XGBoost, which favors flexibility over
interpretability.

TAKING MODEL PROTOTYPES TO PRODUCTION
Tool used: Airbnb’s notebook translation framework — ML AutomatorAs we alluded to earlier, building a production pipeline is quite different from
building a prototype on a local laptop. For example, how can we perform periodic
re-training? How do we score a large number of examples efficiently? How do we
build a pipeline to monitor model performance over time?

At Airbnb, we built a framework called ML Automator that automagically translates a Jupyter notebook into an Airflow machine learning pipeline. This framework is designed specifically for data
scientists who are already familiar with writing prototypes in Python, and want
to take their model to production with limited experience in data engineering.

A simplified overview of the ML Automator Framework (photo credit: Aaron Keys) * First, the framework requires a user to specify a model config in the
   notebook. The purpose of this model config is to tell the framework where to
   locate the training table, how many compute resources to allocate for
   training, and how scores will be computed.
 * Additionally, data scientists are required to write specific fit and transform functions. The fit function specifies how training will be done exactly, and
   the transform function will be wrapped as a Python UDF for distributed
   scoring (if needed).

Here is a code snippet demonstrating how the fit and transform functions are defined in our LTV model. The fit function tells the framework
that a XGBoost model will be trained, and that data transformations will be
carried out according to the pipeline we defined previously.

Once the notebook is merged, ML Automator will wrap the trained model inside a Python UDF and create an Airflow pipeline like the one below. Data engineering tasks such as data serialization,
scheduling of periodic re-training, and distributed scoring are all encapsulated
as a part of this daily batch job. As a result, this framework significantly
lowers the cost of model development for data scientists, as if there was a
dedicated data engineer working alongside the data scientists to take the model
into production!

A graph view of our LTV Airflow DAG, running in productionNote: Beyond productionization, there are other topics, such as tracking model
performance over time or leveraging elastic compute environment for modeling,
which we will not cover in this post. Rest assured, these are all active areas
under development.

LESSONS LEARNED & LOOKING AHEAD
In the past few months, data scientists have partnered very closely with ML
Infra, and many great patterns and ideas arose out of this collaboration. In
fact, we believe that these tools will unlock a new paradigm for how to develop
machine learning models at Airbnb.

 * First, the cost of model development is significantly lower : by combining disparate strengths from individual tools: Zipline for
   feature engineering, Pipeline for model prototyping, AutoML for model
   selection and benchmarking, and finally ML Automator for productionization,
   we have shortened the development cycle tremendously.
 * Second, the notebook driven design reduces barrier to entry : data scientists who are not familiar with the framework have immediate
   access to a plethora of real life examples. Notebooks used in production are
   guaranteed to be correct, self-documenting, and up-to-date. This design
   drives strong adoption from new users.
 * As a result, teams are more willing to invest in ML product ideas : At the time of this post’s writing, we have several other teams exploring
   ML product ideas by following a similar approach: prioritizing the listing
   inspection queue, predicting the likelihood that listings will add cohosts,
   and automating flagging of low quality listings.

We are very excited about the future of this framework and the new paradigm it
brought along. By bridging the gap between prototyping and productionization, we
can truly enable data scientists and engineers to pursue end-to-end machine
learning projects and make our product better.


--------------------------------------------------------------------------------

Want to use or build these ML tools? We’re always looking for talented people to join our Data Science and Analytics team !


--------------------------------------------------------------------------------

Special thanks to members of Data Science & ML Infra team who were involved in
this work: Aaron Keys , Brad Hunter , Hamel Husain , Jiaying Shi , Krishna Puttaswamy , Michael Musson , Nick Handel , Varant Zanoyan , Vaughn Quoss et al. Additional thanks to Gary Tang , Jason Goodman , Jeff Feng , Lindsay Pettingill for reviewing this blog post.

Thanks to Jeff Feng , Jason Goodman , Gary Tang , lindsay m pettingill , Vaughn Quoss , Eddie Santos , and Nick Handel . * Machine Learning
 * Data Science
 * Airbnb
 * Technology

154 3 Blocked Unblock Follow FollowingROBERT CHANG
Data @Airbnb , previously @Twitter

FollowAIRBNB ENGINEERING & DATA SCIENCE
Creative engineers and data scientists building a world where you can belong
anywhere. http://airbnb.io

 * Share
 * 154
 * 
 * 
 * 

Never miss a story from Airbnb Engineering & Data Science , when you sign up for Medium. Learn more Never miss a story from Airbnb Engineering & Data Science Get updates Get updates","Data products have always been an instrumental part of Airbnb’s service. However, we have long recognized that it’s costly to make data products. For example, personalized search ranking enables…",Using Machine Learning to Predict Value of Homes On Airbnb,Live,415
1251,"Skip navigation Upload Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseUSING THE MAKER PALETTE IN THE IBM DATA SCIENCE EXPERIENCE
IBM Analytics Subscribe Subscribed Unsubscribe 19,689 19KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics

4 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Uploaded on Aug 15, 2016This short video demonstrates how to use the maker palette in the IBM Data
Science Experience to find content quickly and be more productive. Try it at http://ibm.co/data-science

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

 * Data science expert interview: Dave Saranchak - Duration: 7:15. IBM Analytics
   76 views * New 7:15
 * Apache Spark Makers Build Online Office Hours #1 - Duration: 41:23. IBM
   Analytics 119 views 41:23
 * Data science expert interview: Jennifer Shin - Duration: 7:29. IBM Analytics
   211 views 7:29
 * A data scientist experiments with Jupyter notebooks and Apache Spark: Part 1
   - Duration: 13:30. IBM Analytics 117 views * New 13:30
 * Overview of IBM Decision Optimization Center 3.9 - Duration: 10:38. IBM
   Analytics 290 views 10:38
 * A data scientist experiments with Jupyter notebooks and Apache Spark: Part 2
   - Duration: 7:23. IBM Analytics 56 views * New 7:23
 * Data science expert interview: Holden Karau - Duration: 6:21. IBM Analytics
   425 views 6:21
 * Where cybersecurity and fraud collide - Duration: 46:00. IBM Analytics 58
   views 46:00
 * Explore the power of statistical analysis in your organization - Duration:
   5:07. IBM Analytics 302 views 5:07
 * Use predictive analytics to anticipate your customers’ life changes -
   Duration: 3:12. IBM Analytics 24 views * New 3:12
 * IBM SPSS Modeler Desktop: Take the guesswork out of predictive intelligence -
   Duration: 3:53. IBM Analytics 51 views 3:53
 * IBM SPSS Modeler Gold: The power of predictive intelligence - Duration: 3:43.
   IBM Analytics 56 views 3:43
 * IBM i2 Enterprise Insight Analysis for border intelligence - Duration: 1:08.
   IBM Analytics 31 views 1:08
 * Synovus uses IBM Incentive Compensation Management to gain efficiency -
   Duration: 1:46. IBM Analytics 73 views 1:46
 * How Zions leverages Watson Analytics and Incentive Compensation Management -
   Duration: 1:51. IBM Analytics 110 views 1:51
 * Deeper insights are making it easier for telco providers to reach customers -
   Duration: 0:33. IBM Analytics 3 views * New 0:33
 * Deeper insights are improving the retail experience - Duration: 0:36. IBM
   Analytics 36 views * New 0:36
 * Deeper insights reveal a new dimension for energy and utility suppliers -
   Duration: 0:36. IBM Analytics 40 views * New 0:36
 * Banking analytics: Uncover another dimension of insights - Duration: 0:35.
   IBM Analytics 62 views 0:35

 * Language: English
 * Country: Worldwide
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Try something new!
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This short video demonstrates how to use the maker palette in the IBM Data Science Experience to find content quickly and be more productive. Try it at http:...,Using the Maker Palette in the IBM Data Science Experience,Live,416
1254,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseIBM DATA CATALOG: OVERVIEW
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Add translations

12 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 31, 2017This video provides an overview of IBM Data Catalog, part of Watson Data
Platform. Find more videos in the IBM Data Catalog Learning Center at http://ibm.biz/data-catalog-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Data Catalog: Create and administer a data catalog - Duration: 3:19.
   developerWorks TV 13 views * New 3:19


--------------------------------------------------------------------------------

 * IBM Data Catalog: Governance overview - Duration: 4:11. developerWorks TV 5
   views * New 4:11
 * IBM Data Refinery: Shape data - Duration: 5:46. developerWorks TV 10 views *
   New 5:46
 * Watson Data Platform: Provision IBM Data Catalog or IBM Data Refinery
   services - Duration: 1:05. developerWorks TV 22 views * New 1:05
 * UrbanCode Deploy: Using composite blueprints - Duration: 9:13. developerWorks
   TV 6 views * New 9:13
 * IBM Data Catalog: Use data assets in a project - Duration: 1:09.
   developerWorks TV 1 view * New 1:09
 * IBM InfoSphere Information Analyzer: Analyzing Data Quality and Risk with the
   Thin Client - Duration: 3:56. IBM Analytics 219 views 3:56
 * IBM Data Catalog: Add data assets to a catalog - Duration: 3:03.
   developerWorks TV 2 views * New 3:03
 * What is a Data Catalog - Tech VLOG - Duration: 8:09. Garnie Bolling 1,110
   views 8:09
 * Welcome - Duration: 1:35. developerWorks TV No views * New 1:35
 * Introduction to IBM InfoSphere Data Architect (1 of 2) - Duration: 8:20.
   TheOnDemandDemoGuy 25,019 views 8:20
 * Business Glossary Azure Data Catalog - Duration: 22:38. Microsoft Azure 1,586
   views 22:38
 * IBM Data Refinery: Create a connection and add it to a project - Duration:
   1:54. developerWorks TV 3 views * New 1:54
 * IBM - InfoSphere Information Governance Catalog Demo - Duration: 11:43. PR3
   Systems 3,833 views 11:43
 * IBM MVS 3.8 Catalog Management Introduction - Duration: 35:06. moshix 125
   views 35:06
 * IBM Academic Initiative z/OS Data Sets, System and User Catalogs, and zUnix
   File Systems- Unit 07 - Duration: 1:23:32. J. Packy Laverty 4,880 views 1:23:32
 * Collibra: Ashish Haruray - Catalog - Duration: 3:10. Collibra University 493
   views 3:10
 * Data Load Utility in WebSphere Commerce Introduction - Duration: 13:58. Bob
   Balfe 3,733 views 13:58
 * Using IBM InfoSphere Business Glossary to Support Data Governance in
   Insurance - Duration: 7:14. Sunil Soares 6,466 views 7:14
 * How to kick-off a Data Governance Project using IBM Information Governance
   Catalog - Duration: 52:14. PR3 Systems 81 views 52:14

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...","This video provides an overview of IBM Data Catalog, part of Watson Data Platform. ",IBM Data Catalog Overview,Live,417
1256,"Deploy RethinkDB in July and get a free t-shirt . Compose Menu Databases * MongoDB
 * Elasticsearch
 * RethinkDB
 * Redis
 * PostgreSQL
 * etcd
 * RabbitMQ

Enterprise Pricing Articles Sign in Free 30-Day TrialPOSTGRESQL, BACKUPS AND EVERYTHING YOU NEED TO KNOW
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 25, 2016In his latest Write Stuff article Robert Wysocki explains PostgreSQL backups, how to pick which ones to
do and what the pros and cons are.

Could you write about databases? Do you need a writing mentor? Would you like to
do it with a reward? Check out the Compose Write Stuff today.You know this good ol' IT saying: ""There are two types of people: those who back
up their data and those who are yet to start"". Backing up databases is
especially important since they usually contain information critical not only
for day-to-day business operation, but also for business continuity. People
leave, software fails, the world changes. What remains is your data and if you
have it, you can always start from scratch - with new software, new people, in a
new world. Let's have a look at various backups strategies available for
PostgreSQL databases so that we can prepare for the worst.

BACKUP TYPES: PHYSICAL VS LOGICAL
When talking about database backups, the first distinction mentioned is usually
about filesystem-level snapshots vs SQL dumps. A database is essentially a bunch
of files sitting somewhere on a drive in some data center. Backing up files is
easy, isn't it? But from another point of view, a database can be perceived as a
sequence of SQL statements used to create and manipulate the data. How hard can
it be to repeat all these statements?

Both backup types are useful, but each comes with its own set of advantages and
disadvantages. And both have some caveats.

PHYSICAL BACKUPS
Physical backups, also known as filesystem-level backups, are basically
snapshots of files constituting the database. But with constant writes going to
the underlying files, it can be quite hard to make these snapshots internally
consistent. PostgreSQL documentation describes two key notions: continuous archiving and point-in-time recovery.
These complement each other and it's important to understand them.

In order for physical backups to be consistent, we have to have a way of
guaranteeing transaction atomicity and durability; in other words we have to
make sure that we either commit the entire transaction or leave the database
unchanged. And if we do commit, it has to be forever. PostgreSQL uses write-ahead logging to provide the A and D letters of the ACID acronym. Write-ahead log segments are exactly the files that the continuous
archiving process takes care of, well, archiving. And the information kept in
WAL segments allows the database engine to keep the data consistent but also to
to bring it back to a consistent state after a crash.

While it is normal for some database files to change during a filesystem backup
creation, not all the changes are safe and some may lead to an inconsistent
snapshot being created. That's why, apart from some nice command line tools , PostgreSQL provides also a low-level API for taking physical backups.
Executing pg_start_backups() before and pg_stop_backup() after ensures that there will be no unsafe changes done to the data files
during the process. But still it's important to remember that a physical backup
is useless without all the WAL segments generated between executing the two
functions mentioned above.

The filesystem-level snapshot along with all the WAL segments necessary for its
restoration is called a ""base backup"".

STAND-ALONE PHYSICAL BACKUPS VS PITR-ENABLED PHYSICAL BACKUPS
While it was said that continuous archiving and point-in-time recovery (PITR)
complement each other, they can also be decoupled.

It is very useful to have continuous archiving configured so that you are always
sure you have access to necessary WAL segments, but in order to have a
functional base backup you only need the ones generated while the backup was
being created. If you have only those segments and do not archive the ones
generated after your backup is ready, you have a ""stand-alone physical backup"".

However, you can always continue to archive write-ahead log segments (and thus
actually have continuous archiving). What that will give you is the ability to restore your database
from the base backup you have and then continue to replay WAL segments on it.
This is called ""point-in-time recovery"", since you can always stop replaying the
WAL and still have a consistent state; so essentially you can restore your
database to any point in time starting with the time you created your base
backup.

LOGICAL BACKUPS
Logical backups are also known as SQL dumps. This is because they contain SQL
statements used to create the database schema and fill it with data. An SQL dump
always represents a consistent state of the database taken at a given moment
since the dumping process follows exactly the same semantics as any other
ordinary database session. This also means that it places locks and, on the
other hand, can be forced to wait for other sessions to release their locks.

PostgreSQL documentation gives an overview of the process. The dumping software simply goes through all
the tables discovering the various schema and fetching all their rows. The
process isn't overly complicated, though it has to be clever enough to know
about the order in which to go through all the relations so that, when being
restored, all the constraints can be met.

It is perfectly normal for SQL dumps to have the various tables' data spread
over time meaning while you can have the last row of one table with timestamp A,
some other table might have rows with timestamp B. If you have rules outside
your database about how rows like that correlate, you should account for that
yourself.

ENTIRE INSTANCE LOGICAL BACKUPS VS SINGLE DATABASE LOGICAL BACKUPS
In one PostgreSQL installation, we can have multiple databases. This is referred to as ""cluster"" in the official documentation, but this term is confusing enough
to not to use it, so let's call it simply an ""instance"".

Although dumping a single database is enough to restore all of its schema and
data, there are also some other extra objects that could be necessary to bring
the restored database to full operation.

All privileges are dumped with the database, but database roles , memberships and attributes are global objects that exist on the instance
level and therefore are not included in a single database dump.

PostgreSQL provides two tools for creating logical backups. While pg_dumpall can be used to dump all the objects in an instance, it can also serve as
complementary tool to the other tool and dump only the global objects. That
other tool is pg_dump which is used to dump a single database from an instance.

THE GOOD, THE BAD AND THE CAVEAT
PHYSICAL BACKUPS
Pros Cons No locking - physical backups do not cause lock contention and do not depend on
other connections releasing their locks High I/O pressure - copying the entire data directory may put significant
pressure on your storage, but this doesn't have to necessarily apply if you're
able to use some sort of filesystem-level snapshotting Entire instance backed up - there's no need to restore single databases - one
restore process is enough for an entire cluster Entire instance backed up - there's no way to restore single databases or
single tables, one has to restore all or nothing Almost instantaneous restore process - no need to execute SQL statements to get
data back Very large base backups - physical backups include all the indices and bloat
and may therefore be much, much larger than SQL dumps PITR capabilities - the ability to go back in time Need to archive WAL segments - you need to provide additional storage for
binary log files Just works - you just run postgres with the backed up directory and it does the
job Binary format - having backup in binary files means you are limited to
restoring it using exactly the same PostgreSQL version it was created withLOGICAL BACKUPS
Pros Cons Less data to copy - logical backups do not copy all data files, so they may
impose a lower I/O pressure on the system than physical backups Locking - logical backups are both prone to and cause lock waits Single database objects can be backed up - it's easy to backup and restore even
a single table Smaller backup files - SQL dumps do not contain index data or bloat, so they
are much smaller than physical backups Long restoration process - SQL statements have to be executed to get schema and
data back No need for storing extra files - backups are self-contained No PITR capabilities - a SQL dump can serve only as a snapshot of the data at
the time it was created Cross-version compatibility - SQL dumps are not dependant on restoring with the
same version as was used to create them Extra effort required - care has to be taken to back up global objects and to
tune the server to allow for faster restoration etc.COMPOSE.IO SUPPORTED BACKUPS
Compose gives you standalone physical backups. This means you should take extra
care of the PostgreSQL version you want to use with your backups.

Since the standard pg_dump tool uses a regular database session, there's nothing stopping you from
creating SQL dumps of a Compose-hosted database. But, since you don't have
access to the database superuser account, you cannot use pg_dumpall to dump global database instance objects.

When it comes to restoring from backups, all you really can do is to use the
ones available to you in Compose's WebUI. You might be able to partially restore
your SQL dumps, but you'll get errors if you'll try to restore them in full.

Robert M. Wysocki is a freelance PostgreSQL consultant, experienced GNU/Linux
administrator and author of travellingsysop.net technical blog

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Image by Tony Webster Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","In his latest Write Stuff article Robert Wysocki explains PostgreSQL backups, how to pick which ones to do and what the pros and cons are.","PostgreSQL, Backups and everything you need to know",Live,418
1257,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectGEOSPATIAL NEAREST NEIGHBOR QUERYLinhao Xu / January 29, 2016Xu Linhao is a software developer for IBM Cloudant, with over 16 years of databaseexperience. His expertise includes index and query processing in peer-based datamanagement systems and spatial-temporal tracking of moving objects. Prior toearning a Ph.D. in computer science from Fudan University, he won APWEb’s BestStudent Paper award. He served on the program committee for DASFAA 2006,P2PSearch 2007, and as Web Chair of SIGMOD 2007.Nearest Neighbor ( NN ) is the most common distance query developers implement to help users dolocation-based search, like finding the nearest post office or bank and showingit on a map. Specifically, given a coordinate P , a NN query of P returns the object whose coordinate is closest to P . The k Nearest Neighbor ( kNN ) query returns the k objects that are closer to P than others.Cloudant Geospatial offers kNN query as a feature, and this blog covers a couple different ways to perform a kNN query in your apps.GEOSPATIAL KNN QUERY OPTIONS IN CLOUDANTFirst off, as a developer, you need to know that there are two different optionsto perform geospatial kNN query in Cloudant: * Cloudant Geo offers the most flexible geospatial kNN query capability. You can perform a   geospatial kNN query by radius and standard GeoJSON geometries (e.g.,   linestring and polygon), rather than by point or rectangle. However, so far,   you cannot query kNN objects with any other non-geospatial attributes at the   same time.       * Cloudant Search combines geospatial kNN query with non-geospatial attribute text search. For   example, a policeman may search a crime database to find which k criminal events occurred last week close to the Chinatown metro station.   However, you cannot perform a geospatial kNN query against complex geometries   like a polygon.      So far, Cloudant measures distance (including kNN) in an “as the crow flies”straight line, not in network distance , like the street path you'd follow.Here, I use the Boston crime dataset to demonstrate how to make geospatial kNN queries. Refer to our previous blogon Geospatial Query with Cloudant Search for the details of the crime dataset (and some more background on geo searchingwith Cloudant).GEOSPATIAL KNN QUERY IN CLOUDANT GEOFirst, we build a geospatial index with Cloudant Geo and then demonstrate threekNN query examples on the indexed Boston crime dataset.Tip: You can run all the query URLs in your browser.INDEX BOSTON CRIME DATASETTo index the Boston crime dataset, create a design document for Cloudant Geo byusing the pre-defined keyword st_indexes to hold one or more Cloudant Geo index definitions, where each index must bedefined by st_index function.For example, in the following design document, the value of the attribute index is a JavaScript function that first checks if doc.geometry and doc.geometry.coordinates exists or not. If true, then system pre-defined st_index function is called to index the geometry object of each document.{    ""_id"": ""_design/geodd"",    ""st_indexes"": {        ""geoidx"": {            ""index"": ""function(doc) { if (doc.geometry && doc.geometry.coordinates && doc.geometry.coordinates.length � }}""        }    }}PERFORM GEOSPATIAL KNN QUERYNow that you've set the Cloudant Geo index, you can issue a geospatial kNN queryagainst a location (-71.0537124, 42.3681995) , to find the five crimes closer to this location. To do so, you must provide a query geometry g=POINT(-71.0537124 42.3681995) that specifies the location and nearest=true that enables nearest neighbor query.http://examples.cloudant.com/crimes/_design/geodd/_geo/geoidx?g=POINT(-71.0537124 42.3681995)&nearest=true&limit=5&include_docs=trueCloudant Geo also lets you make a geospatial kNN query against a complex geomerylike polygon. To do so, combine both geometric relation and nearest=true . For example, to find the five crimes that occurred nearest, but not within anarea in Boston, you can issue the query below where the returned crimes aredisplayed in the following picture.http://examples.cloudant.com/crimes/_design/geodd/_geo/geoidx?g=polygon((-71.0537124 42.3681995,-71.054399 42.3675178,-71.0522962 42.3667409,-71.051631 42.3659324,-71.051631 42.3621431,-71.0502148 42.3618577,-71.0505152 42.3660275,-71.0511589 42.3670263,-71.0537124 42.3681995))&relation=disjoint&nearest=true&limit=15&include_docs=trueGEOSPATIAL KNN QUERY IN CLOUDANT SEARCHNow, I'll show how to use Cloudant Search to do geospatial kNN query.Tip: You need to index prior to running a query. To learn how to index the Bostoncrime dataset, read our previous blog on Geospatial Query with Cloudant Search .In Cloudant Search, we combine both limit and sort to implement geospatial kNN query. For example, to find the five crimes that are closer to the location of (-71.07505979,42.32865671) , we can make a kNN query with limit=5 and sort=""<distance,long,lat,-71.07505979,42.32865671,mi>"" where the distance is measured in miles.https://examples.cloudant.com/crimes/_design/lucenegeoblog/_search/findcrimes?q=type:Argue&sort=""<distance,long,lat,-71.07505979,42.32865671,mi>""&limit=5&include_docs=trueOur second example is a bit complex and tries to identify the five crimes whichare within a specific rectangle of ((-71.08,42.28) (-71.04,42.32)) , and closer to the location (-71.06,42.30) , and whose type or main_crimecode is Argue .https://examples.cloudant.com/crimes/_design/lucenegeoblog/_search/findcrimes?q=type:Argue AND long:[-71.08 TO -71.04] AND lat:[42.28 TO 42.32]&sort=""<distance,long,lat,-71.06,42.30,mi>""&limit=5&include_docs=trueSUMMARYDepending on what results you want, choose the geospatial kNN query method thatfits best. Use the following summary and comparison table as your guide: * Cloudant Geo lets you perform geospatial kNN search by a combination of a query geometry   like polygon and nearest=true , with or without additonal geometric relation like disjoint .       * Cloudant Search lets you perform geospatial kNN search by sort (like sort=""<distance,long,lat,-71.06,42.30,mi>"" ), with or without a rectangle bounding box.      Cloudant Geo Cloudant Search k (an integer) limit=k limit=k reference geometry types point, radius, ellipse, rectangle, linestring, polygon point distance measured by reference geometry the center of the reference geometry specified by sort geometric relation nearest=true (a must) with other relation like disjoint N.A.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Cloudant Geospatial lets you run Nearest Neighbor queries to locate items close to a specific location.,Geospatial Nearest Neighbor Query,Live,419
1258,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK SQL
CACHE TABLE IN APACHE SPARK™ SQL
For users wanting to improve performance by caching table data into memory, we
offer some considerations…

You can either do sqlContext.cacheTable(“tableName”), dataFram.cache() in an
application or “CACHE table tableName” in the Spark-SQL shell. The new query
against the cached table will use InMemoryColumnerTableScan for scanning and
retrieving only the required column(s).

For example:

scala sqlContext.cacheTable(""t4"")

scala val df = sqlContext.sql(""select col1 from t4"") df: org.apache.spark.sql.DataFrame = [col1: int]

scala df.explain(true) == Parsed Logical Plan == 'Project ['col1] +- 'UnresolvedRelation `t4`, None

== Analyzed Logical Plan == col1: int Project [col1#103] +- MetastoreRelation default, t4, None

== Optimized Logical Plan == Project [col1#103] +- InMemoryRelation [col1#103,col2#104,col3#105], true, 10000, StorageLevel(true, true, false, true, 1), HiveTableScan [col1#73,col2#74,col3#75], MetastoreRelation default, t4, None, Some(t4)

== Physical Plan == InMemoryColumnarTableScan [col1#103], InMemoryRelation [col1#103,col2#104,col3#105], true, 10000, StorageLevel(true, true, false, true, 1), HiveTableScan [col1#73,col2#74,col3#75], MetastoreRelation default, t4, None, Some(t4)

It’s worth noting that prior to Apache Spark™ 1.5.2, caching a parquet table had
an issue. Specifically, the query selecting the cached parquet table did not
actually scan from the InMemoryColumnartableScan. Instead, it scanned from
ParquetRelation in the physical plan — which had the potential to downgrade
performance.

The problem was that the LogicalRelation that wraps the ParquetRelation has an
expectedOutpuAttributes that stores a list of resolved fields with expIds, yet
these expIds are not expected to be the same at different times. When caching
table, the LogicalRelation that wraps the ParquetRelation becomes the key in the
cache and the resulting InMemoryRelation is the value. Then, when a new query
comes in, the newly resolved LogicalRelation that wraps the same ParquetRelation
has expectedOutpuAttributes with different expIds than the cached key. As a
result, the look up of the cached relation is not found and the plan fails to
choose the physical ParquetRelation for scanning.

Instead of comparing wrapping LogicalRelations for looking up the key from the
cache, the code should directly compare the underlying ParquetRelation. This
issue is fixed in 1.6.0 and 1.5.2.

Bios: Xin Wu is an active contributor for Apache Spark with IBM Spark Technology
Center(STC).. Xin’s main focus is on Spark SQL component. Prior to joining STC,
he was a developer of Big SQL, which is a SQL-on-Hadoop engine by IBM.

SHARE ON
 * 
 * Share

XIN WU
DATE
05 June 2016TAGS
Spark SQLSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.","For users wanting to improve performance by caching table data into memory, we offer some considerations.",CACHE Table in Apache Spark SQL,Live,420
1261,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Home
 * Cognitive Computing
 * Data Science
 * Web Dev
 * 

Raj Singh Blocked Unblock Follow Following Developer Advocate and Open Data Lead at IBM Watson Data Platform. He
specializes in all things geospatial and analytics and hacks in NodeJS, Python,
and R. Mar 22
--------------------------------------------------------------------------------

NAVIGATING SXSW VIA COGNITIVE CHATBOT
HOW WE BUILT THE DEMO APP USING COGNITIVE SERVICES ON THE IBM CLOUD
Last week, Mark Watson and I represented the Watson Developer Cloud in IBM’s installation at South by Southwest . IBM built a terrific space, inviting visitors to explore how IBM is making
essential aspects of life more engaging, creative, and personal. In that spirit,
we demonstrated a chat-driven app to help event attendees find interesting
events at SXSW. We call it Cognitive Event Finder .

Looking for SXSW events on augmented reality? Just ask our chatbot.HOW IT WORKS
Cognitive Event Finder works by either interacting with our chatbot via SMS ( Twilio , ftw!), or by using a single-page web app. We used the Watson Conversation service to implement a natural-feeling user interaction that takes attendees
through the process of searching for SXSW Interactive (tech) talks, music gigs,
or film screenings.

The actual search queries relied on IBM Cloudant , a JSON database service based on Apache CouchDB™ . All our data on SXSW events is stored in here.

The architecture of our Cognitive Event Finder app, running on IBM Bluemix .While Cloudant has a number of options for implementing search, we used Cloudant Search , the Lucene-based indexing engine built into the database service. It allowed
us to combine text, time, and location data in user searches. For example, if
someone were interested in virtual reality, we could not only search for
sessions related to the subject, but also rank results by those happening soon
and nearby. That’s a much more relevant response than a plain old database
query.

Finally, we used Mapbox ’s JavaScript client to to visualize search results on a highly customizable
map. We also used Mapbox underlay a nice street map using their basemap API.

RESULTS IN THE FIELD
The turnout was great. We gave the app to about 500 festival-goers, and they all
felt it was going to help them more easily navigate the event.

The backlit Mark Watson gets SXSW attendees set up with our event-finder
chatbot.Our primary goal, however, wasn’t to build the best event concierge app, but to
illustrate the type of thing you could build with the cognitive services on the
IBM Cloud. We’re glad people put it to use.

If you’re interested in building something similar, you can fork the GitHub repo . There’s also a YouTube video of the GIF above , in case that’s handy for sharing. Get in touch and let us know what you
build, and please ♡ this article to share it on Medium. Thanks!

 * Sxsw Interactive
 * Cloudant
 * Chatbots
 * Cognitive Computing
 * Nodejs

Blocked Unblock Follow FollowingRAJ SINGH
Developer Advocate and Open Data Lead at IBM Watson Data Platform. He
specializes in all things geospatial and analytics and hacks in NodeJS, Python,
and R.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Our goal wasn’t to build the best event concierge app, but to illustrate the type of thing you could build with the cognitive services on the IBM Cloud.",Navigating SXSW via cognitive chatbot – IBM Watson Data Lab,Live,421
1267,"Working Vis * 
 * 

 * Home
 * About This Blog
 * Brunel

Working Vis » Uncategorized Menu Not Found Skip to content * Home
 * About This Blog
 * Brunel

BRUNEL 2.0 PREVIEW
2016-07-27 Uncategorized
We’ few examples as we’ve been working on new features for our upcoming 2.0
release in August. Here’s the current version of the release notes, showing what
new features will be added:


--------------------------------------------------------------------------------

2.0 RELEASE NOTES
ACCESSIBILITY
Brunel is now accessible. By specifying the accessibility flag in BuilderOptions
(also by using -accessibility as n option for the command-line tools), then
Brunel generates
SVG with Aria roles and labels so as to allow aria compliant screen readers to
read the
content of items. The content is currently in English only. Veuillez nous
excuser.

Brunel adds region roles to major areas, such as elements (in a multi-element
chart),
charts (in a multi-chart visualization), axes and legends. This should allow the
user to use
a compliant navigation system to navigate through the major blocks and arrive at
the one you
desire rapidly.

The system has been tested with Apple’s Voice Over technology, but we are
actively looking
for feedback on this feature, particularly how we can improve it to make it more
useful for
all people, rather than merely compliant.

High-contrast views can be mostly achieved by use of a custom-designed style
sheet. This is
not an area we have addressed in this release.

AXES
GRIDLINES
Gridlines have been brought back in Brunel, and additional syntax added for
them.
Previously gridlines were generated by default, but were styled to be invisible.
Also, they didn’t work well …

The new way to request gridlines is to use a grid modifier on an axes() command
to request them, for example:

x(summer) y(winter) axes(x:grid, y:grid)
x(summer) y(winter) axes(x:grid:50, y:grid)
x(Summer) y(Population:log) axes(x, y:grid)


Standard CSS styling applies to the grid lines; you can set it in the style
sheet you use,
or define it either for both sets of gridlines, or individually:

x(Summer) y(Population) axes(x:grid, y:grid) style('.grid {stroke:green}')
x(Summer) y(Population) axes(x:grid, y:grid) style('.grid{opacity:1}
   .grid.y {stroke-dasharray:5,5} .grid.x {stroke-width:40px; opacity:0.2}')


ALIGNMENT
label-location is now supported on styles for axis title locations, and can be used to place
the
axes titles relative to the axis. We have also improved support for large title
fonts.
Below is an example of this in operation:

x(Region) y(dem_rep) transpose tooltip(#all)
style('.axis .title { fill:red; label-location:right; font-size:60px }')


PADDING AND MARK SIZES
We now support padding in the CSS for text elements associated with axes and
legends.
padding, padding-left, padding-right, padding-top and padding-bottom are all
supported,
with standard units EXCEPT that we do not support percentage padding.

Here is an example of the use with axes:

bar x(Winter) y(#count) sort(#count) tooltip(#all) bin(Winter)
style('.axis .title {fill:red;label-location:left; padding-left:1in}
    .tick{fill:blue;padding:10;padding-right:50px}')


We also now use the css size for the tick mark ( .tick line ) to determmine its size.

The following Brunel is long and ugly, but shows all the styles in action:

x(winter) y(summer) style('.axis.y .tick text{fill:red;padding-right:10px}')
title('Ugly Style test')
style('.axis.x .tick text{fill:blue;font-size:1cm; padding:2mm} .axis .tick line{size:-5mm} ')
style('.axis .title {label-location:left;font-size:20px}')
style('.axis .title {label-location:left;font-size:20px;font-style:italic;padding-left:1in}')
style('.header {fill:red;font-size:40px;padding-bottom:50px;label-location:right}')


STYLES
Elements that have been selected now have the css class ‘selection’ defined for
them.
This allows you to use style definitions for custom display of selected
elements, as
in the below:

point x(Longitude) y(Latitude) color(region) size(population:1000%)
style('.selected {opacity:0.5; stroke-opacity:1; stroke-width:2; stroke-dasharray:2 2}
.element {opacity:0.2}') interaction(select:mouseover)


Labels for elements that are selected also have the selected class defined, so you
can modify selected labels’ appearances using styles. In this version, we only
support one position modifier — vertical-align . If this value is set to a pixel value
(such as 20px or -30px ) it will move the text the indicated amount AFTER placement

Here is a long sample with a lot of styling going on for text:

data('sample:whiskey.csv') bar x(category) y(#count) transpose
size(#selection:[20%, 80%]) sort(#count:ascending) label(category) axes(y)
style('.label.selected {fill:yellow; text-shadow:0px 0px 4px black; vertical-align:18px; text-transform:uppercase}')
style('label-location:outside-top-right; text-align:end; padding:1px')
interaction(select:mouseover)


Another modification was done to how we hand overlapping data labels; previously
they were
removed from the display, but now they are given the class overlap which our default
style sheet hides, but you can modify to treat any way you want. For example:

data('sample:whiskey.csv') point x(Age) y(Price) label(category:5)
style('.overlap {visibility:visible; text-shadow:none; opacity:0.2}')


AXIS RANGES
The initial range of a numeric field can be set by defining a range for the x or y
command, much like a transform. Examples are:

point x(Longitude:[-100, -80]) y(Latitude:[35, 45])


GUIDES
We have added a new feature to allow an element to define a guide. This is
described more fully in
the complete documentation, but here is an example showing its use:

x(winter) y(summer) + guide(y:40+x, y:70, y:'70+10*sin(x)')
style('.guide1{stroke:red} .guide3 {stroke-dasharray:none}')


INTERACTIONS
A new animate command has been added that provides an interactive control to animate a
visualization over
the values of a continuous field. As part of this, labels on continuous filters
have improved (particularly for date fields).

The interaction(select) command and also the new callback event command interaction(call:func)
can now take an event name parameter snap that allows interactivity to be fired when the
mouse is near a data item on screen

interaction(call:func) has been added; the ability to call bak to a javascript function to handle
events in special ways. A new page in the documentation describes the API.

interaction(panzoom) can now take options x , y , xy , auto as well as none which allow
detailed control of how panning and zooming operate

NOTEBOOKS
R Notebooks (IRkernel) no longer require use of a web service. Simply install
Brunel Visualization directly into R and enjoy.

MINOR FIXES
 * Wrapped text in Firefox browsers has been improved to compensate for the
   difference in
   how FF calculates text height.

POST NAVIGATION
← Previous post linkRELATED POSTS
 * WIKIPEDIA RECENT CHANGES
   Wikipedia Recent Changes Map shows a good example is a good, clean, simple
   implementation that addresses the question: “How is Wikipedia […]

LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


 * Home
 * About This Blog
 * Brunel","Current version of the release notes, showing what new features will be added.",Brunel 2.0 Preview,Live,422
1269,"
--------------------------------------------------------------------------------

WEB APPLICATION STATE, À LA DOGFIGHT (1983)
ADDING REAL-TIME NOTIFICATIONS TO YOUR NODE.JS APP
If you have a Node.js application which powers a website and you want changing
data broadcasted to the web clients as change happens, then you need a “real
time” connection between the client and server. Usually web requests are
initiated by the web browser; the browser decides which assets it needs. If the
server wants to send data to the browser — to “push” data from the server — then
there has to be an open connection between client and server to transmit the
data.

Pushing data to the client’s browser keeps a websites’ data fresh and can make
the user experience more compelling.As an example, let’s take a multi-player game. By the end of the post we shall
have turned a vintage video game from the 1980s into a web app that allows two
players to duel in separate web browsers:

DOGFIGHT (1983)
Before we write some code, I should describe the game. When I was a kid, I had a BBC Micro computer . One of my favourite games was “Dogfight” by Opus Software, which allowed two
players to control two aeroplanes and shoot each other.

The BBC Micro had limited resources. Its 32k of RAM had to contain the operating
system, the running program and the video memory. In 320x256 4-colour mode, 20k
of your memory was gone, leaving 12k for your code and the operating system.

I couldn’t find Dogfight in a playable condition for the purposes of this
article. I only had a single original screenshot and my memory of the game as
inspiration to recreate it.

RECREATING DOGFIGHT IN A WEB BROWSER
There are many game frameworks to help you build in-browser JavaScript games. I
didn’t use any of them — they seemed to be overkill for my simple game. To
simplify things, here’s what I used:

 * JavaScript objects to represent the location, direction and projectiles of
   each plane
 * an HTML Canvas control to allow the scene to be drawn
 * transparent PNG images to render the planes, cloud and sun. Note that the
   cloud and the sun are green, as in the original, for reasons unknown
 * CSS transformations to rotate the plane images in response to user controls
 * HTML Audio tags to play sounds
 * a JavaScript setInterval timer to control the recalculation redrawing of the scene

WHAT DATA NEEDS SYNCING BETWEEN BROWSERS?
I started by building the app as if two players were sharing the keyboard: one
player uses the A & D keys to control one plane, and the other player users J &
L to control theirs. In this case, the state of the application is automatically shared because it resides on the same
machine.

When we separate the players and give them their own browser, there is no shared
state. We are then forced to think about what state needs to be transferred between players so that gameplay is identical on each screen.

For this game, each plane needs to transfer the following state information:

 * x coordinate
 * y coordinate
 * direction of travel (degrees)
 * array of objects for each bullet (x, y, direction of missile )
 * number of rounds fired
 * number of hits on the opponent

e.g.,

The JSON above represents a plane flying at 6 degrees from horizontal with three
bullets in flight:

Every 50 milliseconds, when the JavaScript setInterval method fires, the app recalculates the plane's position based on its direction
and performs the same calculation for each of its bullets.

ADDING THE SIMPLE NOTIFICATION SERVICE TO YOUR APP
The Simple Notification Service (SNS) can be baked into your web app to add real-time notifications. It handles
the “last mile” between the browser and the server and uses RethinkDB to handle the storage of data and streaming of changes.

The SNS module is published as an npm module . Building SNS into your own app is easy:

 * create your own Express ‘routes’
 * create your webpage assets and place them in a ‘public’ folder in your
   project
 * add SNS to your app using npm install --save simple-notification-service
 * pass your routes and public directory to the SNS module at startup

Here’s the code:

SNS creates an Express app for you with:

 * your public directory served out and the SNS demo code switched off
 * your custom routes
 * the SNS client library /sns-client.js

SNS also creates an API key at startup, to allow your client-side app to
authenticate at run time (SNS API keys are a pair consisting of hostname and API
key).

SENDING REAL-TIME DATA FROM YOUR CLIENT-SIDE APP
The client-side code needs to include the SNS client library:

<script src=""/sns-client.js""></script>

Then your code makes a connection to SNS at startup that defines the identity of
the user connecting:

The code above connects to the server-side SNS instance using the API key that
was created at startup. It identifies this user as being involved in a ‘game’ as
the ‘white’ colour. It also subscribes to any connection/disconnection events
from the game, so we know when other players arrive and leave.

To send data to the server, simply call sns.send :

The first parameter of sns.send is who you are sending the data to (the red player), and the second parameter
is the JavaScript object to send (the white player's latest position
information).

With this simple code, each player’s data is transmitted to the other — the
white player sends the red player their status and vice versa!

In your code you can pick up incoming notifications as they arrive:

TRY IT YOURSELF
You can try Dogfight (2017) yourself at dogfight.mybluemix.net , which is running with RethinkDB from Compose.com or simply clone the code to run it with your local RethinkDB database.

If you liked this example, or just like Dogfight in general, please consider
clicking the ♡ so other Medium readers will feel the love.

JavaScript Nodejs Web Development Rethinkdb Database Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",My favourite retro game! Demo of real-time notifications between web apps. Uses our Simple Notification Service npm module + RethinkDB for real-time connection between the client and server processes.,"Web application state, à la Dogfight (1983) – IBM Watson Data Lab",Live,423
1281,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

CONTENTS
 * Apache Spark * Get Started * Get Started in Bluemix
      
      
    * Tutorials * Load dashDB Data with Apache Spark
       * Load Cloudant Data in Apache Spark Using a Python Notebook
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Build SQL Queries
       * Use the Machine Learning Library
       * Build a Custom Library for Apache Spark
       * Sentiment Analysis of Twitter Hashtags
       * Use Spark Streaming
       * Launch a Spark job using spark-submit
      
      
    * Sample Notebooks * Sample Python Notebook: Precipitation Analysis
       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis
      
      
 * BigInsights * Get Started * BigInsights on Cloud for Analysts
       * BigInsights on Cloud for Data Scientists
       * Perform Text Analytics on Financial Data
       * Perform Sentiment Analysis
       * Sample Scripts
      
      
 * Compose * Get Started * Create a Deployment
       * Add a Database and Documents
       * Back Up and Restore a Deployment
       * Enable Two-Factor Authentication
       * Add Users
       * Enable Add-Ons for Your Deployment
      
      
    * Compose Enterprise * Get Started
      
      
 * Cloudant * Get started * Copy a sample database
       * Create a database
       * Change database permissions
       * Connect to Bluemix
       * Developing against Cloudant
      
      
    * Intro to the HTTP API * Execute common API commands
       * Set up pre-authenticated cURL
      
      
    * Database Replication * Use cases for replication
       * Create a replication job
       * Check replication status
       * Set up replication with cURL
      
      
    * Indexes and Queries * Use the primary index
       * MapReduce and the secondary index
       * Build and query a search index
       * Use Cloudant Query
       * Cloudant Geospatial
      
      
    * Integrate * Create a Data Warehouse from Cloudant Data
       * Store Tweets Using Cloudant, dashDB, and Node-RED
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Load Cloudant Data in Apache Spark Using a Python Notebook
      
      
 * dashDB * dashDB Quick Start
    * Get * Get started with dashDB on Bluemix
       * Load data from the desktop into dashDB
       * Load from Desktop Supercharged with IBM Aspera
       * Load data from the Cloud into dashDB
       * Move data to the Cloud with dashDB’s MoveToCloud script
       * Load Twitter data into dashDB
       * Load XML data into dashDB
       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB
       * Load JSON Data from Cloudant into dashDB
       * Integrate dashDB and Informatica Cloud
       * Load geospatial data into dashDB to analyze in Esri ArcGIS
       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion
         Workbench (DCW)
       * Install IBM Database Conversion Workbench
       * Convert data from Oracle to dashDB
       * Convert IBM Puredata System for Analytics to dashDB
       * From Netezza to dashDB: It’s That Easy!
       * Use Aginity Workbench for IBM dashDB
      
      
    * Build * Create Tables in dashDB
       * Connect apps to dashDB
      
      
    * Analyze * Use dashDB with Watson Analytics
       * Perform Predictive Analytics and SQL Pushdown
       * Use dashDB with Spark
       * Use dashDB with Pyspark and Pandas
       * Use dashDB with R
       * Publish apps that use R analysis with Shiny and dashDB
       * Perform market basket analysis using dashDB and R
       * Connect R Commander and dashDB
       * Use dashDB with IBM Embeddable Reporting Service
       * Use dashDB with Tableau
       * Leverage dashDB in Cognos Business Intelligence
       * Integrate dashDB with Excel
       * Extract and export dashDB data to a CSV file
       * Analyze With SPSS Statistics and dashDB
      
      
    * REST API * Load delimited data using the REST API and cURL
      
      
 * DataWorks * Get Started * Connect to Data in IBM DataWorks
       * Load Data for Analytics in IBM DataWorks
       * Blend Data from Multiple Sources in IBM DataWorks
       * Shape Raw Data in IBM DataWorks
       * DataWorks API
      
      
CREATE TABLES IN DASHDB
Jess Mantaro / July 17, 2015Watch how to create tables in IBM dashDB.

You can also read a transcript of this video

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM",Watch how to create tables in IBM dashDB. ,Create Tables in dashDB,Live,424
1283,"Enterprise Pricing Articles Sign in Free 30-Day TrialTURN SMALL DATA INTO SMART DATA. PART 3: GNUPLOT AND PSQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 9, 2016This is the third of a three part series focused on an open source tool chain
for small data business intelligence. Part 1 explored dimensional modeling from a logical and use case perspective. Part 2 focused on Extract, Transform and Load with NodeJS.

Part 3 uses gnuplot to graph data both interactively and programmatically via scripts. Plus, psql is also used to pull and dump data in the correct format. Both of these are
""sharp tools"" in the Unix philosophy sense. They are cmdline executables which
take text inputs, can be used interactively, and can be scripted too.


Some Relevant Links * Turn Small Data Into Smart Data. Part 1: The Star Schema
 * Turn Small Data Into Smart Data. Part 2: ETL with NodeJS and ES6 Promises
 * gnuplot
 * psql
 * The Template Toolkit

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Part 3 uses gnuplot to graph data both interactively and programmatically via scripts. Plus, psql is also used to pull and dump data in the correct format.",Turn Small Data Into Smart Data. Part 3,Live,425
1284,"Skip to content * Features
 * Business
 * Explore
 * Marketplace
 * Pricing

This repository Sign in or Sign up * Watch 20
 * Star 398
 * Fork 25

AROGOZHNIKOV / PYTHON3_WITH_PLEASURE
Code Issues 0 Pull requests 0 Projects 0 Insights DismissJOIN GITHUB TODAY
GitHub is home to over 20 million developers working together to host and review
code, manage projects, and build software together.

Sign up

A short guide on features of Python 3 * 51 commits
 * 1 branch
 * 0 releases
 * 3 contributors

Clone or downloadCLONE WITH HTTPS
Use Git or checkout with SVN using the web URL.

Download ZIP Find file Branch: master Switch branches/tags * Branches
 * Tags

master Nothing to show Nothing to show New pull request Latest commit 3738530 Jan 24, 2018 arogozhnikov minor comment Permalink Failed to load latest commit information. images more corrections, list seeems complete, requires proofreading Jan 14, 2018 README.md minor comment Jan 24, 2018README.MD
MIGRATING TO PYTHON 3 WITH PLEASURE
A SHORT GUIDE ON FEATURES OF PYTHON 3 FOR DATA SCIENTISTS
Python became a mainstream language for machine learning and other scientific
fields that heavily operate with data; it boasts various deep learning
frameworks and well-established set of tools for data processing and
visualization.

However, Python ecosystem co-exists in Python 2 and Python 3, and Python 2 is
still used among data scientists. By the end of 2019 scientific stack will stop supporting Python2 . As for numpy, after 2018 any new feature releases will support only Python3 .

To make transition less frustrating, I've collected a bunch of Python 3 features
that you may find useful.


Image from Dario Bertini post (toptal)

BETTER PATHS HANDLING WITH PATHLIB
pathlib is a default module in python3, that helps you to avoid tons of os.path.join s:

from pathlib import Path

dataset ='wiki_images'
datasets_root = Path('/path/to/datasets/') 

train_path = datasets_root / dataset /'train'
test_path = datasets_root / dataset /'test'for image_path in train_path.iterdir():
    with image_path.open() as f: # note, open is a method of Path object# do something with an image

Previously it was always tempting to use string concatenation (concise, but
obviously bad), now with pathlib the code is safe, concise, and readable.

Also pathlib.Path has a bunch of methods, that every python novice previously had to google:

p.exists()
p.is_dir()
p.parts()
p.with_name('sibling.png') # only change the name, but keep the folder
p.with_suffix('.jpg') # only change the extension, but keep the folder and the name
p.chmod(mode)
p.rmdir()

pathlib should save you lots of time, please see docs and reference for more.

TYPE HINTING IS NOW PART OF THE LANGUAGE
Example of type hinting in pycharm:


Python is not just a language for small scripts anymore, data pipelines these
days include numerous steps each involving different frameworks (and sometimes
very different logic).

Type hinting was introduced to help with growing complexity of programs, so
machines could help with code verification. Previously different modules used
custom ways to point types in doctrings (Hint: pycharm can convert old docstrings to fresh typehinting).

As a simple example, the following code may work with different types of data
(that's what we like about python data stack).

defrepeat_each_entry(data):
    """""" Each entry in the data is doubled     <blah blah nobody reads the documentation till the end>""""""
    index = numpy.repeat(numpy.arange(len(data)), 2)
    return data[index]

This code e.g. works for numpy.array (incl. multidimensional ones), astropy.Table and astropy.Column , bcolz , cupy and some others.

This code will work for pandas.Series , but in the wrong way:

repeat_each_entry(pandas.Series(data=[0, 1, 2], index=[3, 4, 5])) # returns Series with Nones inside

This was code of two lines. Imagine how unpredictable behavior of a complex
system, because just one function may misbehave. Putting explicitly which types
method expects is very helpful in large systems, this will warn you if function
wasn't expected to get such arguments.

defrepeat_each_entry(data: Union[numpy.ndarray, bcolz.carray]):

If you have a significant codebase, hinting tools like MyPy are likely to become part of your continuous integration pipeline. A webinar ""Putting Type Hints to Work"" by Daniel Pyrathon is good for a brief introduction.

Sidenote: unfortunately, hinting is not yet powerful enough to provide
fine-grained typing for ndarrays/tensors, but maybe we'll have it once , and this will be a great feature for DS.

TYPE HINTING → TYPE CHECKING IN RUNTIME
By default, function annotations do not influence how your code is working, but
merely help you to point code intentions.

However, you can enforce type checking in runtime with tools like ... enforce , this can help you in debugging (there are many cases when type hinting is not
working).

@enforce.runtime_validationdeffoo(text: str) -> None:
    print(text)

foo('Hi') # ok
foo(5)    # fails@enforce.runtime_validationdefany2(x: List[bool]) -> bool:
    returnany(x)

any ([False, False, True, False]) # True
any2([False, False, True, False]) # Trueany (['False']) # True
any2(['False']) # failsany ([False, None, """", 0]) # False
any2([False, None, """", 0]) # fails

OTHER USAGES OF FUNCTION ANNOTATIONS
As mentioned before, annotations do not influence code execution, but rather
provide some meta-information, and you can use it as you wish.

For instance, measure units is a common pain in scientific areas, astropy package provides a simple decorator to control units of input quantities and convert output to required units

# Python 3from astropy import units as u
@u.quantity_input()
deffrequency(speed: u.meter / u.s, wavelength: u.m) -> u.terahertz:
    return speed / wavelength
    
frequency(speed=300_000* u.km / u.s, wavelength=555* u.nm)
# output: 540.5405405405404 THz, frequency of green visible light

If you're processing tabular scientific data in python (not necessarily
astronomical), you should give astropy a shot.

You can also define your application-specific decorators to perform control /
conversion of inputs and output in the same manner.

MATRIX MULTIPLICATION WITH @
Let's implement one of the simplest ML models — a linear regression with l2
regularization (a.k.a. ridge regression):

# l2-regularized linear regression: || AX - b ||^2 + alpha * ||x||^2 -> min# Python 2
X = np.linalg.inv(np.dot(A.T, A) + alpha * np.eye(A.shape[1])).dot(A.T.dot(b))
# Python 3
X = np.linalg.inv(A.T @ A + alpha * np.eye(A.shape[1])) @ (A.T @ b)

The code with @ becomes more readable and more translatable between deep learning frameworks:
same code X @ W + b[None, :] for a single layer of perceptron works in numpy , cupy , pytorch , tensorflow (and other frameworks that operate with tensors).

GLOBBING WITH **
Recursive folder globbing is not easy in Python 2, even custom module glob2 exists that overcomes this. Recursive flag is supported since Python 3.6:

import glob

# Python 2
found_images = \
    glob.glob('/path/*.jpg') \
  + glob.glob('/path/*/*.jpg') \
  + glob.glob('/path/*/*/*.jpg') \
  + glob.glob('/path/*/*/*/*.jpg') \
  + glob.glob('/path/*/*/*/*/*.jpg') 

# Python 3
found_images = glob.glob('/path/**/*.jpg', recursive=True)

Better option is to use pathlib in python3 (minus one import!):

# Python 3
found_images = pathlib.Path('/path/').glob('**/*.jpg')

PRINT IS A FUNCTION NOW
Yes, code now has these annoying parentheses, but there are some advantages:

 * simple syntax for using file descriptor:
   
   print>>sys.stderr, ""critical error""# Python 2print(""critical error"", file=sys.stderr)  # Python 3
   
   
 * printing tab-aligned tables without str.join :
   
   # Python 3print(*array, sep='\t')
   print(batch, epoch, loss, accuracy, time, sep='\t')
   
   
 * hacky suppressing / redirection of printing output:
   
   # Python 3
   _print =print# store the original print functiondefprint(*args, **kargs):
       pass# do something useful, e.g. store output to some file
   
   In jupyter it is desirable to log each output to a separate file (to track
   what's happening after you got disconnected), so you can override print now.
   
   Below you see a context manager that temporarily overrides behavior of print:
   
   @contextlib.contextmanagerdefreplace_print():
       import builtins
       _print =print# saving old print function# or use some other function here
       builtins.print =lambda*args, **kwargs: _print('new printing', *args, **kwargs)
       yield
       builtins.print = _print
   
   with replace_print():
       <code here will invoke other print function>
   
   It is not a recommended approach, but a small dirty hack that is now possible.
   
   
 * print can participate in list comprehensions and other language constructs
   
   # Python 3
   result = process(x) if is_valid(x) elseprint('invalid item: ', x)
   
   
F-STRINGS FOR SIMPLE AND RELIABLE FORMATTING
Default formatting system provides a flexibility that is not required in data
experiments. Resulting code is either too verbose or too fragile towards any
changes.

Quite typically data scientist outputs iteratively some logging information in a
fixed format. It is common to have a code like:

# Python 2print('{batch:3}{epoch:3} / {total_epochs:3}  accuracy: {acc_mean:0.4f}±{acc_std:0.4f} time: {avg_time:3.2f}'.format(
    batch=batch, epoch=epoch, total_epochs=total_epochs, 
    acc_mean=numpy.mean(accuracies), acc_std=numpy.std(accuracies),
    avg_time=time /len(data_batch)
))

# Python 2 (too error-prone during fast modifications, please avoid):print('{:3}{:3} / {:3}  accuracy: {:0.4f}±{:0.4f} time: {:3.2f}'.format(
    batch, epoch, total_epochs, numpy.mean(accuracies), numpy.std(accuracies),
    time /len(data_batch)
))

Sample output:

120  12 / 300  accuracy: 0.8180±0.4649 time: 56.60


f-strings aka formatted string literals were introduced in Python 3.6:

# Python 3.6+print(f'{batch:3}{epoch:3} / {total_epochs:3}  accuracy: {numpy.mean(accuracies):0.4f}±{numpy.std(accuracies):0.4f} time: {time /len(data_batch):3.2f}')

EXPLICIT DIFFERENCE BETWEEN 'TRUE DIVISION' AND 'INTEGER DIVISION'
For data science this is definitely a handy change (but not for system
programming, I believe)

data = pandas.read_csv('timing.csv')
velocity = data['distance'] / data['time']

Result in Python 2 depends on whether 'time' and 'distance' (e.g. measured in
meters and seconds) are stored as integers. In Python 3, result is correct in
both cases, because result of division is float.

Another case is integer division, which is now an explicit operation:

n_gifts = money // gift_price  # correct for int and float arguments

Note, that this applies both to built-in types and to custom types provided by
data packages (e.g. numpy or pandas ).

CONSTANTS IN MATH MODULE
# Python 3
math.inf # 'largest' number
math.nan # not a number

max_quality =-math.inf  # no more magic initial values!for model in trained_models:
    max_quality =max(max_quality, compute_quality(model, data))

STRICT ORDERING
# All these comparisons are illegal in Python 33<'3'2<None
(3, 4) < (3, None)
(4, 5) < [4, 5]

# False in both Python 2 and Python 3
(4, 5) == [4, 5]

 * prevents from occasional sorting of instances of different typessorted([2, '1', 3])  # invalid for Python 3, in Python 2 returns [2, 3, '1']
   
   
 * helps to spot some problems that arise when processing raw data

Sidenote: proper check for None is (in both Python versions)

if a isnotNone:
  passif a: # WRONG check for Nonepass

UNICODE FOR NLP
s ='您好'print(len(s))
print(s[:2])

Output:

 * Python 2: 6\n��
 * Python 3: 2\n您好 .

x = u'со'
x += 'co' # ok
x += 'со' # fail


Python 2 fails, Python 3 works as expected (because I've used russian letters in
strings).

In Python 3 str s are unicode strings, and it is more convenient for NLP processing of
non-english texts.

There are other funny things, for instance:

'a'<type<u'a'# Python 2: True'a'<u'a'# Python 2: False

from collections import Counter
Counter('Möbelstück')

 * Python 2: Counter({'\xc3': 2, 'b': 1, 'e': 1, 'c': 1, 'k': 1, 'M': 1, 'l': 1, 's': 1,
   't': 1, '\xb6': 1, '\xbc': 1})
 * Python 3: Counter({'M': 1, 'ö': 1, 'b': 1, 'e': 1, 'l': 1, 's': 1, 't': 1, 'ü': 1,
   'c': 1, 'k': 1})

You can handle all of this in Python 2 properly, but Python 3 is more friendly.

PRESERVING ORDER OF DICTIONARIES AND **KWARGS
In CPython 3.6+ dicts behave like OrderedDict by default (and this is guaranteed in Python 3.7+ ). This preserves order during dict comprehensions (and other operations, e.g.
during json serialization/deserialization)

import json
x = {str(i):i for i inrange(5)}
json.loads(json.dumps(x))
# Python 2
{u'1': 1, u'0': 0, u'3': 3, u'2': 2, u'4': 4}
# Python 3
{'0': 0, '1': 1, '2': 2, '3': 3, '4': 4}

Same applies to **kwargs (in Python 3.6+), they're kept in the same order as they appear in parameters.
Order is crucial when it comes to data pipelines, previously we had to write it
in a cumbersome manner:

from torch import nn

# Python 2
model = nn.Sequential(OrderedDict([
          ('conv1', nn.Conv2d(1,20,5)),
          ('relu1', nn.ReLU()),
          ('conv2', nn.Conv2d(20,64,5)),
          ('relu2', nn.ReLU())
        ]))

# Python 3.6+, how it *can* be done, not supported right now in pytorch
model = nn.Sequential(
    conv1=nn.Conv2d(1,20,5),
    relu1=nn.ReLU(),
    conv2=nn.Conv2d(20,64,5),
    relu2=nn.ReLU())
)        


Did you notice? Uniqueness of names is also checked automatically.

ITERABLE UNPACKING
# handy when amount of additional stored info may vary between experiments, but the same code can be used in all cases
model_paramteres, optimizer_parameters, *other_params = load(checkpoint_name)

# picking two last values from a sequence*prev, next_to_last, last = values_history

# This also works with any iterables, so if you have a function that yields e.g. qualities,# below is a simple way to take only last two values from a list *prev, next_to_last, last = iter_train(args)

DEFAULT PICKLE ENGINE PROVIDES BETTER COMPRESSION FOR ARRAYS
# Python 2import cPickle as pickle
import numpy
printlen(pickle.dumps(numpy.random.normal(size=[1000, 1000])))
# result: 23691675# Python 3import pickle
import numpy
len(pickle.dumps(numpy.random.normal(size=[1000, 1000])))
# result: 8000162

Three times less space. And it is much faster. Actually similar compression (but not speed) is achievable with protocol=2 parameter, but users typically ignore this option (or simply not aware of it).

SAFER COMPREHENSIONS
labels =<initial_value>
predictions = [model.predict(data) for data, labels in dataset]

# labels are overwritten in Python 2# labels are not affected by comprehension in Python 3

SUPER, SIMPLY SUPER()
Python 2 super(...) was a frequent source of mistakes in code.

# Python 2classMySubClass(MySuperClass):
    def__init__(self, name, **options):
        super(MySubClass, self).__init__(name='subclass', **options)
        
# Python 3classMySubClass(MySuperClass):
    def__init__(self, name, **options):
        super().__init__(name='subclass', **options)

More on super and method resolution order on stackoverlow .

SINGLE INTEGER TYPE
Python 2 provides two basic integer types, which are int (64-bit signed integer) and long for long arithmetics (quite confusing after C++).

Python 3 has a single type int , which incorporates long arithmetics.

Here is how you check that value is integer:

isinstance(x, numbers.Integral) # Python 2, the canonical way
isinstance(x, (long, int))      # Python 2
isinstance(x, int)              # Python 3, easier to remember


MULTIPLE UNPACKING
Here is how you merge two dicts now:

x =dict(a=1, b=2)
y =dict(b=3, d=4)
# Python 3.5+
z = {**x, **y} 
# z = {'a': 1, 'b': 3, 'd': 4}, note that value for `b` is taken from the latter dict.

See this thread at StackOverflow for comparison with Python 2.

Functions also support this for *args and **kwargs :

Python 3.5+
do_something(**{**default_settings, **custom_settings})

# Also possible, this code also checks there is no intersection between keys of dictionaries
do_something(**first_args, **second_args)


OTHER STUFF
 * keyword-only arguments allows much simpler creation of 'future-proof APIs' * example def f(a, b, *, option=True):
    * user won't be able to write something like numpy.unique(arr, True) , but has to specify name of parameter ( return_index=True )
    * best mechanism so far that allows keep good combination of reliability and
      flexibility of APIs
   
   
 * Enum s are theoretically useful, but * string-typing is already widely adopted in the python data stack
    * Enum s don't seem to interplay with numpy and categorical from pandas
   
   
 * coroutines also sound very promising for data pipelining (see slides by David Beazley), but I don't see their adoption in the wild.
 * Python 3 has stable ABI
 * Python 3 supports unicode identifies (so ω = Δφ / Δt is ok), but you'd better use good old ASCII names
 * some libraries e.g. jupyterhub (jupyter in cloud), django and fresh ipython only support Python 3, so
   features that sound useless for you are useful for libraries you'll probably
   want to use once.

PROBLEMS FOR CODE MIGRATION SPECIFIC FOR DATA SCIENCE (AND HOW TO RESOLVE THOSE)
 * support for nested arguments was dropped
   
   map(lambda x, (y, z): x, z, dict.items())
   
   
   However, it is still perfectly working with different comprehensions:
   
   {x:z for x, (y, z) in d.items()}
   
   In general, comprehensions are also better 'translatable' between Python 2
   and 3.
   
   
 * map() , .keys() , .values() , .items() , etc. return iterators, not lists. Main problems with iterators are:
   
    * no trivial slicing
    * can't be iterated twice
   
   Almost all of the problems are resolved by converting result to list.
   
   
 * see Python FAQ: How do I port to Python 3? when in trouble
   
   
MAIN PROBLEMS FOR TEACHING MACHINE LEARNING AND DATA SCIENCE WITH PYTHON
Course authors should spend time in the beginning to explain what is an
iterator, why is can't be sliced / concatenated / multiplied / iterated twice
like a string (and how to deal with it).

I think most of course authors would be happy to avoid these details, but now it
is hardly possible.

CONCLUSION
Python 2 and Python 3 co-exist for almost 10 years, but we should move to Python 3.

Research and production code should become a bit shorter, more readable, and
significantly safer after moving to Python 3-only codebase.

Right now most libraries support both Python versions. And I can't wait for the
bright moment when packages drop support for Python 2 and enjoy new language
features.

Following migrations are promised to be smoother: ""we will never do this kind of backwards-incompatible change again""

LINKS
 * Key differences between Python 2.7 and Python 3.x
 * Python FAQ: How do I port to Python 3?
 * 10 awesome features of Python that you can't use because you refuse to
   upgrade to Python 3
 * Trust me, python 3.3 is better than 2.7 (video)
 * Python 3 for scientists

LICENSE
This text was published by Alex Rogozhnikov under CC BY-SA 3.0 License (excluding images).

 * © 2018 GitHub , Inc.
 * Terms
 * Privacy
 * Security
 * Status
 * Help

 * Contact GitHub
 * API
 * Training
 * Shop
 * Blog
 * About

You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.",A short guide on features of Python 3 for data scientists.,Migrating to Python 3 with pleasure,Live,426
1289,"Homepage Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Susanna Tai Blocked Unblock Follow Following Offering Manager, Watson Data Platform | Data Catalog Dec 8
--------------------------------------------------------------------------------

IBM DATA CATALOG IS NOW GENERALLY AVAILABLE
We are excited to announce that IBM Data Catalog is now generally available!

When you sign up for Data Catalog , you have a choice of two plans: Lite or Professional. Here are the key
highlights:

LITE
The Lite Plan is the free starter plan. You can create one catalog for all your assets, run automatic
data discovery on up to five connections, collaborate with other users, add your
assets to projects, and work with other Watson Data Platform apps, such as Data
Science Experience and Data Refinery. However, profiling, data governance, and
business glossary capabilities are not available in the Lite plan.

PROFESSIONAL
With the Professional plan, you have full access to all Data Catalog
capabilities. You can create an unlimited number of catalogs and run as many
discovery connections as you need. In addition, you can profile and classify
your data, and control who can access your data with automatic governance
enforcement. You can manage business terms in a business glossary to facilitate
the use of a common vocabulary and relate terms to your catalog assets.

This is only the beginning: we’re adding new features and capabilities weekly!

If you’re a Beta customer, you have until January 31, 2018 to switch to one of
the new plans.

Read our announcement for details about these plans, support for the Beta plan, and how to switch
from your Beta plan to one of the new plans. Or ask us a question by clicking on
the chat icon in the lower right corner when you are signed in to Watson Data
Platform.

Ready to sign up or find out more about IBM Data Catalog? Go to ibm.com/cloud/data-catalog

 * Data Governance
 * Data Management
 * Data Science
 * Data Catalog
 * Data Analysis

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

3 Blocked Unblock Follow FollowingSUSANNA TAI
Offering Manager, Watson Data Platform | Data Catalog

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 3
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","When you sign up for Data Catalog, you have a choice of two plans: Lite or Professional. Check out the key highlights.",IBM Data Catalog is now generally available,Live,427
1290,"Skip navigation Upload Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE DEMO: MODELING ENERGY USAGE IN NYC
IBM Analytics Subscribe Subscribed Unsubscribe 18,909 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics

192 views 4LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 5 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Jun 13, 2016The IBM Data Science Experience is a one stop shop for data scientists to learn,
create and collaborate. In this video, Armand Ruiz Gabernet, the Data Science
Experience Offering Manager, will show one of the first live demonstrations of
the Data Science Experience. Blocpower, a startup based in NYC, is tasked with
helping inner city buildings be a lot more energy efficient. The Data Science
Experience team has been working with Blocpower to not only help operationalize
the model to detect which buildings are the right buildings to task for energy
efficiencies, but to help scale and operationalize this process. See how easy it
is to create and import new notebooks, load and clean data, and finally create
models and analysis from that data. Learn more about the Data Science Experience
at http://ibm.co/data-science .

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Creating the Data Science Experience - Duration: 3:55. IBM Analytics 431
   views 3:55


--------------------------------------------------------------------------------

 * Tetiana Ivanova - How to become a Data Scientist in 6 months a hacker’s
   approach to career planning - Duration: 56:26. PyData 8,009 views 56:26
 * Scalable Data Science and Deep Learning with H2O, Arno Candel, 20150603 -
   Duration: 1:27:05. San Francisco Bay ACM 2,773 views 1:27:05
 * Trial class for Data Science - Duration: 1:41:06. UpX Academy 20 views * New 1:41:06
 * Apache Spark Maker Community Event: The livestream playback - Duration:
   1:30:23. IBM Analytics 610 views 1:30:23
 * The Science of Doubt: Creating Good Controls for Data Science Experiments. -
   Duration: 20:49. Next Day Video 372 views 20:49
 * Introducing the Data Science Experience - Duration: 2:31. IBM Analytics 2,608
   views 2:31
 * Metis Data Science Bootcamp: The Student Journey - Duration: 3:01. Metis
   1,131 views 3:01
 * Don't rely on spreadsheets: Empower your business with IBM SPSS Statistics -
   Duration: 1:00:34. IBM Analytics 6 views * New 1:00:34
 * Become a Data Scientist in 3 months at Galvanize - Duration: 2:21. Galvanize
   9,643 views 2:21
 * Trey Causey: Testing for Data Scientists - Duration: 39:34. PyData 1,212
   views 39:34
 * Student Speaker - Graduation Recognition 2016 - Duration: 10:03. Northeastern
   University Seattle 51 views * New 10:03
 * Immerse yourself in the world of data science at IBM Datapalooza - Duration:
   1:59. IBM Analytics 829 views 1:59
 * IBM's first Global Chief Data Officer covers a general data strategy plan -
   Duration: 5:17. IBM Analytics 157 views 5:17
 * NewMet Data Bootcamp Demo Day - Duration: 1:07:58. NewMet Data Science
   Bootcamp 283 views 1:07:58
 * Data hacking - data science for entrepreneurs | Kevin Novak | TEDxWakeForestU
   - Duration: 17:11. TEDx Talks 8,645 views 17:11
 * Hybrid planning with IBM Planning Analytics: The Secure Gateway - Duration:
   20:28. IBM Analytics 58 views * New 20:28
 * O'Reilly Webcast: Data Science Experiments with Twitter and IPython Notebook
   - Duration: 1:23:05. O'Reilly 4,181 views 1:23:05
 * Monitor trading activities using a surveillance-powered solution - Duration:
   1:40. IBM Analytics 44 views * New 1:40
 * The Data Science Revolution - Duration: 1:02:42. MilkenInstitute 2,522 views 1:02:42
 * Loading more suggestions...
 * Show more

 * Language: English
 * Country: Worldwide
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Try something new!
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...","The IBM Data Science Experience is a one stop shop for data scientists to learn, create and collaborate. In this video, Armand Ruiz Gabernet, the Data Scienc...",Data Science Experience Demo: Modeling energy usage in NYC,Live,428
1291,"Skip navigation Upload Sign in SearchLoading...Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.WATCH QUEUEQUEUEWatch Queue Queue * Remove all * Disconnect 1. Loading...Watch Queue Queue __count__/__total__ Find out why CloseWHY RELATIONAL DATABASES AND R (9:10)Big Data University Subscribe Subscribed Unsubscribe 1,370 1KLoading...Loading...Working...Add toWANT TO WATCH THIS AGAIN LATER?Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?   Sign in to report inappropriate content. Sign in * Transcript * Statistics53 views 0LIKE THIS VIDEO?Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?Sign in to make your opinion count. Sign in 1Loading...Loading...TRANSCRIPTThe interactive transcript could not be loaded.Loading...Loading...Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Mar 10, 2016bigdatauniversity.com * CATEGORY    * Science & Technology       * LICENSE    * Standard YouTube License      Show more Show lessLoading...Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT * Data Scientist Workbench - Features Demo - Duration: 3:08. Big Data   University 57 views 3:08-------------------------------------------------------------------------------- * Introducing Data Scientist Workbench - Duration: 1:14. Big Data University   255 views 1:14 * Updating Data and Using Stored Procedures (8:19) - Duration: 8:20. Big Data   University 45 views 8:20 * Using R with BLU Acceleration for Cloud (9:06) - Duration: 9:07. Big Data   University 23 views 9:07 * Database Design, Load, and Query from R (11:11) - Duration: 11:12. Big Data   University 25 views 11:12 * Entity Relationship Diagram (ERD) Training Video - Duration: 15:04. Gina   Baldazzi 393,114 views 15:04 * Entity-Relationship Diagrams - Duration: 8:53. Seventh Morning LLC 86,612   views 8:53 * 1 Why relational databases and R - Duration: 9:11. Tech Analytics 238 views 9:11 * Database Design 8 - Database Terms - Duration: 15:09. CalebTheVideoMaker2   6,831 views 15:09 * Lovefield: A JavaScript Relational Database (100 days of Google Dev) -   Duration: 7:26. Google Developers 22,887 views 7:26 * How to Configure ODBC to Access a Microsoft SQL Server - Duration: 5:47.   itgeared 118,849 views 5:47 * Database Design 2 - What is a Relational Database? - Duration: 13:02.   CalebTheVideoMaker2 22,153 views 13:02 * How to create table relationships in Access | lynda.com tutorial - Duration:   4:50. Lynda.com 143,247 views 4:50 * Importing data from SQL Server to R Studio via ODBC - Duration: 3:16. Rob   Kerr 7,121 views 3:16 * UHCL 25a Graduate Database Course - Lossless Decomposition - Duration: 9:13.   GaryBoetticher 50,890 views 9:13 * SAP HANA Academy - Live2: 2b. UnixODBC Linux drivers - Duration: 24:09. SAP   HANA Academy 3,187 views 24:09 * Database Lesson #4 of 8 - Data Modeling and the ER Model - Duration: 58:16.   Dr. Daniel Soper 94,565 views 58:16 * Importing A Dataset Into R-Studio - Duration: 6:20. PortableProfessor 23,382   views 6:20 * Database Design 32 - Introduction to Entity Relationship Modeling - Duration:   8:00. CalebTheVideoMaker2 6,506 views 8:00 * MySQL Database Tutorial - 22 - How to Join Tables - Duration: 8:29.   thenewboston 160,328 views 8:29 * Loading more suggestions... * Show more * Language: English * Country: Worldwide * Restricted Mode: OffHistory HelpLoading...Loading...Loading... * About * Press * Copyright * Creators * Advertise * Developers * +YouTube * Terms * Privacy * Policy & Safety * Send feedback * Try something new! * Loading...Working...Sign in to add this to Watch LaterADD TOLoading playlists...",Why use R with Relational Databases?,Why Relational Databases and R?,Live,429
1297,"* United StatesIBM� * Site mapIBM Skip to content IBM� developerWorks Developer Centers * Close Search Seach Search * Sign in * Sign In    * Register       * IBM NavigationdeveloperWorks TV * Shows * Topics * Events * Tags * dW TV Sites * World Wide      CloseDEVELOPERWORKS TV * Shows * Topics * Events * Tags * dW TV Sites * World Wide      DEVELOPERWORKS * Learn * Develop * ConnectDISCOVER IBM * Marketplace * Products * Services * Industries * Careers * Partners * SupportdeveloperWorks TVWatch videos about the latest developer technologiesin The New BuildersTHE NEW BUILDERS EP. 3: A DEVELOPER AND A DATA SCIENTIST WALK INTO A BARJim YoungCreated on April 28, 2016 0 Commentshttps://developer.ibm.com/tv/wp-content/uploads/sites/62/2016/04/David-Taieb-Jorge-Castanon.mp3Podcast: Play in new window | DownloadDavid Taieb, Developer Advocate, IBMJorge Castañón, Lead Data Scientist, IBM“I am learning from the data scientist, and the data scientist is learning fromthe developer, constantly”In Episode 3 of the The New Builders, David Taieb, a developer, is joined byJorge Castañón, a data scientist, to explain how getting a drink together atDatapalooza led to a partnership in which they refined a sample application , built by Taieb, that predicts how much a flight will be delayed, based onweather conditions. Thanks to the rise of cloud computing, collaborativetechnology and complementary skill sets, Taieb and Castañón were able to achieve60 percent accuracy with the app – far exceeding their expectations.We get into the details behind their collaboration, including how open sourcetools like Simple Data Pipe helped Taieb move on-premises data sources into a cloud-based Apache Spark environment for analysis (8:32), the role of intuition in data science (13:23), how acommon language of Spark and IPython notebooks enabled them to collaborativelyexecute and experiment with data (14:30), how deep learning could furtherimprove the sample app (27:03), and why they think it’s never too late fordevelopers and data scientists to start working together on their projects(29:49).Check out the flight predictor sample app on Github .You can find new episodes of The New Builders on developerWorks TV and SoundCloud . Find out more about IBM Cloud Data Services at IBM.biz/forbuilders . Contact hosts Doug Flora and Jim Young on Twitter ( @DSFlora , @JW_Young ) or email ( dsflora@us.ibm.com , jwyoung@us.ibm.com ).The show’s music is provided by School for Robots. Check them out at schoolforrobots.bandcamp.com !SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to Press This! (Opens in new window) * Click to share on Twitter (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Click to share on Pocket (Opens in new window) * Click to share on Tumblr (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Google+ (Opens in new window) * Click to share on Pinterest (Opens in new window) * Click to share on StumbleUpon (Opens in new window) * Tags Apache Spark , Big dataBy Jim YoungJOIN THE DISCUSSION CANCEL REPLYYour email address will not be published. Required fields are marked *Name *Email *WebsiteCommentNotify me of follow-up comments by email.Notify me of new posts by email.IBM® * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyRSS Feed Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.1Array(    [custom_css_file] = http://developer.ibm.com/tv/wp-content/uploads/sites/62/2016/02/dwtv1.css    [facebook] =     [twitter] =     [google_cse_id] = 006696266273115712924:ft7vdfhvshq    [google_cse_filter_id] = nexttv    [google_analytics_view_ID] =     [coremetrics_cat] = SOFDCDTVZZ    [piwik_id] = )Array(    [custom_css_file] = http://developer.ibm.com/tv/wp-content/uploads/sites/62/2016/02/dwtv1.css    [facebook] =     [twitter] =     [google_cse_id] = 006696266273115712924:ft7vdfhvshq    [google_cse_filter_id] = nexttv    [google_analytics_view_ID] =     [coremetrics_cat] = SOFDCDTVZZ    [piwik_id] = 3)","David Taieb and Jorge Castañón built and refined a sample app that predicts flight delays based on weather, using a data movement tool and Apache Spark.",The New Builders podcast Ep 3: Collaboration,Live,430
1298,"Compose Menu Databases * MongoDB
 * Elasticsearch
 * RethinkDB
 * Redis
 * PostgreSQL
 * etcd
 * RabbitMQ

Enterprise Pricing Articles Sign in Free 30-Day TrialMAKING THE MOST OF COMPOSE - DRONEDEPLOY
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 31, 2016The power of software to help us understand the world is immense and companies
like DroneDeploy are bringing that power to everyone; their goal is to make the sky more
accessible and productive for anyone. That includes the farmer looking for
better crop yields, the architect looking to restore a cathedral or the miner
wanting to optimize his excavations.

To help them farm, build or manage, they all have to be able to measure the
world. The company's innovative approach uses commercial drones that anyone can
buy and software which programs them to the task. It teaches the drones how to
automatically scan areas and return with a payload of photos.

Without intelligent software, that payload is of little use. The next step is
for the images to be uploaded into the cloud where DroneDeploy stitches them
together into maps and 3D models using its Map Engine. The user can then explore
the completed maps and models with analytical tools within the easy-to-use
mobile or web-based viewer to obtain useful insights. For example, near-infrared
scans of crops can be interpreted using vegetation indices to aid searching for
crop stress, or measurements can be made that can show the volumes of stockpiles
around a quarry.


Cornwall Cliff by DroneDeploy on Sketchfab


More maps and models are available in the DroneDeploy gallery .

It's the software that makes the difference. ""We're taking the existing drone
hardware and combining it with a very powerful piece of software to make that
drone into a useful tool,"" said Nick Pilkington, DroneDeploy's CTO, ""something
that's repeatable, something that's reliable, something that's safe and
something that provides a huge amount of value."" He explains that one of their
biggest use cases is ""helping farmers make better decisions"".

The maps and models are then available on desktop or hand-held devices in the
field. For example, a farmer could see a near-infra-red scan of his fields with
crop stress picked out in red. He can then go to those locations and see the
""ground truth"" by walking to the plants and taking notes."" Pilkington explained:
""DroneDeploy isn't going to tell the farmer today if it's a lack of nitrogen or
it's pests, but if you're a farmer that's managing 2000 acres of 11ft corn you
can get a map every morning and don't need to rely on satellite imagery."" The
drone imagery is unaffected by clouds which cut off satellite imagery and comes
in at a higher resolution. ""You can zoom in and see the individual plants which
is pretty cool"" says Pilkington. That same information can also be used to to
feed the emerging next generation of decision-making applications in precision
agriculture.


The latest version of DroneDeploy's app - more at their blog

Behind the scenes, the whole process is powered by DroneDeploy's Map Engine
which plugs into Compose databases. DroneDeploy started with Compose back when
it was MongoHQ. The early decision to store data in NoSQL came out of them
mainly operating on time-series data. Documents about flights, locations, images
and other items that don’t have many relations encoded made MongoDB a sensible
choice for them. As a small company that didn't want to invest in self-hosting,
they decided to go with a Database-as-a-Service provider and that provider was
Compose.

Early experience with MongoDB on Compose helped shape an architecture where
documents were handled and then quickly moved to lower cost storage. This
allowed them to optimize their costs while handling the stream of data. Over
time they've expanded their database technology use. Redis came into the picture
as their jobs and coordination requirements built up. These required a rapid
database and Redis' in-memory processing offers a route to performance for
applications. There are queues, lists and sets which can be used to represent
the jobs, settings and resources available and they are all tightly managed.

DroneDeploy saw the advantages and they brought Compose Redis online to handle
that. This also allowed them to lift the load on Compose MongoDB's more
persistent model and boost their application performance. The timing was
fortuitous as Compose had just made Redis available as an offering. ""We make
much heavier use of Compose than initially and now we're starting to use Redis
more heavily because we have thousands of jobs being orchestrated all around the
world"" says Pilkington. ""We've got a bunch of different job queues that manage
that depending on the sort of user, allowed concurrency and type of processing
job it is and all of that is coming through Redis.""


Image from Mapping Drones for Professional Surveyors .

The MongoDB and Redis combination is a mix that many companies have found
compelling for their core infrastructure, offloading other operations to more
appropriate or cost effective databases as needed.

What Compose has given DroneDeploy is the ability to concentrate on running
their business, not how to run databases. That concentration has enabled
DroneDeploy to offer a product which can help millions of people understand and
manage their world.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","The power of software to help us understand the world is immense and companies like DroneDeploy are bringing that power to everyone; their goal is to make the sky more accessible and productive for anyone. That includes the farmer looking for better crop yields, the architect looking to restore a cathedral or the miner wanting to optimize his excavations. ",Making the Most of Compose - Customer DroneDeploy,Live,431
1311,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

CLUSTERED COUCHDB R/W BEHAVIOR
Mike Broberg / October 30, 2015How do database read/write operations behave in a distributed, Dynamo-style cluster ?

MEET MICHAEL RHODES
Mike Rhodes and I worked together back in Cloudant’s startup days. Now at IBM,
Mike is officially Engineer & Architect, Mobile First Data & Cloudant IBM Analytics, Cloud Data
Services . I’ve long appreciated his work on the engineering team, and he’s also an
excellent writer.

I’d like to point to an article Mike recently posted to his personal blog, dx13.co.uk : “ CouchDB 2.0’s read and write behaviour in a cluster .” In it, he explains how database reads and writes work in a clustered Apache CouchDB™ environment, which will become generally available in the project’s upcoming
2.0 release.


CouchDB 2.0 will soon deliver on its promise of running across a distributed,
highly available cluster of servers. Much of the code comes from years of
operating IBM Cloudant at scale. Cloudant donated it to the Apache community,
and the community has done lots of work to integrate it into the upcoming 2.0 release .

MIND YOUR RS AND WS
Thinking about database transactions in a single-server environment is more
straightforward than transactions in a cluster. When you have several copies of
a database spread across many server nodes, the process by which they coordinate
their response is known as “ quorum .”

Mike’s article helps prepare engineers with a mental model for understanding how
CouchDB behaves when network partitions occur within the cluster. A lot can
happen between the load balancers in a cluster and the database nodes that sit
behind them. The good news is that CouchDB always favors availability, even in
situations where other clustered databases might become unavailable to deal with
conflicts or insufficient quorum.

Read on in Mike’s article to learn the basic mechanics of quorum in CouchDB 2.0.

© “Apache”, “CouchDB”, “Apache CouchDB”, and the CouchDB logo are trademarks or
registered trademarks of The Apache Software Foundation. All other brands and
trademarks are the property of their respective owners.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Learn how CouchDB 2.0's read and write operations behave in a distributed cluster. Cross-posted from Mike Rhodes at dx13.co.uk.,Learn how read/write ops in CouchDB 2.0 work in a cluster,Live,432
1313,"Mark Watson Blocked Unblock Follow Following Developer Advocate, IBM Watson Data Platform Jul 27
--------------------------------------------------------------------------------

A DEPTH-FIRST LOOK AT WATSON CONVERSATION + GREMLIN + JANUSGRAPH
HOW I UPDATED MY SAMPLE CHATBOT TO USE THE LATEST IN GRAPH DATABASES
Graph image credit: Jeremy Kun .In a previous blog post I talked about how you could use a graph database to
store conversational data, such as chatbot interactions. I covered using the
Apache TinkerPop framework and the Gremlin graph traversal language to store
vertices and edges in a graph by tracing a typical chatbot conversation.

Have you had “The Talk” with your chatbot about graph data structures? A
coming-of-age story for your database queries medium.freecodecamp.orgRecently Compose announced support for JanusGraph , an open source Graph database that supports the Apache TinkerPop framework.
You can learn more about JanusGraph at janusgraph.org .

In this blog post, I’ll show you how I took the concepts from my previous
chatbot articles and added support for JanusGraph to the Recipe Chatbot example app.

A QUICK REFRESHER
This article is the third in a series of posts [ 1 , 2 ] about a chatbot called the Recipe Bot. The Recipe Bot is a Slack Bot that
lets people request recipes based on specified ingredients or cuisines. Although
not required, it may be useful to read the previous blogs that describe the
application’s higher-level architecture.

At a minimum, it’s important to understand some of the basic relationships used
in Recipe Bot’s graph database. As a conversation with the bot progresses, the
application creates vertices and edges and stores them in the database. Here are
the basics:

 1. A person starts a conversation with the chatbot, and the code creates a
    person vertex.
 2. The person requests an ingredient and the application creates an ingredient
    vertex and an edge from the person to that ingredient.
 3. The chatbot recommends a list of recipes, the person selects a recipe, and
    the code creates a recipe vertex and an edge from the ingredient to that
    recipe.

Here’s a simplified version of the graph (you can see the real one with directed edges in the blog post cited above):

It’s more complicated than the diagram suggests. For example, the bot also
creates an edge between the person and the recipe to make it easier to find a
person’s favorite recipes. However, these simplified mechanics should give you
an idea of how the app relates entities in the conversation to vertices & edges
in the persisted graph.

Before jumping into the new chatbot implementation with JanusGraph, let’s talk a
bit about Gremlin.

GREMLIN AT A GLANCE
As I mentioned before JanusGraph supports Gremlin.

“Gremlin is a functional, data-flow language that enables users to succinctly
express complex traversals on (or queries of) their application’s property
graph.” — https://tinkerpop.apache.org/gremlin.htmlIn the previous article I touched on the structure of the vertices and edges
associated with our conversational data, but I didn’t go into detail on how to
create, update, and query those vertices and edges using Gremlin. So here’s a
quick look at how that would work.

CREATING A VERTEX
The following describes the structure of a person vertex used in the example
app. The label identifies this vertex as a person, and this person has a name of
U2JBLUPL2 (which is the user’s Slack ID).

{
  ""label"": ""person"",
  ""type"": ""vertex"",
  ""properties"": {
    ""name"": ""U2JBLUPL2""
  }
}

To create this person in a graph database using Gremlin, run the following:

graph.addVertex(T.label, ""person"", ""name"", ""U2JBLUPL2

The response would look something like this:

{
  ""id"": 4224,
  ""label"": ""person"",
  ""type"": ""vertex"",
  ""properties"": {
    ""name"": [
      {
        ""id"": ""17b-3aw-1l1"",
        ""value"": ""U2JBLUPL2""
      }
    ]
  }
}

CREATING AN EDGE
The following describes the structure of an edge. In this case it represents the
edge between a person (with ID 4224) and a recipe that the person has selected
(with ID 4320).

{
  ""label"": ""selects"",
  ""type"": ""edge"",
  ""inV"": 4320,
  ""outV"": 4224,
  ""properties"": {
    ""count"": 1
  }
}

To create this edge in a graph database using Gremlin run the following:

def g = graph.traversal();
def outV = g.V(4224).next();
def inV = g.V(4320).next();
outV.addEdge(""selects"", inV, ""count

Here, you first needed to find the two vertices: the person (ID 4224) and the
recipe (ID 4320). The response looks something like this:

{
  ""id"": ""2dl-oe05k-3yt-9ns"",
  ""label"": ""selects"",
  ""type"": ""edge"",
  ""inVLabel"": ""recipe"",
  ""outVLabel"": ""person"",
  ""inV"": 4320,
  ""outV"": 4224,
  ""properties"": {
    ""count"": 1
  }
}

TRAVERSING A GRAPH
In the previous post I showed you the following Gremlin query for getting a
user’s top-five favorite recipes, sorted by count:

g.V().hasLabel(""person"").has(""name"",""U2JBLUPL2"")
.outE().order().by(""count"", decr)
.inV().hasLabel(""recipe"").limit(5)

This query uses the edge between a person and the recipes the person has
selected, and ultimately returns the recipes themselves (inV). The response is
an array of recipes:

[
  {
    ""id"": 4320,
    ""label"": ""recipe"",
    “type”: “vertex”,
    ""properties"": {...},
  }
  {
    ""id"": 4450,
    ""label"": ""recipe"",
    “type”: “vertex”,
    ""properties"": {...},
  }
  ...
]

How about the full graph for the query?

g.V().hasLabel(""person"").has(""name"",""U2JBLUPL2"")
.outE().order().by(""count"", decr)
.inV().hasLabel(""recipe"").limit(5).path()

In the query above, I simply added .path() to the end. The response includes an array of paths, where each path has an
array of labels and objects. Each array of objects includes the vertices and
edges traversed in the path. For example, a path would have an array of objects
that includes the person vertex, the edge between the person and the recipe, and
the recipe vertex. Here is a sample response:

[
  {
    ""labels"": [
      [],
      [],
      []
    ],
    ""objects"": [
      {
        ""id"": 4224,
        ""label"": ""person"",
        ""type"": ""vertex""
        ...
      },
      {
        ""id"": ""2dl-oe05k-3yt-9ns"",
        ""label"": ""selects"",
        ""type"": ""edge""
        ...
      },
      {
         ""id"": 4320,
         ""label"": ""recipe"",
         ""type"": ""vertex""
         ...
      }
    ]
  },
  {
    ""labels"": [...],
    ""objects"": [...]
  }
  ...
]

Now that you have a few Gremlin queries under your belt, here’s how to run them
on JanusGraph.

JANUSGRAPH HTTP API
You can execute Gremlin queries on JanusGraph in a couple of ways:

 1. You can connect to JanusGraph via WebSockets. This method allows you to have
    long-lived conversations between your application and JanusGraph, where
    state can be saved across Gremlin queries.
 2. You can connect to JanusGraph via the HTTP API. In this method, each API
    call is its own unit of work, and state is not saved across API calls.
    Sometimes you may need to provide more information to execute an HTTP API
    call, but connection management is greatly simplified.

I’ll focus on the HTTP API for the rest of this post. JanusGraph exposes a
single HTTP POST endpoint to execute Gremlin queries. The endpoint expects a
JSON-formatted document with a single key ( gremlin ) that has the value of your Gremlin query:

{
 ""gremlin"": ""YOUR_GREMLIN_QUERY_HERE""
}

Every Gremlin query discussed above can be submitted to this endpoint, with one
minor update — you must prefix every query with the following:

def graph=ConfiguredGraphFactory.open(""YOUR_GRAPH_ID

You should replace YOUR_GRAPH_ID with the ID you specified when creating your graph. But, how do I create a
graph? — with the HTTP API of course! To create a graph, send the following
Gremlin query. This example creates a graph called ""recipebot"" :

{
 ""gremlin"": ""def graph=ConfiguredGraphFactory.create(\""recipebot\""
}

Now, every other Gremlin query you want to run against the recipebot graph
simply needs to include the prefix I mentioned above. For example, here is what
the HTTP POST body looks like for creating a person vertex:

{
  ""gremlin"": ""def graph=ConfiguredGraphFactory.open(\""recipebot\ graph.addVertex(T.label, \""person\"", \""name\"", \""U2JBLUPL2\""
}

The HTTP response to the Gremlin endpoint looks like this:

{
  ""requestId"": ""6f0a533e-76ab-412e-a4f9-d73842ea12c2"",
  ""status"": {
    ""message"": """",
    ""code"": 200,
    ""attributes"": {}
  },
  ""result"": {
    ""data"": [
      {
        ""id"": 4224,
        ""label"": ""person"",
        ""type"": ""vertex""
        ...
      }
    ],
    ""meta"": {}
  }
}

Check out the JanusGraph documentation for sample queries, curl commands, and more information on how to query
JanusGraph with Gremlin.

TYING IT ALL TOGETHER
The project uses a Node.js application that communicates with the Recipe Bot and
JanusGraph. The architecture looks like this:

The Node.js application (labeled in the diagram as “Application”) manages the
integration between Slack (for relaying messages to and from users), Watson
Conversation (for running the chatbot), and JanusGraph (for storing and querying
vertices and edges).

You can run the sample application by following the instructions in the GitHub
repo at https://github.com/ibm-watson-data-lab/watson-recipe-bot-nodejs-janusgraph . You’ll find the graph-related code in the following files:

 * JanusGraphRecipeStore.js
 * JanusGraphClient.js

JanusGraphRecipeStore.js contains the high-level functions used by the chatbot
to save and retrieve entities in JanusGraph. It includes functions like:

 * addUser — Adds a user vertex to the graph
 * addIngredient — Adds an ingredient vertex to the graph
 * recordIngredientRequestForUser — Adds or updates the edge between a user
   vertex and an ingredient vertex

These functions, in turn, call the lower-level functions defined in
JanusGraphClient.js. That file contains generic functions for working with
JanusGraph, including:

 * getOrCreateGraph — Creates a graph with the specified ID if it does not
   already exist
 * runGremlinQuery — Executes a Gremlin query on the specified graph
 * createVertex — Creates a vertex on the specified graph
 * createEdge — Creates an edge on the specified graph

Remember that the HTTP API for JanusGraph contains a single endpoint — an HTTP
POST for executing Gremlin queries. The runGremlinQuery function takes a Gremlin
query and executes the HTTP POST. It’s the only function really required to
communicate with JanusGraph and is ultimately called by all of the other
functions in JanusGraphClient.js.

The other functions, like createVertex and createEdge, are convenience functions
that provide a higher level of abstraction for the developer. For example,
createVertex allows you to pass in a vertex object with the following structure:

{
  ""label"": ""person"",
  ""type"": ""vertex"",
  ""properties"": {
    ""name"": ""U2JBLUPL2""
  }
}

This object is then converted into a Gremlin query, like so, and sent to the
runGremlinQuery function:

graph.addVertex(T.label, ""person"", ""name"", ""U2JBLUPL2

Feel free to use JanusGraphClient.js in your own project to help you get started
with JanusGraph.

START CHATTING
Conversations are a natural fit for graph data structures . By persisting chat metadata to a graph database, you can now see how it can
be a more pleasant model to develop code against—without having to implement
lots of special relationship tables as in a relational database. Hopefully, this
article gives you an idea of the mechanics of using Gremlin to persist data to
JanusGraph.

You can try out the Recipe Bot with JanusGraph support by following the
instructions in the GitHub repo . The README has everything you’ll need to get up and running. You’ll walk through the
process of configuring Slack and Spoonacular, creating a Watson Conversation
service instance, and provisioning your first JanusGraph deployment.

Tell your Recipe Bot I said hello!

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

Thanks to Mike Broberg . * Janusgraph
 * Ibm Watson
 * Web Development
 * Graph Database
 * Cognitive Computing

Blocked Unblock Follow FollowingMARK WATSON
Developer Advocate, IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",How I updated my sample chatbot to use the latest in graph databases.,A Depth-First Look at Watson Conversation + Gremlin + JanusGraph,Live,433
1316,"Enterprise Pricing Articles Sign in Free 30-Day TrialICARS CONQUERS THE DATA LAYER
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Feb 2, 2017Last month, Compose sat down with Alec Summers, CTO of iCars , an on-demand technology service for ground transportation focusing on black
car service. If you need real black car service for your business, iCars takes
care of everything.

As an on-demand service, iCars has to process data quickly and reliably. For
example, every morning, iCars gets the flight info for an airline's pilots and
flight crew. They quickly import the information and coordinate with black car
drivers to ensure that the crew is picked up, taken to their hotel, and returned
to the airport before their next flight. It's pretty intense! To keep everything
moving smoothly and to give everyone a fast and reliable service, they need to
be able to trust their data and their data layer.

To conquer their data layer, Alec chose Compose so we decided to ask him why.
Here's what Alec has to say:


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","To conquer their data layer, Alec Summers, CTO of iCars, chose Compose. So we decided to ask him why. Here's what he said.",Customer: iCars Conquers the Data Layer,Live,434
1321,"BUILD A SIMPLE DATA PORTAL WITH PYTHON AND IBM OBJECT STORAGE

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Raj R Singh 12/7/16Raj R Singh

Raj is a Developer Advocate and Open Data Lead at IBM Cloud Data Services. He
specializes in all things geospatial and hacks on analytics in R/dashDB and
Spark/iPython notebooks. He's currently driven to make CDS the best place to
obtain and exploit comprehensive, curated open data sets for business. Raj…

Learn More Recent Posts * Build a Simple Data Portal with Python and IBM Object Storage Try openobjectstore, a RESTful interface providing public, unauthenticated
   access to data in IBM Object Storage.
 * Open Crime Data, Free for All We've built a large database of crime records sourced directly from local
   police departments and…
 * Map Your Cloud Data Use CARTO and IBM open data sets to add maps to your python notebook
   analysis.

IBM Object Storage , like Amazon S3, enables storage and retrieval of any amount of data from
anywhere on the web. Customers use it as a bulk repository, or “data lake,” for
analytics, backup & recovery, disaster recovery, and serverless computing. Many
cloud-native applications even use it as primary storage, like IBM’s Data Science Experience .

One drawback for some users, however, is that IBM Object Storage out-of-the-box
doesn’t provide public, unauthenticated access to files. I created an app that
fills the gap by providing a RESTful interface to the data in Object Storage.

My open object store sample app is a drop-in facade on top of any Object Storage
instance that allows easy data publishing in an interoperable, RESTful manner. Get the code on GitHub .

HOW IT WORKS
This Python Flask web app conforms to RESTful principles of exposing data on the Internet and provides:

 1. a single entry URL from which all data resources can be accessed via
    stateless URL traversal.
 2. read-only access to these data resources via HTTP GET.

What the app doesn’t do is implement a full CRUD service, which would allow resources to be created and updated. If that’s
important to you, feel free to pitch in. Pull requests on the repo are welcome!

Many links in this article open JSON documents, which are much easier to read
and parse when viewed with syntax highlighing. We recommend installing one of
these simple browser add-ons:

 * For Firefox, use JSONView
 * For Chrome, try JSON Viewer

To see my demo, go to the entry URL, http://openobjectstore.mybluemix.net/ . The service returns a JSON document. It contains one field, containers , which holds an array of metadata about container objects, which correspond to
the containers in the Object Store service instance.

Endpoint request (lists containers)
https://openobjectstore.mybluemix.net

Endpoint response (partial listing)

{
    ""containers"": [{
        ""accessURL"": ""https://openobjectstore.mybluemix.net/BostonCodeCamp"",
        ""name"": ""BostonCodeCamp"",
        ""bytes"": 0,
        ""objects"": 0
    }, {
        ""accessURL"": ""https://openobjectstore.mybluemix.net/Election"",
        ""name"": ""Election"",
        ""bytes"": 15141442,
        ""objects"": 11
    }, {
        ""accessURL"": ""https://openobjectstore.mybluemix.net/censusacs2014"",
        ""name"": ""censusacs2014"",
        ""bytes"": 119424774,
        ""objects"": 11
    }]
}

Each container object has an accessURL property. Accessing that URL makes another stateless call to the service,
appending the container name to the entry URL, which returns a JSON document
comprised of all the objects in that container. For example try:

Container request (lists objects)
https://openobjectstore.mybluemix.net/censusacs2014

Container response (partial listing)

{
  ""objects"": [
    {
      ""downloadURL"": ""https://openobjectstore.mybluemix.net/censusacs2014/x01_age_sex.csv"",
      ""last_modified"": ""2016-11-11T18:25:54.822290"",
      ""name"": ""x01_age_sex.csv"",
      ""content_type"": ""text/csv"",
      ""bytes"": 3909967
    },
    {
      ""downloadURL"": ""https://openobjectstore.mybluemix.net/censusacs2014/x02_race.csv"",
      ""last_modified"": ""2016-11-11T18:26:22.645450"",
      ""name"": ""x02_race.csv"",
      ""content_type"": ""text/csv"",
      ""bytes"": 1070001
    }
  ]
}

Here you get some basic metadata about the objects, including name , size in bytes , MIME type ( content_type ), and downloadURL , which gives the client (machine or human) HTTP GET access to the data. All
this without writing a single line of metadata by hand.

ROLLING YOUR OWN SERVICE
To try it yourself, click the following Deploy to Bluemix button, which will:

 * create an instance of IBM Object Storage for you
 * create a Python runtime with this Flask app deployed
 * hook them both together and deploy the service on the web

For more details (including deployment instructions for local development), head
over to the openobjectstore GitHub repo and follow the steps in the README.

If you don’t already have a Bluemix account, you’ll be prompted to sign up for a
free trial. Once the app deploys, all you need to do is load up your Object
Storage with data and start making calls.",A Python Flask RESTful web app that serves files out of your IBM Object Storage without requiring authentication.,Build a simple data portal with Python and IBM Object Storage,Live,435
1327,"BUILDING OHLC DATA IN POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 30, 2016In this Write Stuff article, Lucero Del Alba illustrates the power of PostgreSQL's aggregate and
date functions to analyze financial data, and shows you how to display the
results in a web browser using TechanJS.

OHLC charts (open, high, low, close) are essential in finance to illustrate the
change in prices, which are generated from ticks (bid and ask prices), most
often on the application layer. Let’s review here how to build them in a
database by using PostgreSQL features.

We'll go step-by-step from getting raw financial data from a reliable source,
storing it in a database, and generating from it what's necessary to make a
candlestick chart, which we'll visualize in the end. Also, we'll introduce some
of the terminology, so hopefully you'll be able to follow along even if you
don't know much about finance by just having a foundation of programming and
databases.

TICK DATA
A tick is a measurement of the upwards or downwards movement in the price of a
currency pair, a stock, or any other exchange traded asset. They are normally
expressed in milliseconds and come with a bid (buying) and an ask (selling) price, and they are the minimum amount of price movement that you can
track in terms of time. Sometimes there are just a few milliseconds between one
tick and another when there's volatility in the market. And, sometimes it may
take a few seconds before a new tick is registered in a calm market. From the
ticks, you can reconstruct what happened, what the price evolution of a stock
was, or the price of a currency pair in a given second, minute, hour, day, or
even weeks and months.

The level of granularity of time that's less than seconds is very relevant not
only for high-frequency trading (HFT), but also very necessary when building a market simulator that will accurately model trading conditions and to analyze the markets in
such a way to make, hopefully, useful predictions.

GETTING THE TICKS
Google Finance , Yahoo! Finance , and many other free web services won't provide you normally with price
information to the granularity of ticks; you'll need to acquire this historical
or live data from a specialized source, often, at a price. For this article,
we'll use free data from the Swiss forex bank and marketplace Dukascopy , and from these ticks we'll later build data at a high level of granularity.

Let's go ahead and download ticks for the forex symbol EUR/USD for November 8, 2016. Go to the Dukascopy Historical Data Feed , create a free account or login with one of the accepted social networks, and
you're ready to go; when asked, choose to download a CSV file. You should end up
with a file called EURUSD_Ticks_08.11.2016-08.11.2016.csv of about 3.7 MB.

To download data in a more systematic way from Dukascopy, I'd recommend you use
the free version of Tickstory . Keep in mind that carefully selecting your data sources will be critical to prevent garbage in, garbage out (GIGO) situations in your analysis.

STORING TICKS IN THE DB
If we have a look at the file we just downloaded,

head EURUSD_Ticks_08.11.2016-08.11.2016.csv  


we'll see that it has five columns:

Local time,Ask,Bid,AskVolume,BidVolume  
08.11.2016 00:00:00.186,1.10405,1.1040100000000002,0.75,0.75  
08.11.2016 00:00:00.734,1.10405,1.104,0.75,0.75  
08.11.2016 00:00:01.931,1.10405,1.1039700000000001,0.75,2.25  
08.11.2016 00:00:03.058,1.10408,1.104,0.75,0.75  
08.11.2016 00:00:05.202,1.10407,1.10399,0.75,0.75  
08.11.2016 00:00:07.959,1.10406,1.10398,0.75,0.75  
08.11.2016 00:00:11.335,1.10406,1.10404,0.75,2.1  
08.11.2016 00:00:12.463,1.1041100000000001,1.10404,0.75,2.1  
08.11.2016 00:00:36.313,1.1041200000000002,1.10404,0.75,2.85  


So we'll log into Postgres and use the following data schema to create a table
for our data:

CREATE TABLE ""EUR/USD""  
(
  dt timestamp without time zone NOT NULL,
  bid numeric NOT NULL,
  ask numeric NOT NULL,
  bid_vol numeric,
  ask_vol numeric,
  CONSTRAINT ""EUR/USD_pkey"" PRIMARY KEY (dt)
);


Next, we'll make use of PostgreSQL COPY to import the CSV data into the newly created EUR/USD table (you'll need root
privileges to import from files), but before doing so we'll need to temporarily
set DateStyle to match that of the CVS file:

SET DateStyle = 'ISO,DMY';

COPY ""EUR/USD"" FROM '<PATH�  


And now, for some SQL magic...

OHLC DATA
OHLC stands for open, high, low, close , which are key points typically used in finance to keep track of prices of a
given instrument (in our case, the EUR/USD) within a given period of time (let's
say, an hour).

As mentioned in the introduction, OHLC data is typically generated in the
application layer, that is, by a program or script that's processing the ticks.
What we'll do next, however, is use PostgreSQL features to generate OHLC data
right in the database when querying the ticks.

BUILDING IN THE DB
To build the OHLC data, we'll make use the following PostgreSQL date/time and aggregate functions when using the GROUP BY clause:

 * date_trunc : to truncate the date/time field to the desired precision.
 * array_agg : to concatenate the aggregated values into an array, as a workaround for
   not having FIRST() and LAST() functions to return the opening ( o column) and close ( c column) prices respectively.

The first argument for date_trunc should indicate the precision to which we want to truncate the date/time:

 * date_trunc('second', dt) dt
 * date_trunc('minute', dt) dt
 * date_trunc('hour', dt) dt
 * date_trunc('day', dt) dt

Let's say we want to generate OHLC data for hours, this is what the SQL code
will look like:

SELECT  
    date_trunc('hour', dt) dt,
    (array_agg(bid ORDER BY dt ASC))[1] o,
    MAX(bid) h,
    MIN(bid) l,
    (array_agg(bid ORDER BY dt DESC))[1] c,
    SUM(bid_vol) bid_vol,
    SUM(ask_vol) ask_vol,
    COUNT(*) ticks
FROM ""EUR/USD""  
WHERE dt BETWEEN '2016-11-08' AND '2016-11-09'  
GROUP BY date_trunc('hour', dt)  
ORDER BY dt  
LIMIT 100;  


We arbitrarily chose the bid price for doing the calculations, but you can use
the ask price as well just by replacing bid with ask on the array_agg , and MAX() and MIN() arguments.

For the high price ( h column) we simply chose the maximum and the low price ( l column) for the minimum.

The volumes ( bid_vol and ask_vol ) are a summation of their respective values, and we conveniently added a ticks column with a count of how many ticks were in this period, which is a very
relevant piece of information to illustrate how volatile the period was.

Finally, notice that the WHERE and LIMIT clauses won't change the result set for now, but it's a good idea to put them
in place for when we need them, as we'll surely do as soon as we move forward.

A variant with the arithmetic average between the bid and ask prices , in case you want to show a less biased price, would look like this:

SELECT  
    date_trunc('hour', dt) dt,
    ((array_agg(bid ORDER BY dt ASC))[1] +
     (array_agg(ask ORDER BY dt ASC))[1])/2 o,
    (MAX(bid) + MAX(ask))/2 h,
    (MIN(bid) + MIN(ask))/2 l,
    ((array_agg(bid ORDER BY dt DESC))[1] +
     (array_agg(ask ORDER BY dt DESC))[1])/2 c,
    SUM(bid_vol) bid_vol,
    SUM(ask_vol) ask_vol,
    COUNT(*) ticks
FROM ""EUR/USD""  
WHERE dt BETWEEN '2016-11-08' AND '2016-11-09'  
GROUP BY date_trunc('hour', dt)  
ORDER BY dt  
LIMIT 100;  


BACK END: MAKING THE DATA FEED
The OHLC data alone, however, is of little use if we are not going to parse it,
analyze it or, at least, visualize it. So let's go ahead and create a small
script that's going to feed OHLC data to the front end layer of our application.

I'll use Python to this end, but even if this is not your language of choice, it
will only exemplify how to make a prototypical data feed, and we promise you
that the rest of the guide will not be Python-specific.

The script has a functional approach:

 1. Query the ticks database with the query_ticks() function (which could later be expanded to accept a date range, a valid
    period, and a limit for the result set).
 2. Format the results returned by the database with format_as_csv() so that we have a valid CSV file containing the OHLC data.

#!/usr/bin/python3 -u
""""""OHLC data feed.""""""
import cgitb  
import psycopg2

conn = psycopg2.connect(database='test')  # set the appropriate credentials  
cursor = conn.cursor()

SQL = '''SELECT  
    date_trunc('hour', dt) dt,
    (array_agg(bid ORDER BY dt ASC))[1] o,
    MAX(bid) h,
    MIN(bid) l,
    (array_agg(bid ORDER BY dt DESC))[1] c,
    SUM(bid_vol) bid_vol,
    SUM(ask_vol) ask_vol,
    COUNT(*) ticks
FROM ""EUR/USD""  
WHERE dt BETWEEN '2016-11-08' AND '2016-11-09'  
GROUP BY date_trunc('hour', dt)  
ORDER BY dt  
LIMIT 100;'''


def query_ticks(date_from=None, date_to=None, period=None, limit=None):  
    """"""Dummy arguments for now.  Return OHLC result set.""""""
    cursor.execute(SQL)
    ohlc_result_set = cursor.fetchall()

    return ohlc_result_set


def format_as_csv(ohlc_data, header=False):  
    """"""Dummy header argument.  Return CSV data.""""""
    csv_data = 'dt,o,h,l,c,vol\n'

    for row in ohlc_data:
        csv_data += ('%s, %s, %s, %s, %s, %s\n' %
                     (row[0], row[1], row[2], row[3], row[4], row[5] + row[6]))

    return csv_data

if __name__ == '__main__':  
    cgitb.enable()

    ohlc_result_set = query_ticks()
    csv_data = format_as_csv(ohlc_result_set)

    print('Content-Type: text/plain; charset=utf-8\n')
    print(csv_data)

    cursor.close()


FRONT END: VISUALIZING A CANDLESTICK CHART
Since we were already using Python as a scripting language, we could have easily
generated a chart image with the matplotlib.finance API (see demo chart ), or even an eye-candy rich-featured chart for the browser with Bokeh ; however, we intend to make this guide non-Python specific so all the code in
this section will be a pretty much drop-in regardless of your back end solution.

To that end, we'll use TechanJS , which is an open source stock charting and technical analysis library built
on D3.js , to build interactive financial charts for desktop and mobile browsers.

We'll plot a candlestick chart from our OHLC data feed, which is a combination of line and bar charts to
represent open, high, low and close prices for a given period of time (day,
hours, minutes).

Just for your information, these are the components of a candle in a candlestick
chart:


Wikimedia Commons .

The following code is an adaptation of the TechanJS Crosshair chart. The data source we're passing to d3.csv() is our data_feed.py served by the web browser, but you can just drop in yours and it will work just
the same as long as you are returning a valid CSV file with the appropriate
header ( dt,o,h,l,c,vol ).

The resulting chart should look like this:


Use the following HTML code to produce the chart:

<!DOCTYPE html>  
  <meta charset=""utf-8"">
  <title>Candlesitck chart with crosshair</title>
  <style� }
  </style>
  <body>
  <script src=""http://d3js.org/d3.v4.min.js""></script>
  <script src=""http://techanjs.org/techan.min.js""></script>
  <script�
    }
    </script>
  </body>
</html>  


OTHER JAVASCRIPT LIBRARIES
Before ending, we'll leave you with a quick review of JS libraries for financial
and time series charting:

For financial charts in general (including candlesticks charts):

 * Google Charts API comes with a candlestick chart type , but the features are limited, JS hacks to improve it are very laborious,
   and it doesn't scale well (the JS engine eats the CPU as soon as you feed the
   chart with a few MBs).
 * Highstock JS is the Highcharts library for financial charting. It comes with many
   features, it scales smoothly, and it's under active development. It's free
   for non-commercial use,\ and licenses start at USD $780.
 * Another commercial option (without a free option) is the amCharts JavaScript Stock Chart , which also a richly-featured and finance-oriented API with licenses
   starting at USD $140.

For time series charts:

 * dygraphs and envision are both fast and open source libraries for charting time series info.
 * D3.js-based libraries Cubism.js , MetricsGraphics.js and Rickshaw .

WRAP-UP
We went all the way from downloading ticks for a financial instrument to
generating OHLC data that's typically used to display and analyze prices for
such instruments. The visualization was a plus. What was interesting about what
we did was we let the database do most of the heavy work by using PostgreSQL features to generate the OHLC data for which we would have
otherwise needed to write a separate program.

Generating the OHLC in the database comes at a computational expense, however,
but you can save CPU performance and take this implementation further by using materialized views (PostgreSQL = 9.3) to store every result set on a spare table, so that every OHLC set
that's been queried before will be transparently retrieved from that table
instead of being computed again.

And naturally, the more you can compute in the DB, the more you can profit from
services like Hosted PostgreSQL with Compose .

Lucero dances, plays music , writes about random topics, leads projects to varying and doubtful degrees of
success, and keeps trying to be real and he keeps failing.

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

&copy 2016 Compose","Lucero Del Alba illustrates the power of PostgreSQL's aggregate and date functions to analyze financial data, and shows you how to display the results in a web browser using TechanJS.",Building OHLC Data in PostgreSQL,Live,436
1328,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS
Loading...

Sign up by October 31st for an extended 3-month trial of YouTube Red.Working...

No thanks Try it free Find out why CloseIBM WATSON MACHINE LEARNING: GET STARTED
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

361 views 1LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 2 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Watch this video to see how to get started by provisioning the necessary
services in IBM Bluemix.

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Watson Machine Learning: Score a Predictive Model Built with IBM SPSS
   Modeler - Duration: 5:31. developerWorks TV 190 views 5:31


--------------------------------------------------------------------------------

 * IBM Watson, Machine Learning: How to use the ""Retrieve and Rank"" service in
   IBM Bluemix! - Duration: 29:50. tanmay bakshi 17,471 views 29:50
 * Machine learning beer tasting with Watson - Duration: 5:33. developerWorks TV
   646 views 5:33
 * IBM Watson: How it Works - Duration: 7:54. IBM Watson 1,292,550 views 7:54
 * Azure Machine Learning Demo - Duration: 11:51. Sascha Corti 31,415 views 11:51
 * Watson Machine Learning: Build a Naive-Bayes Model - Duration: 4:08. IBM
   Analytics Learning Services 244 views 4:08
 * 2017 Watson Analytics - Getting Started with Dashboard | Twitter Sentiment
   Analysis - Duration: 13:22. Bharatendra Rai 1,427 views 13:22
 * IBM Watson Machine Learning: Build a Naive-Bayes Model - Duration: 4:07.
   developerWorks TV 30 views 4:07
 * IBM Watson Analytics Basic Tutorial - Duration: 20:31. Ryan Nelson 30,178
   views 20:31
 * IBM Watson Analytics Getting Started - Duration: 30:11. Forsyth Alexander
   66,158 views 30:11
 * 11. Introduction to Machine Learning - Duration: 51:31. MIT OpenCourseWare
   83,285 views 51:31
 * IBM Bluemix Tutorial - Do it Yourself - Getting Started - DIY-1 of 40 -
   Duration: 4:21. BharatiDWConsultancy 1,950 views 4:21
 * The 7 Steps of Machine Learning - Duration: 10:36. Google Cloud 223,693 views 10:36
 * How IBM Watson learns - Duration: 3:21. IBM Watson 30,558 views 3:21
 * IBM Watson Education : Personalizing the teaching and learning experience -
   Duration: 2:32. IBM Watson 2,911 views 2:32
 * Watson Knowledge Studio Demo - Duration: 7:19. IBM Watson 17,921 views 7:19
 * IBM Watson Machine Learning: Create a project for Watson Machine Learning -
   Duration: 2:04. developerWorks TV 144 views 2:04
 * Market Segmentation Analysis on the Watson Data Platform - Duration: 1:04:10.
   Data Gurus 294 views 1:04:10
 * IBM Watson Machine Learning: Build a logistic regression model - Duration:
   4:10. developerWorks TV 246 views 4:10
 * Google's Deep Mind Explained! - Self Learning A.I. - Duration: 13:45.
   ColdFusion 2,118,363 views 13:45
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Watch this video to see how to get started with IBM Watson Machine Learning (WML) by provisioning the necessary services in IBM Bluemix.,IBM Watson Machine Learning: Get Started,Live,437
1331,"DATABASE CHANGES AND SEARCH INDEXES THAT GO TOGETHER LIKE WINE & CHEESE
KEEP YOUR SITE SEARCH UP-TO-DATE WITH JSON THAT CHANGES OVER TIME
Our Simple Search Service gives you a faceted search engine for your data that’s ready to use in minutes.
(It’s open-source code that can automatically deploy to Bluemix.) You’ve likely
used faceted search before: you start by entering a search term or phrase ( cheese ) and refine the results by selecting a facet ( type or age ). If you like cheese, you’ve come to the right place.

Faceted search: find an adult beverage for your cheese, or vice versa.
Visualization: simple-search-jsPreviously, Glynn Bird outlined how to load static data sets into the service (from comma-separated or tab-separated files) and how to create the search
index. It’s a great way to get started and a good fit if your data is static —
but what if your data constantly changes? Today, I’ll describe how you can
programmatically keep your search index up-to-date for data sets that change
over time.

If cheeses don’t excite you as much as me, you can still follow along by
mentally mapping my examples to your public (or secret) passion. Game of Thrones , anyone?

The Simple Search Service provides REST API endpoints to initialize the search
index and add, modify, or delete records. For illustrative purposes, I’ve
implemented an application that notifies the service whenever the content of a JSON
document store changes . I picked CouchDB and its managed sibling Cloudant as my data source, but the
general architecture should be generic enough to serve as a blueprint for other
data source types.

Your site search engine deserves dynamic data. Keep it updated via simple-search-service-sync.GREAT EXPECTATIONS
First, you initialize the search database and define a schema, which describes
the data you want to expose in the search engine. To do it, you call the POST /initialize endpoint. Here, you pass the the schema definition as your JSON payload. You
also define the data fields of interest:

 * The name values cheese , pairings , and description
 * Their data types ( string or arrayofstrings , in my case)
 * Whether they should be exposed as search facets ( true or false )

{ ""fields"": [
    {
      ""name"": ""cheese"",
      ""type"": ""string"",
      ""facet"": true,
      ""example"": ""Brun-uusto""
    },
    {
      ""name"": ""pairings"",
      ""type"": ""arrayofstrings"",
      ""example"": ""Zinfandel,Merlot"",     
      ""facet"": true
    },    
    {
      ""name"": ""description"",
      ""type"": ""string"",
      ""example"": ""It's a bread cheese."",
      ""facet"": false
    }
  ]
}

In my example, if initialization is successful (HTTP code 200), the service is
ready to expose facets in searches about cheese names or beverage pairings. If
the request fails, the returned HTTP code identifies the cause (400 = invalid
schema, 500 = server processing error), with additional details in the JSON
response body (properties error and reason ).

A TALE OF TWO BEERS
A search index is useless without data.

To add a cheese-pairing record to the index, send a POST request to the /row API endpoint. Here, you include the form-urlencoded key-value pairs in the body
for each field in the schema. Since data may change over time (or get deleted),
the request must include an immutable identifier named _id , and you must assign a unique value to it.

POST /row HTTP/1.1

_id=0a4c&name=Limburger&pairings=Porter&description=Stinky

To modify an existing record (for example, to add another pairing), send a PUT request to the /row/:id endpoint. Here, you provide the updated data as form-urlencoded key-value pairs
in the body. Always include all field values in your payload, even if they
haven’t changed.

PUT /row/0a4c HTTP/1.1

name=Limburger&pairings=Porter,Bock&description=Stinky

To remove an existing record (for example, to remove a cheese and its pairings),
send a DELETE request to the /row/:id endpoint. Here, you only need to specify the unique identifier.

DELETE /row/0a4c HTTP/1.1

Set up your own instance of the Simple-Search-Service and use Postman , curl , or your favorite API development tool to try it out.

But before you do let’s visit …

THE OLD CHEESY SHOP
My Node.js sample application implements the key aspects of this architecture:

 * It monitors a CouchDB/Cloudant source database, which might store anything
   anybody would possibly ever want to know about cheeses as JSON documents.
 * It tracks changes using the popular follow npm module .
 * To simplify things, the application re-initializes its associated Simple
   Search Service index every time it is started and re-applies all changes.
 * It maps document properties to cheese-pairing record fields.
 * It also maps document insert, update, and delete operations from the source
   database to the appropriate Simple Search Service API requests and submits
   them.

Note: On the point of re-applying all changes — to use this sample application to
handle large data sets, you should add basic persistence to the initialization
code and keep track of the last successfully processed sequence number. This
change would allow processing to resume at the point of failure.HARD [CHEESE] TIMES
Is the cheese shop not your thing? No worries. To try the application with your
your own data set, you need to make two changes:

 1. Customize the default schema definition to describe your data
 2. Map the relevant document properties to fields in your schema

All the details are in the simple-search-service-sync README . So head there on GitHub and get started.

Feel free to open a GitHub issue if something doesn’t work for you, or drop me
questions here on Medium. And don’t forget to recommend this article to other
Medium readers using the ♡ here. And now, I bid you Gouda bye!

JavaScript Couchdb Cloudant Search Engines Web Development 1 Blocked Unblock Follow FollowingPATRICK TITZLER
Developer Advocate at IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",How to programmatically keep your search index up-to-date for JSON data that change over time. Using CouchDB/Cloudant and the Simple Search Service for quick faceted searches on your website.,Database changes and search indexes that go together like wine & cheese – IBM Watson Data Lab,Live,438
1332,"SEVEN DATABASES IN SEVEN DAYS – DAY 7: REDIS

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Lorna Mitchell & Matt Collins 12/6/16Lorna Mitchell & Matt Collins


Learn More Recent Posts * Seven Databases in Seven Days – Day 7: Redis Looking to learn the basics of cloud databases? Our seventh and final
   article in a…
 * Seven Databases in Seven Days – Day 6: IBM Graph Looking to learn the basics of cloud databases? Our sixth in a series of
   posts…
 * Seven Databases in Seven Days – Day 5: etcd Looking to learn the basics of cloud databases? Our fifth in a series of
   posts…

This post is part of a series of posts created by the two newest members of our
Developer Advocate team here at IBM Cloud Data Services. In honour of the book Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, we challenged Lorna and Matt to take a new
database from our portfolio every day, get it set up and working, and write a
blog post about their experiences. Each post reflects the story of their day
with a new database. —The Editors


Speed and simplicity make Redis the perfect database companion. * Database type: in-memory key value store
 * Best tool for: storing computed but non-vital values, such as caching heavy page fragments
   or keeping running totals
   
   
OVERVIEW
Redis (a sort-of-acronym for REmote DIctionary Service) is an in-memory key value
store. First the good news: storing data in memory means that fetching it is
blisteringly fast. Any data you can safely store in Redis can be fetched many
times faster than it could from any traditional database. With the good news,
comes some bad news: storing data in memory means you are constrained by how
much memory is available for Redis to use. If Redis runs out of memory, you’ll
find that writes fail at best, or that Redis dies at worst.

These restrictions make Redis brilliant, but only for certain use cases. These
cases are usually where losing some data doesn’t matter hugely, or the data
isn’t large, or it’s very important that the data be available in a blisteringly
fast manner. It’s rare to see an application that uses Redis as its only storage
solution. Redis is usually deployed as an auxiliary datastore alongside a
database like PostgreSQL or Cloudant (Apache CouchDB™). Redis makes a brilliant
cache and, particularly in modern web applications, can really help to take the
load off your primary datastore.

GETTING STARTED WITH REDIS
It’s your choice whether you want to install Redis locally or, for a quick
start, you can also spin up a Redis instance on Bluemix (free trial when you sign up). This tutorial will work either way.

You’ll at least want the redis-cli tool installed for this tutorial, regardless of where Redis is actually
running. The easiest way to get the redis-cli , is to install Redis locally, whether you’re using Bluemix or not:

 * Redis downloads page
 * OS X only: install Redis using Homebrew .

From your Bluemix dashboard, click on Catalog and choose the Data and Analytics option under Services . There you will find the Compose for Redis option that we’ll be using.

Choose this option, then if you’d like to customize the service name, do so and
click the Create button. (A default name is always assigned.) When the service finishes
provisioning, you should be able to refresh the page and see the dashboard for
your Redis installation. Alternatively, you can use the hamburger menu to return
to your dashboard and locate your new deployment there.

For now, we’re interested in getting our connection details so we can start
using Redis. These can be found on the Service Credentials tab, where there should be a set of credentials created already. When you click View Credentials , you will see a JSON object with various connection details in it. Copy the
value of the uri_cli field (a command starting with redis-cli ) into your clipboard and then we’re all set!

REDIS FROM THE COMMAND LINE
Before we get specific with accessing Redis from any specific programming
languages, we’ll have a little chat with it using its built-in command line
tool, redis-cli . You can simply paste the uri_cli value you copied earlier in at the command line and it will open a Redis prompt
for you:

$ redis-cli -h bluemix-sandbox-dal-9-portal.1.dblayer.com -p 18491 -a QIJJVMCPGSUZIJBV
bluemix-sandbox-dal-9-portal.1.dblayer.com:18491

Try out your connection by asking Redis how it’s feeling today: use the info command and the response will show you that Redis is working, as well as
telling you a raft of other statistics about its health. Redis is a key value
store, so try these commands for storing, retrieving, and expiring values:

 * SET [key] [value] e.g. SET comments 5 (reuse this command to update a key’s value)
 * GET [key] e.g. GET comments
 * EXPIRE [key] [time] e.g. EXPIRE comments 10 (expires the comment in 10 seconds’ time)
 * SETEX [key] [time] [value] e.g. SETEX comments 5 10 (sets and expires the key in a single command)

So far, hopefully everything seems easy and you’re enjoying your new key value
store. Redis can handle more than just keys and values though, so let’s look at
some more interesting data types .

WORKING WITH HASHES IN REDIS
In Redis, a hash is a way of storing a potentially very large number of keys and
values mapped together into a structure, similar to a Python hash, making them
ideal for representing objects with a series of properties. The commands for
working with hashes are all prefixed with H , so we use HSET to set a hash value and HMSET to set multiple values in a hash. Examples to follow in just a moment, but
first let’s talk about keys.

A key in Redis can be any string, and can be set to expire. By convention we
namespace keys using the colon ( : ) character so that similarly-named keys can be searched for. You’ll notice
this in the examples too.

HSET product:hat color black
(integer) 1
HGET product:hat color
""black""


The first example is pretty simple, declaring a key product:hat with a field color and a value black, with the command HSET . This command allows us to set one field, and its sister command HGET lets us fetch one field.

If we want to set multiple fields in a single call — which seems reasonable in
any non-trivial programming example — then the HMSET command will help us:

HMSET product:hat size L material wool cost 25
OK
HGETALL product:hat
1) ""color""
2) ""black""
3) ""size""
4) ""L""
5) ""material""
6) ""wool""
7) ""cost""
8) ""25""


To retrieve all these multiple values that we set, the HGETALL command returns all the fields and values to us. If you wanted only the keys or
only the values, then the commands HKEYS and HVALS are your friends. Getting the keys and values separately makes more sense in
many ways, so you may find these more useful.

FINDING YOUR KEYS
Once you’ve put data into Redis, how can we inspect what’s stored there? Don’t
be tempted by the command KEYS — it does do what you expect and lists all the keys matching a specified pattern — but it
unexpectedly affects performance.

bluemix-sandbox-dal-9-portal.1.dblayer.com:18491 help KEYS

  KEYS pattern
  summary: Find all keys matching the given pattern
  since: 1.0.0
  group: generic


This command is wildly intensive to run. So feel free to use it on your toy
platform today, but stay well away from running it in production. Instead, the SCAN command is a better approach because it returns results in chunks and can also
filter for specific keys. Let’s see an example of the SCAN command in action, on the products example from before:

SCAN 0
1) ""0""
2) 1) ""appname""
   2) ""product:shoes""
   3) ""product:hat


SCAN takes at least one argument: the cursor to start from. When we have long lists
of keys, the first call to SCAN will return some keys, plus a cursor to use as the argument for the next call
that will return yet more keys. In the output of SCAN , the first number is the cursor, which will be zero if all results have now
been returned in this call or sequence of calls. The second value is the list of
matching keys, which is why it’s important to be consistent and formulaic about
naming your Redis keys. With this approach, I can easily search for all product
keys in my database:

SCAN 0 match product:*
1) ""0""
2) 1) ""product:shoes""
   2) ""product:hat""


We can also use the related command HSCAN to look inside one of our product hashes that we created earlier:

HSCAN product:hat 0
1) ""0""
2) 1) ""color""
   2) ""black""
   3) ""size""
   4) ""L""
   5) ""material""
   6) ""wool""
   7) ""cost""
   8) ""25""


This example returned the cursor ( 0 , since this is the entire record) and the same results as HGETALL did earlier. Remember that with HSCAN we can also supply patterns to match and, therefore, finding predictably-named
keys in a potentially complex hash becomes much easier.

SORTED SETS IN REDIS
Redis’s sorted sets feature offers a big performance boost. Sorted sets are
collections of values that also have a score associated with them, and their
commands are all prefixed with Z . (Redis also has “normal” sets, which are just a collection of values.) The
data is stored already-sorted, so there’s no sorting required if you want to get
the results in order of score or if you already know where in a set an item
already is. This feature is ideal for counting things like clicks, views,
scores, votes (assuming the data isn’t critical), and so on. For example, if we
wanted to show the most-viewed products on our site, here’s one possible
solution:

 1. use a sorted set called product_views
 2. each time a product is viewed, increment the score on that particular value
    in the set (use the ZINCRBY command)
    
    
 3. retrieve the most-viewed products by fetching the products with the highest
    score (use the ZREVRANGE command) and potentially then grab the product info if it is also stored in
    Redis
    
    
REDIS AND PERSISTENCE
Redis might be an in-memory datastore, but it does have durability features . By default, it flushes to disk periodically, but in my experience not often
enough that you’d ever want to rely on this process running at a useful time
before your server experienced a problem!

How often you want to store data to disk depends on your use case. Redis is
usually used for ephemeral data that doesn’t matter if it’s lost. If you were
keeping information about the most-viewed products in Redis, and your Redis
server restarts, then your application should fail gracefully and perhaps fall
back to showing random products until the cache warms back up again.

You can tell Redis to write to disk by using the SAVE command, but beware that this operation will block. (It is not recommended for
production use.) There’s an alternative background save mechanism called BGSAVE , which is more appropriate. It’s possible to configure how often you want
Redis to write to disk by specifying a number of writes in a particular time
frame that should trigger a save.

Redis also has a log that makes it eventually durable, but again the
flush-to-disk frequency of this feature is configurable and should be approached
with caution. Redis is fast because it is in memory, rather than on disk. So if
you configure Redis to write to disk on every transaction so that data is never
lost? You’ll find it runs at the same speed as writing to a file on disk,
because that’s exactly what is happening.

REDIS AND YOUR APPLICATIONS
Redis is a brilliant tool. For many teams, it’s the gateway drug to a true
polyglot persistence layer in their stack, where a variety of datastores are
deployed to serve a number of different needs. Redis has robust replication
out-of-the-box, and clustering (currently needs external tools) is also solid
and easy to set up. Redis supports various data types really well and has lots
of extra features: transactions are built-in, and it also has a good
publish/subscribe feature.

For a way of speeding up repeated, painful queries, or to simply store session
data in a way that’s quick to access, Redis a great addition to your stack and
works alongside existing storage solutions to enhance their work, rather than
replacing them.","Looking to learn the basics of cloud databases? In this series, we show how to use them on Compose and the IBM cloud. Enter: Redis.",Seven Databases in Seven Days – Day 7: Redis,Live,439
1334,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (January 24, 2017)
 * This Week in Data Science (January 17, 2017)
 * This Week in Data Science (January 10, 2017)
 * This Week in Data Science (December 27, 2016)
 * This Week in Data Science (December 20, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (JANUARY 24, 2017)
Posted on January 24, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * The Current State of Automated Machine Learning – A brief definition of Automated Machine Learning and a look into its
   future.
 * US Army turns to IBM to build, manage private cloud data center – IBM will provide services to improve and advance the IT infrastructure of
   the U.S. Army.
 * IBM’s Rometty lays out AI considerations, ethical principles – IBM’s CEO calls outlines the ethics of AI and calls for transparency of AI
   development.
 * How to present your data science portfolio on Github – An installation in the series outlining how to build a data science
   profile.
 * The worst passwords of 2016 are as lazy as ever – A summary of the most used passwords which might lead to the compromising
   of online accounts.
 * How Artificial Intelligence Will Improve the World of Work – How humans and AI can work together to increase productivity in the
   workplace.
 * Charting the data lake: Rethinking data models for data lakes – How data models have adapted to the role of defining and managing data
   lake environments.
 * What is artificial intelligence? A three part definition – A look into the historical context of the term Artificial Intelligence.
 * IBM Watson and FDA collaborate to explore the use of blockchain data in
   population health management – IBM Watson will partner with the FDA in order to study the use of
   blockchain technology in improving public health.
 * 6 Predictions For The $203 Billion Big Data Analytics Market – How the Big Data Analytics Market will change with the growing volume and
   consumption of data.
 * These are three of the biggest problems facing today’s AI – Three commonly overlooked issues when thinking about Artificial
   Intelligence.
 * 5 industries ripe for human-machine learning – A look at some industries that would benefit from increased interaction
   between humans and machine learning.
 * How big data can improve cities and save lives: UW and Seattle brainstorm
   solutions on education, homelessness, and transportation – A look at how data can be used to solve common societal problems.
 * IBM serves an ace at the Australian Open with powerful analytics – IBM provides insights into players’ performance at the Australian Open.
 * How A.I. and blockchain are driving precision medicine in 2017 – A look into how new partnerships could transform the face of healthcare.

UPCOMING DATA SCIENCE EVENTS
 * 2017 Kickoff – Data Science Meetup – January 24, 2017 @ 6:30 pm – 8:30 pm
 * IBM Event: Big Data and Analytics Summit –February 14, 2017 @ 7:15 am – 4:45 pm

COOL DATA SCIENCE VIDEOS
 * Deep Learning with Tensorflow – The Sequential Problem – This videos discusses sequential data as well as its issues wth
   traditional neural networks.
 * Deep Learning with Tensorflow – Convolutional Network with TensorFlow – A demonstration of how to build a convolutional neural network using
   TensorFlow.
 * Multilayer Perceptron with TensorFlow – Deep Learning with Tensorflow – A demonstration of how to create a Multilayer Perceptron for the MNIST
   database.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags:


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (January 24, 2017)",Live,440
1335,"MULTIPLYING MICROSERVICESGlynn Bird / March 31, 2016In previous posts on microservices, we’ve explored how you can use queues and pubsub channels to broker data between the producers of the information and theconsumers of it, and also how to scale a web app using Microservices .I’ve also explained how to convert the Metrics Collector app from a monolithiccodebase, to a set of composable microservices that can be attached to streams of data.The advantages of microservices are: * the producer and consumer services are decoupled from each other * you can add producers and consumers at will, even adding multiple consumers   to the same data stream for storage, analytics, and visualisation * each microservice is a self-contained block of code that has well-defined   inputs and outputs * you can distribute the workload of writing microservices around a development   team. Although they consume the same data, you can use different programming   languages for each service, depending on the skill set availableIn the last blog post we set up the Metrics Collector Microservice to write to either a Redis, RabbitMQ, or Apache Kafka queue or pubsub channel.We also added multiple instances of the Metrics Collector Storage Microservice to consume the data, writing it to either Cloudant, MongoDB, or Elasticsearch,depending on runtime configuration.In this post, we’ll look at adding the Metrics Collector Aggregation Microservice to the same data stream, and show how you can use it to perform simplestreaming analytics on the web metrics data channel it’s connected to.METRICS COLLECTOR AGGREGATION MICROSERVICEThe Metrics Collector Aggregation Microservice , like the other microservices I have written, uses a plugin architecture andcan be configured at runtime to read its data from any of the following sources: * redis_queue , a Redis list data structure * redis_pubsub , a Redis PubSub channel * rabbit_queue , a RabbitMQ PUSH/WORKER queue * rabbit_pubsub , a RabbitMQ PUBLISH/SUBSCRIBE channel * kafka , an Apache Kafka topicIt also has collection of aggregation plugins that summarise the data accordingto runtime instructions: * count count incoming data messages * sum sum a single element of the message * stats calculate stats on a single element of the message * countdistinct count numbers of distinct values of a single element of the message * customstats calculate stats on a value returned by a supplied functionBy default, the Aggregation Microservice uses the null aggregator which does nothing: we must use the service’s API to instruct it toaggregate.INSTALLING THE AGGREGATORAssuming you’ve got the Metrics Collector Microservice running and generatingdata to a Redis pubsub channel, then you can install the Metrics CollectorAggregation Microservice with the following calls:    git clone https://github.com/ibm-cds-labs/metrics-collector-aggregation-microservice    cd metrics-collector-aggregation-microservice    npm installand run it with:    export QUEUE_TYPE=redis_pubsub    node app.jsThe app tells you that it’s running:    Aggregator type: null    Queue mode: redis_pubsub    Connecting to Redis server on localhost/6379    App starting on http://localhost:6019    Subscribed to PubSub channel mcpubsubMake a note of the host and port that the app is running on, in this case http://localhost:6019 . You can substitute redis_pubsub with redis_queue , rabbit_queue , rabbit_pubsub or kafka to point the microservice to a different data source.COUNTING THINGSWe can instruct the app to start aggregating by calling its /configure API endpoint: curl 'http://localhost:6019/configure?mode=count'    {""ok"":true,""mode"":""count""}Each item of data that arrives from the pubsub channel is counted. We can querythe current count at any time using this service’s API: curl 'http://localhost:6019/query'    {""ok"":true,""err"":null,""data"":110}We can also reset any of the aggregators back to zero by calling the /reset endpoint: curl 'http://localhost:6019/reset'    {""ok"":true}     curl 'http://localhost:6019/query'    {""ok"":true,""err"":null,""data"":0}SUMMING THINGSImagine our source data looks like this:{ a: 1, b: 2, c: 14753, d: 'rat', ip: '::1' }Instead of just counting we can accumulate the sum of an individual value: curl 'http://localhost:6019/configure?mode=sum&selector=c'    {""ok"":true,""mode"":""sum"",""selector"":""c""}     curl 'http://localhost:6019/query'    {""ok"":true,""err"":null,""data"":4189963}AVERAGES, TOTALS, AND STANDARD DEVIATIONSIn order to collect stats about a single value, we can use the stats aggregator: curl 'http://localhost:6019/configure?mode=stats&selector=c'    {""ok"":true,""mode"":""stats"",""selector"":""c""}     curl 'http://localhost:6019/query'    {""ok"":true,""err"":null,""data"":{""sum"":8635955,""count"":545,""min"":26,""max"":32689,""sumsqr"":685130625}}We can calculate the average, variance, and standard deviation of c using the values returned by the stats aggregator.COUNTING DISTINCT VALUESWe can count the numbers of distinct values in the data stream with the countdistinct aggregator: curl 'http://localhost:6019/configure?mode=countdistinct&selector=d'    {""ok"":true,""mode"":""countdistinct"",""selector"":""d""}     curl 'http://localhost:6019/query'   {""ok"":true,""err"":null,""data"":{""donkey"":44,""robin"":38,""cat"":49,""dog"":39,""rat"":44,""cow"":37,""crab"":45,""chicken"":32,""ant"":33,""fox"":38,""squirrel"":36,""wolf"":34,""gerbil"":43}}CUSTOM CODEWe can also provide a JavaScript function that receives each message thatarrives and whose return value is sent to the stats aggregator. We can write a custom function like this:function(doc) {  return (doc.c20000) ? doc.c : null;}which returns c from a each message only if its value exceeds 20000.We can send the custom function to the Aggregator Microservice like this: curl 'http://localhost:6019/configure?mode=customstats&selector=function(doc)%7B%20return%20(doc.c20000)%3F%20doc.c%20:%20null%7D'    {""ok"":true,""mode"":""customstats"",""selector"":""function(doc){ return (doc.c20000)? doc.c : null}""}     curl 'http://localhost:6019/query'    {""ok"":true,""err"":null,""data"":{""sum"":8385261,""count"":315,""min"":20063,""max"":32726,""sumsqr"":553378576}}CONCLUSIONUsing a PubSub channel connected to our data producer service (the Metrics Collector Microservice ), we can attach multiple microservices to the channel to consume the data. The Metrics Collector Storage Microservice writes data to a choice of JSON data stores, and the Metrics Collector Aggregation Microservice performs simple streaming analytics, controlled by an HTTP API.Combining microservices like this makes it easy for your app to grow and handlea variety of jobs. Multiple producers add scale and resilience. Multipleconsumer microservices let you store data in multiple databases and analyse thedata in different ways. When you want your app to perform a new trick, it’s easyto introduce an additional microservice into the mix.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Apache Kafka / cloudant / Elasticsearch / microservices / MongoDB / pubsub / queues / Redis Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Add an aggregator microservice to a microservices app and use it to perform simple streaming analytics.,Multiplying Microservices,Live,441
1336,"Cloudant Query gives you a declarative way to define and query indexes on your database. We’ve long wanted to build a system that would make the developer user experience of Cloudant productive from day-one of using the system, and more performant for our longtime fans. With Cloudant Query we’re following through on our plan to make our database service easier and more powerful to use.The datasetThe dataset we’re using in this article is a small subset of IMDb data that the service makes available for non-commercial and educational purposes. Here, we’ve denormalized the separate tables for Actor, Movie, and Person to fit within Cloudant’s JSON document-oriented model.So in accordance with IMDb’s Conditions of Use statement, we’d like to add:Cloudant Query gives you two new API endpoints: one to manage indexes, and one to execute queries against.Instead of writing a javascript map reduce view, you define an index by POSTing a JSON body listing the fields you want to index:""index"": {""fields"": [""Person_name""]NB: Some syntax has been omitted for the sake of clarity. Check out this example app, follow our tutorial, or refer to the API.Once you’ve defined an index, you can run queries against it. A query is run by POSTing JSON to the _find endpoint. The JSON must contain a selector object (if you know MongoDB Query documents this should look familiar) and can contain other optional parameters (fields, sort and limit, etc.):""selector"": {""Person_name"": ""Robert De Niro""""fields"": [""Movie_name"",""Movie_year""""docs"": [{""Movie_name"": ""Goodfellas"",""Movie_year"": 1990},{""Movie_name"": ""Meet the Fockers"",""Movie_year"": 2004},{""Movie_name"": ""Shark Tale"",""Movie_year"": 2004},{""Movie_name"": ""Raging Bull"",""Movie_year"": 1980},{""Movie_name"": ""Untouchables, The"",""Movie_year"": 1987},{""Movie_name"": ""Godfather: Part II, The"",""Movie_year"": 1974},{""Movie_name"": ""Meet the Parents"",""Movie_year"": 2000},{""Movie_name"": ""Deer Hunter, The"", ""Movie_year"": 1978}This query hits the index we defined above, equivalent to a ?key=""Robert De Niro"" view query, then dynamically filters the result data to just the document keys defined in the fields list. If a fields list is omitted, I get back the whole document.What’s really neat about Cloudant Query is that I can run a query against an index and refine the result set by applying conditions on fields beyond the original index. For instance, I could find movies Robert De Niro made in a specific year with the following selector:""selector"": {""Movie_year"": 1978,""Person_name"": ""Robert De Niro""""docs"": [""Movie_genre"": ""DW"",""Movie_name"": ""Deer Hunter, The"",""Movie_rating"": ""R"",""Movie_runtime"": 183,""Movie_year"": 1978,""Person_dob"": ""1943-08-17"",""Person_name"": ""Robert De Niro"",""Person_pob"": ""New York, New York, USA"",""_id"": ""1f003ce73056238720c2e8f7da428f32"",""_rev"": ""1-3fa59b11f43719f46c288b9bb9943d1d""Or find films he was in before a certain year:""selector"": {""Movie_year"": {""$lt"": 1999},""Person_name"": ""Robert De Niro""You’ll want your selector to be fairly restrictive to get the best performance, and you’d maybe want to define a complex index (e.g., ""fields"": [""Person_name"", ""Movie_year""]) if you were making a lot of queries like this, or needed to sort by Movie_year.We defined an index above using:""index"": {""fields"": [""Person_name""]""result"": ""created""If the index already existed the response would have been:""result"": ""exists""But what happens if I make a query that doesn’t have a suitable index?""selector"": {""Movie_earnings_rank"": 191Because your database may be terabytes in size and reindexing the whole dataset may be operationally complicated, Cloudant Query won’t automatically index your data, but it will tell you the indexes you need to define. How you choose to handle that error mode is up to you.""reason"": ""No index exists for this selector, try indexing one of: Movie_earnings_rank"",""error"": ""no_usable_index""Cloudant Query is implemented closer to our core Erlang engine, and we’ve been working on performance optimizations with our colleagues at IBM. So in addition to building on the established concepts (HTTP API, JSON, etc.) you know and love, Cloudant Query has the added benefit of performance over our traditional system of defining javascript map reduce views.We hope that Cloudant Query works like you’d expect a database to work: set some conditions, and get the data you need using a simple, scalable and concise API. Coupled with map reduce views and search, we hope Query rounds out Cloudant’s ability to let you understand and use your data effectively.Cloudant Query is now live on our multi-tenant service, and we’re liaising with our dedicated customers to get it onto their clusters soon. We’d love to hear your feedback on Query, and see what applications you build on it via the comments below. In the years to come, we hope Cloudant Query is the first tool you reach for when working with Cloudant.Don’t have an account? Sign up for free today and give it a spin!","Cloudant Query gives you a declarative way to define and query indexes on your database. Based on MongoDB's query language, Cloudant Query makes it simple to query your data using simple operators. ",Introducing Cloudant Query,Live,442
1338,"Try out a fresh look for YouTube. Learn more . Close Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS
Loading...

Sign up by October 31st for an extended 3-month trial of YouTube Red.Working...

No thanks Try it free Find out why CloseDATA SCIENCE EXPERIENCE: LOAD DB2 WAREHOUSE ON CLOUD DATA WITH APACHE SPARK
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

21 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Get Started With IBM Db2 Warehouse on Cloud - Duration: 2:51. IBM Analytics
   211 views 2:51


--------------------------------------------------------------------------------

 * Data Science Experience: Analyze precipitation data using a community
   notebook - Duration: 5:15. developerWorks TV No views * New 5:15
 * IBM Blockchain Business Models - Duration: 10:13. IBMBlockchain 78 views *
   New 10:13
 * IBM Lift: Migrate Data from IBM Neteeza to IBM Db2 Warehouse on Cloud -
   Duration: 6:38. developerWorks TV 201 views 6:38
 * Data Science Experience: Analyze Db2 Warehouse on Cloud data in RStudio -
   Duration: 5:30. developerWorks TV 3 views * New 5:30
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29
 * Inside a Google data center - Duration: 5:28. G Suite 6,823,035 views 5:28
 * Advances Towards Building an Artificial Brain | IBM' Dharmendra Modha -
   Duration: 23:00. Artificial Intelligence AI 1 view * New 23:00
 * Quickly design, build & secure a mobile app with Bluemix - Duration: 12:09.
   developerWorks TV 42,655 views 12:09
 * DB2 V11 and dashDB: Leadership in the Era of Data, Cloud, and Analytics -
   Duration: 1:03:43. IDUG: International DB2 Users Group 1,253 views 1:03:43
 * Data Science Experience: Tour the Community Section - Duration: 3:20.
   developerWorks TV No views * New 3:20
 * Microservices vs SOA - Duration: 9:49. developerWorks TV 2,364 views 9:49
 * Understanding DevOps - Duration: 6:12. developerWorks TV 16,396 views 6:12
 * IBM Cloud private: Continuously Deliver Java Apps with IBM Cloud private and
   Middleware Services - Duration: 4:51. developerWorks TV 1,622 views 4:51
 * Optimize Staffing Assignments with IBM Decision Optimization - Duration:
   4:01. IBM Analytics 105 views 4:01
 * REST API concepts and examples - Duration: 8:53. WebConcepts 1,578,454 views 8:53
 * IBM Watson Machine Learning: Score a Predictive Model Built with IBM SPSS
   Modeler - Duration: 5:31. developerWorks TV 7 views * New 5:31
 * Discovering Your Differentiation With IoT | Deon Newman - Duration: 16:19.
   Darwin's Circle 7 views * New 16:19
 * Datascience made simple with IBM DSX | HackerEarth Webinar - Duration:
   1:06:11. HackerEarth 264 views 1:06:11
 * Db2 on Cloud Walk Through & Tutorial - Duration: 2:03. IBM Analytics 425
   views 2:03
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to load Db2 Warehouse on Cloud data in a Scala notebook using Spark in IBM Data Science Experience (DSX).,Load Db2 Warehouse on Cloud data with Apache Spark in DSX,Live,443
1339,"{ spark .tc } * Community * Projects * Blog * Resources * Code    * Contributions    * University    * IBM Design    * Apache SystemML    * Apache Spark      TUTORIAL: DECLARATIVE MACHINE LEARNINGMachine learning explores the study and construction of algorithms that learnand make predictions based on data. In the field of machine learning, datascientists, who specialize in analyzing data, are responsible for writing andmodifying such algorithms.Initially, a data scientist writes an algorithm based on a set of data features.This is generally an iterative process in which the data scientist exploresdifferent algorithms for predictive purpose. In this process, the amount of dataand the number of features chosen for analysis may change. Data used foranalysis could be of any type, such as sparse versus dense, or compressed versusnon-compressed. Once the quantity and analysis of data no longer fit on a singlemachine, operations are typically scaled to a cluster of machines. In summary,analysis involves an iterative process over changes in a feature set and changesin the amount and type of data, which leads to the customization of algorithms.We can consider this to be “domain specific analytics.”Analysis using a single machine drifts to a Big Data problemGenerally, a data scientist writes an algorithm in R, Python or anotherscripting language using a sufficiently small data size that can fit on a singlemachine. At a certain point, the algorithm works for a given situation.However, the problem is not yet fully solved. In most cases, the data will growto the point where it cannot fit onto a single machine, and it becomes a BigData problem.How does an organization solve a Big Data Machine Learning problem?Generally, in an organization, a data scientist writes an algorithm using asmall data set that fits on a single machine and makes that “prototype” work.Then, a systems programmer who is an expert in clustered environments getsinvolved to run this algorithm in a clustered environment with larger scaledata. This involves an iterative process of adjusting an algorithm to make itwork in a clustered environment through continuous communication between thedata scientist and the systems programmer until they are satisfied with theoutcome. This works reasonably well and many organizations make that model worksuccessfully. Though organizations make this model work, there are quite a fewchallenges with this approach. The data scientist writes the algorithm in R,Python or another scripting language. The script needs to be run efficiently andeffectively on target platforms such as Spark or Hadoop, which is not a trivialjob.Challenges in analyzing vertically across the business?As I said earlier, this is “Domain specific analytics.” What will happen if thedata scientist needs to analyze the data vertically across the business? Anorganization may have expertise in a particular business, say a “Car Selling” or“Car Manufacturing” business. Do we expect a person who is specialized in aparticular business to be an analyst as well? Probably not, since it’s difficultto find a person with such a set of skills. We see there are at least four typesof issues in such a situation: 1. 1. It’s difficult to hire a person with multi-dimensional skills. 2. 2. It’s a dual effort to write an algorithm, first for a single machine with    small data and then on a cluster with big data. 3. 3. The process of doing domain specific analysis is iterative and will need    to change per situation. 4. 4. Human factors, involving communication between the data scientist and    systems programmer to run an algorithm successfully on a cluster that was    originally written for small data on single machine, will slow down the    effort. 5. Challenges for Data Scientists?How do you expect a data scientist to deal with such dynamism? Do you expectdata scientists to handle these variations in data or runtime environment whenhe or she writes an algorithm? Ideally, we would like a data scientist to beable to write an algorithm that is independent of data characteristics andruntime environment.Motivation for “Declarative Machine Learning”This motivated us to develop “Declarative Machine Learning” so that datascientists can write an algorithm in an expressive language. An algorithmwritten by a data scientist should be independent of data characteristics, scaleof data, and runtime environment where the algorithm will be run. Datascientists should have the flexibility to write new algorithms, reuse existingalgorithms, or customize algorithms as needed. We wanted data scientists to beable to treat this as a single machine problem. This leads to four high level requirements : 1. 1. High-level semantics: A data scientist should be able to write an    algorithm in a high-level language without focusing on any low-level    implementation details. He or she should be able to express goals through    easy semantics. A data scientist should be able to understand the semantics    and debug easily as needed. 2. 2. Flexibility: A data scientist should have flexibility to leverage    existing algorithms with or without any customization. A data scientist    should be able to write new algorithms easily. 3. 3. Data independence: A data scientist should not worry about data    characteristics while writing the algorithms. Data could be sparse/dense, it    could be analyzed per row or column, it may need to be cached, it could be    in compressed or non-compressed form, but the algorithm should be able to be    written without considering any of these data characteristics. 4. 4. Scale independence: The size of the data could be small or large. It    could fit on a single machine for analysis or across a distributed    environment.Based on these requirements, we need something that can understand algorithmswritten in a high-level language and transform those into instructions to beexecuted on a target environment based on data characteristics. We realized weneeded to leverage database optimization technologies and other databasefeatures to handle such a transformation. We developed a query optimizer to do the transformation from high-level statements written in a script toruntime instructions on a target environment based on characteristics of thedata and runtime environment. The goal is to have a high-level language with theability to scale in many different dimensions to many different datacharacteristics so that we can iterate faster.Query OptimizerThe query optimizer is based on database query optimization techniques. Thequery optimizer will read statements written in a high-level language. Thesestatements will be converted into smaller statement blocks. Based on datacharacteristics and the available runtime environment, individual smaller blocksget translated into generic data flow representations called high-leveloperations (HOPs). Subsequently, the optimizer applies dynamic rewrites andoptimizes HOPs to generate low-level operations (LOPs) that will be a genericrepresentation of the runtime execution plan. With current support, if theamount of memory required to run a particular instruction is available on asingle node, then that instruction will be run in memory on the single node(Control Program); otherwise, that instruction will run in a distributedenvironment such as Spark or Hadoop as per the user’s preference.SystemML ArchitectureThe above diagram shows a pictorial representation of SystemML. At the top ofthe diagram, you can see that a user can write an algorithm in an R-like orPython-like language supported by SystemML. SystemML has the capability toexpand language support for other languages as well.An algorithm, which is expressed as a set of statements in a script, will beparsed by the parser for validation and then transformed into smaller blocks.These blocks go through static and dynamic rewrites to generate high-leveloperations (HOPs) that are a generic representation of data flow. Any statictransformation is independent of data characteristics, whereas dynamictransformation is based on data characteristics. At the HOP level, the knowndata size is labeled and propagated across the HOP tree for a given block. Basedon data size, the target runtime platform gets determined at the HOP level.Every HOP gets transformed to one or more low-level operations (LOPs) that are ageneric representation of the runtime execution plan. These transformations arebased on dynamic rewrites and dynamic recompilation. Physical operators aresubstituted in the runtime execution plan based on data and runtimecharacteristics, and then the runtime execution plan is executed on the targetenvironment.SystemML Compilation ChainThe following diagram shows various compilation processes discussed in anearlier section. Please have a look at the cited paper for more detailedinformation on each of these processes.Out-of-the-box AlgorithmsWe have implemented several common algorithms. These algorithms illustrate howSystemML can be leveraged to write algorithms in a high-level language withR-like or Python-like syntax very easily.Algorithms in the SystemML packageALSCOX Proportional Hazard Regression AnalysisCubic SplineDecision TreeGeneralized Linear ModelKaplan-Meier Survival AnalysisMeansL2 Support Vector MachineLinear RegressionMinimal Support Vector MachineMulti Log RegressionNaive-BayesPrincipal Component AnalysisRandom ForestStep GLM Step Linear Regression DS StratStats Transform Univar-StatsSummaryWe briefly looked at Declarative Machine Learning and discussed domain specificknowledge. We also discussed the language and compiler support needed toimplement Declarative Machine Learning. A primary goal of this approach is toallow data scientists to write efficient and effective algorithms and improveproductivity without thinking about data and runtime characteristics.Github link and documentationhttps://github.com/apache/incubator-systemml References 1. Conference Publication Ghoting, Amol, et al. “SystemML: Declarative machine learning on MapReduce.”ICDE 2011. 2. Journal Paper [Matthias Boehm et al: SystemML’s Optimizer: Plan Generation for Large-ScaleMachine Learning Programs. IEEE Data Eng. Bull 2014 ] Contact: Arvind Surve acs_s@yahoo.comYouNotify me of follow-up comments by email.Notify me of new posts by email.{ AUTHOR }ARVIND SURVE *  *  * Apache®, Apache Spark™, and Spark™ are trademarks of the Apache SoftwareFoundation in the United States and/or other countries.","Machine learning explores the study and construction of algorithms that learn and make predictions based on data. In the field of machine learning, data scientists, who specialize in analyzi…",Declarative Machine Learning,Live,444
1341,"Compose The Compose logo Articles Sign in Free 30-day trialHOW MARKETING TECHNOLOGY COMPANIES USE COMPOSE TO CONQUER THEIR DATA LAYER
Published May 8, 2017 case study marketing How Marketing Technology Companies Use Compose to Conquer Their Data LayerAccording to an infographic released at the 2016 MarTech conference , marketing tech has exploded in the last 7 years from around 150 to more than
3,500 companies in 2016. These companies offer a multitude of services for
advertising, content, customer experience, social media, e-commerce, data, and
more.

One thing they all have in common is data. If your company operates in any of
these categories, you are most probably looking for technologies to store,
analyze, and distribute data to multiple stakeholders and apps in a timely
manner.

In this article, we take a look at how some of the marketing tech companies that
work with Compose conquer their data layer:

Emarsys | Vienna, AustriaEmarsys, long time Compose customer, is a leading provider of B2C marketing
automation software, enabling one-to-one interactions between marketers and
consumers in retail and e-commerce businesses. Using cross-channel data
collection, artificial intelligence and a unified profile, the Emarsys platform
allows marketers run engagement campaigns on email, SMS, push, social media, ads
and web marketplaces from a central interface.

The Emarsys application is split into two different infrastructures: a legacy
infrastructure (built in PHP), and a new cloud-based infrastructure on Compose,
Heroku and other as-a-Service platforms. Everything works together seamlessly
using REST APIs to communicate across the entire stack, and the huge data sets
they work with are analyzed through an AI module. Emarsys splits their teams
into ""clans"" to focus on different themes, such as content or reporting, while
various engineering clans use Compose for MongoDB, PostgreSQL and Redis for
different parts of the application.

To some degree, Compose epitomizes why they have moved to a microservices
architecture: It's simply much easier for developers to focus on writing great
code instead of managing infrastructure. This has accelerated their development
time by supporting greater autonomy for engineers to select the databases that
best work for their part of the app.

Andras Fincza, Head of Engineering, explains “So this is why we use Compose, for
example, because it frees up our operations time and helps the team focus on the
code quality and the product itself.""

Read the full case study.

Omni Labs | San Francisco, CaliforniaOmni Labs’ Omni MAP is a marketing intelligence software that helps brands see
and understand their data. While there are many BI tools available for companies
to use, they still require a high level of technical specialization and time to
manage. With the Omni MAP platform, there's no pixel placement or database
integrations. Users just sign-in to see all of their up-to-date marketing KPIs
in a custom dashboard.

Omni MAP is built with a Node.js backend and a React front-end, while using
Spark Python for data processing and Compose for MongoDB and other databases
underneath. The app allows customers do multiple queries on their data set, see
past performance reports, or even set up alerts that get posted into a Slack
channel. All the server stacks are built around Node.js, and all the data that
is collected goes through ETL pipelines built on Python and Google Cloud that's
processed by Spark. From there, the data is stored in Google's data warehouse,
Big Query. The data is processed back to the client and some part of that data
is pushed into Compose for MongoDB and some of it into Redis, based on whether
data is needed in real-time.

As for why Compose, Alex Modon, CEO and co-founder, says, ""With any startup, the
most valued resource is time. Compose removes the 'white knuckle' approach to
database management. There's only so many hours in the day, so it's great
knowing that our database is being taken care of by a company with a high level
of quality and dedication.""

Read the full case study.

Icanmakeitbetter | Austin, TexasIcanmakeitbetter’s all-in-one insight community and customer feedback platform
provide surveys, live chat, focus groups, discussion forums, digital journals,
ideation, and panel management – in a simple, single app that works across any
device.

Originally built as a Rails app, Icanmakeitbetter is in the process of moving
much of the high-volume backend work from Rails to Elixir, due to scaling,
arising from increasing customer demand. The Elixir component model also makes
it a breeze to do database joins. It's faster and much more efficient to work
with smaller, in-memory data sets. While Elixir running on a Heroku Phoenix
server is relatively new to their stack, they have been using Compose for
MongoDB from the start. Why MongoDB? Because, ""Mongo gives us great adaptability
on the model side. It’s really nice to be able to introduce an attribute without
having to update the schema each time,” said Bruce Tate who’s in charge of
building the platform at Icanmakeitbetter.

All this has added up to a platform that delivers value to not only their
customers, but to Icanmakeitbetter's researchers as well. Their team spends less
time trying to get multiple tools to work together; instead, they can support
three to four times the number of customers than researchers at other companies.

Speaking about Compose, Bruce had this to say, ""We like not having to think
about the production side of running our own servers. And we love the quality of
the support that we’ve gotten on Compose. We really don’t have any vendors that
we trust as much as we trust you. You guys have been great.""

Read the full case study.

C2G Consulting | Lagos, NigeriaC2G Consulting is a technology company headquartered on Victoria Island in
Lagos, Nigeria. Providing consulting and development services – ranging from
ERP, HCM, CRM analytics, disaster recovery and more – to both African and global
businesses since 2004, C2G recently expanded its offerings with a new
multi-tenant bulk ordering and retail execution platform that they've dubbed
TradeDepot. TradeDepot is now in beta for some of their mid-to-large sized
enterprise customers.

Built on Meteor.js using Compose for MongoDB and RabbitMQ, TradeDepot allows
product manufacturers to receive orders from distributors and manage the order
all the way through to retail outlets. It's much more than a traditional
eCommerce tool; it's a complete platform for these customers to manage the
supply chain, from manufacturing to retail. One of their first major clients is
one of the largest dairy companies in Africa that needed greater insights and
control over their milk distribution pipeline.

While consulting for some of the largest multi-national companies in the world
remains C2G's core business, TradeDepot firmly pushes them in a new direction to
building commercial software that helps consumer products companies get more
control over their supply chain.

One of the founders of C2G Consulting, Onyekachi (Kachi) Izukanne says, ""For us,
for any back-end service we can procure from a managed service provider, that
would be a more reliable way to go. And while it's early days, so far we're
happy with our Compose investment.""

Read the full case study.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Kaboompics // Karolina

Arick Disilva works in Product Marketing at Compose. Love this article? Head over to Arick Disilva ’s author page and keep reading.RELATED ARTICLES
May 3, 2017CAMPUS DISCOUNTS - MAKING THE MOST OF COMPOSE
Campus Discounts uses several Compose-hosted databases including MySQL, MongoDB,
Redis, Elasticsearch and RabbitMQ to power t…

Arick Disilva Apr 24, 2017CLASSCRAFT - MAKING THE MOST OF COMPOSE
Classcraft gamifies the whole classroom experience, making education a fun
adventure for both students and teachers. We chatt…

Arick Disilva Mar 7, 2017ONE PRESCHOOL - MAKING THE MOST OF COMPOSE
One Preschool has been a loyal Compose customer since their inception. They have
been on a mission to make preschooling both…

Arick Disilva Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this article, we take a look at how some of the marketing tech companies that work with Compose conquer their data layer",How Marketing Technology Companies Use Compose to Conquer Their Data Layer,Live,445
1342,"Homepage IBM Watson Follow Sign in Get started * Home
 * Announcements
 * Editorials
 * Tutorials
 * Code Spotlight
 * 
 * Build with Watson
 * 

Carmen Ruppach Blocked Unblock Follow Following Offering Manager for Data Refinery on Watson Data Platform at IBM Mar 20
--------------------------------------------------------------------------------

WHAT’S NEW IN DATA REFINERY?
We’re very excited to announce the general availability of Data Refinery, our
self-service data preparation tool. We got a lot of feedback over the last 6
months that helped us tremendously to improve the overall experience.

One of the biggest changes is that Data Refinery won’t be a stand-alone app
going forward. Instead, it’s an embedded component in our IBM Watson Studio and IBM Watson Knowledge Catalog offerings. This allows users of both offerings to take advantage of the data
preparation and cleansing capabilities in Data Refinery.

As a data scientist or business analyst, you’re probably spending a lot of time
finding, wrangling, and cleansing your data before you can use it for analysis
and modeling. Direct integration of data preparation capabilities with Watson
Studio, where you work with analytic assets, not only accelerates time to
insight but also enables your team to track changes in important assets. And
tight integration with Watson Knowledge Catalog enables you to more quickly find
and access data needed for your data science tasks while ensuring that
governance rules are enforced during self-service data preparation.

The following new features are available for you now.

SNAPSHOT VIEW FOR EASIER TROUBLESHOOTING
Data refinement is an iterative process. You might apply various operations to
your data set, visualize and evaluate the results, and then make changes to
those previously applied operations or add new ones. To facilitate this process,
Data Refinery supports snapshot view. You can see what your data looked like at
any point in time by simply clicking a step in the data flow. For example, if
you click the initial data source step, you’ll see what your data looked like
before you started refining it. You can also click any operation step to see
what your data looked like after that operation was applied. If you’re not
satisfied with the results of one of your steps, you can easily undo it, change
or insert an operation on the fly.

Snapshot view in Data RefinerySCHEDULING OF DATA PREPARATION JOBS
After successfully creating your data flows, you might want to schedule and
automate the running of your flows in order to operationalize data preparation
and data integration tasks. A common use case is cleaning and transforming data
from different sources, then loading it into a central warehouse on a regular
basis. Automated reports can access the warehouse to get the most recent
business information.

You can configure repeatable data flow runs with just a few clicks and you can
schedule runs hourly, daily, monthly, or yearly. You can also monitor the runs
and view run logs.

Scheduling of a Data FlowNEW DATA CONNECTORS
Data Refinery comes with a comprehensive set of 33+ prebuilt data connectors so
you can easily access your data for analysis and data preparation. The following
new connectors are available now:

 * IBM Watson Analytics
 * SFTP (FTP to Remote File system)
 * Teradata

While the Teradata and SFTP connectors can be used as sources and targets in
data flows, the Watson Analytics connector is a target-only connector. This
connector provides direct integration with the IBM Watson Analytics offering for
advanced analysis and reporting. Data from various sources can be cleaned,
shaped, and directly loaded into the internal repository of IBM Watson Analytics
for further analysis.

PUBLIC APIS FOR DEVELOPERS AND DATA ENGINEERS
All data preparation and cleansing operations can also be executed
programmatically using APIs. This allows developers and data engineers to
integrate data preparation capabilities into their cognitive applications and to
operationalize the running of data flows.

For a tutorial containing steps and sample code to get you started with
creating, running, debugging, and monitoring data flows using APIs, see: Working with data flows using Watson Data APIs

OTHER ENHANCEMENTS
 * New Substitute operation: You can now obscure sensitive information by substituting a random string of characters
   for the actual data in the column. This can be done with a single click in
   the UI.
 * Better support for unstructured data: Users can profile unstructured text assets, such as PDF files and Microsoft
   Word documents, in catalogs. Relevant information is automatically extracted
   using Watson Natural Language Understanding (NLU) technology and stored in
   JSON format. The extracted metadata can then be further processed using Data
   Refinery.
 * Cancel data flow runs: You can now cancel data flow runs while they’re in progress.
 * Additional write modes for relational database targets: If you select an existing relational database table or view as the target
   for your data flow output, you have new write options: Update, Upsert
 * Support for additional file type: You can now generate files in JSON, AVRO, and Parquet format as data flow
   output
 * Date and time support: Some Data Refinery operations now support date and time columns.

Try out Data Refinery! Sign up for free at https://www.ibm.com/cloud/data-refinery

 * Announcements
 * Data Preparation
 * Data Visualization

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

24 Blocked Unblock Follow FollowingCARMEN RUPPACH
Offering Manager for Data Refinery on Watson Data Platform at IBM

FollowIBM WATSON
AI Platform for the Enterprise

 * 24
 * 
 * 
 * 

Never miss a story from IBM Watson , when you sign up for Medium. Learn more Never miss a story from IBM Watson Get updates Get updates","We’re very excited to announce the general availability of Data Refinery, our self-service data preparation tool. We got a lot of feedback over the last 6 months that helped us tremendously to…",What’s New in Data Refinery?,Live,446
1344,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectGET IN LINE! AN INTRO TO QUEUES AND PUBSUBGlynn Bird / December 3, 2015In my last post, Scale a Web App with Microservices and IBM Message Hub , I talked about separating the producer of data from the consumer of the data by putting a broker or queue in between. This allows simplificationof production and consumption code, breaking the tasks into microservices which are small, simple applications that do one job each.In this article, we’ll look at the differences between the various queue andbroker technologies, listed here in order of increasing complexity, resilience,and capacity: * JavaScript async library * Redis queues & pubsub * RabbitMQ * Apache Kafka ( IBM Message Hub )Before we start, let’s sort out some definitions. Firstly, a queue is a list of data items (payloads) that are stored in the order they weregenerated. One or more worker processes can pick a payload from the end of the queue and perform their workon it.PubSub is similar, except that many consumers of the queue all get the data.In most cases, the contents of a queue persist until they are consumed by aworker process. In contrast, PubSub channels tend to be ephemeral, with thecontents being discarded if there are no consumers present.ASYNCAsync is a JavaScript library that simplifies concurrency. One of its tools is a verysimple in-memory queue to allow your in-browser or Node.js code to sequence waiting tasks.First you define the function that will deal with each payload item in thequeue:  var q = async.queue(function(payload, callback) {      // do the work here to deal with the payload        // callback when complete    callback(null, data);  });Then you can start adding items to the queue (usually on an event, such as anincoming API call):  q.push({ a:1, b:2});  q.push({ a:2, b:3});  q.push({ a:3, b:4});  .  .  .Our queue function will be called with each payload, in turn. Crucially, we cancontrol the concurrency of the queue by either passing a second parameter to async.queue when we create the queue, or setting q.concurrency after the fact.  // deal with up to five payloads at any one time  q.concurrency = 5;This simple concurrency control allows the code generating the work ( q.push(...) ) to stay the same, but you can add more queue workers by altering the value of concurrency to suit. * This queue is on a single node. Adding concurrency is only coordinating work   on that node. * As JavaScript is single-threaded, your workers are not really running on   different CPU cores. They are sharing CPU cycles on a single core. * If your Node.js process crashes, your queue items are lost. * The queue size is limited by the memory size of your single node. * This model doesn’t separate producers and consumers onto separate machines, but it does show how the producer/queue/consumer   model can be adopted inside your application as well as between your microservices. Building your   application this way makes it easier to move to an external queue at a later   date.REDISRedis is an in-memory database which has two modes that are of interest to amicroservices architect: * Queues. You can use Redis’s list data type as a queue of payload data. * PubSub. You can use Redis’s PubSub feature to let several clients consume the same   stream of data.REDIS QUEUESAdding an item to a Redis list is as simple as using the LPUSH command to push an item to the ‘left’ side of a list:127.0.0.1:6379 LPUSH myqueue '{""a"":1,""b"":2}'(integer) 1127.0.0.1:6379 LPUSH myqueue '{""a"":2,""b"":3}'(integer) 2127.0.0.1:6379 LPUSH myqueue '{""a"":3,""b"":4}'(integer) 3and then the consumer uses RPOP to retrieve the right-most entry from the list:127.0.0.1:6379 RPOP myqueue""{\""a\"":1,\""b\"":2}""127.0.0.1:6379 RPOP myqueue""{\""a\"":2,\""b\"":3}""127.0.0.1:6379 RPOP myqueue""{\""a\"":3,\""b\"":4}""127.0.0.1:6379 RPOP myqueue(nil)Using this mechanism, you can have as many producers (LPUSHing data in thequeue) and as many consumers (RPOPping data from the queue). * Redis is an in-memory database. It is only flushed to disk periodically, so a   server crash is likely to lose data * Although in its clustered form, data is distributed around the machines in   the cluster, a single key belongs on one node, so our entire queue would live   on a single serverREDIS PUBSUBYou can use Redis PubSub to let several consumers receive notifications of data arriving on a channel they are listening to. Data is pushed from Redis to the clients.Firstly, one or more clients connects to a channel and waits to be notified ofany arriving data.127.0.0.1:6379 SUBSCRIBE mychannelReading messages... (press Ctrl-C to quit)1) ""subscribe""2) ""mychannel""3) (integer) 1Then the producers can start generating data:127.0.0.1:6379 PUBLISH mychannel '{""a"":1,""b"":2}'(integer) 1and all the consumers see the data arrive immediately:1) ""message""2) ""mychannel""3) ""{\""a\"":1,\""b\"":2}"" * All subscribers to the channel receive each every message. * If the subscribers are offline at the time the message is published, they   miss it. There is no means of replaying the messages in a channel.RABBITMQRabbitMQ is a robust messaging application used in wide variety of productionsystems. You can set it up to speak a number of protocols ( AMQP , STOMP , MQTT ) and configure it in a number of ways.RABBITMQ WORK QUEUEWhen behaving like a Work Queue, items are distributed to the consumers of thequeue in round-robin fashion. A single consumer would receive all the messagesin a queue, two consumers would share the workload 50/50, and so on. This makesit very simple to scale out asynchronous workloads by adding more workers todeal with the throughput of data.Our producer simply pushes payloads of data onto a queue called queueevents using PUSH mode from the rabbit.js Node.js module:var context = require('rabbit.js').createContext();context.on('ready', function() {  var pub = context.socket('PUSH');  pub.connect('queueevents', function() {    var i = 0;    setInterval(function() {      var obj = {a: i++, b:true};      pub.write(JSON.stringify(obj), 'utf8');      console.log(""Written�which generates the following output:Written { a: 0, b: true }Written { a: 1, b: true }Written { a: 2, b: true }Written { a: 3, b: true }Written { a: 4, b: true }Written { a: 5, b: true }Written { a: 6, b: true }Written { a: 7, b: true }Written { a: 8, b: true }Written { a: 9, b: true }We can then run two consumer scripts which wait for data arriving on the queueevents channel by connecting in WORKER mode:var context = require('rabbit.js').createContext();context.on('ready', function() {  var sub = context.socket('WORKER', {prefetch: 1});  sub.on('data', function(payload) {    console.log(""!�and the work is distributed evenly:-- worker 1! { a: 0, b: true }! { a: 2, b: true }! { a: 4, b: true }! { a: 6, b: true }! { a: 8, b: true }-- worker 2! { a: 1, b: true }! { a: 3, b: true }! { a: 5, b: true }! { a: 7, b: true }! { a: 9, b: true } * Even if there are no consumers running, the queue simply builds up inside   RabbitMQ. * By default, RabbitMQ loses data if the broker is restarted. To perist the   data on disk, you must set up the Queues to be durable . Even then, data loss is possible if RabbitMQ is terminated before data is   fsynced to disk. * In this example, each worker only receives one item from the queue at a time   ( {prefetch: 1} ). This number can be increased, but this Node.js library doesn’t have a   clean way of acknowledging each message.RABBITMQ PUBSUBIf you want several consumers to receive the same payload, then engageRabbitMQ’s PubSub mode. Using the rabbit.js Node.js module again, we can write data using the PUBLISH mode into a queuecalled pubsubevents :var context = require('rabbit.js').createContext();context.on('ready', function() {  var pub = context.socket('PUBLISH');  pub.connect('pubsubevents', function() {    var i = 0;    setInterval(function() {      var obj = {a: i++, b:true};      pub.write(JSON.stringify(obj), 'utf8');      console.log(""Written�which produces the following output:Written { a: 0, b: true }Written { a: 1, b: true }Written { a: 2, b: true }Written { a: 3, b: true }Written { a: 4, b: true }And any number of consumer processes can listen to the stream of events arrivingon the pubsubevents channel by listening in SUBSCRIBE mode:var context = require('rabbit.js').createContext();context.on('ready', function() {  var sub = context.socket('SUBSCRIBE');  sub.on('data', function(payload) {     console.log(""!�-- worker 1! { a: 0, b: true }! { a: 1, b: true }! { a: 2, b: true }! { a: 3, b: true }! { a: 4, b: true }-- worker 2! { a: 0, b: true }! { a: 1, b: true }! { a: 2, b: true }! { a: 3, b: true }! { a: 4, b: true } * All subscribers to the channel receive every message. * If the subscribers are offline at the time the message is published, then   they miss it. There is no means of replaying the messages in a channel.IBM MESSAGE HUB (APACHE KAFKA)Apache Kafka (run as-a-service as IBM Message Hub ) is a scalable high-performance queue, message log, and pubsub broker. It canbehave as a queue or a pubsub broker depending on how you configure the consumerclients: * Clients connecting with the same consumer group share the messages between them (like a work queue). * All clients with different consumer groups get all the messages (like   PubSub).In this case, both Queue and PubSub examples have the same producer code:var cfenv = require('cfenv');var appEnv = cfenv.getAppEnv();var MessageHub = require('message-hub-rest');var hub = new MessageHub(appEnv.services);var instance = new MessageHub(appEnv.services);var topicName = 'mytopic';var i =0;instance.topics.create(topicName)  .then(function(response) {    console.log(""generating data�      console.log(""Written�APACHE KAFKA – QUEUETo behave like a queue, multiple worker processes need to share the same consumer group :var cfenv = require('cfenv');var appEnv = cfenv.getAppEnv();var MessageHub = require('message-hub-rest');var instance = new MessageHub(appEnv.services);var consumerInstance;var topicName = 'mytopic';var waitforResponses = function() {  var receivedMessages = 0;  var consumer = consumerInstance.get(topicName)    .then(function(data) {        for(var i in data) {          console.log(""!�It is important to note tha Apache Kafka workers are not pushed data from the server, they pull data from server with the intention that clients pull more than a singlepayload in one request. Kafka is designed to be able to deliver large quantitiesof streaming data in the order it was published to multiple consumers.APACHE KAFKA – PUBSUBThe code for PubSub is identical, except that each client should pass in its ownunique consumer group (like myconsumer_group_1 , my_consumer_group_2 and so on) to ensure that they all receive copies of the data.A QUESTION OF SCALESo which queue technology should you choose? The simple answer is to think aboutthe scale of queue data your application is going to handle. This articlepresented these technologies (simplistically) in “scale order”, fromprogrammatic queuing inside of your application up to a multi-node Apache Kafkacluster capable of handling millions of writes per second. This list alsohappens to be in “resilience order” a crash in the Async or Redis solutions WILLcause the loss of data, RabbitMQ may lose data in certain cirumstances, andApache Kafka is very unlikely to. But Apache Kafka is a complicated piece ofsoftware to install and manage, which is why it’s easier to get started with ahosted solution like IBM Message Hub .The features that each broker offers are also subtly different. Kafka is theonly one of the technologies present here which can do PubSub AND Worker queuingon the same queue, for example. RabbitMQ pushes payloads to connected clients,but Kafka clients poll to fetch data. RabbitMQ is long-standing product withexcellent plugin support, and your choice of product may be swayed by its support for the MQTTprotocol for your Internet of Things application.Breaking your application into small independent functional blocks(microservices) that do one job very well, is good programming practice. Itsimplifies testing and lets your application scale easily by adding a messagequeue into your stack to marshall the work queues when the need arises. Youdon’t need a complex infrastructure to get started with microservices. A simplequeue lets you separate your producer code from your consumer code.The good news is that you may try out all of these technologies for free. Redis and RabbitMQ are available as part of IBM’s Compose.io offerring. Simply sign up for a free trial and spin up a queue or two. IBM Message Hub is in Beta and free to try inside Bluemix.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Async / Kafka / Message Hub / pubsub / queues / Rabbit MQ / Redis Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","When to use the following queue and broker technologies to handle app workload: JavaScript async library, Redis, RabbitMQ, Apache Kafka (IBM Message Hub)",Get in line! An Intro to Queues and PubSub,Live,447
1348,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Watson Student Advisor

 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (April 11, 2017)
 * How to Become a Data Scientist
 * This Week in Data Science (April 4, 2017)
 * This Week in Data Science (March 28, 2017)
 * Learn SQL and Relational Databases 101

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (APRIL 11, 2017)
Posted on April 11, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * The Most Popular Languages for Data Scientists/Engineers –The results of the 2017 Stackoverflow Survey showing the most popular
   languages by occupation.
 * List of Free Must-Read Books for Machine Learning – Free recommended readings for Machine Learning.
 * The Data Pipeline – Analytics at The Speed of Business – Definition of the Data Pipeline and its use cases.
 * How IBM plans to be the “undisputed leader” of the next cloud phase – How IBM is building relationships to cater to the new era of cloud.
 * Top 20 Recent Research Papers on Machine Learning and Deep Learning – Recent papers covering the advance of Machine Learning and its subfields.
 * Datasets of the Week, March 2017 – Quality datasets and sample visualizations to explore them.
 * Some Lesser-Known Machine Learning Libraries – A list of lesser known but useful Machine Learning Libraries.
 * Medical Image Analysis with Deep Learning – Basics of Image Processing and its application to Medical Imaging.
 * Interactive visualization is still alive – The usefulness of interactivity in a Data Visualization.
 * IBM Watson is creating highlight reels at the Masters – IBM is harnessing Watson’s ability to see, hear and learn to identify
   great shots at the Masters.
 * Why is data preparation so hard and are we getting worse at it? – Statistics on Data Preparation and explanations for these statistics.
 * Equal Pay Day: a wage gap fact check – Fact check and visualization of the Wage Gap in the United States.
 * What the Academy Awards mix-up teaches us about data integration – The necessary steps to overcome the challenges with data integration.
 * Analytics and the cloud: NoSQL databases – The rise of NoSQL databases.
 * Building a cognitive data lake with ODPi-compliant Hadoop – An attempt to provide a clear understanding of data lakes.
 * Four Trends In Artificial Intelligence That Affect Enterprises – Four Trends that stand out in Artificial Intelligence.
 * How to Become a Data Scientist – A definition of the term of data scientist and courses from BDU to start
   the journey to becoming one.

FEATURED COURSES FROM BDU
 * SQL and Relational Databases 101 – Learn the basics of the database querying language, SQL.
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.
 * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to
   apply deep learning to different data types in order to solve real world
   problems.

COOL DATA SCIENCE VIDEOS
 * Machine Learning With Python – Supervised Learning Model Evaluation – Introduction and Discussion of three Model Evaluation methods.
 * Machine Learning With Python – Supervised Learning Model Evaluation
   Overfitting & Underfitting – Overview of Model Overfitting and Underfitting and consequences.
 * Machine Learning With Python – Supervised Learning – Understanding Different
   Evaluation Models – A look at the three main model evaluation metrics.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (April 11, 2017)",Live,448
1357,"Compose The Compose logo Articles Sign in Free 30-day trialMYSQL FOR YOUR JSON
Published Feb 8, 2017 mysql json MySQL for your JSONSince the arrival of Compose for MySQL beta, we've been asking customers about
how they use the database and what they'd like to learn more about. Many of our
customers rely on NoSQL databases like MongoDB and others, but there are
increasing numbers who are looking at relational databases and the cost/benefits
of their JSON storage and retrieval capabilities. So, we'd like to provide an
overview of how to use JSON in your Compose for MySQL deployment.

We've published a similar article that looks at storing JSON in Compose
PostgreSQL and asks whether PostgreSQL could be your next JSON database . There, we showed you how to store JSON inside your PostgreSQL database, and
highlighted some of the pros and cons of storing documents using its JSON and
JSONB data types.

JSON has been supported by MySQL since version 5.7.8. MySQL stores JSON in
binary form, like PostgreSQL's JSONB format. This means that the JSON is always
validated, because it's always parsed, and it's efficiently accessed as it's
optimized into keys with values and arrays. The drawback? If your JSON has
multiple fields with the same key, only one of them, the last one, will be
retained.

The other drawback is that MySQL doesn't support indexing JSON columns, which
means that searching through your JSON documents could result in a full table
scan. There are a few ways to get around the indexing conundrum, but the one
that MySQL advocates for is to create computed/virtual columns, which they
provide an example of here .

In this article, we won't be taking a deep dive into indexing and virtual
columns; we'll be focusing on keeping it simple by creating a table that stores
some JSON documents that include a list of game players and the games they
played. Then we'll perform some simple CRUD operations using MySQL's JSON
functions to give you an idea about how they work so that you can decide whether
storing JSON in MySQL is the right solution for you. We'll come back to handling
indexing in a future column.

GETTING STARTED
Let's get started by creating a table players that will store an id and a JSON document within the player_and_games column.

CREATE TABLE `players` (  
  `id` INT NOT NULL,
  `player_and_games` json,
  PRIMARY KEY (`id`)
)


We'll set the id as the primary key. Because Compose for MySQL uses group replication , we'll need to set a primary key for the table otherwise we'll receive an
error when entering any JSON documents.

Now, let's insert some data into our table. Below are some SQL insert commands,
which have an id and a JSON document that includes the player name and some games: Battlefield,
Crazy Tennis, and Puzzler. For Battlefield and Crazy Tennis, we only have ""yes""
and ""no"" as options as to whether a player played the game. ""Puzzler"", on the
other hand, includes the time in minutes that it took a player to solve the game.

INSERT INTO `players` (`id`, `player_and_games`) VALUES (1, '{  
    ""name"": ""Sally"",  
    ""games_played"":{    
       ""Battlefield"": ""yes"",                                                                                                                          
       ""Crazy Tennis"": ""yes"",  
       ""Puzzler"": {                                                                                                                                                                                                                                                                                                                                                                                               
          ""time"": 7
        }
      }
   }'
);
...


We've included six players, which look like the following when inserted into our
table:

SELECT * FROM players;  
+----+-----------------------------------------------------------------------------------------------------------+
| id | player_and_games                                                                                          |
+----+-----------------------------------------------------------------------------------------------------------+
|  1 | {""name"": ""Sally"", ""games_played"": {""Puzzler"": {""time"": 7}, ""Battlefield"": ""yes"", ""Crazy Tennis"": ""yes""}}  |
|  2 | {""name"": ""Thom"", ""games_played"": {""Puzzler"": {""time"": 25}, ""Battlefield"": ""yes"", ""Crazy Tennis"": ""no""}}   |
|  3 | {""name"": ""Ali"", ""games_played"": {""Battlefield"": ""no"", ""Crazy Tennis"": ""yes""}}                             |
|  4 | {""name"": ""Alfred"", ""games_played"": {""Puzzler"": {""time"": 10}, ""Battlefield"": ""no"", ""Crazy Tennis"": ""yes""}} |
|  5 | {""name"": ""Phil"", ""games_played"": {""Puzzler"": {""time"": 50}, ""Battlefield"": ""no"", ""Crazy Tennis"": ""no""}}    |
|  6 | {""name"": ""Henry"", ""games_played"": {""Battlefield"": ""yes"", ""Crazy Tennis"": ""yes""}}                          |
+----+-----------------------------------------------------------------------------------------------------------+
6 rows in set (0.03 sec)  


MYSQL JSON FUNCTIONS
MySQL provides us with a number of JSON functions that let us perform CRUD
operations on JSON documents. For a complete list of functions, see MySQL's JSON function reference . We'll cover extracting, modifying, and removing elements within JSON
documents below.

EXTRACTING
To extract some of the elements in our JSON documents, we'll look at the JSON_EXTRACT function. It takes the JSON document, or the name of the column holding the
JSON document, and the path of the value you want returned. For example, if we
wanted the names of the players in the document, we'd write:

SELECT JSON_EXTRACT(player_and_games, '$.name') FROM players;  


+------------------------------------------+
| JSON_EXTRACT(player_and_games, '$.name') |
+------------------------------------------+
| ""Sally""                                  |
| ""Thom""                                   |
| ""Ali""                                    |
| ""Alfred""                                 |
| ""Phil""                                   |
| ""Henry""                                  |
+------------------------------------------+
6 rows in set (0.03 sec)  


The $ sign simply means to start the search at the top level of the JSON document.
From there, we tell the function to search for the name key and return the result. We can keep going down the chain by adding more keys
if they exist in the JSON document. For example, if we wanted to get the players
who played ""Puzzler"" and their times, we'd write:

SELECT JSON_EXTRACT(player_and_games, '$.name') as player, JSON_EXTRACT(player_and_games, '$.games_played.Puzzler.time') as time FROM players WHERE JSON_EXTRACT(player_and_games, '$.games_played.Puzzler')   


+----------+------+
| player   | time |
+----------+------+
| ""Sally""  | 7    |
| ""Thom""   | 25   |
| ""Alfred"" | 10   |
| ""Phil""   | 50   |
+----------+------+
4 rows in set (0.02 sec)  


One function that makes this query a little more readable is the -> operator, which has been available since MySQL 5.7.9 and is shorthand for JSON_EXTRACT . Like JSON_EXTRACT , we still have to use the column name followed by a JSON path. Writing the
above query substituting JSON_EXTRACT with the -> operator, it would look like:

SELECT player_and_games->'$.name' as player, player_and_games->'$.games_played.Puzzler.time' as time FROM players WHERE player_and_games->'$.games_played.Puzzler'   


The operator is not limited to only SELECT commands, but can be used with commands such as ALTER , UPDATE , and DELETE . But, just remember that anywhere in a SQL command that you need to extract
elements from JSON, either the JSON_EXTRACT function or -> operator can be used.

SEARCHING
There are times when you don't know the path of what you're searching for in
your JSON documents. That's where JSON_SEARCH comes in. The JSON_SEARCH function accepts the name of the column where your JSON is stored, an argument
of ""one"" or ""all"" that indicates whether you want all of the results returned or
only the first one, and the name of the item you're searching for. This is what
a search for the name ""Alfred"" might look like:

SELECT JSON_SEARCH(player_and_games, ""all"", ""Alfred"") as name, id FROM players;  


+----------+----+
| name     | id |
+----------+----+
| NULL     |  1 |
| NULL     |  2 |
| NULL     |  3 |
| ""$.name"" |  4 |
| NULL     |  5 |
| NULL     |  6 |
+----------+----+
6 rows in set (0.02 sec)  


From the results, we can see that JSON_SEARCH searches through all of the documents returning NULL for the matches that were not found and the path $.name for the name ""Alfred"", which was found.

UPDATING
Updating values within our JSON documents requires us to use either JSON_INSERT , JSON_REPLACE , and JSON_SET functions depending on what our goals are. They work similarly to JSON_EXTRACT and JSON_SEARCH functions in that they require the JSON column name and the path within the
JSON document in order to add or replace values within the JSON document.

To demonstrate the JSON_INSERT function, we'll add the game Puzzler to Henry. Henry completed Puzzler in
approximately 20 minutes so we can use JSON_INSERT to add to the games_played key.

UPDATE players SET player_and_games = JSON_INSERT(player_and_games, '$.games_played.Puzzler', JSON_OBJECT('time', 20)) WHERE player_and_games-  


Since JSON_INSERT takes the JSON column name and the path where you want to insert your new data,
we added Puzzler to the end of the games_played path. The key Puzzler was not present so MySQL automatically created it for us. From there we created
another JSON object within Puzzler with a time key with a value of 20. Here, we used MySQL's JSON_OBJECT function which creates a JSON object from a list of key pairs. On the other
hand, if we had written '{""time"": 20}' and inserted it into our JSON object, MySQL would have evaluated it as a string
instead of an object.

Querying Henry, we should now see that Puzzler has been created and that the time object has been created and stored with a value of 20.

SELECT * FROM players WHERE id = 6;  


+----+----------------------------------------------------------------------------------------------------------+
| id | player_and_games                                                                                         |
+----+----------------------------------------------------------------------------------------------------------+
|  6 | {""name"": ""Henry"", ""games_played"": {""Puzzler"": {""time"": 20}, ""Battlefield"": ""yes"", ""Crazy Tennis"": ""no""}} |
+----+----------------------------------------------------------------------------------------------------------+
1 row in set (0.03 sec)  


JSON_REPLACE unlike JSON_INSERT will only replace values that exist in the JSON document. If the value doesn't
exist, nothing will be updated. If we look at the name ""Ali"", the JSON document
says that he has never played ""Battlefield"" but he played ""Crazy Tennis"".

+----+-------------------------------------------------------------------------------+
| id | player_and_games                                                              |
+----+-------------------------------------------------------------------------------+
|  3 | {""name"": ""Ali"", ""games_played"": {""Battlefield"": ""no"", ""Crazy Tennis"": ""yes""}} |
+----+-------------------------------------------------------------------------------+


Using JSON_REPLACE we can set the value of ""Battlefield"" to ""yes"" with a similar UPDATE command we ran using JSON_INSERT .

UPDATE players SET player_and_games = JSON_REPLACE(player_and_games, '$.games_played.Battlefield', 'no') WHERE player_and_games-  


SELECT * FROM players WHERE id = 3;  
+----+--------------------------------------------------------------------------------+
| id | player_and_games                                                               |
+----+--------------------------------------------------------------------------------+
|  3 | {""name"": ""Ali"", ""games_played"": {""Battlefield"": ""yes"", ""Crazy Tennis"": ""yes""}} |
+----+--------------------------------------------------------------------------------+


Similarly, we can use MySQL's JSON_SET function which inserts non-existing values and replaces existing ones; thus,
the function performs the same operation as JSON_INSERT and JSON_REPLACE . If a value does not exist, JSON_SET will create a new value, but if it exists, then it will be replaced with the
value that you set when running the function. An example of using JSON_SET is the following:

UPDATE players SET player_and_games = JSON_SET(player_and_games, '$.games_played.Battlefield', 'no', '$.games_played.Puzzler', JSON_OBJECT('time', 15)) WHERE id = 3;  


Here, we updated Ali's Battlefield game to ""no"" to indicate that he didn't play
the game. However, we also added ""Puzzler"" with a time of 15 minutes to his
profile. Therefore, with JSON_SET it allows us to both insert a new value with Puzzler and to update the existing
Battleship value to ""no"".

SELECT * FROM players WHERE id = 3;  
+----+--------------------------------------------------------------------------------------------------------+
| id | player_and_games                                                                                       |
+----+--------------------------------------------------------------------------------------------------------+
|  3 | {""name"": ""Ali"", ""games_played"": {""Puzzler"": {""time"": 15}, ""Battlefield"": ""no"", ""Crazy Tennis"": ""yes""}} |
+----+--------------------------------------------------------------------------------------------------------+
1 row in set (0.02 sec)  


DELETING
The only function that allows us to remove elements from JSON documents is
MySQL's JSON_REMOVE function. The function removes the queried element from the JSON object and
returns the result. Therefore, if we wanted to remove ""Puzzler"" from Phil's JSON
profile, we'd write something like this:

UPDATE players SET player_and_games = JSON_REMOVE(player_and_games, '$.games_played.Puzzler') WHERE id = 5;  


Again, we just have to supply the column name and path of the part of the JSON
document that we want to remove. Querying Phil, we'll now see that Puzzler has
been removed.

SELECT * FROM players WHERE id = 5;  
+----+-------------------------------------------------------------------------------+
| id | player_and_games                                                              |
+----+-------------------------------------------------------------------------------+
|  5 | {""name"": ""Phil"", ""games_played"": {""Battlefield"": ""no"", ""Crazy Tennis"": ""no""}} |
+----+-------------------------------------------------------------------------------+
1 row in set (0.03 sec)  


SUMMING UP
So we went over a few functions that MySQL offers to make it easy to create,
read, update, and delete elements from your JSON documents. The advantage of
MySQL's JSON types is that they give you a rich semi-structured datatype within
the confines of a relational database, but the lack of direct indexing for JSON
data limits its usefulness. In a follow-up article, we'll be discussing MySQL
indexing and virtual columns when using JSON.

Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
Jan 26, 2017COMPOSE FOR MYSQL - A DEVELOPER'S VIEW
In this interview with Chris Winslett, Compose developer and lead on the Compose
for MySQL, we talk about why MySQL is on the…

Dj Walker-Morgan Jan 18, 2017DESIGNING THE UFC MONEYBALL
Using big data analysis on sports? Gigi Sayfan takes us through doing just that
with Cassandra/Scylla, MySQL, and Redis. In t…

Guest Author Dec 30, 2016COMPOSE FOR MYSQL AND COMPOSE FOR SCYLLADB: THE NEW COMPOSE DATABASES ON BLUEMIX
Since we made Compose-hosted databases for Bluemix available, we've seen Bluemix
users opening up the catalog to the benefits…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this article, we won't be taking a deep dive into indexing and virtual columns; we'll be focusing on keeping it simple by creating a table that stores some JSON documents that include a list of game players and the games they played.",MySQL for your JSON,Live,449
1360,"ON THE PASSAGE OF HB2 IN NORTH CAROLINABradley Holt / March 28, 2016I was dismayed to hear the news of North Carolina’s anti-LGBT bill known as HB2 . There were many thoughts that ran through my head, both personally andprofessionally. One thing that occurred to me is that there are at least twoupcoming events being hosted in North Carolina that members of our DeveloperAdvocacy team either plan on attending or are considering attending, FOSS4G NAand All Things Open. If the opportunity came up, would I attend these events aswell? Would, or should, other members of our Developer Advocacy team stillparticipate? Can these events still provide a safe space despite HB2?As I considered all of this, I was encouraged to see IBM’s official statement on the passage of HB2 (IBM is one of North Carolina’s largest employers):“IBM is opposed to discrimination on the basis of race, color, religion, sex,gender, gender identity or expression, sexual orientation, national origin,genetics, disability, or age. Our company has had an explicit policy ofnon-discrimination based on gender identity or expression since 2002.“We are disappointed by the passage of HB2 in the North Carolina GeneralAssembly because this measure will reduce, rather than expand, the scope ofanti-discrimination protection in the state. IBM will continue to follow itsglobal non-discrimination policies in the workplace, and believes that aninclusive and welcoming environment is the best way to attract talentedindividuals to our company.”As with companies like IBM, communities need to be inclusive and welcomingenvironments in order to attract and retain talented individuals and a diversityof perspectives. I am a strong believer in the power of community. Developerslearn about new technologies, new approaches to solving problems, and new ideasin large part from developer communities. It is of vital importance thatdeveloper communities are diverse and inclusive so that everyone can benefit from access to these communities. Laws such as HB2 directlyundermine the ability to build diverse and inclusive communities.Both FOSS4G NA and All Things Open have a history of valuing inclusivity anddiversity. Before coming to any conclusions as to what I should do, I wanted tohear from the conference organizers. This is an excerpt of what FOSS4G NA had to say :On behalf of the team organizing this year’s conference, we would like torespond to the recent legislation passed in North Carolina related to transgendered people’s rights to use the bathroom of their choice.This kind of legislation runs counter to the principles and community code ofconducts adopted by the FOSS4G community such as the one by LocationTech hosted by the Eclipse Foundation , and OSGeo . As such, we simply cannot condone it.We are unable to relocate the event with just 39 days to go. I want topersonally assure attendees that the conference is open and welcoming toeveryone, and this includes transgendered people. If you did not already know, Iam a transwoman myself.We have approached the venue to ask for their help to ensure our guests will beaccommodated in light of this new legislation. They understand why this mattersa great deal. We will share more details about what we are able to offer soon.This is an excerpt of what All Things Open had to say :In light of House Bill 2 recently being passed and signed into law, we’d like toreaffirm our commitment to diversity, inclusion, and providing a safeenvironment during the conference scheduled to take place in October. We don’tagree with any law that results in an unsafe environment for any person at ourevents or in the public at large, including this one.…To anyone considering attending All Things Open in Raleigh, do know the eventwill be a safe place for everyone and that we are currently taking every actionpossible to ensure this is the case, including working with the City of Raleighand the Convention Center. We value diversity, inclusion, and take providing ansafe environment that does as well very seriously. We have since we started ATOin 2013 and do not intend to change that anytime soon.Whether or not to participate in these events (if the opportunity arises) is adifficult and complicated decision, with many factors to consider. My first andforemost consideration is safety: my safety, the safety of my coworkers, and thesafety of other conference attendees. Even if I determine that I will be safe, what about the perspectives not represented at these events dueto safety concerns?One thing that really stood out to me was this excerpt from the statement by FOSS4G NA :We humbly ask that you please do not protest the North Carolina legislature’sdecision by boycotting FOSS4G NA 2016. Doing so hurts our conference, thenot-for-profit organizing it, and our community. Each of which have a clearpolicy of inclusiveness.If I decide not to participate, will I be harming the very people affected byHB2, as well as allies? I have come to the tentative conclusion that I will attend these events if the opportunity arises, as I feel that not attendingcould do more harm than good. However, I will be keeping a close watch on whatthese conferences plan to do to ensure a safe space for all attendees,especially transgender people, in light of HB2.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: All Things Open / diversity / FOSS4G / HB2 / inclusion Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","I was dismayed to hear the news of North Carolina's anti-LGBT bill known as HB2. There were many thoughts that ran through my head, both personally and professionally.",On the Passage of HB2 in North Carolina,Live,450
1362,This video shows you how to change database permissions for a Cloudant database. ,This video shows you how to change database permissions for a Cloudant database.,Change Database Permissions in Cloudant,Live,451
1372,"DATA PRIVACY AND GOVERNANCE UPDATEShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 5, 2016Compose allows customers to deploy databases in many datacenters across theworld with our commitment to provider and location diversity. Because of thisflexibility, we need to take into account data privacy laws not just in theUnited States, but in every country where we store data.To govern the movement of data from the EU to the US, Compose complied with bothUS-EU and US-Swiss Safe Harbor Frameworks . Following last year's ruling of the European Court of Justice that invalidated the Safe Harbor agreement, IBM was quick to urge policymakers to workexpediently to ensure "" unbroken data flows between the European Union and the United States .""IBM has prepared an EU Model Clauses agreement for Compose customers to facilitate the transfer of personal dataoutside of the EU in accordance with the EU data privacy laws. EU Model Clausesare relevant to all customers sending personal information relating to EUcitizens to Compose.To request an EU Model Clauses agreement or for any other information orassistance around the transfer of personal data, customers can contact our dedicated EU Model Clauses team via email .The EU Directive 95/46/EC sets out rules concerning processing of customer's personal data. As the datacontroller, customers appoint IBM as a data processor to process (e.g., store,transmit, archive) any personal data which may be included in the customer'scontent. In turn, customers are responsible for obtaining all necessary consentsto include the content (including any personal data) in IBM Compose DBaaSsolutions.A list of countries where content may be held from or from where content may beaccessed for the purpose of delivering and supporting a cloud service is available .More information about Compose's overall standards compliance and privacy can befound on our help site . Similar information can be found for our partners: * Amazon * DigitalOcean * SoftLayer * GoogleFollowing the recent vote in favor of new General Data Protection Regulations , it is important that Compose customers are aware of the Compose team'sunderstanding and compliance with emerging data privacy standards andlegislation.If you have further questions about changes to data protection laws in Europeand how they might impact your business, feel free to reach out to us .Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Tim Yocum has been in various ops-related roles since 1998 and celebrates going off-callwith Chicagoland craft beer. His current PagerDuty alert tone is Ship's Bell.Love this article? Head over to Tim Yocum’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Compose allows customers to deploy databases in many datacenters across the world with our commitment to provider and location diversity. Because of this flexibility, we need to take into account data privacy laws not just in the United States, but in every country where we store data.",Data Privacy and Governance Update,Live,452
1373,"
--------------------------------------------------------------------------------

BABY’S FIRST IBM GRAPH APP USING NODE.JS
A NODE.JS LIBRARY FOR EXPLORING NODES AND EDGES
IBM Graph provides a REST API that makes it easy to interact with the service on Bluemix: send requests using
your favorite HTTP library and interpret the JSON response, as shown in the official Node.js code examples .

To make it even easier, Mike Elsmore contributed an unofficial, shallow wrapper library for IBM Graph that encapsulates request and response handling in a basic abstraction layer
named ibm-graph-client .

Make your first experience with IBM Graph even easier.This client exposes methods to manipulate and query graphs, vertices, edges and
supporting assets such as schemas and indexes. To help you get quickly
acquainted with the wrapper library, I’ve created a sample application that
illustrates how to create, populate and query a graph instance.

Just follow the instructions in the README :

 * provision an IBM Graph instance on Bluemix
 * clone the GitHub repo
 * review and run the app

LEARNING TO WALK
The sample application is based on the Music Festival sample data set that’s
built into the IBM Graph service’s UI. It associates attendees, bands, and
venues:

It’s time to rock!Sticking with the same data set means you can check the sample query examples in
IBM Graph’s UI to give you more ideas on how to traverse the graph you just
created from Node.js. Here’s a simple one to get you started:

def gt = graph.traversal(); gt.V().hasLabel(""attendee"").has(""name"", ""Jane Doe"").out(""bought_ticket"").values(""name

As an aside, the sample data set here is so simple, in fact, that you could
model it in any database. (It’s to help you get started, after all.) Graph
databases, in general, really shine when the complexity of relations is
compounded by greater size and scale—say a route-planning application for a maps
app or a recommendation engine for an e-commerce site. There’s a lot to work
toward with graph DBs!FIRST WORDS: “COMMAND LINE”
To keep things simple, the sample is a stand-alone application that you run from
the command prompt, passing the IBM Graph service instance credentials as
parameters. A Cloud Foundry application would typically obtain credentials for a
bound service instance from the VCAP_SERVICES environment variable, by inspecting the property IBM Graph .

{
  ""IBM Graph"": [
      {
        ""credentials"": {
          ""apiURL"": ""https://ibmgraph-alpha.ng.bluemix.net/.../g"",
          ""apiURI"": ""https://ibmgraph-alpha.ng.bluemix.net/..."",
          ""username"": ""e...9"",
          ""password"": ""1...8""
        },
        ""label"": ""IBM Graph"",
        ""provider"": null,
        ""plan"": ""Standard"",
        ""name"": ""ibm-graph-sample"",
        ...
     }
  ]
}

Take a look at the source code and enable DEBUG to inspect API interactions, and you’ll be ready to build your first simple
graph project.

On that note, here’s more on IBM Graph fundamentals . If you have general questions about the IBM Graph service on Bluemix, head
over to stackoverflow and tag your inquiry with ibm-graph . We have a Java version of the library too.

If you liked this article, please show it by hitting the ♡. Thanks!

Nodejs Graph Database Database NoSQL NPM 1 Blocked Unblock Follow FollowingPATRICK TITZLER
Developer Advocate at IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

1 * Share
 * 1
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Introducing our unofficial, shallow wrapper library for IBM Graph that makes its REST API even easier to use. Module ibm-graph-client on npm.",Baby’s first IBM Graph app using Node.js – IBM Watson Data Lab,Live,453
1377,"Skip navigation Upload Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATABASE DESIGN, LOAD, AND QUERY FROM R (11:11)
Big Data University Subscribe Subscribed Unsubscribe 1,370 1KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics

27 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Mar 10, 2016bigdatauniversity.com

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Scientist Workbench - Features Demo - Duration: 3:08. Big Data
   University 57 views 3:08


--------------------------------------------------------------------------------

 * Using R with BLU Acceleration for Cloud (9:06) - Duration: 9:07. Big Data
   University 23 views 9:07
 * Updating Data and Using Stored Procedures (8:19) - Duration: 8:20. Big Data
   University 45 views 8:20
 * XLDB2015: R as a Query Language - Duration: 5:14. XLDBConf 85 views 5:14
 * Why Relational Databases and R (9:10) - Duration: 9:11. Big Data University
   52 views 9:11
 * R Tutorial | R Programming Tutorial | R Programming | R Programming Language
   - Duration: 2:22:35. edureka! 134,283 views 2:22:35
 * Introduction to Data Science with R - Data Analysis Part 1 - Duration:
   1:21:50. David Langer 163,308 views 1:21:50
 * Introducing Data Scientist Workbench - Duration: 1:14. Big Data University
   255 views 1:14
 * 1 Why relational databases and R - Duration: 9:11. Tech Analytics 238 views 9:11
 * 4 Database Design, Load, and Query from R - Duration: 11:12. Tech Analytics
   271 views 11:12
 * The R Language The Good The Bad & The Ugly • John Cook - Duration: 38:09.
   GOTO Conferences 61,654 views 38:09
 * How to Create a Stock Management Database in Microsoft Access - Full Tutorial
   with Free Download - Duration: 26:08. Software-Matters 143,296 views 26:08
 * VB.NET Database Tutorial - Fill ComboBox From SQL Database & Dynamic Query
   (Visual Basic .NET) - Duration: 47:38. VB Toolbox 22,781 views 47:38
 * Improve SQL Server performance using profiler and tuning advisor - Duration:
   13:12. .NET Interview Preparation videos 120,140 views 13:12
 * Accelerate Existing Applications without Changing Code Using Hazelcast and
   Heimdall - Duration: 44:26. Hazelcast 196 views 44:26
 * R Tutorial: Introduction to R - Duration: 20:19. econometricsacademy 108,510
   views 20:19
 * Statistics with R (1) - Linear regression - Duration: 19:22. Christoph
   Scherber 96,611 views 19:22
 * Introduction to R for Data Mining - Duration: 1:00:34. REvolutionAnalytics
   127,804 views 1:00:34
 * Hands-on dplyr tutorial for faster data manipulation in R - Duration: 38:57.
   Data School 43,787 views 38:57
 * An Introduction to R - A Brief Tutorial for R {Software for Statistical
   Analysis} - Duration: 16:30. economicurtis 235,968 views 16:30
 * Loading more suggestions...
 * Show more

 * Language: English
 * Country: Worldwide
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Try something new!
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Modifying Database Data with R,"Database Design, Load, and Query from R",Live,454
1380,"* United States

IBM® * Site map

Search within Bluemix Blog Bluemix Blog * About Bluemix * What is Bluemix
    * Getting Started
    * Case Studies
    * Hybrid Architecture
    * Open Source
    * Trust, Security, Privacy
    * Data Centers
    * Our Network
    * Automation
    * Architecture Center
   
   
 * Products * Compute Infrastructure
    * Compute Services
    * Hybrid Deployments
    * Watson
    * Internet of Things
    * Mobile
    * DevOps
    * Data Analytics
    * Network
    * Open Source
    * Storage
    * Security
   
   
 * Services * Bluemix Services
    * Garage
   
   
 * Pricing
 * Support * Support
    * Contact Us
    * Resources
    * Docs
   
   
 * Blog * How-tos
    * Trending
    * What's New
    * Events
   
   
 * Partners * Partners
    * Become a Partner
    * Find a Partner
   
   
 * Sign up

DATA ANALYTICSMAKE MACHINE LEARNING A REALITY FOR YOUR ENTERPRISE
September 6, 2017 | Written by: Armand Ruiz Gabernet

Categorized: Data Analytics

Share this post:


GETTING DOWN TO BUSINESS
In recent years, machine learning has prevailed over two champions on the quiz show, Jeopardy!, and vanquished the world’s number one-ranked player of Go , one of the most complex strategy games humankind has ever devised. You can’t
doubt its immense power and reach, but it’s not all about playing games. Machine
learning is fundamentally changing the way we approach computing—and it can pay
off big time for your business.

Successfully programming a computer to complete a task used to depend on the
developer providing exact, unambiguous instructions. As problems increased in
complexity, so did the time and effort that needed to be put into commands. In
contrast, machine learning today uses algorithms to enable computers to
autonomously learn from data and information. Take the use of IBM Watson Visual Recognition to combat drought in California . The cognitive capabilities of the solution made it possible to process,
clarify and fuse vast amounts of satellite and aerial imagery of California with
other data sets. As a result, the system taught itself to identify elements such
as swimming pools, enabling it to generate useful recommendations for water
conservation – for example, advising pool owners to drain or fill their pools
less frequently.

By giving computers the ability to learn how to solve problems for themselves,
machine learning is truly transformative, driving innovation and uncovering
insights that are far beyond current human capabilities. However, although many
organizations are excited about machine learning, so far, very few have actually
embraced it and incorporated it into everyday operations. Today, we’re going to
analyze what the obstacles are, and learn what you can do for your business to
ensure success with machine learning.


COMMON PITFALLS ON THE DATA JOURNEY
You might assume that building machine learning models is fraught with
complexity and difficulty, but that once the models are in place, machine
learning is straightforward. In reality, though, the models are often considered
the easy part; most of the big challenges with leveraging machine learning
revolve around data.

First, you need to ensure simplified and scalable access to the right data sets.
In a previous blog post, we discussed some of the typical challenges organizations are facing around turning data lakes into
business value . In particular, a lack of effective data governance and limited findability
are preventing knowledge workers from using data lakes to their full potential.
Considering the huge range of data types and formats that a data lake may store,
it is no surprise that finding the data you need can be a major obstacle.

Even once you’ve identified the right data, you still need to carry out data
cleansing to be confident of its quality. Since data is an evolving asset,
keeping track of how the quality changes over time can also be tricky. Next, you
must carry out engineering to transform the data into a structure that best
represents the underlying problem to the machine learning models. This is a
difficult and expensive process, but you cannot start training your machine
learning model until it is complete.

Let’s say you’ve navigated all these challenges, and have one or more models in
production. That isn’t the end of the story. If you’re going to make decisions
based on the results produced by these models, you need to know that they are
working properly and are being used by the right people, in the right ways. Most
organizations lack this insight, putting them in a risky position and
restricting the usefulness of machine learning.


SO, WHAT CAN YOU DO?
Whether you are just starting to build machine learning models, or have stumbled
along the way, IBM Data Catalog (currently in beta) could offer the breakthrough you need.

Here are some of the ways it could help:

Findability:
If you already have a data lake or warehouse, don’t panic: we’re not proposing
moving that data anywhere else. IBM Data Catalog will provide an effective
management layer on top of your existing data infrastructure that can index all
of your data into a single metadata catalog (read this blog post for more information). This catalog can help you find the data you need to
build and train models, and keep track of data quality.

Integration:
Once you have found the data you need, IBM Data Catalog provides seamless
integration with data exploration and model development tools such as IBM Data Science Experience and IBM Watson Machine Learning . As a result, you can manage the entire model development, training and
publication process from end to end within a single coherent environment.

Scalability:
As your machine learning initiatives become more successful, the models you
create, as well as the datasets you use to train and test them, will
proliferate. IBM Data Catalog can help you keep this growing set of assets
well-organized, categorized and controlled—helping your data scientists and data
engineers understand and reuse each asset appropriately, regardless of how many
hundreds of data sets and models you create.

Governance:
Using sophisticated monitoring tools, Data Catalog can track the lineage and
usage of each data asset. By providing insight into who is using your machine
learning models and data sets, the solution can help data stewards enforce
information security policies and prevent violations, while also revealing which
models are being used most often by the business.

This increased visibility can help users make data-driven decisions around where
to direct their efforts to enable optimal outcomes. Let’s say you’ve developed
two machine learning models in parallel; you can quickly discover which is
delivering greater value and, if appropriate, focus your energy on that one. In
the ‘fail fast, fail often’ world of machine learning, determining when to
allocate your resources away from a model is critical.


TAKE THE PLUNGE
Big corporations have come to consider their data as intellectual property,
representing a competitive advantage that they are not willing to share. But
data is only worth something if you do something with it and many companies
simply aren’t using it to its full potential.

Right now, the most promising approach is to use your data to build machine
learning models, so you don’t get left behind. If you choose the right tools,
your first (or next) foray into the world of machine learning doesn’t need to be
fraught with risk, and can potentially pay back your investment in spades.

Some traditionalists might shy away from the idea of putting the fate of their
companies in the hands of a machine, no matter how smart it seems to be. But by
shedding light on the “black box” of machine learning models and giving you
control of your data, IBM Data Catalog can help you get better results.

For organizations that are at the very beginning of their data journey, the way
forward may seem daunting, and it might look like there is too much ground to
make up. But the advantage these organizations hold is that they are starting
with a clean slate, and can set out with a clear target of being data-driven. By
equipping themselves with the right solutions to support machine learning—such
as IBM Data Catalog —these enterprises have the opportunity to catch up and even leapfrog over
competitors.

ARMAND RUIZ GABERNET
Lead Offering Manager, IBM Data Science Experience & Watson Machine Learning

CECELIA SHAO


IBM Data Catalog


Previous Post

Behind Every Guide Dog is a Data StoryNext Post

WebSphere on the Cloud: Application MitigationADD COMMENT NO COMMENTS
LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


Search for:RECENT POSTS
 * My App is Secure: Application Security Assessed
 * Get real-time billing insights from your Bluemix account
 * IBM Lift CLI is out of beta and available for download
 * Using Codeship Pro To Deploy Workloads to IBM Bluemix Container Service
 * Securing single page apps with App ID service

ARCHIVES
Archives Select Month September 2017 August 2017 July 2017 June 2017 May 2017 April 2017 March 2017 February 2017 January 2017 December 2016 November 2016 October 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 October 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 November 2013TAGS
analytics announcements api apps Architecture Center best-of-bluemix Bluemix bluemix-support-notifications buildpacks client success cloud cloudant cloud foundry conference conferences containers dashdb deployment devops docker eclipse garage garage-method hackathon homepage hybrid interconnect iot java Kubernetes liberty local microservices mobile MobileFirst node.js openwhisk security Spark swift twilio ui video watson webinar More Data Analytics StoriesData Analytics

STREAMING ANALYTICS PRICING UPDATE
We're lowering the prices for Streaming Analytics.

Continue reading


Share this post:


Data Analytics

CLEANING THE SWAMP: TURN YOUR DATA LAKE INTO A SOURCE OF CRYSTAL-CLEAR INSIGHT
When we talk to data scientists, we hear the same sad story again and again.
They tell us how their organization fell in love with the idea of building a
data lake as a single platform for self-service data science. How they were
wooed and won by a vendor with a solution that promised much, but delivered
little. How their vision of a data lake as a clear source of business insight
has turned into a stagnant swamp—a dumping ground where data goes to die.

Continue reading


Share this post:


Data Analytics

HOW TO EASE THE STRAIN AS YOUR DATA VOLUMES RISE
Ever had to make a decision when you didn’t have the time, means or patience to
look up all the data that could help you choose the best option? Yes, well,
you’re not alone on that score. Usually, this doesn’t have significant or
long-lasting consequences—does it really matter if you choose where to go for
dinner because you like the look of a place, rather than combing through recent
reviews?

Continue reading


Share this post:


SIGN UP FOR A BLUEMIX TRIAL TODAY


Get started free Learn more about Bluemix

CONNECT WITH US


 * Contact
 * Privacy
 * Terms of use
 * Accessibility",Machine learning is fundamentally changing the way we approach computing — and it can pay off big time for your business.,Make machine learning a reality for your enterprise,Live,455
1382,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectBLUEMIX OBJECT STORAGE ENHANCED: UPDATE PRECIPITATION ANALYSIS SAMPLE NOTEBOOKSven Hafeneger / January 26, 2016The Bluemix Object Storage V3 service is preparing for multiple region support.As a result, the response body of the HTTP POST request for authenticationagainst the service has changed.This led to necessary adaptions in the get_file_content() helper function in the Precipitation Analysis sample notebook that’s part of our Analytics for Apache Spark Service.WHAT YOU NEED TO DOIf you used the old version of the Precipitation Analysis sample notebook, you will face problems accessing files on Bluemix ObjectStorage V3 and get an error message instead of the file content: The resource could not be found.To fix this issue, we updated get_file_content() to consider multiple possible regions. You can do one of the following toresolve the error you get: * Create a new Precipitation Analysis sample notebook. You’ll automatically get the new version of get_file_content() . * Fix the code in your old Precipitation Analysis notebook. Replace the old version of get_file_content() with the following updated version:def get_file_content(credentials):       '''For given credentials, this functions returns a StringIO object containg the file content.'''           url1 = ''.join([credentials['auth_url'], '/v3/auth/tokens'])       data = {'auth': {'identity': {'methods': ['password'],               'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},               'password': credentials['password']}}}}}       headers1 = {'Content-Type': 'application/json'}       resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)       resp1_body = resp1.json()           for e1 in resp1_body['token']['catalog']:           if(e1['type']=='object-store'):               for e2 in e1['endpoints']:                   if(e2['interface']=='public'and e2['region']==credentials['region']):                       url2 = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']])       s_subject_token = resp1.headers['x-subject-token']       headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}       resp2 = requests.get(url=url2, headers=headers2)       return StringIO.StringIO(resp2.content)      SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Bluemix Object Storage service is preparing for multiple region support. Your Precipitation Analysis Sample Notebook may need an update.,Bluemix Object Storage Enhanced: Update Precipitation Analysis Sample Notebook,Live,456
1390,"GETTING STARTED WITH COMPOSE AND BLUEMIXahoffman / July 29, 2015 / 0 commentsWe are incredibly excited to welcome Compose Inc. into the IBM Cloud DataServices family. Compose brings a wealth of experience and offerings in thedatabase-as-a-service (DBaaS) space. Starting with a hosted MongoDB service,they have rapidly expanded to include PostgreSQL, Redis, RethinkDB, andElasticsearch services. With Cloudant’s roots in operating a rock-solid databaseservice based on Apache CouchDB™, Compose is the perfect complement to ourportfolio.This post will walk you through a simple example of how to use Compose’s DBaaSofferings within IBM Bluemix, IBM’s developer cloud platform. I’d like toreiterate that it’s a simple example and meant for those who are new to Bluemixand are interested in the mechanics of binding data services to an application.Read on to learn how to connect a Bluemix Python environment to a Redis instanceon Compose. (With little variation, these steps will work for any service in theCompose suite.)PREREQUISITES 1. You will need a Bluemix account. If you don’t have one, go to https://console.ng.bluemix.net/ and click on “sign up” in the upper right corner. 2. You will need to download ‘cf’, the Bluemix and Cloud Foundry command line    tool. If you don’t have it, you can get it here: https://github.com/cloudfoundry/cli/releases 3. Finally, you will need a Compose.io account. To ensure no disruption to    their user base, Compose will keep their current account, billing, and auth    systems. In fact, for users of Compose, there will be no change whatsoever.    You can sign up for an account and 30-day free trial here: https://app.compose.io/signup/svelteDEPLOYING REDIS IN COMPOSE.IOOk, so now that you have your Compose.io account, let’s deploy an instance ofRedis. Sign into your Compose.io account, and go to the Deployments tab on the left-hand nav. Then click on Add Deployment , and choose Redis from the menu. Pick a name and a location, and click create.Once this is built, you should see it in your Deployments tab like this:Click on your deployment and you are taken to an overview screen. This screen isimportant; it’s where you will find the account credentials you will need toconnect to your Bluemix application.Note your hostname (above it’s aws-us-east-1-portal.4.dblayer.com ), your port number ( 10374 ), and your password (hidden, but you can find it by clicking Show Credentials ).Good? Ok, now you have what you need from the Compose side. Let’s hop over toBluemix.BUILDING A PYTHON APP WITH REDISLog in to your Bluemix account and you are taken to your dashboard. Select Create an App and then choose the Web option. Finally, choose the Python icon to get started. Create and name yourapp. Follow the on-screen instructions for getting started with the CF CommandLine Interface. (If you skipped this part, you can download it now.) You shoulddownload the starter code and save it somewhere you can find it.Note: If your log-in to Bluemix fails and you receive a “No space targeted”message, omit the -s flag and its corresponding space_name parameter. Retry, and you should automatically target your dev space.Click on Add a Service and you are taken to the Bluemix catalog. Find the tile that says Redis by Compose and click on it. You’ll be taken to a screen that looks something like this:This is where you’ll configure your Redis service. As we stated above, this is a“user-provided” service, which means you supply the credentials for the externalservice. Those credentials are saved in an environmental variable ( VCAP_SERVICES ) for easy access within your application.Select your newly created Python app in the App drop-down and enter your Compose username, the password for your Redisdeployment ( not the password for your Compose account!), and your deployment’s <hostname>:<port-number> , and click create . This binds these credentials to the application you just created. (Restageyour application, if prompted.)CONNECTING TO REDIS FROM PYTHONClicking on your application, you can see the Redis service that has been bound,and you can see the credentials stored in VCAP_SERVICES . This environmental variable is a JSON block that holds all the credentialsfor all of your services for easy access. To find it, be sure to select the Overview of your Python application. Look for the Redis by Compose tile and select Show Credentials .Now let’s switch gears and look at the actual Python application code to accessthis information. In the Python starter code you downloaded, there is a filecalled server.py . This is the main program that is run when you launch this application in thebrowser. You can access your creds from VCAP_SERVICES with code like this:import redisimport json...# Parse VCAP_SERVICES Variable vcap_config = os.environ.get('VCAP_SERVICES')decoded_config = json.loads(vcap_config)redis_creds = decoded_config['user-provided'][0]['credentials']redis_public_hostname = redis_creds['public_hostname'].split(':')redis_host = redis_public_hostname[0]redis_port = redis_public_hostname[1]redis_password = redis_creds['password']# Create redis DB connectionr = redis.Redis(  host=redis_host,  port=int(redis_port),  password=redis_password)Note: This snippet is, admittedly, verbose, but we’re going for clarity here.Append that code to the respective sections of server.py . Application dependencies are listed in a requirements.txt file. Be sure to list “redis” there so you can import that library into yourcode. You can find more info about the redis-py library in this github project: https://github.com/andymccurdy/redis-pyAnd like that, you’re done. You can now easily bind a Compose.io service to yourBluemix application. Go ahead and cf push <yourAppName> to Bluemix. (Check under Start Coding for a refresher.)WHAT’S NEXT?These instructions are bare-bones. The application you just deployed preparesyou for future development but doesn’t do anything meaningful. If you’d like toplay with the redis-py library more, I suggest completing Jeff Sloyer’s excellent “ Simple Hello World Python App using Flask ” tutorial and then modifying it to display Redis output. Now go ahead andfollow Jeff’s instructions. I’ll wait.Congratulations. Now, for the changes: 1. Update mainfest.yml 2. Update requirements.txt 3. Update hello.py 4. Bind your existing Redis service to your new app 5. cf push <yourAppName>For step #1, make sure your Redis instance is listed as a service in yourmainfest. Here’s what my mainfest.yml looked like:---applications:- name: broRedisTest  services:  - Redis by Compose-gqFor #2, just add “redis” on a new line below “Flask” in requirements.txt .And for #3, here’s what I did in hello.py :from flask import Flaskimport osimport redisimport jsonapp = Flask(__name__)# On Bluemix, get the port number from the environment variable VCAP_APP_PORTport = int(os.getenv('VCAP_APP_PORT', 8080))# Parse VCAP_SERVICES Variable vcap_config = os.environ.get('VCAP_SERVICES')decoded_config = json.loads(vcap_config)redis_creds = decoded_config['user-provided'][0]['credentials']redis_public_hostname = redis_creds['public_hostname'].split(':')redis_host = redis_public_hostname[0]redis_port = redis_public_hostname[1]redis_password = redis_creds['password']# Create redis DB connectionr = redis.Redis(  host=redis_host,  port=int(redis_port),  password=redis_password)# Create a DB recordr.set('foo', 'bar')@app.route('/')def hello_world():    return('Hello World! I am running on port ' + str(port) + '. ' +        'The value for Redis key \""foo\"" is: ' + '\""' + r.get('foo') + '\"".')if __name__ == '__main__':    app.run(host='0.0.0.0', port=port)All I did there was create one record in Redis and then retrieved its value. Ituses the Flask Web framework to render text in a browser.For #4, instead of adding a service or API through the Bluemix dashboard, choose Bind a service or API to connect to an existing service. This step should look familiar.Now celebrate. You’re on to step #5 and ready to cf push . Here’s what renders in my browser when I hit my app athttp://broredistest.mybluemix.net. (Disclaimer: This is a temporary account andwill likely be decomissioned at some point.)Hello World! I am running on port 61631. The value for Redis key ""foo"" is: ""bar"".The more you explore https://github.com/andymccurdy/redis-py , the more creative you can get with your database CRUD.In the future we’ll be building out tighter integration between Compose servicesand Bluemix, including support and operations. Have any comments or suggestions?We’d love to hear them.© “Apache”, “CouchDB”, “Apache CouchDB” and the CouchDB logo are trademarks orregistered trademarks of The Apache Software Foundation. Other company, product,and service names may be trademarks or service marks of others.SHARE THIS: * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to email this to a friend (Opens in new window) * LEAVE A COMMENTClick here to cancel reply. Tell us who you are Name (required) Email (required) Comment textNotify me of follow-up comments by email.Notify me of new posts by email. * CATEGORIES    * General    * Events    * Updates    * How-to       * @IBMBLUEMIX   . @koding and #Bluemix are hosting a humongous virtual #hackathon . You should join us bit.ly/1nAOOdr #Cloud pic.twitter.com/oCIl…      RT @EyalLiberman In-page user behavior tracking made simple with @IBMBluemix and Cloudant: ow.ly/Xmc6I #nodejs pic.twitter.com/4KNb…      . @bhunstable CEO of @Ustream , explains why businesses should care about video how it changes how we   communicate bit.ly/1QhFJPW      Preview #IoT at InterConnect - intelligence built into intelligent endpoints, not just   interconnected objects bit.ly/1Sl2Axe      RT @IBMInterConnect This @IBMDevOps crowdchat will rock your world. Find out how to deliver with our futurists. crowdchat.net/ibmint… pic.twitter.com/jWoL…       * @IBMCLOUDSUPPORT   Try the new Simple Search Service in @IBMcloudant on #Bluemix for faceted search results. ibm.co/1S89MLN pic.twitter.com/ugzy…      Announcement: The IBM XPages Runtime/boilerplate/NoSQL DB service are now in   Beta! See: ibm.co/1UdQEv4 pic.twitter.com/jKi8…      Microservices, SOA, and APIs: Friends or enemies? ibm.co/1NkEZoy pic.twitter.com/m04D…      Do you develop WebSphere Liberty Java EE apps? Need continuous delivery? See   how: ibm.co/1nbtivd #Bluemix pic.twitter.com/gQpT…      Welcome to the family! MT @IBMcloud : Big News! IBM acquires @Ustream . See: ibm.co/1KsVE9H pic.twitter.com/mcAG… #IBMCloud       * SOLUTIONS    * Analytics    * Big Data    * Bluemix Dedicated    * Bluemix Local    * Catalog    * DevOps    * Integration    * Internet of Things    * Mobile    * Security    * Watson    * Web Apps      Follow us on Twitter RSS Feed * Contact us * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","This article walks you through a simple example of how to use Compose’s DBaaS offerings within IBM Bluemix, IBM’s developer cloud platform. ",Getting Started with Compose and Bluemix,Live,457
1396,"Lorna Mitchell Blocked Unblock Follow Following Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net) Jun 22
--------------------------------------------------------------------------------

“10 THINGS I HATE ABOUT YOUR API” — AMANDA FOLSON
RECAPPING AN EXCELLENT PHP[TEK] TALK ON API DESIGN
Amanda Folson’s slides from her talk at PHP[tek]This talk at PHP[tek] caught my eye because it covered more than just code. Amanda Folson is an engineer but also works in Developer Relations, so she’s been a consumer
of APIs and a visible representative for her companies’ offerings. She knows
this topic from all sides! She spoke in the biggest room at the conference, so I
settled in to hear some wisdom. I wasn’t disappointed.

DEVELOPER EXPERIENCE MATTERS
Her first point was the only one I wish everyone could hear before publishing an API:

Your API is your developer’s experience of your whole product. If the API isn’t
good, or if it’s hard to understand how to use it, the developer will use a
competitor (or just hate your product).Developer Experience (DX) is an actual discipline , and one that we can all improve at — and make our APIs more delightful to use
when we do. It’s no surprise that DX trailblazers Twilio appeared frequently as positive examples throughout Amanda’s talk. Their team
spends a lot of time talking and listening to developers, and trying to smooth
out their journeys. Using their products is a pleasure as a result.

A MATTER OF PERSPECTIVE
The most important thing with an API, common to all software, is to begin from
the perspective of the user. What they are trying to achieve? Amanda made a
point that I don’t often hear discussed at conferences: an API is for users to achieve success, not for your company to achieve sales
by ticking a box.

I loved her advice to use a pen and paper to work out the structure of your API.
It surfaces design issues with how the URLs should work and which fields need to
go where — saving you headaches down the road. The talk further highlighted the
importance of consistency and of format/validation of fields both as inputs and
when returned as outputs. She gave the audience some hilarious examples of
similar-but-not-identical field names from an API that she wouldn’t name and
shame.

Amanda also recommended pulling in potential users from outside your
organisation to try the API and offer feedback. Building a group of alpha/beta
testers who can try out their use-cases ensures that you maintain the
perspective of the user and properly emphasize their needs in your design.

KEEPING YOUR PRODUCT FRESH
Documentation was a key topic of discussion. In particular:

 * documentation is not optional
 * documentation must be up to date, or you might as well not bother
 * documentation includes quickstart posts and tutorials as well as basic API
   documentation
 * documentation should include examples in as many programming languages as
   possible (personally, I like to include a curl example as an inclusive
   option)

Again, many of her positive examples featured Twilio, who have turned the
tutorial into an art form and have comprehensive tutorials in a choice of
languages. I also like how those tutorials are focussed on the outcome the user
is intending to achieve rather than the feature of their API that they want to
write about. A subtle difference but an important one.

The technical aspects of an API are important. For example Amanda asked us
whether we would use an API without SSL in this day and age (plenty are still
around). She made some good points about how important it is to load test your
API before you release it — and some constructive suggestions about versions,
breaking changes, and the use of a change log so that users can know what has
changed recently with your product.

Scathing words were said about APIs that return incorrect status codes. It’s
vitally important that we build APIs that are standards-compliant, so using
status codes and adhering to existing standards for error messages are both
great ideas for an API. Pro-tip from Amanda: include links to documentation along with your error messages where
appropriate. It’s a great way to help a developer who is having trouble getting something
working. It may seem obvious, but more APIs need to follow this approach.

ARTICLE.CLOSE()
As with all things, the “right” decisions will vary depending what API is being
created and what the users’ expectations are. This talk covered some vitally
important points that made me reflect on the onboarding experience, not only for
APIs but for all kinds of new-to-this-developer technologies.

Thanks, Amanda!

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * API
 * Web Development
 * PHP
 * Twilio

Blocked Unblock Follow FollowingLORNA MITCHELL
Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net )

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","This talk at PHP[tek] caught my eye because it covered more than just code. Amanda Folson is an engineer but also works in Developer Relations, so she’s been a consumer of APIs and a visible…",10 Things I Hate About Your API — Amanda Folson – IBM Watson Data Lab – Medium,Live,458
1398,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Jul 17
--------------------------------------------------------------------------------

HACKING VS. PROTOTYPING VS. PRODUCTION CODE
HOW MY PERSONAL COMMAND-LINE TOOL GREW UP TO BECOME A PROFESSIONAL PRODUCT
Not all code is built the same. The shelf life of your code typically depends on
your motivations for building it:

 * If you’re throwing something together at an all-night hackathon, then the
   code doesn’t need to survive past 24 hours — it is disposable.
 * If you’re making demo code that others can use as guidance, then it may be
   written in a clear, but oversimplified, way. In this case, legibility is more
   important than performance or great architecture. The code is more than
   “disposable” — it exists as an educational aid, rather than an exemplar of
   best practice.
 * If you’re building an open-source library that you’re expecting others to use
   and adopt, then you should have automated tests that exercise each line of
   code, to catch unexpected failures over time. This approach is for code that
   lasts the long haul.

Ideally, all of our code is built to the highest professional standards, but we
often favour expediency when prototyping a new idea. Writing code to the highest
standards takes a lot longer.

When some of your demo code is taken up and used by others for a serious
purpose, then you get to see the difference between your “minimum viable
product” hack and production-ready code.

This happened to me with couchbackup . It was time for my code to grow up.

Rugrats: All Grown Up is property of Viacom International Media Networks. Image credit: FANDOM TV CommunitiesTHE STORY OF COUCHBACKUP
I like building command-line tools. As I’m a developer advocate for IBM, and
Cloudant (an as-a-service Apache CouchDB database) falls under my purview, the
tools are often for CouchDB: couchimport to import and export CSV data, couchshell to interact with your database like a command-line shell, and several others on npm . The couchbackup project is one such tool. It allows databases to be backed-up to text files and
for the backup files to be restored to databases.

Here’s what a basic couchbackup interaction looks like:

# backup
couchbackup --db mydatabase > mybackup.txt

# restore
cat mybackup.txt | couchrestore --db myotherdatabase

Like most of my tools, it was built for me but published and open-sourced in
case anyone else wanted to use it.

One day, IBM decided it would take couchbackup and make it production-ready. This work would be undertaken by the Cloudant
libraries team, who build the officially-supported code that allows Cloudant to
be accessed from Python, Java, Objective-C, Swift and Node.js.

What would they change? Would they take my ideas and rewrite the tool in Python
or Java? Let’s find out.

MAKING CODE PRODUCTION-READY
Now that couchbackup v2 has been released, I can look back and see what code was changed and the
work the the libraries team did. Here's a summary of how they took my demo code
and made it production-ready:

 * Instead of taking a shallow copy of the winning revisions of all the
   documents in the database, couchbackup now keeps all undeleted document bodies.
 * The code now features an automated test suite. If folks are going to rely on
   your tool to back up their precious data, it better do what it says it’s
   going to do! The codebase is still open-source, but pull requests must pass
   the test-suite and be reviewed by a contributor prior to acceptance.
 * The API was tweaked: parameters were combined, naming is more consistent, and
   it is easier to invoke a programmatic backup.
 * Events and error-handling are improved. The code behaves more predictably in
   error conditions. It fails fatally for permanent errors (such as
   authentication failure) but retries temporal errors (such as timeouts).
 * Code style rules are in place. The entire codebase is “linted” to the same
   standards. Machine-readable comments are in place above each function, and a
   greater level of consistency is applied across source code files.
 * Performance improvements: allowing extra pairs of eyes to see how an
   algorithm performs can lead to optimisations. Connection pooling of HTTP
   requests and memory usage reduction by controlling volume of data consumed
   from a stream are just two examples.
 * A host of bugs were fixed and features added .

When your code is scrutinised by others, there will also be other optional
changes that are a matter of taste:

 * function(data) { } or (data) => { }
 * callbacks, or Promises
 * semicolons
 * indentation
 * function lengths
 * module size
 * build tools
 * testing frameworks

Reworking someone else’s code can be like editing someone else’s essay. As well
as fact-checking and improving the work, some changes will be matters of style,
convention, and personal taste.

Is the code better than it was? Undoubtably.

Is it still open-source? Yes! Feel free to use, modify, and contribute changes
to the project .

USING COUCHBACKUP
You can install couchbackup on your Node.js-enabled machine with:

npm install -g @cloudant/couchbackup

Backing up a database is as simple as:

couchbackup --url http://localhost:5984/mydb > mydata.txt

There are other options that allow a partial backup to be resumed and
performance characteristics to be tuned. See the README for more details and examples.

CREDITS
Thanks to Mike Rhodes, Rich Ellis, Sam Smith, and Tom Blench from the Cloudant
team for all the excellent work they have put it to make couchbackup workable.

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * JavaScript
 * Couchdb
 * Cloudant
 * Web Development
 * Programming

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Not all code is built the same. The shelf life of your code typically depends on your motivations for building it: Ideally, all of our code is built to the highest professional standards, but we…",Hacking vs. Prototyping vs. Production Code – IBM Watson Data Lab – Medium,Live,459
1400,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

WEB PICKS (WEEK OF 16 OCTOBER 2017)
Posted on October 25, 2017Every two weeks, we find the most interesting data science links from around the
web and collect them in Data Science Briefings , the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting
resources .

 * Your Data is Being Manipulated
   “I think we need to reconsider what security looks like in a data-driven
   world.”
 * 6 ways social media has become a direct threat to democracy
   Recently, a team from two of our organizations, Democracy Fund and Omidyar
   Network, assembled to investigate the relationship between social media and
   democracy. The initial findings are detailed in a paper that identifies six
   key areas where social media has become a direct threat to our democratic
   ideals.
 * Google Has Made a Mess of Robotics
   Its scattered, ambiguous, frequently abandoned objectives for its string of
   big acquisitions have hurt the whole field.
 * Algorithms Have Already Gone Rogue
   Yes, financial markets are the first rogue AI.
 * Scientists Can Read a Bird’s Brain and Predict Its Next Song
   Next up, predicting human speech with a brain-computer interface?
 * What you need to know before you board the machine learning train
   This post is an attempt to explain key concepts of a machine learning
   projects in a business context to a wider audience.
 * The Seven Deadly Sins of AI Predictions
   Mistaken extrapolations, limited imagination, and other common mistakes that
   distract us from thinking more productively about the future.
 * But what *is* a Neural Network?
   Grant Sanderson’s Animated Math blog is a super popular place to learn about
   complex topics. In his latest video, Grant offers an introduction to neural
   nets and it’s one of the clearest introductions you’ll find anywhere.
 * GANs are Broken in More than One Way: The Numerics of GANs
   Even if we fix the objectives, we don’t have algorithmic tools to actually
   find solutions.
 * Spotify’s Discover Weekly: How machine learning finds your new music
   The science behind personalized music recommendations
 * Competitive Self-Play
   “We’ve found that self-play allows simulated AIs to discover physical skills
   like tackling, ducking, faking, kicking, catching, and diving for the ball,
   without explicitly designing an environment with these skills in mind.
   Self-play ensures that the environment is always the right difficulty for an
   AI to improve. Taken alongside our Dota 2 self-play results, we have
   increasing confidence that self-play will be a core part of powerful AI
   systems in the future.”
 * Phone-Powered AI Spots Sick Plants With Remarkable Accuracy
   Researchers have developed a smartphone-based program that can automatically
   detect diseases in the cassava plant with near 100 percent accuracy.
 * No order left behind; no shopper left idle.
   Using Monte Carlo simulations to balance supply & demand in a marketplace.
 * Visualizing gender and race inequality in newsrooms
   “Our latest project in the collaboration with Google News Lab is an
   exploration of gender and race in U.S news publications. It was designed by
   Polygraph based on data from the American Society of News Editors.”
 * Behind the Magic: How we built the ARKit Sudoku Solver
   “What we learned from our first foray into Machine Learning.”
 * The Impressive Growth of R
   “We found in a previous post that Python has a solid claim to being the
   fastest-growing programming language in terms of Stack Overflow visits. The
   same analysis showed that the R programming language has shown remarkable
   growth in the last five years as well. In fact, R is growing at a similar
   rate to Python in terms of a year-over-year percentage, though this growth is
   “easier” because it started from a smaller share of traffic.”
 * Analyzing mortgage data with R
   “We’re going to work hard to aggregate several million loan level records
   into useful summary graphics to tell us about the U.S. mortgage market in
   2016.”
 * Interactions in fraud experiments: A case study in multivariable testing
   A while ago we observed something curious when we ran a set of simultaneous
   A/B tests around multiple antifraud features.
 * Why I don’t like Jupyter Notebooks
   “We’ve had a number of tickets recently asking about running Jupyter
   Notebooks. Until the architecture of the Jupyter Notebook changes this will
   never be a good/safe idea.”
 * Machine visions
   Exploring visual motifs in Wes Anderson films
 * Finding Waldo Using Semantic Segmentation & Tiramisu
   The goal of semantic segmentation is to detect objects in an image; it does
   this by making per-pixel classifications.
 * Microsoft Edge Machine Learning
   Machine learning models for edge devices need to have a small footprint in
   terms of storage, prediction latency and energy. One example of a ubiquitous
   real-world application where such models are desirable is resource-scarce
   devices and sensors in the Internet of Things (IoT) setting. Making real-time
   predictions locally on IoT devices without connecting to the cloud requires
   models that fit in a few kilobytes.
 * Introducing Gluon: a new library for machine learning from AWS and Microsoft
   Gluon provides a clear, concise API for defining machine learning models
   using a collection of pre-built, optimized neural network components.
 * Passive Aggressive Algorithms
   Passive Aggressive Algorithms are a family of online learning algorithms (for
   both classification and regression) proposed by Crammer at al. The idea is
   very simple and their performance has been proofed to be superior to many
   other alternative methods.

‹ Web Picks (week of 2 October 2017) —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Web Picks (week of 16 October 2017)
 * Web Picks (week of 2 October 2017)
 * What’s in the Network: A Stepwise Overview of Working with Network Data in R
 * Web Picks (week of 18 September 2017)
 * Can you comment on the impact of privacy regulation on Big Data & Analytics?

Archives * October 2017
 * September 2017
 * August 2017
 * July 2017
 * June 2017
 * May 2017
 * April 2017
 * March 2017
 * February 2017
 * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com","Every two weeks, we find the most interesting data science links from around the web.",Web Picks - DataMiningApps,Live,460
1403,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (September 20, 2016)
 * This Week in Data Science (September 13, 2016)
 * This Week in Data Science (September 06, 2016)
 * This Week in Data Science (August 30, 2016)
 * This Week in Data Science (August 23, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (SEPTEMBER 13, 2016)
Posted on September 13, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Analyzing NBA basketball data with R – Learn how data gathered at professional basketball games can be gathered
   and analyzed.
 * What is R? R Explained in less than Two Minutes, to Absolutely Anyone – This series of short posts is aimed at explaining some of the key concepts
   and technologies behind Big Data and data analytics to people who are not
   necessarily in the field.
 * New IBM Linux servers could boost AI and big data efforts – A new line of servers from IBM could help make the enterprise data center
   more efficient, while providing additional computer power for deep learning
   and high performance data analytics.
 * How Watson Is Driving Innovation at IBM – Bloomberg hosts Amber Mac and Anthony Lacavera speak to the chief of IBM’s
   Watson IoT division Harriet Green about how the technology is being applied
   within companies to change how we work and live.
 * Airbnb Adopts Rules to Fight Discrimination by Its Hosts – Airbnb has been working to combat discrimination on the platform after a
   barrage of criticism prompted a law suit.
 * 5 reasons data scientists are attending World of Watson – IBM World of Watson is your opportunity to learn about new methodologies,
   tools and practices that can help you achieve your strategic objectives more
   efficiently and reliably.
 * Google’s DeepMind Achieves Speech-Generation Breakthrough – Google’s DeepMind unit, which is working to develop super-intelligent
   computers, has created a system for machine-generated speech that it says
   outperforms existing technology by 50 percent.
 * Here’s how artificial intelligence could solve the biggest problem in
   education – Jill Watson is a digitally intelligent teaching assistant that was
   introduced into an online course for masters students in computer science.
 * Want to prevent a stroke? Combine wearables – The first iOS app to roll up FDA-approved electrocardiogram (ECG) and
   blood pressure readings in a single display was introduced Thursday by
   AliveCor.
 * Make Easy Heatmaps to Visualize your Turnaround Times – Learn how to visualize your Turnaround Times with R.

UPCOMING DATA SCIENCE EVENTS
 * Breaking Data Science Open – Learn about open data science on September 15th.
 * Unlocking New Opportunities with Behavioral Analytics at Hootsuite – Join Robi Chakrabarti, Product Operations at Hootsuite and Christina
   Noren, Head of Product at Interana to discuss how behavioral analytics has
   helped Hootsuite to reveal hidden opportunities for growth and pinpoint
   issues affecting digital business outcomes such as conversion, engagement,
   and retention on September 21st.
 * Data Science Methodology and Data Science in action Hands On – What does a data scientist do when given a business problem? Find out the
   answer on September 21st.
 * IBM World of Watson 2016 – Unleash the power of data, analytics and cognitive. A new world awaits at
   IBM World of Watson 2016, October 24-27 in Las Vegas.

COOL NEW COURSES
 * Data Science 101
 * Data Science Hands-on with Open Source Tools
 * Data Privacy Fundamentals
 * Machine Learning – Dimensionality Reduction

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (September 13, 2016)",Live,461
1416,"* 
 * Engineering
 * Algorithms
 * Careers
 * Blog

Engineering Algorithms Careers BlogI'D RATHER PREDICT BASKETBALL GAMES THAN ELECTIONS: ELASTIC NBA RANKINGS
Kim Larsen November 22, 2016 - San Francisco, CANote: Predictions will be continually updated on Kim’s Github repo

DOES THE WORLD REALLY NEED ANOTHER STATISTICAL PREDICTION MODEL?
When Donald Trump won the 2016 presidential election, both sides of the
political spectrum were surprised. The prediction models didn’t see it coming
and the Nate Silvers of the world took some heat for that (although Nate Silver
himself got pretty close). After this, a lot of people would probably agree that
the world doesn’t need another statistical prediction model.

So, should we turn our backs on forecasting models? No, we just need to revise
our expectations. George Box once reminded us that statistical models are, at
best, useful approximations of the real world. With the recent hype around data science and “money balling”
this point is often overlooked.

What should we then expect from a statistical model? A statistical model should help us
process the many moving parts that often affect a given outcome – such as an
election – by providing a single prediction for the future. It does this by
combining the multitude of inputs, assumptions, trends and correlations with a
simplified representation of how the world works. The human mind cannot do this
and that is what makes models so valuable. However, if the inputs are bad even a
good model is going to be wrong. Moreover, no model predicts the unthinkable
unless unthinkable assumptions are made.

FORECASTING AND BASKETBALL
So what does this have to do with basketball? In many ways, basketball is a
perfect case study for forecasting. The outcome of future games are affected by
many different factors; a team’s current performance, momentum, the strength of
its roster compared to its opponents, as well as the travel schedule. If a team
looks great on paper and it’s winning games, it’ll likely do well in the future.
But factors like injuries, coaching changes, and trades can curtail success very
quickly. Thus, any model-based prediction is only accurate until something
happens that is unaccounted for in the model.

This post discusses a new data-driven approach to predicting the outcome of NBA
games. I call this the Elastic NBA Rankings . If you love statistics, R and basketball, then this is the post for you.

While it’s certain that the model is wrong – all models are – I’m hoping that it
will also be useful.

SUMMARY – WHAT YOU NEED TO KNOW ABOUT THE ELASTIC NBA RANKINGS
The Elastic NBA Team Rankings is based on statistical modeling techniques
frequently used across various industries to predict bankruptcy, fraud or
customer buying behavior. No qualitative data or judgment is used to decide the
ranks or the importance of different variables; the only human judgment applied
is the underlying framework and features behind the algorithm.

At a high level, the model depends on three overall factors:

 * Previous performance.
 * How the team looks on paper. This is measured by the roster composition of
   “player archetypes.”
 * Circumstances – e.g., traveling, rest days, home-count advantage.

The team rankings produced by this model mainly agree with other prediction
models (such as FiveThirtyEight ) – at least when it comes to identifying the strong teams and the weak teams
(the “tail teams”). There are some interesting differences, such as different
predictions for the Atlanta Hawks, which I will analyze later in this post.

Back testing for the 2015-2016 season showed promising results (see more details
below), although I have not done any historical benchmark testing against other
models. Time will be the judge.

WHAT DOES THE MODEL ACTUALLY PREDICT?
The model predicts the outcome of future NBA games for the current season. It
does not predict the points scored, only the probability of a given team
winning.

WHERE TO FIND THE MODEL PREDICTIONS
All rankings and scores can be found in this github repo . The easiest way to extract the data is to directly read the raw files .

The predictions will be updated weekly with a new time-stamp.

There are two main files of interest:

 * game_level_predictions_2016-MM-DD.csv – Game-level predictions for each
   future game.
 * rankings_2016-MM-DD.csv – team rankings and predicted win rates.

In addition, the modeldetails directory has detailed information on the underlying mechanics of the model. See more
details below on how to use this data.

HOW THE MODEL WORKS
The model is based on a three-step procedure:

 1. Create 25 data-driven archetypes using k-means clustering based on game-level box score statistics from games prior to the 2016-2017
    season. The goal of the clustering algorithm is to minimize differences
    between players within clusters (in terms of offensive and defensive stats), while maximizing
    differences between clusters. Players are mapped to a given cluster based on their recent
    performance, which means that players can change archetype if their box
    score statistics change.
 2. The winner of a given game is predicted based on team archetypes, home team
    advantage, rest days, miles traveled, previous match-ups between the two
    teams (during that season), as well as recent win percentages.
 3. Teams are ranked based on the predicted win rate for the season. Hence the
    ranking is schedule-dependent.

Why is the model called “Elastic NBA Rankings?” There are two reasons for this:
first, the model automatically adapts as the season progresses. Second, the
regularization technique used to fit the logistic regression model is a special
case of the Elastic Net .

SOME NOTES ON THE MODEL USED IN STEP 2
The model used to predict the winner of a given game is a statistical model that
is estimated based on the most recent three seasons. Hence, the relative
importance (weights) of the various drivers – for example, the importance of
roster features versus win percentages – are purely based on the relationships
detected from the data. For the stats-minded readers, the model is a logistic
regression with an L1 penalty (lasso) to reduce the chance of over-fitting (this
worked best in back-testing). V-fold cross-validation was used to choose the
penalty parameter.

The model is re-estimated every single day and contains the following variables:

 * Roster composition – surplus/deficit of minutes allocated to the different
   archetypes. For example, if a team’s lineup has more players on the court of
   archetype 1 than its opponent, it’ll have a surplus of minutes allocated to
   that archetype, and vice versa.
 * Trailing 90 day winning percentages – the model assigns less importance to
   win-streaks early in the season. Moreover, the model assigns higher
   importance to wins where the opponent has a high CARM-ELO score.
 * Previous match-up outcomes – for example, let’s say Golden State is playing
   the Clippers and the two teams have played each other earlier in the season.
   This variable captures the outcomes of those two games. A team is more likely
   to win if it beat its opponent in the past.
 * Distances traveled prior to the game – traveling the day before games usually
   translates into weaker performances, holding everything else constant.
 * Rest days prior to games – more rest is beneficial during the long NBA
   season.
 * Home team advantage.

MORE DETAILS ON THE ARCHETYPE SURPLUS/DEFICIT VARIABLES
For a given match-up, the following variables are created (made-up example)

Archetype Team 1 Allocation Team 2 Allocation Difference Coefficient (importance) Archetype 1 10% 15% -0.05 -2 Archetype 2 5% 5% 0 -1.5 . . . . . Archetype 25 10% 5% 0.05 2.5These difference-variables are then directly entered into the logistic
regression model (labeled as share_minutes_cluster_XX in the regression model).
The regression model then estimates the importance of each archetype – i.e., the coefficients shown in the right-most column (again, these are made-up examples). Hence, for
team 1’s roster to be considered strong, compared to team 2, it must have a
surplus of minutes allocated to archetypes with large and positive coefficients,
and vice versa for archetypes with negative coefficients.

HOW ARE PLAYERS ASSIGNED TO ARCHETYPES?
The outcome of the k-means clustering routine is 25 centroids where each centroid represents the box score profile of an archetype. Players
are assigned to archetypes by matching their offensive and defensive box score
statistics to closest centroids using the Euclidean distance, and hence can
switch archetypes at any given time. A decay function was applied such that more
recent games receive a larger weight. In addition, games played in the previous
season are discounted by a factor of 4 (before the coefficients are estimated).

PREDICTING ALLOCATION OF MINUTES FOR FUTURE GAMES
In order to calculate the deficit and surplus variables referenced above, it’s
necessary to predict how many minutes each player will play. Currently, a 90-day
trailing average is used (excluding the off-season). Games played during the
prior season are discounted by a factor of 4 (again, before the coefficients are
estimated).

DECIDING THE WINNER OF A GAME
The current implementation uses the estimated probabilities from the regularized
logistic regression model to pick the winner of a given game. If the estimated
probability of a given team winning exceeds 50%, then that team is declared the
winner.

I’ve also been experimenting with a simulation approach where each game is
“re-played” 1000 times, varying the distribution of minutes across archetypes in
each iteration. For the overall aggregate rankings this did not alter the
overall conclusion, although it did provide a better measure of prediction
uncertainties. For example, for the 2015 validation, the model predicted that
Golden State would have won all its games (see example below). A simulation approach would have done a better
job incorporating uncertainty, and so I might still end up going with that.

For simulation playoffs I have been using the simulation approach due to the
uncertainty introduced by the hierarchical playoff tree. This will be covered in
another post.

MODEL RANKINGS FOR THE 2016-2017 SEASON
All model rankings and results are stored in this github repo . The code below shows how to extract the current rankings and compare to the FiveThirtyEight win/loss predictions . The folks at FiveThirtyEight do amazing work and so this seems like a good
sanity check.

Note that the rankings file stored github repo has three key columns:

 * ytd_win_rate – this is the year-to-date win rate for the season.
 * future_win_rate – this is the predicted win rate for future games.
 * season_win_rate – this combines the predicted games with the games that have
   been played. This is the statistic I’m using to rank teams in the example
   below.

library(tidyr)library(dplyr)library(knitr)f1<-""https://raw.githubusercontent.com/klarsen1/NBA_RANKINGS/master/rawdata/FiveThirtyEight_2016-11-20.csv""f2<-""https://raw.githubusercontent.com/klarsen1/NBA_RANKINGS/master/rankings/rankings_2016-11-20.csv""ft8_rankings<-read.csv(f1)%>%rename(team=selected_team)all_rankings<-read.csv(f2)%>%inner_join(ft8_rankings,by=""team"")%>%mutate(elastic_ranking=min_rank(-season_win_rate),FiveThirtyEight=min_rank(-pred_win_rate_538))%>%select(team,conference,division,elastic_ranking,FiveThirtyEight)%>%arrange(elastic_ranking)kable(all_rankings)

team conference division elastic_ranking FiveThirtyEight Cleveland East Central 1 2 Golden State West Pacific 2 1 LA Clippers West Pacific 3 2 Atlanta East Southeast 4 12 Chicago East Central 5 6 Houston West Southwest 6 6 San Antonio West Southwest 7 4 Portland West Northwest 8 13 Oklahoma City West Northwest 9 9 Toronto East Atlantic 10 5 Charlotte East Southeast 11 9 Boston East Atlantic 12 11 Utah West Northwest 13 6 LA Lakers West Pacific 14 19 Memphis West Southwest 15 13 Milwaukee East Central 16 23 Denver West Northwest 17 17 Indiana East Central 18 21 Detroit East Central 19 15 Orlando East Southeast 20 20 New York East Atlantic 21 18 Minnesota West Northwest 22 16 Sacramento West Pacific 23 25 Brooklyn East Atlantic 24 29 Washington East Southeast 24 21 Phoenix West Pacific 26 28 New Orleans West Southwest 27 27 Miami East Southeast 28 23 Dallas West Southwest 29 26 Philadelphia East Atlantic 30 30The table shows that the elastic rankings generally agree with FiveThirtyEight – at least when it comes to the “tail teams.” For example, all rankings agree
that Golden State, Cleveland and the Clippers will have strong seasons, while
Philadelphia and New Orleans will struggle to win games.

But what about Atlanta? The elastic model ranks Atlanta fourth in terms of
overall win percentage, while FiveThirtyEight ranks Atlanta at number 12 (as of
2016-11-20). To understand why the elastic model is doing this, we can decompose
the predictions into three parts:

 * Roster – archetype allocation deficits/surpluses. These are the variables
   labeled “share_minutes_cluster_XX” described above. This group of variables
   reflects the quality of the roster.
 * Performance – e.g., win percentages, previous match-ups.
 * Circumstances – e.g., travel, rest, home-count advantage

Here’s how this works: the underlying predictive variables were multiplied by
their respective coefficients, and then aggregated to get the group
contributions to the predicted log-odds. The CSV file called
score_decomp_2016_MM_DD contains this information. The code below shows how to
use this file:

library(tidyr)library(dplyr)library(knitr)library(ggplot2)f<-""https://raw.githubusercontent.com/klarsen1/NBA_RANKINGS/master/modeldetails/score_decomp_2016-11-20.csv""center<-function(x){return(x-median(x))}read.csv(f,stringsAsFactors=FALSE)%>%select(selected_team,roster,circumstances,performance)%>%group_by(selected_team)%>%summarise_each(funs(mean))%>%## get averages across games by team
ungroup()%>%mutate_each(funs(center),which(sapply(.,is.numeric)))%>%## standardize across teams
gather(modelpart,value,roster:performance)%>%## transpose
rename(team=selected_team)%>%ggplot(aes(team,value))+geom_bar(aes(fill=modelpart),stat=""identity"")+coord_flip()+xlab("""")+ylab("""")+theme(legend.title=element_blank())

The bars show the contribution from each part of the model. As expected,
circumstances do not affect the overall prediction for the entire season as most
teams have similarly taxing schedules. However, contributions from performance
(weighted winning percentages) and roster (surplus of important archetypes) vary
considerably.

First, let’s take a look at Atlanta: the model thinks that Atlanta’s roster is
just above average (relative to its opponents). However, Atlanta has been
winning a high percentage of their games lately – they even beat Cleveland – and
thus we get a significant positive contribution from this (recall that the wins
are weighted by the opponents CARM-ELO rating). This essentially means that if Atlanta stops winning, the elastic
model is going to downgrade Atlanta since the favorable view is mainly based on
the performance and not its roster.

Next, let’s look at Cleveland and Golden State. The model ranks these two teams
at the top, both in terms of rosters and performance. In fact, my playoff
simulations have these two teams meeting again in the finals and going to seven
games (more on that in a later post).

Last, but not least, let’s take a look at San Antonio and the Timberwolves – two
teams that are viewed very differently by the model. According to the model, San
Antonio has been over-performing. The model does not like how the roster looks
on paper, yet performance has been strong so far. This could be due to strong
coaching, corporate knowledge (as Gregg Popovich calls it) and team chemistry –
factors that the roster component of the model does not capture. Minnesota, on
the other hand, is under-performing according to the model; the roster is rated
highly compared to its opponents, but the team is not performing well. This
could be due to inexperience.

BACKTESTING
The model was used to predict all games from 2015-11-20 to the end the 2015-2016
season, using only information available as of 2015-11-19 (including model
coefficients). This is roughly a five month forecast window (the season ends in
April); most teams played around 70 games during this period. I picked this date
because the latest model-run before this post was 2016-11-20.

The game level accuracy for the entire period was 64.7% – i.e., the model
predicted the correct winner for 64.7% of games.

The area under the ROC curve was 0.725.

The code below compares the actual rankings to the predicted rankings (note that
this table only covers the games played after 2015-11-19):

library(dplyr)library(knitr)library(ggplot2)library(ggrepel)## downloaded from github
f<-""https://raw.githubusercontent.com/klarsen1/NBA_RANKINGS/master/rankings/ranking_validation_2015.csv""ranks<-read.csv(f)%>%select(team,rank_actual,rank_pred)%>%mutate(diff=abs(rank_pred-rank_actual),misslvl=ifelse(diff>10,1,ifelse(diff<5,2,3)))%>%arrange(rank_pred)ggplot(ranks,aes(x=rank_pred,y=rank_actual))+xlab(""Predicted Rank"")+ylab(""Actual Rank"")+geom_point(aes(rank_pred,rank_actual),size=2,color='black')+geom_smooth(method='lm')+geom_label_repel(aes(rank_pred,rank_actual,fill=factor(misslvl),label=team),fontface='bold',color='white',size=2,box.padding=unit(0.35,""lines""),point.padding=unit(0.5,""lines""))+theme(legend.title=element_blank())+theme(legend.position=""none"")

The predicted team rankings were also in line with the actual team rankings,
except for some significant misses like the Chicago Bulls, Denver and Portland.
There might be good reasons why the model misjudged these teams, given the fact
that the predictions were made only 11 games into the season. More investigation
needed here.

Next, let’s check the game-level match-rates for the predictions. The chart below shows the game-level accuracy
for each team. The vertical lines show the overall match rate (64.7%) as well as
the match rate we would get from a random draw (50%):

library(dplyr)library(knitr)library(scales)source(""https://raw.githubusercontent.com/klarsen1/NBA_RANKINGS/master/functions/auc.R"")f<-""https://raw.githubusercontent.com/klarsen1/NBA_RANKINGS/master/rankings/game_level_validation_2015.csv""game_level<-read.csv(f,stringsAsFactors=FALSE)overall_match_rate=mean(as.numeric(game_level$selected_team_win==game_level$d_pred_selected_team_win))mutate(game_level,match=as.numeric(selected_team_win==d_pred_selected_team_win))%>%mutate(overall_match_percent=mean(match))%>%rename(team=selected_team)%>%group_by(team)%>%summarise(team_match_percent=mean(match))%>%mutate(miss=ifelse(team_match_percent<.5,""1. <50%"",ifelse(team_match_percent<.75,""2. 50-70%"",""3. >70%"")))%>%ggplot(aes(team,team_match_percent,fill=factor(miss)))+geom_bar(stat=""identity"")+coord_flip()+xlab("""")+ylab(""Accuracy"")+theme(legend.title=element_blank())+geom_hline(yintercept=overall_match_rate,linetype=2,color='black')+geom_hline(yintercept=.5,linetype=2,color='grey')+scale_y_continuous(labels=scales::percent)

print(paste0(""AUROC = "",AUC(game_level$selected_team_win,game_level$prob_selected_team_win_d)[1]))

## [1] ""AUROC = 0.725354145468453""


Note that the model is most accurate for the “tail teams” such as Golden State
and Philadelphia, which is to be expected. There are a some teams where the
model completely missed the mark – such as Boston or Portland, but overall is
the model is doing fairly well considering length of the forecast window.

FUTURE DEVELOPMENT
DEALING WITH INJURIES AND TRADES
As mentioned previously, the model depends on rosters and previous performance. If a team executes a major mid-season trade, the roster
component will react immediately while the performance component will be slower
to react (based on a 90 day window). There are a number of ways around this, but
currently no special treatment is being applied.

PLAYER INTERACTION
The model currently does not capture any interaction between the archetypes. I
have tested basic two-way interaction terms, but that did not help much. More
work is needed here.

SCHEDULE-INDEPENDENT TEAM RANKINGS
Currently, the model ranks teams by predicting win rates. This means that,
holding everything else constant, the rankings implicitly favor teams that play
in weaker divisions. A future development could be to have two rankings: one
that predicts the win rate given the current schedule (this is what the model is
currently doing), and one that normalizes for schedule differences.

COME WORK WITH US!
We’re a diverse team dedicated to building great products, and we’d love your
help. Do you want to build amazing products with amazing peers? Join us!

All Technology Careers All Careers at Stitch Fix Stitch Fix and Fix are
trademarks of Stitch Fix, Inc. * Stitch Fix Home
 * FAQ
 * Press

 * Tech Blog
 * Tech Careers

 * Terms of Use
 * Privacy Policy

Follow Us! Follow Us! * Tech Blog
 * Tech Careers
 * Stitch Fix Home
 * FAQ
 * Press
 * Terms of Use
 * Privacy Policy

Stitch Fix and Fix are trademarks of Stitch Fix, Inc.","This post discusses a new data-driven approach to predicting the outcome of NBA games. I call this the Elastic NBA Rankings. If you love statistics, R and basketball, then this is the post for you.",I'd Rather Predict Basketball Games Than Elections: Elastic NBA Rankings,Live,462
1418,"Skip navigation Upload Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

 1. Loading...

Watch Queue Queue __count__/__total__ Find out why Close2 1 1 WHAT IS SPARK
Big Data University Subscribe Subscribed Unsubscribe 1,370 1KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics

146 views 1LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 2 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Apr 27, 2016Learn Why and How To Use Spark for large amounts of data that requires low
latency processing that a typical Map Reduce program cannot provide. Learn to
use Resilient Distributed Datasets operations. Create and run applications using
Scala, Java, or Python, Spark SQL, MLlib, Spark Streaming, and GraphX and
configure, monitor and tune Spark.

=====================
Want to take the full free course? Click here: http://bigdatauniversity.com/courses/...

=====================

Follow us on Twitter https://twitter.com/BigDataU
Like us on Facebook https://www.facebook.com/bigdataunive...
Find out more free courses at : https://www.bigdatauniversity.com

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Scientist Workbench - Features Demo - Duration: 3:08. Big Data
   University 57 views 3:08


--------------------------------------------------------------------------------

 * Advanced Apache Spark- Sameer Farooqui (Databricks) - Duration: 5:58:31.
   Apache Spark 44,263 views 5:58:31
 * Intro to Apache Spark (Brain-Friendly Tutorial) - Duration: 1:00:48.
   liondatasystems 71,448 views 1:00:48
 * Did the Big Bang Require Divine Spark? (SETICon 2) - Duration: 51:55. SETI
   Institute 8,167 views 51:55
 * Intro to Spark Streaming | NewCircle Training - Duration: 2:17:10. NewCircle
   Training 17,781 views 2:17:10
 * Introduction to Spark Architecture - Duration: 58:31. datamantra 25,462 views 58:31
 * What is Apache Spark? - Duration: 7:15. Cloudera, Inc. 23,938 views 7:15
 * Introducing Data Scientist Workbench - Duration: 1:14. Big Data University
   255 views 1:14
 * A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks) -
   Duration: 44:03. Apache Spark 34,857 views 44:03
 * Big Data Adoption Part 1 - Duration: 5:42. Big Data University 84 views 5:42
 * Apache Spark vs. MapReduce #WhiteboardWalkthrough - Duration: 7:49. MapR
   Technologies 35,592 views 7:49
 * An Overview of Apache Spark - Duration: 1:06:15. Jim Scott 78,531 views 1:06:15
 * Big Data Processing with Spark | Spark Tutorial - Duration: 1:07:51. edureka!
   2,048 views 1:07:51
 * Introduction to Big Data & Spark | Big Data & Spark Tutorial - 1 - Duration:
   45:11. edureka! 4,280 views 45:11
 * Big Data Processing with Spark and Scala | Webinar - 21-8-2014 - Duration:
   1:18:50. edureka! 446 views 1:18:50
 * Using R with BLU Acceleration for Cloud (9:06) - Duration: 9:07. Big Data
   University 23 views 9:07
 * PNWS 2014 - Apache Spark II: Streaming Big Data Analytics with Team Apache,
   Scala & Akka - Duration: 52:03. Confreaks 1,404 views 52:03
 * Webinar | Big Data Analytics with Cassandra and Spark - Duration: 54:44.
   PlanetCassandra 1,578 views 54:44
 * Introduction to Notebooks you can use with Spark - Duration: 3:12. Big Data
   University 33 views 3:12
 * Arturia Spark 2 and Spark LE in Depth Review - Duration: 31:42. Molten Music
   Technology 27,039 views 31:42
 * Loading more suggestions...
 * Show more

 * Language: English
 * Country: Worldwide
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Try something new!
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Learn Why and How To Use Spark for large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide. Learn to use ...,What is Spark?,Live,463
1419,"* United States

IBM� * Site map

IBM Skip to content IBM� developerWorks Developer Centers * Close Search Seach Search
 * Sign in * Sign In
    * Register
   
   
 * IBM Navigation

developerWorks Recipes * Home
 * All recipes
 * My recipes

CloseDEVELOPERWORKS RECIPES
 * Home
 * All recipes
 * My recipes

DEVELOPERWORKS
 * Learn
 * Develop
 * Connect

DISCOVER IBM
 * Marketplace
 * Products
 * Services
 * Industries
 * Careers
 * Partners
 * Support

developerWorks Recipes

Discover, Remix and Inspire others by standing on the shoulders of giants

in Big Data and analytics , Internet of Things (IoT)USE IBM DATA SCIENCE EXPERIENCE TO DETECT TIME SERIES ANOMALIES
WORK WITH JUPYTER NOTEBOOK TO ANALYZE HISTORICAL IOT DATA

YMDH_sathish_Palaniappan
Created on July 14, 2016 / Modified on August 2, 2016 0 CommentsCONTENTS

--------------------------------------------------------------------------------

 * Overview
 * Ingredients
 * Introduction
 * Getting started quickly with an existing Notebook
 * Load your data into DSX
 * Accessing the data in Notebook
 * Show Anomalies
 * Derive threshold values
 * Create Rules in Watson IoT
 * Conclusion and the Road Ahead

214 Views 1 LikesOVERVIEW
SKILL LEVEL: ANY SKILL LEVEL
With basic knowledge on Jupyter Notebook, Python, Pandas DataFrame and IBM
Watson IoT Platform

This recipe shows how to use the IBM Data Science Experience tool to detect
anomalies in historical timeseries data and create rules in IBM Watson IoT
Platform based on these anomalies.

INGREDIENTS
 * Bluemix account

STEP-BY-STEP
 1. INTRODUCTION
    The usecase
    
    You have historical IoT timeseries data for a device and want to identify
    abnormal events. From the abnormal events that you identify, derive
    threshold values that you can use to create rules in IBM Watson IoT
    Platform. With these rules you can get alerted when your IoT device sends an
    abnormal reading in the future.
    
    Accepted file format
    
    Note that, the sample Notebook in this recipe accepts the CSV file in one of the following file formats:
    
     * 2 column format: <Date and time in DD/MM/YYYY or MM/DD/YYY format,
       Numeric value>
     * 1 column format: <Numeric value>
    
    IBM Data Science Experience
    
    
    Traditionally data scientists were trained to use commercial analytics tools
    and had a strong background in social sciences, economics, and mathematics.
    A new generation that is self-trained, use mainly open source technologies,
    and are not scared of programming and using APIs are now beginning to
    appear. However, because existing tools require different levels of
    expertise, collaboration across tools is difficult.
    
    The IBM Data Science Experience(DSX) is an environment that has everything a data scientist needs to be
    successful. It provides an interactive, collaborative, cloud-based
    environment where data scientists can use multiple tools to activate their
    insights. Data scientists can use the best of open source tools such as R
    and Python, tap into IBMs unique features, grow their capabilities, and
    share their successes.
    
    The workflow
    
    In this recipe, we will use the Jupyter Notebook that is available in IBM Data Science Experience to load your historical
    timeseries data (IoT data) and to detect anomalies in the data using z-score .
    
    The recipe also shows how to derive threshold values from your historical
    data and how to use the these to create a rule in Watson IoT Platform cloud
    analytics. The rules alert you whenever an IoT device associated with the
    rule reports a reading outside of the derived threshold limits.
    
    Z-score
    
    Z-score is a standard score that indicates how many standard deviations an element is from the mean.
    
    A z-score can be calculated from the following formula
    
    z = (X - µ) / σ 
    
    where z is the z-score, X is the value of the element, µ is the population mean, and σ is the standard deviation
    
    A higher z-score value represents a larger deviation from the mean value
    which can be interpreted as abnormal.
    
    
 2. GETTING STARTED QUICKLY WITH AN EXISTING NOTEBOOK
    This section shows how to use already built Jupyter Notebook to obtain the results faster. In the subsequent sections you will see how to build the Notebook from scratch in detail, So we
    recommend you to go through all the sections to understand what is happening
    under the hood.
    
     1. Use a supported browser to log in to DSX at – http://datascience.ibm.com/ .
        Note! If you have Bluemix id, you can login with the same.
     2. In the “ Start a Notebook” tile, click Start as shown below,
     3. Select From URL to load an existing notebook, then specify a descriptive name for the
        Notebook and enter the following URL to load the sample Notebook: 
        https://github.com/ibm-watson-iot/predictive-analytics-samples/raw/master/Notebook/Anomaly-detection-DSX.ipynb ,
     4. Click Create Notebook . Note : Observe that the Notebook is created with metadata, code, and output.
     5. 6. In the Notebook Palette select the Data Source Pane. You might observe the following screen,
     6. Drag and drop the CSV file into the Data Source pane. Tip! If you do not have a file, you can download the sample file from this link . When the file is successfully uploaded, the data file is listed on
        the Data Source pane and is saved in the Object Storage instance that is
        associated with your Analytics for Apache Spark service.
    
    Access the file in Notebook
    
    
     1. To access the file, do the following steps:
         * In the Notebook, scroll down and place the cursor in the third input
           cell,
         * In the Data Source pane click the “ Insert to code function” .
         * Observe that the credentials for accessing the csv file is added to
           the cell as a Python dictionary. Note! In case if the name of the dictionary object is not credentials_1 , change the name of the dictionary object in the next cell to use
           the same identifier.
        
        
     2. You have now loaded the CSV file that contains the historical timeseries
        data and added code to access the file in the Notebook. Now we need to
        run the code and observe the results.
     3. In the menu row, select Cell > Run All to run the notebook.
     4. In the Notebook, scroll down to view the anomalies in your data and the
        threshold values for the spike and dip. You should observe a chart like
        below if you have loaded the sample CSV file provided in this recipe.
    
    As shown, the red marks are the unexpected spikes and dips whose z-score
    value is greater than 3 or less than -3 .
    
    In this section, we showed how to load an already built Notebook to see
    anomalies in your data. Go through the following sections to build a
    Notebook from scratch and create rules in Watson IoT Platform.
    
    
 3. LOAD YOUR DATA INTO DSX
    In this step, you will create a Jupyter Notebook in DSX and load your CSV data file into it.
    
     1. Use a supported browser to log in to DSX at – http://datascience.ibm.com/ . Note! If you have Bluemix id, you can login with the same.
     2. In the Start a Notebook tile, click Start.
     3. Specify a descriptive name for the Notebook, select Python as language and click Create Notebook .
     4. In the Notebook Palette , select the Data Source Pane. ( Note : refer to the above step for the snapshot)
     5. Drag and drop your CSV file into the Data Source pane. Tip! If you do not have a file, you can download the sample file from this link . When the file is successfully uploaded, the data file is listed on
        the Data Source pane and is saved in the Object Storage instance that is
        associated with your Analytics for Apache Spark service.
     6. To access the file in Object Storage, click Insert to code function as shown below,
     7. The credentials for accessing the CSV file are added to the cell as a
        Python dictionary. With these credentials, you can use the helper
        function to load the data file into a pandas .DataFrame . Click Run to run this cell.
    
    Running code cells in a notebook
    
    To run code cells in a notebook, click Run Cell () in the notebook toolbar.
    While the code in the cell is running, a [*] appears next to the cell. After
    the code has run, the [*] is replaced by a number indicating that the code
    cell is the Nth cell to run in the notebook.
    
    At this step, we showed how to create a Jupyter Notebook in DSX and load the
    CSV file into it.
    
    
 4. ACCESSING THE DATA IN NOTEBOOK
    In this step, you will load the data into a Pandas DataFrame and explore the data.
    
    
    The Python Data Analysis Library (pandas) provides high-performance, easy-to-use data structures and data analysis tools that are designed to make working with relational
    or labeled data both easy and intuitive. It aims to be the fundamental
    high-level building block for doing practical, real-world data analysis in
    Python. The two primary data structures of pandas are Series (1-dimensional)
    and DataFrame (2-dimensional)
    
    Define a helper function.
    
     1. Because the CSV file is located in Object Storage, you need to define a
        helper function to access the data file that is loaded. Enter the
        following code in the next cell and click Run. Note! Copying and pasting the code directly from the below snippet might
        result in errors. If you experience errors, copy the code from the Github location
        def get_file_content(credentials):
            """"""For given credentials, this functions returns a StringIO object containing the file content.""""""
        
            url1 = """".join([credentials[""auth_url""], ""/v3/auth/tokens""])
            data = {""auth"": {""identity"": {""methods"": [""password""],
                    ""password"": {""user"": {""name"": credentials[""username""],""domain"": {""id"": credentials[""domain_id""]},
                    ""password"": credentials[""password""]}}}}}
            headers1 = {""Content-Type"": ""application/json""}
            resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
            resp1_body = resp1.json()
            for e1 in resp1_body[""token""][""catalog""]:
                if(e1[""type""]==""object-store""):
                    for e2 in e1[""endpoints""]:
                        if(e2[""interface""]==""public""and e2[""region""]==credentials[""region""]):
                            url2 = """".join([e2[""url""],""/"", credentials[""container""], ""/"", credentials[""filename""]])
            s_subject_token = resp1.headers[""x-subject-token""]
            headers2 = {""X-Auth-Token"": s_subject_token, ""accept"": ""application/json""}
            resp2 = requests.get(url=url2, headers=headers2)
            return StringIO.StringIO(resp2.content)
        
        
     2. Enter the following code in the next cell to load the data into a pandas
        DataFrame and click Run ,content_string = get_file_content(credentials_1)
        pandaDF = pd.read_csv(content_string)
        
        
    Explore data
    
     1. Enter the following code in the next cell to show the first 5 rows of
        data and click Run
        pandaDF.head() 
        
        You should see the following output:
        
        
     2. Enter the following code in the next cell to show the last 5 rows of
        data and click Run .
        pandaDF.tail() 
        
        You should see the following output:
        
        
     3. Enter the following command in the next cell to get the number of rows
        in the CSV file (DataFrame) and click Run .
        pandaDF.count()
        
        You should see the following output:
        
        timestamp 720
        temperature 720
        dtype: int64
        
        
     4. Enter the following commands in the next cell to set timestamp as the
        index if its present and click Run ,# change index to time if its present
        header_list = pandaDF.columns.values
        valueHeaderName = ""value""
        timeHeaderName = ""null""
        
        if (len(header_list) == 2):
         timeHeaderName = header_list[0]
         valueHeaderName = header_list[1]
        else:
         valueHeaderName = header_list[0]
        
        # Drop the timestamp column as the index is replaced with timestamp now
        if (len(header_list) == 2):
         pandaDF[timeHeaderName] = pd.to_datetime(pandaDF[timeHeaderName])
         pandaDF.index = pandaDF[timeHeaderName]
         pandaDF = pandaDF.drop([timeHeaderName], axis=1)
         # Also, sort the index with the timestamp
         pandaDF.sort_index(inplace=True)
        
        pandaDF.head(n=5) 
        
        You should see the following output:
        
        
    In this step, we have successfully created a Pandas DataFrame from the CSV
    file and explored the data a bit. If you want to explore the data further,
    refer to the recipe Timeseries Data Analysis of IoT events by using Jupyter Notebook which provides a list of basic commands to explore the SQL and Pandas
    DataFrame.
    
    
 5. SHOW ANOMALIES
    In this step, you will calculate the z-score and plot anomalies using the Pandas DataFrame and matplotlib library.
    
     1. Enter the following commands in the next cell to calculate z-score for
        each of the values and add it as a new column in the same DataFrame,# calculate z-score and populate a new column
        pandaDF[""zscore""] = (pandaDF[valueHeaderName] - pandaDF[valueHeaderName].mean())/pandaDF[valueHeaderName].std(ddof=0)
        pandaDF.head(n=5)  
        
        You should see the following output:
        
        
     2. Enter the following snippet of the code in the next cell to view the
        anomaly events in your data and click Run . Note! Copying and pasting the code directly from the below snippet might
        result in errors. If you experience errors, copy the code from the Github location# ignore warnings if any
        import warnings
        warnings.filterwarnings(""ignore"")
        
        # render the results as inline charts:
        %matplotlib inline
        import numpy as np
        import matplotlib.pyplot as plt
        
        """"""
        This function detects the spike and dip by returning a non-zero value 
        when the z-score is above 3 (spike) and below -3(dip). Incase if you 
        want to capture the smaller spikes and dips, lower the zscore value from 
        3 to 2 in this function.
        """"""
        def spike(row):
         if(row[""zscore""] >=3 or row[""zscore""] <=-3):
             return row[valueHeaderName]
         else:
             return 0
        
        pandaDF[""spike""] = pandaDF.apply(spike, axis=1)
        # select rows that are required for plotting
        plotDF = pandaDF[[valueHeaderName,""spike""]]
        #calculate the y minimum value
        y_min = (pandaDF[valueHeaderName].max() - pandaDF[valueHeaderName].min()) / 10
        fig, ax = plt.subplots(num=None, figsize=(14, 6), dpi=80, facecolor=""w"", edgecolor=""k"")
        ax.set_ylim(plotDF[valueHeaderName].min() - y_min, plotDF[valueHeaderName].max() + y_min)
        x_filt = plotDF.index[plotDF.spike != 0]
        plotDF[""xyvaluexy""] = plotDF[valueHeaderName]
        y_filt = plotDF.xyvaluexy[plotDF.spike != 0]
        #Plot the raw data in blue colour
        line1 = ax.plot(plotDF.index, plotDF[valueHeaderName], ""-"", color=""blue"", animated = True, linewidth=1)
        #plot the anomalies in red circle
        line2 = ax.plot(x_filt, y_filt, ""ro"", color=""red"", linewidth=2, animated = True)
        #Fill the raw area
        ax.fill_between(plotDF.index, (pandaDF[valueHeaderName].min() - y_min), plotDF[valueHeaderName], interpolate=True, color=""blue"",alpha=0.6)
        
        # Label the axis
        ax.set_xlabel(""Sequence"",fontsize=20)
        ax.set_ylabel(valueHeaderName,fontsize=20)
        
        plt.tight_layout()
        plt.legend()
        plt.show()
        
        You should see a chart:
        
        
    As shown, the red marks are the unexpected spikes and dips whose z-score
    value is greater than 3 or less than -3. Incase if you want to detect the
    lower spike and dips, modify the value to 2 (and -2 for dips) or even lower
    and run. Similarly, if you want to detect only the higher spikes and dips,
    try increasing the z-score value from 3 to 4 (and -4 for dips) and beyond.
    
    
 6. DERIVE THRESHOLD VALUES
    This section shows how to derive the threshold values from your historical
    data using the z-score, and use the same to create rules in the Watson IoT
    Platform to detect anomalies in the current IoT device events in realtime . This will create an alert in realtime when the current sensor reading crosses the threshold value.
    
     1. Enter the following command into the next cell to derive the spike
        threshold value corresponding to z-score value 3 and click Run .
        # calculate the value that is corresponding to z-score 3
        (pandaDF[valueHeaderName].std(ddof=0) * 3) + pandaDF[valueHeaderName].mean() 
        
        70.601299674769308
        
        
     2. Similarly, Enter the following command into the next cell to derive the
        dip threshold value corresponding to z-score value -3. and click Run. Run .(pandaDF[valueHeaderName].std(ddof=0) * -3) + pandaDF[valueHeaderName].mean() 
        
        20.066561436341793
        
        
    In this section , we saw how to derive threshold values for the given
    historical data and in the next section, we will see how to create rules in
    Watson IoT Platform.
    
    
 7. CREATE RULES IN WATSON IOT
    This section shows how to create a rule for the threshold values that you derived now. To get familiar with the
    Watson IoT Platform and connecting devices to it, refer to the recipe Visualizing Data in Watson IoT Platform . Simulate a temperature device using this recipe and proceed with the
    below steps,
    
    
    Create a Schema
    
     1. In the Devices tab, select the Manage Schemas tab as shown below,
     2. Click Add Schema to add a new schema,
     3. Select the DeviceType for which the schema is created and click Next ,
     4. Click Add a property to add the datapoint from the connected devices.
     5. Select “From Connected” option as shown below and select the temperature
        property. Note that the device must be sending the events for you to see
        the datapoint. You can also add a property manually if the device is not
        connected and sending events.
     6. Click Finish to create the schema.
    
    Create an Action
    
    In this example, let us create an E-mail action , such that an E-mail will be sent to the concerned person whenever the
    temperature value crosses the threshold values that we derived.
    
     1. In the Rules tab, select the Actions tab as shown below,
     2. Click “ Create An Action ” button, provide a name and select Email as the action as shown below,
     3. Click Next and provide the E-mail address and click Finish to create the action.
    
    Create a Rule
    
     1. In the Rules tab, select the Browse tab and click “Create Cloud Rule” ,
     2. Provide a name for the Rule, select the schema name in the “ Applies to ” column and click Next,
     3. Set the first condition as shown below,
     4. Then Select OR and add the second condition as shown below,
     5. Once the conditions are set, select action to set it to E-mail action as
        shown below and click Activate to activate the rule,
     6. This will send an alert whenever the temperature crosses the set
        threshold values.
    
    In this section, we saw how to create a Rule and actions with the derived
    threshold values to monitor the current IoT device”s data in realtime.
    
    
 8. CONCLUSION AND THE ROAD AHEAD
    This recipe showed how to use the z-score to detect anomalies in the
    historical timeseries data using the IBM Data Science Experience in simple
    steps. Also, showed, how one can derive the threshold value for the given
    historical data and set the rule accordingly in IBM Watson IoT Platform to
    create realtime alerts. Developers can take a look at the code made
    available in this recipe and also the Notebook in github repository to understand whats happening under the hood. Developers can consider this
    recipe as a template for detecting anomalies in their historical IoT data
    and modify the python code depending upon the use case.
    
    
Tags analytics , anomaly , data secience experience , data seience , dsx , IoT , iot analytics , rti , watson iotBy YMDH_sathish_Palaniappan


JOIN THE DISCUSSION CANCEL REPLY
You must be logged in to post a comment.

IBM® * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

RSS Feed",This recipe shows how to use the IBM Data Science Experience tool to detect anomalies in historical timeseries data and create rules in IBM Watson IoT Platform based on these anomalies.,Use IBM Data Science Experience to detect time series anomalies,Live,464
1426,"KDNUGGETS
Subscribe to KDnuggets News | | Contact * SOFTWARE
 * NEWS
 * Top stories
 * Opinions
 * Tutorials
 * JOBS
 * Companies
 * Courses
 * Datasets
 * EDUCATION
 * Certificates
 * Meetings
 * Webinars


KDnuggets Home » News » 2017 » Jul » Tutorials, Overviews » Introduction to Neural Networks, Advantages and Applications ( 17:n28 )INTRODUCTION TO NEURAL NETWORKS, ADVANTAGES AND APPLICATIONS
Previous post Next post

Tags: Applications , Beginners , Brain , Neural Networks
Artificial Neural Network (ANN) algorithm mimic the human brain to process
information. Here we explain how human brain and ANN works.


--------------------------------------------------------------------------------


By Jahnavi Mahanta .

Artificial Neural Network (ANN) uses the processing of the brain as a basis to
develop algorithms that can be used to model complex patterns and prediction
problems.

Lets begin by first understanding how our brain processes information:

In our brain, there are billions of cells called neurons, which processes
information in the form of electric signals. External information/stimuli is
received by the dendrites of the neuron, proccessed in the neuron cell body,
converted to an output and passed through the Axon to the next neuron. The next
neuron can choose to either accept it or reject it depending on the strength of
the signal.


Now, lets try to understand how a ANN works:


Here, w1, w2, w3 gives the strength of the input signals.

As you can see from the above, an ANN is a very simplistic representation of a
how a brain neuron works.

To make things clearer, lets understand ANN using a simple example: A bank wants
to assess whether to approve a loan application to a customer, so, it wants to
predict whether a customer is likely to default on the loan. It has data like
below:


So, we have to predict Column X. A prediction closer to 1 indicates that the
customer has more chances to default.

Lets try to create an Artificial Neural Network architecture loosely based on
the structure of a neuron using this example:


In general, a simple ANN architecture for the above example could be:


Key Points related to the architecture:

 1. The network architecture has an input layer, hidden layer (there can be more
    than 1) and the output layer. It is also called MLP (Multi Layer Perceptron)
    because of the multiple layers.
 2. The hidden layer can be seen as a “distillation layer” that distills some of
    the important patterns from the inputs and passes it onto the next layer to
    see. It makes the network faster and efficient by identifying only the
    important information from the inputs leaving out the redundant information
 3. The activation function serves two notable purposes: * It captures non-linear relationship between the inputs
     * It helps convert the input into a more useful output.
    
    
In the above example, the activation function used is sigmoid:
O1 = 1 / 1+e -F
Where F = W1*X1 + W2*X2 + W3*X3

Sigmoid activation function creates an output with values between 0 and 1. There
can be other activation functions like Tanh, softmax and RELU.

 1. Similarly, the hidden layer leads to the final prediction at the output
    layer:

O3 = 1 / 1+e -F 1

Where F 1= W7*H1 + W8*H2

Here, the output value (O3) is between 0 and 1. A value closer to 1 (e.g. 0.75)
indicates that there is a higher indication of customer defaulting.

Pages: 1 2


--------------------------------------------------------------------------------

Previous post Next post


--------------------------------------------------------------------------------


TOP STORIES PAST 30 DAYS
Most Popular 1. Top 15 Python Libraries for Data Science in 2017 The 4 Types of Data Analytics The 10 Algorithms Machine Learning Engineers Need to Know Applying Deep Learning to Real-world Problems 10 Free Must-Read Books for Machine Learning and Data Science Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed Text Clustering: Get quick insights from Unstructured Data

Most Shared 1. The 4 Types of Data Analytics Machine Learning Applied to Big Data, Explained Applying Deep Learning to Real-world Problems Emerging Ecosystem: Data Science and Machine Learning Software, Analyzed The Machine Learning Algorithms Used in Self-Driving Cars Text Clustering: Get quick insights from Unstructured Data What Are Artificial Intelligence, Machine Learning, and Deep Learning?

LATEST NEWS
 * The BI & Data Analysis Conundrum Forge.ai: ML Engineer/Data Scientist Machine Learning Exercises in Python: An Intr... KDnuggets 17:n28, Jul 26: 5 Free Resources to start with De... Learn from DeepMind at Deep Learning, AI Assi... SIGKDD Elects Jian Pei as Chair, Michael Zell...


MORE RECENT STORIES
 * SIGKDD Elects Jian Pei as Chair, Michael Zeller Treasurer, New... SIRIS Academic: Data Engineer Introduction to Neural Networks, Advantages and Applications 6 Reasons Why Python Is Suddenly Super Popular The Truth About Bayesian Priors and Overfitting Revolutionizing Data Science Package Management, July 25 Predictive Analytics World, London, 11-12 October – Agen... Summary of Unintuitive Properties of Neural Networks When not to use deep learning Top Stories, Jul 17-23: Machine Learning Applied to Big Data, ... Top Quora Data Science Writers and Their Best Advice, Updated Optimism about AI improving society is high, but drops with ex... ExxonMobil: Machine Learning Position Take The Next Step In Your Data Science Career AI and Deep Learning, Explained Simply Intelligence and Cognition: I Do Not Think They Mean What You ... Picking an Optimizer for Style Transfer Hoag Hospital: Strategic Analytics Analyst Lead Big Data Innovation, Data Visualization Summits, Boston, Sep 7-8 Deep Learning, AI Assistant Summits London feature DeepMind an...


KDnuggets Home » News » 2017 » Jul » Tutorials, Overviews » Introduction to Neural Networks, Advantages and Applications ( 17:n28 )

© 2017 KDnuggets. About KDnuggets
Subscribe to KDnuggets News X",Artificial Neural Network (ANN) uses the processing of the brain as a basis to develop algorithms that can be used to model complex patterns and prediction problems.,"Introduction to Neural Networks, Advantages and Applications",Live,465
1429,"COMPOSE'S DATA BROWSER COMES TO ELASTICSEARCH
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 13, 2016Today, we're excited to introduce you to the new Compose Elasticsearch data
browser that makes your Elasticsearch deployment even easier to manage from the
web. You now have immediate access to the Elasticsearch API without leaving the
Compose Console.

This means you will not have to depend on curl and other tools to keep an eye on
indices and execute queries. Using the Compose data browser from the web means
there's no need to remember how to set up a connection and memorize all your
indices, types and URIs. In the data browser, you can browse the indexes, dig
through the types and get down to the data which you can edit and update. It's
designed to complete your digital toolbox with a quick to access, simple and
powerful interface.

This is our first data browser for Elasticsearch and we'd like to take you on a
tour of what capabilities it already offers.

THE DATA BROWSER
Let's start our tour by logging into the Compose browser and selecting Browser from the sidebar menu. Here, we'll see the index shakespeare created from the Elasticsearch Shakespeare dataset .


In this view we'll see the deployment's ""status"" , ""uptime"", total disk ""size"", number of ""nodes"" and ""shards"", and the total
number of ""documents"" from all indices. Below that, the shakespeare index displays its disk ""size"", the ""count"" of the documents in the index, and
the number of documents that have been ""deleted"". Let's click on the shakespeare index to see what's inside:


Here, we'll find the types belonging to the index. From here, we'll be able to
select a type to edit and query documents corresponding to that type, or we
could create more types by clicking the Rest editor link. The sidebar REST and Admin buttons take us to our editor and administration panel. However, we'll start
from the top-down by looking at the data within a type and moving on from there,
so let's start by clicking the line type where we'll get the following:


This is the Documents view for the line type. We'll see the Elasticsearch URI search request in the console that gives
us by default the first 10 requested documents and their contents, which are
displayed in the Query results table. We can edit the search request in the console by selecting different
parameters for q , editing the default size of 10 documents per page, adjusting where the query should offset from , and filtering the _source . Filling out the search request, we get something like the following:


Clicking on any of the returned documents will take us to a document editor
where we can modify the content like this shown here:


The document id , the number next to line at the top, is not included in the document's content because it's being used
to select it. This editor lets us Validate , Save , and Delete content. We can also Revert any changes made as long as they haven't been saved. Validating the JSON can be
done by clicking Validate , or our data will be validated when saving at runtime because invalid JSON
won't be stored and will flag an error.

Going back to our Documents view, we can also insert documents by clicking Insert Document in the right-top corner.


This will take us to an editor that shows where we can insert a new document. It
shows the title New Document at the top that's displayed here:


You can enter any valid JSON data within the editor and it will be validated
either by clicking Validate or automatically upon saving. If you do not supply a document _id , one will be created for you like this:


Going back to the Document view, we'll see the Mapping tab. Clicking the tab takes us to the Mapping view where the mapping types and field datatypes are displayed:


Here, we'll see the mapping types and field datatypes related to line . The Mapping view only shows us the fields and datatypes belonging to the type.

Going back to the sidebar menu, let's go to the REST Editor by selecting the REST button taking us here:


The editor allows us to access the Elasticsearch API selecting one of the GET,
PUT, POST, and DELETE methods and by inserting a URI endpoint into the input
field shown here:


The editor is not a substitute for the command line. However, it makes it easy
to access data from anywhere on the web to execute simple queries and insert,
update, or delete documents and indices using the Elasticsearch API running curl .

Let's move now to the last button on the sidebar, Admin . Selecting it will take us to the administration page displaying ""tasks""
buttons shown here:


All of the ""tasks"" are performance and optimization APIs from the Elasticsearch Indices API . Their effects only apply to the index were in, and not all the indices of the
deployment. At the bottom there is the button ""Delete Index"", which will drop
the index including its data from the deployment. Remember that before deleting
an index, you should make all necessary backups of your data because you can't
undo it.

Deleting all your indices will start you off with a fresh browser view that
looks like the following:


To create an index we have two options: use curl or the browser. For demonstration purposes, we'll use the browser which has a Create Index button on the top-right corner. This opens a console window that shows the same curl command we'd use to create an index:


The index fields can be edited and are configured by default with 5 shards and 1
replica. Adding a name and clicking Run will create the index and take us to the Index view showing our types:


As we saw before, this view is where our types are collected. Types can be added
from the browser by selecting the REST editor link in the blue bar or REST on the sidebar - both take you to the REST Editor . Selecting the REST editor link, however, is better because it will bring up the REST Editor with a generated URI endpoint and a JSON mapping template to create type and
mapping properties as shown here:


The template preloads the current index path with the _mapping endpoint from Elasticsearch's PUT mapping API . To successfully Make Request we must fill in [type name here] and add mapping properties where it says ""[field name]"" . Filling in the template with Elasticsearch's Shakespeare dataset default type and mapping would look like the following:


Before running Make Request we could run Validate JSON , but the editor will flag any errors at runtime, which means that it
automatically validates JSON. Any errors or successful requests will be shown in
the Results box below the editor. A successful request to create a type gets following
result:


Now, going back to the Index view by clicking Types from the sidebar will show us the default type created:


If we want to upload data via the browser editor, we could by selecting REST on the sidebar. This will take us to the REST editor where we can use Elasticsearch’s Bulk API like this:


While it's possible to upload datasets via the editor, it's not recommended.
Uploading large datasets in the browser could either slow it down, or crash the
browser altogether. So, while the editor has the ability to _bulk insert our data, a more optimal way is just to use curl to connect to our Compose deployment and upload data. We can do that by
executing the following command in the terminal:

curl -XPOST 'https://hamlet:mypass@aws-us-east-1-portal383.dblayer.com:38847/shakespeare/_bulk' --data-binary @shakespeare.json  


From here, go back to the main browser page where we'll see the shakespeare index populated with the uploaded JSON documents.

FINISHING UP
You can see there are a number of features in the data browser that provide
access to many of Elasticsearch's powerful APIs, and we'll be enhancing the data
browser to provide you more features in the future. With the browser, you can
manage and have access to your data from the web, at any location, and at any
time, and we'd like to make it your preferred tool to manage Compose
Elasticsearch.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Today, we're excited to introduce you to the new Compose Elasticsearch data browser that makes your Elasticsearch deployment even easier to manage from the web. You now have immediate access to the Elasticsearch API without leaving the Compose Console.",Compose's Data Browser comes to Elasticsearch,Live,466
1433,"GEOPHILE: CONVERTING AND IMPORTING SHAPEFILES FOR COMPOSE POSTGRESQL AND MONGODB
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Aug 11, 2016Geophile is a series dedicated to looking at geographical data, its features and
its uses. In this article, we discuss how to convert shapefiles into SQL and
GeoJSON and import them into Compose PostgreSQL and MongoDB.

When viewing geographical data on GIS software, you are likely viewing data
produced by a shapefile. A shapefile is a vector data format used for storing
data that references geographical objects. These files must be converted into a
data format that your database can read before it is stored and queried.

Here, we will discuss several tools that will enable you to convert shapefiles
to SQL and GeoJSON to work with your Compose PostgresSQL and MongoDB databases.

A shapefile is commonly downloaded as a single .zip file that, once unzipped, contains three mandatory files with the prefixes .shp , .dbf , and .shx . The .shp file contains the geography data, which includes points, lines, and polygons.
The .dbf file (or dBase table) contains non-geographic features and attributes that
describes the data. And the .shx file contains indices of the record sets in the .shp file for quicker lookups. Other files can be included within the shapefile,
which can be found here .

CONVERTING TO SQL AND IMPORTING INTO COMPOSE POSTGRESQL
Importing shapefiles into your Compose PostgreSQL deployment is made easier by
the command-line tool shp2psql that comes with the PostGIS extension. To use the tool, you must add the
PostGIS, fuzzystrmatch and postgis tiger geocoder extensions to your Compose PostgreSQL database. This can be done in
the Compose-UI by clicking on your database browser extensions. You can also connect to your Compose PostgreSQL deployment via the
terminal and add the extensions via the SQL command:

CREATE EXTENSION extension_name;  


Now, we just need to download a shapefile that we want to use. Here, I am using
the Washington State Counties TIGER shapefile from the Washington State Office of Financial Management (WSOFM). This data contains the various shapes that make up the different
counties in Washington State.

Using the command line tool shp2psql , we will convert the shapefile to a .sql file so that we can import it into our database. shp2psql provides us with many options to convert our data into any required format,
which can be viewed on this cheatsheet .

For this demo, we convert the shapefile using the following command:

shp2pgsql -s 4269 -g geom_4269 -I -W ""latin1"" ""county10"" county  


Let's break down what we did. We start with the shp2pgsql command, then add -s to indicate the spacial reference system of the source we are using. The US
Census and WSOFM uses NAD83, which corresponds to the North American geodetic
network, and has the EPSG projection code and PostGIS SRID number 4269. We should include this number to tell the program
that our shapefile uses this EPSG code. The -g sets the geometry or geography column name in the table created by shp2pgsql , which I've named geom_4269. -I creates a spatial index on the geometry column, which can be included now, or
you can index your table later. -W sets the encoding to Latin1 (or ISO-8859-1) based on what the .dbf file is encoded in. The file name ""county10"" refers to the file we are importing, and county is the name we are assigning the table once it's created.

The output is dumped into a .sql file and can be immediately imported into any Compose PostgreSQL database. You
can also save the file to a location on your computer and import it later. I
have provided the commands to run both below.

Convert and saved on your computershp2pgsql -s 4269 -g geom_4269 -I -W ""latin1"" ""county10"" county > county.sql  


Convert and imported to Compose PostgreSQLshp2pgsql -s 4269 -g geom_4269 -I -W ""latin1"" ""county10"" county | psql ""sslmode= require host=aws-us-east-1-portal.99.dbplayer.com port=9675309 dbname=shapefile_demo user=admin""  


CONVERTING TO GEOJSON AND IMPORTING INTO COMPOSE MONGODB
Converting shapefiles into GeoJSON is done using a command-line tool called ogr2ogr , from the GDAL library , which can be downloaded here . The tool allows you to convert shapefiles into many different formats. Here,
we will convert our shapefile into GeoJSON so that MongoDB, or any database that
can read GeoJSON, is able about to read it. However, if you wanted to convert
your GeoJSON file into a shapefile, the tool can do that as well.

In order to convert our shapefile to GeoJSON, we run the following in our
terminal:

ogr2ogr -f ""GeoJSON"" growth.json ""county10.shp"" -t_srs EPSG:4326 -lco COORDINATE_PRECISION=7  


We will explain this command as well. We start with ogr2ogr and add -f to indicate the file type that we are converting our data into - GeoJSON. The
filename after is the name of the output file. The file in quotations is our
source file - the count10.shp shapefile. -t_srs with EPSG:4326 is reprojecting the file from EPSG 4269 to EPSG 4326 because MongoDB only
support this geospatial reference currently. The -lco switch (layer creation option) allows us to control other features of the
document we are creating. There are several options, but we will use the COORDINATE_PRECISION option that sets the number of decimals after the point to seven for a more
accurate reading. If you do not set COORDINATE_PRECISION to a value, your GeoJSON will not come out as expected and it will set your
longitude and latitude coordinates to numbers like 28377973 , which is not valid.

Once our data has been exported into a JSON file, we can clean the data up a
little by removing the following lines at the top of the file:

{ ""type"": ""FeatureCollection"", ""crs"": { ""type"": ""name"", ""properties"": { ""name"": ""urn:ogc:def:crs:OGC:1.3:CRS84"" } }, ""features"": [...]} 


We will keep all of the other generated data within the array. This allows us
import separate documents into our collection versus storing all our data inside
a single document. After that, we can import our data with the following
command:

mongoimport -h aws-us-east-1-portal.99.dblayer.com:8675309 -p mypassword -u someone -d shapefile_demo county.json --jsonArray  


We tell MongoDB to import from our host with our username and password
credentials. We also provide the database name shapefile_demo with the JSON file we want to import. Then we tell MongoDB to save each item in
the JSON array as a single document. This takes care of the issue of saving the
entire JSON file as one document.

For more a more in-depth discussion about GDAL's ogr2ogr, read this .

ALL TOGETHER NOW
So, we've take shapefiles and changed them into formats that can be imported
into your Compose Postgres and MongoDB deployments. All you need to do is use a
GIS software package to view your data. You can use a number of popular GIS
software packages like OpenJUMP or QGIS to connect to you Compose deployments and view the counties.

Next time we will look at what EPSG codes are and what they mean for your
geographic data.

Image by Stephen Monroe Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","In this article, we discuss how to convert shapefiles into SQL and GeoJSON and import them into Compose PostgreSQL and MongoDB.",GeoFile: Converting and importing shapefiles for Compose PostgreSQL and MongoDB,Live,467
1434,"Jump to navigation * Twitter * LinkedIn * Facebook * About * Contact * Content By Type * Blogs    * Videos * All Videos       * IBM Big Data In A Minute                * Video Chats * Analytics Video Chats       * Big Data Bytes       * Big Data Developers Streaming Meetups       * Cyber Beat Live                * Podcasts    * White Papers & Reports    * Infographics & Animations    * Presentations    * Galleries       * Subscribe×BLOGSWHAT IS HADOOP?MAKING THE COMPLEX SIMPLEPost Comment April 26, 2016 by Mike Ferguson Managing Director of Intelligent Business Strategies Limited, IntelligentBusiness Strategies Limited Topics: Analytics , Big Data Education , Big Data Technology , Big Data Use Cases , Data Scientists , Hadoop Tags: bigdata , analytics , Hadoop , cognitive , predictive , Apache Spark , machine learning , text analytics , graph analytics , unstructured data , streaming analytics , SQL , technology , data scientist , developerApache Hadoop is a set of software technology components that together form a scalable systemoptimized for analyzing data. Data analyzed on Hadoop has several typical characteristics : * Structured—for example, customer data, transaction data and clickstream data   that is recorded when people click links while visiting websites * Unstructured—for example, text from web-based news feeds, text in documents   and text in social media such as tweets * Very large in volume * High rate of speed for creation and arrivalTHE HADOOP SYSTEMA typical Hadoop system is deployed on a hardware cluster, which comprise racksof linked computer servers. Here is a high level diagram of what Hadoop lookslike:In addition to open source Hadoop, a number of commercial distributions ofHadoop are available from various vendors. A Hadoop system comprises a number ofkey components: * Yet Another Resource Negotiator (YARN) * Hadoop Distributed File System (HDFS) * Execution engines that run analytics applications at scale * Apache Pig * Apache Hive and/or third-party SQL on Hadoop engines * Apache HBase * SearchTHE HADOOP OPERATING SYSTEMThink of YARN as Hadoop’s operating system. It is cluster management softwarethat controls the resources allocated to different applications and executionengines across the cluster.THE HADOOP FILE SYSTEMHDFS provides highly scalable storage in a Hadoop cluster that makes use of thedisk storage on every node in the cluster. Many different types of data in manydifferent formats can be loaded into and stored in HDFS. In addition, data inHDFS is partitioned across servers so that it can be accessed in parallel, andit is triple replicated for high availability if disks and/or servers fail.HADOOP EXECUTION ENGINESThe different application execution engines such as Apache Spark, Apache Storm,Apache Tez and MapReduce parallelize the execution of analytics applicationsteps across the cluster. The term engine means software that runs on every server—node—in the cluster, under the controlof YARN. The engines execute application logic and analytics in parallel acrossthe cluster to process partitioned data stored in HDFS.In general, parallelism is achieved by taking the application logic to the data,rather than taking the data to the application. In other words, copies of theapplication—or each task within an application—are run on every server toprocess local data physically stored on that server. This approach avoids movingdata elsewhere in the cluster to be processed.The MapReduce engine runs analytics applications in batch and in parallel. To doso, application developers and data scientists need to write applications as twodistinct program components—the map component and the reduce component. TheMapReduce engine runs the map step on all nodes in the cluster to produce a setof intermediate output files. It then sorts these intermediate files and runs areduce step to take the sorted intermediate files and aggregate the data in themto get a final result. This process is scalable but relatively slow because ofthe need to write lots of intermediate files to disk and then re-read themagain.Tez is an alternative to the MapReduce engine. It does not need to write andread intermediate files to disk. For this reason, it is generally faster thanthe MapReduce engine and can still run MapReduce-style applications. It justdoes so in a different way.Spark accelerates application execution even further by enabling data from HDFSto be read into memory and partitioned across the cluster—meaning that Sparkanalytics applications can analyze data at scale in parallel and in memory. NoI/O exists. Speed is the reason why Spark is getting so much attention nowadays.Storm is also an execution engine, but one that is different because it is forreal-time streaming analytics applications, which means data can be analyzed before it is stored anywhere. It can be analyzed as soon as the data is generated.Sensor data and market data offer examples. Storm is designed to analyze data atscale in real time, typically using time-series analysis. This kind of analysismeans that applications, running in parallel in a cluster, look for patterns inthe data. For example, a sequence of events within a short time window—such asthe last 10 seconds, the last 60 seconds or the last 5 minutes—together indicatea business condition. When the condition is detected or predicted, then somekind of action is taken, which may be an alert or an automated action such asshutting down a piece of equipment because it is predicted to fail.Most Storm applications are written in Java. However, writing these programs canbe challenging, so a library of prebuilt Storm components was made available tospeed up application development. More recently, tools have emerged to generateStorm analytics applications.Today, Spark offers an alternative to Storm called Spark Streaming that runs inmemory to accelerate processing. Spark is emerging as the general-purposeexecution engine of choice for all types of analytics applications, which meansthat interest in MapReduce and Storm are fading somewhat.ANOTHER PROGRAMMING LANGUAGEIf you don’t want to write analytics applications in a programming language suchas Java Python, R or Scala, you can always write them in Pig Latin scripts. ThePig platform includes a high-level language that is optimized for data-flowprocessing, whereby the output of one step is fed into the next step and so on.The difference is that Pig Latin is a declarative language. In simple terms, youstate in Pig what you want to happen, and then the Pig script is compiled intoMapReduce, Spark or Tez jobs to run in parallel on Hadoop clusters processingdata in HDFS. Pig is very popular for extract, transform and load (ETL)processing.A HADOOP DATA WAREHOUSE INFRASTRUCTUREHive is available on all Hadoop systems and is a complimentary SQL interface todata stored in HDFS files. Hive allows you to connect self-service businessintelligence (BI) tools—and applications using SQL to query data—to Hadoop Hiveand then use SQL to access data. Note that Hadoop is not a relational databasemanagement system (RDBMS). However, Hive enables you to create a schema, or atable structure, on top of a file to describe the structure of data within thefile. Then you can use SQL to query the table, and Hive will navigate the fileto get your data. Also, many SQL-on-Hadoop alternatives to Hive—for example, IBMBigSQL and Spark SQL that also offer SQL access—are available. The difference isthat technologies such as IBM BigSQL offer a much more comprehensive SQLcapability and performance than Hive.HADOOP SEARCH CAPABILITYSearch enables building search indexes by crawling the data in HDFS. Once theindexes are built, you can then explore the data using a familiar searchinterface and search queries. The queries access the indexes and let youdiscover what is in your data. This method is particularly useful forunstructured data such as text, but it can also work on structured data andsemistructured data such as JavaScript Object Notation (JSON) or XML.Learn more about Hadoop and how it can increase your speed of innovation, and stay tuned for moreinstallments of the Making the complex simple blog series.Follow @IBMBigDataRELATED CONTENTBLOGWHAT IS SPARK?Spark just seems to be getting big play everywhere in the technology arena. Whatis Spark? And do you need it? Get a good glimpse into its in-memory executioncapabilities, some of its key components, its integrations and its availabilityas a service. Read Blog Blog Spark: The operating system for big data analytics Blog Bridging Spark analytics to cloud data services Blog Why big data? Blog 5 things you missed at Strata + Hadoop World Blog Open for Data: Developing open, cloud-based data warehouse architectures Blog Learning and laughter: Open analytics takeaways from Strata + Hadoop World Blog Migrating production Hadoop clusters on-premises and to the cloud Blog Achieving new milestones in open analytics Blog Strata + Hadoop World: Driving innovation with open data Blog What keeps your C-level colleagues awake at night? Blog Why partner with a leader in hybrid data warehousing? Blog 5 ways to define a Spark-enriched big data strategyView the discussion thread.IBM * Site Map * Privacy * Terms of Use * 2014 IBMFOLLOW IBM BIG DATA & ANALYTICS * Facebook * YouTube * Twitter * @IBMbigdata       * LinkedIn * Google+ * SlideShare * Twitter * @IBManalytics       * Explore By Topic * Use Cases    * Industries    * Analytics    * Technology    * For Developers    * Big Data & Analytics Heroes       * Explore By Content Type * Blogs    * Videos    * Analytics Video Chats    * Big Data Bytes    * Big Data Developers Streaming Meetups    * Cyber Beat Live    * Podcasts    * White Papers & Reports    * Infographics & Animations    * Presentations    * Galleries    * Events    * Around the Web    * About The Big Data & Analytics Hub    * Contact Us    * RSS Feeds       * Additional Big Data Resources * AnalyticsZone    * Big Data University    * Channel Big Data    * developerWorks Big Data Community    * IBM big data for the enterprise    * IBM Data Magazine    * Smarter Questions Blog       * Events * Upcoming Events    * Webcasts    * Twitter Chats    * Meetups       * Around the Web * For Developers * Big Data & Analytics HeroesMore * Events * Upcoming Events    * Webcasts    * Twitter Chats    * Meetups       * Around the Web * For Developers * Big Data & Analytics HeroesSearchEXPLORE BY TOPIC:Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Interactive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog The secret to enhancing customer engagement Blog Big data in healthcare: The secret to calculating total cost of careMOREInteractive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog The secret to enhancing customer engagement Blog Big data in healthcare: The secret to calculating total cost of care Podcast Becoming a cognitive business Podcast InsightOut: Leveraging metadata and governance Blog Predictive forecasting for the cognitive era Interactive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of careMOREInteractive Cognitive business starts with analytics Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of care Podcast InsightOut: Leveraging metadata and governance Blog Predictive forecasting for the cognitive era Blog What is Spark? Podcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog How to protect our PII and sensitive information from fraud Blog Big data in healthcare: The secret to calculating total cost of care Interactive Keep your students in school: How to stop the dropMOREPodcast Finance in Focus: Innovative business ideas with Lisa Bodell Blog How to protect our PII and sensitive information from fraud Blog Big data in healthcare: The secret to calculating total cost of care Interactive Keep your students in school: How to stop the drop Blog The LED lighting revolution Infographic How can law enforcement prevent road accidents and save lives? Blog Predictive forecasting for the cognitive era Interactive Cognitive business starts with analytics Blog The secret to enhancing customer engagement Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of careMOREInteractive Cognitive business starts with analytics Blog The secret to enhancing customer engagement Podcast How is open source transforming streaming analytics? Blog Big data in healthcare: The secret to calculating total cost of care Podcast Becoming a cognitive business Podcast InsightOut: Leveraging metadata and governance Blog The LED lighting revolution * Home * Explore By Topic * Use Cases * All       * Acquire, Grow & Retain Customers       * Create New Business Models       * Improve IT Economics       * Manage Risk       * Optimize Operations & Reduce Fraud       * Transform Financial Processes                * Industries * All       * Banking       * Consumer Products       * Education       * Energy & Utilities       * Government       * Healthcare & Life Sciences       * Industrial       * Insurance       * Media & Entertainment       * Retail       * Telecommunications                * Analytics * All       * Content Analytics       * Customer Analytics       * Entity Analytics       * Social Media Analytics                * Technology * All       * Business Intelligence       * Cloud Database       * Data Governance       * Data Warehouse       * Database Management Systems       * Data Science       * Hadoop & Spark       * Internet of Things       * Predictive Analytics       * Streaming Analytics                   * Content By Type * Blogs    * Videos * All Videos       * IBM Big Data In A Minute                * Video Chat * Analytics Video Chats       * Big Data Bytes       * Big Data Developers Streaming Meetups       * Cyber Beat Live                * Podcasts    * White Papers & Reports    * Infographics & Animations    * Presentations    * Galleries       * Big Data & Analytics Heroes * For Developers * Events * Upcoming Events    * Webcasts    * Twitter Chat    * Meetups       * Around The Web * About Us * Contact Us * Search Site",Dig into this breakdown of Hadoop components to gain an understanding of just how flexible the open source Hadoop framework is for performing big data analytics.,What is Hadoop?,Live,468
1437,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

CONTENTS
 * Apache Spark * Get Started * Get Started in Bluemix
      
      
    * Tutorials * Load dashDB Data with Apache Spark
       * Load Cloudant Data in Apache Spark Using a Python Notebook
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Build SQL Queries
       * Use the Machine Learning Library
       * Build a Custom Library for Apache Spark
       * Sentiment Analysis of Twitter Hashtags
       * Use Spark Streaming
       * Launch a Spark job using spark-submit
      
      
    * Sample Notebooks * Sample Python Notebook: Precipitation Analysis
       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis
      
      
 * BigInsights * Get Started * BigInsights on Cloud for Analysts
       * BigInsights on Cloud for Data Scientists
       * Perform Text Analytics on Financial Data
       * Perform Sentiment Analysis
       * Sample Scripts
      
      
 * Compose * Get Started * Create a Deployment
       * Add a Database and Documents
       * Back Up and Restore a Deployment
       * Enable Two-Factor Authentication
       * Add Users
       * Enable Add-Ons for Your Deployment
      
      
    * Compose Enterprise * Get Started
      
      
 * Cloudant * Get started * Copy a sample database
       * Create a database
       * Change database permissions
       * Connect to Bluemix
       * Developing against Cloudant
      
      
    * Intro to the HTTP API * Execute common API commands
       * Set up pre-authenticated cURL
      
      
    * Database Replication * Use cases for replication
       * Create a replication job
       * Check replication status
       * Set up replication with cURL
      
      
    * Indexes and Queries * Use the primary index
       * MapReduce and the secondary index
       * Build and query a search index
       * Use Cloudant Query
       * Cloudant Geospatial
      
      
    * Integrate * Create a Data Warehouse from Cloudant Data
       * Store Tweets Using Cloudant, dashDB, and Node-RED
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Load Cloudant Data in Apache Spark Using a Python Notebook
      
      
 * dashDB * dashDB Quick Start
    * Get * Get started with dashDB on Bluemix
       * Load data from the desktop into dashDB
       * Load from Desktop Supercharged with IBM Aspera
       * Load data from the Cloud into dashDB
       * Move data to the Cloud with dashDB’s MoveToCloud script
       * Load Twitter data into dashDB
       * Load XML data into dashDB
       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB
       * Load JSON Data from Cloudant into dashDB
       * Integrate dashDB and Informatica Cloud
       * Load geospatial data into dashDB to analyze in Esri ArcGIS
       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion
         Workbench (DCW)
       * Install IBM Database Conversion Workbench
       * Convert data from Oracle to dashDB
       * Convert IBM Puredata System for Analytics to dashDB
       * From Netezza to dashDB: It’s That Easy!
       * Use Aginity Workbench for IBM dashDB
      
      
    * Build * Create Tables in dashDB
       * Connect apps to dashDB
      
      
    * Analyze * Use dashDB with Watson Analytics
       * Perform Predictive Analytics and SQL Pushdown
       * Use dashDB with Spark
       * Use dashDB with Pyspark and Pandas
       * Use dashDB with R
       * Publish apps that use R analysis with Shiny and dashDB
       * Perform market basket analysis using dashDB and R
       * Connect R Commander and dashDB
       * Use dashDB with IBM Embeddable Reporting Service
       * Use dashDB with Tableau
       * Leverage dashDB in Cognos Business Intelligence
       * Integrate dashDB with Excel
       * Extract and export dashDB data to a CSV file
       * Analyze With SPSS Statistics and dashDB
      
      
    * REST API * Load delimited data using the REST API and cURL
      
      
 * DataWorks * Get Started * Connect to Data in IBM DataWorks
       * Load Data for Analytics in IBM DataWorks
       * Blend Data from Multiple Sources in IBM DataWorks
       * Shape Raw Data in IBM DataWorks
       * DataWorks API
      
      
CONNECT APPS TO DASHDB
Jess Mantaro / July 17, 2015Learn how to connect applications that come with your dashDB fully managed data
warehouse service in the cloud.

You can also read a transcript of this video

RELATED LINKS
 * Connect apps to a dashDB database

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM",Learn how to connect applications that come with your dashDB fully managed data warehouse service in the cloud. ,Connect apps to dashDB,Live,469
1443,"Skip to content * United States

IBM® developerWorks Developer Centers * Site map

Search Search Streamsdev

Search

Streamsdev * Github
 * Documentation
 * Support
 * Blog
 * Videos

in Learn About StreamsWHAT’S NEW IN THE STREAMING ANALYTICS SERVICE ON BLUEMIX

WillMarshall
Created on April 14, 2017 0 Comments * All Documentation
 * Knowledge Centers (1)
 * Getting started (14) * Introductory Labs (11)
   
   
 * Roadmap (3)
 * Learn About Streams (29)
 * Beginner (52) * Application Development (32)
    * Install and Setup (9)
    * Tutorial (10)
   
   
 * Read and Store Data (17)
 * Analyze and Classify Data (33) * Statistics/Machine Learning (10)
   
   
 * Advanced (102) * Application Development (62)
    * Monitoring / Performance / Troubleshooting (7)
    * Administration and Install (36) * Streams Console (17)
       * Security (10)
      
      
 * Archived (12)
 * Videos (10)

IBM STREAMING ANALYTICS SERVICE – IMPROVED PYTHON SUPPORT
The combined release of the latest IBM Streaming Analytics service on IBM Bluemix and version 1.6 of the Python Application API introduces a suite
of Python language features that greatly ease the development of streaming
applications in the cloud. Applications developed with Python 3.5 can now be
directly submitted from Python to a Streaming Analytics service – this can be
achieved without the presence of a local IBM Streams install. Furthermore,
Python developers can monitor submitted jobs by using the Python REST API. This
is of particular interest to developers who seek to retrieve streaming data for
purposes of visualization in Jupyter notebooks , debugging, or extra logging. A user can also invoke the REST API to cancel a
previously submitted job; this provides a programmatic alternative to using the
web console or the streamtool command.SUBMITTING AN APPLICATION TO A STREAMING ANALYTICS SERVICE
Running a Streams application in the cloud provides a number of conveniences
compared to running locally; chiefly, that there is no need to maintain local
hardware. A Streams application running in the cloud can more easily be scaled
out in the event of high traffic, or throttled to be more cost efficient.
Previous versions of the Python API have supported running locally through the STANDALONE and DISTRIBUTED execution contexts. In version 1.6, the list has been expanded to include the STREAMING_ANALYTICS_SERVICE context, which executes an application within a Streaming Analytics service
upon submission. The following code illustrates how you can submit a Bluemix
application:from streamsx.topology.topology import Topology
from streamsx.topology.context import submit, ConfigParams, ContextTypes
top = Topology(""Hello World Application"", namespace=""Jupyter"")
hw_stream = top.source([""Hello"", ""world!""], name=""Hello_World_Source"")
hw_stream.print()
config = { ConfigParams.VCAP_SERVICES: ""/home/streamadmin/vcap.json"",
   ConfigParams.SERVICE_NAME:""Streaming Analytics-be"",
   ConfigParams.FORCE_REMOTE_BUILD : True}
r = submit(ContextTypes.STREAMING_ANALYTICS_SERVICE, top, config = config)

Output: …INFO: Streaming Analytics service (Streaming Analytics-be): submit job response: {""artifact"":""3165"",""jobId"":""10"",""application"":""Jupyter::HelloWorldApplication"",""name"":""Jupyter::HelloWorldApplication_10"",""state"":""STARTED"",""plan"":""Standard"",""enabled"":true,""status"":""running""}


You’ll notice that the Bluemix connection information was saved in the /home/streamsadmin/vcap.json file – this is one of several ways in which credentials may be provided. For
more information, see our application guide . After the application has been compiled and submitted, its status and output
are reflected in the Streams console, as shown in the following image of the web
console: The application, as displayed in the Streams console Creating, building, and submitting a cloud application in five lines of Python
code is a significant improvement in usability. Even more significant is the
ability to do so without requiring a local Streams installation. This is handled
automatically when using the STREAMING_ANALYTICS_SERVICE context if a Streams install is not present, or when the FORCE_REMOTE_BUILD option is specified. As the application was built remotely, there is no local
output regarding compilation. In the event that an error occurs during build,
the output will be retrieved and displayed.THE PYTHON REST API
Streaming applications are unusual in that the amount of data they process is
unbounded, and as such the application is treated as if it never terminates.
Therefore, it becomes necessary to monitor a job while it is running, and to
observe metrics such as congestion, flow rate, or even the individual tuples
present on a stream. To make such monitoring easy to Python developers, V1.6 of
the Python API includes the Python REST API which allows an authenticated user
to observe Streams resources at runtime, and to cancel jobs when needed. For
example, a user might write a script that checks the status of a service
instance before submitting an application:from streamsx import rest

sc = sc=rest.StreamingAnalyticsConnection(vcap_services=""/home/streamsadmin/vcap.json"", service_name=""Streaming Analytics-be"")
instance = sc.get_instances()[0]
print(instance.status)

Output: running The graphical representation of data is central to modern data science and
engineering; therefore, the REST API also supports obtaining individual tuples
from a view to visualize a stream’s contents. For example, a user can compare an
engine’s temperature against its likelihood of failure in real time in a Jupyter
notebook:from streamsx import rest

sc = rest.StreamingAnalyticsConnection(vcap_services=""/home/streamsadmin/vcap.json"", service_name=""Streaming Analytics-be"")

instance = sc.get_instances()[0]
view = instance.get_views()[0]

fig, ax = create_plot([], [], title=""Engine Temp Vs. Probability of Failure"", xlabel = ""Probability of Failure"",
ylabel = ""Engine Temp in Degrees Celsius"")
xdata = []
ydata = []

try:
  queue = view.start_data_fetch()

  for line in iter(queue.get, None):
   ydata.append(float(line))
   xdata.append(len(ydata))
   ax.lines[0].set_xdata(xdata)
   ax.lines[0].set_ydata(ydata)
  fig.canvas.draw()
except:
   raisefinally:
   view.stop_data_fetch()

An image of real time data being visualized in a Jupyter notebook Lastly, more than simply monitoring applications, the Python REST API provides
a mechanism for canceling remote jobs.from streamsx import rest

sc = rest.StreamingAnalyticsConnection(vcap_services=""/home/streamsadmin/vcap.json"", service_name=""Streaming Analytics-be"")
inst = sc.get_instances()[0]
job = inst.get_jobs()[0]
canceled = job.cancel()

if canceled:
  print(“The job has been canceled”)


Output: The job has been canceledJOIN THE DISCUSSION CANCEL REPLY
You must be logged in to post a comment.

Back to top

 * Contact
 * Privacy
 * Terms of use
 * Accessibility
 * Feedback
 * Cookie Preferences",The combined release of the latest IBM Streaming Analytics service on IBM Bluemix and version 1.6 of the Python Application API introduces a suite of Python language features that greatly ease the development of streaming applications in the cloud.,What’s new in the Streaming Analytics service on Bluemix,Live,470
1449,"Node.js is an open-source platform for writing JavaScript on the server, and has become my weapon of choice for new web apps. It's fast, asynchronous, and has a tremendous community that is only getting bigger. Because Cloudant indexes are in JavaScript, along with all client-side code, writing JavaScript on the server means my head never needs to switch gears. But perhaps most importantly, there are a ton of tools that make developing on CouchDB / Cloudant with Node.js effortless.To get started with Node.js, download the binary for your operating system here. I highly recommend it :Dnpm, or Node Package Manager, is a CouchApp that hosts packages for the Node.js community. It comes with Node.js, and you use it like this:npm install [package]Because npmjs.org is a CouchApp, you can replicate it and host your own registry using Cloudant. Specifically, replicate from https://registry.npmjs.org/ to your database, and bam, you have your own private registry. Then, you can push custom or private libraries to your registry and download them just like you would from npm.Express is a web framework around Node.js that simplifies things like middleware and URL routing. It even comes with a starting template! To install, run npm install -g express; to scaffold a project, run express wherever you want to get started. The getting started guide serves as a fantastic introduction to Node.js itself, too.Express-Cloudant is an extended Express template for working with Cloudant. Like Express, it stays out of your way, but comes with features for making life easy:* Built-in reverse proxy lets you query your Cloudant data from the client.* Custom API in routes/api.js using nano, exposing your database in a more controlled fashion.* Uses Grunt to manage static assets and design documents.* Manages design documents in the ddocs folder as JavaScript rather than raw JSON.The project NoSQL-Listener is built using Express-Cloudant as a base.Nano and Cradle are libraries for interacting with Cloudant and CouchDB. Here's an example of each:Nano tries to be as out-of-your-way as possible, so it's very lightweight to use:// require nano, point it at our instance's rootvar nano = require('nano')('https://garbados.cloudant.com');// create a databasenano.db.create('example');// create an alias for working with that databasevar example = nano.db.use('example');// fetch the primary indexexample.list(function(err, body){if (err) {// something went wrong!throw new Error(err);} else {// print all the documents in our databaseconsole.log(body);Nano has begun to support Cloudant-specific features like search, which makes it my library of choice for working with Cloudant from Node.js.Cradle is a more full-bodied library than Nano, with features like caching, and convenience methods to get and update documents. This usage example comes from its readme:var cradle = require('cradle');var db = new(cradle.Connection)().database('starwars');db.get('vader', function (err, doc) {doc.name; // 'Darth Vader'assert.equal(doc.force, 'dark');db.save('skywalker', {force: 'light',name: 'Luke Skywalker'}, function (err, res) {if (err) {// Handle error} else {// Handle successnode.couchapp.js is a utility for writing CouchApps, like the Python CouchApp or Erlang Erica utility. I love it because you can use it to write design docs as either a series of separate files across multiple folders, or as a single JavaScript file like this:var couchapp = require('couchapp'), path = require('path');ddoc = {_id: '_design/app', views: {}, lists: {}, shows: {}module.exports = ddoc;ddoc.views.byType = {map: function(doc) {emit(doc.type, null);reduce: '_count'ddoc.lists.people = function(head, req) {start({headers: {""Content-type"": ""text/html""}send(""\n"")ddoc.shows.person = function(doc, req) {return {headers: {""Content-type"": ""text/html""},body: ""\n""ddoc.validate_doc_update = function (newDoc, oldDoc, userCtx) {function require(field, message) {message = message || ""Document must have a "" + field;if (!newDoc[field]) throw({forbidden : message});if (newDoc.type == ""person"") {require(""name"");couchapp.loadAttachments(ddoc, path.join(__dirname, '_attachments'));Check out these CouchApps built with node.couchapp.js as examples:* Egg Chair: like Pinterest and Flickr, but without the terms and conditions.* Chaise Blog: A CouchApp blog, using two databases and filtered replication to share only what you want.Yeoman, Grunt, and Bower are the backbone of a modern JavaScript workflow:All three of these tools have vibrant communities around them, generating plugins that do an increasing amount of your work for you. If you are using Make in your JavaScript projects, stop, and learn Grunt. Future-you will be thankful. Thanks to these tools, my workflow typically looks like this:mkdir [app] && cd $_yo [template]// answer the template's questionsgruntWith only a few commands, my project is built, tested, and deployed. Here are some generators I use frequently at Cloudant:* generator-reveal: Scaffold reveal.js presentations and upload them to Cloudant by running grunt couch.* generator-couchapp: Scaffolds a blank CouchApp that you can upload to Cloudant by running grunt.* grunt-couchapp automates pushing CouchApps, using node.couchapp.js, along with creating and deleting databases, using nano* grunt-couch: like grunt-couchapp, but with an interface more like the classic Python CouchApp utility.When I was starting out with JavaScript, asynchronous programming threw me for a loop. Just like the JavaScript executing in your browser, Node.js is single-threaded and event-driven, which makes it wicked fast, but can feel strange coming from synchronous languages like Python.For example, if you make an HTTP request, Node.js doesn't wait for it to finish. It creates an event to fire once you get a response, then continues executing your program while your request adventures about the interwebs. So, rather than getting a return value with the HTTP response, you give your request a function, called a callback, with instructions on how to handle the eventual response.In Nano, for example, this code...// get all our docsdb.list(function(err, body){if (err) {// something went wrong!throw new Error(err);} else {// print all the documents in our databaseconsole.log(body);// print hello world :Dconsole.log('hello, world!');> 'hello, world!'> {""total_rows"": 1, ""offset"": 0, ""rows"": [...]}For more on asynchronous programming in Node.js, check out Control Flow in Node from Tim Caswell at How To Node.As always, if you have any trouble, check our docs, post your question to StackOverflow, ping us on IRC, or if you'd like to discuss the matter in private, email us at support@cloudant.com.Create an account and try Cloudant yourself",A guide to some of the tools you can use to interact with Cloudant when using the Node.js programming language.,Using Cloudant with Node.js,Live,471
1463,"Now that you know the basics of working with a secondary index, this video covers some more advanced techniques for using the secondary index. For more videos and tutorials, visit the Cloudant Learning Center at http://www.cloudant.com/learning-center",This video covers some more advanced techniques for using Cloudant secondary indexes.,Use advanced techniques with the secondary index in Cloudant,Live,472
1464,"* Select a country/region: United States

IBM� * Site map

Search

 * Related materials Download

 * NO RELATED MATERIALS FOUND
   

 * 
 * 
 * 
 * 
 * 

 * LinkedIn
 * Google+
 * Twitter
 * Facebook

 * 

 * Related materials

 * NO RELATED MATERIALS FOUND
   

 * Download

 * 
 * 
 * 
 * 

 * 

 * Download


CONTACT IBM
CONSIDERING A PURCHASE?
 * Email IBM

FOOTER LINKS
 * Contact
 * Privacy
 * Terms of use
 * Accessibility","Machine  learning is changing not only how we interact with machines,  but how we relate to the world around us. During the past  decade, machine learning has given us self-driving cars, speech  recognition, effective web search and a vastly improved  understanding of the human genome.",Intelligent applications - Apache Spark,Live,473
1465,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: PUBLISH NOTEBOOKS TO GITHUB
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

5 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Science Experience: Analyze precipitation data using a community
   notebook - Duration: 5:15. developerWorks TV No views * New 5:15


--------------------------------------------------------------------------------

 * The Data Science Experience - Duration: 42:45. Evolving Education with
   Cognitive & Data Sciences 1,106 views 42:45
 * Data Science Experience: Use the Cloudant-Spark connector in a Scala notebook
   - Duration: 3:38. developerWorks TV No views * New 3:38
 * Data Science Experience: Analyze NYC traffic collisions data with a community
   notebook - Duration: 8:08. developerWorks TV 5 views * New 8:08
 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21.
   IBM Analytics 8,386 views 8:21
 * Use data science to up your game performance - Duration: 54:35.
   developerWorks Live Webcasts 1,393 views 54:35
 * Data Science Experience: Use the Cloudant-Spark connector in a Python
   notebook - Duration: 4:23. developerWorks TV No views * New 4:23
 * Data Science Experience: Migrate Bluemix notebooks - Duration: 1:54.
   developerWorks TV No views * New 1:54
 * Introducing the Data Science Experience - Duration: 2:31. IBM Analytics
   14,454 views 2:31
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29
 * Use IBM PixieDust and Data Science Experience to analyze San Francisco
   traffic - Duration: 11:57. scottdangelo 447 views 11:57
 * A data scientist experiments with Jupyter notebooks and Apache Spark: Part 1
   - Duration: 13:30. IBM Analytics 4,093 views 13:30
 * Creating the Data Science Experience - Duration: 3:55. IBM Analytics 3,197
   views 3:55
 * A data scientist experiments with Jupyter notebooks and Apache Spark: Part 2
   - Duration: 7:23. IBM Analytics 1,013 views 7:23
 * Awesome Data Science: 1.0 Jupyter Notebook Tour - Duration: 28:01. Alfred
   Essa 17,261 views 28:01
 * Data Science Hands on with Open source Tools - What are Jupyter notebooks -
   Duration: 2:22. Cognitive Class 4,362 views 2:22
 * Data Science Experience: Create a project and notebook - Duration: 1:04.
   developerWorks TV 1 view * New 1:04
 * JavaOne: Microservice hands-on - Duration: 5:22. developerWorks TV No views *
   New 5:22
 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54.
   developerWorks TV No views * New 6:54
 * IBM Watson Machine Learning: Build a Predictive Analytic Model - Duration:
   4:06. developerWorks TV 21 views * New 4:06

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to connect your IBM Data Science Experience (DSX) account with your GitHub account.,Publish notebooks to GitHub in DSX,Live,474
1467,"* Consulting
 * Trinalysis
 * Blog

Select PageGRADIENT BOOSTING EXPLAINED
by Ben | Jan 12, 2017 | gradient-boosting , Machine Learning | 2 comments

If linear regression was a Toyota Camry, then gradient boosting would be a UH-60
Blackhawk Helicopter. A particular implementation of gradient boosting, XGBoost , is consistently used to win machine learning competitions on Kaggle . Unfortunately many practitioners (including my former self) use it as a black
box. It’s also been butchered to death by a host of drive-by data scientists’
blogs. As such, the purpose of this article is to lay the groundwork for
classical gradient boosting, intuitively and comprehensively.


--------------------------------------------------------------------------------

MOTIVATION
We’ll start with a simple example. We want to predict a person’s age based on
whether they play video games, enjoy gardening, and their preference on wearing
hats. Our objective is to minimize squared error. We have these nine training
samples to build our model.

PersonID Age LikesGardening PlaysVideoGames LikesHats 1 13 FALSE TRUE TRUE 2 14 FALSE TRUE FALSE 3 15 FALSE TRUE FALSE 4 25 TRUE TRUE TRUE 5 35 FALSE TRUE TRUE 6 49 TRUE FALSE FALSE 7 68 TRUE TRUE TRUE 8 71 TRUE FALSE FALSE 9 73 TRUE FALSE TRUEIntuitively, we might expect
– The people who like gardening are probably older
– The people who like video games are probably younger
– LikesHats is probably just random noise

We can do a quick and dirty inspection of the data to check these assumptions:

Feature FALSE TRUE LikesGardening {13, 14, 15, 35} {25, 49, 68, 71, 73} PlaysVideoGames {49, 71, 73} {13, 14, 15, 25, 35, 68} LikesHats {14, 15, 49, 71} {13, 25, 35, 68, 73}Now let’s model the data with a regression tree. To start, we’ll require that
terminal nodes have at least three samples. With this in mind, the regression
tree will make its first and last split on LikesGardening.


This is nice, but it’s missing valuable information from the feature
LikesVideoGames. Let’s try letting terminal nodes have 2 samples.


Here we pick up some information from PlaysVideoGames but we also pick up information from LikesHats – a good indication that we’re overfitting and our tree is splitting random
noise.

Here in lies the drawback to using a single decision/regression tree – it fails to include predictive power from multiple, overlapping regions of the
feature space . Suppose we measure the training errors from our first tree.

PersonID Age Tree1 Prediction Tree1 Residual 1 13 19.25 -6.25 2 14 19.25 -5.25 3 15 19.25 -4.25 4 25 57.2 -32.2 5 35 19.25 15.75 6 49 57.2 -8.2 7 68 57.2 10.8 8 71 57.2 13.8 9 73 57.2 15.8Now we can fit a second regression tree to the residuals of the first tree.


Notice that this tree does not include LikesHats even though our overfitted regression tree above did . The reason is because this regression tree is able to consider LikesHats and
PlaysVideoGames with respect to all the training samples, contrary to our
overfit regression tree which only considered each feature inside a small region
of the input space, thus allowing random noise to select LikesHats as a splitting feature.

Now we can improve the predictions from our first tree by adding the
“error-correcting” predictions from this tree.

PersonID Age Tree1 Prediction Tree1 Residual Tree2 Prediction Combined Prediction Final Residual 1 13 19.25 -6.25 -3.567 15.68 2.683 2 14 19.25 -5.25 -3.567 15.68 1.683 3 15 19.25 -4.25 -3.567 15.68 0.6833 4 25 57.2 -32.2 -3.567 53.63 28.63 5 35 19.25 15.75 -3.567 15.68 -19.32 6 49 57.2 -8.2 7.133 64.33 15.33 7 68 57.2 10.8 -3.567 53.63 -14.37 8 71 57.2 13.8 7.133 64.33 -6.667 9 73 57.2 15.8 7.133 64.33 -8.667 Tree1 SSE Combined SSE 1994 1765GRADIENT BOOSTING – DRAFT 1
Inspired by the idea above, we create our first (naive) formalization of
gradient boosting. In pseudocode

 1. Fit a model to the data,
 2. Fit a model to the residuals,
 3. Create a new model,

It’s not hard to see how we can generalize this idea by inserting more models
that correct the errors of the previous model. Specifically,


where is an initial model fit to

Since we initialize the procedure by fitting , our task at each step is to find .

Stop. Notice something. is just a “model”. Nothing in our definition requires it to be a tree-based
model. This is one of the broader concepts and advantages to gradient boosting.
It’s really just a framework for iteratively improving any weak learner. So in
theory, a well coded gradient boosting module would allow you to “plug in”
various classes of weak learners at your disposal. In practice however, is almost always a tree based learner, so for now it’s fine to interpret as a regression tree like the one in our example.

GRADIENT BOOSTING – DRAFT 2
Now we’ll tweak our model to conform to most gradient boosting implementations –
we’ll initialize the model with a single prediction value. Since our task (for
now) is to minimize squared error, we’ll initialize with the mean of the training target values.


Then we can define each subsequent recursively, just like before

, for

where comes from a class of base learners (e.g. regression trees).

At this point you might be wondering how to select the best value for the
model’s hyper-parameter . In other words, how many times should we iterate the residual-correction
procedure until we decide upon a final model, ? This is best answered by testing different values of via cross-validation .

GRADIENT BOOSTING – DRAFT 3
Up until now we’ve been building a model that minimizes squared error, but what
if we wanted to minimize absolute error? We could alter our base model (regression tree) to minimize absolute error, but this has
a couple drawbacks..

 1. Depending on the size of the data this could be very computationally
    expensive. (Each considered split would need to search for a median.)
 2. It ruins our “plug-in” system. We’d only be able to plug in weak learns that
    support the objective function(s) we want to use.

Instead we’re going to do something much niftier. Recall our example problem. To
determine , we start by choosing a minimizer for absolute error. This’ll be . Now we can measure the residuals, .

PersonID Age F0 Residual0 1 13 35 -22 2 14 35 -21 3 15 35 -20 4 25 35 -10 5 35 35 0 6 49 35 14 7 68 35 33 8 71 35 36 9 73 35 38Consider the first and fourth training samples. They have residuals of -22 and -10 respectively. Now suppose we’re able to make each
prediction 1 unit closer to its target. Respective squared error reductions
would be 43 and 19, while respective absolute error reductions would be 1 and 1.
So a regression tree, which by default minimizes squared error, will focus
heavily on reducing the residual of the first training sample. But if we want to
minimize absolute error, moving each prediction one unit closer to the target
produces an equal reduction in the cost function. With this in mind, suppose
that instead of training on the residuals of , we instead train on the gradient of the loss function, with respect to the prediction values produced by . Essentially, we’ll train on the cost reduction for each sample if the predicted value were to become one
unit closer to the observed value. In the case of absolute error, will simply consider the sign of every residual (as apposed to squared error which would consider the magnitude of
every residual). After samples in are grouped into leaves, an average gradient can be calculated and then scaled
by some factor, , so that minimizes the loss function for the samples in each leaf. (Note that in
practice, a different factor is chosen for each leaf.)

GRADIENT DESCENT
Let’s formalize this idea using the concept of gradient descent . Consider a differentiable function we want to minimize. For example,


The goal here is to find the pair that minimizes . Notice, you can interpret this function as calculating the squared error for
two data points, 15 and 25 given two prediction values, and (but with a multiplier to make the math work out nicely). Although we can minimize this
function directly, gradient descent will let us minimize more complicated loss functions that we can’t minimize directly.

Initialization Steps:
Number of iteration steps
Starting point
Step size

For iteration to :
1. Calculate the gradient of at the point
2. “Step” in the direction of greatest descent (the negative gradient) with step
size . That is,

If is small and is sufficiently large, will be the location of ‘s minimum value.

A few ways we can improve this framework:
– Instead of iterating a fixed number of times, we can iterate until the next
iteration produces sufficiently small improvement.
– Instead of stepping a fixed magnitude for each step, we can use something like line search smartly choose step sizes.

If you’re struggling with this part, just google gradient descent . It’s been explained many times in many ways.

LEVERAGING GRADIENT DESCENT
Now we can use gradient descent for our gradient boosting model. The objective
function we want to minimize is . Our starting point is . For iteration , we compute the gradient of with respect to . Then we fit a weak learner to the gradient components. In the case of a
regression tree, leaf nodes produce an average gradient among samples with similar features. For each leaf, we step in the direction of
the average gradient (using line search to determine the step magnitude). The
result is . Then we can repeat the process until we have .

Take a second to stand in awe of what we just did. We modified our gradient
boosting algorithm so that it works with any differentiable loss function. (This
is the part that gets butchered by a lot of gradient boosting explanations.)
Let’s clean up the ideas above and reformulate our gradient boosting model once
again.

Initialize the model with a constant value:


For m = 1 to M:
Compute pseudo residuals,
Fit base learner, to pseudo residuals
Compute step magnitude multiplier . (In the case of tree models, compute a different for every leaf.)
Update

In case you want to check your understanding so far, our current gradient
boosting applied to our sample problem for both squared error and absolute error
objectives yields the following results.

SQUARED ERROR
Age F0 PseudoResidual0 h0 gamma0 F1 PseudoResidual1 h1 gamma1 F2 13 40.33 -27.33 -21.08 1 19.25 -6.25 -3.567 1 15.68 14 40.33 -26.33 -21.08 1 19.25 -5.25 -3.567 1 15.68 15 40.33 -25.33 -21.08 1 19.25 -4.25 -3.567 1 15.68 25 40.33 -15.33 16.87 1 57.2 -32.2 -3.567 1 53.63 35 40.33 -5.333 -21.08 1 19.25 15.75 -3.567 1 15.68 49 40.33 8.667 16.87 1 57.2 -8.2 7.133 1 64.33 68 40.33 27.67 16.87 1 57.2 10.8 -3.567 1 53.63 71 40.33 30.67 16.87 1 57.2 13.8 7.133 1 64.33 73 40.33 32.67 16.87 1 57.2 15.8 7.133 1 64.33

ABSOLUTE ERROR
Age F0 PseudoResidual0 h0 gamma0 F1 PseudoResidual1 h1 gamma1 F2 13 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25 14 35 -1 -1 20.5 14.5 -1 -0.3333 0.75 14.25 15 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25 25 35 -1 0.6 55 68 -1 -0.3333 0.75 67.75 35 35 -1 -1 20.5 14.5 1 -0.3333 0.75 14.25 49 35 1 0.6 55 68 -1 0.3333 9 71 68 35 1 0.6 55 68 -1 -0.3333 0.75 67.75 71 35 1 0.6 55 68 1 0.3333 9 71 73 35 1 0.6 55 68 1 0.3333 9 71

GRADIENT BOOSTING – DRAFT 4
Here we introduce something called shrinkage . The concept is fairly simple. For each gradient step, the step magnitude is
multiplied by a factor between 0 and 1 called a learning rate. In other words,
each gradient step is shrunken by some factor. The current Wikipedia excerpt on shrinkage doesn’t mention why
shrinkage is effective – it just says that shrinkage appears to be empirically
effective. My personal take is that it causes sample-predictions to slowly converge toward observed values. As this slow convergence occurs, samples that
get closer to their target end up being grouped together into larger and larger
leaves (due to fixed tree size parameters), resulting in a natural
regularization effect.

GRADIENT BOOSTING – DRAFT 5
Last up – row sampling and column sampling. Most gradient boosting algorithms
provide the ability to sample the data rows and columns before each boosting
iteration. This technique is usually effective because it results in more different tree splits, which means more overall information for the model. To get a
better intuition for why this is true, check out my post on Random Forest , which employs the same random sampling technique. Alas we have our final
gradient boosting framework.


--------------------------------------------------------------------------------

GRADIENT BOOSTING IN PRACTICE
Gradient boosting in incredibly effective in practice. Perhaps the most popular
implementation, XGBoost , is used in a number of winning Kaggle solutions. XGBoost employs a number of
tricks that make it faster and more accurate than traditional gradient boosting
(particularly 2nd-order gradient descent) so I’ll encourage you to try it out
and read Tianqi Chen’s paper about the algorithm . With that said, a new competitor, LightGBM from Microsoft, is gaining significant traction.

What else can it do? Although I presented gradient boosting as a regression
model, it’s also very effective as a classification and ranking model. As long
as you have a differentiable loss function for the algorithm to minimize, you’re
good to go. The logistic function is typically used for binary classification and the softmax function is often used for multi-class classification.

I leave you with a quote from my fellow Kaggler Mike Kim .

My only goal is to gradient boost over myself of yesterday. And to repeat this
everyday with an unconquerable spirit.


Share2 COMMENTS
 1. JFPuget on January 12, 2017 at 8:26 pmI think that shrinkage is capturing the value of DART, i.e. avoid that the
    first tree you build captures too much of the problem: http://www.jmlr.org/proceedings/papers/v38/korlakaivinayak15.pdf
    
    Reply * huts on February 13, 2017 at 4:16 pmThanks for the awesome article Ben! I only had a general view of how
       boosting worked, but this makes it very clear for gradient boosting,
       though I did struggle to understand the paragraph “Gradient Boosting –
       draft 3”. I think the fact that the gradient descent is only used to get
       residuals and NOT to optimize parameters could be stressed even more (I
       was misleaded by searching for those parameters being optimized with
       gradient descent, but I was being confused by the “classic” use of
       gradient descent in other models, such as in neural nets).
       
       There is, however, one point that I did not get when you are giving the
       estimating process for the squared error and the absolute error loss
       functions. For the squared error, h0 and h1 forecasts are given by the
       mean of the samples contained in each leaf node. However, I did not get
       how you got the h0 and h1 forecasts for the absolute error function.
       Besides, the way gamma0 and gamma1 are estimated for the absolute error
       loss function is still obscure (though I understand from the line search
       article from wikipedia that we are looking for the step that minimizes
       the loss function given the gradient in one point).
       
       That would be great if you could explicit those last details so we could
       understand the whole picture of your article 😉
       
       Thanks again, and continue the great work!
       
       Reply
     * 
    
    
 2. 

TRACKBACKS/PINGBACKS
 1. A Kaggle Master Explains Gradient Boosting | No Free Hunch - […] tutorial was originally posted here on Ben's blog, […]
 2. 
 3. A Kaggle Master Explains Gradient Boosting | A bunch of data - […] tutorial was originally posted here on Ben’s blog, […]
 4. 

SUBMIT A COMMENT CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


Current ye@r *

Leave this field empty

Notify me of follow-up comments by email.

Notify me of new posts by email.

CONTACT
Ben Gorman
bgorman@gormanalysis.com
LinkedIn
Twitter
UpworkPAGES
 * Blog
 * Consulting
 * Home
 * Trinalysis

RECENT POSTS
 * Gradient Boosting Explained
 * Guide to Model Stacking (i.e. Meta Ensembling)
 * Convert More Sales Leads With Machine Learning

SHARE
Copyright 2015 Ben Gorman","The purpose of this article is to lay the groundwork for classical gradient boosting, intuitively and comprehensively.",Gradient Boosting Explained,Live,475
1470,"COMPOSE POSTGRESQL'S NEW PERFORMANCE AND EXTENSIONS VIEWSShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 3, 2016The PostgreSQL team at Compose have been busy working on ways for you to getmore control over your PostgreSQL deployments and that work is now available foryou to use. The new Performance view lets you see how your tables and indexesare doing while the Extensions view lets you see what extensions are availableto install and what's already installed.Both these views are available from the database browser in the Compose console.Select your deployment and then Browser , then pick the database you want to work with. In the left-hand tabs, alongwith Tables , Roles and Admin are Extensions and Performance .The picture at the top of the article is a CCO photograph by Kamesh VedulaNote that all the performance data is per database, but the extension settingsare per-deployment and changes the settings for all databases.EXTENSIONSThe Extensions view is designed to make it easy, and command-line free, toactivate and de-activate PostgreSQL extensions. It alphabetically lists thename, version and a description for each available extension. On each line theseare preceded by an indicator which is clear for uninstalled and green forinstalled extensions and followed by a button which displays install or remove as appropriate:The extensions are mostly sourced from the PostgreSQL Additional supplied modules along with explicit entries for the available PG language extensions (plpgsql,plperl, plv8, plls, plcoffee) and the PostGis geo-extensions (postgis). Some ofthe additional modules, like spi actually contain a number of extensions and it's those extensions that the Extensions view lists.PERFORMANCEThe Performance view is a composite of different views in one page designed togive insight into your database, connections, tables and indexes. They covercache performance, space usage, maintenance, table bloat and index bloat. Let'sstep through these views and discuss the information they offer.OVERVIEWCACHE PERFORMANCEThis is one of the easiest to read panels. It gives two simple percentages forcache hits for the tables and for the indexes. You ideally want high percentageshere on a busy database as this reflects how often the system needs to go todisk; the higher the percentage, the less disk access is occurring.DISK USAGEThis tab looks at statistics about how your database is consuming disk. There'sa number of tables and it starts with information about your databasehousekeeping.MAINTENANCE INFOPostgreSQL tables are at their best when regularly maintained with the VACUUM command and AUTOVACUUM features. They can help with that as long as you know when they are beingapplied and analysis is being run. This time and date of when each of thosehappens is available in the pg_stat_all_tables view, but to save you time, we make it available in the Maintenance Info viewtoo. If you scroll down, you'll find the next view...TABLE BLOATYour tables can waste space quietly through process called bloat. A Compose Article discusses the phenomenon and mentions ways of measuring bloat. We've integratedthose queries into this view. For each table you can see the actual size of thedata, how much larger it is, the ratio of those values, the fill factor, bloatsize and a percentage of how much of that table is bloat.INDEX BLOATIt's not just tables that can get bloated. Indexes are subject to the same kindsof pressures and also find themselves getting wasteful. This carries the samekind of information as the Table Bloat view.You may wonder what you can do about bloat in tables and indexes: We'll befollowing this subject up in a future article.CONNECTIONSMoving onto the next tab, Connections . This is concerned with where the activity on your database is coming from.First of all, there's the Connection sources .CONNECTION SOURCESHere we can see a view of all the connections coming into the database.Currently it shows the IP address on the private LAN which contains thePostgreSQL deployment. The process name contains more useful data, repeatinghowever the connected process has identified itself. In the screenshot abovethere's psql connected in from my own remote desktop, coming in through the Haproxy for thedeployment with one connection and a number of ""unicorns"" which are part of thehigh availability management. Finally, there's an option to kill all currentconnections so you can clear down a server's workload.ACTIVE QUERIESBelow the Connection sources is the current active queries. We can see a query's duration and, in thescreenshot above, one query that is taking quite a bit of time. That's becausewe've slipped in a pg_sleep() call to make it stall. The active queries view isa snapshot of the current queries, enough to give you a feel for the query loadon your database. If you find a query that you don't like, the Kill button on the right hand side will let you remove it (and its connection).SLOW QUERIESThe final tab gives visibility to PostgreSQL's slow query logging via the pg_stat_statements extension. If you haven't enabled it, you'll be offered the chance install andenable it:Once enabled, the slow query log is collected automatically and you can returnto this tab to see what has been taking more that 20ms to run:Here we've had to overrun the server with pgbench to get some slow queries to appear. You get to see the query, the total time ithas been taking, the average time each call to it has taken and the number oftimes it has been called. The Clear slow queries button doesn't make your queries faster, but it does clear the log of slowqueries so you can locate new slow queries more easily.COMMUNITY COUNTSIt's only right to thank the people who's work we're using to get thisinformation to you. The Performance views take advantage of code in PGHero inspired by Craig Kerstiens work and of queries by @ioguix and all the PostgreSQL community. In the future, as we take on our customersproblems, we look forward to giving back more to the community.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose",The PostgreSQL team at Compose have been busy working on ways for you to get more control over your PostgreSQL deployments and that work is now available for you to use. The new Performance view lets you see how your tables and indexes are doing while the Extensions view lets you see what extensions are available to install and what's already installed. ,Compose PostgreSQL's new Performance and Extensions views,Live,476
1474,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Inge Halilovic Blocked Unblock Follow Following Feb 7
--------------------------------------------------------------------------------

MARKDOWN FOR JUPYTER NOTEBOOKS CHEATSHEET
Beautiful graphs in notebooks are great, but I want my explanatory text to look
good too! Somehow I can’t remember all the Markdown tags, so I created this
cheatsheet.

Here’s how to format Markdown cells in Jupyter notebooks in the IBM Data Science
Experience.

Headings : Use #s followed by a blank space for notebook titles and section headings:
# title
## major headings
### subheadings
#### 4th level subheadings

Emphasis : Use this code: Bold : __string__ or **string** Italic : _string_ or *string*

Mathematical symbols : Use this code: $ mathematical symbols $

Monospace font : Surround text with a back single quotation mark. (`) Use monospace for file
path and file names and for text users enter or message text users see.

Line breaks : Sometimes markdown doesn’t make line breaks when you want them. Use this code
for a manual line break: <br> .

Colors : Use this code: <font color=blue|red|green|pink|yellow>Text</font> Not all markdown code works within a font tag, so review your colored text
carefully!

Indenting : Use a greater than sign ( > ) and then a space, then type the text. Everything is indented until the next
carriage return.

Bullets : Use the dash sign ( - ) with two spaces after it or a space, a dash, and a space ( - ), to create a circular bullet. To create a sub bullet, use a tab followed a
dash and two spaces. You can also use an asterisk instead of a dash, and it
works the same.

Numbered lists : Start with 1. followed by a space, then it starts numbering for you. Start each line with
some number and a period, then a space. Tab to indent to get subnumbering.

Colored note boxes : Use one of these div tags. Not all markdown code works within a div tag, so review your colored boxes carefully!
<div class=""alert alert-block alert-info""> Tip : Use blue boxes for Tips and notes. If it’s a note, you don’t have to include
the word “Note”. </div>
<div class=""alert alert-block alert-warning""> Example: Use yellow boxes for examples that are not inside code cells, or use for mathematical formulas if needed. </div>
<div class=""alert alert-block alert-success""> Up to you: Use green boxes sparingly, and only for some specific purpose that the other
boxes can't cover. For example, if you have a lot of related content to link to,
maybe you decide to use green boxes for related links from each section of a
notebook. </div>
<div class=""alert alert-block alert-danger""> Just don't: In general, just avoid the red boxes. </div>

Graphics : You can attach image files directly to a notebook only in Markdown cells.
Drag and drop your images to the Mardown cell to attach it to the notebook. To
add images to other cell types, you can use only graphics that are hosted on the
web. You can’t add captions for graphics at this time. Use this code: <img src=""url.gif"" alt=""Alt text that describes the graphic"" title=""Title text""
/>

Geometric shapes : Use this code with a decimal or hex reference number from here: UTF-8 Geometric shapes
&#reference_number

Horizontal lines : Use three asterisks: ***

Internal links : To link to a section, use this code: [section title](#section-title) For the text in the parentheses, replace spaces and special characters with a
hyphen. Make sure to test all the links!

Alternatively, you can add an ID for a section right above the section title.
Use this code: <a id=""section_ID""></a> Make sure that the section_ID is unique within the notebook.

Use this code for the link and make sure to test all the links! [section title](#section_ID)

External links : Use this code and test all links! __[link text](http://url)__


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on February 7, 2017.

 * Jupyter
 * Markdown
 * Jupyter Notebook

Show your supportClapping shows how much you appreciated Inge Halilovic’s story.

2 Blocked Unblock Follow FollowingINGE HALILOVIC
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 2
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Beautiful graphs in notebooks are great, but I want my explanatory text to look good too! Somehow I can’t remember all the Markdown tags, so I created this cheatsheet. Headings: Use #s followed by a…",Markdown for Jupyter notebooks cheatsheet,Live,477
1476,"Develop in the cloud at the click of a button!Create a business intelligence and analytics service in Ruby with the dashDBserviceAdd a warehouse database plus analytics tools to your app on IBM BluemixThe dashDB service (formerly known as the Analytics Warehouse service) available in                 IBM Bluemix™ provides a powerful, easy-to-use, and agile platform for business                 intelligence and analytics tasks. It is an enterprise-class managed-service powered by                 the in-memory optimized, column-organized BLU Acceleration data warehouse technology.                 With a few clicks of the Bluemix UI, create a ready-to-use business intelligence and                 analytics service for your application. Then, walk through the steps to create a simple                 chart-based application that uses the dashDB service and deploy it on Bluemix.To build the application in this article, you need:Familiarity with these Ruby modules: Familiarity with using the cf cloud foundry command line toolIn a step-by-step process, you build and deploy a Ruby-based Sinatra application on                 Bluemix that uses the dashDB service.Create a Gemfile for your application that lists the necessary gems for thisapplication:Create a simple Sinatra app with method get '/' to render the indexpage. Let's name the file bluaccl.rb. This method is called when a user                         views the root                         page.Under the views folder, create an index.html.erb file with a greeting message that isrendered by method get '/':Run your application with the command:Go to the link                         http://localhost:9292 and see your Sinatra app.Now you're ready to access the dashDB service from your application by using the ibm_db gem and then run                 queries against it.Access the dashDB service with the ibm_db gem. Retrieve database credentials from theVCAP_SERVICES environment variable when your application is running on Bluemix.#Parse VCAP_SERVICES to get Analytics Warehouse Service credentialsif(ENV['VCAP_SERVICES'])# we are running inside PaaS, access database details from VCAP_Services$db = JSON.parse(ENV['VCAP_SERVICES'])[""AnalyticsWarehouse""]$credentials = $db.first[""credentials""]$host = $credentials[""host""]$username = $credentials[""username""]$password = $credentials[""password""]$database = $credentials[""db""]$port = $credentials[""port""]else# we are running local, provide local DB credentials$host = ""localhost""$username = ""bludbuser""$password = ""password""$database = ""BLUDB""$port = 50000endConnect to the dashDB service, execute queries, and process result sets using the ibm_dbgem's APIs.require 'ibm_db'def getDataFromDW#Connect to database using parsed credentialsconn = IBM_DB.connect ""DATABASE=#{$database};HOSTNAME=#{$host};PORT=#{$port};PROTOCOL=TCPIP;UID=#{$username};PWD=#{$password};"", '', ''#Run the analytic SQLstmt = IBM_DB.exec conn, $profitAnalysisSQLdata = {}while(res = IBM_DB.fetch_assoc stmt)if data.has_key?(res['PRODUCT'])data[res['PRODUCT']][res['YEAR']] = res['PROFIT']elseprofit = {}profit[res['YEAR']] = res['PROFIT']data[res['PRODUCT']] = profitendendIBM_DB.close connreturn dataendNext, we describe how to draw a bar graph that is based on the data that is retrieved from                 the dashDB service by using GoogleCharts. Google Charts provides various options to generate several graph types                 such as line graphs, bar graphs, or pie charts.This code example uses the Ruby Googlecharts module, a simple wrapper over the Google Chart API to draw a bar                 graph for the data we retrieved from the database.def renderBarGraph dataarray2011 = [] #Array group that contains profits for Brands respectively for year 2011array2012 = []array2013 = []productNames = []#Render a Bar chart that shows profits of each Product Brand in comparison year-to-yeardata.each do |product,profitHash|productNames << productprofitHash.each do |year, profit|if(year == 2011)array2011 << profitelsif (year == 2012)array2012 << profitelsearray2013 << profitendif(profit > max)max = profitendendend#Render the Bar chart using the gchart library and return the img html tag for displayGchart.bar(:title => ""Profit by Product Brand"",:data => [array2011, array2012, array2013],:background => 'efefef', :chart_background => 'CCCCCC',:bar_colors => '0000DD,00AA00,EE00EE',:stacked => false,:size => '600x400',:bar_width_and_spacing => '15,0,30',:legend => ['2011', '2012','2013'],:axis_with_labels => 'x,y',:axis_labels => [productNames.join('|'), [0,(max/2).to_f,max.to_f].join('|')],#:format => 'file', :filename => 'custom_filename.png') #To save to a file:format => 'image_tag',:alt => ""Profit by brand img"") #To be rendered as an image on web pageendNow you need to put the code examples in the previous sections together into a functioning                 application.First, put the code examples from the previous sections into a file that is called                 bluaccl.rb and configure the 'get' methods of the app to call the                 getDataFromDW and renderBarGraph functions.Next, modify the index.html.erb file to provide information about the application and a                 button that executes the query and displays the charts.Access the code for these steps in the DevOps Services (JazzHub) repository.With these steps complete, we are ready to deploy the app on Bluemix.Log in to Bluemix. On the Catalog tab of Bluemix, select the RubySinatra run time to create a Ruby Sinatra application: Specify the AppName in the Create Application form.For example, bluaccelBind an instance of the dashDB service to the application. Click the application in the Dashboard                         tab of Bluemix and select Add a new service. From the set of                         Services, select the dashDB service and add it to the                         application:Deploy the app to Bluemix.Set the cf target and then deploy your application to Bluemix with the db2rubybuildpack.$ cf push bluaccl -b https://github.com/ibmdb/db2rubybuildpackAccess the route of your app to see the running application. Click                             Overview on the left navigation pane of your                         application:To start the app in a new tab or window, click the hyperlink for your application andview the index page: Let's see the reports in our application. Click Show profit byProduct to see the Profit by Product Brand bar graph chart: In this example application, you learned how easily you can access the enterprise class dashDB service on Bluemix. With a few clicks, you deployed an application that                 uses the dashDB features. This quick-and-easy Ruby application responds to web requests                 and uses the ibm_db and Googlecharts modules to present results that are mined from the                 data warehouse.BLUEMIX SERVICE USED IN THIS TUTORIAL:The dashDB service helps you move your data into a next-generation columnar in-memory                         database, run complex analytical queries with in-database algorithms, and                         integrate with analytic and business intelligence tools.Required fields are indicated with an asterisk (*).By clicking Submit, you agree to the developerWorks terms of use.The first time you sign into developerWorks, a profile is created for you.  Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name.  You may update your IBM account at any time.The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name.  Your display name accompanies the content you post on developerWorks.Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.Required fields are indicated with an asterisk (*).By clicking Submit, you agree to the developerWorks terms of use.ArticleTitle=Create a business intelligence and analytics service in Ruby with the dashDBservice",Learn how to access the enterprise class dashDB service on Bluemix. Walk through these steps to create a simple chart-based application that uses the dashDB service and deploy it on Bluemix.,Create a business intelligence and analytics service in Ruby with the dashDB            service,Live,478
1477,"* United States

IBM® * Site map

Search within Bluemix Blog Bluemix Blog * About Bluemix * What is Bluemix
    * Getting Started
    * Case Studies
    * Hybrid Architecture
    * Open Source
    * Trust, Security, Privacy
    * Data Centers
    * Our Network
    * Automation
    * Architecture Center
   
   
 * Products * Compute Infrastructure
    * Compute Services
    * Hybrid Deployments
    * Watson
    * Internet of Things
    * Mobile
    * DevOps
    * Data Analytics
    * Data Science
    * Network
    * Open Source
    * Storage
    * Security
   
   
 * Services * Bluemix Services
    * Garage
   
   
 * Pricing
 * Support * Support
    * Contact Us
    * Resources
    * Docs
   
   
 * Blog * How-tos
    * Trending
    * What's New
    * Events
   
   
 * Partners * Partners
    * Become a Partner
    * Find a Partner
   
   
 * Sign up

DATA ANALYTICSDROWNING IN DATA SOURCES: HOW DATA CATALOGING COULD FIX YOUR FINDABILITY
PROBLEMS
October 18, 2017 | Written by: Lena Woolf

Categorized: Data Analytics

Share this post:


DANGERS OF TUNNEL VISION
We’ve all heard horror stories about companies that didn’t read the market
effectively. Global book retailers who sleepwalked through the onset of the
e-reader era. Mobile giants who were wrong-footed by the smartphone revolution.
Brick-and-mortar video stores that ignored the onset of mail-order rentals, and
later of on-demand streaming technology.

These companies all had something in common: their strategic decisions were not
based on an adequate understanding of the market at the time. The lesson for
data scientists and other knowledge workers is that you cannot base your
analyses solely on your own company’s internal data—you must bring in external
sources too, or risk missing out on make-or-break opportunities for your
business.


LOOKING OUTWARDS — AND INWARDS — FOR YOUR DATA SOURCES
However, the need for data scientists to access and analyze both internal and
external data often poses practical problems. Accessing external sources means
connecting to systems outside your firewall, which can create technical
challenges around security. And since you don’t own and can’t control the source
of the data, it may only be accessible via a complex, poorly documented or
frequently changing API—putting further barriers between your knowledge workers
and the information they need.

The challenges are not limited to external data sources: there are often
obstacles to overcome when working with internal sources too. For example, if
you want to extract information directly from an internal database, you will
typically need the IT team to set up an account and provide you with access
credentials.

Such requests may also require approvals from data stewards or the office of the
Chief Data Officer (CDO). Requests to access data can hold up analytics projects
for months at a time, as knowledge workers figure out who to ask and make their
justifications, and data stewards locate and review the datasets in questions.
By the time the request is granted, it may no longer be worthwhile to complete
the analysis, as the window of opportunity for decision-making may already have
passed.

Even if you receive clearance to proceed in time, you will need to figure out
how to extract, cleanse and transform the data for analysis. And your work isn’t
finished there: if you are working on a copy of the data, how can you make sure
that this copy remains up-to-date as new information gets added to the source
system? Putting processes in place to refresh data sets regularly is yet another
time-consuming task that limits the amount of time data scientists can spend on
genuine analysis, as discussed in previous blog “ Breaking the 80/20 rule: How data catalogs transform data scientists’
productivity ”.

These factors can all impose a significant drag on analytics efforts. It’s no
wonder that Gartner predicts that in 2017, 60 percent of big data projects will fail to go beyond piloting
and experimentation.


SO, WHAT CAN YOU DO?
The ideal scenario for knowledge workers is to be able to find and use data in
seconds, regardless of whether it comes from internal or external sources, where
it is stored, or what format it is in. Just imagine how much more effective your
data scientists and data engineers could be in such a situation—and how much
more successful their analytics projects could be!

To turn this dream into a reality, you need a solution such as IBM Data Catalog (currently in beta). Let’s take a look at some of the features that can help to
resolve these challenges.

First, you need a tool that can provide an abstraction layer over multiple
external and internal sources of data, whether it resides on-premises or in
private or public cloud platforms. For example, IBM Data Catalog will support
connections to approximately 32 different sources, including both IBM and
popular third-party technologies such as Cloudera Impala, Salesforce.com, Apache
Hive, Amazon Redshift, Microsoft SQL Server, Sybase, Oracle, PostgreSQL and
more.

Once you have connected to a source, you will be able to start creating data
assets, either by copying the source data into cloud object storage connected to
the catalog, or by creating a reference to the source within the catalog. You
will also have the option to add comments and tags to the asset, making it
easier for other users to find and utilize it.


DIVE INTO AUTOMATION
Using its automatic data discovery capabilities, the catalog can proactively
scan and inventory data assets across your entire range of different
repositories, inferring the schema, type and class of each data asset. IBM Data
Catalog can then provide a common approach for accessing data from any source,
abstracting away the complexity associated with connecting to diverse data
sources. This means that you—or your data scientists—don’t need to spend time
learning a different API for each data source, or understanding the complexities
of reading and writing data in different databases. Instead, you have a single
workspace where you can use tools to simply click to move the data you want
between your source and target systems.

IBM Data Catalog also helps by automatically applying relevant data governance
rules to each asset in the catalog, based on the attributes that the asset
contains. As a result, knowledge workers don’t need to worry about whether they
are allowed to access or use a given data set: they will automatically either be
granted or denied access based on these permissions.

To explore how IBM Data Catalog can help you transform the way your data
scientists, engineers and stewards work with internal and external data sources, visit our website and learn more about the beta .


LENA WOOLF


MANISH BHIDE


IBM Data Catalog IBM Watson Data Platform


Previous Post

IBM Watson IoT for Automotive Experimental Bluemix service will be withdrawingNext Post

Docker EE Comes to the IBM CloudADD COMMENT NO COMMENTS
LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


Search for:RECENT POSTS
 * Docker EE Comes to the IBM Cloud
 * Drowning in data sources: How data cataloging could fix your findability
   problems
 * IBM Watson IoT for Automotive Experimental Bluemix service will be
   withdrawing
 * Take control of your app feature rollout and measure the effectiveness using
   Bluemix App Launch service
 * IBM Cloud Activity Tracker – go live with new features

ARCHIVES
Archives Select Month October 2017 September 2017 August 2017 July 2017 June 2017 May 2017 April 2017 March 2017 February 2017 January 2017 December 2016 November 2016 October 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 October 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 November 2013TAGS
analytics announcements api apps Architecture Center best-of-bluemix Bluemix bluemix-support-notifications buildpacks client success cloud cloudant cloud foundry conference conferences containers dashdb deployment devops docker eclipse garage garage-method hackathon homepage hybrid interconnect iot java Kubernetes liberty local microservices mobile MobileFirst node.js openwhisk security Spark swift twilio ui video watson webinar More Data Analytics StoriesData Analytics

USING SIZE TO YOUR ADVANTAGE: MOBILIZING AN ARMY OF DATA STEWARDS
When a company spots a lucrative gap in the market or comes up with an
innovative new business model, it can trigger sudden, exponential growth.
However, many growing enterprises fail to sustain their initial surge in
momentum—or may even collapse dramatically.

Continue reading


Share this post:


Data Analytics

STREAMING ANALYTICS PRICING UPDATE
We're lowering the prices for Streaming Analytics.

Continue reading


Share this post:


Data Analytics

CLEANING THE SWAMP: TURN YOUR DATA LAKE INTO A SOURCE OF CRYSTAL-CLEAR INSIGHT
When we talk to data scientists, we hear the same sad story again and again.
They tell us how their organization fell in love with the idea of building a
data lake as a single platform for self-service data science. How they were
wooed and won by a vendor with a solution that promised much, but delivered
little. How their vision of a data lake as a clear source of business insight
has turned into a stagnant swamp—a dumping ground where data goes to die.

Continue reading


Share this post:


SIGN UP FOR A BLUEMIX TRIAL TODAY


Get started free Learn more about Bluemix

CONNECT WITH US


 * Contact
 * Privacy
 * Terms of use
 * Accessibility",The need to access and analyze both internal and external data sources often poses practical problems that impose a significant drag on analytics efforts.,Drowning in data sources: How data cataloging could fix your findability problems,Live,479
1481,"SEBASTIAN RUDER
--------------------------------------------------------------------------------

I'm a PhD student in Natural Language Processing and a research scientist at
AYLIEN. I blog about Machine Learning, Deep Learning, NLP, and startups.


--------------------------------------------------------------------------------

 * Blog
 * About
 * Papers
 * News
 * Newsletter
 * FAQ

 * 
 * Twitter
 * 
 * Linkedin
 * 
 * Github
 * 
 * Email
 * 
 * RSS


--------------------------------------------------------------------------------

03 Dec 2017 in deep learning optimization ~ read. OPTIMIZATION FOR DEEP LEARNING HIGHLIGHTS IN 2017Table of contents:

 * Improving Adam * Decoupling weight decay
    * Fixing the exponential moving average
   
   
 * Tuning the learning rate
 * Warm restarts * SGD with restarts
    * Snapshot ensembles
    * Adam with restarts
   
   
 * Learning to optimize
 * Understanding generalization

Deep Learning ultimately is about finding a minimum that generalizes well --
with bonus points for finding one fast and reliably. Our workhorse, stochastic
gradient descent (SGD), is a 60-year old algorithm (Robbins and Monro, 1951) [ 1 ], that is as essential to the current generation of Deep Learning algorithms
as back-propagation.

Different optimization algorithms have been proposed in recent years, which use
different equations to update a model's parameters. Adam (Kingma and Ba, 2015) [ 18 ] was introduced in 2015 and is arguably today still the most commonly used one
of these algorithms. This indicates that from the Machine Learning
practitioner's perspective, best practices for optimization for Deep Learning
have largely remained the same.

New ideas, however, have been developed over the course of this year, which may
shape the way will optimize our models in the future. In this blog post, I will
touch on the most exciting highlights and most promising directions in
optimization for Deep Learning in my opinion. Note that this blog post assumes a
familiarity with SGD and with adaptive learning rate methods such as Adam. To
get up to speed, refer to this blog post for an overview of existing gradient descent optimization algorithms.

IMPROVING ADAM
Despite the apparent supremacy of adaptive learning rate methods such as Adam,
state-of-the-art results for many tasks in computer vision and NLP such as
object recognition (Huang et al., 2017) [ 17 ] or machine translation (Wu et al., 2016) [ 3 ] have still been achieved by plain old SGD with momentum. Recent theory
(Wilson et al., 2017) [ 15 ] provides some justification for this, suggesting that adaptive learning rate
methods converge to different (and less optimal) minima than SGD with momentum.
It is empirically shown that the minima found by adaptive learning rate methods
perform generally worse compared to those found by SGD with momentum on object
recognition, character-level language modeling, and constituency parsing. This
seems counter-intuitive given that Adam comes with nice convergence guarantees
and that its adaptive learning rate should give it an edge over the regular SGD.
However, Adam and other adaptive learning rate methods are not without their own
flaws.

DECOUPLING WEIGHT DECAY
One factor that partially accounts for Adam's poor generalization ability
compared with SGD with momentum on some datasets is weight decay. Weight decay
is most commonly used in image classification problems and decays the weights
\(\theta_t\) after every parameter update by multiplying them by a decay rate
\(w_t\) that is slightly less than \(1\):

\(\theta_{t+1} = w_t \: \theta_t \)

This prevents the weights from growing too large. As such, weight decay can also
be understood as an \(\ell_2\) regularization term that depends on the weight
decay rate \(w_t\) added to the loss:

\(\mathcal{L}_\text{reg} = \dfrac{w_t}{2} \|\theta_t \|^2_2 \)

Weight decay is commonly implemented in many neural network libraries either as
the above regularization term or directly to modify the gradient. As the
gradient is modified in both the momentum and Adam update equations (via
multiplication with other decay terms), weight decay no longer equals \(\ell_2\)
regularization. Loshchilov and Hutter (2017) [ 19 ] thus propose to decouple weight decay from the gradient update by adding it
after the parameter update as in the original definition.
The SGD with momentum and weight decay (SGDW) update then looks like the
following:

\( \begin{align} \begin{split} v_t &= \gamma v_{t-1} + \eta g_t \\
\theta_{t+1} &= \theta_t - v_t - \eta w_t \theta_t \end{split} \end{align} \)

where \(\eta\) is the learning rate and the third term in the second equation is
the decoupled weight decay. Similarly, for Adam with weight decay (AdamW) we
obtain:

\( \begin{align} \begin{split} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\
\hat{m}_t &= \dfrac{m_t}{1 - \beta^t_1} \\ \hat{v}_t &= \dfrac{v_t}{1 -
\beta^t_2} \\ \theta_{t+1} &= \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} +
\epsilon} \hat{m}_t - \eta w_t \theta_t \end{split} \end{align} \)

where \(m_t\) and \(\hat{m}_t\) and \(v_t\) and \(\hat{v}_t\) are the biased and
bias-corrected estimates of the first and second moments respectively and
\(\beta_1\) and \(\beta_2\) are their decay rates, with the same weight decay
term added to it. The authors show that this substantially improves Adam’s
generalization performance and allows it to compete with SGD with momentum on
image classification datasets.

In addition, it decouples the choice of the learning rate from the choice of the
weight decay, which enables better hyperparameter optimization as the
hyperparameters no longer depend on each other. It also separates the
implementation of the optimizer from the implementation of the weight decay,
which contributes to cleaner and more reusable code (see e.g. the fast.ai AdamW/SGDW implementation ).

FIXING THE EXPONENTIAL MOVING AVERAGE
Several recent papers (Dozat and Manning, 2017; Laine and Aila, 2017) [ 13 , 16 ] empirically find that a lower \(\beta_2\) value, which controls the
contribution of the exponential moving average of past squared gradients in
Adam, e.g. \(0.99\) or \(0.9\) vs. the default \(0.999\) worked better in their
respective applications, indicating that there might be an issue with the
exponential moving average.

An ICLR 2018 submission formalizes this issue and pinpoints the exponential moving average of past
squared gradients as another reason for the poor generalization behaviour of
adaptive learning rate methods. Updating the parameters via an exponential
moving average of past squared gradients is at the heart of adaptive learning
rate methods such as Adadelta, RMSprop, and Adam. The contribution of the
exponential average is well-motivated: It should prevent the learning rates to
become infinitesimally small as training progresses, the key flaw of the Adagrad
algorithm. However, this short-term memory of the gradients becomes an obstacle
in other scenarios.

In settings where Adam converges to a suboptimal solution, it has been observed
that some minibatches provide large and informative gradients, but as these
minibatches only occur rarely, exponential averaging diminishes their influence,
which leads to poor convergence. The authors provide an example for a simple
convex optimization problem where the same behaviour can be observed for Adam.

To fix this behaviour, the authors propose a new algorithm, AMSGrad that uses
the maximum of past squared gradients rather than the exponential average to
update the parameters. The full AMSGrad update without bias-corrected estimates
can be seen below:

\( \begin{align} \begin{split} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\
\hat{v}_t &= \text{max}(\hat{v}_{t-1}, v_t) \\ \theta_{t+1} &= \theta_{t} -
\dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} m_t \end{split} \end{align} \)

The authors observe improved performance compared to Adam on small datasets and
on CIFAR-10.

TUNING THE LEARNING RATE
In many cases, it is not our models that require improvement and tuning, but our
hyperparameters. Recent examples for language modelling demonstrate that tuning
LSTM parameters (Melis et al., 2017) [ 20 ] and regularization parameters (Merity et al., 2017) [ 21 ] can yield state-of-the-art results compared to more complex models.

An important hyperparameter for optimization in Deep Learning is the learning
rate \(\eta\). In fact, SGD has been shown to require a learning rate annealing
schedule to converge to a good minimum in the first place. It is often thought
that adaptive learning rate methods such as Adam are more robust to different
learning rates, as they update the learning rate themselves. Even for these
methods, however, there can be a large difference between a good and the optimal
learning rate (psst... it's \(3e-4\) ).

Zhang et al. (2017) [ 2 ] show that SGD with a tuned learning rate annealing schedule and momentum
parameter is not only competitive with Adam, but also converges faster. On the
other hand, while we might think that the adaptivity of Adam's learning rates
might mimic learning rate annealing, an explicit annealing schedule can still be
beneficial: If we add SGD-style learning rate annealing to Adam, it converges
faster and outperforms SGD on Machine Translation (Denkowski and Neubig, 2017) [ 4 ].

In fact, learning rate annealing schedule engineering seems to be the new
feature engineering as we can often find highly-tuned learning rate annealing
schedules that improve the final convergence behaviour of our model. An
interesting example of this is Vaswani et al. (2017) [ 14 ]. While it is usual to see a model's hyperparameters being subjected to
large-scale hyperparameter optimization, it is interesting to see a learning
rate annealing schedule as the focus of the same attention to detail: The
authors use Adam with \(\beta_1=0.9\), a non-default \(\beta_2=0.98\),
\(\epsilon = 10^{-9}\), and arguably one of the most elaborate annealing
schedules for the learning rate \(\eta\):

\(\eta = d_\text{model}^{-0.5} \cdot \min(step\text{_}num^{-0.5},
step\text{_}num \cdot warmup\text{_}steps^{-1.5}) \)

where \(d_\text{model}\) is the number of parameters of the model and
\(warmup\text{_}steps = 4000\).

Another recent paper by Smith et al. (2017) [ 5 ] demonstrates an interesting connection between the learning rate and the
batch size, two hyperparameters that are typically thought to be independent of
each other: They show that decaying the learning rate is equivalent to
increasing the batch size, while the latter allows for increased parallelism.
Conversely, we can reduce the number of model updates and thus speed up training
by increasing the learning rate and scaling the batch size. This has
ramifications for large-scale Deep Learning, which can now repurpose existing
training schedules with no hyperparameter tuning.

WARM RESTARTS
SGD WITH RESTARTS
Another effective recent development is SGDR (Loshchilov and Hutter, 2017) [ 6 ], an SGD alternative that uses warm restarts instead of learning rate
annealing. In each restart, the learning rate is initialized to some value and
is scheduled to decrease. Importantly, the restart is warm as the optimization
does not start from scratch but from the parameters to which the model converged
during the last step. The key factor is that the learning rate is decreased with
an aggressive cosine annealing schedule, which rapidly lowers the learning rate
and looks like the following:

\(\eta_t = \eta_{min}^i + \dfrac{1}{2}(\eta_{max}^i - \eta_{min}^i)(1 +
\text{cos}(\dfrac{T_{cur}}{T_i}\pi)) \)

where \(\eta_{min}^i\) and \(\eta_{max}^i\) are ranges for the learning rate
during the \(i\)-th run, \(T_{cur}\) indicates how many epochs passed since the
last restart, and \(T_i\) specifies the epoch of the next restart. The warm
restart schedules for \(T_i=50\), \(T_i=100\), and \(T_i=200\) compared with
regular learning rate annealing are shown in Figure 1.

Figure 1: Learning rate schedules with warm restarts (Loshchilov and Hutter,
2017)The high initial learning rate after a restart is used to essentially catapult
the parameters out of the minimum to which they previously converged and to a
different area of the loss surface. The aggressive annealing then enables the
model to rapidly converge to a new and better solution. The authors empirically
find that SGD with warm restarts requires 2 to 4 times fewer epochs than
learning rate annealing and achieves comparable or better performance.

Learning rate annealing with warm restarts is also known as cyclical learning
rates and has been originally proposed by Smith (2017) [ 22 ]. Two more articles by students of fast.ai (which has recently started to teach this method) that discuss warm restarts
and cyclical learning rates can be found here and here .

SNAPSHOT ENSEMBLES
Snapshot ensembles (Huang et al., 2017) [ 7 ] are a clever, recent technique that uses warm restarts to assemble an
ensemble essentially for free when training a single model. The method trains a
single model until convergence with the cosine annealing schedule that we have
seen above. It then saves the model parameters, performs a warm restart, and
then repeats these steps \(M\) times. In the end, all saved model snapshots are
ensembled. The common SGD optimization behaviour on an error surface compared to
the behaviour of snapshot ensembling can be seen in Figure 2.

Figure 2: SGD vs. snapshot ensemble (Huang et al., 2017)The success of ensembling in general relies on the diversity of the individual
models in the ensemble. Snapshot ensembling thus relies on the cosine annealing
schedule's ability to enable the model to converge to a different local optimum
after every restart. The authors demonstrate that this holds in practice,
achieving state-of-the-art results on CIFAR-10, CIFAR-100, and SVHN.

ADAM WITH RESTARTS
Warm restarts did not work originally with Adam due to its dysfunctional weight
decay, which we have seen before. After fixing weight decay, Loshchilov and
Hutter (2017) similarly extend Adam to work with warm restarts. They set
\(\eta_{min}^i=0\) and \(\eta_{max}^i=1\), which yields:

\(\eta_t = 0.5 + 0.5 \: \text{cos}(\dfrac{T_{cur}}{T_i}\pi))\)

They recommend to start with an initially small \(T_i\) (between \(1%\) and
\(10%\) of the total number of epochs) and multiply it by a factor of
\(T_{mult}\) (e.g. \(T_{mult}=2\)) at every restart.

LEARNING TO OPTIMIZE
One of the most interesting papers of last year (and reddit's ""Best paper name of 2016"" winner ) was a paper by Andrychowicz et al. (2016) [ 23 ] where they train an LSTM optimizer to provide the updates to the main model
during training. Unfortunately, learning a separate LSTM optimizer or even using
a pre-trained LSTM optimizer for optimization greatly increases the complexity
of model training.

Another very influential learning-to-learn paper from this year uses an LSTM to
generate model architectures in a domain-specific language (Zoph and Quoc, 2017)
[ 24 ]. While the search process requires vast amounts of resources, the discovered
architectures can be used as-is to replace their existing counterparts. This
search process has proved effective and found architectures that achieve
state-of-the-art results on language modeling and results competitive with the
state-of-the-art on CIFAR-10.

The same search principle can be applied to any other domain where key processes
have been previously defined by hand. One such domain are optimization
algorithms for Deep Learning. As we have seen before, optimization algorithms
are more similar than they seem: All of them use a combination of an exponential
moving average of past gradients (as in momentum) and of an exponential moving
average of past squared gradients (as in Adadelta, RMSprop, and Adam) (Ruder,
2016) [ 25 ].

Bello et al. (2017) [ 8 ] define a domain-specific language that consists of primitives useful for
optimization such as these exponential moving averages. They then sample an
update rule from the space of possible update rules, use this update rule to
train a model, and update the RNN controller based on the performance of the
trained model on the test set. The full procedure can be seen in Figure 3.

Figure 3: Neural Optimizer Search (Bello et al., 2017)In particular, they discover two update equations, PowerSign and AddSign. The
update equation for PowerSign is the following:

\( \theta_{t+1} = \theta_{t} - \alpha^{f(t)*
\text{sign}(g_t)*\text{sign}(m_t)}*g_t \)

where \(\alpha\) is a hyperparameter that is often set to \(e\) or \(2\),
\(f(t)\) is either \(1\) or a decay function that performs linear, cyclical or
decay with restarts based on time step \(t\), and \(m_t\) is the moving average
of past gradients. The common configuration uses \(\alpha=e\) and no decay. We
can observe that the update scales the gradient by \(\alpha^{f(t)}\) or
\(1/\alpha^{f(t)}\) depending on whether the direction of the gradient and its
moving average agree. This indicates that this momentum-like agreement between
past gradients and the current one is a key piece of information for optimizing
Deep Learning models.

AddSign in turn is defined as follows:

\( \theta_{t+1} = \theta_{t} - \alpha + f(t) * \text{sign}(g_t) *
\text{sign}(m_t)) * g_t\)

with \(\alpha\) often set to \(1\) or \(2\). Similar to the above, this time the
update scales \(\alpha + f(t)\) or \(\alpha - f(t)\) again depending on the
agreement of the direction of the gradients. The authors show that PowerSign and
AddSign outperform Adam, RMSprop, and SGD with momentum on CIFAR-10 and transfer
well to other tasks such as ImageNet classification and machine translation.

UNDERSTANDING GENERALIZATION
Optimization is closely tied to generalization as the minimum to which a model
converges defines how well the model generalizes. Advances in optimization are
thus closely correlated with theoretical advances in understanding the
generalization behaviour of such minima and more generally of gaining a deeper
understanding of generalization in Deep Learning.

However, our understanding of the generalization behaviour of deep neural
networks is still very shallow. Recent work showed that the number of possible
local minima grows exponentially with the number of parameters (Kawaguchi, 2016)
[ 9 ]. Given the huge number of parameters of current Deep Learning architectures,
it still seems almost magical that such models converge to solutions that
generalize well, in particular given that they can completely memorize random
inputs (Zhang et al., 2017) [ 10 ].

Keskar et al. (2017) [ 11 ] identify the sharpness of a minimum as a source for poor generalization: In
particular, they show that sharp minima found by batch gradient descent have
high generalization error. This makes intuitive sense, as we generally would
like our functions to be smooth and a sharp minima indicates a high irregularity
in the corresponding error surface. However, more recent work suggests that
sharpness may not be such a good indicator after all by showing that local
minima that generalize well can be made arbitrarily sharp (Dinh et al., 2017) [ 12 ]. A Quora answer by Eric Jang also discusses these articles.

An ICLR 2018 submission demonstrates through a series of ablation analyses that a model's reliance on
single directions in activation space, i.e. the activation of single units or
feature maps is a good predictor of its generalization performance. They show
that this holds across models trained on different datasets and for different
degrees of label corruption. They find that dropout does not help to resolve
this, while batch normalization discourages single direction reliance.

While these findings indicate that there is still much we do not know in terms
of Optimization for Deep Learning, it is important to remember that convergence
guarantees and a large body of work exists for convex optimization and that
existing ideas and insights can also be applied to non-convex optimization to
some extent. The large-scale optimization tutorial at NIPS 2016 provides an
excellent overview of more theoretical work in this area (see the slides part 1 , part 2 , and the video ).

CONCLUSION
I hope that I was able to provide an impression of some of the compelling
developments in optimization for Deep Learning over the past year. I've
undoubtedly failed to mention many other approaches that are equally important
and noteworthy. Please let me know in the comments below what I missed, where I
made a mistake or misrepresented a method, or which aspect of optimization for
Deep Learning you find particularly exciting or underexplored.

REFERENCES
 1.  Robbins, H., & Monro, S. (1951). A stochastic approximation method. The
     annals of mathematical statistics, 400-407. ↩
     
     
 2.  Zhang, J., Mitliagkas, I., & Ré, C. (2017). YellowFin and the Art of
     Momentum Tuning. In arXiv preprint arXiv:1706.03471. ↩
     
     
 3.  Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., …
     Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the
     Gap between Human and Machine Translation. arXiv Preprint arXiv:1609.08144. ↩
     
     
 4.  Denkowski, M., & Neubig, G. (2017). Stronger Baselines for Trustable
     Results in Neural Machine Translation. In Workshop on Neural Machine
     Translation (WNMT). Retrieved from https://arxiv.org/abs/1706.09733 ↩
     
     
 5.  Smith, S. L., Kindermans, P.-J., & Le, Q. V. (2017). Don’t Decay the
     Learning Rate, Increase the Batch Size. In arXiv preprint arXiv:1711.00489.
     Retrieved from http://arxiv.org/abs/1711.00489 ↩
     
     
 6.  Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with
     Warm Restarts. In Proceedings of ICLR 2017. https://doi.org/10.1002/fut ↩
     
     
 7.  Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K.
     Q. (2017). Snapshot Ensembles: Train 1, get M for free. In Proceedings of
     ICLR 2017. ↩
     
     
 8.  Bello, I., Zoph, B., Vasudevan, V., & Le, Q. V. (2017). Neural Optimizer
     Search with Reinforcement Learning. In Proceedings of the 34th
     International Conference on Machine Learning. ↩
     
     
 9.  Kawaguchi, K. (2016). Deep Learning without Poor Local Minima. In Advances
     in Neural Information Processing Systems 29 (NIPS 2016). Retrieved from http://arxiv.org/abs/1605.07110 ↩
     
     
 10. Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017).
     Understanding deep learning requires rethinking generalization. In
     Proceedings of ICLR 2017. ↩
     
     
 11. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P.
     (2017). On Large-Batch Training for Deep Learning: Generalization Gap and
     Sharp Minima. In Proceedings of ICLR 2017. Retrieved from http://arxiv.org/abs/1609.04836 ↩
     
     
 12. Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp Minima Can
     Generalize For Deep Nets. In Proceedings of the 34th International
     Conference on Machine Learning. ↩
     
     
 13. Dozat, T., & Manning, C. D. (2017). Deep Biaffine Attention for Neural
     Dependency Parsing. In ICLR 2017. Retrieved from http://arxiv.org/abs/1611.01734 ↩
     
     
 14. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.
     N., … Polosukhin, I. (2017). Attention Is All You Need. In Advances in
     Neural Information Processing Systems. ↩
     
     
 15. Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The
     Marginal Value of Adaptive Gradient Methods in Machine Learning. arXiv
     Preprint arXiv:1705.08292. Retrieved from http://arxiv.org/abs/1705.08292 ↩
     
     
 16. Laine, S., & Aila, T. (2017). Temporal Ensembling for Semi-Supervised
     Learning. In Proceedings of ICLR 2017. ↩
     
     
 17. Huang, G., Liu, Z., Weinberger, K. Q., & van der Maaten, L. (2017). Densely
     Connected Convolutional Networks. In Proceedings of CVPR 2017. ↩
     
     
 18. Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic
     Optimization. International Conference on Learning Representations. ↩
     
     
 19. Loshchilov, I., & Hutter, F. (2017). Fixing Weight Decay Regularization in
     Adam. arXiv Preprint arXi1711.05101. Retrieved from http://arxiv.org/abs/1711.05101 ↩
     
     
 20. Melis, G., Dyer, C., & Blunsom, P. (2017). On the State of the Art of
     Evaluation in Neural Language Models. In arXiv preprint arXiv:1707.05589. ↩
     
     
 21. Merity, S., Shirish Keskar, N., & Socher, R. (2017). Regularizing and
     Optimizing LSTM Language Models. arXiv Preprint arXiv:1708.02182. Retrieved
     from https://arxiv.org/pdf/1708.02182.pdf ↩
     
     
 22. Smith, Leslie N. ""Cyclical learning rates for training neural networks."" In
     Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pp.
     464-472. IEEE, 2017. ↩
     
     
 23. Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul,
     T., & de Freitas, N. (2016). Learning to learn by gradient descent by
     gradient descent. In Advances in Neural Information Processing Systems.
     Retrieved from http://arxiv.org/abs/1606.04474 ↩
     
     
 24. Zoph, B., & Le, Q. V. (2017). Neural Architecture Search with Reinforcement
     Learning. In ICLR 2017. ↩
     
     
 25. Ruder, S. (2016). An overview of gradient descent optimization algorithms.
     arXiv Preprint arXiv:1609.04747. ↩
     
     
← Word embeddings in 2017: Trends and future directions Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus © 2017. All rights reserved. Built with Ghost and Uno Zen theme.",An overview of the most exciting highlights and research directions in optimization for Deep Learning in 2017.,Optimization for Deep Learning Highlights in 2017,Live,480
1482,"Are you interested in optimizing routes, enabling offline app usage, more efficiently predicting paths, and scaling GIS applications? If so, check out this recording of our August 21, 2014 webinar and learn why leading Shipping, Logistics, & Transportation providers choose Cloudant.","Norman Barker & Chris Glew talk about using Cloudant to create geospatial applications for shipping, logistics and transportation.","Building Mobile Apps At The Geospatial Edge For Shipping, Logistics, & Transportation",Live,481
1483,"Toggle navigation * Services
 * Demo
 * Software
 * Blog
 * About
 * Contact
 * 
 * 

CUSTOMER ANALYTICS: USING DEEP LEARNING WITH KERAS TO PREDICT CUSTOMER CHURN
Written by Matt Dancho on November 28, 2017

Categories: Business

Tags: R-Project , R , keras , lime , recipes , yardstick , rsample , corrr , Customer Analytics , Churn

 * 
 * Share
 * 
 * 
 * 
 * 
 * 
 * Tweet

Are you in need of data science services ? We can help.
Let's talk


--------------------------------------------------------------------------------

Customer churn is a problem that all companies need to monitor, especially those
that depend on subscription-based revenue streams . The simple fact is that most organizations have data that can be used to
target these individuals and to understand the key drivers of churn, and we now have Keras for Deep Learning available in R (Yes, in R!!), which
predicted customer churn with 82% accuracy . We’re super excited for this article because we are using the new keras package to produce an Artificial Neural Network (ANN) model on the IBM Watson Telco Customer Churn Data Set ! As for most business problems, it’s equally important to explain what features drive the model , which is why we’ll use the lime package for explainability. We cross-checked the LIME results with a
Correlation Analysis using the corrr package. We’re not done yet. In addition, we use three new packages to assist with Machine Learning (ML) : recipes for preprocessing, rsample for sampling data and yardstick for model metrics. These are relatively new additions to CRAN developed by Max Kuhn at RStudio (creator of the caret package). It seems that R is quickly developing ML tools that rival Python . Good news if you’re interested in applying Deep Learning in R! We are so
let’s get going!!

CUSTOMER CHURN: HURTS SALES, HURTS COMPANY
Customer churn refers to the situation when a customer ends their relationship
with a company, and it’s a costly problem. Customers are the fuel that powers a
business. Loss of customers impacts sales. Further, it’s much more difficult and
costly to gain new customers than it is to retain existing customers. As a
result, organizations need to focus on reducing customer churn .

The good news is that machine learning can help . For many businesses that offer subscription based services, it’s critical to
both predict customer churn and explain what features relate to customer churn.
Older techniques such as logistic regression can be less accurate than newer
techniques such as deep learning, which is why we are going to show you how to model an ANN in R with the keras package .

CHURN MODELING WITH ARTIFICIAL NEURAL NETWORKS (KERAS)
Artificial Neural Networks (ANN) are now a staple within the sub-field of
Machine Learning called Deep Learning. Deep learning algorithms can be vastly superior to traditional regression and
classification methods (e.g. linear and logistic regression) because of the ability to model
interactions between features that would otherwise go undetected. The challenge
becomes explainability, which is often needed to support the business case. The
good news is we get the best of both worlds with keras and lime .

IBM WATSON DATASET (WHERE WE GOT THE DATA)
The dataset used for this tutorial is IBM Watson Telco Dataset . According to IBM, the business challenge is…

A telecommunications company [Telco] is concerned about the number of customers
leaving their landline business for cable competitors. They need to understand
who is leaving. Imagine that you’re an analyst at this company and you have to
find out who is leaving and why.

The dataset includes information about:

 * Customers who left within the last month : The column is called Churn
 * Services that each customer has signed up for : phone, multiple lines, internet, online security, online backup, device
   protection, tech support, and streaming TV and movies
 * Customer account information : how long they’ve been a customer, contract, payment method, paperless
   billing, monthly charges, and total charges
 * Demographic info about customers : gender, age range, and if they have partners and dependents

DEEP LEARNING WITH KERAS (WHAT WE DID WITH THE DATA)
In this example we show you how to use keras to develop a sophisticated and highly accurate deep learning model in R. We
walk you through the preprocessing steps, investing time into how to format the
data for Keras. We inspect the various classification metrics, and show that an un-tuned ANN model can easily get 82% accuracy on the unseen data . Here’s the deep learning training history visualization.


We have some fun with preprocessing the data ( yes, preprocessing can actually be fun and easy! ). We use the new recipes package to simplify the preprocessing workflow.

We end by showing you how to explain the ANN with the lime package. Neural networks used to be frowned upon because of the “black box” nature meaning these sophisticated models (ANNs are highly accurate) are difficult to
explain using traditional methods. Not any more with LIME! Here’s the feature importance visualization.


We also cross-checked the LIME results with a Correlation Analysis using the corrr package. Here’s the correlation visualization.


We even built an ML-Powered Interactive PowerBI Web Application with a Customer Scorecard to monitor customer churn risk and to make recommendations on how to improve
customer health! Feel free to take it for a spin.

View in Full Screen Mode for best experience

CREDITS
We saw that just last week the same Telco customer churn dataset was used in the
article, Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest . We thought the article was excellent.

This article takes a different approach with Keras, LIME, Correlation Analysis,
and a few other cutting edge packages. We encourage the readers to check out
both articles because, although the problem is the same, both solutions are
beneficial to those learning data science and advanced modeling.

PREREQUISITES
We use the following libraries in this tutorial:

 * keras : Library that ports Keras from Python enabling deep learning in R. Visit
   the documentation for more information.
 * lime : Used to explain the predictions of black box classifiers. Deep Learning
   falls into this category.
 * tidyquant : Loads the tidyverse ( dplyr , ggplot2 , etc) and has nice visualization functions with theme_tq() . Visit the tidyquant documentation and the tidyverse documentation for more information on the individual packages.
 * rsample : New package for generating resamples. Visit the documentation for more information.
 * recipes : New package for preprocessing machine learning data sets. Visit the documentation for more information.
 * yardstick : Tidy methods for measuring model performance. Visit the GitHub Page for more information.
 * corrr : Tidy methods for correlation. Visit the GitHub Page for more information.

Install the following packages with install.packages() .

pkgs<-c(""keras"",""lime"",""tidyquant"",""rsample"",""recipes"",""yardstick"",""corrr"")install.packages(pkgs)

LOAD LIBRARIES
Load the libraries.

# Load librarieslibrary(keras)library(lime)library(tidyquant)library(rsample)library(recipes)library(yardstick)library(corrr)

If you have not previously run Keras in R, you will need to install Keras using
the install_keras() function.

# Install Keras if you have not installed beforeinstall_keras()

IMPORT DATA
Download the IBM Watson Telco Data Set here . Next, use read_csv() to import the data into a nice tidy data frame. We use the glimpse() function to quickly inspect the data. We have the target “Churn” and all other
variables are potential predictors. The raw data set needs to be cleaned and
preprocessed for ML.

# Import datachurn_data_raw<-read_csv(""WA_Fn-UseC_-Telco-Customer-Churn.csv"")glimpse(churn_data_raw)

## Observations: 7,043
## Variables: 21
## $ customerID       <chr> ""7590-VHVEG"", ""5575-GNVDE"", ""3668-QPYBK""...
## $ gender           <chr> ""Female"", ""Male"", ""Male"", ""Male"", ""Femal...
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Partner          <chr> ""Yes"", ""No"", ""No"", ""No"", ""No"", ""No"", ""No...
## $ Dependents       <chr> ""No"", ""No"", ""No"", ""No"", ""No"", ""No"", ""Yes...
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, ...
## $ PhoneService     <chr> ""No"", ""Yes"", ""Yes"", ""No"", ""Yes"", ""Yes"", ...
## $ MultipleLines    <chr> ""No phone service"", ""No"", ""No"", ""No phon...
## $ InternetService  <chr> ""DSL"", ""DSL"", ""DSL"", ""DSL"", ""Fiber optic...
## $ OnlineSecurity   <chr> ""No"", ""Yes"", ""Yes"", ""Yes"", ""No"", ""No"", ""...
## $ OnlineBackup     <chr> ""Yes"", ""No"", ""Yes"", ""No"", ""No"", ""No"", ""Y...
## $ DeviceProtection <chr> ""No"", ""Yes"", ""No"", ""Yes"", ""No"", ""Yes"", ""...
## $ TechSupport      <chr> ""No"", ""No"", ""No"", ""Yes"", ""No"", ""No"", ""No...
## $ StreamingTV      <chr> ""No"", ""No"", ""No"", ""No"", ""No"", ""Yes"", ""Ye...
## $ StreamingMovies  <chr> ""No"", ""No"", ""No"", ""No"", ""No"", ""Yes"", ""No...
## $ Contract         <chr> ""Month-to-month"", ""One year"", ""Month-to-...
## $ PaperlessBilling <chr> ""Yes"", ""No"", ""Yes"", ""No"", ""Yes"", ""Yes"", ...
## $ PaymentMethod    <chr> ""Electronic check"", ""Mailed check"", ""Mai...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65...
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65,...
## $ Churn            <chr> ""No"", ""No"", ""Yes"", ""No"", ""Yes"", ""Yes"", ""...

PREPROCESS DATA
We’ll go through a few steps to preprocess the data for ML. First, we “prune”
the data, which is nothing more than removing unnecessary columns and rows. Then
we split into training and testing sets. After that we explore the training set
to uncover transformations that will be needed for deep learning. We save the
best for last. We end by preprocessing the data with the new recipes package.

PRUNE THE DATA
The data has a few columns and rows we’d like to remove:

 * The “customerID” column is a unique identifier for each observation that
   isn’t needed for modeling. We can de- select this column.
 * The data has 11 NA values all in the “TotalCharges” column. Because it’s such a small
   percentage of the total population (99.8% complete cases), we can drop these
   observations with the drop_na() function from tidyr . Note that these may be customers that have not yet been charged, and
   therefore an alternative is to replace with zero or -99 to segregate this
   population from the rest.
 * My preference is to have the target in the first column so we’ll include a
   final select operation to do so.

We’ll perform the cleaning operation with one tidyverse pipe (%>%) chain.

# Remove unnecessary datachurn_data_tbl<-churn_data_raw%>%select(-customerID)%>%drop_na()%>%select(Churn,everything())glimpse(churn_data_tbl)

## Observations: 7,032
## Variables: 20
## $ Churn            <chr> ""No"", ""No"", ""Yes"", ""No"", ""Yes"", ""Yes"", ""...
## $ gender           <chr> ""Female"", ""Male"", ""Male"", ""Male"", ""Femal...
## $ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Partner          <chr> ""Yes"", ""No"", ""No"", ""No"", ""No"", ""No"", ""No...
## $ Dependents       <chr> ""No"", ""No"", ""No"", ""No"", ""No"", ""No"", ""Yes...
## $ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, ...
## $ PhoneService     <chr> ""No"", ""Yes"", ""Yes"", ""No"", ""Yes"", ""Yes"", ...
## $ MultipleLines    <chr> ""No phone service"", ""No"", ""No"", ""No phon...
## $ InternetService  <chr> ""DSL"", ""DSL"", ""DSL"", ""DSL"", ""Fiber optic...
## $ OnlineSecurity   <chr> ""No"", ""Yes"", ""Yes"", ""Yes"", ""No"", ""No"", ""...
## $ OnlineBackup     <chr> ""Yes"", ""No"", ""Yes"", ""No"", ""No"", ""No"", ""Y...
## $ DeviceProtection <chr> ""No"", ""Yes"", ""No"", ""Yes"", ""No"", ""Yes"", ""...
## $ TechSupport      <chr> ""No"", ""No"", ""No"", ""Yes"", ""No"", ""No"", ""No...
## $ StreamingTV      <chr> ""No"", ""No"", ""No"", ""No"", ""No"", ""Yes"", ""Ye...
## $ StreamingMovies  <chr> ""No"", ""No"", ""No"", ""No"", ""No"", ""Yes"", ""No...
## $ Contract         <chr> ""Month-to-month"", ""One year"", ""Month-to-...
## $ PaperlessBilling <chr> ""Yes"", ""No"", ""Yes"", ""No"", ""Yes"", ""Yes"", ...
## $ PaymentMethod    <chr> ""Electronic check"", ""Mailed check"", ""Mai...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65...
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65,...

SPLIT INTO TRAIN/TEST SETS
We have a new package, rsample , which is very useful for sampling methods. It has the initial_split() function for splitting data sets into training and testing sets. The return is
a special rsplit object.

# Split test/training setsset.seed(100)train_test_split<-initial_split(churn_data_tbl,prop=0.8)train_test_split

## <5626/1406/7032>

We can retrieve our training and testing sets using training() and testing() functions.

# Retrieve train and test setstrain_tbl<-training(train_test_split)test_tbl<-testing(train_test_split)

EXPLORATION: WHAT TRANSFORMATION STEPS ARE NEEDED FOR ML?
This phase of the analysis is often called exploratory analysis, but basically we are trying to answer the question, “What steps are needed to prepare for
ML?” The key concept is knowing what transformations are needed to run the
algorithm most effectively . Artificial Neural Networks are best when the data is one-hot encoded, scaled
and centered. In addition, other transformations may be beneficial as well to
make relationships easier for the algorithm to identify. A full exploratory
analysis is not practical in this article. With that said we’ll cover a few tips
on transformations that can help as they relate to this dataset. In the next
section, we will implement the preprocessing techniques.

DISCRETIZE THE “TENURE” FEATURE
Numeric features like age, years worked, length of time in a position can
generalize a group (or cohort). We see this in marketing a lot (think
“millennials”, which identifies a group born in a certain timeframe). The
“tenure” feature falls into this category of numeric features that can be
discretized into groups.


We can split into six cohorts that divide up the user base by tenure in roughly
one year (12 month) increments. This should help the ML algorithm detect if a
group is more/less susceptible to customer churn.


TRANSFORM THE “TOTALCHARGES” FEATURE
What we don’t like to see is when a lot of observations are bunched within a
small part of the range.


We can use a log transformation to even out the data into more of a normal
distribution. It’s not perfect, but it’s quick and easy to get our data spread
out a bit more.


Pro Tip: A quick test is to see if the log transformation increases the
magnitude of the correlation between “TotalCharges” and “Churn”. We’ll use a few dplyr operations along with the corrr package to perform a quick correlation.

 * correlate() : Performs tidy correlations on numeric data
 * focus() : Similar to select() . Takes columns and focuses on only the rows/columns of importance.
 * fashion() : Makes the formatting aesthetically easier to read.

# Determine if log transformation improves correlation # between TotalCharges and Churntrain_tbl%>%select(Churn,TotalCharges)%>%mutate(Churn=Churn%>%as.factor()%>%as.numeric(),LogTotalCharges=log(TotalCharges))%>%correlate()%>%focus(Churn)%>%fashion()

##           rowname Churn
## 1    TotalCharges  -.20
## 2 LogTotalCharges  -.25

The correlation between “Churn” and “LogTotalCharges” is greatest in magnitude
indicating the log transformation should improve the accuracy of the ANN model
we build. Therefore, we should perform the log transformation.

ONE-HOT ENCODING
One-hot encoding is the process of converting categorical data to sparse data,
which has columns of only zeros and ones (this is also called creating “dummy
variables” or a “design matrix”). All non-numeric data will need to be converted
to dummy variables. This is simple for binary Yes/No data because we can simply
convert to 1’s and 0’s. It becomes slightly more complicated with multiple
categories, which requires creating new columns of 1’s and 0`s for each category
(actually one less). We have four features that are multi-category: Contract,
Internet Service, Multiple Lines, and Payment Method.


FEATURE SCALING
ANN’s typically perform faster and often times with higher accuracy when the
features are scaled and/or normalized (aka centered and scaled, also known as
standardizing). Because ANNs use gradient descent, weights tend to update
faster. According to Sebastian Raschka , an expert in the field of Deep Learning, several examples when feature
scaling is important are:

 * k-nearest neighbors with an Euclidean distance measure if want all features
   to contribute equally
 * k-means (see k-nearest neighbors)
 * logistic regression, SVMs, perceptrons, neural networks etc. if you are using
   gradient descent/ascent-based optimization, otherwise some weights will
   update much faster than others
 * linear discriminant analysis, principal component analysis, kernel principal
   component analysis since you want to find directions of maximizing the
   variance (under the constraints that those directions/eigenvectors/principal
   components are orthogonal); you want to have features on the same scale since
   you’d emphasize variables on “larger measurement scales” more. There are many
   more cases than I can possibly list here … I always recommend you to think
   about the algorithm and what it’s doing, and then it typically becomes
   obvious whether we want to scale your features or not.

The interested reader can read Sebastian Raschka’s article for a full discussion on the scaling/normalization topic. Pro Tip: When in doubt, standardize the data .

PREPROCESSING WITH RECIPES
Let’s implement the preprocessing steps/transformations uncovered during our
exploration. Max Kuhn (creator of caret ) has been putting some work into Rlang ML tools lately, and the payoff is beginning to take shape. A new package, recipes , makes creating ML data preprocessing workflows a breeze ! It takes a little getting used to, but I’ve found that it really helps manage
the preprocessing steps. We’ll go over the nitty gritty as it applies to this
problem.

STEP 1: CREATE A RECIPE
A “recipe” is nothing more than a series of steps you would like to perform on
the training, testing and/or validation sets. Think of preprocessing data like
baking a cake (I’m not a baker but stay with me). The recipe is our steps to
make the cake. It doesn’t do anything other than create the playbook for baking.

We use the recipe() function to implement our preprocessing steps. The function takes a familiar object argument, which is a modeling function such as object = Churn ~ . meaning “Churn” is the outcome (aka response, predictor, target) and all other
features are predictors. The function also takes the data argument, which gives the “recipe steps” perspective on how to apply during
baking (next).

A recipe is not very useful until we add “steps”, which are used to transform
the data during baking. The package contains a number of useful “step functions”
that can be applied. The entire list of Step Functions can be viewed here. For our model, we use:

 1. step_discretize() with the option = list(cuts = 6) to cut the continuous variable for “tenure” (number of years as a customer)
    to group customers into cohorts.
 2. step_log() to log transform “TotalCharges”.
 3. step_dummy() to one-hot encode the categorical data. Note that this adds columns of
    one/zero for categorical data with three or more categories.
 4. step_center() to mean-center the data.
 5. step_scale() to scale the data.

The last step is to prepare the recipe with the prep() function. This step is used to “estimate the required parameters from a
training set that can later be applied to other data sets”. This is important
for centering and scaling and other functions that use parameters defined from
the training set.

Here’s how simple it is to implement the preprocessing steps that we went over!

# Create reciperec_obj<-recipe(Churn~.,data=train_tbl)%>%step_discretize(tenure,options=list(cuts=6))%>%step_log(TotalCharges)%>%step_dummy(all_nominal(),-all_outcomes())%>%step_center(all_predictors(),-all_outcomes())%>%step_scale(all_predictors(),-all_outcomes())%>%prep(data=train_tbl)

## step 1 discretize training 
## step 2 log training 
## step 3 dummy training 
## step 4 center training 
## step 5 scale training

We can print the recipe object if we ever forget what steps were used to prepare
the data. Pro Tip: We can save the recipe object as an RDS file using saveRDS() , and then use it to bake() (discussed next) future raw data into ML-ready data in production!

# Print the recipe objectrec_obj

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         19
## 
## Training data contained 5626 data points and no missing data.
## 
## Steps:
## 
## Dummy variables from tenure [trained]
## Log transformation on TotalCharges [trained]
## Dummy variables from ~gender, ~Partner, ... [trained]
## Centering for SeniorCitizen, ... [trained]
## Scaling for SeniorCitizen, ... [trained]

STEP 2: BAKING WITH YOUR RECIPE
Now for the fun part! We can apply the “recipe” to any data set with the bake() function, and it processes the data following our recipe steps. We’ll apply to
our training and testing data to convert from raw data to a machine learning
dataset. Check our training set out with glimpse() . Now that’s an ML-ready dataset prepared for ANN modeling!!

# Predictorsx_train_tbl<-bake(rec_obj,newdata=train_tbl)x_test_tbl<-bake(rec_obj,newdata=test_tbl)glimpse(x_train_tbl)

## Observations: 5,626
## Variables: 35
## $ SeniorCitizen                         <dbl> -0.4351959, -0.4351...
## $ MonthlyCharges                        <dbl> -1.1575972, -0.2601...
## $ TotalCharges                          <dbl> -2.275819130, 0.389...
## $ gender_Male                           <dbl> -1.0016900, 0.99813...
## $ Partner_Yes                           <dbl> 1.0262054, -0.97429...
## $ Dependents_Yes                        <dbl> -0.6507747, -0.6507...
## $ tenure_bin1                           <dbl> 2.1677790, -0.46121...
## $ tenure_bin2                           <dbl> -0.4389453, -0.4389...
## $ tenure_bin3                           <dbl> -0.4481273, -0.4481...
## $ tenure_bin4                           <dbl> -0.4509837, 2.21698...
## $ tenure_bin5                           <dbl> -0.4498419, -0.4498...
## $ tenure_bin6                           <dbl> -0.4337508, -0.4337...
## $ PhoneService_Yes                      <dbl> -3.0407367, 0.32880...
## $ MultipleLines_No.phone.service        <dbl> 3.0407367, -0.32880...
## $ MultipleLines_Yes                     <dbl> -0.8571364, -0.8571...
## $ InternetService_Fiber.optic           <dbl> -0.8884255, -0.8884...
## $ InternetService_No                    <dbl> -0.5272627, -0.5272...
## $ OnlineSecurity_No.internet.service    <dbl> -0.5272627, -0.5272...
## $ OnlineSecurity_Yes                    <dbl> -0.6369654, 1.56966...
## $ OnlineBackup_No.internet.service      <dbl> -0.5272627, -0.5272...
## $ OnlineBackup_Yes                      <dbl> 1.3771987, -0.72598...
## $ DeviceProtection_No.internet.service  <dbl> -0.5272627, -0.5272...
## $ DeviceProtection_Yes                  <dbl> -0.7259826, 1.37719...
## $ TechSupport_No.internet.service       <dbl> -0.5272627, -0.5272...
## $ TechSupport_Yes                       <dbl> -0.6358628, -0.6358...
## $ StreamingTV_No.internet.service       <dbl> -0.5272627, -0.5272...
## $ StreamingTV_Yes                       <dbl> -0.7917326, -0.7917...
## $ StreamingMovies_No.internet.service   <dbl> -0.5272627, -0.5272...
## $ StreamingMovies_Yes                   <dbl> -0.797388, -0.79738...
## $ Contract_One.year                     <dbl> -0.5156834, 1.93882...
## $ Contract_Two.year                     <dbl> -0.5618358, -0.5618...
## $ PaperlessBilling_Yes                  <dbl> 0.8330334, -1.20021...
## $ PaymentMethod_Credit.card..automatic. <dbl> -0.5231315, -0.5231...
## $ PaymentMethod_Electronic.check        <dbl> 1.4154085, -0.70638...
## $ PaymentMethod_Mailed.check            <dbl> -0.5517013, 1.81225...

STEP 3: DON’T FORGET THE TARGET
One last step, we need to store the actual values (truth) as y_train_vec and y_test_vec , which are needed for modeling our ANN. We convert to a series of numeric ones
and zeros which can be accepted by the Keras ANN modeling functions. We add
“vec” to the name so we can easily remember the class of the object (it’s easy
to get confused when working with tibbles, vectors, and matrix data types).

# Response variables for training and testing setsy_train_vec<-ifelse(pull(train_tbl,Churn)==""Yes"",1,0)y_test_vec<-ifelse(pull(test_tbl,Churn)==""Yes"",1,0)

MODEL CUSTOMER CHURN WITH KERAS (DEEP LEARNING)
This is super exciting!! Finally, Deep Learning with Keras in R! The team at RStudio has done fantastic work recently to create the keras package, which implements Keras in R. Very cool!

BACKGROUND ON ARTIFICAL NEURAL NETWORKS
For those unfamiliar with Neural Networks (and those that need a refresher), read this article . It’s very comprehensive, and you’ll leave with a general understanding of the
types of deep learning and how they work.


Source: Xenon Stack

Deep Learning has been available in R for some time, but the primary packages
used in the wild have not (this includes Keras, Tensor Flow, Theano, etc, which
are all Python libraries). It’s worth mentioning that a number of other Deep
Learning packages exist in R including h2o , mxnet , and others. The interested reader can check out this blog post for a comparison of deep learning packages in R .

BUILDING A DEEP LEARNING MODEL
We’re going to build a special class of ANN called a Multi-Layer Perceptron (MLP) . MLPs are one of the simplest forms of deep learning, but they are both highly
accurate and serve as a jumping-off point for more complex algorithms. MLPs are
quite versatile as they can be used for regression, binary and multi
classification (and are typically quite good at classification problems).

We’ll build a three layer MLP with Keras. Let’s walk-through the steps before we
implement in R.

 1. Initialize a sequential model : The first step is to initialize a sequential model with keras_model_sequential() , which is the beginning of our Keras model. The sequential model is
    composed of a linear stack of layers.
    
    
 2. Apply layers to the sequential model : Layers consist of the input layer, hidden layers and an output layer. The
    input layer is the data and provided it’s formatted correctly there’s
    nothing more to discuss. The hidden layers and output layers are what
    controls the ANN inner workings.
    
     * Hidden Layers : Hidden layers form the neural network nodes that enable non-linear
       activation using weights. The hidden layers are created using layer_dense() . We’ll add two hidden layers. We’ll apply units = 16 , which is the number of nodes. We’ll select kernel_initializer = ""uniform"" and activation = ""relu"" for both layers. The first layer needs to have the input_shape = 35 , which is the number of columns in the training set. Key Point: While we are arbitrarily selecting the number of hidden
       layers, units, kernel initializers and activation functions, these
       parameters can be optimized through a process called hyperparameter
       tuning that is discussed in Next Steps .
       
       
     * Dropout Layers : Dropout layers are used to control overfitting. This eliminates
       weights below a cutoff threshold to prevent low weights from overfitting
       the layers. We use the layer_dropout() function add two drop out layers with rate = 0.10 to remove weights below 10%.
       
       
     * Output Layer : The output layer specifies the shape of the output and the method of
       assimilating the learned information. The output layer is applied using
       the layer_dense() . For binary values, the shape should be units = 1 . For multi-classification, the units should correspond to the number of classes. We set the kernel_initializer = ""uniform"" and the activation = ""sigmoid"" (common for binary classification).
       
       
 3. Compile the model : The last step is to compile the model with compile() . We’ll use optimizer = ""adam"" , which is one of the most popular optimization algorithms. We select loss = ""binary_crossentropy"" since this is a binary classification problem. We’ll select metrics = c(""accuracy"") to be evaluated during training and testing. Key Point: The optimizer is often included in the tuning process .
    
    
Let’s codify the discussion above to build our Keras MLP-flavored ANN model.

# Building our Artificial Neural Networkmodel_keras<-keras_model_sequential()model_keras%>%# First hidden layerlayer_dense(units=16,kernel_initializer=""uniform"",activation=""relu"",input_shape=ncol(x_train_tbl))%>%# Dropout to prevent overfittinglayer_dropout(rate=0.1)%>%# Second hidden layerlayer_dense(units=16,kernel_initializer=""uniform"",activation=""relu"")%>%# Dropout to prevent overfittinglayer_dropout(rate=0.1)%>%# Output layerlayer_dense(units=1,kernel_initializer=""uniform"",activation=""sigmoid"")%>%# Compile ANNcompile(optimizer='adam',loss='binary_crossentropy',metrics=c('accuracy'))model_keras

## Model
## ______________________________________________________________________
## Layer (type)                   Output Shape                Param #    
## ======================================================================
## dense_1 (Dense)                (None, 16)                  576        
## ______________________________________________________________________
## dropout_1 (Dropout)            (None, 16)                  0          
## ______________________________________________________________________
## dense_2 (Dense)                (None, 16)                  272        
## ______________________________________________________________________
## dropout_2 (Dropout)            (None, 16)                  0          
## ______________________________________________________________________
## dense_3 (Dense)                (None, 1)                   17         
## ======================================================================
## Total params: 865
## Trainable params: 865
## Non-trainable params: 0
## ______________________________________________________________________

We use the fit() function to run the ANN on our training data. The object is our model, and x and y are our training data in matrix and numeric vector forms, respectively. The batch_size = 50 sets the number samples per gradient update within each epoch. We set epochs = 35 to control the number training cycles. Typically we want to keep the batch size
high since this decreases the error within each training cycle (epoch). We also
want epochs to be large, which is important in visualizing the training history
(discussed below). We set validation_split = 0.30 to include 30% of the data for model validation, which prevents overfitting.
The training process should complete in 15 seconds or so.

# Fit the keras model to the training datafit_keras<-fit(object=model_keras,x=as.matrix(x_train_tbl),y=y_train_vec,batch_size=50,epochs=35,validation_split=0.30)

We can inspect the final model. We want to make sure there is minimal difference
between the validation accuracy and the training accuracy.

# Print the final modelfit_keras

## Trained on 3,938 samples, validated on 1,688 samples (batch_size=50, epochs=35)
## Final epoch (plot to see history):
## val_loss: 0.4215
##  val_acc: 0.8057
##     loss: 0.399
##      acc: 0.8101

We can visualize the Keras training history using the plot() function. What we want to see is the validation accuracy and loss leveling off,
which means the model has completed training. We see that there is some
divergence between training loss/accuracy and validation loss/accuracy. This
model indicates we can possibly stop training at an earlier epoch. Pro Tip: Only use enough epochs to get a high validation accuracy. Once
validation accuracy curve begins to flatten or decrease, it’s time to stop
training.

# Plot the training/validation history of our Keras modelplot(fit_keras)+theme_tq()+scale_color_tq()+scale_fill_tq()+labs(title=""Deep Learning Training Results"")


MAKING PREDICTIONS
We’ve got a good model based on the validation accuracy. Now let’s make some
predictions from our keras model on the test data set, which was unseen during modeling (we use this for
the true performance assessment). We have two functions to generate predictions:

 * predict_classes : Generates class values as a matrix of ones and zeros. Since we are dealing
   with binary classification, we’ll convert the output to a vector.
 * predict_proba : Generates the class probabilities as a numeric matrix indicating the
   probability of being a class. Again, we convert to a numeric vector because
   there is only one column output.

# Predicted Classyhat_keras_class_vec<-predict_classes(object=model_keras,x=as.matrix(x_test_tbl))%>%as.vector()# Predicted Class Probabilityyhat_keras_prob_vec<-predict_proba(object=model_keras,x=as.matrix(x_test_tbl))%>%as.vector()

INSPECT PERFORMANCE WITH YARDSTICK
The yardstick package has a collection of handy functions for measuring performance of
machine learning models. We’ll overview some metrics we can use to understand
the performance of our model.

First, let’s get the data formatted for yardstick . We create a data frame with the truth (actual values as factors), estimate
(predicted values as factors), and the class probability (probability of yes as
numeric). We use the fct_recode() function from the forcats package to assist with recoding as Yes/No values.

# Format test data and predictions for yardstick metricsestimates_keras_tbl<-tibble(truth=as.factor(y_test_vec)%>%fct_recode(yes=""1"",no=""0""),estimate=as.factor(yhat_keras_class_vec)%>%fct_recode(yes=""1"",no=""0""),class_prob=yhat_keras_prob_vec)estimates_keras_tbl

## # A tibble: 1,406 x 3
##     truth estimate  class_prob
##    <fctr>   <fctr>       <dbl>
##  1    yes       no 0.328355074
##  2    yes      yes 0.633630514
##  3     no       no 0.004589651
##  4     no       no 0.007402068
##  5     no       no 0.049968336
##  6     no       no 0.116824441
##  7     no      yes 0.775479317
##  8     no       no 0.492996633
##  9     no       no 0.011550998
## 10     no       no 0.004276015
## # ... with 1,396 more rows

Now that we have the data formatted, we can take advantage of the yardstick package.

CONFUSION TABLE
We can use the conf_mat() function to get the confusion table. We see that the model was by no means
perfect, but it did a decent job of identifying customers likely to churn.

# Confusion Tableestimates_keras_tbl%>%conf_mat(truth,estimate)

##           Truth
## Prediction  no yes
##        no  950 161
##        yes  99 196

ACCURACY
We can use the metrics() function to get an accuracy measurement from the test set. We are getting
roughly 82% accuracy.

# Accuracyestimates_keras_tbl%>%metrics(truth,estimate)

## # A tibble: 1 x 1
##    accuracy
##       <dbl>
## 1 0.8150782

AUC
We can also get the ROC Area Under the Curve (AUC) measurement. AUC is often a
good metric used to compare different classifiers and to compare to randomly
guessing (AUC_random = 0.50). Our model has AUC = 0.85, which is much better
than randomly guessing. Tuning and testing different classification algorithms
may yield even better results.

# AUCestimates_keras_tbl%>%roc_auc(truth,class_prob)

## [1] 0.8523951

PRECISION AND RECALL
Precision is when the model predicts “yes”, how often is it actually “yes”.
Recall (also true positive rate or specificity) is when the actual value is
“yes” how often is the model correct. We can get precision() and recall() measurements using yardstick .

# Precisiontibble(precision=estimates_keras_tbl%>%precision(truth,estimate),recall=estimates_keras_tbl%>%recall(truth,estimate))

## # A tibble: 1 x 2
##   precision    recall
##       <dbl>     <dbl>
## 1 0.8550855 0.9056244

Precision and recall are very important to the business case: The organization
is concerned with balancing the cost of targeting and retaining customers at risk of leaving with
the cost of inadvertently targeting customers that are not planning to leave (and potentially decreasing revenue from this group). The threshold above which
to predict Churn = “Yes” can be adjusted to optimize for the business problem.
This becomes an Customer Lifetime Value optimization problem that is discussed further in Next Steps .

F1 SCORE
We can also get the F1-score, which is a weighted average between the precision
and recall. Machine learning classifier thresholds are often adjusted to
maximize the F1-score. However, this is often not the optimal solution to the
business problem.

# F1-Statisticestimates_keras_tbl%>%f_meas(truth,estimate,beta=1)

## [1] 0.8796296

EXPLAIN THE MODEL WITH LIME
LIME stands for Local Interpretable Model-agnostic Explanations , and is a method for explaining black-box machine learning model classifiers.
For those new to LIME, this YouTube video does a really nice job explaining how
LIME helps to identify feature importance with black box machine learning models
(e.g. deep learning, stacked ensembles, random forest).

SETUP
The lime package implements LIME in R. One thing to note is that it’s not setup out-of-the-box to work with keras . The good news is with a few functions we can get everything working properly.
We’ll need to make two custom functions:

 * model_type : Used to tell lime what type of model we are dealing with. It could be classification,
   regression, survival, etc.
   
   
 * predict_model : Used to allow lime to perform predictions that its algorithm can interpret.
   
   
The first thing we need to do is identify the class of our model object. We do
this with the class() function.

class(model_keras)

## [1] ""keras.models.Sequential""        
## [2] ""keras.engine.training.Model""    
## [3] ""keras.engine.topology.Container""
## [4] ""keras.engine.topology.Layer""    
## [5] ""python.builtin.object""

Next we create our model_type() function. It’s only input is x the keras model. The function simply returns “classification”, which tells LIME
we are classifying.

# Setup lime::model_type() function for kerasmodel_type.keras.models.Sequential<-function(x,...){return(""classification"")}

Now we can create our predict_model() function, which wraps keras::predict_proba() . The trick here is to realize that it’s inputs must be x a model, newdata a dataframe object (this is important), and type which is not used but can be use to switch the output type. The output is also
a little tricky because it must be in the format of probabilities by classification (this is important; shown next).

# Setup lime::predict_model() function for keraspredict_model.keras.models.Sequential<-function(x,newdata,type,...){pred<-predict_proba(object=x,x=as.matrix(newdata))return(data.frame(Yes=pred,No=1-pred))}

Run this next script to show you what the output looks like and to test our predict_model() function. See how it’s the probabilities by classification. It must be in this
form for model_type = ""classification"" .

# Test our predict_model() functionpredict_model(x=model_keras,newdata=x_test_tbl,type='raw')%>%tibble::as_tibble()

## # A tibble: 1,406 x 2
##            Yes        No
##          <dbl>     <dbl>
##  1 0.328355074 0.6716449
##  2 0.633630514 0.3663695
##  3 0.004589651 0.9954103
##  4 0.007402068 0.9925979
##  5 0.049968336 0.9500317
##  6 0.116824441 0.8831756
##  7 0.775479317 0.2245207
##  8 0.492996633 0.5070034
##  9 0.011550998 0.9884490
## 10 0.004276015 0.9957240
## # ... with 1,396 more rows

Now the fun part, we create an explainer using the lime() function. Just pass the training data set without the “Attribution column”. The
form must be a data frame, which is OK since our predict_model function will switch it to an keras object. Set model = automl_leader our leader model, and bin_continuous = FALSE . We could tell the algorithm to bin continuous variables, but this may not
make sense for categorical numeric data that we didn’t change to factors.

# Run lime() on training setexplainer<-lime::lime(x=x_train_tbl,model=model_keras,bin_continuous=FALSE)

Now we run the explain() function, which returns our explanation . This can take a minute to run so we limit it to just the first ten rows of
the test data set. We set n_labels = 1 because we care about explaining a single class. Setting n_features = 4 returns the top four features that are critical to each case. Finally, setting kernel_width = 0.5 allows us to increase the “model_r2” value by shrinking the localized
evaluation.

# Run explain() on explainerexplanation<-lime::explain(x_test_tbl[1:10,],explainer=explainer,n_labels=1,n_features=4,kernel_width=0.5)

FEATURE IMPORTANCE VISUALIZATION
The payoff for the work we put in using LIME is this feature importance plot . This allows us to visualize each of the first ten cases (observations) from
the test data. The top four features for each case are shown. Note that they are
not the same for each case. The green bars mean that the feature supports the
model conclusion, and the red bars contradict. A few important features based on
frequency in first ten cases:

 * Tenure (7 cases)
 * Senior Citizen (5 cases)
 * Online Security (4 cases)

plot_features(explanation)+labs(title=""LIME Feature Importance Visualization"",subtitle=""Hold Out (Test) Set, First 10 Cases Shown"")


Another excellent visualization can be performed using plot_explanations() , which produces a facetted heatmap of all case/label/feature combinations.
It’s a more condensed version of plot_features() , but we need to be careful because it does not provide exact statistics and it
makes it less easy to investigate binned features (Notice that “tenure” would
not be identified as a contributor even though it shows up as a top feature in 7
of 10 cases).

plot_explanations(explanation)+labs(title=""LIME Feature Importance Heatmap"",subtitle=""Hold Out (Test) Set, First 10 Cases Shown"")


CHECK EXPLANATIONS WITH CORRELATION ANALYSIS
One thing we need to be careful with the LIME visualization is that we are only
doing a sample of the data, in our case the first 10 test observations.
Therefore, we are gaining a very localized understanding of how the ANN works.
However, we also want to know on from a global perspective what drives feature
importance.

We can perform a correlation analysis on the training set as well to help glean what features correlate globally to
“Churn”. We’ll use the corrr package, which performs tidy correlations with the function correlate() . We can get the correlations as follows.

# Feature correlations to Churncorrr_analysis<-x_train_tbl%>%mutate(Churn=y_train_vec)%>%correlate()%>%focus(Churn)%>%rename(feature=rowname)%>%arrange(abs(Churn))%>%mutate(feature=as_factor(feature))corrr_analysis

## # A tibble: 35 x 2
##                           feature        Churn
##                            <fctr>        <dbl>
##  1                    gender_Male -0.006690899
##  2                    tenure_bin3 -0.009557165
##  3 MultipleLines_No.phone.service -0.016950072
##  4               PhoneService_Yes  0.016950072
##  5              MultipleLines_Yes  0.032103354
##  6                StreamingTV_Yes  0.066192594
##  7            StreamingMovies_Yes  0.067643871
##  8           DeviceProtection_Yes -0.073301197
##  9                    tenure_bin4 -0.073371838
## 10     PaymentMethod_Mailed.check -0.080451164
## # ... with 25 more rows

The correlation visualization helps in distinguishing which features are
relavant to Churn.

# Correlation visualizationcorrr_analysis%>%ggplot(aes(x=Churn,y=fct_reorder(feature,desc(Churn))))+geom_point()+# Positive Correlations - Contribute to churngeom_segment(aes(xend=0,yend=feature),color=palette_light()[[2]],data=corrr_analysis%>%filter(Churn>0))+geom_point(color=palette_light()[[2]],data=corrr_analysis%>%filter(Churn>0))+# Negative Correlations - Prevent churngeom_segment(aes(xend=0,yend=feature),color=palette_light()[[1]],data=corrr_analysis%>%filter(Churn<0))+geom_point(color=palette_light()[[1]],data=corrr_analysis%>%filter(Churn<0))+# Vertical linesgeom_vline(xintercept=0,color=palette_light()[[5]],size=1,linetype=2)+geom_vline(xintercept=-0.25,color=palette_light()[[5]],size=1,linetype=2)+geom_vline(xintercept=0.25,color=palette_light()[[5]],size=1,linetype=2)+# Aestheticstheme_tq()+labs(title=""Churn Correlation Analysis"",subtitle=""Positive Correlations (contribute to churn), Negative Correlations (prevent churn)"",y=""Feature Importance"")


The correlation analysis helps us quickly disseminate which features that the
LIME analysis may be excluding. We can see that the following features are
highly correlated (magnitude > 0.25):

Increases Likelihood of Churn (Red) :

 * Tenure = Bin 1 (<12 Months)
 * Internet Service = “Fiber Optic”
 * Payment Method = “Electronic Check”

Decreases Likelihood of Churn (Blue) :

 * Contract = “Two Year”
 * Total Charges (Note that this may be a biproduct of additional services such
   as Online Security)

FEATURE INVESTIGATION
We can investigate features that are most frequent in the LIME feature importance visualization along with those that the correlation analysis shows an above normal magnitude . We’ll investigate:

 * Tenure (7/10 LIME Cases, Highly Correlated)
 * Contract (Highly Correlated)
 * Internet Service (Highly Correlated)
 * Payment Method (Highly Correlated)
 * Senior Citizen (5/10 LIME Cases)
 * Online Security (4/10 LIME Cases)

TENURE (7/10 LIME CASES, HIGHLY CORRELATED)
LIME cases indicate that the ANN model is using this feature frequently and high
correlation agrees that this is important. Investigating the feature
distribution, it appears that customers with lower tenure (bin 1) are more
likely to leave. Opportunity: Target customers with less than 12 month tenure.


CONTRACT (HIGHLY CORRELATED)
While LIME did not indicate this as a primary feature in the first 10 cases, the
feature is clearly correlated with those electing to stay. Customers with one
and two year contracts are much less likely to churn. Opportunity: Offer promotion to switch to long term contracts.


INTERNET SERVICE (HIGHLY CORRELATED)
While LIME did not indicate this as a primary feature in the first 10 cases, the
feature is clearly correlated with those electing to stay. Customers with fiber
optic service are more likely to churn while those with no internet service are
less likely to churn. Improvement Area: Customers may be dissatisfied with fiber optic service.


PAYMENT METHOD (HIGHLY CORRELATED)
While LIME did not indicate this as a primary feature in the first 10 cases, the
feature is clearly correlated with those electing to stay. Customers with
electronic check are more likely to leave. Opportunity: Offer customers a promotion to switch to automatic payments .


SENIOR CITIZEN (5/10 LIME CASES)
Senior citizen appeared in several of the LIME cases indicating it was important
to the ANN for the 10 samples. However, it was not highly correlated to Churn,
which may indicate that the ANN is using in an more sophisticated manner (e.g.
as an interaction). It’s difficult to say that senior citizens are more likely
to leave, but non-senior citizens appear less at risk of churning. Opportunity: Target users in the lower age demographic.


ONLINE SECURITY (4/10 LIME CASES)
Customers that did not sign up for online security were more likely to leave
while customers with no internet service or online security were less likely to
leave. Opportunity: Promote online security and other packages that increase retention
rates.


NEXT STEPS: BUSINESS SCIENCE UNIVERSITY
We’ve just scratched the surface with the solution to this problem, but
unfortunately there’s only so much ground we can cover in an article. Here are a
few next steps that I’m pleased to announce will be covered in a Business Science University course coming in 2018!

CUSTOMER LIFETIME VALUE
Your organization needs to see the financial benefit so always tie your analysis
to sales, profitability or ROI. Customer Lifetime Value ( CLV ) is a methodology that ties the business profitability to the retention rate.
While we did not implement the CLV methodology herein, a full customer churn
analysis would tie the churn to an classification cutoff (threshold)
optimization to maximize the CLV with the predictive ANN model.

The simplified CLV model is:

Where,

 * GC is the gross contribution per customer
 * d is the annual discount rate
 * r is the retention rate

ANN PERFORMANCE EVALUATION AND IMPROVEMENT
The ANN model we built is good, but it could be better. How we understand our
model accuracy and improve on it is through the combination of two techniques:

 * K-Fold Cross-Fold Validation : Used to obtain bounds for accuracy estimates.
 * Hyper Parameter Tuning : Used to improve model performance by searching for the best parameters
   possible.

We need to implement K-Fold Cross Validation and Hyper Parameter Tuning if we want a best-in-class model.

DISTRIBUTING ANALYTICS
It’s critical to communicate data science insights to decision makers in the
organization . Most decision makers in organizations are not data scientists, but these
individuals make important decisions on a day-to-day basis. The PowerBI
application below includes a Customer Scorecard to monitor customer health (risk of churn). The application walks the user
through the machine learning journey for how the model was developed, what it means to stakeholders, and how it can
be used in production.

View in Full Screen Mode for best experience


For those seeking options for distributing analytics, two good options are:

 * Shiny Apps for rapid prototyping : Shiny web applications offer the maximum flexibility with R algorithms
   built in. Shiny is more complex to learn, but Shiny applications are
   incredible / limitless.
   
   
 * Microsoft PowerBI and Tableau for Visualization : Enable distributed analytics with the advantage of intuitive structure but
   with some flexibilty sacrificed. Can be difficult to build ML into.
   
   
BUSINESS SCIENCE UNIVERSITY
You’re probably wondering why we are going into so much detail on next steps. We
are happy to announce a new project for 2018: Business Science University , an online school dedicated to helping data science learners improve in the
areas of:

 * Advanced machine learning techniques
 * Developing interactive data products and applications using Shiny and other
   tools
 * Distributing data science within an organization

Learning paths will be focused on business and financial applications . We’ll keep you posted via social media and our blog (please follow us / subscribe to stay updated).

Please let us know if you are interested in joining Business Science University . Let us know what you think in the Disqus Comments below.

CONCLUSIONS
Customer churn is a costly problem . The good news is that machine learning can solve churn problems , making the organization more profitable in the process. In this article, we
saw how Deep Learning can be used to predict customer churn . We built an ANN model using the new keras package that achieved 82% predictive accuracy (without tuning)! We used three new machine learning packages to help with
preprocessing and measuring performance: recipes , rsample and yardstick . Finally we used lime to explain the Deep Learning model, which traditionally was impossible ! We checked the LIME results with a Correlation Analysis , which brought to light other features to investigate. For the IBM Telco
dataset, tenure, contract type, internet service type, payment menthod, senior
citizen status, and online security status were useful in diagnosing customer
churn. We hope you enjoyed this article!

ABOUT BUSINESS SCIENCE
Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business and financial
applications. We help businesses that seek to add this competitive advantage but
may not have the resources currently to implement predictive analytics . Business Science can help to expand into predictive analytics while executing
on ROI generating projects. Visit the Business Science website or contact us to learn more!

FOLLOW BUSINESS SCIENCE ON SOCIAL MEDIA
 * @bizScienc is on twitter !
 * Check us out on Facebook page !
 * Check us out on LinkedIn !
 * Sign up for our insights blog to stay updated!
 * If you like our software, star our GitHub packages !

 * 
 * Get Business Science updates:
   
   
 * 
   -----------------------------------------------------------------------------
   
   
 * 
 *  * Prev
   
   
 * 
   -----------------------------------------------------------------------------
   
   
 * 
 * 
 * Categories Technology: 1 A Data Scientist's Resources Business: 8 Customer Analytics: Using Deep Learning With Keras To Predict Customer Churn Sales Analytics: How to Use Machine Learning to Predict and Optimize Product
   Backorders HR Analytics: Using Machine Learning to Predict Employee Turnover Customer Segmentation Part 3: Network Visualization Customer Segmentation Part 2: PCA for Segment Visualization Customer Segmentation Part 1: K-Means Clustering orderSimulatoR: Simulate Orders for Business Analytics Marketing Strategy: Why MBAs Can Benefit from Learning Analytics Investments: 2 Russell 2000 Quantitative Stock Analysis in R: Six Stocks with Amazing,
   Consistent Growth Quantitative Stock Analysis Tutorial: Screening the Returns for Every S&P500
   Stock in Less than 5 Minutes Code-Tools: 20 Demo Week: Time Series Machine Learning with h2o and timetk Demo Week: Tidy Time Series Analysis with tibbletime Demo Week: Tidy Forecasting with sweep Demo Week: Time Series Machine Learning with timetk Demo Week: class(Monday) <- tidyquant It's tibbletime v0.0.2: Time-Aware Tibbles, New Functions, Weather Analysis
   and More It's tibbletime: Time-Aware Tibbles alphavantager: An R interface to the Free Alpha Vantage Financial Data API BizSci Package Updates: Formerly timekit... Now timetk :) sweep: Extending broom for time series forecasting timekit: New Documentation, Function Improvements, Forecasting Vignette timekit: Time Series Forecast Applications Using Data Mining tidyquant 0.5.0: select, rollapply, and Quandl tidyquant Integrates Quandl: Getting Data Just Got Easier tidyquant 0.4.0: PerformanceAnalytics, Improved Documentation, ggplot2
   Themes and More tidyquant 0.3.0: ggplot2 Enhancements, Real-Time Data, and More Speed Up Your Code Part 2: Parallel Processing Financial Data with
   multidplyr + tidyquant tidyquant 0.2.0: Added Functionality for Financial Engineers and Business
   Analysts tidyquant: Bringing Quantitative Financial Analysis to the tidyverse Speed Up Your Code: Parallel Processing with multidplyr Financial-Analysis: 1 Recreating RView's ''Reproducible Finance With R: Sector Correlations'' Technical-Papers: 1 tidyquant: New Tools for Performing Financial Analysis within the Tidy
   Ecosystem Presentations: 4 EARL Presentation on HR Analytics: Using ML to Predict Employee Turnover LIVE DataTalk on HR Analytics Tonight: Using Machine Learning to Predict
   Employee Turnover Business Science EARL SF 2017 Presentation: tidyquant, timekit, and more! tidyquant: R/Finance 2017 Presentation Timeseries-Analysis: 4 Tidy Time Series Analysis, Part 4: Lags and Autocorrelation Tidy Time Series Analysis, Part 3: The Rolling Correlation Tidy Time Series Analysis, Part 2: Rolling Functions Tidy Time Series Analysis, Part 1
 * 
   -----------------------------------------------------------------------------
   
   
 * 
 * Popular Content HR Analytics: Using Machine Learning to Predict Employee Turnover Sales Analytics: How to Use Machine Learning to Predict and Optimize Product
   Backorders Customer Segmentation Part 1: K-Means Clustering
 * 
   -----------------------------------------------------------------------------
   
   
 * 
 * Recent Posts Customer Analytics: Using Deep Learning With Keras To Predict Customer Churn EARL Presentation on HR Analytics: Using ML to Predict Employee Turnover Demo Week: Time Series Machine Learning with h2o and timetk
 * 
   -----------------------------------------------------------------------------
   
   
 * 
 * Tweets by bizScienc
 * 
   -----------------------------------------------------------------------------
   
   
 * 
 * 
 * 
 * Tags BI: 1 CRM: 1 Churn: 1 Code: 1 Community Detection: 3 Customer Analytics: 1 Customer Segmentation: 3 Data Mining: 1 EARL: 2 Employee Turnover: 2 Excel: 1 HR Analytics: 2 MBA: 1 Marketing: 1 Monte Carlo: 1 Multiple Cores: 2 PCA: 1 Parallel Processing: 2 PerformanceAnalytics: 2 Product Backorders: 1 Python: 1 Quandl: 2 R: 40 R-Project: 39 SQL: 1 Sales Analytics: 1 Sales Data Scientist: 1 Salesforce: 1 Simulation: 1 Stock Analysis: 9 Strategy: 1 TTR: 8 Trading Strategy: 7 Web Scraping: 2 alphavantager: 1 bikes data set: 4 broom: 2 corrplot: 1 corrr: 4 cowplot: 1 cranlogs: 4 dplyr: 7 dygraphs: 1 forecast: 1 ggplot2: 4 h2o: 5 igraph: 1 keras: 1 kmeans: 1 lime: 4 multidplyr: 2 networkD3: 1 orderSimulatoR: 1 padr: 1 parallel: 2 plotly: 2 prcomp: 1 purrr: 2 quantmod: 8 r-finance: 1 recipes: 1 rsample: 1 rvest: 2 silhouette: 1 sweep: 3 tibbletime: 3 tidyquant: 18 tidyr: 5 tidyverse: 12 timekit: 4 timekitr: 1 timetk: 3 xts: 9 yardstick: 1 zoo: 8
 * 
   -----------------------------------------------------------------------------
   
   
 * 
 * Resources R: 26 R-Bloggers r-statistics.co R-Project.org RStudio CRAN Bioconductor RDocumentation Shiny RMarkdown knitr tidyverse ggplot2 caret broom quantmod Intro to Statistical Learning (ISL) Elements of Statistical Learning (ESL) R for Data Science Advanced R R Packages Forecasting: Principles & Practice Little Book of R for Time Series Tidy Text Mining with R ComputerWorld's Top R Packages ComputerWorld's Top R Resources RDataMining.com Python: 14 Python.org PyData.org Anaconda Conda Package List Jupyter Notebook NumPy SciPy Pandas Matplotlib Seaborn Scikit-Learn TensorFlow Theano Natural Language Toolkit (NLTK) Power BI: 5 Microsoft Power BI Power BI Documentation Power BI YouTube Playlist Custom Visuals Gallery R Script Showcase Tableau: 3 Tableau R Integration R YouTube Playlist Excel: 1 Chandoo.org Other Tools: 2 GIMP Inkscape
 * 
 * 
 * 
 * 

Are you in need of data science services ? We can help. Let's talk

Like what you read? Sign up for Business Science updates:


EARL Presentation on HR Analytics: Using ML to Predict Employee Turnover Please enable JavaScript to view the comments powered by Disqus.ABOUT US
Business Science is dedicated to helping clients apply data science and predictive analytics in
a variety of business and financial settings. Learn more .

CONTACT US
Contact us for data science consulting!

FOLLOW US
 * 
 * 
 * 

 * 
 * 
 * 

GET UPDATES
Sign up for the Business Science blog:


Copyright © www.business-science.io 2017",Predict customer churn using deep learning with Keras to produce an Artificial Neural Network (ANN) model on the IBM Watson Telco Customer Churn Data Set.,Using Deep Learning With Keras To Predict Customer Churn,Live,482
1488,,See how to evaluate and convert your DDL and SQL to dashDB. ,Convert IBM Puredata for Analytics to dashDB,Live,483
1489,"Compose The Compose logo Articles Sign in Free 30-day trialCOMPOSE ENTERPRISE COMES TO BLUEMIX
Published Jun 1, 2017 compose bluemix enterprise Compose Enterprise comes to BluemixIf you use Compose databases on the IBM Bluemix platform, you'll be pleased to
know you can now sign up for Compose Enterprise for Bluemix.

Running in the cloud is great as you get resources when you need them shared
from a common pool of resources. For critical enterprise workloads though, you
want to be selfish. That means ensuring you have your own dedicated resources
that your workloads, and only your workloads, get to use. When you are using the
Compose services to deploy fully managed production-grade open source databases,
you may appreciate an easy way to be selfish about resources.

That's where Compose Enterprise – and now Bluemix Compose Enterprise – comes in.
It allows you to reserve a dedicated cluster of hardware for your own exclusive
use. When you go this route with other cloud platforms, you end up managing the
entire cluster yourself and deploying all your own applications and databases
into it.

That's not the case with Compose Enterprise. We manage it like we manage all our
clusters, and you can deploy to your private cluster as if it was just another
region or data center you could select. And deploying Compose databases on
Bluemix is just a ""one-click"" experience and it stays that simple too.

But there's more! Compose Enterprise clusters also come with encryption-at-rest
as standard, allowing you to hit more compliance targets for data security.

Read more about Compose Enterprise on Compose in this article ; Bluemix users, read on...

GETTING BLUEMIX COMPOSE ENTERPRISE
How does it work? If you are a Bluemix user, go to your Bluemix Catalog and
search for Compose Enterprise in the tiles.

Select this, and you'll be presented with a form which starts off the cluster
provisioning process. We'll work with your requirements and get in touch when
your Enterprise cluster is ready.

Then, whenever you provision a Compose database on Bluemix, select the
Enterprise plan and your Enterprise cluster and click Create. Then sit back and
let us take care of all your database management while you get time to plan that
next great application you're going to build on top of your Compose databases.

For further details on how to use Compose Enterprise on Bluemix, consult the Bluemix documentation which also includes how to configure Compose databases on Compose Enterprise
through the command-line.

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Mar 21, 2017COMPOSE ENTERPRISE RELOADED
Compose Enterprise is back and better than ever. Fully managed and dedicated to
your conquest of the data layer. Today, after…

Dj Walker-Morgan Dec 30, 2016COMPOSE FOR MYSQL AND COMPOSE FOR SCYLLADB: THE NEW COMPOSE DATABASES ON BLUEMIX
Since we made Compose-hosted databases for Bluemix available, we've seen Bluemix
users opening up the catalog to the benefits…

Dj Walker-Morgan Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX
The power of IBM's Bluemix cloud platform is now able to seamlessly harness
Compose's databases, making Compose-configured Mo…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Running in the cloud is great as you get resources when you need them shared from a common pool of resources. For critical enterprise workloads though, you want to be selfish.",Compose Enterprise comes to Bluemix,Live,484
1491,"LEARNDATASCI


 * Home
 * Courses
 * Free Books

Home » Python for Finance, Part 3: A Moving Average Trading Strategy

Quantitative AnalysisPYTHON FOR FINANCE, PART 3: A MOVING AVERAGE TRADING STRATEGY
July 18, 2017 July 18, 2017 Georgios Efstathopoulos 2 Comments * shares
 * 
 * 
 * 
 * 
 * 
 * 


 * Facebook
 * Twitter
 * Google+
 * Pinterest
 * LinkedIn
 * Digg
 * Del
 * StumbleUpon
 * Tumblr
 * VKontakte
 * Print
 * Email
 * Flattr
 * Reddit
 * Buffer
 * Love This
 * Weibo
 * Pocket
 * Xing
 * Odnoklassniki
 * ManageWP.org
 * WhatsApp
 * Meneame
 * Blogger
 * Amazon
 * Yahoo Mail
 * Gmail
 * AOL
 * Newsvine
 * HackerNews
 * Evernote
 * MySpace
 * Mail.ru
 * Viadeo
 * Line
 * Flipboard
 * Comments
 * Yummly
 * SMS
 * Viber
 * Telegram
 * Subscribe
 * Skype
 * Facebook Messenger
 * Kakao
 * LiveJournal
 * Yammer

x


PYTHON FOR FINANCE, PART 3: MOVING AVERAGE TRADING STRATEGY
In the previous article of this series, we continued to discuss general concepts
which are fundamental to the design and backtesting of any quantitative trading strategy . In detail, we have discussed about

 1. relative and log-returns, their properties, differences and how to use each
    one,
 2. a generic representation of a trading strategy using the normalised asset
    weights $w_i\left(t\right)$ for a set of $N$
    tradable assets and
 3. a very simple, yet profitable strategy, the way to represent it and how to
    calculate its total return.

If you just found this article, see Part 1 and Part 2 .

In this article, we will start designing a more complex trading strategy, which
will have non-constant weights $w_i\left(t\right)$, and thus adapt in some way
to the recent behaviour of the price of our assets.

We will again assume we have a universe of just 3 tradable assets, the Apple and
Microsoft stocks (with tickers AAPL and MSFT respectively) and the S&P 500 Index (ticker ^GSPC ).

As a reminder, the dataframe containing the three “cleaned” price timeseries has
the following format:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

%matplotlib inline
my_year_month_fmt = mdates.DateFormatter('%m/%y')

data = pd.read_pickle('./data.pkl')
data.head(10)


AAPL MSFT ^GSPC 2000-01-03 3.625643 39.334630 1455.219971 2000-01-04 3.319964 38.005900 1399.420044 2000-01-05 3.368548 38.406628 1402.109985 2000-01-06 3.077039 37.120080 1403.449951 2000-01-07 3.222794 37.605172 1441.469971 2000-01-10 3.166112 37.879354 1457.599976 2000-01-11 3.004162 36.909170 1438.560059 2000-01-12 2.823993 35.706986 1432.250000 2000-01-13 3.133722 36.381897 1449.680054 2000-01-14 3.253159 37.879354 1465.150024MOVING AVERAGE CONSIDERATIONS
One of the oldest and simplest trading strategies that exist is the one that
uses a moving average of the price (or returns) timeseries to proxy the recent
trend of the price.

The idea is quite simple, yet powerful; if we use a (say) 100-day moving average
of our price time-series, then a significant portion of the daily price noise
will have been “averaged-out”. Thus, we can can observe more closely the
longer-term behaviour of the asset.

Let us, again, calculate the rolling simple moving averages (SMA) of these three timeseries as follows. Remember, again, that when calculating
the $M$ days SMA, the first $M-1$ are not valid, as $M$ prices are required for
the first moving average data point.

# Calculating the short-window simple moving average
short_rolling = data.rolling(window=20).mean()
short_rolling.head(20)


AAPL MSFT ^GSPC 2000-01-03 NaN NaN NaN 2000-01-04 NaN NaN NaN 2000-01-05 NaN NaN NaN 2000-01-06 NaN NaN NaN 2000-01-07 NaN NaN NaN 2000-01-10 NaN NaN NaN 2000-01-11 NaN NaN NaN 2000-01-12 NaN NaN NaN 2000-01-13 NaN NaN NaN 2000-01-14 NaN NaN NaN 2000-01-17 NaN NaN NaN 2000-01-18 NaN NaN NaN 2000-01-19 NaN NaN NaN 2000-01-20 NaN NaN NaN 2000-01-21 NaN NaN NaN 2000-01-24 NaN NaN NaN 2000-01-25 NaN NaN NaN 2000-01-26 NaN NaN NaN 2000-01-27 NaN NaN NaN 2000-01-28 3.342434 36.389278 1429.120007# Calculating the long-window simple moving average
long_rolling = data.rolling(window=100).mean()
long_rolling.tail()


AAPL MSFT ^GSPC 2016-12-26 110.958205 58.418182 2176.628791 2016-12-27 111.047874 58.476117 2177.500190 2016-12-28 111.140589 58.532936 2178.244490 2016-12-29 111.233698 58.586112 2178.879189 2016-12-30 111.315270 58.635267 2179.426990Let us plot the last $2$ years for these three timeseries for Microsoft stock,
to get a feeling about how these behave.

start_date = '2015-01-01'
end_date = '2016-12-31'

fig = plt.figure(figsize=(15,9))
ax = fig.add_subplot(1,1,1)

ax.plot(data.ix[start_date:end_date, :].index, data.ix[start_date:end_date, 'MSFT'], label='Price')
ax.plot(long_rolling.ix[start_date:end_date, :].index, long_rolling.ix[start_date:end_date, 'MSFT'], label = '100-days SMA')
ax.plot(short_rolling.ix[start_date:end_date, :].index, short_rolling.ix[start_date:end_date, 'MSFT'], label = '20-days SMA')

ax.legend(loc='best')
ax.set_ylabel('Price in $')
ax.xaxis.set_major_formatter(my_year_month_fmt)


It is straightforward to observe that SMA timeseries are much less noisy than
the original price timeseries. However, this comes at a cost: SMA timeseries lag
the original price timeseries, which means that changes in the trend are only
seen with a delay (lag) of $L$ days.

How much is this lag $L$? For a SMA moving average calculated using $M$ days,
the lag is roughly $\frac{M}{2}$ days. Thus, if we are using a $100$ days SMA,
this means we may be late by almost $50$ days, which can significantly affect
our strategy.

One way to reduce the lag induced by the use of the SMA is to use the so-called
Exponential Moving Average (EMA), defined as

\begin{equation}
\begin{split}
& \text{EMA}\left(t\right) & = \left(1-\alpha\right)\text{EMA}\left(t-1\right) +
\alpha \ p\left(t\right) \ & \text{EMA}\left(t_0\right) & = p\left(t_0\right)
\end{split}
\end{equation}

where $p\left(t\right)$ is the price at time $t$ and $\alpha$ is called the
decay parameter for the EMA. $\alpha$ is related to the lag as $$ \alpha =
\frac{1}{L + 1}$$ and the length of the window (span) $M$ as $$ \alpha =
\frac{2}{M + 1}$$.

The reason why EMA reduces the lag is that it puts more weight on more recent
observations, whereas the SMA weights all observations equally by $\frac{1}{M}$.
Using Pandas, calculating the exponential moving average is easy. We need to
provide a lag value, from which the decay parameter $\alpha$ is automatically
calculated. To be able to compare with the short-time SMA we will use a span
value of $20$.

# Using Pandas to calculate a 20-days span EMA. adjust=False specifies that we are interested in the recursive calculation mode.
ema_short = data.ewm(span=20, adjust=False).mean()

fig = plt.figure(figsize=(15,9))
ax = fig.add_subplot(1,1,1)

ax.plot(data.ix[start_date:end_date, :].index, data.ix[start_date:end_date, 'MSFT'], label='Price')
ax.plot(ema_short.ix[start_date:end_date, :].index, ema_short.ix[start_date:end_date, 'MSFT'], label = 'Span 20-days EMA')
ax.plot(short_rolling.ix[start_date:end_date, :].index, short_rolling.ix[start_date:end_date, 'MSFT'], label = '20-days SMA')

ax.legend(loc='best')
ax.set_ylabel('Price in $')
ax.grid()
ax.xaxis.set_major_formatter(my_year_month_fmt)


A MOVING AVERAGE TRADING STRATEGY
Let us attempt to use the moving averages calculated above to design a trading
strategy. Our first attempt is going to be relatively straghtforward and is
going to take advantage of the fact that a moving average timeseries (whether
SMA or EMA) lags the actual price behaviour.

Bearing this in mind, it is natural to assume that when a change in the long
term behaviour of the asset occurs, the actual price timeseries will react
faster than the EMA one. Therefore, we will consider the crossing of the two as
potential trading signals.

 1. When the price timeseries $p\left(t\right)$ crosses the EMA timeseries
    $e\left(t\right)$ from below, we will close any existing short position and
    go long (buy) one unit of the asset.
    
    
 2. When the price timeseries $p\left(t\right)$ crosses the EMA timeseries
    $e\left(t\right)$ from above, we will close any existing long position and
    go short (sell) one unit of the asset.
    
    
How is this translated to the framework described in our previous article about
the weights $w\left(t\right)$?

Well for this strategy it is pretty straghtforward. All we need is to have a
long position, i.e. $w_i\left(t\right)$ >0, as long as the price timeseries is
above the EMA timeseries and a short position, i.e. $w_i\left(t\right)$ < 0, as
long as the price timeseries is below the EMA timeseries.

Since, at this point, we are not interested yet in position sizing, we will
assume that we use all our funds available to trade asset $i$. We will also
assume that our funds are split equally across all $3$ assets (MSFT, AAPL and
^GSPC).

Based on these assumptions, our strategy for each of the assets $i, i = 1,
\ldots, 3$ can be translated as follows:

 * Go long condition: If $p_i\left(t\right) > e_i\left(t\right)$, then
   $w_i\left(t\right) = \frac{1}{3}$
 * Go short condition: If $p_i\left(t\right) < e_i\left(t\right)$, then
   $w_i\left(t\right) = -\frac{1}{3}$

Whenever, the trade conditions are satisfied, the weights are $\frac{1}{3}$
because $\frac{1}{3}$ of the total funds are assigned to each asset and whenever
we are long or short, all of the available funds are invested.

How is this implemented in Python? The trick is to take the sign of the
difference between then price $p_i\left(t\right)$ and the EMA
$e_i\left(t\right)$.

# Taking the difference between the prices and the EMA timeseries
trading_positions_raw = data - ema_short
trading_positions_raw.tail()


AAPL MSFT ^GSPC 2016-12-26 2.039488 1.040200 20.465712 2016-12-27 2.511890 0.977103 23.121693 2016-12-28 1.822235 0.623249 3.765377 2016-12-29 1.621664 0.482954 2.809706 2016-12-30 0.647437 -0.246519 -6.894490# Taking the sign of the difference to determine whether the price or the EMA is greater and then multiplying by 1/3
trading_positions = trading_positions_raw.apply(np.sign) * 1/3
trading_positions.tail()


AAPL MSFT ^GSPC 2016-12-26 0.333333 0.333333 0.333333 2016-12-27 0.333333 0.333333 0.333333 2016-12-28 0.333333 0.333333 0.333333 2016-12-29 0.333333 0.333333 0.333333 2016-12-30 0.333333 -0.333333 -0.333333ONE FINAL CAVEAT
Before seeing the performance of this strategy, let us focus on the first day
$t_o$ when the price timeseries $p\left(t_o\right)$ crosses above and EMA
timeseries $e_i\left(t_o\right)$. Since $p\left(t_o\right) >
e_i\left(t_o\right)$. At that point the trading weight $w_i\left(t_o\right)$
becomes positive, and thus according to our trading strategy, we need to set for
that day $w_i\left(t_o\right)=\frac{1}{3}$.

However, bear in mind that $p\left(t_o\right)$ is the price of the asset at the
close of day $t_o$. For this reason, we will not know that $p\left(t_o\right) �
it is equivalent to us peaking into the future, since we only know we have to go
long at the end of day $t_o$.

The best we can do is assume that we traded at the close of this day $t_o$.
Therefore our position will be long starting on the following day, $t_o + 1$.
This is easily corrected for by lagging our trading positions by one day, so
that on day $t_o$ our actual position is that of the previous day $t_o – 1$ and
only on day $t_o + 1$ do we have a long position. Thus:

# Lagging our trading signals by one day.
trading_positions_final = trading_positions.shift(1)


Let us examine what the timeseries and the respective trading position look like
for one of our assets, Microsoft.

fig = plt.figure(figsize=(15,9))
ax = fig.add_subplot(2,1,1)

ax.plot(data.ix[start_date:end_date, :].index, data.ix[start_date:end_date, 'MSFT'], label='Price')
ax.plot(ema_short.ix[start_date:end_date, :].index, ema_short.ix[start_date:end_date, 'MSFT'], label = 'Span 20-days EMA')

ax.set_ylabel('$')
ax.legend(loc='best')
ax.grid()
ax.xaxis.set_major_formatter(my_year_month_fmt)

ax = fig.add_subplot(2,1,2)

ax.plot(trading_positions_final.ix[start_date:end_date, :].index, trading_positions_final.ix[start_date:end_date, 'MSFT'], 
        label='Trading position')

ax.set_ylabel('Trading position')
ax.xaxis.set_major_formatter(my_year_month_fmt)


Now that the position our strategy dictates each day has been calculated, the
performance of this strategy can be easily estimated. To that end, we will need
again the log-returns of the three assets $r_i\left(t\right)$. These are
calculated as:

# Log returns - First the logarithm of the prices is taken and the the difference of consecutive (log) observations
asset_log_returns = np.log(data).diff()
asset_log_returns.head()


AAPL MSFT ^GSPC 2000-01-03 NaN NaN NaN 2000-01-04 -0.088078 -0.034364 -0.039099 2000-01-05 0.014528 0.010489 0.001920 2000-01-06 -0.090514 -0.034072 0.000955 2000-01-07 0.046281 0.012984 0.026730Note that our strategy trades each asset separately and is agnostic of what the
behaviour of the other assets is. Whether we are going to be long or short (and
how much) in MSFT is in no way affected by the other two assets. With this in
mind, the daily log-returns of the strategy for each asset $i$,
$r_{i}^s\left(t\right)$ are calculated as

\begin{equation}
r_{i}^s\left(t\right) = w_i\left(t\right) r_i\left(t\right)
\end{equation}

where $w_i\left(t\right)$ is the strategy position on day $t$ which has already
been attained at the end of trading day $t-1$.

What does this mean?

Assume that $p\left(t\right)$ crosses above $e_i\left(t\right)$ sometime during
the trading session on Monday, day $t-1$. We assume that at the close on Monday
we buy enough units of asset $i$ to spend $\frac{1}{3}$ of our total funds, that
is $\$\frac{N}{3}$ and that the price we bought at is $p\left(t-1\right) =
\$10$. Let us also assume that on Tuesday, day $t$, the price closes at
$p\left(t\right) = \$10.5$. Then our log-return for asset $i$ on Tuesday, is
simply

$$ \frac{1}{3} \times \log \left(\frac{$10.5}{\$10} \right) \simeq
\frac{0.049}{3}$$

The actual return $r_{\text{rel}, i}^s\left(t\right)$ is

\begin{equation}
r_{\text{rel}, i}^s\left(t\right) = w_i\left(t\right) \times \left[\exp
\left(r_i\left(t\right)\right) – 1 \right] = \frac{0.05}{3}
\end{equation}

In terms of dollars, on Tuesday, day $t$, we made $N \times r_{\text{rel},
i}^s\left(t\right) = \$ \frac{0.05N}{3}$.

To get all the strategy log-returns for all days, one needs simply to multiply
the strategy positions with the asset log-returns.

strategy_asset_log_returns = trading_positions_final * asset_log_returns
strategy_asset_log_returns.tail()  


AAPL MSFT ^GSPC 2016-12-26 0.000000 0.000000 0.000000 2016-12-27 0.002110 0.000211 0.000749 2016-12-28 -0.001424 -0.001531 -0.002797 2016-12-29 -0.000086 -0.000477 -0.000098 2016-12-30 -0.002609 -0.004052 -0.001549Remembering that the log-returns can be added to show performance across time,
let us plot the cumulative log-returns and the cumulative total relative returns
of our strategy for each of the assets.

# Get the cumulative log-returns per asset
cum_strategy_asset_log_returns = strategy_asset_log_returns.cumsum()

# Transform the cumulative log returns to relative returns
cum_strategy_asset_relative_returns = np.exp(cum_strategy_asset_log_returns) - 1

fig = plt.figure(figsize=(15,9))
ax = fig.add_subplot(2,1,1)

for c in asset_log_returns:
    ax.plot(cum_strategy_asset_log_returns.index, cum_strategy_asset_log_returns[c], label=str(c))

ax.set_ylabel('Cumulative log-returns')
ax.legend(loc='best')
ax.xaxis.set_major_formatter(my_year_month_fmt)
ax.grid()

ax = fig.add_subplot(2,1,2)

for c in asset_log_returns:
    ax.plot(cum_strategy_asset_relative_returns.index, 100*cum_strategy_asset_relative_returns[c], label=str(c))

ax.set_ylabel('Total relative returns (%)')
ax.legend(loc='best')
ax.grid()
ax.xaxis.set_major_formatter(my_year_month_fmt)


WHAT IS THE TOTAL RETURN OF THE STRATEGY?
Strictly speaking, we can only add relative returns to calculate the strategy
returns. Therefore $$ r_{\text{rel}}^s\left(t\right) =
\sum_{i=1}^{3}r_{\text{rel}, i}^s\left(t\right)$$.

We saw in the previous article, however, that for small values of the relative
returns, the following approximation holds $$ r_i\left(t\right) \simeq
r_{\text{rel},i}\left(t\right)$$

Thus, an alternative way is to simply add all the strategy log-returns first and
then convert these to relative returns. Let us examine how good this
approximation is.

# Total strategy relative returns. This is the exact calculation.
cum_relative_return_exact = cum_strategy_asset_relative_returns.sum(axis=1)

# Get the cumulative log-returns per asset
cum_strategy_log_return = cum_strategy_asset_log_returns.sum(axis=1)

# Transform the cumulative log returns to relative returns. This is the approximation
cum_relative_return_approx = np.exp(cum_strategy_log_return) - 1

fig = plt.figure(figsize=(15,9))
ax = fig.add_subplot(1,1,1)

ax.plot(cum_relative_return_exact.index, 100*cum_relative_return_exact, label='Exact')
ax.plot(cum_relative_return_approx.index, 100*cum_relative_return_approx, label='Approximation')

ax.set_ylabel('Total cumulative relative returns (%)')
ax.legend(loc='best')
ax.grid()
ax.xaxis.set_major_formatter(my_year_month_fmt)


As we can see, for relatively small time-intervals and as long the assumption
that relative returns are small enough, the calculation of the total strategy
returns using the log-return approximation can be satisfactory. However, when
the small scale assumption breaks down, then the approximation is poor.
Therefore what we need to remember the following:

 1. Log-returns can and should be added across time for a single asset to
    calculate cumulative return timeseries across time.
 2. However, when summing (or averaging) log-returns across assets, care should
    be taken. Relative returns can be added, but log-returns only if we can
    safely assume they are a good-enough approximation of the relative returns.

The overall, yearly, performance of our strategy can be calculated again as:

def print_portfolio_yearly_statistics(portfolio_cumulative_relative_returns, days_per_year = 52 * 5):

    total_days_in_simulation = portfolio_cumulative_relative_returns.shape[0]
    number_of_years = total_days_in_simulation / days_per_year

    # The last data point will give us the total portfolio return
    total_portfolio_return = portfolio_cumulative_relative_returns[-1]
    # Average portfolio return assuming compunding of returns
    average_yearly_return = (1 + total_portfolio_return)**(1/number_of_years) - 1

    print('Total portfolio return is: ' + '{:5.2f}'.format(100*total_portfolio_return) + '%')
    print('Average yearly return is: ' + '{:5.2f}'.format(100*average_yearly_return) + '%')

print_portfolio_yearly_statistics(cum_relative_return_exact)


Total portfolio return is: 108.24%
Average yearly return is:  4.39%


WHAT NEXT?
One can observe that this strategy signficantly underperforms the buy and hold
strategy that was presented in the previous article. Let's compare them again:

# Define the weights matrix for the simple buy-and-hold strategy
simple_weights_matrix = pd.DataFrame(1/3, index = data.index, columns=data.columns)

# Get the buy-and-hold strategy log returns per asset
simple_strategy_asset_log_returns = simple_weights_matrix * asset_log_returns

# Get the cumulative log-returns per asset
simple_cum_strategy_asset_log_returns = simple_strategy_asset_log_returns.cumsum()

# Transform the cumulative log returns to relative returns
simple_cum_strategy_asset_relative_returns = np.exp(simple_cum_strategy_asset_log_returns) - 1

# Total strategy relative returns. This is the exact calculation.
simple_cum_relative_return_exact = simple_cum_strategy_asset_relative_returns.sum(axis=1)

fig = plt.figure(figsize=(15,9))
ax = fig.add_subplot(1,1,1)

ax.plot(cum_relative_return_exact.index, 100*cum_relative_return_exact, label='EMA strategy')
ax.plot(simple_cum_relative_return_exact.index, 100*simple_cum_relative_return_exact, label='Buy and hold')

ax.set_ylabel('Total cumulative relative returns (%)')
ax.legend(loc='best')
ax.grid()
ax.xaxis.set_major_formatter(my_year_month_fmt)

print_portfolio_yearly_statistics(simple_cum_relative_return_exact)


Total portfolio return is: 248.51%
Average yearly return is:  7.59%


WHICH STRATEGY IS BETTER?
This is not a simple question for one to answer at this point. When we need to
choose between two or more strategies, we need to define a metric (or metrics)
based on which to compare them. This very important topic will be covered in the
next article.

In addition, we observe in this last graph that the performance of the two
strategies is not constant across time. There are some periods when one
outperforms the other and other periods when it is not. So a second question
that naturally arises is how do we mitigate the risk to be “tricked” by a good
backtesting performance in a given period.


 * ← Springboard Data Science Career Track Review
 * 

GEORGIOS EFSTATHOPOULOS
Georgios has 7+ years of experience as a quantitative analyst in the financial
sector, and has worked extensively on statistical and machine learning models
for quantitative trading, market and credit risk management and behavioural
modelling. Georgios has PhD in Applied Mathematics and Statistics at Imperial
College London, and is the founder and CEO of QuAnalytics Limited, a consultancy
focusing on quantitative and data analytics solutions for individuals and
organisation who wish to harvest the potential of their own data to grow their
business.

RECOMMENDED
Python for Finance, Part I: Yahoo Finance API, pandas, and matplotlibIn detail,
in the first of our tutorials, we are going to show how one can easily use
Python to download financial data from free online databases, manipulate the
downloaded data and then create some basic technical indicators which will then
be used as the basis of our quantitative strategy. Python for Finance, Part 2: Intro to Quantitative Trading StrategiesBuilding on
these results, our ultimate goal will be to design a simple yet realistic
trading strategy. However, first we need to go through some of the basic
concepts related to quantitative trading strategies, as well as the tools and
techniques in the process. Top Data Science Online Courses in 2017The following is an extensive list of
Data Science courses and resources, from platforms like Coursera, edX, and
Udacity, that give you the skills needed to become a data scientist.LEAVE A REPLY
2 Comments on ""Python for Finance, Part 3: A Moving Average Trading Strategy""

Notify of new follow-up comments new replies to my comments Notify of new replies to this comment
Notify of new replies to this comment
Sort by: newest | oldest | most voted Guest JoeHey i need some help – when I go through the tutorials the script stops working
in the second tutorial. Specifically, when I try the very first thing (after
plugging in everything from the first tutorial)

data = pd.read_pickle(‘./data.pkl’)
data.head(10)

I get: FileNotFoundError: [Errno 2] No such file or directory: ‘./data.pkl’

How can I fix this?

Also, in the first tutorial btw it doesn’t like “.ix” so i change it to “.loc”
b/c I get this error:
DeprecationWarning:.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

Any suggestions?

Vote Up 0 Vote Down Reply 2 days 3 hours ago Admin BrendanHey Joe,

I know this was eventually answered on the Reddit thread , but I’m just going to re-answer it here in case anyone else has a similar
issue.

The FileNotFoundError was caused because the data.pkl file was not present in the local GitHub repo . This was the file that contains the sample data for this exercise. This has
now been remediated.
The DeprecationWarning is nothing to worry about, it is caused by an upgrade in the Pandas package.

Vote Up 0 Vote Down Reply 1 day 3 hours ago * Top Data Science Online Courses in 2017
 * 100+ Free Data Science Books for 2017
 * Best 9 Udemy Data Science Courses

 * Contact Us
 * Privacy Policy

 * 2.7k Followers
 * 927 Fans

Copyright © 2017 LearnDataSci. All rights reserved.

Send this to a friend

Your email Recipient email Send Cancel","In this article, we will start designing a more complex trading strategy, which will have non-constant weights wi(t)wi(t), and thus",A Moving Average Trading Strategy,Live,485
1498,"* United States

IBM® * Site map

Search

IBM Developer Advocacy * Services * Cloudant
    * Compose
    * Apache Spark
    * IBM Graph
    * Bluemix Data Connect
    * Bluemix Lift
    * BigInsights on Cloud
    * IBM Data Science Experience
    * Streaming Analytics
    * IBM Watson Machine Learning
   
   
 * Blog
 * Showcases
 * Search Resources
 * Events

Services to get , build , and analyze data on the ibm cloud CloudantA fully-managed NoSQL database as a service (DBaaS) built from the ground up to
scale globally, run non-stop, and handle a wide variety of data…

ComposeProduction-ready hosting for the following databases: MongoDB with SSL,
Elasticsearch, RethinkDB, PostgreSQL, Redis, etcd, and RabbitMQ.

Apache SparkAnalytics for Apache Spark provides fast, in-memory, distributed analytics
processing of large data sets.

IBM GraphIBM Graph is an easy-to-use, fully managed graph database service for storing,
querying, and visualizing data points, their connections, and properties. IBM
Graph is based…

Bluemix Data ConnectData Connect is a cloud-based data refinery that transforms raw data into
relevant and actionable information. Find data, shape it, and deliver it to
applications…

Bluemix LiftMigrate data from on-premises to the cloud quickly and securely.

BigInsights on CloudIBM BigInsights on Cloud provides Hadoop-as-a-service on IBM’s SoftLayer global
cloud infrastructure. It offers the performance and security of an on-premises
deployment without the cost…

IBM Data Science ExperienceIBM Data Science Experience (DSx) is an interactive, collaborative, cloud-based
environment where data scientists can use multiple tools to activate their
insights. Data scientists can…

Streaming AnalyticsPerform real-time analysis on data in motion as part of your Bluemix®
applications by using IBM® Streaming Analytics for Bluemix. Streaming Analytics
is powered by…

IBM Watson Machine LearningMachine learning is everywhere – influencing nearly everything we do. You’ve
likely heard that Uber is world’s largest taxi company, yet owns no vehicles.
Facebook,…

Search Topic
Advanced Search Language
Technology
Powered by the Simple Search Service i What's This?The most popular Topics, Technologies and Languages are determined by the Simple
Search Service - a microservice that lets you quickly create a faceted search
engine. See what else IBM can do for you.

Learn More about the Simple Search Service CloudDataServices Labs Open Menu * 
 * Services * Back to Navigation
    * Streaming Analytics
    * IBM Data Science Experience
    * IBM Watson Machine Learning
    * Bluemix Data Connect
    * Bluemix Lift
   
   
 * Blog
 * Showcases
 * Search resources * Back to Navigation
   
   
 * Events

IBM DATA SCIENCE EXPERIENCE

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

 * Get started
 * Notebooks
 * Integrate
 * Resources

COMMUNITY NOTEBOOK: USE SPARK R TO LOAD AND ANALYZE DATA
This video shows a community notebook which introduces basic Spark concepts and
helps you to start using Spark for R. In this notebook, you’ll use the publicly
available mtcars data set from Motor Trend magazine to learn some basic R.
You’ll learn how to load data, create a Spark DataFrame, aggregate data, run
mathematical formulas, and run SQL queries against the data. To do so, from
within the IBM Data Science Experience , click the Notebooks section in the Data Science Experience Community, and
search for Spark R.


FREE SPARKR COURSE
Looking to master Apache Spark with SparkR to perform large scale data analysis?
SparkR provides a distributed data frame API that enables structured data
processing with a syntax familiar to R users. This course will help you:

 * Learn why R is a popular statistical programming language with a number of
   extensions that support data processing and machine learning tasks.
 * Learn how SparkR, an R package that provides a light-weight frontend, uses
   Apache Spark from R.

ENROLL IN THIS FREE BIG DATA UNIVERSITY COURSE: ANALYZING BIG DATA IN R USING APACHE SPARK
Notebooks * Use the Machine Learning Library
 * Build SQL Queries
 * Build a Custom Library for Apache Spark
 * Introduction to Decision Tree Learning
 * Introduction to Machine Learning: Predict Cancer Diagnosis
 * Introduction to Natural Language Processing (NLP)
 * Community Notebook: Precipitation Analysis
 * Community Notebook: NY Motor Vehicle Accidents Analysis
 * Community Notebook: Use Spark R to Load and Analyze Data

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus
RECENT UPDATES

 * Blog
 * Recent Post
 * Moving to Medium
 * 2/24/17
 * We're now the IBM Watson Data Lab on Medium. More on what that means for…
 * Mike Broberg",SparkR provides a distributed data frame API that enables structured data processing with a syntax familiar to R users.,Use Spark R to Load and Analyze Data,Live,486
1500,"GET STARTED WITH COUCHDB USING PHP AND GUZZLE
LornaMitchell / July 27, 2016WHY COUCHDB + GUZZLE?
Apache CouchDB™ is a document database, which means it’s a great addition to your application
if you want to store data that could be complex or of varying shapes and sizes.
CouchDB is super-flexible and easy to work with. It stores data in JSON format,
and you write your queries/views using JavaScript. Best of all, the interface is
simple HTTP, which means you can talk to CouchDB directly with familiar tools,
and it’s easy to add the database into your application regardless of the
technology stack you use.

In today’s post, we’ll look at how we can use CouchDB in our PHP applications,
using the excellent PHP HTTP library Guzzle . Guzzle is a modern, PSR-7 compliant object-oriented PHP library that handles all aspects of HTTP in a correct and —
importantly, a scalable — way. So it’s a great way to add any HTTP-interfaced
services into your application (PHP 5.5 and later, does support PHP 7).


Whether you’re running CouchDB locally or using a hosted service like Cloudant , you have easy access to a great admin interface for CouchDB called Fauxton.
If you’re a fan of tools like PHPMyAdmin that provide an easy, human-friendly
view of what’s happening in your database, then definitely check it out.

An empty Cloudant dashboard. The CouchDB Fauxton admin interface looks similar. Let’s fill out this dashboard by creating a
database!

FIRST STEP: CONNECT TO COUCHDB
We mentioned that CouchDB has a simple HTTP interface, and that means you can
make a basic curl request to it and it will respond. This URL is all we need to
begin to use CouchDB from our PHP applications. If you’re running locally, then
the URL is http://127.0.0.1:5984/ . For Cloudant on IBM Bluemix, you can find the details in the “Service
Credentials” section, and the URL looks something like https://[username]:[password]@[hostname]-bluemix.cloudant.com .

Finding your Cloudant creds in IBM Bluemix .

We’ll also need to add the Guzzle library into our project using Composer by doing:

composer require guzzlehttp/guzzle:~6.0

This brings the dependency into our project (the code is now in the vendor/ folder and should be ignored by your source control tool). To make it available
to PHP we will add the composer autoloader at the top of all our scripts by
adding this line:


require 'vendor/autoload.php';


We’re all set! Let’s make a first request to CouchDB using Guzzle, and inspect
the output. Here’s a PHP script to do that:


�

$url = ""� // edit this line!  Put your URL here

try {
    $client = new GuzzleHttp\Client([""base_uri"" =�
    $response = $client->request(""GET"", ""/�
    if($response->getStatusCode() == 200) {
        echo $response-�
    }
} catch (\GuzzleHttp\Exception\RequestException $e) {
    // couldn't connect
} catch (\GuzzleHttp\Exception\ClientException $e) {
    // 400-class error
} catch (\GuzzleHttp\Exception\ServerException $e) {
    // 500-class error
}


The output I get from running this on my system is:

{""couchdb"":""Welcome"",""version"":""1.0.2"",""cloudant_build"":""2580""}

Included in the example above are some catch statements which will allow you to react to the various error states that come
out of Guzzle. To keep the examples short, we’ll focus on the part inside the try{} block in our future examples, but it’s recommended to continue to set up your
code as shown in the full example here and to add detail to the exception cases
to help with logging and debugging in your application.

WORKING WITH DATABASES
Getting a list of databases from CouchDB is simple — you might choose to set
these up in your init script or even manually using the web interface — but
while we’re getting to know CouchDB, we’ll do this programmatically.

Getting the list of databases is done by making a GET request to /_all_dbs . You’ll notice a few of these special URLs starting with an underscore as we
do a few different things with CouchDB. Here’s some PHP code to get that list of
databases (use the same $client you created in the previous example):


    $response = $client->request(""GET"", ""/_all_dbs�
    if($response->getStatusCode() == 200) {
        echo $response-�
    }


At this point, you may notice that this returns an empty list (depending on your
installation, some do set up databases for you) …

… so let’s add some databases. CouchDB’s HTTP interface uses HTTP verbs to convey to the server what action should take place. To create a new named
database, the PUT method is used on the URL where the database itself will live.
For example, to create a database called ""items"" , a PUT request can be made to /items .

Here’s the PHP code to create that ""items"" database:


    $response = $client->request(""PUT"", ""/items�
    echo $response->getStatusCode() . "": "" . $response-�


Notice that this time, the code doesn’t expect a 200 status code. That’s because
when something is created, the server will return the HTTP 201 status code
(which literally means created). Repeat your earlier script to list the
databases and you should see that “items” now appears in the list.

If you try to create a database that already exists, you will cause an error.
One of the nicest things about Guzzle is the way that all errors come back to
the user as Exceptions, so the developer can see what happened and decide how
the application should respond. For a database that already exists, Guzzle will
emit a RequestException because the response has a status code of 412, which means “Precondition
failed”. This is accompanied by a longer message in the body of the response
which lets you know that the DB already exists.

INSERTING AND SELECTING DATA
Now that there is a database to put things in, it is time to work with
individual records. These examples are intentionally kept simple, but the same
principles apply as you expand the size and complexity of the data sets as your
applications (and your ambitions) grow.

Let’ CouchDB is just JSON storage. Here’s some PHP to create a record:


    $data = [
        ""type"" => ""tennis ball"",
        ""color"" => ""yellow"",
        ""tags"" =�

    $response = $client->request(
        ""POST"",
        ""/items"",
        [
            ""json"" =�
    echo $response-�


Guzzle has a ""json"" option parameter that will json_encode() the value that you supply and add the appropriate Content-Type: application/json header as well. You could equally set the body to a json-encoded string and
manually set the header, but this approach is a shorthand to the same outcome.

When using the code above, CouchDB will return an OK status, the ID of your new
record and a revision (more on that later). You can also go and view the data in
the web dashboard. Notice that by default you only see the metadata related to a
record, not the record itself. To see the data as well, tick the “docs” or
“include docs” checkbox at the top.

Game, set, match — our tennis ball document, residing in our items database.

We’ll also want to fetch these records programmatically, so here’s the same
dataset being retrieved with PHP:


    $response = $client->request(
        ""GET"",
        ""/items/_all_docs"",
        ['query' => ['include_docs' =�
    if($response->getStatusCode() == 200) {
        echo $response-�
    }


Here we’re using another special URL, _all_docs to get all the records in this database. Exactly as was shown in the screenshot
above where it’s an extra step to include the document itself, here the include_docs query parameter needs to be set to true to get that data as well.

UPDATING AND DELETING DATA
The approach for updates and deletes is very similar — and they require one
extra step that we haven’t seen before that stems from CouchDB’s superpowers.
CouchDB is designed to be resilient across network failures, including if part
of a database cluster goes offline and both sides of the separated network are
written to. This means that when we operate on existing records, either to
update or delete them, we need to indicate which version of the record we had
when we made this change or requested this delete.

We’ll look at the delete example as the code is simpler simpler (for updates the
only difference is that the PUT verb is used and data is supplied in the same
way as when we inserted a record). To delete a record, the DELETE verb should be
used, and the revision that we are trying to delete must also be supplied. In
most cases, we will first fetch the record to get that information.

Here is the PHP code for fetching and then deleting a record, supplying the
revision information:


    $response = $client->request(
        ""GET"",
        ""/items/_all_docs"",
        ['query' => ['include_docs' =�
    if($response->getStatusCode() == 200) {
        $items = json_decode($response-�
    $response_delete = $client->request(
        ""DELETE"",
        ""/items/"" . $item['id'],
        [
            ""query"" => [""rev"" =�
    echo $response_delete->getStatusCode() . "": "" . $response_delete-�


This example shows how to use both the ID to create a URL and the revision to
add as a query parameter when making the DELETE request so that we can delete a
specific option. If the query is run without the ?rev= , it will fail with a status code of 409 meaning “Conflict”. When making
updates or deletes, CouchDB always expects the revision to be sent so it knows
which version of the record is being operated on, and it can therefore handle
conflicts when syncing data stores that may have been offline (but still in use)
for some time. For futher reading on CouchDB’s magical replication & synch
abilities, try this article about consistency in CouchDB or check out the technical overview of CouchDB (some absolute gems scattered in there if you keep reading).

PHP, GUZZLE AND COUCHDB
This winning combination of modern technologies is a nice way to work with data
from a PHP application. CouchDB is an excellent document datastore, and its very
standard HTTP interface makes it easy and fast to work with. You don’t need
special tools or extensions, and PHP developers have a pretty solid grasp of
HTTP already, which means there are no new skills needed for working with or
debugging this stack. Guzzle is best-of-breed in HTTP libraries for PHP, and
combined with CouchDB it’s a combination ready for anything!

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: cloudant / CouchDB / guzzle / php Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Learn the basics of how to use CouchDB in PHP applications, using the excellent PHP HTTP library Guzzle.",Get Started With CouchDB Using PHP and Guzzle,Live,487
1502,"Enterprise Pricing Articles Sign in Free 30-Day TrialGETTING STARTED WITH ELASTICSEARCH AND NODE: PART 4
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 20, 2016In the previous article in this series we indexed the petitions to go with the constituencies data that
we worked with in the earlier articles, and took a brief look at running a few
queries on the petitions.

In this article we're going to run some queries on our nested fields. Nested
queries are a powerful but tricky aspect of Elasticsearch. They allow us to
explore more complex datasets; setting them up correctly requires a bit more
effort than the queries we've run so far.

To make life a little easier when it comes to running different searches we're
going to start passing arguments to our Node files. We can use the yargs library for this.

npm install yargs  


Yargs takes supplied arguments and puts them in an hash called argv , which we can then use to pass arguments to our various functions. Create a
new file, nestedQuery.js , with the following:

var client=require ('./connection.js');  
var argv = require('yargs').argv;  


Let's set up our query. We'll use the search from query.js that we created in part 3 as the basis for our new nested query. Instead of matching a keyword as we did
in part 3, however, we'll limit our search to all open petitions, which we can
do with match: { 'state': 'open' } .

client.search({  
  index: 'gov',
  type: 'petitions',
  fields: ['action','signature_count'],
  body: {
    query: {
      bool: {
        must: [
          { match: { 'state': 'open' } },
          { range : {
                'signature_count' : {
                    'gte' : 10000
                }
            }
          }
        ]
      }
    }
})


Now we need to add the nested query itself. We want to return results that
correspond to a constituency name, which we'll pass as a value when we run nestedQuery.js . The nested part of our query looks like this:

nested: {  
  path: 'signatures_by_constituency',
  query: {
    bool: {
      must: [
        { 'match': { 'signatures_by_constituency.name': constitLookup }}
      ]
    }
  }
}


First we have to specify the path to the nested field, then inside that we set
up another query like the one we already have. We use dot notation to provide
the path to the nested field we want to query. In our case that's signatures_by_constituency.name . Finally we pass in the match term constitLookup , which we'll provide as an argument when we run nestedQuery.js .

Putting the two parts together, our full query looks like this:

var results = function(constitLookup) {  
  client.search({
    index: 'gov',
    type: 'petitions',
    fields: ['action','signature_count'],
    body: {
      query: {
        bool: {
          must: [
            { match: { 'state': 'open' }},
            { range : {
                  'signature_count' : {
                      'gte' : 10000
                  }
              }
            },
            { nested: {
              path: 'signatures_by_constituency',
              query: {
                bool: {
                  must: [
                    { 'match': { 'signatures_by_constituency.name': constitLookup }}
                  ]
                }
              }
            }}
          ]
        }
      }
    }
  },function (error, response,status) {
      if (error){
        console.log(""search error: ""+error)
      }
      else {
        response.hits.hits.forEach(function(hit){
          console.log(hit);
        })
      }
    });
}


Copy that into nestedQuery.js . We'll need something to pass our search term in and display the results, so
add this to the end of nestedQuery.js :

if (argv.search) {  
  var constitLookup=argv.search;
  console.log(""Search term: ""+constitLookup);
  results(constitLookup);
}


To test our query we can run it as follows:

node nestedQuery --search=""Ipswich""  


You'll get a list of results including the index, type, id and score for each
document in the results, together with the contents of the action and signature_count fields that we specified:

{ _index: 'gov',
  _type: 'petitions',
  _id: '120753',
  _score: 10.085289,
  fields:
   { action: [ 'Parliament to sit on Saturdays which should be a ""normal working day"" for MPs.' ],
     signature_count: [ 99802 ] } }


So far, so good, but we haven't yet told Elasticsearch what we want it to do
with the results, so the query is currently just looking for open petitions with
more than ten thousand signatures in total that were signed in the constituency
we're looking for. Run this query and you'll just get a list of petitions that
have been signed in whichever constituency is passed as an argument to nestedQuery.js - in our case it was 'Ipswich'. We need to sort our results according to the
number of signatures in the constituency specified by constitLookup .

SORTING THE NESTED QUERY
If we want to, we can sort our results just as we did when we were searching on
non-nested fields. To sort our results in descending order of the number of
signatures on the petition we could add this sort to our query, as we did in
part 3.

query: {  
  ...
},
sort: {  
  'signature_count': {
    order: 'desc'
  }
}


However, at this point it's probable that we'd get a very similar set of results
if we didn't specify a constituency name to search for, because once a petition
has ten thousand signatures it's overwhelmingly likely that it has been signed
at least once in every one of the 650 constituencies in our dataset.

We can make our nested search work a bit harder for us by removing the
requirement that signature_count must be at least ten thousand and changing the sort order from descending ( order: 'desc' ) to ascending ( order: 'asc' ).

query: {  
  bool: {
    must: [
      { match: { 'state': 'open' }},
      { nested: {
        path: 'signatures_by_constituency',
        query: {
          bool: {
            must: [
              { 'match': { 'signatures_by_constituency.name': constitLookup }}
            ]
          }
        }
      }}
    ]
  }
},
sort: {  
  'signature_count': {
    order: 'asc'
  }
}


This gets us a list of the least signed open petitions that have been signed at
least once in the constituency we're interested in. It's not as useful to us as
a descending sort, although it may throw up one or two local issues. If instead
we had a list of users and our data was structured like this:

{
  ""name"": {
    first: ""Neil"",
    last: ""Dewhurst""
    },
  ""age"": 21
},
{
  ""name"": {
    first: ""Neil"",
    last: ""Smith""
    },
  ""age"": 55
}


We could use this query to order our 'Neils' by ascending age:

query: {  
 nested: {
  path: 'name',
    query: {
      bool: {
        must: [
          { 'match': { 'name.first': 'Neil' }}
        ]
      }
    }
  }
},
sort: {  
  'age': {
    order: 'asc'
  }
}


SORTING BY NESTED FIELD
What we really want to do with our constituencies is sort on a nested field so
that we are sorting on data that relates to the constituency given by constitLookup . For this we need to specify a nested_path and a nested_filter . First let's sort our results in descending order of the number of signatures
in the constituency we're querying on.

sort: {  
  'signatures_by_constituency.signature_count' : {
    order: 'desc',
    nested_path: 'signatures_by_constituency',
    nested_filter: {
      query: {
        bool: {
          must: [
            { 'match': { 'signatures_by_constituency.name': constitLookup }}
          ]
        }
      }
    }
  }
}


Note that our nested_path here matches the path specified in the nested element of our query, and the nested_filter matches our query of that same nested element.

Replace the plain signature_count sort with this new nested sort and run nestedQuery.js again, keeping 'Ipswich' as your search term.

Finally, let's make use of that importance field we created when we indexed the petitions. We can sort on multiple fields
in Elasticsearch, so we can add a new sort to our existing sort as follows to
sort by importance and then signature_count :

sort: {  
  'signatures_by_constituency.importance' : {
    order: 'desc',
    nested_path: 'signatures_by_constituency',
    nested_filter: {
      query: {
        bool: {
          must: [
            { 'match': { 'signatures_by_constituency.name': constitLookup }}
          ]
        }
      }
    }
  },
  'signatures_by_constituency.signature_count' : {
    order: 'desc',
    nested_path: 'signatures_by_constituency',
    nested_filter: {
      query: {
        bool: {
          must: [
            { 'match': { 'signatures_by_constituency.name': constitLookup }}
          ]
        }
      }
    }
  }
}


Now when you run nestedQuery.js you'll get back ten petitions, listed in descending order of the value of the
importance field for whatever constituency you supply as the search argument. In
other words, given a constituency as an input, our output is now a list of the
petitions people in that constituency are most interested in.

Now that we have two sort criteria you'll also notice that Elasticsearch returns
the sort values as an array, which is handy for when you want to use those
values in your output. And that's exactly what we'll do in the next and final
article.

NEXT
In the final article in the series we'll use both the constituency and petitions
data sets together and add in a postcode lookup. We'll finish by turning our
existing code into a web app, which we'll deploy using IBM Bluemix .

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Neil Dewhurst Love this article? Head over to Neil Dewhurst’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose",In this article we're going to run some queries on our nested fields.,Getting started with Elasticsearch and Node.js - part 4,Live,488
1506,"Compose The Compose logo Articles Sign in Free 30-day trialDATALAYER EXPOSED: AMY UNRUH & SCALING OUT SQL DATABASES WITH SPANNER
Published Jul 17, 2017 datalayer sql DataLayer Exposed: Amy Unruh & Scaling Out SQL Databases with SpannerLet's start the week off with another video from DataLayer Conf, the Compose
sponsored Conference held in Austin this past may, from Amy Unruh .

At DataLayer, our seventh speaker was Amy Unruh . Amy serves as a developer relations engineer for the Google Cloud Platform,
with a focus on big data, analytics, and machine learning, as well as other
Cloud Platform technologies and is a published author on App Engine.

In her talk, Amy gives an overview of Cloud Spanner, Google's relational and
scalable application database. She looks at how it evolved, how Google uses it
internally, and how it could fit in other projects. She shows off some features
of Spanner, looks at horizontal scale, the CAP theorem, talks about TrueTime,
and gives some pretty engaging demos.

Previous DataLayer 2017 talks:

 * Charity Majors' presentation on observability
 * Ross Kukulinski's presentation on the state of containers
 * Antonio Chavez's presentation on the why he left MongoDB
 * Jonas Helfer's presentation on Joins across databases with GraphQL
 * Joshua Drake's presentation on PostgreSQL as the center of your data universe
 * Lorna Jane Mitchell's presentation on surviving failure with RabbitMQ

Be sure to tell us what you think using hashtag #DataLayerConf and check back
next Monday for the next talk at DataLayerConf.


--------------------------------------------------------------------------------

We're in the planning stages for DataLayer 2018 right now so, if you have an
idea for a talk, start fleshing that out. We'll have a CFP, followed by a blind
submission review, and then select our speakers, who we'll fly to DataLayer to
present. Sounds fun, right?

Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe ’s author page and keep reading.RELATED ARTICLES
Jul 10, 2017DATALAYER EXPOSED: LORNA JANE MITCHELL & SURVIVING FAILURE WITH RABBITMQ
It's Monday which means it's time for our next DataLayer Conf video installment.
This week, we'll hear about surviving failur…

Thom Crowe Jul 3, 2017DATALAYER EXPOSED: JOSHUA DRAKE & POSTGRESQL: THE CENTER OF YOUR DATA UNIVERSE
Start your Monday on a high note and catch up on videos from this year's
DataLayer Conference. This week we're highlighting J…

Thom Crowe Jun 28, 2017ACCESSING RELATIONAL DATABASES USING GO
Have you considered using Go to access your relational databases? In this Write
Stuff article, Gigi Sayfan shows you how to a…

Guest Author Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","Amy Unruh gives an overview of Cloud Spanner, Google's relational and scalable application database.",Amy Unruh & Scaling Out SQL Databases with Spanner,Live,489
1507,"Maureen McElaney Blocked Unblock Follow Following dev advocate at @IBM Watson Data Platform. founder of @GDIBurlington. executive
fellow at @BTVIgnite. content here is mine. Website: http://mcelaney.me/ Jun 28
--------------------------------------------------------------------------------

HOODIE APP TRACKER IS NOW DEPLOYABLE TO IBM BLUEMIX
ADDING DEPLOYMENT OPTIONS TO THE HOODIE’S SAMPLE APP
Hoodie is a fast, simple and self-hosted backend as a service for your (web)
apps. In short, Hoodie makes building Offline First apps a breeze. Plus, the Hoodie team is committed to doing wonderful work making the open source community a better place, so you can feel great about
working with them.

Even better, earlier this year Hoodie announced that they have done some work to
become compatible with IBM Cloudant® , which was obviously music to my ears as a Developer Advocate on IBM’s Watson Data Platform . Since Hoodie scratched our back, I thought it only fair to scratch theirs
too, by doing the work to make it possible to deploy Hoodie to IBM Bluemix® . So I submitted a pull request to Hoodie’s simple sample app, App Tracker .

Aside: similar to Cloudant’s relationship with the Apache CouchDB™ community, commercial support for Hoodie is available through Neighbourhoodie Software ,an IBM Business Partner.PREPARE TO BLUEMIX
Now, to make your web app deployable to Bluemix, there are a few steps you’ll
need to take to get your app compatible. To do this work for Hoodie’s App
Tracker, I relied on this handy post by Bradley Holt : A Twelve-Factor App Checklist for Deploying to IBM Bluemix . Not all the steps in his checklist applied to this particular scenario, so
I’d like to walk you through what I did to get a simple Hoodie app deployable to
Bluemix.

All the work is already done, but if you’re curious how it could apply to your
own Node.js project, I’ll describe the steps I took here. I distilled Bradley’s
checklist items, which required me to make changes to Hoodie’s App Tracker and
make it deployable to Bluemix:

 1. I had to create a .cfignore file so that Bluemix would know which files should not be shipped to Bluemix on deployment. Typically, this matches what lives in
    your .gitignore file, but at a minimum you will want your node_modules directory and npm-debug.log files, as well as any other dependencies that get built with your app
    listed in your .cfignore file.
 2. The .cfignore file is something you can sometimes get away without including when
    deploying to Bluemix. However there are two big reasons to include it here.
    First, you’d be deploying several files that should really just stay in your
    local development environment. Second, and probably the biggest reason, is
    that Bluemix won’t install your npm modules and instead will just ship
    whatever happens to be in the node_modules directory in your local development environment.
 3. To fulfill a Bluemix requirement for this application, I needed to add a Procfile . You wouldn’t normally need a Procfile for a simple application, as without it Bluemix will automatically create
    one for you. However, for this application it needs to populate the value
    for the hoodie_dbUrl environment variable to populate the CouchDB URL, so making it ahead of
    time was important.
 4. Lastly, Bluemix has a fancy one-click button that you can utilize on your
    projects to make it easy-peasy to deploy. To be able to enable this button,
    your project must include the manifest.yml file. To get around using the manifest file (which contains a non-unique
    name), the manual Bluemix deployment instructions require two additional
    steps:

 * Provide a unique application name as a parameter for cf push
 * Manually bind the hoodie-app-tracker-cloudant-service service

Already signed into IBM Bluemix, deploying Hoodie’s App Tracker example code is
just a click of the button.LEMON SQUEEZY
Now you can easily deploy Hoodie’s App Tracker example to Bluemix, manually or
automatically using the Deploy to Bluemix button. It took a little finagling to
figure out the correct port info for the Procfile , and you can see what I did by reviewing the final commit . If you’d like to deploy your own version of App Tracker, Hoodie describes all your options for deployment , including the Bluemix instructions.

The commit above was, in fact, my first major open source contribution. I must
say that Hoodie was a wonderful project to work with. If you are curious about
getting involved with an open source project, I would highly recommend you start
with Hoodie because they are professional, welcoming, and oh so friendly!

Stay tuned as I’ll be writing more about my experience contributing #myfirstpr to Hoodie.

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

Thanks to Mike Broberg . * Offline First
 * Hoodie
 * Bluemix
 * Cloud Foundry
 * Cloudant

2 Blocked Unblock Follow FollowingMAUREEN MCELANEY
dev advocate at @IBM Watson Data Platform. founder of @GDIBurlington . executive fellow at @BTVIgnite . content here is mine. Website: http://mcelaney.me/

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 2
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Hoodie is a fast, simple and self-hosted backend as a service for your (web) apps. In short, Hoodie makes building Offline First apps a breeze. Plus, the Hoodie team is committed to doing wonderful…",Hoodie App Tracker is now deployable to IBM Bluemix,Live,490
1509,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses
 *  * Our Courses
    * Partner Courses
   
   
 * Badges
 *  * Our Badges
    * BDU Badge Program
   
   
 * Student Advisor
 * Business

 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (May 16, 2017)
 * This Week in Data Science (May 16, 2017)
 * This Week in Data Science (May 9, 2017)
 * This Week in Data Science (May 2, 2017)
 * This Week in Data Science (April 25, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (MAY 16, 2017)
Posted on May 23, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * General Tips for Web Scraping with Python – Thoughts and tips about scraping data from websites for personal use.
 * Parsing Text for Emotion Terms: Analysis & Visualization Using R –Using R for analysis of emotions expressed through text.
 * The Hitchhiker’s Guide to d3.js – Advice on how to dive into visualization using d3.js.
 * Confused By Data Visualization? Here’S How To Cope In A World Of Many
   Features – The importance on knowing how to communicate data in an effective manner.
 * Top Recent Big Data videos on YouTube – A list of some of the most popular Youtube videos focusing on Big Data.
 * Top 10 SQL interview questions for tech professionals – Practice with 10 sample interview questions on SQL.
 * Modern Machine Learning Algorithms: Strengths and Weaknesses – An analysis and comparison of some Machine Learning Algorithms.
 * Getting Into Data Science: What You Need to Know – Tips for individuals looking to enter into the field of Data Science.
 * Descriptive Statistics Key Terms, Explained – A list of descriptive statistics key terms with accompanying Python
   example code.
 * Saatchi LA Trained IBM Watson to Write Thousands of Ads for Toyota – How IBM Watson was trained to create promotions.
 * A Survival Guide For The Coming AI Revolution – A look at how to prepare for the growth of the increasingly popular field
   of Artificial Intelligence.
 * What is optimization and how it improves planning outcomes – An discussion about the goals of Optimization and its importance.
 * IBM Watson thinks it can use AI to fight opioid addiction – IBM Watson partners to help identify at risk individuals and deliver
   intervention sooner.
 * Must-Know: What are common data quality issues for Big Data and how to handle
   them? – A look into the common issues facing Big Data.
 * How Big Data Analytics Is Solving Big Advertiser Problems – The role of Big Data Analytics in Marketing and Advertising.
 * Do I Need An Advanced Degree To Become A Data Scientist? – Quora response to the question of data science and education level.

UPCOMING DATA SCIENCE EVENTS
 * Data Science: Basic Statistics and Simulation Modeling in Python (Hands-On) – Lighthouse Labs 46 Spadina Avenue Toronto

FEATURED COURSES FROM COGNITIVE CLASS
 * SQL and Relational Databases 101 – Learn the basics of the database querying language, SQL.
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.
 * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to
   apply deep learning to different data types in order to solve real world
   problems.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal
 * Changelog

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

We are currently rebranding our site in order to better reflect our focus on
Data Science and Cognitive Computing. Please bear with us. Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (May 23, 2017)",Live,491
1510,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

WEB PICKS (WEEK OF 26 JUNE 2017)
Posted on July 2, 2017Every two weeks, we find the most interesting data science links from around the
web and collect them in Data Science Briefings , the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting
resources .

 * Like oil leads to global warming, data leads to social cooling
   If you feel you are being watched, you change your behavior. Big Data is
   supercharging this effect.
 * Deal or no deal? Training AI bots to negotiate
   To date, existing work on chatbots has led to systems that can hold short
   conversations and perform simple tasks such as booking a restaurant. But
   building machines that can hold meaningful conversations with people is
   challenging because it requires a bot to combine its understanding of the
   conversation with its knowledge of the world, and then produce a new sentence
   that helps it achieve its goals. Researchers at Facebook have open-sourced
   code and published research introducing dialog agents with the ability to
   negotiate.
 * The mathematicians who want to save democracy
   With algorithms in hand, scientists are looking to make elections in the
   United States more representative.
 * 17 success factors for the age of AI
   How do we, as investors, evaluate early stage software companies that put ML
   at the heart of their value proposition? Below, we introduce our ML
   Investment Framework.
 * Google advances AI with ‘one model to learn them all’
   Google quietly released an academic paper that could provide a blueprint for
   the future of machine learning. Called “One Model to Learn Them All,” it lays
   out a template for how to create a single machine learning model that can
   address multiple tasks well.
 * Advantages of Using R Notebooks For Data Analysis Instead of Jupyter
   Notebooks
   From the perspective of a former Apple Software QA Engineer.
 * Don’t use a blockchain unless you need to
   An easy way to spot a startup that won’t provide return on investment is to
   look for the words “blockchain” or “decentralized” on their landing pages.
 * Engineering extreme event forecasting at Uber with recurrent neural networks
   “At Uber, event forecasting enables us to future-proof our services based on
   anticipated user demand. The goal is to accurately predict where, when, and
   how many ride requests Uber will receive at any given time.”
 * The Limits of Artificial Intelligence
   Companies are too reluctant to talk about the limits of AI.
 * Media Companies Are Getting Sick of Facebook
   News outlets are complaining about Facebook’s terms for TV-quality videos
   meant to compete with YouTube.
 * Make R a production-ready language for deployable machine learning
   Syberia is a new development framework for R.
 * Google Will Stop Reading Your Emails for Gmail Ads
   The move is designed to ease concerns of enterprise customers.
 * Exploring LSTMs
   Lengthy and very visual exploration of LSTM networks.
 * Using Deep Learning to Reconstruct High-Resolution Audio
   There is recent interest in using deep neural networks to accomplish
   upsampling on raw audio waveforms.
 * DeepChatModels: Conversation models in TensorFlow
   Implementation of conversational models in TensorFlow.
 * Andrew Ng cryptically unveils new project: deeplearning.ai
   “Explore the frontier of AI”
 * Learning Deep Nearest Neighbor Representations Using Differentiable Boundary
   Trees (paper)
   “We introduce a new method called differentiable boundary tree which allows
   for learning deep kNN representations. We build on the recently proposed
   boundary tree algorithm which allows for efficient nearest neighbor
   classification, regression and retrieval.”
 * Poincaré Embeddings for Learning Hierarchical Representations (paper)
   “Representation learning has become an invaluable approach for learning from
   symbolic data such as text and graphs. However, while complex symbolic
   datasets often exhibit a latent hierarchical structure, state-of-the-art
   methods typically learn embeddings in Euclidean vector spaces, which do not
   account for this property. For this purpose, we introduce a new approach for
   learning hierarchical representations of symbolic data by embedding them into
   hyperbolic space — or more precisely into an n-dimensional Poincare ball.”

‹ How to Organize your Data Science Team? —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Web Picks (week of 26 June 2017)
 * How to Organize your Data Science Team?
 * What is meant by denormalizing data for analytics?
 * Web Picks (week of 12 June 2017)
 * Analytics in the Battle Against Fraud

Archives * July 2017
 * June 2017
 * May 2017
 * April 2017
 * March 2017
 * February 2017
 * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com",Interesting data science links from around the web.,Web Picks by DataMiningApps,Live,492
1511,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (October 11, 2016)
 * This Week in Data Science (October 05, 2016)
 * Mexico: Amazing interest in Data Science
 * BDU China initiatives
 * This Week in Data Science (September 27, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (OCTOBER 11, 2016)
Posted on October 11, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * First Demonstration of Brain-Inspired Device to Power Artificial Systems – New research, led by the University of Southampton, has demonstrated that
   a nanoscale device, called a memristor, could be used to power artificial
   systems that can mimic the human brain.
 * Bubble sheet multiple choice scanner and test grader using OMR, Python and
   OpenCV – Rosebrock demonstrates how to implement a bubble sheet test scanner and
   grader using strictly computer vision and image processing techniques, along
   with the OpenCV library.
 * Forgotten your password? Mastercard lets you pay with a selfie – Mastercard is allowing online shoppers to take a selfie to verify their
   identity for payments as consumers move away from credit cards to new
   technologies.
 * Why healthcare artificial intelligence isn’t about creepy-looking robots – Robots won’t steal doctors’ jobs but they will spare overworked doctors
   the fatigue.
 * How Big Data Can Improve Student Performance And Learning Approaches – Due to the advances of modern technology, there has been a great change in
   the academic landscape.
 * Why being a data scientist ‘feels like being a magician’ – Three working data scientists describe what their jobs are like.
 * Basic common sense is key to building more intelligent machines – An unfashionable old technique that helps modern artificial intelligences
   grasp our world could make them more versatile and better at communicating
   with us.
 * Your Surgeon Is Probably a Republican, Your Psychiatrist Probably a Democrat – We know that Americans are increasingly sorting themselves by political
   affiliation into friendships, even into neighborhoods. Something similar
   seems to be happening with doctors and their various specialties.
 * Competing In The Age Of AI: 3 Ways To Set Yourself Apart – The rise of Artificial Intelligence (AI) and automation mean that
   professional service businesses don’ they have to worry about machines as
   well.
 * Three Challenges for Artificial Intelligence in Medicine – A brief history of AI in Medicine, and the factors that may help it
   succeed where it has failed before.
 * USGS expands sensor network to track monster hurricane – The internet of things is tracking Hurricane Matthew. As the monster storm
   draws a bead on the south Atlantic coast after wreaking havoc in the
   Caribbean, its impact will be measured by a sensor network deployed by the
   U.S. Geological Survey.
 * The Software to Make Airports Less Miserable Finally Hits the US – The software, called Beontra, is now running at JFKs Terminal 4. Beontra
   grabs data on flight operations and delays from the Airport Operational
   Database, which also includes passenger data.
 * Mapping Dangerous Levels of Air Pollution – The World Health Organization (WHO) and the University of Bath have
   creative an interactive map demonstrating findings from their newly developed
   method for calculating air pollution.

UPCOMING DATA SCIENCE EVENTS
 * Data Innovation Day 2016: Algorithms, Automation, and Public Policy – Join the Center for Data Innovation for a conversation with leading
   experts on the state of artificial intelligence and machine learning on
   October 19th.
 * Open Data Science Conference – Connect with some of the most innovative people and ideas in the world of
   data science on November 4-6.

COOL DATA SCIENCE VIDEOS
 * IBM Watson Internet of Things: Giving a Voice to Autonomous Vehicles – Meet Olli. Not only can it drive itself, but engage in helpful,
   naturalistic conversation with its passengers.
 * Map shows two-party presidential shifts since 1920 – In an update to his two-party map, political scientist David Sparks shows
   the evolution of the two-party system across the country, since 1920.
 * Artificial Intelligence, real-life applications – On 60 Minutes Overtime, Charlie Rose explores the labs at Carnegie Mellon
   on the cutting edge of A.I. See robots learning to go where humans can’t.
 * The Evolution of US Girl Names: Bubbled – While some names stick around forever, others flash into popularity in an
   instant.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our thirty fifth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (October 11, 2016)",Live,493
1515,"* Blog
 * About Me // Contact
 * Download The E-book!
 * Data+Coding 101 Workshop in Budapest
 * The Junior Data Scientist’s first month (1-month online course)

Menu Close * Blog
 * About Me // Contact
 * Download The E-book!
 * Data+Coding 101 Workshop in Budapest
 * The Junior Data Scientist’s first month (1-month online course)

Hey, I'm Tomi Mester. This is my data blog, where I give you a sneak peek into
online data analysts' best practices. You will find here articles and videos
about data analysis, AB-testing, researches, data science and more...SUBSCRIBE FOR DATA ARTICLES HERE:
Email Address * Name *© 2018 Data36 .

Powered by WordPress .

PYTHON FOR LOOPS EXPLAINED (PYTHON FOR DATA SCIENCE BASICS #5)
Written by Tomi Mester on January 17, 2018Remember, that I told last time that Python if statements are similar to how our brain processes conditions in our everyday life? That’s
true for for loops too. You go through your shopping list, until collected every item from it. The
dealer gives a card for each player until everyone has five. The athlete does
push-ups until reaching one-hundred… Loops everywhere! As of for loops in Python : they are perfect for processing repetitive programming tasks. In this
article, I’ll show you everything you need to know about them: the syntax, the
logic and best practices too!

Note: This is a hands-on tutorial. I highly recommend to do the coding part with
me – and if you have time, solve the exercises at the end of the article! If you
haven’t done it yet, please go through these articles first:

 1. How to install Python, R, SQL and bash to practice data science!
 2. Python for Data Science #1 – Tutorial for Beginners – Python Basics
 3. Python for Data Science #2 – Data Structures
 4. Python for Data Science #3 – Functions and methods
 5. Python for Data Science #4 – If statements

PYTHON FOR LOOPS – TWO SIMPLE EXAMPLES
First things first: for loops are for iterating through “iterables” . Don’t get confused by the new term: these “ iterables” will be most of the time our well-known data types: lists, strings or
dictionaries. Some times they can also be range() objects (I’ll get back to this
at the end of the article.)

Let’s take the simplest example first: a list!
Do you remember Freddie, the dog from the previous tutorials ? Good news: he’s back! Let’s create a list:

dog = ['Freddie', 9, True, 1.1, 2001, ['bone', 'little ball']]

Once it’s created, we can go through the elements of it and print them one by
one – by using this very basic for loop:

for i in dog:
    print(i)


The result is the elements of the list one by one, in separate lines:

Freddie
9
True
1.1
2001
['bone', 'little ball']


Wonderful!
But how does this actually become useful? Take another example, a list of
numbers:
numbers=[1,5,12,91,102]

Say that we want to square each of the numbers in this list!

Note: Unfortunately the numbers*numbers formula won’t work… I know, it might sound logical first but when you will get
more into Python, you will see that in fact, it is not logical at all.

We have to do this:

for i in numbers:
    print(i*i)


The result will be:


What happened here step by step is:

 1. We set a list ( numbers ) with five elements.
 2. We took the first element (well actually, because of zero-based indexing , it’s the 0th element) of the list ( 1 ) and stored it into the i variable.
 3. Then we executed the print(i*i) function, which returned the squared value of 1 which is also 1 .
 4. Then we started the whole process over…
 5. We took the next element – we assigned it to the i variable.
 6. We executed print(i*i) again and we got the squared value of the second element: 25 .
 7. And we continued this process…
    … until we got the last element’s squared value.

This was a the most basic example of a Python for loop… but no worries, it won’
only more complex.

THE UNDERLYING LOGIC OF THE PYTHON FOR LOOPS
Okay, now that you see that it’s useful, it’s time to understand the underlying
logic of the Python for loops…

Just one comment here: in my opinion, this section is the most important part of
the article. I see many people using simple loops like a piece of cake but
struggling with more complex ones. The reason is: they learned the syntax but
they don’t get the logic. So please, if anything is unclear, re-read this
section again and again… spend time with it! And I promise, later on, you will
thank yourself for this time-investment because you will profit a lot from it.

Here’s a flowchart to visualize the process:


Let’s break down this flowchart and study all the little details of it… I’ll
guide you through step by step. As an example I’ll use the previous script with
the numbers and their squared values:

numbers=[1,5,12,91,102]
for i in numbers:
    print(i*i)


1.) Define an iterable! (Eg. we have defined a list earlier: numbers=[1,5,12,91,102] ).

2.) When you set a for loop, the first line will look pretty similar to this:


for and in are Python keywords and numbers is the name of our list… But, what I want to talk more about in this section is
the i variable. It’s a “temporary” variable and it’s only role is to store the given
element of the list that we will work with in the given iteration of the loop.
Even if this variable is called i most of the times (in online tutorials or books for instance), it’s good to
know that the naming is totally arbitrary. It could not be just i ( for i in numbers ), but anything else, like x ( for x in numbers ) or hello ( for hello in numbers ) or whatever you prefer… The point is, set a variable and don’t forget that
you have to refer to it when you want to use it inside the loop.

3.) Stick with our numbers-example! We take the first element of our iterable
(well, again – because of zero-based indexing – technically it’s the 0th element
of the list). The first iteration of the loop will run! The 0th element of our
list is 1 . So the i variable is set to 1 .
Note: More info about zero-based indexing: here .

5.) The function itself, inside the loop, was print(i*i) . As i=1 the result of i*i will be 1 . 1 will be printed to our screen.

6.) The loop starts over.

7.) We take the next element and since there is an actual next element of the
list, the second iteration of the loop will run! The 1st element of the numbers list is 5 .

8.) So i is 5 . print(i*i) runs again and the result is printed to our screen: 25 .

9.) The loop starts over. We take the next element.

10.) There is a next element. So here comes the third iteration. The 2nd element
of the numbers list is 12 .

11.) print(i*i) is 144 .

12.) The loop starts over. The next element exists. The iteration runs again.

13.) The 3rd element is 91 . The squared value of it is 8281 .

14.) The loop starts over. Next element exists. The iteration runs again.

15.) i is 102 . The squared value of it is 10404 .

16.) The loop starts over. But there is no more “next element”. So the loop
ends.

This is a very-very detailed explanation for a 3 line script, right? Don’t
worry, it’s enough if you only crunch this once. In the future, you can just go
ahead and use those 3 simple lines, because the underlying logic will be in the
back of your mind! I find it very important to write this down though, because
many junior data professionals do not have this in their back of mind… and that affects the quality of their Python
scripts in a bad way.

ITERATING THROUGH STRINGS
Okay, going forward!
As I mentioned earlier, you can use other sequences than lists too. Let’s try a
string:

my_list=""Hello World!""
for i in my_list:
    print(i)


Easy as pie: we got back the characters one by one.
Remember, strings are basically handled as sequences of characters, thus the for
loop will work with them pretty much as it did with lists.

ITERATING THROUGH RANGE() OBJECTS
range() is a built-in function in Python and we are almost exclusively using it with for loops. What does it
do? In a nutshell: it generates a list of numbers. Let’s see how it works:

my_list=range(0,10)
for i in my_list:
    print(i)


It accepts three arguments:


 * The first element: this will be the first element of your range.
 * The last element: you might assume that this will be the last element of your range… but it
   isn’t. It’s a Python thing (well actually you can find this in other programming languages too ) , but you can define here the element after your actual last… Let’s make
   this clear by an example: if you assign 10 to the last element attribute, the
   range will go from 0 to 9. If it’s 11 then the range will go from 0 to 10.
   Note: If you want to learn more about why Python range() works this way,
   check out this Quora article: Why are Python ranges half-open (exclusive) instead of closed (inclusive)?
 * The step: this is the difference between each element in the range. So if it’s 2, you
   will only print every second elements.

Now, can you guess the result of the range above?
Here it is:


Note: the first element and the step attributes are optional. If you don’t specify them, then the first element will
be 0 and the step will be 1 by default. Try this in your Jupyter Notebook and
check the result:

 my_list=range(10) for i in my_list:     print(i)

When range() can be useful? Mostly, in these two cases:

1.) You want to go through numbers. For instance, you want to cube the impair
integers from 0 to 9? Not a problem:

my_list=range(1,10,2)
for i in my_list:
    print(i*i*i)


2.) You want to go through a list but want to keep the indexes of the elements
too.

my_list=[1,5,12,91,102]
my_list_length=len(my_list)
for i in range(0,my_list_length):
    print(i, my_list[i]*my_list[i])


In this case i will be the index and you can get the actual elements of the list with the my_list[i] syntax – just as we have learned in the Python Data Structures article.

Anyway: use range() – it will make your job with Python for loops easier!

BEST PRACTICES AND COMMON MISTAKES
Finally let me share some best practices:

 1. I know from my data coding workshops that Python for loops are not
    necessarily easy for the first time. They need some sort of algorithmic
    thinking. Of course, the more you practice, the better you become… But if
    you get a really difficult task, it’s always a good tactic to get a paper
    and sketch up the logic first. Go through the first few iterations on paper,
    write down the results – and things will be much clearer!
 2. Just as with if statements, be careful with the syntax. At the end of the for ‘s line, a colon is required. And at the beginning of the lines in the
    loop’s body you have to use indentations.
    
 3. You can’t print strings and integers in one print() function by simply using the + sign. This is more a print-function-thing than a for-loop-thing but most of
    the time you will meet this issue in for loops. Eg.:
    
    If you see this, one of the good solutions is: turning your integers into
    strings by using the str() function! Here is the previous example with the right syntax:
    

Okay, we have done this!
Exercise time!

TEST YOURSELF!
Here’s a nice for-loop-exercise:
Take a variable and assign a random string to it.
Then print a pyramid of the string like in this example:

my_string=""python""

OUTPUT:

p
py
pyt
pyth
pytho
python
pytho
pyth
pyt
py
p


Write a script that does this for any my_string value!
Okay! Let’s go!

THE SOLUTION
Note: There are more than one way to solve this task. I’ll show you here a
relatively simple solution, but feel free to post your alternative solutions in
the comment section below!

my_string=""python""
x=0

for i in my_string:
    x=x+1
    print(my_string[0:x])

for i in my_string:
    x=x-1
    print(my_string[0:x])


I think the solution is quite self-explanatory. The only trick is that I did set
a “counter-variable” called x that always shows the number of characters that I want to print to the screen
in the given iteration. In the first for loop this goes up until I reach the
number of maximum characters. After that, in the second for loop, it goes down
until I have zero characters on the screen.

Note: If the my_string[0:x] syntax does not look familiar, check the Python Data Structures article – and the “How to access multiple elements of a Python list?” section.

CONCLUSION
Python for loops are important and they are used widely in data scripts. The
syntax is simple but as you have seen, to fully understand the logic behind it
needs a little bit of brainwork. But by reading this article, you got through it
and now you have a solid foundation to build on. So all you have to do is a bit
more practicing to become master of loops! 🙂

In the next article I’ll write about how to combine for loops with for loops and
for loops with if statements. It’s gonna be exciting!

ps. subscribe to my Newsletter and be notified first about new articles, videos or courses!

Cheers,
Tomi Mester

 * January 17, 2018
 * In Coding In Data Science and Analytics
 * analytics data coding learn python learn to code python python for loop python3 tomi mester

← Previous post Next post →LEAVE A REPLY CANCEL REPLY
Comment

Name *

Email *

Website


Get free data articles weekly: We use cookies to ensure that we give you the best experience on our website. Ok","Python for loops are for iterating through sequences like lists, strings, dictionaries or ranges. In this article, I’ll show you everything you need to know about them: the syntax, the logic and best practices too!",Python For Loops Explained (Python for Data Science Basics #5),Live,494
1520,"KDNUGGETS
Subscribe to KDnuggets News | | Contact * SOFTWARE
 * News/Blog
 * Top stories
 * Opinions
 * Tutorials
 * JOBS
 * Companies
 * Courses
 * Datasets
 * EDUCATION
 * Certificates
 * Meetings
 * Webinars


KDnuggets Home » News » 2017 » Oct » Tutorials, Overviews » Top 10 Machine Learning Algorithms for Beginners ( 17:n41 )TOP 10 MACHINE LEARNING ALGORITHMS FOR BEGINNERS
Previous post Next post

Tags: Adaboost , Algorithms , Apriori , Bagging , Beginners , Boosting , Decision Trees , Ensemble methods , Explained , K-means , K-nearest neighbors , Linear Regression , Logistic Regression , Machine Learning , Naive Bayes , PCA , Top 10
A beginner's introduction to the Top 10 Machine Learning (ML) algorithms,
complete with figures and examples for easy understanding.


--------------------------------------------------------------------------------


By Reena Shaw , KDnuggets.

I. INTRODUCTION

The study of ML algorithms has gained immense traction post the Harvard Business
Review article terming a ‘Data Scientist’ as the ‘Sexiest job of the 21st century’. So, for
those starting out in the field of ML, we decided to do a reboot of our
immensely popular Gold blog The 10 Algorithms Machine Learning Engineers need to know - albeit this post is targetted towards beginners.

ML algorithms are those that can learn from data and improve from experience,
without human intervention. Learning tasks may include learning the function
that maps the input to the output, learning the hidden structure in unlabeled
data; or ‘instance-based learning’, where a class label is produced for a new
instance by comparing the new instance (row) to instances from the training
data, which were stored in memory. ‘Instance-based learning’ does not create an
abstraction from specific instances.


II. TYPES OF ML ALGORITHMS

There are 3 types of ML algorithms:

1. Supervised learning:

Supervised learning can be explained as follows: use labeled training data to
learn the mapping function from the input variables (X) to the output variable
(Y).

Y = f (X)

Supervised learning problems can be of two types:

a. Classification : To predict the outcome of a given sample where the output variable is in the
form of categories. Examples include labels such as male and female, sick and
healthy.

b. Regression : To predict the outcome of a given sample where the output variable is in the
form of real values. Examples include real-valued labels denoting the amount of
rainfall, the height of a person.

The 1st 5 algorithms that we cover in this blog– Linear Regression, Logistic
Regression, CART, Naïve Bayes, KNN are examples of supervised learning.

Ensembling is a type of supervised learning. It means combining the predictions
of multiple different weak ML models to predict on a new sample. Algorithms 9-10
that we cover– Bagging with Random Forests, Boosting with XGBoost are examples
of ensemble techniques.

2. Unsupervised learning:

Unsupervised learning problems possess only the input variables (X) but no
corresponding output variables. It uses unlabeled training data to model the
underlying structure of the data.

Unsupervised learning problems can be of two types:

a. Association : To discover the probability of the co-occurrence of items in a collection. It
is extensively used in market-basket analysis. Example: If a customer purchases
bread, he is 80% likely to also purchase eggs.

b. Clustering : To group samples such that objects within the same cluster are more similar
to each other than to the objects from another cluster.

c. Dimensionality Reduction : True to its name, Dimensionality Reduction means reducing the number of
variables of a dataset while ensuring that important information is still
conveyed. Dimensionality Reduction can be done using Feature Extraction methods
and Feature Selection methods. Feature Selection selects a subset of the
original variables. Feature Extraction performs data transformation from a
high-dimensional space to a low-dimensional space. Example: PCA algorithm is a
Feature Extraction approach.

Algorithms 6-8 that we cover here - Apriori, K-means, PCA are examples of
unsupervised learning.

3. Reinforcement learning:

Reinforcement learning is a type of machine learning algorithm that allows the
agent to decide the best next action based on its current state, by learning
behaviours that will maximize the reward.

Reinforcement algorithms usually learn optimal actions through trial and error.
They are typically used in robotics – where a robot can learn to avoid
collisions by receiving negative feedback after bumping into obstacles, and in
video games – where trial and error reveals specific movements that can shoot up
a player’s rewards. The agent can then use these rewards to understand the
optimal state of game play and choose the next action.


III. QUANTIFYING THE POPULARITY OF ML ALGORITHMS

Survey papers such as these have quantified the 10 most popular data mining algorithms. However, such lists
are subjective and as in the case of the quoted paper, the sample size of the
polled participants is very narrow and consists of advanced practitioners of
data mining. The persons polled were the winners of the ACM KDD Innovation
Award, the IEEE ICDM Research Contributions Award; the Program Committee members
of the KDD-06, ICDM’06 and SDM’06; and the 145 attendees of the ICDM’06.

The Top 10 algorithms in this blog are meant for beginners and are primarily
those that I learnt from the ‘Data Warehousing and Mining’ (DWM) course during
my Bachelor’s degree in Computer Engineering at the University of Mumbai. The
DWM course is a great introduction to the field of ML algorithms. I have
especially included the last 2 algorithms (ensemble methods) based on their prevalence to win Kaggle competitions . Hope you enjoy the article!


IV. SUPERVISED LEARNING ALGORITHMS

1. Linear Regression

In ML, we have a set of input variables (x) that are used to determine the
output variable (y). A relationship exists between the input variables and the
output variable. The goal of ML is to quantify this relationship.


Figure 1: Linear Regression is represented as a line in the form of y = ax +b. Source

In Linear Regression, the relationship between the input variables (x) and
output variable (y) is expressed as an equation of the form y = ax +b. Thus, the
goal of linear regression is to find out the values of coefficients a and b.
Here, a is the intercept and b is the slope of the line.

Figure 1 shows the plotted x and y values for a dataset. The goal is to fit a
line that is nearest to most of the points. This would reduce the distance
(‘error’) between the y value of a data point and the line.


2. Logistic Regression

Linear regression predictions are continuous values (rainfall in cm),logistic
regression predictions are discrete values (whether a student passed/failed)
after applying a transformation function.

Logistic regression is best suited for binary classification (datasets where y =
0 or 1, where 1 denotes the default class. Example: In predicting whether an
event will occur or not, the event that it occurs is classified as 1. In
predicting whether a person will be sick or not, the sick instances are denoted
as 1). It is named after the transformation function used in it, called the
logistic function h(x)= 1/ (1 + e^x), which is an S-shaped curve.

In logistic regression, the output is in the form of probabilities of the
default class (unlike linear regression, where the output is directly produced).
As it is a probability, the output lies in the range of 0-1. The output
(y-value) is generated by log transforming the x-value, using the logistic
function h(x)= 1/ (1 + e^ -x) . A threshold is then applied to force this
probability into a binary classification.


Figure 2: Logistic Regression to determine if a tumour is malignant or benign.
Classified as malignant if the probability h(x)>= 0.5. Source

In Figure 2, to determine whether a tumour is malignant or not, the default
variable is y=1 (tumour= malignant) ; the x variable could be a measurement of
the tumour, such as the size of the tumour. As shown in the figure, the logistic
function transforms the x-value of the various instances of the dataset, into
the range of 0 to 1. If the probability crosses the threshold of 0.5 (shown by
the horizontal line), the tumour is classified as malignant.

The logistic regression equation P(x) = e ^ (b0 +b1*x) / (1 + e^(b0 + b1*x)) can be transformed into ln(p(x) / 1-p(x)) = b0 + b1*x .

The goal of logistic regression is to use the training data to find the values
of coefficients b0 and b1 such that it will minimize the error between the
predicted outcome and the actual outcome. These coefficients are estimated using
the technique of Maximum Likelihood Estimation.


3. CART

Classification and Regression Trees (CART) is an implementation of Decision
Trees, among others such as ID3, C4.5.

The non-terminal nodes are the root node and the internal node. The terminal
nodes are the leaf nodes. Each non-terminal node represents a single input
variable (x) and a splitting point on that variable; the leaf nodes represent
the output variable (y). The model is used as follows to make predictions: walk
the splits of the tree to arrive at a leaf node and output the value present at
the leaf node.

The decision tree in Figure3 classifies whether a person will buy a sports car
or a minivan depending on their age and marital status. If the person is over 30
years and is not married, we walk the tree as follows : ‘over 30 years?’ -> yes
-> ’married?’ -> no. Hence, the model outputs a sportscar.


Figure 3: Parts of a decision tree. Source


4. Naïve Bayes

To calculate the probability that an event will occur, given that another event
has already occurred, we use Bayes’ Theorem. To calculate the probability of an
outcome given the value of some variable, that is, to calculate the probability
of a hypothesis(h) being true, given our prior knowledge(d), we use Bayes’
Theorem as follows:

P(h|d)= (P(d|h) * P(h)) / P(d)

where

 * P(h|d) = Posterior probability. The probability of hypothesis h being true,
   given the data d, where P(h|d)= P(d1| h)* P(d2| h)*....*P(dn| h)* P(d) P(d|h) = Likelihood. The probability of data d given that the hypothesis h
   was true. P(h) = Class prior probability. The probability of hypothesis h being true
   (irrespective of the data) P(d) = Predictor prior probability. Probability of the data (irrespective of
   the hypothesis)

This algorithm is called ‘naive’ because it assumes that all the variables are
independent of each other, which is a naive assumption to make in real-world
examples.


Figure 4: Using Naive Bayes to predict the status of ‘play’ using the variable
‘weather’.

Using Figure 4 as an example, what is the outcome if weather=’sunny’?

To determine the outcome play= ‘yes’ or ‘no’ given the value of variable
weather=’sunny’, calculate P(yes|sunny) and P(no|sunny) and choose the outcome
with higher probability.

->P(yes|sunny)= (P(sunny|yes) * P(yes)) / P(sunny)

= (3/9 * 9/14 ) / (5/14)

= 0.60


-> P(no|sunny)= (P(sunny|no) * P(no)) / P(sunny)

= (2/5 * 5/14 ) / (5/14)

= 0.40

Thus, if the weather =’sunny’, the outcome is play= ‘yes’.


5. KNN

The k-nearest neighbours algorithm uses the entire dataset as the training set,
rather than splitting the dataset into a trainingset and testset.

When an outcome is required for a new data instance, the KNN algorithm goes
through the entire dataset to find the k-nearest instances to the new instance,
or the k number of instances most similar to the new record, and then outputs
the mean of the outcomes (for a regression problem) or the mode (most frequent
class) for a classification problem. The value of k is user-specified.

The similarity between instances is calculated using measures such as Euclidean
distance and Hamming distance.

Pages: 1 2


--------------------------------------------------------------------------------

Previous post Next post


--------------------------------------------------------------------------------


TOP STORIES PAST 30 DAYS
Most Popular 1. Top 10 Machine Learning Algorithms for Beginners 30 Essential Data Science, Machine Learning & Deep Learning Cheat Sheets Want to Become a Data Scientist? Read This Interview First Understanding Machine Learning Algorithms 6 Books Every Data Scientist Should Keep Nearby The 10 Algorithms Machine Learning Engineers Need to Know 7 Types of Artificial Neural Networks for Natural Language Processing

Most Shared 1. 6 Books Every Data Scientist Should Keep Nearby Want to know how Deep Learning works? Heres a quick guide for everyone Interpreting Machine Learning Models: An Overview AlphaGo Zero: The Most Significant Research Advance in AI Advice For New and Junior Data Scientists XGBoost: A Concise Technical Overview 7 Steps to Mastering Deep Learning with Keras

LATEST NEWS
 * Webinar: Data Preparation Essentials for Auto... Deep Learning in Robotics and Healthcare Summ... The Python Graph Gallery Capsule Networks Are Shaking up AI –... PySpark SQL Cheat Sheet: Big Data in Python Top tweets, Nov 08-14: Approaching (Almost) Any NLP P...


MORE RECENT STORIES
 * Top tweets, Nov 08-14: Approaching (Almost) Any NLP Problem... Basic Concepts of Feature Selection HelloFresh: Machine Learning Engineer HelloFresh: Big Data Engineer HelloFresh: Senior Data Scientist You have created your first Linear Regression Model. Have you ... The 10 Statistical Techniques Data Scientists Need to Master KDnuggets 17:n44, Nov 15: Best Online Masters in Data Scien... MS in Business Analytics from NYU Stern – Advance your c... Best Online Masters in Data Science and Analytics – a co... Extracting Tweets With R Some Things to Remember About Memory Machine Learning Algorithms: Which One to Choose for Your Problem Strata Data Conference, San Jose, Mar 5-8, 2018 – KDnugg... Your guide to predictive analytics in media and entertainment The amazing predictive power of conditional probability in Bay... Top Stories, Nov 6-12: When Will Demand for Data Scientists/Ma... A Day in the Life of a Data Scientist Stanford online Data Science and Data Mining courses and certi... Overview of GANs (Generative Adversarial Networks) – Part I


KDnuggets Home » News » 2017 » Oct » Tutorials, Overviews » Top 10 Machine Learning Algorithms for Beginners ( 17:n41 )

© 2017 KDnuggets. About KDnuggets
Subscribe to KDnuggets News X","For those starting out in the field of ML, we decided to do a reboot of our immensely popular Gold blog The 10 Algorithms Machine Learning Engineers need to know.",Top 10 Machine Learning Algorithms for Beginners,Live,495
1525,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Feb 27
--------------------------------------------------------------------------------

HOW I USED SERVERLESS INFRASTRUCTURE TO BUILD A LARGE-SCALE PETITION SYSTEM
AN EXAMPLE WEB APP USING OPENWHISK AND CLOUDANT
Is it possible to create a web application that collects data from a web form
without using any servers? Of course not, but you would be forgiven for thinking
that when reading the word “serverless”. “Serverless” means that instead of
having fixed numbers of dedicated machines sitting in data centres waiting for
traffic to come in, we can supply the code that handles a single piece of work
(e.g. the submission of a web form) to an application framework like IBM OpenWhisk and let it scale out the computing power required. If no traffic arrives, you don’t pay
anything — no fixed costs!

Let’s take an example. In the UK, there is a government petitions website where
members of the public can gather support for issues they would like debated in
the House of Commons. Recently, over 1.8m signatures were collected on a
petition urging the government to rescind Donald Trump’s state visit invitation.

If you were building an IT system to collect large-scale public petitions, you
would have no idea how popular each was going to be or how much computing power
you’d need to make the site performant. The “serverless” paradigm is compelling
in this use-case, as you only pay for the traffic you generate.

LET’S BUILD A LARGE-SCALE PETITION SYSTEM
Modelling our system on the UK model, our petition system would have a
public-facing website with a simple form detailing the issue being “signed”. The
user supplies their name, location, email address, confirms that they are a
resident of the country in question, and submits the form. Our app saves the
data in a Cloudant database and sends a verification email to the signatory.

When the email arrives, the user clicks on the link to confirm and sign the
petition

At some future date, we can count the confirmed records in the database to see
if the petition has reached the 100,000 records needed to trigger a debate in
parliament.

We want our whole infrastructure to be “serverless” i.e. we won’t stand up our own servers to run the website —
the hosting, form handling, email sending, and database will be run by others
with minimal or ideally no fixed costs.

I already built this solution (spoiler alert!), and you can find all the code in
my online-petition repo on GitHub . To help you understand what’s possible, I’ll give a quick, high-level
explanation of the architecture, technologies, and how it all works together.

SERVERLESS HOSTING
An obvious choice for “serverless” web hosting is GitHub Pages . Sign up for GitHub, create a repository, and put some HTML/JavaScript/CSS in
a gh-pages branch. Your website will be served out at http://USERNAME.github.io/REPO/ . You can even point your custom domain name to GitHub and it serves out your
pages on that domain e.g. mypetition.com .

SERVERLESS COMPUTING
Use IBM OpenWhisk to handle form submissions and email verification requests. OpenWhisk lets you
write your code in JavaScript, Swift, or Python and have it run in response to
incoming events — in this case the submission of a web form or the clicking of a
link in an email.

SERVERLESS TRANSACTIONAL EMAIL
In order to send emails to signatories, you need a transactional email service.
I set up an email template with placeholders for the content that changes
between recipients. The email provider ensures that the emails are sent,
avoiding each users’ spam filter. I used SendGrid because they let you send a templated email via a simple API call.

SERVERLESS DATABASE STORAGE
Ultimately we need to store our data somewhere . So someone is going to have to manage some servers and disks. By using the Cloudant database-as-a-service, we don’t have to have fixed servers dedicated to us; we
can simply consume a database service on a shared cluster and pay only for the
requests and storage that we use.

DESIGNING OUR DATA FLOWS
OpenWhisk computing tasks are built up from actions . You can combine many actions into sequences of actions and activate them with incoming data (such as an API call).

Create two API calls that activate OpenWhisk computing resources:

 * POST /petition/submit - to verify a form submission, save the data to Cloudant, and dispatch the
   email.
 * POST /petition/confirm - to confirm that the user has clicked on a link from the email. Another
   GitHub pages page renders on the browser, which makes an API call to
   OpenWhisk. The database record updates to mark the signature as ""confirmed"".

The POST /petition/submit is actually three separate OpenWhisk actions chained in a sequence, whereas POST /petition/confirm is configured as a single OpenWhisk action.

OpenWhisk uses IBM’s API Connect to map actions and action sequences to public-facing API calls.

SIGN UP FOR THE SERVICES
Going serverless means you don’t have to deal with equipment, virtual servers,
operating systems or networking, but you still have to sign up for some
accounts:

 * sign up for an IBM Bluemix account
 * follow Getting started with OpenWhisk to get the wsk tool installed and authenticated
 * inside Bluemix, sign up for a Cloudant service. Make a note of the Cloudant
   URL, including username and password.
 * setup a SendGrid account, create an API key that has permissions to send transactional emails
   and create an email template
 * create a Github repository with your web site in it and branch to a gh-pages branch

DEPLOYING OPENWHISK CODE
OpenWhisk has nice dashboard where you can paste your code, build sequences, and
try your code. This is fine when taking your first few steps with OpenWhisk but
very soon you’ll want to script the deployment of your code so that it can be
automated and reproduced easily.

Assuming you have the wsk tool installed and it is authenticated against your Bluemix service, then
installing our code is a breeze. Just clone the git repository https://github.com/ibm-cds-labs/online-petition.git and run the deploy.sh script from the openwhisk directory.

This script assumes that you have the following environment variables set up:

 * COUCH_URL = the URL of the Cloudant service e.g. https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com
 * COUCH_DBNAME = the name of the Cloudant database e.g. petition
 * SENDGRIDBEARER = the SendGrid API key
 * SENDGRIDSENDER = the email address that emails will appear to come from
 * SENDGRIDTEMPLATEID = the SendGrid email template id to email

It’s worth understanding each step in the deployment script. Notice how the
credentials are encapsulated into a ‘package’ called ‘petition’ to which each of
the actions is added. Then three of the actions are combined into a sequence
before they are finally exposed as public API calls using the api-experimental command. Each API call also has an 'OPTIONS' method - this is quirk of CORS
restrictions (the rules that prevent a website from making API calls to other
servers). We need the dummy 'OPTIONS' API call to convince the browser that the
main 'POST' request is permitted.

TRY IT YOURSELF
You can sign our demo petition yourself here . The code that appears on that page is in this GitHub repository . The OpenWhisk actions and the deployment script can be found in the openwhisk directory .

WHO’S PAYING THE BILL?
So I have a totally “serverless” system with Github serving the static content,
OpenWhisk handling the form submissions, SendGrid sending emails, and Cloudant
storing the data. But is serverless free?

Not free — it’s pay-as-you-go. * GitHub reserves the right to disable or throttle your site’s usage if your content is very popular
 * SendGrid has a range of plans that start at free for the first month but typically allow you to send tens
   of thousands of emails from $10 per month.
 * OpenWhisk has a free tier and then takes payment depending on the volume, memory, and execution time
   of your actions.
 * Cloudant’s default Lite Plan offers a limited API call limit for free, or increased capacity for
   additional dollars per month.

So it’s possible to create something like this and run it for free for a while
(until your free trial runs out!) and up to a certain amount of traffic
(depending on whether you run out of hosting, email capacity, or database
storage first).

The important point is that you can setup an IT system with $0 fixed costs,
paying extra money to add extra capacity. OpenWhisk lets you scale your
application to deal with spikes in traffic but have no fixed costs for when
there is no demand.

 * Serverless
 * Openwhisk
 * Sendgrid
 * Web Development
 * Cloudant

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","If you were building an IT system to collect large-scale public petitions, you would have no idea how much computing power you’d need to make the site performant. Serverless is compelling here.",How I used serverless infrastructure to build a large-scale petition system – IBM Watson Data Lab,Live,496
1526,"BACKUPS, ETCD AND ETCDTOOLShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 18, 2016Etcd is a fine database for coordination and configuration data but, at leastfor backup and restore, there's some design assumptions made that aren't a goodfit for a hosted service.That's why we were rather pleased to come across the etcdtool utility. It isn't part of the etcd distribution but the work of Michael Perssonand can be found at https://github.com/mickep76/etcdtool .The prime motivation behind etcdtool was to create a version independent backup and restore utility for etcd, but ithas many other capabilities which could make it an essential for etcd users whomay want to automate and validate populating their etcd deployments. We'll startwith looking at moving a tree from one etcd deployment on Compose to anotherusing etcdtool .INSTALLINGBefore we can do anything, we'll need to install it. The README file for it has details on how to build it from source and how to install onMac OS X using Homebrew, and on the Releases page, there are builds for Linux, Mac OS X, an RPM and source code. As we'rerunning Mac OS X and Homebrew we went for the simplest install:brew tap mickep76/funk-gnarge  brew install etcdtool  COMMANDING THE LINEAnd we are ready to go. Before we backup, let's use etcdtool to take a look at our data:$  etcdtool --ca ./etcdcert.crt --peers https://host.1.dblayer.com:10068,https://host.2.dblayer.com:10051 -u root tree /Password:  /└── config/    └── server-15/        └── database2 directories, 1 dirs  Let's step through the command and its parameters at this point. The -ca ./etcdcert.crt tells the tool to use the self-signed certificate file Compose makes availableon the etcd console - see this article for more about that. The next parameter, --peers https://host.1.dblayer.com:10068,https://host.2.dblayer.com:10051 tells etcdtool where to find the cluster's nodes in the same style as the etcdctl command does. Compose users will find that information on the Overview tab of their etcd deployments console. They'll also find the username andpassword for the etcd user on that page – usually root – which goes into theuser parameter next - -u root . Ok, that's the boilerplate parameters for connecting covered. The last partof our command is the subcommand to run. In this case tree / generates a tree view of the etcd hierarchy of keys, but doesn't show values.When we hit return, we're prompted for the password which goes with the ""root""user which was displayed on the Compose console. Once entered we get our treeview. The command is pretty clumsy to work with so let's make things morereadable and alias it.$ alias etcdtool-s1=""etcdtool --ca ./etcdcert.crt --peers https://host.1.dblayer.com:10068,https://host.2.dblayer.com:10051 -u root""Now we can just say etcdtool-s1 for that server.RARE EXPORTSTo export the contents of etcd, we use the export command and tell it what directory we want exported:$ etcdtool-s1 export /                                                          Password:  {  ""config"": {    ""server-15"": {      ""database"": ""mongodb""    }  }}n  Sending this to stdout this very useful, but it does show us what a versionindependent export looks like. Don't try and redirect the output; that passwordprompt will just confuse things. We need an option to send the output to a file.All the subcommands support --help to display flags and options and don't need a connection, so we can do this:$ etcdtool export --help                                                        NAME:     etcdtool export - export a directoryUSAGE:     etcdtool export [command options] [arguments...]OPTIONS:     --sort, -s       returns result in sorted order   --format, -f ""JSON""  Data serialization format YAML, TOML or JSON [$ETCDTOOL_FORMAT]   --output, -o     Output file$And see we can specify an output file and format. We'll stick with JSON for now,but we'll send the output to export.json by running:etcdtool-s1 export / -o export.jsonAs we have a JSON file exported, we can now look at importing the data intoanother etcd deployment. And here's one I prepared earlier.alias etcdtool-s2=""etcdtool --ca ./server2-etcdcert.crt --peers https://host.1.dblayer.com:10181,https://host.2.dblayer.com:101422 -u root""  As is obvious this other server has a different certificate and ports. We’llproceed by supplying the import command with the JSON data file and the destination in the new etcd hierarchy.$ etcdtool-s2 import / export.json                                              Password:  Do you want to overwrite data in directory: / [yes/no]? yes  $ etcdtool-s2 tree /                                                            Password:  /└── config/    └── server-15/        └── database2 directories, 1 dirs  $And we've rebuilt the tree in a new server.RECOVERING COMPOSE BACKUPSCompose backups are snapshots of the data in etcd. Currently, from outside thecluster, they can't be restored directly. That's because the etcd developersexpect restoration to be carried out by moving files around on the file system.If you download an etcd backup, you will find all your data is there. We've justdownloaded a backup and extracted its contents:$ tar xvf 2016-04-17_10-52-03_utc.tar                                             x etcd/  x etcd/member/  x etcd/member/snap/  x etcd/member/snap/0000000000000003-00000000009e394f.snap  x etcd/member/wal/  x etcd/member/wal/0000000000000000-0000000000000000.wal  $   We can bring this data to life by running a local copy of etcd. Do ensure youare running the same or later version of etcd as is running in your Composedeployment. When we run etcd in this directory like so:etcd --data-dir=./etcd --force-new-clusterThis starts a single instance of the database locally, which we can now talk towith etcdtool . No parameters are needed for etcdtool as this is a local instance:$ etcdtool tree /                                                               /└── config/    ├── server-15/    │   └── database    └── server15/        └── test3 directories, 2 dirs  We can now export that data as JSON using etcdtool export / -o backupexport.json and use etcdtool import to load it into any other etcd server or deployment.EDITING WITH...We've been focussed on the export and import functionality of etcdtool in this article but there's more to etcdtool than that. For example, to quickly edit your etcd data as JSON (or YAML orTOML, the two other supported formats), try etcdtool edit . This takes a directory in the store as a parameter. It then exports that tothe required format and starts up an editor where we can make changes. Onexiting the editor we are offered the chance to overwrite that directory withour changes. This is very handy for the trickier changes and a good alternativeto firing off multiple etcdctl commands.Finally, it's worth mentioning that etcdtool has the ability to validate JSON schemas on import using a configuration fileto make schemas to the tree. You'll find examples of this features use in theGithub repository, but, as yet, little documentation.WRAPPING UPetcdtool is simple but powerful and works well with Compose to solve the problem ofconverting snapshots into re-usable data.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Etcd is a fine database for coordination and configuration data but, at least for backup and restore, there's some design assumptions made that aren't a good fit for a hosted service.  That's why we were rather pleased to come across the etcdtool utility.","Backups, etcd and etcdtool",Live,497
1527,"Compose The Compose logo Articles Sign in Free 30-day trialBACH - THE COMPOSE API AT YOUR COMMAND(LINE)
Published May 18, 2017 compose API bach Bach - The Compose API at your command(line)Find out how to use the power of the command line to control your Compose
database deployments. The Bach tool is an easy way to harness the mighty Compose
API.

The Compose API is a powerful way to create new deployments, back them up,
redeploy with restores, get credentials and much more, but you may not have the
time or inclination to master the invoking of curl commands and the creation of JSON parameters for those commands. Worry not, we
don't either and that's why we created Bach.

Bach is a command-line for the Compose API for those of us who live in the
shell. It's written in Go and there's a sister project, gocomposeapi which wraps up the Compose API into something more Go idiomatic. Don't worry
about that though as there are prebuilt binaries also available for macOS, Linux
and Windows in the Bach releases folder.

Download it and copy the binary somewhere you can use it. Then run bach --help ...

$ bach --help 
Bach is designed as simple route to accessing the Compose API

Usage:  
  bach [command]

Available Commands:  
  about       About Bach
  account     Show Account Details
  alerts      Show Alerts for deployment
  backups     Commands for backups
  clusters    Show clusters
  create      Create a deployment
  databases   List databases
  datacenters Lists available datacenters
  deployments Show Deployments attached to account
  deprovision Deprovision a deployment
  details     Show details for a deployment
  help        Help about any command
  recipe      Show details of a recipe
  recipes     Show Recipes related to deployment
  scale       Show scale information for a deployment
  user        Show user information
  versions    Show versions for deployment database
  watch       Watch a recipe status

Flags:  
      --caescaped      Display full CAs as escaped strings
      --fullca         Show all of CA Certificates
  -h, --help           help for bach
      --json           Output post-processed JSON results
      --nodecodeca     Do not Decode base64 CA Certificates
      --raw            Output raw JSON responses
      --token string   Your API Token (default ""Your API Token"")

Use ""bach [command] --help"" for more information about a command.  


That's quite a few commands available with an almost one-to-one mapping to the
Compose API. Each command has its own help too; just run bach command --help . The easiest way to show what Bach can do is to take you through the process
of creating a deployment with it.

THE COMPOSEAPITOKEN
Before you do anything with Bach, you'll need your Compose API Token. Although
there are some API calls that don't need the API token, Bach insists that you
set one for all commands. You can generate an API token in the Compose console , through the Account option. We recommend you create one token specifically
for Bach and don't use it for multiple different purposes - you can issue as
many API tokens as you need and each one currently has full account owner
permissions so be careful with them. Once you have one, you can set it in the
COMPOSEAPITOKEN environment variable, or set it through the --token flag; we recommend you use the environment variable.

NAVIGATING COMPOSE
The first two commands we'll look at are databases and datacenters . These are informative commands, one listing all the types of database
available to deploy, the other listing the datacenters where those databases can
be deployed. On Compose, we're using the term datacenter not to refer to a
single building or facility. It refers to a configuration of zones or networks
that we've combined to provide a highly available, resilient entity in a
particular geographic region within a specific cloud.

$ bach datacenters
         Region: frankfurt-2
       Provider: softlayer
           Slug: softlayer:frankfurt-2

         Region: london-02
       Provider: softlayer
           Slug: softlayer:london-02

         Region: us-east1
       Provider: gce
           Slug: gce:us-east1
...
$ bach databases
bach databases  
           Type: elastic_search
         Status: stable
        Version: 2.4.4 (stable) Preferred

           Type: etcd
         Status: beta
        Version: 2.3.8 (beta) Preferred

           Type: mongodb
         Status: stable
        Version: 3.2.10 (stable)
        Version: 3.2.11 (stable) Preferred
...


If you have a Compose Enterprise cluster, then bach clusters will deliver the same kind of identifying information for them.

CREATING A NEW DATABASE
So, if we want to create a MongoDB in London Softlayer, we can take the slug
from the datacenters entry softlayer:london-02 and the type from the databases mongodb and know that, unless we override it, we'll get the preferred version 3.2.11.

Time to introduce the create command. The bach create command requires a name for the deployment and a database type but if we do
this:

$ bach create exampledb mongodb
2017/05/15 13:57:49 Must supply either a --cluster id or --datacenter slug  


There's two places a deployment can go, either a private Compose Enterprise
cluster or a Compose datacenter, so one or the other has to be specified. We
already have the slug for the datacenter so we'll add --datacenter softlayer:london-02 to our command line... but before we move on, let us take a moment and get the
help for create:

$ bach create --help
Creates a deployment. Requires deployment name and database type.

Usage:  
  bach create [deployment name] [database type] [flags]

Flags:  
      --cluster string      Cluster Id
      --datacenter string   Datacenter region
  -h, --help                help for create
      --ssl                 SSL required (where supported)
      --version string      Database version required
      --wiredtiger          Use WiredTiger storage (MongoDB only)
...


There are two flags here worth mentioning. The --ssl flag is used to enable SSL on the database deployment where it's available. It
defaults to false, so specify it to get SSL support. The other flag, --wiredtiger , only works with MongoDB deployments; use it to enable a WiredTiger deployment
complete with more memory, storage, and the latest MongoDB storage engine. We'll
add all those flags now...

$ bach create exampledb mongodb --datacenter softlayer:london-02 --ssl --wiredtiger
             ID: 591af872cb404d0015001c9b
           Name: exampledb
           Type: mongodb
     Created At: 2017-05-16 13:02:42 +0000 UTC
 Prov Recipe ID: 591af872cb404d0015001c99
 CA Certificate: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0t... (Use --fullca for certificate)
    Web UI Link: https://app.compose.io/compose-3/deployments/exampledb
         Health: 
            SSH: 
          Admin: 
       SSHAdmin: 
    CLI Connect: [mongo --ssl --sslAllowInvalidCertificates sl-eu-lon-2-portal.3.dblayer.com:17774/admin -u admin -p VUHUEQFAGFQUUJWW]
 Direct Connect: [mongodb://admin:VUHUEQFAGFQUUJWW@sl-eu-lon-2-portal.3.dblayer.com:17774,sl-eu-lon-2-portal.1.dblayer.com:17774/admin?ssl=true]


That's the initial credentials of the database coming back. You get the admin
password and connection strings and more. Here's the important bit though - you
get a recipe ID too when you start creating a database.

RECIPES AND WATCHES
Compose runs ""recipes"" to manage tasks like deploying a database or, well,
anything that will be a change to the deployment. In the create example, see the
""Prov Recipe ID"", that is the recipe that is managing the provisioning of the
database. The command bach watch lets you track the progress of a recipe like so:

$ bach watch 591af872cb404d0015001c99                 master
             ID: 591af872cb404d0015001c99
       Template: Recipes::Deployment::Run
         Status: running
  Status Detail: Running post_create_capsule on mongodb400.sl-eu-lon-2-data.0.
     Account ID: 542da1926b9c7d465d00000a
  Deployment ID: 591af872cb404d0015001c9b
           Name: Provision
     Created At: 2017-05-16 13:02:42.153 +0000 UTC
     Updated At: 2017-05-16 13:02:59.338 +0000 UTC
  Child Recipes: 1

             ID: 591af872cb404d0015001c99
       Template: Recipes::Deployment::Run
         Status: running
  Status Detail: Running post_create_capsule on mongodb400.sl-eu-lon-2-data.0.
     Account ID: 542da1926b9c7d465d00000a
...


That command will keep bach polling every 5 seconds, getting the latest status
of that recipe until the Status stops being set to ""running"". If you don't want to poll for updates, bach recipe 591af872cb404d0015001c99 would pull the current recipe status and display it once. The interesting thing
with recipes is that for every Compose deployment, we keep a record of every
recipe that has ever been run against it. You can see that list if you pass a
deployment id to the bach recipes command. We can see the deployment id of our newly created database just after
we kicked the process off; it's the ID field and if we run bach recipes 591af872cb404d0015001c9b we'll get the details of the provisioning recipe thats running, or just
finished running. But how to find other deployment ids?

DIGGING INTO DEPLOYMENTS
That's where bach deployments comes in. It lists all the database deployments associated with the account.

$ bach deployments
             ID: 591af872cb404d0015001c9b
           Name: exampledb
           Type: mongodb
...


Where you have a lot of deployments, you can filter by type with --type (or -t ) and a database type, or by name with --filter ( -f ) and a regular expression. Once you have a deployment id, you can get the
deployment's details with bach details :

$ bach details 591af872cb404d0015001c9b
             ID: 591af872cb404d0015001c9b
           Name: exampledb
           Type: mongodb
     Created At: 2017-05-16 13:02:42.239 +0000 UTC
 CA Certificate: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0t... (Use --fullca for certificate)
    Web UI Link: https://app.compose.io/compose-3/deployments/exampledb
         Health: 
            SSH: 
          Admin: 
       SSHAdmin: 
    CLI Connect: [mongo --ssl --sslAllowInvalidCertificates sl-eu-lon-2-portal.3.dblayer.com:17774/admin -u admin -p VUHUEQFAGFQUUJWW]
 Direct Connect: [mongodb://admin:VUHUEQFAGFQUUJWW@sl-eu-lon-2-portal.3.dblayer.com:17774,sl-eu-lon-2-portal.1.dblayer.com:17774/admin?ssl=true]
$


If we added --fullca , that CA certificate would be decoded and displayed in full. That means with
two commands you can programmatically get the full credentials for a deployed
database.

BACKING UP, RESTORING OUT
Backups are also accessible. There's a whole set of bach backups subcommands. If we want to set off an on-demand backup we can use bach backup start :

$ bach backups start 591af872cb404d0015001c9b       ⏎ master
             ID: 591b03b4d9406d0010001cdc
       Template: Recipes::Deployment::Run
         Status: running
  Status Detail: Running usage on mongodb1340.sl-eu-lon-2-portal.3.
     Account ID: 542da1926b9c7d465d00000a
  Deployment ID: 591af872cb404d0015001c9b
           Name: Backup
     Created At: 2017-05-16 13:50:44 +0000 UTC
     Updated At: 2017-05-16 13:50:44 +0000 UTC
  Child Recipes: 0
$


It's that easy to start an on-demand backup. And that information returned -
that's a recipe so you can use bach recipe or bach watch to track its progress. Then, you can list the backups on that deployment with bach backups list :

$ bach backups list 591af872cb404d0015001c9b
      Backup ID: 591b03b6c381128381000059
  Deployment ID: 591af872cb404d0015001c9b
    Backup Name: exampledb_2017-05-16_13-50-46_utc_on_demand
           Type: on_demand
         Status: complete


This is a new deployment; older deployments will show all the historical backups
available. For now, we've got the backup id we are interested in. We can get the
download details of that with bach backups get . This is a slightly different command in that it needs the both the deployment
id and backup id as parameters.

$ bach backups get 591af872cb404d0015001c9b 591b03b6c381128381000059
      Backup ID: 591b03b6c381128381000059
  Deployment ID: 591af872cb404d0015001c9b
    Backup Name: exampledb_2017-05-16_13-50-46_utc_on_demand
           Type: on_demand
         Status: complete
  Download Link: https://dblayer-backups-mongodb-shard.s3.amazonaws.com/compose-3/591af872cb404d0015001c9b/exampledb_2017-05-16_13-50-46_utc_on_demand.tar.gz...
$ 


That URL can be fed to wget or curl to retrieve the backup if you want a local copy. You can, though, use bach backup restore - that will restore your backup into a new deployment. The command takes the
deployment id, the backup id and the name for a new deployment. It'll also need
either a --datacenter or --cluster flag in the same way bach create needs those parameters. You can use the --ssl flag on MongoDB deployments to enable SSL on MongoDB deployments too.

$ bach backups restore 591af872cb404d0015001c9b 591b03b6c381128381000059 newexampledb --datacenter softlayer:london-02 --ssl
             ID: 591b348fd1db8f0014000026
           Name: newexampledb
           Type: mongodb
     Created At: 2017-05-16 17:19:11 +0000 UTC
 Prov Recipe ID: 591b348ed1db8f0014000024
 CA Certificate: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0t... (Use --fullca for certificate)
    Web UI Link: https://app.compose.io/compose-3/deployments/newexampledb
         Health:
            SSH:
          Admin:
       SSHAdmin:
    CLI Connect: [mongo --ssl --sslAllowInvalidCertificates sl-eu-lon-2-portal.0.dblayer.com:17783/admin -u admin -p VUHUEQFAGFQUUJWW]
 Direct Connect: [mongodb://admin:VUHUEQFAGFQUUJWW@sl-eu-lon-2-portal.0.dblayer.com:17783,sl-eu-lon-2-portal.5.dblayer.com:17783/admin?ssl=true]


ABOUT ...
... A DATABASE
You can find out more about a database with the bach scale command. Give it a deployment id and it will tell you how many units of
resource sare allocated to the deployment. If you want to increase that
allocation, bach scale set will let you increase, or decrease, the resource and the cost of the database
deployment.

The bach alerts command will return the current alerts for the deployment. These are the alerts
displayed in the Compose console's Overview and reflect issues that may, or may not, be impacting upon the deployments
performance.

.... A USER
The bach account command looks up the account which the Compose API Token was created for. The
account can have multiple users associated with it and to get that information,
the bach user command will give the user id back.

DELETING A DATABASE
The last command we'll touch on is bach deprovision which takes a deployment id as a parameter and, as it says, deprovisions it.
Actually, it starts the deprovisioning and returns a recipe id. You can watch
that recipe using bach watch as it tears down the database cluster.

FLAGGING THINGS UP
There's a couple of flags which may be useful to script developers. If we add
the --json flag to any command, it'll attempt to return JSON data, post processed by Bach.
So, if we add that flag to the bach backups list command we used earlier:

$ bach backups list 591af872cb404d0015001c9b --json
[
 {
  ""id"": ""591b03b6c381128381000059"",
  ""deployment_id"": ""591af872cb404d0015001c9b"",
  ""name"": ""exampledb_2017-05-16_13-50-46_utc_on_demand"",
  ""type"": ""on_demand"",
  ""status"": ""complete"",
  ""download_link"": """"
 }
]


There's also a --raw command if you want to see exactly what the API is returning:

$ bach backups list 591af872cb404d0015001c9b --raw
{""_embedded"":{""backups"":[{""id"":""591b03b6c381128381000059"",""deployment_id"":""591af872cb404d0015001c9b"",""name"":""exampledb_2017-05-16_13-50-46_utc_on_demand"",""type"":""on_demand"",""status"":""complete"",""created_at"":""2017-05-16T13:50:46.198Z"",""_links"":{""self"":{""href"":""deployments/591af872cb404d0015001c9b/backups/591b03b6c381128381000059{?embed}"",""templated"":true}}}]}}


FINALLY
The bach utility will always be an open-sourced work-in-progress. You can use the binary
versions of it as is to ease your command-line interactions with Compose's API.
The source code is available under an Apache 2.0 license too, so you can create
your own enhancements - we welcome pull requests. You can find the Github
repository for Bach at compose/bach . Do let us know how you get on as we build out Bach to make your command line
Compose experience easier than ever.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Dec 13, 2016DATABASE UPDATES AND THE NEW COMPOSE API
There's database updates for Redis and the early availability of the new Compose
API for developers wanting to automate their…

Dj Walker-Morgan May 10, 2017COMPOSE'S WRITE STUFF AND THE WINNER OF THE SECOND CYCLE OF 2017
You voted and now it's time to announce the author who has won the $500 Compose
Write Stuff bonus. Write Stuff is Compose's…

Default avatar The default author avatar The Compose Team Apr 25, 2017HORIZONTAL SCALING ARRIVES ON COMPOSE ENTERPRISE
Today, Compose is bringing horizontal scaling to more databases on our
Enterprise platform. MongoDB, Elasticsearch and Scylla…

Jason McCay Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",Find out how to use the power of the command line to control your Compose database deployments. The Bach tool is an easy way to harness the mighty Compose API.,Bach - The Compose API at your command(line),Live,498
1539,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
DATA VISUALIZATION PLAYBOOK: LETTING THE DATA DICTATE THE VISUALIZATION
Post Comment August 21, 2015 by Jennifer Shin Topics: Big Data Technology Tags: big data , analytics , visualizations , data visualization , data science , data scientist , analystData visualization, a great tool for finding patterns within data, can do more than just serve as graphical presentations . When utilized properly , visualizations can be a tool to use for transforming data into a highly effective presentation and even
uncover patterns.

As a demonstration of this tool, consider transforming multiple data tables into
a single data visualization without losing a single piece of information. The
case study presented here walks through a real-world scenario from a former
client that highlights the importance of allowing the data to dictate the design
and selection of data visualizations.

INCLUDING DATA IN TABLES
Each year, an environmental grant-making foundation collected data on how the
funding was distributed across the various environmental issues in each
geographic region. At the end of the year, the organization analyzed the data
and created a report about the distribution of these funds.

To ensure the transparency of the yearly report, the organization included a
table with the amount of funding distributed to each issue area across
geographic regions. Tables are a great way to present information neatly in a
straightforward manner, but this solution tends to be less effective as the
amount of data increases.


In the past, the organization would save time by including all the tables in an
appendix, which required splitting the information across 12 pages in its
glossy, full-color, printed annual report. This presentation was not only costly
to print, but the end product offered little value to the reader because the
task of comparing funding amounts among the different issue areas required
flipping back and forth across the 12 pages.


SIMPLIFYING THE DATA
Converting data tables into a matrix can be a simple yet powerful way to present
the same information more effectively than tables alone. For this scenario, a
matrix can be created to show the geographic region in each row, the
environmental issue in each column and the grant amounts where the region and issues intersect .


This matrix includes the same information as the data tables, but the new format
makes the data easier to view and compare across both geographic regions and
environmental issues than the tables. As an added bonus, the matrix reduces the
visualization from 12 pages to two pages, which can reduce the cost to print the report by over 80 percent .

While the benefits of converting data tables to a matrix are clear, the result
hardly resembles what most data scientists would consider a data visualization.
To bridge this gap and transform the matrix into a true data visualization, the
matrix can be modified as a heat map in which light-to-dark shade gradations
respectively represent low-to-high grant amounts.


Adding this additional level of complexity instantly transforms a simple matrix
into a functional heat map. Data scientists need to consider applying these
kinds of representations to create highly effective data visualizations.

ALLOWING THE DATA TO BE YOUR GUIDE
As this case study example shows, simple solutions for presenting data can
quickly become problematic when the size of the data set increases. Highly
effective data visualizations require investigating the characteristics of the
data and exploring the ways well-suited to leveraging these features for getting
the message across to the people who will be reviewing and using the data.

Discover how the IBM advanced analytics portfolio can help you find patterns and derive insights by visually exploring data.


Follow @IBMBigData

RELATED CONTENT
BLOG
WHAT IS MACHINE LEARNING?
Businesses can benefit enormously from analysis-derived rules that enable
understanding why certain events occur and the corresponding actions to take.
Learn more about a widely used six-phase methodology for building predictive
analytics models that can reveal hidden rules for meaningful business... Read Blog Blog IBM is a leader in the Forrester Wave™: Big Data Hadoop Cloud Solutions,
Q2 2016 Blog Simple polyglot persistence in the cloud Video IBM Analytics is Open for Data Podcast How is open source transforming machine learning? Blog Bridging NoSQL databases into open data science initiatives Blog Spark and R: The deepening open analytics stack Blog Extending trust and confidence in the cloud Infographic Do you stand apart from the others? Blog The data roadmap for IBM’s first CDO Blog The blurring lines between developer and data scientist Blog Delivering superior customer interactions: Banking edition Blog Next-generation DB2 release highlights BLU Acceleration
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analyticsMORE
Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insight Blog What is machine learning? Blog Simple polyglot persistence in the cloud Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Blog What is machine learning? Blog Simple polyglot persistence in the cloudMORE
Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Blog What is machine learning? Blog Simple polyglot persistence in the cloud Podcast How is open source transforming machine learning? Blog The future of cognitive business: Try the self-service technical preview Blog Bridging NoSQL databases into open data science initiatives Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freightMORE
Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic How financial advisors can connect with investors Podcast Cyber Beat Live: I'm In! When insiders threaten our security Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insightMORE
Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insight Infographic How financial advisors can connect with investors Blog What is machine learning? Blog IBM is a leader in the Forrester Wave™: Big Data Hadoop Cloud Solutions,
Q2 2016 * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site",Discover how a few considerations of a specific data set in a real-world use case enables data scientists to implement cost-effective data visualizations.,Let data dictate the visualization,Live,499
1544,"Skip to content The Official NVIDIA Blog * Drivers * GeForce Drivers
    * All NVIDIA Drivers
   
   
 * Products * Processors * GeForce
       * Quadro
       * Tegra
       * Tesla
       * NVIDIA GRID
       * NVS
       * Legacy
      
      
    * Technologies * Advanced Rendering
       * CUDA
       * AI and Deep Learning
       * G-SYNC
       * Machine Learning
       * Multi-GPU
       * NVLink
       * Optimus
       * Optix
       * SLI for GeForce
       * SLI for Quadro
       * Virtual Reality
       * All Technologies
      
      
    * NVIDIA DGX-1
    * NVIDIA GRID * Virtual Desktops and Apps
       * Cloud Gaming
      
      
    * 3D Rendering * Iray
       * mental ray
       * Quadro VCA
       * Material Definition Language
      
      
    * NVIDIA DRIVE
    * Platforms * Desktops
       * Notebooks
       * Tablets
       * Smartphones
       * Workstations
       * Servers
       * High Performance Computing
       * Automotive
       * Embedded
      
      
    * SHIELD * Android TV
       * Tablet
       * Portable
      
      
 * Deep Learning and AI * Deep Learning Overview
    * Technologies * Artificial Intelligence
       * Machine Learning
       * Natural Language Processing
       * Image Recognition
       * Self-Driving Cars
      
      
    * Products * Deep Learning Software
       * DGX-1 Deep Learning System
       * DIGITS DevBox
       * Jetson TX1 Supercomputer Module
       * NVIDIA DRIVE PX
       * NVIDIA TITAN X
       * Tesla K80 Accelerator
       * Tesla M4 Hyperscale Accelerator
       * Tesla M40 Accelerator
       * Tesla P100 Data Center Accelerator
      
      
    * Education * Introduction to Deep Learning
       * Deep Learning Institute
       * Online Courses
      
      
    * Community * Deep Learning Blog
       * DIGITS User Group
       * AI StartUp Program
      
      
 * Communities * GeForce.com
    * Deep Learning Institute
    * GPU Technology Conference
    * NVIDIA Partner Network
    * PartnerForce
    * NVIDIA Forums
    * GRID Forums
    * Developer Zone * CUDA Zone
       * DesignWorks
       * Embedded Computing
       * GameWorks
      
      
    * NVIDIA Research
    * 3D Vision Live
    * GPU Venture Zone
    * Inception Program
    * Social Media * Facebook
       * Flickr
       * Google+
       * Instagram
       * LinkedIn
       * Tumblr
       * Twitter
       * YouTube
      
      
 * Support
 * Shop
 * About NVIDIA * Company Information
    * Newsroom
    * NVIDIA Blog
    * Investors
    * Sustainability
    * Visual Computing
    * Careers
   
   
Search for: * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 

 * © 2016 NVIDIA
 * Legal Info
 * Privacy Policy

 * Home
 * Deep Learning
 * Virtual Reality
 * Driving
 * Pro Graphics
 * Gaming
 * Data Center

Posted on July 29, 2016 August 12, 2016 by Michael Copeland 858 * Facebook
 * Tweet
 * LinkedIn 857
 * Google+
 * 
 * Email 1

WHAT’S THE DIFFERENCE BETWEEN ARTIFICIAL INTELLIGENCE, MACHINE LEARNING, AND
DEEP LEARNING?
This is the first of a multi-part series explaining the fundamentals of deep
learning by long-time tech journalist Michael Copeland.

Artificial intelligence is the future. Artificial intelligence is science
fiction. Artificial intelligence is already part of our everyday lives. All
those statements are true, it just depends on what flavor of AI you are
referring to.

For example, when Google DeepMind’s AlphaGo program defeated South Korean Master
Lee Se-dol in the board game Go earlier this year, the terms AI, machine
learning, and deep learning were used in the media to describe how DeepMind won. And all three are part of
the reason why AlphaGo trounced Lee Se-Dol. But they are not the same things.

The easiest way to think of their relationship is to visualize them as
concentric circles with AI — the idea that came first — the largest, then
machine learning — which blossomed later, and finally deep learning — which is
driving today’s AI explosion — fitting inside both.


FROM BUST TO BOOM
AI has been part of our imaginations and simmering in research labs since a
handful of computer scientists rallied around the term at the Dartmouth
Conferences in 1956 and birthed the field of AI. In the decades since, AI has
alternately been heralded as the key to our civilization’s brightest future, and
tossed on technology’s trash heap as a harebrained notion of over-reaching
propellerheads. Frankly, until 2012, it was a bit of both.

Over the past few years AI has exploded , and especially since 2015. Much of that has to do with the wide availability
of GPUs that make parallel processing ever faster, cheaper, and more powerful.
It also has to do with the simultaneous one-two punch of practically infinite
storage and a flood of data of every stripe (that whole Big Data movement) –
images, text, transactions, mapping data, you name it.

Let’s walk through how computer scientists have moved from something of a bust —
until 2012 — to a boom that has unleashed applications used by hundreds of
millions of people every day.

ARTIFICIAL INTELLIGENCE — HUMAN INTELLIGENCE EXHIBITED BY MACHINES
King me: computer programs that played checkers were among the earliest examples
of artificial intelligence, stirring an early wave of excitement in the 1950s.Back in that summer of ’56 conference the dream of those AI pioneers was to
construct complex machines — enabled by emerging computers — that possessed the
same characteristics of human intelligence. This is the concept we think of as
“General AI” — fabulous machines that have all our senses (maybe even more), all
our reason, and think just like we do. You’ve seen these machines endlessly in
movies as friend — C-3PO — and foe — The Terminator. General AI machines have
remained in the movies and science fiction novels for good reason; we can’t pull
it off, at least not yet.

What we can do falls into the concept of “Narrow AI.” Technologies that are able
to perform specific tasks as well as, or better than, we humans can. Examples of
narrow AI are things such as image classification on a service like Pinterest
and face recognition on Facebook.

Those are examples of Narrow AI in practice. These technologies exhibit some
facets of human intelligence. But how? Where does that intelligence come from?
That get us to the next circle, Machine Learning.

MACHINE LEARNING — AN APPROACH TO ACHIEVE ARTIFICIAL INTELLIGENCE
Spam free diet: machine learning helps keep your inbox (relatively) free of
spam.Machine Learning at its most basic is the practice of using algorithms to parse data, learn from
it, and then make a determination or prediction about something in the world. So
rather than hand-coding software routines with a specific set of instructions to
accomplish a particular task, the machine is “trained” using large amounts of
data and algorithms that give it the ability to learn how to perform the task.

Machine learning came directly from minds of the early AI crowd, and the
algorithmic approaches over the years included decision tree learning, inductive
logic programming. clustering, reinforcement learning, and Bayesian networks
among others. As we know, none achieved the ultimate goal of General AI, and
even Narrow AI was mostly out of reach with early machine learning approaches.

As it turned out, one of the very best application areas for machine learning
for many years was computer vision , though it still required a great deal of hand-coding to get the job done.
People would go in and write hand-coded classifiers like edge detection filters
so the program could identify where an object started and stopped; shape
detection to determine if it had eight sides; a classifier to recognize the
letters “S-T-O-P.” From all those hand-coded classifiers they would develop
algorithms to make sense of the image and “learn” to determine whether it was a
stop sign.

Good, but not mind-bendingly great. Especially on a foggy day when the sign
isn’t perfectly visible, or a tree obscures part of it. There’s a reason
computer vision and image detection didn’t come close to rivaling humans until
very recently, it was too brittle and too prone to error.

Time, and the right learning algorithms made all the difference.

DEEP LEARNING — A TECHNIQUE FOR IMPLEMENTING MACHINE LEARNING
Herding cats: Picking images of cats out of YouTube videos was one of the first
breakthrough demonstrations of deep learning.Another algorithmic approach from the early machine-learning crowd, Artificial
Neural Networks, came and mostly went over the decades. Neural Networks are
inspired by our understanding of the biology of our brains – all those
interconnections between the neurons. But, unlike a biological brain where any
neuron can connect to any other neuron within a certain physical distance, these
artificial neural networks have discrete layers, connections, and directions of
data propagation.

You might, for example, take an image, chop it up into a bunch of tiles that are
inputted into the first layer of the neural network. In the first layer
individual neurons, then passes the data to a second layer. The second layer of
neurons does its task, and so on, until the final layer and the final output is
produced.

Each neuron assigns a weighting to its input — how correct or incorrect it is
relative to the task being performed. The final output is then determined by the
total of those weightings. So think of our stop sign example. Attributes of a
stop sign image are chopped up and “examined” by the neurons — its octogonal
shape, its fire-engine red color, its distinctive letters, its traffic-sign
size, and its motion or lack thereof. The neural network’s task is to conclude
whether this is a stop sign or not. It comes up with a “probability vector,”
really a highly educated guess, based on the weighting. In our example the
system might be 86% confident the image is a stop sign, 7% confident it’s a
speed limit sign, and 5% it’s a kite stuck in a tree ,and so on — and the
network architecture then tells the neural network whether it is right or not.

Even this example is getting ahead of itself, because until recently neural
networks were all but shunned by the AI research community. They had been around
since the earliest days of AI, and had produced very little in the way of
“intelligence.” The problem was even the most basic neural networks were very
computationally intensive, it just wasn’t a practical approach. Still, a small
heretical research group led by Geoffrey Hinton at the University of Toronto
kept at it, finally parallelizing the algorithms for supercomputers to run and
proving the concept, but it wasn’t until GPUs were deployed in the effort that the promise was realized.

If we go back again to our stop sign example, chances are very good that as the
network is getting tuned or “trained” it’s coming up with wrong answers — a lot.
What it needs is training. It needs to see hundreds of thousands, even millions
of images, until the weightings of the neuron inputs are tuned so precisely that
it gets the answer right practically every time — fog or no fog, sun or rain.
It’s at that point that the neural network has taught itself what a stop sign
looks like; or your mother’s face in the case of Facebook; or a cat, which is
what Andrew Ng did in 2012 at Google.

Ng’s breakthrough was to take these neural networks, and essentially make them
huge, increase the layers and the neurons, and then run massive amounts of data
through the system to train it. In Ng’s case it was images from 10 million
YouTube videos. Ng put the “deep” in deep learning, which describes all the
layers in these neural networks.

Today, image recognition by machines trained via deep learning in some scenarios
is better than humans, and that ranges from cats to identifying indicators for
cancer in blood and tumors in MRI scans. Google’s AlphaGo learned the game, and
trained for its Go match — it tuned its neural network — by playing against
itself over and over and over.

THANKS TO DEEP LEARNING, AI HAS A BRIGHT FUTURE
Deep Learning has enabled many practical applications of Machine Learning and by extension
the overall field of AI. Deep Learning breaks down tasks in ways that makes all
kinds of machine assists seem possible, even likely. Driverless cars , better preventive healthcare, even better movie recommendations, are all here
today or on the horizon. AI is the present and the future. With Deep Learning’s
help, AI may even get to that science fiction state we’ve so long imagined. You
have a C-3PO, I’ll take it. You can keep your Terminator.


Categories: Data Center | Deep Learning | Driving | Explainer | Supercomputing Tags: Artificial Intelligence | Deep Learning | GPU | Machine LearningPOST NAVIGATION
 * Previous
 * Home
 * Next

SIMILAR STORIES
TOUCHING A NERVVE: SEARCHING FOR ADVERTISING DOLLARS WITH GPUS
CORRECTING INTEL’S DEEP LEARNING BENCHMARK MISTAKES
NVIDIA CEO DELIVERS WORLD’S FIRST AI SUPERCOMPUTER IN A BOX TO OPENAI
Subscribe to stay up-to-date on NVIDIA news! * Cahir mawr dyffryn aep ceallacBe careful: SKYNET
   
   
 * 
 * Sergey Efimovnice
   
   
 * 
 * JodeHey, what’s up?
   
   
 * 
 * JodeHey, what’s up guy?
   
   
 * 
 * http://www.myspace.com/frontiersciences frontierscientist#IOTpeople #SmarterPlanet #Bi #Ai #DrWatson an interesting analysis!
   Mobile Monday Dk #MoMOCopenhagen #frontiersci
   
   
 * 
 * Ken BledsoeMy grandpa had the dream; and now AI is real. He would be so impressed.
   
   
 * 
 * Robert GrantNot sure how the article in any way justifies its conclusions:
   
   1) General AI is currently impossible and Narrow AI is very difficult
   2) Machine Learning is a way to solve some Narrow AI problems, albeit with
   hand-coding involved
   3) Deep Learning is an advancement on ML, which again is still Narrow AI
   4) Therefore, in the future we could have General AI!
   
   Here’s hoping it was written by a Markov chain generator 🙂
   
   
 * 

SOLUTIONS
 * Graphics Cards
 * GRID
 * High Performance Computing
 * Visualization
 * CUDA
 * Cool Stuff

CORPORATE
 * Events
 * Affiliate Program
 * Developers
 * NVIDIA Partner Network
 * Careers
 * RSS Feeds
 * Newsletters
 * Contact Us
 * Security

Explore our regional blogs and other social networks * © 2016 NVIDIA
 * Legal Info
 * Privacy Policy

Search for: Share this ArticleFriend's Email Address

Your Name

Your Email Address

Comments

Send Email

Email sent!","AI, machine learning, and deep learning are terms that are often used interchangeably. But they are not the same things.","The Difference Between AI, Machine Learning, and Deep Learning?",Live,500
1548,"G. Adam Cox Blocked Unblock Follow Following Jul 21
--------------------------------------------------------------------------------

DEFENSIVE IBM OBJECT STORAGE CONTAINERS
HOW TO PREVENT ACCIDENTAL WRITES WHEN COLLABORATING ON DATA ANALYSIS
The SETI@IBMCloud project and the SETI Institute’s Hackathon and Code Challenge rely on IBM infrastructure, such as the Cloud Foundry Go runtime, Apache Spark
and OpenStack Object Storage, all of which are available in IBM Bluemix .

In the process of building out the data management for the code challenge, we
worked with a few other groups within IBM. However, we needed a way to provide
access to the data securely, and not put other researchers in the position of
being able to accidentally delete or overwrite objects or containers.

With most container tech, the reference is to a “shipping container,” but I find
food storage containers easier to move, share, provision, decommission, etc.
Image credit: Dollar Tree .ENCOUNTER SAFETY
In general, it’s probably a good policy for you to do the same with any data set
you place in Object Storage. Instead of accessing your Object Storage data using
your “admin” credentials, you should create credentials that have restricted,
read-only access when performing analysis or sharing data with colleagues.

Sorting through the available documentation on IBM Bluemix, the IBM Knowledge
Center, and the OpenStack Object Storage pages took some time and testing. So
you don’t have to do the same, this article will show you, step-by-step, how to:

 * create a backup container for your data
 * create new credentials in IBM Bluemix
 * set read-only access for a container.

ARCHIVE CONTAINER
In IBM Object Storage, a backup container is called an “archive container.” The
first step is to create an archive container for each container holding your
data. By creating an archive container, the Object Storage system will
automatically create old versions of your objects should you accidentally
overwrite them. The instructions in IBM Bluemix are straightforward for this.
From your local workstation, perform the following steps:

1. Install the python-swiftclient Python package and command-line tool.

> pip install python-swiftclient
> pip install python-keystoneclient

2a. Log in to IBM Bluemix (your Data Science Experience credentials should work) and navigate to your
Object Storage instance.

2b. Navigate to your DSX Object Storage instance from within Bluemix.

3. Select the Service credentials tab.

4. From the list of credentials displayed, click View Credentials and find a set of credentials that contain the ""role"":""admin"" key-value pair.

5. Copy the values from this set of credentials into a new file on your local
machine. These are the environment variables that are necessary for the python-swiftclient command-line tool. For these instructions, I’ll name the file object_store.config . (I’ve also assumed you are using a bash shell.)

export OS_PROJECT_ID=...
export OS_PASSWORD='...'
export OS_USER_ID=...
export OS_AUTH_URL=https://identity.open.softlayer.com/v3
export OS_REGION_NAME=dallas
export OS_IDENTITY_API_VERSION=3
export OS_AUTH_VERSION=3

6. Set the environment variables.

> source object_store.config

7. You should now be able to interact with your Object Storage instance from you
local workstation’s command-line.

> swift list

For a complete list of available commands, see the documentation .

8. Now, you can create a new archive container and assign it to hold backup
versions of objects in one of your existing containers. For example, for a
container named ATAdata , one would do the following:

> swift post ATAdata_archive
> swift post ATAdata -H ""X-Versions-Location: ATAdata_archive""

In the future, should any object be overwritten in the ATAdata container, a previous version will be placed in ATAdata_archive .

READ-ONLY CREDENTIALS
Next, in order to create a container that is read-only for a particular set of
credentials, you first need to create those credentials in IBM Bluemix. Suppose
you want to provide credentials for your colleague, Jane.

1. Starting from the Service credentials tab of your Object Storage instance in Bluemix, create a new set of
credentials.

2. This step is critical. In the pane that opens in your browser, you must now add an Optional Parameter: {""role"":""member""}

3. Click the Add button to save your new credentials for Jane.

4. Find the project_id and user_id of those newly created credentials.

5. Using the python-swiftclient command-line tool, you’ll now append Jane’s project_id and user_id to the Access Control List (ACL) for the container for which you wish to
provide read access.

Since there may already be various configurations set for a container, you must
be careful not to overwrite those values. First check the Read ACL values and
then append Jane’s credentials.

> swift stat -v ATAdata
            Account: AUTH_cdbef69adf7a149c96930e1071f0a46b
          Container: ATAdata
            Objects: 288009
              Bytes: 310238099808
           Read ACL: cd35:90fe
          Write ACL:
            Sync To:
           Sync Key:
      Accept-Ranges: bytes
         X-Trans-Id: tx92f020c8361c48ad81cd3-0079dc276d
   X-Storage-Policy: standard
        X-Timestamp: 1471903329.98992
       Content-Type: text/plain; charset=utf-8
X-Versions-Location: ATAdata_archive

In the case above, you can see there are already values in the Read ACL, which
we must not remove. If there are values in the Read ACL for your container, be
sure to append Jane’s project_id:user_id to those existing values. Otherwise, specify only Jane’s project_id:user_id .

> swift post -r ""cd35:90fe,<project_id>:<user_id>"" ATAdata

Re-run the swift stat command to make sure your changes have been applied.
That’s it.

You can now send Jane’s credentials to her. With tools such as python-swiftclient and ibmos2spark , she can read data from the ATAdata container, but is unable to write to that
container, nor to read/write to any other container in your Object Storage
account.

SAFE TRAVELS!
You’ll find other ways to organize access to your containers. See the OpenStack Object Storage documentation on ACLs . In particular, one useful arrangement would be to allow for read-only or
write-only access to containers for everybody with the same “project_id”.

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * Data Science
 * Bluemix
 * Openstack
 * SETI

1 Blocked Unblock Follow FollowingG. ADAM COX
FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","The SETI@IBMCloud project and the SETI Institute’s Hackathon and Code Challenge rely on IBM infrastructure, such as the Cloud Foundry Go runtime, Apache Spark and OpenStack Object Storage, all of…",Defensive IBM Object Storage Containers – IBM Watson Data Lab – Medium,Live,501
1549,"Beta ☰ * Login
 * Sign Up

 * Learning Paths
 * Courses
 * Badges
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (August 02, 2016)
 * This Week in Data Science (July 26, 2016)
 * Welcome to the new BDU!
 * This Week in Data Science (July 19, 2016)
 * This Week in Data Science (July 12, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (AUGUST 02, 2016)
Posted on August 2, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * How To Ace A Data Science Interview – Here’s what to expect at each stage of a data science interview.
 * What Has Pokemon Got To Do With Big Data? – When there is participation at the levels of Pokemon, many behavioral
   insights can be gained from this real-time Big Data.
 * How I built a Slack bot to help me find an apartment in San Francisco – Vik Paruchuri walks us through how he built a Slack bot that scrapes
   housing listings from Craigslist, filters them, posts them on Slack, and
   deploys it to a server.
 * First Wi-Fi-Enabled Smart Contact Lens Prototype – Researchers at the University of Washington developed a way for embedded
   devices to harvest Bluetooth radio signals and use them to broadcast Wi-Fi
   transmissions.
 * 8 Things Your Company Needs to Know About Cyber Security – Learn about the best security protocols to implement in order to stay safe
   from cyber attacks.
 * NYC Subway Math – Erik Bernhardsson, Head of Engineering at One Zero Capital, analyzes New
   York City’s subway real-time API in order to find the best strategy when
   waiting for the subway.
 * Only 9% of America Chose Trump and Clinton as the Nominees – This dynamic data visualization displays the surprising information about
   the primary elections.
 * Fly The Frustrating Skies – The results of various sentiment analyzers are compared in order to
   identify the most effective strategy for identifying customer complaints
   towards 6 major U.S. airlines.
 * Up-and-Coming Programming Languages – Stay up to date by learning about these five up-and-coming programming
   languages.
 * How to Start Learning Deep Learning – Here is an overview of deep learning with some advice on how to best move
   forward with learning Deep Learning.
 * Yahoo Has a Tool that Can Catch Online Abuse Surprisingly Well – A team at Yahoo recently developed an algorithm capable of catching
   abusive messages better than any other automated system to date.
 * Approaching (Almost) Any Machine Learning Problem – Abhishek Thakur walks us through the general methods used to preprocess
   data and apply machine learning models.
 * 4 Predictions for the World of Big Data – With the rise of massive data availability, what changes will Big Data
   have on our lives?

UPCOMING DATA SCIENCE EVENTS
 * Big Data & Analytics for Pharma Summit – Join some of the world’s top analytics professionals within the
   pharmaceutical industry on November 3 & 4 in Philidelphia for kenote
   speeches, workshops, and more.
 * eMetrics Summit – Marketing analytics practitioners, experts and visionaries discuss
   capturing and applying insights from data on October 23 – 27 in New York.
 * Big Data University on 2016 Hadoop Summit – Join Big Data University at the Hadoop Summit in Australia on September
   1st as they discuss spatial-temporal trajectory analysis with Spark.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (August 02, 2016)",Live,502
1551,"HOW-TO: CONNECTING TO COMPOSE MONGODB WITH JAVA AND SSL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 3, 2016We've heard from some Compose users that connecting to MongoDB with Java and SSL
is a somewhat difficult process so we're going to illustrate the process in a
stripped down example.

We'll assume that you are using the official MongoDB Java driver and you've imported it into your application. So let's recall what information
we need to have to make a connection. First, the connection string for the
Compose database that looks something like this in code:

   String url = ""mongodb://example:example@hodor.2.dblayer.com:10287,hodor.1.dblayer.com:10302/admin?ssl=true"";


This is the value from the Compose console's MongoDB overview. The username and
password are for a user in the named database – in this case admin . We can use that url to connect to MongoDB like this:

   MongoClient client = new MongoClient(new MongoClientURI(url));


And, if there was no SSL in play we could run some command against the database
with this client:

   MongoIterable<String�

   Block<String�


This just gets a list of database names. We can do this because we're logging
into the admin database and to do that we need admin level privileges.

But, notice the URL has ""?ssl=true"", so SSL is enabled and if we ran the code
there'd be a nice big stack trace because there was no way to verify the
certificate the server presented to the client.

To make this work, we need to go back to the Compose console and get the SSL
certificate available on the Overview page - find it by clicking the button to reveal it and then copy it to a file.
Let's call that file mongodbcert.crt for e. This is a self-signed CA certificate. There's no easy way, as there is
in many other languages, to read and use that certificate. Java likes to find
its credentials in a trust store file which we have to create using keytool .

keytool -import -alias compose -file ./mongodbcert.crt -keystore ./mongostore -storetype pkcs12 -storepass ilikeamiga  


This command imports the MongoDB certificate mongodbcert.crt , with an alias ""compose"" into a keystore called mongostore . It locks the store with the password ilikeamiga .

Now we have to tell Java that this information exists. The name of the store and
the password can be set in javax.net.ssl.trustStore and javax.net.ssl.trustStorePassword JVM variables. When your application starts running you could do something like
this:

  System.setProperty(""javax.net.ssl.trustStore"", ""/home/username/mongostore"");
  System.setProperty(""javax.net.ssl.trustStorePassword"", ""ilikeamiga"");


That sets the properties, but be aware that you need to do that very early in
the runtime of your application; once an SSL/TLS context is created, the
property values at that point in time are locked in, so there is no changing
them later on. With the properties set, you can move on to the connection code
we covered earlier.

What can go wrong? Well, you may not have access to creating files in the file
system to hold those credentials, or you may not be allowed or able to change
the trust store. Problems like that tend to be caused by the application
deployment platform and we will be looking at some approaches to solving that
kind of problem in the future.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",We've heard from some Compose users that connecting to MongoDB with Java and SSL is a somewhat difficult process so we're going to illustrate the process in a stripped down example.,Connecting to Compose MongoDB with Java and SSL,Live,503
1553,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (December 27, 2016)
 * This Week in Data Science (December 20, 2016)
 * This Week in Data Science (December 13, 2016)
 * New York Data Science Bootcamp And Validated Badges
 * This Week in Data Science (December 06, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (DECEMBER 27, 2016)
Posted on December 27, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * 6 big data trends to watch in 2017 – As more analytics systems become mission critical in 2017, big data
   leaders and CIOs should keep a close eye on these six trends, which include
   increases in cloud migrations and dark data usage.
 * Family Matters Genealogy Knowledge Graphs Made Easy with GRAKN.AI – GRAKN.AI is an open-source knowledge graph data platform that uses the
   power of machine reasoning to help you overcome these challenges to build
   intelligent and cognitive systems.
 * A non-comprehensive list of awesome things other people did in 2016 – For the last few years I have made a list of awesome things that other
   people did. Like in previous years I’m making a list, again right off the top
   of my head.
 * A Guide to Solving Social Problems with Machine Learning – For the right type of problem, there are enormous gains to be made from
   using machine learning tools.
 * How I Detect Fake News – Read how Tim O’Reilly traced the falsity of one internet meme, and what
   that teaches us about how an algorithm might do it.
 * Most Popular Programming Languages For Machine Learning And Data Science – In recent times, the demand for machine learning and data science experts
   has witnessed an exponential growth. So, what programming languages should
   one learn to land a machine learning or data science job?
 * The magic of planning analytics: A holiday story – We tend to imagine that the work of Santa’s elves at the North Pole is
   full of happy songs, bright smiles and endless cheer. But, many of us don’t
   know about something else at their workshop.
 * A Beginner’s Guide to Big Data Terminology – Big Data includes so many specialized terms that it’s hard to know where
   to begin. Make sure you can talk the talk before you try to walk the walk.
 * Top 4 big data and analytics trends of 2016 – Big data and advanced analytics continued to make inroads in the
   enterprise in 2016 as organizations learned how to interrogate data to better
   understand their customers and drive efficiencies.
 * 2016: The Year That Deep Learning Took Over the Internet – Neural networks are the machine learning models that identify faces in the
   photos posted to your Facebook news feed.
 * The 5 Basic Types of Data Science Interview Questions – Data science interviews are notoriously complex, but most of what they
   throw at you will fall into one of these categories.
 * Machine learning, deep learning, and AI… – These buzzwords get thrown around all the time, but do we really know the
   differences between them?
 * Data Science at Slush 2016 – Slush, Europe’s leading startup event, took place in Helsinki from
   November 30th to December 1st. Thousands of attendees including startups,
   investors, tech companies, and researchers came together to get a glimpse of
   the latest developments in a massive range of fields.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our forty sixth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (December 27, 2016)",Live,504
1555,"CONNECTION POOLING WITH MONGODB
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 28, 2016TL;DR Making proper use of connection pooling can massively improve the performance
of your MongoDB deployment.

At Compose we often get support tickets from customers asking about the
performance of their MongoDB deployments. In many cases, these issues can be
addressed with a straightforward adjustment to how your applications connect to
your database.

CONNECTING TO MONGODB THE EASY, BUT INEFFICIENT WAY
Let's take a look at some basic Node.js code for an app connection to a MongoDB
deployment. You can find your connection string in the Connection Info panel in
your Compose deployment overview. It will look something like this:


Drop a username and password in where indicated and you're ready to make
connections.

var express = require('express');  
var app = express();  
var MongoClient = require('mongodb').MongoClient;  
var assert = require('assert');

// Connection URL
var url = '[connectionString]]';

// start server on port 3000
app.listen(3000, '0.0.0.0', function() {  
  // print a message when the server starts listening
  console.log(""server starting"");
});

// Use connect method to connect to the server when the page is requested
app.get('/', function(request, response) {  
  MongoClient.connect(url, function(err, db) {
    assert.equal(null, err);
    db.listCollections({}).toArray(function(err, collections) {
        assert.equal(null, err);
        collections.forEach(function(collection) {
            console.log(collection);
        });
        db.close();
    })
    response.send('Connected - see console for a list of available collections');
  });
});


Here we define a simple app using express , and connect to a MongoDB database in our deployment using the provided
connection string. We start the server running, then respond to page requests at
the home page by connecting to the database and outputting a list of available
collections. Then we close the connection and wait for the next request. Simple
and straightforward.

The problem with this approach is that every time the home page is requested, we
make a new connection to the database, and connections can be expensive things
to create, especially where authentication is involved. A good way to cut down
on your connections expenses is to use connection pooling.

WHAT IS CONNECTION POOLING
Think of your connections like a delivery service: if you are taking delivery of
a hundred packages, it's going to take a long time and result in a lot of wasted
effort if you have to sign for each parcel individually, take it upstairs, and
wait for the courier to ring the doorbell for the next parcel, which you'll have
to sign for, take upstairs etc.

It's better for you (and better for the courier) if you can answer the door
once, and sign for as many parcels as you feel like carrying upstairs before
returning for some more. That, in a nutshell, is what connection pooling offers.

All these individual connections can also cause problems if they are left idle
instead of being closed after they have been used, as every open connection
consumes some server RAM. In terms of your friendly courier, that would be like
you having multiple doors and multiple couriers, and you not bothering to return
to a door after taking a delivery. A courier isn't allowed to leave until you've
closed the door and said goodbye to confirm the delivery has been accepted, so
he just waits there. Meanwhile, other couriers keep arriving, until eventually
they run out of doors and start queuing up behind the couriers who are already
there. Eventually, some of them are going to get bored, and your parcels will be
returned undelivered.

With connection pooling, rather than isolating connection requests, we can group
them together. This way we can make fewer connection requests: once you're in
the pool you're trusted as long as you stay in there, and will only need to
re-authenticate and reconnect if you leave and then try to come back later.

SOUNDS GREAT - HOW DO I USE IT?
We're going to make two changes to the way we connect to our database. Instead
of having our app wait around for a request before connecting to the database
we're going to have it connect when the application starts, and we're going to
give ourselves a pool of connections to draw from as and when we need them.

Here we're using the node-mongodb-native driver, which like most available MongoDB drivers has an option that you can use to set the size of your connection pool. For
this driver, it's called poolSize , and has a default value of 5. We can make use of the poolsize option by creating a database connection variable in advance, and letting the
driver allocate available spaces as new connection requests come in:

// This is a global variable we'll use for handing the MongoDB client
var mongodb;

// Connection URL
var url = '[connectionString]';

// Create the db connection
MongoClient.connect(url, function(err, db) {  
    assert.equal(null, err);
    mongodb=db;
    }
);


To change the size of the connection pool from the default, we can pass poolSize in as an option:

// Create the database connection
MongoClient.connect(url, {  
  poolSize: 10
  // other options can go here
},function(err, db) {
    assert.equal(null, err);
    mongodb=db;
    }
);


Now we have a connection ready and waiting. To use our new connection, we just
need to make use of our new global variable, mongodb when a request is made:

// Use the connect method to connect to the server when the page is requested
app.get('/', function(request, response) {  
    mongodb.listCollections({}).toArray(function(err, collections) {
        assert.equal(null, err);
        collections.forEach(function(collection) {
            console.log(collection);
        });
    })
    response.send('See console for a list of available collections');
});


ANYTHING ELSE?
No, not really. Just letting the driver handle the connection pool for you will
give you a big boost over running with individual connection requests.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Paul Morris Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Neil Dewhurst is a technical writer at Compose. Love this article? Head over to Neil Dewhurst’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",Making proper use of connection pooling can massively improve the performance of your MongoDB deployment.,Connection Pooling with MongoDB,Live,505
1556,"Compose The Compose logo Articles Sign in Free 30-day trialCONQUER EMAIL WITH POSTAL AND COMPOSE
Published May 24, 2017 bluemix rabbitmq compose for mysql Conquer Email with Postal and ComposeCreating a custom email server is a breeze with Postal and Compose. Take a deep
dive in this article as we tackle all things email.

Handling large amounts of email going out from an organization can be a complex
service to administer. That's why services like Sendgrid and Mailgun exist; to
simplify the process. But there is an alternative in the form of Postal. Built
on RabbitMQ and MySQL, Postal makes email easy to generate mail, manage its
delivery and keep track of non-delivery. It's not a complete mail server but can
sit in front of other mail services triggering events on incoming mail and
ensuring that mail goes out to good servers, securely.

In this article, we'll show you how to deploy the Postal application stack, with
the Postal application itself running in the IBM Bluemix Container Service, and
Compose RabbitMQ and Compose for MySQL providing the data storage layer.

INTRODUCTION TO POSTAL
Postal is Email from your apps, but on steroids.


The postal management console displays statistics such as incoming and outgoing
messages delivered (as well as reasons why they may not have been delivered) and
makes adding things like webhooks and tracking links easy and intuitive.


It's an open-source project , making the code freely available to install on any server. Compared to
something like the sendmail MTA (Mail Transport Agent) Postal's focus is on sending mail on behalf of
applications efficiently. As mail sending can be constrained by various factors,
it uses RabbitMQ's message queues and MySQL to manage that backlog.


Those are just a few of the many features Postal offers that makes developing
with email a snap. Let's take a look at how to spin up our own Postal server.

POSTAL USING DOCKER
Postal can be installed locally using the quick install guide . However, in this article, we'll deploy Postal in a containerized environment
using the Docker container for Postal . This container provides a fully functioning Postal application as well as a docker-compose.yml file to set up two other services that Postal relies upon: MySQL and RabbitMQ.
You can use docker-compose locally to create all of the necessary services to run Postal on your local
development machine, but when it comes to pushing those services to production,
you'll want a more hardened and robust solution.

To make our service more secure and scalable, we'll use Compose's hosted
RabbitMQ and MySQL services, and configure our Postal container to use them.
Then, we'll deploy to IBM's Bluemix Container Service.

CREATE THE RABBITMQ SERVER
Postal uses RabbitMQ as its message broker. When emails are queued to be sent,
they're placed in a RabbitMQ queue. Using Compose RabbitMQ ensures that your
emails are delivered safely and quickly with minimal downtime.

Spin up a new RabbitMQ deployment by click on the Create Deployment button in Compose dashboard.

Then, select RabbitMQ from the Production Deployments menu and click Create Deployment . You can leave the settings at their defaults, or optionally provide a new
deployment name, but we recommend you leave the Location settings alone unless
you have a reason to change them.

At this point, you should have a running RabbitMQ deployment. In the Deployment Overview page, scroll down until you see the Connection Strings section. Take note of those as we'll need them later.

Now, we'll set up a username and password so Postal can access this RabbitMQ
instance. Click on the browser tab and you'll see a list of all of the available vhost s for RabbitMQ.


Select the vhost by clicking on the link and you'll be taken to the users page. Click on Add User to configure the postal user.


You can make the username and password whatever you'd like, but make sure you
remember the username and password you type here. You won't be able to retrieve
it later.


Finally, set up the permissions of that user. We want Postal to be able to
handle everything it needs, so let's set the permissions for this user to be
completely open using the .* permission level:


CREATE THE MYSQL SERVER
The next step in setting up our Postal application is to create the MySQL
database that will store the content of your emails, as well as statistics and
tracking links.

Head to the Deployments page and click the Create New Deployment button.


Then, select MySQL from the Production Deployments menu and click Create Deployment . You can leave the settings at their defaults, or optionally provide a new
deployment name, but we recommend you leave the Location settings alone unless
you have a reason to change them.


To access the MySQL connection strings, click on the reveal your credentials link. This will ask you to authenticate using your account and then display
your credentials in the Connection Info section of the deployment page.


These credentials are available later, so there's no need to take note of them
yet.

CREATING A CONTAINER IN IBM BLUEMIX CONTAINER SERVICE
Now that we have the data services set up, we can deploy our Postal container
and begin connecting it up to our Compose services. Our Postal container uses
environment variables to link our containers to our data services, so once we
run the container on a sandbox at first and then move to production versions of
the services in the future.

SETTING UP THE BLUEMIX AND CLOUDFOUNDRY CLI
The first thing we'll need to do is set up the Bluemix and CloudFoundry CLI.
These command-line applications will help us manage our IBM Bluemix

To get started, first navigate to console.ng.bluemix.net and create an account
(or log into your existing account). There are two ways you can deploy your
application: in a clustered environment using Kubernetes, or as Docker single
and scalable containers. Since our application only has one container, we'll
choose the single and scalable containers option.

Follow the instructions on the IBM Bluemix site for creating single and scalable containers . Once you've created the container, download the CloudFoundry CLI in the format appropriate for your platform if you haven't done so already. We
can verify the installation was completed successfully by running the following
command:

cf -v 


Next, install the Bluemix CLI for your platform and install the IBM Container Service plugin so we can start
creating our containers in the Bluemix service:

bx plugin install IBM-Containers -r Bluemix  


You can verify the installation by checking out the list of installed plugins:

bx plugin list  


Once you've installed the Bluemix CLI, you'll need to set up your API within the
correct region. You'll only need to do this the first time you run the Bluemix
CLI:

bx api https://api.ng.bluemix.net  


Now, it's time to log into Bluemix so we can push our Postal image up to it:

bx login  


PUSHING OUR CUSTOM POSTAL IMAGE TO BLUEMIX CONTAINER SERVICE
Next, we'll want to create a private image registry in the Bluemix Container
Service to upload our Postal container image bundled with our custom
configuration changes. This needs to be private, rather than on a public
registry such as DockerHub, because this image will have full access to our
database.

The first step is to create a namespace in the bluemix container registry. You
can think of a namespace as the base URL of your registry; you'll likely have a
different namespace for every organization or project you wish to deploy. We'll
access the Bluemix Container Services plugin by using the ic command modifier for the Bluemix CLI:

bx ic namespace-set <your namespace here>  


If we wanted a namespace for this project of compose_postal , we would have the following:

bx ic namespace-set compose_postal  


And now, we'll initialize the Bluemix Container Service for our namespace and
account:

bx ic init  


Once we've initialized the Container Service, we can start to build our local
image. First, we'll put together a simple Dockerfile to tell docker how to build
our custom Postal image. We'll also configure our environment variables with
location and credentials of our Compose hosted services.

FROM sax1johno/postal  
ENV MYSQL_ROOT_PASSWORD <your_mysql_password>  
ENV MYSQL_DATABASE postal  
ENV MYSQL_HOST <mysql_connection_string>  
ENV RABBITMQ_HOST <rabbitmq_connection_string>  
ENV RABBITMQ_DEFAULT_USER <your_rabbitmq_username>  
ENV RABBITMQ_DEFAULT_PASS <your_rabbitmq_password>  
ENV RABBITMQ_DEFAULT_VHOST postal  


Where you see above, make sure you replace those with your actual credentials from Compose.

Finally, let's build our container using the Bluemix CLI, which will
automatically upload the build image to our private image repository. The -t
flag gives us a tag name that we can use to refer to our image later:

$ bx ic build -t registry.ng.bluemix.net/<your namespace>/postal


Remember to replace with your actual namespace. If you used our namespace above, your command would
look like the following:

$ bx ic build -t registry.ng.bluemix.net/compose_postal/postal


You can double-check our image actually made it all the way to your images
repository by typing the following into the terminal:

$ bx ic images REPOSITORY                                                  TAG                 IMAGE ID            CREATED             SIZE  

registry.ng.bluemix.net/compose_postal/postal                          latest              8de059ee71fc        2 minutes ago         317.4 MB  

registry.ng.bluemix.net/ibm-node-strong-pm                  latest              322b9ca7b2dc        2 weeks ago         616.4 MB  

registry.ng.bluemix.net/ibmliberty                          latest              6595ea483bf5        2 weeks ago         552.8 MB  

registry.ng.bluemix.net/ibmnode                             latest              b2c351248227        2 weeks ago         472.4 MB  

registry.ng.bluemix.net/ibmnode                             v4                  b2c351248227        2 weeks ago         472.4 MB  

registry.ng.bluemix.net/ibmnode                             v1.1                7d11220193d6        2 weeks ago         449.2 MB  

registry.ng.bluemix.net/ibmnode                             v1.2                84efce0c747b        2 weeks ago         465.2 MB  


You should see your new image listed there.

BUILDING AND DEPLOYING OUR CONTAINER
Now that we've built our Postal container and made it available to the Bluemix
Container Service in our private registry, there's just one step left: it's time
to run the container!

$ bx ic run --name postal registry.ng.bluemix.net/compose_postal/postal


If all goes well, you should now have a container running in the IBM Bluemix
Container Service. You can double-check that your container is running by
running the following command:

$ bx ic ps 

CONTAINER ID        IMAGE                                           COMMAND             CREATED             STATUS                  PORTS                         NAMES  

68f6536a-82f        registry.ng.bluemix.net/compose_postal/postal:latest   """"                  3 minutes ago          Running 6 seconds ago   80/tcp                      postal  


Your service is now running in the cloud, but we still need to attach it to a
public IP address. To find out the IP addresses available to you, run the
following command:

43.222.342.122


NOTE: If you don't see any IP addreses in this list, run the following command
to request one.

$ bx ic ip-request


Then, run the previous ips command to see the IP addresses leased to you.

Finally, let's bind one of those ip addresses to our running container:

bx ic ip-bind <your ip address> postal  


Now, just load up a web browser and navigate to your IP address and port. You
can test this now by navigating to the UI component of Postal in a browser at http://<your ip address> .

For more information on using the Bluemix Container Service, check out the
running single containers tutorial on the Bluemix site.

FINAL STEPS: RUNNING THE FIRST-RUN POSTAL COMMANDS
To finish the installation, postal needs you to run its initialization scripts
from within your Bluemix container. This initializes the database schemas and
sets up the default vhost in RabbitMQ, among some other small administrative
tasks.

The Bluemix CLI wraps several common docker commands, including docker attach which we can use as an interactive shell into our container. To connect, you'll
need to know the container ID for our postal container:

$ bx ic ps 

CONTAINER ID        IMAGE                                           COMMAND             CREATED             STATUS                  PORTS                         NAMES  

68f6536a-82f        registry.ng.bluemix.net/compose_postal/postal:latest   """"                  3 minutes ago          Running 6 seconds ago   80/tcp                      postal  


Then, we'll attach to that container using it's Container ID :

bx ic attach 68f6536a-82f  


You should now have access to an interactive shell within the container. We'll
use this to run our postal setup commands. The first will initialize the
database schemas and RabbitMQ vhosts, and the second will guide you through the
process of creating an admin user for your postal installation.

cd /opt/postal/bin  
./postal initialize
./postal make-user


WRAPPING UP
There you have it! By combining IBM Bluemix Container Service, Compose MySQL,
and Compose RabbitMQ, we have a fully functional, robust, modern email
management environment with multiple mail servers, webhooks, an HTTP and SMTP
API, and many more of the goodies we've come to expect from a production mailer.
Since all of the services we're using are scalable, this system will help you
conquer the email layer whether you're a scrappy startup or are managing
hundreds of email servers


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Daria Nepriakhina

John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
May 3, 2017CAMPUS DISCOUNTS - MAKING THE MOST OF COMPOSE
Campus Discounts uses several Compose-hosted databases including MySQL, MongoDB,
Redis, Elasticsearch and RabbitMQ to power t…

Arick Disilva Dec 30, 2016COMPOSE FOR MYSQL AND COMPOSE FOR SCYLLADB: THE NEW COMPOSE DATABASES ON BLUEMIX
Since we made Compose-hosted databases for Bluemix available, we've seen Bluemix
users opening up the catalog to the benefits…

Dj Walker-Morgan Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX
The power of IBM's Bluemix cloud platform is now able to seamlessly harness
Compose's databases, making Compose-configured Mo…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",Creating a custom email server is a breeze with Postal and Compose. Take a deep dive in this article as we tackle all things email.,Conquer Email with Postal and Compose,Live,506
1561,,Watch how to generate SQL-based reports for Cloudant JSON data and dashDB with the Embeddable Reporting Service. ,Use dashDB with IBM Embeddable Reporting Service,Live,507
1562,"developerWorks Premium An all-access pass to building your next great app!Sign up

 * 
 * Sign in | Register * › My developerWorks
    * 
      --------------------------------------------------------------------------
      
      
    * developerWorks Community
    * › My profile
    * › My communities
    * › Settings
    * 
      --------------------------------------------------------------------------
      
      
    * › Sign out
   
   
 * 
 * IBM

 * 

 * Technical topics
 * Evaluation software
 * Community
 * Events

Search developerWorks

 * developerWorks
 * Technical topics
 * Cloud computing
 * Technical library

DATA SCIENCE IN THE CLOUD
Investment analysis with IPython and pandas

Data Science is a growing specialization that can touch on many of the following
topics: Cloud computing, big data, math, business theory, and computer science
theory. A scripting language like Python is often a great choice for the typical
cycle of prototyping to make sure the math of the problem works, then
""productizing"" the result to a distributed farm of cloud servers. This article
presents some hands-on examples of investment analysis and statistical analysis
using IPython and pandas.


Share:Noah Gift ( noah.gift@giftcs.com ), Founder, CTO, Giftcs

Close [x]

Noah Gift is an experienced technical leader and software developer at AT�T Interactive . He solves interesting problems in a variety of languages including
Python/Iron Python, Erlang, F#, C#, and JavaScript. (He's also worked at
Caltech, Disney Feature Animation, Sony Imageworks, and Weta Digital.) A member
of the Python Software Foundation, he is also an author of many developerWorks
articles and the co-author of Python For Unix and Linux System Administration . He earned a BS in Nutritional Science from Cal Poly San Luis Obispo, an MS in
Computer Information Systems from CSULA, and is an MBA Candidate at UC Davis
specializing in business analytics, finance, and entrepreneurship. In his spare
time, he composes for the piano and runs in marathons. Find him at his web site , on Twitter , or for consulting .


21 February 2013

Also available in Chinese Spanish

 * Table of contents * Introduction
    * Getting started
    * Pandas introduction
    * Portfolio theory
    * In conclusion
    * Download
    * Resources
    * Comments
   
   
INTRODUCTION
Let's take a very common kind of analysis. You've probably done this yourself.
Suppose you want to analyze stock performance. You might:

 1. Find a stock on Yahoo finance.
 2. Download the historical data as a CSV file.
 3. Import the CSV file into Excel.
 4. Perform mathematical analysis: Regression, descriptive statistics,or linear
    optimization using the Excel Solver tool.

That's fine, but this article shows you a simpler, more intuitive, and more
powerful way to do the same kind of analysis using IPython and pandas.

The IPython library is one of the key tools for a data scientist who uses
Python. This tool is a key differentiator from Excel, specifically because you
can interactively explore data and analysis from an interactive prompt. The
examples in this article mainly use IPython as the mechanism to run them.

The Python Data Analysis Library (pandas) is an open source, BSD-licensed
library that provides high-performance, easy-to-use data structures and data
analysis tools for the Python programming language.

GETTING STARTED
To get started with IPython and pandas, set up your Linux or Unix operating
system such as Ubuntu or OS X.

 1. Install pip, a tool for installing and managing Python packages. You might
    have used easy_install before — pip is a replacement for easy_install. To
    install pip, go to the pip index page on the Python website and follow the instructions.
 2. Now that you have pip, install IPython using this command:sudo pip install IPython
    
    
 3. Use pip to install pandas:sudo pip install pandas
    
    
 4. One more tool to install — matplotlib, a plotting library for the Python
    program language and its NumPy numerical mathematics extension. Use this
    command:sudo pip install matplotlib
    
    
You have all the pieces you need, so let's begin.


--------------------------------------------------------------------------------

Back to top

PANDAS INTRODUCTION
To feed your investment portfolio data into pandas, use this code:

Listing 1. Pandas portfolio ingestion    In [1]: import pandas.io.data as web
    
    In [2]: from pandas import DataFrame
    
    In [3]: data_feed = {}

    In [4]: symbols=['AAPL','FB', 'GOOG', 'SPLK', 'YELP', 'GG','BP','SCPJ','JNJ', 'OMG']
    
    In [5]: for ticker in symbols:
    ...:         data_feed[ticker] = web.get_data_yahoo(ticker, '05/21/2012', '11/1/2012')
    ...:
    
    In [6]: price = DataFrame({tic: data['Adj Close']
    ...:     for tic, data in data_feed.iteritems()})
    
    In [7]: volume = DataFrame({tic: data['Volume']
    ...:     for tic, data in data_feed.iteritems()})
    
    In [8]: returns = price.pct_change()

To determine the return percentages for the year and plot them, the return DataFrame method can be called along with plot . This is done by calling sum which sums each column in the DataFrame, and by using matplotlib, which does
the heavy lifting to create the chart shown in Figure 1.

Listing 2. Returns for year    In [9]: import matplotlib.pyplot as plt
    In [10]: returns.sum().plot(kind='bar',title=""% Return For Year"")
    Out[10]: <matplotlib.axes.AxesSubplot at 0x10c1b0350>
    In [11]: plt.show()

Here's the result:

Figure 1. Returns for yearAs you can see in Figure 1 , Facebook had a rough IPO, and year-to-date it has lost close to 40% of its
IPO value. In comparison, Yelp — in the same industry — has gained almost 40%.
In hindsight, shorting Facebook and going long on Yelp could have led a
year-to-date return that would have almost doubled the original investment. The
text output of the sum() command shows the real raw values of the yearly returns in this code:

Listing 3. Raw sum output    In [12]: returns.sum()
    Out[12]:
    AAPL    0.077139
    BP      0.155668
    FB     -0.376935
    GG      0.285309
    GOOG    0.124510
    JNJ     0.140735
    OMG     0.145005
    SCPJ    0.189855
    SPLK    0.021382
    YELP    0.357202

Another way to look at the data is to create a histogram of the daily return
percentage change for the year and see if this reveals any underlying insights
about the data. Fortunately, this is fairly straight forward as the code example
below shows:

Listing 4. Creating a histogram of daily returns    In [13]: returns.diff().hist()
    Out[13]:
    array([[Axes(0.125,0.677778;0.158163x0.222222),
    Axes(0.330612,0.677778;0.158163x0.222222),
    Axes(0.536224,0.677778;0.158163x0.222222),
    Axes(0.741837,0.677778;0.158163x0.222222)],
    [Axes(0.125,0.388889;0.158163x0.222222),
    Axes(0.330612,0.388889;0.158163x0.222222),
    Axes(0.536224,0.388889;0.158163x0.222222),
    Axes(0.741837,0.388889;0.158163x0.222222)],
    [Axes(0.125,0.1;0.158163x0.222222),
    Axes(0.330612,0.1;0.158163x0.222222),
    Axes(0.536224,0.1;0.158163x0.222222),
    Axes(0.741837,0.1;0.158163x0.222222)]], dtype=object)
    
    In [14]: plt.show()

The code in Listing 4 gives this chart:

Figure 2. Histogram daily returnsAnother way to look at the data is to take the daily returns and do a line chart
for the year. The code sample below shows how to do this:

Listing 5. Pandas portfolio correlation line chart for the year    In [15]: returns.plot(title=""% Daily Change For Year"")
    Out[15]: <matplotlib.axes.AxesSubplot at 0x10b56e850>
    
    In [16]: plt.show()

Here's the result:

Figure 3. Histogram daily returns as a line chartOne issue with this simple graph is that it is a bit tough to figure out what's
going on. A way to deal with this time series data is to use the cumsum function, and then chart that.

Listing 6. Cumulative sumIn [17]: ts = returns.cumsum()

In [18]: plt.figure(); ts.plot(); plt.legend(loc='upper left')
Out[18]: <matplotlib.legend.Legend at 0x10c69cb50>
    
In [19]: plt.show()

The result, shown in Figure 4 , tells us even more information about your portfolio. By doing a time series
analysis and charting the results, it's apparent that Facebook had a rougher
time than originally thought. While it was down for the year by 40%, at one
point it was down 60% in September. Intuitively, additional data about the
movement of the stock indicates that the standard deviation is quite high for
Facebook. Since standard deviation is a rough proxy for risk, this is something
to watch out for while formulating this portfolio and determining weights.

Figure 4. Cumulative sum of portfolioDetermining the correlation of the percentage change between the ten stocks is
as simple as calling the corr method on the DataFrame returns.

Listing 7. Pandas portfolio correlation of percentage change    In [9]: returns.corr()
    Out[9]:
            AAPL        BP        FB        GG      GOOG       JNJ       
                      OMG      SCPJ      SPLK      YELP
    AAPL  1.000000  0.169053  0.094286  0.134131  0.376466  
                      0.163904  0.411568  0.117152  0.368266  0.124856
    BP    0.169053  1.000000  0.011832  0.294994  0.291391  
                      0.437816  0.436781 -0.009499  0.224151  0.084014
    FB    0.094286  0.011832  1.000000 -0.065156  0.081912  
                      0.020755  0.130815  0.039980  0.038010  0.343646
    GG    0.134131  0.294994 -0.065156  1.000000  0.302844  
                      0.138329  0.206255 -0.066144  0.148690 -0.006135
    GOOG  0.376466  0.291391  0.081912  0.302844  1.000000  
                      0.144882  0.305486 -0.001538  0.226364  0.154207
    JNJ   0.163904  0.437816  0.020755  0.138329  0.144882  
                      1.000000  0.268308  0.021108  0.190023 -0.009803
    OMG   0.411568  0.436781  0.130815  0.206255  
                      0.305486  0.268308  1.000000  0.117257  0.279653  0.146944
    SCPJ  0.117152 -0.009499  0.039980 -0.066144 -
                      0.001538  0.021108  0.117257  1.000000 -0.017114  0.058541
    SPLK  0.368266  0.224151  0.038010  0.148690  0.226364  
                      0.190023  0.279653 -0.017114  1.000000  0.215260
    YELP  0.124856  0.084014  0.343646 -0.006135  0.154207 
                    -0.009803  0.146944  0.058541  0.215260  1.000000
    In [58]: plt.show()


--------------------------------------------------------------------------------

Back to top

PORTFOLIO THEORY
So far, the analysis has been fairly simple. Investment analysis usually
involves determining an optimal portfolio that can ""beat the market."" The market
is typically referred to as the Standard and Poor 500 index. There is no
definitive proof that mutual funds can actually beat the market through skill
(not luck). The odds are in your favor that you can beat most mutual funds by a
passive investment in index funds since the vast majority of mutual funds cannot
beat a market portfolio year after year.

Despite the fact that trained professionals with armies of Ph.D. mathematicians,
rocket scientists, and billions of dollars of capitol can't beat the market,
let's try to do it with pandas in your free time. The first step is to see how
your portfolio stacks up against the market portfolio, the Standard and Poor
500.

Listing 8. SPY cumulative time chart        In [116]: market_data_feed = {}
        
        In [117]: market_symbols=['SPY']
        
        In [118]: for ticker in market_symbols:
        .....:         market_data_feed[ticker] = web.get_data_yahoo
                      (ticker, '05/21/2012', '11/1/2012')
        .....:
        
        In [119]: market_price = DataFrame({tic: data['Adj Close']
        .....:     for tic, data in market_data_feed.iteritems()})
        
        In [120]:
        
        In [120]: market_volume = DataFrame({tic: data['Volume']
        .....:     for tic, data in market_data_feed.iteritems()})
        
        In [121]: 
        
        In [121]: market_returns = market_price.pct_change()
        
        In [122]: market_returns.cumsum()
        In [123]: mts = market_returns.cumsum()
        
        In [124]: plt.figure(); mts.plot(); plt.legend(loc='upper left')
        Out[124]: <matplotlib.legend.Legend at 0x10b8f4650>
        
        In [125]: plt.show()

In this example, another DataFrame is created, for the same time period, and it
acts as your ""market portfolio"". The chart in Figure 5 shows the return
generated for SPY, which is a proxy for the Standard and Poor 500 index.

Figure 5. Return generated for SPYWith two separate time-series charts completed, the next step in doing an
analysis is to look at your portfolio against the market portfolio. Two quick
and dirty ways to do this are (a) to look at the average/mean return of your
portfolio versus the market portfolio, and (b) to look at the standard deviation
(stdev), a very rough proxy for risk on your portfolio versus the market
portfolio.

Listing 9. Beating the marketIn [126]: sum_returns = returns.sum()
In [127]: sum_returns.mean()
Out[127]: 0.11198689337482581
In [128]: market_returns.sum().mean()
Out[128]: 0.093679854637400028
In [239]: market_returns.std()
Out[239]:
minor
SPY      0.008511        
In [240]: returns.std().mean()
Out[240]: 0.025706773344634132

IN CONCLUSION
In the final interactive example in Listing 9 , you beat the market by getting an 11% return on your portfolio versus the
market portfolio of 9%. Before you can start a hedge fund, though, you might
need a good story to explain why the market portfolio had a stdev of .8% and
your portfolio had a 2% stdev. The quick story is that you took on much more
risk and simply got lucky. Further analysis would involve determining the alpha,
beta, expected return, and doing more advanced analysis like Fama-French and
efficient frontier optimization.

In this article, Python was used to perform a quick and dirty investment
portfolio analysis. Python is turning into a go-to language for real world data
analysis. Libraries like Pyomo, pandas, Numpy, and IPython take much of the pain
out of doing advanced applied math in Python. If you want to learn more about
portfolio analysis, see some reading suggestions in Resources .


--------------------------------------------------------------------------------

Back to top

DOWNLOAD
Description Name Size CSV file and source code src.zip 6.30KBRESOURCES
LEARN
 * In the book Python For Unix and Linux System Administration by Noah Gift and Jeremy Jones (O'Reily Media, 2008), chapter 2 covers
   IPython.
 * Visit the IPython web site.
 * In the book SciPy and Numpy by Eli Bressert (O'Reilly Media, 2012), learn how to use NumPy for numerical
   processing, including array indexing, math operations, and loading and saving
   data. Learn how to use SciPy in work with advanced mathematical functions
   such as optimization, interpolation, integration, clustering, statistics, and
   other tools for scientific programming.
 * Statusmodels is a Python module that allows users to explore data, estimate statistical
   models, and perform statistical tests.
 * The book Python for Data Analysis by Wes McKinney (O'Reilly Media, 2012) is a practical introduction to
   scientific computing in Python, tailored for data-intensive applications.
 * The book Pyomo – Optimization Modeling in Python by W. E. Hart et al is highly recommended.
 * You can read about investment portfolio management in Harvard Business School Case Study: Dimensional Fund Advisors by Jay O. Light (Harvard Business School, 1993).
 * Also see Investments (McGraw-Hill/Irwin Series in Finance, Insurance and Real Estate) , by A. Bodie, A. Kane, and A. Marcus (McGraw-Hill/Irwin, 2010).
 * Statistics for Management and Economics teaches how to apply statistics to real-world business problems.
 * In his book The Signal and the Noise: Why So Many Predictions Fail - but Some Don't (Penguin Press, 2012), Nate Silver examines the world of prediction,
   investigating how we can distinguish a true signal from a universe of noisy
   data.
 * R in Action by Robert Kabacoff (Manning Publications, 2011) presents the R system, a
   powerful language for statistical computing and graphics.
 * Study the book Probability Theory: The Logic of Science by E. T. Jaynes (Cambridge University Press, 2003)
 * Learn more about cloud computing technologies at cloud at developerWorks .
 * Access IBM SmartCloud Enterprise .
 * Follow developerWorks on Twitter .
 * Watch developerWorks demos ranging from product installation and setup demos for beginners, to advanced
   functionality for experienced developers.

GET PRODUCTS AND TECHNOLOGIES
 * Download pip and use it to install IPython.
 * Evaluate IBM products in the way that suits you best: Download a product trial, try a product
   online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

DISCUSS
 * Get involved in the developerWorks community . Connect with other developerWorks users while exploring the
   developer-driven blogs, forums, groups, and wikis.

COMMENTS
Close [x]

DEVELOPERWORKS: SIGN IN
Required fields are indicated with an asterisk ( * ).

IBM ID: *
Need an IBM ID?
Forgot your IBM ID?

Password: *
Forgot your password?
Change your password

Keep me signed in.

By clicking Submit , you agree to the developerWorks terms of use .


--------------------------------------------------------------------------------

The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is
displayed to the public and will accompany any content you post, unless you opt
to hide your company name . You may update your IBM account at any time.

All information submitted is secure.

Close [x]

CHOOSE YOUR DISPLAY NAME


The first time you sign in to developerWorks, a profile is created for you, so
you need to choose a display name. Your display name accompanies the content you
post on developerWorks.

Please choose a display name between 3-31 characters . Your display name must be unique in the developerWorks community and should
not be your email address for privacy reasons.

Required fields are indicated with an asterisk ( * ).

Display name: * (Must be between 3 – 31 characters.)

By clicking Submit , you agree to the developerWorks terms of use .


--------------------------------------------------------------------------------

All information submitted is secure.

DIG DEEPER INTO CLOUD COMPUTING ON DEVELOPERWORKS
 * Overview
 * New to cloud computing
 * 
 * Technical library (tutorials and more)
 * Forums
 * Blogs
 * Communities


--------------------------------------------------------------------------------

 * DEVELOPERWORKS PREMIUM
   Exclusive tools to build your next great app. Learn more.
   
   
 * 
 * 
 * 
 * CLOUD NEWSLETTER
   Crazy about Cloud? Sign up for our monthly newsletter and the latest cloud
   news.
   
   
 * 
 * TRY SOFTLAYER CLOUD
   Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer
   public cloud instance for one month.
   
   
--------------------------------------------------------------------------------

Back to top

static.content.url=http://www.ibm.com/developerworks/js/artrating/ SITE_ID=1 Zone=Cloud computing, Open source ArticleID=858854 ArticleTitle=Data science in the cloud publish-date=02212013 * About
 * Help
 * Contact us
 * Submit content

 * Feeds
 * Newsletters
 * Follow
 * Like

 * Report abuse
 * Terms of use
 * Third party notice
 * IBM privacy
 * IBM accessibility

 * Faculty
 * Students
 * Business Partners

 * Select a language:
 * English
 * 中文
 * 日本語
 * Русский
 * Português (Brasil)
 * Español
 * Việt","Data Science is a growing specialization that can touch on many of the following topics: Cloud computing, big data, math, business theory, and computer science theory. A scripting language like Python is often a great choice for the typical cycle of prototyping to make sure the math of the problem works, then ""productizing"" the result to a distributed farm of cloud servers. This article presents some hands-on examples of investment analysis and statistical analysis using IPython and pandas.",Data science in the cloud,Live,508
1563,"Compose The Compose logo Articles Sign in Free 30-day trialINTRODUCTION TO GRAPH DATABASES
Published May 22, 2017 graph writestuff Introduction to Graph DatabasesYou may have heard about Graph databases but are they right for you? In this Write Stuff article, Graham Cox looks at the concepts and application of Graph databases.

If you are reading this article then no doubt you have already heard of the
concept of a Graph Database, and are looking to learn more about what they are
and what they can do for you.

Graph Databases are currently gaining a lot of interest, as they can give very
powerful data modeling tools that provide a closer fit to how your data works in
the real world. This can allow a large level of flexibility to represent your
data in a way that makes the most sense to everyone involved, whilst still
making the most of the complex interactions between it.

This article aims to explore exactly what they are and where they can be a good
fit in your application landscape.

WHAT IS A GRAPH?
Before we can understand what a Graph Database is, we first need to understand
what is meant by a Graph.

In this context, a Graph Database represents a mathematical Graph . Specifically a Graph Database will typically be a Directed Graph.

In Mathematical terms, a Graph is simply a collection of elements - typically
called Nodes (also called Vertices or Points) - that are joined together by
Edges. Each Node represents some piece of information in the Graph, whereas each
Edge represents some connection between two Nodes.

A Directed Graph is a special type of Graph where edges always have a direction
associated with them. Conversely, an Undirected Graph would be one where the
edges are simply links with no direction associated with them.

Once you start dealing with Graphs, you very quickly get involved in Graph Theory . This is a branch of Mathematics that deals with the complexities that Graphs
can contain, and with how best to get information out of them.

Graphs are already prevalent in the real world, and in software development. For
example, any time you try to use a Tube Map or trace a Family Tree, you are dealing with a Graph.

Even using the Internet on a daily basis is using a Graph. Each computer on the
Internet - servers, routers, switches - is a Node, and each connection between
them is an Edge. Some elements of Graph Theory are then very important in the
infrastructure used here, in order to correctly connect distant computers
together in the best way.

WHAT IS A GRAPH DATABASE?
At it's most basic, a Graph Database is simply a Database Engine that models
both Nodes and Edges in the relational Graph as first-class entities. This
allows for you to represent complex interactions between your data in a much
more natural form, and often allows for a closer fit to the real-world data that
you are working with.

Graph Databases are often schema-less - allowing for the flexibility of a
Document or Key/Value Store database - but supporting Relationships in a similar
way to that of a traditional Relational Database. This doesn't mean that there
is no data model associated with the database though. Simply that there is more
flexibility in how you define it, which can often lead to the faster iteration
of your projects.

This is all possible in other database solutions, but not always as elegantly as
in a Graph Database and often involving link tables or nested documents to
achieve the same level of expressiveness.

Graph Databases also often allow us to apply Graph Theory to our data in an
efficient manner, allowing us to discover connections from our data that are
otherwise difficult to see. For example, minimal routes between nodes, or
disjoint sets within our data.

WORKED EXAMPLE - OR HOW DO DIFFERENT DATABASE SOLUTIONS DIFFER?
The best way to understand the benefits of such a solution is often to see it in
action. As such, we will cover a worked example of a simple Social Network,
implemented in a Relational Database (e.g. MySQL), a Document Database (e.g.
MongoDB) and a Graph Database.

All three of these solutions will represent the same data but will do it in
their own ways. This allows us to quickly see the commonalities and the
differences between the three solutions.

Our simple Social Network will have only two types of entity - Users and Posts.
Users have Friends, are able to write Posts, and are able to Like Posts. We are
then going to explore how to retrieve a relatively complex answer from this -
All of the Friends of any User who has Liked one of my Posts, in alphabetical
order of username.

RELATIONAL
In a typical Relational Database, this will likely be modeled using four
different tables - users , posts , friends and likes . These might look something like this:


We have ended up with 4 different tables, with 5 foreign key relationships
between them.
Two of these tables are actual data, and the other two are nothing more than
links between entities in our system.

Answering our query in this data model is complicated but can be achieved with a
single query.

SELECT friends_of_likers.*  
FROM posts  
JOIN likes ON (posts.post_id = likes.post_id)  
JOIN users likers ON (likers.user_id = likes.user_id)  
JOIN friends ON (likers.user_id = friends.user_id)  
JOIN users friends_of_likers ON (friends_of_likers.user_id = friends.friend)  
WHERE posts.author = :me  
ORDER BY friends_of_likers.username ASC  


It's hardly pretty, and it's not especially easy to read this query to work out
what it does.
It ends up joining together 5 resultsets just to get the results from one of
them.
It will work though, and it will return all of the information we desire in only
a single query - however efficient that may be.

DOCUMENT STORE
In a Document Store Database, there are a number of different ways that this can
be modeled depending on exactly what you want to achieve. Often, relationships
between entities of different types are difficult to achieve, either being
modeled as a nested document or as a manually enforced foreign key. We will go
for a mixture of the two, giving us a users and a posts collection to work with.

USERS
{
  ""user_id"": ""u1"",
  ""username"": ""grahamcox"",
  ""friends"": {
    ""u2"": ""2017-04-25T06:41:11Z"",
    ""u3"": ""2017-04-25T06:41:11Z""
  }
}


POSTS
{
  ""post_id"": ""p1"",
  ""author"": ""u1"",
  ""title"": ""My first post"",
  ""content"": ""This is my first post"",
  ""created"": ""2017-04-25T06:41:11Z"",
  ""likes"": [
    ""u2""
  ]
}


Straight away we've reduced the number of entities we are modeling down to two -
which is correct from our original data modeling. We've also made it so that we
get some of the related data about an entity all in one go - a Post and all of
the Likes, for example. However, the cross-links from Post to User and from User
to User are harder to manage in this setup. Also, remember that most Document
Databases don't support relational integrity so these cross-links need to be
maintained by the software, and support needs to be built in for when they are
broken.

However, in order to answer our query in this data model is going to need
multiple queries. Because Document Stores don't generally support cross-links,
we will need to do the various joins in code instead. In this case, we will need
to:

 * Query 1: Find all of my posts, which will include the IDs of all the users
   who liked those posts.
 * Manual processing: De-duplicate the list of User IDs
 * Query 2: Find all of the users who liked any of my posts, which will include
   the IDs of all of the friends of those users
 * Manual processing: De-duplicate this list of User IDs
 * Query 3: Find all of the users that will actually solve our query

Each of these queries is relatively painless to execute - they are just
returning documents on a simple key.
However, the fact that we need to do three different queries, and some manual
processing in between each one is just painful. We can possibly reduce this by
having some assumptions about our data model - for example, if friends links are
always both ways then we can merge the second and third queries together - but
this is then adding limits into our data model to make these queries better. And
these limits are not always correct to add in.

GRAPH DATABASE
In a Graph Database, we can choose to model the Entities as our Nodes, and the
Relationships as our Edges. This gets us closer to the Document Store model -
where we only have two types of Entity - but with the power of the Relational
Model - where we don't have to handle links between Entities manually, and where
we can easily traverse these links inside the database itself. This might look
something like this:


Here we have two different types of Node, and three different types of Edge.
Whilst not visible in the diagram, the Nodes and Edges can each contain data,
similar to the Relational model.
For example, the ""FRIENDS"" Edge would contain the date when the Relationship was
created, allowing us to list all Friends in time order.

This very quickly shows us that we have all of the power that we are used to
from the Relational model, but with the flexibility we are used to from the
Document Store model.

Now to answer our example query using this. This can be solved as follows (Using
the Cypher query language)

MATCH (:User {id:{author}}) <-[:AUTHOR]- (:Post) <-[:LIKES]- (:User) <-[:FRIENDS]- (u:User) RETURN (u)  


This is actually not too dissimilar to the Relational Database query, except
that the query is much more readable and the links are much more obvious.
We can also clearly see that there is a distinction between Nodes and
Relationships here and that we are following Relationships to get from one Node
to another. You can even traverse this query by simply tracing your finger
across the named lines on the above diagram.

The real thing to notice though is that nowhere are we telling the database
engine how to link the Nodes together. We simply tell it to follow a
Relationship of a certain type and it handles everything for us automatically.
No more necessity to match IDs in different tables and hope that they correspond
correctly.

SHOULD I USE A GRAPH DATABASE?
Obviously, a Graph Database will not always be the best fit for your needs.
Every situation is different and you need to evaluate the requirements every
time. The most important thing you need to do is evaluate your data model. It's
very likely that it is highly relational. Most real world data models are. In
this case, a Graph Database is already likely to be a good fit for your needs.

Next, determine the type of relationships that your data has. If it contains a
number of Many-To-Many relationships then a Graph Database will probably work
better for your needs than a traditional Relational Database. Even if it
contains a number of One-To-One or One-To-Many relationships though, a Graph
Database may make this easier to represent.

Thirdly, determine the schema of your data.
Graph Databases are generally much more flexible in the way that they allow you
to store data, allowing for much more fluidity of the data present in each
location. If your data needs are such that the schema is not absolutely rigid
then a Graph Database may be a better fit, even if a Relational Database fits
your needs otherwise.

Finally, determine what you want to do with your data.
If you want to do complex data analysis, or potentially expensive queries
spanning multiple types of data, then a Graph Database may make this easier to
achieve and will possibly make the queries run more efficiently.

OPTIONS FOR WHAT TO USE
Once you've decided you want to use a Graph Database, the next hurdle is to
decide which one to go for. There are quite a few options available, and we are
going to briefly cover some of these here to help determine which is the best
fit for your project.

We are going to summarize the features of Neo4J, OrientDB, ArangoDB and
JanusGraph to help decide which is the best fit for your project. Note that
these are not the only options to chose from, so please investigate the options
fully before deciding.


Neo4J is likely the name that most people know when thinking about Graph
databases. It's the oldest option around, and the best-known name. However, it
is not as feature rich as other options, and possibly not as performant (based
solely on other benchmarks online, so not necessarily reliable.)

JanusGraph, on the other hand, is a very new name in the Graph Database scene.
It has been in development since 2012 but had its first release in 2017. It is
being worked on under The Linux Foundation and is completely free for anyone to
use or contribute towards.

SUMMARY
When starting a new project, there is often a tendency to use technologies that
are well known, or else that are new and well discussed. Consideration should be
taken to see if these are really the best fit for what you need though, or if
something else might work better for you.

A Relational Database is often considered the safe option, and there is a myriad
of NoSQL database solutions that get a lot of discussions online these days that
may be tempting to use as well.

But if you want the best of both - the flexibility and speed of iteration that
is common in NoSQL databases, combined with the relational modeling power from a
Relational Database - then you should consider looking at a Graph Database
instead, and see what it can do for you.


attribution Clint Adair

This article is licensed with CC-BY-NC-SA 4.0 by Compose. Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Sep 14, 2015GRAPH DATA WITH MONGODB
Today's Write Stuff article is by Igor Ribeiro Lima. It is a dive into storing
graph data and visualizing it in real time usi…

Guest Author May 16, 2017HOW HYVER USES SCYLLADB FOR API KEY MANAGEMENT
Building a platform means building an API to go with it and building an API
means managing the keys to that API efficiently.…

Dj Walker-Morgan May 10, 2017COMPOSE'S WRITE STUFF AND THE WINNER OF THE SECOND CYCLE OF 2017
You voted and now it's time to announce the author who has won the $500 Compose
Write Stuff bonus. Write Stuff is Compose's…

Default avatar The default author avatar The Compose Team Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","You may have heard about Graph databases but are they right for you? In this Write Stuff article, Graham Cox looks at the concepts and application of Graph databases.",Introduction to Graph Databases,Live,509
1569,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses
 *  * Our Courses
    * Partner Courses
   
   
 * Badges
 *  * Our Badges
    * BDU Badge Program
   
   
 * Student Advisor
 * Business

 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (May 30, 2017)
 * This Week in Data Science (May 23, 2017)
 * This Week in Data Science (May 16, 2017)
 * This Week in Data Science (May 9, 2017)
 * This Week in Data Science (May 2, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (MAY 30, 2017)
Posted on May 30, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Managing Spark data handles in R – How to work with data handles and Spark in R.
 * 5 ways to measure running time of R code – A quick rundown of five ways R code can be benchmarked.
 * Free Data Science Resources for Beginners – A list of 65 resources to for beginners in Data Science.
 * R Packages worth a look – A short list of lesser known but useful R packages.
 * Google, IBM, and Lyft launch open source project Istio – The first public release of the open source service that developers a
   vendor-neutral way to work with networks of different microservices on cloud
   platforms.
 * R vs Python: Different similarities and similar differences – A look at how the two languages compare.
 * The Major Do’s and Don’ts of Data Visualization – Some guidelines to creating an effective and successful Data
   Visualization.
 * Machine Learning Workflows in Python from Scratch Part 1: Data Preparation – Implementing a Machine Learning Workflow from scratch.
 * Arranging subplots with ggplot2 –Brief tutorial to working with subplots in R’s ggplot2 package.
 * A survival guide for the coming AI revolution. – Steps for working with AI in the future.
 * The Hitchhiker’s Guide to d3.js – An introduction to data visualization javascript package d3.
 * Fast track your data – Upcoming event on how to use data to gain actionable insights.
 * Data Nirvana – How to Develop and Data-Driven Culture. – How to properly utilize business intelligence by adapting a data-driven
   culture.
 * New Leader, Trends, and Surprises in Analytics, Data Science, Machine
   Learning Software Poll – The result of a poll showing trends, leaders and surprises in Data Science
   and Machine Learning.

UPCOMING DATA SCIENCE EVENTS
 * Big Data Toronto 2017 –June 20, 2017 – June 21, 2017 all-day Metro Toronto Convention Center, 255
   Front Street West Toronto, Ontario M5V 2W6 Canada

FEATURED COURSES FROM COGNITIVE CLASS
 * SQL and Relational Databases 101 – Learn the basics of the database querying language, SQL.
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.
 * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to
   apply deep learning to different data types in order to solve real world
   problems.

UPCOMING DATA SCIENCE EVENTS
 * Big Data Toronto 2017 –June 20, 2017 – June 21, 2017 all-day Metro Toronto Convention Center, 255
   Front Street West Toronto, Ontario M5V 2W6 Canada

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * For Business
 * Events
 * Resources
 * FAQ
 * Legal
 * Changelog

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

We are currently rebranding our site in order to better reflect our focus on
Data Science and Cognitive Computing. Please bear with us. Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe if you find this useful! Interesting Data Science Articles and News Managing Spark data handles in …,"This Week in Data Science (May 30, 2017)",Live,510
1570,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS
Loading...

Sign up by October 31st for an extended 3-month trial of YouTube Red.Working...

No thanks Try it free Find out why CloseIBM WATSON MACHINE LEARNING: PUT A HUMAN FACE ON MACHINE LEARNING
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

131 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017This videos shows you how to create a very simple Python Flask web application
front end to call a deployed Logistic Regression model in real time. This video
is a continuation of the logistic regression analysis video.

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Watson Machine Learning: Build a Predictive Analytic Model - Duration:
   4:06. developerWorks TV 80 views 4:06


--------------------------------------------------------------------------------

 * Using Celery in Flask to Email Dynamic PDFs - Duration: 18:06. Pretty Printed
   1,714 views 18:06
 * Deploying a Hello World Python Flask App in IBM Bluemix (Cloud Foundry) -
   Duration: 9:33. Jeff Sloyer 8,406 views 9:33
 * IBM Watson: How it Works - Duration: 7:54. IBM Watson 1,292,550 views 7:54
 * Machine Learning and Big Data Magic with Node.js and Google Cloud - Sara
   Robinson & Bret McGowen - Duration: 58:33. NDC Conferences 70 views 58:33
 * The 7 Steps of Machine Learning - Duration: 10:36. Google Cloud 223,693 views 10:36
 * Watson Machine Learning: Build a logistic regression model - Duration: 4:10.
   IBM Analytics Learning Services 502 views 4:10
 * Miguel Grinberg - Python Microservices - Duration: 43:06. EuroPython
   Conference 437 views 43:06
 * Watson Machine Learning: Build a Predictive Analytic Model Using SPSS Modeler
   - Duration: 5:31. IBM Analytics Learning Services 365 views 5:31
 * IBM Watson Machine Learning: Create a project for Watson Machine Learning -
   Duration: 2:04. developerWorks TV 144 views 2:04
 * IBM Watson presents Soul Machines, LENDIT Conference 2017 (Professional
   Camera) - Duration: 5:12. Soul Machines 28,325 views 5:12
 * Flask Tutorial Web Development with Python 12 - GET & POST - Duration: 14:14.
   sentdex 20,789 views 14:14
 * Computer evolves to generate baroque music! - Duration: 18:13. carykh
   1,209,419 views 18:13
 * 7.2 - Using IBM Watson API - Duration: 18:25. Matthew Meyers 15,727 views 18:25
 * Healthy Habits Pet Assembly, Part 2 - Duration: 8:03. developerWorks TV 6
   views * New 8:03
 * Google's Deep Mind Explained! - Self Learning A.I. - Duration: 13:45.
   ColdFusion 2,119,085 views 13:45
 * Image Classification - Tensorflow vs. Watson [Machine Learning] - Duration:
   10:12. Cristi Vlad 1,636 views 10:12
 * Flask Web Development in Python - 1 - Intro - Duration: 20:15. sentdex 65,265
   views 20:15
 * IBM Watson Machine Learning: Get Started - Duration: 1:23. developerWorks TV
   326 views 1:23
 * Tanmay Bakshi on building AskTanmay - Duration: 22:59. developerWorks TV
   221,971 views 22:59
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This videos shows you how to create a very simple Python Flask web application front end to call a deployed Logistic Regression model in real time using IBM Waston Machine Learning and IBM Data Science Experience. This video is a continuation of the logistic regression analysis video.,Put a human face on machine learning with WML & DSX,Live,511
1578,"Kaggle is a platform for predictive modelling and analytics competitions on which companies and researches post their data and statisticians and data miners from all over the world compete to produce the best model.There are very tempting awards for the winners:Anyone can nowadays start solving these challenges, and there are powerful tools like SPSS Modeler that within a few click rank you in the top 5% of the Ranking.One of the first problems that you face doing this kind of competitions is that the datasets are TOO big to handle with a regular computer. Not enough memory…not enough CPU power and the process is not performant…So then…what? I cannot participate? Or I have to pay expensive money to have a Cloud Environment? No! Use IBM Bluemix band DashDB!With IBM Bluemix service DashDB is one of the best solutions to do so. Why?– DashDB is powered by IBM BLU Acceleration and Netezza in-Database Analytics. It uses dynamic in-memory columnar technology and innovations such as actionable compression to rapidly scan and return relevant data. In-database analytic algorithms integrated from Netezza bring simplicity and performance to advanced analytics.-You can get started for Free. You get a free account now on Bluemix.net. You get for free at No charge 1GB of data stored. After that, you pay as you grow. 1 GB to 10 GB of available compressed database storage that can hold, respectively, from 5 GB to 50 GB of uncompressed data, based on typical compression ratios. The compression ratio for your data varies based on the characteristics and values in your data set.-Full integration with R: You can not only run R scripts but also open an instance of the best R Development Environment RStudio completely embedded in the web-browser. You don’t need to install any software on your computer, you can do it all in the Cloud!What I am going to explain is how to set up the environment to solve the challenge and not how to solve it. The Challenge I want to explain is “Click-Through Rate Prediction” by Avazu.  The award for this challenge is $15,000.  First thing to be done is to download the dataset.In order to deal with this dataset, the best is to set up a Cloud Platform and I am going to explain how to do so using IBM Bluemix and Dash DB.Create your account on Bluemix.net and create any kind of application. Then attach to the application a DashDB service. When you create the service you will get all the credentials of your DB instance:You can find full documentation about how to set up the environment here: https://www.ng.bluemix.net/docs/#services/dashDB/index.html#dashDBThen you can just click and access the DashDB Administration console that is very user friendly.This might be the most “annoying” part. You can use different tools to do that, like SQuirreL SQL Client and just create a table and load the data.  In my case I used SPSS Modeler to do that, using connection through ODBC.  Feel free to contact me if you find issues in loading your data. Since the dataset is quite big, this process might take long time, all depend on your internet connection. But as soon as it is loaded, you will enjoy the power of DashDB!Note as well that in the documentation in the Administration console of DashDB you have instructions to connect it to many different applications such as IBM SPSS, IBM Cognos, SAS, Tableau, RStudio…As I mentioned before, with dashDB you can run R scripts to develop statistical models and plot their results based on data in the dashDB database.The RODBC and ibmdbR packages is all what you will need. You can download these packages and use them in your own RStudio instance or use it directly in the dashDB environment in the cloud, where they are already installed and immediately ready to use.– RODBC is a package that provides functions that you can use to access the data in the dashDB database-ibmdbR is a package that provides methods to read, write, and sample data from the dashDB database, and access methods for in-database analytics functions.You can learn more about it in the documentation of dashDB and RNow there is no excuse, you can use dashDB on Bluemix and R to solve the most difficult challenges of Kaggle!",Kaggle is a platform for predictive modelling and analytics competitions on which companies and researches post their data and statisticians and data miners from all over the world compete to produce the best model. One of the first problems that you face doing this kind of competitions is that the datasets are TOO big to handle with a regular computer. Not enough memory…not enough CPU power and the process is not performant…So then…what? I cannot participate? Or I have to pay expensive money to have a Cloud Environment? No! Use IBM Bluemix and DashDB!,Use #Bluemix DashDB and R to solve a Kaggle Competition,Live,512
1579,"Cloudant's search is built upon Lucene and allows you to do more ad hoc queries over your data than can be done with primary and secondary indexes.Enhance this tutorial with live data from a sample database inside your Cloudant account.For security purposes, please sign in or sign up to demo the API.To demo the Cloudant API, you'll need to replicate a small sample database into your account. The database is named animaldb, and it contains information from Wikipedia about ten different animals.Search indexes are defined by a javascript function. This is run over all of your documents, in a similar manner to a view's map function, and defines the fields that your search can query.function(doc){index(""name"", doc.name);The function takes a single argument, the document, and calls the built-in   index function to define an index on the name field.Field names (the first argument to the index()   function) cannot start with an underscore (_). If they   do the document will not be indexed.Values can only be strings, booleans or numbers (specifically   64-bit floating point). Notably, they cannot be objects, arrays, null   or undefined, if they are the document will not be   indexed.Similar to views, the functions that define search indexes are stored in design documents, but under the key indexes. Under indexes you define each search index in an object, containing the index function and an optional analyzer. Details on the analyzer are below, the default is standard.""_id"": ""_design/views101"",""_rev"": ""12-649b0e71ca89cdad5d66a4e07316726f"",""indexes"": {""animals"": {""index"": ""function(doc){ index(\""default\"", doc._id); }""The API call below hits this search index, called animals, inside the views101 design document. As you can see, we're not specifying a field for the query (we're just using ?q=[query]), so Cloudant uses the default field, which we specified above indexes the document _id. Because animal names are stored in the _id field, the default search index is perfect for name searches, like ?q=kookaburra. Also try a search for ""llama"" or ""elephant"". Note, however, that you can always query by id using the special _id field name.Hit this code with the Cloudant API. The server response will appear directly below.Sign in or create a free account to demo the Cloudant API.To demo the API here, replicate the sample database first.The built-in index function takes three arguments; the Lucene field, the value for that field and an optional options object.function(doc){index(""name"", doc.name, {""store"": true, ""index"": false});The options object has two boolean keys; store and index.If true, the value will be returned in the search result; iffalse, the value will not be returned in the search result.whether the data is indexed. If set to false, the data can not be used for searches, but it can still be retrieved from the index if store is set to true.To increase the relevance of the data being indexed in search results, supply a number greater than 1.0. Relevance will be adjusted by the given factor.Calling the index function with both options set to false has no effect.Analyzers define how to extract index terms from text, which you might need to do if your application need to index Chinese, for example). Here's the list of generic analyzers supported by Cloudant   search. See further down for language-specific analyzers.This is the default analyzer and implementsthe Word Break rules from the Unicode Text Segmentation algorithm,as specified inUnicode Standard Annex#29.Like standard but tries harder to matchan email address as a complete token.Input is not tokenized at all.The standard Lucene analyzer circarelease 3.1. You'll know if you need it.You can choose which analyzer is used by your index function by   changing the index definition in the design document.Note: Changing the analyzer causes the index to be rebuilt. (Also note   that queries against a given index are run with the same analyzer as   is defined by the function.)We provide a large number of analyzers for specific   languages. These analyzers will omit very common words in the   specific language, as these tend to make poor search   queries and cause considerable index bloat. Many of these also   perform stemming, where common word prefixes or suffixes are   removed.+ See the full list of language-specific analyzersSometimes a single analyzer isn't enough. You can use   the perfield analyzer to configure different analyzers   for different field names;You may want to define a set of words that do not get indexed. These are   called stop words. You define stop words in the design document by turning the   analyzer string into an object:""indexes"": {""mysearch"" : {""analyzer"": {""name"": ""portuguese"", ""stopwords"":[""foo"", ""bar"", ""baz""]},""index"": ""function(doc){ ... }""Note that keyword, simple and whitespace analyzers do not support stop words.As you probably noticed above, the search URL requires a q (or query) query string. This is the query that is passed on to the search index. There are two data types supported by search; string and number. The data type is auto detected. If you need to pass a number in as a string you will need to quote it, e.g. q=""12"".The search URL can optionally take limit, include_docs, stale (which have the same behavior as those in the primary and secondary indexes) sort and bookmark.Bookmarks allow you to efficiently skip through results you have already seen. All search results include a bookmark in their JSON response. By passing this value to the search URL via the bookmark query parameter you will see the next page of results.Search results can be sorted ascending or descending by any numeric or string field in the index. Sort order is set by the sort query parameter, which takes a JSON string or list as its parameter. If the field is a string field, you have to add  to the end of the string. If you wanted to sort by age you'd query your search index with ?sort=""age"", if you wanted to sort descending you'd use?sort=""-age"". If you wanted to search by name, you'd use ?sort=""name. Sorts can be applied to multiple fields, for instance?sort=[""-age"", ""height""] would sort by age descending then height ascending.The default sort order (when you don't supply a sort parameter) is relevance, the highest scoring matches are returned first. If you specify a sort order then matches are returned in that order, ignoring relevance. If you want to include the relevance ordering in your sort order you can use the special fields - and .In addition to sorting by indexed fields, you can sort by distance from a point chosen at query time. You will need to index two numeric fields (representing the longitude and latitude of whatever you're indexing);function(doc) {index(""mylon"", doc.longitude);index(""mylat"", doc.latitude);You can then query using the special  sort field which takes 5 parameters;The name of your longitude field (""mylon"" in this example)The name of your latitude field (""mylat"" in this example)The longitude of the place you want to sort by distance fromThe latitude of the place you want to sort by distance fromThe units to use (""km"" or ""mi"" for kilometers and miles, respectively). The distance itself is returned in the order fieldAn example query to make this clear:?sort=""You can combine sorting by distance with a bounding box query to perform simple geo operations.The Cloudant search query syntax is based on theLucene syntax. Search queries take the form of name:value (unless the name is omitted, in which case they hit the default field as we demonstrated in the first example, above).Queries over multiple fields can be logically combined and groups and fields can be grouped. The available logical operators are: AND, +, OR, NOT and -, and are case sensitive. Range queries can run over strings or numbers.If you want a fuzzy search you can run a query with ~ to find terms like the search term, for instance look~ will find terms book and took.You can also increase the importance of a search term by using the boost character ^. This makes matches containing the term more relevant, e.g.cloudant ""data layer""^4 will make results containing ""data layer"" 4 times more relevant. The default boost value is 1. Boost values must be positive, but can be less than 1 (e.g. 0.5 to reduce importance).Wild card searches are supported, for both single (?) and multiple (*) character searches. dat? would match date and data, dat* would match date, data, database, dates etc. Wildcards must come after a search term, you cannot do a query like *base.Result sets from searches are limited to 200 rows, and return 25 rows by default. The number of rows returned can be changed via the limit parameter. The response contains a bookmark. If the bookmark is passed back as a URL parameter you'll skip through the rows you've already seen and get the next set of results.The following characters require escaping if you want to search on   them;The animals database contains a design document that, amongst other things, defines a search index over the animal name, diet, minimum length, Latin name and class.function(doc){index(""default"", doc._id);if(doc.min_length){index(""min_length"", doc.min_length, {""store"": ""yes""});if(doc.diet){index(""diet"", doc.diet, {""store"": ""yes""});if (doc.latin_name){index(""latin_name"", doc.latin_name, {""store"": ""yes""});if (doc.class){index(""class"", doc.class, {""store"": ""yes""});With this index you can run any of these queries.Animals that begin with the letter ""l""Herbivores that are 2m long or lessMammals that are at least 1.5m longMammals who are herbivore or carnivoreTry any of these examples in the query field, below. The server response will appear directly below.Sign in or create a free account to try these searches in the query field, below.To demo the API here, replicate the sample database first.In addition to basic searching, you can also group results by common values of a chosen field using the group_field parameter. For full details, see Docs.Cloudant Search also supports faceted searching, which allows you to discover aggregate information about all your matches quickly and easily. You can even match all documents (using the special ?q=*:* query syntax) and use the returned facets to refine your query.Indexing a facet is straightforward and can be strings or numbers;function(doc) {index(""type"", doc.type, {""facet"": true});index(""price"", doc.price, {""facet"": true});Once indexed, you can find out how many documents you have of any string facet with the counts= parameter, in addition to any query string you like. Example output for ?q=*:*&counts=[""type""] follows;{""total_rows"":100000, ""bookmark"":""g..."", ""rows"":[...],""counts"":{""type"":{""sofa"":10.0, ""chair"":100.0}}You can also perform range facet queries on numeric facets using the ranges= parameter. For example;?q=*:*&ranges={""price"":{""cheap"":""[0 TO 100]"",""expensive"":""{100 TO Infinity}""}}The range facet syntax reuses the standard Lucene syntax for ranges (inclusive range queries are denoted by square brackets, exclusive range queries are denoted by curly brackets).This will return output like;""ranges"":{""price"":{""cheap"":101.0,""expensive"":99899.0}}Some queries can get very long or it can be difficult to URL encode your query correctly. In these cases you can use a POST instead;{""query"":""*:*"", ""limit"":100}The same parameters you know, but you no longer have to worry about URL length limits or URL escaping.To demonstrate the functionality of search we've pulled together a couple of example applications. If you'd like to replicate them into your account you are welcome to do so, but they both use sizable datasets and will use up a significant number of Cloudant units.Full text indexing is what Lucene is built for, and Cloudant search is no different. In this example we've taken public lobbyist disclosure dataset from the US senate. The dataset consists of 757,123 individual documents. The uncompressed XML documents are 2.5 GB on disk, and the corresponding Cloudant database is only 1.3 GB.Geo indexing is possible with Cloudant search. By combining location awareness with other queries you can build applications that find what a user wants, where a user is. In this example we've taken the Simple Geo ""places of interest"" data set of over 20 million locations and combined it with searches over other values (e.g. find restaurants near the office). A simple geo-indexer couldn't do these ""refined searches"" because they require additional dimensions in the query.Browse the API ReferenceGet the API Reference (PDF)","Create full-text search indexes, built on Apache Lucene, with Cloudant Search. This step-by-step guide shows you how.",Search Indexes,Live,513
1583,"Jan Lehnardt recently wrote Understanding CouchDB Conflicts, which explains some of the nit and grit behind eventual consistency in CouchDB and Cloudant, but underlying Jan’s article is the question, ""How does eventual consistency change how I should write applications?""Whether your database guarantees strong consistency (where every part of a distributed system sees the same data at the same time) or high availability (where all requests receive responses, regardless of system failure) changes what you can expect of it. No distributed database can guarantee both while remaining meaningfully distributed, so it’s important to understand how you build applications given the tradeoff.So, as a developer, how should you code for eventual consistency?Cloudant implements an eventually consistent clustering model (quorum-based, inspired by the Amazon Dynamo paper), which guarantees high availability. When I first started using Cloudant, I wrote a simple script that, among other things, inserted a document and immediately tried to retrieve it again. Occasionally, the script would fail, and I didn't understand why -- until I understood eventual consistency.My script inserted the document into one node, but broke when the retrieval attempt spoke to a different node, which didn't yet have the inserted document. By skipping the retrieval and instead using the local copy of the document I inserted, I got the script to stop breaking, and perform faster by avoiding unnecessary database requests.When nodes of a database cluster handle requests, such as insertions and updates, they need to let the rest of the cluster know, and bring them up to speed. Some systems will ""lock"" portions of the cluster while that happens, to prevent conflicts like what my script experienced and enforce the property of isolation. The downside is that, while locked, you can't perform insertions or updates to the locked fields, rows, documents, nodes, etc.Other systems, like Cloudant and CouchDB, never use locks. This makes every part of the system always available to handle requests. In exchange, try to afford the cluster a moment to bring itself up to date after any change. Alternatively, store relevant state locally, such as by using PouchDB, which will keep itself up to date with your cluster automatically.These days, if I’m writing data, I use PouchDB to replicate any relevant subsets of the data into the browser, replicating them back as I make changes. This gives me on-disk access speeds, ensures any local changes I make are immediately available, and syncs changes to and from Cloudant in the background. (N.B. Make sure to use filters to minimize the size of the dataset you replicate into PouchDB; it’s optimized for working with small datasets)Kyle Kingsbury wrote Call me maybe: Carly Rae Jepsen and the perils of network partitions to explore all the wondrous ways distributed systems crack under pressure. By and large, the lesson across every database Kyle examined boiled down to updates break things. So, don't do updates.""But wait, Max,"" you say. ""I store {votes,visits,transactions} as updates to a {user,site,cart} document. How else would you do it?"" Relations, yo.Let's take votes, for example. Say you ran Reddit, and you decided to store votes as part of each post, like this:What happens when two folks vote on the same object at once? The cluster will update the post document, appending a vote to the votes field, and save them at the same time -- resulting in two valid but conflicting documents. Most systems reject one (resulting in a failed request), or overwrite one (resulting in a successful request and data loss). Cloudant and CouchDB store both conflicting documents, so you can resolve the conflict yourself, but it still picks one of the conflicting documents as the winner until you resolve the conflict by hand, or handle it using some automated process. To the user, until you resolve it, the conflict looks like data loss.Instead, store votes independently, like this:When someone votes on a post, create a document with a field like ""type"": ""vote"", and other data reflecting the vote. To prevent duplicates, make the ""_id"": some unique and dependent value like ""user_id+post_id"" so that attempts by a single user to vote more than once on a post will be rejected as conflicts. Then, use MapReduce to count up the number of votes for a particular post, like this:Now, any number of folks can vote on a post at the same time, and none of them would experience write failures or data loss. (For more on MapReduce indexes, see our docs)So, insert. Avoid updates. Where you can't avoid updates, try to reduce the number of agents that might be interacting with that document at once. If you need to update information about a user, for example, try to ensure only the user, and/or only admins, can edit that document.Create an account and try Cloudant DBaaS yourselfAs always, if you have any trouble, check our docs, post your question to StackOverflow, ping us on IRC, or if you'd like to discuss the matter in private, email us at support@cloudant.com.","Cloudant is a highly available, partition tolerant distributed database and as such is 'eventually consistent', as per CAP Theorem. This blog outlines some techniques for dealing with eventual consistency in practice.",Coding for Eventual Consistency,Live,514
1591,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
THE ART OF SIDE EFFECTS: CURING APACHE SPARK STREAMING’S AMNESIA (PART 1/2)
Guy Pearce’s character in Christopher Nolan’s psychological thriller, Memento,
suffers from anterograde amnesia due to a personal tragedy, which inhibits his
ability to form new memories. Throughout the movie his character’s standard
introduction consists of “I have this condition”.

If Spark Streaming could talk it would describe a similar condition. Spark
Streaming batches are stateless by design: all transformations during a batch
only affect the RDDs in that batch. As a result, applications are side effect
free: running the same application an infinite number of times results in the
same behavior and output. Similar to functional programming, this simplifies
debugging and reasoning about the state of a program, as input and output paths
are deterministic.

While side effect free applications generally have many advantages, this
complicates the design of a large class of applications that requires
batch-invariant state. For instance, keeping track of the monthly max, min, and
average value of a stock value and triggering a stock purchase if the average
value exceeds a threshold. Similarly, building a dynamic user action profile on
an e-commerce website to perform A/B testing also requires remembering previous
aspects of state.

In this 2-part blog post, we’ll sketch some recipes that apply stateful
operations to batch-invariant RDDs. In this first part, we’ll look at two
solutions: 1) static variables and 2) the updateStateByKey() transform to
maintain state across RDDs. In the second, we’ll employ accumulators and
external storage (Redis) to achieve the same goal.

The running use case for us comes from the financial industry to illustrate
Stock Market 2.0, which is driven by access to real time stock market tickers
and feeds. Specifically, we’ll grab the current stock value and volume for 10
major technology companies from Yahoo Finance and calculate various statistics.
Yahoo Finance exposes both a web service and an advanced querying service
enabled by YQL (Yahoo Query Language). YQL 1 provides a SQL-like interface to query data from Yahoo APIs over HTTP. All the
recipes in this chapter leverage its JSON response, which contains close to 80
fields 2 .

In addition, Spark Streaming out of the box does not contain a connector to
generate RDDs from an HTTP source. To this end, we’ll make use of a custom
HttpInputDStream creator 3 .

Static Variables

The simplest state is in the form of counters. For instance, how can we keep
track of the max and min stock volume across all 10 securities, count the number
of times any stock price has hit 500, and print a message when this counter has
reached 1000? To this end, we can use static variables in the driver process in
tandem with foreachRDD. Specifically, we can take advantage of the fact that
foreachRDD is invoked in the driver process so if we update the value of any
static variables, the state will be applicable across RDDs. The code below uses
this approach.

Using static counters and foreachRDD to maintain statistics across RDDs/batches:


1. var globalMax: AtomicLong = new AtomicLong(Long.MinValue)   
2. var globalMin: AtomicLong = new AtomicLong(Long.MaxValue)   
3. var globalCounter500: AtomicLong = new AtomicLong(0)   
4. 
5. HttpUtils.createStream(ssc, url = ""https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20in%20(%22IBM,GOOG,MSFT,AAPL,FB,ORCL,YHOO,TWTR,LNKD,INTC%22)%0A%09%09&format=json&diagnostics=true&env=http%3A%2F%2Fdatatables.org%2Falltables.env"",   
6. interval = batchInterval)   
7. .flatMap(rec = {   
8. implicit val formats = DefaultFormats   
9. val query = parse(rec) \ ""query""   
10. ((query \ ""results"" \ ""quote"").children)   
11. .map(rec = ((rec \ ""symbol"").extract[String], (rec \ ""LastTradePriceOnly"").extract[String].toFloat, (rec \ ""Volume"").extract[String].toLong))   
12. })   
13. .foreachRDD(rdd = {   
14. val stocks = rdd.take(10)   
15. stocks.foreach(stock = {   
16. val price = stock._2   
17. val volume = stock._3   
18. if (volume  globalMax.get()) {   
19. globalMax.set(volume)   
20. }   
21. if (volume  500) {   
25. globalCounter500.incrementAndGet()   
26. }   
27. })   
28. if (globalCounter500.get()  1000L) {   
29. println(""Global counter has reached 1000"")   
30. println(""Max ---- "" + globalMax.get)   
31. println(""Min ---- "" + globalMin.get)   
32. globalCounter500.set(0)   
33. }   
34. })


There are 2 noteworthy aspects of the code above:

The use of atomic variables to ensure that calculations remain atomic even in
the face of concurrent access and modification.

Ingestion of projected data into the driver process (line 14). This works
because a) we know the number of stock records apriori and b) the amount of data
is small enough to not overwhelm the driver node heap. Note that because all the
data has been materialized in the driver program, the inner foreach (line 15) is
executed in the driver JVM not on the worker executors. That is why we can
maintain global numbers.

This approach works well if we know, a) the number of counters apriori and b)
the amount of global state is small enough to fit within the driver JVM memory.
What happens when one or both of these statements is not true? For instance, if
we wanted to keep the max and min for each stock value within the data stream
but did not know the stock symbols of interest beforehand?

Let’s look at alternatives.

updateStateByKey()

The most obvious choice is updateStateByKey(), which allows us to track the
state of keys across RDDs. Using a custom update function, we can reperform the
max, min, and count calculation in each invocation. This is exactly what is
presented in the code below. (Be sure to set a checkpoint directory.)

In contrast to the previous example, all state manipulation takes place within
the update function (line 12-22). The batch-invariant variable for each key is a
tuple of the form: (min, max, count). The current value of this tuple is passed
to the update function as the second argument. This can further be enhanced to
perform any arbitrary computation and hold any collection, such as a map.

Leveraging updateStateByKey() for per key statistics across RDDs:


1. HttpUtils.createStream(ssc, url = ""https://query.yahooapis.com /v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where %20symbol%20
in%20(%22IBM,GOOG,MSFT,AAPL,FB,ORCL,YHOO,TWTR,LNKD, INTC%22)%0A%09%09&format=json&diagnostics=true&env=http%3A%2F%2Fdatatables.org%2Falltables.env"",   
2.   interval = batchInterval)   
3.   .flatMap(rec = {   
4. implicit val formats = DefaultFormats   
5. val query = parse(rec) \ ""query""   
6. ((query \ ""results"" \ ""quote"").children)   
7. .map(rec = ((rec \ ""symbol"").extract[String], ((rec \ ""LastTradePriceOnly"").extract[String].toFloat, (rec \ ""Volume"").extract[String].toLong)))   
8.   })   
9.   .updateStateByKey(updateState)   
10. .print()
11.    
12. def updateState(values: Seq[(Float, Long)], state: Option[(Long, Long, Long)]): Option[(Long, Long, Long)] = {   
13. val volumes = values.map(s = s._2)   
14. val localMin = volumes.min   
15. val localMax = volumes.max   
16. val localCount500 = values.map(s = s._1).count(price = price  500)   
17. val globalValues = state.getOrElse((Long.MaxValue, Long.MinValue, 0L)).asInstanceOf[(Long, Long, Long)]   
18. val newMin = if (localMin  globalValues._2) localMax else globalValues._2   
20. val newCount500 = globalValues._3 + localCount500   
21. return Some(newMin, newMax, newCount500)   
22. }


updateStateByKey() works well if the number of keys and the amount of
RDD-invariant state is small. This is primarily because RDDs and DStreams are
immutable. On the plus side, this simplifies fault tolerance: Spark can
regenerate any lost RDD by replaying data from the checkpoint. On the downside,
this means that any state maintained via StateDStream (the DStream that makes
updateStateByKey operations possible) needs to be regenerated in every batch.
Imagine if we are tracking a million keys but only a handful of them need to be
updated in every micro-batch. It is clearly overkill and suboptimal to create
new copies of keys that have not been mutated. Can we do better?

The answer to that question and more will be presented in Part 2 of this 2-part
series. Stay tuned.


--------------------------------------------------------------------------------

You can find more such real-world, data-driven applications, code, and recipes
in Zubair’s upcoming book “ Pro Spark Streaming: The Zen of Real-time Analytics using Apache Spark ”, which is going to be published by Apress Publishing in June, 2016. ISBN-13:
978-1484214800.

You can pre-order it from Amazon as well as the publisher’s website:
Amazon : http://www.amazon.com/Pro-Spark-Streaming-Real-time-Analytics/dp/1484214803 Apress : http://www.apress.com/9781484214800

SHARE ON
 * 
 * Share

ZUBAIR NABI
DATE
06 June 2016TAGS
SPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.","Spark Streaming batches are stateless by design: all transformations during a batch only affect the RDDs in that batch. As a result, applications are side effect free: running the same application an infinite number of times results in the same behavior and output. Similar to functional programming, this simplifies debugging and reasoning about the state of a program, as input and output paths are deterministic.",The Art of Side Effects: Curing Apache Spark Streaming’s Amnesia (Part 1/2),Live,515
1592,"Lorna Mitchell Blocked Unblock Follow Following Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net) Jul 20
--------------------------------------------------------------------------------

IMPORTING DATA TO COUCHDB WITH COUCHIMPORT
LIFE BEYOND THE SPREADSHEET FOR CSVS
Data often comes in simple formats. The CSV format (Comma-Separated Values) is a
very common one, especially for relatively small data sets, and even for large
data sets having millions of rows. It’s an ideal format in many ways, easily
understood by machines, and you can even open it with a standard desktop
spreadsheet tool such as Excel. That said, sometimes I want to do more
interesting things than I can with Excel, and for that, I’ll want a proper
database!

I’ve been working with some of the open data sets that my team maintains. I’m a frequent and enthusiastic user of CouchDB and its IBM-hosted version Cloudant . These databases are often where I like to land data. Handily, my colleague Glynn Bird has already created just the tool I need: CouchImport , a command-line tool for importing CSV data to CouchDB.

Put your CSV or TSV data to work in a CouchDB-style database using CouchImport .PERFORMING THE IMPORT
The couchimport tool is available from npm. Since I will run it as a command in my terminal,
I'll install it globally:

npm install -g couchimport

Test that the tool is working by running the following command. It should output
a version number:

couchimport --version

With CouchImport in place, you’ll need to get data and configure the tool to
know what to do with it. For this example, I’m going to use the movies database
that you can find here: https://ibm-watson-data-lab.github.io/open-data/#movies

I’d like this data imported into my local CouchDB instance and stored there in a
database very creatively named movies . So first, create that local database:

curl -X PUT http://localhost:5984/movies

You can also use the web dashboard to create a new database in CouchDB/Cloudant,
but either way make sure the target database exists and that you know the
following:

 * the URL of the CouchDB server, including port number and security credentials
   if required (for my locally-running CouchDB: http://localhost:5984 )
 * the name of the database to import into (I just created movies )
 * the character that separates your data; the movies database is actually TSV
   (Tab-Separated Values) so I’ll need \t which is the default value

We can supply these arguments—and many others, check out the project page —either as environment variables, or as command-line switches. I’ll use the
command-line approach since I only want to run this command once. To pass in my movies.tsv file, I simply use cat and then direct the output into my couchimport command. If that sounds complicated, here's an example to clarify:

cat movies.tsv | couchimport --url http://localhost:5984 --db movies

The delimiter is the default \t or ""tab"" character. If you're using a CSV file, then it should be set to , (by appending the --delimiter ‘,’ flag to the command above) so that the utility can separate the data
appropriately. Check your database, and you should see that your data has
imported correctly:

My movie data has made the leap from TSV row to JSON document (pictured here in
the updated Fauxton dashboard in CouchDB 2.0).You can now go ahead and use CouchDB to explore your data set. For more
information about the power of CouchDB, try the views documentation .

A TWO-WAY STREET
To make the exact opposite manoeuvre—in other words, take flat-shaped data from
your CouchDB database and turn it into a CSV—the same module also includes a couchexport command. Enjoy!

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

Thanks to Mike Broberg , Teri Chadbourne, CMP , and Raj Singh . * Couchdb
 * NPM
 * Data
 * Web Development
 * Cloudant

2 Blocked Unblock Follow FollowingLORNA MITCHELL
Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net )

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 2
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",Life beyond the spreadsheet for CSVs,Simple CSV Import For CouchDB – IBM Watson Data Lab – Medium,Live,516
1593,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseIBM DATA REFINERY: SHAPE DATA
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

22 views 1LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 2 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 31, 2017This video shows you how to quick start load a local data file and shape the
data. Find more videos in the IBM Data Refinery Learning Center at http://ibm.biz/data-refinery-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Data Catalog: Governance overview - Duration: 4:11. developerWorks TV 5
   views * New 4:11


--------------------------------------------------------------------------------

 * Bluemix Data Connect: See an end-to-end-use case - Duration: 8:15.
   developerWorks TV 76 views 8:15
 * IBM Data Refinery: Create a connection and add it to a project - Duration:
   1:54. developerWorks TV 3 views * New 1:54
 * IBM Data Catalog: Create and administer a data catalog - Duration: 3:19.
   developerWorks TV 13 views * New 3:19
 * UrbanCode Deploy: Using composite blueprints - Duration: 9:13. developerWorks
   TV 6 views * New 9:13
 * Introduction to the IBM MobileFirst Platform Foundation - Duration: 17:51.
   developerWorks TV 15,383 views 17:51
 * Healthy Habits Pet Assembly, Part1 - Duration: 5:53. developerWorks TV 6
   views * New 5:53
 * IBM Data Refinery: Create a project and add data - Duration: 1:47.
   developerWorks TV No views * New 1:47
 * Watson Analytics: How to add a data group to the data set - Duration: 5:58.
   IBM Analytics Learning Services 859 views 5:58
 * Getting started with Display in IBM Watson Analytics - Duration: 4:38. IBM
   Watson Analytics Documentation 1,722 views 4:38
 * Watson Analytics: How to join data sets - Duration: 4:35. IBM Analytics
   Learning Services 1,352 views 4:35
 * Watson Analytics: How to add a hierarchy - Duration: 6:34. IBM Analytics
   Learning Services 755 views 6:34
 * IBM Data Catalog: Add data assets to a catalog - Duration: 3:03.
   developerWorks TV 2 views * New 3:03
 * Getting started with Display - Duration: 4:35. IBM Watson Analytics
   Documentation 2,268 views 4:35
 * Welcome - Duration: 1:35. developerWorks TV No views * New 1:35
 * IBM Watson Analytics Tutorial - Refine Data - Part 2 of 6 - Duration: 7:53.
   BharatiDWConsultancy 1,202 views 7:53
 * Discovering insights in IBM Watson Analytics - Duration: 3:09. IBM Watson
   Analytics Documentation 6,009 views 3:09
 * Inside IBM API Connect Version 5: An end-to-end demo - Duration: 48:39.
   developerWorks TV 21,041 views 48:39
 * IBM Data Catalog: Use data assets in a project - Duration: 1:09.
   developerWorks TV 1 view * New 1:09
 * Watson Analytics: How to load flat files - Duration: 2:55. IBM Analytics
   Learning Services 1,081 views 2:55

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to quick start load a local data file and shape the data. ,Shaping data with IBM Data Refinery,Live,517
1594,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSIMPLE DATA PIPEThe Cloud Data Warehouse Made EasyA lot of times, you simply want all your data in one place. No fancytransformations. Just put it all together in a warehouse and visualize it indashboard.BI architects call this approach an “ Operational Data Store ” (ODS). It’s a simple goal, but the mechanics of moving data into an ODS areoften complex. That’s the problem we set out to solve with the Simple Data Pipe .The Simple Data Pipe is an example app for moving information into dashDB , IBM’s cloud data warehouse service. Once data is landed in dashDB, it’s easyto hook up analytics engines like Chartio , Looker , or Watson Analytics . The Pipe comes with built-in connectors for Web data sources like Salesforce.com and Stripe , and we provide frameworks for building your own. It’s all open-source onGitHub, and the repo includes example visualizations to get you started.We designed the app to deploy at the push of a button to Bluemix , IBM’s platform service. Since most Web apps exchange data as JSON , we built the Simple Data Pipe to move JSON into Bluemix’s Cloudant NoSQL service. It then uses Bluemix’s DataWorks data refinery tool to convert Cloudant JSON into a relational format thatdashDB can understand.With The Simple Data Pipe, your ODS is provisioned, populated, and ready inminutes.NEXT STEPSTry the Simple Data Pipe.Build a connectorNeed to load JSON from another source?Find The Pipe handy?WATCH A DEPLOYMENTUSE CASES * Stripe.com + Chartio * Salesforce.com + Looker * Configure for Stripe * Connect a new data source (easy) * Connect a new data source (advanced)IBM TECHNOLOGY * Bluemix * Cloudant * dashDB * DataWorksBLOGS ‘N’ STUFF * Why dashDB? * Why ODS? * Troubleshoot pipes-Bluemix deploys * Monitor DataWorks for pipe runs * Troubleshoot pipe run errors * Accessing Stripe data in dashDB * Stripe connector FAQ * BI office hours * dashDB how-tosSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Building a data warehouse isn't as scary as it sounds. This example app lets you automatically load JSON into dashDB, IBM's a cloud data warehouse service.",Cloud Data Warehouse Made Easy,Live,518
1595,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectBETTER INFRASTRUCTURE MAINTENANCE WITH OFFLINE MOBILE MAPSRaj R Singh / March 7, 2016You might assume that since the reliable provision of utilities like water, gas,and electricity, is so important, we'd have the maintenance of these systemsdown to a science. Unfortunately that's not the case. When maintenance workersgo out into the field to fix or inspect something, they're often doing it withinaccurate, old, paper maps. This is usually because they don't have maps onmobile devices, or their mobile mapping apps work only when they have a strongInternet connection.By Honza Groh ( Jagro )The workers must use their experience and wits to figure out a lot of thingsthat should be easy. Then once they do get the job done, there's usually no easyway to update the official maps so that the next worker doesn't have to gothrough the same ordeal.Our group is often approached by utility companies looking for a better way, andover time we've developed a sample app called Field Work . It enables workers out in the field, like maintenance staff in the utilities,oil, and gas industries, to do their jobs more efficiently. It's a lightweight,mobile, map-centric, pure HTML5 web application that lets workers take gigabytesof data into the field where Internet connectivity is limited or non-existent.There, they make updates: mark infrastructure as broken or repaired, create workorders, or move features on the map to update the accuracy of the database. Whenconnectivity is restored, the changes are synchronized to a Cloudant databasewhich serves as a staging point to update work order management systems like IBM Maximo , or geographic information systems like Esri's ArcGIS , or other systems of record.HOW IT WORKSA live demonstration is online here , and the code is available on GitHub .Here’s how it works: 1. Click the Load Data button. 2. A menu appears on the upper right. These are categories of items that show    on the map called layers . Turn on a checkbox to see items of that type on the map. Red pushpins    appear.Tip: To move the map and see a different area, click and drag.         3. Click the Edit Pins button.    Some new tools appear on the left side of the map. Hover over each tool and    try them out!Edits to the pins are automatically synchronized back to the Cloudant databaseimmediately, or whenever the web app gets an Internet connection. Then once thedata is in Cloudant, it can be shared across back-end applications that need it.EXTENDING THE DEMO APPWhen describing Cloudant’s role in enterprise information systems, we often talkabout Cloudant being the database for “systems of engagement”, i.e. webapplications. Cloudant Geo brings a wide range of geospatial functionality to web apps that inexpensivelyscale to the biggest audiences. Read more about Cloudant’s geospatial querycapabilities in these two blog posts: * Geospatial query with Cloudant Search * Geospatial Nearest Neighbor QueryThe data Cloudant handles eventually needs to spread throughout the enterprise,so you want to integrate with back-end “systems of record” like work ordermanagement, HR, CRM, and ERP. In the geospatial realm, you may want to connectyour Cloudant-powered web app with an Esri geographic information system, whichyou can do either by writing your own adapter, or using FME from Safe software.Both of these are relatively simple paths to seamless data integration.Another way to extend the demo app is on the front end, allowing for editing oflines and polygons. If you’re a JavaScript developer this should be prettystraightforward as Field Work uses Leaflet Draw , which has very nice documentation on its event-driven library for drawing andediting geospatial features. I have some code that handles polygon editing andsaving to Cloudant nicely, so get in touch.Enjoy the Field Work demo and explore all the geospatial richness of Cloudant.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / geospatial / offlinefirst / PouchDB Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",An offline-first app that lets mobile utility workers access and update maintenance details and maps--even when disconnected.,Better infrastructure maintenance with offline mobile maps,Live,519
1597,"DEBUGGING LUA IN REDIS
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Aug 25, 2016A little while ago, we published a speed guide to Redis's Lua scripting . It's super useful stuff and almost immediately, the question was asked ""But
how do we debug these scripts?"". Until Redis 3.2, the answer was ""Very, very
carefully"" but Redis 3.2 brought with it a Redis Lua debugger so let's have a
quick look at it.

HOW TO START DEBUGGING
Let's jump back to the speed guide where we went to run a script which
incremented various keys and counted them:

$ redis-cli -h sl-eu-lon-2-portal.1.dblayer.com -p 10030 -a secret --eval broadcast.lua region:one
(integer) 33
$


That's not right. Let's invoke the debugger by adding --ldb to the command line:

redis-cli --ldb -h sl-eu-lon-2-portal.1.dblayer.com -p 10030 -a secret --eval broadcast.lua region:one  
Lua debugging session started, please use:  
quit    -- End the session.  
restart -- Restart the script in debug mode again.  
help    -- Show Lua script debugging commands.

* Stopped at 1, stop reason = step over
-> 1   local count=0
lua debugger>  


And we're into the debugger. The essential commands are s / step , n / next and c / continue . You can use the first letter of the command or the full command (nothing in
between though). If we type s and step now:

lua debugger> s  
* Stopped at 2, stop reason = step over
-> 2   local broadcast=redis.call(""lrange"", KEYS[1], 0,-1)
lua debugger>  


We move on in the debugger, we can see why the debugger stopped again (it
stepped over) and the line it is now sitting on. Stepping again will show us
something very useful:

lua debugger> s  
<redis> lrange region:one 0 -1  
<reply> [""count:emea"",""count:usa"",""count:atlantic""]  
* Stopped at 3, stop reason = step over
-> 3   for _,key in ipairs(broadcast) do
lua debugger>  


The command calls on Redis to do a lrange command. The debugger shows the command sent to Redis ( <redis> ) and the result that came back from it ( <reply> ), before stopping on the next line. The l / list command will show us the code around the current line:

lua debugger> l  
   1   local count=0
   2   local broadcast=redis.call(""lrange"", KEYS[1], 0,-1)
-> 3   for _,key in ipairs(broadcast) do
   4     redis.call(""INCR"",key)
   5     count=count+11
   6   end
   7   return count
lua debugger>  


This is a tiny script, so we get to see it all. You can give a line number or a
range of line numbers to l to see particular parts of the script. If you want to see the w hole script you can say w / whole .

Setting breakpoints is an essential part of any debugger. The b / break command handles all of that. Let's set a breakpoint on line 5 by typing b 5 .

lua debugger> b 5  
   4     redis.call(""INCR"",key)
  #5     count=count+11
   6   end
lua debugger>  


You can list all your breakpoints with just b :

lua debugger>b  
1 breakpoints set:  
  #5     count=count+11
lua debugger>  


And clear them all with b 0 . For now, we just want to continue so we enter c :

lua debugger> c  
* Stopped at 5, stop reason = break point
->#5     count=count+11
lua debugger>  


And we've stopped on the breakpoint. We can quickly check the variable count with print count and then continue on:

lua debugger> print count  
<value> 0  
lua debugger> c  
* Stopped at 5, stop reason = break point
->#5     count=count+11
lua debugger> print count  
<value> 11  
lua debugger>  


Oh no! Someone finger fumbled the +1 and added 11. Ok, this is a somewhat
contrived example which relies on you not looking at the code at all. But we've
located the problem and thats what counts.

There's a couple of other things that you can call on when debugging. If you
don't give a variable name to the print command, it dumps all the local
variables. If you've got lots of function calling taking place then t / trace will give you a backtrace. You can also evaluate some lua code using e / eval - it's run in a different environment so it won't interact with the code you
are debugging. And, you can make a call on redis with the r / redis command like so:

lua debugger> redis lrange region:one 0 -1  
<redis> lrange region:one 0 -1  
<reply> [""count:emea"",""count:usa"",""count:atlantic""]  
lua debugger>  


Now, you may wonder if these changes persist and that's an interesting question.
If we look at the command which controls scripting SCRIPT DEBUG we find it has three settings: NO which turns debug off, YES and SYNC .

When set to YES , changes made by the debugger session are discarded at the end of the session
when you use a / abort or q / quit commands. How does this work? Well, you aren't actually talking to the server.
When the debug sessions starts, Redis forks a copy of the server for your debug
session to talk to so you won't block the running Redis server.

The SYNC option, as you may guess, doesn't do that and all the changes are made to the
server. It WILL bring your server to a grinding unavailability if you use this mode though as
it blocks. It's not recommended for use unless you have a very, very, very good
reason.

Finally, there are two calls you can use in your Lua code: redis.debug() lets your Lua code emit messages when in debug mode to assist your debugging
efforts and redis.breakpoint() can be inserted where you want to ""fix"" a breakpoint in the code.

This has been a brief introduction to the Redis Lua Debugger. You can find out
more in the Redis documentation and don't forget to get h / help at the debugger command line.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Redis 3.2 brought with it a Redis Lua debugger, so let's have a quick look at it.",Debugging Lua in Redis,Live,520
1604,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own Oct 22, 2016
--------------------------------------------------------------------------------

LEARN ABOUT DATA SCIENCE IN WORLD OF WATSON
Unleash the power of data, analytics, and cognitive at IBM World of Watson 2016,
October 24–27 in Las Vegas. Explore the many Data Science related presentations
and labs at World of Watson. Hear about the one-stop-shop of IBM Data Science
Experience, which allows teams to collaborate and learn in one place. In
addition, learn more about Machine Learning, Apache® Spark™, and a host of other
related technology. All the sessions you need to attend are here:

 * Machine Learning


Blocked Unblock Follow FollowingARMAND RUIZ
Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Unleash the power of data, analytics, and cognitive at IBM World of Watson 2016, October 24–27 in Las Vegas. Explore the many Data Science related presentations and labs at World of Watson. Hear…",Learn about Data Science in World of Watson,Live,521
1610,"SHARE THE (PIXIEDUST) MAGIC
PACKAGING AND DISTRIBUTING YOUR PIXIEDUST PLUGIN AS A PYTHON MODULE ON PYPI
https://unsplash.com/@yvettedewit?photo=NYrVisodQ2MPreviously, you learned how to create your own custom visualization for PixieDust . The visualization code was written and tested directly in notebook cells,
which is great for proof-of-concept and rapid iterating. However, it may not be
the best vehicle for packaging and distributing your work to a large audience.

Magic is best when it can be experienced by everyone. To get your visualization
out there, you will need to take it from notebook cells to a Python module, and
then to a software repository for others to download and use.

THE PROPS
To make sure your PixieDust magic is more easily consumable by others, you first
need to package it up as a Python project.

The Python Packaging User Guide outlines instructions on packaging your Python project . However, you can use the generate command of the PixieDust installer to simplify the process and automatically set up the scaffolding for your project.

While the instructions here use the Simple Word Cloud visualization discussed in
the previous post, you can follow along with any PixieDust visualization you
have created. These instructions also presume that you already have a local
Jupyter environment, as well as a local install of PixieDust.To get started, run the command jupyter pixiedust generate from a terminal window and answer the subsequent prompts. Be sure to enter a
name for your project that will not conflict with an existing Python package . For the project type , choose option 1 (i.e., 1.Display Visualization ).

Once completed, you will see a listing of the files created.

Output from command: jupyter pixiedust generate If you receive a TemplateNotFound error, uninstall PixieDust via pip uninstall -y pixiedust and reinstall it using pip install -e <path_to_pixiedust_dir> updating <path_to_pixiedust_dir> .THE SETUP
The result of the generate command is a Python project with a default PixieDust visualization. You can now
edit this default visualization, updating it with the code from the Simple Word
Cloud.

The code for the template and metadata of the Simple Word Cloud in the previous post will be placed into separate
files for this Python project. More specifically, you will need to update these
files:

 * helloWorld.html — Replace its content with the Simple Word Cloud’s template HTML fragment.
   Also, replace the {0} with {{base64image}} :

<center><img src=base64,{{base64image}}""></center>

 * display.py — Update the doRender function with that from the Simple Word Cloud and then replace the self.__addHTMLTemplateString call with:

self._addHTMLTemplate(""helloWorld.html"", base64image=img_str.decode(""ascii""))

The HTML fragment is moved to its own file; therefore, doRender should instead call self._addHTMLTemplate with the file name and pass in any desired parameters.Do not forget to include the appropriate imports near the top of display.py :

from wordcloud import WordCloud
import cStringIO
import base64

 * __init.py__ — Update the getMenuInfo function with Simple Word Cloud’s implementation.
 * setup.py — Update the install_requires to include the wordcloud module and set include_package_data to True :

install_requires=['pixiedust', 'wordcloud'],
include_package_data=True,

 * MANIFEST.in — Add a line to have template files included in the distribution:

recursive-include pixiedust_wordcloud templates/*

THE REHEARSAL
With your Python project set up, you can install and run it from the safety of
your local Jupyter environment. From a local notebook cell run:

!pip install -e <full_path_to_your_project>

Replace <full_path_to_your_project> accordingly and restart the kernel before continuing.To test your new module, in a new cell run:

import pixiedust
import pixiedust_wordcloud

df = pixiedust.sampleData(7)
df2 = df.groupBy(""street"").count()

display(df2)

Be sure to import the appropriate name of your project.Select Simple Word Cloud from the charts dropdown menu.

Congrats! You have turned your PixieDust visualization into a consumable Python
module.

PixieDust Word Cloud VisualizationNow it’s time to put your magic on center stage for all to enjoy.

THE STAGE
PyPI (Python Package Index) is the official third-party software repository for
Python. It’s a go-to resource for finding Python modules.

Before you can proceed to including your project in PyPI, be sure to have:

 * an account registered on PyPI
 * twine installed (from a terminal window: pip install twine )

The first step is to create a source distribution, which would be uploaded to
PyPI. From a terminal shell, at the root directory of your project, run the
command:

python setup.py sdist

If any errors occur in the output, resolve them, and rerun the command.

The next step is to take the distribution output and upload to PyPI. From the
terminal shell (at the root directory of your project), run:

twine upload dist/*

Enter your PyPI credentials when prompted.

Congrats again! You have published your PixieDust visualization to the world. Learn more about customizing your packaging and distribution with various configuration options in Python’s documentation.

THE SHOW
Your visualization should be available on PyPI for anyone to download, install,
and use. Go ahead, return to your notebook and give it a try.

Begin by uninstalling the local version you had installed earlier. In a notebook
cell, run:

!pip uninstall -y pixiedust_wordcloud

Then install your visualization directly from PyPI. In a notebook cell, run:

!pip install pixiedust_wordcloud

Restart the kernel and rerun the test cell from earlier.

Presto! You have yourself a PixieDust visualization that you, your friends, and total
strangers can all easily download and use.

THE AFTER-PARTY
With no smoke or mirrors, you have created (PixieDust) magic! You have made data
visualization easier, and it is worth sharing. Don’t hesitate to put it out
there—inform the PixieDust community! Someone might be looking for exactly what
you created.

If you want to look at a completed version of the pixiedust_wordcloud , you can clone the repo from GitHub or install the module from PyPI . Feel free to improve upon this version.

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

Thanks to Mike Broberg . * Python
 * Data Science
 * Pixiedust
 * Jupyter
 * Data Visualization

Blocked Unblock Follow FollowingVA BARBOSA
code rules everything around me

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Previously, you learned how to create your own custom visualization for PixieDust. The visualization code was written and tested directly in notebook cells, which is great for proof-of-concept and…",Share the (PixieDust) Magic – IBM Watson Data Lab – Medium,Live,522
1613,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSCALE A WEB APP WITH MICROSERVICES AND IBM MESSAGE HUBGlynn Bird / November 11, 2015Building a web application in 2015 is easy and cheap. With a single affordablevirtual server, you can host your entire web stack: * Web server. Nginx or Apache Httpd * Static assets. JavaScript, CSS, images, and media files stored on the server’s disk * Application logic. Your PHP/Ruby/Node.js/etc code * Database storage. Your application’s dynamic data in a MySQL/PostgreSQL/NoSQL database * Dynamic assets. Uploaded images, videos, or user-generated content * Logs. The engineering logs showing how your app is workingA single-server instance is fine for low-volume websites that receive smallamounts of traffic or generate little data. But successful applications growVERY quickly; distribution using app stores and viral social media coverage canbombard a little-known app with eye-watering volumes of traffic in no time.With success, comes the problem of scaling your application. As more usersarrive and the volume of data grows, it becomes obvious that your stack needs tobe reorganised and expand horizontally, to add extra capacity and to introducefault-tolerance: * What if a web server fails? You should really have three of those with a load   balancer in front of them. To make sure, you should have a pair of   load-balancers in an HA configuration. * Your web server shouldn’t have to be overloaded by serving static assets.   Move the assets to a Content Delivery Network (CDN) where they are also   distributed around the globe to be as close as possible to your users. * You can’t have your database on the same server as your web server. For a   start, there’s now three web servers! Move it to a separate database server,   a cluster of database servers, or even a cloud-hosted solution. * Your users’ push uploaded content into Object storage which scales massively. * A successful application generates tons of logs. Logs recording user   requests, application status, error conditions etc. With a move from one   server to many servers, there is a need to see all your application’s logs in   one time-ordered stream so that the entire story can be viewed in sequence.Making those changes moves our application stack from this:to this:The new setup may be a more expensive and complicated stack, but it can handlemore users, is more reliable, and can be expanded to deal with extra load, likethe Christmas rush. Each component in the stack can be scaled individually tosuit its workload.But a successful application has other growing pains that are not immediatelyobvious.ASYNCHRONOUS WORKLOADS – THE SIGNUP CONUNDRUMImagine your application has a sign-up path. A user enters their name, email, aphoto and Twitter account. The app must: * resize the photo into several thumbnail sizes and upload the pictures to our   Object store * access the Twitter profile to pull the user’s biography and follower list,   storing the information in a database * send an email to the user to verify their email account (the user clicks on   the link and then we know they are real)All of the above actions take some time, some computing power, some networkbandwidth, and some API calls to external services. The tasks may take differentamounts of time, and it would be logical to enable each task to scaleindividually. In our single-server architecture, the web server itself wouldhave resized the images, uploaded to Object storage, made calls to Twitter, andsent the email. But to be truly scalable, we need to dedicate our web servers todeal with any incoming HTTP requests and reply as soon as they can, to give theuser a snappy user experience.What is required here is a queue, or a number of queues; one for each task.When a registration request arrives at the web server, it puts an item into aqueue called registration_queue . A worker process listens for items arriving on the registration_queue and performs one job: for each item received on the registration_queue , add an item to each of 3 task-specific queues: the resize_photo , scrape_twitter , and verify_email queues (in a minute, you’ll see why we do it in this convoluted way). Workerprocesses monitoring each of these queues pick up each item of work, in turn,and perform the actual work. Because many worker servers can be assigned to thequeues, we can provision as many as it takes to process the work at the requiredrate! If processing the photos is computationally more expensive, we can assignbeefier servers to the resize_photo queue than to the other queues. If, at a later date, we wish to add a fourthstep into the registration process, we need only create another queue, addworkers to it, and then modify the registration_queue worker to add an item to this queue for each registrant.Furthermore, when someone clicks on the link in their verification email, therequest arrives at a web server which adds an item to the registration_success queue. A process listening to that queue sends another email to the user toinform them that they have registered successfully.OFFLINE WORKLOADS – THE STATS DILEMMAAs your successful startup grows, more people will be employed to analyse andreport on the success of the business.To answer these questions, we would have to collect data from multiple sources,but the organisation of our queues makes the data easy to access. Web serversadd items to the registration_queue and verified users are added to the registration_success queue.Although worker processes acting upon these queued tasks are already consumingthe data, some queue systems allow multiple consumers of the data to attach tothe same queue. This lets the worker processes continue using the stream oftasks as a queue, while other consumer processes turn the data into dailyaggregates and feed real-time dashboards.In this way, the data queue scales the workload to meet demand, and can alsofeed multiple reporting streams to the teams that need to know what’s happeningin the application.COMPLEX SYSTEMS BREED DATAAs the complexity of the system increases, fault-finding becomes more difficult.Tracking issues requires yet more recorded data (“Instrument Everything!”) tolet people dive into all aspects of system activity. So, in addition to the webserver logs, the queue activity, and the logs of each of the queue workers, wealso have detailed logs from the database layer, the application logic, as wellas any server or operating system-level logging.This data may be kept forever in another long-term data store. But for the shortterm, it makes sense to stream this data into a queue where it can be consumedby zero or more consumers. Example consumers could be: * saving the log data in Apache Hadoop for historical analysis * streaming to Apache Spark for real-time reporting, paging the Ops team with   any faults found during the analysis * feeding a real-time dashboard appOnce again the producer of the data knows nothing about the consumers of thedata, and vice versa. There can be multiple producers and multiple consumers.IBM MESSAGE HUB AND RABBITMQIn a large, multi-faceted IT system, the queue or Message Hub becomes the heart of the system. It receives and buffers data from manyproducers and delivers streams of data to connected consumers. The messaging hubneeds to be scalable, fault-tolerant, and performant as it brokers every pieceof data generated or consumed by our system.Here are two options you could use to fulfill this role: * Message HubIBM Message Hub is based on Apache Kafka and is run as-a-service. It stores streams of data   in topics , bufferring the data for a while (usually a few days) before the oldest   data is discarded. Consumers can ask for data in chronological order, from   any point in the stream (within the last few days). Because the way in which   data can be consumed is limited, Message Hub is optimised for time-ordered   reads and can handle vast amounts of throughput (millions of messages per   second and terrabytes of data).      Like any good queue or broker, the consumers know nothing about the producers   and vice versa. The only thing to consider in your design is how many   consumers you wish to have consuming a topic in parallel, if they only   consume the data like a queue. Topics can by sharded into N partitions—in   this scenario, you should deploy N consumer processes consuming one shard of   the topic each. This will prevent each worker from treading on other worker’s   toes and spreads the workload around the Kafka cluster.      Apache Kafka was first developed by LinkedIn to deal with the billions of   messages their organisation was dealing with each day. It is open-source software and can be downloaded and installed for free but IBM’s Message Hub   gives you a simple way of working with Kafka without installation, server   provisioning, or maintenance.      Third-party libraries are available in many programming languages, so it’s   easy to get started in the language of your choice.      If you’re a Java developer and want to get started with Kafka then this blog post has step-by-step instructions.       * RabbitMQRabbitMQ is a messaging application designed to behave like a queue or pub-sub hub   for your application’s messaging. Our friends at Compose.io recently launched RabbitMQ as another service you can spin up on the Compose platform. You can create a   fault-tolerant, multi-node RabbitMQ cluster in a couple of clicks.      RabbitMQ has been around for years and is widely supported by a range of programming languages .      Follow this getting started guide to create your own, private message broker cluster within a few minutes!       * SHARE THIS:    * Click to email this to a friend (Opens in new window)    * Click to share on Twitter (Opens in new window)    * Click to share on LinkedIn (Opens in new window)    * Share on Facebook (Opens in new window)    * Click to share on Reddit (Opens in new window)    * Click to share on Pocket (Opens in new window)    *       Tagged: Apache Hadoop / Apache Kafka / Apache Spark / Message Hub / microservices / Rabbit MQ Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",How to scale your web app for success. Use microservices and IBM Message Hub to handle more users and a growing volume of data.,Scale a Web App with Microservices and IBM Message Hub,Live,523
1614,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

IBM Data Science Experience Blocked Unblock Follow Following Mar 16
--------------------------------------------------------------------------------

DATA SCIENCE PLATFORMS ARE ON THE RISE AND IBM IS LEADING THE WAY
It’s widely understood that organizations that embed analytics and data science
into their operating models bring actionable knowledge into every business
decision. Now these same businesses are discovering that a centralized,
integrated platform for doing data science work can provide them with a more
effective analytics strategy and even offer competitive advantage.

WHAT IS A DATA SCIENCE PLATFORM?
The 2017 Gartner Magic Quadrant for Data Science Platforms defines a data science platform as: “A cohesive software application that
offers a mixture of basic building blocks essential for creating all kinds of
data science solutions, and for incorporating those solutions into business
processes, surrounding infrastructure and products.”

This new report, previously known as the Magic Quadrant for Advanced Analytics
Platforms, evaluates 16 vendors on their completeness of vision and ability to
execute. IBM has emerged as the vendor furthest in vision and highest in
execution.

HOW DATA SCIENCE PLATFORMS PROVIDE VALUE
As organizations move from descriptive analytics to predictive analytics and prescriptive approaches, they are embracing leading open source
technologies such as R, Python and Apache Spark, as well as machine learning and streaming analytics. However, the short supply of people with data science
skills is driving many businesses to consider tools that are accessible to
non-technical users.

By centralizing the tools and processes needed to integrate and explore all
types of data; develop and deploy advanced analytics model; and streamline
communication and collaboration, organizations can make data science more
accessible to more users. A data science platform can actually give
organizations an edge on the competition by enabling both business experts and
technical experts to collaborate on solutions to costly problems such as demand
forecasting, customer propensity to buy or churn, fraud detection and
prevention, and more.

We understand that to be successful, data scientists need an environment that is
open, engaging, and fosters collaboration. To that end, the IBM data science portfolio IBM data science portfolio helps advance the practice of data science by
providing:

 * Access to the open source technologies data scientists know and love
 * Enterprise-grade functionality for critical data science projects
 * A community that supports them throughout the whole process
 * Support for a broad range of data types, including unstructured data *An
   integrated development environment built to encourage creativity and
   collaboration

Acknowledging that there are many different types of data science users, the IBM
integrated portfolio addresses the broad range of skillsets and preferences. We
have included capabilities that make it easier for non-data scientists to build
and deploy models visually without coding, along with tools to help experienced
data scientists and business users collaborate more effectively and work more
productively.

For example, IBM® SPSS® Modeler and IBM SPSS Statistics provide an intuitive, point-and-click experience for users who prefer not to
write code, enabling them to easily create machine learning applications through
an extensive library of open source extensions. Skilled programmers and coders
can use the best of open source and IBM SPSS together for greater power and
flexibility. And the IBM Data Science Experience enables data scientists to develop code using their favorite tools, and share
projects via notebooks in a cloud-based environment designed to foster
creativity and collaboration.

Download the 2017 Gartner Magic Quadrant for Data Science Platforms today to learn why IBM is named a leader in data science and to find out why
data science, analytics, and machine learning are the engines of the future.

Gartner does not endorse any vendor, product or service depicted in its research
publications, and does not advise technology users to select only those vendors
with the highest ratings or other designation. Gartner research publications
consist of the opinions of Gartner’s research organization and should not be
construed as statements of fact. Gartner disclaims all warranties, expressed or
implied, with respect to this research, including any warranties of
merchantability or fitness for a particular purpose.


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on March 16, 2017 by Christine O’Connor.

 * Dsx
 * Data Science
 * IBM
 * Gartner

Show your supportClapping shows how much you appreciated IBM Data Science Experience’s story.

Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",It’s widely understood that organizations that embed analytics and data science into their operating models bring actionable knowledge into every business decision. Now these same businesses are…,Data science platforms are on the rise and IBM is leading the way,Live,524
1618,"RStudio Blog * Home

 * Subscribe to feed

NEW SHINY CHEAT SHEET AND VIDEO TUTORIAL
June 22, 2015 in Shiny , Training | Tags: cheatsheet

We’ve added two new tools that make it even easier to learn Shiny.

VIDEO TUTORIAL


The How to Start with Shiny training video provides a new way to teach yourself Shiny. The video covers everything you
need to know to build your own Shiny apps. You’ll learn:

 * The architecture of a Shiny app
 * A template for making apps quickly
 * The basics of building Shiny apps
 * How to add sliders, drop down menus, buttons, and more to your apps
 * How to share Shiny apps
 * How to control reactions in your apps to * update displays
    * trigger code
    * reduce computation
    * delay reactions
   
   
 * How to add design elements to your apps
 * How to customize the layout of an app
 * How to style your apps with CSS

Altogether, the video contains two hours and 25 minutes of material organized
around a navigable table of contents.

Best of all, the video tutorial is completely free. The video is the result of
our recent How to Start Shiny webinar series . Thank you to everyone who attended and made the series a success!

Watch the new video tutorial here .

NEW CHEAT SHEET
The new Shiny cheat sheet provides an up-to-date reference to the most important Shiny functions.


The cheat sheet replaces the previous cheat sheet, adding new sections on
single-file apps, reactivity, CSS and more. The new sheet also gave us a chance
to apply some of the things we’ve learned about making cheat sheets since the
original Shiny cheat sheet came out.

Get the new Shiny cheat sheet here .

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

2 COMMENTS
June 23, 2015 at 4:04 pm

Chris H

Hey Garrett, the new cheat sheet is great and really highlights the advances in
Shiny. However, I’d like to print it and without changing the color of the light
gray text, it’s illegible.

 * June 23, 2015 at 4:45 pm
   
   Garrett.Grolemund
   
   Chris, No problem. I darkened all of the light grey letters and reposted
   (keep in mind those portions of the text are pretty uninmportant, basically
   place holders). Let me know if this still does not work and I will make a low
   contrast version of the sheet. Thanks.
   
   
« Shiny 0.12: Interactive Plots with ggplot2 RStudio adds custom domains, bigger data and package support to shinyapps.io »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:",We’ve added two new tools that make it even easier to learn Shiny. Video tutorial The How to Start with Shiny training video provides a new way to teach yourself Shiny. The video covers every…,New Shiny cheat sheet and video tutorial,Live,525
1622,"When we hear about Compose databases being used as part of demonstrations, we like to share. Our friends over at 28.io told us they have been using Compose MongoDB and PostgreSQL as part of their JSONiq demonstration videos. As you'd expect, because we like Compose databases to be no hassle databases, their visible appearance is just a cameo. Behind the scenes though Compose databases are enabling 28.io to show their impressive virtual database technlogy is. So sit back and watch how 28.io make use of Compose MongoDB...And then enjoy them showing how they can switch between MongoDB and PostgreSQL while retaining the same query...Compose makes it easy to have the databases you need available at the click of a button, and then lets you get on with being awesome.",See how 28.io uses Compose MongoDB and PostgreSQL as part of their JSONiq demonstration videos. ,A guest video post,Live,526
1626,"Compose The Compose logo Articles Sign in Free 30-day trialA HOW-TO FOR MIGRATING FROM ETCD 2 TO ETCD 3
Published Jul 11, 2017 etcd etcd3 migrations A How-to for migrating from etcd 2 to etcd 3Thinking of moving from etcd 2 to etcd 3? We're here to give you the why to
move, the how-to do it and the what changes to expect when you do move.

As an etcd 2 user, you already know why etcd is a great database for
configuration and orchestration and you are as excited as us at Compose to be
able to use etcd 3. That's great, but we thought we'd better guide you through
the migration process. There's a lot of changes in etcd 3, compared to etcd 2.

Most of the changes are for the better, but if you have your own applications
using etcd 2 then there is some engineering to do to get up and running on it.
The great thing with Compose is that you can create an upgraded clone of your
current etcd 2 database without impacting on your use of it. Let's talk about
that process first.

MIGRATING YOUR DATA
Go to your existing etcd 2 database deployment. Go to the Backups view. You'll want to make an up to date backup, so click on Back up now to make an on-demand backup. Wait for that to complete and then head back to
the Backups view. There will now be a fresh on-demand backup. Go to the right of the row
and select the circular arrow. This is the Restore to a new deployment button.
Click on it and you'll go to a New etcd Deployment from Backup . Enter a new name. The Location will automatically match where your existing
deployment is. Below that will be the new version number. Make sure it's 3.2.1
or later (if available). With all that in place, select Create Deployment . The new deployment will be created, your backup data will be loaded into it
and then the etcd migrate process will be run.

THE FLATTENING OF KEYS
This is where the first big difference between etcd 2 and 3 comes in. In etcd 2
you could have a hierarchy of keys and directories, in a very file-system-like
way. It allowed for all sorts of novel tricks by grouping keys together. In etcd
3, the hierarchy is gone. Every key is now at the same level and as long as
needed. This flattening avoids a lot of complexity and potential issues as
things scale up.

It also means that to move from etcd 2 to 3, the keys for your existing data
have to be transformed. So if you had an etcd 2 key ...

/ services / servers / server1 / enabled

... then that would become an etcd3 key ...

/services/servers/server1/enabled

The services , servers and server1 directory nodes which would exist on etcd 2 to support the key's path are
discarded by the transformer as there's no such thing as a directory on etcd 3.

At Compose, we use the default key transformer (no relation to transformers in
Compose's Transporter) which has these rules built in. It is possible, if you
are running your own etcd server, to create a custom flattener which could do a
more complex transformation of the key. For most people, this simple, default
transformation should do.

This migration happens automatically when the etcd 2 database is imported into a
Compose etcd 3 database.

PREFIX POSTSCRIPT
You may be concerned that with the keys being flattened that the directory
structure has gone. The heirachy was useful in etcd 2, but in etcd 3 it is still
possible to reduce your scope to a subset of keys.

For example, remember our flattened key; / services / servers / server1 / enabled which became /services/servers/server1/enabled . Imagine the application that used it used to wait on the /services/servers/ directory for changes to it or any subdirectory in etcd 2. Well, in etcd 3,
you'd wait with a prefix of /services/servers/ .

Any key that changed which started with that would trigger a notification. It's
simpler and leans in on etcd 3's key handling and ranges, making for a solid
foundation to build out new key namespaces.

PORTING YOUR APPLICATION
Ideally, you should start looking at this before you migrate your database; Compose has you covered though as your etcd 2
database carries on running till you turn it off.

If your application already supports etcd 3, then you are good to go and
probably only need to change the configuration to tell the applications it is
talking to the latest version of etcd. If your application doesn't support etcd
3 and it's not your own application, get in touch with your application vendor.

If it is your application, then you've got some work to do. In brief,
directories have gone (as you can see), keys can now be attached to leases with
a time to live, watchers are way more efficient and ACID compare-and-swap
operations are replaced by mini-transactions. Most importantly, in etcd 3 the
HTTP request based API is replaced by a gRPC-based API. This is way more
efficient and capable, but also means it's likely your old libraries and scripts
are out and you'll need a replacement.

For scripts, the good news is that the etcdctl application is much more capable than the etcd 2 version. It's got a whole new
set of commands that reflect the gRPC API and can even run multi-operation
transactions. If you are writing a script, using etcdctl is probably the best
way to go when porting. Just remember to set the ETDCTL_API=3 environment variable to make etcdctl use the right API - yes, both APIs are baked into the one binary.

FINDING LIBRARIES
For applications, you'll have to find a library for your chosen language or use
the gRPC API of etcd 3 directly ( etcd v3 API documentation ).

First stop for Go users will be the clientv3 library from the etcd developers and the most official. Node.js developers have a number of packages to choose from. Most promising of them is Connor Peet's etcd3 . Python users should check out python-etcd3 , Ruby users should have a look at etcdv3-ruby , Nokia produce a C++ etcdv3 package and, finally, there's a CoreOS jetcd package for Java (and JVM-hosted languages by extension). Other libraries are steadily emerging
for etcd, partially gated by the availability of gRPC support.

SWITCHING MODES
For more details of what you'll need to change, a previous Compose Article - etcd 2 to 3 - New APIs and New Possibilities - takes a deeper dive into the functional changes. You may also wish to step
back and ensure you aren't porting etcd 2 patterns over to etcd 3 where they may
be inappropriate. For example, setting up TTLs through Leases is a much simpler
proposition now; etcd 2 encouraged a clumping of keys under a directory, but
with leases in etcd 3 you can spread out your keys more appropriately. Watching,
which was expensive, is now much more economical on resources. Consider what and
why you watch things and see if etcd 3's lighter watching could let you watch
more discretely distributed keys or just watch more. And finally, transactions
will let you act on complex conditions and on many keys. Make sure you aren't
working around etcd 2's simple ACID operations; you may already have
transactions in your code.

CONCLUSION
Just migrating to etcd 3 is only the start. With easier TTL management, more
""complex"" transactions and more efficient watching, it offers a thoroughly
enhanced version of its original v2 features. Compose makes it easy for you to
explore how you can upgrade without disturbing your production databases.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Andrew Ruiz

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Jul 6, 2017NOW ON COMPOSE: ETCD 3
TL;DR: etcd 3 is now available on Compose, alongside etcd 2, with an API built
for scale. We are proud to announce the immedi…

Dj Walker-Morgan Jun 16, 2017NEWSBITS: ETCD GETS A SCALING UPDATE
These are the Newsbits for the week ending June 16th: etcd gets a major update
for scale MongoDB gets a minor fix up A look a…

Dj Walker-Morgan May 26, 2017NEWSBITS: IT'S TIME TO COUNT ON SCYLLA 1.7
NewsBits is Compose's roundup of the past week's database and developer news:
Scylla 1.7 is released with counter support Git…

Dj Walker-Morgan Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","Thinking of moving from etcd 2 to etcd 3? We're here to give you the why to move, the how-to do it and the what changes to expect when you do move.",A How-to for migrating from etcd 2 to etcd 3,Live,527
1627,"Homepage About membership Sign in Get started Homepage Roman Kierzkowski Blocked Unblock Follow Following Oct 18
--------------------------------------------------------------------------------

10 TIPS ON USING JUPYTER NOTEBOOK
Jupyter Notebook (a.k.a iPython Notebook) is brilliant coding tool. It is ideal for doing
reproducible research. Here is my list of 10 tips on structuring Jupyter
notebooks, I worked out over the time.

1. USE VIRTUALENV TO CREATE SELF-CONTAINED ENVIRONMENT
You might be tempted to install all research libraries within your operating
system and share them among all your projects. Soon you will discover that when
you add some additional library it may update ones installed previously. Some of
the other libraries will no longer work with newer versions. So when you go back
to a previous project you will waste a lot of time trying to figure out what
changed and how to fix it.

The solution is to use separate virtual environment for each of your projects. I
recommend using virtualenv via virtualenvwrapper . To avoid problems with resolving paths to the virtual environment you should
install Jupyter in each environment separately.

And by the way: go with Python 3!

2. INCLUDE REQUIREMENTS.TXT
When you have a separate environment for your project, it is a good idea to save
the list of dependencies. It will save you a lot of time in the future. For
example when you will want to recreate the environment.

$ pip freeze > requirements.txt

3. DO ALL IMPORTS IN THE FIRST CELLS
Do all your imports in the first cell of your notebook. It has two benefits. The
dependencies and tools used are obvious at the first glance. When you restart
the notebook server, you can have all your imports restored with a single
re-run. It is especially useful when you don’t want to re-execute the entire
notebook.

I also use this cell to define any filesystem paths used in the notebook.

5. START DIRTY AND KEEP YOUR DRAFT
Start quick and dirty. The fastest you get to what you want to do, the better. The inspiration is perishable [ Rework , by Jason Fried] . But when you notice that you start stepping on your own toes, that you are no
longer effective and the development become clumsy, it is time to organize the
notebook. Start over, copy the good code, rewrite and generalize bad one, but
whatever you do: KEEP THE DRAFT NOTEBOOK!

6. WRAP CELL CONTENT IN A FUNCTION
Many of the notebook cells will look like this:

parameter1 = 1.0
parameter2 = 100

step1 = X * parameter1

step1 * parameter2

There are parameters at the beginning of the cell. You change them and
re-execute the cell or you even copy the entire cell and modify parameters.
There are some intermediate computations and at the end, there is a line to
display the results.

It’s ok in the draft. But after a while it becomes unmanageable. You got plenty
of intermediary variables trashing a global namespace. You lose the steps that
led you to the current parameter choices.

Instead, you can wrap it all in one a function:

def computation(parameter1=1.0, parameter2=100):
    step1 = X * parameter1
return step1 * parameter2

computation()
...
computation(parameter1=10.0)

You can modify the parameters and re-execute in a separate cell, keeping the
history of changes. The intermediary steps will no longer trash the global
namespace and consume memory.

7. USE JOBLIB FOR CACHING OUTPUT
You thought your neural network for three days and now you are ready to build on
top of it. But you forgot to plug your laptop to a power source and it runs out
of batteries. So you scream: Why didn’t I pickle!? The answer is: because it is
pain in the back. Managing file names, checking if the file exists, saving,
loading… What to do instead? Use joblib .

from sklearn.externals.joblib import Memory
memory = Memory(cachedir='/tmp', verbose=0)

@memory.cache
def computation(p1, p2):
    ...

With three lines of code, you get caching of the output of any function. Joblib
traces parameters passed to a function, and if the function has been called with
the same parameters it returns the return value cached on a disk.

8. MAKE SECTIONS OF THE NOTEBOOK SELF-CONTAINED
Make sections of your notebook loosely bound. Use as little global variables as
possible. If you wrap your cells in functions and you use joblib for caching, it
is really inexpensive to call same code within each section. It’s better than
making code reliable on the variables created several cells above.

In general, try to limit the number of cells you have to re-run after the
restart to continue on your work.

9. REUSE VARIABLE NAMES.
Don’t use long variable names. When you get a chance re-use existing ones. It is
contrary to the advice I would give when developing other kinds of software, but
in case of a notebook this approach works better.

Let me illustrate it with an example. Let’s assume that your algorithms need a
list of clusters. You try various versions of clustering and algorithms. Your
code can look like this:

clusters_kmeans_k10 = KMeans(k=10).fit_predict(X)

clusters_kmean_k5 = KMeans(k=10).fit_predict(X)

# many cells further

algorithm1(clusters_kmeans_k10)

algorithm2(clusters_kmeans_k10)

algorithm1(clusters_kmeans_k5)

algorithm2(clusters_kmeans_k5)

But instead you can use joblib cached function and re-use variables:

@memory.cache
def kmeans(X, k):
return KMeans(k=k).fit_predict(X)

# many cells further
clusters = kmeans(X, k=10)
algorithm1(clusters)
algorithm2(clusters)

clusters = kmeans(X, k=5)
algorithm1(clusters)
algorithm2(clusters)

10. USE ASSERTIONS TO TEST UTILITY FUNCTIONS
When you create some utility function, create short tests using assert keyword. For example:

def norm_scale(X, axis=0):
    mx = np.max(X, axis=axis)
    mi = np.min(X, axis=axis)
    epsilon = 10**-32
return (X — mi) / (np.abs(mi) + mx + epsilon)

norm = norm_scale(X)
assert np.min(norm) >= 0
assert np.max(norm) <= 1

Here are my tips? What are yours? How do you organize your notebooks?

 * Python
 * Jupyter Notebook
 * Machine Learning
 * Research
 * Naturallanguageprocessing

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

150 Blocked Unblock Follow FollowingROMAN KIERZKOWSKI
 * 150
 * 
 * 
 * 

Never miss a story from Roman Kierzkowski , when you sign up for Medium. Learn more Never miss a story from Roman Kierzkowski Blocked Unblock Follow Get updates",Jupyter Notebook (a.k.a iPython Notebook) is brilliant coding tool. It is ideal for doing reproducible research. Here is my list of 10 tips on structuring Jupyter notebooks.,10 tips on using Jupyter Notebook,Live,528
1631,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseIBM DATA REFINERY: CREATE A PROJECT AND ADD DATA
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Add translations

2 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Nov 1, 2017This video shows you how to create a project, associate the project with a data
catalog, then add data from the data catalog to the project. Find more videos in
the IBM Data Refinery Learning Center at http://ibm.biz/data-refinery-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * OSGi Explained - Duration: 9:58. developerWorks TV 60,708 views 9:58


--------------------------------------------------------------------------------

 * Welcome - Duration: 1:35. developerWorks TV No views * New 1:35
 * IBM Data Refinery: Shape data - Duration: 5:46. developerWorks TV 10 views *
   New 5:46
 * IBM Data Refinery: Create a connection and add it to a project - Duration:
   1:54. developerWorks TV 3 views * New 1:54
 * IBM Data Catalog: Add data assets to a catalog - Duration: 3:03.
   developerWorks TV 2 views * New 3:03
 * Making Data-Driven Decisions at Speed with Watson Marketing Insights -
   Duration: 1:55. IBM Watson Marketing 122 views 1:55
 * IBM Data Catalog: Governance overview - Duration: 4:11. developerWorks TV 5
   views * New 4:11
 * Introduction to the IBM MobileFirst Platform Foundation - Duration: 17:51.
   developerWorks TV 15,383 views 17:51
 * Watson Data Platform: Provision IBM Data Catalog or IBM Data Refinery
   services - Duration: 1:05. developerWorks TV 22 views * New 1:05
 * UrbanCode Deploy: Using composite blueprints - Duration: 9:13. developerWorks
   TV 6 views * New 9:13
 * Healthy Habits Pet Assembly, Part1 - Duration: 5:53. developerWorks TV 6
   views * New 5:53
 * Inside IBM API Connect Version 5: An end-to-end demo - Duration: 48:39.
   developerWorks TV 21,041 views 48:39
 * IBM Data Catalog: Create and administer a data catalog - Duration: 3:19.
   developerWorks TV 13 views * New 3:19
 * IBM Watson Marketing Insights - Refining audiences - Duration: 2:15. IBM
   Watson Marketing 478 views 2:15
 * Texmark Chemicals deploys IIoT at the edge in showcase Refinery of the Future
   - Duration: 2:33. Hewlett Packard Enterprise 2,754 views 2:33
 * IBM Data Catalog: Overview - Duration: 2:03. developerWorks TV 2 views * New 2:03
 * IBM Data Catalog: Use data assets in a project - Duration: 1:09.
   developerWorks TV 1 view * New 1:09
 * Watson Analytics: How to add a data group to the data set - Duration: 5:58.
   IBM Analytics Learning Services 859 views 5:58
 * Mandy Chessell at IBM Insight 2014, how does Data Refinery work? - Duration:
   5:12. IBM Redbooks 630 views 5:12
 * Watson Analytics: How to create a data group - Duration: 2:11. IBM Analytics
   Learning Services 731 views 2:11

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...","This video shows you how to create a project, associate the project with a data catalog, then add data from the data catalog to the project.",Create a project and add data using IBM Data Refinery,Live,529
1633,"WHAT YOU NEED TO KNOW TO EXTEND NIFI
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Aug 15, 2016Generic data tools often miss the mark. Most data is nuanced and idiosyncratic.
While at the high level, tasks for data flows and transformations are the same.
At the implementation level, little differences in the data often need
customization. NiFi provides the services which are necessary and nice for a
data flow tool plus it provides these services to an easily extended class: the
Processor.

For more background and to review the FlowFile which is what a Processor
actually processes see here .

THIS BLANK CANVAS ISN'T SO BLANK


When first logging into the NiFi UI, you are presented with a blank canvas.
While it's certainly true that without a data flow created, configured, and set
to run, you don't have much. That is unless you want to think about it from the
perspective of potential. A blank NiFi canvas is really a great deal of services
and code waiting to be employed. Whether you are just a straightforward user of
NiFi, or intend to extend it to take advantage of its well polished class
hierarchy, this blank canvas represents a runtime environment waiting for the
addition of NiFi Processors . It's an environment optimized for connecting data flow processing steps. It
exposes a component model that embeds its own documentation and configuration.
It excels at managing data with an in-memory key-value store for speed, a couple
of persistent write ahead logs for performant durability of your data, and even
a searchable index of history. Whether you build your own Processor or use a
builtin, they are no different. They are built the same, deploy the same, and
even run the same in this not-so-blank canvas after all.

FIRST STEPS: ADDING A PROCESSOR


By dragging the processor icon onto the canvas, the first step begins towards
flowing the data. When the icon is dropped you are presented with a great number
of choices to Add a Processor:


There are over one hundred types to choose from. The ones included are all
loaded from a configured directory which NiFi reads when it starts up (typically $NIFI_ROOT/lib ). What's really nice is that this list is dynamically generated. All of these
components are extensions of the core API packaged in an extended jar file
format that includes any dependencies. By using custom ClassLoaders, the NiFi
server allows you to add additional Processors to the platform the exact same
way that all Processors are added to the system: by just adding a .nar file to
the configured directory and restarting the server. Here's a portion of a
directory listing for the lib/ directory to give you a sense:


These NiFi Archives (nar files) are just jar files with all of their
dependencies packaged in. With the ability to add .nar files at runtime instead
of compile time, the NiFi team really needed to handle the possibilities of
dependency hell. There really isn't anything magical about the format and they
do provide a maven archetype to provide a skeleton for a new nar project. Here's the output from unzipping
an example nar:


CONFIGURATION AND DOCS ALL BUILTIN
Once you pick one of the available Processors, it is instantiated and ready for
configuration:


The little warning triangle on your newly instantiated Processor means that some
configuration that must be done in order to schedule this Processor hasn't been
done yet. This is no surprise since we just put it there. If you do a Ctrl-Click on the Processor up will pop a menu which has all of the high level actions
available:


By going into Configure we can access the real guts of what it means to manage a Processor:


With its Settings, the Processor exposes the Relationships which can be
Auto-Terminated. Basically, if a Processor transfers a FlowFile to a
relationship that is Auto-Terminated, then it just drops it; No questions asked
and no more history kept. Eventually most flows end up with this being
configured on some Processor. Although it's usually on one at the end that
performs some kind of side effect like posting to a data store or filesystem of
some type after everything else is finished.

These Relationships are part of the Processor's code too. So, while it appears
that there are standard names like success and failure they are really just conventions. A Relationship is really just a named queue
that can be setup between Processors. Just like in message queuing the interface
between the Processors is generic and decoupled. It allows for everything to
work at its own speed or even be stopped for a while without any fear of data
loss. And, like was mentioned they are just named queues that require some basic
static configuration:


This ComposeStreamingGetMongo Processor only has the one Relationship because it generates FlowFiles when it
runs and pumps them out one by one to the success queue. If there is failure , it just stops. Other situations might warrant different paths but this one
keeps it simple.

SCHEDULING: CRON, TIMER, OR EVENT DRIVEN


Scheduling refers to when and how often a Processor will run. There are three
different types. Timer based, which means it should run again after some
interval of time once it finishes; Think poll this every 5 minutes or every 4
hours. Cron based, which means it should run on a set schedule; Think every
Sunday at midnight or every weekday at 6AM. And finally event driven, which
means run if there is a FlowFile queued on any inbound queue; Think whenever
there is an event waiting to be processed.


The other dimension of scheduling beyond the how often is the how many. When
making a custom Processor, the developer can set things like only allow one task
to run at a time (no concurrency). The ComposeStreamingGetMongo is this way. It
doesn't make much sense to concurrently query the same Mongo collections so it
is set to only ever have one task run at a time via an @TriggerSerially annotation to its class declaration. On the other hand, the BatchPutMongo
Processor can certainly run concurrently since the data being fed to it will be
different and independent in each FlowFile. It can be scheduled to run any
number of Concurrent tasks that are reasonable from a resource perspective.

PROPERTIES: SELF CONTAINED AND DOCUMENTED


Just like Relationships, Properties are typically statically configured and
fully contained in each individual Processor 's code:


All the details of things like must be included , validated , use custom expression language builtin to NiFi to pull info out of attributes are configured for each property. This then raises them in the UI and exposes
them to the code. It even takes the information and turns it automatically into
documentation for the end user:


All of this though is really just setup for the whole point of all of this: the
Processor's onTrigger method.

DOING THE PROCESSOR'S BIDDING: THE ONTRIGGER METHOD
In the world of web servers there is usually some method like handle(request, response) that a developer implements or overrides. The server then executes such method
whenever it receives a corresponding event such as GET this url. In this way the developer accrues all of the benefits of the web
server code that was written to help such as session management, url
destructuring into parameters, and even SSL all available to that one method
call. In an analogous fashion, NiFi uses this same Template Method pattern with its onTrigger(context, session) method:


In the above snippet which is from ComposeUniqueRocksDB , the onTrigger is called by the NiFi server code according to its Schedule (i.e. Timer, Cron,
or Event) which is set by the user. When it is the session has the FlowFile to be processed. In this particular case a logical.key attribute from the FlowFile is retrieved and compared to a unique index. The
outcome of the comparison determines which named queue, or Relationship, the
FlowFile will be transferred to: REL_SEEN or REL_UNSEEN .


After session.transfer() , the FlowFile with its corresponding metadata is persisted to the multiple
repositories NiFi provides to manage all of this. In this particular case, the
Content-Repository is untouched since we didn't need to change or even read any
of the FlowFile's content or payload data. The FlowFile-Repository was accessed
when we utilized the flowFile.getAttribute() but that was in memory already and is only written to via a WAL (Write Ahead
Log) in order to persist change to protect against failure. In this particular
case the FlowFile-Repository is only written to after a session.commit() is called by the framework when the onTrigger method ends (view AbstractProcessor to see this NiFi class). This session.commit() is analogous to a transaction commit and it actually durably finishes the transfer step, which is to move the FlowFile to the relevant Relationship queue. The
data is protected the entire way through the flow from step to step. Note too
the getProvenanceReporter().route(flowfile, relationship) call where the audit trail of this same transfer is sent to the
Provenance-Repository for end use search and later verification. These are
powerful data flow services to say the least.

SOME EXAMPLE CUSTOM PROCESSORS
We have reviewed some of the highlights of NiFi Processors. To see a full
example nar project with multiple custom Processors please feel free to visit
this github repo for nifi-compose-bundle .

It consists of the following three Processors:

 1. ComposeBatchPutMongo.java which takes a FlowFile that contains an array of JSON objects to be batch
    inserted into MongoDB.
 2. ComposeUniqueRocksDB.java which routes a FlowFile based upon whether it has seen or not seen an
    attribute previously. This is only done on one local node and it uses
    RocksDB to persist and manage the index so it is unlike the builtin
    DetectDuplicate which can detect duplicates across a cluster of NiFi
    instances. DetectDuplicate requires a network call though for each lookup to
    a DistributedMapCache. ComposeUniqueRocksDB keeps its lookup in memory and
    on the local disk.
 3. ComposeStreamingGetMongo.java immediately creates a FlowFile for each document in a query as it is
    returned. This differs from the builtin GetMongo which only does a session.commit() when the original query is completed and thereby releases all of the
    generated FlowFiles at once on completion of its query. For a large query
    over a network this can take time and may require a great deal of buffering.
    ComposeStreamingGetMongo cannot rollback on a query error at least not the FlowFiles which have already been sent.
    Also, it doesn't require as many resources plus it allows a data flow to
    start processing immediately.

While there are some more advanced extension points which cut across multiple
Processors like ControllerServices and ReportingTasks, we have covered the
basics here by reviewing the Processor to help you get started with NiFi.
Obviously, for more in depth details the developer docs and user docs are great resources for next steps. And, there is always the code .

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",NiFi provides the services which are necessary and nice for a data flow tool plus it provides these services to an easily customizable class: the Processor. Here's how to use it with MongoDB.,What You Need to Know to Extend NiFi,Live,530
1635,"OSCON EUROPE TALK REVIEW: OPEN SOURCE DATASTORES AT YELP

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Lorna Mitchell 11/21/16Lorna Mitchell

Lorna is based in Leeds, UK; she is a Developer Advocate at IBM Cloud Data
Services, an experienced developer and published author. She brings her
technical expertise to audiences all over the world with her writing and
speaking engagements, delivering advice on a wide range of technical topics, all
with…

Learn More Recent Posts * OSCON Europe Talk Review: Open Source Datastores at Yelp My favorite talk from OSCON Europe was ""Building a Powerful Data Tier from
   Open Source…
 * Put Messages into RabbitMQ from Your PHP Application One approach I often use to scale PHP apps is to loosen the coupling
   between…
 * Get Started With CouchDB Using PHP and Guzzle In today's post, we'll look at how we can use CouchDB in our PHP
   applications,…

“BUILDING A POWERFUL DATA TIER FROM OPEN SOURCE DATASTORES”
At the excellent OSCON Europe conference in London, I saw many great and memorable sessions. My favourite by
far was the talk on open source datastores by Joey Lynch from Yelp . It had such excellent coverage of the available options, alongside pragmatic
advice about when best to pick each one. Best of all it focussed entirely on
open source databases, many of which we have available on our Bluemix platform .


The Yelp engineering team knows a thing or two about the modern database
landscape. They also have a great engineering blog .
DATASTORES: NO ONE CAN DEPLOY JUST ONE.™
Joey started by discussing why more than one datastore is a good idea. Many
traditional application architectures have one thing labelled “database”, and
the rest of the system has to work around the constraints of whichever database
was chosen. In reality, today’s applications are complex, componentised, and can
easily take advantage of multiple datastores and their respective strengths.
Given that this was a short session for the size of the topic (only 40
minutes!), five specific areas were covered in detail: relational databases,
document databases, key value stores, configuration services and search.

RELATIONAL DATABASES
The traditional relational database still has much to offer modern applications.
In this section there were shout-outs to both MySQL and PostgreSQL , which have been doing sterling work in our stacks for so long. These
relational databases are great when working with relational data (surprising, I
know) and are based on well-established academic research. For data that is not
relational, such as objects, Joey’s advice is to consider adding appropriate
datastores to your application stack.

DOCUMENT DATABASES
Document databases are not brand new, but many organisations haven’t yet
identified the best way to make use of their power. The main players in this
space are MongoDB (very popular in the NodeJS community), CouchDB (fresh from its recent 2.0 release milestone) and RethinkDB (now safely in community control after its commercial caretakers announced
their retirement). Joey gave us all a few words of encouragement about how easy
it is to get started with document databases and recommended they be used in
place of the relational databases for object storage. These datastores are
definitely growing in adoption and it’s easy to see why: they are very modern
and web-friendly with JSON document storage and strong analytical features.

KEY VALUE STORES
Key value stores are always an auxiliary datastore; they are blazing fast for
small and simple data but aren’t suitable for some other data types. The key
value stores include our old friend Memcached and its newer cousin Redis . Both are open source projects seeing excellent adoption and impressive
performance in a wide range of applications.

If it’s durability you need in your key value store, Redis has eventual
consistency features. Other options include the DynamoDB and BigTable databases, which are based on modern academic research and perform well in
particular use-cases. For applications requiring serious write scalability,
Joey’s advice was to check out HBase , Riak or Cassandra — but to bear in mind that at this level of scalability the design emphasis is
very much on the queries required rather than the data shape or structure.

CONFIGURATION/CO-ORDINATION SERVICES
In contrast to the stores mentioned so far in the talk, these foundational
datastores don’t store application data, but rather, they keep information
needed to solve problems such as distributed consensus. In this space the
leading offerings are Apache Zookeper , Consul and etcd . Joey’s advice was to look for examples of these deployed well before trying
to solve problems with a homespun solution, especially in distributed systems.

SEARCH ENGINES
The search space would easily have offered a talk by itself, but the quick
overview here was valuable. Search options include Lucene , Elastic Search and Solr — all of them offering a wide range of awesome features and with support for a
variety of different data types.

The main challenges when working with search indexes is keeping them updated at
a rate matching the other datastores in the application stack. We were also
treated to some sage advice around relying on these types of datastores for
persistence: search engines are difficult to keep consistent and their indexes
should always be expected to lose data on occasion. The recommendation is to
deploy these solutions as secondary stores and never to rely on search engines
as primary storage.


No one uses just one general-purpose database in their architecture anymore.CHOOSING DATASTORES
My favourite slide of the talk was where Joey laid out his approach for choosing
a datastore to add to your stack.

 1. Does this datastore satisfy the business requirements which are causing us
    to consider a new datastore?
 2. Is the community behind this project solid?
 3. Will it play nicely with your existing stack?
 4. Given that all datastores are broken in some way: how broken is it?

DEPLOYING, OPERATING AND MANAGING DATASTORES AT SCALE
The final section of the talk walked us through the evolution of the datastores
landscape at Yelp over the last 5 years, with some excellent explanations of
what drove the need and selection for each addition. Check out the slide deck for some lovely overview diagrams of how the Yelp data pipeline looks now and
how they manage the movement and aggregation of all their data at huge scale.

Here at IBM we have a good proportion of the datastores mentioned here available
for you to painlessly add to your applications (or just play around with on our
platform). Check them out:

 * MySQL
 * PostgreSQL
 * MongoDB
 * CouchDB
 * RethinkDB
 * Redis
 * etcd
 * Elastic Search

Which are your favourites? Are there any you’d have included if you were giving
this talk yourself? Let us know in the comments! My colleague Matt Collins and I
also have our own series on getting started with some of these datastores. See
our series “Seven Databases in Seven Days,” parts 1 , 2 , 3 , 4 , 5 and 6 for more.","A recap from OSCON Europe of Joey Lynch's excellent talk ""Building a Powerful Data Tier from Open Source Datastores"".",OSCON Europe Talk Review,Live,531
1637,"Homepage Follow Sign in Get started * Home
 * Patterns
 * Methodology
 * Strategy
 * 
 * Deep Learning Playbook
 * 

Carlos E. Perez Blocked Unblock Follow Following Author of Artificial Intuition and the Deep Learning Playbook — Intuition
Machine Inc. Mar 31, 2017
--------------------------------------------------------------------------------

THE TWO PHASES OF GRADIENT DESCENT IN DEEP LEARNING
Credit: https://unsplash.com/@paulgilmore_Thanks to great experimental work by several research groups studying the
behavior of Stochastic Gradient Descent (SGD), we are collectively gaining a
much clearer understanding as to what happens in the neighborhood of training
convergence.

The story begins with the best paper award winner for ICLR 2017, “ Rethinking Generalization ”. This paper I first discussed several months ago in a blog post “ Rethinking Generalization in Deep Learning ”. One interesting observation in that paper is the role of SGD. The
observation is extremely radical, where the authors write:

Indeed, in neural networks, we almost always choose our model as the output of
running stochastic gradient descent. Appealing to linear models, we analyze how
SGD acts as an implicit regularizer. For linear models, SGD always converges to
a solution with small norm. Hence, the algorithm itself is implicitly
regularizing the solution.This is a very odd notion that SGD is being labeled as an ‘implicit
regularization’. Coincidentally, another paper: An Empirical Analysis of Deep Network Loss Surfaces by Daniel Jiwoong Im, Michael Tao, Kristin Branson, discusses the structure of
loss surfaces of different SGD algorithms and discovers that they all are
different:

These experimental measurements seems to back the claim that similar to
regularization, the SGD algorithm you select will influence where a network
converges to. In short, you reach different resting placing with different SGD
algorithms. This is different to how we conventionally think about SGD. That is,
different SGDs just give you differing convergence rates due to different
strategies, but we do expect that they all end up at the same results! We sort
of believe that SGD would reach the same optima irregardless of method.( BTW, I
mention that this fantastic paper was rejected in ICLR 2017 . Writing academic papers in the deep learning space is unreasonable
competitive. )

Leslie Smith and Nicholay Topin, recently submitted a workshop paper to the ICLR
2017 workshop: “ Exploring Loss Function Topology with Cyclic Learning Rate ” where they discover some peculiar convergence behavior:

Source: Exploring Loss Function Topology with Cyclic Learning RateHere, as you monotonically increase and decrease the learning rate, there is a
transition near at the convergence regime that a large enough learning rate
perturbs the system right off is basin into a space of much higher loss. Then
the SGD again quickly converges ( also note the faster convergence rate ). What
exactly is happening here?

A recent paper on Arxiv “ Opening the Black Box of Deep Neural Networks via Information ” by Ravid Shwartz-Ziv and Naftali Tishby has a elegant interpretation of what goes on in SGD. They describe SGD as
having two distinct phases, a drift phase and a diffusion phase. SGD begins in
the first phase, basically exploring the multidimensional space of solutions.
When it begins converging, it arrives at the diffusion phase where it is
extremely chaotic and the convergence rate slows to a crawl. An intuition of
what’s happening in this phase is that the network is learning to compress. This
graph best illustrates this behavior:

That is, the behavior makes a phase transition from high mean with low variance
to one with a low mean but high variance. This provides further explanation to
Smith et. al’s observations, that in the region near convergence, it is highly
chaotic. This of course does not fully explain why a high learning rate will
knock the system into a place of high loss.

Tomaso Poggio and Qianli Liao have however their own experiments and have a
theory: “ Theory II: Landscape of the Empirical Risk in Deep Learning ”. Where they describe in detail the behavior in that chaotic region:

Source: Theory II: Landscape of the Empirical Risk in Deep LearningIt turns the basin of global minima is flat but has very bumpy. Not only that,
but there are many of these basins. They invoke some esoteric math theorem and
come up with this conclusion:

We can then invoke Bezout theorem to conclude that there are a very large number
of zero-error minima, and that the zero-error minima are highly degenerate,
whereas the local non-zero minima, if they exist, may not be degenerate. In the
case of classification, zero error implies the existence of a margin, that is a
flat region in all dimensions around zero error.Absolutely fascinating paper, worthy of several reads. There is however one
pragmatic take away from this paper “Averaging two models within a basin tend to
give a error that is the average of the two models (or less). Averaging two
models between basins tend to give an error that is higher than both models”.

There remains many questions on how to exploit this new knowledge. How can we
leverage this for critical capabilities such as transfer learning, domain
adaptation and avoiding forgetting? What is the relationship of these phases,
particularly the compression phase with respect to generalization? There
certainly a lot of intriguing avenues here!

In summary, there are a lot of research groups that make a good effort at trying
to better understand the behavior of Deep Learning systems. It is through this
fundamental research work that we all collectively gain better ways to improve
our own work. Conferences unfortunately have the bias towards valuing novel
architectures (the crazier the better) over good experimental data.
Unfortunately, that favors the practice of alchemy rather than the pursuit of
the science of chemistry.

Update: Tomaso Poggio has his Theory III out: http://cbmm.mit.edu/sites/default/files/publications/CBMM-Memo-067.pdf

Explore Deep Learning : Artificial Intuition: The Improbable Deep Learning Revolution Exploit Deep Learning : The Deep Learning AI Playbook * Machine Learning
 * Artificial Intelligence
 * Deep Learning

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

657 4 Blocked Unblock Follow FollowingCARLOS E. PEREZ
Medium member since Feb 2018Author of Artificial Intuition and the Deep Learning Playbook — Intuition
Machine Inc.

FollowINTUITION MACHINE
Deep Learning Patterns, Methodology and Strategy

 * 657
 * 
 * 
 * 

Never miss a story from Intuition Machine , when you sign up for Medium. Learn more Never miss a story from Intuition Machine Get updates Get updates","Thanks to great experimental work by several research groups studying the behavior of Stochastic Gradient Descent (SGD), we are collectively gaining a much clearer understanding as to what happens in the neighborhood of training convergence.",The Two Phases of Gradient Descent in Deep Learning,Live,532
1644,"Christian Johnson Blocked Unblock Follow Following Aug 23
--------------------------------------------------------------------------------

MOVE OVER, MATPLOTLIB
VISUALIZATION MADE EASY WITH FAITH, TRUST, AND PIXIEDUST
Editor’s note: This article is part of an occasional series by the 2017 summer interns on the
Watson Data Platform developer advocacy team, depicting projects they developed
using Bluemix data services, Watson APIs, the IBM Data Science Experience, and
more.As a data scientist intern at IBM, I had the opportunity to work with some
pretty huge data sets. Any data scientist will tell you that being able to
visualize your data is extremely helpful. Visualization tools are used in
multiple phases in a data science project.

Before my internship, my weapon of choice was MatPlotLib , a very powerful 2D plotting library for Python. Although it was my preferred
plotting library, I certainly wouldn't consider myself a master. (╥_╥) Like most
other plotting libraries for Python, MatPlotLib has a pretty steep learning
curve due to its many intricacies. With a little bit of time and effort, though,
I was always able to get it to show what I needed.

Early in my internship I met David Taieb, who introduced me to a tool he created called PixieDust , which he assured me would make my life as a data scientist easier. I admit
that I was a bit skeptical at first, but the more I played around with it, the
more I began to appreciate it for what it was. Now I’m sure a few of you are
probably wondering, “Well what the heck is PixieDust?” ¯\_(ツ)_/¯ To put it
simply, PixieDust is one of the most — if not the most — simple and easy-to-use
visualization tools I've ever used for Python in notebooks.

Visualization is easily my most used feature in PixieDust, which is largely
attributed to how easy it is to implement. Simply pass in a Spark DataFrame or pandas DataFrame into the display function and *BAM*, \(^‿^)/ PixieDust brings up a UI inside of my notebook, allowing me to choose
exactly how I want to visualize the data and even making switching between
different types of visualizations (e.g. bar chart, scatter plot, map) a breeze.

Plotting a line chart in a notebook without writing any visualization code PixieDust is open source and you can therefore extend existing visualizations or contribute new visualizations . Contributions are welcome!Now let’s talk about PixieApps. PixieApps are an extremely powerful component of PixieDust that takes visualization to a
whole new level. You can use PixieApps to create highly customized and
personalized dashboards with everything you need to see in one place. You can
also use PixieApps to create dynamic and interactive applications, capable of
performing a multitude of transformations on your data.

One limitation of PixieDust is that you can only see one visualization at a time
in a single notebook cell, which makes quickly comparing visualizations a bit
difficult when you have a lot of things you want to compare. To help solve this
problem, over the summer I created a PixieApp to display different
visualizations side by side. o(^-^)o

Arranging multiple scatter plots side-by-side using a PixieAppThe main data science project I worked on over the summer was a study of how the
weather affects traffic collisions in the greater metropolitan area of New York.
Given that The Weather Company is part of IBM, we had tons of historical weather data at our disposal. The
other data set was provided by the New York Police Department (NYPD), and
contained various details of motor vehicle collisions in New York City . After cleaning up the two data sets, joining them together, and removing
unnecessary features, we were left with an enormous DataFrame.

In order to glean insights from the data, I used a PixieApp to create a map
which, given any particular weather condition contained in the weather data
(e.g. thunderstorms), showed a map of “hotspots” where traffic collisions
occurred under that condition. The PixieApp first subsets our enormous DataFrame
by creating a temporary DataFrame containing only the rows matching our selected
weather condition. It then uses MapBox to create a heat map based on the temporary DataFrame. Check out how cool it is
below. ٩(^ᴗ^)۶

Correlating weather conditions with traffic accident statisticsAlthough it’s the end of my internship, I can definitely see myself
incorporating PixieDust into many of my future data science projects because of
how easy it is to use and how quickly it lets me whip up great looking
visualizations.

Thanks to Patrick Titzler and Teri Chadbourne, CMP . * Data Science
 * Jupyter Notebook
 * Pixiedust
 * Data Visualization
 * Python

Blocked Unblock Follow FollowingCHRISTIAN JOHNSON
FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Visualization made easy with faith, trust, and PixieDust","Move over, MatPlotLib – IBM Watson Data Lab – Medium",Live,533
1646,"RStudio Blog * Home

 * Subscribe to feed

DPLYR 0.5.0
June 27, 2016 in Packages

I’m very pleased to announce that dplyr 0.5.0 is now available from CRAN. Get
the latest version with:

install.packages(""dplyr"")

dplyr 0.5.0 is a big release with a heap of new features, a whole bunch of minor
improvements, and many bug fixes, both from me and from the broader dplyr
community. In this blog post, I’ll highlight the most important changes:

 * Some breaking changes to single table verbs.
 * New tibble and dtplyr packages.
 * New vector functions.
 * Replacements for summarise_each() and mutate_each() .
 * Improvements to SQL translation.

To see the complete list, please read the release notes .

BREAKING CHANGES
arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and
earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally
more useful. Regardless, it’s not going to change again, as more changes will
just cause more confusion.

mtcars %>% 
  group_by(cyl) %>% 
  arrange(desc(mpg))
#> Source: local data frame [32 x 11]
#> Groups: cyl [3]
#> 
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
#> 2  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
#> 3  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
#> 4  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
#> 5  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
#> ... with 27 more rows

If you give distinct() a list of variables, it now only keeps those variables (instead of, as
previously, keeping the first value from the other variables). To preserve the
previous behaviour, use .keep_all = TRUE :

df <- data_frame(x = c(1, 1, 1, 2, 2), y = 1:5)

# Now only keeps x variable
df %>% distinct(x)
#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1     1
#> 2     2

# Previous behaviour preserved all variables
df %>% distinct(x, .keep_all = TRUE)
#> # A tibble: 2 x 2
#>       x     y
#>   <dbl> <int>
#> 1     1     1
#> 2     2     4

The select() helper functions starts_with() , ends_with() , etc are now real exported functions. This means that they have better
documentation, and there’s an extension mechnaism if you want to write your own
helpers.

TIBBLE AND DTPLYR PACKAGES
Functions related to the creation and coercion of tbl_df s (“tibble”s for short), now live in their own package: tibble . See vignette(""tibble"") for more details.

Similarly, all code related to the data table dplyr backend code has been
separated out in to a new dtplyr package. This decouples the development of the data.table interface from the
development of the dplyr package, and I hope will spur improvements to the
backend. If both data.table and dplyr are loaded, you’ll get a message reminding
you to load dtplyr.

VECTOR FUNCTIONS
This version of dplyr gains a number of vector functions inspired by SQL. Two
functions make it a little easier to eliminate or generate missing values:

 * Given a set of vectors, coalesce() finds the first non-missing value in each position:x <- c(1,  2, NA, 4, NA, 6)
   y <- c(NA, 2,  3, 4,  5, NA)
   
   # Use this to piece together a complete vector:
   coalesce(x, y)
   #> [1] 1 2 3 4 5 6
   
   # Or just replace missing value with a constant:
   coalesce(x, 0)
   #> [1] 1 2 0 4 0 6
   
   
 * The complement of coalesce() is na_if() : it replaces a specified value with an NA .x <- c(1, 5, 2, -99, -99, 10)
   na_if(x, -99)
   #> [1]  1  5  2 NA NA 10
   
   
Three functions provide convenient ways of replacing values. In order from
simplest to most complicated, they are:

 * if_else() , a vectorised if statement, takes a logical vector (usually created with a
   comparison operator like == , < , or %in% ) and replaces TRUE s with one vector and FALSE s with another.x1 <- sample(5)
   if_else(x1 < 5, ""small"", ""big"")
   #> [1] ""small"" ""small"" ""big""   ""small"" ""small""
   
   if_else() is similar to base::ifelse() , but has two useful improvements.
   First, it has a fourth argument that will replace missing values:
   
   x2 <- c(NA, x1)
   if_else(x2 < 5, ""small"", ""big"", ""unknown"")
   #> [1] ""unknown"" ""small""   ""small""   ""big""     ""small""   ""small""
   
   Secondly, it also have stricter semantics that ifelse() : the true and false arguments must be the same type. This gives a less surprising return type,
   and preserves S3 vectors like dates and factors:
   
   x <- factor(sample(letters[1:5], 10, replace = TRUE))
   ifelse(x %in% c(""a"", ""b"", ""c""), x, factor(NA))
   #>  [1] NA NA  1 NA  3  2  3 NA  3  2
   if_else(x %in% c(""a"", ""b"", ""c""), x, factor(NA))
   #>  [1] <NA> <NA> a    <NA> c    b    c    <NA> c    b   
   #> Levels: a b c d e
   
   Currently, if_else() is very strict, so you’ll need to careful match the types of true and false . This is most likely to bite you when you’re using missing values, and
   you’ll need to use a specific NA : NA_integer_ , NA_real_ , or NA_character_ :
   
   if_else(TRUE, 1, NA)
   #> Error: `false` has type 'logical' not 'double'
   if_else(TRUE, 1, NA_real_)
   #> [1] 1
   
   
 * recode() , a vectorised switch() , takes a numeric vector, character vector, or factor, and replaces elements
   based on their values.x <- sample(c(""a"", ""b"", ""c"", NA), 10, replace = TRUE)
   
   # The default is to leave non-replaced values as is
   recode(x, a = ""Apple"")
   #>  [1] ""c""     ""Apple"" NA      NA      ""c""     NA      ""b""     NA     
   #>  [9] ""c""     ""Apple""
   # But you can choose to override the default:
   recode(x, a = ""Apple"", .default = NA_character_)
   #>  [1] NA      ""Apple"" NA      NA      NA      NA      NA      NA     
   #>  [9] NA      ""Apple""
   # You can also choose what value is used for missing values
   recode(x, a = ""Apple"", .default = NA_character_, .missing = ""Unknown"")
   #>  [1] NA        ""Apple""   ""Unknown"" ""Unknown"" NA        ""Unknown"" NA       
   #>  [8] ""Unknown"" NA        ""Apple""
   
   
 * case_when() , is a vectorised set of if and else if s. You provide it a set of test-result pairs as formulas: The left side of
   the formula should return a logical vector, and the right hand side should
   return either a single value, or a vector the same length as the left hand
   side. All results must be the same type of vector.x <- 1:40
   case_when(
     x %% 35 == 0 ~ ""fizz buzz"",
     x %% 5 == 0 ~ ""fizz"",
     x %% 7 == 0 ~ ""buzz"",
     TRUE ~ as.character(x)
   )
   #>  [1] ""1""         ""2""         ""3""         ""4""         ""fizz""     
   #>  [6] ""6""         ""buzz""      ""8""         ""9""         ""fizz""     
   #> [11] ""11""        ""12""        ""13""        ""buzz""      ""fizz""     
   #> [16] ""16""        ""17""        ""18""        ""19""        ""fizz""     
   #> [21] ""buzz""      ""22""        ""23""        ""24""        ""fizz""     
   #> [26] ""26""        ""27""        ""buzz""      ""29""        ""fizz""     
   #> [31] ""31""        ""32""        ""33""        ""34""        ""fizz buzz""
   #> [36] ""36""        ""37""        ""38""        ""39""        ""fizz""
   
   case_when() is still somewhat experiment and does not currently work inside mutate() . That will be fixed in a future version.
   
   
I also added one small helper for dealing with floating point comparisons: near() tests for equality with numeric tolerance ( abs(x - y) < tolerance ).

x <- sqrt(2) ^ 2

x == 2
#> [1] FALSE
near(x, 2)
#> [1] TRUE

PREDICATE FUNCTIONS
Thanks to ideas and code from Lionel Henry , a new family of functions improve upon summarise_each() and mutate_each() :

 * summarise_all() and mutate_all() apply a function to all (non-grouped) columns:mtcars %>% group_by(cyl) %>% summarise_all(mean)    
   #> # A tibble: 3 x 11
   #>     cyl      mpg     disp        hp     drat       wt     qsec        vs
   #>   <dbl>    <dbl>    <dbl>     <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
   #> 1     4 26.66364 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909
   #> 2     6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
   #> 3     8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
   #> ... with 3 more variables: am <dbl>, gear <dbl>, carb <dbl>
   
   
 * summarise_at() and mutate_at() operate on a subset of columns. You can select columns with: * a character vector of column names,
    * a numeric vector of column positions, or
    * a column specification with select() semantics generated with the new vars() helper.
   
   mtcars %>% group_by(cyl) %>% summarise_at(c(""mpg"", ""wt""), mean)
   #> # A tibble: 3 x 3
   #>     cyl      mpg       wt
   #>   <dbl>    <dbl>    <dbl>
   #> 1     4 26.66364 2.285727
   #> 2     6 19.74286 3.117143
   #> 3     8 15.10000 3.999214
   mtcars %>% group_by(cyl) %>% summarise_at(vars(mpg, wt), mean)
   #> # A tibble: 3 x 3
   #>     cyl      mpg       wt
   #>   <dbl>    <dbl>    <dbl>
   #> 1     4 26.66364 2.285727
   #> 2     6 19.74286 3.117143
   #> 3     8 15.10000 3.999214
   
   
 * summarise_if() and mutate_if() take a predicate function (a function that returns TRUE or FALSE when given a column). This makes it easy to apply a function only to numeric
   columns:iris %>% summarise_if(is.numeric, mean)
   #>   Sepal.Length Sepal.Width Petal.Length Petal.Width
   #> 1     5.843333    3.057333        3.758    1.199333
   
   
All of these functions pass ... on to the individual funs :

iris %>% summarise_if(is.numeric, mean, trim = 0.25)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.802632    3.032895     3.934211    1.230263

A new select_if() allows you to pick columns with a predicate function:

df <- data_frame(x = 1:3, y = c(""a"", ""b"", ""c""))
df %>% select_if(is.numeric)
#> # A tibble: 3 x 1
#>       x
#>   <int>
#> 1     1
#> 2     2
#> 3     3
df %>% select_if(is.character)
#> # A tibble: 3 x 1
#>       y
#>   <chr>
#> 1     a
#> 2     b
#> 3     c

summarise_each() and mutate_each() will be deprecated in a future release.

SQL TRANSLATION
I have completely overhauled the translation of dplyr verbs into SQL statements.
Previously, dplyr used a rather ad-hoc approach which tried to guess when a new
subquery was needed. Unfortunately this approach was fraught with bugs, so I
have now implemented a richer internal data model. In the short-term, this is
likely to lead to some minor performance decreases (as the generated SQL is more
complex), but the dplyr is much more likely to generate correct SQL. In the
long-term, these abstractions will make it possible to write a query
optimiser/compiler in dplyr, which would make it possible to generate much more
succinct queries. If you know anything about writing query optimisers or
compilers and are interested in working on this problem, please let me know!

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,744 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

19 COMMENTS
June 28, 2016 at 3:30 am

Alex Ioannides

Thanks Hadley. Especially for the if_else()!

June 28, 2016 at 4:57 am

Hans Gardfjell

If you are using the RSQLServer package in connection with dplyr – wait with
this update. The changes in the SQL backend of dplyr 0.5 will make the
RSQLServer package impossible to load and use.

June 28, 2016 at 10:56 am

Erik

Maybe this is bad coding practice but the following code does not seem to work
in this version but worked in version 0.4.3. Is this a feature or a bug?

library(dplyr)
f1]<-1

You get 'Error: Unknown column 'z''

 * June 28, 2016 at 5:02 pm
   
   hadleywickham
   
   I think your example got eaten by wordpress, but we’re aware of the problem
   and will push out a fix this week.
   
   
 * June 29, 2016 at 3:32 am
   
   Erik
   
   Great!
   I will also use ‘My code got eaten by WordPress” as an excuse in the future.
   
   
 * 

June 28, 2016 at 2:22 pm

Pedram

Lots of great changes in here! Glad arrange is back to the old way. The new
summarize and mutate functions are wonderful.

June 29, 2016 at 10:54 am

Fr.

May I ask if there is still a way to perform grouped arrange() in this version?

I just had a look, and have over 100 arrange calls to check. Any way arrange()
could take an option like `ignore_groups = TRUE` and allow both grouped and
non-grouped usage?

 * July 5, 2016 at 10:28 pm
   
   Luke
   
   You can try arrange(groupingvar, othervar)
   
   
July 3, 2016 at 1:25 pm

Travis

any dplyr errors now cause Rstudio to crash. what gives?

 * July 3, 2016 at 1:54 pm
   
   hadleywickham
   
   This typically means you need to re-install dplyr and Rcpp.
   
   
 * July 3, 2016 at 1:58 pm
   
   Kevin Ushey (@kevin_ushey)
   
   Are you using a newer 32bit version of R? Unfortunately, there’s currently a
   bad interaction between new 32bit versions of R (R 3.3.0+) and RStudio that
   causes this crash.
   
   You can avoid it by switching to the 64bit version of R; we’ll try to get to
   the bottom of this and get a fix soon. See 
   https://support.rstudio.com/hc/en-us/articles/200486138-Using-Different-Versions-of-R for some more details on switching R versions on Windows.
   
   
 * July 3, 2016 at 2:33 pm
   
   ok stupid (@ohkay_stupid)
   
   I’ve got the same problem. Unfortunately RODBC doesn’t work in 64-bit so I
   had to resort to an older version of 32-bit R for now.
   
   July 3, 2016 at 2:36 pm
   
   jjallaire
   
   Another temporary workaround is to switch to the 32-bit version of R 3.2
   
   July 4, 2016 at 7:00 pm
   
   Travis
   
   wow, thanks. 64bit solved the crashing, as well as the issue that browser()
   wasn’t highlighting code while debugging anymore
   
   
 * 

July 4, 2016 at 3:54 pm

Felipe

Can dplyr connect to local ms access databases trough src_sqlite?

 * July 5, 2016 at 7:34 am
   
   hadleywickham
   
   No, it’s for connecting to SQLite databases (hence the name)
   
   
 * July 5, 2016 at 9:58 am
   
   Felipe
   
   A better question would have been, does dplyr work with access databases?
   
   July 5, 2016 at 10:01 am
   
   hadleywickham
   
   Not currently, but we have some work in progress that will hopefully lead to
   this in the next 6-12 months.
   
   
 * 

July 11, 2016 at 9:54 am

David Eagle (@davideaglephd)

Thanks for these changes! With mutate_at I’m in dplyr nirvana.


« tidyr 0.5.0 See RStudio at UseR! 2016 »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,744 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","dplyr 0.5.0 is a big release with a heap of new features, a whole bunch of minor improvements, and many bug fixes, both from me and from the broader dplyr community. In this blog post, you’ll find highlights of the most important changes.",dplyr 0.5.0,Live,534
1654,"Covering both mobile and Internet of Things (IoT) use cases, this deep dive into offline first will explore several patterns for using PouchDB together with Cloudant including setting up one database per user, one database per device, read-only replication, and write-only replication. Come prepared with your offline-first questions!","Covering both mobile and Internet of Things (IoT) use cases, this deep dive into offline first will explore several patterns for using PouchDB together with Cloudant including setting up one database per user, one database per device, read-only replication, and write-only replication. Come prepared with your offline-first questions!",A Deep Dive into Offline First with PouchDB and IBM Cloudant,Live,535
1656,"Develop in the cloud at the click of a button!Build a pedometer app on Liberty for Java that combines the dashDB,             MongoLab, and Embeddable Reporting services on BluemixIn this tutorial, you'll see how to build an app that pulls data and embeds                 reports. Our sample app, OnTrack, is a fitness app showing pedometer                 activity. It uses the Embeddable Report service in Bluemix and stores the                 results in a Bluemix data storage facility, dashDB. We'll start with a                 Liberty for Java® runtime, then                 add the dashDB, Embeddable Reporting, and MongoLab services. Embeddable                 Reporting uses MongoLab to store reporting artifacts.“The Embeddable Reporting service on Bluemix offers a rich                     report authoring environment. You can create visualizations, charts,                     formatted lists, and more.”Whatyou'll need to build your applicationA Bluemix account, where you canbuild and run your app.A DevOps Services account, whereyou can download the sample app.The sample ontrack.zip file. Download it from DevOpsServices using the Get the code button:Begin by creating an application on Bluemix where your reports will be                 deployed.Click CREATE AN APP to go to the catalog.Click WEB since you are creating a web app.Give your application a name and click FINISH.Click ADD A SERVICE and add the following threeservices to your application: MongoLab (under Data Management), dashDB(under Big Data) and Embeddable Reporting (under Business Analytics).For each service, leave the Service name as is. ClickCREATE.Now that your application is created, you can populate the data source with                 sample pedometer data.On your dashboard, click the dashDB service, and thenclick LAUNCH.Click the Browse File button to select a file toupload. Select PEDOMETER.csv (included as part of thecode).Under Does the file have columns that contain dates ortimes? select Yes. Select Create a new table and load, and then clickNEXT.Note the table name, because you will need it to author reports later.For example: On your dashboard, click your app to go to the EnvironmentVariables section. Note the username,password, and jdbcurl values for the dashDB service.From your application's Environment Variables page,note the URI property for the MongoLab service. Click the Embeddable Reporting service, and thenclick LAUNCH to start the Embeddable Reportingconsole.Click CONNECT to connect to your MongoLab datastore.Enter the MongoLab URI you noted above.Click Select File to Import and navigate to theontrack.zip package you saved earlier. ClickImport.On the OnTrack/Data Sources tab, enter thejdbcurl, username, and passwordfrom Step 2.9.Go back to the Applications tab, click thefolder icon, and then clickReports. You will see two reports listed. Clickviz-report to select it, and then click thepencil icon to edit the report in IBM CognosBusiness Intelligence Report Studio.This report is written with native SQL. To examine the query, selectView, and then select Queriesand double-click the SQL icon next to any query.Double-click Query1 and examine the data items thatare part of this query. Some items are from the SQL result; otherswere authored (for example, Total Steps).Feel free to modify the report as you see fit. For example, you canchange the color scheme or change the chart type. You can preview thereport by clicking the Run > Run Report -HTML menu item. When you are satisfied, save the report. Step 4.Embed the reports in your applicationEmbed the report by ID. To find the unique ID of the report, click thefolder icon in the Embeddable Reporting consoleand select Reports.From the DevOpsproject or locally, update thesrc/main/webapp/WEB-INF/public/index.html file to use thenew report IDs. Do this by updating the getReport andgetReportAsJSON calls located in theinsertReports function.Rebuild the application and deploy it to Bluemix.You are done! Go to your application route and view your embedded report.To see your report change in real time, go to theUpload tab and upload data to your data source.Experiment with the sample app and get started integrating your own reports                 into more complex applications!BLUEMIX SERVICES USED IN THIS TUTORIAL:The Liberty for Java runtime helps you develop, deploy, and scale Java web apps withease.The Embeddable Reporting service enables you to run IBM Cognos Business Intelligencereports within your Bluemix environment.The dashDB service helps you move your data into a next-generationcolumnar in-memory database, run complex analytical queries within-database algorithms, and integrate with analytic and businessintelligence tools.The MongoLab service is a fully-managed cloud database service featuringhighly-available MongoDB databases, automated backups, web-basedtools, 24/7 monitoring, and expert support.Required fields are indicated with an asterisk (*).By clicking Submit, you agree to the developerWorks terms of use.The first time you sign into developerWorks, a profile is created for you.  Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name.  You may update your IBM account at any time.The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name.  Your display name accompanies the content you post on developerWorks.Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.Required fields are indicated with an asterisk (*).By clicking Submit, you agree to the developerWorks terms of use.","Learn about Bluemix's rich report-authoring environment. You can create visualizations, charts,formatted lists, and more. This tutorial shows you how to create a pedometer app on Liberty for Java that combines the dashDB, MongoLab, and Embeddable Reporting services on Bluemix",Embed rich reports in your applications,Live,536
1657,"Compose The Compose logo Articles Sign in Free 30-day trialCOMPOSE POSTGRESQL POWERS UP TO 9.6
Published Jun 8, 2017 compose postgresql postgis Compose PostgreSQL powers up to 9.6TL;DR: You can now run PostgreSQL 9.6 on Compose, PostGIS has been upgraded and
now PGrouting is also available.

PostgreSQL 9.6 is now available as the latest version of PostgreSQL on Compose
allowing users to run the most recent version, 9.6.3, along with the updated
appropriate version of the PostGIS extension and a new addition, the PGrouting extension.

The change means that most Compose users can benefit from improved vacuuming
with reporting, full-text search for phrases, enhanced GIN indexing, faster
sorting and a smarter Foreign Data Wrapper for PostgreSQL databases which can
pass work to other servers. These features, combined with hundreds of
enhancements and bug-fixes in 9.6 , 9.6.1 , 9.6.2 and 9.6.3 , offers the richest PostgreSQL experience to date.

PARALLEL
One of the big features of PostgreSQL 9.6 is the introduction of parallel
queries. This allows multiple cores to take part in the processing of queries.
Parallel querying is disabled by default on 9.6. We'll be taking a deeper look
into this and showing the benefits and trade-offs in hosted Compose and Compose
Enterprise in a future article.

GEODATA
The PostGIS extension has already provided Compose users with extensive geographic support
and the 2.3.2 release has built on that with Block Range Index (BRIN) support
and new functions such as ST_GeneratePoints , ST_MinimumClearance and ST_GeometricMedian . We've also brought online PGrouting (2.4.1) which adds routing algorithms to
the PostgreSQL/PostGIS support that can balance route length and cost when
working out how to get from A to B.

GETTING POSTGRESQL 9.6
All new PostgreSQL deployments on Compose now default to 9.6.3 when being
created.

Whether you currently run PostgreSQL 9.5 or 9.4 deployments, this is a major
update and we recommend that you use the restore-to-a-new-deployment route. You
start the process by creating an on-demand backup.

Compose backups have the ability to be restored into new deployments, so to
upgrade you restore your on-demand backup to a new PostgreSQL deployment running
the latest version. This also ensures your current database is untouched and the
new, fresh deployment is automatically loaded with your data.

STEP BY STEP
 * Go to the Compose Console and select your database
 * Select Backups tab and then click Backup Now


 * Wait for the on-demand backup to complete and return to the Backups tab
 * Click on the restore button (the looped arrow) in the on-demand backup row.


 * On the displayed form, enter a new name for the deployment (if you want to),
   select a datacenter for the deployment and ensure that the version selector
   is set to the new database version.


 * Then click Create Deployment and the process will automatically deploy a new database, load that backup
   into it and start it up. You can either wait or carry on in the console; it
   all takes place in the background.

You'll be up and running with PostgreSQL 9.6, ready for testing or for
production. Don't forget to backup your old 9.5 deployment and, when you are
done with it, delete it.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Larry Li

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Sep 30, 2016COMPOSE NEWSBITS: POSTGRESQL 9.6, NEURAL REDIS, GRAPHQL, NMAP, ERLANG, SETI AND
CHATBOTS
Compose NewsBits for the week ending September 30th: PostgreSQL 9.6 is out and
so is PostGIS 2.3 and Barman 2.0, Redis gets a…

Dj Walker-Morgan Mar 16, 2017GEOFILE: USING OPENSTREETMAP DATA IN COMPOSE POSTGRESQL - PART I
GeoFile is a series dedicated to looking at geographical data, its features, and
uses. In today's article, we're going to int…

Abdullah Alger Nov 18, 2016NEWSBITS - SQL SERVER ON LINUX, COCKROACHDB, CITUS 6.0, POSTGRESQL PARALLELISM,
LINUX ENTRYISM AND MORE
The Compose NewsBits for the week ending November 18th - Microsoft's SQL Server
is surprising on Linux, CockroachDB talks sta…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","You can now run PostgreSQL 9.6 on Compose, PostGIS has been upgraded and now PGrouting is also available.",Compose PostgreSQL powers up to 9.6,Live,537
1658,This video shows you how to set up replication using cURL. ,Learn how to set up a Cloudant replication job programmatically.,Use cURL to set up replication,Live,538
1660,,Need to move some data to the cloud for warehousing and analysis? We’ll walk you through your move from Oracle to a cloud-based dashDB warehouse. ,Convert data from Oracle to dashDB,Live,539
1672,"SENTIMENT ANALYSIS OF REDDIT AMAS USING DASHDB AND R
Chetna Warade / June 23, 2016A few months ago, I published Sentiment Analysis of Reddit AMAs which explained how to grab a reddit Ask Me Anything (AMA) conversation and
export its data for analysis using our Simple Data Pipe app. From there, I used
the Cloudant-Spark Connector, and Watson Tone Analyzer to get insights into
writer sentiment. The flow looked like this:


But you have lots of options for the analysis portion of this exercise. How
about a data warehouse instead? When you run Simple Data Pipe, the reddit AMA
lands in Cloudant. From there, it’s a breeze to send data on to a dashDB data
warehouse, where you can run analytics.

In this tutorial, I’ll perform analysis with R. Lots of people love R because
it’s FREE and features lots of great statistics libraries that make it easy to
analyze and visualize your data. It also comes built-in to dashDB. Handy!

Here’s the new alternate flow:


BEFORE YOU BEGIN
If you haven’t already, read my earlier Sentiment Analysis of Reddit AMAs blog post , so you understand what we’re up to here. You’ll get the background you need,
and we can dive right in to this alternate analysis approach. (You don’t need to
follow that earlier tutorial to implement this dashDB + R solution. All the
steps you need are here in this blog post.)

DEPLOY SIMPLE DATA PIPE
The fastest way to deploy this app to Bluemix is to click the Deploy to Bluemix button, which automatically provisions and binds the Cloudant service too.


If you would rather deploy manually , or have any issues, refer to the readme .

When deployment is done, click the EDIT CODE button.

INSTALL REDDIT CONNECTOR


Since we’re importing data from reddit, you need to establish a connection
between reddit and Simple Data Pipe.

Note: If you have a local copy of Simple Data Pipe, you can install this connector using Cloud Foundry .

 1. In Bluemix, at the deployment succeeded screen, click the EDIT CODE button.
 2. Click the package.json file to open it.
 3. Edit the package.json file to add the following line to the dependencies list:
    ""simple-data-pipe-connector-reddit"": ""^0.1.2""
    Tip: be sure to end the line above with a comma and follow proper JSON syntax.
 4. From the menu, choose File Save .
    
    
 5. Press the Deploy app button and wait for the app to deploy again.
    
    
ADD WATSON TONE ANALYZER
To work its magic, the reddit connector needs help from Watson Tone Analyzer. So
add this service now by following these steps:

 1. In Bluemix, go to the top menu, and click Catalog .
 2. In the Search box, type Tone Analyzer , then click the Tone Analyzer tile.
 3. Under app , click the arrow and choose your new Simple Data Pipe application. Doing
    so binds the service to your new app.
 4. In Service name enter only tone analyzer (delete any extra characters)
 5. Click Create .
 6. If you’re prompted to restage your app, do so by clicking Restage .

LOAD THE REDDIT AMA DATA
 1.  Launch simple data pipe in one of the following ways: * If you just restaged, click the URL for your simple data pipe app.
        
      * Or, in Bluemix, go to the top menu and click Dashboard , then on your Simple Data Pipe app tile, click the Open URL button.
        
     
 2.  In Simple Data Pipe, go to menu on the left and click Create a New Pipe .
 3.  Click the Type dropdown list, and choose Reddit AMA .When you added a reddit connector earlier, you added the Reddit option
     you’re choosing now.
     
     
 4.  In Name , enter ibmama .
 5.  If you want, enter a Description .
 6.  Click Save and continue .
 7.  Enter the URL for the AMA. We’ll use the sample IBM-hosted AMA explored at
     length my earlier post: 
     https://www.reddit.com/r/IAmA/comments/3ilzey/were_a_bunch_of_developers_from_ibm_ask_us
 8.  Click Connect to AMA .
     You see a You’re connected confirmation message.
 9.  Click Save and continue .
 10. On the Filter Data screen, make the following 2 choices:
     
      * under Comments to Load , select Top comments only .
      * under Output format , choose JSON flattened .
     
     Then click Save and continue .
     
     Why flattened JSON? We plan to run some simple SQL queries on this data once we get it into
     dashDB, so flattened is the way to go, since each JSON document maps to a
     single row in one table.
     
     
 11. Click Skip , to bypass scheduling.
 12. Click Run now .
     
     When the data’s done loading, you see a Pipe Run complete! message.
     
     
 13. Click View details .
     
     
MOVE THE AMA DATA TO DASHDB
 1. On your Simple Data Pipe’s Activity page, Click the run’s Details link.
    
    Click the Top comments only link.
    Find your Cloudant password.In another browser tab, return to your Bluemix dashboard. Click your
    Cloudant service to open it. In the menu on the left, click Service Credentials and copy your password.
    
    
 2. Return the broswer tab where you had Cloudant open. In the Password field, paste your Cloudant service password and click Sign in .
 3. From the Cloudant menu, choose Warehousing Create a dashDB Warehouse .
    
    
 4. Enter your Bluemix credentials and click Authenticate in Bluemix .
 5. In Warehouse Name enter a name for your warehouse (doesn’t matter what you name it).
 6. In the Data Sources field, type an r and you see a choice appear called reddit_ibmama_top_comments_only . Select it and click Create Warehouse .
    When the system finishes provisioning your dashDB instance, you see your
    database appear.
 7. Click the Open in dashDB button.
 8. Review the table definition produced by the Simple Data Pipe.In the menu on the left side of the dashDB screen, click Tables . From the dropdown on the upper right, choose REDDIT_IBMAMA_TOP_COMMENTS_ONLY . You’ll see an emotions list that forms a base for a simple SQL query that
    we will run next.
    
    
ANALYZE AMA DATA
dashDB offers a few different ways to work with data ranging from a simple SQL
query interface to built-in R scripting and R Studio features.

TRY A SQL QUERY
 1. On the left side of the dashDB screen, click the Run SQL tab.
 2. Delete the sample queries you see in the box and replace it with the
    following query (if your table name is not REDDIT_IBMAMA_TOP_COMMENTS_ONLY , replace that string with your table name): SELECT AUTHOR,TEXT FROM REDDIT_IBMAMA_TOP_COMMENTS_ONLY WHERE ANALYTICAL  70
    
    Click the Run button.
    
    When it’s finished, the bottom of the screen shows a Succeeded message.
    
    Scroll down and you see a list of authors and their text entries from
    reddit. Sweet!
    
    
WRITE AN R SCRIPT
dashDB offers you 2 ways to work with R:

 * dashDB web console. You can write and save queries directly in the dashDB interface.
 * R Studio. dashDB includes an integrated R studio environment, which runs directly on
   the Bluemix platform.

Both methods let you save and store scripts. Let’s start with a simple R query:

 1. From the menu, choose Analytics R Scripts .
 2. Click the + plus sign to get a quick helper dialog box, which lets you pick
    data:
    
     1. Choose your ibmama table, then columns appear.
     2. Above the Columns list, click Clear all then turn on the Analytical checkbox and click the Apply button.
    
    You see the R query that dashDB writes.
    
    
 3. Add the following line to the query:
    
    df1465936166363 # prints dataframe contents to the console
    
    
 4. Click Submit .
    
    
 5. When the query finishes, dashDB lands you on the Console Output tab. Scroll down to see the entire output, which looks like this:
    
       Console Output:
        ANALYTICAL
    1         0.00
    2        32.40
    3         0.00
    .....       
    395       0.00
    396       0.00
    Console Messages:
    Loading required package: RODBC
    Loading required package: ibmdbR
    Loading required package: methods
    Loading required package: MASS
    Loading required package: grDevices
    Loading required package: graphics
    Loading required package: stats
    Loading required package: utils
    Loading required package: Matrix
    Loading required package: arules
    
    Attaching package: ‘arules’
    
    The following objects are masked from ‘package:base’:
    
        %in%, write
    
    Loading required package: rpart
    Loading required package: rpart.plot
    Loading required package: ggplot2
    Warning message:
    closing unused RODBC handle 1 
    
    
ANALYZE IN RSTUDIO
For a more robust analysis environment, try running scripts in RStudio.

 1. Get dashDB credentials.To access RStudio, you need the credentials for your dashDB service. To get
    them: Leave the R scripts page open (you’ll return here in a minute), and in
    a new tab or window, visit your Bluemix dashboard. Click your dashDB service
    to open it. On the left side of the screen click Service Credentials . If necessary, click Add Credentials . Copy values for username and password .
    
    
 2. Return to dashDB’s R Scripts page, click the R Studio button.
    
    
 3. Paste in the Username and password you just copied from dashDB credentials and click Sign In .
    
    RStudio opens.
    
    
 4. Generate a bar chart of comments by sentiment.
    
    Create a new R script by clicking + R Script .
    
    
    Then enter the following script, select all the text you just pasted
    (otherwise you’d have to run the script line-by-line), and click Run .
    
    library(ibmdbR)
    mycon  70')
    
    df  70% in IBM Reddit AMA"",names.arg=displaynames, ylim=c(0,200),col=139, srt=45,cex.axis=0.8,cex.names=0.5,ylab=""Count"")
    idaClose(mycon)
    
    In the pane on the lower right of the screen, click the Plots tab to see the following bar chart:
    
    
 5. Create an web page that lists comments by sentiment.
    
    Create a new R script by clicking + R Script ,
    
    
    Then enter the following script, select all the text, and click Run .
    
    
    library(ibmdbR)
    mycon  70')
    
      df  70')
    
      df 
    
    On the upper right of the screen, in the Environment tab , click Comments . RStudio shows a list of comments in the left pane:
    
    
    To see the html page, go to the pane on the lower right of the RStudio screen and click the Files tab. Then click RedditSentiment.html .
    
    
    To see the sample html page I generated from this AMA conversation, copy and
    paste this URL into your browser's address bar:
    
    
    http://htmlpreview.github.io/?https://github.com/ibm-cds-labs/reddit-sentiment-analysis/blob/master/R-artifacts/RedditSentiment.html
    
    
Tip: You can download and run the R script code from our CDS labs github repository https://github.com/ibm-cds-labs/reddit-sentiment-analysis/tree/master/dashdb-R .

CONCLUSION
This is just one of many analysis solutions you could apply to Simple Data Pipe
output. In my previous post , I showed analysis with Spark and Cloudant. Now you see how easy it is to move
Cloudant JSON data on to a dashDB warehouse, then use built-in analysis tool R
to glean meaning and insights.

TRY THESE AMAS


Launch your Simple Data Pipe app again and return to the Load reddit AMA Data section. In step 7, swap in one of these AMA URLs and check out the results.

 * Steve Wozniak
   
   https://www.reddit.com/r/IAmA/comments/4apj5f/im_apple_cofounder_steve_wozniak_ask_me_anything/
 * Chris Rock
   https://www.reddit.com/r/IAmA/comments/2pi16o/chris_rock_here_ama/
 * Tim Berners Lee
   
   https://www.reddit.com/r/IAmA/comments/2091d4/i_am_tim_bernerslee_i_invented_the_www_25_years/
 * Neil deGrasse Tyson
   
   https://www.reddit.com/r/IAmA/comments/qccer/i_am_neil_degrasse_tyson_ask_me_anything/
 * Bill Gates
   
   https://www.reddit.com/r/IAmA/comments/18bhme/im_bill_gates_cochair_of_the_bill_melinda_gates/
 * Louis C. K.
   
   https://www.reddit.com/r/IAmA/comments/n9tef/hi_im_louis_ck_and_this_is_a_thing/
 * Amy Poehler
   https://www.reddit.com/r/IAmA/comments/2kp7w0/im_amy_poehler_amaa/
 * IBM's Chef Watson
   
   https://www.reddit.com/r/IAmA/comments/3id842/we_are_the_ibm_chef_watson_team_along_with_our/
 * Barack Obama
   
   https://www.reddit.com/comments/z1c9z/i_am_barack_obama_president_of_the_united_states/

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Another analysis solution you can apply to Simple Data Pipe output. See how easy it is to move Cloudant JSON data on to a dashDB warehouse, then use built-in analysis tool R to glean meaning and insights.",Sentiment Analysis of Reddit AMAs using dashDB and R,Live,540
1674,"Compose The Compose logo Articles Sign in Free 30-day trialDATALAYER EXPOSED: ROSS KUKULINSKI & THE STATE OF STATE IN CONTAINERS
Published Jun 12, 2017 DataLayer Exposed: Ross Kukulinski & The State Of State In ContainersWe're continuing to bring you video of all the sessions from this year's
DataLayer conference, and next up is Ross Kukulinski's talk on the state of
state in containers. Dive in now and start your own virtual DataLayer.

Kubernetes expert Ross Kukulinski was next to come to the stage to share with us
the state of state in containers at DataLayer.

Application container orchestration technologies like Kubernetes and Docker have
revolutionized the world of applications, making it quick and easy to launch and
manage stateless applications.

Containers are quick to launch and make efficient use of underlying compute
resources and Orchestration engines like Kubernetes simplify the deployment,
lifecycle, and scalability of applications. But there's still the problem of
managing stateful applications as they usually have storage associated with them
and that storage is where you keep your data. Ross looks at the ways this is
being addressed in this talk...

Sit back and get a look into the state of the world of containers.

Previous DataLayer 2017 talks:

 * Charity Majors presentation on observability

Be sure to tell us what you think using hashtag #DataLayerConf and check back
next Monday for the next talk at DataLayerConf.


--------------------------------------------------------------------------------

We're in the planning stages for DataLayer 2018 right now so, if you have an
idea for a talk, start flushing that out. We'll have a CFP, followed by a blind
submission review, and then select our speakers, who we'll fly to DataLayer to
present. Sounds fun, right?

Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe ’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","We're continuing to bring you video of all the sessions from this year's DataLayer conference, and next up is Ross Kukulinski's talk on the state of state in containers.",DataLayer Exposed: Ross Kukulinski & The State Of State In Containers,Live,541
1684,"Getting started with python. This book is intended for anyone who works with or intends to develop Python applications such as application developers, consultants, software architects, data scientists, instructors and students. It is a good reference as well for dev ops, system administrators and product managers.","This book is intended for anyone who works with or intends to develop Python applications such as application developers, consultants, software architects, data scientists, instructors and students. It is a good reference as well for dev ops, system administrators and product managers.",Getting started with Python,Live,542
1686,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectIBM ANALYTICS AT THE ESRI PARTNER/DEVELOPER SUMMITRaj R Singh / March 25, 2016“Leave geospatial data where it livesand transform it into GeoJSON, CSV, KML, a Shapefile,or a Feature Service dynamically.”-koopjs.github.ioEarlier this month, we took our cloud analytics show on the road to Esri ‘s partner conference and developer summit, which is a week devoted to thebusiness and ecosystem around this market-leading geographic information systemsoftware provider.Esri’s flagship offering is ArcGIS, which comes in the following versions: * desktop version targeted at professional cartography and map-centric   enterprise apps * a server version for storing and indexing spatial data * the relatively new ArcGIS OnlineIBM and Esri have a strong, venerable partnership going back to the 1990s. Atthis year’s conference, our IBM team was there to highlight the ways you canintegrate dashDB and Cloudant with Esri technology. dashDB, a new offeringthat’s a close sibling of DB2, takes advantage of the time-tested DB2 spatialintegration with Esri. You can use dashDB as an enterprise data warehouse, butstill connect to it from ArcGIS (desktop or online) for high-end mappingfunctionality. From an organizational perspective, this is powerful since itlets mapping and GIS users work seamlessly within the traditional corporate ITworld. Companies already use dashDB as an EDW or “data lake”, so the ability todo a File -> Open on dashDB data from within an Esri tool opens up that information toGIS-centric solutions. For more details, see this slide presentation by IBM’sJohn Park:and check out these tutorials: * Load Geospatial Data into dashDB to Analyze in Esri ArcGIS * Analyzing Geospatial Data with IBM dashDB and Esri ArcGIS for DesktopSince I’m particularly interested in open data and open source, I was impressedby the latest advances to ArcGIS Open Data . Esri-hosted ArcGIS Open Data gives you a quick way to set up public-facingwebsites where people can easily find and download your open data in a varietyof open formats. It’s powered by Koop , an open source geospatial ETL engine that aspires to let you, “Leave geospatial data where itlives and transform it into GeoJSON, CSV, KML, a Shapefile, or a Feature Servicedynamically.”red numbers image by Dave BleasdaleKoop is a product of the Washington, DC Esri R&D team, originally conceived by Chris Helm , when he worked there, and now maintained by Daniel Fenton . Here at IBM, Norman Barker leads our effort to build a “Koop provider” for the Cloudant NoSQL JSONdocument store, so that any data stored as GeoJSON in Cloudant can be accessedthrough Koop. While Cloudant already supports advanced spatial queries andindexing for extremely fast access, putting a Koop interface in front ofCloudant lets developers re-use Esri style queries and data format skills ondata in Cloudant.This has powerful implications–not only for online mapping–but also foranalysis, as I discussed in a Cloudant-Koop lightning talk at the conference. Join us in developing the Cloudant-Koop provider on GitHub .The week’s activities culminated with a wildly popular and intense dodgeballtournament, which IBM sponsored. Our team made it out of the opening round, butquickly fell in round 2. I clearly need to work on my fast-twitch muscles to have a better showing next year. But we all had a good time and look forwardto using new relationships and technologies to cook up even better onlinemapping and spatial analysis solutions for you in the future.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: ArcGIS / cloudant / dashdb / DB2 / Esri / GeoJSON / geolocation / Koop Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","News of Esri's ArcGIS desktop, online, and open data offerings. How you can use this software in combination with Koop, Cloudant, and dashDB.",IBM Analytics at the Esri Partner/Developer Summit,Live,543
1687,body,"The ggplot2 package lets you make beautiful and customizable plots of your data. It implements the grammar of graphics, an easy to use system for building plots.",Data Visualization with ggplot2 Cheat Sheet,Live,544
1691,"Mark Watson Blocked Unblock Follow Following Developer Advocate, IBM Watson Data Platform Apr 18
--------------------------------------------------------------------------------

ZERO TO KUBERNETES ON THE IBM BLUEMIX CONTAINER SERVICE
JUST THE BASICS. NOTHING FANCY.
A few weeks ago I stumbled across this tweet:

I love containers, and I have been playing with Kubernetes for a while now, so I
was excited to see IBM offering Kubernetes as part of the IBM Bluemix Container Service (Note: as of this writing this service is in beta).

After the InterConnect conference I got a chance to experiment with the Kubernetes offering, and I
decided to deploy some of my favorite Bluemix apps to Kubernetes.

Before I get into all that, I want to talk a little about containers and
Kubernetes.

CONTAINERS
For the purpose of this article I am going to assume that you have some
understanding of containers. Most likely you have used, or are at least familiar
with, Docker. If you are not familiar with Docker, or containers in general, I
would recommend this post from our friends at freeCodeCamp .

Containers provide developers with a way to package up applications and their
dependencies in a lightweight manner. They are an attractive deployment option
for some developers as they provide consistency across deployment environments.
When you package your application into a container you are guaranteed that the
same code you ran in development will run in QA, staging, and production. This
works best when you incorporate containers into your entire development cycle.

KUBERNETES
Kubernetes was originally developed by Google as part of the Borg project and
open sourced in 2014. In March, 2016 Kubernetes became the first project hosted
by the CNCF . From the Kubernetes website:

Kubernetes is an open-source system for automating deployment, scaling, and management of
containerized applications.In a nutshell, Kubernetes helps you run containers in production. You can deploy
your containers to Kubernetes, and Kubernetes will do things like:

 1. Ensure your containers are always running.
 2. Handle network routing between your containers.
 3. Auto-scale your containers across multiple nodes.

Kubernetes also supports Ingress and Load Balancers, Stateful Sets for running
databases in containers, Persistent Volumes, and more. Some of these features
are provided by the underlying cloud platform and not Kubernetes itself. For
example, load balancing is currently not built-in to Kubernetes, but rather
implemented by the cloud platform that is running Kubernetes.

For this example I’m not going to worry about networking, auto-scaling, or any
of these other features. I’m going to simply deploy a single pod (I’ll tell you
what a pod is later) and a single secret.

KUBERNETES/CONTAINERS VS. PAAS
There are many ways to deploy containers to production. Many cloud providers
have proprietary container services. Kubernetes, on the other hand, is an open
source standard which ensures your Kubernetes deployment will run on any cloud
platform that supports Kubernetes. You can even run Kubernetes in a multi-cloud
infrastructure.

Kubernetes is a great option for running containers in production, but are
containers right for you?

PaaS offerings like Cloud Foundry and Bluemix make it super simple for
developers to deploy applications to the cloud. If you are running a popular
software stack like Node.js or Python, there are very good PaaS offerings
available to you as a developer. PaaS offerings are often more mature and have
more built-in features like user management, integrated CI/CD, metrics, and
logging.

Sometimes, however, these offerings are not enough. Maybe you want to run a
software stack not supported by your PaaS provider, or maybe you want to run a
custom version of your software stack. In these cases containers and Kubernetes
might be a better option.

See this Stack Overflow post for a good discussion on the differences between Kubernetes and Cloud Foundry.

LET’S RUN KUBERNETES ON BLUEMIX!
If this is your first time working with Kubernetes, this isn’t the right place
to start. I would recommend starting with the Kubernetes tutorials . I also recommend installing and running Minikube , which is a local, single-node Kubernetes cluster that runs inside a VM.

IBM Bluemix Container Service logo. I can hardly contain myself!PREREQUISITES
If this is your first time working with Kubernetes—and you ignored my earlier
warning about this not being the best place to start—then you’ll need to download the Kubernetes CLI . After you have the Kubernetes CLI installed, you’ll need to set up the Bluemix CLI .

At this point you should have the kubectl and bx commands available from your command line interface. The next step is to
configure the container service plugin for the Bluemix CLI. Log in using the bx command:

bx login -a https://api.ng.bluemix.net

Run the following command:

bx plugin install container-service -r Bluemix

Next, run this command to initialize the container service plugin:

bx cs init

CREATING A CLUSTER
You’ll start by creating a free-tier cluster in the Bluemix Container Service:

bx cs cluster-create --name my-cluster

This will create a cluster with a single worker node with 2 vCPU and 4 GB
memory. List your clusters by running the following command:

bx cs clusters

Next, you’ll need to get your remote cluster config to use a local kubectl context from your command line. It may take a few minutes for your cluster to
be created. Run the following command every few minutes until the cluster is
ready:

bx cs cluster-config my-cluster

When your cluster is up and running, the result of this command should look
something like this:

Downloading cluster config for my-cluster
OK
The configuration for my-cluster was downloaded successfully. Export environment variables to start using Kubernetes.

export KUBECONFIG=/Users/markwatson/.bluemix/plugins/container-service/clusters/my-cluster/kube-config-prod-dal10-my-cluster.yml

Copy the export command and run it in a new command line window.

Note: You will have to run this command every time you open a new command line
interface. Alternatively you can configure a new context in your local kube
config, typically found at ~/.kube/config . For example, I created a context called “bx-context,” so if I run the command kubectl config use-context bx-context , kubectl will be configured to access my Bluemix cluster globally. See ~/.kube/config , and your cluster YML file located at KUBECONFIG (see snippet above) for more info.KUBERNETES DASHBOARD
You can access your Bluemix Kubernetes dashboard locally by running:

kubectl proxy

After running this command open http://127.0.0.1:8001/ui in your web browser:

If your dashboard looks similar to the screenshot above, then congrats! You are
officially running Kubernetes in the IBM Bluemix Container Service. Next, you’ll
get your first pod up and running.

HELLO, KUBERNETES
At this point you should have a Kubernetes cluster running in the IBM Bluemix
Container Service, but it’s not actually running any containers yet. You’ll soon
change that, but first a quick discussion on Kubernetes Pods .

A pod, in this context, is a group of one or more containers. In this article,
you’re going to set up your pods with a single container. You can deploy a pod
(or container) to Kubernetes using kubectl and passing it a YAML file that describes the pod:

apiVersion: v1
kind: Pod
metadata:
  name: my-nginx-pod
labels:
app: my-app
spec:
  containers:
- name: nginx
image: nginx:latest

This YAML file tells Kubernetes to deploy a pod with the name my-nginx-pod , which uses the nginx:latest container. By default it will download the latest container image from Docker,
but you could also use your private Bluemix container registry.

Let’s deploy it! Copy the YAML above into a file called my-nginx-pod.yaml and run the following command:

kubectl create -f my-nginx-pod.yaml

To see what pods are running in Kubernetes, click the Pods link in the left nav
of the dashboard or run the following command:

kubectl get pods

If the pod has started you should see something like this:

NAME           READY     STATUS    RESTARTS   AGE
my-nginx-pod   1/1       Running   0          7s

An nginx instance is now running inside Kubernetes, but you can’t access it yet.
You’ll need to expose it to the outside world. There are a number of ways to
expose ports on your pods. The first and easiest one is by using the expose command:

kubectl expose pods my-nginx-pod --type=NodePort --port=80 --name=my-nginx-pod-svc

This will create a Kubernetes service , which will act as a proxy to port 80 in our nginx container. Kubernetes will
allocate a port for the service in the range of 30000–32767. To get the port
assigned by Kubernetes, run:

kubectl get services

The output should look similar to the following:

NAME               CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
kubernetes         10.10.10.1     <none>        443/TCP        4h
my-nginx-pod-svc   10.10.10.252   <nodes>       80:32398/TCP   9

Here you can see the service that Kubernetes created and that the service is
exposed on port 32398 . (Ignore the cluster-ip value. That’s the internal IP address of the service
inside the cluster.) You’ll need to get the IP address of the node within the
cluster to access the service. So, get a list of nodes:

kubectl get nodes

The output should look something like this:

NAME             STATUS    AGE
169.47.249.158   Ready     4h

You can now access your nginx service. In this case, it’s available at http://169.47.249.158:32398 :

You can use services to expose a port on a single pod like above, or
load-balance a group of pods. You can also define a service in a YAML file and
deploy it to Kubernetes. The YAML file for this service would look something
like this:

apiVersion: v1
kind: Service
metadata:
name: my-nginx-pod-svc
spec:
  selector:
    app: my-app
ports:
- protocol: TCP
    port: 80
type: NodePort

You now have an nginx instance running inside Kubernetes that you can access
from outside the cluster, but this just barely scratches the surface of
Kubernetes. I’d like to point out a few other Kubernetes concepts that are
extremely important, but outside the scope of this article:

1. REPLICASETS
You can and should use ReplicaSets to ensure your pods are always running. I typically use Deployments , which essentially allow you to define your pods and replica sets in one file
along with your desired state. For example, the following deployment tells
Kubernetes to ensure three nginx pods are always running:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: my-nginx-deployment
spec:
  replicas: 3
template:
    metadata:
      labels:
        app: my-app
spec:
      containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80

2. LOADBALANCERS OR INGRESS CONTROLLERS
You would typically configure your service with type LoadBalancer . This will use the underlying cloud provider’s load balancing capabilities,
allow you to configure the port (or use standard http/https ports), and in most
cases allow you to configure SSL. Alternatively, you can create an Ingress to accomplish the same task. I used a NodePort because the free tier of
Kubernetes on Bluemix does not support load balancer or ingress options.

DEPLOYING THE RECIPE CHATBOT
Now that you’ve configured your Kubernetes cluster on Bluemix and I’ve
introduced you to some simple Kubernetes concepts, you’re ready to deploy a real
application. For this exercise you’ll deploy my Recipe Chatbot .

The Recipe Chatbot is a simple Slackbot for finding recipes based on ingredients
or cuisines. It uses Watson Conversation to manage the chat, Spoonacular for
looking up recipes and instructions, and IBM Watson Data Platform services like
Cloudant or IBM Graph to store user behavior. You can read more about it:

Persisting Data for a Smarter Chatbot With Watson Conversation Service and
Cloudant medium.com Have you had “The Talk” with your chatbot about graph data structures? A
coming-of-age story for your database queries medium.freecodecamp.comThe Recipe Chatbot uses hosted database services and hosted APIs. This is
typical of many applications. Even though you are running your applications
inside containers, you will still need to access external services. In this
case, each of these external services has its own set of credentials. I’ll show
you below how to configure them in Kubernetes.

There is quite a bit of setup required to get the Recipe Chatbot up and running,
so I recommend you follow the instructions at the GitHub repo for the
Node.js/Cloudant version at https://github.com/ibm-cds-labs/watson-recipe-bot-nodejs-cloudant .

BUILD THE CONTAINER
Before you can create a pod, you’ll need a container.

Tip: you can skip these steps and go straight to the Deploy to IBM Bluemix Container Service section if you don’t want to create the container yourself. I have made the
container publicly available from my Docker Hub account .The Recipe Chatbot includes a Dockerfile called bot.Dockerfile in the docker folder. Here are the contents of the Dockerfile:

FROM node:latest
MAINTAINER Mark Watson <markwatsonatx@gmail.com>
RUN mkdir -p /usr/src/bot
COPY package.json /usr/src/bot/package.json
COPY index.js /usr/src/bot/index.js
COPY CloudantRecipeStore.js /usr/src/bot/CloudantRecipeStore.js
COPY RecipeClient.js /usr/src/bot/RecipeClient.js
COPY SousChef.js /usr/src/bot/SousChef.js
WORKDIR /usr/src/bot
RUN npm install
CMD [""node"",""index.js""]

The code uses the latest node ( node:latest ), copies only the files needed to run the bot and nothing more, and runs npm install when it builds the container. When the container is started it will run node index.js .

Note: Do not include your .env file in your container . It will contain your Cloudant credentials, Watson Conversation credentials,
etc. If you do include your .env file in your container, do not share it publicly .To build the container, cd into the docker folder:

cd /dev/github/ibm-cds-labs/watson-recipe-bot-nodejs-cloudant/docker

Run the following docker command to build the container (replace markwatsonatx with your own Docker username):

dockerbuild -t markwatsonatx/watson-recipe-bot-nodejs-cloudant:latest -f ./bot.Dockerfile ../

Upload to Docker Hub :

docker push markwatsonatx/watson-recipe-bot-nodejs-cloudant:latest

DEPLOY TO IBM BLUEMIX CONTAINER SERVICE
Now that you’ve created your Docker container and uploaded it to Docker Hub,
you’re ready to deploy it to Kubernetes. The Recipe Chatbot requires the
following environment variables to run properly:

SLACK_BOT_TOKEN
SLACK_BOT_ID
SPOONACULAR_KEY
CONVERSATION_USERNAME
CONVERSATION_PASSWORD
CONVERSATION_WORKSPACE_ID
CLOUDANT_URL
CLOUDANT_DB_NAME

You could include these variables in a .env file in your container, but as I mentioned I highly discourage this approach . Instead, you can use Kubernetes Secrets .

You deploy secrets to Kubernetes just like you deploy pods. Start by creating a
YAML file for your secret:

apiVersion: v1
kind: Secret
metadata:
  name: bot-secrets
data:
  slackBotToken: eGXXX
slackBotId: VTXXX
spoonacularKey: dnXXX
conversationUsername: ZTXXX
conversationPassword: N2XXX
conversationWorkspaceId: YmXXX
cloudantUrl: aHXXX
cloudantDbName: d2XXX

Obviously I did not share the actual values here, but there is one important
note: these values must be base-64 encoded . For example, if your slackBotToken is 1234 then the value in your secret file should be MTIzNA== .

Save the YAML above to a file called bot-secrets.yaml . Base-64 encode each value in your .env file (created when you set up your Recipe Chatbot) and copy it into your
secrets.

Time to deploy your secret to Kubernetes. Run the following command:

kubectl create -f bot-secrets.yaml

Now you’re ready to take the Recipe Bot container and deploy it as a pod. Create
a file called bot.yaml with the following contents (replace markwatsonatx with your Docker username if you deployed it yourself):

apiVersion: v1
kind: Pod
metadata:
  name: bot-pod
spec:
  containers:
- name: watson-recipe-bot
image: markwatsonatx/watson-recipe-bot-nodejs-cloudant:latest
env:
- name: SLACK_BOT_TOKEN
valueFrom:
            secretKeyRef:
              name: bot-secrets
key: slackBotToken
        - name: SLACK_BOT_ID
valueFrom:
            secretKeyRef:
              name: bot-secrets
key: slackBotId
        - name: SPOONACULAR_KEY
valueFrom:
            secretKeyRef:
              name: bot-secrets
key: spoonacularKey
        - name: CONVERSATION_USERNAME
valueFrom:
            secretKeyRef:
              name: bot-secrets
key: conversationUsername
        - name: CONVERSATION_PASSWORD
valueFrom:
            secretKeyRef:
              name: bot-secrets
key: conversationPassword
        - name: CONVERSATION_WORKSPACE_ID
valueFrom:
            secretKeyRef:
              name: bot-secrets
key: conversationWorkspaceId
        - name: CLOUDANT_URL
valueFrom:
            secretKeyRef:
              name: bot-secrets
key: cloudantUrl
        - name: CLOUDANT_DB_NAME
valueFrom:
            secretKeyRef:
              name: bot-secrets
key: cloudantDbName

You’ll notice above that the configurations map the variables in your
bot-secrets to environment variables in your pod. The Node.js application
doesn’t care how the environment variables are set, as long as they are there
when it starts.

Time to deploy the pod! Run the following command and cross your fingers:

kubectl create -f bot.yaml

If all goes well, your Recipe Chatbot should be up and running in a few seconds,
and you should be able to chat with your Slackbot. As you can see below I can
access my favorite recipes from Cloudant:

If anything goes wrong with your chatbot, you can simply kubectl delete pod bot-pod and then recreate it using kubectl create -f bot.yaml .

WHAT JUST HAPPENED?
If you followed the exercises in this post, then you did at least some of the
following:

 * Created a Kubernetes cluster in the IBM Bluemix Container Service.
 * Deployed an nginx container as a Kuberntes pod and made it publicly
   accessible.
 * Built a container and pushed it to Docker Hub.
 * Deployed the Recipe Chatbot to Kubernetes via a single secret and a single
   pod.

If you are currently deploying containers to production; interested in deploying
containers to production; or curious about moving your application, solution, or
service to containers, then Kubernetes could be a great fit.

If this post helped you get something up and running in Kubernetes on Bluemix —
or if you leave here with just a little bit better understanding of Kubernetes,
why it’s important, or where it might fit in your deployment strategy — then it
was all worth it. Congratulations!

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

Thanks to Mike Broberg . * Kubernetes
 * Docker
 * Nodejs
 * Web Development
 * DevOps

1 Blocked Unblock Follow FollowingMARK WATSON
Developer Advocate, IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","I love containers, and I have been playing with Kubernetes for a while now, so I was excited to see IBM offering Kubernetes as part of the IBM Bluemix Container Service (Note: as of this writing this…",Zero to Kubernetes on the IBM Bluemix Container Service,Live,545
1696,"OpenXC is a combination of open source hardware and software that allows developers to read car telemetry data in real time. This work is being led by Ford.OpenXC software is primarily focused on Android developers and uses a bluetooth or USB connection. As part of the London Traffic Tamer App Challenge, Cloudant developed a HTML5 Chrome application that used WebWorkers, the Chrome TextToSpeech engine, the Leaflet JavaScript library for interactive maps, and PouchDB, the in-browser database that syncs.The goal of the contest is to design applications to help alleviate traffic congestion on the streets of London. We call our app ""Cloudant's Traffic Tamer.""Create a free Cloudant account and get coding in minutesOur Traffic Tamer app collects car telemetry data. Every time a car stops, this event is synchronized with Cloudant to enable crowd-sourced traffic information.Every 5 minutes, the application executes a server-side geospatial radius search, and congestion alerts are sent to the device. To avoid distractions, this alert is spoken to the driver. In development mode, Leaflet is used to display a map of the driver's location and path.The app was designed for low bandwidth / offline usage. Everything is stored locally in the web browser (PouchDB). Vehicles have spotty mobile connectivity while in transit, so we designed the app to transmit only lightweight, individual JSON documents over the network in this scenario. The entire database is only synchronized with the cloud (Cloudant) when the user has a Wi-Fi connection. Synchronizing the entire database allows analysis of driving patterns to provide route recommendations.We chose Chrome as an application technology because:* It can run in the background* Can run on any deviceThe code is available on GitHub under the Cloudant repo openxc-js and is licensed under the Apache 2.0 Software License. A vehicle simulator has been provided that runs a WebWorker thread and updates the development display.The OpenXC Platform provides car telemetry data on a line-by-line basis, and this data is JSON, e.g.{""name"":""vehicle_speed"",""value"":0,""timestamp"":1364323939.468000}{""name"":""accelerator_pedal_position"",""value"":0,""timestamp"":1364323939.474000}{""name"":""engine_speed"",""value"":774,""timestamp"":1364323939.475000}{""name"":""latitude"",""value"":40.77092,""timestamp"":1364323939.476000}{""name"":""longitude"",""value"":-73.953773,""timestamp"":1364323939.478000}{""name"":""fuel_level"",""value"":89.674202,""timestamp"":1364323939.482000}{""name"":""torque_at_transmission"",""value"":3,""timestamp"":1364323939.483000}The application aggregates this data, and every 10 seconds stores it into PouchDB as GeoJSON, e.g.""geometry"": {""type"": ""Point"",""coordinates"": [125.6, 10.1]""properties"": {N.B.: As per the GeoJSON spec, the ""properties"" field above could contain any number of nested telemetry objects, such as those given in the preceding code block.This can be seen in the Chrome Development Console:The application can be installed from the Chrome Web Store herePlease vote for our submission by following this link and clicking the red ""VOTE"" button. The video walkthrough of our submission to the contest is also reproduced here below. Thank you for giving it a look.","Cloudant's Traffic Tamer app collects car telemetry data. Every time a car stops, this event is synchronized with Cloudant to enable crowd-sourced traffic information. This post describes how PouchDB is used to collect data offline and sync to Cloudant when on the network.",Syncing Car Telemetry Data in Real-Time: Ford’s OpenXC API and Cloudant’s Traffic Tamer App,Live,546
1697,"Follow Sign in / Sign up 4 * Share
 * 4
 * 
 * 

Never miss a story from Clover Health , when you sign up for Medium. Learn more Never miss a story from Clover Health Get updates Get updates Ian Blumenfeld Blocked Unblock Follow Following Data Science at Clover Health 2 days ago
--------------------------------------------------------------------------------

TRUST IN DATA SCIENCE
Trust is a funny thing. One day you’re on top of the world, the next you can do
no right. The recent election has obviously shaken trust in the analytics
establishment. While there are many reasons why this might have happened, an
untrusted analysis is an unused one, regardless of the quality. So how does one
go about building, or rebuilding (as the case may be), trust in the face of
challenges and failure?

If you’ve been following us at Clover, you’ve probably noticed that we approach
data science as a human problem. Our goal is to help humans at every level make
better decisions — either through analysis or data products. And what’s the
cornerstone of solving human problems? Trust. Whether it’s a result generated by
a team member, our team as a whole, or a system we’ve designed — all of our data
consumers, from executive leaders, nurse practitioners and wellness managers, to
our call center agents and provider services reps — need to trust in the output.
If they don’t, they won’t use it.

So how do we build trust? The easy answer is by producing high quality work. The
hard part is how you get there.

In our experience, high quality work is iterative. If you’re doing things right,
the quality you strive for is never where you’d like it to be at the beginning,
but improves over time. In order to get the quality where you need it, you must
have processes that enable continuous improvement, both in your people and in
your systems. We do this with the three Ts: Training, Testing, and Transparency.

FIRST, THE FAILURES
And we’ve had our share:

 1. We miscalculated a core business metric over and over again. The metric in
    question had a ton of edge cases and many unknowns in the calculation. We
    made every error in the book, from misinterpreting data to typos in the code
    to copy-and-paste mistakes. Needless to say, our executive team was not
    impressed.
 2. A model we shipped in one of our internal applications was generating
    results that were counterintuitive to our team members who were using it.
    Essentially, there was a mismatch between what the data predicted and what
    they could actually achieve in the field. The team’s response to this was to
    rebel against using the software.
 3. We reversed the rank ordering of a queue in one of our workflow tools due to
    some faulty assumptions at the interface between the data pipeline loading
    the work list and the application where the workflow was embedded. Our
    operations team tried to flag symptoms of this in several ways but weren’t
    using the level of specificity our technical teams were used to. The
    underlying problem persisted for 7 months before discovery (even though the
    users knew it was broken!).

In the first case, we eventually got it right after much pain. In the second
case, we rolled the model back. After the third… well, we had a giant hole to
dig ourselves out of. All of these situations cemented that we needed to do
better. So how did we do it?

TRAINING
No one is a finished product. From new junior hires to an experienced lead we
all have things we need to work on. Creating space and processes for improvement
and skills development for our team members is one element of our trust building
strategy. People may need improvement on several dimensions:

 * Communication skills , meaning a person’s ability to translate their thoughts to others. Our
   company is home to a large group of smart, diverse, motivated people. While
   we are all trying to do right by the company and our members, there are many
   strong opinions on how to execute against this. Aligning personalities and
   getting everyone marching in the same direction means that our team needs to
   be good at many communication styles and pathways.
 * Methodological skills , meaning a person’s ability to translate their thoughts into math. As our
   operations matured, we reached the peak of what simple counts could give us
   (and they got us pretty far!), and started operating closer to the margins.
   More difficult distributional and model-based thinking became the key to
   solving posed problems, especially when examining thorny questions around
   skill ranking and experimental analysis with low sample sizes. Our team was
   tenacious at understanding data and extracting bulk value from it, but we
   didn’t all have the skill set needed to push to the next level.
 * Programming skills , meaning a person’s ability to translate their math into code. As our
   membership grew, so did our data — in both size and complexity. Edge cases we
   hadn’t considered started to appear, causing failures in our pipelines.
   Queries and calculations that had been running in a few minutes started to
   take hours. More and more code was duplicated in our repos. We don’t focus on
   technical skill in our hiring process, so none of this was a surprise to us,
   but we realized we needed to help our team members up-skill explicitly in
   this area.

We’ve developed and embarked on a long skills-development process, investing in
our team members as they’ve invested in us. We produce many “How to” and “Data
Scientists’ Guides to” documents, to give people background in concepts they may
not have thought much about, such as versioning and working in branches,
deploying code, and data modeling. We use code reviews as a tool to help improve
the implementations in our code bases. We run a “lunch-and-learn” program, where
we cover elements like unit testing, modeling methodologies, and pipeline
implementations. And we work directly with individuals to help them hone their
skill base in the dimensions they are looking to improve. Most importantly, we
acknowledge that there is no forcing function as powerful as necessity, and so
we ensure that team members consistently work on projects that stretch their
skills and help them grow.

Seeing us put effort into improvement helps other teams build trust in us. By
striving to be better we show others that we care about how our work affects
their efforts and outcomes. So when something does go wrong, they know we’ll do
what it takes to prevent that same mistake from occurring again.

TESTING
Training improves our skills, but can’t solve all problems. Our data, and
therefore our transforms, calculations, and algorithms, have high complexity
associated with them. It is unreasonable to ask our humans to reason about this
complexity and expect nothing to ever break, no matter how talented they may be.
Fortunately, this is a place where computers excel.

We’ll be writing about this in more detail in an upcoming post, but over the
past year we’ve added over 3,000 automated tests to our analytics pipelines.
These include consistency tests, validations tests, and functional tests. At a
high level, our tests allow us to build confidence that our system is working as
expected:

 * Consistency tests check that the tables that underlie all of our infrastructure have a
   consistent dependency structure and that the schema will build. These tests
   prevent typos in column names or inconsistent table definitions from being
   added to a production run, which would prevent data from propagating through
   the production systems. (Kudos to everyone who caught the typo in the initial
   version of this- test everything!)
 * Validation tests check that the data in the production tables make sense, relative to
   acceptance conditions we’ve decided upon. They throw errors when data flowing
   through the production pipelines looks corrupted.
 * Functional tests check that each calculation or data transform is doing what we designed it
   to do. Akin to unit tests, these are the most labor intensive, as they
   require that the data scientist reason about the edge cases that could cause
   the calculation to fail and generate test data that encapsulate those edge
   cases. Over time, these catch regressions in the code as implementations are
   changed or assumptions violated.

We use pytest and testing.postgresql along with some custom structures put
together by our data engineers to make this happen. All of our tests are run as
part of a continuous integration process.

Through our tests, we know very quickly if something has broken in the course of
our development efforts or through corrupt data. Copy-and-paste errors and typo
bugs have been virtually eliminated, and we are able to see that our system as a
whole is improving daily. This solves the critical problem of confirming what
we’ve built works. This has reduced the number of failures in our production
systems over time, critical for our teams that rely on them for their work.

TRANSPARENCY
And yet, despite our best efforts, failures will still occur. In fact,
specifically because of these efforts, the severity of these failures increases
over time- the more robust our processes, the more systemic the error required
to get through. Thus the third leg in our trust building strategy is
transparency. Transparency around:

 * Who we are- our strengths and challenges, what we are focusing on improving, and
   how to best work with us.
 * Why we decided to build something and where it fits in the company roadmap.
 * What we have built and how it works- the assumptions that went into our efforts, the mechanics that
   produce the end result, and the validations that show us the feature is
   working.

And when a problem happens:

 * Where the failure occurred and what was affected.
 * Why the failure occurred, who will fix it, and how we will mitigate the risk of future such events.

We do retros on all of our projects, even the successful ones, and share what
we’ve learned. We dig into our failures to understand where our processes (not
our people) have broken, and make sure that the company knows what was
discovered. Mistakes and errors will happen, especially in a young company
trying to do difficult things. Transparency helps our partner teams trust that
we won’t hide when things go wrong, that we will ensure they are aware of system
issues that affect their work, and that we will identify places where we need to
improve as a team and work to make them better.

OUR RESULTS
We have seen a marked improvement in the quality of our work since we started
emphasizing these programs. Pipelines fail less often, errors are caught
earlier, and regressions have all but disappeared. Moreover, we have forged
relationships across many different departments by delivering value more
consistently, which has helped everyone to understand our point of view.

As always, your mileage may vary :-). And if this approach appeals to you, we’re hiring !

Software Development Data Science Best Practices Startup Lessons Analytics 4 Blocked Unblock Follow FollowingIAN BLUMENFELD
Data Science at Clover Health

FollowCLOVER HEALTH
Experiences from our technology team on building a whole new kind of insurance
company.","Trust is a funny thing. One day you’re on top of the world, the next you can do no right. The recent election has obviously shaken trust in the analytics establishment. While there are many reasons…",Trust in Data Science,Live,547
1702,"Compose Databases * MongoDB
 * Elasticsearch
 * RethinkDB
 * Redis
 * PostgreSQL
 * etcd
 * RabbitMQ
 * ScyllaDB
 * MySQL

Enterprise Pricing Articles Sign in Free 30-Day TrialDATALAYER: SCYLLA, THE HIGH-PERFORMANCE CASSANDRA-SUCCESSOR
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 5, 2016At DataLayer, we brought in Eyal Gutkind, solution architect for Scylla, to
present on ScyllaDB, a drop replacement for Cassandra. In his talk, Eyal
introduces the database and discusses how it provides flexible replication,
multi-datacenter, and Apache Spark integration. ScyllaDB simplifies it's
manageability by tuning RAM, cache and admin tasks, such as repair compaction,
by providing consistently predictable latency providing 10X better benchmarkings
than Cassandra. If you've been using or thinking about using Cassandra, ScyllaDB
may be a better option for you.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",Video from Compose's DataLayer conference.,"DataLayer Conference: Scylla, the High-Performance Cassandra-Successor",Live,548
1710,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (December 13, 2016)
 * New York Data Science Bootcamp And Validated Badges
 * This Week in Data Science (December 06, 2016)
 * This Week in Data Science (November 29, 2016)
 * This Week in Data Science (November 22, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (DECEMBER 13, 2016)
Posted on December 13, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * The 5G Revolution Is Coming — What to Know Before It’s Here – It doesn’t seem long ago when telecom companies were prepping their
   networks for the next wireless standard of the time — 4G. Here we are on the
   eve of 5G.
 * Amazon’s Grocery Store Doesn’t Have a Single Checkout – You can simply scan your phone, pick your food up off the shelf, and walk
   out.
 * Finding the genre of a song with Deep Learning – A step-by-step guide to make your computer a music expert.
 * 5 unexpected sources of bias in artificial intelligence – The reality is that not only are very few intelligent systems genuinely
   unbiased, but there are multiple sources for bias.
 * What You Are Too Afraid to Ask About Artificial Intelligence – Along with the
   advancements in pure machine learning research, we have done many steps ahead
   toward a greater comprehension of the brain mechanisms.
 * Cozmo, which came out on Oct. 17, is the latest toy from six-year-old San
   Francisco startup Anki. It’s also an attempt to bring the burgeoning fields
   of robotics and artificial intelligence to consumers. –
 * Why is data science such an important skill? –
   Data science is one the hottest career tracks at the moment, but why? What
   makes data science so valuable and why should candidates be training to be
   data scientists? David Pardoe from Hays Recruitment has the answer.
 * Fast Food Menu of Calories – This is the range of apple slices and piles of bacon.
 * Predicting with confidence: the best machine learning idea you never heard of – One of the disadvantages of machine learning as a discipline is the lack
   of reasonable confidence intervals on a given prediction.
 * Mall of America Gets an IBM Watson-powered Bot For Holiday Shopping – IBM today announced the launch of a pilot program to deploy its artificial
   intelligence inside Mall of America.
 * IBM, Rice University working on Watson-powered robot to help elderly – ‘Aging in place’ initiative in Austin, Texas, explores ways to approach
   eldercare, while in Italy IBM teams up with Sole Cooperative to study how
   Internet of Things might improve senior housing.
 * Machine Learning vs Statistics – Machine learning is all about predictions, supervised learning, and
   unsupervised learning, while statistics is about sample, population, and
   hypotheses. But are they actually that different?
 * “The Future Of Humanity Is To Direct One’s Own Evolution” – an Interview with
   Amal Graafstra – Amal Graafstra is a double RFID implantee, author of the book RFID Toys,
   and TEDx speaker. His custom gadgetry company Dangerous Things’ mission
   states that “biohacking is the forefront of a new kind of evolution.”
 * IBM to use AI to help banks with cybersecurity – IBM launched its IBM Watson for Cyber Security program in beta on Tuesday,
   and announced that it already has 40 clients signed up, including global
   leaders in the banking and insurance industries.
 * The 10 Algorithms Machine Learning Engineers Need to Know – Read this introductory list of contemporary machine learning algorithms of
   importance that every engineer should understand.

UPCOMING DATA SCIENCE EVENTS
 * Predictive Analytics Innovation Summit – The Predictive Analytics Innovation Summit, on February 22-23, provides a
   platform for industry leaders to deliver case studies, expertise and share
   unique insight into the rapidly emerging movement: the power to forecast the
   future with today’s data.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our forty fourth release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (December 13, 2016)",Live,549
1719,"Compose The Compose logo Articles Sign in Free 30-day trial5-MINUTE SIGNUP FORMS WITH NODE-RED AND COMPOSE
Published Apr 6, 2017 mongodb node-red startup 5-minute Signup Forms with Node-RED and ComposeTesting interest in a new startup idea? Looking for a quick and easy way to do
RSVPs? Need to capture email addresses for a newsletter? In this article, you'll
build a quick and easy signup form in Node-RED that allows users to enter their
information and saves entered data to a Compose MongoDB database. We'll also
look at validating that data, in this case an email address, and handling
validation errors.

GETTING STARTED
To get started, you'll need an installation of Node-RED and to spin up a new Compose MongoDB database with SSL enabled . Check out our previous article on Power Prototyping with MongoDB and Node-RED for a comprehensive setup guide. You can also get started with Node-RED on IBM Bluemix .

Make sure that you've installed the MongoDB2 Node-RED module in whichever
version of Node-RED you end up using (see the power prototyping article for instructions on that as well).

CREATING THE HTTP ROUTES
We'll need two routes to make any signup form function. The first is a route
that displays the HTML for the login form, and the second captures the data sent
from the form. Start by dragging two HTTP-IN nodes in Node-RED. Configure the
first node with a method of GET and a URL of /signup . You can change this URL at a later time, but you should keep the method the
same.


This is the URL that will display our signup form. Now, we'll grab a template node from the palette and drag it onto the canvas. Double-click it, and add the
following code to the template area.

<html>  
  <head>
  </head>
  <body>
    <form action=""/signup"" method=""post"">
      <h1>Enter your information below to sign up for our newsletter</h1>
      <label for=""email"">Email Address:</label>
      <input type=""text"" name=""email"" />
      <input type=""submit"" />
    </form>
  </body>
</html>  


This will create a simple form that contains a single field, email . When the user clicks the submit button, the data in the form will be sent to the /signup url, but using the POST method. This is a pretty common pattern, and is easy enough for us to remember.

Let's finish out this flow with an HTTP Response to complete the response.


Now, we can move on to processing the form. In the second HTTP IN node you added, double-click it to configure. In the URL section type /signup and select POST from the Method dropdown. This will be the receiver of the data we sent from our form. We can
also add a Debug node to this receiver so we can try out our form and see what kind of data is
coming back to us.


Notice that the data is coming back in the msg.payload object. The msg represents the message that we're sending from one node to another, and by
convention the payload field will contain any data from that node. The Debug node automatically outputs the msg.payload object since, and shortly we'll look at how we can manipulate other fields in
the msg object.

Now, go ahead and fill out the form and click Submit . The first thing you'll notice is that our browser seems to ""hang"", which
means that it's waiting for a response. Since we haven't added an HTTP Response node to the canvas yet, there's no response being sent back to our browser.
However, we can still pop on over to Node-RED and check out what our receiver
got in the debug panel.


This looks like the data we want to capture, and it's already in JSON format
ready for us to send off to MongoDB. Let's go ahead and do that - wire in a
MongoDB2 node to save that data in Compose MongoDB. If you haven't done so
already, set up a Compose MongoDB database with SSL enabled . Then, configure your MongoDB2 node by double-clicking on it, selecting add new MongoDB2 in the server section, and clicking on the pencil icon to create a new configuration:


Copy your credentials found in the Compose dashboard in the admin section of your database. Once you've configured and clicked save , you should now see your new configuration selected in the server section. Now, put signups in the Collection field, and select insert from the operation dropdown. You can rename Collection to anything you want, and there are a few different options for operation , but for now, these two settings will serve our purposes well.

For now, this will just insert a new document for every signup we receive. We'll
look at validating emails shortly, but first let's set up our response page.

When a user has signed up, we'd like to send them to a ""thank you"" template as a
response. Drag a new template node onto the canvas and double-click it to configure. Put the following into
the template field:

<html>  
  <head>
  </head>
  <body>
    <h1>Thanks for signing up!</h1>
    <p>You're all set.</p>
  </body>
</html>  


We'll wire up the template to the POST /signup node and send the final output to our HTTP Response node.

Your final flow setup should look like the following:


Load up the signup form and give it a try. Once you've submitted, pop open the
Compose dashboard and take a look at the signups collection in your Mongo database. You should be able to see your latest
signups.

ADDING SIMPLE EMAIL VALIDATION
Now that you have a simple signup form, let's go ahead and make it a little more
robust. In this section, we'll demonstrate how to use the node-red-contrib-npm module to use a module from npm without having to download it into node-red ahead of time. In our case, we'll
use an email validation library to validate the email field to ensure that the user has entered a valid email.

Install the node-red-contrib-npm module by clicking on the menu in Node-RED and clicking manage palette . Then, select the install tab and type npm . Install the module, and you'll see a new node added to the palette.


Now, we'll use the email-validator npm module to validate our emails. Drag the newly-installed npm node onto the canvas and double-click it to configure. The NPM Module field contains the name of the module we want to use. Since we want to use the email-validator module, type email-validator . Then, in the function field type validate , which is the function within the email-validator that we'd like to use.

Since we access the module code directly by require -ing it, rather than having to use a constructor, we'll set the Module Style field is set to Module Function . We also want the result of the call to our validate method to be sent back to us in the node-red message payload, so set the Message Payload section to Function Return Value . Your settings should look something like the following by the time you're
done:


The npm node will now validate any emails we pass into it via the msg.payload object. However, there's a problem. Right now we're sending our form data from
node to node via the msg.payload object, but our email validator is only interested in a single text value
containing the email we want to validate. Worse still, once the email validator
returns back, we'll either have a true or false value in the msg.payload object. It looks like our form data will disappear!

Now to worry - we can work around this by simply moving our signup data over to
a different field in the msg object, placing the email in the msg.payload , and then restoring our signup data to the msg.payload once we're sure validation worked.

Let's save off our signup data and update the msg.payload with the email field we want to validate. We'll do that by dragging a function node in between our POST /signup node and our npm node and entering the following code into it:

// Move our signup data to a different part of the msg
msg.signupData = msg.payload;

// Update the msg.payload to contain the email text.
msg.payload = msg.signupData.email;

// Finally, return the entire message object which forwards 
// everything along the chain.
return msg;  


FINALIZING THE VALIDATION
Now that we have the validator consuming our email and returning a true or false , let's finishing this out by using a switch node to direct our message flow depending on the value of the validation. Start
by dragging a switch node onto the canvas and placing it after the npm node. Double-click the switch node and click the +add button on the bottom twice to add two switch channels. For the first channel,
select is true from the dropdown, and for the second channel select is false then click done .

The switch node now has two outputs, with the top being followed if the validation is
successful and the bottom being followed if validation fails. Drag our thank you template and HTTP Response node to the output side of the switch node and wire the first output to the template. Our flows should now look like
the following:


Now, let's handle an invalid email. We'll do this by wiring the second output
back over to the template of our signup form. We'll also modify our template to
include an error message if the email was invalid. First, we'll the second
output of the switch node to a template node that will add an error message on a msg.errorMessage object. Drag a new template node onto the canvas and double-click it to configure. Then, add the following
to the template node:

Error: Please enter a valid email.  


Then, we'll wire the output of that template to the input of the signup template. It should look like this:


Finally, let's update the signup template by double-clicking it and replacing its content with the following:

<html>  
  <head>
  </head>
  <body>
  {{ errorMessage }}
    <form action=""/signup"" method=""post"">
      <h1>Enter your information below to sign up for our newsletter</h1>
      <label for=""email"">Email Address:</label>
      <input type=""text"" name=""email"" />
      <input type=""submit"" />
    </form>
  </body>
</html>


The error message will now only be shown if there was an error.

WRAPPING UP
When you're just starting up an idea, or trying to gauge interest in a concept,
you'll want to be able to capture interest quickly and painlessly. Using
Node-RED and MongoDB, we can spin up a simple capture form in a few minutes, and
with a few minutes more we can ensure that our form is validated. Stay tuned for
more solution-driven articles with Node-RED and Compose in the near future.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Aaron Burden

John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
Dec 20, 2016AUTHENTICATING NODE-RED USING JSONWEBTOKEN - PART 2
In Part 1 of this series, we got a first look at using JSONWebToken in Node-RED
by learning how to encrypt and decrypt tokens…

John O'Connor Nov 23, 2016POWER PROTOTYPING WITH MONGODB AND NODE-RED
Do you want to be able to quickly get your database backend fronted by a web
service? Node-RED and MongoDB can be a powerful…

John O'Connor Apr 3, 2017INTRODUCTION TO COMPOSE FOR MONGODB
At 10-years old, MongoDB remains one of the most popular NoSQL databases. Yet,
despite its immense popularity, we find many u…

Jon Silvers Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this article, you'll build a quick and easy signup form in Node-RED that allows users to enter their information and saves entered data to a Compose MongoDB database. We'll also look at validating that data, in this case an email address, and handling validation errors.",5-minute Signup Forms with Node-RED and Compose,Live,550
1721,"DATALAYER - BOOTSTRAPPING A STARTUP USING COMPOSE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 28, 2016At DataLayer, we had a long-time Compose user, Vikram Tiwari of Omni Labs
present on how they bootstrapped their startup using Compose. Vikram submitted
his talk, we promise we didn't ask him to, but we really liked what he had to
say.

Vikram is a full stack developer focused on building scalable web platforms for
high availability, resilience, and security. He is currently creating solutions
to simplify multi-channel advertising space at Omni Labs, Inc. Outside of work,
he has a passion for contributing to the open source community and is one of the
maintainers for MEAN Stack framework , among other web and cloud projects. He loves working with startups and
developers in helping them navigate through challenges and succeed in their
journey.

Watch Video


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","In this talk, long-time Compose user Vikram Tiwari of Omni Labs presents on how the company bootstrapped itself using Compose.",DataLayer Conference: Bootstrapping a Startup using Compose,Live,551
1722,"OFFLINE-FIRST IOS APPS WITH SWIFT & PART 3: USER INTERFACE
Jason H. Smith / June 30, 2016This walkthrough is a sequel to Apple’s well-known iOS programming introduction, Start Developing iOS Apps (Swift) . Apple’s introduction walks us through the process of building the UI, data,
and logic of an example food tracker app, culminating with a section on data
persistence: storing the app data as files in the iOS device.

This series picks up where that document leaves off: syncing data between
devices, through the cloud, with an offline-first design. You will achieve this
using the free IBM Cloudant service, with open source tools.

This document is the third in the series, covering useful user interface
features related to Cloudant Sync. You can also review the previous post in the
series:

 1. Part 1: The Datastore .
 2. Part 2: Sync to the Cloud .


TABLE OF CONTENTS
 1. Getting Started with FoodTracker
 2. Configure for Your Cloudant Account
 3. The Problem: No Transparency for Users
 4. The “Pushing” Spinner
 5. Pull to Refresh
 6. Conclusion

GETTING STARTED WITH FOODTRACKER
The FoodTracker main screenThis document assumes that you have completed Part 2: The Datastore of the series. If you have completed that walkthrough, you may continue with
your FoodTracker project.

Alternatively, you can download the prepared project from the Part 2 Code download and begin there. Extract the zip file, FoodTracker-Cloudant-Sync-2.zip , browse into its folder with Finder, and double-click FoodTracker.xcworkspace . That will open the project in Xcode.


CONFIGURE FOR YOUR CLOUDANT ACCOUNT
If you downloaded the FoodTracker source code from the link above, then you must
re-configure it to work with your own IBM Cloudant account. For a simple example
like FoodTracker, these credentials are simply hard-coded in the source code. If
you want to generate a new API key for your app, see the section in Part 2, Prepare Cloudant for the Food Tracker App .

 1. Open MealTableViewController.swift
 2. In MealTableViewController.swift , find the section, MARK: Cloudant Settings
 3. Find the comment: // NOTE: You must change these values for your own application.
 4. Modify the three values below that comment. For example:
    // You must change these values for your own application.
    let cloudantAccount = ""my-name""
    cloudantApiKey = ""andougstonlyingeoledteat""
    let cloudantApiPassword = ""995f34498cb918334c7f0b962b8e973ced13003d""
    
    
Checkpoint: Run your app. As always do not worry about compiler warnings from third-party, open source
libraries.

In the console log, you should see messages indicating a successful pull
replication. If everything is in order, proceed with these instructions. If you
have a problem with compilation or replication, compare your code carefully to
the code from part 2 .

If you download the prepared project, when you first open it with Xcode, you may
see warnings about CDTDatastore and related names. This will go away on its own once Xcode has indexed the
project. Wait for Xcode to index the project. Then, run a build (Command-B) . When that completes, you will know that everything is working correctly.


THE PROBLEM: NO TRANSPARENCY FOR USERS
Suppose a user creates a new meal. Presumably, they are aware of Food Tracker’s
offline-first behavior and cloud sync capabilities. Indeed, that is a major
selling point!

Unfortunately, the user will now expect some sort of feedback to tell them when
the meal is “saved,” or “synced.” They know that data takes some time to
transfer to the cloud. They know that sometimes the network is fast; sometimes
it is slow.

Now suppose the user knows that new meals are already synced in the cloud (for
example, they have created meals on a different iOS device). They know that an
update is “due” but they have no way to tell Food Tracker, “Hey, sync from the
cloud now. New data awaits you!”

If Food Tracker just quietly accepts new meals, and also vaguely promises to
sync to and from the cloud, but there is no transparency for the user to see
what is going on, then the user will be disappointed. As you use mobile and web
apps for yourself, notice the abundance of subtle feedback cues, often telling
you when syncing is in-process (usually a spinning spinner), and when it’s been
completed (the spinner goes away, or a green check appears, etc.)


THE “PUSHING” SPINNER
Start by showing the user when a push is underway. When the user adds or changes
a meal, you should superimpose a “spinner” on top of the meal thumbnail. This
lets them know that their change is in-flight, destined for permanence in the
cloud, somewhere.

When the push completes, simply remove the spinner, and the user will know: Push
complete! Start in the storyboard, with the spinner.

To create the activity indicator

 1. Open Main.storyboard
 2. Look in the tree navigation panel, on the left. Click Your Meals Scene and position the display so that you can see the photo in the prototype
    cell.
    
 3. Make sure that the Utilities panel is visible (the rightmost panel). Click
    the Object Library, and in the filter prompt, type activity . This will reveal the Activity Indicator View object.
    
 4. Drag the Activity Indicator View and position it centered over the meal
    photo.
    
 5. With the new Activity Indicator View selected, select the Attributes
    Inspector in the utilities panel.
 6. Set Style to Large White .
 7. Set Color to Light Gray Color .
 8. For Behavior, check Hides When Stopped .
    
 9. Rename the indicator in the tree view of your scene. Renaming works the same
    as the Mac OS Finder: With the indicator already highlighted, click it once.
    A prompt will pop up, and enter, Sync Indicator .
    

The final step is to make an outlet, so that this sync indicator is declared in
the MealTableViewCell class.

To create an outlet for the sync indicator

 1. With Main.storyboard still open, show the assistant editor (the icon of two linked circles).
    Now, you have one editor with your storyboard, and another with Swift source
    code.
    
 2. In the source code editor, select MealTableViewCell.swift . You should now see that file’s contents in the editor.
    
 3. Back in the storyboard, right-click Sync Indicator . A window for outlets will pop up.
 4. Find the row “New Referencing Outlet” and drag and drop: from the circle (it
    will become a plus sign when you mouse over), over to the MealTableViewCell.swift source code, where the existing variables are declared.
    
 5. A prompt will pop up, asking for information about this connection. Under Name , enter syncIndicator and then click Connect.
    

You may switch back to the Standard Editor if you like. Looking at MealTableViewCell.swift , you will see a new property declared: @IBOutlet weak var syncIndicator: UIActivityIndicatorView! . When complete, the source code will look like this:


Great! Now you can reference a cell’s .syncIndicator property. All that remains is to activate it when push replication begins, and
to stop it when replication completes.

To use the sync indicator

 1. Open MealTableViewController.swift
 2. In MealTableViewController.swift , find the method, unwindToMealList(_:) .
 3. Notice that the method is mostly one if/else branch, the “if” block handling
    meal updates, and the “else” handling new meal creation.
 4. In the “if” block, after the call to updateMeal(meal) , insert these lines:
    // Mark the meal in-flight. When sync completes, the
    // indicator will stop.
    let cell = tableView.cellForRowAtIndexPath(selectedIndexPath)
       as! MealTableViewCell
    cell.syncIndicator.startAnimating()
    
    
 5. In the “else” block, after the call to createMeal(meal) , insert these lines:
    // Mark the meal in-flight. When sync completes, the
    // indicator will stop.
    let cell = tableView.cellForRowAtIndexPath(newIndexPath)
       as! MealTableViewCell
    cell.syncIndicator.startAnimating()
    
    
To double-check, your complete unwindToMealList(_:) method will look like this:


@IBAction func unwindToMealList(sender: UIStoryboardSegue) {
    if let sourceViewController = sender.sourceViewController as? MealViewController, meal = sourceViewController.meal {
        if let selectedIndexPath = tableView.indexPathForSelectedRow {
            // Update an existing meal.
            meals[selectedIndexPath.row] = meal
            tableView.reloadRowsAtIndexPaths([selectedIndexPath], withRowAnimation: .None)
            updateMeal(meal)

            // Mark the meal in-flight. When sync completes, the
            // indicator will stop.
            let cell = tableView.cellForRowAtIndexPath(selectedIndexPath)
                as! MealTableViewCell
            cell.syncIndicator.startAnimating()
        } else {
            // Add a new meal.
            let newIndexPath = NSIndexPath(forRow: meals.count, inSection: 0)
            meals.append(meal)
            tableView.insertRowsAtIndexPaths([newIndexPath], withRowAnimation: .Bottom)
            createMeal(meal)

            // Mark the meal in-flight. When sync completes, the
            // indicator will stop.
            let cell = tableView.cellForRowAtIndexPath(newIndexPath)
                as! MealTableViewCell
            cell.syncIndicator.startAnimating()
        }
        sync(.Push)
    }
}


The final step is to stop the spinners when replication completes.

To stop the activity indicator when replication completes

 1. In MealTableViewController.swift , find the method, replicatorDidComplete(_:) .
 2. Notice that the method is mostly one “if” block, handing a completed pull
    replication.
 3. Immediately after the “if” block, append an “else if” block, to handle a
    completed push replication:
     } else if (replicator == replications[.Push]) {
       // Stop all active spinners. Note, this does not perfectly
       // reflect the real replication state; however, it is very
       // simple, and it typically works well enough.
       dispatch_async(dispatch_get_main_queue(), {
           for cell in self.tableView.visibleCells
               as! [MealTableViewCell] {
                   cell.syncIndicator.stopAnimating()
           }
       })
    }
    
    
Checkpoint: Run your app. Change a meal rating. Create a new meal. Notice the spinner superimposed on
your meal image. If you watch the app log in Xcode, you can confirm that, when
replication completes, the spinners stop and become invisible again.

In a more full-featured app, of course you will want to handle other situations
in a more clever way. What if the phone is in airplane mode? What if the
cellular bandwidth is very slow, causing the replication to fail? What about
when service is restored and everything finally syncs with Cloudant? For a
mature, full-featured app, you will want to add different behaviors for all of
these situations. (But of course, Food Tracker is an app for learning.)


PULL TO REFRESH
Pull-to-refresh is a great feature to give users visibility and control of the
incoming pull replication process. With pull-to-refresh, the user drags their
finger downward, indicating their desire to retrieve updates from the cloud.
This is a perfect place to trigger a pull replication!

Begin by enabling refreshing in the storyboard.

 1. Open Main.storyboard
 2. Look in the tree navigation panel, on the left. Click the “Your Meals” view controller , which has the yellow icon.
    
    You will also see that the view controller is selected in the storyboard.
    
 3. In the Utilities (the rightmost panel in Xcode), be sure that you have
    selected the Attributes inspector. Visually scan down the attributes until
    you find the Table View Controller section.
 4. In the Table View Controller section, set the Refreshing attribute to Enabled .Refreshing option in the ""Your Meals"" view controller
    
    
When complete, Xcode should look like this.

Enabled refreshing in the ""Your Meals"" view controller

Next, implement the “refresh” function. It is very simple: just trigger pull
replication.

 1. In MealTableViewController.swift , find the section, MARK: Cloudant Sync
 2. In the section MARK: Cloudant Sync , insert this function just above cloudURL()
    func handleRefresh(refreshControl: UIRefreshControl) {
       print(""Pull to refresh!"")
       sync(.Pull)
    }
    
    
Of course, when the replication completes, the UI should reflect that. All you
need to do is to stop the refresh control when a pull replication completes. (If
the refresh control was not active, then nothing will happen, which is
harmless.)

 1. In MealTableViewController.swift , find the section, MARK: Cloudant Sync
 2. Go to the function, replicatorDidComplete(_:)
 3. In the code block for pull replications, append the code to end the refresh
    control. The if block in the middle of the function should now look like this:
    if (replicator == replications[.Pull]) {
      if (replicator.changesProcessed > 0) {
          // Reload the meals, and refresh the UI.
          loadMealsFromDatastore()
          dispatch_async(dispatch_get_main_queue(), {
              self.tableView.reloadData()
          })
      }
    
      // End the refresh spinner, if necessary.
      self.refreshControl?.endRefreshing()
    }
    
    
The final step is to connect the UI refresh control to this code.

 1. In MealTableViewController.swift , find the section, MARK: Cloudant Settings
 2. Go to the function, viewDidLoad()
 3. Add the following code, so that the beginning of the function looks like so:
    super.viewDidLoad()
    
    // Activate the pull-to-refresh control.
    self.refreshControl = UIRefreshControl()
    self.refreshControl?.addTarget(self, action:
       #selector(MealTableViewController.handleRefresh(_:)),
       forControlEvents: UIControlEvents.ValueChanged)
    
    
Checkpoint: Run your app. As always do not worry about compiler warnings from third-party, open source
libraries.


CONCLUSION
Congratulations! Today’s accomplishment is so delightful! Your users will
appreciate these clear, tangible features: sync status spinners and
pull-to-refresh. Although part 2 of this series lays the foundation for syncing,
Food Tracker simply wasn’t much fun to use without visual feedback and tactile
control.

This concludes the major parts of this series. If you have followed along, then
you have created an iOS app in Swift, using the local data store, CDTDatastore . You connected that datastore to IBM Cloudant, with bi-directional replication
to and from Cloudant. Finally, you added user interface features to let users
interact with these capabilities.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Develop an iOS app with Swift and IBM Cloudant. This post demonstrates UI elements for offline-first design, with pull-to-sync and replication spinners.",Offline-First iOS Apps with Swift & Cloudant Sync; Part 3,Live,552
1727,"Homepage Stats and Bots Sign in / Sign up Homepage * Home
 * DATA SCIENCE
 * ANALYTICS
 * STARTUPS
 * BOTS
 * DESIGN
 * Subscribe
 * 
 * 🤖 TRY STATSBOT FREE
 * 

Daniil Korbut Blocked Unblock Follow Following Data Scientist at Statsbot Oct 26
--------------------------------------------------------------------------------

MACHINE LEARNING ALGORITHMS: WHICH ONE TO CHOOSE FOR YOUR PROBLEM
INTUITION OF USING DIFFERENT KINDS OF ALGORITHMS IN DIFFERENT TASKS
When I was beginning my way in data science, I often faced the problem of
choosing the most appropriate algorithm for my specific problem. If you’re like
me, when you open some article about machine learning algorithms, you see dozens
of detailed descriptions. The paradox is that they don’t ease the choice.

In this article for Statsbot , I will try to explain basic concepts and give some intuition of using
different kinds of machine learning algorithms in different tasks. At the end of
the article, you’ll find the structured overview of the main features of
described algorithms.

First of all, you should distinguish 4 types of Machine Learning tasks:

 * Supervised learning
 * Unsupervised learning
 * Semi-supervised learning
 * Reinforcement learning

SUPERVISED LEARNING
Supervised learning is the task of inferring a function from labeled training
data. By fitting to the labeled training set, we want to find the most optimal
model parameters to predict unknown labels on other objects (test set). If the
label is a real number, we call the task regression . If the label is from the limited number of values, where these values are
unordered, then it’s classification .

Illustration sourceUNSUPERVISED LEARNING
In unsupervised learning we have less information about objects, in particular,
the train set is unlabeled. What is our goal now? It’s possible to observe some
similarities between groups of objects and include them in appropriate clusters.
Some objects can differ hugely from all clusters, in this way we assume these
objects to be anomalies.

Illustration sourceSEMI-SUPERVISED LEARNING
Semi-supervised learning tasks include both problems we described earlier: they
use labeled and unlabeled data. That is a great opportunity for those who can’t
afford labeling their data. The method allows us to significantly improve
accuracy, because we can use unlabeled data in the train set with a small amount
of labeled data.

Illustration sourceREINFORCEMENT LEARNING
Reinforcement learning is not like any of our previous tasks because we don’t
have labeled or unlabeled datasets here. RL is an area of machine learning
concerned with how software agents ought to take actions in some environment to
maximize some notion of cumulative reward.

Illustration sourceImagine, you’re a robot in some strange place, you can perform the activities
and get rewards from the environment for them. After each action your behavior
is getting more complex and clever, so you are training to behave the most
effective way on each step. In biology, this is called adaptation to natural
environment.

COMMONLY USED MACHINE LEARNING ALGORITHMS
Now that we have some intuition about types of machine learning tasks, let’s
explore the most popular algorithms with their applications in real life.

LINEAR REGRESSION AND LINEAR CLASSIFIER
These are probably the simplest algorithms in machine learning. You have
features x1,…xn of objects (matrix A) and labels (vector b). Your goal is to
find the most optimal weights w1,…wn and bias for these features according to
some loss function, for example, MSE or MAE for a regression problem. In the case of MSE there is a mathematical equation
from the least squares method:

In practice, it’s easier to optimize it with gradient descent, that is much more
computationally efficient. Despite the simplicity of this algorithm, it works
pretty well when you have thousands of features, for example, bag of words or
n-gramms in text analysis . More complex algorithms suffer from overfitting many features and not huge
datasets, while linear regression provides decent quality.

Illustration sourceTo prevent overfitting we often use regularization techniques like lasso and
ridge. The idea is to add the sum of modules of weights and the sum of squares
of weights, respectively, to our loss function. Read the great tutorial on these
algorithms at the end of the article.

LOGISTIC REGRESSION
Don’t confuse these classification algorithms with regression methods for using
“regression” in its title. Logistic regression performs binary classification,
so the label outputs are binary. Let’s define P(y=1|x) as the conditional
probability that the output y is 1 under the condition that there is given the input feature vector x . The coefficients w are the weights that the model wants to learn.

Since this algorithm calculates the probability of belonging to each class, you
should take into account how much the probability differs from 0 or 1 and
average it over all objects as we did with linear regression. Such loss function
is the average of cross-entropies:

Don’t panic, I’ll make it easy for you. Allow y to be the right answers: 0 or 1, y_pred — predicted answers. If y equals 0, then the first addend under sum equals 0 and the second is the less
the closer our predicted y_pred to 0 according to the properties of the logarithm. Similarly, in the case when y equals 1.

What is great about a logistic regression? It takes linear combination of
features and applies non-linear function (sigmoid) to it, so it’s a very very
small instance of neural network!

DECISION TREES
Another popular and easy to understand algorithm is decision trees. Their
graphics help you see what you’re thinking and their engine requires a
systematic, documented thought process.

The idea of this algorithm is quite simple. In every node we choose the best
split among all features and all possible split points. Each split is selected
in such a way as to maximize some functional. In classification trees we use
cross entropy and Gini index. In regression trees we minimize the sum of a
squared error between the predictive variable of the target values of the points
that fall in that region and the one we assign to it.

Illustration sourceWe make this procedure recursively for each node and finish when we meet a
stopping criteria. They can vary from minimum number of leafs in a node to tree
height. Single trees are used very rarely, but in composition with many others
they build very efficient algorithms such as Random Forest or Gradient Tree Boosting .

K-MEANS
Sometimes you don’t know any labels and your goal is to assign labels according
to the features of objects. This is called clusterization task .

Suppose you want to divide all data-objects into k clusters. You need to select
random k points from your data and name them centers of clusters. The clusters
of other objects are defined by the closest cluster center. Then, centers of the
clusters are converted and the process repeats until convergence.

This is the most clear clusterization technique, which still has some
disadvantages. First of all, you should know the amount of clusters that we
can’t know. Secondly, the result depends on the points randomly chosen at the
beginning and the algorithm doesn’t guarantee that we’ll achieve the global
minimum of the functional.

There are a range of clustering methods with different advantages and
disadvantages, which you could learn in recommended reading.

PRINCIPAL COMPONENT ANALYSIS (PCA)
Have you ever prepared for a difficult exam on the last night or during the last
hours? You have no chance to remember all the information, but you want to
maximize information that you can remember in the time available, for example,
learning first the theorems that occur in many exam tickets and so on.

Principal component analysis is based on the same idea. This algorithm provides
dimensionality reduction. Sometimes you have a wide range of features, probably
highly correlated between each other, and models can easily overfit on a huge
amount of data. Then, you can apply PCA.

Surprisingly, these vectors are eigenvectors of correlation matrix of features
from a dataset.

Illustration sourceThe algorithm now is clear:

1. We calculate the correlation matrix of feature columns and find eigenvectors
of this matrix.

2. We take these multidimensional vectors and calculate the projection of all
features on them.

New features are coordinates from a projection and their number depends on the
count of eigenvectors, on which you calculate the projection.

NEURAL NETWORKS
I have already mentioned neural networks, when we talked about logistic
regression. There are a lot of different architectures that are valuable in very
specific tasks. More often, it’s a range of layers or components with linear
connections among them and following nonlinearities.

If you’re working with images, convolutional deep neural networks show the great
results. Nonlinearities are represented by convolutional and pooling layers,
capable of capturing the characteristic features of images.

Illustration sourceFor working with texts and sequences you’d better choose recurrent neural networks . RNNs contain LSTM or GRU modules and can work with data, for which we know
the dimension in advance. Perhaps, one of the most known applications for RNNs
is machine translation .

CONCLUSION
I hope that I could explain to you common perceptions of the most used machine
learning algorithms and give intuition on how to choose one for your specific
problem. To make things easier for you, I’ve prepared the structured overview of
their main features.

Linear regression and Linear classifier. Despite an apparent simplicity, they are very useful on a huge amount of
features where better algorithms suffer from overfitting.

Logistic regression is the simplest non-linear classifier with a linear combination of parameters
and nonlinear function (sigmoid) for binary classification.

Decision trees is often similar to people’s decision process and is easy to interpret. But
they are most often used in compositions such as Random forest or Gradient
boosting.

K-means is more primal, but a very easy to understand algorithm, that can be perfect as
a baseline in a variety of problems.

PCA is a great choice to reduce dimensionality of your feature space with minimum
loss of information.

Neural Networks are a new era of machine learning algorithms and can be applied for many tasks,
but their training needs huge computational complexity.

RECOMMENDED SOURCES
 * Overview of clustering methods
 * A Complete Tutorial on Ridge and Lasso Regression in Python
 * YouTube channel about AI for beginners with great tutorials and examples

YOU’D ALSO LIKE:
How to Get All Your Product Launch Metrics Without Leaving Slack How we used
Statsbot to track our product launch metrics blog.statsbot.co SQL Queries for Funnel Analysis A template for building SQL funnel queries
blog.statsbot.co How to Reduce Churn Rate By Handling Stripe Failed Payments How We Automated
Dunning Management blog.statsbot.co * Machine Learning
 * Algorithms
 * Data Science
 * Neural Networks
 * Reinforcement Learning

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

153 Blocked Unblock Follow FollowingDANIIL KORBUT
Data Scientist at Statsbot

FollowSTATS AND BOTS
Data stories on machine learning and analytics. From Statsbot’s makers.

 * 153
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates",Basic concepts and intuition of using different kinds of machine learning algorithms in different tasks.,Which One to Choose for Your Problem,Live,553
1731,"Compose The Compose logo Articles Sign in Free 30-day trialNEWSBITS: RETHINKDB LIVES, REDIS AND POSTGRESQL FUTURES, FOSDEM, RUST AND WUZZ
Published Feb 10, 2017 newsbits rethinkdb redis NewsBits: RethinkDB lives, Redis and PostgreSQL futures, FOSDEM, Rust and WuzzNewsBits for the week ending 10th February - RethinkDB has a new home, Redis's future is
being mapped out, PostgreSQL 10's features are appearing, a look at FOSDEM, Rust
1.15 and 1.15.1 land, Wuzz is a whizz at HTTP and what do we code in on the
weekend?

Compose's NewsBits are the bits of news which we think you should know about.
From databases to developer tools and always something extra, these are the
NewsBits:

DATABASE BITS
RethinkDB LivesThe news that RethinkDB has joined the Linux Foundation was widely welcomed. The code base and many of the associated assets have been
purchased by the Cloud Native Computing Foundation . The then relicensed from AGPL 3.0 to the much more permissive Apache 2.0
license, and then transferred the project to the Linux Foundation; CNCF is a
Linux Foundation project too.

The move means that RethinkDB's destiny is firmly in the community's hands and
the relicensing means it will be easier to expand that community. There's a bug
fix release (2.3.6) and a new feature release (2.4) in the pipeline too.
Congratulations to all involved from Newsbits. More reaction here .

Redis FuturesAntirez tweeted a potential Redis 4.2 roadmap which includes lots of improvements for Redis Clusters, ZipPacks being replaced
with Listpacks and the implementation of Disque as a Redis Module. It's only a roadmap, but
it's an exciting one... Redis 4.0 is still to arrive as stable.

PostgreSQL 10Features are landing for PostgreSQL 10 due at the end of this year. The DSHL
blog has, as usual, been looking at some of them including the huge logical replication patch which adds a selective replication engine to PostgreSQL, the table partitioning alternative to table inheritance and the sequence catalog which makes all sequences easier to find.

Meanwhile, there's a version update for all existing PostgreSQL releases that's
just landed. That means 9.6.2, 9.5.6, 9.4.11 and more are the latest versions.

DEVELOPER BITS
FOSDEMOver in Brussels last weekend, FOSDEM brought thousands of FOSS developers and users together for hundreds of talks . Too many to cover here, but as an example: How many datatypes does PostgreSQL
have out of the box? If you didn't say 361, you'll want to watch PostgreSQL data types ( slides ).

Rust 1.15....1Landing just as NewsBits went out last week, Rust 1.15 continues Rust's drumbeat releases with the addition of custom derive
attributes to the language, a rewritten (in Rust) build system and more. And as
we write NewsBits this week, Rust 1.15.1 has also landed, fixing two issues in the 1.15 release.

WuzzPopping up on the radar, Wuzz - a Go based interactive HTTP inspector that lets you enter an HTTP URL/REST
endpoint, add parameters, request data and headers and get a searchable response
back. Very compact and neat.

Something for the weekendAccording to Stack Overflow's analysis of time, tags and differences, on the weekend people are doing a lot more
Haskell (and Assembly and OpenGL) than they do during the week. Is it time for
you to Learn You a Haskell for Great Good too?


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Jan 20, 2017NEWSBITS - POSTGRESQL TIPS, RETHINKDB POST-MORTEM, GO IDE, AND MORE
NewsBits for the week ending January 20th: Some PostgreSQL tips and SSL news,
the RethinkDB post-mortem, a JetBrains Go IDE a…

Hays Hutton Dec 16, 2016NEWSBITS - MYSQL, REDIS, ELASTICSEARCH, POSTGRESQL, CENTOS, COREOS AND MORE
NewsBits for the week ending December 16 - MySQL Group Replication is now GA,
Redis 4.0 RCs, Modules and Loglog changes, GoRe…

Dj Walker-Morgan Dec 9, 2016NEWSBITS- RETHINKDB, POSTGRESQL 10, ELASTIC 5.1.1, DATABASE SECURITY, PYTHON
3.6, TYPESCRIPT 2.1 AND 4-BIT Z80S
NewsBits for the week ending December 9th - An update on RethinkDB's new world,
developments in PostgreSQL 10, ElasticSearch…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","For the week ending 10th February: RethinkDB has a new home, Redis's future is being mapped out, PostgreSQL 10's features are appearing, and more.","RethinkDB lives, Redis and PostgreSQL futures, FOSDEM, Rust and Wuzz",Live,554
1735,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS
Loading...

Sign up by October 31st for an extended 3-month trial of YouTube Red.Working...

No thanks Try it free Find out why CloseIBM WATSON MACHINE LEARNING: BUILD A NAIVE-BAYES MODEL
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

99 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Watch this video to see how to build a Naive Bayes model that assesses what
category of goods a customer might be interested in.

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Watson Machine Learning: Get Started - Duration: 1:23. developerWorks TV
   326 views 1:23


--------------------------------------------------------------------------------

 * IBM Watson: How it Works - Duration: 7:54. IBM Watson 1,292,550 views 7:54
 * Watson Machine Learning: Build a logistic regression model - Duration: 4:10.
   IBM Analytics Learning Services 502 views 4:10
 * How would your business use facial recognition? - Duration: 1:31:32.
   developerWorks TV 4 views * New 1:31:32
 * IBM Watson Machine Learning: Put a human face on machine learning - Duration:
   6:07. developerWorks TV 96 views 6:07
 * IBM Watson Machine Learning: Build a Predictive Analytic Model - Duration:
   4:06. developerWorks TV 80 views 4:06
 * Watson Machine Learning: Build a Predictive Analytic Model Using SPSS Modeler
   - Duration: 5:31. IBM Analytics Learning Services 365 views 5:31
 * IBM Watson Machine Learning: Create a project for Watson Machine Learning -
   Duration: 2:04. developerWorks TV 144 views 2:04
 * Watson: IBM's Machine-learning Supercomputer - Duration: 16:07. iqsquared
   13,779 views 16:07
 * Machine learning beer tasting with Watson - Duration: 5:33. developerWorks TV
   646 views 5:33
 * Image Classification - Tensorflow vs. Watson [Machine Learning] - Duration:
   10:12. Cristi Vlad 1,636 views 10:12
 * Talking machine learning with Tanmay Bakshi at the IBM Watson Summit! -
   Duration: 28:39. Dev Diner 1,352 views 28:39
 * Vidhya Murali and Ching-wei Chen on predicting music - Duration: 55:44.
   developerWorks TV 10 views * New 55:44
 * IBM Watson, Machine Learning: How to use the ""Retrieve and Rank"" service in
   IBM Bluemix! - Duration: 29:50. tanmay bakshi 17,471 views 29:50
 * Machine Learning with IBM Watson and Alchemy API. Andy Thurai at APIdays
   Mediterranea 2015 - Duration: 33:32. Apicultur 1,409 views 33:32
 * IBM Bluemix Tutorial - Do it Yourself - Getting Started - DIY-1 of 40 -
   Duration: 4:21. BharatiDWConsultancy 1,950 views 4:21
 * Michael Ludden talks AI, VR, and Star Trek at Galvanize - Duration: 49:31.
   developerWorks TV 3 views * New 49:31
 * Transform Data into Intelligence Using IBM Watson Machine Learning -
   Duration: 1:19. IBM Analytics 657 views 1:19
 * IBM Introduces The Power of Machine Learning With Watson Data Platform -
   Duration: 2:10. developerWorks TV 726 views 2:10
 * With AI, you won't look at your industry the same way again - Duration:
   1:00:57. developerWorks TV 8 views * New 1:00:57

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Watch this video to see how to build a Naive Bayes model using IBM Waston Machine Learning and IBM Data Science Experience that assesses what category of goods a customer might be interested in.,Build a Naive-Bayes Model with WML & DSX,Live,555
1736,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. 14 mins ago
--------------------------------------------------------------------------------

CLOUDANT AND NODE.JS MADE SIMPLE WITH THE SILVERLINING LIBRARY
A LIBRARY FOR NEW USERS
I frequently talk with developers who are using IBM Cloudant or Apache CouchDB™ for the first time, and I’ve found that they most often have a difficult time
coming to terms with these concepts:

 * Design Documents: a feature that is unique to this database family. They are special JSON
   documents holding index definitions and other data-manipulation magic. Once
   mastered, design docs can be used to great effect, but to a first-timer they
   are baffling.
 * Multi-Version Concurrency Control : MVCC is the mechanism whereby documents are stored in a tree of revisions,
   allowing databases to sync changes with other copies without data loss.

Usually, developers want to learn the mechanics of database CRUD and basic queries. They don’t want to have to learn about design documents in
order to query or aggregate data, nor do they want MVCC getting in the way of
storing and updating their data. What would Cloudant or CouchDB look like
without these concepts? Or without show & list functions, update handlers,
attachments, replication, and other advanced features?

To find out, I wrote the silverlining library that hides away this powerful-but-esoteric functionality and
concentrates on making data manipulation simple and intuitive.

“Hi ho silver lining, and away you go …”INTRODUCING SILVERLINING
The silverlining library is a Node.js module that can be used with Cloudant and allows:

 * database creation
 * document insert, update, bulk insert and delete
 * document get, multi-get and bulk fetch
 * database querying
 * aggregation for counting, totalising and statistics

All of this is achieved without users dealing with any of Cloudant’s inner
complexities.

INSTALLING
Use the library in your own project like so:

> npm install --save silverlining

Then, use it within your code by adding the URL of your Cloudant instance and
defining which Cloudant database you want to work with:

var url = 'https://myusername:mypassword@host.cloudant.com'
var cities = require('silverlining')(url, 'cities'

CREATING A DATABASE
A new Cloudant database can be created on first-use with the create function:

cities.create();

All the function calls we make here are going to be asynchronous: they return a Promise , which you handle with then and catch functions, but for brevity, I’ve omitted much of this scaffolding from the code
samples. For example, to log the output of the info function you could do:

cities.info().then(console.log)

Now, we are going to import some data from geonames.org , which stores the location and population of cities around the world.

INSERTING DATA
Single documents can be added to a database with the insert function:

var city = { 
  _id: '3041563', 
  name:'Andorra la Vella', 
  latitude: 42.50779, 
  longitude: 1.52109, 
  country: 'AD', 
  population: 15853, 
  timezone: 'Europe/Andorra' 
};
cities.insert(city);

If an _id property is supplied, then this field becomes the key field of the JSON
document stored in Cloudant. If omitted, an '_id' is automatically generated by
the database.

The same function can be used to insert multiple documents — simply pass in an
array of objects instead:

var multi = [
  {
    _id: '1000501',
    name: 'Grahamstown',
    latitude: -33.30422,
    longitude: 26.53276,
    country: 'ZA',
    population: 91548,
    timezone: 'Africa/Johannesburg'
  },
  {
    _id: '1000543',
    name: 'Graaff-Reinet',
    latitude: -32.25215,
    longitude: 24.53075,
    country: 'ZA',
    population: 62896,
    timezone: 'Africa/Johannesburg'
  }
];
cities.insert(multi);

The array of data can be as long as you like. Let’s get some data from Github
and store it in a local file:

curl https://raw.githubusercontent.com/glynnbird/cities/master/cities.json > cities.json

It can then be imported in batches with one function call:

var mydata = require('./cities.json'
cities.insert(mydata);

Records can be retrieved by their IDs with the get function singly:

cities.get('1000543'

Or get records in batches by providing an array of IDs:

cities.get(['1000501', '1000543'

Individual records can be updated by supplying the ID and a new document body to
the update function (or just a new object):

cities.update('1000501'

A document can be deleted with the del function:

cities.del('1000501'

Assuming we have managed to create a suitable data set, then we next need to
query it.

QUERYING DATA
The database can be queried by any field using the query function, passing in a query object:

cities.query({ name: 'York'
[ { _id: '2633352',
    name: 'York',
    latitude: 53.95763,
    longitude: -1.08271,
    country: 'GB',
    population: 144202,
    timezone: 'Europe/London' },
  { _id: '4562407',
    name: 'York',
    latitude: 39.9626,
    longitude: -76.72774,
    country: 'US',
    population: 43718,
    timezone: 'America/New_York' } ]

The query can also contain Cloudant Query selector operators :

// find cities with a population > 5m
cities.query({ population: { '$gt': 5000000} })

// cities with population > 5m in Brazil or USA
cities.query({ population: { '$gt': 5000000}, country: { '$in': ['BR','US']} })

AGGREGATION
Create aggregated views of your data with the count , sum and stats functions.

The count function simply counts things:

cities.count().then(console.log)
23515

It can also count things grouped by other fields within the document:

// get counts of cities by country code
cities.count('country'
{ AD: 2,
  AE: 13,
  AF: 48,
  AG: 1,
  AI: 1,...

The sum function calculates totals:

// get sum of all population fields
cities.sum('population'
2694222973

You can also use sum to group the totals by other fields:

// get totals of population grouped by country
cities.sum('population','country'
{ AD: 36283,
  AE: 3272938,
  AF: 6308267,
  AG: 24226,...

The stats function calculates statistics on one or more fields, which can be grouped by
other fields:

// get stats on cities' population by timezone
cities.stats('population', 'timezone'
{ 'Africa/Abidjan': 
   { sum: 8399817,
     count: 57,
     min: 15068,
     max: 3677115,
     mean: 147365.2105263158,
     variance: 240877692698.44693,
     stddev: 490792.9224208993 },
  'Africa/Accra': 
   { sum: 7064394,
     count: 58,
     min: 18077,
     max: 1963264,
     mean: 121799.89655172414,
     variance: 96426027195.8169,
     stddev: 310525.4050731065 },...

HOW DOES THIS WORK?
It’s still Cloudant under-the-hood, but this library is taking care of a few
details on your behalf:

 * When you create a database, a Cloudant Query index is created that indexes all fields. This index then powers the query operator.
 * The update and delete functions hide revision tokens from you — behind the scenes these operations
   fetch the latest revision, overwriting the data with your supplied document.
   If it doesn't succeed initially, it tries again later, up to three times, at
   increasingly longer time intervals.
 * The count , sum and stats functions use Cloudant's built-in MapReduce engine, creating the required
   Design Document if necessary.

In production systems, it’s important to understand design documents, MVCC and
the other features of Cloudant, but when you’re getting started it’s helpful to
be able to write data quickly and create queries without any fuss.

IS THIS FREE?
Yes! This is a free, open-source library that can work with any Cloudant
account.

 * Code: https://github.com/ibm-cds-labs/silverlining
 * NPM page: https://www.npmjs.com/package/silverlining

Feel free to give it a go, and we’d welcome any feedback through the project’s GitHub Issues page.

JavaScript Nodejs Cloudant Couchdb Database Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Join Medium Join Medium","Introducing silverlining, a Node.js library that hides away the powerful-but-esoteric functionality of Cloudant and concentrates on making data manipulation simple and intuitive for new users.",Cloudant and Node.js Made Simple with the silverlining Library – IBM Watson Data Lab,Live,556
1742,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseSUPER BOWL IS ALMOST HERE - LET'S HAVE SOME FUN WITH NFL DATA
Data Gurus Subscribe Subscribed Unsubscribe 58 58Loading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

12 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Feb 2, 2017Working with Spark we will show data engineering operations that will shape and
prepare data for analysis. This will include augmenting the existing data set
with Weather Data for each of the games. After walking through the engineering
and persistence of the data, we will show how to visualize and analyze the data
to determine if correlations exist between players and the weather they play in.
In this webinar you will get an introduction to the power of Spark, leveraging
Python, SQL and simple statistics. You will also be introduced to options around
persistence, data sources and visualization.

 * CATEGORY
    * People & Blogs
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Streaming Data Analytics with Apache Spark Streaming - Duration: 1:01:19.
   Data Gurus No views * New 1:01:19


--------------------------------------------------------------------------------

 * WHO WILL WIN SUPER BOWL 51!? | MY SUPER BOWL ANALYSIS AND PREDICTION|
   UNCAPPED - Duration: 12:52. Mayo Please 52 views * New 12:52
 * Making Data Simple in the Cognitive Era - Duration: 58:18. Data Gurus 56
   views 58:18
 * Draw insights from car accident reports Python + apache Spark - Duration:
   39:50. Data Gurus 13 views 39:50
 * Text Classification using Spark Machine Learning - Duration: 1:00:20. Data
   Gurus 69 views 1:00:20
 * SQL on Hadoop – Enterprise Data Warehouse Offload - Duration: 47:55. Data
   Gurus 21 views 47:55
 * Kia Super Bowl Commercial 2017 w/ Melissa McCarthy (FUNNIEST SUPERBOWL
   COMMERICAL EVER!) 2017 - Duration: 1:56. Boogie Whiplash 792 views * New 1:56
 * Super Bowl XXXVI: “Patriots Dynasty Begins” | Rams vs. Patriots | NFL Full
   Game - Duration: 2:04:29. NFL 124,744 views 2:04:29
 * NFL: SUPER BOWL LI PREDICTION | Patriots vs Falcons WHO WINS SUPER BOWL 51? -
   Duration: 5:35. GlobalSportsCommentator 39 views * New 5:35
 * Nintendo Switch Super Bowl Ad Trailer Reaction & Analysis - Duration: 3:33.
   LockeXI 125 views * New 3:33
 * PFT Commenter gets drunk and hits the streets, Radio Row, and more to have
   some fun with fans - Duration: 4:32. SB Nation 18,747 views 4:32
 * Super Bowl Tower DIY Step by Step - Duration: 9:10. Sandi Masori Balloons
   13,485 views 9:10
 * The Ultimate Nacho Platter (8,000+ Calories) - Duration: 10:48. Matt Stonie
   5,753,118 views 10:48
 * puppy bowl is almost here! - Duration: 1:13. globby 1234 2 views * New 1:13
 * Beyonce Almost Falls & Lady Gaga's Memeworthy Superbowl - Khloe Kardashian
   Hikes W/ Lamar Odom (DHR) - Duration: 10:49. Clevver News 414,295 views 10:49
 * How To Make Cauliflower Buffalo ""Wings"" | Vegan Super Bowl Snack - Duration:
   2:46. The Vegetarian Baker 5,714 views 2:46
 * Superbowl 50 is almost here. My prediction. - Duration: 7:05. football picks
   by Dinero 12 views 7:05
 * Beyonce Almost Falls During Super Bowl Halftime Show, Recovers Like A Pro -
   Duration: 1:43. Inside Edition 99,871 views 1:43
 * Super Bowl LI is Almost Here! - Duration: 0:35. KAREN DOUGLAS 1 view * New 0:35
 * 2-3 Roblox livestream. Superbowl is almost here - Duration: 4:10:42. kmyeakel
   55 views * New 4:10:42
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Working with Spark we will show data engineering operations that will shape and prepare data for analysis. This will include augmenting the existing data set...,Let's have some fun with NFL data,Live,557
1746,"Compose The Compose logo Articles Sign in Free 30-day trialMONGO METRICS: CALCULATING THE MEAN
Published Mar 6, 2017 mongodb metrics data mining Mongo Metrics: Calculating the MeanMongo Metrics is a new series in collaboration with Compose's Resident Data
Scientist Lisa Smith that shows you how to extract insights and toy with data stored in Compose
MongoDB. This series is a MongoDB flavor of our popular Metrics Maven series.

While most of us don't immediately think ""Data Science"" when we think of
MongoDB, it turns out that the MongoDB, through its aggregations pipeline, is a
fantastic data store for assembling and analyzing complex data. In this new
series, we'll build on our previous Metrics Maven articles on mean , median and mode , as well as our previous article on the MongoDB aggregations pipeline , to achieve Mean, Median, and Mode calculations using MongoDB. In this first
article in the series, we'll cover the mean .

GETTING TO THE MEAN
We'll borrow the product catalog from a fictional pet supply company used in a previous article and put it in a collection called products .

 product        | category | productLine    | price | numberInStock 
-----------------------------------------------------------------------
leash           | dog wear | Bowser         | 15.99 | 48  
collar          | dog wear | Bowser         | 10.99 | 76  
name tag        | dog wear | Bowser         | 5.99  | 204  
jacket          | dog wear | Bowser         | 24.99 | 12  
ball            | dog toys | Bowser         | 6.99  | 27  
plushy          | dog toys | Bowser         | 8.99  | 30  
rubber bone     | dog toys | Bowser         | 4.99  | 52  
rubber bone     | dog toys | Tippy          | 4.99  | 38  
plushy          | dog toys | Tippy          | 6.99  | 16  
ball            | dog toys | Tippy          | 2.99  | 47  
leash           | dog wear | Tippy          | 12.99 | 34  
collar          | dog wear | Tippy          | 6.99  | 88  
name tag        | dog wear | Tippy          | 5.99  | 165  
jacket          | dog wear | Tippy          | 20.99 | 50  


We'll also create a transaction collection which contains the following data:

order_id | date                  | item_count | order_value  
----------------------------------------------------------------
50000    | ISODATE(""2016-09-02"") | 3          | 35.97  
50001    | ISODATE(""2016-09-02"") | 2          | 7.98  
50002    | ISODATE(""2016-09-02"") | 1          | 5.99  
50003    | ISODATE(""2016-09-02"") | 1          | 4.99  
50004    | ISODATE(""2016-09-02"") | 7          | 78.93  
50005    | ISODATE(""2016-09-02"") | 0          | null  
50006    | ISODATE(""2016-09-02"") | 1          | 5.99  
50007    | ISODATE(""2016-09-02"") | 2          | 19.98  
50008    | ISODATE(""2016-09-02"") | 1          | 5.99  
50009    | ISODATE(""2016-09-02"") | 2          | 12.98  
50010    | ISODATE(""2016-09-02"") | 1          | 20.99  


We'll have to modify the data format slightly for MongoDB - one document in the product collection looks like the following:

{
  _id: ObjectID(""589cd56b6ca2fef0f7737fbc""),
  product: ""leash"",
  category: ""dog wear"",
  productLine: ""Bowser"",
  price: 15.99,
  numberInStock: 48
}


and one document in the transactions collection looks like the following:

{
  _id: ObjectID(""589cd56b6ca2eef0f7737b0a""),
  orderId: 50001,
  date: ISODATE(""2016-09-02""),
  itemCount: 3,
  orderValue: 35.97
}


LEAN MEAN
Calculating the Mean, also known outside the mathematical world as the
""average"", is our first stop on the Mongo Metrics tour. The mean is computed by
summing a field in each document returned from a query, and dividing that by the
number of documents returned. For example, to calculate the mean of the prices
across all of the products in our database, we would add the prices together and
divide by the number of documents.

To better understand this section, you'll want to have a foundational
understanding of the $match and $group operators in the MongoDB aggregation pipeline. If you need some background, you
can check out our previous article on MongoDB aggregations by example .

MongoDB provides an $avg operator that delivers exactly what we want. The first thing we need to do is
make sure that we're only running our average calculation on transactions that
have a non-null value in the orderValue field:

{
   $match: {
    orderValue: {
      $exists: true
    }
  }
}


Next, we'll use the $group operation to generate the sum across the orderValue fields.

  {
    $group: {
      _id: null,
      averageTransactionAmount: {
        $avg: ""$orderValue""
      }
    }
  }


Combining the two gives us the full aggregation query:

db.transactions.aggregate([  
  {
    $match: {
      orderValue: {
        $exists: true
      }
    }
  },
  {
    $group: {
      _id: null,
      averageTransactionAmount: {
        $avg: ""$orderValue""
      }
    }
  }  
]);


Which will give us the following output:

{ averageTransactionAmount: 18.16272727272 }


Since this is a dollar amount, it would be ideal to have our results rounded to
the hundreds place.

ROUNDING IN MONGODB
MongoDB doesn't have an operation for rounding so we'll need to do a little bit
of fancy footwork with our averageTransactionAmount . The steps we'll take to get down to the hundredths place is as follows:

 1. Multiply the amount by 100 (1816.2727272)
 2. Truncate the amount to it's integer value (1816)
 3. Divide the amount by 100 (18.16)

We aren't doing true rounding here, but that's ok for our purposes.

Since our math operations accept any expression that evaluates to a number, we
can add our calculation directly to the group stage of our pipeline:

  {
    $group: {
      _id: null,
      averageTransactionAmount: {
        $divide: [
          { 
          $trunc: {
              $mult: [
                {$avg: ""$orderValue""}, 100
              ]
            }
          }, 
          100
        ]
      }
    }
  }  


Which produces the following results:

{ averageTransactionAmount: 18.16 }


While this may seem like an involved process, it's pretty straight forward if
you follow the steps listed above. Our final aggregation query including
rounding looks like this:

db.transactions.aggregate([  
  {
    $match: {
      orderValue: {
        $exists: true
      }
    }
  },
  {
    $group: {
      _id: null,
      averageTransactionAmount: {
        $divide: [
          { 
          $trunc: {
              $mult: [
                {$avg: ""$orderValue""}, 100
              ]
            }
          }, 
          100
        ]
      }
    }
  }
]);


NEXT STEPS
Finding the mean is a great first starting point for metrics, but it's not
perfect. One of the major flaws with mean is that adding one value which lies far outside the majority of the data can
heavily skew the results.

Let's take a look at an extreme example:

100, 105, 110, 112, 120, 500  


Looking at the data above, it's obvious that the biggest value (500) lies far
outside the rest of the values. This value, called an outlier , affects our mean value since mean takes into account every single data point. With our outlier,
the mean value is the following:

(100 + 105 + 110 + 112 + 120 + 500) / 6 = 174.5


Without the outlier, the mean more closely represents what we're looking for
with the average - an idea of where the majority of the values lie:

(100 + 105 + 110 + 112 + 120) / 5 = 109.4


The idea that a single value that lies substantially above or below the majority
of values can skew the results makes mean unreliable as a singular metric. To get a better picture of the data, we need
another answer.

In our next article in this series, we'll explore using the median to reduce the impact of our outliers.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Luis Llerena

John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
Feb 23, 2017AGGREGATIONS IN MONGODB BY EXAMPLE
In this second half of MongoDB by Example, we'll explore the MongoDB aggregation
pipeline. The first half of this series cov…

John O'Connor Mar 3, 2017NEWSBITS - S3, CLOUDPETS AND MONGODB, GOOGLE'S ROSEHUB, MYSQL'S OPTIMIZER AND A
UNIVERSAL SQL CLIENT
NewsBits for the week ending March 3rd - Amazon's S3 post-mortem, another
unsecured MongoDB bites CloudPets, Google uses big…

Dj Walker-Morgan Mar 2, 2017USE ALL THE DATABASES - PART 1
Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data
from multiple sources using GraphQL in this W…

Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",Mongo Metrics is a new series in collaboration with Compose's Resident Data Scientist Lisa Smith that shows you how to extract insights and toy with data stored in Compose MongoDB. This series is a MongoDB flavor of our popular Metrics Maven series.,Mongo Metrics: Calculating the Mean,Live,558
1751,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Watson Student Advisor

 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (April 4, 2017)
 * This Week in Data Science (March 28, 2017)
 * This Week in Data Science (March 21, 2017)
 * Learn TensorFlow and Deep Learning Together and Now!
 * This Week in Data Science (March 14, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (APRIL 4, 2017)
Posted on April 4, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * The 5 Most Effective Ways to Learn R – Tips to learn and master a powerful statistical language.
 * A Beginner’s Guide to Tweet Analytics with Pandas – Simple Beginner tutorial for tweet analytics with pandas using
   downloadable twitter data.
 * Recapping the IBM Chief Data Officer Strategy Summit Spring 2017 – Highlights from the Spring 2017 IBM Chief Data Officer Strategy.
 * Is Blockchain the solution for healthcare – How the use of blockchain technology could solve common challenges
   associated with healthcare data.
 * Incorporating machine learning in the data lake for robust business results – The basics of a “cognitive trusted data lake” enabled by Machine Learning
   and Data Science.
 * IBM Watson Cognitive Computing – Overview of the IBM Watson ecosystem.
 * IBM’s Watson to make debut in China – IBM partners with the Dalian Wanda Group to bring Watson and a cloud
   service to Chinese companies.
 * Machine learning 101: The healthcare opportunities are endless – How supervised and unsupervised machine learning can help in the delivery
   of medical care.
 * Using Deep Learning Technologies IBM Reaches a New Milestone in Speech
   Recognition – How IBM used deep learning technologies to move closer to human-like
   speech recognition.
 * What Meaningful Careers Exist In Data Science? – Quora discussion about the different careers in the field of Data Science.
 * Confused by data visualization? Here’s how to cope in a world of many
   features – Rundown of how to deal with the many features that go into a successful
   data visualization.
 * Managing big data in schools: 5 ways it can benefit school education – Brief overview of Big Data and some benefits it provides to education.
 * How Understanding Animals Can Help Us Maximize Artificial Intelligence – A former trainer uses animals to relate to the limits of artificial
   intelligence.
 * Can Artificial Intelligence Identify Pictures Better than Humans? – Computers can recognize images faster than humans but can they do it
   better?
 * 4 Ways to Build a More Effective Open Data Program – Insight into building an open data platform to benefit a city.

FEATURED COURSES FROM BDU
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.
 * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to
   apply deep learning to different data types in order to solve real world
   problems.

COOL DATA SCIENCE VIDEOS
 * Machine Learning With Python – Supervised Learning Regression Algorithms – Video from our Machine Learning with python course at BDU covering
   Regression Algorithms.
 * Machine Learning With Python – Supervised Learning – Advantages &
   Disadvantages of Decision Trees – A breakdown of the pros and cons of Decision Trees.
 * Machine Learning With Python – Supervised Learning – Reliability of Random
   Forests – A discussion of the reliability of Random Forests from BDU.
 * Machine Learning With Python – Supervised Learning Random Forests – An introduction to Random Forests and their relationship to Decision
   Trees.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (April 4, 2017)",Live,559
1754,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  LOAD CLOUDANT DATA IN APACHE SPARK USING A SCALA NOTEBOOKsharynr / November 13, 2015Learn how to load Cloudant data in IBM Analytics for Apache Spark using theCloudant-Spark beta connector in a Scala notebook for easy access to filter and refine Cloudantdata in IBM Analytics for Apache Spark.You can download the Scala notebook shown in the video and referenced in this tutorial, or create your own notebookby cutting/pasting the code found below in the tutorial into a new notebook.You can also read a transcript of this videoRELATED LINKS * Build SQL Queries * Use the Machine Learning Library * Load and Analyze dashDB Data with Apache Spark * Use a Python notebook to load Cloudant data in Apache SparkTRY THE TUTORIALLearn how use Spark SQL to load and filter Cloudant data in a Scala notebook inIBM Analytics for Apache Spark.WHAT YOU’LL LEARNAt the end of this tutorial, you should be able to: * replicate a Cloundant database into your Cloudant account * create a Scala notebook in IBM Analytics for Apache Spark. * use Spark SQL to load and filter the Cloudant data. * write data back to a Cloudant database.BEFORE YOU BEGINWatch the Getting Started on Bluemix video to add the IBM Analytics for Apache Spark service to your Bluemix account.You can download the Scala notebook shown in the video and referenced in this tutorial, or create your own notebookby cutting/pasting the code into a new notebook.PROCEDURE 1: REPLICATE THE CRIMES DATABASE INTO YOUR CLOUDANT ACCOUNT 1. Sign in to your Cloudant account or sign in to Bluemix , and access the Cloudant Dashboard . 2. Click the Replication tab. 3. Complete the form on the right side of the screen to create a new    replication job with the following specifications. 1. For the _id, type crimes_replication .     2. In this tutorial, you want to replicate a database from the Education        account to your own personal account, so indicate that the source        database is a Remote Database and type the URL to the database as https://education.cloudant.com/crimes .        In this case, you don’t need to set any special permissions because this        database is already set to allow anyone to replicate it locally.     3. For the target database, click New Database , select Create a new database locally , and then specify the database name as crimes .     4. Leave Make this replication continuous unchecked so this will be a singular replication event.        Click Replicate . Next, type your password, and click Continue .Under the covers, the process base64 encodes your credentials and includes    that authentication information in the replication document.        You get the success message: This replication has been posted to the _replicator database but hasn’t    been fired yet. Check the _replicator DB to see its state.PROCEDURE 2: CREATE A SCALA NOTEBOOK TO ANALYZE THE CLOUDANT DATA 1. From your Bluemix Dashboard, open the Apache Spark instance, and launch IBM    Analytics for Apache Spark. 2. Open the existing Spark instance. 3. On the Analytics tab, click New Notebook , select Scala , type a name and description for the notebook, and click Create Notebook . 4. Paste the following statement into the first cell, and then click Run . This command contains SQLContext which is the entry point into all    functionality in Spark SQL and is necessary to execute SQL queries.Command: val sqlContext = new org.apache.spark.sql.SQLContext(sc)         5. Paste the following statement into the second cell, and then click Run . Replace hostname , username , and password with the hostname, username, and password for your Cloudant account. This    command reads the crimes database from the Cloudant account and assigns it    to the cloudantdata variable.    Command: val cloudantdata = sqlContext.read.format(""com.cloudant.spark"").    option(""cloudant.host"",""hostname"").    option(""cloudant.username"", ""username"").    option(""cloudant.password"", ""password"").    load(""crimes"") 6. Paste the following statement into the third cell, and then click Run . This next command lets you take a look at that schema.    Command: cloudantdata.printSchema 7. Paste the following statement into the fourth cell, and then click Run . A DataFrame object can be created directly from a Cloudant database. This    next line creates and displays a DataFrame containing all of the crime codes    from the cloudantdata.    Command: val resultsDF = cloudantdata.select(""properties.naturecode"")    resultsDF.show() 8. Paste the following statement into the fifth cell, and then click Run . This next line creates a DataFrame containing the cloudantdata filtered    on only crime data where the crime code is a public disturbance. You’ll    notice that the .select statement specified which column to select, and the    .filter statement specifies which rows to select. Refer to the SQL Programming Guide for more information on the .select and .filter syntax.    Command: val disturbDF =    cloudantdata.filter(cloudantdata.col(""properties.naturecode"").startsWith(""DISTRB""))    disturbDF.show() 9. Paste the following statement into the sixth cell, and then click Run . This line persists the DataFrame back to another Cloudant database. The    Cloudant-Spark Connector does not create the database, so the database needs    to already exist. This command writes 7 documents into a database named    crimes_filtered and contains the properties of the crime. Replace hostname , username , and password with the hostname, username, and password for your Cloudant account.    Command: disturbDF.select(""properties"").write.format(""com.cloudant.spark"").    option(""cloudant.host"",""hostname.cloudant.com"").    option(""cloudant.username"",""username"").    option(""cloudant.password”,”password”).    save(""crimes_filtered"")PROCEDURE 3: VIEW THE DATABASE FROM THE CLOUDANT DASHBOARD 1. Open the Cloudant dashboard by signing in to your Cloudant account . 2. In the list of databases, notice the original crimes database contains 273    documents, while the crimes_filtered database contains only 7 documents. 3. Open the crimes_filtered database. 4. Open the documents in the database to verify that all documents contain the    naturecode “DISTRB”.Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Learn how to load Cloudant data in IBM Analytics for Apache Spark using the Cloudant-Spark connector in a Scala notebook.,Load Cloudant Data in Apache Spark Using a Scala Notebook,Live,560
1757,This video shows you how to set up a pre-authenticated version of cURL to eliminate the need to type your username and password each time you execute a cURL command to act on a Cloudant database. Sign up for a Cloudant account here: https://cloudant.com/sign-up/. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center,This video shows you how to set up a pre-authenticated version of cURL to eliminate the need to type your username and password each time you execute a cURL command to act on a Cloudant database.,Set Up Pre Authenticated cURL,Live,561
1760,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

CONTENTS
 * Apache Spark * Get Started * Get Started in Bluemix
       * Get started with notebooks
      
      
    * Tutorials * Load dashDB Data with Apache Spark
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Load Cloudant Data in Apache Spark Using a Python Notebook
       * Build SQL Queries
       * Use SparkR
       * Use the Machine Learning Library
       * Build a Custom Library for Apache Spark
       * Sentiment Analysis of Twitter Hashtags
       * Analyze Open Data
       * Use Spark Streaming
       * Use GraphFrames
       * Launch a Spark job using spark-submit
      
      
    * Sample Notebooks * Sample Python Notebook: Precipitation Analysis
       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis
      
      
    * Technical Previews * R in Jupyter notebooks
      
      
 * BigInsights * Get Started in Bluemix * What is the Basic Plan?
       * Set Up a Cluster
       * Connect to an Object Storage
       * Try Examples on Github
      
      
    * Get Started Analyzing * Use HBase With BigInsights
       * Use Hive With BigInsights
       * BigInsights on Cloud for Analysts
       * BigInsights on Cloud for Data Scientists
       * Perform Text Analytics on Financial Data
       * Perform Sentiment Analysis
       * Analyze dashDB Data
       * Use Tableau and BigInsights
       * See the Power of SPSS Predictive Analytics Using Spark on BigInsights
       * Sample Scripts
      
      
 * Cloudant * Get started * Copy a sample database
       * Create a database
       * Change database permissions
       * Connect to Bluemix
       * Developing against Cloudant
      
      
    * Intro to the HTTP API * Execute common API commands
       * Set up pre-authenticated cURL
      
      
    * Database Replication * Use cases for replication
       * Create a replication job
       * Check replication status
       * Set up replication with cURL
      
      
    * Indexes and Queries * Use the primary index
       * MapReduce and the secondary index
       * Build and query a search index
       * Use Cloudant Query
       * Use Cloudant Geospatial
      
      
    * Integrate * Create a Data Warehouse from Cloudant Data
       * Store Tweets Using Cloudant, dashDB, and Node-RED
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Load Cloudant Data in Apache Spark Using a Python Notebook
      
      
 * Compose * Get Started * Create a Deployment
       * Add a Database and Documents
       * Back Up and Restore a Deployment
       * Enable Two-Factor Authentication
       * Add Users
       * Enable Add-Ons for Your Deployment
      
      
    * Compose Enterprise * Get Started on Bluemix and Compose.io
      
      
 * dashDB * dashDB Quick Start
    * dashDB Local Quick Start
    * Get * Get started with dashDB on Bluemix
       * Load data from the desktop into dashDB
       * Load from Desktop Supercharged with IBM Aspera
       * Load data from the Cloud into dashDB
       * Migrate Data Using IBM Bluemix Lift
       * Move data to the Cloud with dashDB’s MoveToCloud script
       * Load Twitter data into dashDB
       * Load XML data into dashDB
       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB
       * Load JSON Data from Cloudant into dashDB
       * Use DataStage to Bulk Load Data into dashDB
       * Use DataStage Parallel Inserts to Load Data into dashDB
       * Integrate dashDB and Informatica Cloud
       * Load geospatial data into dashDB to analyze in Esri ArcGIS
       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion
         Workbench (DCW)
       * Install IBM Database Conversion Workbench
       * Convert data from Oracle to dashDB
       * Convert IBM Puredata System for Analytics to dashDB
       * From Netezza to dashDB: It’s That Easy!
       * Manage data with dashDB Support Tools
       * Use Aginity Workbench for IBM dashDB
      
      
    * Build * Create Tables in dashDB
       * Connect apps to dashDB
      
      
    * Analyze * Use dashDB with Watson Analytics
       * Perform Predictive Analytics and SQL Pushdown
       * Use dashDB with Spark
       * Use dashDB with Pyspark and Pandas
       * Use dashDB with R
       * Publish apps that use R analysis with Shiny and dashDB
       * Perform market basket analysis using dashDB and R
       * Connect R Commander and dashDB
       * Use dashDB with IBM Embeddable Reporting Service
       * Use dashDB with Tableau
       * Leverage dashDB in Cognos Business Intelligence
       * Integrate dashDB with Excel
       * Extract and export dashDB data to a CSV file
       * Analyze With SPSS Statistics and dashDB
      
      
    * REST API * Load delimited data using the REST API and cURL
      
      
 * DataStage * Get Started * Log In to DataStage on Cloud
       * Create a Simple DataStage Job
       * Create Users in DataStage Administrator
       * Read and Write Null Values
       * Import a Table Definition
       * Configure Property Values and Defaults
       * Define Constraints in the Transformer Stage
       * Define Derivations in the Transformer Stage
      
      
 * Bluemix Data Connect * Get Started * Configuring Data Connect
       * Load Data for Analytics in Data Connect
       * Blend Data from Multiple Sources in Data Connect
       * Shape Raw Data in Data Connect
       * Data Connect API
      
      
 * DB2 * Get Started * Fast and Easy Data Movement
       * Establish a Trusted Connection
       * Manage DB2 on Cloud Storage
       * Encrypt Data at Rest
       * Manage I/O Performance
       * Encrypt Data in Transit With SSL
       * Manage Logging
       * Recover Dropped Tables
      
      
 * Graph * Get Started * Walk Through the User Interface
       * Create a Model and Schema
       * Perform Basic CRUD Operations
       * Traverse Your Graph
      
      
 * Informix * Get Started * Connect to Informix Server
      
      
 * Lift * Get Started * Set Up a Secure Gateway
       * Migrate CSV Data to dashDB
       * Migrate PureData for Analytics Data to dashDB
       * Troubleshoot Connection Issues
       * Troubleshoot Data Migration Issues
       * Access Additional Resources
      
      
 * Master Data Management * Get Started * Create a New Development Project
      
      
BLUEMIX DATA CONNECT
sharynr / October 1, 2015Bluemix Data Connect is a fully managed data preparation and movement service
that enables business analysts, developers and data scientists to put data to
work through a simple, powerful cloud application. Its processing engine is
built on Apache Spark™, but even ‘power Excel’ users can use the Data Connect UI
to discover, cleanse, transform and move data for their developers and
analytics.

Bluemix Data Connect excels at preparing and moving data from on-premises
systems to cloud infrastructure, and back again. A key component of the IBM
Cloud Data Services portfolio, the Data Connect service comes pre-integrated
with other IBM services, like the dashDB cloud data warehouse, Cloudant NoSQL
database and Watson Analytics.

NEXT STEPS
 * Get started on Bluemix
 * Learn about Data Connect
 * Learn about the Data Connect APIs

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM","Learn how to use Bluemix Data Connect to load, mix, and refine raw data from both on-premises and off-premises environments.",Data Connect: Cloud data prep and movement,Live,562
1761,Cheat sheet,"The RStudio IDE is the most popular integrated development environment for R. Do you want to write, run, and debug your own R code? Work collaboratively on R projects with version control? Build packages or create documents and apps? No matter what you do with R, the RStudio IDE can help you do it faster. This cheat sheet will guide you through the most useful features of the IDE, as well as the long list of keyboard shortcuts built into the RStudio IDE.",RStudio IDE  Cheat Sheet,Live,563
1765,"RStudio Blog * Home

 * Subscribe to feed

DT: AN R INTERFACE TO THE DATATABLES LIBRARY
June 24, 2015 in Packages | Tags: DataTables , htmlwidgets

We are happy to announce a new package DT is available on CRAN now. DT is an interface to the JavaScript library DataTables based on the htmlwidgets framework, to present rectangular R data objects (such as data frames and
matrices) as HTML tables. You can filter, search, and sort the data in the
table. See http://rstudio.github.io/DT/ for the full documentation and examples of this package. To install the
package, run

install.packages('DT')
# run DT::datatable(iris) to see a ""hello world"" example


The main function in this package is datatable() , which returns a table widget that can be rendered in R Markdown documents,
Shiny apps, and the R console. It is easy to customize the style (cell borders,
row striping, and row highlighting, etc), theme (default or Bootstrap),
row/column names, table caption, and so on.


DATATABLES OPTIONS
The DataTables library supports a large number of initialization options.
Through DT , you can specify these options using a list in R. For example, we can disable
searching, change the default page length from 10 to 5, and customize the length
menu to use page lengths 5, 10, 15, and 20:

library(DT)
datatable(iris, options = list(
  searching = FALSE,
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)
))

When you need to write literal JavaScript code in these options (e.g. the
callback functions), you can use the JS() function. An example of the initComplete callback:

datatable(iris, options = list(
  initComplete = JS(""
    function(settings, json) {
      $(this.api().table().header()).css({
        'background-color': '#000',
        'color': '#fff'
      });
    }"")
))

Being able to write JavaScript gives you full flexibility to customize the
table. However, one of the goals of DT is to avoid writing JavaScript in your R scripts, and we hope users can express
everything in pure R syntax, so we have provided a few R helper functions in DT that essentially generates JavaScript code for users to fulfill some common
tasks, such as formatting table columns and cells.

FORMATTING FUNCTIONS
The functions formatCurrency() , formatPercentage() , formatRound() , and formatDate() can be used to format table columns. For example, for a data frame with five
columns A, B, C, D, and E, we format the columns A and C as euros, B as
percentages (rounded to 2 decimal places), round D to 3 decimal places, and
format E as date strings (the pipe operator %>% comes from the magrittr package):

library(DT)
df <- data.frame( 
  A = rpois(100, 1e4), 
  B = runif(100), 
  C = rpois(100, 1e3), 
  D = rnorm(100), 
  E = Sys.Date() + 1:100 
)
datatable(df) %>%
  formatCurrency(c('A', 'C'), '€') %>%
  formatPercentage('B', 2) %>%
  formatRound('D', 3) %>%
  formatDate('E', 'toDateString') 


It is also easy to style the table cells according to their values using the formatStyle() function. You can apply different CSS styles to cells, e.g. use bold font for
those cells with Sepal.Length > 5 , gray background for Sepal.Width <= 3.4 and yellow for Sepal.Width > 3.4 , and so on. See the documentation page for these formatting functions for more information.


SERVER-SIDE PROCESSING
Interactions with the table can be processed either on the client side (using
JavaScript in the web browser), or on the server side. Server-side processing is
suitable for large data objects, since filtering, sorting, and pagination can be
much faster in R than JavaScript in the browser. In theory, you can use any
server-side processing language to process the data, and we have implemented it
in R, which you can trivially enable by using DT in Shiny apps (the default mode is just server-side processing).

COLUMN FILTERS
DataTables does not come with column filters by default. It only provides a
global search box. We have added filters for individual columns in DT , and you can enable column filters using the argument filter = 'top' or 'bottom' in datatable() . Currently, three types of filters are provided:

 * For numeric/date/time columns, range sliders are used to filter rows within ranges;
 * For factor columns, selectize inputs are used to display all possible categories, and you can select multiple
   categories there (note you can also type in the box to search all
   categories);
 * For character columns, ordinary search boxes are used to match the values you
   typed in the boxes;

These filters are similar to the ones introduced in the RStudio 0.99 Data Viewer. Column filters work in both server-side and client-side processing
modes. You can enable search result highlighting by the option searchHighlight = TRUE .


SHINY
If you have used DataTables before in Shiny (i.e. the functions dataTableOutput() and renderDataTable() ), it should be trivial to switch from Shiny to DT . DT has provided two functions of the same names, and the usage is very similar.
Basically, all you have to do is to load DT after shiny , so that dataTableOutput() and renderDataTable() in DT can override the functions in shiny . If you want to be sure to use the functions in DT , you can add the prefix DT:: to these functions. We will deprecate dataTableOutput() and renderDataTable() in shiny eventually as DT becomes mature and stable.

library(shiny)
library(DT)  # make sure you load DT *after* shiny

As mentioned before, DT uses the server-side processing mode in shiny . To go back to client-side processing, you can use renderDataTable(data, server = FALSE) .

The first argument of the function renderDataTable() can be either a data object (e.g. a data frame), or a table widget object
(returned by datatable() ). The latter form is useful when you need to further process the table widget,
e.g. format certain columns or cells.

renderDataTable({
  datatable(iris) %>% formatStyle(
    'Sepal.Width',
    backgroundColor = styleInterval(3.4, c('gray', 'yellow'))
  )
})

When a table is rendered in a Shiny app, you can obtain some information about
the state of the table via the input object in Shiny. For example, for a table output dataTableOutput('foo') , the indices of the selected rows can be obtained from input$foo_rows_selected , and the indices of rows on the current page are available via input$foo_rows_current ( live example ). This page has more information about using DT in Shiny.

DATATABLES EXTENSIONS
DataTables has several extensions, and we have integrated all of them into DT . You may enable extensions via the extensions argument of datatable() . For example, you can reorder columns using the ColReorder extension,
show/hide columns using the ColVis extension, fix certain columns on the left
and/or right via FixedColumns when scrolling horizontally in the table, and so
on. Please see the documentation page for extensions for details.

We hope you will enjoy this package, and please let us know if you have any questions, comments, or feature requests.

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

9 COMMENTS
June 24, 2015 at 5:42 pm

John Baumgartner

Cool.

The “Formatting Functions” example doesn’t work as expected for me:


df <- setNames(data.frame(c(replicate(4, runif(100)*100, simplify=FALSE), 
                            list(Sys.Date() - sort(sample(0:356, 100))))),
               LETTERS[1:5])

df %>%
  formatCurrency(c('A', 'C'), '€') %>%
  formatPercentage('B', 2) %>%
  formatRound('D', 3) %>%
  formatDate('E', 'toDateString') 


returns the incomplete error message:

Error in appendFormatter(x$options$rowCallback, columns, colnames, rownames, :
You specified the columns: A, C, but the column names of the data are

(Note also that your first `%>%` is incomplete.)

 * June 24, 2015 at 6:10 pm
   
   John Baumgartner
   
   Apologies… I see now that `df` needs to be a `datatable`. Great package
   
   
 * June 24, 2015 at 6:11 pm
   
   Yihui Xie
   
   That is right. I’ll edit the post. Thanks!
   
   
 * 

June 25, 2015 at 2:42 am

GD

In an rmarkdown site (like this one http://rmarkdown.rstudio.com ), where I use the Open Sans font family, DT does not display correctly greek
characters. Any ideas?

 * June 25, 2015 at 2:49 am
   
   GD
   
   Actually I am not sure if it’s the font that causes the problem, because when
   I change the setting `self_contained: false` to `true` the characters are
   displayed correctly
   
   
 * June 25, 2015 at 3:04 pm
   
   Yihui Xie
   
   Sorry, I have no idea. Do you have a minimal reproducible example?
   
   
 * 

June 25, 2015 at 5:26 am

Stefan Fritsch

It’s a great library. I love the input$foo_rows_selected and it would be really
cool if you could expand on that. E.g. clicks on cells, what search terms have
been entered and where, and the like.

But…
You use lower camel case for most of your functions, DataTables itself is upper
camel case.
But for datatable you went with neither and instead used a name with maximum
similarity to data.table.
Was that really necessary? Really?

 * June 25, 2015 at 3:03 pm
   
   Yihui Xie
   
   This post is only an announcement. For more information, please go to the
   package website, e.g. http://rstudio.github.io/DT/shiny.html has more info about input$foo_xxx
   
   
June 26, 2015 at 4:28 pm

Marc P

Is there a way to integrate the DT package output with the Leaflet output? I’m
looking for a way to integrate both the data table in DT with a map produced by
Leaflet so that a user can dynamically filter a list from a table and have those
filtered points show up on a map. I know this can be done easily in Shiny, but I
am working on a project that has very sensitive data that I don’t want passed
outside of the firm’s firewall.


« d3heatmap: Interactive heat maps Leaflet: Interactive web maps with R »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","We are happy to announce a new package DT is available on CRAN now. DT is an interface to the JavaScript library DataTables based on the htmlwidgets framework, to present rectangular R data objects…",DT: An R interface to the DataTables library,Live,564
1772,"Compose The Compose logo Articles Sign in Free 30-day trialHOW TO WATCH AND WAIT WITH COMPOSE'S BACH AND API
Published May 30, 2017 compose API bach How to watch and wait with Compose's Bach and APIWe've just added a new feature to the Bach command line for the Compose API to
make things easier to track to completion. Let us explain what --watch and --wait does with advice for command line users and API developers too.

If you've not met Bach yet, read all about the utility that lets your manage
Compose from the command line in this article and download the latest version from the Bach GitHub repository .

The secret to great cooking is often just working through a recipe and waiting
for things to be cooked just right. The same goes for the Compose API. It uses
recipes too and you will need to wait till those recipes are completed to have
your database deployment just right.

WHAT'S A RECIPE?
When you ask the Compose API to create a database deployment, backup on demand,
scale up a database or decommission an existing database, what the Compose
platform does is apply what it calls a recipe to the deployment. A recipe is a
bundle of commands that get run against a deployment and that takes time. The
Compose system checks back on the recipes over time to see when they are done
and updates the status in the UI.

But when you are working with the Compose API, we prefer to return to as quickly
as possible. That means returning as soon as we have an id for the recipe that's
being used. This also happens with the Bach CLI. That can be a surprise if you
don't know about it. More so if you expect that when you make an API call or run
a command the work is all done by the time that they return.

Let's see this happening. We'll start an on-demand backup for a database. When
you run a database backup you get back the recipe that's running the backup.

$ bach backups start 5927521082af7e0011000074
             ID: 592d4b38de099f00100001b0
       Template: Recipes::Deployment::Run
         Status: running
  Status Detail: Running usage on mongodb62.sl-eu-lon-2-portal.5.
     Account ID: 542da1926b9c7d465d00000a
  Deployment ID: 5927521082af7e0011000074
           Name: Backup
     Created At: 2017-05-30 10:36:40 +0000 UTC
     Updated At: 2017-05-30 10:36:41 +0000 UTC
  Child Recipes: 0
$ 


The command has returned, the backup is running but not completed. If we use the
Bach watch command with the ID we got back we can see it working.

$ bach watch 592d4b38de099f00100001b0  
             ID: 592d4b38de099f00100001b0
       Template: Recipes::Deployment::Run
         Status: running
  Status Detail: Running backup_on_demand on sl-eu-lon-2-data.7.
     Account ID: 542da1926b9c7d465d00000a
  Deployment ID: 5927521082af7e0011000074
           Name: Backup
     Created At: 2017-05-30 10:36:40.686 +0000 UTC
     Updated At: 2017-05-30 10:36:41.145 +0000 UTC
  Child Recipes: 1

             ID: 592d4b38de099f00100001b0
       Template: Recipes::Deployment::Run
         Status: running
  Status Detail: Running backup_on_demand on sl-eu-lon-2-data.7.
     Account ID: 542da1926b9c7d465d00000a
  Deployment ID: 5927521082af7e0011000074
           Name: Backup
     Created At: 2017-05-30 10:36:40.686 +0000 UTC
     Updated At: 2017-05-30 10:36:41.145 +0000 UTC
  Child Recipes: 1

             ID: 592d4b38de099f00100001b0
       Template: Recipes::Deployment::Run
         Status: complete
  Status Detail: All operations have completed successfully!
     Account ID: 542da1926b9c7d465d00000a
  Deployment ID: 5927521082af7e0011000074
           Name: Backup
     Created At: 2017-05-30 10:36:40.686 +0000 UTC
     Updated At: 2017-05-30 10:37:08.648 +0000 UTC
  Child Recipes: 1
$


Every few seconds Bach polls the API for the latest status of the recipe until
that status is complete . But that does involve you cutting and pasting the ID quickly into a new
command and that is not something you want to do all the time.

ENTER --WATCH (AND --WAIT )
This is why we've added the --watch flag to Bach. With it, any command that returns a watchable recipe starts
watching it automatically. Let's do that backup again, this time with the --watch flag.

$ bach backups start 5927521082af7e0011000074 --watch
             ID: 592d4d32de099f0018000172
       Template: Recipes::Deployment::Run
         Status: running
  Status Detail: Running usage on mongodb62.sl-eu-lon-2-portal.5.
     Account ID: 542da1926b9c7d465d00000a
  Deployment ID: 5927521082af7e0011000074
           Name: Backup
     Created At: 2017-05-30 10:45:06 +0000 UTC
     Updated At: 2017-05-30 10:45:06 +0000 UTC
  Child Recipes: 0

         Status: running
  Status Detail: Running backup_on_demand on sl-eu-lon-2-data.7.
     Updated At: 2017-05-30 10:45:06.62 +0000 UTC
....
         Status: complete
  Status Detail: All operations have completed successfully!
     Updated At: 2017-05-30 10:45:37.06 +0000 UTC
$


Now, where before it would have returned, the backup command starts monitoring
the recipe. It does so in a more terse format than the bach watch command. It only outputs status, status message and updated at. It polls every
five seconds, and if there's no change in those values, it just prints a
progress full stop. If that's all a bit noisy for you, don't worry. There's also --wait which does exactly the same as --watch but produces no output beyond that which the unflagged command would produce.
If you are scripting something, this may be the flag you find easiest to work
with.

FOR API DEVELOPERS
As you can see from Bach's new flags, you'll probably be wanting to watch or
wait for a recipe to complete for any endpoint that returns a recipe. API calls
can do this indirectly - creating a deployment or restoring a backup as the provision_recipe_id field - or directly - creating a backup or deprovisioning a deployment .

Take that recipe id and poll the get recipe endpoint with that id. You'll get back a complete recipe which will tell you the status of that recipe. It'll be done when its state is
either completed or failed . It's that simple; all you have to do is flow that process in with your code
appropriately.

WHAT NEXT?
Recipes and how Compose uses them is one of the more interesting attributes of
the Compose platform. Understanding that they need to run to completion on their
own enables you to make more effective use of the API. It also keeps using Bach
to work with that API an enjoyable process. Learn more about the API on its dedicated help site or, if you just want to use the command line - pick
up the latest edition of Bach today.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
May 18, 2017BACH - THE COMPOSE API AT YOUR COMMAND(LINE)
Find out how to use the power of the command line to control your Compose
database deployments. The Bach tool is an easy way…

Dj Walker-Morgan Dec 13, 2016DATABASE UPDATES AND THE NEW COMPOSE API
There's database updates for Redis and the early availability of the new Compose
API for developers wanting to automate their…

Dj Walker-Morgan May 10, 2017COMPOSE'S WRITE STUFF AND THE WINNER OF THE SECOND CYCLE OF 2017
You voted and now it's time to announce the author who has won the $500 Compose
Write Stuff bonus. Write Stuff is Compose's…

Default avatar The default author avatar The Compose Team Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",We've just added a new feature to the Bach command line for the Compose API to make things easier to track to completion. Let us explain what --watch and --wait does with advice for command line users and API developers too.,How to watch and wait with Compose's Bach and API,Live,565
1778,"Homepage Follow Sign in / Sign up Homepage * Home
 * Dev
 * Design
 * Data
 * 
 * Learn to code for free
 * 

Kirill Dubovikov Blocked Unblock Follow Following Knowledge distiller, Data Scientist and Software Architect Aug 18
--------------------------------------------------------------------------------

THE T-DISTRIBUTION: A KEY STATISTICAL CONCEPT DISCOVERED BY A BEER BREWERY
What does Guinness beer have to do with probability distributions? You’ll know
by the end of this article.In this post we will look at two probability distributions you will encounter
almost each time you do data science, statistics, or machine learning.

GAUSSIAN DISTRIBUTION
Imagine that we are doing a research on the height of various people in a city.
We go down the street and measure a bunch of random people. (Some of them
thought this was quite strange and wanted to call the police, but come on, this
is for the science!)

Now we decide that some Exploratory Data Analysis won’t hurt. But statistical software like R isn’t available at the moment, so
we just make a histogram out of people.

When you have no statistical software at hand…What do we see here? Ahh, the famous bell curve. This is likely to be the most
important probability distribution you will ever encounter. Thanks to the Central Limit Theorem , the Gaussian distribution is present in many real world phenomena. It’s so
common that people just call it a normal distribution .

The Central Limit Theorem states that arithmetic mean of a sufficiently large
number of independent random variables will be normally distributed. Those
random variables can have any distribution initially. But when we measure
something that is represented by their sum, we will eventually (as the number of
samples tends to ∞ ) end up with normally distributed process.

The probability density function of Gaussian distribution is written below:

This formula may look a bit intimidating, but it’s convenient to work with
mathematically. If you’re interested in how it can be derived, you can read how here . As you can see this distribution has two parameters:

 * µ (mean)
 * σ(standard deviation).

Mean µ controls the expected value (where the most values will go) of a normally distributed random variable.
Variance σ² controls the spread or variety of possible values under the
distribution.

The concept of a normal distribution has immense value in machine learning. A
great variety of machine learning algorithms use it extensively:

 * Linear models assume that errors are normally distributed
 * Gaussian processes assume that all values of a function under the model are
   distributed normally
 * Gaussian mixtures let you model complex distributions and build classifiers
   on top of mixture models
 * Normal distribution comes up as one of the main components in Variational
   Autoencoders

Here is an interactive demo of the Gaussian distribution.

Mean and variance can be changed by using drag&dropA STUDENT’S T-DISTRIBUTION
What if we wanted to model our data with Gaussian distribution, but the variance
σ² is was not known to us? This problem arises when the sample sizes are small
and standard deviation (σ) can not be estimated accurately.

William Gosset tackled this problem while working at a Guinness brewery. He
empirically found a formula for a t-distributed random variable.

First, suppose we have values x, …, xn which were sampled from some normal distribution N(µ, σ²) .

We do not know the true variance, but we can estimate it by calculating sample
mean and variance:

Then the random variable

will have a t-distribution with n-1 degrees of freedom, where n is the number of samples.

This formula may resemble transformation from Normal to Standard Normal (a
shorthand for Normal distribution with zero mean and unit variance):

We don’t know the true population variance, so we have to substitute sample
standard deviation estimate for the real one.

This distribution lies at the foundation of the scientific method, called the t-test . This was used at Guinness to measure the quality of their beer.

William Gosset published this result under a pseudonym Student. Guinness was
afraid that its competitors would discover that the t-test was used to control
the quality of their product.

Gosset’s discoveries were later formalized by famous statistician Ronald Fisher.
Fisher is considered to be the author of the frequentist approach to statistics.

Now goes the fun part! You can play with t-distribution below:

Degrees of freedom can be changed by drag&drop over the x axisAs you can see t-distribution approaches standard normal when degrees of freedom
are large. This happens because sample mean approaches true mean as a number of
samples approaches infinity. The “fat” tails of t-distribution compensate for
uncertainty when we are working with small samples.

An interested reader might ask, “So, what is the probability density function of
the t-distribution? How can we derive it?” This turns out to be not that easy in
terms of mathematics, but the central idea is easy to grasp.

Let’s suppose we are interested in getting the probability density function of
normal variable X ~ N(0, σ). But without direct dependence on standard deviation σ.

Intuitively, to get rid of σ we must make some assumptions. Let’s treat σ as a random variable itself, and assume that it follows Gamma distribution (this is a very general distribution which has many uses in Bayesian
statistics).

This way we may say that X is a mixture of two continuous probability distributions: Normal and Gamma.
Then we integrate out σ and arrive at the probability density function formula for the t-distribution.

You can see more formal proofs here and here .

CONCLUSION
Gaussian distributions and Student’s distributions are some of the most
important continuous probability distributions in statistics and machine
learning.

The t-distribution may be used as a placeholder for Gaussian when population
variance is not known, or when the sample size is small. Both are closely
related to each other in a strict and formal way.

Thanks for reading my article! I hope it helped you to learn something new or
refresh existing knowledge.

 * Learning
 * Mathematics
 * Data Science
 * Life
 * Technology

160 2 Blocked Unblock Follow FollowingKIRILL DUBOVIKOV
Knowledge distiller, Data Scientist and Software Architect

FollowFREECODECAMP
Our community publishes stories worth reading on development, design, and data
science.

 * Share
 * 160
 * 
 * 
 * 

Never miss a story from freeCodeCamp , when you sign up for Medium. Learn more Never miss a story from freeCodeCamp Get updates Get updates","In this post we will look at two probability distributions you will encounter almost each time you do data science, statistics, or machine learning.",The t-distribution: a key statistical concept discovered by a beer brewery,Live,566
1783,"* United States

IBM® * Site map

Search within Bluemix Blog Bluemix Blog * About Bluemix * What is Bluemix
    * Getting Started
    * Case Studies
    * Hybrid Architecture
    * Open Source
    * Trust, Security, Privacy
    * Data Centers
    * Our Network
    * Automation
    * Architecture Center
   
   
 * Products * Compute Infrastructure
    * Compute Services
    * Hybrid Deployments
    * Watson
    * Internet of Things
    * Mobile
    * DevOps
    * Data Analytics
    * Network
    * Open Source
    * Storage
    * Security
   
   
 * Services * Bluemix Services
    * Garage
   
   
 * Pricing
 * Support * Support
    * Contact Us
    * Resources
    * Docs
   
   
 * Blog * How-tos
    * Trending
    * What's New
    * Events
   
   
 * Partners * Partners
    * Become a Partner
    * Find a Partner
   
   
 * Sign up

DATA ANALYTICSDATA GOVERNANCE – YOU COULD BE LOOKING AT IT ALL WRONG
August 9, 2017 | Written by: Jay Limburn

Categorized: Data Analytics

Share this post:


Data governance is rarely seen as a glamorous topic, and even the mere mention
of the ‘G’ word often inspires groans and yawns from non-specialists. But are
they missing a trick? It’s possible that the failure to appreciate data
governance comes from a lack of understanding about the value it can deliver,
and just how important it is to future success.

Today, we’re going to attempt to address that gap in understanding. First, let’s
define our terms: by data governance, we’re referring to the overall management
of the availability, usability, integrity, and security of the data employed in
an enterprise. A sound data governance program includes a defined set of
procedures, a plan to execute those procedures, and people who are responsible
for putting that plan into action. This might sound like a lot of work without
much payoff—but the truth is that data governance plays a key role in ensuring
that data is used to its full potential.

In our previous blog, “ How smart catalogs can turn the big data flood into an ocean of opportunity ”, we touched on the overlap between cataloging and data governance, and
suggested that both are vital to a successful data strategy. Now we’re going to
explore how governance and cataloging can combine to help you evolve your
organization into a truly data-driven enterprise.


CONFRONTING THE DANGERS OF BEING REACTIVE, NOT PROACTIVE
Let’s look at some of the most common approaches to data governance. Many data
governance plans are created or updated in response to emerging regulations,
such as the upcoming Global Data Protection Regulation (GDPR). These types of plans typically place a strong focus on avoiding
non-compliance—and little else.

Developing a policy in this way often means hiring an army of compliance
officers and data stewards, writing reams of documents around data handling, and
hurriedly deploying a range of niche technology solutions to shore up the
process as and where needed.

The result? A fragmented set of data governance tools, and a high probability
that data governance professionals gain a reputation for being the ‘tax
inspectors’ of the data world—focusing on enforcing restrictions, rather than
helping the business achieve its goals. In consequence, knowledge workers are
limited rather than enabled in the effective use of data to support everyday
decision-making.


EMBRACING A NEW DATA GOVERNANCE PERSPECTIVE
It doesn’t need to be this way. By flipping the focus of data governance from
restricting usage of data to one of enabling access, sharing and reuse,
organizations will start to realize the positive value that good governance
delivers.

Already, chief data officers are tasked with driving better use of data
throughout their organizations, on top of their typical responsibilities around
information security and compliance. By embracing the right data governance
strategy, they can hit both these targets; giving users the confidence to use
and share data freely, while still keeping the organization and its data
appropriately protected.


ENABLING GOVERNANCE FOR ENFORCEMENT AND INSIGHT
How can you put this new approach into action? The first step is to extend your
definition of governance. The organization needs to understand that governance
is not just about ensuring enforcement, it’s also about the reliable delivery of
insight.

In fact, these two aspects of the role go hand-in-hand. A good governance
framework will ensure that all data shared within your platform is automatically
protected and used in line with the company’s governance guidelines. At the same
time, the protections offered by the framework can help give users the
confidence to discover data anywhere within your organization, knowing that
whatever they can access, they can use. Similarly, they will be encouraged to
contribute their own data, safe in the knowledge that it will be shared
appropriately and won’t be leaked or misused.

Thus, governance strategies designed around enforcement can also enhance data
usage and increase insight.


ONE PLATFORM FOR ALL DATA
End-to-end governance that empowers rather than restricts the user is a big step
towards becoming a data-driven organization. With IBM Watson Data Platform , making the move towards data governance models that enable and empower the
business in its use of data, rather than restricting it, will be easy. This is
because Watson Data Platform will deliver value in three key areas through its
new and upcoming Data Catalog solution (currently in beta):

Automated enforcement and classification: By providing real-time classification and enforcement of governance policies,
Data Catalog will ensure data is appropriately organized—and if necessary, that
sensitive information is masked, hidden or protected—whenever it is accessed,
edited or moved.

Monitoring and analytics: A Data Catalog governance dashboard will offer immediate insight into the
status of an organization’s governance program, allowing chief data officers and
other managers to track their progress towards becoming a data-driven business.

Dynamic metadata: Data Catalog will feature dynamic management of the metadata around how data
should be used and managed. This enables an intelligent search functionality,
powering more effective analytics and making the most valuable information
easier for users to find and use.

Embedded as shared components within the fabric of future releases of Data
Catalog, each of these three capabilities will be fronted by APIs that enable
seamless integration with other tools.


BUT WHAT DOES IT MEAN IN PRACTICE?
Let’s consider one possible use case for Watson Data Platform’s Data Catalog
solution, and how the embedded data governance capabilities could have a huge
impact in real-life terms.

Imagine you are a business analyst in the North American offices of a financial
institution. You’re working on a potentially sensitive data asset, and are
considering whether to share it within your company’s enterprise catalog. Many
of your colleagues would find elements of the dataset useful—they could use it
to help identify locations where large numbers of new bank accounts are being
opened, for example. But because the dataset includes customers’ Social Security
numbers, you cannot share it in its entirety. Plus, teams in Europe aren’t
allowed to see any information on North American customers, so you need to make
sure they don’t even know this data exists.

With a traditional data platform, your options would be limited: if you wanted
to share the dataset, you’d need to invest large amounts of time and effort in
cleaning and masking it to avoid violating your data governance policies. The
likelihood is that you’d decide sharing the data simply isn’t worth your time,
and your company would miss out on a host of insights.

But fortunately, your organization has recently rolled out Data Catalog. When
you add the data to the enterprise catalog, it is automatically classified as
containing sensitive information. The solution also immediately redacts or masks
any data that users are not allowed to see, in line with your organization’s
data governance policies. Your colleagues in Europe will not even be aware of
the data’s existence, while your North American team will be able to find and
access it instantly to enrich their analysis.

Best of all, you too are able to dive into the platform and extract data that
will enhance your own work, confident that by definition, if you can see it,
you’re allowed to use it. In short: everyone can be more confident, more
productive, and more collaborative in their use of data.


THE MESSAGE IS CLEAR
It is time to start thinking about data governance in a new way. IBM Data
Catalog is being built from the ground up to focus on a new way to address how
governance can provide insights to your business. Learn more about the Data Catalog solution and sign up for more information
about the beta .

JAY LIMBURN
Jay Limburn

IBM Data Catalog


Previous Post

IBM Cloud Event Management is now in Beta!Next Post

Considerations in taking your first Bluemix application all the way into
Production (Part 1)ADD COMMENT NO COMMENTS
LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


Search for:RECENT POSTS
 * My App is Secure: Application Security Assessed
 * Get real-time billing insights from your Bluemix account
 * IBM Lift CLI is out of beta and available for download
 * Using Codeship Pro To Deploy Workloads to IBM Bluemix Container Service
 * Securing single page apps with App ID service

ARCHIVES
Archives Select Month September 2017 August 2017 July 2017 June 2017 May 2017 April 2017 March 2017 February 2017 January 2017 December 2016 November 2016 October 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 October 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 November 2013TAGS
analytics announcements api apps Architecture Center best-of-bluemix Bluemix bluemix-support-notifications buildpacks client success cloud cloudant cloud foundry conference conferences containers dashdb deployment devops docker eclipse garage garage-method hackathon homepage hybrid interconnect iot java Kubernetes liberty local microservices mobile MobileFirst node.js openwhisk security Spark swift twilio ui video watson webinar More Data Analytics StoriesData Analytics

STREAMING ANALYTICS PRICING UPDATE
We're lowering the prices for Streaming Analytics.

Continue reading


Share this post:


Data Analytics

CLEANING THE SWAMP: TURN YOUR DATA LAKE INTO A SOURCE OF CRYSTAL-CLEAR INSIGHT
When we talk to data scientists, we hear the same sad story again and again.
They tell us how their organization fell in love with the idea of building a
data lake as a single platform for self-service data science. How they were
wooed and won by a vendor with a solution that promised much, but delivered
little. How their vision of a data lake as a clear source of business insight
has turned into a stagnant swamp—a dumping ground where data goes to die.

Continue reading


Share this post:


Data Analytics

HOW TO EASE THE STRAIN AS YOUR DATA VOLUMES RISE
Ever had to make a decision when you didn’t have the time, means or patience to
look up all the data that could help you choose the best option? Yes, well,
you’re not alone on that score. Usually, this doesn’t have significant or
long-lasting consequences—does it really matter if you choose where to go for
dinner because you like the look of a place, rather than combing through recent
reviews?

Continue reading


Share this post:


SIGN UP FOR A BLUEMIX TRIAL TODAY


Get started free Learn more about Bluemix

CONNECT WITH US


 * Contact
 * Privacy
 * Terms of use
 * Accessibility",The failure to appreciate data governance may come from a lack of understanding about the value it can deliver and how important it is to future success.,You could be looking at it all wrong,Live,567
1784,"Develop in the cloud at the click of a button!Build an iOS 8 App with Bluemix and the MobileFirst Platform foriOSBuild an iOS 8 app with Bluemix and MobileFirstBluemix™ is IBM's                 open platform for developing and deploying mobile and web applications.                 You may have built some applications with some of the mobile services                 already offered. In this tutorial, we introduce you to the new MobileFirst                 Platform for iOS, currently in Beta.This tutorial will walk you through the process of creating a mobile                 back-end application on Bluemix, and then connecting that back-end with our                 sample iOS mobile application, BlueList. BlueList has already been written                 (in Swift), so you'll be able to view the code and see how to use the                 iOS SDK to consume these mobile services. The second half of the tutorial                 explains how to send a push notification from the Bluemix console.Here are some of the Bluemix services used by the iOS app                 described in this tutorial.“See how to create a mobile                     application backend on Bluemix and configure that backend to provide                     Facebook authentication and to push notifications.”What you'll need to build yourapplicationOur mobile application is hosted on IBM DevOps Services. We can use any git                 commands to interact with this project.Go to our mobilecloud|imf-bluelist page to view theproject.Click Git URL on the right side of the page. Thehttps://hub.jazz.net/git/mobilecloud/imf-bluelist URL should bedisplayed and already selected. Copy the value.On your Mac, click the spotlight search icon at the top right cornerof the desktop. Type terminal and pressEnter. The Terminal application will open.Indicate where you'd like to work, such as on the desktop or bycreating your own folder. You can change directories with thecd command, view your current path with pwd,and make a new directory with mkdir.In the terminal, make a copy of this project on your machine by typing:git clone https://hub.jazz.net/git/mobilecloud/imf-bluelistNote                         that the URL is the same as the git URL you just copied from IBM                         DevOps Services.Step 2. Pull down the dependenciesfor this project using CocoaPodsThe Git repository contains both Swift and Objective-C versions of thesame sample iOS program. To work with Swift, change the directory by entering:cd imf-bluelist/bluelist-swiftIf CocoaPods is not installed, you can install it entering:$ sudo gem install cocoapods$ pod setupFor more information, visit the CocoaPods site.Run the pod install command to install any dependenciesthe project needs, including the IBM SDK. (The IBM SDK is alsoavailable in a zip file from the Bluemix Docs.)Finally, open the BlueList.xcworkspace file by entering the followingcommand interminal:open BlueList.xcworkspaceThe projectshould open in Xcode.Select the bluelist-swift project folder. You should see the BundleIdentifier and Version number. Make a note of these. The BundleIdentifier is a unique identifier that is registered with the AppleDeveloper program to identify a given app on the App Store, iOS, andwith other services such as Push notifications. It is typicallycomposed of a reverse domain name such as ""com.ibm"" and a product namesuch as ""BlueList."" In this case, the Bundle Identifier is""com.ibm.BlueList.""Back in your browser, go to https://www.bluemix.net/Log in to your Bluemix account.From your dashboard, select Create an App.A wizard is opened to walk you through the back-end app creation.Select Mobile.Finally, choose a unique name for your application (for example,BlueList-Carlos-ACMECompany) and clickFinish.After the app is provisioned, the client configuration page isdisplayed. This page requires the Bundle Identifier and Version numberfrom Xcode that you noted earlier; these values have to match. Scrolldown and click Done.Next, you will specify the kind of authentication you want to use forindividual users of a particular iOS app. Click the AdvancedMobile Access (AMA) service to open the AMA dashboard.Click Allows Unauthenticated access to the backend toset up authentication.Click the + New Authentication button tocreate a new authentication.You are prompted for an authentication mechanism. ClickFacebook.Now you need the Facebook application listed in Whatyou'll need to build your application above. If you havealready created a Facebook app, skip the next step and continue. Ifnot, go now to https://developers.facebook.com.Click My Apps, then click Add a newApp. Follow the steps for app creation. Make sure yourunique Bundle ID matches that in your Facebook App.Once your app is created, find the application ID by clickingMy Apps and then clicking the application name.Add the Application ID to the User Authentication for Facebook.Great! You now have set up authentication on your mobile back-end                 application.Step 4. Connect your mobileapplication to Bluemix and the Cloudant NoSQL DB serviceGo back to Xcode or reopen the BlueList.xcworkspace file (see Step 2).In the bluelist-swift/Supporting Files folder, open the bluelist.plistfile. Note that applicationId and applicationRoute are blank. Followthe next two steps to fill in those values.Go to Bluemix and open the dashboard for the Advanced Mobile Accessservice. Select the Client Registration item from theleft menu. The Route and UID values for your back-end applicationare displayed.Copy the Route value from the page and paste it into the keyapplicationRoute in the bluelist.plist file. Then copyand paste the UID value into the key applicationId in thesame file.Finally, open the Info.plist file in the same Supporting Files folderin Xcode. Here you will provide configuration settings for theFacebook SDK to be able to implement login.Copy the Facebook App ID value previously used to configureauthentication in the AMA service dashboard. Paste theFacebook App ID into the FacebookAppID key's value. This isthe value originally provided by the Facebook developerdashboard.For FacebookAppDisplayName, enter the valuelocated in the Facebook developer dashboard. For this example,use BlueList.Finally, for URL types>Item 0(Editor)>URLSchemes>Item 0, enterfb. Seethe following figure.Select a simulator such as ""iPhone 6"" from the Xcode toolbar. Thenclick the play button to run the App and see the BlueList app in action.BlueList starts and launches the mobile Safari browser to authenticatewith Facebook. Log in with your Facebook account. Only public profileinformation is required. After the login, the browser returns toBlueList.Since this is the first time running the BlueList app, there are noTodo items. Enter an item to the list in the BlueList application. Inthis example, we enter Bluemix Webinar.To change the priority, select the circle before the item. In thisexample, we add another item (Code Review) and select its circle.Click the different filters at the top to filter by priority.The items you have created at this point are stored in a localdatabase. The Cloudant NoSQL DB SDK offers the functionality tosynchronize local with remote databases. This allows users to usethe data service offline. There are a number of different waysapplication developers could decide to sync their data: oncreation of a new item, on deletion, on anychanges to the list. We have decided to sync the localwith remote databases on a refresh of the list. Pull down the list torefresh and sync now.Your data items, which were once stored only locally, are now synced to the                 backend you created on Bluemix. Let's take a look.Open Bluemix and navigate to your application's dashboard.On the left menu, select Objects and Documentsfrom the Mobile Menu.On the Objects and Documents panel, compose a query with some alreadypopulated values. This query should return the data you just added.Select the database (todosdb) where @datatype equalsTodoItem.See your Todo items? Awesome! You can click these items to directlyedit the JSON values.Try changing the priority or the name values.Let's change the name to Code Review for Belinda.Go back to your running BlueList application in the simulator. Pulldown the list to sync with the changes you just made on the cloud. Seeyour changes? You now have data syncing between your mobileapplication and the cloud.Congratulations! You have achieved the first milestone,                 creating an iOS App with cloud data using MobileFirst Platform for iOS8 on                 Bluemix.Now let's look at sending a push notification to your iOSApp.Go to Bluemixand navigate to your application's dashboard.To configure push, you will need to upload the .p12 APNS (Apple PushNotification Service) certificate for sending push notifications toiOS devices. For instructions, see the Apple developer site. Onceyou have your .p12 push notification certificate, clickConfiguration on the left menu.Click SELECT and locate the .p12 APNS Certificateassociated with the Bundle ID previously configured for your iOS App,as well as the associated password to unlock the .p12 certificate.Push notifications can be sent only to real devices registered with the                 Apple Developer Program.Plug the iOS device into your machine.Select the iOS device from the drop-down list on Xcode.Select bluelist-swift under Targets. Then from theGeneral tab, select a  team from the drop-down list.Make sure thereare no warning messages under the team selected. If there arewarnings,go to Build Settings and verify that the correctprovisioning profile is selected for code signing.Press the play button on the Xcode toolbar. The BlueList app begins torun on your device. The device is now registered with the Push iOS8service in Bluemix.Go back to Bluemixand select Notifications from theleft menu under Push.For now, send the push to all the registered devices, since yours isthe only registered device.Add some message text. You could include the other fields, but they'renot mandatory. Note that there are many options to play with: Silent,Mixed, Default messages, badges, sounds, and types of pushes.Go back to your iOS device and push the home button or lock thescreen. This will put the BlueList app in the background and allow youto see any push notifications you send from the dashboard.Watch your phone for the incoming push notification! By now you've created a mobile application backend on Bluemix, and you've                 configured that backend to provide Facebook authentication and push                 notifications. You've also downloaded the BlueList sample iOS app and have                 modified some configuration settings to get the application working                 against the mobile backend on Bluemix. You now have an iOS app with a                 local data store replicating to the remote data store, which is also                 receiving push notifications.As a next step, we suggest looking at the code provided in the BlueList                 sample. See how it is using the SDKs to consume the services provided on                 Bluemix. You can use all of these concepts in your own mobile application,                 or you can build your app on top of BlueList.BLUEMIX SERVICES USED IN THIS TUTORIAL:The SDK for Node.js runtime helps you develop, deploy, and scale server-side JavaScript apps with ease.The Push iOS 8 service enables you to send push notifications to iOS devices.The Cloudant NoSQL service provides access to a fully managed NoSQL JSON datalayer that's always on.The Advanced Mobile Access service helps you finely tune iOS 8 apps with operational analyticsfrom real-time performance and data usage.Required fields are indicated with an asterisk (*).By clicking Submit, you agree to the developerWorks terms of use.The first time you sign into developerWorks, a profile is created for you.  Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name.  You may update your IBM account at any time.The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name.  Your display name accompanies the content you post on developerWorks.Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.Required fields are indicated with an asterisk (*).By clicking Submit, you agree to the developerWorks terms of use.ArticleTitle=Build an iOS 8 App with Bluemix and the MobileFirst Platform foriOS","Build the back-end for an iOS mobile app using Bluemix and Cloudant. You’ll configure authentication with Facebook, implement cloud-based data storage that syncs with your mobile app, and set up push notifications too.",Build an iOS 8 App with Bluemix and the MobileFirst Platform for            iOS,Live,568
1786,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
HOW CAN DATA SCIENTISTS COLLABORATE TO BUILD BETTER BUSINESS APPLICATIONS?
Post Comment June 10, 2016 by James Kobielus Big Data Evangelist, IBM Follow me on Google+ , LinkedIn , TwitterWe asked five social influencers how data scientists can collaborate to build
better business applications rapidly. See what they had to say in response about
collaboration best practices, tools and more .

BOB E. HAYES


""They need to be provided a process or platform that improves how different data
scientists collaborate with each other. In fact, r esearch shows that different aspects of teamwork quality impact the success of projects
involving innovation. This process/platform needs to improve team communication,
coordination, balance of member contribution, member support, effort and
cohesion. Toward that end, I recommend data science team members review the scientific method to organize their collective efforts to extract insights from data.
Additionally, while many vendors provide cloud data services for each step of
the analytics life cycle (idea, data collection, data management, analytics,
reporting), the ones who will win will better integrate the analytics journey to
facilitate teamwork.""

Bob E. Hayes is chief research officer at AnalyticsWeek and president of Business Over Broadway. Hayes conducts research in the area of
big data, data science and customer feedback—for example, identifying best
practices in customer experience programs, reporting methods and loyalty
measurement. And he provides consultation services to companies for helping them
improve how they use their customer data through proper integration and
analysis.


CHRIS MADDERN


""The relationship and collaboration between data science and data engineering is
incredibly important to building real products faster—particularly what those
each mean for the team or company that you're working within and the goals of
each group. Well-defined lines between these two teams keeps you all building in
the right direction and enables data engineering to plan and invest in the right
infrastructure to ensure that you can build what's needed to meet the business
problems. Additionally, ensuring good product and business input to data science
problem definition is key. Technically, ensuring consistency in standards and
approach allows you to make best use of the infrastructure set up and be able to
take models from test to production quickly.""

Chris Maddern is cofounder of Button , a leading-edge marketplace for app connections. Prior to Button, Maddern led
mobile engineering at popular social payments network Venmo, and has founded
several mobile products startups—some of which were successful.


JOE MCKENDRICK


""There needs to be closer interaction with the business, both formally and
informally. Too often, data science teams work in isolation from the mainstream
business. Business users need to be able to share in the tools, at some basic
level, to create formulations to their own queries.""

Joe McKendrick is an author, independent researcher and speaker exploring innovation,
information technology trends and markets. McKendrick’s Forbes.com column, “Disruptions,” explores how technology innovations move markets and careers.


DAVE SARANCHAK


""We are keenly aware of the analytic product challenge of bridging the gap
between development and deployment. Collaboration between data science and
software engineering teams is vital to overcoming this. Ideally, this
collaboration occurs in the same physical space, using common tools (for
instance, white boards) and within the problem environment. This ensures that
the full knowledge and experience of the team can be employed to fail fast and
ultimately create innovative solutions.

Unfortunately, this synergy can be hard to recreate outside of a business unit
focused on common problems. At the same time, the reading and writing of blogs
only gets you so far to this goal. So, it is essential for data scientists to
actively collaborate at venues with broader perspectives. This can be at large
conferences such as the Predictive Analytics World, Spark Summit, Strata +
Hadoop, the ACM’s Conference on Knowledge Discovery and Data Mining, or the
INFORMS analytics conference. It could also be at more localized events such as
meet-ups focused on general data science or specific topics within the
industry.""

Dave Saranchak is a data scientist with Elder Research , where he develops and applies statistical data modeling techniques for
national security clients. Saranchak developed and leads training for the Elder
Research office in Baltimore, Maryland, emphasizing the technologies best able
to meet clients’ needs.


JENNIFER SHIN


""Rapid development can be a challenge for data scientists, especially anyone who
has limited experience and exposure to product development. Unlike programmers
who can defer to technical requirements, and business leaders who use goals and
objectives, the breadth and depth of a data scientist’s skill set can make
finding the right fit within a project challenging for product managers.

Data scientists should always be looking for good data science opportunities
when working within a team. Successful collaboration requires understanding the
ins and outs of the business—or organization—well enough to find where the data
can fit into the current process and then translating this opportunity into
technical specifications for developers and engineers.""

Jennifer Shin is founder of and principal data scientist at 8 Path Solutions , a data science, analytics and technology company. Shin is a recognized
thought leader, data science contributor for the IBM Big Data & Analytics Hub
and technology expert on eHow.com.


OTHER DATA SCIENCE OPPORTUNITIES
To build high-quality data science applications rapidly, try IBM’s new Data Science Experience for yourself. Also, don’t forget to explore how Spark, R and open data science can help you build your own apps in less time than ever before.


Follow @IBMBigData

Topics: Analytics , Big Data Technology , Big Data Use Cases , Data Scientists , Hadoop Tags: data science , data scientist , data engineer , Apache Spark , Spark , R , collaborationRELATED CONTENT
WHITE PAPERS & REPORTS
INTRODUCING NOTEBOOKS: A POWER TOOL FOR DATA SCIENTISTS
Check out the details on a tool that can change the game for data
scientists—open source analytics notebooks. Learn what notebooks are, what value
they provide and how to get started using them today. View White papers & Reports Blog The power of machine learning in Spark Blog InsightOut: The role of Apache Atlas in the open metadata ecosystem Blog Top analytics tools in 2016 Blog End-to-end analytics in the cloud Blog Highlights from the Apache Spark Maker Community Event Blog Experiencing deeper productivity in open data science White papers & Reports Using a predictive analytics model to foresee flight
delays Blog Improving quality of life with Spark-empowered machine learning Blog Learning to fly: How to predict flight delays using Spark MLlib Blog Innovative business applications: The disruptive potential of open data
science Blog Lean data science with Apache Spark Blog Boosting the productivity of the next-generation data scientist
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacyMORE
Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacy Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog Cloud-based ingestion: The future is hereMORE
Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog Cloud-based ingestion: The future is here Blog 3 strategies to get your CFO to care about Sales Performance Management Blog Proactive emergency plans: Data empowers law enforcement agencies at all
levels Blog Emergency management information system data needs to be filtered Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformationMORE
Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformation Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog Keep your head above water with information lifecycle governance Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutionsMORE
Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Video What does Hadoop and big data success look like? Blog The death of application performance White papers & Reports Introducing notebooks: A power tool for data scientists * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site",We asked five social influencers how data scientists can collaborate to build better business applications? See what they had to say.,How can data scientists collaborate to build better business,Live,569
1788,"Enterprise Pricing Articles Sign in Free 30-Day TrialMAKING THE MOST OF COMPOSE – DIFFERENTIAL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 27, 2016It can be hard to visualize exactly how a new closet or shelving system would
fit into your living quarters, which is why Organized Living turned to Compose
customer, Differential , to build an easy-to-use 3D visualization tool. Running on MongoDB and Meteor,
Differential produced something that was not only useful for Organized Living,
but also gorgeous.

Gerard Sychay, Differential Product Lead, explained the situation before the
project began: ""Organized Living had been taking closet design orders through a
bare, 2D closet design jQuery app that stored hard coded images and inventory.
The original app didn't even have the latest inventory because it was too hard
to update.""

Differential's approach replaced the previous 2D closet design that was built in
jQuery with a new 3D interactive app and even integrated it with Organized
Living's back-end ERP system so that customers could visualize actual inventory.
""The final product amounted to an online CAD program for designing closets"" said
Sychay adding that it has also ""increased consumer sales and replaced the older,
spreadsheet-­based dealer ordering process.""

It's a project that's indicative of the types of solutions that Differential has
been building for their customers since 2013.

Differential, based in Cincinnati, Ohio, was founded as part incubator, part
startup, part services business, or what one of Differential's product managers,
Conrad VanLandingham, refers to as a venture studio. The company's founders and
many current employees have startup backgrounds, so they understand the needs of
fledgling businesses and bring that spirit to projects at mid-to-large companies
as well. The collection of developers, designers and storytellers at
Differential ""help remove barriers to growth and innovate using technology.""

A preview of the 3D visualization app built by Differential

These days ""everybody has a software component to their company,"" but not every
team has the resources or the know-how to bring new applications to market. For
many of their clients, Differential goes beyond typical development services to
offer support and coaching that helps with a project's long-term success.

For example, Differential also works with one of America's fastest growing
churches, Crossroads. Crossroads turned to Differential to help develop several
projects including a new experiential learning and self-improvement platform, a
spiritual journey app called Brave , and a web-based interactive ""journey builder"" which is now being iterated
into a mobile app.

An official Meteor consulting partner, Differential uses Galaxy for app hosting
and Compose MongoDB for databases. Whether working with organizations like
Crossroads on Brave, Grupo Bimbo to build apps that work across platforms and
will be highly accessible to managers around the world, or with Lexmark to build
modern applications that reach beyond their core business, Meteor is their
preferred platform. The Meteor development platform frees up their developers'
time to focus on building a great experience with fewer lines of code. Meteor,
of course, leverages MongoDB, and for Differential there's immense value in
being able to model data in the database the same way they model data in the
code.


Crossroad's Brave app and mobile experience

Differential has been using Compose since back in the early days when it was
still MongoHQ. ""The quality of all of our interactions with Compose is top
notch. Everything from your administration console to your documentation and
even to your company culture."" Compose's ""elastic"" scaling provides much needed
automation that makes it easy for their client projects to grow seamlessly.

Differential has done a lot with a small team of twenty-five professionals. By
turning to technologies from Meteor and Compose, it has freed up its team's time
to focus on clients' success. ""We try very hard to educate and build a support
network for our partners. A lot of that includes automation, so any instance
where we can make this process easier for ourselves and our partners is a huge
success.""

For Differential, success is rooted in achieving an extremely close relationship
with the companies that have turned to them for help. Selecting the right tools
is part of what it means to build a great partnership with each of their
clients.

""You know, there's always going to be competitors that can compete with things
like SLA agreements and pricing. But at the end of the day, I think trust and
what we consider mission critical service comes down to our ability to use your
tools, and then the competence of your teams to make things right.""

CC BY-SA Image by Jeff Kubina Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Jon Silvers does marketing at Compose. He is also a father, husband, Californian, runner,
hiker, INTJ, guitar beginner, and comedy geek. Love this article? Head over to Jon Silvers’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","It can be hard to visualize exactly how a new closet or shelving system would fit into your living quarters, which is why Organized Living turned to Compose customer, Differential, to build an easy-to-use 3D visualization tool.",Making the Most of Compose – Case Study: Differential,Live,570
1789,"GETTING STARTED WITH ELASTICSEARCH AND NODE.JS - PART 1
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 29, 2016In this article we're going to look at using Node to connect to an Elasticsearch
deployment, index some documents and perform a simple text search.

It's the first of a series of articles that will show you how to combine the
powerful search and indexing capabilities of Elasticsearch with Node's
efficiency and ease of deployment. Future articles will take you through to
building and deploying a web app while demonstrating some key Elasticsearch
concepts along the way. All the code you'll need can be found in the article or
in the petitioneering Github repository .

CREATING AN ELASTICSEARCH DEPLOYMENT WITH COMPOSE
If you don't already have a Compose account, you can quickly get started with
Elasticsearch and Node by signing up for a free 30-day trial .

If you already have a Compose account you can create an Elasticsearch deployment
from the Deployments tab in the Compose management console. To do this, log in
to your Compose console, and click Create Deployment . Select Elasticsearch and create a new Compose Hosted deployment.

CHOOSING A DATASET
There are tons of great datasets out there that you can use to explore
Elasticsearch. The UK government's open data initiative, for example, has led to
the release of tens of thousands of datasets . Or, you could spread the net wider by choosing a set from Awesome Public Datasets , a github archive that is exactly what its name suggests.

For this series of articles, we've chosen to work with data from the UK Government and Parliament petitions website . Set up to make the government more accountable to the people, it allows any
British citizen and UK resident to start a petition. When a petition reaches ten
thousand signatures it gets a response from the government, while petitions with
more than 100,000 signatures could end up debated in parliament.

The petitions dataset includes information about which electoral district in the UK (known as a constituency) each signature comes from. Before we start
working with the ten thousand or so petitions that have been created we'll index
a dataset that contains a little information about each of those constituencies.
This is a smaller dataset that will allow you to quickly get working with an
index while you get a feel for the Elasticsearch structure.

Before all that, however, you'll need to configure your environment for using
Node.

SETTING UP YOUR NODE ENVIRONMENT
First, install Node and npm .

You'll need npm and the following Node modules for this walkthrough:

 * elasticsearch
 * get-json

Install the modules using npm:

npm install elasticsearch get-json  


REFERRING TO YOUR DEPLOYMENT IN NODE
Using the elasticsearch module in node we can easily connect to and interact
with our elasticsearch cluster. We'll create one file for the connection code
which we can then use in all our subsequent code using Node's require method.

Copy this code and save it as connection.js . Replace the username, password, server and port with the values from your
Elasticsearch deployment. You'll find these values in the connection strings in
the Connection Info section on your Compose Deployment Overview page.

var elasticsearch=require('elasticsearch');

var client = new elasticsearch.Client( {  
  hosts: [
    'https://[username]:[password]@[server]:[port]/',
    'https://[username]:[password]@[server]:[port]/'
  ]
});

module.exports = client;  


We can now use this client var whenever we want to perform an operation on our
deployment by including the following line in any other file we create:

var client = require('./connection.js');  


We'll start by creating a series of short, self-contained Node files that each
perform a single function. Later on we'll start to combine them as we get into
more in-depth examples. Let's test our connection with a simple deployment
health check. Create a new file with this code, which will display the current
status of your cluster.

var client = require('./connection.js');

client.cluster.health({},function(err,resp,status) {  
  console.log(""-- Client Health --"",resp);
});


Save the file as info.js , then in the terminal, run the file with:

node info  


You should get a response like this:

-- Client Health -- { cluster_name: 'el-petitions',
  status: 'green',
  timed_out: false,
  number_of_nodes: 3,
  number_of_data_nodes: 3,
  active_primary_shards: 0,
  active_shards: 0,
  relocating_shards: 0,
  initializing_shards: 0,
  unassigned_shards: 0,
  delayed_unassigned_shards: 0,
  number_of_pending_tasks: 0,
  number_of_in_flight_fetch: 0 }


If not, go back and check the connection details in your Compose Deployment
Overview. If your connection looks good it's time to move on to creating an
index and adding some documents.

INDEXING
Indexing in Elasticsearch is not quite like indexing in other databases: the
word 'index' itself has different meanings in different contexts in
Elasticsearch, some of which might not be immediately intuitive.

In Elasticsearch, an index is a place to store related documents. We're going to
create an index called 'gov', and we're going to use it to store two types of
documents - 'constituencies' and 'petitions'. The act of storing those documents
in an index is known as indexing. Unlike other database systems, where you need
to explicitly specify and create indexes to improve the efficiency of some
operations, in Elasticsearch these 'inverted indexes' as they are usually known
are automatically created. When you index a document in Elasticsearch every
field in that document is indexed by default.

To create the 'gov' index, let's create a new file (with our require statement
at the top so we can connect to our ELasticsearch deployment):

var client = require('./connection.js');

client.indices.create({  
  index: 'gov'
},function(err,resp,status) {
  if(err) {
    console.log(err);
  }
  else {
    console.log(""create"",resp);
  }
});


Save this file as create.js and run it.

You should see:

create { acknowledged: true }  


Deleting an index is as easy as creating one. Create delete.js with the following:

var client = require('./connection.js');

client.indices.delete({index: 'gov'},function(err,resp,status) {  
  console.log(""delete"",resp);
});


When you run this file you should see:

delete { acknowledged: true }  


Now we've got an index, we just need some documents to go in it. We're going to
be using two datasets for our index: one contains information about
parliamentary constituencies, the other contains actual the petitions data.
We'll start with the constituencies data because it's smaller and less complex.

Before we add our dataset, though, let's look at just adding a single document.
Create a new file, called document_add.js and add the following:

var client = require('./connection.js');

client.index({  
  index: 'gov',
  id: '1',
  type: 'constituencies',
  body: {
    ""ConstituencyName"": ""Ipswich"",
    ""ConstituencyID"": ""E14000761"",
    ""ConstituencyType"": ""Borough"",
    ""Electorate"": 74499,
    ""ValidVotes"": 48694,
  }
},function(err,resp,status) {
    console.log(resp);
});


What this does is add a document to our 'gov' index, with a document type of
'constituencies' and an id of 1. (If you don't specify an id, Elasticsearch
automatically generates one for the document). The document itself contains a
few fields relating to the UK parliamentary constituency of Ipswich.

Run the code and you should see a response like this:

{ _index: 'gov',
  _type: 'constituencies',
  _id: '1',
  _version: 1,
  created: true }


Run it again and you should see

  ...
  _version: 2,
  created:false
  ...


Because we already have a document with id 1 for this type in this index,
Elasticsearch treats this as a new version of the document.

At some point we might want to check how many documents there are in our index.
This is not going to produce a terribly exciting result at the moment, but let's
do it anyway. Add the following to info.js .

client.count({index: 'gov',type: 'constituencies'},function(err,resp,status) {  
  console.log(""constituencies"",resp);
});


Run it and you should get a response that shows a count of one constituency.

Deleting a document is as easy as indexing it. Create a new file, document_del.js and add:

var client = require('./connection.js');

client.delete({  
  index: 'gov',
  id: '1',
  type: 'constituencies'
},function(err,resp,status) {
    console.log(resp);
});


Run this file and you should see something along the lines of:

{ found: true,
  _index: 'gov',
  _type: 'constituencies',
  _id: '1',
  _version: 3 }


Run info.js again and our document count should now be zero.

When we want to add a lot of documents at the same time it's often easier to use
the bulk method in Elasticsearch. The format is similar to the index format,
except for each document we need to send Elasticsearch two objects, one to
define the index, type and id of the document, and one for the body. For
example:

var myBody = { index: {_index: 'gov', _type: 'constituencies', _id: '1' } },  
{
  ""ConstituencyName"": ""Ipswich"",
  ""ConstituencyID"": ""E14000761"",
  ""ConstituencyType"": ""Borough""
  ...
}


These objects then form the body of the object that you send to Elasticsearch
using the bulk call:

client.bulk({  
  index: 'gov',
  type: 'constituencies',
  body: myBody
};


Download the constituencies files - constituencies.json and constituencies.js - from the petitioneering Github repo . Take a look at both and make sure you understand what's going on. In a
nutshell, constituencies.js reads the contents of constituencies.json file, and adds each entry to its bulk array. Using the bulk command it sends
all the constituency data to the Elasticsearch client, which indexes each
constituency and then returns a response, which you'll be able to see all or
part of depending on how far back your terminal window lets you scroll.

Run constituencies.js to index the contents of constituencies.json .

If you check your client again by running info.js you should see an updated constituencies document count of 650.

SEARCHING
Obviously, one of the things you're going to want to do with your Elasticsearch
index is search it. Create a new file and add the following:

var client = require('./connection.js');

client.search({  
  index: 'gov',
  type: 'constituencies',
  body: {
    query: {
      match: { ""constituencyname"": ""Harwich"" }
    },
  }
},function (error, response,status) {
    if (error){
      console.log(""search error: ""+error)
    }
    else {
      console.log(""--- Response ---"");
      console.log(response);
      console.log(""--- Hits ---"");
      response.hits.hits.forEach(function(hit){
        console.log(hit);
      })
    }
});


Save the file as search.js and run it.

All being well you'll get one hit. Change the query so it looks for constituency
names matching ""Ipswich"" and you should get more hits. Let's change it again,
but this time using ""North Ipswich"" as the search term. We might expect just one
result for this query (after all, how many North Ipswich constituencies can
there be in the UK?), but in fact this query returns multiple hits. We'll get
into why that is the case in the next article in the series. You can also learn
more about searching in Elasticsearch by checking out our article on Query-time strategies and techniques "".

WILDCARDS AND REGULAR EXPRESSION SEARCHES
Finally, a simple change to the search query shows how you can use wildcards and
regular expressions to expand your search.

For an example of a wildcard search, search for constituency names starting with
any three characters followed by 'wich':

query: {  
  wildcard: { ""constituencyname"": ""???wich"" }
}


For an example of a regular expression search, search for constituency names
starting with one or more characters followed by 'wich':

query: {  
  regexp: { ""constituencyname"": "".+wich"" }
}


NEXT
In the next article in the series we'll introduce using analyzed and
non-analyzed fields to control search results in Elasticsearch defining mappings
to tell Elasticsearch what sort of data your fields contain. Later we'll go on
to explore how you can handle nested data structures, and finally how to turn
our code snippets into a deployable web app.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Neil Dewhurst Love this article? Head over to Neil Dewhurst’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","In this article we're going to look at using Node to connect to an Elasticsearch deployment, index some documents and perform a simple text search.",Getting started with Elasticsearch and Node.js - part 1,Live,571
1790,"Bradley Holt Blocked Unblock Follow Following Developer Advocate and Senior Software Engineer with IBM Watson Data Platform |
Author and Speaker | opinions are my own May 15
--------------------------------------------------------------------------------

OFFLINE FIRST: WHAT’S IN A NAME?
IT’S ALSO “PERFORMANCE FIRST.” TWITTER LITE AND OTHERS ARE ALREADY DOING IT.
I recently returned from Offline Camp Berlin, our third Offline Camp . A funny thing has happened at every Offline Camp to date. Inevitably someone
will see the stream of tweets coming out of Offline Camp and get very confused
as to how and why we are online at Offline Camp.

OFFLINE, ONLINE, AND SOMEWHERE IN BETWEEN
We call it Offline First for a reason, not just “Offline” or “Offline Only.” With an Offline First
approach you design your app for the most resource-constrained environments first , and then apply progressive enhancement techniques to make your app work
better and better as it has access to more and more resources.

Using an Offline First approach means that your app will not only work offline,
but your app will work better when it has access to only a flaky network
connection, and your app will even work better when it has a great network
connection. Without this intent to design your app for the progression from
offline to limited connectivity to great connectivity, there would be no reason
to call it Offline First .

“PERFORMANCE FIRST”
Interestingly, one of the biggest benefits of an Offline First approach has
nothing to do with being offline. Offline First apps are faster and more
responsive than their cloud-dependent counterparts. An Offline First app stores
most (if not all) of its content and data directly on the local device. This
makes create, read, update, and delete operations extremely fast as all of your
data access happens locally on the device. Your Offline First app can then get
updated content, sync its data, or enable features that aren’t practical to make
work offline when the app is connected. However, your users can still interact
with your app even when there’s no reliable connection available.

WHAT OFFLINE FIRST LOOKS LIKE TODAY
Offline First is a concept that is applicable to any type of app that would
typically depend on a network connection. This includes Progressive Web Apps,
native mobile apps, hybrid mobile apps (apps built using web technologies that
are packaged up as native apps), desktop apps, and even Internet of Things (IoT)
apps. Many native apps, to some extent, already take an Offline First approach
(though some are still quite dependent on a constant connection to a mobile
backend).

PROGRESSIVE WEB APPS
Web apps, especially, can benefit from an Offline First approach, as web apps
have traditionally been built as client/server applications. Progressive Web Apps are beginning to help developers make this shift in
mindset.

The basic idea of Progressive Web Apps is to combine the discoverability of web
apps with the power of native mobile apps. As an end user, you browse to a
Progressive Web App just like you would browse to any other website. As you use
the app more and more, it gains additional native-app-like capabilities, such as
a home screen icon or the ability to send you alerts and notifications. Offline
First is an important characteristic of Progressive Web Apps, as it ensures that
the app will be available and responsive regardless of the device’s current
level of connectivity.

Offline Sync for Progressive Web Apps Apache CouchDB + Hoodie + PouchDB
medium.com
--------------------------------------------------------------------------------

You may have heard of some of these high-profile examples of Offline First apps
recently:

TWITTER LITE
Twitter just rolled out Twitter Lite last month, an Offline First Progressive
Web App:

Every day, millions of people around the world use Twitter to see what’s
happening right now. However, there are several barriers to using Twitter,
including slow mobile networks, expensive data plans, or lack of storage on
mobile devices. While smartphone adoption grew to 3.8 billion connections by the
end of 2016, 45% of mobile connections are still on slower 2G networks, according to GSMA . Today, we are rolling out Twitter Lite , a new mobile web experience which minimizes data usage, loads quickly on
slower connections, is resilient on unreliable mobile networks, and takes up
less than 1MB on your device. We also optimized it for speed, with up to 30%
faster launch times as well as quicker navigation throughout Twitter. Twitter
Lite provides the key features of Twitter — your timeline, Tweets, Direct
Messages, trends, profiles, media uploads, notifications, and more. With Twitter
Lite, we are making Twitter more accessible to millions of people — all you need
is a smartphone or tablet with a browser. Introducing Twitter Lite | Twitter Blogs Twitter Lite is a more accessible,
faster and more affordable way for people to use Twitter when they are on slow…
blog.twitter.comINSTAGRAM
Facebook recently announced that the Android version of Instagram is getting offline functionality :

Facebook announced at its F8 developer conference that the Android version of
Instagram is getting offline functionality. In fact, the features are already
rolling out in certain parts of the world. (I was able to get some of it to work
in New York on a phone running Android Nougat, for example.) Offline mode could
eventually make it to iOS as well, according to TechCrunch . The offline mode features go beyond just saving a draft or queueing up a photo
at the top of the feed, which the app already let users do when they tried to
post with poor service. You can now like or comment on other users’ photos, or
even follow and unfollow accounts, without any data connection. The next time
your phone accesses the internet, Instagram will go back through this history
and complete each of those actions. Instagram for Android now works offline Facebook announced at its F8 developer
conference that the Android version of Instagram is getting offline…
www.theverge.comDROPBOX PAPER
Dropbox Paper For iOS recently gained an offline mode as well:

The new offline feature was introduced to allow users to create new documents,
or access, edit, and comment on documents stored in the cloud even if they lost
their internet connection. When the connection is restored, changes are
automatically synced to the Paper service. Dropbox Paper For iOS Gains Offline Mode and Multiple Language Support
Dropbox's collaborative editing software Paper received an update to its iOS app
on Tuesday that allows users to edit… www.macrumors.comYOUTUBE GO
YouTube Go is an app that allows for offline viewing and sharing of videos:

Google has announced YouTube Go, a new app designed to broaden the accessibility
of the behemoth video-sharing service. Designed and developed with Indian users
in mind, who will be able to test the app first before a broader rollout,
YouTube Go is intended to work more effectively in areas where connectivity is
more limited. YouTube Go allows users to save videos for offline viewing, giving options over
quality and file size so it’s clear how much data a download will use. The app
also allows for local sharing with nearby users without using any data. YouTube
Go builds upon the Smart Offline feature that YouTube launched first in India earlier this year. YouTube Go is a new app for offline viewing and sharing Google has announced
YouTube Go, a new app designed to broaden the accessibility of the behemoth
video-sharing service… www.theverge.com
--------------------------------------------------------------------------------

You may have noticed that all of these examples address not only what happens
when the app is offline, but what happens when the app has limited connectivity
or is re-connected after being offline for a period of time. I hope this post
has cleared up some confusion about what Offline First is and is not. If you
want to learn more about the Offline First movement, check out the Offline Camp Medium publication and the Offline First Resources page. You can also join the Offline First community Slack team to talk with others who are working on Offline First apps.

A moment at Offline Camp Berlin where almost everyone was, in fact, offline.If you enjoyed this article, please ♡ it to recommend it to other Medium
readers. Thanks for reading!

Thanks to Teri Chadbourne, CMP . * Web Development
 * Mobile
 * Progressive Web App
 * Offline First

Blocked Unblock Follow FollowingBRADLEY HOLT
Developer Advocate and Senior Software Engineer with IBM Watson Data Platform |
Author and Speaker | opinions are my own

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","I recently returned from Offline Camp Berlin, our third Offline Camp. A funny thing has happened at every Offline Camp to date. Inevitably someone will see the stream of tweets coming out of Offline…",Offline First: What’s in a name? – IBM Watson Data Lab,Live,572
1796,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Home
 * Cognitive Computing
 * Data Science
 * Web Dev
 * 

Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Mar 28
--------------------------------------------------------------------------------

MOVING DATA FROM DOCUMENTDB TO CLOUDANT OR COUCHDB
INTRODUCING THE DOCUMENTDBEXPORT NPM MODULE
In my last blog, we looked at moving data from Amazon DynamoDB to Cloudant or CouchDB . In this article we’re going to look at extracting data from Microsoft Azure’s
DocumentDB service.

You’ll use the documentdbexport npm module to extract data, and the couchimport
module to import it, of course.GETTING THE DATA OUT
Once again, I’ve written a script to do this for you: documentdbexport .

First, install the tool:

$ npm install -g documentdbexport

Define a couple of environment variables with your Azure credentials:

$ export AZURE_ENDPOINT=""https://MYDOCDB.documents.azure.com:443/""
$ export AZURE_KEY=""GeIZysnonvgpk2""

Then, simply run documentdbexport , supplying the name of the database and collection to export:

$ documentdbexport --database iot --collection temperaturereadings
{""temperature"":30730,""time"":""2017-03-09T02:21:48+0000"",""_id"":""1489026108""}
{""temperature"":17072,""time"":""2017-03-09T02:15:22+0000"",""_id"":""1489025722""}
{""temperature"":18177,""time"":""2017-03-08T21:27:23+0000"",""_id"":""1489008443""}
Export complete { records: 3, time: 0.145 }

The tool makes as many API calls as it needs to extract the data, converting the
JSON to a more compact form as it goes.

IMPORTING INTO COUCHDB/CLOUDANT
We can use couchdbimport to do the import stage for us. Install it with:

$ npm install -g couchimport

Set an environment variable with your target Cloudant/CouchDB service’s URL:

$ export COUCH_URL=""https://MYUSER:MYPASS@MYHOST.cloudant.com""

Then, run both the documentdbexport and couchimport commands together, piping the output of the former into the latter:

$ documentdbexport --database iot --collection temperaturereadings | couchimport --db iot --type jsonl

The --type jsonl parameter tells couchimport that it is to expect one JSON document per line and --db iot defines the name of the target database.

It’s that simple! You’ll find more details on command-line usage and
programmatic access for documentdbexport on npm . And please ♡ this article if you’d like to recommend it to other Medium
readers.

 * Nodejs
 * Documentdb
 * Couchdb
 * Cloudant
 * Web Development

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","In my last blog, we looked at moving data from Amazon DynamoDB to Cloudant or CouchDB. In this article we’re going to look at extracting data from Microsoft Azure’s DocumentDB service. The tool makes…",Moving data from DocumentDB to Cloudant or CouchDB – IBM Watson Data Lab,Live,573
1799,"GETTING STARTED WITH ELASTICSEARCH AND NODE.JS - PART5
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Aug 8, 2016In the final step of the series, we create a web application to access the
Elasticsearch managed data and show how to host the web app on IBM's Bluemix.

In the previous article we ran some queries on the nested fields in our petitions data. In this article
- the last in the series - we're going to turn our existing code into a
fully-fledged app and deploy it using IBM Bluemix . To get an idea of what you'll have by the end of this article, check out our
own Petitioneering app.

Before we get to that we'll add in the postcode lookup we promised at the end of
the last article. We'll need the get-json library for this so install it before you start.

npm install get-json  


We'll also move the search query into a separate file which we'll call from nestedQuery.js . This will set us up nicely for when we are putting our application together
since it will allow us to keep our functions separate from our logic.

Start by creating a new file and call it functions.js . Add in the following lines to connect to Elasticsearch and use the get-json
library.

var client = require ('./connection.js');  
var getJSON = require('get-json');  


Now copy your search function from nestedQuery.js into functions.js . We need to modify it very slightly because of the way we'll be calling it
from now on so change

var results = function(constitLookup) {  


to

function results(constitLookup,callback) {  


To allow nestedQuery.js to use this function we need to export it using module.exports. Add this to the
end of functions.js :

module.exports = {  
  results: results
};


Back in nestedQuery.js we need to add a require statement to use functions.js and tweak how we call the results function. Replace everything in nestedQuery.js with the following:

var argv = require('yargs').argv;  
var functions = require('./functions.js');

if (argv.search) {  
  functions.results(argv.search, function(results) {
      console.log(results);
  });
}


Run nestedQuery.js to test everything looks ok, and when you're happy we'll move on to adding the
postcode lookup.

ADDING A POSTCODE LOOKUP
Not everyone knows the exact name of their constituency (easy enough to type in
when it's Ipswich, not so much if you live in the constituency of ""Inverness,
Nairn, Badenoch and Strathspey"") so to make it easier for a user to find out
what petitions are popular where they live we'll add a lookup that works out a
user's constituency from their postcode. For this, we'll use the API from http://postcodes.io/ . First we'll use their validate method to check we've got a valid postcode, and then we'll use their postcode method to establish the constituency name. We'll then pass this into our results function.

In functions.js add the following:

function getConstituency(postcode,callback) {  
  getJSON('https://api.postcodes.io/postcodes/'+postcode, function(error, response){
    if(error) {
      console.log(error);
    }
    else {
      results(response.result.parliamentary_constituency,function(response){
        callback(response);
      });
    }
  });
}

function validatePostcode(postcode, callback) {  
  getJSON('https://api.postcodes.io/postcodes/'+postcode+'/validate',function(error,response){
      if(response.result){
        getConstituency(postcode,function(response){
          callback(response);
        });
      }
      else {
        console.log(""Please enter a valid postcode"");
      }
  });
}

function getResults(userinput, cb) {  
  var results = validatePostcode(userinput,function(response){
    cb(response);
  });
}


We also need to add getResults to our module exports so we can call it from nestedQuery.js :

module.exports = {  
  getResults: getResults,
  results: results
};


Finally, we just need to add a function to nestedQuery.js to invoke the postcode lookup when required. It shouldn't be too much of a
surprise that it looks a lot like our existing search function:

if (argv.postcode) {  
  functions.getResults(argv.postcode, function(results) {
      console.log(results);
  });
}


Now you can run nestedQuery.js by supplying it with a postcode, like so:

node nestedQuery --postcode=""KA22 8NG""  


Later we'll be able to drop this code right into our application: we'll need to
modify how we get the search term to our function and we'll turn the results
into some html we can output but the workflow will be essentially the same.
Before we get to that stage, though, we need to create a Bluemix account and set
up our application.

CREATING THE APP IN BLUEMIX
If you don't already have a Bluemix account, you can quickly get started by signing up for a free 30-day trial . After you've registered and confirmed your email, log in to your account and
create your first organization when prompted. We've called ours petitioneering . Create a space and get ready to create your app.

From your Console choose Compute from the menu, and Cloud Foundry Applications . We'll be creating a web app using the SDK for Node.js so choose that, give
your app a name and click Create . At this point Bluemix will start staging your app, and if you return to your
Console again you'll see you have one item under Compute . As soon as Bluemix completes the staging you can click Open URL to see the default home page for your app and verify that your app is now
running.


Next we need to connect our Elasticsearch index to our app. From your console,
choose Data & Analytics . Click the icon to create a new service and choose Elasticsearch by Compose . Give your service an appropriate name, and in the Connect to field select the Node.js app you just created. The values you need to enter for
Username, Password and Public hostname/Port are the same values that you've been
using in connection.js all through this series .

Click create and restage your app when prompted by Bluemix.

DEVELOPING YOUR APP
Now it's time to download your app and start dropping in some code. Go to your
app's Getting Started page, and download the CF Command Line Interface. Download
your starter code and follow the rest of the instructions on the page to make
sure your basic configuration is set up correctly.

When you're happy with the basic app, it's time to start adding our code. The
first task is just to copy your functions.js file from earlier in this article into your application directory. Most of the
app's functions are already in functions.js - we just need to pass user input into them and format the output for
displaying in a web page.

Now, as well as entering their postcode to get information on their own
constituency, a user might be interested in the results from any other
constituency, so let's add a select box that allows them to do that. To populate
the select box we'll use the output from a new Elasticsearch query that returns
the constituency names from the documents in our constituencies index.

Add this to functions.js :

function getConstituencies(callback){  
  client.search({
    index: 'gov',
    type: 'constituencies',
    size: 650,
    fields: 'constituencyname',
    body: {
      sort:
        {
          ""constituencyname"": {
            order: ""asc""
          }
        }
    }
  },function (error, response,status) {
      if (error){
        console.log(""search error: ""+error)
      }
      if (response){
        var constitList = [];
        response.hits.hits.forEach(function(hit){
          constitList.push(hit.fields.constituencyname);
        })
        callback(constitList.sort());
      }
      else {
        console.log(""<p>No results</p�
}


This function will be called when the app's homepage is loaded. That will happen
over in app.js , so we'll need to export this function, along with getresults and results . To do that, we need to update module.exports in functions.js again:

module.exports = {  
  getResults: getResults,
  getConstituencies: getConstituencies,
  results: results
};


Back over in app.js , we need to require our new functions.js file so we can pass user input into it as function arguments:

var functions = require('./functions.js');  


When we want to use any of the functions from functions.js we reference them by their name in module.exports :

functions.getResults(argv.postcode, function(results) {  
  console.log(""results output..."");
    console.log(results);
});


Now it's time to tell our application what to do when a user arrives at the
homepage. We're going to use that getConstituencies function to populate a select box so users check the results from any
constituency without having to know a valid postcode. Add the following to app.js :

app.get('/', function(request, response) {  
  functions.getConstituencies(function(constituencyList){
    if(constituencyList){
      response.render('index', {
          constituencies: constituencyList
      });
    }
  });
});


This fires when a request is made for the app's index page, and passes the
response from getConstituencies (which will be a list of the 650 UK constituencies in our Elasticsearch index)
to whatever is going to render our web page.

On the subject of rendering web pages, now might be a good time to introduce
Pug, our template engine for this app.

PUG (THE TEMPLATE ENGINE FORMERLY KNOWN AS JADE)
You'll have probably noticed that our basic code from Bluemix uses express.js as its web application framework. To make it even easier (or harder, depending
on your viewpoint) to create our html pages we're going to use a template engine
as well. We've chosen to use Pug , but express supports many others as well.

Pug allows us to generate static html pages using templates that we define, and
variables in those templates that we can pass values into. Starting with a base
layout, which we'll use to define a header and a footer for every page, we can
then insert different blocks of content depending on which page we want to
display to the user. Our app only really has one page, but it should give you an
idea of how you could extend it.

Let's add the following to app.js to tell our app to use Pug and specify the directory where it can find the
templates (also known as views):

app.set('view engine', 'pug');  
app.set('views', __dirname + '/public/views');  


And let's define our base layout. Save this in your app folder as public/views/layout.jade :

doctype html  
html  
    head
        title='Petitioneering'
        link(rel='stylesheet', href='/stylesheets/style.css')
        link(rel='stylesheet', href='/stylesheets/petitioneering.css')
    body
        block content
        script(src='http://ajax.googleapis.com/ajax/libs/jquery/2.0.3/jquery.min.js')
        script(src='/javascripts/main.js')
    footer
        block footer
          div#footer
            div#footer-content
              p
                | Petitioneering uses open data from
                a(href='https://petition.parliament.uk') UK Government and Parliament
                | , indexed in
                a(href='https://elastic.co') Elasticsearch
                | , hosted by
                a(href='https://compose.com') Compose
              p
                | Application source available from
                a(href='https://github.com/compose-ex/petitioneering') Github


We've defined a fairly straightforward html page, with <head>, <body> and
<footer> sections. We've added a couple of links to css and JavaScript files,
and added some footer content. Note the indenting in the file and make sure it's
preserved when in your file. Pug uses this indenting to generate the right
hierarchy and nesting in its generated html. So, on our page, the footer will
look like this:

<footer>  
  <div id=""footer"">
    <div id=""footer-content"">
      <p>Petitioneering uses open data from <a href=""https://petition.parliament.uk"">UK Government and Parliament</a>, indexed in  <a href=""https://elastic.co"">Elasticsearch</a>, hosted by <a href=""https://compose.com"">Compose</a></p>
      <p>Source available from <a href=""https://github.com/compose-ex/petitioneering"">Github</a></p>
    </div>
  </div>
</footer>  


The other key element in our layout is block content . Pug allows layouts to be extended; to extend this layout we need to create a
new file, specifying that it extends layout and providing something for block content . We can do this by creating public/views/index.jade with the following:

extends layout

block content  
  div#header
    div#title
      h1 Petitioneering
  div#main
    div#content
      div#search-div
        p Want to know what really matters to voters in your constituency? Type your postcode into the search box to find out...
        h2 Search by Postcode
        div#intro-div
        form
        input(id='postcode',name='postcode',type='text', placeholder=""Enter postcode"")
        span(id='pcbutton', value=""postcode"") Submit
        h2 Search by Constituency
        form
          select(id='constitlist')
            option(value='')
            each constituency, c in constituencies
              option(value=constituency) #{constituency}
      div#results-div
        div#results


This will give us our search box for users to type in their postcode, and a
select box containing the names of all our constituencies. Remember our function
to render the home page?

res.render('index', {  
    constituencies: response
});


That constituencies variable is what's being passed into this template:

each constituency, c in constituencies  
  option(value=constituency) #{constituency}


At this point we're almost ready to check on our progress by running the app
locally again. Before doing that we have a little housekeeping: first, delete index.html from the /public directory so that our layouts are used. Next, we need to
install all the new libraries to run the app locally:

npm install express cfenv elasticsearch get-json pug  


Don't worry too much about the style at this point - as long as you have a page
with a heading of 'Petitioneering' and sections for 'Search by Postcode' and
'Search by Constituency' we're doing fine.

If you want to make it look a bit prettier you can download the CSS from our Github repo or create your own.

You might notice if you try to enter a postcode or select a constituency that a
whole heap of nothing happens. That's because we haven't yet told our app what
to do when a user actually does something on this home page. For that we need to
add some client-side JavaScript to respond to events and some functions to
handle those events and return data to that JavaScript.

First, the JavaScript. We'll define a quick function that responds to a click of
the 'Submit' button or a change in the value of the select box (which will
happen when a user chooses a different constituency from the list). We'll then
send the postcode or constituency variable over to a page called 'search'.

$(function() {
  $('#pcbutton').click(function(){
    var parameters = { postcode: $('#postcode').val() };
      $.get( '/search',parameters, function(data) {
        $('#results').html(data);
      });
  });
  $('#constitlist').change(function(){
    var parameters = { constituency: $('#constitlist').val() };
      $.get( '/search',parameters, function(data) {
        $('#results').html(data);
      });
  });
});


Now we need to tell our application to do something when these events are
triggered. Back in app.js we can tell Express to respond to a request for the 'search' page using app.get like we did when we generated the select box when the homepage was loaded.

app.get('/search', function(request, response) {  
  if (!request.query.postcode && !request.query.constituency) {
    response.send(""<p>Please enter a postcode or parliamentary constituency</p�


Add that to app.js and stop and then re-run your app. Pop a valid postcode into the search box and
click Submit or choose a constituency from the list and check the output in your
terminal. You should be seeing results like you did earlier in the article when
running nestedQuery.js . At the moment we're getting input from main.js in the form of a postcode or constituency, but we're not passing that back in
the form of a response, so we're not yet sending any html back to main.js for outputting.

We need to turn the response from Elasticsearch into html and pass that back to
our client-side JavaScript, which will insert it into the page. To do this we
need to tweak our results function so that instead of outputting content to the
console it sends it to a new function that will create html, which will then get
passed back down the callback chain and into main.js where express picks it up. Using res.send. So here's a function we're going to
call to do just that. Add it to functions.js :

function makeHtmlList(constituency,results,callback) {  
  var htmllist = '<h2>Results for '+constituency+'</h2><ol class=""petition-results""�
  results.forEach(function(petitiondetails){
    htmllist+='<li><span class=""list-item-head""><a href=""https://petition.parliament.uk/petitions/'+petitiondetails._id+'"">'+petitiondetails.fields.action+'</a></span><span class=""list-item-info"">'+petitiondetails.sort[1]+' signatures from a total of '+petitiondetails.fields.signature_count+' </span></li�
  })
  htmllist+='</ol�
}


To call it, we need to change how our results function handles the response it
gets from Elasticsearch:

if (error){  
  console.log(""search error: ""+error)
}
else {  
  makeHtmlList(constitLookup,response.hits.hits,function(response){
    callback(response);
  });
}


Drop that into functions.js and re-run the app. You should see a list of petitions with details of how many
signatures each has attracted in the constituency. Click on the petition titles
to view the petition in full.

UPLOADING THE FINISHED APP
Before we can upload our finished application we need to add in some
dependencies to our package.json file so the app can be built using the modules it needs before being deployed
by Bluemix. You can do that by replacing the dependencies section of package.json with the following:

""dependencies"": {
  ""cfenv"": ""1.0.x"",
  ""elasticsearch"": ""^10.1.3"",
  ""express"": ""4.13.x"",
  ""get-json"": ""0.0.2"",
  ""pug"": ""^0.1.0""
}


Now you can push your changes up to Bluemix. When you downloaded the basic app
from Bluemix it included a manifest.yml file which contains all the information needed to upload your updated app so
all you need to type is:

cf push  


...and your updated app will be uploaded and automatically restaged by Bluemix.

WRAPPING UP
We've come a long way from creating our Elasticsearch deployment in part 1. We've defined mappings , looked at non-analyzed and nested datatypes, we've taken a brief look at Elasticsearch queries , and we've queried nested fields .

And now we've put it all together to create a web app.

We hope this series has given you a solid introduction to Elasticsearch and how Compose can help you do awesome things with it.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Neil Dewhurst Love this article? Head over to Neil Dewhurst’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","In the final step of the series, we create a web application to access the Elasticsearch managed data and show how to host the web app on IBM's Bluemix.",Getting started with Elasticsearch and Node.js - part 5,Live,574
1801,"* United States

IBM� * Site map

IBM Skip to content IBM� developerWorks Developer Centers * Close Search Seach Search
 * Sign in * Sign In
    * Register
   
   
 * IBM Navigation

developerWorks TV * Shows
 * Topics
 * Events
 * dW TV Sites * World Wide
   
   
CloseDEVELOPERWORKS TV
 * Shows
 * Topics
 * Events
 * dW TV Sites * World Wide
   
   
DEVELOPERWORKS
 * Learn
 * Develop
 * Connect

DISCOVER IBM
 * Marketplace
 * Products
 * Services
 * Industries
 * Careers
 * Partners
 * Support

developerWorks TV

Video for developers from IBMers worldwide

in Featured , The New BuildersTHE NEW BUILDERS EP. 13: ALL THE DATA THAT’S FIT TO ANALYZE

Jim Young
Created on July 14, 2016 / Modified on July 25, 2016 0 Comments

Subscribe by email Subscribe on iTunes
Somnath Banerjee, CTO, LodgIQAs the CTO of LodgIQ , Somnath Banerjee is bringing dynamic data science and machine learning to an industry that understands the importance of data to
their business, but often uses manual practices to collect and analyze this
information – the hospitality industry.

Many of the factors that go into determining a hotel room rate – whether there’s
a major event nearby, flight costs, weather conditions, what competitors are
charging – are constantly in flux. It’s impossible to keep up through manual
rate shopping and following static competitor intelligence reports.

In this episode of The New Builders, Somnath explains how LodgIQ applies data
science and machine learning to enable more intelligent revenue management
(6:55), the types of data that have the most influence on room rates (8:17), the
role of Apache ® Spark™ and Python in the LodgIQ platform (11:05), why LodgIQ built its data
layer using a relational database and a MongoDB NoSQL database (13:08), how
reviews on Yelp can predict whether a guest will stay again (15:44), and how
elastic computing enables hotels of any size to deploy machine learning (21:52).

The New Builders podcast listeners can obtain special discounts for the upcoming
CloudNativeDay (Toronto, Aug. 25) and CloudNativeCon (Seattle, Nov. 8-9) events:

 * For a discounted CloudNativeDay pass, register here with code CND16IBMDW.
 * For a discounted CloudNativeCon pass, register here with code CNC16IBMDW.

You can find new episodes of The New Builders on developerWorks TV and SoundCloud . Find out more about IBM Cloud Data Services at IBM.biz/forbuilders . Contact hosts Doug Flora and Jim Young on Twitter ( @DSFlora , @JW_Young ) or email ( dsflora@us.ibm.com , jwyoung@us.ibm.com ).

The show’s music is provided by School for Robots. Check them out at schoolforrobots.bandcamp.com ! * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * 


Tags analytics , Apache Spark , MongoDBBy Jim Young

 * Twitter
 * Google+


JOIN THE DISCUSSION CANCEL REPLY
Your email address will not be published. Required fields are marked *

Name *

Email *

Website

Comment


Notify me of follow-up comments by email.

Notify me of new posts by email.


IBM® * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

RSS Feed Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",LodgIQ CTO Somnath Banerjee is bringing dynamic data science and machine learning to the hospitality industry through a platform built on Apache Spark.,The New Builders Ep. 13: All the Data That’s Fit to Analyze,Live,575
1802,"G. Adam Cox Blocked Unblock Follow Following May 22
--------------------------------------------------------------------------------

OFFLINE AND RADIOACTIVELY CLEAN
CHOOSING COUCHDB FOR DESIGNING EXPERIMENTS IN REMOTE MOUNTAIN LABORATORIES
radiopurity.org logoRADIOPURITY FOR PARTICLE PHYSICS
In my previous life I was a particle physicist, searching for very rare nuclear
reactions (in order to count neutrinos from the sun and search for dark matter ). I had forgotten, until very recently, that we too used an Offline First approach to build a database of special materials used to observe these
reactions.

For these kinds of experiments, where you see only a few events per day
(neutrinos) or zero events over years (dark matter), the name of the game is
“background elimination.” This means you must build your experiment out of the
least amount of material, and with material that is as radioactively pure as
possible.

There is a cottage industry within the particle physics community dedicated to
measuring the levels of radioactivity in various materials that are under
consideration for use in experiments throughout the world. These measurements
often take place in ultra-clean underground laboratories. Until recently,
however, there was no central location for these data to be stored.

In this article, I’ll provide some context on the importance of offline first
design, and why we chose a database that could support these requirements for
our repository of experiment materials.

OFFLINE FIRST
The Watson Data Platform dev advocacy team has been building a great community
around the idea of building mobile applications that are offline first. This
phrase is meant to encapsulate the mindset of building applications assuming
that you’re not always going to have a network connection to your data, yet not
compromising on your user’s experience.

The Offline First community meets semi-regularly in various intimate settings at locations around the world, forming their community in small batches of ~30
participants. Their most recent Offline Camp event was held near Berlin . They go remote and literally offline in order to discuss their problems and
solutions for building offline-first applications.

The primary data solution for an Offline First architecture has centered around IBM Cloudant . IBM Cloudant, based on Apache CouchDB™ , is a database service for managing JSON documents. Cloudant and CouchDB
support masterless replication of databases that track revision histories of
each document. In addition, there exist mobile libraries that will synchronize a
database between device and server. Combined with PouchDB or Cloudant Envoy , this ecosystem becomes a full solution for managing an individual user’s data
between the database server and their device.

CHOOSING CLOUDANT
My collaborators and I decided that using CouchDB/Cloudant was the best choice
for our database of radioactively clean materials. The two primary reasons were
the JSON document nature of the database and the easy replication and offline
capabilities.

Because measurements for different materials are taken in many different ways
with different focus and results, the schema-less nature of JSON document
storage was well-suited to record all of the information for each measurement in
a single document. It also allowed us to “future-proof” the database to ensure
later generations of physicists can add data without having to fit into a
schema. We avoided a complicated, normalized relational system that could
potentially need to be redesigned to meet future needs.

We chose CouchDB/Cloudant because the software is relatively easy to spin up and
because there exist mobile libraries that would allow easy replication of the
database to a physicist’s phone, should we ever build a mobile app. The ability
to replicate local copies allows researchers to easily carry the database with
them to conferences and to test benches in laboratories under the French Alps .

PARTING THOUGHTS
If you’re interested in seeing the results, check out radiopurity.org , and you’ll find data from a number of radioassay results for different parts
and materials that have been compiled from recent rare event research
experiments within the particle physics community. There’s also a github repository for this source code , and basic Python modules for interacting with the data. For further reading,
we published a paper describing this database last year.

If you enjoyed this article and the search for rare nuclear reactions, please ♡
it to recommend it to other Medium readers.

 * Physics
 * Science
 * Offline First
 * Offline Camp
 * Cloudant

Blocked Unblock Follow FollowingG. ADAM COX
FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","In my previous life I was a particle physicist, searching for very rare nuclear reactions (in order to count neutrinos from the sun and search for dark matter). I had forgotten, until very recently…",Offline and radioactively clean – IBM Watson Data Lab – Medium,Live,576
1803,This video shows you how to construct queries to access the primary index through the API.Visit http://www.cloudant.com/sign-up to sign up for a free Cloudant account. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center,This video shows you how to construct queries to access the primary index through Cloudant's API.,Use the Primary Index,Live,577
1804,"DATA-DRIVEN SHOPPER INSIGHTS
Raj R Singh / May 27, 2016What happens in your business depends upon what happens in the world around you.
Population demographics, weather patterns, global economic and social trends all
can influence your decision making. For decades, big businesses have leveraged
this knowledge, building “big data” systems that blend customer location with
demographic data to show who will buy what, and how much. In the past, only big
business could afford this type of analysis. Data from specialized providers,
hardware and software to process that data, and data science expertise are all
extremely expensive.

But today, you can mitigate these costs by:

 * Using free, open US Census data (pre-processed by your friendly IBM Developer
   Advocates)
 * Deploying on the cloud to minimize hardware and software costs
 * Re-using community-developed Python analytics programs, delivered in Jupyter
   notebooks

SAMPLE PROJECT: SIMPLE SHOPPER INSIGHTS
I’ve built a basic analysis package in a Python Jupyter Notebook that makes
customer demographic profiling accessible to startups and small-to-medium
enterprises. If you know a bit of Python, you can understand this notebook, and
even customize it with your own data. All you need is a CSV file with your
customers’ zip codes.

This blog post walks you through data gathering, merging, and analysis. To
follow along, open Simple Shopper Insights in nbviewer . This notebook’s also available on:

 * GitHub (raw .ipynb)
 * GitHub Pages (html)

Tip: Contribute to this project. Pull requests welcome!

DEMOGRAPHICS DATA
The good news is there’s a huge repository of demographics data available in the
US for free. The bad news is that the repository is so huge it’s
incomprehensible to a beginner. In an earlier blog post , I talked about how painful it is to get Census data directly from the US
Census Bureau’s web site. Their offerings are so complex, they even offer courses . We’ve done the heavy lifting for you and compiled what we think is the most
useful subset of the data into a few files aggregated to the zip code level.

You can find these data sets on IBM’s Analytics Exchange , a data portal that offers lots of open data sets that load easily into IBM
analytics products, like DashDB, Watson Analytics, Cloudant, and DataWorks. This
notebook uses API access , which provides access to the raw CSV file. This format is great for Python’s
popular Pandas module, which we’ll use for statistical analysis.

US Census data in Analytics Exchange:

 * Income
 * Age
 * Education
 * Race

To customize this notebook work with your own account, you’d get your API key
(your personal URL for the data file) and enter it where you call the data. For
example, you’d replace os.environ['AE_KEY_AGE'] with your URL for the Age data CSV file.

In the notebook’s Census data section, you’ll see lists of the demographic stats (by code and name) we deemed
useful and simple enough to be helpful.

The Pandas read_csv command lets us access the CSV and extract only those columns we want with the
following commands:

census_age_df = pd.read_csv( os.environ['AE_KEY_AGE'], usecols=['GEOID','B01002e1'] )
census_age_df.columns = ['GEOID','AGE']
...
census_income_df = pd.read_csv( os.environ['AE_KEY_INCOME'], usecols=['GEOID','B19049e1'] )
census_income_df.columns = ['GEOID','INCOME']
...
census_race_df = pd.read_csv(os.environ['AE_KEY_RACE'], 
                             usecols=['GEOID','B02001e1','B02001e2','B02001e3','B02001e4',
                                 'B02001e5','B02001e6','B02001e7','B02001e8','B03001e3'])
...
census_edu_df = pd.read_csv( os.environ['AE_KEY_EDUCATION'], 
                        usecols=['GEOID','B15003e1','B15003e2','B15003e3','B15003e4','B15003e5','B15003e6',
                                 'B15003e7','B15003e8','B15003e9','B15003e10','B15003e11','B15003e12','B15003e13',
                                 'B15003e14','B15003e15','B15003e16','B15003e17','B15003e18','B15003e19','B15003e20',
                                 'B15003e21','B15003e22','B15003e23','B15003e24','B15003e25'])

The median income, median age, and race properties are in good shape, but
education is more complex than we want, so we do some data munging to simplify,
grouping people into four categories:

 * no high school diploma
 * high school diploma and/or some college
 * at least a bachelor’s degree
 * graduate degree

Finally, we divide education and race by the total population of each zip, to
get percentage values. (Doing so solves the problem of sheer numbers weighting
big zip codes more heavily than small ones.) We don’t have to do anything to
median age and income since medians are already proportional values.

This work generates a simple set of key census characteristics by zip code,
listed in the table at the end of the Census section.

SALES DATA
Here’s where you’ll add your sales data into the mix. For this article, I’m
using a sample sales file I found online in czuriaga’s bigml data gallery . It has sales of some anonymous products and includes a customer zip code for
each sale. This data set is small enough to store and access on GitHub.

In a few steps, we can glean the important info: We grab the Product , US Zip , and Country columns and filter for only Product1 . Create a new column, GEOID , and populate it with the zip code prefixing with the string ‘86000US’ (to
match the zip code format in the Census data). Then drop all columns except GEOID .

Now we join the Census data to the sales data. As you see in the Merge sales to Census section, we use a Pandas join to create a data set called saleszips . Then under Identify sales zips , we take only the zip codes that have strong sales. The idea here is to get
rid of outliers in the data set so that you’re looking only at strong
demographic trends. This sample data set is pretty sparse (only 400 sales across
the entire nation), so we take any zip code with more than one sale.

We save results in a data set called highsaleszips , then use the Pandas describe function to save some important descriptive statistics about highsaleszips (mean, min, max and 40, 50 and 60 percentile cutoffs) in a new data set we call highsaleszipssum_df .

INSIGHTS
Now we’re set up to perfom some meaningful analysis. In the Insights section, we start by returning to the nation-wide Census data and generating
the same descriptive statistics we produced for highsaleszips (mean, min, max and 40, 50 and 60 percentile cutoffs). Then we subtract the
nation-wide values from the highsaleszips values, giving us a clear picture of how demographics in our high-sales zip
codes diverge from national averages.

We learn four main facts about zip codes in which we have sales:

 1. Incomes are much higher
 2. The education level is significantly higher
 3. People are a little younger
 4. The proportion of whites to other races is a little lower

REPORTING
Now that we better understand our customer profile, we can run a query to
identify other zip codes that are similar to those where we’ve had success. That
list can inform future marketing decisions, like where to target potential
retail outlets, advertising, and direct mail campaigns.

First we pull out zip codes that are within 40%-60% of the measures in our sales
zip codes for income, age, bachelor degree proportion, and white proportion.

censusgood_df = census_df [
    ( census_df.INCOME > highsaleszipssum_df.loc['40%','INCOME'] )
    & ( census_df.INCOME < highsaleszipssum_df.loc['60%','INCOME'] )
    & ( census_df.AGE > highsaleszipssum_df.loc['40%','AGE'] )
    & ( census_df.AGE < highsaleszipssum_df.loc['60%','AGE'] )
    & ( census_df.BA > highsaleszipssum_df.loc['40%','BA'] )
    & ( census_df.BA < highsaleszipssum_df.loc['60%','BA'] )
    & ( census_df.WHITE > highsaleszipssum_df.loc['40%','WHITE'] )
    & ( census_df.WHITE < highsaleszipssum_df.loc['60%','WHITE'] )
    ]


Then we extract only the zip codes with this command:

censusgood_df.index.tolist()

Of course, you can save this list as a spreasheet, but to better communicate
with your team, show results in a handy interactive map. To generate a map, our
notebook uses CartoDB , a mapping-as-a-service provider that makes it easy to display an interactive
map. We take our list of zip codes and create SQL INSERT statements which we
then execute using the CartoDB python module. See indivdual cells for detailed
commands.


CONCLUSION
Hopefully, you get a glimpse of how powerful it is to tap Census and other open
data sets, combine with your own, then analyze with cloud-based software tools.
This notebook only scratches the surface. We used a very simple method to
identify good prospect zip codes. We manually chose our significant demographic
characteristics (age, race, income, and education), and cutoffs (40%, 60%).

There are more sophisticated techniques. For example, you could apply automated
machine learning to algorithmically figure out the most significant variables
and best cutoff points. In future posts, we’ll look at some of these analytic
options and see if they do a better job. Meanwhile, run through this notebook
with your own data sets to see what you can discover.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Combine your customer and sales data with open data sets like Census stats. Then perform analysis in the cloud to make data-driven business decisions.,Data-Driven Shopper Insights,Live,578
1806,"Compose The Compose logo Articles Sign in Free 30-day trialWHY IT CONSULTING AND DEVELOPER SERVICES COMPANIES LOVE COMPOSE
Published Feb 21, 2017 Development consulting case study Why IT Consulting and Developer Services Companies Love ComposeOne of the great constants of software consulting is this: You need reliable,
stable, and repeatable databases and database services. That's why IT consulting
companies, software firms, and app developers for the last seven years have
sworn by Compose to manage data for all of their clients - everything from
messaging apps, social media portals, and online games to mission critical
enterprise applications. One thing these diverse consulting firms all agree on
is that they’ve found the right mix of databases and managed database services
at Compose.

But don't take our word for it. Let's look at a few of their stories and read,
in their own words, why they've selected compose.

Differential | Cincinnati, OhioDifferential is a development studio that provides consulting, custom software
development and post-launch support and hosting services to their customers. In
one of their recent projects, Differential was approached by Organized Living, a
manufacturer of storage and organization products for the home, to build them an
easy-to-use 3D visualization tool for their products. That idea was to help
customers visualize exactly how a new closet or shelving system would fit into
their living quarters.

Running on MongoDB and Meteor.js, Differential produced something that was not
only useful for Organized Living but also gorgeous. The new app would replace
the previous 2D closet design app that was built in jQuery with a new 3D
interactive environment that integrated with Organized Living's back-end ERP
system, allowing customers to visualize using actual inventory. It's a project
that's indicative of the types of solutions that Differential has been building
for their customers since 2013.

When asked why they have been such a loyal user of Compose services, Conrad
VanLandingham had this to say: ""The quality of all of our interactions with
Compose is top notch. Everything from your administration console to your
documentation and even to your company culture."" Compose's ""elastic"" scaling
provides much-needed automation that makes it easy for their client projects to
grow seamlessly.

Read the full case study.

ReadMe | San Francisco, CaliforniaReadMe.io has built a platform for collaboratively writing technical docs and
API documentation. It does several things differently from its competitors.
First, it allows visitors to test APIs right from the documentation pages which
saves users time and gives them a good feel for the tooling. Second, it allows
anyone with permission to push changes quickly and easily; no more waiting on
other team’s pull requests or time- tables for publishing. And third, it allows
anyone (employees, customers, etc.) to suggest changes, much in the same way a
wiki works, but with greater control over what gets published and how the pages
are designed.

ReadMe is built on Node.js along with Express for the backend and AngularJS for the frontend. They use Compose MongoDB for their primary data store and
Compose Redis as a queue for email and for caching certain database calls
(reading, not writing). The team didn't have much experience with MongoDB, so
they turned to Compose to ensure their deployments were secure and backups were
working.

“If you’re not in the position to have a full-time DevOps person, the easiest
way to feel secure is using Compose,” said Ashley Chang.

Read the full case study.

C2G Consulting | Lagos, NigeriaC2G Consulting is a technology company headquartered on Victoria Island in
Lagos, Nigeria. Providing consulting and development services – ranging from
ERP, HCM, CRM analytics, disaster recovery and more – to both African and global
businesses since 2004, C2G recently expanded its offerings with a new
multi-tenant bulk ordering and retail execution platform that they've dubbed
TradeDepot. TradeDepot is now in beta for some of their mid-to-large sized
enterprise customers.

Built on Meteor.js with a Compose MongoDB and Compose RabbitMQ backend,
TradeDepot allows product manufacturers to receive orders from distributors and
manage the order all the way through to retail outlets. It's much more than a
traditional eCommerce tool; it's a complete platform for these customers to
manage the supply chain, from manufacturing to retail. One of their first big
clients is one of the largest dairy companies in Africa who needed greater
insights and control over their milk distribution pipeline.

While C2G's core business remains consulting for some of the largest
multi-national companies in the world, TradeDepot firmly pushes them in a new
direction of building commercial software to help consumer products companies
get more control over their supply chain.

One of the founders of C2G Consulting, Onyekachi (Kachi) Izukanne said, ""For us,
for any back-end service we can procure from a managed service provider, that
would be a more reliable way to go. And while it's early days, so far we're
happy with our Compose investment.""

Read the full case study.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by: Unsplash Arick Disilva works in Product Marketing at Compose. Love this article? Head over to Arick Disilva ’s author page and keep reading.RELATED ARTICLES
Dec 8, 2016OMNI LABS – MAKING THE MOST OF COMPOSE
Learn how startup Omni Labs uses Compose-hosted MongoDB and a combination of
Node.js, React, and Spark Python to help bootst…

Jon Silvers Nov 3, 2016FLYING DONUT – MAKING THE MOST OF COMPOSE
Flying Donut has created a simple and intuitive agile collaboration and project
tracking tool running on Compose MongoDB and…

Jon Silvers Sep 8, 2016MAKING THE MOST OF COMPOSE ENTERPRISE – README.IO
ReadMe.io has created a simple, gorgeous platform for publishing technical
documentation running on Node.js and Compose Enter…

Jon Silvers Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",One thing these diverse consulting firms all agree on is that they’ve found the right mix of databases and managed database services at Compose.,Why IT Consulting and Developer Services Companies Love Compose,Live,579
1809,"Compose The Compose logo Articles Sign in Free 30-day trialTRANSPORTER, MONGODB AND SYNCHRONIZATION
Published Apr 18, 2017 mongodb transporter Transporter, MongoDB and synchronizationTransporter was developed originally to synchronize MongoDB databases. In this
article, we'll look at how to configure the latest generation of Transporter to
do just that.

We've looked in past articles on how data can flow from database to disk and disk to database and introduced how we can manipulate that flow of data with Transformers. Now, I want to look at
moving some data between databases. The simplest yet useful connection is
MongoDB to MongoDB.

Consider, as an example that we want to move a collection from one MongoDB
deployment to another. It's a useful trick you can perform when you want to
isolate the impact of different workloads on the database, but if you are that
busy, you need to be able to do it as transparently as possible. Not a problem
with the Transporter to hand. Let's get a configuration file going:

$ transporter init mongodb mongodb
Writing pipeline.js...  
$


Now we have a pipeline.js file, which starts out with a source and a sink for
MongoDB databases. In this case, I'm going to move our enron mail example
database - people are doing some complex queries on it and it's interfering with
production so it needs to go somewhere else. I'll be moving it from one
deployment, the source, into another, the sink, in a different datacenter. Both
databases have SSL enabled so I have to allow for that too. I'm going to jump
ahead and show you the pipeline.js file that works to do this then explain what
it all means.

var source = mongodb({  
    ""uri"": ""${MONGODB_SOURCE_URI}"",
    ""ssl"": true,
    ""cacerts"": [""${MONGODB_SOURCE_CERT}""],
});

var sink = mongodb({  
    ""uri"": ""${MONGODB_SINK_URI}"",
    ""ssl"": true,
    ""cacerts"": [""${MONGODB_SINK_CERT}""],
    ""bulk"": true
});

t.Source(""source"", source, ""/^enron$/"").Save(""sink"", sink, ""/^enron$/"");  


Right, this is a little different from the generated pipeline.js in that I've added SOURCE or SINK to all the environment variables for URIs, and rather than hardwire the path to
the certificate files in the cacerts arrays, I've made them environment variables too using that ${ENVNAME} embedded variable syntax. It is worth remembering that you can map any value in
a Transporter pipeline to an environment variable; it makes them much more
reusable and easier to deploy in containers where you may only be able to pass
environment variables.

Now we need to create the information to populate those environment variables.
We'll go to each of the database deployments and create users called
""transporteruser"" with a password ""transporterpass"" (so you can see them
clearly). You can use whatever users and passwords you've configured. On the
admin pages for the databases, you'll find the URIs for them. Remember to clip
off the ?ssl=true from the end of them. Then we download the self-signed SSL certificates for
each and save them as source.pem and sink.pem .

With all that information, I make a setup.sh file which I can source into my shell.

export MONGODB_SOURCE_URI=mongodb://transporteruser:transporterpass@aws-eu-west-1-portal.0.dblayer.com:10000,aws-eu-west-1-portal.2.dblayer.com:10000/enron  
export MONGODB_SOURCE_CERT=./source.pem  
export MONGODB_SINK_URI=mongodb://transporteruser:transporterpass@sl-eu-lon-2-portal.2.dblayer.com:10000,sl-eu-lon-2-portal.3.dblayer.com:10000/enron  
export MONGODB_SINK_CERT=./sink.pem  


You should be able to see how these variables substitute into the pipeline.js file when it runs.

There are two properties that are explicity set though. One is self-explanatory: ""ssl"":true enables SSL support replacing the ?ssl=true we clipped off the URIs. The other is less obvious: ""bulk"":""true"" is only
enabled on the sink and that turns bulk writes on. Does this make a difference?
Yes, a huge one; with non-bulk writes the transporter has to make a round trip
to the sink database for every document transported. With bulk enabled, many
documents are bundled up in one trip. It is especially noticeable if you are
running Transporter from a location geographically remote from your sink - the
further away you are, the longer it takes without bulk.

The only other change we've made is to ensure that only the enron collection is copied. That's done by setting the namespace in the actual
pipeline:

t.Source(""source"", source, ""/^enron$/"").Save(""sink"", sink, ""/^enron$/"");  


A brief reminder here that the namespace parameter is a regular expression and
the ""/^"" and ""$/"" are there to ensure that only the word ""enron"" gets through.
We're all done so a quick

$ transporter run


And the Transporter starts copying the collection... and when it's done it stops
running. That's great for this collection because of its historical data. But
what if that collection was live data; how would we manage to copy it
consistently. With the MongoDB adaptor, there's an option to tail the oplog,
MongoDB's replication trail and this lets programs see changes in real time.
Turn the option on and when the initial copying has finished, the Transporter
stays running listening to the oplog and creating new messages which contain
documents with all the changes. So all you need to add to the source properties
is ""tail"": true .

There is a complication though. The oplog is somewhat protected. If you're
running your own MongoDB database then you'll want to create a user capable of
reading the local database which is where the oplog collection lives. If you are on older Compose
MongoDB Classic, just create a user with the oplog privilege.

If you are on current Compose MongoDB, you'll need to turn on the Oplog Add-on
which handles getting the oplog from a sharded database. That will offer you a oploguser , password and URI. You'll need to edit the URI, removing the &ssl=true from the end and replacing local with the name of your database, which in my case is enron . Use that URI as MONGODB_SOURCE_URI in your environment variables. I'll also download the SSL certificate from the
add-on as oplogsource.pem and set that as MONGODB_SOURCE_CERT .

Set the Transporter running with these changes and once the copy is complete,
you'll be able to insert, delete and edit the source collection and see the
changes appear in the sink collection.

With this information in hand, you can now copy and synchronize collections
between MongoDB databases. Being the Transporter, you can also modify the data
using Transformers allowing you to create new document structures rather than
just clones. Be aware, though that when you are tailing the oplog, you'll get
different messages types which represent the changes in records; don't assume
they'll all be ""inserts"" as they are when you copy the database without tailing.
We'll look at this in more detail in a future article on Transformers and how to
use them effectively.

attribution Marina Vitale

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Mar 14, 2017HOW TO MOVE DATA WITH TRANSPORTER - FROM DISK TO DATABASE
Compose's open-sourced Transporter is a powerful tool. In this article, we'll
show you how to use it to upload data to a Mong…

Dj Walker-Morgan Mar 9, 2017HOW TO MOVE DATA WITH COMPOSE TRANSPORTER - FROM DATABASE TO DISK
Transporter is a great way to move and manipulate data between databases. In
this new article, we look at how you can get on…

Dj Walker-Morgan Aug 29, 2016MONGO TO MONGO DATA MOVES WITH NIFI
There are many reasons to move or synchronize a database such as MongoDB:
migrating providers, upgrading versions, duplicatin…

Hays Hutton Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Transporter was developed originally to synchronize MongoDB databases. In this article, we'll look at how to configure the latest generation of Transporter to do just that.","Transporter, MongoDB and synchronization",Live,580
1810,"Compose The Compose logo Articles Sign in Free 30-day trialGEOFILE: HOW TO TRANSFORM OPENSTREETMAP DATA INTO GEOJSON USING GDAL
Published May 25, 2017 openstreetmap gdal geofile GeoFile: How to Transform OpenStreetMap Data into GeoJSON Using GDALGeoFile is a series dedicated to looking at geographical data, its features, and
uses. In today's article, we're going to show you how to convert OSM data to
GeoJSON and import it into a Compose for MongoDB deployment.

In the last GeoFile article we looked at what OpenStreetMap (OSM) is and how to import data from it into a Compose PostgreSQL deployment. We also
touched on making queries on that data using the hstore column, which stored
supplemental non-standardized data like amenities, cuisines, place descriptions,
etc. Using only the command line tool osm2pgsql we could import OSM data and set up an hstore column with an index and it would
create tables out of OSM's map layers that we could then query. Unfortunately,
it's not that easy if we want to convert OSM data to GeoJSON.

In this article, we'll look at converting OSM data into useable GeoJSON and
importing that into Compose for MongoDB. What we mean by useable will be shown
below, but it requires some setup using the command line tool ogr2ogr , specifying the keys we want in a OSM_CONFIG_FILE used by the tool, and converting the OSM layer that we want to run queries on.
We'll take you step-by-step in the process so let's get started ...

SETTING THINGS UP
Ogr2ogr is a command line tool within the Geospatial Data Abstraction Library (GDAL) , for converting geospatial data into different formats, while providing
customization options to reproject coordinates, minimize attributes, and a whole
host of advanced options.

To use the tool, you'll first have to download the source or the binaries, but we prefer downloading and installing GDAL from Homebrew if you're on a Mac writing brew install gdal in the terminal.

Once that's installed, let's go over to OSM so that we can select the dataset
we're going to use. Once we're on the OSM website, select the area of North Los
Angeles, California, or whichever area you prefer to select on OSM. It really
doesn't matter what area you select in this tutorial since OSM has common keys
that we'll be using. You have the option of selecting the map in your browser's
window or manually selecting an area. We've manually selected the area like
below.


Once your area has been selected, we'll have to use one of OSM's alternative
export sources since our map exceeds the number of nodes we can export. Click
the Overpass API which will give us more nodes and will automatically start downloading a file
containing OSM data named map . Depending on your selection, the file may take a while to download because of
its size - our's was 1.28 GB. Once that file has been downloaded, you'll have to
add to it the file extension .osm .

So, now that we have the GDAL and ogr2ogr installed and the OSM data, we can now transform it to GeoJSON.

TRANSFORMING OSM TO GEOJSON
Using ogr2ogr we now have the ability to transform our OSM data into something MongoDB can
use. There are some hiccups that you might run into, however, and we'll show you
how to get out of some of the common ones.

When running ogr2ogr , we have to specify the type of data we're converting to, the file we want our
transformed data to be saved into, and the OSM file that has all the data.

ogr2ogr -f GeoJSON map.geojson map.osm  


It's as simple as that. But, this simplicity has a hiccup. Running this command
will give you the following error:

ERROR 6: GeoJSON driver doesn't support creating more than one layer  


Since we're transforming data into GeoJSON, it doesn't automatically create
different sets of data that contain the various OSM layers. You have to specify
individual layers. To view the layers that are available, run ogrinfo map.osm . This will give you something like the following showing you the layers:

Had to open data source read-only.  
INFO: Open of `map.osm'  
      using driver `OSM' successful.
1: points (Point)  
2: lines (Line String)  
3: multilinestrings (Multi Line String)  
4: multipolygons (Multi Polygon)  
5: other_relations (Geometry Collection)  


For now, we'll just transform the points layer to transform the points of interest from the OSM map to GeoJSON. This
tool automatically will convert the spatial coordinates from OSM into CRS 84,
the standard used for GeoJSON, so we won't have to worry about changing the
coordinate system later. Now, we'll run the ogr2ogr command again, but specify the points layer like:

ogr2ogr -f GeoJSON map.geojson map.osm points  


After running the command, we will have a file called map.geojson which contains a GeoJSON feature collection containing an array of GeoJSON
features with our points of interest and their coordinates, and it's about 6.6
MB - much less than the 1.28 GB we started with.

{
""type"": ""FeatureCollection"",
""crs"": { ""type"": ""name"", ""properties"": { ""name"": ""urn:ogc:def:crs:OGC:1.3:CRS84"" } },

""features"": [
{ ""type"": ""Feature"", ""properties"": { ""osm_id"": ""10537875"", ""name"": null, ""barrier"": null, ""highway"": ""traffic_signals"", ""ref"": null, ""address"": null, ""is_in"": null, ""place"": null, ""man_made"": null, ""other_tags"": null }, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -118.2897534, 34.1710594 ] } },
...
{ ""type"": ""Feature"", ""properties"": { ""osm_id"": ""4856744809"", ""name"": null, ""barrier"": null, ""highway"": ""traffic_signals"", ""ref"": null, ""address"": null, ""is_in"": null, ""place"": null, ""man_made"": null, ""other_tags"": ""\""traffic_signals\""=>\""signal\"""" }, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -118.4736755, 34.2355165 ] } }
]
}


Depending on what you're looking for, you may not need to transform the data
further so you can import it into MongoDB now. However, for non-standardized
data like amenities, traffic signals, etc., they have not been included
automatically in the feature properties. For example, if we take the sample
above, look at the other_tags key.

""other_tags"": ""\""traffic_signals\""=>\""signal\""""


It contains a string with the key traffic_signals and signal as the value. We could put this into MongoDB, but if we wanted to search within other_tags , we'd have to run a text search and we could only index the other_tags key and not traffic_signals . This could end up taking up a lot of resources. We could avoid this
altogether by modifying ogr2ogr 's OSM_CONFIG_FILE that contains the configuration settings for the tool's OSM driver so that it includes the keys we need inside of the feature properties.

If you're on a Mac and installed GDAL using Homebrew, the location of the
configuration file for the OSM driver osmconf.ini will probably be in /opt/opengeo/share/gdal . Opening this file you'll have the following:

#
# Configuration file for OSM import
#

# put here the name of keys for ways that are assumed to be polygons if they are closed
# see http://wiki.openstreetmap.org/wiki/Map_Features
closed_ways_are_polygons=aeroway,amenity,boundary,building,craft,geological,historic,landuse,leisure,military,natural,office,place,shop,sport,tourism  
...


The part we're most interested in is the section starting with [points] since we're transforming the points layer from OSM to GeoJSON. The
configuration for that will have the following:

[points]
# common attributes
osm_id=yes  
osm_version=no  
osm_timestamp=no  
osm_uid=no  
osm_user=no  
osm_changeset=no

# keys to report as OGR fields
attributes=name,barrier,highway,ref,address,is_in,place,man_made  
# keys that, alone, are not significant enough to report a node as a OGR point
unsignificant=created_by,converted_by,source,time,ele,attribution  
# keys that should NOT be reported in the ""other_tags"" field
ignore=created_by,converted_by,source,time,ele,note,openGeoDB:,fixme,FIXME  
# uncomment to avoid creation of ""other_tags"" field
#other_tags=no
# uncomment to create ""all_tags"" field. ""all_tags"" and ""other_tags"" are exclusive
#all_tags=yes


In the section keys to report as OGR fields with attributes , this is where we can add fields that will appear in the feature properties as
keys. Since we'll be looking for amenities in North Los Angeles, we'll need the
amenity keys and values. Additionally, we'll need the cuisine keys and values
because if we look for restaurants, we'll want to know the cuisine. These keys
are currently contained in a string in other_tags .

A word of caution, however, is needed: you shouldn't modify the original OSM
driver configuration file unless you really need to. Instead, we can set up our
own custom, configuration file and tell ogr2ogr to use that.

Setting up a custom OSM_CONFIG_FILE can be done by copying all the contents of
the original osmconf.ini file and creating a new one saved somewhere on your system. You also can copy
the contents of the file linked from the OSM driver's webpage and save that, too. We'll save the contents of the file into one called customOSMconfig.ini .

In that file, the only part we'll change is under [points] in the part that says attributes . We'll modify it to look like:

# keys to report as OGR fields
attributes=name,ref,address,amenity,cuisine  


These are the only attributes that we'll need for our purposes. If we needed
more, or wanted to keep the original attributes, we could just append what we
need and run that.

Now that the new customized configuration file has been saved, we can transform
the points layer again and see what our GeoJSON will look like. First, we'll have to tell ogr2ogr where to look for the OSM_CONFIG_FILE. If we set up a global environment
variable for OSM_CONFIG_FILE, GDAL will automatically take that as our OSM
driver configuration file.

export OSM_CONFIG_FILE=/path/to/file/customOSMconfig.ini  


Now that's there, we can run ogr2ogr again:

ogr2ogr -f GeoJSON map.geojson map.osm points  


Now, if we look at the file, we'd get a 5 MB file with only the feature
properties we added in our configuration file:

{
""type"": ""FeatureCollection"",
""crs"": { ""type"": ""name"", ""properties"": { ""name"": ""urn:ogc:def:crs:OGC:1.3:CRS84"" } },

""features"": [
{ ""type"": ""Feature"", ""properties"": { ""osm_id"": ""10537875"", ""name"": null, ""ref"": null, ""address"": null, ""amenity"": null, ""cuisine"": null, ""other_tags"": ""\""highway\""=>\""traffic_signals\"""" }, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -118.2897534, 34.1710594 ] } },
...


Running ogrinfo on our GeoJSON file will provide us with a much clearer overview of the keys it
contains. We'll use the -sql flag to select the first point by the osm_id .

ogrinfo map.geojson -sql ""SELECT *  FROM OGRGeoJSON WHERE osm_id = '10537875'""

Layer name: OGRGeoJSON  
Geometry: Point  
Feature Count: 1  
Extent: (-118.289753, 34.171059) - (-118.289753, 34.171059)  
Layer SRS WKT:  
GEOGCS[""WGS 84"",  
    DATUM[""WGS_1984"",
        SPHEROID[""WGS 84"",6378137,298.257223563,
            AUTHORITY[""EPSG"",""7030""]],
        TOWGS84[0,0,0,0,0,0,0],
        AUTHORITY[""EPSG"",""6326""]],
    PRIMEM[""Greenwich"",0,
        AUTHORITY[""EPSG"",""8901""]],
    UNIT[""degree"",0.0174532925199433,
        AUTHORITY[""EPSG"",""9108""]],
    AUTHORITY[""EPSG"",""4326""]]
Geometry Column = _ogr_geometry_  
osm_id: String (0.0)  
name: String (0.0)  
ref: String (0.0)  
address: String (0.0)  
amenity: String (0.0)  
cuisine: String (0.0)  
other_tags: String (0.0)  
OGRFeature(OGRGeoJSON):0  
  osm_id (String) = 10537875
  name (String) = (null)
  ref (String) = (null)
  address (String) = (null)
  amenity (String) = (null)
  cuisine (String) = (null)
  other_tags (String) = ""highway""=>""traffic_signals""
  POINT (-118.2897534 34.1710594)


If we want to see all the types of amenities that our GeoJSON data has, we could
query ogrinfo again using SQL like:

ogrinfo map.geojson -sql ""SELECT DISTINCT amenity FROM OGRGeoJSON""

Layer name: OGRGeoJSON  
OGRFeature(OGRGeoJSON):0  
  amenity (String) = restaurant

OGRFeature(OGRGeoJSON):1  
  amenity (String) = (null)

OGRFeature(OGRGeoJSON):2  
  amenity (String) = fast_food

OGRFeature(OGRGeoJSON):3  
  amenity (String) = parking
...


This will give us 80 amenities that are found within our GeoJSON file. There's a
lot we could find out just by running SQL queries with ogrinfo . But, let's see if we can make our GeoJSON file smaller by only including
names, amenities, and cuisines.

ogr2ogr -f GeoJSON map_optimized.geojson map.geojson -select name,amenity,cuisine  


This takes our 5 MB GeoJSON file down to 2.7 MB and keeps our 17,265 points
intact.

...
""features"": [
{ ""type"": ""Feature"", ""properties"": { ""name"": null, ""amenity"": null, ""cuisine"": null }, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -118.2897534, 34.1710594 ] } },
{ ""type"": ""Feature"", ""properties"": { ""name"": null, ""amenity"": null, ""cuisine"": null }, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -118.3217314, 34.184282 ] } },
...


We can optimize our data further using SQL by selecting only documents were
amenity or cuisine have at least one value. The reason why we're doing that is
that sometimes amenities that are restaurants have not had the cuisine tag
added, and vice versa. Instead of only selecting amenities and cuisines that
have values, we'll just select those that have at least one value. There is a
problem with this as well in that some places, take a name like Starbucks as an
example, doesn't always have amenities or cuisine tags. This may cause an issue
if we wanted very accurate results because we'd have to locate it by name rather
than by these tags.

ogr2ogr -f GeoJSON map_no_nulls.geojson map_optimized.geojson -sql ""SELECT * FROM OGRGeoJSON WHERE amenity IS NOT NULL OR cuisine IS NOT NULL""  


That takes us from 2.9 MB to 329 KB and 1,765 points, which look like:

...
""features"": [
{ ""type"": ""Feature"", ""properties"": { ""name"": ""Hometown Buffet"", ""amenity"": ""restaurant"", ""cuisine"": ""american"" }, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -118.3305106, 34.190943 ] } },
{ ""type"": ""Feature"", ""properties"": { ""name"": ""McDonald's"", ""amenity"": ""fast_food"", ""cuisine"": ""burger"" }, ""geometry"": { ""type"": ""Point"", ""coordinates"": [ -118.6031003, 34.1807258 ] } },
...


Since our GeoJSON file has been filtered down to only include the data we need,
let's import our GeoJSON into Compose for MongoDB ...

IMPORTING TO COMPOSE FOR MONGODB
Importing our data is now much easier to manage since we filtered it down
significantly. By making the file smaller, we won't have as much overhead when
trying to index or query over data that we're not interested in. To import the
data, you can either use mongoimport or you could use Studio 3T, which we recently reviewed .

As it stands, in order to import the GeoJSON feature collection into MongoDB,
you'd have to modify the file and keep only the features array portion with the
documents. But, why would you want to manually fiddle with your data if you
don't have to? To help us out, we can use the command line JSON processor jq which will select the documents in the features array them import them to
Compose for MongoDB using mongoimport .

You can download and install jq for your platform , but if you're on macOS and using Homebrew, just use brew install jq . Once it's installed, we just have to use the tool with mongoimport like:

jq '.features[]' map_no_nulls.geojson | mongoimport --host aws-us-west-2-portal.2.dblayer.com --port 11111 --db los_angeles --collection amenities --ssl --sslAllowInvalidCertificates -u user -p mypass --file map_no_nulls.geojson --drop  


And now you will have your processed JSON data inside your Compose for MongoDB
database to start querying.

THAT'S GDAL FOLKS
We've taken you on a tour on how you can use GDAL's ogr2ogr command line tool to transform and filter OSM data into useable and manageable
GeoJSON. GDAL has other powerful tools that you can use to transform other types
of geospatial data into those that can be used in various GIS applications or
databases. We scratched the surface on how to use ogr2ogr , but it will provide you with the ability to transform data into different
formats and to explore other features of the tool that may be useful for your
application. Next time, we'll take the data that we got from North Los Angeles
and see how we can use Compose for MongoDB and Mapbox together to make an
interactive map of the area.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Andrew Neel

Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
Mar 30, 2017GEOFILE: USING OPENSTREETMAP DATA IN COMPOSE POSTGRESQL - PART II
GeoFile is a series dedicated to looking at geographical data, its features, and
uses. In today's article, we're continuing o…

Abdullah Alger Mar 16, 2017GEOFILE: USING OPENSTREETMAP DATA IN COMPOSE POSTGRESQL - PART I
GeoFile is a series dedicated to looking at geographical data, its features, and
uses. In today's article, we're going to int…

Abdullah Alger Jan 19, 2017GEOFILE: ELASTICSEARCH GEO QUERIES
GeoFile is a series dedicated to looking at geographical data, its features, and
uses. In this article, we'll be covering Ela…

Abdullah Alger Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","GeoFile is a series dedicated to looking at geographical data, its features, and uses. In today's article, we're going to show you how to convert OSM data to GeoJSON and import it into a Compose for MongoDB deployment.",GeoFile: How to Transform OpenStreetMap Data into GeoJSON Using GDAL,Live,581
1811,"RStudio Blog * Home

 * Subscribe to feed

TIBBLE 1.1
July 5, 2016 in Packages

We’re proud to announce version 1.1 of the tibble package. Tibbles are a modern reimagining of the data frame, keeping what time
has shown to be effective, and throwing out what is not. Grab the latest version
with:

install.packages(""tibble"")

There are three major new features:

 * A more consistent naming scheme
 * Changes to how columns are extracted
 * Tweaks to the output

There are many other small improvements and bug fixes: please see the release notes for a complete list.

A BETTER NAMING SCHEME
It’s caused some confusion that you use data_frame() and as_data_frame() to create and coerce tibbles. It’s also more important to make the distinction
between tibbles and data frames more clear as we evolve a little further away
from the semantics of data frames.

Now, we’re consistently using “tibble” as the key word in creation, coercion,
and testing functions:

tibble(x = 1:5, y = letters[1:5])
#> # A tibble: 5 x 2
#>       x     y
#>   <int> <chr>
#> 1     1     a
#> 2     2     b
#> 3     3     c
#> 4     4     d
#> 5     5     e
as_tibble(data.frame(x = runif(5)))
#> # A tibble: 5 x 1
#>           x
#>       <dbl>
#> 1 0.4603887
#> 2 0.4824339
#> 3 0.4546795
#> 4 0.5042028
#> 5 0.4558387
is_tibble(data.frame())
#> [1] FALSE

Previously tibble() was an alias for frame_data() . If you were using tibble() to create tibbles by rows, you’ll need to switch to frame_data() . This is a breaking change, but we believe that the new naming scheme will be
less confusing in the long run.

EXTRACTING COLUMNS
The previous version of tibble was a little too strict when you attempted to
retrieve a column that did not exist: we had forgotten that many people check
for the presence of column with is.null(df$x) . This is bad idea because of partial matching, but it is common:

df1 <- data.frame(xyz = 1)
df1$x
#> [1] 1

Now, instead of throwing an error, tibble will return NULL . If you use $ , common in interactive scripts, tibble will generate a warning:

df2 <- tibble(xyz = 1)
df2$x
#> Warning: Unknown column 'x'
#> NULL
df2[[""x""]]
#> NULL

We also provide a convenient helper for detecting the presence/absence of a
column:

has_name(df1, ""x"")
#> [1] FALSE
has_name(df2, ""x"")
#> [1] FALSE

OUTPUT TWEAKS
We’ve tweaked the output to have a shorter header, more information in the
footer. We’re using # consistently to denote metadata, and we print missing character values as <NA> (instead of NA ).

The example below shows the new rendering of the flights table.

nycflights13::flights
#> # A tibble: 336,776 x 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1   2013     1     1      517            515         2      830
#> 2   2013     1     1      533            529         4      850
#> 3   2013     1     1      542            540         2      923
#> 4   2013     1     1      544            545        -1     1004
#> 5   2013     1     1      554            600        -6      812
#> 6   2013     1     1      554            558        -4      740
#> 7   2013     1     1      555            600        -5      913
#> 8   2013     1     1      557            600        -3      709
#> 9   2013     1     1      557            600        -3      838
#> 10  2013     1     1      558            600        -2      753
#> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <time>

Thanks to Lionel Henry for contributing an option for determining the number of printed extra columns: getOption(""tibble.max_extra_cols"") . This is particularly important for the ultra-wide tables often released by
statistical offices and other institutions.

Expect the printed output to continue to evolve. In the next version, we hope to
do better with very wide columns (e.g. from long strings), and to make better
use of now unused horizontal space (e.g. from long column names).

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,744 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

2 COMMENTS
July 10, 2016 at 12:48 pm

Gil Hornung

Hi,

Thanks for the new package. The “data frames” used by dplyr were indeed nice and
it’s great that they are being extended.

Two requests which I find useful:
a) Is it possible to add to the printed output the tail of the table and not
just the head? This is implemented in DataFrame objects e.g. in the DESEQ2
package. It is very useful, because in many cases “weird” things hide at the
bottom (e.g. NA values when you order a dataframe)

b) An interesting feature that is found on the vignette is the ability to create
a column that is actually a list. Can you please elaborate on that? How would
you cbind a tibble and a list?

All the best,

Gil

 * July 15, 2016 at 11:17 am
   
   hadleywickham
   
   Hi Gil,
   
   a) yes – this is on the todo list: https://github.com/hadley/tibble/issues/113
   
   b) The best place to learn about that is http://r4ds.had.co.nz/many-models.html
   
   Hadley
   
   
« httr 1.2.0 Discover R and RStudio at JSM 2016 Chicago! »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,744 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","We’re proud to announce version 1.1 of the tibble package. Tibbles are a modern reimagining of the data frame, keeping what time has shown to be effective, and throwing out what is not. Grab the la…",tibble 1.1,Live,582
1814,"* United States

IBM® * Site map

Search within Bluemix Blog Bluemix Blog * About Bluemix * What is Bluemix
    * Getting Started
    * Case Studies
    * Hybrid Architecture
    * Open Source
    * Trust, Security, Privacy
    * Data Centers
    * Our Network
    * Automation
    * Architecture Center
   
   
 * Products * Compute Infrastructure
    * Compute Services
    * Hybrid Deployments
    * Watson
    * Internet of Things
    * Mobile
    * DevOps
    * Data Analytics
    * Network
    * Open Source
    * Storage
    * Security
   
   
 * Services * Bluemix Services
    * Garage
   
   
 * Pricing
 * Support * Support
    * Contact Us
    * Resources
    * Docs
   
   
 * Blog * How-tos
    * Trending
    * What's New
    * Events
   
   
 * Partners * Partners
    * Become a Partner
    * Find a Partner
   
   
 * Sign up

DATA ANALYTICSTHE MILLION DOLLAR QUESTION: WHERE IS MY DATA?
August 31, 2017 | Written by: Jay Limburn

Categorized: Data Analytics

Share this post:


THE ERA OF THE CDO
Ten years ago, Chief Data Officers (CDOs) were a rarity. Large corporations such
as Visa, Capital One and Yahoo! led the way in appointing CDOs, but the job
title had yet to become mainstream. Then the global financial crisis of
2007-2008 hit. Organizations heard the alarm bells ring, and CDOs – suddenly in
high demand – were asked to help align operations with a raft of new regulatory
requirements around data governance and reporting.

Around the same time, companies began to wake up to the value of big data,
recognizing the competitive advantage that it could potentially deliver. As big
data analytics moved to the forefront, the spotlight on CDOs intensified. Today,
beyond enforcing compliance, their role has expanded to put them in the front
line of organizations’ efforts to become data-driven enterprises.

The increased importance of CDOs is reflected in their growing numbers: in a
recent survey , 54 percent of firms report having appointed one, up from just 12 percent in
2012. As the big data revolution gets underway, CDOs are seen as having the
potential to make or break an organization’s fortunes. And to deliver on these
big expectations, they need to overcome some major obstacles.


IN THE HOT SEAT
To get a grip on their organization’s data, CDOs first need to know what data
the company has and who is using it. These questions may seem simple, but in
reality they are not easy to answer.

It’s usually possible—although not trivial—to work out the location and usage of
enterprise data because organizations can look at their data warehouses and
operational systems. However, when it comes to data from outside the company,
such as social media streams or data that is available publicly, the picture
quickly becomes more complicated. And with sentiment analysis and customer
profiling growing in popularity, the usage of external data is becoming
increasingly important.

Most CDOs have no straightforward way to gain a comprehensive view of the
internal and external data used by their organization. They often have to poll
different departments individually, before manually building an overview. This
manual approach to data governance is rarely 100-percent effective—and whenever
there’s a gap in coverage, there’s a risk that breaches will occur.


WHY IS IT SO IMPORTANT TO FIND A SOLUTION?
Firstly, until a CDO understands the ‘current’ state, it’s very difficult to
make positive changes. Without a 360-degree view of data, it can be difficult,
if not impossible, to enforce governance policies. As compliance is one of the
CDO’s key responsibilities, this is a challenge that needs to be overcome
post-haste.

Secondly, most companies have data assets that they’re not using to their full
potential, simply because nobody knows about them except the original data
owner. Unless the CDO can discover these data sets and make them more widely
available, they cannot unlock the benefit for the rest of the enterprise—a
missed opportunity.

Thirdly, the true benefit of analytics only manifests itself when you can
combine data sets to get a full picture of a situation—and without a
comprehensive view of the data assets you have, it’s impossible for your data
scientists to achieve this.

For example, a telecoms company looking to upsell new handsets to its customers
might look at its own internal data and conclude that it isn’t worthwhile
marketing to customers who have upgraded within the last six months. However, by
combining that internal data with insight from social media, they might be able
to identify high-value customers who love having the latest gadgets and
technologies—and who would be happy to upgrade again whenever a new phone is
launched. By giving line-of-business teams a richer selection of data sets to
include in their analyses, the CDO can play a key role in making those lines of
business more data-driven and successful.

To truly liberate value from data, companies must enable self-service analytics
for employees. Consequently, CDOs are also under pressure to ensure users can
find and access the data they need. If CDOs are struggling to gain insight into
their own data landscape, how can they present it to anyone else? It’s clear
that they need to find a way to bring together all their sources of data and
open them up for self-service—all within the context of a robust security and
governance framework.


SO, WHAT CAN THEY DO?
The answer to these pain points lies in a smart catalog solution. First,
companies need to index their organization’s internal data storage, and then
catalog their external sources too. Even these initial steps will deliver value,
so it’s a relatively painless way for CDOs to make inroads in the journey
towards a data-driven culture.

For example, by providing a single place to find both enterprise and
non-enterprise data catalogs remove a huge burden from data engineers. Instead
of distracting data engineers with endless requests for data from different
departments, they can focus on core projects, while the users find assets for
themselves.

A data catalog can also incentivize employees to start engaging with a
collaborative approach to data sharing and analytics, confident that governance
policies will be automatically applied. For example, employees who are used to
working locally on spreadsheets might become comfortable with the idea of
publishing their analyses back into the company’s data catalog, in case they are
helpful for other staff members. In this way, CDOs can drive the way away from
redundant, repetitive work, to more efficient and productive analytics.

Critically, this democratization of access to and sharing of data does not come
at the cost of data governance. With greater transparency, plus automatic
enforcement of governance policies, CDOs will have more control than ever.


BENEFITS FOR EVERYONE
The real beauty of such a solution is that regardless of where an organization
is in its data journey, embracing a smart catalog can provide benefits extremely
quickly. In stark contrast to the months or even years you’ll spend building a
data lake, a solution such as IBM Data Catalog (currently in beta) empowers you to provision an instance and start connecting
data sources in a matter of minutes.

For CDOs, enabling this sort of agility comes down to the core of their mission
statement. Gone will be the days of months-long analytics projects based on
static, outdated data. In their place comes a new age, where data will truly
become the new currency, delivering value to everyone that comes into contact
with it. If you’d like to learn how your organization can get started, request access to the Data Catalog beta today .

JAY LIMBURN
Jay Limburn

MICHAEL TUCKER


IBM Data Catalog Watson Data Platform


Previous Post

Transforming your user experience for the new economyNext Post

What's included in the IBM Cloud Developer Tools CLI version 1.0.0ADD COMMENT NO COMMENTS
LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


Search for:RECENT POSTS
 * My App is Secure: Application Security Assessed
 * Get real-time billing insights from your Bluemix account
 * IBM Lift CLI is out of beta and available for download
 * Using Codeship Pro To Deploy Workloads to IBM Bluemix Container Service
 * Securing single page apps with App ID service

ARCHIVES
Archives Select Month September 2017 August 2017 July 2017 June 2017 May 2017 April 2017 March 2017 February 2017 January 2017 December 2016 November 2016 October 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 October 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 November 2013TAGS
analytics announcements api apps Architecture Center best-of-bluemix Bluemix bluemix-support-notifications buildpacks client success cloud cloudant cloud foundry conference conferences containers dashdb deployment devops docker eclipse garage garage-method hackathon homepage hybrid interconnect iot java Kubernetes liberty local microservices mobile MobileFirst node.js openwhisk security Spark swift twilio ui video watson webinar More Data Analytics StoriesData Analytics

STREAMING ANALYTICS PRICING UPDATE
We're lowering the prices for Streaming Analytics.

Continue reading


Share this post:


Data Analytics

CLEANING THE SWAMP: TURN YOUR DATA LAKE INTO A SOURCE OF CRYSTAL-CLEAR INSIGHT
When we talk to data scientists, we hear the same sad story again and again.
They tell us how their organization fell in love with the idea of building a
data lake as a single platform for self-service data science. How they were
wooed and won by a vendor with a solution that promised much, but delivered
little. How their vision of a data lake as a clear source of business insight
has turned into a stagnant swamp—a dumping ground where data goes to die.

Continue reading


Share this post:


Data Analytics

HOW TO EASE THE STRAIN AS YOUR DATA VOLUMES RISE
Ever had to make a decision when you didn’t have the time, means or patience to
look up all the data that could help you choose the best option? Yes, well,
you’re not alone on that score. Usually, this doesn’t have significant or
long-lasting consequences—does it really matter if you choose where to go for
dinner because you like the look of a place, rather than combing through recent
reviews?

Continue reading


Share this post:


SIGN UP FOR A BLUEMIX TRIAL TODAY


Get started free Learn more about Bluemix

CONNECT WITH US


 * Contact
 * Privacy
 * Terms of use
 * Accessibility","To get a grip on an organization’s data, a CDO first needs to know what data the company has and who is using it - which isn't as simple as it seems.",The million dollar question: Where is my data?,Live,583
1815,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (December 06, 2016)
 * This Week in Data Science (November 29, 2016)
 * This Week in Data Science (November 22, 2016)
 * This Week in Data Science (November 15, 2016)
 * This Week in Data Science (November 08, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (DECEMBER 06, 2016)
Posted on December 6, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * The Top 7 Big Data Trends for 2017 – Let’s have a look at the seven top big data trends for 2017.
 * Self-learning software that builds itself – Researchers at Lancaster University in England have developed a
   machine-learning system that can assemble code components into a program to
   meet goals set by the human developers.
 * IBM and Apple Set To Transform Education Through IBM Watson Element for
   Educators – IBM and Apple worked together in combining their strengths in targeting
   the education market.
 * Apple Said to Fly Drones to Improve Maps Data and Catch Google – Apple Inc. plans to use drones and new indoor navigation features to
   improve its Maps service and catch longtime leader Google, according to
   people familiar with the matter.
 * The increasingly diverse United States of America – The racial and ethnic diversity of communities varies greatly across the
   country, but rapid change is coming to many of the least-diverse areas.
 * IBM Insight at World of Watson 2016: Building on inspiration – IBM Insight at World of Watson 2016 may be over, but you can still gain
   inspiration from the conference on how you can leverage Watson to help build
   your cognitive business.
 * Facebook developing artificial intelligence to flag offensive live videos – Facebook Inc. is working on automatically flagging offensive material in
   live video streams, building on a growing effort to use artificial
   intelligence to monitor content, said Joaquin Candela, the company’s director
   of applied machine learning.
 * Big data and analytics trends in 2017: James Kobielus’s predictions – Every December, I publish my predictions of data industry trends for the
   upcoming year. Here are my forecasts for trends in big data analytics, data
   science, predictive business and cognitive computing in 2017.
 * AI is Disrupting Everything and These 3 Industries are Next – You may not see it, but it’s there. Artificial intelligence, that is. It’s
   beating us at our most complicated games, helping curate the news we read
   every day, ever improving our search results, driving our cars, and on and on
   and on.
 * AI Songsmith Cranks Out Surprisingly Catchy Tunes – Google’s songwriting program learns by combining statistical learning and
   explicit rules—the same approach may make it easier for engineers to shape
   other AI programs.
 * Virginia Education Department Launches School Performance Data Portal – The new School Quality Profiles are a comprehensive look at how students
   at a school, school division and across the state are performing.
 * Visualizing Interest in Food – Google News Lab and data visualization designer Moritz Stefaner have
   created a data visualization tool called “Rhythm of Food” that analyzes
   Google search trends for different food items to reveal how different foods
   rise and fall in popularity.
 * Tech: 5 Reasons Why Every Student Should Consider a Job in Analytics – Big data analytics is the next frontier of new technologies and it is easy
   to see a future where analytics will be ubiquitous and essential to almost
   every industry.
 * IBM Watson Data Platform: Put Data to Work. It’s That Simple. – The Key to the Cognitive Business
   is Putting Data to Work. What is needed is a platform, an ecosystem, and a
   method.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our forty third release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (December 06, 2016)",Live,584
1821,"SHARP SIGHT LABS

 * HOME
 * MEMBER LOGIN
 * ABOUT

TIDYVERSE PRACTICE: MAPPING LARGE EUROPEAN CITIES
August 29, 2017


As noted in several recent posts , when you’re learning R and R’s Tidyverse packages, it’s important to break everything down into small units that you can learn.

What that means is that you need to identify the most important tools and
functions of the Tidyverse, and then practice them until you are fluent.

But once you have mastered the essential functions as isolated units, you need
to put them together. By putting the individual piece together, you solidify
your knowledge of how they work individually but also begin to learn how you can combine small tools together to create
novel effects.

With that in mind, I want to show you another small project. Here, we’re going
to use a fairly small set of functions to create a map of the largest cities in Europe.

As we do this, pay attention:

 * How many packages and functions do you really need?
 * Evaluate: how long would it really take to memorize each individual function?
   (Hint: it’s much, much less time than you think.)
 * Which functions have you seen before? Are some of the functions and
   techniques used more often than others (if you look across many different
   analyses)?

Ok, with those questions in mind, let’s get after it.

First we’ll just load a few packages.


#==============
# LOAD PACKAGES
#==============

library(rvest)
library(tidyverse)
library(ggmap)
library(stringr)


Next, we’re going to use the rvest package to scrape data from Wikipedia. The data that we are gathering is data
about the largest cities in Europe. You can read more about the data on
Wikipedia.


#===========================
# SCRAPE DATA FROM WIKIPEDIA
#===========================
html.population <- read_html('https://en.wikipedia.org/wiki/List_of_European_cities_by_population_within_city_limits')

df.euro_cities <- html.population %>%
  html_nodes(""table"") %>%
  .[[2]] %>%
  html_table()


# inspect
df.euro_cities %>% head()
df.euro_cities %>% names()


Here at Sharp Sight, we haven’t worked with the rvest package in too many examples, so you might not be familiar with it.

Having said that, just take a close look. How many functions did we use from rvest ? Could you memorize them? How long would it take?

Ok. Now we’ll do a little data cleaning.

First, we’re going to remove some of the variables using dplyr::select() . We are using the minus sign (‘-‘) in front of the names of the variables that
we want to remove.


#============================
# REMOVE EXTRANEOUS VARIABLES
#============================
df.euro_cities <- select(df.euro_cities, -Date, -Image, -Location, -`Ref.`, -`2011 Eurostat\npopulation[3]`)


# inspect
df.euro_cities %>% names()


After removing the variables that we don’t want, we only have four variables.
These remaining raw variable names could be cleaned up a little.

Ideally, we want names that are lower case (because they are easier to type). We
also want variable names that are brief and descriptive.

In this case, renaming these variables to be brief, descriptive, and lower-case
is fairly straightforward. Here, we will use very simple variable names: like
rank, city, country, and population.

To add these new variable names, we can simply assign them by using the colnames() function.


#===============
# RENAME COLUMNS
#===============
colnames(df.euro_cities) <- c(""rank"", ""city"", ""country"", ""population"")


# inspect
df.euro_cities %>% names()
df.euro_cities %>% head()


Now that we have clean variable names, we will do a little modification of the
data itself.

When we scraped the data from Wikipedia, some extraneous characters appeared in
the population variable. Essentially, there were some leading digits and special characters
that appear to be useless artifacts of the scraping process. We want to remove
these extraneous characters and parse the population data into a proper numeric.

To do this, we will use a few functions from the stringr package.

First, we use str_extract() to extract the population data. When we do this, we are extracting everything
from the ‘♠’ character to the end of the string (note: to do this, we are using
a regular expression in str_extract() ).

This is a quick way to get the numbers at the end of the string, but we actually
don’t want to keep the ‘♠’ character. So, after we extract the population
numbers (along with the ‘♠’), we then strip off the ‘♠’ character by using str_replace() .


#========================================================================
# CLEAN UP VARIABLE: population
# - when the data are scraped, there are some extraneous characters
#   in the ""population"" variable.
#   ... you can see leading numbers and some other items
# - We will use stringr functions to extract the actual population data
#   (and remove the stuff we don't want)
# - We are executing this transformation inside dplyr::mutate() to 
#   modify the variable inside the dataframe
#========================================================================

df.euro_cities <- df.euro_cities %>% mutate(population = str_extract(population, ""♠.*$"") %>% str_replace(""♠"","""") %>% parse_number())

df.euro_cities %>% head()


We will also do some quick data wrangling on the city names. Two of the city
names on the Wikipedia page (Istanbul and Moscow) had footnotes. Because of
this, those two city names had extra bracket characters when we read them in
(e.g. “Istanbul[a]”).

We want to strip off those footnotes. To do this we will once again use str_replace() to strip away the information that we don’t want.


#==========================================================================
# REMOVE ""notes"" FROM CITY NAMES
# - two cities had extra characters for footnotes
#   ... we will remove these using stringr::str_replace and dplyr::mutate()
#==========================================================================

df.euro_cities <- df.euro_cities %>% mutate(city = str_replace(city, ""\\[.\\]"",""""))
  
df.euro_cities %>% head()


For the sake of making the data a little easier to explain, we’re going to
filter the data to records where the population is over 1,000,000.

Keep in mind: this is a straightforward use of dplyr::filter() ; this is the sort of thing that you should be able to do with your eyes
closed.


#=========================
# REMOVE CITIES UNDER 1 MM
#=========================

df.euro_cities <- filter(df.euro_cities, population >= 1000000)

#=================
# COERCE TO TIBBLE
#=================

df.euro_cities <- df.euro_cities %>% as_tibble()


Before we map the cities on a map, we need to get geospatial information. That
is, we need to geocode these records.

To do this, we will use the geocode() function to get the longitude and latitude.

After obtaining the geo data, we will join it back to the original data using cbind() .


#========================================================
# GEOCODE
# - here, we're just getting longitude and latitude data 
#   using ggmap::geocode()
#========================================================

data.geo <- geocode(df.euro_cities$city)

df.euro_cities <- cbind(df.euro_cities, data.geo)


#inspect
df.euro_cities


To map the data points, we also need a map that will sit in the background,
underneath the points.

We will use the function map_data() to get a world map .


#==============
# GET WORLD MAP
#==============

map.europe <- map_data(""world"")


Now that the data are clean, and we have a world map, we will plot the data.


#=================================
# PLOT BASIC MAP
# - this map is ""just the basics""
#=================================

ggplot() +
  geom_polygon(data = map.europe, aes(x = long, y = lat, group = group)) +
  geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = ""red"", alpha = .3) +
  coord_cartesian(xlim = c(-9,45), ylim = c(32,70))


This first plot is a “first iteration.” In this version, we haven’t done any
serious formatting. It’s just a “first pass” to make sure that the data are in
the right format. If we had found anything “out of line,” we would go back to an
earlier part of the analysis and modify our code to correct any problems in the
data.


Based on this plot, it looks like the data are essentially correct.

Now, we just want to “polish” the visualization by changing colors, fonts,
sizes, etc.


#====================================================
# PLOT 'POLISHED' MAP
# - this version is formatted and cleaned up a little
#   just to make it look more aesthetically pleasing
#====================================================

#-------------
# CREATE THEME
#-------------
theme.maptheeme <-
  theme(text = element_text(family = ""Gill Sans"", color = ""#444444"")) +
  theme(plot.title = element_text(size = 32)) +
  theme(plot.subtitle = element_text(size = 16)) +
  theme(panel.grid = element_blank()) +
  theme(axis.text = element_blank()) +
  theme(axis.ticks = element_blank()) +
  theme(axis.title = element_blank()) +
  theme(legend.background = element_blank()) +
  theme(legend.key = element_blank()) +
  theme(legend.title = element_text(size = 18)) +
  theme(legend.text = element_text(size = 10)) +
  theme(panel.background = element_rect(fill = ""#596673"")) +
  theme(panel.grid = element_blank())


#------
# PLOT
#------
#fill = ""#AAAAAA"",colour = ""#818181"", size = .15)
ggplot() +
  geom_polygon(data = map.europe, aes(x = long, y = lat, group = group), fill = ""#DEDEDE"",colour = ""#818181"", size = .15) +
  geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = ""red"", alpha = .3) +
  geom_point(data = df.euro_cities, aes(x = lon, y = lat, size = population), color = ""red"", shape = 1) +
  coord_cartesian(xlim = c(-9,45), ylim = c(32,70)) +
  labs(title = ""European Cities with Large Populations"", subtitle = ""Cities with over 1MM population, within city limits"") +
  scale_size_continuous(range = c(.7,15), breaks = c(1100000, 4000000, 8000000, 12000000), name = ""Population"", labels = scales::comma_format()) +
  theme.maptheeme


Not too bad.

Keep in mind that as a reader, you get to see the finished product: the
finalized visualization and the finalized code.

But as always, the process for creating a visualization like this is highly iterative . If you work on a similar project, expect to change your code dozens of times.
You’ll change your data-wrangling code as you work with the data and identify
new items you need to change or fix. You’ll also change your ggplot() visualization code multiple times as you try different colors, fonts, and
settings.

IF YOU MASTER THE BASICS, THE HARD THINGS NEVER SEEM HARD
Creating this visualization is actually not terribly hard to do, but if you’re
somewhat new to R, it might seem rather challenging.

If you look at this, and it seems difficult then you need to understand: once
you master the basics, the hard things never seem hard.

What I mean by that, is that this visualization is nothing more than a careful
application of a few dozen simple tools, arranged in a way to create something
new.

Once you master individual tools from ggplot2, dplyr, and the rest of the
Tidyverse, projects like this become very easy to execute.

SIGN UP NOW, AND DISCOVER HOW TO RAPIDLY MASTER DATA SCIENCE
To rapidly master data science, you need to master the essential tools.

You need to know what tools are important, which tools are not important, and how to practice .

Sharp Sight is dedicated to teaching you how to master the tools of data science
as quickly as possible.

Sign up now for our email list , and you’ll receive regular tutorials and lessons.

You’ll learn:

 * What data science tools you should learn (and what not to learn)
 * How to practice those tools
 * How to put those tools together to execute analyses and machine learning
   projects
 * … and more

If you sign up for our email list right now , you’ll also get access to our “Data Science Crash Course” for free .


SIGN UP NOW


COMMENTS
 1. Julian says
    
    September 2, 2017 at 1:39 am
    
    When I run the following code
    df.euro_cities %
    html_nodes(“table”) %>%
    .[[2]] %>%
    html_table()
    I got the error as below
    Error in utils::type.convert(out[, i], as.is = TRUE, dec = dec) :
    invalid multibyte string at ‘4,804,116’
    What’s wrong with it , I guessed some syntax error but I failed to fix it.
    Could you
    please give me a hand? Thanks.
    
    Reply
 2. 

LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


GET THE FREE DATA SCIENCE CRASH COURSE
Sign up now and learn:
• a step-by-step data science learning plan
• the 1 programming language you need to
learn
• 3 essential data visualizations
• how to do data manipulation in R
• how to get started with machine learning
• the difference between machine learning
and statistics
• and more ...

Your first name Your best email address
By signing up for the newsletter you'll also get ...
✓ Free machine learning tutorials
✓ Free data visualization tutorials
✓ Learning strategies to skyrocket your
progress

... delivered to your inbox on a regular basis.

RECOMMENDED READING
FlowingData
R-Bloggers
R-users (jobs site)Subscribe to receive our free ""Getting Started with Analytics and Data Science""
pdf.

First Name E-Mail Address © 2017 · Powered by data","When you’re learning R and R’s Tidyverse packages, it’s important to break everything down into small units that you can learn. And once you have mastered the essential functions as isolated units, you need to put them together. With that in mind, we’ll show you another small project that puts it all together.",Tidyverse practice: mapping large European cities,Live,585
1824,"THE GRADIENT FLOW
DATA / TECHNOLOGY / CULTURE
Menu Search Skip to content * Home
 * About
 * Calendar
 * Contact
 * Hardcore Data Science and Data Engineering
 * The Data Show
 * Webcasts

Search for:BUILDING A BUSINESS THAT COMBINES HUMAN EXPERTS AND DATA SCIENCE
THE O’REILLY DATA SHOW PODCAST: ERIC COLSON ON ALGORITHMS, HUMAN COMPUTATION,
AND BUILDING DATA SCIENCE TEAMS.
[A version of this post appears on the O’Reilly Radar .]

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

In this episode of the O’Reilly Data Show, I spoke with Eric Colson , chief algorithms officer at Stitch Fix , and former VP of data science and engineering at Netflix. We talked about
building and deploying mission-critical, human-in-the-loop systems for consumer
Internet companies. Knowing that many companies are grappling with incorporating
data science, I also asked Colson to share his experiences building, managing,
and nurturing, large data science teams at both Netflix and Stitch Fix.

AUGMENTED SYSTEMS: “ACTIVE LEARNING,” “HUMAN-IN-THE-LOOP,” AND “HUMAN
COMPUTATION”
We use the term ‘human computation’ at Stitch Fix. We have a team dedicated to
human computation. It’ there’s certain tasks that only a human can do or we’re
going to fail if we try this with machines, so we almost have programmatic
access to human talent. We are allowed to route certain tasks to them, things
that we could never get done with machines.

… We have some of our own proprietary software that blends together two
resources: machine learning and expert human judgment. The way I talk about it
is, we have an algorithm that’s distributed across the resources. It’s a single
algorithm, but it does some of the work through machine resources, and other
parts of the work get done through humans.

… You can think of even the classic recommender systems, collaborative
filtering, which people recognize as, ‘people that bought this also bought
that.’ Those things break down to nothing more than a series of rote
calculations. Being a human, you can actually do them by hand—it’ll just take
you a long time, and you’ll make a lot of mistakes along the way, and you’re not
going to have much fun doing it—but machines can do this stuff in milliseconds.
They can find these hidden relationships within the data that are going to help
figure out what’s relevant to certain consumer’s preferences and be able to
recommend things. Those are things that, again, a human could, in theory, do,
but they’re just not great at all the calculations, and every algorithmic
technique breaks down to a series of rote calculations.

… What machines can’t do are things around cognition, things that have to do
with ambient information, or appreciation of aesthetics, or even the ability to
relate to another human—those things are strictly in the purview of humans.
Those types of tasks we route over to stylists. … I would argue that our humans
could not do their jobs without the machines. We keep our inventory very large
so that there are always many things to pick from for any given customer. It’s
so large, in fact, that it would take a human too long to sift through it on her
own, so what machines are doing is narrowing down the focus.

COMBINING ART AND SCIENCE
Our business model is different. We are betting big on algorithms. We do not
have the barriers to competition that other retailers have, like Wal-Mart has
economies of scale that allow them to do amazing things; that’s their big
barrier. … What is our protective barrier? It’s [to be the] best in the world at
algorithms. We have to be the very best. … More than any other company, we are
going to suffer if we’re wrong.

… Our founder wanted to do this from the very beginning, combine empiricism with
what can’t be captured in data, call it intuition or judgment. But she really
wanted to weave those two things together to produce something that was better
than either can do on their own. She calls it art and science, combining art and
science.

DEFINING ROLES IN DATA SCIENCE TEAMS
[Job roles at StitchFix are] built on three premises that come from Dan Pink’s
book Drive . Autonomy, mastery, purpose—those are the fundamental things you need to have
for high job satisfaction. With autonomy, that’s why we dedicate them to a team.
You’re going to now work on what’s called ‘marketing algorithms.’ You may not
know anything about marketing to begin with, but you’re going to learn it pretty
fast. You’re going to pick up the domain expertise. By autonomy, we want you to
do the whole thing so you have the full context. You’re going to be the one
sourcing the data, building pipelines. You’re going to be applying the
algorithmic routine. You’re going to be the one who frames that problem, figures
out what algorithms you need, and you’re going to be the one delivering the
output and connecting it back to some action, whatever that action may be. Maybe
it’s adjusting our multi-channel strategy. Whatever that algorithmic output is,
you’re responsible for it. So, that’s mastery. Now, you’re autonomous because
you do all the pieces. You’re getting mastery over one domain, in that case, say
marketing algorithms. You’re going to be looked at as you’ you know the
end-to-end.

Then, purpose—that’s the impact that you’re going to make. In the case that we
gave, marketing algorithms, you want to be accountable. You want to be the one
who can move the needle when it comes to how much we should do. What channels
are more effective at acquiring new customers? Whatever it is, you’re going to
be held accountable for a real number, and that is motivating, that’s what makes
people love their jobs.

Subscribe to the O’Reilly Data Show Podcast: Stitcher , TuneIn , iTunes , SoundCloud , RSS

Editor’s note: Eric Colson will speak about augmenting machine learning with human computation for better personalization , at Strata + Hadoop World in San Jose this March .

Related resources:

 * Minds and machines—Humans where they’re best, robots for the rest : Adam Marcus’ presentation at Hardcore Data Science (Strata + Hadoop World NYC 2015)
 * Fashioning Data (a free O’Reilly Data report )
 * Marketing and Consumer Research (an O’Reilly Learning Path)


SHARE THIS:
 * Click to share on Twitter (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * Click to email (Opens in new window)
 * 

01/28/2016 Ben Lorica crowdsource , crowdsourcing , data show , machine learning , podcastPOST NAVIGATION
← →LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.

Notify me of new posts via email.


SEARCH
Search for:RSS FEED
 * RSS - Posts

SITE MAP
 * About
 * Calendar
 * Contact
 * Hardcore Data Science and Data Engineering
 * The Data Show
 * Webcasts

RECENT POSTS
 * Structured streaming comes to Apache Spark 2.0
 * Don’t overlook simpler techniques and algorithms
 * Recent trends in recommender systems
 * Semi-supervised, unsupervised, and adaptive algorithms for large-scale time
   series
 * Practical machine learning techniques for building intelligent applications

CATEGORIES
 * Data Engineer
 * Data Science
 * Finance
 * Marketing
 * Science
 * Uncategorized

ARCHIVES
 * May 2016 (3)
 * April 2016 (2)
 * March 2016 (5)
 * February 2016 (3)
 * January 2016 (3)
 * December 2015 (3)
 * November 2015 (4)
 * October 2015 (5)
 * September 2015 (5)
 * August 2015 (3)
 * July 2015 (4)
 * June 2015 (4)
 * May 2015 (3)
 * April 2015 (6)
 * March 2015 (5)
 * February 2015 (7)
 * January 2015 (6)
 * December 2014 (7)
 * November 2014 (3)
 * October 2014 (3)
 * September 2014 (4)
 * August 2014 (5)
 * July 2014 (7)
 * June 2014 (6)
 * May 2014 (1)
 * April 2014 (4)
 * March 2014 (4)
 * February 2014 (7)
 * January 2014 (4)
 * December 2013 (5)
 * November 2013 (3)
 * October 2013 (3)
 * September 2013 (5)
 * August 2013 (4)
 * July 2013 (4)
 * June 2013 (5)
 * May 2013 (4)
 * April 2013 (4)
 * March 2013 (4)
 * February 2013 (2)
 * October 2012 (1)
 * August 2012 (1)

My Tweets Blog at WordPress.com. | The Sorbet Theme . FollowFOLLOW “THE GRADIENT FLOW”
Get every new post delivered to your Inbox.


Join 35 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","The O’Reilly Data Show podcast: Eric Colson on algorithms, human computation, and building data science teams. [A version of this post appears on the O’Reilly Radar.] Subscribe to the O&#8217…",Building a business that combines human experts and data science,Live,586
1825,"OFFLINE-FIRST IOS APPS WITH SWIFT & PART 2: SYNC TO THE CLOUD
Jason H. Smith / June 8, 2016This walkthrough is a sequel to Apple’s well-known iOS programming introduction, Start Developing iOS Apps (Swift) . Apple’s introduction walks us through the process of building the UI, data,
and logic of an example food tracker app, culiminating with a section on data
persistence: storing the app data as files in the iOS device.

This series picks up where that document leaves off: syncing data between
devices, through the cloud, with an offline-first design. You will achieve this
using open source tools and the IBM Cloudant service.

This document is the second in the series, showing you how to sync the app data
to Cloudant. You can also review the previous post in the series, Part 1: The Datastore .


TABLE OF CONTENTS
 1. Getting Started with FoodTracker
 2. Getting Started with Cloudant 1. Create a Free IBM Cloudant Account
     2. The Cloudant Dashboard
     3. Prepare Cloudant for the FoodTracker App
     4. Confirm API Access
    
    
 3. Push Replication 1. First Steps: Set the User-Agent
     2. Preparing to Sync
     3. Managing a replication: CDTReplicatorDelegate
     4. The Heart of Replication: The Sync Method
     5. Push to Cloudant When a Change is Made
     6. Confirm Replication in the Cloudant Dashboard
    
    
 4. Pull Replication 1. Sync on App Start
     2. Confirm Pull Sync
    
    
 5. Troubleshooting Tip: Deleting Data 1. Deleting the iOS Datastore
     2. Deleting a Cloudant database
    
    
 6. Conclusion
 7. Next Steps: User Interface
 8. Download This Project

GETTING STARTED WITH FOODTRACKER
The FoodTracker main screenThis document assumes that you have completed Part 1: The Datastore of the series. If you have completed that walkthrough, you may continue with
your FoodTracker project.

Alternatively, you can download the prepared project from the Part 1 Code download and begin there. Extract the zip file, FoodTracker-Cloudant-Sync-1.zip , browse into its folder with Finder, and double-click FoodTracker.xcworkspace . That will open the project in Xcode. Run the app (Command-R) and confirm that
it works correctly. If everything is in order, proceed with these instructions.


GETTING STARTED WITH CLOUDANT
In this section you will create a free IBM Cloudant account. This will be the
cloud database with which the FoodTracker app will sync its data. If you already
have a Cloudant account then you can use your existing account. If you plan on
using your existing Cloudant account then skip to the “ Prepare Cloudant for the FoodTracker App “ section of this tutorial.


CREATE A FREE IBM CLOUDANT ACCOUNT
Getting started with IBM Cloudant is free and easy. Begin by signing up on Cloudant.com .


Click the red “Sign Up” button and fill out the sign-up form. Your Username will be in every URL you use. For example, the user foodtracker will be accessible at https://foodtracker.cloudant.com . This foodtracker account is used throughout the examples in this tutorial. You will replace foodtracker with your unique Cloudant username in these examples.

Complete the form, read the terms of service, and then click the red button, “I
agree, sign me up.”


Alternatively, you can create a Cloudant service instance within IBM Bluemix . You can sign up for a free trial if you don’t already have a Bluemix account. Note that a Cloudant service
instance created within Bluemix is assigned a Cloudant username that is
automatically generated at the time of service instantiation.


THE CLOUDANT DASHBOARD
The Cloudant Dashboard homeWhen sign-up is complete, your browser will display the dashboard .

The dashboard is the web interface to manage your data.

Use the dashboard to observe and verify FoodTracker’s behavior. This is a major advantage to using Cloudant: you have a simple and pleasant
tool to help you do your job. In this walkthrough, you will frequently use the
dashboard in conjunction with the iOS Simulator.


PREPARE CLOUDANT FOR THE FOODTRACKER APP
Use the dashboard to prepare the database for FoodTracker. To work correctly,
FoodTracker will need a few things:

 1. A database , to store data,
 2. An API key , to authenticate, and
 3. Permission to use the database

THE DATABASE
Begin by creating the database for FoodTracker. At the top of the dashboard is
the “Create Database” button. Click it, and a drop-down form will appear. Input
the database name in all lower case, underscore format: food_tracker ; and then click the Create button.


Welcome to food_tracker ! You have a brand-new, clean database for FoodTracker to sync with.

Database creation complete!In Cloudant, the database is central to an application: it is the “observable
universe” of the application. The vast majority of the Cloudant API is
operations on or in a database. Access control, data validation, and queries are
scoped to a specific database and the data stored within that database.

One common approach that we see used is the one-database-per-user pattern. With
this approach, each user is given their own Cloudant database in which all of
the data applicable to this one user is stored. Since there is no limit to the
number of databases that you can create within your Cloudant account, the
one-database-per user approach can scale as you add more app users.

To review about names:

 * Your server is named after your account. For example, the account in this example, foodtracker , is available at https://foodtracker.cloudant.com/ .
 * Your database is the storage place for FoodTracker data. For example, the database used in
   this document, food_tracker , is named after the iOS app. It is available at https://foodtracker.cloudant.com/food_tracker .
 * Your server name will be different from this document
 * Your database name will be the same as this document.

To see your database, use the dashboard. From the “Databases” tab, click the
link to food_tracker .


THE API KEY
Now, you must create an API key . An API key is a username and password pair generated by Cloudant. The
FoodTracker app will use the API key credentials to access the data stored in
Cloudant.

To create an API key

 1. Open the food_tracker database in the dashboard.
 2. In the food_tracker database, click the “Permissions” link.
 3. In the Permissions tab, click the “Generate API key” button and wait for
    Cloudant to generate a new key.
    
    
 4. Cloudant will tell you when the key is ready.
    
    
Copy the API key and password now. You will need to use these to connect the FoodTracker iOS app to Cloudant.

API PERMISSION
The final step is to grant read, write, and replication access to your API key.

To grant permissions to an API key

 1. In the Permissions tab, find the access control settings at the top of the
    page.
 2. Find the row for your new API key, for example, facringediftedgentlerrad .
 3. Check the columns for Reader , Writer , and Replicator .
 4. Ensure that the column for Admin is not checked.


CONFIRM API ACCESS
Now is the time to stop and confirm that everything is ready with your Cloudant
service. The best way to do this on a Mac is to open Terminal and use curl (curl can also run on a number of other platforms). When you “curl” Cloudant,
you will immediately see whether everything is working, and you can quickly
determine what might be wrong. Begin by running the Terminal application.

The command below will authenticate to Cloudant and display the data. Notice
that the username and password are inserted in the URL. The username is followed
by : , then the password, then @ , and then the usual hostname and path. When you paste this command into
Terminal, change the values to reflect your own server .

curl https://facringediftedgentlerrad:ee4c30dbd2f7457ccf6804f9536ad1a79f0ea9ad@foodtracker.cloudant.com/food_tracker/_all_docs


If you see an empty listing of documents , you have completed everything.

Good job! Cloudant is fully prepared. You are ready to move to the next
sections, where you will actually add the ability for the FoodTracker app to
sync with Cloudant.


PUSH REPLICATION
Now comes the fun part! The first step is to push data from the device up to Cloudant. When you finish this section, all data
updates will syncronize to your central Cloudant service: both when the user
creates new meals, and also when the user modifies existing meals (for example,
changing the meal photo, or its star rating).


FIRST STEPS: SET THE USER-AGENT
Most iOS apps should identify themselves properly in the HTTP User-Agent header. Typically, the User-Agent value should specify the software name
version, and also the operating system version. In this example, we will
hard-code the software name, FoodTracker, and extract the version from the
sofware bundle (managed by Xcode).

Defining a User-Agent is a very useful habit to develop. For example: remember
that this FoodTracker app, and Cloudant Sync is also compatible with Apache
CouchDB 2.0. In the future, if you wish to manage your own CouchDB server, or to
use a hybrid Cloudant-and-CouchDB system, that will be a straightforward
procedure. However, a key part of such a system will be the User-Agent
identification coming from all of your deployed apps.

In other words: Set the User-Agent now, just in case. It is easy to do, and you
will thank yourself in the future.

To set the user-agent string

 1. Open MealTableViewController.swift
 2. Near the top of MealTableViewController.swift , find the “Properties” section, where meals , datastoreManager , and datastore are declared.
 3. Insert the following code beneath that:// MARK: Cloudant Settingslet userAgent = ""FoodTracker""
    
    
 4. Scroll down to the bottom of the class.
 5. Below the method func storeSampleMeals() , insert the following code:
    
    // MARK: Cloudant Sync// Intercept HTTP requests and set the User-Agent header.funcinterceptRequestInContext(context: CDTHTTPInterceptorContext)
       -> CDTHTTPInterceptorContext {
           let info = NSBundle.mainBundle().infoDictionary!
           let appVer = info[""CFBundleShortVersionString""]
           let osVer = NSProcessInfo().operatingSystemVersionString
           let ua = ""\(userAgent)/\(appVer) (iOS \(osVer)""
    
           context.request.setValue(ua, forHTTPHeaderField: ""User-Agent"")
           return context
    }
    
    
 6. Scroll to the top of the class and find the class declaration line.
 7. Append the CDTHTTPInterceptor interface, so that the declaration now looks as follows:
    
    classMealTableViewController: UITableViewController,
       CDTHTTPInterceptor{
    
    
Checkpoint: Run your app. Of course, the app’s behavior will not change; however this is a good place to
catch any programming errors.


PREPARING TO SYNC
The next step is also straightforward. You will define the information that
FoodTracker will need to connect to Cloudant: the account name, the login
credentials, etc.

Another important property is the replications dictionary, which will keep track of pending replications. Replications are not
instant. It takes time to transfer data and images to and from the central
server. So, FoodTracker will remember which replications are currently running.
If the code attempts to run a duplicate replication (for example, it wants to
push when an existing push is already in progress), then FoodTracker can simply
ignore the second sync call. The execution will work like this:

 1. The user creates or modifies a meal.
 2. FoodTracker begins the push replication procedure.
 3. replications[.Push] is nil . No push replications are in-flight, so: 1. A new replication begins.
     2. FoodTracker stores this replication object in replications[.Push] .
    
    
 4. Before the replication completes, the user makes another change.
 5. FoodTracker again automatically begins a the push replication procedure.
 6. But this time, it sees that replications[.Push] already has a replication underway, so it does nothing .
 7. The original push replication will see the latest change from step 4 and
    includes that in the replication. (Thanks, CDTDatastore!)
 8. Once the replication is complete, FoodTracker sets replications[.Push] to nil , indicating that the next replication can proceed.

The same logic will apply to pull replications.

To define key Cloudant Sync information

 1. In MealTableViewController.swift , scroll to the section, MARK: Properties .
 2. Below the datastoreManager and datastore properties, insert the following code:// Define two sync directions: push and pull.// .Push will copy local data from FoodTracker to Cloudant.// .Pull will copy remote data from Cloudant to FoodTracker.enumSyncDirection{
       casePushcasePull
    }
    
    // Track pending .Push and .Pull replications here.var replications = [SyncDirection: CDTReplicator]()
    
    
 3. Scroll to the section, MARK: Cloudant Settings .
 4. Find the line where you defined userAgent .
 5. Below that line, insert the following code:
    
    let cloudantDBName = ""food_tracker""// NOTE: You must change these values for your own application.let cloudantAccount = ""foodtracker""let cloudantApiKey = ""dshromeactseedirseemaske""let cloudantApiPassword = ""b039114f3e2194db4ff813228340c4eae269e9b5""
    
    Remember when you copied down your API key credentials? Paste them here, as
    the values for cloudantApiKey and cloudantApiPassword . If you haven’t got them handy, you can simply create a new API key (and remember to grant credentials too).
    
    
Checkpoint: Run your app. Again, the app’s behavior will not change. But when the app compiles and runs,
you will know you haven’t got any errors or typos.


MANAGING A REPLICATION: CDTREPLICATORDELEGATE
A replication to or from Cloudant will take a certain amount of time, while data
and images are copied over the Internet. Also, various noteworthy events may
happen during replication, for example: some progress is made, or the
replication has completed, or the replication has encountered an error. To
handle these events, your class must implement the CDTReplicatorDelegate interface. Once that groundwork is done, it is a simple matter to keep an eye
on replications as they transpire.

For now, you will implement the interface, but the methods will do little except
logging the replication status, and clearing the replications dictionary when a replication is done.

To implement the CDTReplicatorDelegate interface

 1. In MealTableViewController.swift , find the MealTableViewController class declaration line.
 2. Append the CDTReplicatorDelegate interface, so that the declaration now looks as follows:classMealTableViewController: UITableViewController,
       CDTHTTPInterceptor, CDTReplicatorDelegate{
    
    
 3. Go to the very bottom of the class, and append the relevant functions.
    
    funcreplicatorDidChangeState(replicator: CDTReplicator!) {
       // The new state is in replicator.state.
    }
    
    funcreplicatorDidChangeProgress(replicator: CDTReplicator!) {
       // See replicator.changesProcessed and replicator.changesTotal// for progress data.
    }
    
    funcreplicatorDidComplete(replicator: CDTReplicator!) {
       print(""Replication complete \(replicator)"")
    
       if (replicator == replications[.Pull]) {
           if (replicator.changesProcessed > 0) {
               // Reload the meals, and refresh the UI.
               loadMealsFromDatastore()
               dispatch_async(dispatch_get_main_queue(), {
                   self.tableView.reloadData()
               })
           }
       }
    
       clearReplicator(replicator)
    }
    
    funcreplicatorDidError(replicator: CDTReplicator!, info:NSError!) {
       print(""Replicator error \(replicator)\(info)"")
       clearReplicator(replicator)
    }
    
    funcclearReplicator(replicator: CDTReplicator!) {
       // Determine the replication direction, given the replicator// argument.let direction = (replicator == replications[.Push])
           ? SyncDirection.Push
           : SyncDirection.Pullprint(""Clear replication: \(direction)"")
       replications[direction] = nil
    }
    
    
Feel free to flesh out the empty methods, replicatorDidChangeState(_:) and replicatorDidChangeProgress(_:) . But, for now, these examples will keep them empty, for simplicity.

Checkpoint: Run your app. The app’s behavior will not change. But when the app compiles and runs, you
will know you haven’t got any errors or typos.


THE HEART OF REPLICATION: THE SYNC METHOD
Everything is now in place. Now, you will implement the method to sync to
Cloudant. The method takes one argument, indicating whether to copy local data
to Cloudant ( .Push ), or to copy remote data from Cloudant ( .Pull ). The method will detect if the user triggers a concurrent replication (one
which would begin before the previous one has finished), and it will simply do
nothing.

To implement Cloudant Sync replication

 1. In MealTableViewController.swift , scroll to the bottom of the file, in the section, MARK: Cloudant Sync
 2. Below the method interceptRequestInContext(context: CDTHTTPInterceptorContext) , append the following code:// Return an NSURL to the database, with authentication.funccloudURL() -> NSURL {
       let credentials = ""\(cloudantApiKey):\(cloudantApiPassword)""let host = ""\(cloudantAccount).cloudant.com""let url = ""https://\(credentials)@\(host)/\(cloudantDBName)""returnNSURL(string: url)!
    }
    
    // Push or pull local data to or from the central cloud.funcsync(direction: SyncDirection) {
       let existingReplication = replications[direction]
       guard existingReplication == nilelse {
           print(""Ignore \(direction) replication; already running"")
           return
       }
    
       let factory = CDTReplicatorFactory(
           datastoreManager: datastoreManager)
    
       let job = (direction == .Push)
           ? CDTPushReplication(source: datastore!, target: cloudURL())
           : CDTPullReplication(source: cloudURL(), target: datastore!)
       job.addInterceptor(self)
    
       do {
           // Ready: Create the replication job.
           replications[direction] = try factory.oneWay(job)
    
           // Set: Assign myself as the replication delegate.
           replications[direction]!.delegate = self// Go!try replications[direction]!.start()
       } catch {
           print(""Error initializing \(direction) sync: \(error)"")
           return
       }
    
       print(""Started \(direction) sync: \(replications[direction])"")
    }
    
    
Checkpoint: Run your app. The app’s behavior will not change. But when the app compiles and runs, you
will know you haven’t got any errors or typos.

That is it for the heavy lifting and the “do-nothing” checkpoints! From this
point forward, expect to see very cool features emerge.


PUSH TO CLOUDANT WHEN A CHANGE IS MADE
When should FoodTracker push to Cloudant? Well, first of all, recall that
triggering a push very frequently is not a problem. Only one replication per
direction (push or pull) will be running at a time. After it completes, if
FoodTracker replicates again, that relpication will complete very quickly, since
all the data is already synced.

Therefore, it is prudent to call sync(.Push) any time the local datastore has changed:

 1. After the sample meals are created
 2. After the user deletes a meal
 3. After the user edits or creates a meal

To do this, you must modify createMeal(_:) so that it notifies the caller whether the datastore has changed at all.
(Sometimes it can no-op to avoid creating duplicate sample meals.) Change this
method to return true or false , where true effectively means time to push to Cloudant .

To replicate to Cloudant (Push) when local data changes

 1.  In MealTableViewController.swift , in the section Datastore , go to the method createMeal(_:) .
 2.  Modify its signature to look as follows:funccreateMeal(meal: Meal) -> Bool {
     
     
 3.  Modify both early return calls to return false , indicating that no change was made. The code in the first if block will look as follows:
     
     do {
        try datastore!.getDocumentWithId(docId)
        print(""Skip \(docId) creation: already exists"")
        returnfalse
     } catchlet error asNSError {
        if (error.userInfo[""NSLocalizedFailureReason""] as? String != ""not_found"") {
            print(""Skip \(docId) creation: already deleted by user"")
            returnfalse
        }
     
        print(""Create sample meal: \(docId)"")
     }
     
     
 4.  At the bottom of the method, after the do/catch block, append one return true statement, indicating that a change was made. The code will look as
     follows:
     
     do {
        let result = try datastore!.createDocumentFromRevision(rev)
        print(""Created \(result.docId)\(result.revId)"")
     
        // Remember the new ID assigned by the datastore.
        meal.docId = result.docId
     } catch {
        print(""Error creating meal: \(error)"")
     }
     
     returntrue
     
     
 5.  In the section Table view data source , go to the method tableview(_:commitEditingStyle:forRowAtIndexPath:) .
 6.  In the first if block, add a call to sync(.Push) . The code will look as follows:
     
     if editingStyle == .Delete {
        // Delete the row from the data sourcelet meal = meals[indexPath.row]
        deleteMeal(meal)
        meals.removeAtIndex(indexPath.row)
        tableView.deleteRowsAtIndexPaths([indexPath], withRowAnimation: .Fade)
     
        // Push this deletion to Cloudant.
        sync(.Push)
     
     
 7.  In the section Navigation , go to the method unwindToMealList(_:) .
 8.  Below the else block, insert a call to the sync() method. The code will look as follows:
     
     } else {
        // Add a new meal.let newIndexPath = NSIndexPath(forRow: meals.count, inSection: 0)
        meals.append(meal)
        tableView.insertRowsAtIndexPaths([newIndexPath], withRowAnimation: .Bottom)
        createMeal(meal)
     }
     
     // Push this edit or creation to Cloudant.
     sync(.Push)
     
     
 9.  In the section Datastore , go to the method storeSampleMeals() .
 10. At the bottom of the method are three calls to createMeal() . Delete those lines and replace them with the following code:
     
     let created1 = createMeal(meal1)
     let created2 = createMeal(meal2)
     let created3 = createMeal(meal3)
     
     if (created1 || created2 || created3) {
        print(""Sample meals changed; begin push sync"")
        sync(.Push)
     }
     
     
Checkpoint: Run your app. The app’s outward behavior will not change; however, it is patiently waiting for a reason to sync
to Cloudant.

Because the sample meals already existed (a one-off situation the app will not
encounter in the real world), it has not yet pushed anything. But, you can
trigger a push by making any change. Try changing the star rating of the Caprese
salad.

Remember, a replication will copy everything in the datastore. It will bring Cloudant current, reflecting the all documents , not one specific document. So, when you make one change to a meal, Cloudant
Sync will sync all meals to Cloudant, because none had been on the server previously. But if you
make a subsequent change, Cloudant Sync will know to copy only that one
document. This is the magic of Cloudant Sync, really. It transfers the minimum
amount of data to synchronize the entire datastore .

Be sure to watch the console log in Xcode. You will see printouts indicating the
replication progress, and when it has finished.


CONFIRM REPLICATION IN THE CLOUDANT DASHBOARD
Open your web browser to admire your work! Log in to Cloudant and open the food_tracker database. You will see the three sample documents.


Click the edit icon (a pencil) in the upper right of a document. This will open
a detailed view of that document. Here, you will see the meal data. Doesn’t it
seem so accessible in JSON format, from a web browser? There is the name ( ""name"":""Caprese Salad"" ). There is the rating ( ""rating"":4 ). What could be simpler?


And finally, click “View Attachments” and select photo.jpg . There is the meal. Congratulations! You have synced an iOS app to the cloud!


Here is a desktop screenshot which demonstrates everything working together.

App, log, and dashboard review


PULL REPLICATION
By now, you may have a strong urge to modify the meal data from the Cloudant dashboard. However, if you do so, those changes
will not be reflected in FoodTracker. To accomplish that, you must first
implement pull replication , so that FoodTracker will download updates from Cloudant.

Would you believe, this will take a single line of code?


SYNC ON APP START
In future posts in this series, you will learn to replicate from Cloudant at
certain times (for example, when the user executes a well-known “pull to
refresh” gesture). However, for now, the easiest thing to do is to pull from
Cloudant when the app starts.

To replicate from Cloudant (Pull) when FoodTracker starts

 1. In MealTableViewController.swift , go to the method viewDidLoad() .
 2. At the bottom of the method, after the call to initDatastore() , append the following code:// Immediately pull changes from Cloudant.
    sync(.Pull)
    
    
Checkpoint: Run your app. Keep an eye on the console log in Xcode. You will see a Pull replication begin, and then quickly complete (because there is no data needing
to be copied).


CONFIRM PULL SYNC
Open your web browser, log in to Cloudant, enter the food_tracker database, and open a meal document. Make a change. For example, change the ""name"" or ""rating"" of a meal. (Be careful not to change types. The name is a string; the rating is
an integer. You have not implemented data validation yet, so if you are
careless, you might get bad data into FoodTracker, where it is less forgiving
than your web browser.)

Save your changes.

Now, close and restart FoodTracker. Note the replication logs. When replication
is complete (it should be quick), notice your changes reflected in the app. What
do you think? Pretty fancy!


TROUBLESHOOTING TIP: DELETING DATA
When working with data syncing features, you may wish to delete some data, so
that you can start over and try again.

These are example commands you can paste into Terminal . These commands will help you to easily remove data from either Cloudant or
your iOS Cloudant Sync datastore. As you work on syncing you may notice that
documents will often move between the cloud and the device (which is the very
nature of syncing data). So, you may sometimes wish to delete both data sets, to truly start from the beginning.


DELETING THE IOS DATASTORE
This will remove the Cloudant Sync database. When you restart the app, the app
will initialize a new datastore and behave as if this was its first time to run.
For example, it will re-create the sample meals again.

To delete the datastore from the iOS Simulator

rm -i -rv $HOME/Library/Developer/CoreSimulator/Devices/*/data/Containers/Data/Application/*/Documents/foodtracker-meals


This command will prompt you to remove the files. If you are confident that the
command is working correct, you can omit the -i option.


DELETING A CLOUDANT DATABASE
This procedure will delete the food_tracker database in the cloud, removing everything in it. Once deleted, you will want
to re-create a new database by following the database preparation procedure .

To delete and re-create the database in Cloudant

 1. Open the the food_tracker database in the dashboard.
 2. In the food_tracker database, click the settings button, , which looks like a gear.
 3. Click the “Delete” button with the rubbish bin icon.
 4. Since this is a significant and permanent change, you must confirm your
    request by inputting the database name, food_tracker , into the form. Press “Delete”.


Finally, prepare a new database to replace this one. If your FoodTracker already connects to Cloudant, then you
have two choices:

 * Either paste your old API key and press the “Grant Rights” button.
 * Or, generate a new API key, then update your API key and password in the app
   source code.


CONCLUSION
Congratulations! This is a great accomplishment. You have already built so much:

 1. An iOS app,
 2. Powered by a fast, local datastore (CDTDatastore),
 3. Which syncs all changes to and from its replica in the cloud (IBM Cloudant),
 4. And always fully supporting offline operation, with cloud synchronization
    performed when possible.

An exercise for the reader is to begin replication when the device comes online (for
example, when the user disables airplane mode).


NEXT STEPS: USER INTERFACE
In the next section, we will build user interface features which give the user
some visibility and control of the replication process. For example: a
pull-to-refresh feature.


DOWNLOAD THIS PROJECT
To see the completed sample project for this lesson, download the file and view
it in Xcode.

Download File

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Apple's Food Tracker sample app taught you iOS. Take it further and sync data through the cloud, with an offline-first design. Part 2 covers cloud JSON sync",Offline-First iOS Apps with Swift & Cloudant Sync; Part 2,Live,587
1826,"RStudio Blog * Home

 * Subscribe to feed

TIDYR 0.6.0
August 15, 2016 in Packages

I’m pleased to announce tidyr 0.6.0. tidyr makes it easy to “tidy” your data,
storing it in a consistent form so that it’s easy to manipulate, visualise and
model. Tidy data has a simple convention: put variables in the columns and
observations in the rows. You can learn more about it in the tidy data vignette. Install it with:

install.packages(""tidyr"")

I mostly released this version to bundle up a number of small tweaks needed for R for Data Science . But there’s one nice new feature, contributed by Jan Schulz : drop_na() . drop_na() drops rows containing missing values:

df <- tibble(x = c(1, 2, NA), y = c(""a"", NA, ""b""))
df
#> # A tibble: 3 × 2
#>       x     y
#>   <dbl> <chr>
#> 1     1     a
#> 2     2  <NA>
#> 3    NA     b

# Called without arguments, it drops rows containing
# missing values in any variable:
df %>% drop_na()
#> # A tibble: 1 × 2
#>       x     y
#>   <dbl> <chr>
#> 1     1     a

# Or you can restrict the variables it looks at, 
# using select() style syntax:
df %>% drop_na(x)
#> # A tibble: 2 × 2
#>       x     y
#>   <dbl> <chr>
#> 1     1     a
#> 2     2  <NA>

Please see the release notes for a complete list of changes.

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,797 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

LEAVE A COMMENT
Comments feed for this article

LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.

Notify me of new posts via email.


« A New Version of DT (0.2) on CRANBlog at WordPress.com. Ben Eastaugh and Chris Sternal-Johnson.

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,797 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","Announcing tidyr 0.6.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention…",tidyr 0.6.0,Live,588
1828,"FLYING DONUT – MAKING THE MOST OF COMPOSE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 3, 2016Flying Donut has created a simple and intuitive agile collaboration and project
tracking tool running on Compose MongoDB and RabbitMQ.

Founded in 2012, Flying Donut was born from the frustration that Ioannis Tzikas, Dimitrios Souflis, Pinelopi
Kouleri, and Evangelos Parchas had when building their own projects at a former
employer. The multinational company was fast-growing and distributed, making
transparency and communication difficult. “We set out to solve our own problems,
which incidentally meant we were solving others’ problems, too,” said Tzikas.

They needed a better way to gain transparency and accountability in an agile
development environment, while at the same time not over-engineer the tooling
necessary to do it. Helping organizations use scrum productively is what Flying
Donut has set out to do, and Compose databases are what Flying Donut use to
power their solution.

In the last 14 years, ever since the Scrum Alliance was founded, agile software
development has moved to the forefront of how most teams approach building
software, but the flavors of agile are more varied than ever. The distinction
between doing a general variety of agile from true scrum development is still
important, and the methodologies for scrum practices remain as relevant today as
they did in 2002. However, until Flying Donut was launched, scrum tooling was
either watered down and generic or over-engineered and heavy.

Flying Donut (named for the effect they wanted to have on a team’s performance
and the delicious deep fried confection) is built on Backbone.js, Websockets,
Java (Spring) and runs on a farm of Jetty application servers. They use
Compose-hosted MongoDB and RabbitMQ to run the back-end. Flying Donut has designed both the frontend and backend to
work asynchronously and in parallel. All updates are sent to the UI via
Websockets and users are never blocked by any http request.


“The biggest challenge all teams face when adopting agile to their organization
is to actually be agile and not just practice agile,” says Tzikas. Flying Donut
built their application from the ground up to include essentials for running
scrum — know where you are in the process; take small, iterative steps towards
your goal; adjust your understanding based on what you’ve learned; repeat.
Flying Donut helps to reinforce scrum through a rich set of features. For
example, their “review” feature reminds teams to provide feedback on each sprint
— something many teams forget to do once something has shipped.

MongoDB was selected because of its denormalized database design that’s
optimized for read operations. Said Tzikas, “The database is designed from the
ground up to be versioned with a complex schema containing the majority of the
information, thus making it possible for optimistic locking on a field level and
quick fetch queries. This makes it possible to have multiple modifications on
the same document at the same time.”

Flying Donut utilizes RabbitMQ messaging extensively both at its backend and
frontend. They’re using backend messages to trigger asynchronous events in the
backend processes — updating stats, burn down charts, aggregated data, etc. The
frontend messaging allows the client application to receive the events, that are
filtered based on the available online users and each user session receives the
proper updates via websockets.

As a company building an agile tool, one might expect them to be eating their
own dog food... and they do. They have implemented continuous delivery and
frequently make three to four upgrades on a single day while keeping different
versions of the backend running at the same time. Because of the asynchronous
design (they have moved to a fully asynchronous backend with the introduction of
events, queues, and topics), downtime is rare. The evolution of their app has
changed how they approach development too.

“Besides having extensive experience with developing and supporting high
transaction volume applications, we had to develop new processes and adopt a
devops mentality that would allow us to succeed in a more resource constrained
environment.”

Backups, replication, and monitoring all factored in their decision to use
Compose’s production grade MongoDB and RabbitMQ hosting. Tzikas added, “We are a
small team and try to spend all of our time building Flying Donut. We love the
Compose UI that gives us a simple way to very quickly do database queries and
maintenance operations. Your support team has offered very fast resolution time
and we’ve experienced five-star reliability.”

Many of their customers find Flying Donut through their Public Projects site,
which helps open source (and some corporate) project leaders share their process
and project transparently with the world. It’s a terrific way to learn more
about Flying Donut and we encourage you to check out the Flying Donut app and Public Projects .


--------------------------------------------------------------------------------",Flying Donut has created a simple and intuitive agile collaboration and project tracking tool running on Compose MongoDB and RabbitMQ.,Making the Most of Compose – Customer: Flying Donut,Live,589
1831,"MAKING THE MOST OF COMPOSE - ICANMAKEITBETTER
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Aug 16, 2016At icanmakeitbetter, the mix of the Elixir language and Compose's MongoDB
databases are key components of the stack that powers the company's market
research platform.

Your most enthusiastic customers are often the best teachers for any company.
Large companies that have tens of thousands of customers turn to surveys, email,
and incentive programs – not to mention third-party market research – to
understand what it is customers want. Based in Austin, Texas, icanmakeitbetter.com started life as a traditional market research company that would run focus
groups and surveys for their customers and then crunch the data. As they started
to bring on more clients, they found themselves solving the same problems
repeatedly and knew there had to be a another way to scale while also giving
clients better tools. Five years ago, they made the switch from market research
agency to a market research platform.

Bruce Tate, author of Seven Languages in Seven Weeks and co-author on Programming Phoenix , was brought into icanmakeitbetter to help build the platform. ""Idea portals
are good for when all you want is ideation. There are also companies that work
in the survey space. There are also companies that do online chat,"" says Tate.

icanmakeitbetter combines all these services in a single app running on Phoenix
servers and MongoDB. Their platform helps companies such as Thomson Reuters,
Dell, Columbia, Whole Foods and others to build ""insight"" communities. Pulling
together the data from surveys and joining it with customer demographic data,
customer interactions and market research is complex and often out of reach of
the people who are building new products and marketing programs. ""Our customers
wanted to be able to ask more detailed questions and keep going back to the
data, then tie results together over time.""

icanmakeitbetter gives their customers easy access to these tools and works with
third party vendors, such as Tango Card, to offer incentives right in the app.
For a company that wants to identify and reward their most loyal users, and get
new ideas for campaigns, icanmakeitbetter is their one-stop-shop.


Originally built as a Rails app, icanmakeitbetter is in the process of moving
much of the high-volume backend work from Rails to Elixir. ""About three years
ago we realized we were going to need to scale to meet customer demand."" Knowing
that their platform is going to be hit by hundreds of thousands of users
simultaneously, they began to rethink the customer-facing portion of their app,
which is why they selected Elixir.

""Elixir scales like nothing else does. It has the ability to allow us to shape
our thoughts with meta-programming, just as we do with Ruby. As a functional
programming language, however, we don’t have the same problems with mutability
like we did with Ruby, so we can run mini-threads that sit on top of the Erlang
platform so we can have tens of thousands, even hundreds of thousands of
threads, and not blink an eye.""

The Elixir component model makes it a breeze to do database joins. It's faster
and much more efficient to work with smaller, in-memory data sets. Reporting
features, for example, are ""de-normalized"" so they can roll up a report using
the Elixir component model.

While Elixir running on a Heroku Phoenix server is relatively new to the stack, Compose-hosted MongoDB has been around from the start. ""Mongo gives us great adaptability on the model
side. It’s really nice to be able to introduce an attribute"" without having to
update the schema each time.

All this has added up to a platform that delivers value to not only their
customers, but to icanmakeitbetter's researchers as well. ""Eating their own dog
food,"" their team is able to spend less time trying to get multiple tools to
work together; instead, they can support three to four times the number of
customers than researchers at other companies.

As for why Compose, Tate added, ""We like not having to think about the
production side of running our own servers. And we love the quality of the
support that we’ve gotten on Compose. We really don’t have any vendors that we
trust as much as we trust you. You guys have been great.""

Image by juhansonin CC-BY-2.0 Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Jon Silvers does marketing at Compose. He is also a father, husband, Californian, runner,
hiker, INTJ, guitar beginner, and comedy geek. Love this article? Head over to Jon Silvers’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","At icanmakeitbetter, the mix of the Elixir language and Compose's MongoDB databases are key components of the stack that powers the company's market research platform.",Making the Most of Compose - Customer icanmakeitbetter,Live,590
1833,"Are you spending too much valuable time, resources, and budget attempting to efficiently manage complex distributed data? Watch this recording of our September 10, 2014 webinar to learn how Cloudant's customer, Physion, brought new software to market faster using our NoSQL database service.",Claudius Li and Barry Wark talk about how Life Science Software uses Cloudant to store biological and chemical experimental data.,Using NoSQL DBaaS to launch Life Science Software,Live,591
1836,"APACHE SPARK ANALYTICSCombine Apache® Spark™ with other cloud services to speed analysis and revealinsights.Whether you want to track how a conversation is trending on Twitter or predictfuture events based on a trove of existing data, the lighting-fast processingpower of Apache® Spark™ can help you crunch large data sets fast. We’ll show youhow to harness the power and potential of the Apache® Spark™ engine andprogramming model.WHY SPARK?Apache® Spark™ is the open-source, cluster-computing framework with in-memoryprocessing, which runs up to 100 times faster than other technologies on themarket today. One of the nicest things about this technology is that it featuresa simple programming model that hides the complexity inherent to distributedcomputing. As an added bonus, the APIs come in multiple flavors: Scala, Java,Python, and R.INTEGRATED SERVICESAnalytics for Apache Spark integrates with SWIFT Object Storage, Cloudant,dashDB, SQLDB, Watson, and other IBM Cloud services. These services all playwell together on a single cloud platform, which eases development and lets youbring creativity and breadth to your analysis solutions.COMBINED SERVICES IN ACTIONThis video demo shows how you can set up Apache Spark to work with Watson ToneAnalyzer to gauge social and emotional tones in a set of tweets.Dive in! Check out the growing list of Tutorials on top right of this page and choose one that interests you.TUTORIALS * Start Developing with Spark and Notebooks * Sentiment Analysis of Twitter Hashtags * Realtime Sentiment Analysis of Twitter Hashtags * Sentiment Analysis of reddit AMAs * Speed SQL Queries with Spark SQL * IBM TECHNOLOGY * Bluemix * Analytics for Apache Spark * Spark-Cloudant Connector * Watson Tone Analyzer * Message Hub (Apache Kafka) * Message Connect (event streams)BLOGS ‘N’ STUFF * Spark Learning Center * Spark articles * Journey to Space with SETI and IBM Analytics for Apache Spark * Status Update from the SETI Institute© “Apache,” “Spark,” and “Apache Spark” are trademarks or registered trademarksof The Apache Software Foundation. All other brands and trademarks are theproperty of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Combine Apache® Spark™ with other cloud services to speed analysis and reveal insights.,Apache Spark Analytics,Live,592
1837,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

IBM Data Science Experience Blocked Unblock Follow Following Jul 15, 2016
--------------------------------------------------------------------------------

UPLOAD FILES TO IBM DATA SCIENCE EXPERIENCE USING THE COMMAND LINE
Data Science Experience (DSX) is a unified analytics environment providing access to Jupyter notebooks and
RStudio on top of IBM Analytics for Apache Spark (think Spark-as-a-Service). DSX
is tightly integrated with SoftLayer Object Storage; it serves as the default
storage mechanism for files of any size. That said, DSX is flexible and can
connect to any cloud or on-premises data source that Spark supports. DSX also
includes several predefined connectors for cloud data sources including Amazon
S3, Microsoft Azure, etc.

This tutorial will cover uploading a Parquet table with 30 partitions into
SoftLayer Object Storage, then reading that file into a Spark DataFrame . The sample Parquet table being used for this tutorial is small (36 MB); you can download it here . This Parquet table is a folder containing 30 files, one for each partition.
Be sure to download the entire folder. The data is a sample of public transit
schedule information for buses and rapid transit lines from the Massachusetts Bay Transportation Authority . You can access the full datasets here in .csv format.

Although the tutorial uses a Parquet file and a Python notebook, the Object
Storage portion of this tutorial is applicable to other file types and
notebooks. If you haven’t tried DSX yet, you can create an account here .

1. GET OBJECT STORAGE CREDENTIALS
While logged into DSX, click the profile icon in the top-right corner. On the
pane that appears, choose Settings . On the Settings page, choose the Services tab. All IBM Bluemix services that your DSX account has access to will be
listed. At a minimum, you should see Apache Spark and Object Storage , as these are created during the DSX setup process. You might have others too.

Click on the ellipsis next to Object Storage, then click Manage on Bluemix . The screen that appears will show all files that are in your SoftLayer Object
Storage instance. Click the Service Credentials tab. Copy the JSON blob from that page into something you can reference later
(e.g. a text file).

Note: The JSON credentials block contains your password. Make sure to keep the block
a secret, as anyone who has access to it can access files in your account.SoftLayer Object Storage is powered by OpenStack Swift , a distributed and scalable object/blob store. OpenStack Keystone is used by Object Storage for authentication. The key-value pairs in the JSON
credentials above are used to access Object Storage through the Keystone API.

This tutorial uses the latest version of the Keystone API, version 3. The
concept of a domain was added in API v3, and the concept of a tenant was
replaced with a project. This can be confusing when reading older tutorials or
posts involving Keystone authentication. You can read more about API v3 here .2. INSTALLING THE OPENSTACK CLIENT
OpenStack provides two clients for Object Storage: an individual client called swift and a client common to all services called openstack . The individual clients are being deprecated in favor of using the common
client. This tutorial uses the newer openstack client, which is available for Linux, macOS, and Windows. I am using a Mac, but
the experience on other platforms should be similar. From the command line,
install the client using pip:

$ pip install python-openstackclient

3. CREATE AN OPENSTACK RC FILE
The Object Storage client reads several environment variables to get the
credentials. Although these credentials can be provided as arguments, it is
usually easier to define them beforehand in a script. This is referred to as an OpenStack RC file . Create a file called dsx-openrc.sh containing the lines below. Replace the four fields in <brackets> with the
corresponding values in the JSON block from Step 1.

You can download dsx-openrc.sh from this folder . Be sure to replace the fields in <brackets> before use.Save dsx-openrc.sh to the same directory as your Parquet table. Be sure to save
the script to a location not managed by version control since it contains your
password.

On the shell from which you want to run OpenStack commands, navigate to the
directory containing the table and source the RC file as shown below. This will
set environment variables according to the script.

$ source dsx-openrc.sh

4. UPLOAD THE PARQUET FILE
Object Storage includes the concept of a container , which serves as a namespace for objects. By default, DSX creates a container
called notebooks during the setup process; we will store the Parquet table in this container.

Object Storage is a key-value store — it isn’t hierarchical. However, users can
create pseudo-hierarchies by using the ‘/’ character in object names, such as
“/marketing/2015/file.parquet”.The Parquet table in this case is actually a folder containing several parts of
the data:

MBTAStopFrequency.parquet/
   _SUCCESS
   _common_metadata
   _metadata
   part-r-00000-374be5e3-1ba6-43e4-bcd3-7dc126fa0f24.gz.parquet
   part-r-00001-374be5e3-1ba6-43e4-bcd3-7dc126fa0f24.gz.parquet
   ...
   part-r-00029-374be5e3-1ba6-43e4-bcd3-7dc126fa0f24.gz.parquet

Upload the contents of the folder using the following command:

$ openstack object create notebooks MBTAStopFrequency.parquet/*

5. USE THE FILE FROM DATA SCIENCE EXPERIENCE
Click here to open the ParquetFromObjectStorage notebook in DSX . This notebook uses Spark SQLContext to import the Parquet file and create a
Spark DataFrame. The notebook itself is also available as a Jupyter .ipynb file
and can be downloaded here .

Follow the instructions embedded in the notebook to insert your Object Storage
credentials, then click Cell→Run All . If successful, the last cell will print the schema, followed by the first 10
rows from the DataFrame.

At this point, you have successfully uploaded a Parquet table to SoftLayer
Object Storage, then read the table into Spark as a DataFrame. This concludes
the tutorial. For more DSX tutorials and Big Data courses, check out Big Data University .

This entry was originally posted to my personal blog on Medium .


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on July 15, 2016 by Adam Johnson .

 * Apache Spark
 * Data Science
 * Jupyter


Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",Data Science Experience (DSX) is a unified analytics environment providing access to Jupyter notebooks and RStudio on top of IBM Analytics for Apache Spark (think Spark-as-a-Service). DSX is tightly…,Upload Files to IBM Data Science Experience Using the Command Line,Live,593
1839,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Perform Sentiment Analysis       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  SAMPLE PYTHON NOTEBOOK: NY MOTOR VEHICLE ACCIDENTS ANALYSISsharynr / December 11, 2015Learn how to analyze New York City traffic collision data using a PythonNotebook in IBM Analytics for Apache Spark. This sample notebook is availabledirectly in the Analytics for Apache Spark service on Bluemix and can be adaptedto suit your needs. To do so, from within the service, click the New Notebook button, and click the Samples tab.Watch this short video to see how the NY Motor Vehicle Accidents Analysisnotebook works.You can also read a transcript of this videoRELATED LINKS * Analyzing Precipitation Data in Apache Spark * Build SQL Queries * Use the Machine Learning Library * Load and Analyze dashDB Data with Apache Spark * Load and Filter Cloudant data with Apache SparkPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Learn you how to analyze New York City traffic collision data using a Python Notebook in IBM Analytics for Apache Spark.,NY Motor Vehicle Accident Analysis,Live,594
1844,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  LOAD DASHDB DATA WITH APACHE SPARKsharynr / November 6, 2015Learn how to create a connection to dashDB data in IBM Analytics for ApacheSpark, and load data in a Scala notebook.You can also read a transcript of this videoRELATED LINKS * Build SQL Queries * Use the Machine Learning Library * Load and Filter Cloudant Data with Apache SparkTRY THE TUTORIALLearn how to use Spark SQL to load and analyze dashDB data using a Scalanotebook in IBM Analytics for Apache Spark.WHAT YOU’LL LEARNAt the end of this tutorial, you should be able to: * set up a connection for your dashDB instance on Bluemix to IBM Analytics for   Apache Spark. * create a Scala notebook in IBM Analytics for Apache Spark. * add the dashDB instance as a data source in a Scala notebook. * run SQL queries to analyze the dashDB data.BEFORE YOU BEGINComplete these tasks before starting this tutorial: * Download this SALES table , and import it into your dashDB instance. * Watch the Getting Started on Bluemix video to create a Bluemix account and add the IBM Analytics for Apache Spark   service.PROCEDURE 1: SET UP THE CONNECTION TO YOUR DASHDB INSTANCE 1. Sign in to Bluemix . 2. Access the Dashboard , and open the Apache Spark instance. 3. Click the Data tab in the left navigator. 4. Click Create A Connection . 5. Type a name and description for the connection. 6. Select the service type which is IBM Bluemix in this case. 7. From the Instance drop-down box, select the dashDB instance. 8. Select the database which is BLUDB in this case. 9. Click Create Connection .PROCEDURE 2: CREATE A SCALA NOTEBOOK TO ANALYZE THE DASHDB DATA 1.  Click the Analytics tab. 2.  Click New Notebook , select Scala , type a name and description for the notebook, and click Create Notebook . 3.  On the right side of the screen, click Data Source . 4.  Click Add Source . 5.  Select the dashDB instance from the list, and click Add Data Source . 6.  Paste the following SQL statement into the first cell in the notebook, and     then click the Run icon on the toolbar. This first command contains SQLContext which is the     entry point into all functionality in Spark SQL and is necessary to execute     SQL queries.     Command: val sqlContext = new org.apache.spark.sql.SQLContext(sc) 7.  Paste the following SQL statement into the second cell, and then click Run . Replace the JDBCURL with the JDBC URL for your dashDB instance, such as,     jdbc:db2://bluemix05.bluforcloud.com:50000/BLUDB. Also replace account and password with your the username and password for your dashDB instance. If you don’t     know your credentials, click Insert to code to see your credentials. This second command loads the sales table from     the dashDB account and assigns it to the dashdata variable.     Command: val dashdata = sqlContext.load(""jdbc"", Map( ""url"" - ""<JDBCURL>:user=<account�password=<password�"", ""dbtable"" - ""<account>.SALES"")) 8.  Paste the following SQL statement into the third cell, and then click Run . This third command takes the dashdata and registers it as a table called     salesdata.     Command: dashdata.registerTempTable(""salesdata"") 9.  Paste the following SQL statement into the fourth cell, and then click Run . This next command lets you take a look at that schema.     Command: dashdata.printSchema 10. Paste the following SQL statement into the fifth cell, and then click Run . The dashdata.collect command shows the data collected for each sales     associate.     Command: dashdata.collect 11. Paste the following SQL statement into the sixth cell, and then click Run . Once the data is registered as a table, you can use SQL to process the     data. This next line returns the SALESMAN_ID data from the salesdata table.     Command: val results = sqlContext.sql(""SELECT SALESMAN_ID from salesdata"")     results.collect 12. Paste the following SQL statement into the seventh cell, and then click Run . This command retrieves the information for a specific sales person by     ID.     Command: val salesman = sqlContext.sql(""select * from salesdata where     SALESMAN_ID='NC100     salesman.collectPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM","Learn how to create a connection to dashDB data in IBM Analytics for Apache Spark, and load data in a Scala notebook.",Load dashDB Data with Apache Spark,Live,595
1845,"Compose The Compose logo Articles Sign in Free 30-day trialINTEGRATION TESTING AGAINST REAL DATABASES
Published Jul 12, 2017 postgresql mongodb elasticsearch Integration Testing Against Real DatabasesIntegration testing can be challenging, and adding a database to the mix makes
it even more so. In this Write Stuff contribution, Graham Cox talks about integration testing against live databases
on both local and Compose-managed deployments.

If you've ever developed any software larger than a toy, you will no doubt know
the importance of testing to ensure that it works, and especially that it
interacts with other systems in an expected manner. The larger the software
project, the more important it is that these tests are automated so that they
are run correctly and regularly in order to catch any bugs that might be
introduced.

The most fragile part of automated testing is the connection to external
systems. If you're connecting to other applications that have a well-defined
API, creating ""mocks"" which simulate responses from those API's can be a
valuable technique. However, this technique can fall short when the system
you're connecting to is a database.

Mocking database calls is notoriously difficult, and to make matters worse,
software systems are usually heavily dependent on interactions with their
databases making them a crucial testing need. Not only is it critical that the
connection to the database is handled and tested, but the queries sent to the
database must be tested for efficiency and to ensure that they return the
correct data in all scenarios. Software bugs at the data access layer,
especially ones where the data being returned is inconsistent, can be
notoriously difficult to discover and fix without a thorough test rig for your
application.

Testing is made more complicated by the nuanced behavior between different
database engines, and even different versions of the same engine. For example,
certain queries are significantly more efficient in PostgreSQL than they are in
MySQL, and some queries are more efficient in MySQL 5.6 than they are in MySQL
5.1.

This article will show some techniques for testing your code against accurate
versions of your database engine, so that you can reproduce production-like
environments much more accurately.

JAVA LIBRARIES FOR ""EMBEDDED"" DATABASES
If you are writing your application in Java, there are a number of pre-existing
libraries that can be used to load up a local Database system as part of your
development process. These libraries will generally download the database
software, set it up, run the database engine, and then shut it all down and tidy
up afterward. This can make it really simple to integrate into your build for
when you are running Integration tests and also into your software when you
start a development version on your local system.

This section will cover a small number of Database engines that you can achieve
this with, but there are many more that can be found with some searching.

POSTGRESQL
PostgreSQL can be run from inside your application either in Build-time
Integration tests or when running the application locally on your development
system by using the Embedded PostgreSQL Server library from Yandex QATools. This library allows you to set up a PostgreSQL database,
downloading the version that you have specified, and then stopping it afterward.

In order to use this, you simply need to add the appropriate dependency to your
project, as follows:

<dependency>  
    <groupId>ru.yandex.qatools.embed</groupId>
    <artifactId>postgresql-embedded</artifactId>
    <version>2.2</version>
</dependency>  


and then write a wrapper that sets up the database exactly as you need, as
follows:

/**
 * Wrapper around the PostgreSQL database
 */
public class PostgresWrapper {  
    /** The embedded postgres wrapper */
    private EmbeddedPostgres embeddedPostgres;
    /** The URL to connect on */
    private String connectionUrl;

    /**
     * Start PostgreSQL running
     * @throws IOException if an error occurs starting Postgres
     */
    public void start() throws IOException {
        if (embeddedPostgres == null) {
            int port = getFreePort();
            embeddedPostgres = new EmbeddedPostgres(V9_6_2);
            connectionUrl = embeddedPostgres.start(""localhost"", port, ""dbName"", ""userName"", ""password"");
        }
    }

    /**
     * Stop PostgreSQL
     */
    public void stop() {
        if (embeddedPostgres != null) {
            embeddedPostgres.stop();
            embeddedPostgres = null;
        }
    }

    /**
     * Get the URL to use to connect to the database
     * @return the connection URL
     */
    public String getConnectionUrl() {
        return connectionUrl;
    }

    /**
     * Get a free port to listen on
     * @return the port
     * @throws IOException if an error occurs finding a port
     */
    private static int getFreePort() throws IOException {
        ServerSocket s = new ServerSocket(0);
        return s.getLocalPort();
    }
}


When you call the start() method of this class, this will download PostgreSQL 9.6.2 and start it running
on a randomly selected free port. It will then make a Connection URL available
for a Data Source to connect to. When you call the stop() method it will then stop and delete the downloaded PostgreSQL database,
completely cleaning up. This code can be triggered from anywhere appropriate;
for example, you might include it in a Spring Context, or wrap it in a JUnit
Rule.

MONGODB
MongoDB also has a library that can be used to start and stop an embedded MongoDB database , and it shares the same base code and API with the Embedded PostgreSQL library
shared in the previous section so will work in a very similar way.

You simply need to add the appropriate dependency to your project:

<dependency>  
    <groupId>de.flapdoodle.embed</groupId>
    <artifactId>de.flapdoodle.embed.mongo</artifactId>
    <version>2.0.0</version>
</dependency>  


and then write a wrapper that sets up the database exactly as you need, as
follows:

/**
 * Wrapper around the MongoDB database
 */
public class MongoWrapper {  
    /** The embedded MongoDB executable */
    private MongodExecutable mongodExecutable;
    /** The URL to connect on */
    private String connectionUrl;

    /**
     * Start MongoDB running
     * @throws IOException if an error occurs starting MongoDB
     */
    public void start() throws IOException {
        if (mongodExecutable == null) {
            int port = getFreePort();

            MongodStarter starter = MongodStarter.getDefaultInstance();
            IMongodConfig mongodConfig = new MongodConfigBuilder()
                .version(Version.V3_4_1)
                .net(new Net(""localhost"", port, Network.localhostIsIPv6()))
                .build();

            mongodExecutable = starter.prepare(mongodConfig);
            mongodExecutable.start();

            connectionUrl = ""mongodb://localhost:"" + port;
        }
    }

    /**
     * Stop MongoDB
     */
    public void stop() {
        if (mongodExecutable != null) {
            mongodExecutable.stop();
        }
    }

    /**
     * Get the URL to use to connect to the database
     * @return the connection URL
     */
    public String getConnectionUrl() {
        return connectionUrl;
    }

    /**
     * Get a free port to listen on
     * @return the port
     * @throws IOException if an error occurs finding a port
     */
    private static int getFreePort() throws IOException {
        ServerSocket s = new ServerSocket(0);
        return s.getLocalPort();
    }
}


This wrapper works exactly the same as the PostgreSQL version, with the same start() , stop() and getConnectionUrl() methods that you can use to control the database.

ELASTICSEARCH
To round out our top-three in Java, we'll take a look at doing the same with
Elasticsearch using the embedded-elasticsearch library from Allegro Tech. While this is a spiritual sibling to the previous
two libraries, this one has a slightly different API. Our wrapper is more
complicated this time because Elasticsearch is more complicated to set up.

In this case, we need to provide it with two ports to listen on, rather than
just one. If we don't provide both ports then it will use defaults, which might
clash with other services running on the same machine. We also need to configure
any Plugins and Indexes that we want to use.

In the following example, we will configure the analysis-stempel plugin for performing Stemming of Polish language text, and we configure an
index called cars that is configured based on the contents of the car-mapping.json file on the classpath.

As before, we need to add the appropriate dependency to our system:

<dependency>  
    <groupId>pl.allegro.tech</groupId>
    <artifactId>embedded-elasticsearch</artifactId>
    <version>2.1.0</version>
</dependency>  


And then write our wrapper around it.

/**
 * Wrapper around the Elasticseearch database
 */
public class ElasticsearchWrapper {  
    /** The embedded Elasticseearch executable */
    private EmbeddedElastic embeddedElastic;
    /** The URL to connect on */
    private String connectionUrl;

    /**
     * Start Elasticseearch running
     * @throws IOException if an error occurs starting Elasticseearch
     */
    public void start() throws IOException, InterruptedException {
        if (embeddedElastic == null) {
            int httpPort = getFreePort();
            int transportPort = getFreePort();

            embeddedElastic = EmbeddedElastic.builder()
                .withElasticVersion(""5.4.0"")
                .withSetting(PopularProperties.TRANSPORT_TCP_PORT, transportPort)
                .withSetting(PopularProperties.HTTP_PORT, httpPort)
                .withSetting(PopularProperties.CLUSTER_NAME, ""my_cluster"")
                .withPlugin(""analysis-stempel"")
                .withIndex(""cars"", IndexSettings.builder()
                    .withType(""car"", getSystemResourceAsStream(""car-mapping.json""))
                    .build())
                .build()
                .start();

            connectionUrl = ""http://localhost:"" + httpPort;
        }
    }

    /**
     * Stop Elasticseearch
     */
    public void stop() {
        if (embeddedElastic != null) {
            embeddedElastic.stop();
        }
    }

    /**
     * Get the URL to use to connect to the database
     * @return the connection URL
     */
    public String getConnectionUrl() {
        return connectionUrl;
    }

    /**
     * Get a free port to listen on
     * @return the port
     * @throws IOException if an error occurs finding a port
     */
    private static int getFreePort() throws IOException {
        ServerSocket s = new ServerSocket(0);
        return s.getLocalPort();
    }
}


DOCKER
If you are not using a language that has similar support to the above, or else
you can not find suitable libraries for the database that you wish to use, then
the next option would be to actually run them locally yourself.

Docker makes it incredibly easy to achieve this by including an appropriate Dockerfile or docker-compose.yml file along with your build.

This is slightly more complicated than the above because it involves integrating
Docker into your build process as well. Depending on how your system works, this
can be done by calling the Docker or Docker Compose tools as part of the build
process, or else by interacting directly with the Docker API. However, the
advantage here is that you can more closely mirror your production environments,
giving more confidence that what you are running in your tests is representative
of how things will work live. This is even more true if you are using Docker to
deploy to production as well.

This example will use the Docker Compose tool, since we are only interested in
deploying software rather than packaging it, and this allows us to deploy as
many database engines as are needed; for example, if you are running MySQL and
RabbitMQ side-by-side in your environment.

An example docker-compose.yml file for running a MySQL Database and a RabitMQ Message Queue might look like
this:

version: '2'  
services:  
    mysql:
        image: mysql/mysql-server:5.7
        ports:
            - ""3306:3306""
        environment:
            - MYSQL_ROOT_PASSWORD=myRootPassword
            - MYSQL_DATABASE=testDatabase
            - MYSQL_USER=testUser
            - MYSQL_PASSWORD=testPassword
    rabbitmq:
        image: rabbitmq:3.6.9-management
        ports:
            - ""5672:5672""
            - ""15672:15672""
        environment:
            - RABBITMQ_DEFAULT_USER=rabbitUser
            - RABBITMQ_DEFAULT_PASS=rabbitPass


This gives us:

 * A MySQL database that you can connect to on localhost * This has a database called ""testDatabase"", that you can connect to with
      credentials ""testUser/testPassword"" already set up
   
   
 * A RabbitMQ message queue that you can connect to on localhost * This has the Management extensions included as well
    * This has a user set up to connect to it, with credentials
      ""rabbitUser/rabbitPass""
   
   
Any time that you want to set these databases up, you need only run
""docker-compose up"" against this file, and then you can run your application or
test suite against these databases. You can then stop Docker, and when you start
it again you will get exactly the same database versions as before, set up
completely clean again.

COMPOSE API
Compose allows you to take this a step further by offering an API to interact
with their database hosting options. This allows you to programmatically create
and destroy databases using a large number of database engines and versions. The
Compose API can also be easily integrated into your software as part of your
Integration tests and even as part of the Software deployment in development
mode.

If you are using Compose to host your Production database systems then this will
be the perfect way to test your database integration since it will give you an
identical platform for the test databases to run on. It is even possible to have
tests that run against a pre-populated database using a copy of your live
Production database, which can be invaluable for certain types of tests.

In order to use the Compose API, you will need to know both your Account ID and
your Access Token. Both of these can be obtained from the Compose.io Account screen . It is strongly recommended that these credentials are not stored in your
source code, but are passed in from the outside. This ensures that they are not
ever stored in source control and thus can never leak out by accident.

For these examples, we will use an Account ID of ""thisIsMyAccountId"" and an
Access Token of ""theSecretAccessToken"". These are not real tokens obvious
placeholders for the real ones which are hexadecimal numbers.

We will also be showing how to do this by making the raw HTTP calls from the
command line. In reality this will be done as part of your application using
whatever makes the most sense: for example, you may use Spring Framework RestTemplate in a Java Application.

STARTING FRESH
In order to create a clean new database for testing, you simply need to know the
type and version of the database to create, and where you wish to create it.
Unless you are specifically doing performance testing, the data center location
that the database is hosted on doesn't really matter. As such, we are going to
always use the ""us-east-1"" from AWS for simplicity.

Here, we are going to create a clean ""RethinkDB"" for our tests. This will be
done by issuing a POST to "" https://api.compose.io/2016-07/deployments "", as follows:

$ curl -v -H ""Content-Type: application/json"" -H ""Authorization: Bearer theSecretAccessToken"" https://api.compose.io/2016-07/deployments -X POST --data '{""deployment"": {""name"": ""article-test"", ""account_id"": ""thisIsMyAccountId"", ""type"": ""rethink"", ""version"": ""2.3.5"", ""datacenter"": ""aws:us-east-1""}}'

> Host: api.compose.io
> User-Agent: curl/7.51.0
> Accept: */*
> Content-Type: application/json
> Authorization: Bearer theSecretAccessToken
> Content-Length: 132

< Date: Mon, 29 May 2017 16:37:05 GMT  
< Connection: close  
< X-Frame-Options: SAMEORIGIN  
 mode=block  
< X-Content-Type-Options: nosniff  
 charset=utf-8  
< Cache-Control: no-cache  
< X-Request-Id: 5611FCB3E92E_0A00000220FB_592C4E2A_60DA00CAC  
< X-Runtime: 6.967557  
< Vary: Accept-Encoding, Origin  
< Strict-Transport-Security: max-age=31536000  
 no-origin  
{
  ""id"": ""592c4e2ac6a81f001900038e"",
  ""account_id"": ""thisIsMyAccountId"",
  ""name"": ""article-test"",
  ""type"": ""rethink"",
  ""created_at"": ""2017-05-29T16:36:58Z"",
  ""notes"": null,
  ""customer_billing_code"": null,
  ""ca_certificate_base64"": ""........"",
  ""connection_strings"": {
    ""direct"": [
      ""rethinkdb://admin:ba355df9096a48740b03eba585d806be@aws-us-east-1-portal.25.dblayer.com:19673""
    ],
    ""cli"": null,
    ""maps"": null,
    ""ssh"": null,
    ""health"": null,
    ""admin"": [
      ""https://aws-us-east-1-portal25.dblayer.com:19673""
    ],
    ""cli_token"": null,
    ""cli_basic_auth"": null,
    ""cli_token_auth"": null
  },
  ""provision_recipe_id"": ""592c4e2ac6a81f001900038d"",
  ""_links"": {
    ""compose_web_ui"": {
      ""href"": ""https://app.compose.io/n-a-237/deployments/article-test{?embed}"",
      ""templated"": true
    },
    ""scalings"": {
      ""href"": ""https://api.compose.io/2016-07/deployments/592c4e2ac6a81f001900038e/scalings{?embed}"",
      ""templated"": true
    },
    ""backups"": {
      ""href"": ""https://api.compose.io/2016-07/deployments/592c4e2ac6a81f001900038e/backups{?embed}"",
      ""templated"": true
    },
    ""alerts"": {
      ""href"": ""https://api.compose.io/2016-07/deployments/592c4e2ac6a81f001900038e/alerts{?embed}"",
      ""templated"": true
    },
    ""cluster"": {
      ""href"": ""https://api.compose.io/2016-07/clusters/55889965e8e2b432ef000009{?embed}"",
      ""templated"": true
    }
  }
}


The request payload contains the name and version of the database to be created,
as well as the Account ID that it should be created under and the Data Centre in
which to create it.

This response has two very important pieces of information that we need to keep
track of.

 * id : This is the unique ID of the deployed database, which we will need to
   delete it later. Deleting the database is important because names must be
   unique, and because you will be charged for the duration that the database
   exists.
 * connection_strings/direct : This is the URL that is used to connect to the database. Different
   databases will have different connection strings in different formats, and it
   will depend on the exact database as to how you will make use of it.

Once you are finished with the database at the end of your tests, or after you
stop the development instance of your application, you will want to delete the
database.
This is done by sending a DELETE to "" https://api.compose.io/2016-07/deployments/[id ]"", as follows:

$ curl -v -H ""Authorization: Bearer theSecretAccessToken"" https://api.compose.io/2016-07/deployments/592c4e2ac6a81f001900038e -X DELETE

> Host: api.compose.io
> User-Agent: curl/7.51.0
> Accept: */*
> Authorization: Bearer theSecretAccessToken

< HTTP/1.1 202 Accepted  
< Date: Mon, 29 May 2017 16:55:11 GMT  
< Connection: close  
< X-Frame-Options: SAMEORIGIN  
 mode=block  
< X-Content-Type-Options: nosniff  
 charset=utf-8  
< Cache-Control: no-cache  
< X-Request-Id: 5611FCB3EA87_0A00000220FB_592C526E_108590213  
< X-Runtime: 0.390463  
< Vary: Accept-Encoding, Origin  
< Strict-Transport-Security: max-age=31536000  
 no-origin

{
  ""id"": ""592c526f155cd70010000285"",
  ""account_id"": null,
  ""template"": ""Recipes::Deployment::Deprovision"",
  ""status"": ""running"",
  ""status_detail"": ""Running destroy_capsule on aws-us-east-1-portal.25."",
  ""created_at"": ""2017-05-29T16:55:11Z"",
  ""updated_at"": ""2017-05-29T16:55:11Z"",
  ""deployment_id"": ""592c4e2ac6a81f001900038e"",
  ""name"": ""Deprovision"",
  ""_embedded"": {
    ""recipes"": []
  }
}


This indicates that the deployed database has been queued for deletion, which
means that the name is free for future tests and that you will no longer be
billed for its existence.

PRODUCTION DATA
If you want to create a database from an existing backup of another database,
then the Compose API also makes this really simple to achieve. In order to do
this, you need to know the ID of the Deployed Database that the backup is from,
and the ID of the backup that you wish to restore. Once you know these two
things, you can then create a brand new database with an exact copy of this
backup as follows:

In this case, we are using a Deployment ID of theDeploymentId , and a Backup ID of *theBackupId:

$ curl -v -H ""Content-type: application/json"" -H ""Authorization: Bearer theSecretAccessToken"" https://api.compose.io/2016-07/deployments/theDeploymentId/backups/theBackupId/restore -X POST --data '{""deployment"": {""name"": ""restore-test"", ""datacenter"": ""aws:us-east-1""}}' | json

> POST /2016-07/deployments/theDeploymentId/backups/theBackupId/restore HTTP/1.1
> Host: api.compose.io
> User-Agent: curl/7.51.0
> Accept: */*
> Content-type: application/json
> Authorization: Bearer theSecretAccessToken
> Content-Length: 71

< Date: Mon, 29 May 2017 17:19:11 GMT  
< Connection: close  
< X-Frame-Options: SAMEORIGIN  
 mode=block  
< X-Content-Type-Options: nosniff  
 charset=utf-8  
< Cache-Control: no-cache  
< X-Request-Id: 5611FCB3EC72_0A00000220FB_592C5803_9D101CF6  
< X-Runtime: 11.828383  
< Vary: Accept-Encoding, Origin  
< Strict-Transport-Security: max-age=31536000  
 no-origin

{
  ""id"": ""592c5803328838001400000c"",
  ""account_id"": ""thisIsMyAccountId"",
  ""name"": ""restore-test"",
  ""type"": ""mongodb"",
  ""created_at"": ""2017-05-29T17:18:59Z"",
  ""notes"": null,
  ""customer_billing_code"": null,
  ""ca_certificate_base64"": null,
  ""connection_strings"": {
    ""direct"": [
      ""mongodb://admin:MQGWQMWLNACGXEZQ@aws-us-east-1-portal.20.dblayer.com:16049,aws-us-east-1-portal.19.dblayer.com:16049/admin?ssl=true""
    ],
    ""cli"": [
      ""mongo --ssl --sslAllowInvalidCertificates aws-us-east-1-portal.20.dblayer.com:16049/admin -u admin -p MQGWQMWLNACGXEZQ""
    ],
    ""maps"": null,
    ""ssh"": null,
    ""health"": null,
    ""admin"": null,
    ""cli_token"": null,
    ""cli_basic_auth"": null,
    ""cli_token_auth"": null
  },
  ""provision_recipe_id"": ""592c5803328838001400000a"",
  ""_links"": {
    ""compose_web_ui"": {
      ""href"": ""https://app.compose.io/n-a-237/deployments/restore-test{?embed}"",
      ""templated"": true
    },
    ""scalings"": {
      ""href"": ""https://api.compose.io/2016-07/deployments/592c5803328838001400000c/scalings{?embed}"",
      ""templated"": true
    },
    ""backups"": {
      ""href"": ""https://api.compose.io/2016-07/deployments/592c5803328838001400000c/backups{?embed}"",
      ""templated"": true
    },
    ""alerts"": {
      ""href"": ""https://api.compose.io/2016-07/deployments/592c5803328838001400000c/alerts{?embed}"",
      ""templated"": true
    },
    ""cluster"": {
      ""href"": ""https://api.compose.io/2016-07/clusters/572dee2d71133800160006ae{?embed}"",
      ""templated"": true
    }
  }
}


The response from this works in the exact same was as before, giving you the
Deployment ID needed for cleaning up, and the Connection URL needed for
connecting to the database.
The only difference is that this database will already be populated with the
data present in the used database backup instead of being completely empty.

Note that this will have created a new deployed database with the name provided
in the request; you do not need to create a database to restore the backup into.

As before, you should clean up afterward to ensure that you do not leave
resources lying around that are not needed.

SUMMARY
Integration testing against a real database can be an invaluable tool for your
testing toolbox. It can give you early indications of problems and allow you to
test and tune your application in a real-world setup.

This article has hopefully given some ideas on how you can integrate this level
of testing into your own applications. And, as always, there's a lot more
options than are covered here so please explore and have fun.


--------------------------------------------------------------------------------

Do you want to shed light on a favorite feature in your preferred database? Why
not write about it for Write Stuff ?


attribution Denys Nevozhai

This article is licensed with CC-BY-NC-SA 4.0 by Compose. John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
Mar 2, 2017USE ALL THE DATABASES - PART 1
Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data
from multiple sources using GraphQL in this W…

Guest Author Nov 21, 2016DATALAYER: STORAGE WARS - THE ART GENOME PROJECT
As you can see DataLayer Conf was full of great talks and this next one is no
exception. Daniel Doubrovkine, CEO of Artsy.net…

Thom Crowe Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX
The power of IBM's Bluemix cloud platform is now able to seamlessly harness
Compose's databases, making Compose-configured Mo…

Dj Walker-Morgan Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company",Graham Cox talks about integration testing against live databases on both local and Compose-managed deployments.,Integration Testing Against Real Databases,Live,596
1849,"Compose The Compose logo Articles Sign in Free 30-day trialTAKE A DIP INTO POSTGRESQL ARRAYS
Published Apr 19, 2017 postgresql datatypes arrays Take a Dip into PostgreSQL ArraysThere's a number of datatypes available in PostgreSQL. In this article, we're
going to take a look at the array datatype.

There are times when you might want to store multiple values in one database
column instead of across multiple tables. PostgreSQL gives you this capability
with the array datatype.

Arrays are some of the most useful data types for storing lists of information.
Whether you have a list of baseball scores, blog tags, or favorite book titles,
arrays can be found everywhere. The bigger question, however, is why you'd even
consider using arrays. The answer comes down to performance and ease of use.
Aggregating and joining data across tables and rows can be difficult, and
depending on the type of data you're storing, it might get expensive too.

In this article, we're going to create a table containing student information.
We'll create two fields that demonstrate two ways to create arrays. We'll insert
some data into the table, index the array columns, show how to query data from
one and multidimensional arrays, and we'll look at some of the performance
benefits and drawbacks along the way.

So, let's get started ...

CREATING A TABLE WITH ARRAYS
We'll start by creating a table of students. It comprises their name , exam scores as a multidimensional array, and contact information that's a one-dimensional
array of phone numbers. We'll create the table using both the bracket notation [] for creating arrays and the ARRAY keyword which are the two ways you can designate a field to be an array. So,
let's show you how to do that:

CREATE TABLE students (  
    name text,
    scores int[][],
    contacts varchar ARRAY -- or varchar[]
);


We are using the text datatype for the name. The student scores is an array of integers, but since we will insert four exam scores taken at
four different times, we'll create a multidimensional array of integers to hold
all the scores. This is indicated using the double brackets. Student contacts is also an array using the varchar datatype, but instead of using brackets, we'll use the the ARRAY constructor. The brackets and the ARRAY constructor are synonymous.

Once the table has been created, we can run \d students to see how PostgreSQL has stored the datatypes for each table row.

          Table ""public.students""
  Column  |        Type         | Modifiers 
----------+---------------------+-----------
 name     | text                | 
 scores   | integer[]           | 
 contacts | character varying[] | 


Notice that the table doesn't indicate explicitly that scores is a multidimensional array. That's because PostgreSQL allows arrays to be
either multidimensional and one-dimensional. So, even though we created the
table with double brackets [][] , PostgreSQL would have allowed us to insert a one-dimensional array as well.
At the same time, we could have simply created the scores field as a
one-dimensional array, with single brackets [] , and inserted a multidimensional array.

Now, let's move on and insert some data into our student table ...

INSERTING DATA WITH ARRAYS
Once our table has been created, let's insert some data. For this article, we've
entered 400 students into our database like the following:

INSERT INTO students VALUES (  
     'student_name', '{{44,93,82,42},{59,74,73,67},{43,54,59,77},{46,45,68,98}}', '{""(555)480-9941"",""(555)738-3707""}'
);


student_ with a number is the unique student name for each record. The exam scores is a
multidimensional array. Each of these arrays contains four exam arrays that
represent a single exam enclosed using braces {} . Each of the values within an array represents a portion of the exam. The key
when inserting a multidimensional array is that the enclosed arrays must have
matching dimensions. So, we couldn't have a student who has only three scores
for a single exam, for example, otherwise we'll get an error like this:

SELECT ARRAY[[1,2,3],[4,5]];  
ERROR:  multidimensional arrays must have array expressions with matching dimensions  


The last values that we'll insert are enclosed in a one-dimensional array
containing two phone numbers for each student. If a student has one number or
two numbers, it won't matter. The only constraints we have are for
multidimensional arrays where the array dimensions must match.

After inserting the data, we should see something like:

   name    |                          scores                           |           contacts            
-----------+-----------------------------------------------------------+-------------------------------
 student_1 | {{91,54,54,68},{64,68,59,89},{95,51,42,63},{89,51,70,97}} | {(555)941-8927,(555)984-8816}
 ...


Now, let's set up some indexes on these array columns.

INDEXING ARRAYS
To index arrays in PostgreSQL, it's best to use a GIN or GiST index . Using either index has its benefits and drawbacks; however, GiST indexes were
primarily developed for geometric datatypes, while GIN indexes were designed for
arrays. As a rule of thumb, PostgreSQL recommends using a GIN index for static
data, especially for arrays, and a GiST index for data that is frequently
updated.

Both types of indexes provide us with special operator classes that we can use when searching for data in arrays. GIN indexed queries support
<@, @ , =, and && operators, while GiST indexed queries support many more operators
like <<, &<, & , , <<|, &<|, |& , | , @ , <@, ~=, and &&.

Indexing arrays don't require a special syntax to use. Using PostgreSQL's CREATE INDEX operation, we'll create a Btree index for student names and GIN indexes on the
other columns like:

CREATE INDEX idx_name ON students (name); -- BTREE Index (no array)  
CREATE INDEX idx_scores ON students USING GIN(scores); -- GIN Index (array)  
CREATE INDEX idx_contacts ON students USING GIN(contacts); -- GIN Index (array)  


Once the indexed have been created, let's get into querying the data ...

ACCESSING ARRAY ELEMENTS
To take advantage of our indexes when accessing student scores and contacts, we
can use GIN operators to find whether students got a certain exam score, or
whether a contact number matches with a student record. For a simple example, we
could find out which student has the contact number (555)941-8927. For that,
we'd write something like:

SELECT name, contacts FROM students WHERE contacts @  


Since we used the @> operator which looks for numbers contained in the array, the GIN index is
automatically used. Using EXPLAIN on the query, we can view its performance while searching for the number in the contacts column.

                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Bitmap Heap Scan on students  (cost=8.01..12.02 rows=1 width=72)
   Recheck Cond: (contacts @> '{(555)941-8927}'::character varying[])
   ->  Bitmap Index Scan on idx_contacts  (cost=0.00..8.01 rows=1 width=0)
         Index Cond: (contacts @> '{(555)941-8927}'::character varying[])


Without the GIN index, PostgreSQL would perform a sequential scan that would
have to look at all of our contacts in the table before finding returning the
right student. Using a data source of more than 400 students, one can see that
the costs of the query would be much higher without an index:

                           QUERY PLAN                           
----------------------------------------------------------------
 Seq Scan on students  (cost=0.00..19.00 rows=1 width=72)
   Filter: (contacts @> '{(555)941-8927}'::character varying[])


Finding a specific exam score within a multidimensional array is a little
tougher. Let's try to find the number of students who got a 97 on their exam.
One way to write this query is like this:

SELECT name, scores FROM students WHERE scores @  


The problem with this query is that we will get 87 students back who got a 97 in
either four of their exams. At the same time, PostgreSQL didn't use our index
because it had to look at each student to find out who got a 97.

                        QUERY PLAN                         
-----------------------------------------------------------
 Seq Scan on students  (cost=0.00..19.00 rows=1 width=234)
   Filter: (scores @> '{97}'::integer[])


This way of querying is quite costly, so let's see if there is a better
approach. One of the more interesting ways to search through arrays containing
integers is using the intarray extension , which contains special operator classes for arrays. To install the extension,
we'd write:

CREATE EXTENSION intarray;  


Once the extension is installed, we can drop our scores index DROP INDEX idx_scores; and create a new index using the intarray extension we installed and the gin__int_ops operator class that comes with intarray .

CREATE INDEX idx_scores_with_intarray ON students USING GIN(scores gin__int_ops);  


Now that we've created the new index when we run EXPLAIN on the query, we can see that the index is used and it's much faster to search
for the score we want.

                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Bitmap Heap Scan on students  (cost=8.00..12.02 rows=1 width=234)
   Recheck Cond: (scores @> '{97}'::integer[])
   ->  Bitmap Index Scan on idx_scores_intarray  (cost=0.00..8.00 rows=1 width=0)
         Index Cond: (scores @> '{97}'::integer[])


Sometimes we want to search for particular scores within a single exam and don't
want to include all the exams a student has taken. Since we have four arrays
representing four exams, we could select one of the arrays/exams and see how
many of the 87 students got a 97 on the first exam. To do that, we'd write a
query like:

SELECT * FROM students WHERE scores[1:1] @  


From this query, we'll get 20 students who got a 97 as one of their scores on
their first exam. Notice that the only addition to the query is scores[1:1] . This is an array slice, which selects only the first array of our scores multidimensional array.

The only problem with the query is that, even though we set up an index on the
scores column, the index won't be used because we haven't indexed both
dimensions of the array.

                        QUERY PLAN                         
-----------------------------------------------------------
 Seq Scan on students  (cost=0.00..19.00 rows=1 width=234)
   Filter: (scores[1:1] @> '{97}'::integer[])


Since the scores array is two dimensional, we'd have to set up two indexes
instead of one. To get around this, we can create a materialized view that will
contain the student names in one column and their first exam scores as a
one-dimensional array in another column. Then we can index the score's column
and see what the performance differences are there. Let's set that up now:

CREATE MATERIALIZED VIEW student_scores AS  
    SELECT 
        name, 
        ARRAY(SELECT UNNEST(s.scores[1:1])) AS scores
    FROM
        students AS s;


With the new materialized view, our data looks like the following with scores as
a one-dimensional array:

    name    |    scores     
------------+---------------
 student_1  | {91,54,54,68}
 student_2  | {82,53,85,59}
 ...


Now, we can set up a GIN index on the scores column:

CREATE INDEX idx_materialized_student_score_view ON student_scores USING GIN(scores);  


Let's run the query using the student_scores materialized view:

SELECT * FROM student_scores WHERE scores @  


We'll still get the 20 students who got a 97 on their first exam, but let's run EXPLAIN on the query to see its performance.

                           QUERY PLAN                           
----------------------------------------------------------------
 Seq Scan on student_scores  (cost=0.00..10.00 rows=1 width=48)
   Filter: (scores @> '{97}'::integer[])


As we can see, PostgreSQL chose to do a sequential scan on our data without
using the index. PostgreSQL intuitively selects the fastest way to query our
data, and in this case it was via the sequential scan and not the index. But, if
we look at the performance of querying our data using the slice of the scores array ( scores[1:1] ) or creating the materialized view, we can see that creating the materialized
view and searching for our scores that way outperformed looking for values
through the slice.

SELECTING ANY OF THE ARRAY ELEMENTS
Another way that PostgreSQL allows us to search for values is using the array
subexpression ANY . This expression lets us check if any of the values in an array meet the
expression requirements. The only drawback with this is that it doesn't use an
index, so on larger datasets, we might see significant performance loss if we
use it. Nonetheless, let's try it out.

When constructing a query using ANY , we place the expression in the WHERE clause on the right side of an operator. So, if we are looking again for
students who scored a 97 on an exam we'd write:

SELECT * FROM students WHERE 97 = ANY(scores);  


Like the query above where we used the GIN and GiST indexes with the @ operator, the ANY expression provides us with the same result of 87 students. Using EXPLAIN , however, shows us the real performance problems using the expression:

                         QUERY PLAN                         
------------------------------------------------------------
 Seq Scan on students  (cost=0.00..23.00 rows=87 width=234)
   Filter: (97 = ANY (scores))


Therefore, it's not that advantageous to use ANY and it's favorable to put an index on the columns that we're going to be
accessing a lot.

SUMMING IT UP
So we've discussed quite a bit about how to create tables with arrays, how to
access them, how to index them and their performance, as well as a trick to put
values from a multidimensional array into a materialized view for better
indexing. By shedding some light into how PostgreSQL arrays work and how to
query them, your performance and understanding of how arrays work will enhance
your effectiveness when coming across these datatypes in your database.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Joel Filipe

Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
Apr 5, 2017STORING NETWORK ADDRESSES USING POSTGRESQL
PostgreSQL provides developers with numerous data types with specialized
functions. In this article, we focus on the network…

Abdullah Alger Apr 14, 2017NEWSBITS: POSTGRESQL, MYSQL, ELASTICSEARCH, CHROME, UBUNTU AND KUBERNETES
UPDATES, TELEPORTER HITS 2.0, AND MINTYPI 2.0
The NewsBits for week ending April 14th - Still More PostgreSQL 10 features,
MySQL InnoDB Cluster hits the shelves, MySQL hit…

John O'Connor Apr 7, 2017NEWSBITS - SCYLLA, POSTGRESQL, REDIS, MONGODB AND RABBITMQ UPDATES, GIT IS 12
AND BUILD YOUR OWN TEXT EDITOR
The NewsBits for week ending April 7th - Scylla's getting counters, PostgreSQL
is getting full JSON text search, Redis is abo…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","There's a number of datatypes available in PostgreSQL. In this article, we're going to take a look at the array datatype.",Take a Dip into PostgreSQL Arrays,Live,597
1852,"Compose Menu Databases * MongoDB * Elasticsearch * RethinkDB * Redis * PostgreSQL * etcd * RabbitMQEnterprise Pricing Articles Sign in Free 30-Day TrialKIBANA AND COMPOSE ELASTICSEARCHShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 11, 2016We're going to be talking about how you can run your own Kibana instance forElasticsearch. You may think this is an odd thing to want to do, but allElasticsearch users should be paying attention to this.That's because Elasticsearch has been restructured over the last few versions,so some components have been moved away from the central server's pluginarchitecture and given their own server. That process isn't slowing down either.The future doesn't look good for the other site plugins like Head and Kopf. While the decisions arearchitecturally sound, it does mean that we at Compose are tracking an evolvingapplication. We're working away on how to incorporate the changes inElasticsearch in a smooth an effective way and we'll be letting you know when wehave more to say on that.In the interim though, you may just want to run Kibana so you can explore yourdatabase with dashboards. You can install Kibana locally or up on a cloudinstance; we'll look at setting it up locally though the steps should be thesame for a cloud instance.GET YOUR KIBANA ONTo start this process we'll need the appropriate binary for Kibana. This isn'tas easy as going to https://www.elastic.co/products/kibana and clicking Download though because Kibana versions are version matched withElasticsearch versions. Clicking Download there would give you, at time ofwriting, Kibana 4.5.0 which is compatible with Elasticsearch 2.3.x and nothingearlier. So, where to look? Well, the same page also lists Kibana 4.1.6 which iscompatible with Elasticsearch 1.4.4 to 1.7 and if you are running one of thoseversions, you can use those.But for anything between 1.7 and 2.3.x, you'll need to dive into the Elastic "" Past Releases "" link on the page, and here. For Elasticsearch 2.2, you'll want a Kibana4.4.x, currently version 4.4.1 . For Elasticsearch 2.1.x, which is what's currently available on Compose,you'll want Kibana 4.3.x, currently version 4.3.3 .So let's download Kibana 4.3.3 for our Compose Elasticsearch 2.1.1 installation.You'll find the available downloads on the release notes pages. We're going toinstall on Mac OS X and that means we end up downloading kibana-4.3.3-darwin-x64.tar.gz . Unpack this file, either with a double click in the desktop or a tar xvZf - kibana-4.3.3-darwin-x64.tar.gz at the command line. You'll now have a `kibana-4.3.3-darwin-x64 directory. Go into that directory and have a look:tar xvZf - kibana-4.3.3-darwin-x64.tar.gz If this looks familiar, it may be because Kibana is now a node.js application.If you started it up right now though – by running bin/kibana – it would try and connect to a local instance of Elasticsearch. We want it toconnect to a Compose deployment though. So, first step is to get our connectionstrings from Compose Elasticsearch.GET YOUR COMPOSE ONGo to the Compose web console at app.compose.io and log in. If you don't have an account why not sign up for the 30 day free trial . Once logged in, select the Elasticsearch deployment and you'll see this:You'll want to have added a user, ideally one specifically for your new kibanainstance, in the Users panel. Then you'll need one of the HTTP connection strings from the Connection info panel. With that information in hand, return to your command line where yourKibana software is being set up. You want to edit the config/kibana.yml file so run your preferred editor. We'll use Atom:The line we're interested in is commented out but begins # elasticsearch.url: . Find it and remove that # . Next change the value of the URL to the one from the Compose console,substituting in the username and password of your ""Kibana user"" into it. Makesure the URL is surrounded by double quotes. Ok. Once you've done that we'reready to start it up.Run bin/kibana and you should see something like this whizz by:It's the Server running at http://0.0.0.0:5601 that shows the Kibana server is up and running. Now you can connect to theserver. Open up your browser and go to http://localhost:5601 and you'll be able to connect.If you are new to Kibana, your next stop should be picking up Elastic's gettingstarted guide at this point . If this is a completely fresh install, remember to load Elasticsearch with example data so you have something to work with.Then you can move on the Discover , Visualize and create Dashboards for your data.Remember, if you are deploying into a cloud environment, take steps to lock downaccess to Kibana to either authorized users or IP addresses depending on thetools available in your cloud instance. You'll find instructions on how todeploy onto Ubuntu, RHEL and Centos on Elastic's Getting Started page. Note what we mentioned about matching versions and adjust the current 4.5 version number in URLs to 4.3.3 to match Compose's Elasticsearch deployments.This should get you up and running with your own Kibana and ready to dig intoyour data in your Compose Elasticsearch deployment. Have fun in there!Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","We're going to be talking about how you can run your own Kibana instance for Elasticsearch. You may think this is an odd thing to want to do, but all Elasticsearch users should be paying attention to this.  That's because Elasticsearch has been restructured over the last few versions, so some components have been moved away from the central server's plugin architecture and given their own server.",Kibana and Compose Elasticsearch,Live,598
1857,"DIVING INTO DATA
A BLOG ON MACHINE LEARNING, DATA MINING AND VISUALIZATION
MAIN MENU
Skip to content * About

RANDOM FOREST INTERPRETATION – CONDITIONAL FEATURE CONTRIBUTIONS
Posted October 24, 2016In two of my previous blog posts , I explained how the black box of a random forest can be opened up by tracking
decision paths along the trees and computing feature contributions. This way,
any prediction can be decomposed into contributions from features, such that
\(prediction = bias + feature_1contribution+..+feature_ncontribution\).

However, this linear breakdown is inherently imperfect, since a linear
combination of features cannot capture interactions between them. A classic
example of a relation where a linear combination of inputs cannot capture the
output is exclusive or (XOR), defined as

X1 X2 OUT 0 0 0 0 1 1 1 0 1 1 1 0In this case, neither X1 nor X2 provide anything towards predicting the outcome
in isolation. Their value only becomes predictive in conjunction with the the
other input feature.

A decision tree can easily learn a function to classify the XOR data correctly
via a two level tree (depicted below). However, if we consider feature
contributions at each node, then at first step through the tree (when we have
looked only at X1), we haven’t yet moved away from the bias, so the best we can
predict at that stage is still “don’t know”, i.e. 0.5. And if we would have to
write out the contribution from the feature at the root of the tree, we would
(incorrectly) say that it is 0. After the next step down the tree, we would be
able to make the correct prediction, at which stage we might say that the second
feature provided all the predictive power, since we can move from a coin-flip
(predicting 0.5), to a concrete and correct prediction, either 0 or 1. But of
course attributing this to only the second level variable in the tree is clearly
wrong, since the contribution comes from both features and should be equally
attributed to both.

This information is of course available along the tree paths. We simply should
gather together all conditions (and thus features) along the path that lead to a
given node.

As you can see, the contribution of the first feature at the root of the tree is
0 ( value staying at 0.5), while observing the second feature gives the full information
needed for the prediction. We can now combine the features along the decision
path, and correctly state that X1 and X2 together create the contribution
towards the prediction.

The joint contribution calculation is supported by v0.2 of the treeinterpreter package (clone or install via pip). Joint contributions can be obtained by
passing the joint_contributions argument to the predict method, returning the triple [prediction, contributions, bias], where
contribution is a mapping from tuples of feature indices to absolute
contributions.
Here’s an example, comparing two datasets of the Boston housing data, and
calculating which feature combinations contribute to the difference in estimated
prices

 
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn.datasets import load_boston
from treeinterpreter import treeinterpreter as ti, utils

boston = load_boston()
rf = RandomForestRegressor()
# We train a random forest model, ...
rf.fit(boston.data[:300], boston.target[:300])
# take two subsets from the data ...
ds1 = boston.data[300:400]
ds2 = boston.data[400:]
# and check what the predicted average price is
print (np.mean(rf.predict(ds1)))
print (np.mean(rf.predict(ds2)))


21.9329
17.8863207547


The average predicted price is different for the two datasets. We can break down
why and check the joint feature contribution for both datasets.

 
prediction1, bias1, contributions1 = ti.predict(rf, ds1, joint_contribution=True)
prediction2, bias2, contributions2 = ti.predict(rf, ds2, joint_contribution=True)


Since biases are equal for both datasets (because the the model is the same),
the difference between the average predicted values has to come only from
(joint) feature contributions. In other words, the sum of the feature
contribution differences should be equal to the difference in average
prediction.
We can make use of the aggregated_contributions convenience method which takes the contributions for individual predictions and
aggregates them together for the whole dataset

 
aggregated_contributions1 = utils.aggregated_contribution(contributions1)
aggregated_contributions2 = utils.aggregated_contribution(contributions2)

print (np.sum(list(aggregated_contributions1.values())) -  
       np.sum(list(aggregated_contributions2.values())))
print (np.mean(prediction1) - np.mean(prediction2))


4.04657924528
4.04657924528


Indeed we see that the contributions exactly match the difference, as they
should.

Finally, we can check which feature combination contributed by how much to the
difference of the predictions in the too datasets:

 
res = []
for k in set(aggregated_contributions1.keys()).union(
              set(aggregated_contributions2.keys())):
    res.append(([boston[""feature_names""][index] for index in k] , 
               aggregated_contributions1.get(k, 0) - aggregated_contributions2.get(k, 0)))   
        
for lst, v in (sorted(res, key=lambda x:-abs(x[1])))[:10]:
    print (lst, v)    


(['RM', 'LSTAT'], 2.0317570671740883)
(['RM'], 0.69252072064203141)
(['CRIM', 'RM', 'LSTAT'], 0.37069750747155134)
(['RM', 'AGE'], 0.11572468903150034)
(['INDUS', 'RM', 'AGE', 'LSTAT'], 0.054158313631716165)
(['CRIM', 'RM', 'AGE', 'LSTAT'], -0.030778806073267474)
(['CRIM', 'RM', 'PTRATIO', 'LSTAT'], 0.022935961564662693)
(['CRIM', 'INDUS', 'RM', 'AGE', 'TAX', 'LSTAT'], 0.022200426774483421)
(['CRIM', 'RM', 'DIS', 'LSTAT'], 0.016906509656987388)
(['CRIM', 'INDUS', 'RM', 'AGE', 'LSTAT'], -0.016840238405056267)


The majority of the delta came from the feature for number of rooms (RM), in
conjunction with demographics data (LSTAT).

SUMMARY
Making random forest predictions interpretable is pretty straightforward,
leading to a similar level of interpretability as linear models. However, in
some cases, tracking the feature interactions can be important, in which case
representing the results as a linear combination of features can be misleading.
By using the joint_contributions keyword for prediction in the treeinterpreter package, one can trivially take into account feature interactions when breaking
down the contributions.

Follow @crossentropy Tweet Posted Oct 24 , 2016 Categorized: Random forestLEAVE A REPLY CANCEL REPLY
Your email address will not be published.

Comment

Name

Email

Website


RECENT POSTS
 * Random forest interpretation – conditional feature contributions
 * Histogram intersection for change detection
 * Who are the best MMA fighters of all time. A Bayesian study
 * First Estonian Machine Learning Meetup
 * 7 tools in every data scientist’s toolbox

ARCHIVES
 * October 2016
 * February 2016
 * December 2015
 * November 2015
 * October 2015
 * August 2015
 * June 2015
 * February 2015
 * December 2014
 * November 2014
 * October 2014

META
 * Log in
 * Entries RSS
 * Comments RSS
 * WordPress.org","The black box of a random forest can be opened up by tracking decision paths along the trees and computing feature contributions. This way, any prediction can be decomposed into contributions from features. However, this linear breakdown is inherently imperfect, since a linear combination of features cannot capture interactions between them...",Random forest interpretation – conditional feature contributions,Live,599
1858,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Mahesh Kurapati Blocked Unblock Follow Following Sep 27, 2016
--------------------------------------------------------------------------------

ACCESS IBM ANALYTICS FOR APACHE SPARK FROM RSTUDIO
In this post I will show you how to use the IBM Analytics for Apache Spark
service from the RStudio IDE, which is integrated into the Data Science
Experience.

RSTUDIO IDE
RStudio is the premier integrated development environment (IDE) for R
programmers. Data Science Experience provides a convenient way of loading and
executing R scripts.

IBM ANALYTICS FOR APACHE SPARK SERVICE
The IBM Analytics for Apache Spark service is a managed service that lets you
run Spark programs in the cloud.

RUNNING SPARK PROGRAMS FROM RSTUDIO
RStudio uses the new sparklyr package ( http://spark.rstudio.com/index.html ) to connect with the Spark kernel gateway on the cloud using
Spark-as-a-Service interactive APIs. The sparklyr package includes a dplyr
interface to Spark data frames as well as an R interface to Spark’s distributed
machine learning pipelines.

You can use your existing Spark instances from RStudio. To use this feature, run
the following steps:

1 List your available Spark instances 2 Connect to a selected Spark instance 3
Run dplyr APIs and Spark’s distributed machine learning libraries 4 Display
tables for Spark loaded data sets 5 View logs for Spark kernel interaction

6 View Spark connect status and connect or disconnect

LIST AVAILABLE SPARK INSTANCES
When you start RStudio two files are created in the working directory (don’t
delete them!):
1) config.yml file — Lists all of your available Spark instances.
2) .Rprofile file — Configures your Spark environment.

These files are created under your home directory, /home/rstudio . If the working directory is different from the home directory, you can copy
the config.yml and .RProfile files to your current working directory.

You can list Spark instances by using the list_spark_kernels() R function and store it in a variable called kernel . For example:

This function lists only your currently available Spark instances. If you need
another Spark instance, create it in Data Science Explorer.

CONNECT WITH THE SELECTED SPARK INSTANCE
To connect to Spark, run the spark_connect R function. For example:

sc <- spark_connect(config = kernels[1])

After this Spark context is created, all subsequent operations will be executed
using this Spark instance:

RUN DPLYR APIS AND SPARK’S DISTRIBUTED MACHINE LEARNING LIBRARIES
To run dplyr functions, load the dplyr package and then run the copy_to function using the Spark context. For example:

library(dplyr) localDF <- data.frame(name=c(""John"", ""Smith"", ""Sarah"", ""Mike"", ""Bob""), age=c(19, 23, 18, 25, 30)) sampletbl <- copy_to(sc, localDF, ""sampleTbl"")

This creates a Spark data frame on the remote kernel based on a local R data
frame, and displays the local references in the Spark view:

VIEW THE TABLE FOR SPARK LOADED DATA SETS
Spark View shows all of the remote Spark data frames. You can click on the table
icon to show sample views of these tables.

VIEW THE LOG FOR SPARK KERNEL INTERACTION
You can select the Logs icon to view all of the calls to the Spark instance.

VIEW SPARK CONNECT STATUS AND CONNECT OR DISCONNECT A SERVICE
You can view the connection status on the Spark View, and you can connect to or
disconnect from a Spark service.

CONNECT
DISCONNECT
EXAMPLES
You can find the example R script files in the /ibm-sparkaas-demos folder under your home directory. These examples demonstrate scenarios you can
run with Spark in RStudio.

SPARK-KERNEL-BASIC.R
Creates simple R data frames and generates remote Spark data frames based on the
local R data frames. Also runs some basic filters and DBI queries.

SPARKAAS_MTCARS.R
Loads the popular mtcars R data frame and then generates a Spark data frame for
the mtcars data frame. It then does transformations to create a training data
set and runs a linear model on the training data set.

SPARKAAS_FLIGHTS.R
Loads some larger data sets, creates ggplot for delay and runs windows
functions. See sparklyr — R interface for Apache Spark for more information.

See also sparklyr Examples for more examples.

Mahesh Kurapati is an Advisory Software Engineer with the IBM Analytics team. Mahesh’s primary
focus is on the development of various micro-services for IBM Data Science
Experience. Mahesh is involved in the development of various Sparkling.data
features and SparkaaS integration with RStudio. With more than 20 years of
experience in software development, Mahesh has contributed key functionalities
to IBM products including SPSS Statistics, SPSS Modeler, and SPSS Analytics
Server.

Originally published at datascience.ibm.com on September 28, 2016.

 * Data Science
 * Rstudio
 * Spark

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingMAHESH KURAPATI
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","In this post I will show you how to use the IBM Analytics for Apache Spark service from the RStudio IDE, which is integrated into the Data Science Experience. RStudio is the premier integrated…",Access IBM Analytics for Apache Spark from RStudio,Live,600
1862,"Compose The Compose logo Articles Sign in Free 30-day trialHOW HYVER USES SCYLLADB FOR API KEY MANAGEMENT
Published May 16, 2017 syclladb writestuff How Hyver uses ScyllaDB for API Key ManagementBuilding a platform means building an API to go with it and building an API
means managing the keys to that API efficiently. Kirk Morales, CEO at Hyver , has used ScyllaDB to solve this tricky problem and in this Write Stuff article, he shows how you can too.

When building a platform, clients will need to interface with it in various
ways. At first, basic user authentication may be all that's necessary, but as
you start allowing for integrations and open your platform up to developers,
you'll need to develop an API key management solution that scales.

At Hyver, we consider what we've built a platform more than just a product. As such, we realize more and more types of clients
and integration points for pushing and pulling data to/from us. We made the
mistake early on of building highly-specific authentication mechanisms for our
immediate clients: a web user and Zapier. Instead of a generic key management
system, we authenticated Zapier requests uniquely and even had a custom table
and columns for keys in our database. As our need for additional integration
points surfaced, we realized we had to re-build how we authenticate requests and
the underlying architecture for doing so.

ScyllaDB is great for storing, managing, and performing quick lookups of API
keys when needed for authentication. Due to its high throughput and fast reads,
request latency isn't increased drastically as it could be with other databases.
Also, the schema and read/write operations are incredibly simple to maintain.

DEFINING THE SCHEMA
At a high level, we need to consider the following entities:

Permissions A set of permissions for an API key, defining what its capabilities are. Each
API key should have a unique set of permissions so as to control how it can be
used. If extra or fewer permissions are needed by another client, a brand new
API key should be generated for that specific client.

For the sake of simplicity, we'll define permissions by using just a string,
consisting of one of the following values:

management - Access to management functionality, minus billing, e.g. updating account
settings or getting a list of objects within an account.

billing - Ability to manage an account's billing settings, including the subscribed
plan and card on file.

full - Full access to the account.

Accounts This is our parent object, to which API keys will belong and the object being
authenticated. Whatever a key's permissions will be allowed only on the account
it's associated with.

At a minimum, we'll need a table to store accounts with a map containing its API keys.

CREATE TABLE accounts (  
  account_id uuid PRIMARY KEY,
  api_keys map<uuid, text


We use a map type here for a few reasons. First, storing the keys directly in the accounts table prevents us from having to make subsequent SELECT statements to read all keys for an account. Additionally, we can update
multiple keys at once with a single UPDATE command. Lastly, since we're storing permissions as a basic string, retrieving
API keys as a map allows us to do a quick permissions lookup in code.

API Key A key that is provided in the request, belonging to a single account. An
account may have multiple API keys for different purposes. When creating the
table to store keys, we want to make sure each key is linked to the account it
belongs to as well as the appropriate permissions for a quick look up.

CREATE TABLE api_keys (  
  api_key uuid PRIMARY KEY,
  account_id uuid,
  permissions text
);


Since a client will only ever be sending an API key with a request, we need API
keys to be unique to a single account (therefore, unique universally), so we
place the PRIMARY KEY only on api_key . If we made it across (api_key, account_id) , we could have duplicate API keys in the wild, each associated with a
different account...very bad.


--------------------------------------------------------------------------------

Now, you may be wondering why we split these out. Having redundant data in both
the accounts and api_keys tables seems like a waste of space. Extra storage? Yes. A waste? Absolutely
not.

We could put an index on account.api_keys to use when authenticating requests, but due to how ScyllaDB handles indexes,
this is wildly inefficient and much slower. It is recommended not to use indexes on high-cardinality values and instead store them in a separate
table. Read this Datastax article for more insight as to why.

So, by creating another table, yes, we're increasing our storage footprint, but
we're ensuring a fast and efficient lookup.

CREATING API KEYS
Let's start to fill up our database. First, we'll create an account.

INSERT INTO accounts (account_id) VALUES (uuid());  
SELECT account_id FROM accounts;  


From this, we'll get a new Account ID, let's say 4e45f0e8-ade5-48fd-862d-d657f299828b .

Next, we'll define three API keys, one for each permission type of ""full"",
""management"", and ""billing"".

UPDATE accounts  
SET api_keys = {uuid(): 'full', uuid(): 'management', uuid(): 'billing'}  
WHERE account_id = 4e45f0e8-ade5-48fd-862d-d657f299828b;  


In this example, we're letting CQL create new UUIDs for us, however, if we
already have IDs we want to use (such as ones generated in code), we can set
them in a similar fashion:

UPDATE accounts  
SET api_keys = {  
  4f6f1cf5-ba62-4b9c-be39-ee73a02e6dab: 'full', 
  462d3e55-4903-405f-aeec-cb46b327b025: 'management', 
  282a3b8b-f20c-4b5c-b38b-ec2e7d943c32: 'billing'
}
WHERE account_id = 4e45f0e8-ade5-48fd-862d-d657f299828b;  


We now have an account with three active API keys. Last thing to do is add them
to our api_keys table for use in our API when authenticating requests. We'll add each key as a
separate record, associating them with the account and the appropriate
permissions:

INSERT INTO api_keys (api_key, account_id, permissions) VALUES (4f6f1cf5-ba62-4b9c-be39-ee73a02e6dab, 4e45f0e8-ade5-48fd-862d-d657f299828b, 'full');  
INSERT INTO api_keys (api_key, account_id, permissions) VALUES (462d3e55-4903-405f-aeec-cb46b327b025, 4e45f0e8-ade5-48fd-862d-d657f299828b, 'management');  
INSERT INTO api_keys (api_key, account_id, permissions) VALUES (282a3b8b-f20c-4b5c-b38b-ec2e7d943c32, 4e45f0e8-ade5-48fd-862d-d657f299828b, 'billing');  


AUTHENTICATING REQUESTS
Now for the easy part -- when a client makes a request, to check the validity of
the supplied API key and get permissions, we make one request:

SELECT * from api_keys WHERE api_key = SOME_KEY;  


We'll get the Account ID (which we'll likely need to execute the requested
operation) and will get the text-based permissions for the key, telling us if we
can authorize the request. From here, there are a few possibilities:

 1. The API key doesn't exist - we'll throw a 401 Unauthorized
 2. The API key exists, but its permissions don't match the request. We can
    throw a 401 Unauthorized or 403 Forbidden .

Unless we need to see an existing account value, we can process the requested
action without having to do another SELECT since we already have the account ID.

MANAGING EXISTING API KEYS
The trade-off for efficiency is seen when we have to update existing keys by
making multiple writes. Not only do we need to make multiple writes, but we'll
want to batch the requests to ensure they all happen at once. We're okay with
this, though, because we can afford the extra write cost now more than a larger
read cost when authorizing a request.

UPDATE PERMISSIONS
Updating the permissions for an existing key requires us to update the text value for the map in the accounts table as well as the permissions column in the api_keys table.

Let's update the permissions for the API key 4f6f1cf5-ba62-4b9c-be39-ee73a02e6dab to be ""management"".

BEGIN BATCH  
  UPDATE accounts SET api_keys[4f6f1cf5-ba62-4b9c-be39-ee73a02e6dab] = 'management' WHERE account_id = 4e45f0e8-ade5-48fd-862d-d657f299828b;
  UPDATE api_keys SET permissions = 'management' WHERE api_key = 4f6f1cf5-ba62-4b9c-be39-ee73a02e6dab;
APPLY BATCH;  


ADD A KEY
To add a new key, we simply add it to the map in accounts and create a new row in api_keys . Let's issue a new ""full"" API key 1939b3d5-34ae-436f-a7f0-02ac1368194b :

BEGIN BATCH  
  UPDATE accounts SET api_keys = api_keys + {1939b3d5-34ae-436f-a7f0-02ac1368194b: 'full'} WHERE account_id = 4e45f0e8-ade5-48fd-862d-d657f299828b;
  INSERT INTO api_keys (api_key, account_id, permissions) VALUES (1939b3d5-34ae-436f-a7f0-02ac1368194b, 4e45f0e8-ade5-48fd-862d-d657f299828b, 'full');
APPLY BATCH;  


REMOVE A KEY
If you need to deactivate or remove a key, we simply remove all references to
it. Let's delete the key we just created:

BEGIN BATCH  
  DELETE api_keys[1939b3d5-34ae-436f-a7f0-02ac1368194b] FROM accounts WHERE account_id = 4e45f0e8-ade5-48fd-862d-d657f299828b;
  DELETE FROM api_keys WHERE api_key = 1939b3d5-34ae-436f-a7f0-02ac1368194b;
APPLY BATCH;  


WRAP UP
Although it may seem trivial, proper and efficient management of API keys on
your platform is crucial to the security of your data and speed of fulfilling
requests.

It's tempting to simplify your schema into a single table -- sure, you only have
to make a single request to update keys and permissions and don't have to worry
about data consistency across tables. Those benefits, however, are highly
outweighed by the inefficiencies and subsequent speed loss due to performing an
indexed lookup rather than a PRIMARY KEY lookup on a separate table.

However you define your API key and permissions scheme, be sure to keep a few
points in mind:

 1. Determine how your data will be looked up most frequently and determine if
    it makes sense to abstract it out to its own table rather than relying on an INDEX .
 2. Your permissions scheme may be more complex than the text types we used in this example. In that case, consider linking each API key
    to a uuid that represents another object defining the permissions OR simply
    JSON-encode the permission data into the same text value.
 3. Monitor common requests and their subsequent performance against your
    database. Look for long-running requests and other areas you can improve
    your schema to limit the number of reads from your database.


attribution Rubén Bagüés

This article is licensed with CC-BY-NC-SA 4.0 by Compose. Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
May 10, 2017COMPOSE'S WRITE STUFF AND THE WINNER OF THE SECOND CYCLE OF 2017
You voted and now it's time to announce the author who has won the $500 Compose
Write Stuff bonus. Write Stuff is Compose's…

Default avatar The default author avatar The Compose Team May 1, 2017VOTE FOR YOUR FAVORITE COMPOSE WRITE STUFF ARTICLE
Back in February, we asked developers to submit articles to our blog as part of
our Write Stuff program. We wanted to provide…

Default avatar The default author avatar The Compose Team Mar 28, 2017SIMPLE OAUTH WITH MONGODB & MYSQL
Don Omondi, Campus Discounts' founder and CTO, discusses securing applications
with OAuth and shows you how to securely store…

Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Building a platform means building an API to go with it and building an API means managing the keys to that API efficiently. Kirk Morales, CEO at Hyver, has used ScyllaDB to solve this tricky problem.",How Hyver uses ScyllaDB for API Key Management,Live,601
1868,"Compose The Compose logo Articles Sign in Free 30-day trialMANAGE YOUR BUSINESS WITH COMPOSE, ODOO, AND BLUEMIX
Published Jun 14, 2017 bluemix postgresql odoo Manage Your Business with Compose, Odoo, and BluemixOdoo is a suite of applications to help you manage any business from anywhere.
Learn how to deploy using Compose and IBM Bluemix so you'll have your own cloud
of enterprise tools.

Marketing, sales pipelines, distribution, point-of-sale; sometimes it can be
overwhelming trying to find the right tools to help you manage your business.
Odoo, formerly called OpenERP, is an open source platform for building business
applications that comes with a set of pre-built modules covering many business
needs. You can enable only the modules you need and can extend the system by
building your own modules, or by connecting to existing modules through an
XML-RPC interface.

In this article, we'll introduce you to Odoo and show you how to deploy the Odoo
stack on IBM Bluemix Container Service with Compose PostgreSQL providing the
data layer.

INTRODUCTION TO ODOO
Odoo is an open source ERP system, which is really a suite of tools to help you
manage every facet of a business. Odoo allows you to install the tools they need
right away while enabling you to expand as future needs arise.


Odoo applications also have the ability to coordinate with each other, so adding
an invoice to a customer in the sales pipeline can be automatically added to
sales reports.

Odoo also includes a surprisingly complete content management system. The
templating system can access data models from other modules, making it useful
for creating internal-facing team sites, and has an extensive palette of common
page elements for creating customer-facing sites.


There are many more features, and the system is deeply customizable and
extensible so it can fit the needs of many businesses. The best way to explore
Odoo is to experiment with it, so let's get a copy running and try it out.

INSTALLING ODOO
You can download and install Odoo for Windows and Linux using the installer for each, available on the Odoo website . Or, if you're more inclined to tinker with the inner workings of things, you
can download the source code on Github and build Odoo manually.

However, the easiest way to get started is to deploy the official Docker container which contains the entire application pre-configured and ready to deploy.
Configuration takes place through environment variables, which are set when we
launch the container.

For this article, we'll use the Docker container and create a production-ready
deployment. We'll back it with Compose PostgreSQL and host the Odoo application
itself in the IBM Bluemix Container Service.

CREATE THE COMPOSE POSTGRESQL DEPLOYMENT
Odoo uses PostgreSQL to store your business data, and the default container
needs to have access to the PostgreSQL deployment. By default, it is linked to
Odoo as a service called db and which can be deployed along with Odoo using docker-compose .

Let's start with the data layer by spinning up a Compose PostgreSQL deployment.
You can do so by creating a new database through the Compose UI or by creating a new deployment through the Compose API .

Once you have PostgreSQL set up, make sure you have your username and password available so we can configure the Odoo container with them; you can find them
by clicking the reveal your credentials link on the overview page of your PostgreSQL deployment.


DEPLOY ODOO ON BLUEMIX
Since we aren't modifying the original Docker image, we'll have Bluemix pull the
Odoo container directly from Docker Hub. We'll also need to configure Odoo to
use our Compose PosgreSQL deployment using the HOST environment variable, and set the USER and PASSWORD environment variables to those of our PostgreSQL deployment.

First things' first, let's spin up the Blumix Container Service. If you haven't
already, you can check out the procedure to install the Bluemix CLI on the Bluemix Container Service docs page . Once you have the bx CLI tool installed and have logged in using bx login , you'll need to pull the Docker image from Docker Hub into your IBM Bluemix
registry:

bx ic cpi odoo registry.ng.bluemix.net/<your_namespace>/odoo

This will copy the image directly from Docker Hub and make it available in your
Bluemix registry.

Once you have the container copied, creating a new container looks a lot like
deploying a new container with Docker:

bx ic run -p 8069:8069 -e HOST=path_to_your_compose_postgresql -e
USER=your_postgres_username -e PASSWORD=your_postgres_password
registry.ng.bluemix.net/<your_namespace>/odoo

Let's break this down:

 * bx ic run uses the container service plugin of the bluemix tool, which uses the same
   command to run containers as the docker command line tool does.
 * -p 8069:8069 tells the bluemix to expose port 8069, which is what Odoo listens on by
   default, as port 8069 at our container URL.
 * -e HOST=path_to_your_compose_postgresql defines an environment variable that Odoo uses to determine where your
   PostgreSQL deployment is hosted.
 * -e USER=your_postgres_username defines an environment variable with the username you use to connect to
   PostgreSQL; you can find that in the connection strings section of the deployment in Compose.
 * -e PASSWORD=your_postgres_password defines an environment variable with the password you use to connect to
   PostgreSQL; you can also find this in the connection strings section of the deployment in Compose.
 * registry.ng.bluemix.net/<your_namespace>/odoo is the name of the container we're going to run. Since the image is hosted
   on Docker Hub, we copied it into our local private Bluemix repository.

If all goes well, you should see the container ID output to the console:

88ce0f99-e144-4c9b-86f3-xxxxxxxxxxx

Use bx ic ps to see the status of your container:

CONTAINER ID        IMAGE                                                      COMMAND             CREATED             STATUS              PORTS               NAMES  
88ce0f99-e14        registry.ng.bluemix.net/johnwoconnor_compose/odoo:latest   """"                  48 seconds ago      Building            8069/tcp            kickass_kilby  


Once the bx ic ps command shows a status of Running you are good to go.

CONTAINER ID        IMAGE                                                      COMMAND             CREATED             STATUS              PORTS               NAMES  
88ce0f99-e14        registry.ng.bluemix.net/johnwoconnor_compose/odoo:latest   """"                  2 minutes ago       Running             8069/tcp            kickass_kilby  


To access the container on the Internet, we'll need to assign an IP address to
it. You can see the available IP addresses using the following command:

$ bx ic ips
Number of allocated public IP addresses: 0

IP Address   Container ID  


If you don't have any available, use the following command to get one:

$ bx ic ip-request
OK  
IP address ""169.44.15.82"" was obtained.  


Now that you have an IP address available, use the bind command to bind the IP
address to your container.

$ bx ic ip-bind <your_ip> <your_container_id>
OK  
The IP address was bound successfully.  


You should now be able to access your Odoo installation from the following url:

https://<your_ip_address>:8069

SUMMING THINGS UP
Now that you've seen how to deploy Odoo with Bluemix and Compose, give it a test
drive, explore the modules, and see what options are the best fit for your
business. If you like what you see, you can continue to use it knowing that your
sensitive business data is safe and secure.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution rawpixel.com

John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
Jun 13, 2017BUILDING AN ORDERING APPLICATION WITH WATSON AI AND POSTGRESQL: PART I
Do you want to leverage Compose, Twilio, and IBM Watson to provide customers
with a real-time, interactive experience? We'll…

Abdullah Alger Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX
The power of IBM's Bluemix cloud platform is now able to seamlessly harness
Compose's databases, making Compose-configured Mo…

Dj Walker-Morgan Jun 9, 2017NEWSBITS: OTTERTUNE TUNES DATABASES WITH MACHINE LEARNING
NewsBits is Compose's roundup of the past week's database and developer news:
Ottertune uses machine learning to tune databas…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","We'll introduce you to Odoo, an open source ERP system, and show you how to deploy the Odoo stack on IBM Bluemix Container Service with Compose PostgreSQL providing the data layer.","Manage Your Business with Compose, Odoo, and Bluemix",Live,602
1870,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Jun 12
--------------------------------------------------------------------------------

CHOOSING A CLOUDANT LIBRARY
WHICH LEVEL OF ABSTRACTION IS “JUST RIGHT” FOR YOU?
The beauty of Apache CouchDB and Cloudant is that you don’t need to a library to
be able to start using it. Some databases require a “driver” module to be
installed to handle communication between your application and your database,
but when your database speaks HTTP then you only need curl , a web browser, or anything that can make web requests. For example:

 * your Raspberry Pi could write IoT data to a remote database by making PUT
   requests from curl
 * your web page could fetch data directly from the database by making in-page
   HTTP calls
 * your PHP code could read and write from its data store without any
   third-party add-on code

Sometimes developers need a little help. To avoid repeating the same low-level
code, to abstract the API calls into more semantically meaningful methods, and
to make life easier we often employ libraries.

In this article we’ll explore some options that a JavaScript/Node.js developer
could choose when writing code against CouchDB or Cloudant, from the lowest
level to the highest. (The following code snippets show fetching all documents
from a Cloudant database.)

Too verbose? Too abstract? Which Cloudant library is “just right” for you? Image
credit: twinkl.co.ukLEVEL 0 — NO LIBRARIES
If you want to learn the HTTP in detail, then you can choose to use no libraries
whatsoever:

This approach uses the Node.js https library to make a single API call. It leaves you to formulate your own URL and
to join the separate chunks of reply data into a complete JSON response.

LEVEL 1 — AN HTTP REQUEST LIBRARY
To help with formulating HTTP requests, there a several third-party HTTP
libraries to choose from. I usually go for request , but others are available.

The request module makes it simpler to deal with HTTP requests, and if you ask it nicely,
it will parse the JSON response for you too. You still get to learn the CouchDB
API, but the mechanics of making the HTTP call are simplified.

LEVEL 2 — THE NANO LIBRARY
Nano is an open-source project that was donated to the Apache Software Foundation
and has become the official Node.js library for CouchDB.

It doesn’t actually do much — it is a thin wrapper around CouchDB’s API — but it
does make your code a little easier to write and to maintain:

Using Nano allows you to abstract the API calls away. In this case, the list function makes a GET /db/_all_docs API call. You can use the Nano library and not know what API calls are being
made on your behalf.

LEVEL 3 — THE CLOUDANT LIBRARY
The Cloudant library extends Nano to add:

 * functions that wrap Cloudant-specific API calls
 * a plugin system to allow Promises, retry logic and cookie authentication
   wrappers to be used

Here’s how you could use Promises instead of callbacks:

Some of the plugins make multiple API calls for one function call (e.g., swap
your Cloudant credentials for a token), then make a second API call to fetch the
data you need passing the token. You don’t see the individual API calls, just the response.

LEVEL 4 — POUCHDB
You can use PouchDB as an HTTP-only client too:

It has its own naming convention for functions, but function calls result in the
equivalent API call. If you’re developing with PouchDB on the client side, it
may be easier to stick with the same API to deal with your server-side CouchDB
or Cloudant database.

LEVEL 5 — THE SILVERLINING LIBRARY
The silverlining library builds on the request library, not on Nano. It provides a different abstraction from the Cloudant
API, hiding some of the complexities that confuse first-time users and providing
higher-level simplifications.

The data returned by silverlining doesn’t necessarily match the data from the Cloudant API. It tidies up and
simplifies returned data, removing revision tokens and complicated scaffolding.

It also abstracts the creation of indicies, which means that performing an
aggregation on a data set no longer requires users to understand the CouchDB
MapReduce system:

ALTERNATIVES
The great thing about open-source is that you aren’t limited to “official”
products. If you don’t like the tools, help improve the open-source offerings by
raising issues or submitting code — find alternative tools or build your own!
You can choose whether you’re looking for a library that can do callbacks or
Promises and whether it allows you to learn the CouchDB API or hides it from
you.

In my opinion, it makes sense to get started with a high-level library that
abstracts and hides details from you at first. It allows you to build more
quickly with less distraction. As you get more serious in your work, you may
find you need to see more detail and switch to a lower-level of abstraction.

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * JavaScript
 * Web Development
 * Cloudant
 * Nodejs

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",The beauty of Apache CouchDB and Cloudant is that you don’t need to a library to be able to start using it. Some databases require a “driver” module to be installed to handle communication between…,Choosing a Cloudant Library – IBM Watson Data Lab – Medium,Live,603
1872,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
THE 3 CS OF BIG DATA
Post Comment June 23, 2016 by Chris Nott Client Technical Leader for UK Public Sector, IBM UK Ltd. Follow me on LinkedIn , TwitterWe’re all familiar with big data’s varying number of Vs: volume, variety,
velocity and veracity. However, taking into consideration the purpose for which
insight can be derived from big data is highly important and likely more useful
for engineering information systems. This purpose is often characterized by
using data to inform enhanced decision making, and business leaders need to
trust the data before they use it.

LOOKING AT BIG DATA’S THREE CS
As a result, I propose we discuss what’s needed for that trust in terms of big
data’s three Cs: confidence, context and choice.

CONFIDENCE
The objectives of early management information initiatives were to report on
financial and sales performance. These objectives demand a high degree of
accuracy. A CFO who doesn’t have confidence —the first C of big data—in a report ’ s financial performance figures is forced to look elsewhere.This scenario is
different in the case of a CMO, who is looking to offer a promotion to an
individual customer at a point in time. Such an action draws upon data from many
systems and external data sources, perhaps including social media.

Organizations bring data together into a single, comprehensive view of a
customer so that they can maximize the opportunity of engaging with that
customer for increased sales or improved customer service, for example.
Combining data from multiple systems requires matching records—something that is
imprecise because of data-quality issues, varying data formats and other
characteristics of the way data is stored and managed by those systems.
Consequently, the matching of records from multiple systems into a single view
of a customer cannot be achieved with certainty. A score can be calculated that
determines the level of confidence that can be placed in the combined view.

A CMO who is undertaking marketing decisions may sufficiently have a lower
degree of confidence in the data than a CFO requires for reporting financial
performance. The level of confidence that is acceptable is a judgment that a
business needs to make based on the risk and effect of actions. And that
judgment is a balance between what might result from poor decisions that arise
from inaccurate data and the cost of making improvements in the provision of
data.

Measures of confidence are not limited to merging data, but they also apply to
data sources themselves. For example, a city’s buildings may distort location
data from Global Positioning System (GPS) sensors. Temperature sensors that work
with defined tolerance levels and the use of social media data require caution.
Understanding the provenance of data being used to make decisions is important.

The increasing desire to exploit data is widening the use of statistics. Data
science techniques including predictive analytics and machine learning are all
producing results from analyzing data with a degree of accuracy that is not
absolute. A measurable level of confidence exists. Consumers of those results
need to make sure they understand what that level of confidence means as they
use them to make decisions.

CONTEXT
The second C of big data is context . Understanding context requires understanding who is asking the question and
why. And part of that grasp includes the role of the person, where that person
is asking the question, what the questioner is trying to do and the purpose to
which the results will be applied.

People undertaking comparative analysis of remuneration for roles in their
organization against similar roles in the market, for example, require access to
salary data; whereas, analysis of employee career progression does not. These
two activities may or may not be carried out by the same person, but the purpose
is clearly different. Understanding the context to provide the appropriate
authorized access is essential, even if the same person is carrying out the two
activities. This authorization is an example of information governance—defining
and enforcing policies. The requirement is critical not only in regulated
industries, but also more widely as organizations become increasingly data
driven.

Context is also important in time-critical situations. Fields such as public
safety, defense and even sport utilize context in the continuous monitoring of
operations. Producing an alert is of no use, for example, if a commander is not
also provided sufficient information about the wider context to be able to judge
the situation correctly. This wide context needs to be provided with the alert,
in real time, and it needs to avoid providing superfluous and distracting
information—noise—that is not relevant at that time. The commander is likely to
be under a lot of pressure, and too much information may result in missing the
key information and making the wrong decision—in the same way that too little
information is of no use. Understanding the context in which the commander is
operating is essential to getting this balance right in that moment.

CHOICE
Opting for a particular technology platform and analytics tools represents the
third C of big data— choice . Many organizations have deployed Apache Hadoop systems in support of big data
initiatives and are attracted by cost-effective infrastructure. Even though the
importance of information governance was highlighted previously, sadly, it is
often not considered early enough in such initiatives. Information governance is
important because businesses can soon become reliant on such systems. They place
increasing demands on systems as they realize that easier access to data offers
new opportunities. Exploratory ad hoc analytics begin to compete with regularly
run analytics for system resources, and the problem of hitting capacity limits
is compounded because no platform is optimized for different types of analytical
workloads.

Inbound marketing decision making and operational decision making in public
safety situations, for example, both require high performance to produce results
from analytics, in context and in near-real time. Being too slow means that
customer engagement has ended and the marketing opportunity is missed, or the
public safety situation might have escalated. In these cases, a Hadoop system is
probably not the best analytics platform to meet the business need.

Business users performing specific functions often run similar types of queries
repeatedly; providing access to data on a platform optimized to meet their needs
can support them better than competing for resources on a platform designed for
and used by everyone. Analysts in an organization can use data to enable timely
and effective decision making, and the role of IT is to provide the platforms
and tools to enable them to succeed. As a result, they need to select platforms
that are fit for purpose and provide the technologies to manage the information
flows among them that adhere to information governance policies.

ENGINEERING TRUSTED ANALYTICS PLATFORMS
A fourth C for big data— cognitive —is worth mentioning. As human beings, we naturally take into account our
surroundings and make judgments from what we observe in everything that we do,
and human reasoning systems are a step change in analytics systems.
Nevertheless, confidence, context and choice still apply: results are ranked
using scores that are based on training and thereby offer a measurable level of
confidence. Wider context can be both an input and an access to source content
in the results, which gives context to the output, and many technologies can be
employed that underpin the diversity of cognitive services assembled into
applications.

Analytics platforms need to be properly engineered to support the information
architecture to meet the variety of business needs. Business users can then
benefit from being confident in the data, have access to it in context and know
that technology choices have been made for them. To find out more on how to
engineer a data driven organization, see the IBM Redbooks publication Designing and Operating a Data Reservoir that focuses on the data lake. And then explore how you can achieve all these objectives on a trusted big data platform .


Follow @IBMBigData

Topics: Analytics , Big Data Technology , Governance Tags: big data , volume , variety , velocity , veracity , confidence , context , choice , cognitive , governance , information governance , data lake , data reservoir , decision making , Apache Hadoop , Hadoop , analytics , predictive analytics , CFO , CMORELATED CONTENT
BLOG
TIME IS MONEY, BUT FOR DECISION DEBT COLLABORATION COSTS LESS
Don’t let your business come to a standstill as a result of technical debt.
Discover how a decision debt approach to tools and analytics help overcome the
quick-fix solutions that contribute to technical debt and its impact on
business. Read Blog Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog InsightOut: The role of Apache Atlas in the open metadata ecosystem Blog InsightOut: The case for open metadata and governance Infographic Win the race to insight Podcast InsightOut: Leveraging metadata and governance Blog InsightOut: Metadata and governance Blog Why big data? Blog IBM CDO Summit 2016, San Francisco, CA Podcast InsightOut: How to build a highly collaborative and data-driven
organization Video Good information = better decisions Blog InsightOut: Role-based interaction with the information and analytics
lifecycle
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Presentation Transforming video data to police intelligence Video How are CIOs balancing self-service data and security? Podcast Finance in Focus: What is behavioral finance? Video A personal touch for banking clientsMORE
Presentation Transforming video data to police intelligence Video How are CIOs balancing self-service data and security? Podcast Finance in Focus: What is behavioral finance? Video A personal touch for banking clients Podcast Should we shut down the dark web? Podcast Finance in Focus: Insights from Velan Inc.'s finance transformation Blog The ROI of incentive compensation management Podcast Finance in Focus: What is behavioral finance? Video A personal touch for banking clients Podcast Finance in Focus: Insights from Velan Inc.'s finance transformation Blog The ROI of incentive compensation managementMORE
Podcast Finance in Focus: What is behavioral finance? Video A personal touch for banking clients Podcast Finance in Focus: Insights from Velan Inc.'s finance transformation Blog The ROI of incentive compensation management Blog Time is money, but for decision debt collaboration costs less Blog Datapalooza 101: Why you need to make time for Datapalooza Blog CIO Insights: One CIO’s priorities—an agile culture and a big windshield Presentation Transforming video data to police intelligence Podcast Finance in Focus: What is behavioral finance? Video A personal touch for banking clients Podcast Should we shut down the dark web?MORE
Presentation Transforming video data to police intelligence Podcast Finance in Focus: What is behavioral finance? Video A personal touch for banking clients Podcast Should we shut down the dark web? Podcast Creating personalized viewing experiences for your target audience Video Holistic surveillance powered by sophisticated analytics and cognitive
capabilities Blog The holy grail of programmatic advertising Blog Why data science should be your top priority Video How are CIOs balancing self-service data and security? Video Data science expert interview: Somesh Nigam Blog Enterprise-grade property graphs debut on cloudMORE
Blog Why data science should be your top priority Video How are CIOs balancing self-service data and security? Video Data science expert interview: Somesh Nigam Blog Enterprise-grade property graphs debut on cloud Video Data science expert interview: Dean Wampler Blog Time is money, but for decision debt collaboration costs less Infographic 3 tactics to personalize customer acquisition, customer growth and
customer retention * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site","When considering big data as a source for insight to enhance decision making, it may be best characterized by its three Cs—confidence, context and choice.",The 3 Cs of big data,Live,604
1873,"Compose The Compose logo Articles Sign in Free 30-day trialHOW TO MOVE DATA WITH TRANSPORTER - FROM DISK TO DATABASE
Published Mar 14, 2017 mongodb json transporter How to move data with Transporter - from disk to databaseCompose's open-sourced Transporter is a powerful tool. In this article, we'll
show you how to use it to upload data to a MongoDB database using it and unlock
the power of Transformers.

In the first part of this series , we downloaded data from a MongoDB database using Transporter. For our next
step, we're going to upload some data to a MongoDB database, see how it doesn't
fit and use Transporter's tools to make it fit. See the first part for a how-to
on getting Transporter running on your local system

SETTING UP
For this we're going to use an example dataset from the JSON Studio tutorials datasets - Download and unzip the startup company dataset. This is a dump of data from a
MongoDB database of some now out of date information, but it is full of some
large JSON records. We'll name this file companies.json .

We've also set up a MongoDB deployment before hand - using Compose it's a simple
as clicking ""Create Deployment"" - we then made a database called company with a user named driver . Create your own database in the cloud or locally if you want to follow along.

STARTING THE TRANSPORTER
As before, we'll use transporter init to create our configuration files. This time the file is the source and the database is the sink :

$ transporter init file mongodb
Writing transporter.yaml...  
Writing pipeline.js...  
$


Now we can edit the transporter.yaml file. The source needs to point to the companies.json file so we'll change the file URI in the source. The MongoDB database is
secured with SSL, so we also need to turn on the ssl parameter in the sink. That
done, the file should look like this:

nodes:  
  source:
    type: file
    uri: file://companies.json
  sink:
    type: mongodb
    uri: ${MONGODB_URI}
    ssl: true


We'll need to export our MONGODB_URI to the environment and then we can test our
setups connectivity:

$ export MONGODB_URI=""mongodb://driver:driverpassword@sl-eu-lon-2-portal.2.dblayer.com:16920,sl-eu-lon-2-portal.3.dblayer.com:16920/company""
$ transporter test pipeline.js
TransporterApplication:  
 - Source:         source                                   file            test./.*/                      file://companies.json
  - Sink:          sink                                     mongodb         test./.*/                      mongodb://driver:driverpassword@sl-eu-lon-2-portal.2.dblayer.com:16920,sl-eu-lon-2-portal.3.dblayer.com:16920/company


That seems to pass. But if we try and transporter run things we'd fall into the first of the common traps people hit. There'll be a
lot out ouput, but look for the errors and you'll see:

$ transporter run pupeline.js
...
ERRO[0000] ERROR: write message error (not authorized on test to execute command { insert: ""companies.json"", documents: [ {  
...


Thats because of the pipeline.js file; it's still set to its defaults:

Source({ name: ""source"", namespace: ""test./.*/"" }).save({ name: ""sink"", namespace: ""test./.*/"" })  


When the MongoDB adaptor goes to save a record to the database, it uses the
database name test from the namespace in the sink settings. We don't have a test database set up,
hence the error. Let's change that so it points to our company database...

Source({ name: ""source"", namespace: ""test./.*/"" }).save({ name: ""sink"", namespace: ""company./.*/"" })  


And we run the transporter again and there's another error. This one is much
more interesting though:

ERRO[0000] ERROR: write message error ($oid is not valid for storage.)  path=""source/sink""  


TIP: There may be a lot of information messages emitted too, so if you want to
concentrate on just seeing errors use transporter run -log.level ""error"" pipeline.js .

If you aren't using the companies.json file we selected at the start, there's a good chance you won't see this
error... but even if you don't, read on.

TRANSFORMING THE ERROR
The complaint here is that there's something odd in what's being written to
MongoDB related to $oid . Let's have a look at the JSON file with the jq tool.

$ jq .  < companies.json | more
{
  ""_id"": {
    ""$oid"": ""52cdef7c4bab8bd675297d8a""
  },
  ""name"": ""Wetpaint"",
  ""permalink"": ""abc2"",
  ""crunchbase_url"": ""http://www.crunchbase.com/company/wetpaint"",
  ""homepage_url"": ""http://wetpaint-inc.com"",
...


There's the $oid , nestled within the _id field for the record. This data came from MongoDB and MongoDB has ObjectID as a
data type. When unmarshalled, it turns into this format which doesn't like being
written back. But what do we do with this field? Well, we could turn it into a
string... but how.

This is where the Transformer comes in. Transformers are small JavaScript
programs which are dedicated to processing Messages. Messages are how records
packaged and passed between the source and the sink in the pipeline. They can
also be passed through Transformers if we add them to the pipeline. Let's add a
simple Transformer to our pipeline now. There's two parts to this: a JavaScript
file which contains the code and an extra stage in the pipeline. Here's the code
for a dump Transformer.

module.exports = function(msg) {  
    console.log(JSON.stringify(msg, null, 2));
    return msg;
};


This simple Transformer is incredibly useful. It takes the msg object and, using the JavaScript built-in functions console.log and JSON.stringify , prints out a formatted JSON object for each message. We'll save this as dump.js and then edit the pipeline.

Source({ name: ""source"", namespace: ""test./.*/"" })  
    .transform({ filename: ""dump.js"", namespace: ""test./.*/"" })
    .save({ name: ""sink"", namespace: ""company./.*/"" })


We've spread the pipeline over three lines for readability. The addition is the transform() function in the middle. The options for this are a filename to load the Transformer from and a namespace setting which can filter which messages are processed. We've set it to match
everything. When we run with this pipeline we get output like this:

{
  ""data"": {
    ""_id"": {
      ""$oid"": ""52cdef7c4bab8bd675297d90""
    },
    ""acquisition"": {
    }
...
  },
  ""ns"": ""companies.json"",
  ""op"": ""insert"",
  ""ts"": 1489488630
}
...


This is the entire message contents, and yes, we did skip most of the data. Now
we can see the JSON document we saw earlier with jq, wrapped as the data field's value.There are also three other fields in the message. The ns field is for the namespace value and here it's been set to the name of the file
that was read from by the adaptor. The op field can contain insert , update , delete and some other specialized labels. It shows what kind of operation the data was
associated with. Here, it is a simple insert. Finally, ts is a timestamp for internal monitoring of the message in the Transporter.

With this knowledge in-hand we can set about fixing our $oid problem. Here's a file called oidremove.js , a Transformer that takes the value $oid contains and makes it into the _id value, as a string:

module.exports = function(msg) {  
    msg.data._id = msg.data._id.$oid;
    return msg;
};


Let's install that into our pipeline like so:

Source({ name: ""source"", namespace: ""test./.*/"" })  
    .transform({ filename: ""oidremove.js"", namespace: ""test./.*/"" })
    .save({ name: ""sink"", namespace: ""company./.*/"" })


Run the Transporter and sit back as it steadily pushes up records to the MongoDB
database. And it'll trudge along, doing a document at a time taking around 600
seconds to finish here (your mileage will vary depending on network latency).
The MongoDB adaptor has a bulk option to help in those circumstances. We can set the option in the transporter.yaml file:

nodes:  
  source:
    type: file
    uri: file://companies.json
  sink:
    type: mongodb
    uri: ${MONGODB_URI}
    ssl: true
    bulk: true


Now, the Transporter will tear through the data, here taking just 76 seconds to
upload the JSON file. If in doubt, turn the bulk option on.

NAMESPACES
If you look in your MongoDB database you'll find that the Transporter has
created a collection called companies.json . That is the namespace value allocated to it by the file adaptor at the
source. The secret to the current implementation of namespaces in Transporter is
that the first half of the namespace given in a pipeline is actually a hint to
the underlying adaptor or transformer as to what database, table or similar
value to use when extracting or writing data.

That's why the save() function (and MongoDB adaptor below it) is given a namespace ""company./.*/""; we
want it to write to the company database, so that's where we specify it, before
the dot in the namespace. Incoming message's namespace string is then matched
with everything after the dot in the adaptor's namespace string - and that can
be a regular expression – and if it matches then that namespace string is used
as the collection name.

For example, let's change the save() namespace setting to company.known . If we run the Transporter now, nothing will be written... the messages' ns field is ""companies.json"" and that doesn't match ""known"" so everything is
filtered out. What we can do is change the ns field in a transformer; let's change the oidremove.js so it sets the message ns value to ""known"":

module.exports = function(msg) {  
    msg.data._id = msg.data._id.$oid;
    msg.ns = ""known"";
    return msg;
};


And we're off, with a new collection created called ""known"". That the
Transformer has this control means you can write JavaScript code that selects
what collection an incoming record is stored in, or, using the filtering, which
Sink it flows down. As well as the built-in JavaScript functions, the Underscore
library is also available to help you mutate your data, either through renaming,
deleting, picking or projecting, within the JavaScript code of your
Transformers. One last trick; if your Transformer returns ""false"" rather than a
message, you can discard that message from the pipeline.

NEXT
In the final part of this series, we'll look at moving data from database to
database, where the Transporter helps and where you'll find the different data
models of modern databases rub up against each other.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Mar 9, 2017HOW TO MOVE DATA WITH COMPOSE TRANSPORTER - FROM DATABASE TO DISK
Transporter is a great way to move and manipulate data between databases. In
this new article, we look at how you can get on…

Dj Walker-Morgan Aug 29, 2016MONGO TO MONGO DATA MOVES WITH NIFI
There are many reasons to move or synchronize a database such as MongoDB:
migrating providers, upgrading versions, duplicatin…

Hays Hutton Feb 3, 2016COMPOSE'S NEW MONGODB AND WHAT YOU NEED TO KNOW
The release of our new MongoDB deployments has brought a lot of new questions
about what you need to know and why and what ha…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Compose's open-sourced Transporter is a powerful tool. In this article, we'll show you how to use it to upload data to a MongoDB database using it and unlock the power of Transformers.",How to move data with Transporter - from disk to database,Live,605
1874,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Fraud Analytics
 * Credit Risk Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

WEB PICKS (WEEK OF 11 DECEMBER 2017)
Posted on December 16, 2017Every two weeks, we find the most interesting data science links from around the
web and collect them in Data Science Briefings , the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting
resources .

 * Using Artiﬁcial Intelligence to Augment Human Intelligence
   By creating user interfaces which let us work with the representations inside
   machine learning models, we can give people new tools for reasoning.
 * What Happens When the Government Uses Facebook as a Weapon?
   It’s social media in the age of “patriotic trolling” in the Philippines,
   where the government is waging a campaign to destroy a critic—with a little
   help from Facebook itself.
 * Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning
   Algorithm (paper)
   DeepMind put itself at the forefront of the news again this week by beating
   the leading chess engine, StockFish, using the same architecture as was used
   by AlphaZero to beat the Go world champion.
 * Turi Create simplifies the development of custom machine learning models.
   This was a surprise — after being acquired by Apple, we didn’t think that
   we’d see the source code behind Graphlab Create’s machine learning framework
   (later: “Dato”, later: “Turi”) being available again, but here we go: “Turi
   Create simplifies the development of custom machine learning models. You
   don’t have to be a machine learning expert to add recommendations, object
   detection, image classification, image similarity or activity classification
   to your app.”
 * Apache Kafka and GDPR Compliance
   “Our commitment is to provide the necessary capabilities in data streaming
   systems, to allow your data-driven business to achieve compliance with GDPR
   prior to the regulation’s effective date.”
 * A Year in Computer Vision
   Very informative wrap-up article on computer vision and recent improvements
   in the field.
 * Artwork Personalization at Netflix
   “Artwork may highlight an actor that you recognize, capture an exciting
   moment like a car chase, or contain a dramatic scene that conveys the essence
   of a movie or TV show. If we present that perfect image on your homepage (and
   as they say: an image is worth a thousand words), then maybe, just maybe, you
   will give it a try.”
 * Want a loan? Make sure you’re tweeting the right things
   The article that someone tweeted about, posts that they liked on Facebook,
   and a new phone just bought on an e-commerce site—all these events now play a
   crucial role in determining if an individual is eligible for a loan or not.
 * Google collects Android users’ locations even when location services are
   disabled
   Many people realize that smartphones track their locations. But what if you
   actively turn off location services, haven’t used any apps, and haven’t even
   inserted a carrier SIM card?
 * Analyzing 1000+ Greek Wines With Python
   Another interesting approach of web scraping!
 * A Massive New Library of 3-D Images Could Help Your Robot Butler Get Around
   Your House
   Using three-dimensional images is a better way of mimicking the way animals
   perceive things.
 * Optimization for Deep Learning Highlights in 2017
   Deep Learning ultimately is about finding a minimum that generalizes well —
   with bonus points for finding one fast and reliably. Our workhorse,
   stochastic gradient descent (SGD), is a 60-year old algorithm, that is as
   essential to the current generation of Deep Learning algorithms as
   back-propagation.
 * Neural Networks in JavaScript with deeplearn.js
   Yes, GPU support is included through pixel/vertex shaders on WebGL,
   impressive stuff!
 * Monte Carlo Simulation with Categorical Values
   In Monte Carlo simulation, we repeatedly make guesses of some unknown value
   according to some distribution and are able to report on the results of that
   simulation to understand a little bit more about the unknown. While any one
   guess may be far from the truth, in aggregate those outliers don’t have as
   much of an effect.
 * Are GANs Created Equal? A Large-Scale Study (paper)
   “We conduct a neutral, multi-faceted large-scale empirical study on
   state-of-the art models and evaluation measures. We find that most models can
   reach similar scores with enough hyperparameter optimization and random
   restarts. This suggests that improvements can arise from a higher
   computational budget and tuning more than fundamental algorithmic
   changes.Finally, we did not find evidence that any of the tested algorithms
   consistently outperforms the original one.”
 * Neural Network on a Commodore 64
   Who says neural networks are new? Take a trip back to 1987 and find out!

‹ Some have argued that big data is fundamentally about data “plumbing”, and not
about insights, or deriving interesting patterns. It is argued that value (the
5th V) can just as easily be found in “small”, normal, or “weird” data sets
(i.e. data sets that wouldn’t have been considered before). What are your
thoughts on this? —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Web Picks (week of 11 December 2017)
 * Some have argued that big data is fundamentally about data “plumbing”, and
   not about insights, or deriving interesting patterns. It is argued that value
   (the 5th V) can just as easily be found in “small”, normal, or “weird” data
   sets (i.e. data sets that wouldn’t have been considered before). What are
   your thoughts on this?
 * Using Web Scraping as a Data Science Tool
 * Web Picks (week of 27 November 2017)
 * Practical Applications of the Expected Maximum Proﬁt Measure: Computing Model
   Proﬁt

Archives * December 2017
 * November 2017
 * October 2017
 * September 2017
 * August 2017
 * July 2017
 * June 2017
 * May 2017
 * April 2017
 * March 2017
 * February 2017
 * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com","Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter.",Web Picks (December 2017),Live,606
1875,"Stats and Bots Follow Sign in / Sign up * Home
 * Analytics
 * Big Data
 * Design
 * Startups
 * Bots
 * Updates
 * Subscribe
 * 
 * 🤖 STATSBOT
 * 

Daniil Korbut Blocked Unblock Follow Following Jul 6
--------------------------------------------------------------------------------

RECOMMENDATION SYSTEM ALGORITHMS
AN OVERVIEW OF THE MAIN EXISTING RECOMMENDATION ENGINES
Today, many companies use big data to make super relevant recommendations and
growth revenue. Among a variety of recommendation algorithms, data scientists
need to choose the best one according a business’s limitations and requirements.

To simplify this task, the Statsbot team has prepared an overview of the main existing recommendation system algorithms.

COLLABORATIVE FILTERING
Collaborative filtering (CF) and its modifications is one of the most commonly
used recommendation algorithms. Even data scientist beginners can use it to
build their personal movie recommender system, for example, for a resume project .

When we want to recommend something to a user, the most logical thing to do is
to find people with similar interests, analyze their behavior, and recommend our
user the same items. Or we can look at the items similar to ones which the user
bought earlier, and recommend products which are like them.

These are two basic approaches in CF: user-based collaborative filtering and
item-based collaborative filtering, respectively.In both cases this recommendation engine has two steps:

 1. Find out how many users/items in the database are similar to the given
    user/item.
 2. Assess other users/items to predict what grade you would give the user of
    this product, given the total weight of the users/items that are more
    similar to this one.

WHAT DOES “MOST SIMILAR” MEAN IN THIS ALGORITHM?
All we have is a vector of preferences for each user (row of the matrix R) and
the vector of user ratings for each product (columns of the matrix R).

Illustration sourceFirst of all, let’s leave only the elements for which we know the values in both
vectors.

For example, if we want to compare Bill and Jane, we can mention that Bill
hasn’t watched Titanic and Jane hasn’t watched Batman until this moment, so we
can measure their similarity only by Star Wars. How could anyone not watch Star
Wars, right? :)

The most popular techniques to measure similarity are cosine similarity or correlations between vectors of users/items. The final step is to take the weighted arithmetic mean according to the degree of similarity to fill empty cells in the table.

MATRIX DECOMPOSITION FOR RECOMMENDATIONS
The next interesting approach uses matrix decompositions. It’s a very elegant
recommendation algorithm because usually, when it comes to matrix decomposition,
we don’t give much thought to what items are going to stay in the columns and
rows of the resulting matrices. But using this recommender engine, we see
clearly that u is a vector of interests of i-th user, and v is a vector of parameters for j-th film .

So we can approximate x (grade from i-th user to j-th film) with dot product of u and v . We build these vectors by the known scores and use them to predict unknown
grades.

For example, after matrix decomposition we have vector (1.4; .9) for Ted and
vector (1.4; .8) for film A, now we can restore the grade for film A−Ted just by calculating the dot product of (1.4; .9) and (1.4; .8). As a result, we
get 2.68 grade .

Illustration sourceCLUSTERING
The previous recommendation algorithms are rather simple and are appropriate for
small systems. Until this moment, we considered a recommendation problem as a
supervised machine learning task. It’s time to apply unsupervised methods to
solve the problem.

Imagine, we’re building a big recommendation system where collaborative
filtering and matrix decompositions should work longer. The first idea would be clustering .

At the start of a business, there is a lack of previous users’ grades, and
clustering would be the best approach.But separately, clustering is a bit weak , because what we do in fact is we identify user groups and recommend each user
in this group the same items. When we have enough data it’s better to use
clustering as the first step for shrinking the selection of relevant neighbors
in collaborative filtering algorithms. It can also improve the performance of
complex recommendation systems.

Each cluster would be assigned to typical preferences, based on preferences of
customers who belong to the cluster. Customers within each cluster would receive
recommendations computed at the cluster level.

DEEP LEARNING APPROACH FOR RECOMMENDATIONS
In the last 10 years, neural networks have made a huge leap in growth. Today
they are applied in a wide range of applications and are gradually replacing
traditional ML methods. I’d like to show you how the deep learning approach is
used by YouTube.

Undoubtedly, it’s a very challenging task to make recommendations for such a
service because of the big scale, dynamic corpus, and a variety of unobservable
external factors.

According to the study “ Deep Neural Networks for YouTube Recommendations ”, the YouTube recommendation system algorithm consists of two neural networks:
one for candidate generation and one for ranking. In case you don’t have enough
time, I’ll leave a quick summary of this research here .

Illustration sourceTaking events from a user’s history as input, the candidate generation network
significantly decreases the amount of videos and makes a group of the most
relevant ones from a large corpus.The generated candidates are the most relevant
to the user, whose grades we are predicting. The goal of this network is only to
provide a broad personalization via collaborative filtering.

Illustration sourceAt this step, we have a smaller amount of candidates that are similar to the
user. Our goal now is to analyze all of them more carefully so that we can make
the best decision. This task is accomplished by the ranking network, which can
assign a score to each video according to a desired objective function that uses
data describing the video and information about users’ behavior. Videos with the
highest scores are presented to the user, ranked by their score.

Illustration sourceUsing a two-stage approach, we can make video recommendations from a very large
corpus of videos while still being certain that the small number of them are
personalized and engaging for the user. This design also enables us to blend
candidates together that were generated by other sources.

Illustration sourceThe recommendation task is posed as an extreme multiclass classification problem
where the prediction problem becomes accurately classifying a specific video
watch (wt) at a given time t among millions of video classes (i) from a corpus
(V) based on user (U) and context (C).

IMPORTANT POINTS BEFORE BUILDING YOUR OWN RECOMMENDATION SYSTEM:
 * If you have a large database and you make recommendations from it online, the
   best way would be to divide this problem into 2 subproblems: 1) choosing
   top-N candidates and 2) ranking them.
 * How do you measure the quality of your model? Along with the standard quality
   metrics, there are some metrics specially for recommendation problems: Recall@k and Precision@k , Average Recall@k, and Average Precision@k. Also look at the great description of metrics for recommendation systems .
 * If you are solving recommendation problems with classification algorithms,
   you should think about generating negative samples. If a user bought a
   recommended item, you should add it as a positive sample, and others as
   negative samples.
 * Think about the online-score and offline-score of your algorithm quality. A
   training model only on historical data can lead to primitive recommendations
   because the algorithm won’t know about new trends and preferences.

YOU’D ALSO LIKE:
Data Scientist Resume Projects Machine learning problems set to build a data
scientist CV without work experience blog.statsbot.co Time Series Anomaly Detection Algorithms The current state of anomaly detection
techniques in plain language blog.statsbot.co SQL Queries for Funnel Analysis A template for building SQL funnel queries
blog.statsbot.co * Data Science
 * Machine Learning
 * Recommendation System
 * Recommender Systems
 * Algorithms

17 Blocked Unblock Follow FollowingDANIIL KORBUT
FollowSTATS AND BOTS
Data stories from Statsbot’s makers

 * Share
 * 17
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates","Today, many companies use big data to make super relevant recommendations and growth revenue. Among a variety of recommendation algorithms, data scientists need to choose the best one according a…",Recommendation System Algorithms – Stats and Bots,Live,607
1878,"A FIRST PEEK AT THE NEXT MONGODBShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 4, 2016MongoDB Inc have recently released a build of MongoDB 3.3.5. For those thatdon't know how MongoDB versions are numbered, the odd minor version denotes adevelopment release which will eventually become version 3.4 in the fullness oftime.At Compose we like to have a quick peer into the JIRA issues repository and seewhat things could well be coming. The 3.3 branch has been in development sinceJanuary and now, five versions in, we thought it would be a good time to havethat look. Here are some of the things that caught our eye...AGGREGATION ENHANCEMENTSThe addition of a $switch operator to the aggregation framework in SERVER-10689 means that future aggregations which need more complex branching will be a lotmore readable and maintainable. The $reduce operation from SERVER-17258 should make rolling up of arrays of results easier too. It allows arrays to beprocessed with a function to reduce them down to a single value and is aspiritual partner to the existing $map operator.There's also the $in from SERVER-6146 which adds a standalone version of the array matching function to theaggregation framework. Date handling is also set to be improved with ISOversions of $week , $year and $dayOfWeek and $dateToString operators being added with SERVER-7695 .Those are features in 3.3.5, but looking slightly ahead, 3.3.6 will add somestring indexing SERVER-8951 in the form of $indexOfBytes , $indexOfCP and $indexOfArray .DECIMAL NUMBERSOne interesting feature which is still up in the air as of writing this, issupport for decimal128 format numbers. The currently experimental decimal datatype would allow for very large decimal values with up to 34 decimal digits inthem. The feature is being tracked in SERVER-1393 . There's a lot of engineering involved in adding a new data type though 3.3.5sees keystring and indexing support SERVER-19703 being added for the new data type.EXPLAIN IT MORETo help with more complex debugging of queries SERVER-4494 has been resolved so that where indexes are being shown in the explain output,they include what version of the index is in use. There's also a ticket SERVER-2235 which will add the number of IXSCAN seeks to explain's output to helpdistinguish between seeking through the index and actually scanning thecollection.OTHER CHANGESWe tend to have been focussing on the user visible features of MongoDB, butthere are various changes taking place in the background. There are, of course,hundreds of fixes and small enhancements to the MongoDB core, and browsing therepository at least gives the impression that with the enhanced testing, 3.4will have improved reliability.Beyond that though, we did note: * that the switch back to Spidermonkey in 3.2 is complemented with an upgrade   of the JavaScript engine SERVER-23358 for 3.4. * One interesting little addition is SERVER-21414 which makes it easy to tell if a server writes to disk or not - a question   that arises now that there's an in-memory storage engine for Mongo   Enterprise. * Support for running MongoDB on a read-only filesystem is tracked in a whole   range of JIRA tickets, starting with SERVER-593 , which would allow some interesting recovery scenarios at least. * Systemd will be officially supported thanks to work around SERVER-7285 .IN BRIEFThe next release of MongoDB, whenever that happens, currently at least, lookslike a reliability release with some useful new features turning up and payingoff some technical debt in the process. When will the release be? Who knows.We've only looked at closed tickets for development and with over 450 open/inprogress tickets ( details here ) there's still plenty of work to be done and plenty of leeway to surprise uswith new features.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","At Compose we like to have a quick peer into the JIRA issues repository and see what things could well be coming. The 3.3 branch has been in development since January and now, five versions in, we thought it would be a good time to have that look. Here are some of the things that caught our eye...",A first peek at the next MongoDB,Live,608
1884,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

SIMPLE LINEAR REGRESSION? DO IT THE BAYESIAN WAY
Posted on September 9, 2017Contributed by: Eugen Stripling , Seppe vanden Broucke , Bart Baesens

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow
us @DataMiningApps . Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail
over at briefings@dataminingapps.com and let’s get in touch!


--------------------------------------------------------------------------------

In this column, we demonstrate the Bayesian method to estimate the parameters of
the simple linear regression (SLR) model. The goal of the SLR is to ﬁnd a
straight line that describes the linear relationship between the metric response
variable Y and the metric predictor X. In the frequentist (or classical)
approach, the normal error regression model is deﬁned as follows [1]:

Yi = β0 + β1Xi + εi (1)

where:

 * Yi is the observed response of observation i,
 * Xi is the level of the predictor variable of observation i, a known constant,
 * β0 is the intercept parameter,
 * β1 is the slope parameter,
 * and εi are independent normal distributions with zero mean and constant
   variance σ2, N (0,σ2), for i = 1,…,n.

The response Yi of observation i is the sum of two components [1]: (i) the
constant term β0+β1Xi and (ii) the random term εi. Due to the induced randomness
of εi, Y becomes a random variable as well. In other words, there is a
probability distribution of the response variable Y for each level of the
predictor variable X, and the means of these probability distributions vary in
some systematic fashion with X [1]. The figure below illustrates an example
taken from [1] in which it is assumed that the intercept and the slope are
known, i.e., β0 = 9.5 and β1 = 2.1.


Simple linear regression [1].

Suppose we are in the possession of a data set (X1,Y1),…,(Xn,Yn), where Xi and
Yi (i = 1,…,n) are the observed predictor and response measurements of
observation i, respectively. We are then able to estimate the model parameters
from the data. Usually, our primary interest is in the slope, β1, which reveals
how Y will, on average, change when X increases by one unit.

Knowing the value of the slope reveals the strength of the dependency of Y on X.
The SLR in fact has a closed-form solution, meaning that the intercept and slope
estimates can be computed directly from the data without having to rely on
numeric optimization techniques. However, the frequentist approach will return
only one solution which is referred to as a single point estimate.

In contrast, as we will see by the end of the column, the Bayesian approach
returns a whole distribution of so-called credible values for each parameter.
Credibility is a term used to express our belief in certain parameter values.
Typically, common probability distributions are used to specify the allocation
of credibility over possible parameter values. The beneﬁt of working with
distributions is that they provide a natural way to reﬂect uncertainty of
estimates. The application of Bayesian estimation requires a probabilistic
reformulation of the SLR, which may look like the following (2):

Yi ∼ N (µi,σ)
µi = β0 + β1Xi
β0 ∼ N (M0,T0)
β1 ∼ N (M1,T1)
σ ∼|Cauchy( G )|

A graphical (perhaps more intuitive) way to represent the same model is achieved
by means of Kruschke’s DBDA-style diagrams [2, 3, 4]:


Probabilistic formulation of the simple linear regression model.

From the diagram, it is easy to see that the response Yi of observation i is
modeled as a normal distribution with mean µi = β0 + β1Xi and standard deviation
σ (depending on the software program, either variance, standard deviation, or
precision is speciﬁed). To make Bayesian estimation possible, we are required to
assign so-called prior distributions to all model parameters, thus making our
assumptions about the parameters explicit. In the figure above, we assign for
each regression parameter βj (j = 0,1) a normal distribution with mean Mj and
precision Tj (reciprocal of the variance), and for σ we assign a half-Cauchy
distribution (i.e., considering only non-negative values) with scale parameter
G. The prior distributions reﬂect our prior belief about each parameter in the
SLR. If no prior knowledge is available, values for Mj, Tj, and G are typically
chosen such that these distributions become so ﬂat that they almost resemble a
uniform distribution, implying that we do not favor any particular parameter
values.

Additionally, we need to specify an appropriate likelihood, which is the normal
likelihood function for the SLR. The likelihood component is the part that
incorporates the data. Now, Bayes’ rule [5] can be applied in order to update
the priors to the posterior distributions, which reﬂect the distribution of
credible parameter values taken into account the data. When specifying such
“weak” priors as described above, even a modest amount of data will easily
overrule the priors and the likelihood component will have a much stronger
inﬂuence on the resulting posterior distributions [3, 4].

To provide a formal deﬁnition, suppose we are interested in the estimation of
some parameter θ and we are given some data D, Bayes’ rule is then deﬁned as
follows [4], (3):


Without going into much detail, the crucial part to remember is that Bayes’ rule
merely describes the mathematical relation between the prior and the posterior
conditional on the data. In other words, Bayes’ rule precisely dictates how
prior allocation of credibility is to be re-allocated to obtain the distribution
of credible parameter values taking the data into account [4].

In practice, we typically rely on so-called Markov chain Monte Carlo (MCMC)
methods which allow users to generate a large, representative sample from the
posterior distribution, which is often also referred to as trace or MCMC chain.
Common computer programs to generate such sample are BUGS [6], WinBUGS [7], JAGS
[8], Stan [9], and PyMC3 [10]. This sample of representative, posterior
parameter values can be utilized to perform statistical inference. In case of
the SLR, each step in the MCMC chain, which typically encompasses several
thousands, refers to credible parameter values that correspond to estimates for
the intercept β0, slope β1, and standard deviation σ. Thus, the Bayesian method
returns a whole distribution of credible regression lines. Let us demonstrate
the frequentist and Bayesian approach on some toy data. To do so, we assume the
true values of the regression parameters are as follows: β0 = 9.5, β1 = 2.1, and
σ = 3. Having created the data set, we can now compute the parameter estimates
according to the frequentist approach and plot the result (figure below).
Clearly, since we estimate the parameter values based on a ﬁnite sample of
modest size, the estimates will most likely not be exactly equal to the true
parameter values. The main message, however, is that the frequentist approach
returns only one solution, and therefore no particular notion that accounts for
the uncertainty in the parameter estimates is considered.


Simple linear regression.

However, when doing data analysis, it can be beneﬁcial to take the estimation
uncertainties into account. This can be achieved with Bayesian estimation
methods in which the posterior holds the distribution of credible parameter
values, which in turn allows user to make a richer statistical inference [3, 4].
When running the computer program, the Bayesian method returns a distribution of
credible regression lines (figure below). The ﬁgure shows only a sample of
credible regression lines randomly taken from the generated MCMC chain. Yet, it
can clearly be seen that there is some variation among the lines, reﬂecting the
uncertainty in the parameter estimates.


Bayesian SLR: Sample of credible linear regression lines (light blue).

A closer look at the posteriors gives more information about distribution of
credible parameter values (figure below). In the ﬁgure, the middle numbers
correspond to the means computed based on the values in the MCMC chain. The
numbers on the left and right are the limits of the 95% highest posterior
density (HPD) interval, which corresponds to the minimum width Bayesian credible
interval (BCI). Kruschke [3, 4] refers to the HPD as the highest density
interval (HDI). Regardless of the terminology, the interpretation is that all
values within the 95% HPD are deemed as having the highest credibility. For all
three regression parameters, we see that the 95% HPD indeed covers well the true
parameter values: β0 = 9.5, β1 = 2.1, and σ = 3:


Marginal posterior distributions of each SLR model parameter. Means correspond
to the middle numbers. Limits of the 95% HPD interval correspond to the numbers
left and right.

Whenever we want to make a prediction of Y given a value of the predictor X (be
it the mean or predicted response), we simply go over the MCMC chain and at each
step of the chain we plug in the posterior estimates into the regression model
to compute the response of interest. In this way, we can create a credible
distribution of the mean and predicted response of Y and determine, for example,
the 95% HPD (figure below). And indeed, the band for the mean response (darker
blue) captures the true regression line well. Similarly, the prediction band
(lighter blue) contains almost all observations. Conﬁrming that the Bayesian SLR
is a good representation of the data:


Bayesian SLR with 95% HPD bands for the mean and predicted response. The blue,
thick line corresponds to the means of the posterior distributions.

What makes Bayesian methods so attractive is that it is fairly straightforward
to adapt the model to challenging circumstances. For example, one could replace
the bottom normal distribution in the Kruschke’s DBDA-style diagram figure
above, which models the response, with a Student’s t-distribution and rerun the
computer program (with some minor adjustments). This modiﬁcation will make the
model robust against outlying observations. Kruschke [4] provides a detailed
example of the robust linear regression model. Needless to say, another beneﬁt
of Bayesian methods is that they easily allow users to incorporate their
valuable expert knowledge into the modeling process by adapting the priors
appropriately.

CONCLUSION
In this column, we demonstrated the application of Bayesian parameter estimation
for the simple linear regression model. We showed that the concepts applied in
Bayesian inference diﬀer fundamentally from the traditional ones. That is, in
the frequentist approach, it is assumed that the data are randomly sampled from
normal distributions with ﬁxed parameters. On the contrary, the Bayesian
approach assumes the exact opposite, i.e., the data are ﬁxed and the parameters
follow some (probability) distribution. Refer to Eq. (3) which expresses the
posterior distribution, p(θ | D), of some parameter θ given the data D. Once the
data are observed, they are ﬁxed and know to us. It is the set of parameters we
are uncertain about and want to learn their true values. In this sense, Bayesian
methods are more aligned with circumstances experienced in everyday practice. On
top, working with the posterior distribution allows for a richer statistical
inference, since more information is processed.

REFERENCES:
 1.  M. H. Kutner, C. J. Nachtsheim, J. Neter, and W. Li, Applied Linear
     Statistical Models, ser. McGrwaHill International Edition. McGraw-Hill
     Irwin, 2005.
 2.  J. K. Kruschke, “Diagrams for hierarchical models – we need your opinion,”
     [ Online; accessed May 2017 ], 2013.
 3.  Doing Bayesian Data Analysis: A Tutorial with R and BUGS. Academic Press,
     2011.
 4.  Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan, 2nd ed.
     Academic Press, 2014.
 5.  T. Bayes and R. Price, “An essay towards solving a problem in the doctrine
     of chances. By the late Rev. Mr. Bayes, F.R.S. Communicated by Mr. Price,
     in a letter to John Canton, A.M.F.R.S.” Philosophical Transactions, vol.
     53, pp. 370–418, 1763.
 6.  D. Lunn, C. Jackson, N. Best, A. Thomas, and D. Spiegelhalter, The BUGS
     Book: A Practical Introduction to Bayesian Analysis, ser. Chapman &
     Hall/CRC Texts in Statistical Science. Taylor & Francis, 2012.
 7.  D. J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter, “WinBUGS: A Bayesian
     modelling framework: concepts, structure, and extensibility,” Statistics
     and Computing, vol. 10, no. 4, pp. 325–337, 2000.
 8.  M. Plummer, “JAGS: A Program for Analysis of Bayesian Graphical Models
     Using Gibbs Sampling,” in Proceedings of the 3rd International Workshop on
     Distributed Statistical Computing, 2003.
 9.  Stan Development Team, “The Stan C++ Library, Version 2.15.0.” [Online].
     Available: http://mc-stan.org
 10. J. Salvatier, T. V. Wiecki, and C. Fonnesbeck, “Probabilistic programming
     in Python using PyMC3,” PeerJ Computer Science, vol. 2, p. e55, 2016.
     [Online]. Available: https://doi.org/10.7717/peerj-cs.55

‹ Web Picks (week of 7 August 2017) —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Simple Linear Regression? Do It The Bayesian Way
 * Web Picks (week of 7 August 2017)
 * Web Picks (week of 24 July 2017)
 * What is the difference between up-, cross- and down-selling?
 * Forgetting the Past to Learn the Future: Long Short-Term Memory Neural
   Networks for Time Series Prediction

Archives * September 2017
 * August 2017
 * July 2017
 * June 2017
 * May 2017
 * April 2017
 * March 2017
 * February 2017
 * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com","In this article, we demonstrate the Bayesian method to estimate the parameters of the simple linear regression (SLR) model.",Simple Linear Regression? Do It The Bayesian Way,Live,609
1885,"Homepage Follow Sign in / Sign up Homepage * Home
 * Archive
 * 

Snehal Gawas Blocked Unblock Follow Following Aug 8
--------------------------------------------------------------------------------

BACK TO BASICS — JUPYTER NOTEBOOKS
Jupyter notebooks provide an effective way to develop, document, execute and
communicate your results. Here are few tips and tricks to make your life easier
while working on notebooks in Data Science Experience .

NOTEBOOKS MODES
Jupyter notebooks have two different modes, EDIT and COMMAND .

Edit mode:
To enter the EDIT mode, press ENTER on your keyboard or click in a cell. Edit mode can be identified by green
border around the cell with green left margin. When you are in the edit mode,
you can type in the cells.

Command mode:
To enter the COMMAND mode press ESC or click anywhere outside the cell. You will see grey border around the cell
with blue left margin. When you are in Command mode, you can edit your notebook
but you can't type in the cells.

NAVIGATION IN THE NOTEBOOK
You can perform different actions on notebooks using your MOUSE or KEYBOARD .

Mouse navigation:
You can use mouse to perform any action on your notebook from Menu bar or Tool
bar as per your preference.

Keyboard navigation:
Those who prefer keyboard can use keyboard shortcuts. Here are few examples of
keyboard shortcuts.

 * Command Mode: ESC
 * Edit Mode: ENTER
 * Run selected cell: CTRL+ENTER
 * Run cell and insert below: ALT+ENTER
 * Run cell and select below: SHIFT+ENTER
 * Insert cell above: A
 * Insert cell below: B
 * Cut selected cells: X
 * Copy selected cells: C

For more keyboard shortcuts, you can refer Help -> Keyboard Shortcuts from Menu bar.

MAGIC IN THE NOTEBOOK
Do you know, you can run code in different languages in different cells within
your notebook? If you want to see what all magic commands you can use within
your notebook, just run %lsmagic in one of the cells.

Use % matplotlib inline to show matplotlib plots inline the notebook.

Use %%writefile to write content of the cell to a file and %run before the name of the file to execute the script.

Few other examples of cell magic:

 * To render cell contents as LaTeX: %%latex
 * To render cell contents as Html: %%HTML
 * To run cell with bash commands: %%bash

You can try different commands from above list and experiment with magic.

Finally, if you need any help in the notebook just type in your function name
and append ? . Notebook will give you required information about function with example.

Additionally, there are features in Jupyter to help with code completion. For
example, if you need to see parameters for a function inside a notebook, after
you open the parentheses for the function call you can press SHIFT + TAB to show the docstring.

You can also use code completion to help end names of variables or functions in
your environment by starting to type the name of the object and pressing TAB .

To test out these tips and tricks sign up for a for a free trial of Data Science Experience. For more informative blogs, keep an eye on Data
Science Experience blog. Feel free to share your notebook on social media and
don’t forget to add your friends in your project as collaborators.

Happy exploring Data Science Experience!


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on August 8, 2017.

 * Data Science
 * Data Science Experience
 * Dsx
 * Jupyter Notebook

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingSNEHAL GAWAS
FollowIBM DATA SCIENCE EXPERIENCE
Master the art of data science

 * 
 * 
 * 
 * 

Never miss a story from IBM Data Science Experience , when you sign up for Medium. Learn more Never miss a story from IBM Data Science Experience Get updates Get updates","Jupyter notebooks provide an effective way to develop, document, execute and communicate your results. Here are few tips and tricks to make your life easier while working on notebooks in Data Science…",Back to basics — Jupyter notebooks,Live,610
1886,"Lorna Mitchell Blocked Unblock Follow Following Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net) Aug 2
--------------------------------------------------------------------------------

TRIGGER PERIODIC OPENWHISK ACTIONS
DISTRIBUTED “CRON JOBS” USING THE ALARM PACKAGE
With serverless programming in general, and OpenWhisk in particular, actions run in response to an event. An event might be a web
request, a database change, a sensor reading—the possibilities are endless.

But how about the situation where the action should run every so often? This
post will cover how to set up a periodic trigger for your OpenWhisk actions, as
if we had set up a “cron” schedule.

I’m assuming that you already have the wsk command-line tool set up to work with OpenWhisk. If not, you can nip off and do
that now, I'll wait! Click the ""Download OpenWhisk CLI"" button on the OpenWhisk page for all the instructions you need. Currently incubating in the Apache Software Foundation.RULES AND TRIGGERS
Let’s start with some glossary definitions:

 * package is a way of keeping actions grouped together; they can also have parameters
   that are available to all actions
 * action a serverless function
 * sequence more than one action, set up to run one after another, and with each action
   using the output of the previous action as its own input
 * trigger an event that can be responded to
 * rule wiring to link a trigger to an action or sequence

The components of OpenWhisk’s serverless platform. Image credit: Apache OpenWhisk™ .Still with me? What’s needed for our action to be automatically run at defined
intervals is actually two things:

 1. the trigger to fire as often is required
 2. the rule to link that trigger to an action (or sequence)

THE CRON TRIGGER
The trigger uses the /whisk.system/alarm/alarms feed that is built in to OpenWhisk. I'm showing a specific example here, but
the package has excellent documentation if you want to adapt these examples to your own needs. The trickiest part by
far is working out the cron syntax, and for that there’s http://crontab.org/ .

The trigger used here fires every five minutes, and it sends a list of tags as
its payload. To create it, the command looks like this:

wsk trigger create five-mins-data --feed /whisk.system/alarms/alarm --param cron ""*/5 * * * *"" --param trigger_payload ""{\""tags\"": [\""cloudant\"",\""pixiedust\""]}""

Once you create the trigger, you can see it in the output of wsk activation list when it fires (in this case, every five minutes). It’s a handy way to check if
the trigger is firing as expected.

Note that once created, triggers can’t be updated. Instead, they need to be
deleted with wsk trigger delete five-mins-data and then recreated.The trigger_payload parameter will be passed into any action or sequence that this trigger is
linked to by a rule. Speaking of which, it’s time to create a rule.

LINK TRIGGERS TO ACTIONS WITH RULES
The trigger is up and firing — but it’s firing blanks until you attach it to
something. To wire it up, you’ll create a rule and link it to an existing
(packaged) action, like so:

wsk rule update five-mins-rule five-mins-data $MYPACKAGE/$MYACTION
wsk rule enable five-mins-rule

Rules can be updated, and if the rule doesn’t exist then wsk will silently create it. Hence, you’ll use the wsk update command here. The same update command also names a trigger and an action (or
sequence), in that order. When the trigger fires, the action will be invoked.
There can be many rules linking the same trigger to multiple actions.

Note that when a rule is changed, it will be automatically disabled. Check that
the rule is enabled after creating or changing it, or it may not work!The next time the trigger fires, the rule will cause the action to run — and
we’ll see all three appearing in the output of wsk activation list .

CLEANUP
Now that your trigger has been running, you might want to delete it so it
doesn’t run perpetually. Similar to activations, to see a list of current rules
run wsk trigger list . Then to delete your trigger, run:

wsk trigger delete five-mins-data

A similar process applies to rules. If you deleted the trigger in this example,
it makes sense to delete the rule associated with it too.

PERIODIC TRIGGERS FOR SERVERLESS ACTIONS IN THE WILD
The time-based approach available in OpenWhisk’s Alarm package is very handy for
many applications. In one of my apps, I’m using this regular trigger to hit an
API endpoint for a data source that doesn’t have webhooks available for when
data changes. I can schedule this trigger as often or as infrequently as I like
— and in fact the live application uses multiple of these “cron” triggers to
fetch different tags on different frequencies.

What’s your use case? I’d love to hear about it, so leave me a comment. And of
course, click the heart icon ♡ to tell me I should write more like this, if
that’s the case!

Thanks to Mike Broberg . * Serverless
 * Openwhisk
 * Ibm Bluemix
 * Web Development
 * Cron

Blocked Unblock Follow FollowingLORNA MITCHELL
Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net )

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","With serverless programming in general, and OpenWhisk in particular, actions run in response to an event. An event might be a web request, a database change, a sensor reading—the possibilities are…",Trigger Periodic OpenWhisk Actions – IBM Watson Data Lab – Medium,Live,611
1887,"BLOCK CHAIN TECHNOLOGY, SMART CONTRACTS AND ETHEREUM
Glynn Bird / May 19, 2016Block chain technology is getting a lot of attention, not just as means to store
the ledger of transactions that power cryptocurrencies like Bitcoin, but to
encapsulate code and data as “smart contracts”. This post looks at one
open-source block chain technology Ethereum and builds a smart contract to demonstrate the strengths of such an approach.

WHAT IS A BLOCK CHAIN?
A block chain is a data structure that stores time-ordered data in an
ever-growing list, like an accounting ledger. The block chain data structure is
maintained using a distributed, peer-to-peer network of computers with no
central “master”. Each block in the chain contains transactions which represent a change of state in the database; the transfer of funds from
one account to another, for example. Transactions are verified by multiple nodes
in the network and are eventually stored in blocks in the block chain. Each block contains a signed hash of the contents of the
preceding block, making it impossible for a block’s contents to be tampered
with. It is possible to traverse the entire block chain to ensure that the hash
of one block is stored in the block that follows it.


Block chains contain a series of transfers of value from one address to another. An address uniquely identifies an account (or a user) in the system
and is in fact a public key whose paired private key belongs only to the user
who created the account. Value cannot be transferred from an account without a
digital signature that requires the source account’s private key.

As well as securing the list of transactions cryptographically, block chains
also provide a distributed consensus of the state of the database. It ensures
that value transfers happen once or not at all, giving the application developer
the peace of mind that, once stored, their data is both immutable and trusted.
The block chain network effectively picks a random node to generate the next
block in the chain by giving that privilege to the node that solves a
mathematical task that takes a lot of computing power. The node that finds a solution to the problem nominates the
next block in the chain and publishes it, where it is verified by others in the
network. The winning node is rewarded for this “proof of work” with freshly
minted crypto-currency and transaction fees collected from the transactions’
creators. This process is known as “mining” and serves the purpose of:

 * minting new currency – the rate of generation of the currency is strictly
   governed
 * rewarding the “miners” for verifying transactions and establishing consensus
   in the network

In practice, mining favours only the very fastest, specialised hardware and an
“arms race” develops where miners have to provision more and more hardware to
maintain the same rate of currency harvest.

Bitcoin is the most famous practical example of a production block chain. Bitcoin is a
cryptocurrency which can be used in exchange for real currencies, or transferred
between Bitcoin accounts (wallets) using transactions in the Bitcoin block
chain. At the time of writing there are 7000 nodes in the Bitcoin peer-to-peer
network.


ETHEREUM AND SMART CONTRACTS
Block chains have some key properties that are appealing to application
developers:

 * The block chain is run by other people. If your application stores data in a
   popular block chain, the application does not need to provide its own storage
   mechanism for its data. The block chain nodes are incentivised for storage
   with transaction fees and occasionally mining new currency.
 * Block chains provide a distributed consensus solution, which is hard to
   implement yourself.
 * Block chains provide anonymity for users. An account ID is a public key,
   that’s not necessarily directly attributable to a human.
 * “Value” can be transferred from one account to another for a smaller fee that
   traditional wire transfer mechanisms with real currencies.
 * If the block chain is trusted, then the transactions that are stored in it
   are trusted too.

The Ethereum project takes the block chain principle and adds the ability to create smart contracts on the block chain—apps that can hold value, store data, and encapsulate code to perform computing tasks. Like Bitcoin, Ethereum has a currency—in this case,
called Ether . Ether is mined by nodes that verify transactions before being stored in a
shared-consensus block chain. Ether can be transferred between accounts (public
keys) and smart contracts themselves.

Smart contracts allow anonymous parties to enter into binding agreements, with
each participant having full transparency on the deal being made. Value can be
transferred between accounts or held in escrow inside the smart contract itself.
As the contract is just code , the application is only limited by the developer’s imagination.


SMART CONTRACTS BY EXAMPLE: SMARTSPONSOR
In the rest of this article, we’ll build a smart contract that allows the
following account-holders to interact:

 * a charity holding a fund-raising activity, which we’ll call thebenefactor
 * a sponsored runner who wants to raise money for the charity: therunner
 * other users who want to sponsor the runner: thesponsor
 * an Ethereum node that is mining the block chain, verifying transactions: theminer

Our contract (smartSponsor):

 * is created by a runner raising money for a charity by doing a sponsored run
 * when creating the contract, the runner nominates the benefactor of the money raised
 * the runner then invites others to sponsor the run. Users sponsor the runner by calling
   a function on the smart contract which transfers Ether from the sponsor’s account to the contract , where it is held until further notice
 * during the lifetime of the contract everyone can see who the benefactor is, how much Ether has been raised and from whom (although the sponsors can be anonymous, of course)


Then one of two things can happen:

 * The run goes to plan and the runner instructs the contract to transfer all of
   the funds to the benefactor


 * The run cannot be undertaken for some reason and the runner instructs the
   contract to refund the sponsors’ pledges


Ethereum allows smart contracts to be defined by writing code in a language
called Solidity which is Java-like language where the contract is similar to a Java class—the
member variables are stored using block chain transactions and the contract’s
methods can be called to interrogate the contract or change its state. As copies
of the block chain are distributed to all nodes in the network, anyone can
interrogate the contract to glean publicly accessible information from it.

Our contract will have the following methods:

 * smartSponsor – the contract’s constructor. It initialises the contract’s state. The
   creator of the contract nominates the address of the account that will
   benefit when the contract is drawn down
 * pledge – can be called by anyone to donate Ether to the sponsorship fund. The
   sponsor supplies an optional message of support
 * getPot – returns the current total of Ether stored in the contract
 * refund – sends the sponsor money back to the sponsors. Only the contract’s owner
   can call this function
 * drawdown – sends total value of the contract to the benefactor account. Again, only
   the contract’s owner can call this function

The idea is to make a contract that is binding; if Ether is transferred to the
contract by a sponsor, they can’t get it back unless the whole contract is
refunded. In this case, all of the data is publicly accessible, meaning that
anyone who has access to the Ethereum block chain can see who set up the
contract, who is the benefactor, and who pledged each amount by accessing the
contract code itself.

It’s important to note that anything that changes the state of a contract (its
creation, pledging, refunding, or drawing down) requires transactions to be
created on the block chain which means that the data is not stored until those
transactions are mined and stored in a block. Operations that only read the state of an existing
contract ( getPot or reading the public member variables) are free operations which do not
require mining. This is an important but subtle point: write operations are slow
(we have to wait until mining completes). They may never make it to the block
chain (if your code throws an exception or another error occurs) and require the
caller to provide an incentive to the miners for doing the work. This is called gas in Ethereum terminology. All write operations require gas to pay for operations
that change the state of the block chain.

Luckily we don’t have to buy any real Ether and participate in an Ethereum
network. We can use the same software but configure it to run a local test block
chain and run a miner to generate our own pretend Ether. This allows us to test
our code without wasting real Ether.

SOLIDITY CODE
Here’s the full source code of our smart contract written in the Solidity
language:

contract smartSponsor {
  address public owner;
  address public benefactor;
  bool public refunded; 
  bool public complete;
  uint public numPledges;
  struct Pledge {
    uint amount;
    address eth_address;
    bytes32 message;
  }
  mapping(uint = Pledge) public pledges;
  
  // constructor
  function smartSponsor(address _benefactor) {
    owner = msg.sender;
    numPledges = 0;
    refunded = false;
    complete = false;
    benefactor = _benefactor;
  }

  // add a new pledge
  function pledge(bytes32 _message) {
    if (msg.value == 0 || complete || refunded) throw;
    pledges[numPledges] = Pledge(msg.value, msg.sender, _message);
    numPledges++;
  }

  function getPot() constant returns (uint) {
    return this.balance; 
  }

  // refund the backers
  function refund() {
    if (msg.sender != owner || complete || refunded) throw;
    for (uint i = 0; i 

 * a Pledge structure models a donation, storing the sponsor's account id, amount
   pledged, and a message string
 * the pledges array stores a list of Pledge objects
 * all member variables in the contract are made public so that ""getters"" are
   created automatically
 * throw is called in some functions to prevent data being written to the block chain
   in error conditions

Notice how the code doesn't mention transactions, blocks, gas, or any of the
terminology of block chains or crypto-currencies. It's just code that saves
state inside its member variables. It just so happens that Ethereum creates the
necessary transactions and submits them to the network (in this case our test
network) for verification, before being written to the block chain. All of that
complication is hidden from us, leaving our code small (50 lines) and easy to
understand.

This is important because smart contracts are about shared trust; all
participants in a contract should be clear what they are committing to, where
funds are going, and who has access to perform which operations. The simpler the
code, the easier it is to verify that the contract is trustworthy.

RUNNING THE SMART CONTRACT
In order to run the contract, you first need Ethereum up and running. My
installation instructions for an Ubuntu server are here . I used an IBM Bluemix virtual machine and added the packages I needed using apt-get .

Assuming you have followed my setup instructions to create four Ethereum
accounts on your test network and have set up a mining process, we can clone the smartSponsor code and execute it using the Ethereum console:

 git clone https://github.com/glynnbird/smartsponsor.git
 cd smartsponsor
 geth attach

From with the geth console, we can execute JavaScript commands which interact
with Ethereum's API:

 loadScript(""./smartsponsor.js"")
Contract transaction send: TransactionHash: 0xe797ce5c1e5eeaae6e4bd09ad6564f9deba1beeeb7f09b6c16eec728584e370c waiting to be mined...
true
 Contract mined! Address: 0x15590c0417f6421fd35e113db0fdb2055df2344b
[object Object]

The smartsponsor.js file creates variables which hold the addresses of the four Ethereum accounts
we created ( theminer , therunner , thebenefactor , thesponsor ), which makes it easier to understand who's doing what in the subsequent code
snippets. It also contains a series of commands which compile the Solidty
source, construct a smartSponsor contract and instantiate a new contract as therunner , such that the benefactor of the contract is thebenefactor .

The contract isn't live until its transaction is mined. This may take a few
seconds or a few minutes depending on the speed of the machine it is running on.
Let's look at the contract (which is assigned to a variable ss ):

 ss
{
  address: ""0xe021f45922e141f5e17d05a4b2721ec972065960"",
  transactionHash: ""0x77ba5bc77f0a62888c08084a7c00cf00b6cc024f88f988e9daada751788c8693"",
  allEvents: function(),
  benefactor: function(),
  complete: function(),
  drawdown: function(),
  getPledge: function(),
  getPot: function(),
  numPledges: function(),
  owner: function(),
  pledge: function(),
  refund: function(),
  refunded: function()
}

We can see that the contract has an address , meaning it can send/receive Ether value and a transactionHash which locates it in the block chain. Also listed are the public functions that
can be called against the contract. Let's call some of those now:

 ss.benefactor()
""0x63de8807ac0bd63be460be0de250749c4df1dcb0""
 ss.owner()
""0x458305055882d53663b41a00eebd0b657469843f""
 ss.getPot()
0
 ss.numPledges()
0
 ss.complete()
false

We can see that the owner and benefactor of the contract are different accounts
( therunner and thebenefactor , respectively), and the contract's state has been initialised with no money
and no pledges. These ""read"" operations are free so we do not have to supply any
gas because we are simply reading from our own copy of the block chain.

Next we'll give some money to thesponsor account, as accounts are generated with no Ether at first:

 personal.unlockAccount(thesponsor,""password"");
 eth.sendTransaction({from: theminer, to: thesponsor, value: 100000000000000000});
""0xd4fc641311e31abb6546c3503c367c6ac971b0ad9cb4bcd4c56597e3b98d6d7a""
 eth.getBalance(theminer);
4.9524805801917e+22
 eth.getBalance(thesponsor);
100000000000000000


Next we'll authenticate as thesponsor user and pledge some money to our smart contract:

 personal.unlockAccount(thesponsor,""password"");
true
 ss.pledge(""Good luck with the run!"", {from: thesponsor, value: 10000000, gas: 3000000});
""0xc0880c4151946014389e135bcbefe39fb8f786e9e3e0ce077fa5f967e2a31ab3""

The value parameter is the amount of Ether we wish to transfer to the contract. 10000000 may seem like a lot, but the units are in wei . There are 1000000000000000000 wei in 1 Ether!

The returned value is a transaction id. We have to wait until the transaction
makes it into the block chain before the contract's state will be updated:

 ss.getPot()
10000000
 ss.numPledges()
1
 ss.pledges(0)
[10000000, ""0x225905462cf12404757852c01edfd2ec0bf0dbe9"", ""0x476f6f64206c75636b2077697468207468652072756e21000000000000000000""]


The call to pledges(0) , returns the first pledge as an array containing the value, the sponsor's
address and the message (as a string of bytes). We can keep adding pledges with
repeated calls to the pledge function and watch the pot build up. After seven sponsors have pledged we have:

 ss.getPot()
70000000

Notice that the contract gets the full value of the Ether pledged by the
sponsor, but the sponsor's account is actually debited with slightly more than
the pledge. Why is this? It's because the caller of the pledge function also has to supply the ""gas"" to power the operation.

When the runner is ready to complete the contract, only a call to drawdown is needed:

 personal.unlockAccount(therunner,""password"");
true
 ss.drawdown({from: therunner, gas:3000000});
""0x082424d8057b8c250f8b86cda05211628bb3bae513ce27bf6194445ae035a3c4""

After that contract has mined, we should see that the benefactor's account has
received the contents of the contract:

 eth.getBalance(thebenefactor);
70000000
 ss.getPot()
0

The smart contract holds multiple donations in escrow inside the contract until
they are ready to be transferred either to the benefactor, or back to the
sponsors. The code has safeguards in place to ensure that only the creator of
the contract can refund or draw it down and to prevent further funds being added
after the contract's completion. The state of the contract can be interrogated
at all times by all parties and all in 50 lines of code!

The commands that are executed on the geth command-line utility are actually
JavaScript statements. The same calls can be made by your own client-side code
which communicate back to the live (or test) network using a remote procedure
call API. This makes it simple to create web-based front-ends for your smart
contracts.

There is also the Mist browser that lets you create accounts, and view and interact with contracts—like an App
Store of smart contracts. There are grand plans for Mist, but at the moment it
is a relatively simple wallet app and contract browser.

ARE BLOCKCHAINS JUST DISTRIBUTED DATABASES?
A blockchain is a means of storing data in time-ordered ledger across a
multi-node distributed database. But not distributed in the sense that Cloudant
is a distributed database. The data is not sharded into pieces to spread the
workload into manageable chunks; all members of the network have to deal with
every change in data. Furthermore, the process of mining nodes performing
""proof-of-work"" tasks to prove that they are worthy of suggesting changes makes
the process of writing data extremely slow. At the time of writing, the Ethereum
network can handle only 20-30 transactions per second (not 20-30 transactions
per user, but total) across the global network.

Ethereum also is more than a data store. It adds the ability to encapsulate
executable code alongside the data in the blockchain. This allows participants
to trust that a contract will do what it says it is going to. Compare and
contrast that with signed paper contracts, lawyers, notaries, banks, insurers,
clearing houses, etc.

WHAT SHOULD BLOCK CHAINS AND SMART CONTRACTS BE USED FOR?
Ethereum is only one of a number of smart contract platforms that have been
created to provide a means of building applications on block chain technology.
The most suitable use-cases are applications that:

 * transfer value from one party to another
 * require anonymity for one or more parties
 * require value to be held for a time in the contract itself
 * wish to avoid transaction fees associated with moving real currencies around
 * need to establish consensus
 * wish to demonstrate their openness by storing state in the public domain
 * need to guarantee the provenance of physical or virtual material

Applications such as auctions, loans, wills, registries, crowdfunding,
shareholding, and voting spring to mind.

Write operations are expensive, both monetarily and computationally, so smart
contracts lend themselves to applications which write infrequently but whose
data has a high value. Although read operations are free, there are only
primitive querying operations. You can create indexes on stored data, but there
is no query language or means to extract or aggregate data as you would in a
conventional database.

There are other drawbacks to block chains. Every participant in the network
saves every block—the data is not sharded to divide the data into manageable
chunks—so every node has to store the entire database and deal with every
change. The ""proof of work"" model for distributing trust in the network is a
clever work-around, but in practice, causes hundreds or thousands of nodes
burning megawatts of power to prove that they are trustworthy. Finally, the
value of Ether, like Bitcoin, is subject to speculative buying and selling,
making the monetary value of Ether unstable. Ethereum has items in its roadmap
to address performance, scalability, and a replacement for proof-of-work, but at
the time of writing they are just wish-list items.

REFERENCES
 * Ethereum
 * Solidity
 * Ethererum Web3 API

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",We explore the open-source block chain technology Ethereum and build a smart contract to help you understand the strengths of block chain platforms.,"Block chain technology, smart contracts and Ethereum",Live,612
1891,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Patrick Titzler Blocked Unblock Follow Following Developer Advocate at IBM Watson Data Platform 10 mins ago
--------------------------------------------------------------------------------

ACCESSING IBM GRAPH FROM JAVA
A Java library that’s the shortest path for getting started with IBM Graph

The IBM Graph service in Bluemix provides an HTTP API that can be accessed using any programming language. For a recent project
(you’ll hear about it in an upcoming blog post), we wanted to use the service
from a Java application. Realizing that a lot of generic functionality related
to the IBM Graph-specific HTTP request/response processing would have to be
implemented, we decided to encapsulate it and make it available as an open
source library.

Start tackling graph problems that much faster with our experimental Java
library for IBM GraphINSTALLATION AND USE
To install and use the unofficial Java library for IBM Graph , add the following dependency to your application’s pom.xml :

 <dependencies>
 <dependency>
 <groupId>com.ibm.graph</groupId>
 <artifactId>graphclient</artifactId>
 <version>0.1.0</version>
 </dependency>
 …
 </dependencies>

To help familiarize yourself with this library, we’ve created a simple sample application that you can download and review to get started in minutes. This sample
application does the following:

 * creates a new graph
 * loads the sample schema
 * creates vertices and edges (ad-hoc and in bulk)
 * traverses the graph to explore the data

In fact, the sample graph we create from Java uses the same data model as the
sample Music Festival data that’s built into the IBM Graph service’s UI. So be
sure to check the sample query examples in the UI to give you more ideas on how
to run them on the graph you just created from Java. Here’s a simple one to get
you started:

def g = graph.traversal(); g.V().has(""gender"", ""male"").values(""name

So check out our sample Java application at https://github.com/ibm-cds-labs/hello-graph-java and find the proper library at https://github.com/ibm-cds-labs/java-graph . Now go and make Edsger Dijkstra proud!

Java Graph Database Software Development Rest Api Blocked Unblock Follow FollowingPATRICK TITZLER
Developer Advocate at IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",We’ve encapsulated the HTTP API for the IBM Graph database service and made it available as an open source library to help you get up and running faster from Java. Check out our simple sample app!,Accessing IBM Graph from Java – IBM Watson Data Lab,Live,613
1894,"MAKING THE MOST OF COMPOSE ENTERPRISE – README.IO
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 8, 2016ReadMe.io has created a simple, gorgeous platform for publishing technical
documentation running on Node.js and Compose Enterprise MongoDB and Redis.

ReadMe.io is one of our favorite apps. You’ll find ReadMe powering our documentation at help.compose.com . We’ve been using ReadMe for more than a year, and as it turns out, they’ve
been a long-time Compose customer as well. ReadMe have been running on MongoDB
and Redis for a couple of years and recently switched from Compose-hosted
deployments to self-hosted Compose Enterprise .

ReadMe.io has built a platform for collaboratively writing technical docs and
API documentation. ReadMe does several things differently from its competitors.
First, it allows visitors to test APIs right from the documentation pages which
saves users time and gives them a good feel for the tooling. Second, it allows
anyone with permission to push changes quickly and easily; no more waiting on
other team’s pull requests or time tables for publishing. And third, it allows
anyone (employees, customers, etc.) to suggest changes, much in the same way a
wiki works, but with greater control over what actually gets published and how
the pages are designed.

Founder Gregory Koberger said he started ReadMe because he wanted to put the
same level of care and attention into developer sites that most companies put
into their websites or apps. Ashley Chang, who focuses on the customer user
experience at ReadMe, looks at it from a slightly different perspective, saying
""we want to build better developer communities around documentation.”

Most documentation tools have taken the “O’Reilly book approach” according to
Koberger, by digitizing the content but not improving on it. “Documentation very
much tends to be just paragraphs of text that know nothing about the person
reading them. Your eyes glaze over when you look at it. Our goal is to make it
more personalized so that everyone sees what they need to see.”


Box is a good example of a company that turned to ReadMe and reaped immediate
benefits. “Box has a big developer platform team and a large community of users.
But updating the docs and APIs was very slow. They also didn’t have a great way
for people to try the API, to play around with it, see how it works,"" said
Koberger. ReadMe made it so that anyone on the team can edit; whether it's a new
page, a small text edit, or metadata update, ReadMe makes it easy to do.

ReadMe is built on Node.js and uses Express web and AngularJS UI frameworks to deliver the front-end. The decision to build the project on
Node.js in part led to the decision to use MongoDB and Redis on the backend. MongoDB is used as their primary data store and Redis is used
as a queue for email and for caching certain database calls (reading, not
writing). The team didn't have much experience with MongoDB, so they turned to
Compose to ensure their deployments were secure and backups were working.

“If you’re not in the position to have a full-time DevOps person, the easiest
way to feel secure is using Compose,” said Chang.

Compose-hosted deployments bypassed much of the work a full-time devops team
would have done, but it was also the interface that helped seal it for their
team. “I loved the interface,"" said Koberger. ""I loved how easy it was to get
set up. I was so close to quitting MongoDB, and just said, you know, screw it. I
already know how to use Postgres; I'm going to use that. And then I started
using MongoDB on Compose and I was like, oh, okay, I actually can get started,
and it works.”

In February 2016, when Compose Enterprise was released, ReadMe was one of the first customers to migrate from a hosted
option to an AWS VPC. “Enterprise has been a nice halfway point where we can own
our data, but still get the benefits of making sure everything's backed up, we
have that nice interface, and so forth. It’s the perfect medium between what
Compose offers with its hosted product, while giving us more control over our
servers.”

The company is growing as fast as ever and now boasts more than 20,000 projects
and 90,000 users, and we are happy to be counted among them.

ReadMe's Owlbert image courtesy of ReadMe.io

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Jon Silvers does marketing at Compose. He is also a father, husband, Californian, runner,
hiker, INTJ, guitar beginner, and comedy geek. Love this article? Head over to Jon Silvers’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","ReadMe.io has created a simple, gorgeous platform for publishing technical documentation running on Node.js and Compose Enterprise MongoDB and Redis. Here's how they're using it.",Making the Most of Compose Enterprise – Customer: ReadMe.io,Live,614
1896,"OFFLINE-FIRST MOBILE APPSBetter user experience, offline & onIt’s easy to assume that your mobile app will run on a fast and reliable networkwith great coverage. The reality for your users is often much different.Offline-first applications built with Cloudant Sync provide a better, faster user experience — both offline and online — by storingand accessing data locally and then synchronizing this data with the cloud whenan Internet connection is available.As an app developer, it’s tempting to let your mobile backend handle all of yourdata access. This is a clean and simple architecture, with less code to write.The problem is that networks are unreliable. When the network doesn’t work,neither does your app — and a broken app means unhappy, frustrated users.Data sync can be a challenge. While we can’t eliminate all the complexities,we’ve simplified things with Cloudant Sync: native libraries for iOS and Androidthat store JSON locally first, and then sync changes to the cloud wheneverpossible.Deliver a better UX, regardless of connectivity, by taking an offline-firstapproach and building your app with Cloudant Sync. Use on, users. Use on.NEXT STEPSDig the FoodTracker?WATCH SOME SYNCINGUSE CASES * Cloudant FoodTracker Pt. 1 (Sync for iOS) * QR-code Badge Scanner (PouchDB)IBM TECHNOLOGY * Cloudant * Cloudant Sync for Android * Cloudant Sync for iOSBLOGS ‘N’ STUFF * Cloudant FoodTracker Intro * Cloudant Sync Intro * Repo: Cloudant FoodTracker * Repo: Cloudant Traffic Tamer * Video: PouchDB + Cloudant * Video: Offline First @ Leeds JS * Video: Cloudant Sync Walk-thruSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Offline-first apps using Cloudant Sync deliver a better UX—both offline and on—by storing data locally first, then synching to the cloud when connected.",Offline-First mobile apps with IBM Cloudant,Live,615
1906,"Homepage Follow Sign in Get started Adarsh Pannu Blocked Unblock Follow Following Senior engineer at IBM • Background in distributed systems and data engineering
• Currently working on machine learning • Views are my own. Nov 27
--------------------------------------------------------------------------------

THREE REASONS MACHINE LEARNING MODELS GO OUT OF SYNC
by Adarsh Pannu and Steve Moore

Creative Commons license imageLet’s say you want to create a machine learning model to predict the price of
homes in a particular city. You gather historical sales data, details about the
homes sold, proximity to parks, income trends for the area, statistics about the
local schools, crime statistics, and more. You clean and organize the data, do
feature extraction, try out several modeling algorithms, test their performance,
and select the best one. Eventually, you deploy the model into the world.

Because you want to see how the model behaves over time, you set up a feedback
loop that continues to compare the model’s predictions with actual home sales.
You check in with it periodically and the model continues to work well and
generate predictions as expected.

But then after six months, something changes. The model starts to lose some of
its predictive power. Let’s say it had an initial median error rate of 4% but
the error has since increased to 8.5%.

Based on our experience with IBM’s Machine Learning Hub in San Francisco, that loss of fidelity means that one of three things has
happened — each of which can give us insight into machine learning in general
and how to build for resilience. We’ll dive into the ideas below, but to get a
sense of the ideas we’re considering here, set aside some time to study the
notion of “ concept drift ” in machine learning.

Before we jump in, let’s first make an important distinction between data and
features. Imagine that you have three columns of data: an identifier for the
home, a date when the home went on sale, and a date when the home sold.
Independently, each column represents raw data , but given that raw data, you can extract any number of features , based on how the data is transformed or combined. For example, we could
extract the number of days on the market — which turns out to be an important
clue to price. From the same data, we might also extract features about the
listing and selling of houses during particular seasons of the year, or even
days of the week — and so on.

With that distinction in mind, let’s look at the three reasons:

REASON 1: NEW DATA. NEW FEATURES.
The first reason models can fall out of sync is that something becomes important
that simply wasn’t relevant before. For example, suppose the town hasn’t had a
serious flood in 250 years, but suddenly over a few intense days of rain, 80% of
the homes on the east side of town are inundated.

Home prices in that part of town tumble. You retrain the model on the new home
prices, but you soon realize that it’s not enough. Other issues are at play.
After some digging, you find correlations with several new factors including the
availability of flood insurance, FEMA regulations, and even climate predictions
that anticipate the likelihood of flooding in years to come.

You have no choice but to repeat the process of feature extraction and then build and train a new model
from scratch . As might be obvious, incorporating new data into the model will mean
extracting new features as well.

REASON 2: SAME DATA. SAME FEATURES.
The second reason a model can go out of sync is less dramatic — but very common.
It happens when a feature that you’re already accounting for begins to make a
greater difference to the end result. With the example of home prices, this
might happen if certain neighborhoods are experiencing rapid gentrification.

As part of the trend, home prices are rising faster than expected in particular
zip codes. Over the course of six months, the prices in those particular zip
codes are diverging more and more from the predictions of your model. The simple solution is to retrain the model on new data and validating that the
new model is now accurate enough for your needs.

Notice that you didn’t need to add any new features and you didn’t need to think
differently about the data or features you already had in place. You were
already accounting for zip code, and just needed to adjust how the model
weighted zip code in the larger mix.

REASON 3: SAME DATA. NEW FEATURES.
The first two reasons represent two ends of a spectrum. On one end, the model
needs to be rebuilt and retrained with new data and features. On the other end,
the model is sound and all the relevant data and features are already available;
the model simply needs to be retrained.

But there’s a middle ground. Here, imagine a scenario where you already have the
relevant data but something about the behavior of the system has changed. In this case, you can repeat the feature extraction
process without finding, cleaning, and organizing new data.

Suppose, for example, that the city has changed its regulations and tax code for
short-term rental properties. As a result, new buyers enter the market and drive
up the prices for homes that fit a certain profile. In particular, these are
smaller homes with as many bedrooms as possible. The buyers also prefer homes
close to downtown, and they don’t want homes with garages or big yards — because
short-term renters rarely care about garages or yards, and both mean more
maintenance for the owner.

In this case, you already have the data you need, but retraining the same model
on the same data isn’t going to help. Recall the example of the listing and
sales data where we could extract a feature for days on the market. Here, we
might want to extract a feature that’s a ratio of the square footage of the home
to the number of bedrooms, and perhaps another feature that’s a ratio of the
square footage of the home to the square footage of the whole lot. And so on.

You’ll need to build a new model with the new features , but of course you would like to do that in a way that preserves the value of
the previous model, which presumably is still relevant for homes that don’t meet
the narrow profile for short-term rentals.

AND ONE MORE…
We said three reasons, but there’s actually one more that’s particularly
intriguing — and that we’re seeing more and more. This is when your model falls
out of sync because the model itself is influencing the data that arrives into
the system. For example, a model that’s been deployed to detect credit card
fraud might be doing a wonderful job of identifying fraudulent transactions —
such a good job that the data that enters the system has far fewer instances of
fraud.

But that situation has an unintuitive consequence: If the model is catching 95%
of fraudulent transactions, then the 5% that make it past the model’s algorithms
become 100% of the transactions against which the model is then assessed. This
makes it appear that the model has fallen far out of sync, when in fact it simply needs to be adjusted in some way to attempt to account for that remaining 5%.

The more we work with machine learning in the field, the more aware we’re
becoming of the pitfalls we can fall into — and how to avoid them. For more,
check out the Machine Learning Hub that we mentioned earlier where we’re focused every day on operationalizing
real machine learning for real customers.

We also invite you to dive into machine learning yourself by taking advantage of
the free registration at IBM’s Data Science Experience .

 * Machine Learning
 * Algorithms
 * Feedback Loop

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

23 Blocked Unblock Follow FollowingADARSH PANNU
Senior engineer at IBM • Background in distributed systems and data engineering
• Currently working on machine learning • Views are my own.

FollowINSIDE MACHINE LEARNING
Deep-dive articles about machine learning and data. Curated by IBM Analytics.

 * 23
 * 
 * 
 * 

Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates","Let’s say you want to create a machine learning model to predict the price of homes in a particular city. You gather historical sales data, details about the homes sold, proximity to parks, income…",Three reasons machine learning models go out of sync,Live,616
1909,"David Taieb Blocked Unblock Follow Following Apr 13
--------------------------------------------------------------------------------

PIXIEDUST GETS ITS FIRST COMMUNITY-DRIVEN FEATURE IN 1.0.4
NOW WITH TIME SERIES DATA, THE FAIRY TALE CONTINUES
Last month I announced the availability of PixieDust 1.0 . Since then, community adoption has been fantastic. Based on repo stars and on
feedback at conferences and events, more developers and data scientists are
using PixieDust as part of their work in Jupyter Notebooks.

Today, we’re releasing version 1.0.4 on PyPi, but what’s noteworthy is that this version is getting a new feature
that has been prioritized by the community: time series data. (Personally,
there’s no better feeling than working on a feature that users are clamoring
for.)

The PixieDust fairy tale continues with time series data.ONCE UPON A TIME [SERIES]
PixieDust now supports display of time series data for bar and line charts.
Previously, when loading data into PySpark DataFrames from data sources that
required schema discovery (CSVs, JSON, etc.), datetime values were often
converted into strings. This caused problems when visualizing the data (sorting,
formatting, etc.). Fixing it required complicated massaging of the data.

In the example below, I want to display stock values over time. Unfortunately,
Spark converts the date values to Unix timestamps, and the results are not
visualized correctly:

PixieDust: before using the new time series option in version 1.0.4. Unix
timestamps muddy the sort order.Users can now click the “Time Series” checkbox to have PixieDust automatically
convert this data into a correctly formatted date.

PixieDust: after applying the new time series option. Unix timestamps are now
converted into datetime64 values and sorted properly.GAZE UPON THE CRYSTAL BALL OF PIXIEDUST: PIXIE APPS
I’ll let you in on a secret: there is a bigger feature that has been
dark-launched with PixieDust 1.0.4. Even though it’s not ready yet, I thought I
would float the idea here for feedback as we refine the design.

This new feature is called “Pixie App.” Some of its features, like routes, are
inspired by the popular AngularJS framework for web apps, but applied to the context of data science notebooks. The idea is
to let developers easily create bigger building blocks that encapsulate their
data (Model), UI (View) and logic (Controller). MVC, anyone?

Pixie App lets you refactor your projects for speed and repeatability. For
example, you could use it to build an interactive dashboard with widgets
communicating via events, or automate part of a machine learning pipeline that
requires multiple manual steps and replace it with a nice UI.

From a developer’s perspective, Pixie Apps have been designed to minimize
boilerplate code. All you need to get started is create a Python class and
provide HTML fragments for each widget. The logic, workflow, entity, and event
handling is expressed via HTML, microformats, and embedded Python.

Here’s some sample code, for an idea:

A sample Pixie App that creates a toolbar to control a widget.To run the sample app above, you’ll need some data. (Note: for Pixie Apps, you
don’t always necessarily need data.) The code below creates a simple PySpark DataFrame, which is passed to
the run function. It also uses the runInDialog='true' option to automatically display the app in a dialog, as opposed to the cell’s
output:

Running the sample Pixie App with some data.The results are as follows:

Click the “Run custom Python script” button. Click the “Show Bar Chart” button.THE NEXT CHAPTER
More documentation on Pixie Apps will come soon, but for now you can upgrade
your PixieDust version to 1.0.4 and try the sample Pixie App above.

Let’s get the Pixie App discussion started on Github . I can’t wait to hear your feedback and ideas.

Click the ♡ here to sprinkle a bit of Medium love in the name of PixieDust.
Thanks for reading!

 * Data Science
 * Data Visualization
 * Pixiedust
 * Apache Spark
 * Jupyter

3 Blocked Unblock Follow FollowingDAVID TAIEB
FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 3
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Last month I announced the availability of PixieDust 1.0. Since then, community adoption has been fantastic. Based on repo stars and on feedback at conferences and events, more developers and data…",PixieDust gets its first community-driven feature in 1.0.4,Live,617
1911,"NYTimes.com no longer supports Internet Explorer 9 or earlier. Please upgrade
your browser. LEARN MORE » Sections Home Search Skip to content Skip to navigation View mobile versionTHE NEW YORK TIMES
Magazine | Can A.I. Be Taught to Explain Itself? Close searchSITE SEARCH NAVIGATION
Search NYTimes.com Clear this text input
Go https://nyti.ms/2hR2weQSITE NAVIGATION
SITE MOBILE NAVIGATION
Cover Photo Credit Photo illustration by Derek Brahney. Source photo: J.R. Eyerman/The Life
Picture Collection/Getty Images.CAN A.I. BE TAUGHT TO EXPLAIN ITSELF?
As machine learning becomes more powerful, the field’s researchers increasingly
find themselves unable to account for what their algorithms know — or how they
know it.

By CLIFF KUANG NOV. 21, 2017

Continue reading the main story Share This Page Continue reading the main storyIn September, Michal Kosinski published a study that he feared might end his career. The
Economist broke the news first, giving it a self-consciously anodyne title:
“Advances in A.I. Are Used to Spot Signs of Sexuality.” But the headlines
quickly grew more alarmed. By the next day, the Human Rights Campaign and Glaad,
formerly known as the Gay and Lesbian Alliance Against Defamation, had labeled
Kosinski’s work “dangerous” and “junk science.” (They claimed it had not been
peer reviewed, though it had.) In the next week, the tech-news site The Verge
had run an article that, while carefully reported, was nonetheless topped with a
scorching headline: “The Invention of A.I. ‘Gaydar’ Could Be the Start of
Something Much Worse.”

Kosinski has made a career of warning others about the uses and potential abuses
of data. Four years ago, he was pursuing a Ph.D. in psychology, hoping to create
better tests for signature personality traits like introversion or openness to
change. But he and a collaborator soon realized that Facebook might render
personality tests superfluous: Instead of asking if someone liked poetry, you
could just see if they “liked” Poetry Magazine. In 2014, they published a study
showing that if given 200 of a user’s likes, they could predict that person’s
personality-test answers better than their own romantic partner could.

After getting his Ph.D., Kosinski landed a teaching position at the Stanford
Graduate School of Business and soon started looking for new data sets to
investigate. One in particular stood out: faces. For decades, psychologists have
been leery about associating personality traits with physical characteristics,
because of the lasting taint of phrenology and eugenics; studying faces this way
was, in essence, a taboo. But to understand what that taboo might reveal when
questioned, Kosinski knew he couldn’t rely on a human judgment.

Kosinski first mined 200,000 publicly posted dating profiles, complete with
pictures and information ranging from personality to political views. Then he
poured that data into an open-source facial-recognition algorithm — a so-called
deep neural network, built by researchers at Oxford University — and asked it to
find correlations between people’s faces and the information in their profiles.
The algorithm failed to turn up much, until, on a lark, Kosinski turned its
attention to sexual orientation. The results almost defied belief. In previous
research, the best any human had done at guessing sexual orientation from a
profile picture was about 60 percent — slightly better than a coin flip. Given
five pictures of a man, the deep neural net could predict his sexuality with as
much as 91 percent accuracy. For women, that figure was lower but still
remarkable: 83 percent.

Continue reading the main storyAdvertisement

Continue reading the main storyMuch like his earlier work, Kosinski’s findings raised questions about privacy
and the potential for discrimination in the digital age, suggesting scenarios in
which better programs and data sets might be able to deduce anything from
political leanings to criminality. But there was another question at the heart
of Kosinski’s paper, a genuine mystery that went almost ignored amid all the
media response: How was the computer doing what it did? What was it seeing that humans could not?

Photo Credit Photo illustration by Derek Brahney. Source photo: Howard Sochurek/The Life
Picture Collection/Getty Images.It was Kosinski’s own research, but when he tried to answer that question, he
was reduced to a painstaking hunt for clues. At first, he tried covering up or
exaggerating parts of faces, trying to see how those changes would affect the
machine’s predictions. Results were inconclusive. But Kosinski knew that women,
in general, have bigger foreheads, thinner jaws and longer noses than men. So he
had the computer spit out the 100 faces it deemed most likely to be gay or
straight and averaged the proportions of each. It turned out that the faces of
gay men exhibited slightly more “feminine” proportions, on average, and that the
converse was true for women. If this was accurate, it could support the idea
that testosterone levels — already known to mold facial features — help mold
sexuality as well.

But it was impossible to say for sure. Other evidence seemed to suggest that the
algorithms might also be picking up on culturally driven traits, like straight
men wearing baseball hats more often. Or — crucially — they could have been
picking up on elements of the photos that humans don’t even recognize. “Humans
might have trouble detecting these tiny footprints that border on the
infinitesimal,” Kosinski says. “Computers can do that very easily.”

Advertisement

Continue reading the main storyIt has become commonplace to hear that machines, armed with machine learning,
can outperform humans at decidedly human tasks, from playing Go to playing
“Jeopardy!” We assume that is because computers simply have more data-crunching
power than our soggy three-pound brains. Kosinski’s results suggested something
stranger: that artificial intelligences often excel by developing whole new ways
of seeing, or even thinking, that are inscrutable to us. It’s a more profound
version of what’s often called the “black box” problem — the inability to
discern exactly what machines are doing when they’re teaching themselves novel
skills — and it has become a central concern in artificial-intelligence
research. In many arenas, A.I. methods have advanced with startling speed; deep
neural networks can now detect certain kinds of cancer as accurately as a human.
But human doctors still have to make the decisions — and they won’t trust an
A.I. unless it can explain itself.

This isn’t merely a theoretical concern. In 2018, the European Union will begin
enforcing a law requiring that any decision made by a machine be readily
explainable, on penalty of fines that could cost companies like Google and
Facebook billions of dollars. The law was written to be powerful and broad and
fails to define what constitutes a satisfying explanation or how exactly those
explanations are to be reached. It represents a rare case in which a law has
managed to leap into a future that academics and tech companies are just
beginning to devote concentrated effort to understanding. As researchers at
Oxford dryly noted, the law “could require a complete overhaul of standard and
widely used algorithmic techniques” — techniques already permeating our everyday
lives.

Those techniques can seem inescapably alien to our own ways of thinking. Instead
of certainty and cause, A.I. works off probability and correlation. And yet A.I.
must nonetheless conform to the society we’ve built — one in which decisions
require explanations, whether in a court of law, in the way a business is run or
in the advice our doctors give us. The disconnect between how we make decisions
and how machines make them, and the fact that machines are making more and more
decisions for us, has birthed a new push for transparency and a field of
research called explainable A.I., or X.A.I. Its goal is to make machines able to
account for the things they learn, in ways that we can understand. But that
goal, of course, raises the fundamental question of whether the world a machine
sees can be made to match our own.

“Artificial intelligence” is a misnomer, an airy and evocative term that can be shaded with whatever
notions we might have about what “intelligence” is in the first place.
Researchers today prefer the term “machine learning,” which better describes
what makes such algorithms powerful. Let’s say that a computer program is
deciding whether to give you a loan. It might start by comparing the loan amount
with your income; then it might look at your credit history, marital status or
age; then it might consider any number of other data points. After exhausting
this “decision tree” of possible variables, the computer will spit out a
decision. If the program were built with only a few examples to reason from, it
probably wouldn’t be very accurate. But given millions of cases to consider,
along with their various outcomes, a machine-learning algorithm could tweak
itself — figuring out when to, say, give more weight to age and less to income —
until it is able to handle a range of novel situations and reliably predict how
likely each loan is to default.

Machine learning isn’t just one technique. It encompasses entire families of
them, from “boosted decision trees,” which allow an algorithm to change the
weighting it gives to each data point, to “random forests,” which average
together many thousands of randomly generated decision trees. The sheer
proliferation of different techniques, none of them obviously better than the
others, can leave researchers flummoxed over which one to choose. Many of the
most powerful are bafflingly opaque; others evade understanding because they
involve an avalanche of statistical probability. It can be almost impossible to
peek inside the box and see what, exactly, is happening.

Rich Caruana, an academic who works at Microsoft Research, has spent almost his
entire career in the shadow of this problem. When he was earning his Ph.D at
Carnegie Mellon University in the 1990s, his thesis adviser asked him and a
group of others to train a neural net — a forerunner of the deep neural net — to
help evaluate risks for patients with pneumonia. Between 10 and 11 percent of
cases would be fatal; others would be less urgent, with some percentage of
patients recovering just fine without a great deal of medical attention. The
problem was figuring out which cases were which — a high-stakes question in,
say, an emergency room, where doctors have to make quick decisions about what
kind of care to offer. Of all the machine-learning techniques students applied
to this question, Caruana’s neural net was the most effective. But when someone
on the staff of the University of Pittsburgh Medical Center asked him if they
should start using his algorithm, “I said no,” Caruana recalls. “I said we don’t
understand what it does inside. I said I was afraid.”

The problem was in the algorithm’s design. Classical neural nets focus only on
whether the prediction they gave is right or wrong, tweaking and weighing and
recombining all available morsels of data into a tangled web of inferences that
seems to get the job done. But some of these inferences could be terrifically
wrong. Caruana was particularly concerned by something another graduate student
noticed about the data they were handling: It seemed to show that asthmatics
with pneumonia fared better than the typical patient. This correlation was real,
but the data masked its true cause. Asthmatic patients who contract pneumonia
are immediately flagged as dangerous cases; if they tended to fare better, it
was because they got the best care the hospital could offer. A dumb algorithm,
looking at this data, would have simply assumed asthma meant a patient was
likely to get better — and thus concluded that they were in less need of urgent
care.

“I knew I could probably fix the program for asthmatics,” Caruana says. “But
what else did the neural net learn that was equally wrong? It couldn’t warn me
about the unknown unknowns. That tension has bothered me since the 1990s.”

Advertisement

Continue reading the main storyThe story of asthmatics with pneumonia eventually became a legendary allegory in
the machine-learning community. Today, Caruana is one of perhaps a few dozen
researchers in the United States dedicated to finding more transparent new
approaches to machine learning. For the last six years, he has been creating a
new model that combines a number of machine-learning techniques. The result is
as accurate as his original neural network, and it can spit out charts that show
how each individual variable — from asthma to age — is predictive of mortality
risk, making it easier to see which ones exhibit particularly unusual behavior.
Immediately, asthmatics are revealed as a far outlier. Other strange truths
surface, too: For example, risk for people age 100 goes down suddenly. “If you
made it to this round number of 100,” Caruana says, “it seemed as if the doctors
were saying, ‘Let’s try to get you another year,’ which might not happen if
you’re 93.”

Caruana may have brought clarity to his own project, but his solution only
underscored the fact the explainability is a kaleidoscopic problem. The
explanation a doctor needs from a machine isn’t the same as the one a fighter
pilot might need or the one an N.S.A. analyst sniffing out a financial fraud
might need. Different details will matter, and different technical means will be
needed for finding them. You couldn’t, for example, simply use Caruana’s
techniques on facial data, because they don’t apply to image recognition. There
may, in other words, eventually have to be as many approaches to explainability
as there are approaches to machine learning itself.

Three years ago, David Gunning, one of the most consequential people in the emerging discipline
of X.A.I., attended a brainstorming session at a state university in North
Carolina. The event had the title “Human-Centered Big Data,” and it was
sponsored by a government-funded think tank called the Laboratory for Analytic
Sciences. The idea was to connect leading A.I. researchers with experts in data
visualization and human-computer interaction to see what new tools they might
invent to find patterns in huge sets of data. There to judge the ideas, and act
as hypothetical users, were analysts for the C.I.A., the N.S.A. and sundry other
American intelligence agencies.

The researchers in Gunning’s group stepped confidently up to the white board,
showing off new, more powerful ways to draw predictions from a machine and then
visualize them. But the intelligence analyst evaluating their pitches, a woman
who couldn’t tell anyone in the room what she did or what tools she was using,
waved it all away. Gunning remembers her as plainly dressed, middle-aged,
typical of the countless government agents he had known who toiled thanklessly
in critical jobs. “None of this solves my problem,” she said. “I don’t need to
be able to visualize another recommendation. If I’m going to sign off on a
decision, I need to be able to justify it.” She was issuing what amounted to a
broadside. It wasn’t just that a clever graph indicating the best choice wasn’t
the same as explaining why that choice was correct. The analyst was pointing to
a legal and ethical motivation for explainability: Even if a machine made
perfect decisions, a human would still have to take responsibility for them —
and if the machine’s rationale was beyond reckoning, that could never happen.

NEWSLETTER SIGN UP
Continue reading the main story


Please verify you're not a robot by clicking the box.

Invalid email address. Please re-enter.

You must select a newsletter to subscribe to.

Sign Up You agree to receive occasional updates and special offers for The New York
Times's products and services.THANK YOU FOR SUBSCRIBING.
AN ERROR HAS OCCURRED. PLEASE TRY AGAIN LATER.
YOU ARE ALREADY SUBSCRIBED TO THIS EMAIL.
View all New York Times newsletters.

 * See Sample
 * Manage Email Preferences
 * Not you?
 * Privacy Policy
 * Opt out or contact us anytime

Gunning, a grandfatherly military man whose buzz cut has survived his stints as
a civilian, is a program manager at the Defense Advanced Research Projects
Agency. He works in Darpa’s shiny new midrise tower in downtown Alexandria, Va.
— an office indistinguishable from the others nearby, except that the security
guard out front will take away your cellphone and warn you that turning on the
Wi-Fi on your laptop will make security personnel materialize within 30 seconds.
Darpa managers like Gunning don’t have permanent jobs; the expectation is that
they serve four-year “tours,” dedicated to funding cutting-edge research along a
single line of inquiry. When he found himself at the brainstorming session,
Gunning had recently completed his second tour as a sort of Johnny Appleseed for
A.I.: Starting in the 1990s, he has founded hundreds of projects, from the first
application of machine-learning techniques to the internet, which presaged the
first search engines, to the project that eventually spun off as Siri, Apple’s
voice-controlled assistant. “I’m proud to be a dinosaur,” he says with a smile.

As of now, most of the military’s practical applications of such technology
involve performing enormous calculations beyond the reach of human patience,
like predicting how to route supplies. But there are more ambitious applications
on the horizon. One recent research program tried to use machine learning to
sift through millions of video clips and internet messages in Yemen to detect
cease-fire violations; if the machine does find something, it has to be able to
describe what’s worth paying attention to. Another pressing need is for drones
flying on self-directed missions to be able to explain their limitations so that
the humans commanding the drones know what the machines can — and cannot — be
asked to do. Explainability has thus become a hurdle for a wealth of possible
projects, and the Department of Defense has begun to turn its eye to the
problem.

After that brainstorming session, Gunning took the analyst’s story back to Darpa
and soon signed up for his third tour. As he flew across the country meeting
with computer scientists to help design an overall strategy for tackling the
problem of X.A.I., what became clear was that the field needed to collaborate
more broadly and tackle grander problems. Computer science, having leapt beyond
the bounds of considering purely technical problems, had to look further afield
— to experts, like cognitive scientists, who study the ways humans and machines
interact.

This represents a full circle for Gunning, who began his career as a cognitive
psychologist working on how to design better automated systems for fighter
pilots. Later, he began working on what’s now called “old-fashioned A.I.” —
so-called expert systems in which machines were given voluminous lists of rules,
then tasked with drawing conclusions by recombining those rules. None of those
efforts was particularly successful, because it was impossible to give the
computer a set of rules long enough, or flexible enough, to approximate the
power of human reasoning. A.I.’s current blossoming came only when researchers
began inventing new techniques for letting machines find their own patterns in
the data.

Gunning’s X.A.I. initiative, which kicked off this year, provides $75 million in
funding to 12 new research programs; by the power of the purse strings, Gunning
has refocused the energies of a significant part of the American A.I. research
community. His hope is that by making these new A.I. methods accountable to the
demands of human psychology, they will become both more useful and more
powerful. “The real secret is finding a way to put labels on the concepts inside
a deep neural net,” he says. If the concepts inside can be labeled, then they
can be used for reasoning — just like those expert systems were supposed to do
in A.I.’s first wave.

Advertisement

Continue reading the main storyDeep neural nets, which evolved from the kinds of techniques that Rich Caruana was experimenting
with in the 1990s, are now the class of machine learning that seems most opaque.
Just like old-fashioned neural nets, deep neural networks seek to draw a link
between an input on one end (say, a picture from the internet) and an output on
the other end (“This is a picture of a dog”). And just like those older neural
nets, they consume all the examples you might give them, forming their own webs
of inference that can then be applied to pictures they’ve never seen before.
Deep neural nets remain a hotbed of research because they have produced some of
the most breathtaking technological accomplishments of the last decade, from
learning how to translate words with better-than-human accuracy to learning how
to drive.

To create a neural net that can reveal its inner workings, the researchers in
Gunning’s portfolio are pursuing a number of different paths. Some of these are
technically ingenious — for example, designing new kinds of deep neural networks
made up of smaller, more easily understood modules, which can fit together like
Legos to accomplish complex tasks. Others involve psychological insight: One
team at Rutgers is designing a deep neural network that, once it makes a
decision, can then sift through its data set to find the example that best
demonstrates why it made that decision. (The idea is partly inspired by
psychological studies of real-life experts like firefighters, who don’t clock in
for a shift thinking, These are the 12 rules for fighting fires; when they see a
fire before them, they compare it with ones they’ve seen before and act
accordingly.) Perhaps the most ambitious of the dozen different projects are
those that seek to bolt new explanatory capabilities onto existing deep neural
networks. Imagine giving your pet dog the power of speech, so that it might
finally explain what’s so interesting about squirrels. Or, as Trevor Darrell, a
lead investigator on one of those teams, sums it up, “The solution to
explainable A.I. is more A.I.”

Five years ago, Darrell and some colleagues had a novel idea for letting an A.I.
teach itself how to describe the contents of a picture. First, they created two
deep neural networks: one dedicated to image recognition and another to
translating languages. Then they lashed these two together and fed them
thousands of images that had captions attached to them. As the first network
learned to recognize the objects in a picture, the second simply watched what
was happening in the first, then learned to associate certain words with the
activity it saw. Working together, the two networks could identify the features
of each picture, then label them. Soon after, Darrell was presenting some
different work to a group of computer scientists when someone in the audience
raised a hand, complaining that the techniques he was describing would never be
explainable. Darrell, without a second thought, said, Sure — but you could make
it explainable by once again lashing two deep neural networks together, one to
do the task and one to describe it.

Darrell’s previous work had piggybacked on pictures that were already captioned.
What he was now proposing was creating a new data set and using it in a novel
way. Let’s say you had thousands of videos of baseball highlights. An
image-recognition network could be trained to spot the players, the ball and
everything happening on the field, but it wouldn’t have the words to label what
they were. But you might then create a new data set, in which volunteers had
written sentences describing the contents of every video. Once combined, the two
networks should then be able to answer queries like “Show me all the double
plays involving the Boston Red Sox” — and could potentially show you what cues,
like the logos on uniforms, it used to figure out who the Boston Red Sox are.

Call it the Hamlet strategy: lending a deep neural network the power of internal
monologue, so that it can narrate what’s going on inside. But do the concepts
that a network has taught itself align with the reality that humans are
describing, when, for example, narrating a baseball highlight? Is the network
recognizing the Boston Red Sox by their logo or by some other obscure signal,
like “median facial-hair distribution,” that just happens to correlate with the
Red Sox? Does it actually have the concept of “Boston Red Sox” or just some
other strange thing that only the computer understands? It’s an ontological
question: Is the deep neural network really seeing a world that corresponds to
our own?

We human beings seem to be obsessed with black boxes: The highest compliment we
give to technology is that it feels like magic. When the workings of a new
technology is too obvious, too easy to explain, it can feel banal and
uninteresting. But when I asked David Jensen — a professor at the University of
Massachusetts at Amherst and one of the researchers being funded by Gunning —
why X.A.I. had suddenly become a compelling topic for research, he sounded
almost soulful: “We want people to make informed decisions about whether to
trust autonomous systems,” he said. “If you don’t, you’re depriving people of
the ability to be fully independent human beings.”

A decade in the making, the European Union’s General Data Protection Regulation finally
goes into effect in May 2018. It’s a sprawling, many-tentacled piece of
legislation whose opening lines declare that the protection of personal data is
a universal human right. Among its hundreds of provisions, two seem aimed
squarely at where machine learning has already been deployed and how it’s likely
to evolve. Google and Facebook are most directly threatened by Article 21, which
affords anyone the right to opt out of personally tailored ads. The next article
then confronts machine learning head on, limning a so-called right to
explanation: E.U. citizens can contest “legal or similarly significant”
decisions made by algorithms and appeal for human intervention. Taken together,
Articles 21 and 22 introduce the principle that people are owed agency and
understanding when they’re faced by machine-made decisions.

For many, this law seems frustratingly vague. Some legal scholars argue that it
might be toothless in practice. Others claim that it will require the basic
workings of Facebook and Google to change, lest they face penalties of 4 percent
of their revenue. It remains to be seen whether complying with the law will mean
a heap of fine print and an extra check box buried in a pop-up window, some new
kind of warning-label system marking every machine-made decision or much more
profound changes.

Advertisement

Continue reading the main storyIf Google is one of the companies most endangered by this new scrutiny on A.I.,
it’s also the company with the greatest wherewithal to lead the whole industry
in solving the problem. Even among the company’s astonishing roster of A.I.
talent, one particular star is Chris Olah, who holds the title of research
scientist — a title shared by Google’s many ex-professors and Ph.D.s — without
ever having completed more than a year of college. Olah has been working for the
last couple of years on creating new ways to visualize the inner workings of a
deep neural network. You might recall when Google created a hallucinatory tool
called Deep Dream, which produced psychedelic distortions when you fed it an
image and which went viral when people used it to create hallucinatory mash-ups
like a doll covered in a pattern of doll eyes and a portrait of Vincent Van Gogh
made up in places of bird beaks. Olah was one of many Google researchers on the
team, led by Alex Mordvintsev, that worked on Deep Dream. It may have seemed
like a folly, but it was actually a technical steppingstone.

Olah speaks faster and faster as he sinks into an idea, and the words tumbled
out of him almost too quickly to follow as he explained what he found so
exciting about the work he was doing. “The truth is, it’s really beautiful.
There’s some sense in which we don’t know what it means to see. We don’t
understand how humans do it,” he told me, hands gesturing furiously. “We want to
understand something not just about neural nets but something deeper about
reality.” Olah’s hope is that deep neural networks reflect something deeper
about parsing data — that insights gleaned from them might in turn shed light on
how our brains work.

Olah showed me a sample of work he was preparing to publish with a set of
collaborators, including Mordvintsev; it was made public this month. The tool they had developed was basically an ingenious way of
testing a deep neural network. First, it fed the network a random image of
visual noise. Then it tweaked that image over and over again, working to figure
out what excited each layer in the network the most. Eventually, that process
would find the platonic ideal that each layer of the network was searching for.
Olah demonstrated with a network trained to classify different breeds of dogs.
You could pick out a neuron from the topmost layer while it was analyzing a
picture of a golden retriever. You could see the ideal it was looking for — in
this case, a hallucinatory mash-up of floppy ears and a forlorn expression. The
network was indeed homing in on higher-level traits that we could understand.

Watching him use the tool, I realized that it was exactly what the psychologist
Michal Kosinski needed — a key to unlock what his deep neural network was seeing
when it categorized profile pictures as gay or straight. Kosinski’s most
optimistic view of his research was that it represented a new kind of science in
which machines could access truths that lay beyond human intuition. The problem
was reducing what a computer knew into a single conclusion that a human could
grasp and consider. He had painstakingly tested his data set by hand and found
evidence that the computer might be discovering hormonal signals in facial
structure. That evidence was still fragmentary. But with the tool that Olah
showed me, or one like it, Kosinski might have been able to pull back the
curtain on how his mysterious A.I. was working. It would be as obvious and
intuitive as a picture the computer had drawn on its own.

Cliff Kuang is a writer at large for Fast Company and the author of “User
Friendly,” which will be published by Farrar, Straus and Giroux.

Sign up for our newsletter to get the best of The New York Times Magazine delivered to your inbox every
week.

A version of this article appears in print on November 26, 2017, on Page MM46 of
the Sunday Magazine with the headline: Can A.I. Be Taught to Explain Itself?. Today's Paper | Subscribe

Continue reading the main story We’re interested in your feedback on this page. Tell us what you think.
 * 
 * 
 * 
 * 

WHAT'S NEXT
Loading...Go to Home Page »

SITE INDEX THE NEW YORK TIMES
SITE INDEX NAVIGATION
NEWS
 * World
 * U.S.
 * Politics
 * N.Y.
 * Business
 * Tech
 * Science
 * Health
 * Sports
 * Education
 * Obituaries
 * Today's Paper
 * Corrections

OPINION
 * Today's Opinion
 * Op-Ed Columnists
 * Editorials
 * Op-Ed Contributors
 * Letters
 * Sunday Review
 * Video: Opinion

ARTS
 * Today's Arts
 * Art & Design
 * Books
 * Dance
 * Movies
 * Music
 * N.Y.C. Events Guide
 * Television
 * Theater
 * Video: Arts

LIVING
 * Automobiles
 * Crossword
 * Food
 * Education
 * Fashion & Style
 * Health
 * Jobs
 * Magazine
 * N.Y.C. Events Guide
 * Real Estate
 * T Magazine
 * Travel
 * Weddings & Celebrations

LISTINGS & MORE
 * Reader Center
 * Classifieds
 * Tools & Services
 * N.Y.C. Events Guide
 * Multimedia
 * Photography
 * Video
 * NYT Store
 * Times Journeys
 * Subscribe
 * Manage My Account
 * NYTCo

SUBSCRIBE
 * Subscribe
 * Home Delivery
 * Digital Subscriptions
 * Crossword

 * Email Newsletters
 * Alerts
 * Gift Subscriptions
 * Group Subscriptions
 * Education Rate

 * Mobile Applications
 * Replica Edition

SITE INFORMATION NAVIGATION
 * © 2017 The New York Times Company
 * Home
 * Search
 * Accessibility concerns? Email us at accessibility@nytimes.com . We would love to hear from you.
 * Contact Us
 * Work With Us
 * Advertise
 * Your Ad Choices
 * Privacy
 * Terms of Service
 * Terms of Sale

SITE INFORMATION NAVIGATION
 * Site Map
 * Help
 * Site Feedback
 * Subscriptions","As machine learning becomes more powerful, the field’s researchers increasingly find themselves unable to account for what their algorithms know — or how they know it.",Can A.I. Be Taught to Explain Itself?,Live,618
1915,"Enterprise Pricing Articles Sign in Free 30-Day TrialCONNECTING TO RETHINKDB WITH ELIXIR
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Aug 2, 2016When you have a database that's as interesting as RethinkDB hosted on Compose, it's not going to be a surprise to find people using cutting
edge languages to connect to it. With that in mind, we're going to look at how
to connect to Compose RethinkDB using Elixir and pick out some of the pitfalls. We're going to be using the rethinkdb-elixir driver to connect to the database; it's a pure Elixir driver.

PREPARATION
We'll assume that you've got Elixir installed and the mix and iex commands available - we'll create ourselves a mix project:

$ mix new elxrdb
mix new elxrdb  
* creating README.md
* creating .gitignore
* creating mix.exs
* creating config
* creating config/config.exs
* creating lib
* creating lib/elxrdb.ex
* creating test
* creating test/test_helper.exs
* creating test/elxrdb_test.exs

Your Mix project was created successfully.  
You can use ""mix"" to compile it, test it, and more:

    cd elxrdb
    mix test

Run ""mix help"" for more commands.  
$ cd elxrdb
$ mix test
Compiling 1 file (.ex)  
Generated elxrdb app  
.

Finished in 0.02 seconds  
1 test, 0 failures

Randomized with seed 567600  


And that's our example project created. Now we need to add in the rethinkdb
library. We do that by opening up mix.exs in our preferred editor and adding {:rethinkdb, ""~> 0.4.0""} within the defp deps do section like so:


With that in place, we can run mix deps.get to get the required packages:

$ mix deps.get                                                                                         
Running dependency resolution  
* Getting rethinkdb (Hex package)
  Checking package (https://repo.hex.pm/tarballs/rethinkdb-0.4.0.tar)
  Using locally cached package
* Getting connection (Hex package)
  Checking package (https://repo.hex.pm/tarballs/connection-1.0.3.tar)
  Using locally cached package
* Getting poison (Hex package)
  Checking package (https://repo.hex.pm/tarballs/poison-2.2.0.tar)
  Using locally cached package
$ 


CERTIFICATES PLEASE
Before we can connect, we need to get the SSL Public Certificate from the
Compose RethinkDB deployment. You'll find it on the Overview page of your
Compose. It's normally hidden, so go down the page to where it says SSL Certificate (Self-Signed) , click the Show button and, from the text area that appears, cut and paste the entire contents
into a local file using an editor. Save it as compose.cert . We'll use it in a moment.

While we are on this page we can gather the other information we'll need to make
the connection. Look up to the Connection Info panel. It'll look like this:


Take the connection string, and make a note of the hostname and port. We've
marked them above so you can find them easily. The other thing you'll need is
the authkey. Before RethinkDB 2.3, the authkey was a kind of password for the
RethinkDB cluster. Since RethinkDB 2.3, that same key is now the password for
the admin user. Because it has two uses, we've labelled it Authentication Credential in the Console. The current version of Elixir-RethinkDB hasn't migrated to
RethinkDB 2.3 yet so we'll be using it as an authkey. Click the Show button to reveal it and make a note of it.

GETTING INTERACTIVE
We're going to test everything is in place by running the Elixir interactive
shell. Run iex -S mix and you'll see something like this:

iex -S mix  
Erlang/OTP 19 [erts-8.0.2] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

==> connection
Compiling 1 file (.ex)  
Generated connection app  
==> poison
Compiling 4 files (.ex)  
Generated poison app  
==> rethinkdb
Compiling 12 files (.ex)  
Generated rethinkdb app  
==> elxrdb
Compiling 1 file (.ex)  
Generated elxrdb app  
Interactive Elixir (1.3.2) - press Ctrl+C to exit (type h() ENTER for help)  
iex(1)>  


Now we can import the RethinkDB library with import RethinkDB :

iex(1)> import RethinkDB  
RethinkDB  
iex(2)> { :ok, conn } = RethinkDB.Connection.start_link([ host: ""sl-eu-lon-2-portal.2.dblayer.com"", port: 10354, auth_key: ""password"", ssl: %{ ca_certs: [""./compose.cert"" ] }, db: ""spystuff""])  
{:ok, #PID<0.570.0>}
iex(3)>  


There's a lot there so let's break it down. We're invoking the start_link function of RethinkDB.Connection. It takes a map of options. First up are the
details we got from the connection string; host: is the host name to connect to, this can be a string or a char list, and port: is the numeric port number. Then comes the authorization key we copied, passed
as auth_key: .

Next come the SSL details, passed over as it's own map. That only uses one entry
at the moment, ca_certs: which should be set to be a list of file paths where certificates can be found.
We've only got the one we created earlier. There's one last parameter we can
pass, db: , which sets the name of the default database within RethinkDB we'll be using.
We've set that to our spystuff database (which we used in our RethinkDB Joinery article).

This call returns a process for the connection to RethinkDB. Now we can use it:

iex(3)> results=RethinkDB.Query.table(""agents"") |> RethinkDB.run(conn)  
%RethinkDB.Collection{data: [%{""id"" => ""03c90deb-4c85-49b9-bbf9-0c6945bf04bb"",
    ""name"" => ""Illya Kuryakin"",
    ""org_id"" => ""7f486769-a43e-44d9-bf85-74e8859c6b7e"", ""skill"" => [""combat""]},
  %{""id"" => ""1934a1f7-0cac-4ccd-a8d8-b4d641313935"", ""name"" => ""Chuck Bartowski"",
    ""org_id"" => ""938201f0-6a24-4f0a-91ee-cc1751df23a4"",
    ""skill"" => [""investigation"", ""stealth""]},
    ...


This is already looking a little clumsy, but you can import the RethinkDB.Query
module itself which then lets write more concise statements:

iex(4)> import RethinkDB.Query  
RethinkDB.Query  
iex(5)> results=table(""agents"") |> eq_join(""org_id"",table(""orgs"")) |> zip() |> run(conn)  
%RethinkDB.Collection{data: [%{""alignment"" => %{""country"" => ""USA"",
      ""side"" => ""west""}, ""id"" => ""938201f0-6a24-4f0a-91ee-cc1751df23a4"",
    ""name"" => ""Chuck Bartowski"", ""org"" => ""CIA"",
    ""org_id"" => ""938201f0-6a24-4f0a-91ee-cc1751df23a4"",
    ""skill"" => [""investigation"", ""stealth""]},
  %{""alignment"" => %{""country"" => ""USA"", ""side"" => ""west""},
    ""id"" => ""938201f0-6a24-4f0a-91ee-cc1751df23a4"", ""name "" => ""Jason Bourne"",
    ""org"" => ""CIA"", ""org_id"" => ""938201f0-6a24-4f0a-91ee-cc1751df23a4"",
    ""skill"" => [""combat"", ""assassination""]},
    ....


With the namespace clutter gone, we can query, eq_join and zip together two
tables with ease. Note we also just used run(conn) to execute the query.

We're up and running with the interactive Elixir shell and it's a great place to
start exploring the RethinkDB-Elixir package. But we want to make more of an
application so let's quit iex and do that.

A SIMPLE APPLICATION
First stop is to add a supervised connection from Elixir's OTP to RethinkDB. The
OTP, Open Telecom Platform, is a group of libraries that ship with Elixir and
offer a framework and tools for managing the fault-tolerant, robust applications
that are the trademark of the Erlang family.

We could get by with using the commands we used above to make queries, but
that's not the Elixir (or Erlang) way. For practical applications, the OTP
platform has the concepts of supervisors and workers and managed startups. So as
part of a managed startup, we're going to create a supervisor which has one
worker whose task is to manage a connection to RethinkDB. You can read more
about the supervisor and worker concept in the Elixir documentation .

So, let's edit lib/elxrdb.ex so it looks like this:

defmodule Elxrdb do  
  def start(_type, _args) do
        import Supervisor.Spec
        children = [
            worker(Elxrdb.Connection, [[ host: ""sl-eu-lon-2-portal.2.dblayer.com"", port: 10354, auth_key: ""password"", ssl: %{ ca_certs: [""./compose.cert"" ] }, db: ""spystuff""]])
    ]
        Supervisor.start_link(children, strategy: :one_for_one, name: Elxrdb.Supervisor)
    end
end

defmodule Elxrdb.Connection do  
        use RethinkDB.Connection
end  


This defines the startup for our application. It creates children for a
supervisor, of which there's only one, a worker, named Elxrdb.Connection . That worker will be presented with one argument when invoked, a list of
options which should be familiar as they are the options we used to create a
connection interactively. With the children defined, the start_link function is called for Supervisor, which will, in turn, invoke start_link on all the children. And how does that connect up with the RethinkDB libraries?
The last module definition defines Elxrdb.Connection as using RethinkDB.Connection to make that link.

To make that code run at startup, we need to modify the mix.exs file, this time changing the application definition from this:

def application do  
    [applications: [:logger]]
end  


to this:

def application do  
    [applications: [:logger],
    mod: {Elxrdb, []}]
  end


This will now wake up our connection at startup. Now we can make a module that
uses the database connection. Create lib/database.ex and add this to it:

defmodule Database do  
        import RethinkDB.Query, only: [table_create: 1, table: 1, insert: 2]

      def create_table(table_name) do
            table_create(table_name) |> Elxrdb.Connection.run
        end

    def create_entry(table_name, entry) do
            table(table_name) |> insert(%{title: entry}) |> Elxrdb.Connection.run
        end
end  


The module imports specific parts of RethinkDB.Query which it then uses to form
a ""create table"" function and a ""create entry"" function. Note how it refers to
the Elxrdb.Connection to run the queries.

With this in place, we can test things out in the interactive shell, running iex -S mix :

iex -S mix  
Erlang/OTP 19 [erts-8.0.2] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Compiling 1 file (.ex)  
Interactive Elixir (1.3.2) - press Ctrl+C to exit (type h() ENTER for help)  
iex(1)>  


The connection to RethinkDB is already up and running and we can invoke the
Database module to create a table and add an entry:

iex(1)> Database.create_table(""secretmessages"")  
%RethinkDB.Record{data: %{""config_changes"" => [%{""new_val"" => %{""db"" => ""spystuff"",
        ""durability"" => ""hard"", ""id"" => ""90be664d-928d-4193-bdee-1e8f01732730"",
        ""indexes"" => [], ""name"" => ""secretmessages"", ""primary_key"" => ""id"",
        ""shards"" => [%{""nonvoting_replicas"" => [],
           ""primary_replica"" => ""rethink328_sl_eu_lon_2_data_3_dblayer_com"",
           ""replicas"" => [""rethink328_sl_eu_lon_2_data_3_dblayer_com""]}],
        ""write_acks"" => ""majority""}, ""old_val"" => nil}], ""tables_created"" => 1},
 profile: nil}
iex(2)> Database.create_entry(""secretmessages"",""Ixnay On The Messagay"")  
%RethinkDB.Record{data: %{""deleted"" => 0, ""errors"" => 0,
   ""generated_keys"" => [""08faf062-bf86-416c-b9ef-e2c252de0ffb""],
   ""inserted"" => 1, ""replaced"" => 0, ""skipped"" => 0, ""unchanged"" => 0},
 profile: nil}
iex(3)>  


And that connection is available for our own ad-hoc queries too:

iex(3)> import RethinkDB  
RethinkDB  
iex(4)> import RethinkDB.Query  
RethinkDB.Query  
iex(5)> table(""agents"") |> run(Elxrdb.Connection)  
%RethinkDB.Collection{data: [%{""id"" => ""03c90deb-4c85-49b9-bbf9-0c6945bf04bb"",
    ""name"" => ""Illya Kuryakin"",
    ""org_id"" => ""7f486769-a43e-44d9-bf85-74e8859c6b7e"", ""skill"" => [""combat""]},
  %{""id"" => ""1934a1f7-0cac-4ccd-a8d8-b4d641313935"", ""name"" => ""Chuck Bartowski"",
    ""org_id"" => ""938201f0-6a24-4f0a-91ee-cc1751df23a4"",
    ""skill"" => [""investigation"", ""stealth""]},
 ...


WRAPPING UP
The RethinkDB Elixir driver is an idiomatic implementation of ReQLs format which
makes it fairly simple to translate ReQL command chains from other languages to
Elixir piping. Elixir's interactive shell also makes it easy to explore the
driver, and RethinkDBs capabilities. Hopefully, this tutorial will get you up
and running with Compose's RethinkDB deployments more rapidly. We're looking
forward to the next release of the driver which will hopefully support RethinkDB
2.3's authentication scheme. Until then though, enjoy your Elixir with Compose
RethinkDB and the RethinkDB-Elixir driver.

Image by Drew Collins Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose",we're going to look at how to connect to Compose RethinkDB using Elixir and pick out some of the pitfalls.,Connecting to RethinkDB with Elixir,Live,619
1918,"This video introduces you to the different types of indexes that you can create to query the data in your database, and explains the typical use cases for each type of index. ","This video introduces you to the different types of indexes that you can create to query the data in your database, and explains the typical use cases for each type of index. ",Define indexes and queries,Live,620
1919,"SEVEN DATABASES IN SEVEN DAYS – DAY 3: POSTGRESQL
Lorna Mitchell and Matt Collins / October 5, 2016This post is part of a series of posts created by the two newest members of our
Developer Advocate team here at IBM Cloud Data Services. In honour of the book Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, we challenged Lorna and Matt to take a new
database from our portfolio every day, get it set up and working, and write a
blog post about their experiences. Each post reflects the story of their day
with a new database. We’ll update our seven-days GitHub repo with example code as the series progresses. —The Editors

 * Database type: SQL-Compliant RDBMS
 * Best tool for: Creating systems that value data integrity and where complexity may be
   beyond that of other RDBMS offerings

It’s everyone’s favorite blue elephant: PostgreSQL !

OVERVIEW
PostgreSQL is an open source database in the familiar RDBMS style, boasting 15 years of
active development and a focus on data integrity and reliability. It has a
proven history of performing at an enterprise level.

PostgreSQL has an extensive selection of data types and other functionality not
always found in competitors. It is also the most SQL-compliant of the major
offerings, with a main goal of being fully compliant with the SQL standard. This
will come in handy when you need to integrate your application with other SQL
based systems.

As with most open source software, getting yourself up and running is as simple
as downloading and installing. For today, we are going to use the cloud based
offering from Compose to create a simple database and show you how to get started with PostgreSQL.
We’ll demonstrate a few of the features you may not be so familiar with
(spoilers: the array data type and constraints on insert).

CREATE YOUR DATABASE
Whilst your PostgreSQL instance is spinning up on Compose , you should head over to the PostgreSQL downloads page and install the psql command line tool.

Find your psql connection string for the command line from your Compose control panel, and use
that plus your password to connect. It should look something like this:


psql ""sslmode=require host=aws-us-east-1-portal.17.dblayer.com port=11610 dbname=compose user=sevendays""


Once you have entered your password you should end up at a prompt for the
default database, which is named compose .


compose=>


Congratulations, you are now connected to your PostgreSQL database!

Although the folks at Compose are kind enough to provide us with a compose database as a starting point, today we’ll be building an application to
represent the books in a book store’s product catalog. With that in mind, let’s
start by creating a database for our book data.


-- Create database
CREATE DATABASE bookstore;


In PostgreSQL, your connection is to a specific database. To use the new
database, you’ll need to reconnect. Luckily there’s a handy way to do it while
retaining all other connection details:


\connect bookstore


DESIGN THE BOOKSTORE DATABASE
Now to create some tables. As we mentioned earlier, PostgreSQL is an RDBMS-style
database, so we need to define our table before we can use it. Creating a basic
schema should be a simple task for anyone with some SQL experience. Check out
our example:


-- Create books table with auto increment key
CREATE TABLE books
(
    book_id serial primary key,
    title text not null,
    description text not null,
    author_id int not null
);

-- Create authors table with auto increment key
CREATE TABLE authors
(
    author_id serial primary key,
    name text not null
);


Here we are creating two tables: books and authors . Both have auto-incrementing primary keys (the serial keyword indicates auto-incrementing). You may also have noticed that the books table has an author_id column, which could be used as a foreign key — we want to add a simple
constraint here to help enforce this relationship and improve the integrity of
our database.


-- Add a foreign key constraint so author must exist before adding a book
ALTER TABLE books ADD FOREIGN KEY (author_id) REFERENCES authors (author_id);


This means that any value added in books.author_id must exist in authors.author_id .

To check that our tables have been added successfully, we can do the following:


bookstore=�


        List of relations
 Schema |  Name   | Type  | Owner 
--------+---------+-------+-------
 public | authors | table | admin
 public | books   | table | admin
(2 rows)


And to check the definitions of our tables we can do the following:


bookstore=�


                                 Table ""public.books""
   Column    |       Type       |                        Modifiers                        
-------------+------------------+---------------------------------------------------------
 book_id     | integer          | not null default nextval('books_book_id_seq'::regclass)
 title       | text             | not null
 description | text             | not null
 author_id   | integer          | not null
Indexes:
    ""books_pkey"" PRIMARY KEY, btree (book_id)
Foreign-key constraints:
    ""books_author_id_fkey"" FOREIGN KEY (author_id) REFERENCES authors(author_id)


ADD THE BOOK DATA
Inserting data is as simple as you would expect from an SQL-based database:


-- Add an author
INSERT INTO authors (name) VALUES ('Roald Dahl');

-- Add a book
INSERT INTO books (title, description, author_id) VALUES ('The BFG', 'On a dark, silvery moonlit night, Sophie is snatched from her bed by a giant. Luckily it is the Big Friendly Giant, the BFG, who only eats snozzcumbers and glugs frobscottle.', 1);


Remember the constraint we added at the end of the previous section? Well, lets
see what happens when we try to add data into the books table that doesn’t comply with our foreign key constraint:


INSERT INTO books (title, description, author_id) VALUES ('Should fail', 'Because author_id does not exist', 100);
-- #ERROR:  insert or update on table ""books"" violates foreign key constraint ""books_author_id_fkey""
-- #DETAIL:  Key (author_id)=(100) is not present in table ""authors"".


As expected, we are unable to add this book to our table as the supplied author_id is not present in the authors table — excellent!

QUERYING THE BOOKS DATA
Again, getting data out of your PostgreSQL database is nice and easy:


-- Get everything from the books table
SELECT * FROM books;

-- Get everything form the authors table
SELECT * FROM authors;


And the join syntax should also be familiar to anyone who has done this in other
flavours of SQL:


-- Join between books and authors
SELECT books.title, authors.name FROM books INNER JOIN authors ON books.author_id = authors.author_id;


           title           |      name       
---------------------------+-----------------
 The BFG                   | Roald Dahl
 A Squash and a Squeeze    | Julia Donaldson
 The Gruffalos Child       | Julia Donaldson
 The Snail And The Whale   | Julia Donaldson
 James and the Giant Peach | Roald Dahl
 The Gruffalo              | Julia Donaldson
 Room On The Broom         | Julia Donaldson


Here we are using our foreign key to join the books and authors tables together and pull out the title and name fields.

CHANGING DATA
Updating existing data can be achieved using the UPDATE command:


UPDATE books SET description='On a dark, silvery moonlit night, Sophie is snatched from her bed by a giant. Luckily it is the Big Friendly Giant, the BFG, who only eats snozzcumbers and glugs frobscottle. But there are other giants in Giant Country.' WHERE book_id = 1;


Here we are updating the description of one of our books, namely the book whose book_id is 1 .

Deleting is a similar process:


DELETE FROM books WHERE book_id = 1;


Important: Be careful when you are deleting or updating your data. If you forget to
include the WHERE clause at the end of your statement, you are going to update or delete every
row in your table — probably not what you intended to do!

POSTGRESQL HAS AN ARRAY DATA TYPE
One of the areas in which PostgreSQL stands out from the crowd is the extensive
range of data types it offers. One of these which may not be familiar to users
of other SQL databases is the Array data type. PostgreSQL allows a column to be
defined as a multi-dimensional array!

Below you will see that we are adding a new column releases to our books table, to record the release dates of each edition of a book. The [] notation is used to define this column as an Array.


ALTER TABLE books ADD COLUMN releases date[] NULL;


We can then update our book records to include this data, defining a set of
dates per row:


UPDATE books SET releases = '{1984-05-03, 1998-03-08}' WHERE book_id = 1;
UPDATE books SET releases = '{1987-12-06, 1995-06-23}' WHERE book_id = 2;


We can then query this array in a number of ways:


-- Find any book where the first release was after January 1st 2000
SELECT title, releases FROM books WHERE releases[1] �

-- Find any book where ANY release was after January 1st 2000
SELECT title, releases FROM books WHERE '2000-01-01' �


Yes, you read that right! In PostgreSQL, array indexing starts at 1 , not at 0.

If a book gets a new edition, we will want to add a new release date to our
array. Add it using the array_append function.


UPDATE books SET releases = array_append(releases, '2001-05-07') WHERE book_id = 1;


There are many other array functions available that can be used to manipulate arrays in PostgreSQL.

USING CHECK CONSTRAINTS TO SANITY-CHECK DATA
We added a foreign key constraint earlier, but PostgreSQL also features
something called a check constraint , which allows you to validate data against a simple boolean expression before
storing it in the database.

Here we add a new column price with a Check Constraint in place to make sure that the price must be greater
than 5.


ALTER TABLE books ADD COLUMN price float constraint price_check check(price �


We can see this check constraint in action by attempting to update an existing
book to have a price of 5 or below:


UPDATE books SET price = 2.5 WHERE book_id = 1;                                  
-- #ERROR:  new row for relation ""books"" violates check constraint ""price_check""
-- #DETAIL:  Failing row contains (1, The BFG, On a dark, silvery moonlit night, Sophie is snatched from her bed by a giant ..., 1, {1984-05-03, 1998-03-08, 2001-05-07}, 6.95)


This is another example of how PostgreSQL ranks data integrity high on its list
of priorities.

CONECTING TO POSTGRESQL FROM YOUR APPLICATION
PostgreSQL is well-supported by all the programming languages that we tried. You
should have an easy time working with it, regardless of your stack. For quick
reference, we’ve provided some super-simple examples of getting connected to our
PostgreSQL database hosted on Compose and fetching data in a number of different
languages (PHP, NodeJS, Python and Go). You can find all the snippets below in
our seven-days GitHub repo.

CONNECTING TO POSTGRESQL FROM PHP
To connect from PHP you’ll need to have the PDO_PGSQL extension installed in PHP in order to run the code below (more information in
the php.net documentation ). Using PDO gives us access to an easy, object-oriented interface. PDO uses a
DSN (Data Source Name) as the first parameter to its constructor, and here we
can reuse the connection string we used with the command-line client, putting it
after the initial pgsql: part of that parameter.


<?php

$db = new PDO(""pgsql:sslmode=require host=aws-us-east-1-portal.17.dblayer.com port=11610 dbname=bookstore user=sevendays password=DBEIQOCFEBUBQYFI�

$stmt = $db-�

$result = $stmt-�
if($result) {
    print_r($stmt-�
}


Once you’re connected, the PDO aspect will make PostgreSQL interactions into a
familiar experience if you’re a MySQL user. Many projects use either an ORM or
another database wrapper, so you may not be accustomed to working directly with
PDO. The vast majority of these libraries will seamlessly support PostgreSQL,
but look out for situations where raw SQL is used as there are differences
between the SQL syntax supported by different databases.

CONNECTING TO POSTGRESQL FROM NODE.JS
From Node.js we can include PostgreSQL functionality in our applications by
bringing in the pg library . Once it is installed, we can easily connect to our PostgreSQL database by
crafting the connection string, as seen in the example below:


const pg = require('pg');
const conString = ""postgres://sevendays:DBEIQOCFEBUBQYFI@aws-us-east-1-portal.17.dblayer.com:11610/bookstore�


Look at the rest of the library documentation for how pg lets you work with
PostgreSQL from your Node.js applications.

CONNECTING TO POSTGRESQL FROM PYTHON
Python has a good PostgreSQL library called psycopg2 , which is strongly recommended for your Python projects and, happily, is also
pretty painless to use. Here it is in action in a simple example:


import psycopg2
import psycopg2.extras

conn = psycopg2.connect(""sslmode=require host=aws-us-east-1-portal.17.dblayer.com port=11610 dbname=bookstore user=sevendays password=DBEIQOCFEBUBQYFI"")

cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)

# query the DB
cur.execute(""SELECT * FROM books"")
row = cur.fetchone()
while row:
    print(""Title: "" + row[""title""])
    print(""Description: "" + row[""description""])
    print(""Author_id: "" + str(row[""author_id""]))
    print(""Price: "" + str(row[""price""]))
    print(""Releases: "" + str(row[""releases""]))
    row = cur.fetchone()

cur.close()


By default, the cursor will fetch data as a simple tuple. In this example, the psycopg2.extras library has been used to enable setting up the cursor to return dicts instead
and so allow us to access the various fields by name. Returning the dict
structure makes our code much more readable and maintainable, especially in a
more complex project than this one! In Python string concatenation, for data
types that are not strings (in our case, the int, float and array columns), they need to be cast
to strings before they can be printed.

CONNECTING TO POSTGRESQL FROM GO
Go is a newer entry to the world of scripting languages, but it’s already very
mature and has all the libraries you would expect it to have — including support
for PostgreSQL. Support is available in the pq package , which extends the database/sql package to add PostgreSQL support. The pattern looks much like the other
languages in that we can use our command-line connection string and then query
the database:

package main

import (
    ""database/sql""
    ""fmt""
    _ ""github.com/lib/pq""
)

func main() {
    db, err := sql.Open(""postgres"", ""sslmode=require host=aws-us-east-1-portal.17.dblayer.com port=11610 dbname=bookstore user=sevendays password=DBEIQOCFEBUBQYFI"")
    checkErr(err)
    defer db.Close()

    rows, err := db.Query(""SELECT * FROM books"")
    checkErr(err)

    fmt.Println(""Book ID | Title                     | Description  "")
    for rows.Next() {
        var book_id int
        var author_id int
        var title string
        var description string
        var price float32
        var releases string
        err = rows.Scan(&book_id, &title, &description, &author_id, &releases, &price)
        checkErr(err)
        fmt.Printf("" %6v | %25v | %20v \n"", book_id, title, description)
    }
}

func checkErr(err error) {
    if err != nil {
        panic(err)
    }
}


To fetch data from the database, we first define the variables that will hold
this data, then inside the loop we Scan each row and populate those variables as appropriate before quickly outputting
them.

CONCLUSION
PostgreSQL has been a stalwart of the open source database world for a long
time, and it has earned its excellent reputation. Beloved particularly by the
Python and Perl communities, basically every programming language provides
excellent support for PostgreSQL.

While many web applications use the traditionally lighter MySQL option,
PostgreSQL is a more fully-featured and, in some cases, more performant option
while still being similar enough to other SQL databases to feel quite familiar.
Especially given the wide library support and growing options for PostgreSQL
hosting, it’s an exciting technology going through an exciting time.

Do you have a language you’d like to see a simple PostgreSQL example for? Let us
know in the comments below, or share your own on https://github.com/ibm-cds-labs/seven-days — pull requests are very much welcome!

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: Compose / PostgreSQL / SQL Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Object Storage
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Looking to learn the basics of cloud databases? In this series, we show them running on Compose and intro programmatic access. Enter: PostgreSQL.",Seven Databases in Seven Days – Day 3: PostgreSQL,Live,621
1922,"RStudio Blog * Home

 * Subscribe to feed

HTMLWIDGETS: JAVASCRIPT DATA VISUALIZATION FOR R
December 18, 2014 in News , Packages , Shiny

Today we’re excited to announce htmlwidgets , a new framework that brings the best of JavaScript data visualization
libraries to R. There are already several packages that take advantage of the
framework ( leaflet , dygraphs , networkD3 , DataTables , and rthreejs ) with hopefully many more to come.

An htmlwidget works just like an R plot except it produces an interactive web visualization.
A line or two of R code is all it takes to produce a D3 graphic or Leaflet map.
Widgets can be used at the R console as well as embedded in R Markdown reports and Shiny web applications. Here’s an example of using leaflet directly from the R
console:


When printed at the console the leaflet widget displays in the RStudio Viewer
pane. All of the tools typically available for plots are also available for
widgets, including history, zooming, and export to file/clipboard (note that
when not running within RStudio widgets will display in an external web
browser).

Here’s the same widget in an R Markdown report. Widgets automatically print as
HTML within R Markdown documents and even respect the default knitr figure width
and height.


Widgets also provide Shiny output bindings so can be easily used within web
applications. Here’s the same widget in a Shiny application:


BRINGING JAVASCRIPT TO R
The htmlwidgets framework is a collaboration between Ramnath Vaidyanathan (rCharts), Kenton
Russell (Timely Portfolio), and RStudio. We’ve all spent countless hours
creating bindings between R and the web and were motivated to create a framework
that made this as easy as possible for all R developers.

There are a plethora of libraries available that create attractive and fully
interactive data visualizations for the web. However, the programming interface
to these libraries is JavaScript, which places them outside the reach of nearly
all statisticians and analysts. htmlwidgets makes it extremely straightforward to create an R interface for any JavaScript
library.

Here are a few widget libraries that have been built so far:

 * leaflet , a library for creating dynamic maps that support panning and zooming, with
   various annotations like markers, polygons, and popups.
 * dygraphs , which provides rich facilities for charting time-series data and includes
   support for many interactive features including series/point highlighting,
   zooming, and panning.
 * networkD3 , a library for creating D3 network graphs including force directed
   networks, Sankey diagrams, and Reingold-Tilford tree networks.
 * DataTables , which displays R matrices or data frames as interactive HTML tables that
   support filtering, pagination, and sorting.
 * rthreejs , which features 3D scatterplots and globes based on WebGL.

All of these libraries combine visualization with direct interactivity, enabling
users to explore data dynamically. For example, time-series visualizations
created with dygraphs allow dynamic panning and zooming:


LEARNING MORE
To learn more about the framework and see a showcase of the available widgets in
action check out the htmlwidgets web site . To learn more about building your own widgets, install the htmlwidgets package from CRAN and check out the developer documentation .


SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

18 COMMENTS
December 19, 2014 at 8:19 am

Jim Collins

Terrific work. Just one question; How do you install rthreejs? I could not find
where the package’s author specified the github installation particulars.

 * December 19, 2014 at 11:59 am
   
   jjallaire
   
   You can just do:
   
   devtools::install_github(“bwlewis/rthreejs”)
   
   
 * December 19, 2014 at 12:00 pm
   
   jjallaire
   
   Note that the quotes need to be straight not curly (WordPress converted my
   quotes above)
   
   
 * 

December 19, 2014 at 6:08 pm

kripawrites

Reblogged this on Kripa writes and commented:
Gorgeous.

December 22, 2014 at 2:01 am

Start up: Samsung ChatON going off, USB apps for iPad, the ‘uncanny valley’ for
algorithms, Sony hack history, and more | The Overspill: when there's more that
I want to say

[…] htmlwidgets: JavaScript data visualization for R >> RStudio Blog […]

December 23, 2014 at 3:59 am

BStan

This is fantastic. I love the easy way to handle time series inside R-Studio. Is
there a way to include horizontal lines (similar to abline())? It would be great
to have dyShading and dyEvent doing this job (maybe with orientation=”h” to keep
it simple)

 * December 23, 2014 at 6:22 am
   
   jjallaire
   
   There isn’t a straightforward way to do this now but we’d certainly take a
   pull request for this. The implementation would be similar to dyShading in
   that you’d add a custom underlay callback. You could also of course just
   create an additional series and then customize it’s appearance using
   dySeries.
   
   
 * December 23, 2014 at 8:01 am
   
   BStan
   
   I tried the workaround some times, but there seems to be some kind of problem
   with the number of observations and dySeries.
   
   The examples from the website ( http://rstudio.github.io/dygraphs/gallery-series-options.html ) don’t make any problems.
   
   But if I try out the following, R-studio freezes at the last line:
   
   library(zoo)
   library(dygraphs)
   ## generate some smooth timeseries
   x1 <- c(2, 1, 2, 1, 2, 1, 2, 1, 2, 1)
   x2 <- c(0.02, 0.08, 0.07, 0.06, 0.09, 0.09, 0.1, 0.08, 0.01, 0.1)
   
   z1 <- zooreg(x1, start=as.POSIXct(""2013-01-01 00:00:01""),
   frequency=0.0000003)
   z2 <- zooreg(x2, start=as.POSIXct(""2013-01-01 00:00:20""),
   frequency=0.0000003)
   zt <- zooreg(rnorm(300000), start=as.POSIXct(""2013-01-01 00:00:01""),
   frequency=0.01)
   
   z<-merge(zt, z1, all = TRUE)
   z<-merge(z, z2, all = TRUE)
   Zf%
   dySeries(“z2”, axis = ‘y2’)
   
   ## add abline at 1.5
   Zfb%
   dySeries(“z2”, axis = ‘y2’)
   
   ## trying to dash abline: R-studio crashes (no result after 30 sec)
   dygraph(Zfb, main = “abline workaround”) %>%
   dySeries(“base”, strokeWidth = 0.2, strokePattern = “dashed”)
   
   December 23, 2014 at 8:05 am
   
   BStan
   
   Don’t know what happened to my CnP…
   
   library(zoo)
   library(dygraphs)
   ## generate some smooth timeseries
   x1 <- c(2, 1, 2, 1, 2, 1, 2, 1, 2, 1)
   x2 <- c(0.02, 0.08, 0.07, 0.06, 0.09, 0.09, 0.1, 0.08, 0.01, 0.1)
   
   z1 <- zooreg(x1, start=as.POSIXct(""2013-01-01 00:00:01""),
   frequency=0.0000003)
   z2 <- zooreg(x2, start=as.POSIXct(""2013-01-01 00:00:20""),
   frequency=0.0000003)
   zt <- zooreg(rnorm(300000), start=as.POSIXct(""2013-01-01 00:00:01""),
   frequency=0.01)
   
   z<-merge(zt, z1, all = TRUE)
   z<-merge(z, z2, all = TRUE)
   Zf%
   dySeries(“z2”, axis = ‘y2’)
   
   ## add abline at 1.5
   Zfb%
   dySeries(“z2”, axis = ‘y2’)
   
   ## trying to dash abline: R-studio crashes (no result after 30 sec)
   dygraph(Zfb, main = “abline workaround”) %>%
   dySeries(“base”, strokeWidth = 0.2, strokePattern = “dashed”)
   
   December 23, 2014 at 8:11 am
   
   BStan
   
   sorry, it seems I can’t post my code in this comment box. It shows up with
   copy and paste, but fails in posting out properly. Maybe it works in a text
   line…
   If you.. Zf <- na.spline ( z [ , 2:3 ] , na.rm = FALSE )
   
   ## add abline at 1.5
   and… Zfb %
   dySeries ( “base”, strokeWidth = 0.2, strokePattern = “dashed” )
   
   December 23, 2014 at 8:21 am
   
   BStan
   
   hmm, this comment box is annoying. Please just delete the comments
   
   
 * 

December 23, 2014 at 4:37 pm

Jeroen Ooms (@OpenCPU)

It might be nice if htmlwidgets would somehow use (and encourage) a dedicated
package repository (or some other module manager) to prevent CRAN from being
flooded with countless packages wrapping around a contemporary JS libraries.

Maybe something as simple as install_widget that wraps around
devtools::install_github to a particular github org.

December 26, 2014 at 11:27 am

bug_charlie

i have a TZ issue
dygraph seems to force GMT of something
using Europe/paris tz and get -1 hour shift

 * December 27, 2014 at 6:47 am
   
   jjallaire
   
   Yes, currently dygraphs only shows times in GMT (we have a pull request to
   remedy this but are working through a few issues on it, hopefully this will
   land in January).
   
   
December 26, 2014 at 2:02 pm

bug_charlie

dySeries(color=’red’)

does not work always getting black

 * December 27, 2014 at 6:49 am
   
   jjallaire
   
   There is a bug whereby if you only specify one color it doesn’t take effect
   (this will be fixed soon!). In the meantime just do this:
   
   dySeries(color = c(“red”, “red”))
   
   
December 26, 2014 at 3:06 pm

64 new external resources and articles about data science, big data – December
24 | Doclens

[…] JavaScript data visualization for R […]

December 31, 2014 at 12:18 am

beckmw

This is fantastic, great work! Quick question about the shiny output bindings
for leaflet, though I realize they’re in development. Is it possible to maintain
the zoom after changing the interactive inputs of the application? The map
resets the zoom with each change, kind of a bothersome feature.


« httr 0.6.0 Hadley Wickham Master R Developer Workshop – Space Limited »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","Today we’re excited to announce htmlwidgets, a new framework that brings the best of JavaScript data visualization libraries to R. There are already several packages that take advantage of th…",htmlwidgets: JavaScript data visualization for R,Live,622
1923,"If you have a Redis database, you'll probably have wanted an easier way to do ad-hoc queries and updates with it. We have and that's why we're introducing the Redis data browser for Compose.s

The problem with the redis-cli is that it is an integral part of the Redis database distribution so you have to build or install Redis on your local system before you can make use of it. Then you have to get your Redis database URI and password so you can connect the redis-cli to your Compose Redis database. If that's your development workstation, then that's reasonable, but when you are out on the road or away from your desk and need to, say, add some key/values, edit some values or just see what some keys are set to, then this becomes burdensome.

Building on our tradition of not just hosting databases by making them more accessible, we're bringing a version of our data browser to Redis. That means that when you are logged into to your Compose dashboard and viewing your Redis deployment, one click on the ""Browser"" button in the sidebar and you will be able to explore your data. We've a screencast that can tell you more...

Your first view will be an overview of the sixteen databases of key/values that Redis holds. By default, new connections will use database 0 unless a user selects one of the other 15. Each database is listed along with the number of keys it currently holds. You can drill down into each database to get a SCAN view of the keys.

The virtual command line here offers SCAN command which allows you to set how many keys to display, set with COUNT after skipping over a number of other keys, set with CURSOR.  The displays syntax is similar but not identical to the of the redis-cli. Selecting the MATCH= option on the line below allows you to add a wildcard string to select which keys you want displayed. Wild cards include ? and * for any single character and any number of characters.

The list below shows what keys and values match the current SCAN, with the number of characters (for strings), fields (for hashes), elements (for lists) or members (for sets) that that key represents. Clicking on the row takes you into a view of that item. Click on keys to return to the SCAN view.

To add a key, use the Add Key button on the right. This takes you to our new key editor view. There's five different operations you can carry out here:

* String does a Redis SET which sets a key with a string value.

* Hash does a Redis HSET which sets a key with a hash value and, within that, a field with its own value.

* List does a Redis RPUSH which sets a key with a list value and then appends the given value to the tail end of that list.

* Set does a Redis SADD which sets a key to be a set value and then adds the value given as a member of that set.

* Sorted Set does a Redis ZSET which sets a key to be a sorted set value then adds a member and score for that member to the sorted set.

After any key addition, you can go straight to the created key by clicking ""View key"" in the confirmation dialog. That will take you to the value editor. As well as being able to delete the key and associated value with the ""Delete"" button,  you can re-set a simple key/value's value (SET), add and remove a hash's fields and values (HSET, HDEL), push new values to the start and end of a list (LPUSH, RPUSH) and add and remove members to both kinds of set (SADD, SREM, ZADD and ZREM). This value editor is also displayed when you click on a key in the SCAN view.

We've set out to follow the Redis CLI style for commands so that using the browser not only lets you make changes but helps you remember the various Redis commands whenever you do use redis-cli. Thats is just one of the ways we like to empower our users at Compose.

If you have any comments or issues with the Redis Data Browser please contact support@compose.io by email or through the integrated support page for assistance.",How to use Compose's Redis Data Browser,Redis Data Browser Now Available in Compose Dashboard,Live,623
1927,"Recently we partnered with Cloudbees to deliver a webinar on migrating from SQL to NoSQL, focusing on why and how. It was my first webinar ever, which was both exciting and nerve-wracking. To all who might do a webinar in the future: if you ask yourself whether you've practiced enough, the answer is no. Protip: always ask yourself whether you've practiced enough.We got some great questions, some of which I'd like to answer in more detail today. Thanks again to everyone who participated!Sure can! We kept this webinar close to the concepts, but will focus more on the nitty-gritty in future events. In the meantime, check these out for sample code using Cloudant:* For Developers: An interactive walk-through of Cloudant's basics.* Haengematte: Examples of basic operations on Cloudant in numerous languages, from C to Node.js.* API basics: Tutorial about the fundamentals of HTTP APIs, like Cloudant's.* Egg Chair: A web app both using and served from Cloudant for storing and sharing images.In SQL databases, indexes and queries are usually separate. As your app scales, you get involved with indexes, but until then you write queries that get your data by traversing the database when they execute. In Cloudant, all queries hit a pre-computed index, which is how they return so fast.Cloudant uses incremental MapReduce functions to build indexes. Then, you query them over HTTP. Let's see an example:In SQL, we might write this:SELECT MONTH, ID, RAIN_I, TEMP_FFROM STATSORDER BY MONTH, RAIN_I DESC;In Cloudant, we write this:map: function (doc) {if (doc.type === 'stats') {emit([doc.month, doc.rain_i], doc.temp_f);That SQL statement is computed when you ask it, while the Cloudant query is computed when you add the index to your database. (You can read about how to do that here)But that bit above is just the index, not the query. Queries in Cloudant are HTTP requests against the URL where the index lives, which will be something like this:https://USERNAME.cloudant.com/DATABASE/_design/DESIGN_DOC/_view/INDEX_NAME""total_rows"": 4137210,""offset"": 0,""rows"": [""id"": ""..."",""key"": [9, 0.21],""value"": 74Cloudant queries have their own options, in addition to everything you can do with indexes. Check out the full list of options in the documentation, but here are some examples:* Retrieve only rows who match a set of keys: ?keys=[""..."", ""...""]These next two require a Reduce function in your index:Why would you? When inputting data, shove all of it into the database. Disk space is cheap, and realizing too late that you discarded data you now need is at least sixteen kinds of embarrassing.That said, I've been there. My team had collected Twitter data, but we wanted to flesh out the geo-coordinate data using external services. We decided we would update each document in-place, while ensuring that any new doc was created with the proper fields. We wrote a 40-line script to update existing docs, modified our app to correctly insert new docs, and pretty quick, we had added exacting geographic information to millions of documents without ever taking the database offline.Alternatively, we could have set our script to watch the _changes feed -- a list of all changes made to the database in approximately chronological order -- and inserted new documents containing the extended geographic information whenever a document was inserted into the database. By chewing through the _changes feed, we could add data for all existing docs and all new docs without modifying our initial app. Instead, we could use view collation to connect the original docs with this new data in our indexes.This varies widely by solution, but in Cloudant, users are held in the _users database, while permissions are managed through a _security document contained in each database. For example, here's the _security document for one of my databases:""_id"": ""_security"",""cloudant"": {""garbados"": [""_reader"",""_writer"",""_creator"",""_admin""""garbados.cloudant.com"": [""_reader""""nobody"": [This doc indicates that the user garbados has every permission: _reader to read docs, _creator to insert docs, _writer to update them, and _admin to insert and update design documents. garbados.cloudant.com is a hostname, not a user, and means anyone -- authenticated or not -- visiting the database from that host will have _reader permissions. Check it out: https://garbados.cloudant.com/jobot/_all_docs?limit=10nobody is anyone from any hostname. To make a database world-readable, assign _reader permissions to nobody.For more details, check out Authentication and Authorization.Depends on your solution, but in Cloudant, all documents are JSON, so you can map your data into JSON data types. Those types are...* Array: an ordered sequence of values, comma-separated and enclosed in square brackets; the values do not need to be of the same type. Ex: [1,3,'bears']* Object: an unordered collection of key:value pairs with the ':' character separating the key and the value, comma-separated and enclosed in curly braces; the keys must be strings and should be distinct from each other. Ex: {""nobody_got_time_for"": [""this"", ""that"", ""the other""]}* null: a non-value, like nil, None, or empty in other languagesTo preserve type-specific information, you can serialize a type to one of JSON's types, and then deserialize it later. For example, you could serialize dates as an object like {datetime: 1380144496837}, knowing that that integer is the number of milliseconds since the UNIX epoch.Check out our docs, look up NoSQL on StackOverflow, ping us on IRC, or if you'd like to discuss the matter in private, email us at support@cloudant.com.Create an account and try it yourself","Developers coming to Cloudant from an SQL background will have lots of questions about importing data, schemas and authentication. This post answers the most frequently asked questions.",Top 6 Questions from our webinar on Top 6 Questions about SQL -> NoSQL Migrations,Live,624
1935,"Mike Broberg Blocked Unblock Follow Following Editor for the IBM Watson Data Platform developer advocacy team. OK person. Feb 23
--------------------------------------------------------------------------------

HAVE YOU HAD “THE TALK” WITH YOUR CHATBOT ABOUT GRAPH DATA STRUCTURES?
A ROUNDUP OF ARTICLES ON IMPROVING CHATBOTS WITH DATABASES
Image credit: Charlotte ParentThis week our own Mark Watson was published in the Free Code Camp Medium publication . You should read it:

Have you had “The Talk” with your chatbot about graph data structures? A
coming-of-age story for your database queries medium.freecodecamp.comIn his article, Mark describes how he modeled API interactions with the Watson Conversation service (no relation 😛) and the Spoonacular recipe API to work with the Apache Tinkerpop graph database framework. The result is a Slack chatbot that can use graph
traversals to intelligently recommend recipes based on their popularity with
other users.

The code and setup instructions are on GitHub in the watson-recipe-bot-nodejs-graph repo .

HUNGRY FOR MORE?
It builds off of two excellent articles, one by Josh Zheng and another from Mark.

Josh’s article reviews the important functions of the client application ( in Python ), and it walks readers through how to build the dialog flow in the Conversation Tool . It’s the place to start if you want to understand the inner-workings of the
chatbot.

Mark’s previous article shows readers how to extend Josh’s chatbot by integrating the Cloudant JSON database service to cache API calls to Spoonacular. Aside from improved performance, Mark
describes how to use the cached data to provide a more personalized user
experience (for example, “show me my favorite recipes”). He also shows how to
run some simple analytics on interactions, with an eye toward informing UX
improvements. There are versions for Python and Node.js .

It was fun editing Mark’s work, especially after having recently completed some Coursera classes that emphasized graph data structures .

So check out Mark on Free Code Camp, and don’t forget to recommend the article
to other intrepid Medium readers by clicking the ♡.

Thanks to Brad Noble and Quincy Larson for their feedback.

 * Chatbots
 * Graph Database
 * Cloudant
 * Web Development
 * JavaScript

Blocked Unblock Follow FollowingMIKE BROBERG
Editor for the IBM Watson Data Platform developer advocacy team. OK person.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",A summary of recent articles on chatbots by the Watson Data Platform and the Watson Developer Cloud dev advocacy teams — including a story originally published on Free Code Camp’s Medium publication.,Have you had “The Talk” with your chatbot about graph data structures? – IBM Watson Data Lab,Live,625
1936,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS
Loading...

Sign up by October 31st for an extended 3-month trial of YouTube Red.Working...

No thanks Try it free Find out why CloseDATA SCIENCE EXPERIENCE: ANALYZE DB2 WAREHOUSE ON CLOUD DATA IN RSTUDIO
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

27 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Science Experience: Analyze precipitation data using a community
   notebook - Duration: 5:15. developerWorks TV No views * New 5:15


--------------------------------------------------------------------------------

 * IBM Lift: Migrate Data from IBM Neteeza to IBM Db2 Warehouse on Cloud -
   Duration: 6:38. developerWorks TV 201 views 6:38
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29
 * The Three Ways to Cloud Compute - Duration: 7:10. ExplainingComputers 548,290
   views 7:10
 * Explaining Big Data - Duration: 8:33. ExplainingComputers 554,627 views 8:33
 * DB2 V11 and dashDB: Leadership in the Era of Data, Cloud, and Analytics -
   Duration: 1:03:43. IDUG: International DB2 Users Group 1,253 views 1:03:43
 * Predicting Stock Prices - Learn Python for Data Science #4 - Duration: 7:39.
   Siraj Raval 234,999 views 7:39
 * Data Science Experience: Load Db2 Warehouse on Cloud data with Apache Spark -
   Duration: 3:12. developerWorks TV 4 views * New 3:12
 * Tetiana Ivanova - How to become a Data Scientist in 6 months a hacker’s
   approach to career planning - Duration: 56:26. PyData 132,184 views 56:26
 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21.
   IBM Analytics 8,386 views 8:21
 * Modeling Data for NoSQL Document Databases - Duration: 31:23. TechEd North
   America 34,480 views 31:23
 * CodeAnywhere -- Coding in the Cloud That Actually Works - Duration: 17:23.
   Gamefromscratch 1,619 views * New 17:23
 * IBM Cloud private: Continuously Deliver Java Apps with IBM Cloud private and
   Middleware Services - Duration: 4:51. developerWorks TV 1,622 views 4:51
 * Learning R in RStudio: corrplot - Duration: 9:00. R at Colby 3,010 views 9:00
 * How Did Python Become A Data Science Powerhouse? - Duration: 34:36. Coding
   Tech 81,376 views 34:36
 * Get Started With IBM Db2 Warehouse on Cloud - Duration: 2:51. IBM Analytics
   211 views 2:51
 * Datascience made simple with IBM DSX | HackerEarth Webinar - Duration:
   1:06:11. HackerEarth 264 views 1:06:11
 * Introduction - Learn Python for Data Science #1 - Duration: 6:55. Siraj Raval
   174,956 views 6:55
 * Optimize Staffing Assignments with IBM Decision Optimization - Duration:
   4:01. IBM Analytics 105 views 4:01
 * IBM Bluemix Lift: Migrate CSV data to dashDB using Bluemix Lift - Duration:
   7:45. developerWorks TV 770 views 7:45
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to connect to an IBM Bb2 Warehouse on Cloud database in RStudio in IBM Data Science Experience (DSX).,Analyze Db2 Warehouse on Cloud data in RStudio in DSX,Live,626
1945,"IMPORTING REDIS DATA INTO COMPOSE REDISShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 19, 2016We're turning on Redis Import for Compose Redis deployments. Some may say that,by its use as an in-memory data store, Redis doesn't seem like a database thatneeds an import tool, but we've been finding Compose users who want to bringtheir Redis data to our platform. They don't just cache data in their Redis,they use it as an active reference store for applications which can be rapidlyupdated and, thanks to the Redis pubsub options, update clients quickly too.At Compose, we understand how, with Redis as an essential cog in youroperations, you'd want to minimise downtime when migrating. So, for our firstoffering, we've created a static import tool which will import your data from aremote Redis installation into your Compose Redis deployment. You'll find thecontrols in the Compose Redis console under Settings .Before you import a remote database, be aware that this import is going to makeyour Compose deployment identical to the remote database and it starts thatprocess by clearing out your current Redis data in your deployment. The importprocess does not merge the content of the two Redis databases involved. If youhave any data of value in your Compose Redis database then we recommend youcreate a new Redis deployment and import into that.What you need to import a remote Redis database is a Redis URL for it, and tohave that means the remote Redis needs to be accessible on the internet. Ifyou're Redis is behind a firewall, you'll need to make the appropriatearrangements to make it accessible to your Compose Redis deployment. Your RedisURL includes the hostname and port that the database resides on and the passwordfor the database.Once you've assembled your redis URL and put it into the field, it's time topress Import .As the import involves clearing your Compose Redis deployment, we ask you toconfirm that this is what you want to do. Click Import again if it is. The import will begin.A message will appear to signal the start of the import and the console willmove to the Jobs screen. Here, you'll see the Redis RemoteDB , Redis Import and Run jobs will run. The firs one prepares for the import, the next does the actualimport and the last wraps up and scales up the deployment if needed. It shouldlook like this when doneThe scaling only occurs if your imported data is larger than the currentlydeployed size of your Compose Redis. Do check after importing on the Overview page for the new storage usage to avoid a surprise.This is the first import option we've added to Redis and it is just part of adevelopment process that is bringing simpler, more reliable ways of moving yourdata to Compose and our production ready, high availability deployments.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","At Compose, we understand how, with Redis as an essential cog in your operations, you'd want to minimise downtime when migrating. So, for our first offering, we've created a static import tool which will import your data from a remote Redis installation into your Compose Redis deployment.",Importing Redis data into Compose Redis,Live,627
1947,"* DATA WAREHOUSING FEATURES, IMPROVEMENTS, AND UPDATES: INTRODUCING THE NEWDASHDBBy Alan HoffmanDecember 19, 2014INTRODUCING THE NEW DASHDBToday we’re excited to announce that the Cloudant/dashDB integration isofficially out of beta and into General Availability. For those that don’t know,IBM dashDB is a cloud-based data warehouse with built-in analytics tailored tohelp ensure you’re making the most of your data. dashDB is accessible right inthe Cloudant dashboard by way of the Warehousing tab.We have made a number of upgrades, performance improvements, and bugfixes inthis release. Our initial dashDB blog post still gives a good overview of dashDB’s integration with Cloudant and serves asa helpful getting started guide.Here are some of the highlights from this release: 1. The limit to the warehouse size has been increased from 1 GB to 10 GB    (compressed,) which translates into roughly 30 GB of JSON but your mileage    may vary based on compression. That’s more than enough to get meaningful    insight from your Cloudant data. Please note that you are still entitled to    one warehouse per Cloudant account 2.  3. Incremental updates are here! Your warehouses will be automatically updated    as you add new data to the underlying database and you will no longer need    to rescan to include new data. 4.  5. Improved performance. Initial testing shows a 20x improvement in data    transfer and schema discovery. You’ll be able to move more data, much    faster, to your dashDB instance 6.  7. Finally, we have improved our integration with IBM Bluemix so warehouses you    create in Cloudant will be accessible from your Bluemix console. Because of    this, you will be asked for a Bluemix id and password when you create or    delete a warehouse. If you do not have a Bluemix id, you can get one hereWith great power comes great responsibility (meaning billing of course.)Customers that create warehouses will be expected (after a 30 day free trial) toprovide credit card info to Bluemix to ensure continued use of the service.(Users that exceed the free limit on dashDB warehouses (1 GB compressed) will berequired to enter a credit card so they can be billed.)We are by no means done with our work on dashDB, the Schema Discovery Process(SDP), or the Cloudant integration. This is simply the next step in ourcontinued evolution. We’ve got of an exciting plans for 2015. We would love to hear your feedback about how we can improve the service for you so please don’t hesitate to reachout .Please enable JavaScript to view the comments powered by Disqus.SIGN UP FOR UPDATES!RECENT POSTS * Data Privacy and Governance Update * Cloudant Warehousing: New features and improvements * Announcing ISO 27001 Compliance for Cloudant, dashDB and BigInsights! * Understanding Mango View-Based Indexes vs. Search-Based Indexes * Introducing Monitoring Plugins for IBM Cloudant LocalBlog archive Follow @cloudantPRODUCT * Why DBaaS? * Features * Pricing * DBaaS ComparisonDOCS * Getting Started * API Reference * Libraries * GuidesFOR DEVELOPERS * FAQ * Sample AppsRESOURCES * Blog * Case Studies * Data Sheets * Training * Webinars * Whitepapers * Videos * EventsCOMPANY * About Us * Contact UsNEWS * In the Press * Press Releases * Awards * Terms Of Use * | * Privacy * | * ©IBM Corporation 2016","The Cloudant/dashDB integration is officially out of beta and into General Availability. For those that don’t know, IBM dashDB is a cloud-based data warehouse with built-in analytics tailored to help ensure you’re making the most of your data. dashDB is accessible right in the Cloudant dashboard by way of the Warehousing tab.","Data Warehousing Features, Improvements, and Updates: Introducing the New DashDB",Live,628
1949,This video shows you how to use the Cloudant dashboard to create a database and add documents to the database using the easy-to-use JSON editor. Sign up for a Cloudant account here: https://cloudant.com/sign-up/. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center,This video shows you how to use the Cloudant dashboard to create a database and add documents to the database using the easy-to-use JSON editor. ,Create a Database and Add Documents to Database,Live,629
1950,"Compose The Compose logo Articles Sign in Free 30-day trialTAKING A LOOK AT ROBOMONGO AND STUDIO 3T WITH COMPOSE FOR MONGODB
Published May 17, 2017 robomongo studio 3t mongodb Taking a Look at Robomongo and Studio 3T with Compose for MongoDBWith two of the best known MongoDB desktop UI's now under one roof, we decided
to take a look at who should be using either of them - and how well they work
with Compose.

While Compose has a useful web browser for MongoDB to complement the Mongo
shell, there's a whole class of desktop tools which offer different ways to
access MongoDB with a native experience. One, Robomongo, we looked at it a few years back . Since then, after stalling for a while, the developers pushed forward to get
a version 1.0 out. They then announced they'd been acquired by 3T , the maker of commercial MongoDB desktop formerly known as MongoChef and now
known as Studio 3T.

Since the release of Robomongo 1.0 and its aquisition by Studio 3T (formerly MongoChef), we've been interested in looking at some of
its developments and how Robomongo and Studio 3T compare. Although Robomongo has
made a lot of progress since we reviewed it a few years ago , we decided to review some of the main differences between Robomongo 1.0 and
Studio 3T so that you can decide which one will fit your development needs.

With both desktop UIs under the same roof, the question we heard being asked was
what's the difference between them and which one would we recommend. What we're
going to do here is a walkthrough of both Robomongo and Studio 3T, connected to
Compose MongoDB, with some real data imported and queries needed to be
performed. We'll see what works, where the differences are and pick out who each
one is suited to. Let's dive in and get going.

SETTING UP THE TOOLS
Robomongo 1.0 is free to download for Windows, Mac and Linux platforms. Currently, Robomongo only supports
MongoDB 3.2, but they have a roadmap to support 3.4 in the next release . The latest version of Studio 3T, which supports MongoDB 3.4, is also free to download for those three platforms, but you'll only have all the features as a trial
period for 14 days. After the trial ends, you still have the non-commercial
license of Studio 3T, but all the premium features will be inaccessible. Once
you've downloaded the tools, just click the installers and they'll set
themselves up on your desktop.

When opening up Robomongo and Studio 3T, you'll be presented with two screens
that look very similar with a ""Welcome"" tab that includes the latest blog from
their developers and other news concerning tool improvements. The menus for both
tools, however, are very different in what they present. Robomongo's menu could
be described as functional: it's for saving and opening files, as well as
running and stopping queries. Studio 3T's menu, on the other hand, is focused on
tools centered around database administration and querying.


CONNECTING TO YOUR DATABASE
Once you've started Robomongo, a connection window will first open up showing
you a blank list with links to create, edit, remove, or clone a database
connection. In Studio 3T, you can click either the Connect button in the top menu or the green button in the ""Welcome"" tab to get a
similar window with the same database connection options in addition to
importing and exporting connections.

When creating a new connection in both, you'll get a screen where you'll enter
basic connection information such as the connection alias, the server address
and the port number. This is where you'd enter your Compose deployment URL and
port number. On top of the screen there are a number of tabs that let you
configure the authentication settings and SSL options.


When initially creating a connection using Robomongo, you'll have to break your
Compose connection string URI into its constituent parts to fill out the
connection fields. However, Studio 3T can take the whole URI and parse it for
you, including determining the type of server and SSL settings needed. This
makes connecting to Compose very easy.

MIGRATING YOUR DATA
When migrating your data into a new database, Robomongo doesn't have any
import/export features. Studio3T does and that makes it easy to import JSON,
CSV, mongodumped BSON or other collections into a database. Simply click the Import icon in the toolbar and pick a format. Let's import a CSV file, the NOAA fishing vessel permits for 2017 , to see how it works.

After selecting the CSV format from the import window, we'll be taken to another
window to either import the file from the clipboard or provide a path to a CSV
file. Here, we'll enter the name for the database and collection as well. We'll
call the database vessel and the collection permits , then we click Next .


After that, we are given a preview of the fields and the data that will be
imported into the database. The first set of options let us select other field
delimiters, strip CSV headings or turn those headings into field names. The
options are accompanied with a quick view of how the file will be processed.
Once we're happy, we hit Next .


The next screen gives us precise control of how the data will be mapped into
MongoDB; the field type, how empty fields are handled, how escaped characters
are inserted, array and document structure detection and trimming of spaces.
Again, it's accompanied with a preview of the results.


Once that's set up, Studio 3T will provide you with real-time information on the
import process in the Operations window below your database connections.


However, this convenience seems to come with a cost. Studio 3T took around three
minutes to import the dataset, compared with less than a second using
mongoimport.

ADDING USERS TO YOUR DATABASE
Now that we've created a database, best practice says that we should create a
user for that database for general use. In Studio 3T, click the database then
the Users button in the top menu, which will give you a screen where you can add, edit,
and drop users as well as grant or revoke their permissions. We're adding
""user1"" with ""dbAdmin"" permissions on the vessels database.


With Robomongo, although the feature is there to add users by right clicking the Users folder within the vessels database, there is an outstanding bug that will not allow you to create users, or find existing ones.

BROWSING YOUR DATA
Now that you've created a user and added data to your database, you simply
select the collection name and the first 50 documents will appear in a new tab
on the right of both tools.

Clicking on a document, both tools can show that document as a tree, table or
JSON text. Where Robomongo just displays the document though, Studio 3T presents
an editable view with visible BSON types. Clicking on a field lets us change its
value. With Robomongo, we have to go to an editor through a menu and modify and
validate a JSON version of the document.


QUERYING YOUR DATA
The ease by which you can query data is perhaps the most important criteria when
evaluating these tools. The query that we'll create will look for vessels with
over 1200 horsepower that harvest American lobster.

For querying, both UIs come with a graphical shell to write them using MongoDB
query syntax with autocompletion enabled for MongoDB commands and a window
displaying the query results below the shell. The graphical shell is how you
interact with your data exclusively in Robomongo, while Studio 3T comes with its
own graphical shell, called IntelliShell, and a Query Builder . IntelliShell is selected from the top menu. An added feature of this shell is
that it autodetects fields in your schema making it easy to select the right key
when writing a query.


Querying your database with Studio 3T's Query Builder is done by dragging and dropping the fields into the query builder's Query , Projection or Sort fields and setting the parameters in the forms that appear. Studio 3T generates
the appropriate MongoDB query which you can see in the Query tab.


The Result , Query and Explain tabs are a new feature introduced in Studio 3T 5.2.0.


These tabs contain the query results, the generated MongoDB query, and the query
plan and its execution statistics, which can be incredibly helpful if you need
to debug slow-running or memory-intensive queries.


Another feature in the query screen provides the query history, which includes
all the queries you've run in the application, and the ability to bookmark
queries. These are found on the upper right-hand corner of the query form.


Robomongo doesn't have the option of looking back into your search history,
besides saving a query and loading it back up. However, Studio 3T's search
history is pretty useful when you want to go back to your work. For queries you
want to save, it lets you bookmark them and save them with an alias. The search
history and bookmarks are saved within the application so you don't lose them
after disconnecting from a database or when closing the application.

INDEXING YOUR DATA
Now that we've queried the data, let's see how Robomongo and Studio 3T create
indexes using the PERMITS field of our dataset. By clicking on the index folder or the collection, both
UIs provide screens to create indexes easily.

Robomongo displays a screen of index properties where you set up fields to be
indexed in a JSON text area. You can also add other indexing features by
navigating through the Basic , Advanced and Text Search tabs. Studio 3T shows a screen that displays all the indexed fields. When you
add a field, it provides a list of field names from your dataset that you click
to add. Studio 3T provides the same tabs as Robomongo that you can navigate
through, but it also comes with options to set up geospatial and partial
indexes.


AGGREGATING YOUR DATA
Both tools support MongoDB aggregation operations using the aggregate pipeline.
Using the graphical shell or IntelliShell, you can write aggregation queries and
the output will appear in the output window below. So, writing an aggregation
query that will give you the number of vessels that have over 1200 horsepower
organized by state and sorted by the average weight of all those vessels in tons
looks the same. The difference is that Studio 3T will use autocompletion to
suggest pipeline operators at each stage, which is not available in Robomongo,
and the shell output will be the same as what you'd see in the MongoDB shell
with the results of each execution of the aggregation appearing in a new tab in
the output window.


The real benefit of using Studio 3T over Robomongo, however, is the aggregation
pipeline builder, which is opened by clicking on Aggregate on the top menu. The aggregation tool allows you control and test the results
of aggregations much more thoroughly and easily than manually doing the same
tasks in the shell. Clicking on the Aggregate button will open up a screen where you can add, delete, edit, include, exclude,
view results of individual stages, and arrange each stage of the pipeline. The
results of the aggregation are displayed in the output window below the
aggregation builder. Each time the aggregation is executed, a new tab will open
with the new results.


SYNCING YOUR DATA
Another feature found in Studio 3T that you won't find in Robomongo will let you
compare and synchronize data from different collections. It's only available in
Studio 3T Pro and Enterprise licenses at the moment. One use case for this tool
is to make sure that the integrity of data and schemas across two databases is
sound.

To compare two database collections, you add your source and target connections
and then drag and drop your source collection onto the target collection. Then
the comparison that Studio 3T will run shows in the comparisons window at the
bottom of the screen.


After running the comparison, if there are inconsistencies in your data and/or
schemas, Studio 3T will give you the option to either synchronize individual
parts of documents or your entire dataset with either database.


VISUALIZING YOUR SERVER
Studio 3T provides is real-time server monitoring, status, and build information
for your Compose MongoDB deployments. You can view your real-time server
information by right clicking your deployment then highlighting Show Server Info then select Server Status Charts . This feature basically is running MongoDB diagnostic operations in the
background rather than running db.serverStatus and getting metrics manually in the MongoDB shell.

The charts show a variety of statistics: from network traffic to network
requests, as well as active clients and operation counts. The update frequency
intervals can be configured to give you results ranging from a second to 10
minutes.

TRY THEM OUT
For the developer who wants a simple tool to query and visualize their data in
MongoDB, Robomongo will do a nice job and it's free to download . The only caveat is that you have to know your MongoDB query syntax, and you
don't have many of the editing and easy to use querying and visualization tools
that Studio 3T provides.

Robomongo is best suited to the developer who wants a more advanced graphical
shell that works on MongoDB databases. That developer will already be
comfortable with the Mongo shell and query language. For a free tool, it's worth
having.

Studio 3T is, on the other hand, more of a MongoDB users tool, with lots of
assistance in importing, query building, aggregation and data synchronization.
It can be used by the experienced developer too; it doesn't force them into
using query builders and the like, but may well be a time saver. The only
question with Studio 3T is will it save you enough time to cover its cost.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Philip Swinburn

Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
May 9, 2017LAUNCHING RESTHEART INTO PRODUCTION
Now that we've shown you how to build instant RESTFul API's with RESTHeart and
secure your RESTHeart installation, there's ju…

John O'Connor May 5, 2017NEWSBITS - ELASTICSEARCH, REDIS, MONGODB, ETCD, GCC, GO, HOMEBREW AND MORE
NewBits for the week ending 5th May - Elasticsearch goes to 5.4, Redis history
revealed, MongoDB and etcd updates, GCC is 30…

Dj Walker-Morgan May 4, 2017AVOID STORING DATA INSIDE ""ADMIN"" WHEN USING MONGODB
When importing or restoring your database into MongoDB, make sure that you don't
use the admin database. Why? We'll show you.…

Abdullah Alger Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","With two of the best known MongoDB desktop UI's now under one roof, we decided to take a look at who should be using either of them - and how well they work with Compose.",Taking a Look at Robomongo and Studio 3T with Compose for MongoDB,Live,630
1954,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (January 17, 2017)
 * This Week in Data Science (January 10, 2017)
 * This Week in Data Science (December 27, 2016)
 * This Week in Data Science (December 20, 2016)
 * This Week in Data Science (December 13, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (JANUARY 17, 2017)
Posted on January 17, 2017 by Janice Darling

Here are some stories from this week in Data Science and Big Data. Don’t forget
to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * IBM Watson, FDA to explore blockchain for secure patient data exchange – IBM Watson Health and the FDA partner to explore the use of blockchain
   technology in the medical field.
 * IBM Breaks Record in U.S. Patents With Cloud, AI and Health Bets – IBM continues its trend as the leader in receiving US patents for the
   twenty-fourth consecutive year.
 * 6 Steps to Effective Data Preparation for Quality Conclusions – Follow these six steps to save time and boost results on the most time
   consuming task in data analysis.
 * 10 expert tips to boost agility with Hadoop as a service – Experts provide tips on using Hadoop and Spark cloud-based services.
 * A Concise Overview of Recent Advances in the Internet of Things (IoT) – A review of some of the most important happenings related to the Internet
   of Things.
 * How to train your Deep Neural Network – A few practices that should be taken into consideration in order to
   effectively train a neural network.
 * The Most Popular Language For Machine Learning and Data Science Is … – An analysis of the choice of programming language for Machine Learning and
   Data Science
 * – Research review and introduction to the application of Deep Learning to
   Natural Language Processing.
 * How Will Big Data Evolve in the Year Ahead? – Explore how Big Data Analytics will change as the volume of data produced
   grows.
 * CES 2017: As with Big Data, Human Factor Plays a Part in IoT – Panelists discuss the need for collaboration in order to make the most of
   IoT.
 * Is Big Data Analytics The Secret To Successful Fire Fighting? – How data received from multiple sources can assist fire departments in
   handling fires and other incidences.
 * Taking another look at in-database analytics – Discussion of how a deviation from the traditional practice of
   extract-transform-load (ETL) might assist in data analytics.
 * IBM Watson Partners with Illumina to Advance Cancer Research – IBM Watson and Illumina partner to achieve better interpretation of cancer
   genome data.
 * VR, AI and voice: Behind the hype at CES 2017 – A wide rage of companies make AI the forefront of their presentations at
   CES 2017.
 * – Watson spreads to take on the cybersecurity.
 * To know the past, one must first know the future: The relevance of
   decision-based thinking to statistical analysis – The relevance of each step in the approach to solving statistical problems


COOL DATA SCIENCE VIDEOS
 * Deep Learning with Tensorflow – The MNIST Database – A discussion of the MNIST Database from our free Deep Learning course.
 * Deep Learning with Tensorflow – Convolution with Python and TensorFlow – A demonstration of the convolution operation using Python and TensorFlow.
 * Deep Learning with Tensorflow – Convolution and Feature Learning – A high level overview of convolution and demonstrate how a convolutional
   neural network can learn how to extract features.
 * Video: Artificial intelligence takes centre stage at CES tech show in Las
   Vegas – The prominence of Artificial Intelligence at CES 2017 in Las Vegas.
 * SHARE THIS:
    * Facebook
    * Twitter
    * LinkedIn
    * Google
    * Pocket
    * Reddit
    * Email
    * Print
    * 
   
   
 * RELATED
   

Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here are some stories from this week in Data Science and Big Data. ,"This Week in Data Science (January 17, 2017)",Live,631
1955,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Advisory Council
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Advisory Council
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
EVENTS
HYPERPARAMETER OPTIMIZATION: SVEN HAFENEGER
At the recent sold-out Spark & Machine Learning Meetup in Brussels, Sven Hafeneger of IBM delivered a lightning talk called Hyperparameter Optimization - when scikit-learn meets PySpark .

As Sven explained, Apache Spark™ is not only useful when you have big data
problems. If you have a relatively small data set you might still have a big
computational problem. One problem is the search for optimal parameters for ML
algorithms.

Normally, a data scientist has a laptop with 4 cores (8 threads), that means it
will take some time to perform a grid search. However, using Spark opens the
possibility that the grid search can be taken out on a cluster with a higher
degree of parallelism.

See a video of the talk on the Spark Technology Center Youtube channel ...

See the slides on SlideShare ...

Hyperparameter Optimization - Sven HafenegerSHARE ON
 * 
 * Share

STEVE MOORE
DATE
01 November 2016TAGS
events, pyspark, optimization, scikit-learnNEWSLETTER
Subscribe to the Spark Technology Center newsletter for the latest thought
leadership in Apache Spark™, machine learning and open source.

SubscribeNEWSLETTER

YOU MIGHT ALSO ENJOY
OPEN SOURCE IMPROVEMENTS TO THE SIZEESTIMATOR CLASS IN APACHE™ SPARK by Vijay
Sundaresan OPEN SOURCE EXPLORING THE APACHE SPARK™ DATASOURCE API by Sunitha Kambhampati DATA SCIENCE DATA SCIENCE HUB & THE DATA SCIENCE COMMUNITY: PHILIPPE VAN IMPE
by Steve Moore MEETUP APACHE SPARK™ APPLICATIONS THE EASY WAY: PIERRE BORCKMANS by Steve MooreSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About
 * Advisory Council

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.

 * 
 * 
 * 
 *",Sven Hafeneger of IBM delivered a lightning talk called Hyperparameter Optimization - when scikit-learn meets PySpark.,Hyperparameter Optimization: Sven Hafeneger,Live,632
1956,This video shows you how to check the status of a replication job in Cloudant. ,Learn how to manage replication jobs in Cloudant,Check the status of a replication job,Live,633
1958,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

IBM Data Science Experience Blocked Unblock Follow Following Jan 10
--------------------------------------------------------------------------------

WORKING WITH DB2 WAREHOUSE ON CLOUD IN DATA SCIENCE EXPERIENCE
Because it is optimized for analytic operations in many ways, Db2 Warehouse on Cloud (formerly known as IBM dashDB) is an excellent choice to be the home of your
data to analyze with Data Science Experience.

In this article we review the options to access Db2 Warehouse on Cloud and read
or write table data from Data Science Experience.

OVERVIEW
There are two approaches to working with Db2 Warehouse on Cloud data:

 * R & Python APIs for direct access and push-down to Db2 Warehouse
 * Access through Spark service to Db2 Warehouse on Cloud

Which method to choose depends on your specific needs.

Use the first method when you want to rely on Db2 Warehouse on Cloud to scale
your analytics through generated SQL statements or invocation of built-in
analytic routines running inside the database . Also use the first method when you want to define the SQL statements that run
inside Db2 Warehouse on Cloud yourself, and you only want to use R or Python to
work on the results of these SQL statements, for example, to visualize them.

Use the second method if you want to define your own analytic logic that needs to be scaled out and run in parallel. Also use the second method when you want to employ
Spark-based machine learning libraries to work on your Db2 Warehouse on Cloud
data.

PYTHON AND R API FOR DB2 WAREHOUSED
Db2 Warehouse provides dedicated API libraries for both R and Python. They both
abstract Db2 Warehouse tables as regular data frames, which is the established
mechanism for tabular data representation in both languages. In addition, these
APIs provide R and Python wrapper methods to invoke Db2 Warehouse’s built-in predictive analytic routines to train, persist, manage and score predictive models inside Db2 Warehouse.

ibmdbR

The ibmdbR library is deployed out of the box in Data Science Experience. It allows you to
connect directly from R notebooks or R scripts in RStudio to a Db2 Warehouse
database, and to interact with the table data as if it was standard R data
frames. Operations that you perform on such data frames are automatically
translated into SQL statements running in Db2 Warehouse. Also, the library
provides a set of methods to invoke the Db2 Warehouse predictive analytics
routines.

As a quick starter, use these R demo notebooks:

 * Running SQL from R notebook in Db2 Warehouse
 * KMeans clustering from R notebooks inside Db2 Warehouse
 * Naive Bayes from R notebooks inside Db2 Warehouse on Cloud
 * Linear Regression from R notebooks inside Db2 Warehouse

In addition, the extension pack ibmdbRXt is deployed out of the box in Data Science Experience, which provides API
methods to use the Db2 Warehouse geospatial data and analytic functions.

ibmdbpy

The ibmdbpy library is deployed out of the box in Data Science Experience. The library
allows you to connect directly from Python notebooks to a Db2 Warehouse
database, and to interact with the table data as if it was Pandas data frames.
Operations that you perform on such data frames are automatically translated
into SQL statements running in Db2 Warehouse. Also, the library provides a set
of methods to invoke the Db2 Warehouse predictive analytics routines.

As a quick starter, use these Python demo notebooks:

 * Running SQL from a Python notebook in Db2 Warehouse
 * ibmdbpy Demo Notebook

ACCESSING DB2 WAREHOUSE ON CLOUD THROUGH SPARK
You can use the Apache Spark service in Data Science Experience to read and
write data from Db2 Warehouse on Cloud just as from any other storage backend.
Specifically, the Db2 Warehouse on Cloud data access relies on the standard JDBC
data source mechanism of Apache Spark. There are two things to take care of to
make this work smoothly: Specify the correct driver JDBC URL , and define a special Db2 Warehouse dialect for Spark to fix the default String data type mapping, which doesn’t apply to
Db2 Warehouse on Cloud.

To start with the right best practices, take a look at these demo notebooks:

 * Reading and writing Spark data frames from and to Db2 Warehouse on Cloud with
   Scala
 * Reading and writing Spark data frames from and to Db2 Warehouse on Cloud with
   PySpark

OPTIMIZING DATA READING FOR DB2 WAREHOUSE ON CLOUD MPP
Spark provides mechanisms to read data in parallel. When you already have a
partitioned data backend such as a Db2 Warehouse on Cloud MPP instance, then it
makes sense to read the individual data partitions in parallel into Spark. This
saves a lot of reshuffling work that would otherwise have to be performed under
the hood.

Here is how you can specify that reading happens in parallel along the lines of
the Db2 Warehouse on Cloud MPP partitions:

var df = spark.read.
format(""jdbc"").
option(""url"", ""jdbc:db2://<DB2 server>:<DB2     port>/<dbname>"").
option(""user"", ""<username>"").
option(""password"", ""<password>"").
option(""dbtable"", ""<your table>"").
option(""partitionColumn"", ""DBPARTITIONNUM(<a column   name>)"").
option(""lowerBound"", ""<lowest partition number>"").
option(""upperBound"", ""<largest partition number>"").
option(""numPartitions"", ""<number of partitions>"").
load()

In case you don’t know the partitioning of your Db2 Warehouse on Cloud MPP
system, here is how you can find it out using SQL:

SELECT min(member_number), max(member_number), count(member_number) FROM TABLE(SYSPROC.DB_MEMBERS())


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on January 10, 2017 by Torsten Steinbach .

 * Db2 Warehouse On Cloud
 * Python
 * Spark
 * Rstudio


Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Because it is optimized for analytic operations in many ways, Db2 Warehouse on Cloud (formerly known as IBM dashDB) is an excellent choice to be the home of your data to analyze with Data Science…",Working with Db2 Warehouse on Cloud in Data Science Experience,Live,634
1959,"UX IMPROVEMENTS TO CLOUDANT DATA-REPLICATION

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

J. McDavid 11/29/16J. McDavid


Learn More Recent Posts * UX Improvements to Cloudant Data-Replication The Cloudant/CouchDB dashboard-development team has just landed some
   substantial improvements to replication, particularly around triggering,…

The Cloudant/CouchDB dashboard-development team has just landed some substantial
improvements to data-replication, particularly around triggering, monitoring, and troubleshooting replication jobs .

The project is the result of a great effort from a number of people, including
the numerous Cloudant and Apache CouchDB™ users. I’m very thankful to all the people who provided comments on the
previous replication section, and expressed hopes for future versions of our
continuously improving dashboard. We have more in-plan, but you should check out
the current drop — I’m excited for the problems that have been solved, and
functionality we’ve added.

THE GOOD STUFF
We’ve redesigned the activity section for increased display area and greater
information density, and have added numerous new tools to help you track,
organize, and work with your current and past replication jobs.

We’ve also streamlined the process for triggering new replication jobs. The
Replication form now follows a progressive-disclosure model, where relevant
options and information are displayed as you make selections in each step.

FILTER/SORT/SELECTION IMPROVEMENTS FOR THE ACTIVITY SECTION
We’ve added a filter into the activity section, and have made each column
sortable, allowing you to organize and quickly search replication information to
find what you need. No more scanning long lists to find a particular job; sort
on the new timestamp column and then start entering the source or target, or the
replication document ID into the filter and see just how quickly that job
appears. Filter or sort to display whatever you’re looking for, whether it’s
errored jobs, continuous-replication jobs, or jobs timestamped on a particular
day. It now takes few keystrokes and minimal navigation.

We’ve also added a bulk-selector to the activity section so that you can easily
clear history and cancel jobs without repetitious clicking. Use the filter to
display only the jobs you want to clear, and then select-all-delete to quickly
clear them, or pick and choose the specific jobs that you’d like to delete all
at once.

Your local source and target databases as well as the replication documents
associated with each job are all just single clicks away, meaning that you have
direct access to replicated data and troubleshooting information.

From the editors: In addition to UX improvements, the Cloudant engineering team has optimized the
underlying processes supporting replication. We’ll have the full story in a
forthcoming article that covers replication performance improvements and future
changes to the API. For now, here’s a list of the most commonly encountered bugs
the replication team has squashed as of their latest updates (these fixes are
now live in Cloudant’s current release):

 * Hang of HTTP worker when streaming target has died
 * Situation where changes reader gets stuck in cleanup process by capping max
   number of messages
 * Un-triggered replications
 * Deadlock condition with remote source
 * Stuck changes feed
 * Race condition in replicator rescan logic
 * Excessive frequency of checkpoint attempts
 * Leaking replication changes feed readers

BETTER MANAGEMENT OF REPLICATION JOBS
Full information from previous replication jobs (successful or failed ones) can
now repopulate the replication form, meaning that we’ve made it much easier for
you to:

 * Re-run completed jobs
 * Troubleshoot and correct failed jobs
 * Duplicate jobs, but with adjustments to anything you’d like to adjust (new
   target/source/type/ID)

Similar to the replication-form repopulation, triggering a new replication job
no longer redirects you away from the replication form, meaning that you can
trigger numerous jobs in succession by re-clicking “Start Replication” after
making whatever changes you’d like to the existing replication details.

We’ve improved conflict-handling for your replication documents. In addition to
validating new replication-document IDs against existing IDs, the dashboard now
gives you the option to overwrite a conflicting replication document when you
trigger a job. So, if your first job fails, you can quickly recreate it, using
the same replication document ID. Name your replication documents as you’d
prefer, and keep the failed ones from cluttering.

OUR TEAM AND YOUR FEEDBACK
Ben Keen started the development effort, which Garren Smith later took over and expanded in scope. We received valuable code reviews from Robert Kowalski and Michelle Phung , and testing by a good number of other developers. As a designer for IBM
Cloudant and CouchDB, I’m thrilled to be able to work with such a talented
development community on projects such as this.

You can sign-in to your account to give the new Replication section a try. There’s a lot to see
and play around with. And please let us know what you think in the comments
below, or send feedback directly to me, at jmmcdavi@us.ibm.com . I love the engagement in our community, and look forward to your thoughts.
provements to replication, particularly around triggering,…
",The Cloudant/CouchDB dashboard-development team has just landed substantial UX improvements for data-replication.,UX Improvements to Cloudant Data-Replication,Live,635
1961,"Variance Explained * About Me
 * Posts
 * R Course

DAVID ROBINSON
Data Scientist at Stack Overflow, works in R and Python.

Email Twitter Github Stack OverflowSUBSCRIBE

RECOMMENDED BLOGS
 * R Bloggers
 * RStudio Blog
 * R4Stats
 * Simply Statistics

ONE YEAR AS A DATA SCIENTIST AT STACK OVERFLOW
One day in January 2013 I found myself wasting time on the internet.

This wasn’t a good idea: I was as busy as anyone 2.5 years into their PhD. I had
to finish a presentation on some yeast genetics research, I was months behind on
a paper with an NYU collaborator and even farther behind on some leftover
undergraduate research. I was also busy in my personal life- I had returned from
a trip to Israel, and had just taken up Brazilian Jiu-Jitsu and jogging.

But this one day I was wasting time by answering a stranger’s question about the beta distribution . The question was on Cross Validated, the statistics sister site of developer
Q&A site Stack Overflow. I’d been an active answerer on Stack Overflow for about
a year at the time, and a less frequent answerer on Cross Validated: it was
certainly my favorite way to waste time.

At the time I had the somewhat cute idea to explain the beta distribution in
terms of baseball statistics- an answer that would later turn into this post and this series .


I did a lot in my PhD that I am proud of, and I did a lot more that was
forgettable or unimportant. But in terms of the effect on my career, that answer
is the work that I’m happiest about.

ONE YEAR AT STACK OVERFLOW
Last Thursday (June 16th) marks my one-year anniversary of working at Stack
Overflow as a Data Scientist.

I’d finished my PhD about a month before I joined, and my move to a tech company
was a pretty big change for me. As of only a few months earlier, I’d been
planning to stay in academic research, particularly in the field of
computational biology. I’d started applying for postdoctoral fellowships, and
hadn’t even considered applying to “industry” jobs.

What changed my mind? It started in January 2015, when Jason Punyon found my
(then two-year-old) post about the beta distribution:

Why I love working at @StackExchange reason #14785: We get awesome answers to questions like this one: http://t.co/NkgvYHMEAy Thanks @drob

— JSONP (@JasonPunyon) January 7, 2015@drob P.S. I know we aren't as exciting as finishing your PhD but if you want an
interview here at Stack you can have one :)

— JSONP (@JasonPunyon) January 7, 2015At the time I was pretty certain I was entering academia, but I didn’t want to
pass up the chance to check out the Stack Exchange offices and meet some of the
people behind the product. It took basically one visit to the office for me to
change my mind, and a few weeks and interviews later I was offered the position
and signed.

SOME THINGS I’VE BEEN WORKING ON
People know what web developers do, but what does a data scientist do? ( I’m not the only person who gets asked that ).

The following aren’t the only projects I’m working on, but they may give a sense
of what I’ve been up to.

DESIGNING, DEVELOPING AND TESTING MACHINE LEARNING FEATURES
The most prominent example of where machine learning is used in our product is Providence; our system for matching users to jobs they’ll be interested in . (For example, if you visit mostly Python and Javascript questions on Stack
Overflow, you’ll end up getting Python web development jobs as advertisements).
I work with engineers on the Data team ( Kevin Montrose , Jason Punyon , and Nick Larsen ) to design, improve and implement these machine learning algorithms. ( Here’s some more about the architecture of the system, built before I joined ). For example, we’ve worked to get the balance right between jobs that are
close to a user geographically and jobs that are well-matched in terms of
technology, and ensuring that users get a variety of jobs rather than seeing the
same ones over and over.

A lot of this process involves designing and analyzing A/B tests, particularly
about changing our targeting algorithms, ad design, and other factors to improve
clickthrough rate (CTR). This process is more statistically interesting than I’d
expected, in some cases letting me find new uses for methods I’d used to analyze
biological experiments, and in other cases encouraging me to learn new
statistical tools. In fact, much of my series on applying Bayesian methods to baseball batting statistics is actually a thinly-veiled version of methods I’ve used to analyze CTR across
ad campaigns.

LEARNING COOL THINGS
I’m not an academic scientist any more, but that doesn’t mean I’m not interested
in drawing conclusions from data. Stack Overflow has a birds-eye view of the
software development ecosystem- millions of questions, users, and daily
visitors. What can we learn from all that data?

For starters, by looking at how tags are used together, we can find natural
clusters of technologies 1 :


This lets us automatically categorize frameworks and packages into the
higher-level languages and clusters they belong to, all without manual
annotation.

But it really shows us only how tags co-appear on specific programming
questions, not how they’re used in the same projects (for example, C# and SQL
Server may not always appear on the same questions, but they’re often used as
part of the same technology stack). For that, I might look at another source of
data, Stack Overflow Careers profiles, and see which technologies tend to be
used by the same developers:


I like how this divides the tags not just by strict categories, but by
“technology ecosystems.” This kind of understanding isn’t limited to programming
technologies. The Stack Exchange network contains a vast range of Q&A sites. By
looking at which communities tend to have the same active members, we can
similarly create a network of how our sites are interrelated:


(Not everything I do is networks, just some of the examples that are more
interesting at a glance).

Why spend time on analyses like these? Sometimes they can contribute directly to
product features. For example, understanding clusters of technologies
quantitatively lets us improve our model of developer types that drives the Providence targeting. Other insights can be valuable from a
business perspective. I’ve worked a bit with the sales, marketing, and community
teams to interpret their data and help make decisions.

But I’m also just intrinsically pretty interested in learning about and
visualizing this kind of information; it’s one of the things that makes this a
fun job. One plan for my second year here is to share more of these analyses
publicly. In a previous post I looked at which technologies were the most polarizing , and I’m looking forward to sharing more posts like that soon.

DEVELOPING DATA SCIENCE ARCHITECTURE (INTERNAL R PACKAGES)
I like using R to learn interesting things about our data, but my longer term
goal is to make it easy for any of our engineers to do so. When I joined I was
the first person at the company who used R, but it’s been spreading in the year
since. R is just a really great way to engage with data directly and to answer interesting questions. (It makes me
sad when brilliant software engineers open up Excel to make a line graph!)

Towards this goal, I’ve been focusing on building reliable tools and frameworks
that people can apply to a variety of problems, rather than “one-off” analysis
scripts. (There’s an awesome post by Jeff Magnusson at StitchFix about some of these general challenges). My approach has been building internal R packages , similar to AirBnb’s strategy (though our data team is quite a bit younger and smaller than theirs). These
internal packages can query databases and parsing our internal APIs, including
making various security and infrastructure issues invisible to the user.

This also has involved building R tutorials and writing “onboarding” materials.
As an example, I’ve made public a tutorial that introduces the internal sqlstackr package for querying our databases. 2 This doubles as a general dplyr/tidyr/ggplot2 introduction, which I find more
useful than linking a developer to a general dplyr tutorial (since this is data
my colleagues are sure to be interested in!) My hope is that as the data team
grows and as more engineers learn R, this ecosystem of packages and guides can
grow into a true internal data science platform.

ACADEMICS AND INDUSTRY
There’s a popular definition of data scientists:

Data Scientist (n.): Person who is better at statistics than any software
engineer and better at software engineering than any statistician.

— (((Josh Wills))) (@josh_wills) May 3, 2012This version is framed positively, but it’s worth noting that the inverse is
also true: in grad school I knew less about statistics than others in my lab , and in my new job I know less about software engineering than my colleagues.
So how has the transition been?

KNOWING MORE ABOUT STATISTICS
There are plenty of dramatic articles about how “programmers need to learn statistics” . It’s true that I have more statistical experience and training than my
coworkers (also the first with a PhD in any topic). But it hasn’t felt like an
obstacle in my work.

For one thing, I’ve noticed this gap is closing in a lot of important areas. The
developers I’ve met who interpret A/B testing are already aware of the dangers
of p-hacking and multiple hypothesis testing, as well as the importance of
effect sizes and confidence intervals. The data team in particular had already
worked to spread good practices. The gaps that took more getting used to were
more along the lines of “using Poisson rather than linear regression” or “know
when to use a log scale”.

More importantly, I’ve found that developers have been more than willing to
listen and learn when I brought up statistical issues, and I feel like my
working relationships so far have built up mutual trust. This is one case in
which my experience really hasn’t reflected the “Programmers Need To Learn Statistics” article, which painted developers as overconfident zealots. It’s possible we’re
an unusually functional engineering department in this regard (I’ve heard worse
stories from other companies!), but it’s also possible that the attitude of the
software development industry has changed in the last six years, with the
importance of statistics being more widely recognized.

One aspect of graduate school I miss is learning about statistics from others. I
was surrounded by people much more knowledgeable than myself, and in lab
meetings and seminars I got exposed to a lot of useful statistical theory and
methods. I also could count on others to catch if I made a mistake. Most of my
current statistical education has to be self-driven, and I need to be very
cautious about my work: if I use an inappropriate statistical assumption in a
report, it’s unlikely anyone else will point it out.

KNOWING LESS ABOUT SOFTWARE ENGINEERING
I’ve long cared about programming practices, and I’ve been using GitHub and
contributing to open source Python and R projects for years, but working for a
tech company did represent something of a shift. I’m a lifetime Mac user, and in
the last few years have worked entirely in R. Stack Overflow is built on
Microsoft technologies, particularly C#, ASP.NET, and SQL Server, and before I
joined nobody at the company had ever used R. I’m not attached to a particular side of a language war , but I was certainly nervous about what the change would mean for me.

This also turned out not to be much of an obstacle, especially I’ve found that I
can contribute a lot to the company entirely from within R on a Mac. I owe a lot
to the developers of RSQLServer and the jTDS driver ; thanks to them I can easily query our databases from RStudio. I have a
Parallels window with Visual Studio open at all times, but I find most days I
don’t even need to use it. I do push code to production sometimes (usually
related to ad targeting experiments), but it hasn’t been a source of friction.
There’s many areas of software engineering that I know much less about than my colleagues, including front-end web design and site
reliability engineering, but like in any company I end up pretty insulated from
those concerns.

OTHER CHANGES
Leaving biology research . This was probably the change I was most nervous about. For eight years
(including most of my undergraduate degree), my research had been focused on
biology. It took spending a few months working on other problems to realize
that, honestly, I had never been that passionate about biological questions.

@daattali @JennyBryan @noamross @hspter

Me: I had the weirdest, longest dream I cared about RNA

Wife: That wasn't a dream it was a PhD

— David Robinson (@drob) April 16, 2016Biology presents a lot of interesting computational and statistical problems,
and there’s a lot of exciting work going on in bioinformatics. But when I
finished a biological analysis, I’d end up with results (say, a hundred genes
that changed expression in response to a stimulus) that I didn’t have the
knowledge or interest to interpret myself. (Even after years of working with the
yeast genome as a whole, I can recognize only a handful of genes by name). In
contrast, I’ve been a longtime user of Stack Overflow, and I have a general
interest in the state of the software developer ecosystem, so I see a result
like the above networks, I can tell whether it makes sense immediately. It feels different to work on data that I’m actually interested in.

Writing : This is an advantage I’d underestimated. I did a lot of writing in my degree,
mostly for journal articles and my dissertation, and the truth is that writing
that kind of formal language can feel pretty stilted . In the last year most of my writing has been for internal reports, for
documentation, or for blog posts, where I get to write informally and
conversationally (try comparing the language of my dissertation to any of the posts on this blog). I’ve had some leftover research I’m trying
to get published, and for this reason it’s very hard to get back into the mindset of writing for a journal article.

THE PEOPLE I WORK WITH
I have a weird sense of humor, and many of my tweets involve a fictional “Dev”
that serves as a comedic foil, either to mock engineering culture or to mock my
own inexperience in it.

Me: You can't just add two p-values together.

Dev: The hell I can't:

newPval = pval1 + pval2;

Me: But-

Dev: Is all statistics this easy

— David Robinson (@drob) March 29, 2016Me [smugly]: So you see, when you use the right correction, the p-value is .023,
not .02

Dev: Are you the guy who broke the build yesterday

— David Robinson (@drob) April 11, 2016In case it’s not obvious, every one of these tweets is a blatant lie. First, not
one of these exchanges happened. But more importantly, they’re not remotely
representative of developers I’ve worked with. The smart, competent, caring
people at this company are one of my favorite parts of working here.

There are many worth listing (certainly all the members of the Data and Ad
Server teams), but I’m going to name just a few as examples. Jason Punyon is the developer who originally discovered my post on the beta distribution.
Jason’s an excellent engineer, and after six years here he’s built up a
tremendous amount of useful product knowledge. One aspect that really impresses
me is the way he combines caring about the data with caring about our users.

A few months ago I ran an experiment that showed there was a substantial benefit
to displaying a salary range on job ads (more about that in future posts), and
shared the results within the company. I was happy just to share some of the
conclusions I’d learned. But Jason took the results and turned them into action,
starting a push for all employers (including ourselves) to provide salary
information on listings. He did that because he takes data seriously, as
something that guides his choices and that should guide the decisions of the
company. He also cares about developers, as users of our product and as people,
and thinks they deserve to know salary information up front. I’m proud to work
with him.

There are a lot of people outside the engineering department I’m impressed with,
and I’m especially glad to work with the Community Team . For instance, Taryn Pratt (aka bluefeet ) joined the team a few months before I started. I contributed to the Stack
Overflow community before I worked here, but nothing compared to Taryn’s contributions. She’d been an active answerer and moderator
for years, including answering >3500 questions and casting >22,000 helpful flags
(!!).

While Taryn’s not a developer here, she’s got excellent technical experience and
skills (true of most Community Managers). And she commits these skills
(especially SQL) to helping the Stack Overflow community. In that context she
recently started learning R, so she could apply statistical, data-driven methods
to analyze and understand patterns in Q&A activity. With the help of her and
other people on the team I’m really excited to see what data science can
contribute to the community.

MY ADVICE TO GRADUATE STUDENTS: CREATE PUBLIC ARTIFACTS
Some time after I was hired, I learned a bit more of the story behind Jason’s
tweet, and that the internal conversation about reaching out to me had started
on a whim.


I feel really lucky to have the job I have now, and seeing that made me feel
even luckier. The circumstances of my hiring were probably too much of a freak
accident to extract any advice from. But if I were to try to give advice to
people still in graduate school, here’s what I’d come up with: public work is not a waste of time.

When I was in graduate school I cared most about getting papers published; in my
understanding that was the mark of a successful degree, and the only thing that
would matter for my career. I ended up publishing roughly the median number for
a PhD in my field (you can find them here ). I’m glad I did so, but honestly it would be hard to point to a way my life
would be different if I had published fewer or even none of them. But I do know
for sure my life would be different if I hadn’t posted about the beta
distribution, and even more different if I hadn’t started answering on Stack
Overflow at all.

Journal articles are one way of creating public work, but far from the only one:
they’re slow to review, and they need to be “perfect” before they’re submitted.
I think there’s a dangerous attitude that they’re the only way to make work public, and that therefore a lot of good work in academia
languishes for years or disappears entirely because it’s not quite a paper (certainly none of my blog posts would qualify for submission as a journal
article). So I’d say that if you have something interesting but it’s not quite a
paper, write it as a blog post, or a Stack Overflow answer, or an open source
project on GitHub. Just get something out there!

Like I said at the start, I’m really glad I answered that question about the
beta distribution. It gave me the chance to work on a product that’s done so
much for my programming knowledge and productivity. I get to work with the
people who build it, who to this day consistently impress me. And one year in,
it’s been the best job I could have asked for.

Well, almost the best.

*Sigh*

Back to work I guess pic.twitter.com/rjHmPk3X0V

— David Robinson (@drob) April 7, 2016 1. A reproducible version of the code and data used to make this first graph is here if you’d like to analyze it yourself. ↩
    
    
 2. If you work at Stack Overflow and you want to learn more about getting
    started with our internal R tools, be sure to check out this guide , which includes links to the other tutorials. ↩
    
    
--------------------------------------------------------------------------------

DAVID ROBINSON
Data Scientist at Stack Overflow, works in R and Python.

Email Twitter Github Stack OverflowSUBSCRIBE

RECOMMENDED BLOGS
 * R Bloggers
 * RStudio Blog
 * R4Stats
 * Simply Statistics

One year as a Data Scientist at Stack Overflow was published on June 20, 2016 .

YOU MIGHT ALSO ENJOY ( VIEW ALL POSTS )
 * Understanding beta binomial regression (using baseball statistics)
 * Understanding Bayesian A/B testing (using baseball statistics)
 * The adblockr package: block ads from the monetizr package


--------------------------------------------------------------------------------

© 2016 David Robinson. Powered by Jekyll using the Minimal Mistakes theme. Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus","Last Thursday (June 16th) marks my one-year anniversary of working at Stack Overflow as a Data Scientist.  I’d finished my PhD about a month before I joined, and my move to a tech company was a pretty big change for me. As of only a few months earlier, I’d been planning to stay in academic research, particularly in the field of computational biology. I’d started applying for postdoctoral fellowships, and hadn’t even considered applying to “industry” jobs.  What changed my mind?",One year as a Data Scientist at Stack Overflow,Live,636
1965,,See how to create a new dashDB instance and populate it with data directly from a Cloudant account. ,Load JSON from Cloudant database into dashDB,Live,637
1967,"METRICS MAVEN: CALCULATING A WEIGHTED AVERAGE IN POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 9, 2017In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the metrics you need from your data.
In this article, we'll look at how to calculate a weighted average.

Building on one of our previous articles about calculating a mean , in this article we'll look at how to calculate a weighted average and apply
that to a group of products from our hypothetical pet supply company.

OUR DATA
If you recall from the previous article, these are the orders we received for a
variety of dog products that we sell:

order_id | item_count | order_value  
------------------------------------------------
50000    | 3          | 35.97  
50001    | 2          | 7.98  
50002    | 1          | 5.99  
50003    | 1          | 4.99  
50004    | 7          | 78.93  
50005    | 0          | (NULL)  
50006    | 1          | 5.99  
50007    | 2          | 19.98  
50008    | 1          | 5.99  
50009    | 2          | 12.98  
50010    | 1          | 20.99  


In the previous article, we determined our mean order value to be 19.98 with an
average number of items per order of 2.10. In subsequent articles, we also
looked at the median and the mode of the orders so that we could get a more complete picture of how our business
was doing. Let's dig even a bit deeper now to see which products these orders
consisted of so that we can get more insight into how different products are
performing.

Here are the details of the orders that we'll be using in this article:

order_id | date        | customer_id | product_id | product     | category | product_line | price  
--------------------------------------------------------------------------------------------------
50000    | 2016-09-02  | 667325      | 1          | leash       | dog wear | Bowser       | 15.99  
50000    | 2016-09-02  | 667325      | 2          | collar      | dog wear | Bowser       | 10.99  
50000    | 2016-09-02  | 667325      | 6          | plushy      | dog toys | Bowser       | 8.99  
50001    | 2016-09-02  | 667326      | 8          | rubber bone | dog toys | Tippy        | 4.99  
50001    | 2016-09-02  | 667326      | 10         | ball        | dog toys | Tippy        | 2.99  
50002    | 2016-09-02  | 667327      | 13         | name tag    | dog wear | Tippy        | 5.99  
50003    | 2016-09-02  | 667328      | 8          | rubber bone | dog toys | Tippy        | 4.99  
50004    | 2016-09-02  | 667329      | 1          | leash       | dog wear | Bowser       | 15.99  
50004    | 2016-09-02  | 667329      | 2          | collar      | dog wear | Bowser       | 10.99  
50004    | 2016-09-02  | 667329      | 3          | name tag    | dog wear | Bowser       | 5.99  
50004    | 2016-09-02  | 667329      | 4          | jacket      | dog wear | Bowser       | 24.99  
50004    | 2016-09-02  | 667329      | 5          | ball        | dog toys | Bowser       | 6.99  
50004    | 2016-09-02  | 667329      | 6          | plushy      | dog toys | Bowser       | 8.99  
50004    | 2016-09-02  | 667329      | 7          | rubber bone | dog toys | Bowser       | 4.99  
50005    | 2016-09-02  | (NULL)      | (NULL)     | (NULL)      | (NULL)   | (NULL)       | (NULL)  
50006    | 2016-09-02  | 667330      | 13         | name tag    | dog wear | Tippy        | 5.99  
50007    | 2016-09-02  | 667331      | 11         | leash       | dog wear | Tippy        | 12.99  
50007    | 2016-09-02  | 667331      | 12         | collar      | dog wear | Tippy        | 6.99  
50008    | 2016-09-02  | 667327      | 13         | name tag    | dog wear | Tippy        | 5.99  
50009    | 2016-09-02  | 667332      | 12         | collar      | dog wear | Tippy        | 6.99  
50009    | 2016-09-02  | 667332      | 13         | name tag    | dog wear | Tippy        | 5.99  
50010    | 2016-09-02  | 667333      | 14         | jacket      | dog wear | Tippy        | 20.99  


The order detail shown here is derived from combining our orders data with our
product catalog data. We first looked at the dog products catalog in our article
about pivoting using CROSSTAB . In case you aren't familiar with the products, here's what the catalog looks
like:

id  | product         | category | product_line   | price  
----------------------------------------------------------
1   | leash           | dog wear | Bowser         | 15.99  
2   | collar          | dog wear | Bowser         | 10.99  
3   | name tag        | dog wear | Bowser         | 5.99  
4   | jacket          | dog wear | Bowser         | 24.99  
5   | ball            | dog toys | Bowser         | 6.99  
6   | plushy          | dog toys | Bowser         | 8.99  
7   | rubber bone     | dog toys | Bowser         | 4.99  
8   | rubber bone     | dog toys | Tippy          | 4.99  
9   | plushy          | dog toys | Tippy          | 6.99  
10  | ball            | dog toys | Tippy          | 2.99  
11  | leash           | dog wear | Tippy          | 12.99  
12  | collar          | dog wear | Tippy          | 6.99  
13  | name tag        | dog wear | Tippy          | 5.99  
14  | jacket          | dog wear | Tippy          | 20.99  


But before we go further, let's backtrack a little bit to get you caught up. By
running the following query on the orders detail table above, we can get the
orders summary table that we worked with in the previous article about mean:

SELECT order_id,  
COUNT(product_id) as item_count,  
SUM(price) as order_value  
FROM orders_detail  
GROUP BY order_id;  


If we use that query with the WITH clause to make it a CTE (common table expression), we can calculate the mean
values we got originally:

WITH orders as (  
    SELECT order_id,
    COUNT(product_id) as item_count,
    SUM(price) as order_value
    FROM orders_detail
    GROUP BY order_id
)
SELECT  
ROUND(AVG(NULLIF(item_count,0)),2) AS avg_items,  
ROUND(AVG(order_value),2) AS avg_value  
FROM orders;  


We'll get a mean value for orders of 19.98 with an average number of items per
order of 2.10. Just as we mentioned above.

Because of the one invalid order (50005) that does not have items, we are using
the NULLIF() function for the item count calculation so that the 0 count for that order will
be ignored. Note also our use of the ROUND() function that we learned about in Make Data Pretty to round the results to only 2 decimal places. If you'd like to refresh your
memory about any of these functions and calculations before diving into weighted
averages with us, check out the article about calculating a mean .

Now let's build on what we learned previously to get even more insight into our
business by calculating a weighted average for ordered products.

WEIGHTED AVERAGE
Using a weighted average calculation allows us to assign more value to certain
data elements than others based on some additional criteria when arriving at our
final mean value. Looking at examples is the best way to wrap your head around
the concept so let's dive right in.

For this article, we're going to look at the average price for ordered products
to get more insight into how different products are performing and we're going
to use a weighted average price as a proxy for how our business is doing. The
higher the average and the weighted average become, the better we're doing at
selling our higher-priced items and getting repeat customers. More on this
below. First, let's get the average order price for each of the products that
were ordered:

SELECT product,  
ROUND(AVG(price),2) as avg_item_price  
FROM orders_detail  
GROUP BY product;  


Here we're using the ROUND() function to round our mean product value to 2 decimal places. Our result looks
like this:

product     | avg_item_price  
------------------------------
 (NULL)     | (NULL) 
rubber bone | 4.99  
collar      | 8.99  
jacket      | 22.99  
leash       | 14.99  
ball        | 4.99  
name tag    | 5.99  
plushy      | 8.99  


While some of the products from the catalog are the same price regardless of
product line, others differ in price. The Bowser line is more expensive for most
items than the Tippy line. For example, a Bowser jacket costs 24.99 while a
Tippy jacket costs only 20.99. The average price per product that we just
returned will accommodate for this since it's calculating every item at each
price. For example, 1 Bowser jacket was ordered at 24.99 and 1 Tippy jacket was
ordered at 20.99 so the rounded average for those two items is 22.99. If the
average order price for jackets begins to go up from this point forward, then
we'll know we're selling more Bowser jackets. If it begins to go down, then
we're selling more Tippy jackets.

AVERAGING AVERAGES
Beware of averaging averages!

We might be tempted now to average the average product prices to get an overall
average item price of 10.28:

(4.99 + 8.99 + 22.99 + 14.99 + 4.99 + 5.99 + 8.99) / 7 = 10.28


The problem with averaging our averages is that we lose the impact of the amount
of the items that were ordered for each product. In this case, to get an overall
average item price, it would be better to go back to square one and just do an
average across all the items that were ordered. Also, since we know that one
order without any items is invalid, let's just exclude it from our
consideration. Here's our query to get the overall average item price (excluding
the invalid order that has only NULL values):

SELECT ROUND(AVG(price),2) as avg_item_price  
FROM orders_detail  
WHERE price IS NOT NULL  
;


Now we can see that our overall average item price is 9.51:

(15.99 + 10.99 + 8.99 + 4.99 + 2.99 + 5.99 + 4.99 + 15.99 + 
   10.99 + 5.99 + 24.99 + 6.99 + 8.99 + 4.99 + 5.99 + 12.99 +
   6.99 + 5.99 + 6.99 + 5.99 + 20.99) / 21 = 9.51


In the future, if we see this overall average price increase, then we'll know
that more higher-priced items are being ordered.

THE WEIGHTING FACTOR
Now let's weight the average prices for each product ordered according to the
number of repeat customers represented for it. Our working business assumption
is that the more repeat customers a product brings in, the higher that product's
importance is to our business. Let's see that in practice. Here's our query:

WITH repeat_customers as (  
    SELECT customer_id
    FROM orders_detail
    WHERE customer_id IS NOT NULL
    GROUP BY customer_id
    HAVING COUNT(distinct order_id) �  


When we run this, we come back with a weighted average item price of 9.74.
Weighting by the number of repeat customers represented for each product has
increased our average item price by 0.23. We'd like to see this go even higher
in the future because then that would mean we're getting more repeat customers,
but it's a place to start. Let's walk through how we got there because the
details here are key for us to understand.

The query uses two CTEs.

The first CTE, ""repeat_customers"", uses GROUP BY and 'HAVING` to identify only customer IDs with more than one order. We could
get more sophisticated with this in time and maybe add a conditional for amount
of time that has passed between orders, but for now, we only care that a
customer has made more than one order with us.

In the ""products"" CTE, the query is exactly the same as the one we used to
calculate the average above, except that we've added a COUNT() aggregation by distinct customer ID based on doing a LEFT JOIN to our repeat customers list. We're also adding a 1 to the final value of that
aggregation so that every product will at least have a weight of 1. The results
of the CTE look like this:

product     | avg_item_price | weight  
--------------------------------------
rubber bone | 4.99           | 1  
collar      | 8.99           | 1  
jacket      | 22.99          | 1  
leash       | 14.99          | 1  
ball        | 4.99           | 1  
name tag    | 5.99           | 2  
plushy      | 8.99           | 1  


As we can see, none of the products have so far gotten repeat customers except
for the name tag product. As more repeat customers come back to us and place
more orders the weights of the different products will begin to emerge telling
us which products are more likely to generate repeat business for us. With this
valuable information, we may want to promote those products more.

By using these weights in our calculation for weighted average item price, we
are suggesting that the products that have more repeat customers, and thus
higher weights, are more valuable to our business regardless of the average
order price each product may have. Now we have two ways of seeing improvement in
the quality of our orders -- increased average order price for each product and
how much weight that product has due to its ability to drive repeat customers.

The final part of our query is where we put the two components together. Here we
run the weighted average calculation. Basically it multiplies each value and the
determined weight for each product, sums them together, and then divides by the
sum of the weights to get the weighted average item price. Take a look at the
math for the weighted average:

((4.99 * 1) + (8.99 * 1) + (22.99 * 1) + (14.99 * 1) +
   (4.99 * 1) + (5.99 * 2) + (8.99 * 1)) / 7 = 9.74


This tells us more about our orders and products and how they're performing than
the average product order price alone.

There are many different scenarios and you'll need to decide which criteria make
sense for using as weights with your data (or if calculating a weighted average
even makes sense for what you want to learn). You may consider ratings or social
mentions or any of a number of different options as weighting factors. Just by
determining what the weighting factors should be, you've already decided what's
important to your business. Now you can apply those to see how your business is
performing according to those factors.

WRAPPING UP
While it's easy to rely on a simple mean or to take the average of averages for
some of your business metrics, stop and think about whether the metric
accurately represents what you're trying to convey. You might be missing some
key insights. Sometimes one value should count more than another value and you
now know how to use the weighted average calculation to make that happen.

In our next article, we'll take what we've learned here and apply it toward
calculating a moving weighted average. Until then...!


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by: Peggy_Marco Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","In this article, we'll look at how to calculate a weighted average and apply that to a group of products from our hypothetical pet supply company.",Metrics Maven: Calculating a Weighted Average in PostgreSQL,Live,638
1968,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectTRACKING DEPLOYMENTS OF SAMPLE APPSBradley Holt / October 22, 2015A screenshot of the Deployment Tracker service taken in October of 2015(deployments-by-month numbers redacted).I recently wrote about my experience as a Developer Advocate here at IBM CloudData Services. One of the things that we do as Developer Advocates is create sample apps thatdemonstrate various aspects of our offerings (like our Location Tracker sample app which demonstrates how to track and map location with HTML5, JavaScript, andCloudant). However, we had no way to measure whether or not anyone was actuallyusing these sample apps. Sure, we could look at GitHub watches , stars , and forks , but this wouldn’t tell us if people were actually deploying the apps.To address this problem we worked closely with the IBM Bluemix Developer Advocacy team to build a Deployment Tracker service along with a Deployment Tracker Node.js client library and a Deployment Tracker Java client library . The Deployment Tracker now lets us see when someone deploys one of our sampleapps to IBM Bluemix.HOW IT WORKSWhenever we create a sample app, we now include the Deployment Tracker clientlibrary in the sample app and add a line (or a few lines, depending on howdeployment tracking is integrated into the application) that triggers thedeployment tracking. This in turn pings the Deployment Tracker service with somemetadata about the deployment, such as the app’s name ( application_name ), space ID ( space_id ), version ( application_version ), and its URIs ( application_uris ). This metadata is collected from the VCAP_APPLICATION environment variable found in IBM Bluemix and other Cloud Foundry platforms.USING OUR OWN SERVICESWe built the Deployment Tracker using our own tools and following the processesfor which we advocate. The Deployment Tracker service is built using Node.js,runs on IBM Bluemix, and uses IBM Cloudant as its data layer. Our plan is tokeep the Deployment Tracker as a microservice which focuses on doing one job:tracking deployments. We have plans for additional microservices that willcollect additional metrics that are valuable to our team.Development of the Deployment Tracker has been agile and iterative (the initialiteration of the Deployment Tracker was developed in three days). All of ourwork is done in public GitHub repositories. For the Deployment Tracker servicewe require every team member (even core contributors) to send pull requests andrequire a review by at least one other team member before merging. We use the Cloud Foundry blue/green deployment module from 18F and Travis CI for zero-downtime, continuous deployments.WORKING IN THE OPENA badge displaying the number of times that the Hello World Node.js sample app has been deployed to Bluemix.A Deploy to Bluemix button displaying the number of times that the Hello World Node.js sample app has been deployed to Bluemix.One of our core tenants at IBM Cloud Data Services is that “we will work in theopen.” To this end we recently added deployment badges and buttons to theDeployment Tracker. These badges and buttons can be embedded in a project’sREADME or on related tutorials and blog posts. The Deployment Tracker badges andbuttons let anyone see how many times the sample app was deployed. For now wejust provide a count of deployments. In the future we may provide more detailsuch as deployments over time.PRIVACYCollecting information about deployments helps us improve on the sample apps andother material that we create for you. READMEs state clearly when a sample appuses Deployment Tracker, and we provide a privacy notice and instructions on howto disable deployment tracking too. Disabling deployment tracking is quick andeasy for those who are not comfortable having their deployments tracked.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Bluemix / Cloud Foundry / cloudant / Deployment Tracker / Location Tracker / metrics / microservices / Node.js Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",How we used our own services to build a tool that tracks deployments of sample apps to IBM Bluemix.,Tracking Deployments of Sample Apps,Live,639
1970,"ROW LEVEL SECURITY WITH POSTGRESQL 9.5Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 20, 2016Release 9.5 of PostgreSQL delivers many new features like upsert , new JSONB functions, new GROUPING functions, and more . While some of these like upsert or JSONB may be useful to many people, a number of these new features reallyonly service edge cases. If you have the particular edge case a feature solvesthough then that new feature can be invaluable. RLS (Row Level Security) is oneof these edge case features.RLS does just what it says: it secures a row in a table. But, you do have toenable it for each table plus you need to commit to using database roles as amain security mechanism. That last part is the barrier but also the reason touse such a feature.With RLS, you use the database tier to secure the data (at least for the enabledtables). Both multi-tenant tables and analytics schemas where users have generalaccess to the database via a query tool are solid examples of when RLS makessense.Functionally RLS uses table level policies to enforce data protection. Oftenrole names are embedded in a row's data. If they aren't then some key eventuallymaps to a role because the current_user for the connection usually drives the authorization to access the data.Basically, the policy constrains access and limits updates by constraining DMLstatements automatically. This is not unlike adding a WHERE clause to any statement regarding a particular table.CHANGE AN EXAMPLE TO RLSWe have an example schema which has been referenced in a number of articles:The differences in securing data with RLS versus the example are instructive.The above schema does not rely on RLS. It relies on the application layer toenforce security. It provides the application layer with a Users entity which itcan use to authenticate and authorize. The application usually connects with anapplication level role that has elevated privileges and then sendsauthentication requests and constrained DML statements when necessary. With RLS,the Users entity becomes obsolete:The pg_catalog.pg_roles in some logical way replaces the Users table and the application level roledisappears. Then the application connects with each particular user to thedatabase. Here each user is the constraint via table policies. And with this theDML is simpler since these policies already constrain the data.CREATE TABLE, ALTER TABLE, CREATE POLICY: ENABLING RLSCREATE TABLE ratings2 (    user_role_name NAME,  rating_type_name TEXT,  artist_name TEXT,  rating INTEGER);ALTER TABLE ratings2 ENABLE ROW LEVEL SECURITY;CREATE POLICY ratings2_user ON ratings2    USING(user_role_name = current_user);The above encapsulates the simplest example of RLS. The policy returns a booleanfor each row. When true then row accessible.This can be defined with more precision by declaring each type of statement andhow it can be authorized. The above defaults to ALL . SELECT , INSERT , UPDATE , and DELETE are individually available too. If desired, SELECT and INSERT could be declared to create a readable append only table scoped to anindividual user:CREATE POLICY ratings2_user_select ON ratings2    FOR SELECT  USING(user_role_name = current_user);CREATE POLICY ratings2_user_insert ON ratings2    FOR INSERT   WITH CHECK(user_role_name = current_user);Replacing the ratings2_user policy with the above two policies enables the append only read.Even more specifically, hierarchies and groups can be accommodated with the pg_has_role(current_user, user_role_name, 'member') function. Drop the two policies above and replace with the following:CREATE POLICY ratings2_user ON ratings2    USING(pg_has_role(current_user, user_role_name, 'member'));If a user_role_name of a row is a group that the current_user is a member of then it will work. If the user_role_name has been granted to the current_user then that will work too.CREATE ROLE group1;  GRANT ALL ON ratings2 TO group1;CREATE ROLE music1 LOGIN PASSWORD 'change';  GRANT group1 to music1;CREATE ROLE music2 LOGIN PASSWORD 'change';  GRANT group1 TO music2;With the above, music1 and music2 can insert into ratings2 each privately by setting user_role_name for any rows they insert to their respective role names. Plus, by setting the user_role_name to group1 on an insert they can both access the row.This can get complicated but in certain situations can be invaluable. Also, whendoing this don't forget that PostgreSQL abstracts users and groups into roles.The user role merely categorizes roles that have the LOGIN privilege. They are still just roles.NOT STANDARD SQL BUT WIDELY IMPLEMENTEDMany database systems implement some form of RLS. It will not be the most usedfeature of PostgreSQL 9.5+ but in some situations Row Level Security isinvaluable.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","RLS does just what it says: it secures a row in a table. But, you do have to enable it for each table plus you need to commit to using database roles as a main security mechanism. That last part is the barrier but also the reason to use such a feature.",Row Level Security with PostgreSQL 9.5,Live,640
1975,"O'REILLY
Ideas Learning Platform Conferences Shop Search search Feedback Loading... Log in Log out configure Close Menu Open MenuON OUR RADAR
AI Business Data Design Economy Operations Security Software Architecture Software Engineering Web Programming See all Ideas Learning Platform Conferences Shop Search searchON OUR RADAR
AI Business Data Design Economy Operations Security Software Architecture Software Engineering Web Programming See all AIPERFORM SENTIMENT ANALYSIS WITH LSTMS, USING TENSORFLOW
Explore a highly effective deep learning approach to sentiment analysis using
TensorFlow and LSTM networks.

By Adit Deshpande July 13, 2017 Perform Sentiment Analysis with LSTMs, Using TensorFlow! (source: O'Reilly ).SENTIMENT ANALYSIS WITH LSTMS
You can download and modify the code from this tutorial on GitHub here.

In this notebook, we'll be looking at how to apply deep learning techniques to
the task of sentiment analysis. Sentiment analysis can be thought of as the
exercise of taking a sentence, paragraph, document, or any piece of natural
language, and determining whether that text's emotional tone is positive,
negative or neutral.

This notebook will go through numerous topics like word vectors, recurrent
neural networks, and long short-term memory units (LSTMs). After getting a good
understanding of these terms, we’ll walk through concrete code examples and a
full Tensorflow sentiment classifier at the end.

Before getting into the specifics, let's discuss the reasons why deep learning
fits into natural language processing (NLP) tasks.

DEEP LEARNING FOR NLP
Natural language processing is all about creating systems that process or
“understand” language in order to perform certain tasks. These tasks could
include:

 * Question Answering - The main job of technologies like Siri, Alexa, and
   Cortana
 * Sentiment Analysis - Determining the emotional tone behind a piece of text
 * Image to Text Mappings - Generating a caption for an input image
 * Machine Translation - Translating a paragraph of text to another language
 * Speech Recognition - Having computers recognize spoken words

In the pre-deep learning era, NLP was a thriving field that saw lots of
different advancements. However, in all of the successes in the aforementioned
tasks, one needed to do a lot of feature enginering and thus had to have a lot
of domain knowledge in linguistics. Entire 4 year degrees are devoted to this
field of study, as practitioners needed to be comfortable with terms like
phonemes and morphemes. In the past few years, deep learning has seen incredible
progress and has largely removed the requirement of strong domain knowledge. As
a result of the lower barrier to entry, applications to NLP tasks have been one
of the biggest areas of deep learning research.

WORD VECTORS
In order to understand how deep learning can be applied, think about all the
different forms of data that are used as inputs into machine learning or deep
learning models. Convolutional neural networks use arrays of pixel values,
logistic regression uses quantifiable features, and reinforcement learning
models use reward signals. The common theme is that the inputs need to be scalar
values, or matrices of scalar values. When you think of NLP tasks, however, a
data pipeline like this may come to mind.


This kind of pipeline is problematic. There is no way for us to do common
operations like dot products or backpropagation on a single string. Instead of
having a string input, we will need to convert each word in the sentence to a
vector.


You can think of the input to the sentiment analysis module as being a 16 x D
dimensional matrix.

We want these vectors to be created in such a way that they somehow represent
the word and its context, meaning, and semantics. For example, we’d like the
vectors for the words “love” and “adore” to reside in relatively the same area
in the vector space since they both have similar definitions and are both used
in similar contexts. The vector representation of a word is also known as a word
embedding.


WORD2VEC
In order to create these word embeddings, we'll use a model that's commonly
reffered to as ""Word2Vec"". Without going into too much detail, the model creates
word vectors by looking at the context with which words appear in sentences.
Words with similar contexts will be placed close together in the vector space.
In natural language, the context of words can be very important when trying to
determine their meanings. Taking our previous example of the words ""adore"" and
""love"", consider the types of sentences we'd find these words in.


From the context of the sentences, we can see that both words are generally used
in sentences with positive connotations and generally precede nouns or noun
phrases. This is an indication that both words have something in common and can
possibly be synonyms. Context is also very important when considering
grammatical structure in sentences. Most sentences will follow traditional
paradigms of having verbs follow nouns, adjectives precede nouns, and so on. For
this reason, the model is more likely to position nouns in the same general area
as other nouns. The model takes in a large dataset of sentences (English
Wikipedia for example) and outputs vectors for each unique word in the corpus.
The output of a Word2Vec model is called an embedding matrix.


This embedding matrix will contain vectors for every distinct word in the
training corpus. Traditionally, embedding matrices can contain over 3 million
word vectors.

The Word2Vec model is trained by taking each sentence in the dataset, sliding a
window of fixed size over it, and trying to predict the center word of the
window, given the other words. Using a loss function and optimization procedure,
the model generates vectors for each unique word. The specifics of this training
procedure can get a little complicated, so we’re going to skip over the details
for now, but the main takeaway here is that inputs into any Deep Learning
approach to an NLP task will likely have word vectors as input.

For more information on the theory behind Word2Vec and how you create your own
embeddings, check out Tensorflow's tutorial

RECURRENT NEURAL NETWORKS (RNNS)
Now that we have our word vectors as input, let's look at the actual network
architecture we're going to be building. The unique aspect of NLP data is that
there is a temporal aspect to it. Each word in a sentence depends greatly on
what came before and comes after it. In order to account for this dependency, we
use a recurrent neural network.

The recurrent neural network structure is a little different from the
traditional feedforward NN you may be accostumed to seeing. The feedforward
network consists of input nodes, hidden units, and output nodes.


The main difference between feedforward neural networks and recurrent ones is
the temporal aspect of the latter. In RNNs, each word in an input sequence will
be associated with a specific time step. In effect, the number of time steps
will be equal to the max sequence length.


Associated with each time step is also a new component called a hidden state
vector h t . From a high level, this vector seeks to encapsulate and summarize all of the
information that was seen in the previous time steps. Just like x t is a vector that encapsulates all the information of a specific word, h t is a vector that summarizes information from previous time steps.

The hidden state is a function of both the current word vector and the hidden
state vector at the previous time step. The sigma indicates that the sum of the
two terms will be put through an activation function (normally a sigmoid or
tanh).


The 2 W terms in the above formulation represent weight matrices. If you take a
close look at the superscripts, you’ll see that there’s a weight matrix W X which we’re going to multiply with our input, and there’s a recurrent weight
matrix W H which is multiplied with the hidden state vector at the previous time step. W H is a matrix that stays the same across all time steps, and the weight matrix W X is different for each input.

The magnitude of these weight matrices impact the amount the hidden state vector
is affected by either the current vector or the previous hidden state. As an
exercise, take a look at the above formula, and consider how h t would change if either W X or W H had large or small values.

Let's look at a quick example. When the magnitude of W H is large and the magnitude of W X is small, we know that h t is largely affected by h t-1 and unaffected by x t . In other words, the current hidden state vector sees that the current word is
largely inconsequential to the overall summary of the sentence, and thus it will
take on mostly the same value as the vector at the previous time step.

The weight matrices are updated through an optimization process called
backpropagation through time.

The hidden state vector at the final time step is fed into a binary softmax
classifier where it is multiplied by another weight matrix and put through a
softmax function that outputs values between 0 and 1, effectively giving us the
probabilities of positive and negative sentiment.


LONG SHORT TERM MEMORY UNITS (LSTMS)
Long Short Term Memory Units are modules that you can place inside of reucrrent
neural entworks. At a high level, they make sure that the hidden state vector h
is able to encapsulate information about long term dependencies in the text. As
we saw in the previous section, the formulation for h in traditional RNNs is
relatively simple. This approach won't be able to effectively connect together
information that is separated by more than a couple time steps. We can
illiustrate this idea of handling long term dependencies through an example in
the field of question answering. The function of question answering models is to
take an a passage of text, and answer a question about its content. Let's look
at the following example.


Here, we see that the middle sentence had no impact on the question that was
asked. However, there is a strong connection between the first and third
sentences. With a classic RNN, the hidden state vector at the end of the network
might have stored more information about the dog sentence than about the first
sentence about the number. Basically, the addition of LSTM units make it
possible to determine the correct and useful information that needs to be stored
in the hidden state vector.

Looking at LSTM units from a more technical viewpoint, the units take in the
current word vector x t and output the hidden state vector h t . In these units, the formulation for h t will be a bit more complex than that in a typical RNN. The computation is
broken up into 4 components, an input gate, a forget gate, an output gate, and a
new memory container.


Each gate will take in x t and h t-1 (not shown in image) as inputs and will perform some computation on them to
obtain intermediate states. Each intermediate state gets fed into different
pipelines and eventually the information is aggregated to form h t . For simplicity sake, we won't go into the specific formulations for each
gate, but it's worth noting that each of these gates can be thought of as
different modules within the LSTM that each have different functions. The input
gate determines how much emphasis to put on each of the inputs, the forget gate
determines the information that we'll throw away, and the output gate determines
the final h t based on the intermediate states. For more information on understanding the
functions of the different gates and the full equations, check out Christopher
Olah's great blog post .

Looking back at the first example with question “What is the sum of the two
numbers?”, the model would have to be trained on similar types of questions and
answers. The LSTM units would then be able to realize that any sentence without
numbers will likely not have an impact on the answer to the question, and thus
the unit will be able to utilize its forget gate to discard the unnecessary
information about the dog, and rather keep the information regarding the
numbers.

FRAMING SENTIMENT ANALYSIS AS A DEEP LEARNING PROBLEM
As mentioned before, the task of sentiment analysis involves taking in an input
sequence of words and determining whether the sentiment is positive, negative,
or neutral. We can separate this specific task (and most other NLP tasks) into 5
different components.

1) Training a word vector generation model (such as Word2Vec) or loading pretrained word vectors
2) Creating an ID's matrix for our training set (We'll discuss this a bit later)
3) RNN (With LSTM units) graph creation
4) Training 
5) Testing


LOADING DATA
First, we want to create our word vectors. For simplicity, we're going to be
using a pretrained model.

As one of the biggest players in the ML game, Google was able to train a
Word2Vec model on a massive Google News dataset that contained over 100 billion
different words! From that model, Google was able to create 3 million word vectors , each with a dimensionality of 300.

In an ideal scenario, we'd use those vectors, but since the word vectors matrix
is quite large (3.6 GB!), we'll be using a much more manageable matrix that is
trained using GloVe , a similar word vector generation model. The matrix will contain 400,000 word
vectors, each with a dimensionality of 50.

We're going to be importing two different data structures, one will be a Python
list with the 400,000 words, and one will be a 400,000 x 50 dimensional
embedding matrix that holds all of the word vector values.


import numpy as np
wordsList = np.load('wordsList.npy')
print('Loaded the word list!')
wordsList = wordsList.tolist() #Originally loaded as numpy array
wordsList = [word.decode('UTF-8') for word in wordsList] #Encode words as UTF-8
wordVectors = np.load('wordVectors.npy')
print ('Loaded the word vectors!')


Just to make sure everything has been loaded in correctly, we can look at the
dimensions of the vocabulary list and the embedding matrix.


print(len(wordsList))
print(wordVectors.shape)


We can also search our word list for a word like ""baseball"", and then access its
corresponding vector through the embedding matrix.


baseballIndex = wordsList.index('baseball')
wordVectors[baseballIndex]


Now that we have our vectors, our first step is taking an input sentence and
then constructing the its vector representation. Let's say that we have the
input sentence ""I thought the movie was incredible and inspiring"". In order to
get the word vectors, we can use Tensorflow's embedding lookup function. This
function takes in two arguments, one for the embedding matrix (the wordVectors
matrix in our case), and one for the ids of each of the words. The ids vector
can be thought of as the integerized representation of the training set. This is
basically just the row index of each of the words. Let's look at a quick example
to make this concrete.


import tensorflow as tf
maxSeqLength = 10 #Maximum length of sentence
numDimensions = 300 #Dimensions for each word vector
firstSentence = np.zeros((maxSeqLength), dtype='int32')
firstSentence[0] = wordsList.index(""i"")
firstSentence[1] = wordsList.index(""thought"")
firstSentence[2] = wordsList.index(""the"")
firstSentence[3] = wordsList.index(""movie"")
firstSentence[4] = wordsList.index(""was"")
firstSentence[5] = wordsList.index(""incredible"")
firstSentence[6] = wordsList.index(""and"")
firstSentence[7] = wordsList.index(""inspiring"")
#firstSentence[8] and firstSentence[9] are going to be 0
print(firstSentence.shape)
print(firstSentence) #Shows the row index for each word


The data pipeline can be illustrated below.


The 10 x 50 output should contain the 50 dimensional word vectors for each of
the 10 words in the sequence.


with tf.Session() as sess:
    print(tf.nn.embedding_lookup(wordVectors,firstSentence).eval().shape)


Before creating the ids matrix for the whole training set, let’s first take some
time to visualize the type of data that we have. This will help us determine the
best value for setting our maximum sequence length. In the previous example, we
used a max length of 10, but this value is largely dependent on the inputs you
have.

The training set we're going to use is the Imdb movie review dataset. This set
has 25,000 movie reviews, with 12,500 positive reviews and 12,500 negative
reviews. Each of the reviews is stored in a txt file that we need to parse
through. The positive reviews are stored in one directory and the negative
reviews are stored in another. The following piece of code will determine total
and average number of words in each review.


from os import listdir
from os.path import isfile, join
positiveFiles = ['positiveReviews/' + f for f in listdir('positiveReviews/') if isfile(join('positiveReviews/', f))]
negativeFiles = ['negativeReviews/' + f for f in listdir('negativeReviews/') if isfile(join('negativeReviews/', f))]
numWords = []
for pf in positiveFiles:
    with open(pf, ""r"", encoding='utf-8') as f:
        line=f.readline()
        counter = len(line.split())
        numWords.append(counter)       
print('Positive files finished')

for nf in negativeFiles:
    with open(nf, ""r"", encoding='utf-8') as f:
        line=f.readline()
        counter = len(line.split())
        numWords.append(counter)  
print('Negative files finished')

numFiles = len(numWords)
print('The total number of files is', numFiles)
print('The total number of words in the files is', sum(numWords))
print('The average number of words in the files is', sum(numWords)/len(numWords))


We can also use the Matplot library to visualize this data in a histogram
format.


import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(numWords, 50)
plt.xlabel('Sequence Length')
plt.ylabel('Frequency')
plt.axis([0, 1200, 0, 8000])
plt.show()


From the histogram as well as the average number of words per file, we can
safely say that most reviews will fall under 250 words, which is the max
sequence length value we will set.


maxSeqLength = 250


Let's see how we can take a single file and transform it into our ids matrix.
This is what one of the reviews looks like in text file format.


fname = positiveFiles[3] #Can use any valid index (not just 3)
with open(fname) as f:
    for lines in f:
        print(lines)
        exit


Now, let's convert to to an ids matrix


# Removes punctuation, parentheses, question marks, etc., and leaves only alphanumeric characters
import re
strip_special_chars = re.compile(""[^A-Za-z0-9 ]+"")

def cleanSentences(string):
    string = string.lower().replace(""<br />"", "" "")
    return re.sub(strip_special_chars, """", string.lower())


firstFile = np.zeros((maxSeqLength), dtype='int32')
with open(fname) as f:
    indexCounter = 0
    line=f.readline()
    cleanedLine = cleanSentences(line)
    split = cleanedLine.split()
    for word in split:
        try:
            firstFile[indexCounter] = wordsList.index(word)
        except ValueError:
            firstFile[indexCounter] = 399999 #Vector for unknown words
        indexCounter = indexCounter + 1
firstFile


Now, let's do the same for each of our 25,000 reviews. We'll load in the movie
training set and integerize it to get a 25000 x 250 matrix. This was a
computationally expensive process, so instead of having you run the whole piece,
we’re going to load in a pre-computed IDs matrix.


# ids = np.zeros((numFiles, maxSeqLength), dtype='int32')
# fileCounter = 0
# for pf in positiveFiles:
#    with open(pf, ""r"") as f:
#        indexCounter = 0
#        line=f.readline()
#        cleanedLine = cleanSentences(line)
#        split = cleanedLine.split()
#        for word in split:
#            try:
#                ids[fileCounter][indexCounter] = wordsList.index(word)
#            except ValueError:
#                ids[fileCounter][indexCounter] = 399999 #Vector for unkown words
#            indexCounter = indexCounter + 1
#            if indexCounter >= maxSeqLength:
#                break
#        fileCounter = fileCounter + 1 

# for nf in negativeFiles:
#    with open(nf, ""r"") as f:
#        indexCounter = 0
#        line=f.readline()
#        cleanedLine = cleanSentences(line)
#        split = cleanedLine.split()
#        for word in split:
#            try:
#                ids[fileCounter][indexCounter] = wordsList.index(word)
#            except ValueError:
#                ids[fileCounter][indexCounter] = 399999 #Vector for unkown words
#            indexCounter = indexCounter + 1
#            if indexCounter >= maxSeqLength:
#                break
#        fileCounter = fileCounter + 1 
# #Pass into embedding function and see if it evaluates. 

# np.save('idsMatrix', ids)


ids = np.load('idsMatrix.npy')


HELPER FUNCTIONS
Below you can find a couple of helper functions that will be useful when
training the network in a later step.


from random import randint

def getTrainBatch():
    labels = []
    arr = np.zeros([batchSize, maxSeqLength])
    for i in range(batchSize):
        if (i % 2 == 0): 
            num = randint(1,11499)
            labels.append([1,0])
        else:
            num = randint(13499,24999)
            labels.append([0,1])
        arr[i] = ids[num-1:num]
    return arr, labels

def getTestBatch():
    labels = []
    arr = np.zeros([batchSize, maxSeqLength])
    for i in range(batchSize):
        num = randint(11499,13499)
        if (num <= 12499):
            labels.append([1,0])
        else:
            labels.append([0,1])
        arr[i] = ids[num-1:num]
    return arr, labels


RNN MODEL
Now, we’re ready to start creating our Tensorflow graph. We’ll first need to
define some hyperparameters, such as batch size, number of LSTM units, number of
output classes, and number of training iterations.


batchSize = 24
lstmUnits = 64
numClasses = 2
iterations = 100000


As with most Tensorflow graphs, we’ll now need to specify two placeholders, one
for the inputs into the network, and one for the labels. The most important part
about defining these placeholders is understanding each of their
dimensionalities.

The labels placeholder represents a set of values, each either [1, 0] or [0, 1],
depending on whether each training example is positive or negative. Each row in
the integerized input placeholder represents the integerized representation of
each training example that we include in our batch.


import tensorflow as tf
tf.reset_default_graph()

labels = tf.placeholder(tf.float32, [batchSize, numClasses])
input_data = tf.placeholder(tf.int32, [batchSize, maxSeqLength])


Once we have our input data placeholder, we’re going to call the tf.nn.lookup()
function in order to get our word vectors. The call to that function will return
a 3-D Tensor of dimensionality batch size by max sequence length by word vector
dimensions. In order to visualize this 3-D tensor, you can simply think of each
data point in the integerized input tensor as the corresponding D dimensional
vector that it refers to.


data = tf.Variable(tf.zeros([batchSize, maxSeqLength, numDimensions]),dtype=tf.float32)
data = tf.nn.embedding_lookup(wordVectors,input_data)


Now that we have the data in the format that we want, let’s look at how we can
feed this input into an LSTM network. We’re going to call the
tf.nn.rnn_cell.BasicLSTMCell function. This function takes in an integer for the
number of LSTM units that we want. This is one of the hyperparameters that will
take some tuning to figure out the optimal value. We’ll then wrap that LSTM cell
in a dropout layer to help prevent the network from overfitting.

Finally, we’ll feed both the LSTM cell and the 3-D tensor full of input data
into a function called tf.nn.dynamic_rnn. This function is in charge of
unrolling the whole network and creating a pathway for the data to flow through
the RNN graph.


lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.75)
value, _ = tf.nn.dynamic_rnn(lstmCell, data, dtype=tf.float32)


As a side note, another more advanced network architecture choice is to stack
multiple LSTM cells on top of each other. This is where the final hidden state
vector of the first LSTM feeds into the second. Stacking these cells is a great
way to help the model retain more long term dependence information, but also
introduces more parameters into the model, thus possibly increasing the training
time, the need for additional training examples, and the chance of overfitting.
For more information on how you can add stacked LSTMs to your model, check out
Tensorflow's excellent documentation .

The first output of the dynamic RNN function can be thought of as the last
hidden state vector. This vector will be reshaped and then multiplied by a final
weight matrix and a bias term to obtain the final output values.


weight = tf.Variable(tf.truncated_normal([lstmUnits, numClasses]))
bias = tf.Variable(tf.constant(0.1, shape=[numClasses]))
value = tf.transpose(value, [1, 0, 2])
last = tf.gather(value, int(value.get_shape()[0]) - 1)
prediction = (tf.matmul(last, weight) + bias)


Next, we’ll define correct prediction and accuracy metrics to track how the
network is doing. The correct prediction formulation works by looking at the
index of the maximum value of the 2 output values, and then seeing whether it
matches with the training labels.


correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))


We’ll define a standard cross entropy loss with a softmax layer put on top of
the final prediction values. For the optimizer, we’ll use Adam and the default
learning rate of .001.


loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=labels))
optimizer = tf.train.AdamOptimizer().minimize(loss)


If you’d like to use Tensorboard to visualize the loss and accuracy values, you
can also run and the modify the following code.


import datetime

tf.summary.scalar('Loss', loss)
tf.summary.scalar('Accuracy', accuracy)
merged = tf.summary.merge_all()
logdir = ""tensorboard/"" + datetime.datetime.now().strftime(""%Y%m%d-%H%M%S"") + ""/""
writer = tf.summary.FileWriter(logdir, sess.graph)


HYPERPARAMETER TUNING
Choosing the right values for your hyperparameters is a crucial part of training
deep neural networks effectively. You'll find that your training loss curves can
vary with your choice of optimizer (Adam, Adadelta, SGD, etc), learning rate,
and network architecture. With RNNs and LSTMs in particular, some other
important factors include the number of LSTM units and the size of the word
vectors.

 * Learning Rate: RNNs are infamous for being diffult to train because of the
   large number of time steps they have. Learning rate becomes extremely
   important since we don't want our weight values to fluctuate wildly as a
   result of a large learning rate, nor do we want a slow training process due
   to a low learning rate. The default value of 0.001 is a good place to start.
   You should increase this value if the training loss is changing very slowly,
   and decrease if the loss is unstable.
 * Optimizer: There isn't a consensus choice among researchers, but Adam has
   been widely popular due to having the adaptive learning rate property (Keep
   in mind that optimal learning rates can differ with the choice of optimizer).
 * Number of LSTM units: This value is largely dependent on the average length
   of your input texts. While a greater number of units provides more
   expressibility for the model and allows the model to store more information
   for longer texts, the network will take longer to train and will be
   computationally expensive.
 * Word Vector Size: Dimensions for word vectors generally range from 50 to 300.
   A larger size means that the vector is able to encapsulate more information
   about the word, but you should also expect a more computationally expensive
   model.

TRAINING
The basic idea of the training loop is that we first define a Tensorflow
session. Then, we load in a batch of reviews and their associated labels. Next,
we call the session’s run function. This function has two arguments. The first is called the ""fetches""
argument. It defines the value we’re interested in computing. We want our
optimizer to be computed since that is the component that minimizes our loss
function. The second argument is where we input our feed_dict . This data structure is where we provide inputs to all of our placeholders. We
need to feed our batch of reviews and our batch of labels. This loop is then
repeated for a set number of training iterations.

Instead of training the network in this notebook (which will take at least a
couple of hours), we’ll load in a pretrained model.

If you decide to train this notebook on your own machine, note that you can
track its progress using TensorBoard . While the following cell is running, use your terminal to enter the directory
that contains this notebook, enter tensorboard --logdir=tensorboard , and visit http://localhost:6006/ with a browser to keep an eye on your training progress.


# sess = tf.InteractiveSession()
# saver = tf.train.Saver()
# sess.run(tf.global_variables_initializer())

# for i in range(iterations):
#    #Next Batch of reviews
#    nextBatch, nextBatchLabels = getTrainBatch();
#    sess.run(optimizer, {input_data: nextBatch, labels: nextBatchLabels})

#    #Write summary to Tensorboard
#    if (i % 50 == 0):
#        summary = sess.run(merged, {input_data: nextBatch, labels: nextBatchLabels})
#        writer.add_summary(summary, i)

#    #Save the network every 10,000 training iterations
#    if (i % 10000 == 0 and i != 0):
#        save_path = saver.save(sess, ""models/pretrained_lstm.ckpt"", global_step=i)
#        print(""saved to %s"" % save_path)
# writer.close()


LOADING A PRETRAINED MODEL
Our pretrained model’s accuracy and loss curves during training can be found
below.


Looking at the training curves above, it seems that the model's training is
going well. The loss is decreasing steadily, and the accuracy is approaching 100
percent. However, when analyzing training curves, we should also pay special
attention to the possibility of our model overfitting the training dataset.
Overfitting is a common phenomenon in machine learning where a model becomes so
fit to the training data that it loses the ability to generalize to the test
set. This means that training a network until you achieve 0 training loss might
not be the best way to get an accurate model that performs well on data it has
never seen before. Early stopping is an intuitive technique commonly used with
LSTM networks to combat this issue. The basic idea is that we train the model on
our training set, while also measuring its performance on the test set every now
and again. Once the test error stops its steady decrease and begins to increase
instead, you'll know to stop training, since this is a sign that the network has
begun to overfit.

Loading a pretrained model involves defining another Tensorflow session,
creating a Saver object, and then using that object to call the restore
function. This function takes into 2 arguments, one for the current session, and
one for the name of the saved model.


sess = tf.InteractiveSession()
saver = tf.train.Saver()
saver.restore(sess, tf.train.latest_checkpoint('models'))


Then we’ll load some movie reviews from our test set. Remember, these are
reviews that the model has not been trained on and has never seen before. The
accuracy for each test batch can be seen when you run the following code.


iterations = 10
for i in range(iterations):
    nextBatch, nextBatchLabels = getTestBatch();
    print(""Accuracy for this batch:"", (sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})) * 100)


CONCLUSION
In this notebook, we went over a deep learning approach to sentiment analysis.
We looked at the different components involved in the whole pipeline and then
looked at the process of writing Tensorflow code to implement the model in
practice. Finally, we trained and tested the model so that it is able to
classify movie reviews.

With the help of Tensorflow, you can create your own sentiment classifiers to
understand the large amounts of natural language in the world, and use the
results to form actionable insights. Thanks for reading and following along!


--------------------------------------------------------------------------------

This post is part of a collaboration between O'Reilly and TensorFlow . See our statement of editorial independence .

Article image: Perform Sentiment Analysis with LSTMs, Using TensorFlow! (source: O'Reilly ). Share 1. Tweet
 2. 
 3. 


--------------------------------------------------------------------------------

ADIT DESHPANDE
Adit Deshpande is a 2nd year undergraduate student majoring in Computer Science
at UCLA. He is Vice President of ACM AI, the artificial intelligence club on
campus. With regular posts on his personal blog, he's interested in sharing and
communicating his knowledge of different topics in computer science. He's
passionate about applying knowledge of machine learning to important fields such
as healthcare and education.

more
--------------------------------------------------------------------------------

AIHIGHLIGHTS FROM THE O'REILLY AI CONFERENCE IN NEW YORK 2016
By Mac SlocumWatch highlights covering artificial intelligence, machine learning,
intelligence engineering, and more. From the O'Reilly AI Conference in New York
2016.

Video play AIHOW AI IS PROPELLING DRIVERLESS CARS, THE FUTURE OF SURFACE TRANSPORT
By Shahin FarshchiShahin Farshchi examines role artificial intelligence will play in driverless
cars.

AIUNTAPPED OPPORTUNITIES IN AI
By Beau CroninSome of AI's viable approaches lie outside the organizational boundaries of
Google and other large Internet companies.

AISMALL BRAINS, BIG DATA
By Jeremy FreemanHow neuroscience is benefiting from distributed computing, and how computing
might learn from neuroscience.

ABOUT US
 * Our Company
 * Teach/Speak/Write
 * Careers
 * Customer Service
 * Contact Us

SITE MAP
 * Ideas
 * Learning
 * Topics
 * All

 * facebook
 * twitter
 * youtube-large
 * google
 * linkedin

© 2017 O'Reilly Media, Inc. All trademarks and registered trademarks appearing
on oreilly.com are the property of their respective owners.


Terms of Service • Privacy Policy • Editorial Independence",Explore a highly effective deep learning approach to sentiment analysis using TensorFlow and LSTM networks.,"Perform sentiment analysis with LSTMs, using TensorFlow",Live,641
1977,"--------------------------------------------------------------------------------

PERSISTING DATA FOR A SMARTER CHATBOT
WITH WATSON CONVERSATION SERVICE AND CLOUDANT
In a previous blog post Josh Zheng showed us how to build a Slack bot that uses Watson Conversation and
a 3rd party API called Spoonacular to provide recipe suggestions and
instructions based on ingredients and cuisines.

In this blog post we’ll show you how we improved the Watson Recipe Chatbot by
integrating Cloudant to cache 3rd party API calls, provide a more personal
experience to the user, and perform analysis on the conversations users have
with the bot.

The source code and setup instructions for the new version of the Watson Recipe
Bot can be found at https://github.com/ibm-cds-labs/watson-recipe-bot-python-cloudant .

IN REVIEW: WATSON RECIPE BOT
Before we start, let’s review how the Watson Recipe Bot works.

 * First, you start a conversation with the sous-chef bot by sending it a direct
   message:

 * Next, you tell the sous-chef what you would like to do. You can say you want
   to cook something, you want to eat something, etc:

 * Next, you reply to the sous-chef whether or not you would like to use
   specific ingredients (if you say no sous-chef will ask you to specify a
   cuisine):

 * Next, you specify the ingredients or cuisine you would like to cook with, and
   sous-chef will return a list of recipes:

 * Finally, you select the recipe you’d like to cook, and sous-chef will tell
   you how:

So, what’s actually going on here?

HOW IT WORKS
Here is a high-level architecture diagram:

Here’s what’s happening behind the scenes:

 1. sous-chef is a Slack bot that has been registered with a Slack team. Slack
    users send direct messages to the sous-chef bot.
 2. Messages are immediately sent from Slack to the application over a WebSocket
    connection.
 3. The application forwards the message along with the current state of the
    conversation for the active user to Watson Conversation.
    Watson Conversation uses natural language processing to determine the intent
    of the message. For example, is the user asking to cook something? Did the
    user specify a cuisine?
 4. Watson Conversation returns a response to the application which includes the
    identified intent and any other information related to the current step in
    the conversation.
 5. The application uses the response to determine it’s next step. For example,
    if the user specifies a cuisine the application will query Spoonacular (a
    3rd party API) for a list of recipes.
 6. Spoonacular returns a response to the application.
 7. The application formulates a message and sends it to Slack.
 8. Slack delivers the message to the user.

ADDING PERSISTENCE WITH CLOUDANT
The new version of the application uses Cloudant to store users, ingredients,
cuisines, and recipes.

Here’s how it’s different:

 1. A document for each user that interacts with the bot is stored in Cloudant.
 2. A document for each ingredient, cuisine, or recipe is also stored in
    Cloudant.
 3. The application looks for ingredients, cuisines, and recipes in Cloudant
    before querying Spoonacular.
    If one of these already exists in Cloudant, then it is returned immediately
    and there is no need for the application to query Spoonacular.
 4. Every request for an ingredient, cuisine, or recipe made by a user is stored
    in Cloudant.
 5. In addition, the number of times the user has requested a particular
    ingredient, cuisine, or recipe is stored in the user document.
 6. A new Watson Conversation intent has been created that allows a user to
    request their favorite recipes.
    The application queries Cloudant to find the recipes requested most by the
    user.

HOW IT WORKS NOW
Here is what the new architecture looks like:

Most of the application logic has remained the same. We have simply added a
Cloudant instance and a little bit of code.

Let’s rehash some of the benefits of the new implementation:

 1. Reduce 3rd party API calls by caching entities in Cloudant.
 2. Provide a more personal experience for users (favorite recipes) using the
    data in Cloudant.
 3. Perform analysis on ingredients, cuisines, and recipes.

Let’s dive into these points in a little more detail…

REDUCE 3RD PARTY API CALLS
Many 3rd Party APIs limit the number of times you can call them, or charge you
based on the number of API calls you make. In our example
we make the following calls to Spoonacular:

 1. Request a list of recipes based on an ingredient or comma-separated list of
    ingredients.
 2. Request a list of recipes based on a cuisine.
 3. Request a specific recipe and the steps required to cook it.

Multiple users may request the same recipe, cuisine, or ingredient from
Spoonacular, and if the data we receive from Spoonacular never changes, or
changes infrequently, we would be making the same API calls over and over again.
By querying Cloudant first, and storing the results of the API calls in
Cloudant, we can reduce the number of API calls made to Spoonacular. This could
result in cost savings and help us avoid rate limits.

We could also see performance improvements. Your application performance may
suffer if the 3rd party API experiences high load from other users. In many
cases you will experience lower latency querying your Cloudant instance vs. a
3rd party API.

Note: If you are using a 3rd party API consult their Terms of Service to make
sure you are allowed to cache results. Spoonacular does allow caching as part of
their Terms of Service .PROVIDE A MORE PERSONAL EXPERIENCE
As users request ingredients, cuisines, and recipes we can learn about each
user’s habits. Do they cook Chinese food more than Italian food? Do they cook
with chicken more than beef? What are their favorite recipes?

With just a few lines of code we can track every ingredient, cuisine, and recipe
requested by a user. We can then use that information to personalize a user’s
experience. The application currently stores each ingredient, cuisine, and
recipe selected by a user in a ""type"": ""user"" document in Cloudant. Here’s an example:

Using this information we can give a user their favorite recipes sorted by
count. We created a new intent and dialog path in the Watson Conversation
workspace to support favorites:

When the application receives the favorite_recipes intent from Watson Conversation, it immediately looks up and returns the
recipes stored in Cloudant:

This is a very simple example, but with more data we could make the bot even
better. You can see in the user document that this user has requested Chinese
food more often than Italian. We also store every request for a cuisine,
including the date of the request. It looks something like this:

We could look for patterns in this data. Maybe the user makes a request for
Chinese food every Friday. If the user messages the bot on a Friday it could
immediately ask if they want Chinese food.

By persisting a user’s behavior we can make simple improvements to our
application to provide a better, more personal experience for our users.

ANALYZING THE DATA
Now that we are storing every user request for an ingredient, cuisine, and
recipe, we can do some simple analytics on our data. The new application
automatically creates views in Cloudant that allow you to find the most popular
ingredients, cuisines, and recipes requested by all users. The views are a part
of the by_popularity design doc and are named ingredients , cuisines , and recipes .

For example, you can easily find the most popular cuisines by requesting the cuisines view in the by_popularity design doc. The view looks like this:

You can query the view in your browser at the following URL:

https://CLOUDANT-USERNAME.cloudant.com/watson_recipe_bot/_design/by_popularity/_view/cuisines?group=true

Here’s an example of the output:

Here you can see that your users as a whole are requesting Italian recipes more
often than Chinese, unlike the user we looked at earlier.

How about finding out what days people tend to use the bot more? The application
includes a design doc to do that as well. The following recipes view of the by_day_of_week design doc shows you the total number of recipes requested by day:

Here’s an example of the output:

These are fairly simple examples of the analysis you can perform on this type
data, but once you start storing the data the sky is the limit. You could
correlate weather information with ingredient or cuisines requests. You could
determine which cuisines or ingredients result in the most drop off and, hence,
the least desirable recipes. You can do this analysis in Cloudant or import your
data into other tools like Apache Spark or dashDB for further analysis.

Performing analysis on your data can help you improve your product, which can
lead to better experiences for your users and reduce your user drop off rate.

WHAT’S NEXT?
In this blog we showed you how we could make simple, but valuable improvements
to the Watson Recipe Bot by adding persistence with Cloudant and a little bit of
code. We showed you how we could cut down on 3rd party API calls, personalize
the user’s experience, and perform some basic analysis on the data.

Try a deployment for yourself. The README at https://github.com/ibm-cds-labs/watson-recipe-bot-python-cloudant has step-by-step instructions for completing your first deployment.
Alternatively, if you’re already using Watson Conversation in your applications
and want to use Cloudant to persist this data, we hope the source code in the
repo above gives you some ideas to improve your apps. Happy coding!

Chatbots API Cloudant NoSQL Tutorial 4 Blocked Unblock Follow FollowingMARK WATSON
Developer Advocate, IBM Watson Data Platform

FollowIBM WATSON DATA LAB
For developers, by developers.","How to improve the Watson Recipe Chatbot by integrating Cloudant to cache 3rd party API calls, provide a personalized UX, and run analytics on interactions.",Persisting Data for a Smarter Chatbot,Live,642
1978,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix       * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Use Spark Streaming                * Tutorials and samples * Sample Notebooks       * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags                   * BigInsights * BigInsights on Cloud for Analysts    * BigInsights on Cloud for Data Scientists       * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Perform Predictive Analytics and SQL Pushdown       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata for Analytics to dashDB       * From Neteeza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  BIGINSIGHTS ON CLOUD FOR ANALYSTSsharynr / February 9, 2016Learn you how to analyze data in IBM BigInsights on Cloud using BigSheets andBig SQL.You can also read a transcript of this videoRELATED LINKS * BigInsights on Cloud for Data ScientistsPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",how to analyze data in IBM BigInsights on Cloud using BigSheets and Big SQL,BigInsights on Cloud for Analysts,Live,643
1979,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseH2O WITH IBM'S DATA SCIENCE EXPERIENCE (DSX)
Matt McInnisLoading...

Unsubscribe from Matt McInnis? Cancel UnsubscribeWorking...

Subscribe Subscribed UnsubscribeLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

63 viewsLIKE THIS VIDEO?
Sign in to make your opinion count. Sign inDON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign inLoading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Apr 6, 2017Get started building machine learning models with Python and H2O on IBM's Data
Science Experience.

 * CATEGORY
    * Nonprofits & Activism
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Bluemix Data Connect: See an end-to-end use case - Duration: 8:15. IBM
   Analytics Learning Services 743 views 8:15


--------------------------------------------------------------------------------

 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21.
   IBM Analytics 4,291 views 8:21
 * Better Customer Experience using Data Science with Bernard Burg - Duration:
   18:50. H2O.ai 408 views 18:50
 * Top 10 Hottest Artificial Intelligence Technologies 2017 - Duration: 4:05.
   The Epic Guy 240 views 4:05
 * SriSatish Ambati - Welcome to H2O World - Duration: 8:27. H2O.ai 2,161 views 8:27
 * Derek Schoettle, IBM Analytics - Apache Spark Maker Community Event 2016 -
   #theCUBE - Duration: 14:05. SiliconANGLE 341 views 14:05
 * ""Real-Time Anomaly Detection on Time-Series IoT Sensor Data Using Deep
   Learning"", Romeo Kienzler - Duration: 25:20. Data Natives 565 views 25:20
 * Machine Learning in Financial Services - Panel - Duration: 51:39. H2O.ai
   1,019 views 51:39
 * How Macy's uses Advanced Analytics and Big Data - Daqing Zhao - Duration:
   22:24. H2O.ai 401 views 22:24
 * Joel Horwitz, IBM Analytics & Sri Satish Ambati, H2O - Apache Spark Maker
   Community Event 2016 - Duration: 25:17. SiliconANGLE 196 views 25:17
 * H2O Flow demo - airlines - Duration: 8:32. H2O.ai 1,897 views 8:32
 * Progressive uses H2O Predictive Analytics for UBI - Duration: 4:21. H2O.ai
   4,385 views 4:21
 * Apache Spark Maker Community Event: The livestream playback - Duration:
   1:30:23. IBM Analytics 1,378 views 1:30:23
 * Ritika Gunnar, IBM Analytics & Dean Wampler, Lightbend - Apache Spark Maker
   Community Event 2016 - Duration: 17:21. SiliconANGLE 318 views 17:21
 * IBM Machine Learning Event: The dawn of continuous intelligence, part 2 -
   Duration: 1:11:54. IBM Analytics 1,146 views 1:11:54
 * Sparkling Water Webinar - Duration: 52:06. H2O.ai 1,888 views 52:06
 * #DEEPLEARNING and COGNITIVE COMPUTING #SIXHACKATHON in London and Zurich on
   IBM Bluemix - Duration: 36:47. Romeo Kienzler 224 views 36:47
 * H20 Big Data - Duration: 0:35. Precious Life 1,118 views 0:35
 * Jim Kobielus - Apache Spark Maker Community Event 2016 - #theCUBE - Duration:
   9:16. SiliconANGLE 87 views 9:16

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Get started building machine learning models with Python and H2O on IBM's Data Science Experience.,H2O With IBM's Data Science Experience (DSX),Live,644
1981,"Learn R programming for data science

 * Home
 * About Us
 * Archives
 * Contribute
 * Free Account
 * 

We share R tutorials from scientists at academic and scientific institutions
with a goal to give everyone in the world access to a free knowledge. Our
tutorials cover different topics including statistics, data manipulation and
visualization! Introduction Getting Data Data Management Visualizing Data Basic Statistics Regression Models Advanced Modeling Programming Best R Packages Tips & Tricks Regression ModelsHOW TO PERFORM A LOGISTIC REGRESSION IN R
by Michy Alice on September 13, 2015 29 CommentsLogistic regression is a method for fitting a regression curve, y = f(x) , when y is a categorical variable. The typical use of this model is predicting y given a set of predictors x . The predictors can be continuous, categorical or a mix of both.

The categorical variable y , in general, can assume different values. In the simplest case scenario y is binary meaning that it can assume either the value 1 or 0. A classical
example used in machine learning is email classification: given a set of
attributes for each email such as number of words, links and pictures, the
algorithm should decide whether the email is spam (1) or not (0). In this post
we call the model “binomial logistic regression” , since the variable to predict is binary, however, logistic regression can
also be used to predict a dependent variable which can assume more than 2
values. In this second case we call the model “multinomial logistic regression”.
A typical example for instance, would be classifying films between
“Entertaining”, “borderline” or “boring”.

LOGISTIC REGRESSION IMPLEMENTATION IN R
R makes it very easy to fit a logistic regression model. The function to be
called is glm() and the fitting process is not so different from the one used in linear
regression. In this post I am going to fit a binary logistic regression model
and explain each step.

THE DATASET
We’ll be working on the Titanic dataset . There are different versions of this datasets freely available online,
however I suggest to use the one available at Kaggle , since it is almost ready to be used (in order to download it you need to sign
up to Kaggle).
The dataset (training) is a collection of data about some of the passengers (889
to be precise), and the goal of the competition is to predict the survival
(either 1 if the passenger survived or 0 if they did not) based on some features
such as the class of service , the sex , the age etc. As you can see, we are going to use both categorical and continuous
variables.

THE DATA CLEANING PROCESS
When working with a real dataset we need to take into account the fact that some
data might be missing or corrupted, therefore we need to prepare the dataset for
our analysis. As a first step we load the csv data using the read.csv() function.
Make sure that the parameter na.strings is equal to c("""") so that each missing value is coded as a NA . This will help us in the next steps.


training.data.raw <- read.csv('train.csv',header=T,na.strings=c(""""))


Now we need to check for missing values and look how many unique values there
are for each variable using the sapply() function which applies the function passed as argument to each column of the
dataframe.


sapply(training.data.raw,function(x) sum(is.na(x)))

PassengerId    Survived      Pclass        Name         Sex 
          0           0           0           0           0 
        Age       SibSp       Parch      Ticket        Fare 
        177           0           0           0           0 
      Cabin    Embarked 
        687           2 

sapply(training.data.raw, function(x) length(unique(x)))

PassengerId    Survived      Pclass        Name         Sex 
        891           2           3         891           2 
        Age       SibSp       Parch      Ticket        Fare 
         89           7           7         681         248 
      Cabin    Embarked 
        148           4


A visual take on the missing values might be helpful: the Amelia package has a
special plotting function missmap() that will plot your dataset and highlight missing values:


library(Amelia)
missmap(training.data.raw, main = ""Missing values vs observed"")


The variable cabin has too many missing values, we will not use it. We will also
drop PassengerId since it is only an index and Ticket.
Using the subset() function we subset the original dataset selecting the relevant columns only.


data <- subset(training.data.raw,select=c(2,3,5,6,7,8,10,12))


TAKING CARE OF THE MISSING VALUES
Now we need to account for the other missing values. R can easily deal with them
when fitting a generalized linear model by setting a parameter inside the
fitting function. However, personally I prefer to replace the NAs “by hand”, when is possible. There are different ways to do this, a typical
approach is to replace the missing values with the average, the median or the
mode of the existing one. I’ll be using the average.

data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)

As far as categorical variables are concerned, using the read.table() or read.csv() by default will encode the categorical variables as factors. A factor is how R
deals categorical variables.
We can check the encoding using the following lines of code


is.factor(data$Sex)
TRUE

is.factor(data$Embarked)
TRUE

For a better understanding of how R is going to deal with the categorical
variables, we can use the contrasts() function. This function will show us how the variables have been dummyfied by R
and how to interpret them in a model.


contrasts(data$Sex)
       male
female    0
male      1

contrasts(data$Embarked)
  Q S
C 0 0
Q 1 0
S 0 1


For instance, you can see that in the variable sex, female will be used as the
reference. As for the missing values in Embarked, since there are only two, we
will discard those two rows (we could also have replaced the missing values with
the mode and keep the datapoints).


data <- data[!is.na(data$Embarked),]
rownames(data) <- NULL


Before proceeding to the fitting process, let me remind you how important is cleaning and formatting of the data . This preprocessing step often is crucial for obtaining a good fit of the
model and better predictive ability.

MODEL FITTING
We split the data into two chunks: training and testing set. The training set
will be used to fit our model which we will be testing over the testing set.


train <- data[1:800,]
test <- data[801:889,]


Now, let’s fit the model. Be sure to specify the parameter family=binomial in the glm() function.


model <- glm(Survived ~.,family=binomial(link='logit'),data=train)


By using function summary() we obtain the results of our model:


summary(model)

Call:
glm(formula = Survived ~ ., family = binomial(link = ""logit""), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6064  -0.5954  -0.4254   0.6220   2.4165  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.137627   0.594998   8.635  < 2e-16 ***
Pclass      -1.087156   0.151168  -7.192 6.40e-13 ***
Sexmale     -2.756819   0.212026 -13.002  < 2e-16 ***
Age         -0.037267   0.008195  -4.547 5.43e-06 ***
SibSp       -0.292920   0.114642  -2.555   0.0106 *  
Parch       -0.116576   0.128127  -0.910   0.3629    
Fare         0.001528   0.002353   0.649   0.5160    
EmbarkedQ   -0.002656   0.400882  -0.007   0.9947    
EmbarkedS   -0.318786   0.252960  -1.260   0.2076    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1065.39  on 799  degrees of freedom
Residual deviance:  709.39  on 791  degrees of freedom
AIC: 727.39

Number of Fisher Scoring iterations: 5


INTERPRETING THE RESULTS OF OUR LOGISTIC REGRESSION MODEL
Now we can analyze the fitting and interpret what the model is telling us.
First of all, we can see that SibSp , Fare and Embarked are not statistically significant. As for the statistically significant
variables, sex has the lowest p-value suggesting a strong association of the sex
of the passenger with the probability of having survived. The negative
coefficient for this predictor suggests that all other variables being equal,
the male passenger is less likely to have survived. Remember that in the logit
model the response variable is log odds: ln(odds) = ln(p/(1-p)) = a*x1 + b*x2 +
… + z*xn. Since male is a dummy variable, being male reduces the log odds by
2.75 while a unit increase in age reduces the log odds by 0.037.

Now we can run the anova() function on the model to analyze the table of deviance


anova(model, test=""Chisq"")

Analysis of Deviance Table
Model: binomial, link: logit
Response: Survived
Terms added sequentially (first to last)

         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                       799    1065.39              
Pclass    1   83.607       798     981.79 < 2.2e-16 ***
Sex       1  240.014       797     741.77 < 2.2e-16 ***
Age       1   17.495       796     724.28 2.881e-05 ***
SibSp     1   10.842       795     713.43  0.000992 ***
Parch     1    0.863       794     712.57  0.352873    
Fare      1    0.994       793     711.58  0.318717    
Embarked  2    2.187       791     709.39  0.334990    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


The difference between the null deviance and the residual deviance shows how our
model is doing against the null model (a model with only the intercept). The
wider this gap, the better. Analyzing the table we can see the drop in deviance
when adding each variable one at a time. Again, adding Pclass , Sex and Age significantly reduces the residual deviance. The other variables seem to
improve the model less even though SibSp has a low p-value. A large p-value here indicates that the model without the
variable explains more or less the same amount of variation. Ultimately what you
would like to see is a significant drop in deviance and the AIC .

While no exact equivalent to the R 2 of linear regression exists, the McFadden R 2 index can be used to assess the model fit.


library(pscl)
pR2(model)

         llh      llhNull           G2     McFadden         r2ML         r2CU 
-354.6950111 -532.6961008  356.0021794    0.3341513    0.3591775    0.4880244


ASSESSING THE PREDICTIVE ABILITY OF THE MODEL
In the steps above, we briefly evaluated the fitting of the model, now we would
like to see how the model is doing when predicting y on a new set of data. By setting the parameter type='response' , R will output probabilities in the form of P(y=1|X). Our decision boundary
will be 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0. Note that for some
applications different decision boundaries could be a better option.


fitted.results <- predict(model,newdata=subset(test,select=c(2,3,4,5,6,7,8)),type='response')
fitted.results  0.5,1,0)

misClasificError <- mean(fitted.results != test$Survived)
print(paste('Accuracy',1-misClasificError))

""Accuracy 0.842696629213483""


The 0.84 accuracy on the test set is quite a good result. However, keep in mind
that this result is somewhat dependent on the manual split of the data that I
made earlier, therefore if you wish for a more precise score, you would be
better off running some kind of cross validation such as k-fold cross
validation.

As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary
classifier.
The ROC is a curve generated by plotting the true positive rate (TPR) against
the false positive rate (FPR) at various threshold settings while the AUC is the
area under the ROC curve. As a rule of thumb, a model with good predictive
ability should have an AUC closer to 1 (1 is ideal) than to 0.5.


library(ROCR)
p <- predict(model, newdata=subset(test,select=c(2,3,4,5,6,7,8)), type=""response"")
pr <- prediction(p, test$Survived)
prf <- performance(pr, measure = ""tpr"", x.measure = ""fpr"")
plot(prf)

auc <- performance(pr, measure = ""auc"")
auc <- auc@y.values[[1]]
auc

0.8647186


And here is the ROC plot:


I hope this post will be useful. A gist with the full code for this example can
be found here .

Thank you for reading this post, leave a comment below if you have any question.

Tags Logistic Regression The Author Michy is a regular writer at DataScience+, engineer student at Polytechnic
University of Milan. His main interests are machine learning and statistics
although Michy is enthusiast for data science and its applications in general. Twitter WebsiteDISCLOSURE
 * Michy Alice does not work or receive funding from any company or organization
   that would benefit from this article.

0 Shares Like this article? Give it a share: Facebook Twitter Google+ Linkedin Email this * Buskea22Unsure if you’re still checking up on this (well done) blog, but is there a
   way to graph the curve by Sensitivity and 1-Specificity? Or are they just
   tags that can be equally shown by TPR vs FPR?
   
   
 * 
 * Cleber IackFirst, I wanted to congratulate you.
   
   I’ve read several articles, but I haven’t found exactly what I wanted.
   
   I really like your writing, and I wanted to ask if you have any to speak of
   Interpreting the results of our logistic regression model but with mixed data
   in R.
   
   I need much of a logistical data example but with mixed data, but with the
   interpretation of the results, i.e. how to analyze the odds in this case
   having random part.
   
   In advance I thank you and I apologize for the inconvenience.
   
   God Bless
   
   
 * 
 * omarhello maybe someone can help. I have a couple data sets that i am doing the
   same thing as this example. in set 1 i get good values and about 4 of 21
   variables are sig. and i am able to do a ANOVA. But in my second data set,
   the test set, i get 1s and .99s (i ran the same analysis on the test set)
   does that mean that 1 or more of my variables is completely explaining my
   outcome? that really would not be possible heres what the out put looks like.
   any advice??? also, i am unable to apply the predictive analysis as in this
   example to the test set i get errors. im knew to predictive analysis and
   rstudio in general. thanks
   
   Deviance Residuals:
   
   Min 1Q Median 3Q Max
   
   -1.17741 0.00000 0.00000 0.00001 1.17741
   
   Coefficients:
   
   Estimate Std. Error z value Pr(>|z|)
   
   (Intercept) 2.361e+02 8.046e+05 0.000 1.000
   
   CCNH 3.126e-01 7.269e+02 0.000 1.000
   
   CDK7 -4.434e-01 7.192e+02 -0.001 1.000
   
   DDB1 -4.727e-01 9.034e+02 -0.001 1.000
   
   DDB2 -1.591e+00 1.061e+03 -0.001 0.999
   
   ERCC1 -2.153e-01 5.596e+02 0.000 1.000
   
   ERCC2 6.091e-01 3.247e+03 0.000 1.000
   
   ERCC3 -1.480e+00 2.653e+03 -0.001 1.000
   
   ERCC4 -6.005e+00 1.451e+04 0.000 1.000
   
   ERCC5 3.388e+00 5.403e+03 0.001 0.999
   
   ERCC6 1.152e+00 8.831e+02 0.001 0.999
   
   ERCC8 -5.120e+00 5.698e+03 -0.001 0.999
   
   GTF2H2 1.077e-01 1.918e+03 0.000 1.000
   
   GTF2H3 -6.086e-02 1.996e+02 0.000 1.000
   
   GTF2H4 5.501e-01 1.547e+03 0.000 1.000
   
   RAD23B -1.431e-01 6.344e+02 0.000 1.000
   
   RPA1 6.691e-01 7.038e+02 0.001 0.999
   
   RPA2 1.706e-01 6.174e+02 0.000 1.000
   
   RPA3 6.547e-02 5.579e+02 0.000 1.000
   
   XPA 5.885e-01 8.568e+02 0.001 0.999
   
   XPC 5.662e-02 2.667e+03 0.000 1.000
   
   (Dispersion parameter for binomial family taken to be 1)
   
   Null deviance: 43.6447 on 41 degrees of freedom
   
   Residual deviance: 2.7726 on 21 degrees of freedom
   
   AIC: 44.773
   
   Number of Fisher Scoring iterations: 24
   
   
 * 
 * Thanhtam TranVery interesting and useful post. I will look for your future work. It would
   have been nicer in some places if you had explained the meaning of some
   (parts) of the codes that you used, espectially in part “Assessing the
   predictive ability of the model”. Since you sometimes used multicodes but
   without explaination, so it was hard to get what their role were. Anyway, I
   am really enjoyed reading this post. Thanks.
   
   
 * 
 * Thanhtam TranThanks for the nice work. I really enjoyed reading this whole post. Just have
   a couple questions. it seems to me that you just tested the separate effect
   of each predictor like age, sex… on survival probability. However what if
   some of the factors influence each other and have interactive effect on
   survival. If I am understand correctly, from this test we can not solve that
   problem. I am learning this method and also struggle with this isse. It would
   be nice if you could suggest what should be done to check issue out. Thanks.
   
   
 * 
 * Nilesh HulyalThanks for sharing this.
   
   I tried to execute script but got error for script mentioned in “Assessing
   the predictive ability of the model”
   
   Below error I am getting
   
   Error in eval(expr, envir, enclos) : object ‘SE.Experities’ not found
   
   Can anyone help me on this
   
   
 * 
 * Mahamaro Hery AndrianiainaHi, thanks a lot for this post! i’m new to logistic regression and my
   question is what’s the difference between logistic regression and
   classification model in r, since the same example of titanic is used for some
   people on Kaggle to predict the survival by classification model using rpart
   in r.
   
   
 * 
 * ShiviFantastic walk through of the entire process while building the model. very
   simple to understand and easy to interpret and follow.
   Just couple of questions though:
   1) we did not check the C statistics concordant and discordant and other
   significant parameters such as Somer’s D or gamma value.i use SAS hence i
   check all these statistics to check strength and direction with the
   significance of the overall model.
   2) as i am pretty new to this domain, hence just wanted to understand why we
   used type= ‘response’ in accessing the predictive power of the model
   Thanks again for putting it out for all of us.
   Regards, Shivi
   
    * KlodianHi Shivi, thank you for your comment. Indeed, C-statistic is an important
      statistical test to evaluate the improvement of model discrimination. The
      c statistic is generally used when you evaluate the new predictor in the
      model.
      
       * ShiviThanks Klodian.
         Also if could help with why we used type = response in assessing the
         predictive power of the model.
         
          * KlodianHi Shivi, type = response is new for me as well.
            
            
          * 
         
         
       * 
      
      
    * 
    * Vadim KhotilovichAUC == C == (D+1)/2 for a logistic regression model.
      
      For Goodman-Kruskal gamma and number of other GLM stats at easy access,
      you might try the lrm method from the rms package in place of glm.
      
      type= ‘response’ simply applies the logistic transform to predicted
      logodds scores, so the predictions are on the scale of probability.
      
      
    * 
   
   
 * 
 * faustina SelvadeepaThanks for the post. I am new to predictive analytics, I found this post very
   useful. I want to know just one more thing. if I want to predict failure of a
   machine for a new month( in this case I wont have the machine ‘fail’ or ‘not
   fail’ information at the beginning of the month) how can I predict using the
   predict function?
   
    * KlodianActually we should be aware that there is not prediction but an
      associations (depends on interpretation) between exposure and outcome. If
      you have information on time when the failure happen then you can do
      survival analysis (cox regression). For logistic regression the outcome
      should be yes and no (0, 1). However, there is possible to make an ordinal
      logistic regression.
      
      Hope I answered your question accordingly.
      
       * faustina SelvadeepaThanks for your reply Klodian. I am yet to look into survival analysis.
         Suppose I have 10 machines with categorical predictors like machine
         type,location etc. and I want to predict if the machine is going to
         fail in the next 1 week(fail-1 not fail-0), in this case I want to
         predict my y as 1 or 0. will I be able to do it with logistic
         regression?
         Suppose my train data is from 1st Jan- 31st July and my test data if
         from 1st Aug to 31st Aug. From the above mentioned steps, I can fit the
         data and build a model. now I want to predict, in the first week of
         Sept how many machines are going to fail? Any thoughts on this if this
         is possible in logistic regression?
         
          * KlodianI think logistic regression is the way to go. About survival it
            works if the failure the machine is not in same time for all
            machines.
            
             * faustina SelvadeepaThanks Klodian!
               
               
             * 
            
            
          * 
         
         
       * 
      
      
    * 
   
   
 * 
 * Scott HorvathThanks for the post. Would it be possible to do a post for when y is a
   continuous variable and the predictors are a mix of both continuous and
   categorical variables?
   
    * KlodianHi Scott,
      
      If y is a continuous variable, then you can use linear regression. Read
      this 
      http://datascienceplus.com/linear-regression-predict-energy-output-power-plant/ and http://datascienceplus.com/bivariate-linear-regression/
      
       * Scott HorvathThanks!
         
         
       * 
      
      
    * 
   
   
 * 
 * Mashhood SyedExcellent explanation for a beginner
   
   
 * 
 * TravelerHi Michy – this post looks a bit old now, but very helpful, thanks! In your
   exercise you go through a binomial modelling. Could you give me a one
   paragraph description regarding what you would do differently if, for
   example, you wanted to predict the age of a passenger, or fare? I understand
   that this would be much less accurate, but I would like to understand the
   process. Could I start by switching the glm function from family=binomial()
   to something else?
   
    * MichyHello Traveler! Glad to know it’s been useful!
      
      Short answer: to predict a continuous variable you should use linear
      regression (or another regression model).
      
      Long answer: In the article I tackled a classification problem because y
      is categorical: in the example y is coded as either 0 or 1 (note that this
      can be generalized to variables with more than 2 categorical output)
      however you should not think of y as taking purely numerical values such
      as 0 or 1 but rather taking a value which is binary such as “blue” or
      “red”. Then you can code the binary variable as either 0 or 1 and fit a
      curve using logistic regression, that despite its name, is a powerful
      classifier (cit. scikit-learn website! 🙂 ).
      
      If you had to predict a continuous variable such as age, then you’d be
      tackling a regression problem and logistic regression would not be the way
      to go: you are right, you would get bad results (if any at all). The same
      applies to linear regression, if you use linear regression to predict a
      categorical variable such as our y, you may get bad results, such as
      probabilities that are greater than 1 for some values of the predictor x!
      Yes, if you do not set family=binomial(), glm would simply do linear
      regression (as lm()) and you could easily try to predict your continuous
      y.
      
      For a better understanding of both the models, I suggest to look at the
      premises of each of the two models and the problems they were built to
      solve (you can google both).
      
      I hope I answered your question.
      
      
    * 
   
   
 * 
 * NataljaThank you for really useful post! You wrote: “Note that for some applications
   different thresholds could be a better option”. Does the threshold depend on
   the percentage of “success” in the original data set?
   
    * MichyHi Natalja! Thanks for the feedback! 🙂
      My bad, I wrote “threshold” where I should have written “decision
      boundary”! I now corrected it, thanks for pointing that out. The decision
      boundary is a property, not of the training set, but of the model and of
      the parameters of the model. That is, the training set is used to fit the
      parameters that define the decision boundary. Different models fit
      different decision boundaries.
      For a better theoretical explanation, I suggest you to check out the
      machine learning course by Andrew Ng at Coursera in the “logistic
      regression” section.
      
      
    * 
   
   
 * 
 * BroVicThanks for this very comprehensive (and comprehensible) article. I
   particularly appreciate the pains you took to explain the data cleaning
   processes you employ. I really learnt a lot from it!
   
    * MichyHi BroVic! Thanks for the feedback I appreciate it!
      
      
    * 
   
   
 * 
 * Cedric BataillerHi there, first thanks for this clear tutorial. I was just wondering how
   replacing missing values by average value is a good idea?
   
    * MichyHi Cedric, thank you for the comment. I’m glad you find the tutorial
      clear.
      That’s a good question and a tough one at the same time! 🙂
      In general during the fitting stage, what you’d like to do is to keep more
      data as possible, if there are any NAs in the data, the glm() function,
      depending on the parameters you set, will stop or drop the missing
      datapoints. Age is missing 20% of the values so removing 20% of the
      training set would be a big hit. Replacing the NAs with the mean is a
      “quick fix” and maintains the mean for age at the same value altough it is
      a very rough approach. It’s most likely not the best idea, if you were
      wondering that, you are right. When tuning the model, you may want to try
      different strategies and see what gets you the best results. For instance,
      you could do further analysis on the Age variable, see how it relates to
      the other features (say Pclass, Sex, the title, the Fare etc..) and make
      more “informed” substitutions e.g. substitute the missing age based on the
      average age for a certain subgroup of passenger that have the same
      characteristics etc.. There are many different ways to deal with missing
      values, perhaps in the future we’ll make a post on that! I hope I answered
      your question!
      
      
    * 
   
   
 * 

TRENDING NOW ON DATASCIENCE+
 * K Means Clustering in R
 * Sentiment analysis with machine learning in R
 * Fitting a Neural Network in R; neuralnet package
 * Implementing Apriori Algorithm in R
 * How to Create, Rename, Recode and Merge Variables in R

DataScience+ Learn R programming for data science Site Links * About Us
 * Contribute
 * Advertise
 * Contact Us

Legal * Privacy Policy
 * Terms of Use
 * Account Terms
 * Stylebook

Other Sites * R Bloggers

 * 
 * 
 * 
 * 

Connect with Us © 2016 DataSciencePlus.com","Logistic regression is a method for fitting a regression curve, y = f(x), when y is a categorical variable. The typical use of this model is predicting y given a",How to Perform a Logistic Regression in R,Live,645
1988,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectWINTER IS COMING – IMPORTING GAME OF THRONES DATA INTO CLOUDANT WITH THE SIMPLESEARCH SERVICEGlynn Bird / April 19, 2016Winter is coming. The sixth season of Game of Thrones is nearly upon us and because the writers have run out of George R. R. Martin’sbooks, anything could happen.To prepare for the new season, I decided to make a searchable database of Game of Thrones characters using the Simple Search Service , which we announced in January. The Simple Search Service is a Node.js appthat lets you import structured data into a Cloudant NoSQL database, where it isindexed and presented as a faceted search API.The first thing we need is information.Information is the key. You need to learn your enemy’s strength and strategies.You need to learn which of your friends are not your friends.Lord Varys – Game of ThronesAN API OF ICE AND FIREAn API of Ice and Fire provides a RESTful API service that lets you query characters, books, andhouses using a simple RESTful, HTTP interface. By writing a data-processingscript, I was able to convert the data into a TSV file that looks like this:_id name    gender  culture born    died    titles  aliases father  mother  spouse  allegiances books   povBooks    tvSeries    playedBycharacters:583  Jon Snow    Male    Northmen    In 283 AC       Lord Commander of the Night's Watch Lord Snow,Ned Stark's Bastard,The Snow of Winterfell,The Crow-Come-Over,The 998th Lord Commander of the Night's Watch,The Bastard of Winterfell,The Black Bastard of the Wall,Lord Crow         House Stark of Winterfell   A Feast for Crows   A Game of Thrones,A Clash of Kings,A Storm of Swords,A Dance with Dragons   Season 1,Season 2,Season 3,Season 4,Season 5    Kit Harington...The file uses tab characters to delimit the columns. Its first line representsthe field names, and subsequent lines are rows of data. Notice that the dataisn’t completely flat: some of the fields are themselves comma-separated , indicating that they represent an array of possible values. When imported, wewant the data to look like this:{	""_id"": ""characters:583"",	""_rev"": ""1-668a8b166b6826ca0576e9b21924f814"",	""name"": ""Jon Snow"",	""gender"": ""Male"",	""culture"": ""Northmen"",	""born"": ""In 283 AC"",	""died"": """",	""titles"": [""Lord Commander of the Night's Watch""],	""aliases"": [""Lord Snow"", ""Ned Stark's Bastard"", ""The Snow of Winterfell"", ""The Crow-Come-Over"", ""The 998th Lord Commander of the Night's Watch"", ""The Bastard of Winterfell"", ""The Black Bastard of the Wall"", ""Lord Crow""],	""father"": """",	""mother"": """",	""spouse"": """",	""allegiances"": [""House Stark of Winterfell""],	""books"": [""A Feast for Crows""],	""povBooks"": [""A Game of Thrones"", ""A Clash of Kings"", ""A Storm of Swords"", ""A Dance with Dragons""],	""tvSeries"": [""Season 1"", ""Season 2"", ""Season 3"", ""Season 4"", ""Season 5""],	""playedBy"": [""Kit Harington""]}The data file we are going to use to build our database is here .My mind is my weapon.George R. R. Martin – Game of ThronesIMPORTING THE DATAFirst, download the data file .Now, deploy the Simple Search Service on Bluemix and then launch the app and choose Upload to select the data file:Then you can choose: * which fields are to have which data types * which fields are to have facet counts calculated for them—typically fields   that have repeating values throughout the data set, such as culture or gender .You need to pick out which fields are to be considered “arrays of strings”, toensure that the comma-separated fields are imported correctly.Click the Import button to write the data to the database and within a few seconds, your data isuploaded and searchable.You can perform searches in the search box and see the results rendered as atable of results and a list of facet counts. Some sample queries: * q=*:* – search for everything * q=Jon Snow – find me Jon Snow * q=allegiances:”House Stark of Winterfell” bring me characters loyal to the Stark familyIn addition to the web front-end, you can also use the Simple Search Service asan API to power your own front-end. Just visit /search?q= , where the value q is the query you wish to perform: * q=gender:Male AND name:Frey – find me male Freys * q=books”A Storm of Swords” – list the characters who featured in A Storm of Swords * q=playedBy:”Natalie Dormer” – which character did Natalie Dormer play?Searching is not finding.George R. R. Martin – Game of ThronesLOCKING DOWN THE SIMPLE SEARCH SERVICEOnce your data is uploaded, if you want to prevent other data uploads fromhappening, simply set a Bluemix environment variable LOCKDOWN with a value true to instruct the app to act only as a read-only API.Once the app restarts, it no longer displays its friendly user-interface. Itonly lets you access its /search endpoint, but as CORS is enabled, you can call out to your Simple Search Service URL from any client-side script without triggering a security warning from yourbrowser.Winter is coming.George R. R. Martin – Game of ThronesBUILDING A FRONT ENDThe next blog in this series will take a Simple Search Service instance that contains Game of Thrones data and build web front end.Dark wings, dark words.Ned Stark – Game of ThronesSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / Node.js / NoSQL / Simple Search Service Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Import structured data into a Cloudant NoSQL database, where it is indexed and presented as a faceted search API.",Importing Game of Thrones Data into Cloudant with the Simple Search Service,Live,646
1989,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Trevor Grant Blocked Unblock Follow Following Ain't no data like CPG data because CPG data don't stop. Apr 25
--------------------------------------------------------------------------------

GETTING STARTED WITH APACHE MAHOUT
Apache Mahout 0.13.0 just dropped- a huge release that adds support for Spark
CPU/GPU acceleration via native solvers . Apache Mahout is a linear algebra library that runs on top of any distributed
engine that have bindings written. The Mahout community maintains bindings for
Apache Spark, e.g. you can now do linear algebra (sometimes called tensor
algebra because that’s a more fun buzz word) on Spark. The power of Mahout is that it allows you to quickly develop your own
distributed algorithms using a mathematically expressive Scala syntax .

“But I’m a data scientist- I use machine learning not linear algebra, that was something stupid from highschool.”Well, then you’re not much of a data scientist. So called machine learning
algorithms, I broadly define as computer science tricks to approximate
statistical methods. Consider Ordinary Least Squares (OLS) regression versus
Stochastic Gradient Descent linear regression. (Yes, yes, non linear methods
have no closed form solutions, no one is stopping you from doing those in
Mahout). In fact, I might have been too harsh on “machine learning” just now- it
has its place, but it is often abused in places where traditional statistics
were the right tool for the job, either because the practitioner doesn’t
understand the statistics, or tools for executing the statistics at scale
haven’t existed.

Additionally, Mahout exposes an R-Like Scala DSL which allows you to handle RDDs
(wrapped as a Mahout Distributed Row Matrix -DRM), as if they were matrices and
gives you familiar R-Like operators.

It is this last fact that separates Apache Mahout from its closest competitor,
Google’s TensorFlow. Google’s TensorFlow is a powerful framework for creating
your own distributed algorithms (as long as you are OK with that distributed
context being some sort of Google product, or a makeshift Spark integration).
The major short coming of TensorFlow (besides the sassy comments I made in
parentheses in the proceeding sentence) is that it lacks a mathematically
expressive syntax. For some reason Google insists on taking Python, which is
expressive by its self, and making it less so with Java-like APIs. Oh, also
TensorFlow is a product that is produced by a for-profit company with a business
agenda; Apache Mahout is produced by volunteers who are passionate about their
product and maintain it because it helps them in their day job, and also because
we love it.

Showing this may be more illustrative: the hello world of Apache Mahout is a simple OLS regression .

We setup our Matrix:

val drmData = drmParallelize(dense( (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios (1, 2, 12, 12, 18.042851), // Cap'n'

Slice our original matrix into an X matrix and y vector:

val drmX = drmData(::, 0 until 4) val y = drmData.collect(::, 4)

Calculate X transposed times X and X transposed times y:

val drmXtX = drmX.t %*% drmX val drmXty = drmX.t %*% y

Collect these matrices and solve the system:

val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val beta = solve(XtX, Xty)

Now, all of this might seem a little tedious for just getting an Ordinary Least
Squares, but again, the power of Mahout is that it allows you to quickly develop your own
distributed algorithms using a mathematically expressive Scala syntax .

Of course there are some algorithms that are “precanned”. The Mahout project is working towards building
a robust collection of precanned algorithms for the 0.14.0 release, and if you
write something cool- you are welcome and encouraged to contribute.

A ‘precanned’ algorithm example

Building on the last example:

import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares val drmY = drmData(::, 4 until 5) val model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY) model.summary

This will return:

res3: String = ""Coef. Estimate Std. Error t-score Pr(Beta=0) X0 -1.336265388326865 2.6878127323908942 -0.49715717625097144 0.6451637608603309 X1 -13.157701320678825 5.393984138816236 -2.4393288860442244 0.07125997131880424 X2 -4.152654199019935 1.7849055635870432 -2.326540005105108 0.08055692464566588 X3 -5.67990809423236 1.886871957793384 -3.0102244462177334 0.03954162297847308 X4 163.17932687840948 51.91529676169986 3.143183937239744 0.03474107366050938 R^2: 0.9424805502529814 Mean Squared Error: 6.4571565392236MSE: 6.4571565392236 R2: 0.9424805502529814

Ahh yes, much more R-Like.

SETTING UP MAHOUT IN DSX
Enough talk, let’s get started playing with Mahout.

First we need to update the Spark context and enable Kryo Serialization ( which is recommended anyway ).

See my previous blog post on how to update your Spark config. Add the following keys and values:

Property Value spark.kryo.registrator org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.referenceTracking false spark.kryoserializer.buffer 32k spark.kryoserializer.buffer.max 600m

Once you have that all set, create a new Scala-Spark 1.6 notebook. Once you open
it up, restart the kernel just to be safe (and make sure the new settings are in
effect).

In the first code cell, run the following to import the Mahout Jars:

%AddDeps org.apache.mahout mahout-math 0.13.0 --transitive %AddDeps org.apache.mahout mahout-math-scala_2.10 0.13.0 --transitive %AddDeps org.apache.mahout mahout-spark_2.10 0.13.0 --transitive %AddDeps org.apache.mahout mahout-native-viennacl-omp_2.10 0.13.0 --transitive

That is going to add all of the required Mahout jars (and all of their
dependencies). You’ll see quite a bit of gobbly-gook after you run that.

In the next code cell, we’re going to import the Mahout libraries, and set up
the MahoutDistributedContext . The Mahout Distributed Context is a wrapper around the Spark Context ( sc ). This is required because Mahout is an abstracted language in that it doesn't care what the distributed engine is. There are community
supported bindings for Flink and H2O also, and in theory you can copy your
Mahout-Spark code directly to Flink, H2O, or another engine if you write the
bindings for it... (caveats apply- only the Mahout portion of your code is
perfectly portable, and there are a couple of others as well).

import org.apache.mahout.math._ import org.apache.mahout.math.scalabindings._ import org.apache.mahout.math.drm._ import org.apache.mahout.math.scalabindings.RLikeOps._ import org.apache.mahout.math.drm.RLikeDrmOps._ import org.apache.mahout.sparkbindings._ implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc)

SOME SIMPLE THINGS…
Follow on posts are going to go into detail about what we can do that is cool in Mahout, but for now let’s just do a simple thing:

 1. Create a synthetic linear data set RDD
 2. Wrap the RDD in a DRM
 3. Use precanned algorithms to run some regressions and tests

Creating Synthetic Data sets

Spark’s MLLib has a handy little tool for generating synthetic datasets:

import org.apache.spark.mllib.util.LinearDataGenerator  
val n = 10000  
val features = 100  
val eps = 0.1 // i'm guessing error term, poorly documented  
val partitions = 2  
val intercept = 10.0

val synDataRDD = LinearDataGenerator.generateLinearRDD(sc, n, features, eps, partitions, intercept)

synDataRDD is an RDD[LabeledPoint] . There is a convenience method in Mahout for wrapping this type of RDD into a
DRM:

val synDRM = drmWrapMLLibLabeledPoint(synDataRDD)

As a convention, the “label” of the labeled point is the last column of the
resulting DRM. Now lets slice the synDRM into our drmX and drmY (keep in mind DRMs are distributed just like RDDs, in fact you can access the
underlying RDD of a DRM like this: synDRM.rdd ).

Ok, let’s slice up the DRM:

val drmX = synDRM(::, 0 until 100)  
val drmY = synDRM(::, 100 until 101)

And finally, let’s do a little “precanned” OLS:

import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares

val model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY)

A big advantage of closed form least squares over stochastic gradient descent is
that we get test statistics and p-values:

println(model.summary)

Returns:

Coef.        Estimate        Std. Error      t-score         Pr(Beta=0)  
X0    0.22576514034437556 0.0017246931500453733   130.9016275378818   0.0  
X1    0.17917579689950025 0.0017518499295931052   102.27805126042769  0.0  
X2    -0.19146366085333175    0.0017458442662089378   -109.66823591264004 0.0  
X3    -0.22289018193430454    0.0017389760287412077   -128.17323427721314 0.0  
... I've manually truncated ...
X97    -0.40772114855473285    0.0017594529450939542   -231.7317719076373  0.0  
X98    -0.47250010948070553    0.0017293570157212227   -273.22299859734335 0.0  
X99    0.05057360854152024 0.0017467127972544104   28.95359135229038   0.0  
X100    10.00029872647369   0.0010078959135847478   9921.955820721587   0.0  
R^2: 0.9955923020558377  
Mean Squared Error: 0.009964053588077419MSE: 0.009964053588077419  
R2: 0.9955923020558377

The major call outs here:

 * We have extremely low p-values (often indistinguishable from 0). This is due
   to the fact that we generated our data set synthetically.
 * The intercept (X100) was correctly estimated to be 10 (with a slight error,
   which we introduced when we synthesized the dataset).

CONCLUSIONS
Apache Mahout brings the mathematical functionality of R to Big Data (SparkR is
not R on Spark by the way, it is an R interface to the MLLib package, which is
useful only for toy examples). The Mahout ‘precanned’ algorithm canon is …
sparse, especially compared to the CRAN, however it is growing. Because one can
easily compose distributed algorithms, we hope to see it grow quickly. DSX is a
great environment for playing with Mahout- if you implement an algorithm, please
subscribe and reach out on dev@mahout.apache.org and we’ll be more than happy to
help you contribute it so other people can enjoy your fine work!


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on April 25, 2017.

 * Spark
 * Mahout
 * Linear Algebra
 * Big Data Analytics
 * Data Science

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

1 Blocked Unblock Follow FollowingTREVOR GRANT
Ain't no data like CPG data because CPG data don't stop.

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",Apache Mahout 0.13.0 just dropped- a huge release that adds support for Spark CPU/GPU acceleration via native solvers. Apache Mahout is a linear algebra library that runs on top of any distributed…,Getting started with Apache Mahout,Live,647
1990,"Compose The Compose logo Articles Sign in Free 30-day trialCOMPOSE ENTERPRISE RELOADED
Published Mar 21, 2017 compose enterprise Compose Enterprise ReloadedCompose Enterprise is back and better than ever. Fully managed and dedicated to your conquest of
the data layer.

Today, after months of development, we are pleased to be re-introducing Compose Enterprise . Compose Enterprise adds the option of using managed hosts that are dedicated
to your own Compose cluster so you can have all the features of Compose along
with encrypted data at rest and control over backups for compliance and
security.

It has been some time since we originally launched Compose Enterprise and during
that time we've listened to our users and customers and learned what they need.
The new Compose Enterprise that comes to you today is inspired by your needs and
engineered for your production demands.

COMPOSE ENTERPRISE REFINED
Compose started out only offering multi-tenant database deployments; the many
databases we manage for our customers are deployed on clusters of large host
systems. Using a powerful combination of container technology, software-defined
networking, and advanced management processes, these database deployments
operate as if they were running on isolated hardware with private networking.
It's allowed Compose to deliver vast fleets of databases to thousands of users
with the ability to scale on-demand as those customers grow.

We initially delivered a version of Compose Enterprise which could be
provisioned on any VPC that was managed and provisioned by a customer. It's a
powerful option but it became clear that more and more users just want the
option to have isolated dedicated hardware provisioned and managed by Compose
running their Enterprise cluster. We switched focus to deliver that and we've
been running it in test and production with customers for some months. Now it is
ready for everyone.

COMPOSE ENTERPRISE DEFINED
Compose Enterprise uses exactly the same technology to deliver databases to
customers but Enterprise allows those customers to have their own private
cluster of hosts where they can deploy one, or one hundred or one thousand,
databases. With Compose Enterprise, the cluster is entirely dedicated to the
customer and located in the cloud datacenter of their choice. This makes
compliance with various industry requirements possible and simple. It also means
that you can consume all the resources your Enterprise cluster has to offer;
scale databases vertically or, where supported, horizontally to fully exploit
the compute and I/O power of the cluster.

Compose Enterprise also includes volume level encryption-at-rest, alongside
Compose's existing options for encryption in flight for connections between the
client applications and the clusters databases. Take all that and then add in
Compose's standard features, nine database technologies at your command to
deploy, each with their own private networking, high availability features, data
browsing from a web browser, user/team access controls, automated backups,
performance controls and more.

COMPOSE ENTERPRISE DELIVERED
Compose Enterprise gives you all the flexibility, automated, autoscaling,
manageability of Compose, but in a high compliance, dedicated and exclusive
environment and all automatically covered by Compose's Enhanced support package.

Learn more about Compose Enterprise or get in touch right now and find out how you could use it to manage your
Enterprise database

First Name* Last Name* Email* Company* Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Sep 14, 2016GET IN LINE FOR COMPOSE WEBINARS
We want you to learn more at Compose which is why we are pleased to be rolling
out the first of a new series of Compose Webin…

Dj Walker-Morgan Sep 8, 2016MAKING THE MOST OF COMPOSE ENTERPRISE – README.IO
ReadMe.io has created a simple, gorgeous platform for publishing technical
documentation running on Node.js and Compose Enter…

Jon Silvers Sep 1, 2016CONFIGURING COMPOSE ENTERPRISE ON GOOGLE CLOUD PLATFORM
*Compose is now available on Google's Cloud Platform. Being a different
platform, the configuration for launching your first…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Today, after months of development, we are pleased to be re-introducing Compose Enterprise. Compose Enterprise adds the option of using managed hosts that are dedicated to your own Compose cluster so you can have all the features of Compose along with encrypted data at rest and control over backups for compliance and security.",Compose Enterprise Reloaded,Live,648
1991,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Home
 * Cognitive Computing
 * Data Science
 * Web Dev
 * 

Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Mar 16
--------------------------------------------------------------------------------

ANALYSING CLOUDANT JSON IN RUNKIT JAVASCRIPT NOTEBOOKS
USING THE SIMPLIFIED SILVERLINING NPM MODULE
Computational notebooks provide a workspace where programmers and data scientists can process,
visualise and analyse data on a web page. Usually, notebook code is written in
Python, Java or Scala — not necessarily my languages of choice.

As a JavaScript programmer, I haven’t really used notebooks.

RUNKIT: BRINGING NOTEBOOKS TO JAVASCRIPT
RunKit is a notebook service that uses JavaScript as its language of choice, and it
allows you to ‘require in’ any Node.js (npm) package and use that in your code.
This playground environment makes it really useful to process data, document your approach, and share it with others.

RunKit is useful for analysing and sharing data that is defined in the notebook.
Here’s a simple example analysing US elections margins. The notebook service really comes to life,
however, when it’s using data from a database. I’m going to show how to pull
data from a Cloudant database using the silverlining package — a very simple Node.js library that allows data to be created, queried
and aggregated from the Cloudant NoSQL service.

CREATING SOME DATA
Sign up for a Cloudant account , and in the Cloudant dashboard click Create Databases and name it cities .

In the Replication tab, create a new replication job from https://reader.cloudant.com/cities to your local cities database. This will copy thousands of JSON documents to your Cloudant account.

Note: This replication will likely cost you a few cents worth of Cloudant usage. But
as long as your bill remains under $50 for the month, it’s free.Next we’re going to analyse the data in a notebook.

CONNECTING TO THE DATABASE
Create a new RunKit notebook. In the first code block, you could write code like this:

var url = 'https://myusername:mypassword@myhost.cloudant.com/mydb'
var db = require('silverlining'

But this would be a mistake. All RunKit notebooks are public, so you would be
revealing your Cloudant credentials to the world. Best practice is to visit the environment configuration page , create an environment variable (for example, URL ) that contains your full Cloudant URL, including the username and password,
and then reference it in your code (for example, process.env.URL ):

var db = require('silverlining')(process.env.URL + '/cities'

This way, your notebook still works, but your secrets are hidden.

QUERYING DATA
You can query your database using the query function. In this case, you’ll get all the cities whose population is greater
than 5 million:

// create Cloudant connection
var db = require('silverlining')(process.env.CLOUDANTURL + '/cities'

// query cities with population greater than 5 million
var cities = await db.query({'population': { '$gt'

Notice how the call to db.query is preceded by the keyword await . This keyword is a new JavaScript feature that allows asynchronous function
calls to be created as synchronous calls. It greatly simplifies step-by-step
code, as used in notebooks like RunKit. Normally, an await statement can only be used in an async function , but RunKit lets you get away with that detail!

Because the returned data contains values for latitude and longitude , RunKit allows you to render the list on a map. In fact, it does so by
default:

A map, automatically rendered by RunKit.You could also get only British cities:

// cities in Great Britain
var cities = await db.query({'country': 'GB'

Querying Cloudant from RunKit to return only British cities.Or fetch a city by its id individually:

// get London, knowing its id
var london = await db.get('2643743'

Getting a single document id from Cloudant.Or get id s in batches:

// pull two cities at once
var cities = await db.get(['2643743','2642607'

Just enclose your id strings in an array to get multiple JSON docs.At any point, you can switch from the map view to the properties view:

The Properties Viewer lets you see all the underlying JSON.CLONE THE NOTEBOOK
Here is my cities notebook on RunKit.

Once you create a RunKit account, you’ll be able to clone the repo and run my
example cells for yourself. Just remember to create an environment variable for
your Cloudant account. If you call yours CLOUDANTURL and drop the database name from the end of the string, you can run my code
as-is.

When you’re set up, hit the green repeat button to the right of the cell to
re-run it, or you can hit shift + return as a shortcut.

MIXING WITH OTHER DATA
You can also bring in other data sources for your notebooks. In this notebook , you’ll fetch the current position of the International Space Station and
combine it with the city data from my previous example.

// get current position of the international space station
var request = require('request'
async function currentPosition() {
  var issurl = 'http://api.open-notify.org/iss-now.json'
  var obj = JSON.parse(await require('request-promise'
  obj.latitude = parseFloat(obj.latitude);
  obj.longitude = parseFloat(obj.longitude);
  return obj;
};

// fetch current position
var iss = await currentPosition();

Location of the ISS when I queried it earlier. Expect different results when you
run this code.Then you can query your Cloudant cities database to find nearby cities:

// create Cloudant connection
var db = require('silverlining')(process.env.CLOUDANTURL + '/cities'

// get cities ""near"" the ISS (sometimes there are none — there's a lot of empty out there)
var t = 5; // five degrees of tolerance
var cities = await db.query(
  { 
    'latitude': {'$gt': iss.latitude — t, '$lt': iss.latitude + t},
    'longitude': {'$gt': iss.longitude — t, '$lt': iss.longitude + t}
  }
);

I got some islands in the South Pacific. You might get an empty array if the ISS
is over open seas.WHAT’S YOUR NEXT JAVASCRIPT NOTEBOOK?
As a JavaScript developer, I hope you’ll get value out of analyzing data in a
notebook. I chose to require my npm module, silverlining, because it simplifies
database interactions with Cloudant. But remember that in RunKit, you can
require in any Node.js package you need.

To wrap up, here are the Notebooks I made for this article:

 * US election margins
 * Silverlining and cities
 * Silverlining & the ISS

Share your own RunKit notebooks in the comments, and please consider sharing
this article with other Medium readers by clicking the ♡ here.

 * JavaScript
 * Notebooks
 * Runkit
 * Cloudant
 * Data Science

1 Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Computational notebooks provide a workspace where programmers and data scientists can process, visualise and analyse data on a web page. Usually, notebook code is written in Python, Java or Scala …",Analysing Cloudant JSON in RunKit JavaScript Notebooks – IBM Watson Data Lab,Live,649
1993,"GEOFILE: EVERYTHING IN THE RADIUS WITH MONGODB GEOSPATIAL QUERIES
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 16, 2016GeoFile is a series dedicated to looking at geographical data, its features, and
uses. In this article, we’ll continue our discussion of getting locations within
a radius, but apply it to GeoJSON documents and MongoDB. We’ll be setting up 2dsphere indices and using MongoDB’s built-in geospatial query operators.

If you’re using GeoJSON in your next project, consider using MongoDB to store
your data. MongoDB provides many geospatial query operators that enable you to
create queries using GeoJSON documents quickly and efficiently.

In this article, we'll be looking at three ways to write a geospatial query that
get's documents from within a specified distance of a point of origin. We'll use
three datasets from Fairfax County, Virginia, that are freely available on Data.gov , the United States government’s open data repository. Finally, we'll write a
small ExpressJS app that serves up the data on a website using OpenStreetMap.

First, download the following data sets: Fairfax County Border , Fairfax Country Libraries , and Fairfax Historical Sites .

After you've downloaded the datasets, insert them into your MongoDB database
using either an application and your connection strings, or via the command line
using the command-line string provided by the console of your Compose
deployment. To import a GeoJSON dataset into your Compose deployment, you'll
want to use mongoimport like this:

mongoimport --db fairfax --collection sites --file ""Historic_Sites.geojson"" -u username -p password --ssl --sslCAFile cert.pem --host aws-us-west-2-portal.0.dblayer.com --port 15326 --jsonArray  


To show you the results of our geospatial queries on a map, we’ll be using LeafletJS , the JavaScript library for maps, and OpenStreetMap . The library is used to take our GeoJSON documents and project them onto
OpenStreetMap. To help us get the GeoJSON documents to LeafletJS, we'll be using
MongoDB's NodeJS library with ExpressJS.

All of the code for this article is included in https://github.com/compose-ex/mongodb-radius .

IT’S ALL ABOUT THE INDEX
Before we do anything, let's start by creating indices. Indices are fundamental
to geospatial querying and most of the geospatial query operations we’ll be
running depend on having a geospatial index set up. But, we first have to figure
out whether our dataset uses a flat or spherical projection system. To determine
that, look at our GeoFile article on geospatial reference systems .

The MongoDB documentation on geospatial indexing goes into depth about when to use either a 2d or 2dsphere index. However, if
we’re using GeoJSON and MongoDB 3.2.x, then we should be using the 2dsphere
index since the 2d index does not support GeoJSON geometry objects.

To create a geospatial index, we can connect to our MongoDB deployment and run
the following command:

db.collection.createIndex({'geometry': '2dsphere'});  


Replace collection with the collection you’re creating the index on.

Once you run that, you’ll receive a confirmation that the index has been
successfully created like:

mongos> db.sites.createIndex({""geometry"": ""2dsphere""})  
{
    ""raw"" : {
        ""set-5817b780a5342b0016000005/10.201.200.39:27017,10.201.200.40:27017"" : {
            ""createdCollectionAutomatically"" : false,
            ""numIndexesBefore"" : 1,
            ""numIndexesAfter"" : 2,
            ""ok"" : 1,
            ""$gleStats"" : {
                ""lastOpTime"" : Timestamp(1479247707, 1),
                ""electionId"" : ObjectId(""7fffffff0000000000000001"")
            }
        }
    },
    ""ok"" : 1
}


MAKING GEOSPATIAL QUERIES
Now, let’s take a look at what some of the queries and results will look like
via the MongoDB shell. If you’re just starting out with GeoJSON and writing
geospatial queries, the MongoDB shell is a useful tool for running test queries.

There are a number of geospatial query operators included in MongoDB, and we will not cover them all. But familiarize yourself
with the different queries, how to combine operators, and whether they require
geospatial indices.

GEOJSON DOCUMENTS
To set up our first query, we'll have to know the structure of our data. This is
done by using the findOne method with a collection:

db.libraries.findOne();  


This query will provide us with a result that'll look like:

{
    ""_id"" : ObjectId(""582122d4a98c5c6f7b955ebf""),
    ""type"" : ""Feature"",
    ""properties"" : {
        ""OBJECTID"" : 9,
        ""TYPE"" : ""LIB"",
        ""DESCRIPTION"" : ""GEORGE MASON REGIONAL LIBRARY"",
        …
        ""Editor"" : ""FairfaxCounty""
    },
    ""geometry"" : {
        ""type"" : ""Point"",
        ""coordinates"" : [
            -77.18621789486043,
            38.82741811639861
        ]
    }
}


We’ve named collection containing historical locations as sites , which has the same GeoJSON structure as the libraries collection.

You’ll notice that each document contains type , properties , and geometry . The type: ""Feature"" just indicates that the document includes geometry and properties JSON objects. The properties object includes information about the geometry, while geometry includes the type of geometry, such as Point , Polygon , MultiPolygon , etc., and coordinates that can be an array containing longitude and latitude coordinates, or an array
of arrays of coordinates. An example of the MultiPolygon type geometry is our border collection which looks like:

{
    ""_id"" : ObjectId(""582a09bca98c5c6f7b964181""),
    ""type"" : ""Feature"",
    ""properties"" : {
        ""OBJECTID"" : 4,
        ""NAME"" : ""FAIRFAX COUNTY"",
        ""CREATED_USER"" : ""FairfaxCounty"",
        ""CREATED_DATE"" : ""2016-11-12T06:02:42.655Z"",
        ""LAST_EDITED_USER"" : ""FairfaxCounty"",
        ""LAST_EDITED_DATE"" : ""2016-11-12T06:02:42.655Z""
    },
    ""geometry"" : {
        ""type"" : ""MultiPolygon"",
        ""coordinates"" : [
            [
                [
                    [
                        -77.308791849374,
                        38.8428552213637
                    ],
                    [
                        -77.3103917301592,
                        38.843354741431
                    ],
                    …
                    [
                        -77.3031623200181,
                        38.8328038878956
                    ]
                ]
            ]
        ]
    }
}


For more information about GeoJSON, here’s an article that explains everything you need to know about it in a very accessible way. If
you want to take a deep dive into GeoJSON, here’s the specifications .

POINT OF ORIGIN
After setting up our indices, we'll need to select a point of origin to set up
our radius geospatial query. To do this, we’ll select a library from our library
dataset. Then, the query will be assigned a variable to store the geometry field
that will be used when building our geospatial query. The query will look
something like this:

var library = db.libraries.findOne({""properties.DESCRIPTION"": ""THOMAS JEFFERSON LIBRARY""}, {_id: 0, ""geometry"": 1});  


For our first query, we’ve selected a library named “THOMAS JEFFERSON LIBRARY”.
The query will look through our library collection for the name in the properties.DESCRIPTION field. Since we set up a projection, once the document is found, it will return
only the geometry field for the library. This is stored in the library variable:

{
    ""geometry"" : {
        ""type"" : ""Point"",
        ""coordinates"" : [
            -77.20176584039915,
            38.86543182000657
        ]
    }
}


THREE WAYS TO GET DOCUMENTS WITHIN A RADIUS
So now that we have a point of origin, we can now start searching for locations
surrounding it. We want to make this a little interesting, so we'll be covering
three ways to search for historical sites surrounding the Thomas Jefferson
Library in Fairfax, Virgina.

$GEOWITHIN AND $CENTERSPHERE
We’ll start by setting up a query that uses $geoWithin , which selects documents that are located within a specified geometry.

Since we are looking for various historical sites next to our chosen library,
we’ll query over the sites collection. The historical sites' geometry field will be used in our query filter so that the historical sites geometries
will be compared to our library geometry.

db.sites.find({""geometry"": ... })  


Next, we’ll ask the query to get all the documents within a five-mile radius.

{$geoWithin: {$centerSphere: [library.geometry.coordinates, 5/3963.2]}}


$geoWithin selects all the points within a geometry, while $centerSphere creates a circular geometry around the point of origin. It accepts an array of
coordinates which are taken from our library geometry field. The second item in the array is a radian, which sets the distance from
the point of origin. Since we're using $centerSphere , the query will look for locations surrounding our library's location. Since
we want to look for historical sites that are at most five miles from the
library, we can divide 5 by the equatorial radius of the earth, 3963.2 miles, to
get the correct radian.

The full query will look like the following:

db.sites.find({""geometry"":  
   {$geoWithin: 
     {$centerSphere: [library.geometry.coordinates, 5/3963.2]}
   }
  }, {_id: 0, ""properties.DESCRIPTION"": 1});


This gives us the names of the historical sites surrounding our library, but not
sorted in any order.

{ ""properties"" : { ""DESCRIPTION"" : ""HUNTER HOUSE"" } }
{ ""properties"" : { ""DESCRIPTION"" : ""WAKEFIELD CHAPEL"" } }
{ ""properties"" : { ""DESCRIPTION"" : ""GREEN SPRING FARM"" } }
{ ""properties"" : { ""DESCRIPTION"" : ""CLARK HOUSE"" } }


$NEARSPHERE , $GEOMETRY AND $MAXDISTANCE
The second way to make the same query is to use $nearSphere . which requires a geospatial index. This query operator will produce an
ordered list of historical sites from nearest to farthest from the library:

db.sites.find({""geometry"":  
   {$nearSphere: 
     {$geometry: library.geometry, $maxDistance: 8046.72}
   }
  }, {_id: 0, ""properties.DESCRIPTION"": 1});


In this query, we’re using the $geometry operator that takes the entire GeoJSON geometry field from the library. $maxDistance is an optional measurement that’s calculated in meters and specifies the
distance from the point of origin where the query should look for locations.
Since we're looking for historical sites within five miles of the library in
meters, we'll enter 8046.72. This will give us the same results, but ordered by
nearest to farthest from our library.

{ ""properties"" : { ""DESCRIPTION"" : ""CLARK HOUSE"" } }
{ ""properties"" : { ""DESCRIPTION"" : ""WAKEFIELD CHAPEL"" } }
{ ""properties"" : { ""DESCRIPTION"" : ""GREEN SPRING FARM"" } }
{ ""properties"" : { ""DESCRIPTION"" : ""HUNTER HOUSE"" } }


$GEONEAR AGGREGATION
The last way to make the query is to use MongoDB’s $geoNear aggregation. Results from $geoNear will give you GeoJSON documents from nearest to farthest from a point of origin
like $nearSphere , but you can also tell the query to provide you with the distance of each
historical site from the library. This is how we set up the query:

db.sites.aggregate([  
    { $geoNear: 
        { 
            near: library.geometry, 
            distanceField: ""dist.calculated"", 
            maxDistance: 8046.72,  
            spherical: true, 
            distanceMultiplier: 1/1609.344
        }
    },
    { $project: 
        { 
            _id: 0, 
            ""properties.DESCRIPTION"": 1, 
            ""dist.calculated"": 1 
        }
    }
]);


The options we set up are similar to what we have been using in the other
queries we've covered. near includes the geometry field from our library, and spherical should be set to true since we have a 2dsphere index. distanceField provides you with an output field that contains the calculated distance of each
historical site from the library, while distanceMultiplier allows you to convert radians to kilometers or miles. In this query, we convert
from maxDistance in meters to miles.

The final aggregation $project allows us to select the fields that are returned by $geoNear . In this case, we’ve selected the fields that give use the name of the
historical sites within the radius and their distance from the library, which
produces the following:

{ ""properties"" : { ""DESCRIPTION"" : ""CLARK HOUSE"" }, ""dist"" : { ""calculated"" : 3.1988623437242323 } }
{ ""properties"" : { ""DESCRIPTION"" : ""WAKEFIELD CHAPEL"" }, ""dist"" : { ""calculated"" : 3.3132885303175272 } }
{ ""properties"" : { ""DESCRIPTION"" : ""GREEN SPRING FARM"" }, ""dist"" : { ""calculated"" : 3.7410665986589016 } }
{ ""properties"" : { ""DESCRIPTION"" : ""HUNTER HOUSE"" }, ""dist"" : { ""calculated"" : 4.156427801047941 } }


Now that we've got our queries working, we want to see how they look on a map.
To do this, we'll use NodeJS.

TRANSLATING INTO NODEJS
Now that we’ve tested out the queries in the MongoDB shell, it’s a fairly
straightforward process to add them, for example, to an ExpressJS application so
that the data can be transported to LeafletJS and shown on a map in the browser.

First, make sure that NodeJS is installed then install ExpressJS and MongoDB’s NodeJS driver into your project's package.json file:

npm install express mongodb --save  


We'll then require express and mongodb and insert the connection strings provided in the Compose MongoDB console.
Here's an example of how the code might look like:

const express = require('express');  
const app = express();  
const client = require('mongodb').MongoClient;  
const URI = ""mongodb://username:password@aws-us-west-2-portal.0.dblayer.com:15326/fairfax?ssl=true, mongodb://username:password@aws-us-west-2-portal.1.dblayer.com:15326/fairfax?ssl=true"";  
const options = {  
    mongos: {
        ssl: true,
        sslValidate: false
    }
};
const port = 8080;  


Now, let’s set up a connection pool so that we can reuse the MongoDB connection
when building URI endpoints to serve our GeoJSON data.

let db;  
client.connect(URL, options, (err, database) =�
    app.listen(port, () =�


The set up the ExpressJS Router() middleware, which will make it easy to set up URI endpoints with HTTP method
routes:

const router = express.Router();  
app.use('/api', router);  


Now that we have the basics of an ExpressJS application set up, we can start
translating the MongoDB geospatial queries that we tested out into JavaScript
functions. Each function below corresponds to the three methods we discussed to
get historical sites within a 5-mile radius of our library.

What you'll notice is that the queries translate from the MongoDB shell to our
Express application nicely. The only differences are how you define which
collection to use using collection() and how you convert the output of documents to an array toArray .

$geoWithin and $centerSpherefunction findNearPlacesGeoWithin(placeName, radius, collection1, collection2, res) {  
    db.collection(collection1).find({""properties.DESCRIPTION"": placeName}, {_id: 0, ""properties.DESCRIPTION"": 1, ""geometry"": 1})
    .toArray((err, docs)  => {
        db.collection(collection2).find({
            ""geometry"": {    
                $geoWithin: {
                    $centerSphere: [docs[0].geometry.coordinates, radius/3963.2] // convert radius to miles
                }
            }
        }, {_id: 0}).toArray((err, docs) =�
}


$nearSphere , $geometry , and $maxDistancefunction findNearPlacesGeoNear(placeName, distMeters, collection1, collection2, res) {  
    db.collection(collection1).find({""properties.DESCRIPTION"": placeName}, {_id: 0, ""properties.DESCRIPTION"": 1, ""geometry"": 1})
    .toArray((err, docs)  => {
        db.collection(collection2).find({
            ""geometry"": {
                $nearSphere: {
                    $geometry: docs[0].geometry,
                    $maxDistance: distMeters 
                }
            }
        }, {_id: 0}).toArray((err, docs) =�
}


$geoNear aggregationfunction findNearPlacesAgg(placeName, distMeters, collection1, collection2, res) {  
    db.collection(collection1).find({""properties.DESCRIPTION"": placeName}, {_id: 0, ""properties.DESCRIPTION"": 1, ""geometry"": 1})
    .toArray((err, docs) => {
        db.collection(collection2).aggregate([
            {$geoNear: {
                near: docs[0].geometry,
                maxDistance: distMeters,
                spherical: true,
                distanceField: ""dist.calculated"",
                distanceMultiplier: 1/1609.344 // calculate distance in miles
            }},
            {$project: {
                _id: 0,
                ""type"": 1,
                ""properties.DESCRIPTION"": 1,
                ""geometry"": 1
            }}
        ]).toArray((err, docs) =�
}


Using one of these functions will give us with the same results, except $geoNear will provide us with the distance of each historical site from the library.
Once these functions have been made, just drop one into a custom route and get
the returned GeoJSON documents that satisfy the query:

router.get('/radius', (req, res, next) =�


Using LeafletJS and OpenStreetMap, you should see the final result that looks
like the following in your browser:


SUMMING IT UP
We’ve looked at the various ways to create a query using MongoDB’s geospatial
query operators by finding the historical sites within a 5-mile radius of the
Thomas Jefferson Library. The three ways that we covered will help you on your
way to discovering how to use other MongoDB geospatial operators, and how to
translate them into queries that can be used in your next NodeJS application.


--------------------------------------------------------------------------------
&copy 2016 Compose","In this article, we’ll continue our discussion of getting locations within a radius, but apply it to GeoJSON documents and MongoDB. We’ll be setting up 2dsphere indices and using MongoDB’s built-in geospatial query operators.",GeoFile: Everything in the Radius with MongoDB Geospatial Queries,Live,650
1997,"Homepage Follow Sign in Get started Homepage * Home
 * Archive
 * 

IBM Data Science Experience Blocked Unblock Follow Following Sep 27, 2016
--------------------------------------------------------------------------------

ANALYZING STREAMING DATA FROM KAFKA TOPICS
Now you can create streaming data topics in the Watson Data Platform UI, receive
data from diverse sources, and create connections to these topics in order to
analyze streaming data in your analytic projects and notebooks.

From the Data Science Experience UI, go to “Watson Data Platform”->”Data
Services”. If you don’t have one yet, create an instance of “Message Hub”. This
may take a little while, then it appears in your list of data service instances.
Now click on the Message Hub service instance entry to get to the list of
streaming data topics. Here you can create one or more Apache Kafka topics, to
which you can send data messages from your apps, devices, or other streaming
data sources via the Apache Kafka API. See Analyze streaming data from Kafka topics for more details.

Now, from within your project, you can create a connection and pick the Message
Hub service instance as the data service you want to connect to, and from within
a notebook pick that connection and click “Insert to Code”. This gives you
access info needed to connect to the topic, that you can now use in your Python,
R, Scala code to read data from the streaming data topic directly or through
Spark Streaming within the notebook.


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on September 27, 2016 by Thomas Schaeck .

 * Streaming
 * Kafka

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE
FollowIBM DATA SCIENCE EXPERIENCE
Master the art of data science

 * 
 * 
 * 
 * 

Never miss a story from IBM Data Science Experience , when you sign up for Medium. Learn more Never miss a story from IBM Data Science Experience Get updates Get updates","Now you can create streaming data topics in the Watson Data Platform UI, receive data from diverse sources, and create connections to these topics in order to analyze streaming data in your analytic…",Analyzing streaming Data from Kafka Topics,Live,651
2001,"MONGO TO MONGO DATA MOVES WITH NIFI
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Aug 29, 2016There are many reasons to move or synchronize a database such as MongoDB:
migrating providers, upgrading versions, duplicating for testing or staging,
consolidating, and cleaning. There are even more ways to perform the function of
moving said data: mongodump/mongorestore, on Compose there is Import which is backed by Transporter , one could write their own custom scripts, and one could use a tool such as
NiFi.

Here we will review a couple of scenarios using NiFi. First, we'll look at the
simplest approach possible of just queries and inserts then a brute force
approach of polling with de-duplication and then move on to a more advanced
synchronization approach by hooking into MongoDB's oplog. Since all of these are
on NiFi the flexibility of adding extra transforms and viewing what's happening
will always be available too.

One TimeFor a one time move a tool such as NiFi might be more overhead than is needed
unless there are some mitigating circumstances like large size or desired
transformations but it can be done.


The simplest way to get started is to pull both a GetMongo Processor and PutMongo Processor onto the canvas and connect them via a success relationship (see here for details on Processors or here for details on NiFi overall). Each pair of these is good for a single Collection .


They are easy to copy so you can configure the first pair with things like
connection strings and logon and then copy them. After that just change the
relevant details for each Mongo Collection which really isn't too much trouble.


Since NiFi is built for data that is flowing, the typical idiom for a Processor
such as GetMongo is to run over and over again which would generate duplicate data. So,
configuring the Processor to run only once can be effectively done by setting
the Run Schedule to some large interval that would be much longer than the actual one time
session of copying data. This will ensure that the GET query will only run once per day when you start the flow which should be
sufficient.

Brute Force Sync with Polling and De-duplicationThe next solution builds upon the previous. It utilizes the same GetMongo and PutMongo processors at the edges but enhances the one shot nature of the previous
example by adding a few fingerprinting and de-duplication steps to allow for a
more continuous flow of data and a regular synchronization. The following
details the flow for one collection:


It starts the same with the GetMongo then steps to HashContent which generates a fingerprint of each FlowFile's content and puts it into a FlowFile attribute. The next step
isn't technically needed for this flow since GetMongo doesn't write any attributes but is included to show that the unique key can be
generated from both the content and attributes to ensure proper de-duplication.


The ComposeUniqueRocksDB only needs two properties configured. The Directory for the actual RocksDB data
files and the name of the FlowFile's attribute which has the key to be checked.
Since, the fingerprint attribute is a truly unique key built from all of the relevant data then we can
rely on ComposeUniqueRocksDB to only pass on a FlowFile to the unseen relationship and hence next step of PutMongo if it hasn't actually seen the Document contained in the FlowFile. ComposeUniqueRocksDB is part of a custom extension package. There are other solutions for
de-duplication that come with NiFi but they rely on more sophisticated
configuration with some ControllerService s.


So, now that we only pass on the new data, we can go ahead and configure the
most important parameter of this flow which is how often we schedule GetMongo to run. This will determine how often we query the Collection in the Mongo database. The downside is the amount of query load it places on
the source system since it will query the entire collection over and over again
according to this Run Schedule that is set above. For some situations this may
be fine. If your collection isn't super large and you don't mind some extra
load, or if how often you need synchronization can be changed or shifted to less
utilized times, then this might be ""good enough"". This particular flow will
basically perform the snapshot plus changes by building up a unique index and
comparing the entire result set against the previously seen keys to decide
whether a Document is actually new. While this is a lot of processing, it is
simple and may work for you. If not, then the next solution might.

Tailing the OplogOne of the idiosyncratic features of MongoDB's replication implementation is
that it can be ""hooked into"". Mongo's oplog is a capped collection which is
ultimately just a buffered changelog. Just like a regular Collection it can be queried and the database state changes can be moved and applied to
another database. In essence, this is what Mongo does internally to keep
replicas in sync.

A snapshot of the database plus the changes after the snapshot equals the
current state. Marrying the streaming nature of the changes to NiFi makes a lot
of sense and is the most complete solution if you have access to Mongo's oplog.

The below is an example flow which uses some custom code from the same package as the previous example which is not already in NiFi. It is easily built with mvn package and then easily deployed to NiFi by just copying a file and restarting the
server.


Somewhat counter to the general notion of a NiFi Processor being simple and
doing one thing, these two Processors are a little more sophisticated. Where for
the GetMongo Processor we have to create one for each Collection. Whereas for the ComposeTailingGetMongo one suffices for the entire Mongo Database.

The ComposeTailingGetMongo runs only once and stays running. It begins by creating FlowFiles for every
Document currently in a Mongo Database (a snapshot) then it will continue to
generate FlowFiles for any relevant operations such as inserts, updates, and
deletes for as long as it runs (the changes). And while this is example code, it
is useful example code and could easily be used in multiple situations. This Get plus the matching ComposeTailingPutMongo is sufficient to keep entire MongoDB's in sync. Plus, this use case is a great
match to NiFi's managing and running of data flows. And, we haven't even
mentioned that it is easy to transform this data by inserting extra processing
steps and even duplicating to multiple databases by adding another ComposeTailingPutMongo :


Continuous, Visible, and Easily Customized SynchronizationThe various solutions above are good examples of how useful NiFi can be for
moving data. Whether you need to synchronize your test Mongo database with
production data, or whether you need to migrate to a new Storage Engine, or
whatever your moving database use case may be, by choosing NiFi you get access
to all of the benefits of a sophisticated data flow platform with less effort
than ""rolling your own"" solution.

Image by: Rusellblande Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Here we review a couple scenarios of moving data using NiFi. First, we'll look at the simplest approach possible of just queries and inserts then a brute force approach of polling with de-duplication and then move on to a more advanced synchronization approach by hooking into MongoDB's oplog.",Mongo to Mongo Data Moves with NiFi,Live,652
2005,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseIBM DATA REFINERY: CREATE A CONNECTION AND ADD IT TO A PROJECT
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

13 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 31, 2017This video shows you how to create a connection to a data source in IBM Data
Refinery and then add connected data to a project. Find more videos in the IBM
Data Refinery Learning Center at http://ibm.biz/data-refinery-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Data Refinery: Create a project and add data - Duration: 1:47.
   developerWorks TV No views * New 1:47


--------------------------------------------------------------------------------

 * IBM Data Refinery: Shape data - Duration: 5:46. developerWorks TV 10 views *
   New 5:46
 * Microservices vs SOA - Duration: 9:49. developerWorks TV 3,396 views 9:49
 * Understanding DevOps - Duration: 6:12. developerWorks TV 20,502 views 6:12
 * IBM Data Catalog: Create and administer a data catalog - Duration: 3:19.
   developerWorks TV 13 views * New 3:19
 * Inside a Google data center - Duration: 5:28. G Suite 7,123,447 views 5:28
 * How to use the IBM Application Security on Cloud service - Duration: 7:05.
   developerWorks TV 2,560 views 7:05
 * IBM Graph: Create a model and schema in IBM Graph - Duration: 4:45.
   developerWorks TV 1,492 views 4:45
 * IBM Watson Marketing Insights - Refining audiences - Duration: 2:15. IBM
   Watson Marketing 478 views 2:15
 * What is IBM Watson and Bluemix - Duration: 12:11. developerWorks TV 18,995
   views 12:11
 * Dubai Billionaires and Their Luxury Homes and Toys - Documentary - Duration:
   46:25. Provident Real Estate 12,188,498 views 46:25
 * IBM Data Catalog: Governance overview - Duration: 4:11. developerWorks TV 5
   views * New 4:11
 * How To Make A Simple Python Keylogger - Duration: 5:09. Tinkernut 868,374
   views 5:09
 * Making Data-Driven Decisions at Speed with Watson Marketing Insights -
   Duration: 1:55. IBM Watson Marketing 122 views 1:55
 * Microservices vs APIs - Duration: 3:55. developerWorks TV 3,302 views 3:55
 * Texmark Chemicals deploys IIoT at the edge in showcase Refinery of the Future
   - Duration: 2:33. Hewlett Packard Enterprise 2,754 views 2:33
 * UrbanCode Deploy: Using composite blueprints - Duration: 9:13. developerWorks
   TV 6 views * New 9:13
 * Watson Data Platform: Provision IBM Data Catalog or IBM Data Refinery
   services - Duration: 1:05. developerWorks TV 22 views * New 1:05
 * Quickly design, build & secure a mobile app with Bluemix - Duration: 12:09.
   developerWorks TV 45,734 views 12:09
 * Integrate Watson and Salesforce with IBM App Connect - Duration: 12:51.
   developerWorks TV 769 views 12:51
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to create a connection to a data source in IBM Data Refinery and then add connected data to a project. ,Create a connection and add it to a project using IBM Data Refinery,Live,653
2007,"MONGODB AND RANSOMWARE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 5, 2017Recent reports in the news of MongoDB databases being hacked are not new but the ransoms demanded for the return of data is a new twist on an old problem - insecure MongoDB databases. Compose MongoDB
users haven't had to worry about the problem, but it is worth looking at what is
going on and why it isn't a worry for them.

We've talked about unsecured MongoDB in the past, but these recent attacks show the problem has not gone away even
though MongoDB changed the ""out-of-the-box"" defaults on the database.

So what was the problem and how is it back? Simply put, people often create
their own MongoDB instances in the cloud or on web-facing servers and don't put
any access controls on them. Back in 2015, the out of the box default for
MongoDB let anyone access it over the network with no passwords until a user was
created for the system. Although initially convenient, it was too easy for
people to forget to lock down the database. That insecurity was mitigated in
MongoDB by ensuring that only connections from the machine the MongoDB instance
was running on were accepted by default. But old versions and bad habits
persist. What were 40,000 exposed databases on the internet has fallen to around 25,000 databases , but that's still 25,000 opportunities for bad actors.

The problem for those bad actors wanting to exploit this issue was that the data
involved on those attackable databases was usually only valuable to its owners.
That's led to this new ""ransom"" strategy where the data is deleted and replaced
with a single record containing a demand for payment to get the data back. Some
people have apparently paid too . Unfortunately for them, researchers have found there's no record in the logs
of any backup being taken. There's also multiple attackers who may be overwriting each other's ransom notes that are left in the database.

The chances are, unless the owners of the databases made backups, that the data
is lost and paying the bitcoin ransom will do nothing but mark the victim as
someone prepared to pay a ransom. With at least 500 victims, this current spate
of fake data-kidnappings still has a way to go.

Interestingly, the reports of vulnerable databases also include versions that
appeared since the defaults were fixed on MongoDB. This does suggest that some
users are using new database versions but relying on old tutorials offering bad
practices for configuring their new MongoDB systems. Worse still, they could be
knowingly dropping security measures to simplify making a database available.

Compose MongoDB users have not had to worry about this problem: when we deploy
one of our production-ready MongoDB database deployments for you, it's
automatically secured with a locked down administration user. If you administer
a Compose MongoDB deployment you have to create users through the Compose
console to enable database access. This does mean a little more to do when
setting up your database deployment at Compose, but it also means people can't
walk in and delete your data. That's a trade-off that is simply best practice.

Then there's the fully automated backup system taking regular backups and
preserved for three months so even if an authorized user does delete data,
there's a backup you can go back to. Better still, you can even restore your
backups into a completely new database – it's the default actually – so you can
verify them or use them for staging tests. The current Compose MongoDB platform
also turns on SSL/TLS on by default so you can have encrypted connections to the
database for in-flight credentials and data security.

The current spate of MongoDB attacks is unfortunate, but also avoidable.
Whenever you put a database on the web, make sure you secure it or create it with someone who can keep it secure for you.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","Compose MongoDB users haven't had to worry about the problem, but it is worth looking at what is going on and why it isn't a worry for them.",MongoDB and Ransomware,Live,654
2020,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseIBM WATSON MACHINE LEARNING: CREATE A PROJECT FOR WATSON MACHINE LEARNING
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

9 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Watch this video to see how to create a project in IBM Data Science Experience
and set it up to use Watson Machine Learning.

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Watson: How it Works - Duration: 7:54. IBM Watson 1,239,655 views 7:54


--------------------------------------------------------------------------------

 * IBM Watson Machine Learning: Put a human face on machine learning - Duration:
   6:07. developerWorks TV 6 views * New 6:07
 * IBM Watson Machine Learning: Score a Predictive Model Built with IBM SPSS
   Modeler - Duration: 5:31. developerWorks TV 7 views * New 5:31
 * IBM Watson Machine Learning: Build a Predictive Analytic Model - Duration:
   4:06. developerWorks TV 21 views * New 4:06
 * Machine learning beer tasting with Watson - Duration: 5:33. developerWorks TV
   587 views 5:33
 * IBM Watson Machine Learning: Build a Naive-Bayes Model - Duration: 4:07.
   developerWorks TV 1 view * New 4:07
 * IBM Watson Machine Learning: Build a logistic regression model - Duration:
   4:10. developerWorks TV 3 views * New 4:10
 * IBM Watson Machine Learning: Get Started - Duration: 1:23. developerWorks TV
   1 view * New 1:23
 * IBM Watson, Machine Learning: How to use the ""Retrieve and Rank"" service in
   IBM Bluemix! - Duration: 29:50. tanmay bakshi 16,803 views 29:50
 * Machine Learning with IBM Watson and Alchemy API. Andy Thurai at APIdays
   Mediterranea 2015 - Duration: 33:32. Apicultur 1,395 views 33:32
 * Image Classification - Tensorflow vs. Watson [Machine Learning] - Duration:
   10:12. Cristi Vlad 1,483 views 10:12
 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54.
   developerWorks TV No views * New 6:54
 * Talking machine learning with Tanmay Bakshi at the IBM Watson Summit! -
   Duration: 28:39. Dev Diner 1,101 views 28:39
 * Dinesh Nirmal, IBM | IBM Machine Learning Launch - Duration: 19:19.
   SiliconANGLE 882 views 19:19
 * JavaOne: Microservice hands-on - Duration: 5:22. developerWorks TV No views *
   New 5:22
 * IBM Analytics Engine Overview - Duration: 7:21. developerWorks TV 7 views *
   New 7:21
 * IBM Bluemix Tutorial - Do it Yourself - Getting Started - DIY-1 of 40 -
   Duration: 4:21. BharatiDWConsultancy 1,636 views 4:21
 * IBM Machine Learning in Retail - Duration: 17:34. IBM Analytics 1,121 views 17:34
 * BigInsights on Cloud: Use Sqoop to Ingest Data from Compose for MySQL -
   Duration: 5:24. developerWorks TV 6 views * New 5:24
 * IBM Machine Learning Event: The dawn of continuous intelligence, part 1 -
   Duration: 1:52:29. IBM Analytics 6,239 views 1:52:29

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Watch this video to see how to create a project in IBM Data Science Experience (DSX) and set it up to use Watson Machine Learning (WML).,Create a project for Watson Machine Learning in DSX,Live,655
2023,"Homepage Stats and Bots Follow Sign in Get started * Home
 * DATA SCIENCE
 * ANALYTICS
 * STARTUPS
 * BOTS
 * DESIGN
 * Subscribe
 * 
 * 🤖 TRY STATSBOT FREE
 * 

Eduard Tyantov Blocked Unblock Follow Following Mail.ru Group, Head of Machine Learning Team Dec 21, 2017
--------------------------------------------------------------------------------

DEEP LEARNING ACHIEVEMENTS OVER THE PAST YEAR
GREAT DEVELOPMENTS IN TEXT, VOICE, AND COMPUTER VISION TECHNOLOGIES
At Statsbot , we’re constantly reviewing the deep learning achievements to improve our
models and product. Around Christmas time, our team decided to take stock of the
recent achievements in deep learning over the past year (and a bit longer). We
translated the article by a data scientist, Ed Tyantov, to tell you about the
most significant developments that can affect our future.

1. TEXT
1.1. GOOGLE NEURAL MACHINE TRANSLATION
Almost a year ago, Google announced the launch of a new model for Google Translate . The company described in detail the network architecture — Recurrent Neural Network (RNN).

Machine Learning Translation and the Google Translate Algorithm The basic
principles of machine translation engines blog.statsbot.coThe key outcome: closing down the gap with humans in accuracy of the translation
by 55–85% (estimated by people on a 6-point scale). It is difficult to reproduce
good results with this model without the huge dataset that Google has.

1.2. NEGOTIATIONS. WILL THERE BE A DEAL?
You probably heard the silly news that Facebook turned off its chatbot , which went out of control and made up its own language. This chatbot was
created by the company for negotiations. Its purpose is to conduct text
negotiations with another agent and reach a deal: how to divide items (books,
hats, etc.) by two. Each agent has his own goal in the negotiations that the
other does not know about. It’s impossible to leave the negotiations without a
deal.

For training, they collected a dataset of human negotiations and trained a
supervised recurrent network. Then, they took a reinforcement learning trained
agent and trained it to talk with itself, setting a limit — the similarity of
the language to human.

The bot has learned one of the real negotiation strategies — showing a fake
interest in certain aspects of the deal, only to give up on them later and
benefit from its real goals. It has been the first attempt to create such an
interactive bot, and it was quite successful.

Full story is in this article , and the code is publicly available .

Certainly, the news that the bot has allegedly invented a language was inflated
from scratch. When training (in negotiations with the same agent), they disabled
the restriction of the similarity of the text to human, and the algorithm
modified the language of interaction. Nothing unusual.

Creepiest Stories in Artificial Intelligence Development Scary things about AI:
from virtual cannibalism to racist monsters blog.statsbot.coOver the past year, recurrent networks have been actively developed and used in
many tasks and applications. The architecture of RNNs has become much more
complicated, but in some areas similar results were achieved by simple
feedforward-networks — DSSM . For example, Google has reached the same quality , as with LSTM previously, for its mail feature Smart Reply. In addition,
Yandex launched a new search engine based on such networks.

2. VOICE
2.1. WAVENET: A GENERATIVE MODEL FOR RAW AUDIO
Employees of DeepMind reported in their article about generating audio. Briefly, researchers made an autoregressive
full-convolution WaveNet model based on previous approaches to image generation
( PixelRNN and PixelCNN ).

The network was trained end-to-end: text for the input, audio for the output.
The researches got an excellent result as the difference compared to human has
been reduced by 50%.

The main disadvantage of the network is a low productivity as, because of the
autoregression, sounds are generated sequentially and it takes about 1–2 minutes
to create one second of audio.

Look at… sorry, hear this example.

If you remove the dependence of the network on the input text and leave only the
dependence on the previously generated phoneme, then the network will generate
phonemes similar to the human language, but they will be meaningless.

Hear the example of the generated voice.

This same model can be applied not only to speech, but also, for example, to
creating music. Imagine audio generated by the model , which was taught using the dataset of a piano game (again without any
dependence on the input data).

Read a full version of DeepMind research if you’re interested.

2.2. LIP READING
Lip reading is another deep learning achievement and victory over humans.

Google Deepmind, in collaboration with Oxford University, reported in the
article, “Lip Reading Sentences in the Wild” on how their model, which had been trained on a television dataset, was able to
surpass the professional lip reader from the BBC channel.

There are 100,000 sentences with audio and video in the dataset. Model: LSTM on
audio, and CNN + LSTM on video. These two state vectors are fed to the final
LSTM, which generates the result (characters).

Different types of input data were used during training: audio, video, and audio
+ video. In other words, it is an “omnichannel” model.

2.3. SYNTHESIZING OBAMA: SYNCHRONIZATION OF THE LIP MOVEMENT FROM AUDIO
The University of Washington has done a serious job of generating the lip movements of former US President Obama. The choice fell
on him due to the huge number of his performance recordings online (17 hours of
HD video).

They couldn’t get along with just the network as they got too many artifacts.
Therefore, the authors of the article made several crutches (or tricks, if you
like) to improve the texture and timings.

You can see that the results are amazing . Soon, you couldn’t trust even the video with the president.

3. COMPUTER VISION
3.1. OCR: GOOGLE MAPS AND STREET VIEW
In their post and article , Google Brain Team reported on how they introduced a new OCR (Optical
Character Recognition) engine into its Maps, through which street signs and
store signs are recognized.

In the process of technology development, the company compiled a new FSNS (French Street Name Signs), which contains many complex cases.

To recognize each sign, the network uses up to four of its photos. The features
are extracted with the CNN, scaled with the help of the spatial attention (pixel
coordinates are taken into account), and the result is fed to the LSTM.

The same approach is applied to the task of recognizing store names on
signboards (there can be a lot of “noise” data, and the network itself must
“focus” in the right places). This algorithm was applied to 80 billion photos.

3.2. VISUAL REASONING
There is a type of task called visual reasoning, where a neural network is asked
to answer a question using a photo. For example: “Is there a same size rubber
thing in the picture as a yellow metal cylinder?” The question is truly
nontrivial, and until recently, the problem was solved with an accuracy of only
68.5%.

And again the breakthrough was achieved by the team from Deepmind: on the CLEVR dataset they reached a super-human accuracy of 95.5% .

The network architecture is very interesting:

 1. Using the pre-trained LSTM on the text question, we get the embedding of the
    question.
 2. Using the CNN (just four layers) with the picture, we get feature maps
    (features that characterize the picture).
 3. Next, we form pairwise combinations of coordinatewise slices on the feature
    maps (yellow, blue, red in the picture below), adding coordinates and text
    embedding to each of them.
 4. We drive all these triples through another network and sum up.
 5. The resulting presentation is run through another feedforward network, which
    provides the answer on the softmax.

3.3. PIX2CODE
An interesting application of neural networks was created by the company Uizard : generating a layout code according to a screenshot from the interface
designer.

This is an extremely useful application of neural networks, which can make life
easier when developing software. The authors claim that they reached 77%
accuracy. However, this is still under research and there is no talk on real
usage yet.

There is no code or dataset in open source, but they promise to upload it.

3.4. SKETCHRNN: TEACHING A MACHINE TO DRAW
Perhaps you’ve seen Quick, Draw! from Google, where the goal is to draw sketches of various objects in 20
seconds. The corporation collected this dataset in order to teach the neural
network to draw, as Google described in their blog and article .

The collected dataset consists of 70 thousand sketches, which eventually became publicly available . Sketches are not pictures, but detailed vector representations of drawings
(at which point the user pressed the “pencil,” released where the line was
drawn, and so on).

Researchers have trained the Sequence-to-Sequence Variational Autoencoder (VAE)
using RNN as a coding/decoding mechanism.

Eventually, as befits the auto-encoder, the model received a latent vector that
characterizes the original picture.

Whereas the decoder can extract a drawing from this vector, you can change it
and get new sketches.

And even perform vector arithmetic to create a catpig:

3.5. GANS
One of the hottest topics in Deep Learning is Generative Adversarial Networks
(GANs). Most often, this idea is used to work with images, so I will explain the
concept using them.

Generative Adversarial Networks (GANs): Engine and Applications How generative
adversarial nets are used to make our life better blog.statsbot.coThe idea is in the competition of two networks — the generator and the
discriminator. The first network creates a picture, and the second one tries to
understand whether the picture is real or generated.

Schematically it looks like this:

During training, the generator from a random vector (noise) generates an image
and feeds it to the input of the discriminator, which says whether it is fake or
not. The discriminator is also given real images from the dataset.

It is difficult to train such construction, as it is hard to find the
equilibrium point of two networks. Most often the discriminator wins and the
training stagnates. However, the advantage of the system is that we can solve
problems in which it is difficult for us to set the loss-function (for example,
improving the quality of the photo) — we give it to the discriminator.

A classic example of the GAN training result is pictures of bedrooms or people

Previously, we considered the auto-coding (Sketch-RNN), which encodes the
original data into a latent representation. The same thing happens with the
generator.

The idea of generating an image using a vector is clearly shown in this project in the example of faces. You can change the vector and see how the faces
change.

The same arithmetic works over the latent space: “a man in glasses” minus “a
man” plus a “woman” is equal to “a woman with glasses.”

3.6. CHANGING FACE AGE WITH GANS
If you teach a controlled parameter to the latent vector during training, when
you generate it, you can change it and so manage the necessary image in the
picture. This approach is called conditional GAN.

So did the authors of the article, “Face Aging With Conditional Generative Adversarial Networks.” Having trained the engine on the IMDB dataset with a known age of actors, the
researchers were given the opportunity to change the face age of the person.

3.7. PROFESSIONAL PHOTOS
Google has found another interesting application to GAN — the choice and improvement of photos. GAN was trained on a professional photo
dataset: the generator is trying to improve bad photos (professionally shot and
degraded with the help of special filters), and the discriminator — to
distinguish “improved” photos and real professional ones.

A trained algorithm went through Google Street View panoramas in search of the
best composition and received some pictures of professional and
semi-professional quality (as per photographers’ rating).

3.8. SYNTHESIZATION OF AN IMAGE FROM A TEXT DESCRIPTION
An impressive example of GANs is generating images using text.

The authors of this research suggest embedding text into the input of not only a generator (conditional
GAN), but also a discriminator, so that it verifies the correspondence of the
text to the picture. In order to make sure the discriminator learned to perform
his function, in addition to training they added pairs with an incorrect text
for the real pictures.

3.9. PIX2PIX
One of the eye-catching articles of 2016 is, “Image-to-Image Translation with Conditional Adversarial Networks” by Berkeley AI Research (BAIR). Researchers solved the problem of
image-to-image generation, when, for example, it was required to create a map
using a satellite image, or realistic texture of the objects using their sketch.

Here is another example of the successful performance of conditional GANs. In
this case, the condition goes to the whole picture. Popular in image
segmentation, UNet was used as the architecture of the generator, and a new
PatchGAN classifier was used as a discriminator for combating blurred images
(the picture is cut into N patches, and the prediction of fake/real goes for
each of them separately).

Christopher Hesse made the nightmare cat demo, which attracted great interest from the users.

You can find a source code here.

3.10. CYCLEGAN
In order to apply Pix2Pix, you need a dataset with the corresponding pairs of
pictures from different domains. In the case, for example, with cards, it is not
a problem to assemble such a dataset. However, if you want to do something more
complicated like “transfiguring” objects or styling, then pairs of objects
cannot be found in principle.

Therefore, authors of Pix2Pix decided to develop their idea and came up with
CycleGAN for transfer between different domains of images without specific pairs
— “Unpaired Image-to-Image Translation.”

The idea is to teach two pairs of generator-discriminators to transfer the image
from one domain to another and back, while we require a cycle consistency —
after a sequential application of the generators, we should get an image similar
to the original L1 loss. A cyclic loss is required to ensure that the generator
did not just begin to transfer pictures of one domain to pictures from another
domain, which are completely unrelated to the original image.

This approach allows you to learn the mapping of horses -> zebras.

Such transformations are unstable and often create unsuccessful options:

You can find a source code here.

3.11. DEVELOPMENT OF MOLECULES IN ONCOLOGY
Machine learning is now coming to medicine. In addition to recognizing
ultrasound, MRI, and diagnosis, it can be used to find new drugs to fight
cancer.

We already reported in detail about this research . Briefly, with the help of Adversarial Autoencoder (AAE), you can learn the
latent representation of molecules and then use it to search for new ones. As a
result, 69 molecules were found, half of which are used to fight cancer, and the
others have serious potential.

3.12. ADVERSARIAL-ATTACKS
Topics with adversarial-attacks are actively explored. What are
adversarial-attacks? Standard networks trained, for example, on ImageNet, are
completely unstable when adding special noise to the classified picture. In the
example below, we see that the picture with noise for the human eye is
practically unchanged, but the model goes crazy and predicts a completely
different class.

Stability is achieved with, for example, the Fast Gradient Sign Method (FGSM):
having access to the parameters of the model, you can make one or several gradient steps towards the desired class and change the original picture.

One of the tasks on Kaggle is related to this: the participants are encouraged to create universal
attacks/defenses, which are all eventually run against each other to determine
the best.

Why should we even investigate these attacks? First, if we want to protect our
products, we can add noise to the captcha to prevent spammers from recognizing
it automatically. Secondly, algorithms are more and more involved in our lives —
face recognition systems and self-driving cars. In this case, attackers can use
the shortcomings of the algorithms.

Here is an example of when special glasses allow you to deceive the face
recognition system and “pass yourself off as another person.” So, we need to
take possible attacks into account when teaching models.

Such manipulations with signs also do not allow them to be recognized correctly.

• A set of articles from the organizers of the contest.
• Already written libraries for attacks: cleverhans and foolbox.

4. REINFORCEMENT LEARNING
Reinforcement learning (RL), or learning with reinforcement is also one of the
most interesting and actively developing approaches in machine learning.

The essence of the approach is to learn the successful behavior of the agent in
an environment that gives a reward through experience — just as people learn
throughout their lives.

RL is actively used in games, robots, and system management (traffic, for
example).

Of course, everyone has heard about AlphaGo’s victories in the game over the best professionals . Researchers were using RL for training: the bot played with itself to improve
its strategies.

4.1. REINFORCEMENT TRAINING WITH UNCONTROLLED AUXILIARY TASKS
In previous years, DeepMind had learned using DQN to play arcade games better than humans. Currently, algorithms are being taught
to play more complex games like Doom .

Much of the attention is paid to learning acceleration because experience of the
agent in interaction with the environment requires many hours of training on
modern GPUs.

In his blog, Deepmind reported that the introduction of additional losses (auxiliary tasks), such as the
prediction of a frame change (pixel control) so that the agent better
understands the consequences of the actions, significantly speeds up learning.

Learning results:

4.2. Learning robots
In OpenAI, they have been actively studying an agent’s training by humans in a
virtual environment, which is safer for experiments than in real life.

In one of the studies , the team showed that one-shot learning is possible: a person shows in VR how
to perform a certain task, and one demonstration is enough for the algorithm to
learn it and then reproduce it in real conditions.

If only it was so easy with people. :)

4.3. LEARNING ON HUMAN PREFERENCES
Here is the work of OpenAI and DeepMind on the same topic . The bottom line is that an agent has a task, the algorithm provides two
possible solutions for the human and indicates which one is better. The process
is repeated iteratively and the algorithm for 900 bits of feedback (binary
markup) from the person learned how to solve the problem.

As always, the human must be careful and think of what he is teaching the
machine. For example, the evaluator decided that the algorithm really wanted to
take the object, but in fact, he just simulated this action.

4.4. MOVEMENT IN COMPLEX ENVIRONMENTS
There is another study from DeepMind . To teach the robot complex behavior (walk, jump, etc.), and even do it
similar to the human, you have to be heavily involved with the choice of the
loss function, which will encourage the desired behavior. However, it would be
preferable that the algorithm learned complex behavior itself by leaning with
simple rewards.

Researchers managed to achieve this: they taught agents (body emulators) to
perform complex actions by constructing a complex environment with obstacles and
with a simple reward for progress in movement.

You can watch the impressive video with results . However, it’s much more fun to watch it with a superimposed sound!

Finally, I will give a link to the recently published algorithms for learning RL from OpenAI . Now you can use more advanced solutions than the standard DQN.

5. OTHER
5.1. COOLING THE DATA CENTER
In July 2017, Google reported that it took advantage of DeepMind’s development in machine learning to reduce the energy costs of its data center.

Based on the information from thousands of sensors in the data center, Google
developers trained a neural network ensemble to predict PUE (Power Usage
Effectiveness) and more efficient data center management. This is an impressive
and significant example of the practical application of ML.

5.2. ONE MODEL FOR ALL TASKS
As you know, trained models are poorly transferred from task to task, as each
task has to be trained for a specific model. A small step towards the
universality of the models was done by Google Brain in his article “One Model To Learn The All.”

Researchers have trained a model that performs eight tasks from different
domains (text, speech, and images). For example, translation from different
languages, text parsing, and image and sound recognition.

In order to achieve this, they built a complex network architecture with various
blocks to process different input data and generate a result. The blocks for the
encoder/decoder fall into three types: convolution, attention, and gated mixture of experts (MoE).

Main results of learning:

 * Almost perfect models were obtained (the authors did not fine tune the
   hyperparameters).
 * There is a transfer of knowledge between different domains, that is, on tasks
   with a lot of data, the performance will be almost the same. And it is better
   on small problems (for example, on parsing).
 * Blocks needed for different tasks do not interfere with each other and even
   sometimes help, for example, MoE — for the Imagenet task.

By the way, this model is present in tensor2tensor .

5.3. LEARNING ON IMAGENET IN ONE HOUR
In their post, Facebook staff told us how their engineers were able to teach the
Resnet-50 model on Imagenet in just one hour. Truth be told, this required a
cluster of 256 GPUs (Tesla P100).

They used Gloo and Caffe2 for distributed learning . To make the process effective, it was necessary to adapt the learning
strategy with a huge batch (8192 elements): gradient averaging, warm-up phase,
special learning rate, etc.

As a result, it was possible to achieve an efficiency of 90% when scaling from 8
to 256 GPU. Now researchers from Facebook can experiment even faster, unlike
mere mortals without such a cluster.

6. NEWS
6.1. SELF-DRIVING CARS
The self-driving car sphere is intensively developing, and the cars are actively
tested. From the relatively recent events, we can note the purchase of Intel
MobilEye, the scandals around Uber and Google technologies stolen by their former
employee , the first death when using an autopilot , and much more.

I will note one thing: Google Waymo is launching a beta program . Google is a pioneer in this field, and it is assumed that their technology is
very good because cars have been driven more than 3 million miles.

As to more recent events, self-driving cars have been allowed to travel across
all US states.

6.2. HEALTHCARE
As I said, modern ML is beginning to be introduced into medicine. For example, Google collaborates with a medical center to help with diagnosis.

Deepmind has even established a separate unit.

This year, under the program of the Data Science Bowl, there was a competition held to predict lung cancer in a year on the basis of detailed images with a prize fund of one million dollars.

6.3. INVESTMENTS
Currently, there are heavy investments in ML as it was before with BigData.

China invested $150 billion in AI to become the world leader in the industry.

For comparison, Baidu Research employs 1,300 people, and in the same FAIR
(Facebook) — 80. At the last KDD, Alibaba employees talked about their parameter
server KungPeng , which runs on 100 billion samples with a trillion parameters, which “becomes
a common task” ©.

You can draw your own conclusions, it’s never too late to study machine learning . In one way or another, over time, all developers will use machine learning,
which will become one of the common skills, as it is today — the ability to work
with databases.

Link to the original post .

YOU’D ALSO LIKE:
SQL Window Functions Tutorial for Business Analysis The most popular business
problems solved with SQL blog.statsbot.co A Guide for Customer Retention Analysis with SQL How to make customer retention
curves and cohort analysis the right way blog.statsbot.co SQL Queries for Funnel Analysis A template for building SQL funnel queries
blog.statsbot.co * Machine Learning
 * Deep Learning
 * Data Science
 * AI
 * Computer Vision

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

3.9K 9 Blocked Unblock Follow FollowingEDUARD TYANTOV
Mail.ru Group, Head of Machine Learning Team

FollowSTATS AND BOTS
Data stories on machine learning and analytics. From Statsbot’s makers.

 * 3.9K
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates",Recent achievements in deep learning over the past year (and a bit longer). We’ll tell you about the most significant developments that can affect our future.,Deep Learning Achievements Over the Past Year ,Live,656
2026,"Compose The Compose logo Articles Sign in Free 30-day trialWHY A MANAGED DATABASE FROM COMPOSE?
Published Jun 27, 2017 compose managed databases Why a managed database from Compose?At Compose, we believe databases should be as easy to use as any utility service
you get at home. That’s why all databases on the Compose platform are managed
services.

But what is a managed service? Unlike other services, a managed service isn't
just supplied but actively monitored and that monitoring is acted upon. It moves
the boundary of what you expect from a service from say, simply expecting a
supplier to keep a server powered and connected so your application can run, to
say expecting the application to be always available, monitored, backed up and
responsive...

IMAGINE DATABASES WERE LIKE YOUR WATER SUPPLY
If water supply were like a database hosted, rather than managed, in the cloud.
When there was a problem with the water, chances are you'd have to walk down the
entire length of the pipe all the way to the pumping station looking for what
was wrong. When you found what was wrong, then you'd have to figure out if it
was the trench diggers, the pipe owners or the water pumping company that were
at fault. And unless you are in the business of digging trenches, laying pipe or
pumping water, you are going to have to contact them to get them to repair it.

Luckily, we generally live in a world where the water company brings your water
all the way up to, and into your home. This is a managed service. If there's a
problem, with the supply of water, it's up to the water company to fix it. In
this scenario, the water arrives at your home ready for you to do what you want
with it.

Let's unwrap the analogy though and talk about what it practically means to have
a managed database.

CONFIGURATION
When you are assembling your own database for hosting, it's human nature to fit
it to your needs and known risks. Of course, you may not know all your needs or
the risks; a full high availability configuration of a database may feel like a
lot of work early on in your development cycle.

It would be better if someone had already thought through what you should have
and might need and pre-configured the complete system for you. That's where
Compose's managed servers come in. We've looked at what most users need —
reliability — and configured systems which match those needs. From multiple
redundant servers, failover proxies and active agents - for each database, we
give a configuration built to always be working for you. And as it's managed,
you get it all with one click.

MONITORING
When you have a hosted database, it's your responsibility to monitor the
database running on the server. There's plenty of tools out there that can help
but you still have to make time to install and configure them. Once that's done,
you have to make sure the monitoring is working too.

On a managed database like Compose, all that monitoring is automatically
installed with every database and its monitored at multiple levels to ensure it
is all functioning efficiently and smoothly. And all the alerts and metrics go
to the Compose ops team so they know when there's a problem.

RESPONSE
Ready for your 24/7 database monitoring on-call? That's what you are going to
have to do with a hosted database if you are going to be prepared. When you find
out that your monitoring has detected a problem and you need to respond, it's up
to you. Reboot or restart? It'll be your decision.

Meanwhile, the managed database user could never know there was a problem. That
managed configuration we mentioned earlier, it'll keep running, automatically
switching servers and other elements as needed. And the Compose ops team can
swing in to restart or revive a problematic system.

BACKUP
Backup is a chore for the hosted database. Backup strategies must be selected,
tools implemented, operations scheduled and you need to make sure it all happens
like clockwork. All that effort and you still have to check that the backups are
working. Otherwise when something goes wrong, when you need to restore you could
find the backups aren't there.

The managed database, on the other hand, manages all of this for you. On
Compose, databases do daily backups and then create weekly and monthly backups
from them. All automatically. And you can create a new managed database from one
of those backups with a couple of clicks or a single API call.

ACCESS
With a hosted database, it's entirely up to you how you administer the database.
Again you get to use your valuable time selecting tools and configuring them. If
you have different databases, it'll be different tools too.

Compose's managed databases are all fronted by a common web interface. You can
hop between any of them at a click. Your administrative visibility of backups,
logs and metrics are all in an easy to understand, cross-database interface. We
can do this because we managed the configuration and because our platform is
built to make working with your database simple. And where there are web tools
for a particular database, there's no digging around; they'll be a click away.

RESPONSIBILITY
Where does responsibility begin for a user of a managed database? Well, going
back to our analogy, where the water enters the building; where it's delivered.
What you do with your service is your own concern and although the water company
may have recommendations on what you can do with your water, how you use it
inside the building is up to you. With a database, you'll get advice on how to
connect and general direction over good and bad practices but we won't tell how
you build your applications, or which exact drivers to use (or not use). That
said, every database on Compose is not only managed but open source, so there's
a whole community to call on when the issue isn't ""Why isn't my water flowing
into my building"".

SO WHY A MANAGED DATABASE?
You don't want to be in the database administration business any more than you
want to be in the pipe checking and maintaining pumps. You have things to do
like building your business and that makes your time valuable.

Spend your time where it'll do the most good for your business and let a Compose
managed database take care of your data layer and deliver a more reliable and
efficient database.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Patryk Gradys

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Jun 22, 2017ELASTICSEARCH 5.4.2 COMES TO COMPOSE
TL;DR: Elasticsearch 5.4.2 is now available on Compose. Elasticsearch 2 users
can upgrade to 5.4.2 right away. Today, we're c…

Dj Walker-Morgan Jun 21, 2017INTERVIEW: CHRISTOPHER QUINONES ON THE NEWEST GRAPH DATABASE, JANUSGRAPH
We sat down with Christopher Quinones, Project Lead and Developer on Compose for
JanusGraph, to get the scoop on graph databa…

Josh Mintz Jun 20, 2017COMPOSE ELASTICSEARCH'S NEW KIBANA ADDON
TL;DR: Compose users can now add-on a dedicated Kibana capsule to their
Elasticsearch deployments. The Kibana visualization t…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL JanusGraph Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","At Compose, we believe databases should be as easy to use as any utility service you get at home. That’s why all databases on the Compose platform are managed services.",Why a managed database from Compose?,Live,657
2028,"AUTOMATING WEB ANALYTICS THROUGH PYTHON ¶
Ruthger Righart

Email: rrighart@googlemail.com

Website: https://rrighart.github.io

PYTHON MODULES ¶
In this blog, the following Python modules will be used: Important note: in the
current blog Python 2 is used, the code may be slightly different for Python 3.

In [1]:importpandasaspdimportnumpyasnpfromgoogle2pandasimport*importmatplotlibasmplimportmatplotlib.pyplotaspltimportsysfromgeonamescacheimportGeonamesCachefromgeonamescache.mappersimportcountryfrommatplotlib.patchesimportPolygonfrommatplotlib.collectionsimportPatchCollectionfrommpl_toolkits.basemapimportBasemapfromscipyimportstatsimportmatplotlib.patchesasmpatchesimportplotlyimportplotly.plotlyaspyfromIPython.displayimportImagefromplotlyimporttoolsfromplotly.graph_objsimport*

1. INTRODUCTION ¶
Web analytics is a serious domain of data science. Seeing where people come from
(geographical information), what they do (behavioral analyses), how they visit
(device: mobile, tablet or workstation), and when they visit your website
(time-related info, frequency etc), are all different metrics of webtraffic that
have potential business value. Google Analytics is one of the available
open-source tools that is highly used and well-documented.

The current blog deals with the case how to implement web analytics in Python. I
am enthusiastic about the options that are available inside Google Analytics.
Google Analytics has a rich variety of metrics and dimensions available. It has
a good visualization and an intuitive Graphic User Interface (GUI). However, in
certain situations it makes sense to automate webanalytics and add advanced
statistics and visualizations. In the current blog, I will show how to do that
using Python.

As an example, I will present the traffic analyses of my own website, for one of
my blogs ( https://rrighart.github.io/Webscraping/ ). Note however that many of the implementation steps are quite similar for
conventional (non-GitHub) websites. So please do stay here and do not despair if
GitHub is not your cup of tea.

2. THE END IN MIND ¶
As the options are quite extensive, it is best to start with the end in mind. In
other words, for what purpose do we use webtraffic analyses? 1 .

 1. What is the effect of a campaign on webtraffic? Concretely, publishing a
    link (i.e., my blog ""Webscraping and beyond"" ) on a reputated site for colleague developers, does this have a
    substantial impact on the number of visitors? To reach this goal, much like
    an experiment, I performed a single intervention of publishing a link at http://www.datatau.com .
 2. Where do the visitors come from, is this a large range of countries, from
    different continents, or is it rather limited to one or a few countries?
 3. What kind of devices do my visitors use? (mobile, desktop, or tablet).

To realise these goals, what follows are the analytical steps needed from data
acquisition to visualization. If at this moment you are not able to run Google
Analytics for your own website but want to nevertheless reproduce the data
analyses in Python (starting below at section 8), I advice to load the
DataFrames (df1, df2, df3, df3a, df3b) from my GitHub site using the following
code (change the ""df""-filename accordingly):

In [2]:url='https://raw.githubusercontent.com/RRighart/GA/master/df1.csv'df1=pd.read_csv(url,parse_dates=True,delimiter="","",decimal="","")

3. ADD A WEBSITE TO YOUR GOOGLE ANALYTICS ACCOUNT ¶
You need to first subscribe to Google Analytics and add your website 2 :

 * Select Admin
 * In the dropdown menu Property , select Create new property . You need to give the name of your website. The URL has the following
   format for GitHub sites: https://yourname.github.io/projectname/ (in my case, it is for example https://rrighart.github.io/Webscraping/ ).
 * Do not forget to set the reporting timezone right. This is very important if
   you want to research the time of the day that your customers come visit your
   website.
 * After confirming the other steps, you'll receive a Universal Analytics (UA)
   tracking -ID, which has the following format, UA-xxxxxxxx-x, where the x are
   numbers.

In [3]:Image(""Fig1.png"")

Out[3]:4. TRACKING-ID AND CODE ¶
If you need to find back the tracking-ID later, the code can be found at Tracking Info and then Tracking Code . Under the header Website Tracking , Google Analytics will give a script that needs to be pasted in your website.
It is essential to set the code right to prevent for example half or double
counting 3 .

5. CHECK THE CONNECTION ¶
Now you have the tracking code pasted in your website, Google Analytics is able
to collect traffic data. The „official“ way to inspect if there is a connection
in Google Analytics is to select Tracking Info , Tracking Code , and under status, push the button Send test traffic . This will open up your website.

However, a more real life way to do this is to visit your website yourself,
using for example your mobile phone. In Google Analytics select Home , Real-time , and Overview . If you just visited your website of interest, you should see under pageviews that there is ""Right now 1 active users on site"" (of course this could be >1 if at the same moment there were other visitors).
Additionally, you may want to check the geographical map and see if your place is highlighted. If you leave your website, the active users section should
return to zero (or go one down). If this works, you are ready to start
webtraffic analyses as soon as your first visitors drop in.

In [4]:Image(""Fig2.png"")

Out[4]:6. QUERY EXPLORER ¶
So how to start webtraffic analyses? One option is to visualize traffic in
Google Analytics itself. Another option is Query Explorer. Query Explorer is a
GUI tool that gives a very quick impression of your data, combining different
metrics and dimensions at once. It is also very helpful for preparing the Python
code needed for data extraction (more about this later). Follow the next steps:

 * Log-in with your Google account at: https://ga-dev-tools.appspot.com/query-explorer/
 * Select under property the webpage that you want to check (in my case „Webscraping“).
 * Select view to choose between extracting all data, desktop or mobile.
 * Select ids : this is the „ga:“ code that corresponds with your property.
 * Fill-in a start-date . Here we select '2017-07-20' (this is one day before I started campaigning
   at www.datatau.com ).
 * Fill-in End-date : '2017-08-07'.
 * Metrics : select 'ga:sessions'.
 * Dimensions : select 'ga:date'.

Note that number of sessions is different from number of visitors . The difference is that the same visitors may return several times at the same
website, resulting in a higher number of sessions. For the time being, leave all
the other fields empty. When you hit the button Run Query this should return a spreadsheet with number of sessions, for each day in your
time-window.

In [5]:Image(""Fig3.png"")

Out[5]:7. GET YOUR DATA IN PYTHON ¶
A major advantage of using Python is that you can automate data extraction,
customize settings and build your own platform, statistics, predictive
analytics, and visualizations, if you desire all in a single script. Regarding
visualizations, it would be possible to build for example dynamic geographic
maps in Python, showing how the flux of visitors changed locally and globally
from day-to-day, week-to-week.

Google2pandas 4 is a tool that transfers data from Google Analytics to Python into a Pandas
DataFrame. From there you could proceed further making for example statistics
and visualizations. The most important steps to enable this are:

 * Getting permission from Google Analytics API 2
 * Install Google2pandas and Pandas .
 * Copy the following code in Jupyter notebook. Run it and verify if you
   obtained the right DataFrame. This should give data that are identical to
   those in Query Explorer or the data that can be displayed in Google
   Analytics. So we started collecting the data at July 20, and ended at August
   7. There are by the way a number of handy options to select start_date and end_date , such as '7daysAgo' (you can change the number to your likings, for ex. '10daysAgo') or 'today'. These options come in very
   useful if you want to regularly extract and analyze a same time period. For
   example, if you want to make a report every thursday morning going one week
   back (""7daysAgo""), you could in principal run everytime the same script
   without changing anything.

In [6]:df1=[]conn=GoogleAnalyticsQuery(secrets='/your-directory/ga-creds/client_secret.json',token_file_name='/your-directory/ga-creds/analytics.dat')query={'ids':'999999999','metrics':'sessions','dimensions':'date','start_date':'2017-07-20','end_date':'2017-08-07'}df1,metadata=conn.execute_query(**query)print(df1)

        date  sessions
0   20170720         4
1   20170721       147
2   20170722       125
3   20170723        77
4   20170724       104
5   20170725        57
6   20170726        63
7   20170727       326
8   20170728       277
9   20170729        93
10  20170730        59
11  20170731        96
12  20170801       118
13  20170802        67
14  20170803        34
15  20170804        28
16  20170805        16
17  20170806        20
18  20170807        22


8. PLOTTING NUMBER OF SESSIONS AS A FUNCTION OF DATE ¶
To answer the first question -- what is the effect of a campaign on the
webtraffic? -- we will analyze if campaigning had a sizeable effect on the
number of sessions. First, to improve readability of the resulting plot, we will
modify the date string that will be in the x-axis. Therefore, we will remove the
year part, and we will reverse the order of day and month 5 .

In [7]:df1.date=df1.date.replace({'2017':''},regex=True)

In [8]:df1.date=df1.date.map(lambdax:str(x)[2:])+'-'+df1.date.map(lambdax:str(x)[:2])

In [9]:df1.head(5)

Out[9]: date sessions 0 20-07 4 1 21-07 147 2 22-07 125 3 23-07 77 4 24-07 104Checking the DataFrame df1 we can see that the date column now has less bits of information. Before
plotting the number of sessions, let us view some summary statistics. The number
of sessions was 91 on average in the inspected period (with a max. of 326).

In [10]:df1['sessions'].describe()

Out[10]:count     19.000000
mean      91.210526
std       84.652610
min        4.000000
25%       31.000000
50%       67.000000
75%      111.000000
max      326.000000
Name: sessions, dtype: float64

Total number of sessions during the selected time window was 1733.

In [11]:sum(df1.sessions)

Out[11]:1733

Now we are going to plot the number of sessions ( y -axis) as a function of date ( x -axis). Remind that a link to the blog was published at 21-07-2017. Next to the
observation that the number of visitors increased substantially after the
publication date, there is an additional boost at July 27. I do not have a
definite explanation for this second boost. One admittedly speculative
explanation is that the blog had in the meanwhile received several ""likes"" at
the site datatau , and this in turn may have attracted other visitors. That the number of
sessions is decreasing after a certain time is explained by the fact that the
link is slowly falling off the datatau main page, as new blogs are dropping in. This means that some time after
publication people are less likely to see and visit it.

In [12]:f1=plt.figure(figsize=(12,8))plt.plot(df1.sessions,'-ko',lw=2,markerfacecolor='white',markersize=7,markeredgewidth=2)plt.xticks(range(len(df1.sessions)),df1.date,size='small',rotation='vertical')plt.xlabel('Date')plt.ylabel('Number of sessions')plt.show()

9. GEOGRAPHIC MAPPING ¶
To investigate the second question -- where do visitors come from? -- a
choropleth map can be used. In this case, a world map is used that displays the
number of visitors per country, using different color scales. For this purpose,
we make a new DataFrame df2 that sorts the countries on number of sessions.

In [13]:df2=[]conn=GoogleAnalyticsQuery(secrets='/your-directory/ga-creds/client_secret.json',token_file_name='/your-directory/ga-creds/analytics.dat')query={\
'ids':'999999999','metrics':'sessions','dimensions':'country','sort':'-sessions','start_date':'2017-07-20','end_date':'2017-08-07'}df2,metadata=conn.execute_query(**query)

The top 20 countries are the following:

In [14]:df2.head(20)

Out[14]: country sessions 0 United States 589 1 India 95 2 Germany 91 3 United Kingdom 77 4 Australia 54 5 France 50 6 Canada 47 7 Brazil 44 8 Poland 44 9 South Korea 40 10 Spain 40 11 Russia 39 12 China 32 13 Netherlands 32 14 Vietnam 29 15 Italy 28 16 Ukraine 28 17 Hungary 22 18 Singapore 20 19 Japan 19It turns out that visitors come from 80 countries:

In [15]:len(df2)

Out[15]:80

We are now going to use the choroplethmap. This is a bit of code and an
excellent blog going in more detail about this method can be found elsewhere 6 .

In [16]:reload(sys)sys.setdefaultencoding(""utf-8"")

In [17]:shapefile='ne_10m_admin_0_countries'num_colors=9title='Visitors in period from July 20'imgfile='.png'description='''Number of sessions were obtained by Google Analytics. Author: R. Righart'''.strip()

We are going to make a list called cnt consisting of country abbreviations that we will put in the index of df2 :

In [18]:cnt=[]mapper=country(from_key='name',to_key='iso3')foriinrange(0,len(df2)):A=mapper(df2.country[i])cnt.append(A)

In [19]:df2.index=cnt

In [20]:df2.head(5)

Out[20]: country sessions USA United States 589 IND India 95 DEU Germany 91 GBR United Kingdom 77 AUS Australia 54Using for sessions the absolute values did not give a clear color distribution. Most countries had quite
similar values with only a few countries having higher values, and for this
reason most countries fell in the same color scale. Therefore, I decided to
convert the values to percentiles, benefitting from the Scipy package 7 . The resulting map demonstrates that visitors do not come from a local
geographic area, but they come from a wide variety of countries around the
world.

In [21]:values=df2.sessionsvalues=stats.rankdata(values,""average"")/len(values)

In [22]:num_colors=11cm=plt.get_cmap('Reds')nw=float(""{0:.2f}"".format(num_colors))scheme=[cm(i/nw)foriinrange(num_colors)]bins=np.linspace(values.min(),values.max(),num_colors)df2['bin']=np.digitize(values,bins)-1

In [23]:mpl.style.use('classic')fig=plt.figure(figsize=(22,12))ax=fig.add_subplot(111,axisbg='w',frame_on=False)fig.suptitle('Number of sessions',fontsize=30,y=.95)m=Basemap(lon_0=0,projection='robin')m.drawmapboundary(color='w')m.readshapefile(shapefile,'units',color='#444444',linewidth=.2)forinfo,shapeinzip(m.units_info,m.units):iso3=info['ADM0_A3']ifiso3notindf2.index:color='#dddddd'else:color=scheme[df2.ix[iso3]['bin']]patches=[Polygon(np.array(shape),True)]pc=PatchCollection(patches)pc.set_facecolor(color)ax.add_collection(pc)ax.axhspan(0,1000*1800,facecolor='w',edgecolor='w',zorder=2)ax_legend=fig.add_axes([0.35,0.14,0.3,0.03],zorder=3)cmap=mpl.colors.ListedColormap(scheme)cb=mpl.colorbar.ColorbarBase(ax_legend,cmap=cmap,ticks=bins,boundaries=bins,orientation='horizontal')cb.ax.set_xticklabels([str(round(i,1))foriinbins])plt.annotate(description,xy=(-.8,-3.2),size=14,xycoords='axes fraction')plt.savefig(imgfile,bbox_inches='tight',pad_inches=.2)plt.show()

More detail is possible here, such as regional or city maps. An example of a
city map can be found elsewhere 8 . Region and city dimensions can be extracted from Google Analytics. As
mentioned before, it is best to explore the available ""metrics"" and ""dimensions""
in Query Explorer before implementing it in Python.

To get a better impression of the real number of sessions from ""top 30""
countries, we could make a barplot.

In [24]:ndd=df2.head(30)

In [25]:fig=plt.figure(figsize=(10,8),facecolor='w')fig.subplots_adjust(wspace=0.2)ax1=plt.subplot(1,1,1)barwd=0.6r1=range(len(ndd))ax1.barh(r1,ndd.sort_values(by='sessions',ascending=True).sessions,height=0.4,align=""center"",color=""orange"")ax1.set_yticks(r1)ax1.set_yticklabels(ndd.sort_values(by='sessions',ascending=True).country,size=9)plt.show()

10. DEVICE ¶
The third question --what kind of devices do my visitors use? -- can be best
answered using a piechart, since it nicely illustrates the proportions.

In [26]:df3=[]conn=GoogleAnalyticsQuery(secrets='/your-directory/ga-creds/client_secret.json',token_file_name='/your-directory/ga-creds/analytics.dat')query={'ids':'999999999','metrics':'sessions','dimensions':'deviceCategory','sort':'-sessions','start_date':'2017-07-20','end_date':'2017-08-07'}df3,metadata=conn.execute_query(**query)

In [27]:df3

Out[27]: deviceCategory sessions 0 desktop 1182 1 mobile 478 2 tablet 73For this purpose we are going to use Plotly, but one could equally well do this
using for example Matplotlib. Please be aware that you would need a so-called api-key to do this, which you will receive upon registration at the Plotly website.

In [28]:plotly.tools.set_credentials_file(username='rrighart',api_key='999999999')

In [29]:fig={'data':[{'labels':df3.deviceCategory,'values':df3.sessions,'type':'pie'}],'layout':{'title':'Number of sessions as a function of device'}}py.iplot(fig)

Out[29]:The pie chart shows that a large majority of the visitors used a desktop. Does
this change from one week to another week? It would be possible to show the
change in device use over time using multiple pie charts. First, we make two
datasets, a dataset for the first and second week, df3a and df3b respectively. The code is a bit extensive, it is advisable to use a loop when
you have several timepoints.

In [30]:df3a=[]conn=GoogleAnalyticsQuery(secrets='/your-directory/ga-creds/client_secret.json',token_file_name='/your-directory/ga-creds/analytics.dat')query={'ids':'999999999','metrics':'sessions','dimensions':'deviceCategory','sort':'-sessions','start_date':'2017-07-20','end_date':'2017-07-26'}df3a,metadata=conn.execute_query(**query)df3b=[]conn=GoogleAnalyticsQuery(secrets='/your-directory/ga-creds/client_secret.json',token_file_name='/your-directory/ga-creds/analytics.dat')query={'ids':'999999999','metrics':'sessions','dimensions':'deviceCategory','sort':'-sessions','start_date':'2017-07-27','end_date':'2017-08-02'}df3b,metadata=conn.execute_query(**query)

In [31]:df3a

Out[31]: deviceCategory sessions 0 desktop 423 1 mobile 136 2 tablet 18 In [32]:df3b

Out[32]: deviceCategory sessions 0 desktop 695 1 mobile 293 2 tablet 48Now the goal is to display the piecharts for the 1st and 2nd week. It would be
important to use the domain parameter to set the position of the charts. Further
parameters can be found in the official Plotly documentation 9 .

In [33]:fig={'data':[{'labels':df3a.deviceCategory,'values':df3a.sessions,'type':'pie','domain':{'x':[0,.48],'y':[.21,1]},'name':'1st Week'},{'labels':df3b.deviceCategory,'values':df3b.sessions,'type':'pie','domain':{'x':[.49,.97],'y':[.21,1]},'name':'2nd Week'}],'layout':{'title':'Device use over time','showlegend':False}}py.iplot(fig)

Out[33]:The number of mobile sessions increased slightly while the number of desktop
sessions decreased. If this is just noise or a real shift in sessions could be probably best evaluated using more timepoints.

11. SAVE DATA ¶
If you want to save the DataFrames for later use:

In [34]:df1.to_csv('df1.csv',sep="","",index=False)df2.to_csv('df2.csv',sep="","",index=False)df3.to_csv('df3.csv',sep="","",index=False)df3a.to_csv('df3a.csv',sep="","",index=False)df3b.to_csv('df3b.csv',sep="","",index=False)

12. CLOSING WORDS ¶
Webtraffic allows a wide variety of potentially interesting analyses. Lots of
other measures can be explored. For example, is webtraffic different for certain
weekdays and hour of the day?, what is the duration that visitors stay on the
website?, from which websites are users referred? is there a change in
geographic distribution?, just to name a few.

Bringing webtraffic data to Python is rewarding for several reasons. Python can
automate basic to advanced analyses, completely customized to your goals. After
programming the analysis pipeline, a single button press can produce the desired
analyses and visualizations in a report. And last but not least, periodical
updates can be produced on a regular basis by one button press.

NOTES & REFERENCES ¶
 1. My goal was to track webtraffic to my GitHub site, specifically my blog
    pages (the so-called gh-pages in GitHub). GitHub has its own webtraffic
    stats for the master branch, where developers typically share their software
    tools, scripts etc. The current blog will only deal with Google Analytics.
    
    
 2. Subscribing and how to get your site detected by Google Analytics. http://www.ryanpraski.com/google-analytics-reporting-api-python-tutorial/
    
    
 3. Setting the tracking code right. 
    http://www.seerinteractive.com/blog/audit-this-why-it-matters-where-you-put-the-google-analytics-tracking-code/
    
    
 4. Google2pandas. https://github.com/panalysis/Google2Pandas
    
    
 5. A perhaps more appropriate way is converting the column to datetime. As we
    only use this variable as an x -axis in the plot, I have chosen the shorter alternative.
    
    
 6. Choropleth mapping using Basemap. http://ramiro.org/notebook/basemap-choropleth/
    
    
 7. Conversion to percentiles. 
    https://stackoverflow.com/questions/12414043/map-each-list-value-to-its-corresponding-percentile
    
    
 8. City map. https://rrighart.github.io/City/#q24
    
    
 9. Plotly subplot option. https://plot.ly/python/pie-charts/
    
    
(c) 2017, R. Righart | Website: https://rrighart.github.io | Email: rrighart@googlemail.com",This blog deals shows you how to implement web analytics in Python.,Automating web analytics through Python,Live,658
2029,"Homepage Follow Sign in Get started Homepage * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Susanna Tai Blocked Unblock Follow Following Offering Manager, Watson Data Platform | Data Catalog Nov 17
--------------------------------------------------------------------------------

DISCOVER, CATALOG AND GOVERN DATA WITH IBM DATA CATALOG
For the last couple of months, we’ve blogged about the data challenges faced by many organizations today, particularly in
areas of access, collaboration and governance. In those posts, we presented a
vision of how IBM Data Catalog will address these challenges with an intelligent
asset catalog that offers full end-to-end capabilities around data lifecycle and
rule-based governance.

Our team is super excited to let everyone know that IBM Data Catalog is now
available in open beta!

WHAT IS IBM DATA CATALOG?
IBM Data Catalog provides a cloud-based enterprise metadata repository that lets
you securely catalog your data sources and assets wherever they reside. It is
the one-stop shop for data for the enterprise, where data scientists, data
engineers and business analysts can easily find what they need, and then quickly
put that data into productive use through other applications and tools, such as Data Science Experience and Data Refinery , all working seamlessly together in the integrated Watson Data Platform.
Robust governance capabilities embedded in Data Catalog let you define and
enforce policies, giving you the peace of mind that the right data are being
accessed by the right people. Data Catalog also comes with a business glossary
which allows you to manage business terms and link them to data assets, policies
and rules, providing the bridge between business domain and technical assets.

Here are some of the key capabilities of IBM Data Catalog that make data simple
and accessible.

DISCOVER DATA
Users can easily find and discover catalogued data across multiple on-premises
and cloud sources through different search methods including tags and filters,
and previewing of data. Data scientists and business analysts no longer have to
waste their time logging into different systems to search for and extract data.
With Data Catalog, they can shop for the data that they need from a single,
centralized portal.

Search and discover data through a catalogCATALOG DATA WHEREVER THEY RESIDE
To share assets, users can add local files or data assets from remote data
sources to Data Catalog. For the latter, there are currently 30 pre-built data
connectors to help you set up connections to commonly used on-premises and cloud
data stores. When cataloguing remote data assets, only the metadata of the asset
is captured in Data Catalog. The actual data remain in their source systems.

Pre-built connections to access remote dataAUTOMATICALLY CLASSIFY YOUR DATA
As data assets are added to the catalog, they are automatically indexed and
classified, making it easy for users such as data engineers, data scientists,
data stewards and business analysts to find, understand, share and use the
assets.

REFINE AND ANALYZE DATA
Data Catalog, Data Science Experience (DSX) and Data Refinery are all part of
Watson Data Platform designed to work seamlessly together in a common fabric.
Once the user has found the data that he needs in the catalog, with a single
click, he can add it to a project where he can refine and analyze the data using
Data Refinery and DSX capabilities. Think of catalogs as being the place where you share and find data, and projects as being the work-spaces where you collaborate with other users for specific
goals, for example, cleansing and shaping data to use in sales analysis.

Add a catalog data asset to a project to refine or analyzeGOVERN DATA
Governance policies and rules can be defined to control access to data in
governed catalogs. Policy enforcement is automatic and enabled all the time, and
leverages classifications assigned automatically or manually to data assets when
evaluating whether or not a user can view or use the data. So while Data Catalog
makes data easy to access, it is also underpinned by an intelligent and robust
governance framework that ensures its users comply with corporate data
governance policies.

Categories to organize governance policiesMONITOR GOVERNANCE POLICIES
Through the Governance Dashboard, the Data Catalog administrator can view a
summary of all active governance policies and their enforcement history.

Policy enforcement history in Governance DashboardCREATE A BUSINESS GLOSSARY
Data Catalog’s business glossary provides a framework to capture and manage the
enterprise’s common business vocabulary. Business terms can be added manually or
imported from a csv file or an xmi file from IBM Information Governance Catalog.

Manage terms in Business GlossaryLINK BUSINESS TERMS TO ASSETS, POLICIES AND RULES
By simply mapping a business term to an asset or attribute classifier in
business glossary, users can automatically see all the policies, rules and
assets that are related to that term, thus providing the link between business
domain and technical assets.

Assets, policies and rules linked to a business termREADY TO EXPLORE?
IBM Data Catalog is currently available in Beta, and you can try all the
features described above for free!

To sign up, go to http://ibm.com/cloud/data-catalog . Or, if you are already a Data Science Experience or IBM Data Refinery user,
check out how to add Data Catalog from your app .

Once you’re signed in, here are some suggested first steps to get started :

 * Create a catalog
 * Add assets to the catalog

To learn more, check out our docs or watch our video tutorials on YouTube. If you have a question or need help, you can leave us a message on
the chat box on the bottom right of the app and we’ll be there to assist.

So dive in, explore and let us know what you think!

 * Data Science
 * Data Management
 * Data Analysis
 * Data Governance
 * Data Catalog

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingSUSANNA TAI
Offering Manager, Watson Data Platform | Data Catalog

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","For the last couple of months, we’ve blogged about the data challenges faced by many organizations today, particularly in areas of access, collaboration and governance. In those posts, we presented a…","Discover, catalog and govern data with IBM Data Catalog",Live,659
2031,This video shows you how to create a replication job. ,Learn how to set up Cloudant database replication job.,Create a replication job,Live,660
2033,"KDNUGGETS
Data Mining, Analytics, Big Data, and Data Science Subscribe to KDnuggets News | Follow | Contact * Data Mining Software
 * News
 * Top stories
 * Opinions
 * Tutorials
 * Jobs
 * Academic
 * Companies
 * Courses
 * Datasets
 * Education
 * Meetings
 * Polls
 * Webinars


KDnuggets Home » News » 2016 » Feb » Tutorials, Overviews » 21 Must-Know Data Science Interview Questions and Answers ( 16:n06 )LATEST NEWS, STORIES
 * Let Me Hear Your Voice and I’ll Tell You How You... The Good, Bad & Ugly of TensorFlow Harnessing Open Data Science for Predictive Analytics ... CRN Top Business Analytics Vendors 2016 Don’t Just Assume That Data Are Interval Scale


More News & Stories | Top Stories

21 MUST-KNOW DATA SCIENCE INTERVIEW QUESTIONS AND ANSWERS
Previous post Next post Tweet Tags: Bootstrap sampling , Data Science , Interview questions , Kirk D. Borne , Precision , Recall , Regularization , Yann LeCun
--------------------------------------------------------------------------------

KDnuggets Editors bring you the answers to 20 Questions to Detect Fake Data
Scientists, including what is regularization, Data Scientists we admire, model
validation, and more.

By Gregory Piatetsky , KDnuggets. comments The recent post on KDnuggets

20 Questions to Detect Fake Data Scientists has been very popular - most viewed in the month of January .

However these questions were lacking answers, so KDnuggets Editors got together
and wrote the answers to these questions. I also added one more critical
question - number 21, which was omitted from the 20 questions post.

Here are the answers. Because of the length, here are the answers to the first
11 questions, and here is part 2 .

Q1. EXPLAIN WHAT REGULARIZATION IS AND WHY IT IS USEFUL.

Answer by Matthew Mayo .

Regularization is the process of adding a tuning parameter to a model to induce
smoothness in order to prevent overfitting . (see also KDnuggets posts on Overfitting )


This is most often done by adding a constant multiple to an existing weight
vector. This constant is often either the L1 (Lasso) or L2 (ridge) , but can in actuality can be any norm. The model predictions should then
minimize the mean of the loss function calculated on the regularized training
set.

Xavier Amatriain presents a good comparison of L1 and L2 regularization here , for those interested.
Fig 1: Lp ball: As the value of p decreases, the size of the corresponding L- p space also decreases.


Q2. WHICH DATA SCIENTISTS DO YOU ADMIRE MOST? WHICH STARTUPS?

Answer by Gregory Piatetsky :

This question does not have a correct answer, but here is my personal list of 12
Data Scientists I most admire, not in any particular order.


Geoff Hinton , Yann LeCun , and Yoshua Bengio - for persevering with Neural Nets when and starting the current Deep Learning
revolution.

Demis Hassabis, for his amazing work on DeepMind , which achieved human or superhuman performance on Atari games and recently Go .

Jake Porway from DataKind and Rayid Ghani from U. Chicago/DSSG, for enabling data science contributions to social good.

DJ Patil , First US Chief Data Scientist, for using Data Science to make US government
work better.

Kirk D. Borne for his influence and leadership on social media.

Claudia Perlich for brilliant work on ad ecosystem and serving as a great KDD-2014 chair.

Hilary Mason for great work at Bitly and inspiring others as a Big Data Rock Star.

Usama Fayyad , for showing leadership and setting high goals for KDD and Data Science, which
helped inspire me and many thousands of others to do their best.

Hadley Wickham , for his fantastic work on Data Science and Data Visualization in R, including
dplyr, ggplot2, and Rstudio.

There are too many excellent startups in Data Science area, but I will not list
them here to avoid a conflict of interest.

Here is some of our previous coverage of startups .

Q3. HOW WOULD YOU VALIDATE A MODEL YOU CREATED TO GENERATE A PREDICTIVE MODEL OF
A QUANTITATIVE OUTCOME VARIABLE USING MULTIPLE REGRESSION.


Answer by Matthew Mayo .

Proposed methods for model validation:

 * If the values predicted by the model are far outside of the response variable
   range, this would immediately indicate poor estimation or model inaccuracy.
 * If the values seem to be reasonable, examine the parameters; any of the
   following would indicate poor estimation or multi-collinearity: opposite
   signs of expectations, unusually large or small values, or observed
   inconsistency when the model is fed new data.
 * Use the model for prediction by feeding it new data, and use the coefficient of determination (R squared) as a model validity measure.
 * Use data splitting to form a separate dataset for estimating model
   parameters, and another for validating predictions.
 * Use jackknife resampling if the dataset contains a small number of instances, and measure validity
   with R squared and mean squared error (MSE).

Pages: 1 2 3


--------------------------------------------------------------------------------

Previous post Next post


--------------------------------------------------------------------------------


MOST POPULAR LAST 30 DAYS
Most viewed 1. 7 Steps to Mastering Machine Learning With Python R vs Python for Data Science: The Winner is ... Top 10 Data Analysis Tools for Business TensorFlow Disappoints - Google Deep Learning falls shallow When Does Deep Learning Work Better Than SVMs or Random Forests? Poll: What software you used for Analytics, Data Mining, Data Science,
    Machine Learning projects in the past 12 months? 9 Must-Have Skills You Need to Become a Data Scientist

Most shared 1. Poll: What software you used for Analytics, Data Mining, Data Science,
    Machine Learning projects in the past 12 months? Meet the 11 Big Data & Data Science Leaders on LinkedIn Why Implement Machine Learning Algorithms From Scratch? How to Explain Machine Learning to a Software Engineer How to Use Cohort Analysis to Improve Customer Retention 5 Machine Learning Projects You Can No Longer Overlook Data scientists mostly just do arithmetic and that’s a good thing


MORE RECENT STORIES
 * Top Stories, May 16-22: Annual KDnuggets Analytics Software Po... The Data Science Market: 2016 Compensation Insights 10 Must Have Data Science Skills, Updated Bain: Data Architect Manager How to Explain Machine Learning to a Software Engineer Boosting Productivity of the Next-Generation Data Scientist: I... What are the Challenges of the Analytics of Things? Doing Data Science: A Kaggle Walkthrough Part 1 – Introd... Six PAW Chicago Sessions That Show Analytics’ Long Reach 5 Machine Learning Projects You Can No Longer Overlook Top tweets, May 11-17: Vote: What software you used for Ana... Tips for Data Scientists: Think Like a Business Executive Corios: Junior Database Engineer The Amazing Power of Word Vectors KDnuggets 16:n18, May 18: Annual Software Poll; Practical D... IBM Singapore: Advisory Data Scientist Spark 2.0 Preview Now on Databricks Community Edition: Easier,... An Introduction to Semi-supervised Reinforcement Learning HR/Workforce Analytics leadership conference/London/Innovation... Uplift modeling, advanced techniques, analytics – Oh My!


KDnuggets Home » News » 2016 » Feb » Tutorials, Overviews » 21 Must-Know Data Science Interview Questions and Answers ( 16:n06 )

© 2016 KDnuggets. About KDnuggets
Subscribe to KDnuggets News | Follow @kdnuggets | | X","KDnuggets Editors bring you the answers to 20 Questions to Detect Fake Data Scientists, including what is regularization, Data Scientists we admire, model validation, and more. ",21 Must-Know Data Science Interview Questions and Answers,Live,661
2034,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__IT’S YOUTUBE. UNINTERRUPTED.
Loading...

Want music and videos with zero ads? Get YouTube Red.Working...

Not now Try it free Find out why CloseIBM WATSON MACHINE LEARNING: BUILD DEEP LEARNING ARCHITECTURES WITH NEURAL
NETWORK MODELER
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 24KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Add translations

12 views 2LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 3 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Apr 2, 2018This video shows you how to build a deep learning model using a popular cifar10
data set.

Find more videos in the IBM Watson and Cloud Platform Learning Centers at http://ibm.biz/learning-centers

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Use Watson Knowledge Studio to build a custom machine learning model in the
   medical domain - Duration: 47:07. IBM Watson 41 views * New 47:07


--------------------------------------------------------------------------------

 * An introduction to the IBM Cloud - Part 4 - Duration: 13:09. developerWorks
   TV 5 views * New 13:09
 * An introduction to the IBM Cloud - Part 1 - Duration: 12:16. developerWorks
   TV 57 views * New 12:16
 * An introduction to the IBM Cloud - Part 3 - Duration: 19:54. developerWorks
   TV 2 views * New 19:54
 * Home Automation Tutorial: Cloud IoT App Setup - Duration: 4:46.
   developerWorks TV 27 views * New 4:46
 * An introduction to the IBM Cloud - Part 3 - Duration: 20:51. developerWorks
   TV 41 views * New 20:51
 * Home Automation Tutorial: Home Automation Controller App - Duration: 14:33.
   developerWorks TV 4 views * New 14:33
 * Machine Learning | What is Machine Learning | Types of Machine Learning
   Algorithms - Duration: 2:35. Machine Learning 9 views * New 2:35
 * Bobbie Cochrane: How to build product, ecosystem, and value with blockchain -
   Duration: 8:00. developerWorks TV 18 views * New 8:00
 * IBM Analytics Engine: Provision the service in IBM Cloud - Duration: 2:18.
   developerWorks TV 21 views * New 2:18
 * Home Automation Tutorial: Setup your Raspberry Pi 3 - Duration: 5:16.
   developerWorks TV 44 views * New 5:16
 * IBM Analytics Engine: Analyze and visualize data - Duration: 5:11.
   developerWorks TV 24 views * New 5:11
 * An introduction to the IBM Cloud - Part 4 - Duration: 14:13. developerWorks
   TV 2 views * New 14:13
 * Home Automation Tutorial: The WiringPi gpio Utility - Duration: 5:31.
   developerWorks TV 6 views * New 5:31
 * Home Automation Tutorial: Setup Your 433MHz Hardware - Duration: 4:52.
   developerWorks TV 6 views * New 4:52
 * Home Automation Tutorial: Setup WiringPi and 433Utils - Duration: 2:56.
   developerWorks TV 6 views * New 2:56
 * Home Automation Tutorial: Device Controller App - Duration: 6:21.
   developerWorks TV 31 views * New 6:21
 * Home Automation Tutorial: Breadboard Basics - Duration: 6:07. developerWorks
   TV 18 views * New 6:07
 * Home Automation Tutorial: Receiver Transmitter Demo - Duration: 4:13.
   developerWorks TV 4 views * New 4:13
 * ScaleBNPattern - Duration: 4:28. developerWorks TV 6 views * New 4:28

 * Language: English
 * Location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to build a deep learning model using the popular CIFAR10 data set using IBM Watson Machine Learning Neural Network Modeler.,Build Deep Learning Architectures With Neural Network Modeler,Live,662
2035,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Eytan Davidovits Blocked Unblock Follow Following Product Designer at Flexport. www.eytandavidovits.com Jun 7, 2016
--------------------------------------------------------------------------------

FIND THE USER IN DATA SCIENCE
BY ZOE PADGETT AND EYTAN DAVIDOVITS
This is a repost from the Data Science Experience Blog. View original post here .


--------------------------------------------------------------------------------

When the IBM Design team began researching data scientists, we had a lot to
learn; but what we found was our two disciplines had a lot in common.WITHOUT CONNECTING PEOPLE TO DATA, IT’S JUST A BUNCH OF STUFF
The Data Science practice is amazing and complex. A solo data scientist has to
form a relevant hypothesis, find a corresponding data set, clean it, and
repeatedly build and edit a model to prove or disprove their hypothesis.

The Data Science Experience grew from our attempts to understand data science as
outsiders: as designers wanting to build a tool for data scientists. We were
curious how data scientists distill something interesting from inchoate data.
This curiosity catapulted us into a months-long research endeavor. We
synthesized research conducted in our studios all over the world and had
conversations with every data scientist we could find. This included hundreds of
interviews, dozens of contextual inquiries, and the production of countless
research artifacts. We were astounded by the practice we uncovered, and inspired
by its creativity. We came to understand data science as storytelling — an act
of cutting away the meaningless, and finding humanity in a series of digits.

The data science process is an experiment, the adding and subtracting of
elements to find just the right mix. It’s a fluid dance of trial and error, give
and take, push and pull. We realized that the tools that data scientists
currently use are not designed to support this fluid process of constant
refinement — the tools operate in isolation. Data scientists constantly have to
navigate away from their workspaces in order to advance and edit their product.
This disconnection is where we found our opportunity

FINDING OUR PRINCIPLES
Current tools only address single facets of data science — which means data
scientists must toggle back-and-forth between research and development. Data
Shaper is for cleaning data, Jupyter is for modeling, and MatPlotLib is for
visualizing. These tools are designed to serve a linear process, but a data
scientist’s process is not linear, it’s cyclical.

Research artifact depicting the cyclical process of data scienceFrom this model, our first design principle emerged: A holistic approach to
enable data scientists. As we discussed before, much of our research involved
contextual inquiries. We watched a data scientist build a pipeline — sourcing
assets from the web, comparing his code to others’, and constantly jumping from
tool to tool. We loved this part of the research, as it helped us understand
that each facet of the process requires unique research.

Notes on contextual inquiry during pipeline constructionWe saw him use dozens of assets of many different types. We watched him organize
and name them. At any given point, he needed a tutorial, an academic paper, or a
data set to move to the next step in his process, and each of these assets had
to be saved and interacted with in a different environment. The process he used
to manage his resources helped us establish a tentative system for artifact
classification.

It was also enlightening to watch him browse for resources. Whether he was
scrolling through lists in databases or scanning forums for code, he had
criteria for assessing the value of these artifacts. We watched him pull code
from several different projects and seek advice on API implementation from a
forum.

It became obvious that a data science project can’t just stand on its own. It
needs support and validation from the community. An artifact, whether code
snippet, API, or academic paper, is only as strong as the people who use it. The
more an artifact is employed, the more people there are to discuss it. The
public use of an artifact sharpens its quality. The value of an asset is
determined by the discussion around it — its documentation, its versioning, and
its critics.

The evolution of data science is fueled by the collaborative processes of
building off of each other’s work. This understanding led us to our second, and
arguably most inspiring principle, Community first. The community is the
strongest tool a data scientist can access. So why hasn’t it been factored in
any of their current interfaces?

TURNING PRINCIPLES INTO PRACTICE
We wanted to create an interface that was open and dynamic, just like the
modeling process we observed. We determined that our concept must allow the data
scientists to converse, learn, and research in the context of their software. We
knew our design had to operate as a toolbox that was more dynamic than just a
collection of software applications. In addition to providing data scientists
with the full scope of software products that they need to complete their
process, we need to address their need to validate and advance their work
through research.

This helped us design one of our first concepts: the maker palette. This feature
developed from the idea that the community is a tool — just as important as a
notebook or data set. The design treatment is just the same as any other
resource — it appears in a panel that can be opened and closed at will. The
benefit is that it’s not specific to a file format or tool, so it can be
accessed in any part of the interface.

A user test with the maker paletteIn the community palette, a data scientist can find data sets, access papers,
view tutorials, and compare their code to others. When they’re uninspired or
stuck, the community acts as both peer, tool, and teacher.

MIXED CONTENT
The practice of data science surrounds the building of a pipeline, which is a
sequence of algorithms that process and learn from data. As we watched data
scientists build their pipelines in notebooks, we likened the process to
building a wall around a garden, brick by brick. Each brick must be tested to
see if it fits the within the bricks that preceded it. These bricks, collected
piecemeal throughout the process, slowly enclose the desired pieces of data. The
implementation of these bricks requires supplemental materials, like
documentation and user testimonials. While these materials will not be included
in the pipeline, they need to be viewed in the context of the code. Although
they manifest as different file types, these materials are building blocks also,
and are just as necessary to the advancement of a project as an actual line of
code.

The brick building metaphor inspired the form of our design. We translated the
modularity of pipeline construction into a card design paradigm for the
interface. Having a uniform treatment for a variety of content types allowed us
to streamline the search for resources. A key component of our maker palette was
the ability to display mixed content in a singular environment. The data
scientist can search for any type of asset inside of their workspace, and review
and reference it in a singular, cohesive environment.

The design of our cards was shaped by repeated user testing.The card-in-panel format gives the data scientist the ability to quickly test a
variety of assets in their work. They can make off-the-cuff adjustments without
having to make time commitments to deep research or additional tools. They can
repeatedly complete the cycles of their work — ask, build, test, refine — in one
unified experience.

IN DATA SCIENTISTS, WE SEE OURSELVES
In IBM Design, we often discuss “the loop,” or the practice of continuous
refinement of an idea through research and testing. Like the scientific method,
we design a hypothesis, develop prototypes, test them, make observations, and
adjust. As software designers, we’re constantly trying to find the storyline in
“stuff.” Much like data scientists, we sift through the extraneous to find the
human elements in products and processes. At the beginning, data science seemed
complex and distant, and now, after all our research and a little
self-reflection, it seems strangely familiar.


--------------------------------------------------------------------------------

Visit datascience.ibm.com to learn more.

 * Data Science
 * Design
 * User Experience


18 Blocked Unblock Follow FollowingEYTAN DAVIDOVITS
Product Designer at Flexport. www.eytandavidovits.com

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 18
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","The Data Science practice is amazing and complex. A solo data scientist has to form a relevant hypothesis, find a corresponding data set, clean it, and repeatedly build and edit a model to prove or disprove their hypothesis.",Find the User in Data Science ,Live,663
2038,"DATALAYER: PARTIAL INDEXING FOR IMPROVED QUERY PERFORMANCE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 19, 2016PostgreSQL is able to let you index just a subset of your data if that's what
your application in interested in. The feature is called partial indexes and it
allow you to essentially add a where clause to your indexes.Chris Erwin shares
some of his experience and talks about this incredibly powerful tool, which is
something he says most developers don't know about. In his lightning talk, Chris
covers what partial indexes are, how to use them, and some practical real world
examples.

Chris is cofounder of Elemeno, a modern cloud-based Content Management System.
Previously he was lead developer for the world renowned design agency Teehan+Lax
where he worked on high-profile projects for Medium, YouTube, Google, Flipboard,
and many others. He's passionate about creating beautifully crafted experiences
that solve complex problems in simple ways. He tackles all aspects of product
development including front-end, back-end, and infrastructure development, as
well as UI design.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",Chris Erwin shares his experience with partial indexes in PostgreSQL and talks about this incredibly powerful tool many have never heard of.,DataLayer Conference: Partial Indexing for Improved Query Performance,Live,664
2040,"Homepage Follow Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine
Learning Dec 8
--------------------------------------------------------------------------------

GET SOCIAL WITH YOUR NOTEBOOKS IN DSX
Data Science is hard; doing it alone is harder. Check out two new features
available in Data Science Experience to help make it easier to work as a team on
your notebooks:

 * You can now ‘@’ comment any member of your project to trigger a notification that they will see instantly.
 * Publish your notebooks in the Data Catalog to share code and analysis with your organization.


--------------------------------------------------------------------------------

It’s been 20 minutes staring at this code and I’m starting to question if I can
solve my problem. This isn’t an error message I can copy/paste in Stack Overflow
to tap into the knowledge of the community, but rather a question with my
approach that requires domain expertise to solve.

Many engineers find themselves in this situation and typically follow similar
steps. If a smart colleague sits near by, you can try grabbing them to look on
your machine.. if you are remote you can try to chat it out over Slack with a
lot of back and forth trying to explain the context and what you are really trying to do.

At Data Science Experience, we found the best way to solve this challenge is to
bring the conversation to the code being worked on. With so many conversations
happening across different applications, it can be a challenge to traceback what
conversation happened that triggered a change in the analytical approach taken
in a notebook. By keeping a conversation directly tied to a notebook, it’s easy
to find the conversation, see it’s history, and understand why decisions were
made inside the notebook.

I’m very happy to announce that one of the most requested enhancements to our
notebook commenting feature is now available for everyone! You can now '@’ comment any member of your project to trigger a notification that will be seen instantly.

Type @ then find the collaborator you want to mention!
--------------------------------------------------------------------------------

In case you missed our update on Watson Data Platform from the beginning of November, Data Science Experience is fully integrated
with the other products available in the Platform. Data Catalog is one of these
services that can be integrated with DSX to unlock new features. Data Catalog is
a service focused on helping you better organize and govern your assets (check
out this post dedicated to all of Data Catalog’s features). Of course this includes
governance on data assets, but it also applies to other assets created inside
Watson Data Platform projects; like DSX notebooks.

To unlock the new capabilities in DSX that Data Catalog offers, you first need
to add the Data Catalog service to your account. You can do this by clicking
your profile in the top right corner of your screen and clicking “Add Other
Apps”:

Next, when creating a new project in DSX, select Cloud Object Storage as the
Storage type:

Cloud Object Storage features encryption at rest and unlocks many new features
in DSXInside your new project, you can create notebooks as you normally would, but
when you are ready to share with a broader team you have a new option. Adding a
notebook to a Catalog is as easy as clicking the publish button from inside your
project:

Publish to your catalog from your DSX project
--------------------------------------------------------------------------------

Now that your notebook is shared in the catalog you get many benefits.

 * Granular access control, including Governance
 * Add tags to your notebooks to quickly find in the Catalog with searching and
   filtering
 * The notebook is available outside your project so other team members can find
   your notebooks for code re-use and inspiration in different projects
 * A full preview rendering of the notebook:

Notebooks rendering inside a Catalog
--------------------------------------------------------------------------------

In summary — make your work easier by collaborating around your notebooks! In
this post we reviewed why it’s important to keep discussion around code and
analysis tied to a notebook, and the benefits of using a Catalog to organize and
share your notebook or data assets with your organization.

To get started with these features today head to datascience.ibm.com to sign up for a free account. If you’d like to collaborate with me feel free
to reach out on Twitter @gdfilla .

 * Data Science
 * Notebook
 * Jupyter

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

19 Blocked Unblock Follow FollowingGREG FILLA
Product manager & Data scientist — Data Science Experience and Watson Machine
Learning

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 19
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",Data Science is hard; doing it alone is harder. Check out two new features available in Data Science Experience to help make it easier to work as a team on your notebooks.,Get social with your notebooks in DSX,Live,665
2041,,"See how to connect dashDB, as a source and target database service in the cloud, and how easy it is to synchronize data.",Integrate dashDB and Informatica Cloud,Live,666
2042,"ENSEMBLE BLOGGING
Beyond data, signal, and statistics

PAGES
 * Home
 * About me

SEARCH
IMITATION LEARNING IN TENSORFLOW (HOPPER FROM OPENAI GYM)

Labels: Cloning , Imitation Learning , Machine Learning , Reinforcement Learning , tensorflow

Sunday, September 24, 2017OVERVIEW
Imitation learning, a.k.a behavioral cloning, is learning from demonstration. In
other words, in imitation learning, a machine learns how to behave by looking at what a teacher (or expert) does and then
mimics that behavior. An example can be when we collect driving data from human
and then use that data for a self driving car.

FORMULATION
Imitation learning is a supervised learning where we have a set of observation
and action pairs
\[\{(o_i, a_i)\}_{i=1}^N\] where $o_i$ is observation that we collect from
environment (sensory outputs), and $a_i$ is the best (I use best very loosly
here) action that machine can take (comes from expert). Similar to any other
supervised learning, the goal in imitation learning is to estimate a function
$f(\cdot)$ so that given an observation o, we estimate the best (defined below) action:
\[ \hat{a} = f(o) \]

The lost function (to define best in above sentence) is defined to minimize distance between expert action and
our estimation:
\[ min \sum_i dist(\hat{a}, a) \] For this exercise I use Euclidian distance for
$dist$ function above.

GETS OUR HAND DIRTY
Let's use Hopper from openAI gym. The Hopper is two-dimensional one-legged
robot. Its goal is to jump as far away as possible without falling. You can find
more information about Hopper on OpenAI pageCREATE DATASET
Let's import gym package and setup the Hopper environment

import gym
env = gym.make('Hopper-v1')


In gym you can get the environment action/observation space by doing:

print(env.observation_space)
print(env.action_space)


For imitation learning, we need an expert to show us what action to take for
each observation. For Hopper lets grab expert policy from here

Let assume we have load this pickle file in a way that

expert_action = expert_policy(obs)


To collect data from expert, we run the Hopper multiple times and each time we
create pairs of (observation, action). Note that the environment is stochastic
environement so every run is a bit different from others.

num_iter = 80
for i in range(num_iter):
    steps = 0
    while not done:
        expert_action = expert_policy(obs)
        observations.append(obs)
        actions.append(expert_action)
        obs, r, done, _ = env.step(action)
        steps += 1
        if steps >= env.spec.timestep_limit:
            break


In above code, in each iteration, Hopper jumps until it is done or it reaches
certain step limitation.

NEURAL NETWORK
Now that we have our supervised dataset, we can create a neural network to get
an observation and estimate the action to take:

model = model = BCModel(log_name='./logs/'+args.envname,
                            observation_dim=env.observation_space.shape[0],
                            action_dim=env.action_space.shape[0])
model.train(expert_data)


The detail of the BCModel can be found on my github account . Here, let's just look at the computation graph:

def build_graph(self, activator=tf.nn.tanh, regularizer_scale=0.01):
        self.observations = tf.placeholder(shape=(None, self.observation_dim), dtype=tf.float32, name='observations')
        self.actions = tf.placeholder(shape=(None, self.action_dim), dtype=tf.float32, name='actions')

        # regularizer
        regularizer = tf.contrib.layers.l2_regularizer(scale=regularizer_scale)

        # layers
        W1 = tf.get_variable(shape=(self.observation_dim, 128),
                             regularizer=regularizer,
                             initializer=tf.contrib.layers.xavier_initializer(),
                             name='W1')
        b1 = tf.get_variable(shape=(1, 128),
                         initializer=tf.contrib.layers.xavier_initializer(),
                         name='b1')
        logit1 = tf.matmul(self.observations, W1) + b1
        layer1 = activator(logit1, 'layer1')

        W2 = tf.get_variable(shape=(128, 64),
                             regularizer=regularizer,
                             initializer=tf.contrib.layers.xavier_initializer(),
                             name='W2')
        b2 = tf.get_variable(shape=(1, 64),
                         initializer=tf.contrib.layers.xavier_initializer(),
                         name='b2')
        logit2 = tf.matmul(layer1, W2) + b2
        layer2 = activator(logit2, 'layer2')

        output = tf.matmul(layer2, W3) + b3
        output_action = tf.identity(output, 'output_action')

        self.l2_loss = tf.losses.mean_squared_error(labels=self.actions, predictions=output)

        grad_wrt_input = tf.gradients(self.l2_loss, self.observations)
        grad_wrt_activations = tf.gradients(self.l2_loss, W1)

        self.optimizer = tf.train.AdamOptimizer(learning_rate=self.LEARNING_RATE).minimize(self.l2_loss)


This is fully connected network with linear output layer. Hidden layers have
similar activation function (activator).

FINDINGS
There are couple of interesting findigns while I was trying to get maximum
reward for the hooper.

1- Adam optimizer works much better than gradient descent. The following plots
shows the MSE error rate for identitcal networks with SGD and Adam optimizer. As
one can see 1) Adam optimizer reaches to lower error faster 2) it is more stable
(less variance) compare to SGD


l_2 loss minimization using SGD
l_2 loss minimization using Adam optimizer2- tanh works better that relu for activation function. In the following table I
summerize different attempt by changing activation, optimizer, and output layer.
Superisingly, LSTM output layer was not giving better reward than linear layer.
One possible explanation is that it increases the model complexity and amount of
data might not be enough for network to learn. This can be some of future works.

Network Structure Avg Iteration Reward Two layer tanh SGD with linear output 213 Two layer tanh Adam with linear output 771.34 Two layer ReLU SGD with linear output 532 Two layer ReLU Adam with linear output 913.65
Two layer tanh with linear output with regularization (Adam) 1169.94 Two layer ReLU with linear output with regularization (Adam) 1127.3 Two layer relu with regularization with lstm output layer 991.2

Author:
Follow @mamhamed
Labels: Cloning , Imitation Learning , Machine Learning , Reinforcement Learning , tensorflow
Sunday, September 24, 20170 COMMENTS:
POST A COMMENT


Older Post Subscribe to: Post Comments (Atom)CONTACT
CONTACT
TAGS
Big data Cloning Data Frame Data Science Forecasting Graph Query Imitation Learning ipython notebook Machine Learning matrix factorization Model Validation R recommendation systems Reinforcement Learning Spark Stats tensorflow VisualizationTOTAL PAGEVIEWS
RECOMMENDED BLOGS
 * R-Bloggers
 * Data Science central
 * Base blogs

FAVORITE QUOTES
""I have never thought of writing for reputation and honor. What I have in my
heart must out; that is the reason why I compose."" --Beethoven
""All models are wrong, but some are useful."" --George BoxCopyright © 2015 • Ensemble Blogging","Overview   Imitation learning, a.k.a behavioral cloning, is learning from demonstration. In other words, in imitation learning, a machine...",Imitation Learning in Tensorflow (Hopper from openAI gym),Live,667
2043,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Jorge Castañón Blocked Unblock Follow Following applied mathematician and art lover | opinions are my own Oct 14, 2016
--------------------------------------------------------------------------------

SHINY: A DATA SCIENTIST’S BEST FRIEND
One of the most important skill for data scientists to have is the ability to
clearly communicate results to a general audience. The impact of data
scientists’ work depends on how well others can understand their insights to
take further actions. But most data science projects are hard to digest and deep
in the math and the computation! Therefore, to clearly communicate data science
findings can be very hard, especially when you have an audience with very
diverse backgrounds. This is why tools like Shiny become a data scientist best friend.

SHINY APPS
Shiny is an RStudio package to develop interactive web apps using the R
programming language. Here are a few benefits of Shiny:

 1. Great to communicate results via interactive charts, visualizations, text
    and tables.
 2. Easy to use. If you know R, there’s not a lot more to learn in order to
    develop a cool shiny app rapidly. I follow this excellent tutorial to quickly learn the core concepts of shiny.
 3. Easy to share with colleagues and friends.
 4. Shiny apps look very cool!

To convince you, let me show you a few examples. If you would like to run these
examples on your own Data Science Experience account (creating an account is free), please follow these instructions.

CAR ACCIDENTS PREDICTIONS BASED ON WEATHER
Some of my data science colleagues at IBM built a very cool model to predict car
accidents in New York City. The model was trained using historical data of car
accidents and IBM weather’s data. Weather conditions per zip code were used as
features to train a logistic regression with Spark that predicts the probability
of car accidents. The math and code behind this project takes a little bit of
time (days!) to follow and understand. But, I can clearly show you the results
of the trained model in action in a shiny app that shows the probability of car
accidents per zip code. Here is a screenshot of the shiny app that shows the
predicted probabilities of car accidents on an interactive map. Note that the
day and time can be chosen interactively.

PLANNING YOUR NEXT VACATION
There is nothing more annoying than getting flight delays. This is why I built a
shiny app for you and me to use next time we are planning a trip. The web app
can be used to explore the average flight arrival delays (in minutes) for each
airport in the US. Users’ can interact with the app by choosing the month and
the year to be explored. In addition, users’ can click on airports to get the
airport name, code, state, city and average delay. The size of the airport
bubbles depends on the volume of flights for each airport. Negative average
delays are early arrival flights.

WOULD YOU LIKE TO BUILD YOUR OWN SHINY?
Don’t start from scratch. Remember that you can follow these instructions to run the mentioned shiny examples (and more) using your Data Science Experience .


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on October 14, 2016.

 * Data Science
 * Data Science Experience
 * Dsx
 * Shiny
 * Rstudio

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingJORGE CASTAÑÓN
applied mathematician and art lover | opinions are my own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",One of the most important skill for data scientists to have is the ability to clearly communicate results to a general audience. The impact of data scientists’ work depends on how well others can…,Shiny: a data scientist’s best friend,Live,668
2045,"KDNUGGETS
Subscribe to KDnuggets News | | Contact * SOFTWARE
 * News/Blog
 * Top stories
 * Opinions
 * Tutorials
 * JOBS
 * Companies
 * Courses
 * Datasets
 * EDUCATION
 * Certificates
 * Meetings
 * Webinars


KDnuggets Home » News » 2017 » Dec » Opinions, Interviews » 70 Amazing Free Data Sources You Should Know ( 18:n01 )70 AMAZING FREE DATA SOURCES YOU SHOULD KNOW
Previous post Next post

Tags: Big Data , Business , Crime , Datasets , Finance , Government , Health , Journalism , Octoparse , Social Media
70 free data sources for 2017 on government, crime, health, financial and
economic data, marketing and social media, journalism and media, real estate,
company directory and review, and more to start working on your data projects.


--------------------------------------------------------------------------------


commentsBy Aries Chau, Octoparse .


Every great data visualization starts with good and clean data. Most of people
believe that collecting big data would be a rough thing, but it’s simply not
true. There are thousands of free data sets available online, ready to be
analyzed and visualized by anyone. Here we’ve rounded up 70 free data sources
for 2017 on government, crime, health, financial and economic data,marketing and
social media, journalism and media, real estate, company directory and review,
and more.

We hope you could enjoy this and save a lot time and energy searching blindly
online.

Free Data Source: Government

 1.  Data.gov : It is the first stage and acts as a portal to all sorts of amazing
     information on everything from climate to crime freely by the US
     Government.
 2.  Data.gov.uk : There are datasets from all UK central departments and a number of other
     public sector and local authorities. It acts as a portal to all sorts of
     information on everything, including business and economy, crime and
     justice, defence, education, environment, government, health, society and
     transportation.
 3.  US. Census Bureau : The website is about the government-informed statistics on the lives of
     US citizens including population, economy, education, geography, and more.
 4.  The CIA World Factbook : Facts on every country in the world; focuses on history, government,
     population, economy, energy, geography, communications, transportation,
     military, and transnational issues of 267 countries.
 5.  Socrata : Socrata is a mission-driven software company that is another interesting
     place to explore government-related data with some visualization tools
     built-in. Its data as a service has been adopted by more than 1200
     government agencies for open data, performance management and data-driven
     government.
 6.  European Union Open Data Portal : It is the single point of access to a growing range of data from the
     institutions and other bodies of the European Union. The data boosts
     includes economic development within the EU and transparency within the EU
     institutions, including geographic, geopolitical and financial data,
     statistics, election results, legal acts, and data on crime, health, the
     environment, transport and scientific research. They could be reused in
     different databases and reports. And more, a variety of digital formats are
     available from the EU institutions and other EU bodies. The portal provides
     a standardised catalogue, a list of apps and web tools reusing these data,
     a SPARQL endpoint query editor and rest API access, and tips on how to make
     best use of the site.
 7.  Canada Open Data is a pilot project with many government and geospatial datasets. It could
     help you explore how the Government of Canada creates greater transparency,
     accountability, increases citizen engagement, and drives innovation and
     economic opportunities through open data, open information, and open
     dialogue.
 8.  Datacatalogs.org : It offers open government data from US, EU, Canada, CKAN, and more.
 9.  U.S. National Center for Education Statistics : The National Center for Education Statistics (NCES) is the primary
     federal entity for collecting and analyzing data related to education in
     the U.S. and other nations.
 10. UK Data Service : The UK Data Service collection includes major UK government-sponsored
     surveys, cross-national surveys, longitudinal studies, UK census data,
     international aggregate, business data, and qualitative data.

Free Data Source: Crime

 1. Uniform Crime Reporting : The UCR Program has been the starting place for law enforcement
    executives, students, researchers, members of the media, and the public
    seeking information on crime in the US.
 2. FBI Crime Statistics : Statistical crime reports and publications detailing specific offenses
    and outlining trends to understand crime threats at both local and national
    levels.
 3. Bureau of Justice Statistics : Information on anything related to U.S. justice system, including
    arrest-related deaths, census of jail inmates, national survey of DNA crime
    labs, surveys of law enforcement gang units, etc.
 4. National Sex Offender Search : It is an unprecedented public safety resource that provides the public
    with access to sex offender data nationwide. It presents the most up-to-date
    information as provided by each Jurisdiction.

Free Data Source: Health

 1. U.S. Food & Drug Administration : Here you will find a compressed data file of the Drugs@FDA database.
    Drugs@FDA, is updated daily, this data file is updated once per week, on
    Tuesday.
 2. UNICEF : UNICEF gathers evidence on the situation of children and women around the
    world. The data sets include accurate, nationally representative data from
    household surveys and other sources.
 3. World Health Organisation : statistics concerning nutrition, disease and health in more than 150
    countries.
 4. Healthdata.gov : 125 years of US healthcare data including claim-level Medicare data,
    epidemiology and population statistics.
 5. NHS Health and Social Care Information Centre : Health data sets from the UK National Health Service. The organization
    produces more than 260 official and national statistical publications. This
    includes national comparative data for secondary uses, developed from the
    long-running Hospital Episode Statistics which can help local decision
    makers to improve the quality and efficiency of frontline care.

Free Data Source: Financial and Economic Data

 1.  World Bank Open Data : Education statistics about everything from finances to service delivery
     indicators around the world.
 2.  IMF Economic Data : An incredibly useful source of information that includes global
     financial stability reports, regional economic reports, international
     financial statistics, exchange rates, directions of trade, and more.
 3.  UN Comtrade Database : Free access to detailed global trade data with visualizations. UN
     Comtrade is a repository of official international trade statistics and
     relevant analytical tables. All data is accessible through API.
 4.  Global Financial Data : With data on over 60,000 companies covering 300 years, Global Financial
     Data offers a unique source to analyze the twists and turns of the global
     economy.
 5.  Google Finance : Real-time stock quotes and charts, financial news, currency conversions,
     or tracked portfolios.
 6.  Google Public Data Explorer : Google's Public Data Explorer provides public data and forecasts from a
     range of international organizations and academic institutions including
     the World Bank, OECD, Eurostat and the University of Denver. These can be
     displayed as line graphs, bar graphs, cross sectional plots or on maps.
 7.  U.S. Bureau of Economic Analysis : U.S. official macroeconomic and industry statistics, most notably
     reports about the gross domestic product (GDP) of the United States and its
     various units. They also provide information about personal income,
     corporate profits, and government spending in their National Income and
     Product Accounts (NIPAs).
 8.  Financial Data Finder at OSU : Plentiful links to anything related to finance, no matter how obscure,
     including World Development Indicators Online, World Bank Open Data, Global
     Financial Data, International Monetary Fund Statistical Databases, and EMIS Intelligence .
 9.  National Bureau of Economic Research : Macro data, industry data, productivity data, trade data, international
     finance, data, and more.
 10. U.S. Securities and Exchange Commission : Quarterly datasets of extracted information from exhibits to corporate
     financial reports filed with the Commission.
 11. Visualizing Economics : Data visualizations about the economy.
 12. Financial Times : The Financial Times provides a broad range of information, news and
     services for the global business community.

Free Data Source: Marketing and Social Media

 1.  Amazon API : Browse Amazon Web Services’ Public Data Sets by category for a huge
     wealth of information. Amazon API Gateway allows developers to securely
     connect mobile and web applications to APIs that run on Amazon Web(AWS)
     Lambda, Amazon EC2, or other publicly addressable web services that are
     hosted outside of AWS.
 2.  American Society of Travel Agents : ASTA is the world's largest association of travel professionals. It
     provides members information including travel agents and the companies
     whose products they sell such as tours, cruises, hotels, car rentals, etc.
 3.  Social Mention : Social Mention is a social media search and analysis platform that
     aggregates user-generated content from across the universe into a single
     stream of information.
 4.  Google Trends : Google Trends shows how often a particular search-term is entered
     relative to the total search-volume across various regions of the world in
     various languages.
 5.  Facebook API : Learn how to publish to and retrieve data from Facebook using the Graph
     API.
 6.  Twitter API : The Twitter Platform connects your website or application with the
     worldwide conversation happening on Twitter.
 7.  Instagram API : The Instagram API Platform can be used to build non-automated,
     authentic, high-quality apps and services.
 8.  Foursquare API : The Foursquare API gives you access to our world-class places database
     and the ability to interact with Foursquare users and merchants.
 9.  HubSpot : A large repository of marketing data. You could find the latest
     marketing stats and trends here. It also provides tools for social media
     marketing, content management, web analytics, landing pages and search
     engine optimization.
 10. Moz : Insights on SEO that includes keyword research, link building, site
     audits, and page optimization insights in order to help companies to have a
     better view of the position they have on search engines and how to improve
     their ranking.
 11. Content Marketing Institute : The latest news, studies, and research on content marketing.

Free Data Source: Journalism and Media

 1. The New York Times Developer Network – Search Times articles from 1851 to today, retrieving headlines, abstracts
    and links to associated multimedia. You can also search book reviews, NYC
    event listings, movie reviews, top stories with images and more.
 2. Associated Press API : The AP Content API allows you to search and download content using your
    own editorial tools, without having to visit AP portals. It provides access
    to images from AP-owned, member-owned and third-party, and videos produced
    by AP and selected third-party.
 3. Google Books Ngram Viewer : It is an online search engine that charts frequencies of any set of
    comma-delimited search strings using a yearly count of n-grams found in
    sources printed between 1500 and 2008 in Google's text corpora.
 4. Wikipedia Database : Wikipedia offers free copies of all available content to interested
    users.
 5. FiveThirtyEight : It is a website that focuses on opinion poll analysis, politics,
    economics, and sports blogging. The data and code on Github is behind the
    stories and interactives at FiveThirtyEight.
 6. Google Scholar : Google Scholar is a freely accessible web search engine that indexes the
    full text or metadata of scholarly literature across an array of publishing
    formats and disciplines. It includes most peer-reviewed online academic
    journals and books, conference papers, theses and dissertations, preprints,
    abstracts, technical reports, and other scholarly literature, including
    court opinions and patents.

Free Data Source: Real Estate

 1. Castles : Castles are a successful, privately owned independent agency. Established
    in 1981, they offer a comprehensive service incorporating residential sales,
    letting and management, and surveys and valuations.
 2. Realestate.com : RealEstate.com serves as the ultimate resource for first-time home
    buyers, offering easy-to-understand tools and expert advice at every stage
    in the process.
 3. Gumtree : Gumtree is the first site for free classifieds ads in the UK. Buy and
    sell items, cars, properties, and find or offer jobs in your area is all
    available on the website.
 4. James Hayward : It provides an innovative database approach to residential sales,
    lettings & management.
 5. Lifull Home ’ s : Japan’s property website.
 6. Immobiliare.it : Italy’s property website.
 7. Subito : Italy’s property website.
 8. Immoweb : Belgium's leading property website.

Free Data Source: Business Directory and Review

 1.  LinkedIn : LinkedIn is a business- and employment-oriented social networking
     service that operates via websites and mobile apps. It has 500 million
     members in 200 countries and you could find the business directory here.
 2.  OpenCorporates : OpenCorporates is the largest open database of companies and company
     data in the world, with in excess of 100 million companies in a similarly
     large number of jurisdictions. Our primary goal is to make information on
     companies more usable and more widely available for the public benefit,
     particularly to tackle the use of companies for criminal or anti-social
     purposes, for example corruption, money laundering and organised crime.
 3.  Yellowpages : The original source to find and connect with local plumbers, handymen,
     mechanics, attorneys, dentists, and more.
 4.  Craigslist : Craigslist is an American classified advertisements website with
     sections devoted to jobs, housing, personals, for sale, items wanted,
     services, community, gigs, résumés, and discussion forums.
 5.  GAF Master Elite Contractor : Founded in 1886, GAF has become North America’s largest manufacturer of
     commercial and residential roofing (Source: Fredonia Group study). Our
     success in growing the company to nearly $3 billion in sales has been a
     result of our relentless pursuit of quality, combined with industry-leading
     expertise and comprehensive roofing solutions. Jim Schnepper is the
     President of GAF, an operating subsidiary of Standard Industries. When you
     are looking to protect the things you treasure most, here are just some of
     the reasons why we believe you should choose GAF.
 6.  CertainTeed : You could find contractors, remodelers, installers or builders in the US
     or Canada on your residential or commercial project here.
 7.  Companies in California : All information about companies in California.
 8.  Manta : Manta is one of the largest online resources that deliver products,
     services and educational opportunities. The Manta directory boasts millions
     of unique visitors every month who search comprehensive database for
     individual businesses, industry segments and geographic-specific listings.
 9.  EU-Startups : Directory about startups in EU.
 10. Kansas Bar Association : Directory for lawyers. The Kansas Bar Association (KBA) was founded in
     1882 as a voluntary association for dedicated legal professionals and has
     more than 7,000 members, including lawyers, judges, law students, and
     paralegals.

Free Data Source: Other Portal Websites

 1. Capterra : Directory about business software and reviews.
 2. Monster : Data source for jobs and career opportunities.
 3. Glassdoor : Directory about jobs and information about inside scoop on companies with
    employee reviews, personalized salary tools, and more.
 4. The Good Garage Scheme : Directory about car service, MOT or car repair.
 5. OSMOZ : Information about fragrance.
 6. Octoparse : A free data extraction tool to collect all the web data mentioned above
    online.

Do you know some great data sources? Contact us at support@octoparse.com to let us know and help us share the data love.

More Related Sources:

Top 30 Big Data Tools for Data Analysis

Top 30 Free Web Scraping Software

Related

 * How Big Data and New Technologies Are Changing Aging
 * Octoparse: Free & Automated Web Crawling Tool
 * Unlock Machine Learning for the New Speed and Scale of Business


--------------------------------------------------------------------------------

Previous post Next post


--------------------------------------------------------------------------------


TOP STORIES PAST 30 DAYS
Most Popular 1. Top 10 Machine Learning Algorithms for Beginners Top 10 TED Talks for Data Scientists and Machine Learning Engineers Quantum Machine Learning: An Overview Comparing Machine Learning as a Service: Amazon, Microsoft Azure, Google
    Cloud AI The Art of Learning Data Science Want to Become a Data Scientist? Try Feynman Technique Docker for Data Science

Most Shared 1. Top 10 TED Talks for Data Scientists and Machine Learning Engineers Comparing Machine Learning as a Service: Amazon, Microsoft Azure, Google
    Cloud AI The Art of Learning Data Science How Docker Can Help You Become A More Effective Data Scientist A Beginners Guide to Data Engineering – Part I How To Grow As A Data Scientist Web Scraping Tutorial with Python: Tips and Tricks

LATEST NEWS
 * Pray.com: Sr Data Engineer Join RE•WORK & AI experts in London, Boston and ... Why Data Scientists Must Know About Change Management 5 Machine Learning Projects You Should Not Overlook Fast.ai Lesson 1 on Google Colab (Free GPU) Top tweets, Jan 31 – Feb 6: #DeepLearning for N...


MORE RECENT STORIES
 * Top tweets, Jan 31 – Feb 6: #DeepLearning for Natural... AI & Machine Learning: the key skills every software engi... Building a Daily Bitcoin Price Tracker with Coindeskr and Shin... Deep Feature Synthesis: How Automated Feature Engineering Works KDnuggets 18:n06, Feb 7: 5 Fantastic Practical Machine Lear... NYU Stern MS in Business Analytics Register for DataScience: Elevate Livestream, Feb 22 Top January Stories: Docker for Data Science; Top 10 TED Talks... 2018 Predictions for the Analytics & Data Science Hiring ... 5 Fantastic Practical Machine Learning Resources The Doing Part of Learning Data Science The Voleon Group: Machine learning research, software developm... Future Trends in Biometrics Challenge Yourself to Think, Mar 19-22, Las Vegas Top Stories, Jan 29 – Feb 4: Web Scraping Tutorial with ... A Simple Starter Guide to Build a Neural Network Upcoming Meetings in AI, Analytics, Big Data, Data Science, De... Dow Chemical: Advanced Analytics Data Scientist Generalists Dominate Data Science Avoid Overfitting with Regularization


KDnuggets Home » News » 2017 » Dec » Opinions, Interviews » 70 Amazing Free Data Sources You Should Know ( 18:n01 )

© 2018 KDnuggets. About KDnuggets
Subscribe : Email Name Leave this field empty if you're human: X","70 free data sources for 2017 on government, crime, health, financial and economic data, marketing and social media, journalism and media, real estate, company directory and review, and more to start working on your data projects. ",70 Amazing Free Data Sources You Should Know,Live,669
2051,"Skip to content * United States

IBM® developerWorks * Site map

Search Search Streamsdev

Search

Streamsdev * Github
 * Documentation
 * Support
 * Blog
 * Videos

in NewsCALCULATE MOVING AVERAGES ON REAL TIME DATA WITH STREAMS DESIGNER

Natasha DSilva
Published on January 12, 2018 / Updated on January 15, 2018SHARE THIS:
 * Click to share on Facebook (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * 

0 CommentsMany organizations are taking advantage of the continuous streams of data being
generated by their devices, employees, customers, and more. You can use
streaming analytics to extract insights from your data as it is generated,
instead of storing it in a database or data warehouse first. For example, you
could analyze the data generated by an online store to answer questions like:

Which are the top selling products in each department right now? What are the
total sales for the last hour?

These are examples of streaming analytics applications that you can create with
Streams Designer. Streams Designer is a new web based graphical IDE to help you create streaming analytics applications without having to write a
lot of code or learn a new language.

In this article, I’ll demonstrate how to use the Aggregation operator in Streams
Designer to create applications that compute and store various statistics for
streaming data.

Table of Contents

 * Prerequisites
 * Overview * About the use case
   
   
 * Aggregation concepts * Windows and window size
    * Choosing a window type: sliding vs tumbling windows.
    * Aggregation Functions
    * Timestamps and dates
   
   
 * Example 1: What are the total sales in the last 5 minutes?
 * Example 2: For each hour, how many customers were active on the site?
 * Example 3: For each product category, what are the total sales in the last 5,
   10 and 30 minutes?
 * Conclusion

PREREQUISITES
Streams Designer is available on the Watson Data Platform. You can sign up here . To create a new flow, click “Tools” > “Streams Designer”. Watch this short video for an overview of the canvas.

OVERVIEW
You use the Aggregation operator in Streams Designer to calculate averages,
maximums, and other basic statistics for streaming data. The best way to learn
about the Aggregation operator is by example. This article will show a few
common examples, and in each case, you’ll see how to configure the Aggregation
operator to get the desired result.

ABOUT THE USE CASE
The scenario is of an online department store. As customers browse the store,
they generate events that are called a clickstream . A clickstream is a continuous stream of data that describes users’
interactions with the website as they occur. In addition to browsing, these
activities could also be adding an item or items to a cart, log-in/log-out, and
so on.

The store management is interested in using the clickstream data to get ongoing
answers to the following questions:

 * What is the running total sales amount today?
 * For each hour, how many customers were actively browsing the store?
 * What is the running total sales amount per department in the last hour, day
   and week?

Our input data will be the sample stream of clickstream events that is available
in Streams Designer. Below is an example of the contents of the sample data
stream:


Each row in the table is a single event, or tuple. Each event always has a
customer id and a timestamp. The remaining contents of each tuple include depend
on the type of the click event, highlighted above. For example, the add_to_cart event is generated when a customer adds a product to their cart, and contains
the name and category/department of the product that was added to the cart,
while the login event contains the customer id and the event time.

To use this sample stream as a data source, drag the Sample data operator to the
canvas. In the properties pane, choose the Clickstream topic.

You can preview the clickstream data as shown above: click Edit Schema and then Show preview in the dialog that appears.

Now that we have a data stream, we can use it to learn more about the
Aggregation operator.

AGGREGATION CONCEPTS
The Aggregation operator takes a data stream as input and produces the result of
user specified aggregations as output. To use the Aggregation operator, you need
to configure its key parameters based on what you are trying to calculate.

These are:

 * Aggregation window size and window type,
 * Aggregation function (max, min, average, etc.)

WINDOW SIZE
Although streaming data is potentially infinite, we are often only interested in
subsets of the data that are based on time, e.g. total sales for the last hour.
Or, we use subsets based on the number of events that have occurred, e.g. the
maximum of the last 5 readings.

This subset of the streaming data is called a window . The size of the window can be specified in different ways, such as elapsed
time, or based on the number of tuples. You always have a clue to the size of
the window in the question that you are trying to answer. For example, in the
question “How much are the total sales for the last hour?”, the window size is 1
hour. The Aggregation operator in Streams designer currently supports time based
windows.


WINDOW TYPE: SLIDING VS TUMBLING
There are two types of windows, sliding and tumbling. The window type determines
on how often you want the result to be calculated.

Sliding : Calculate the result of the aggregation whenever a new tuple arrives. Best for moving averages, running totals and other up-to-the-second
calculations.

Tumbling : Calculate the result of the aggregation once at the end of each period, regardless of how often tuples arrive . Best for situations where updates at specific intervals are required. For
example, you would use a tumbling window to report the total sales once an hour.

If your store had a sale every minute and you were calculating the total sales
in the last hour, the difference between the two window types can be illustrated
as follows:

Window type Number of result tuples per hour Tuples used in calculation Sliding 60, since we receive a sale every minute All sales that occurred less than an hour from the current time. Tumbling 1 All sales that occurred in the hour since the application started, and every
hour after that.Any tuples used in a tumbling window are only used once and are discarded once
the operator produces output. This is where the “tumbling” term comes from, all
the tuples tumble out of the window and are not reused.
On the other hand, a tuple in a sliding window can be used many times for the
calculation, as long as it has not been in the window longer than X , where X is the size of the window. For example, with a 1 hour window, a tuple that
arrived 30 minutes ago will be kept in the window, while a tuple that arrived
1.5 hours ago will be discarded.


FUNCTIONS
The last parameter you need to configure is which aggregate function(s) will be
used on our input data to get our results. Available functions at the time of
writing are are Average , Max , Min , Count , CountDistinct, Sum , and StdDev. In our example, we want to compute the total sales so far. By applying the Sum function to the value of every tuple in the window, we will get the running
total sales.

Whenever the operator is ready to produce output, whether periodically (tumbling
window) or every time a new tuple arrives (sliding window), the function(s) you
select will be applied to the all the tuples in the window. If you just want to
copy the value of an attribute on the input stream to the output stream, use PassThrough .

PUTTING IT ALL TOGETHER
For the question “how much are the total sales for the last hour?”, we need a 1
hour time window. Since we want the running total to be updated every time there
is a sale, we use a sliding window. Every time there is a new sale, the Sum function is applied to all the tuples in the window, that is, all the sales in
the last hour, and the result is produced as output.


TIMESTAMPS AND DATES
Before moving to the first example, it is helpful to mention how the Aggregation
operator uses timestamps. The operator has a “ Use timestamp in tuple ” flag to indicate that the recorded time for events is present in the incoming
data and should be used instead of system time. If this flag is used, each tuple
must have an attribute that contains the timestamp to be used. The operator
would start counting the window size from the time recorded in the first tuple,
and not when the tuple arrived.

DATE FORMATS
If you are writing applications that will send data to Streams Designer, the
data must be in JSON and the time stamp should be in ISO-8601 format, with any
delimiter. Milliseconds are optional and the timezone should not be present.
Valid examples are: ""2018-01-08T07:11:36"" , ""2018-01-08 07:11:36.877"" .

EXAMPLES
Now let’s see some examples in Streams Designer.

To follow along, create a new empty flow. Drag the Sample Data operator to the canvas, and select “Clickstream” as the Topic for the sample data.


EXAMPLE 1: WHAT ARE THE TOTAL SALES FOR THE LAST 5 MINUTES?
We will compute the running total by adding the value of each sale in the last 5
minutes. While a small value is helpful for testing purposes you can increase
the size of the window to 1 hour or 1 week or more, depending on the
organization’s needs.

Configure the operator:

 1. Drag the Aggregation operator to the canvas and connect it to the sample
    data operator. The properties pane will open so we can configure the
    operator.
 2. Under Aggregation Window: *  * Type : Use a sliding window because we want a running total.
        * Time Unit : minute (For testing purposes you can use a smaller value, say 1
          minute).
        * Number of Time units: 5
        * Use timestamp in tuple : If your data has timestamps that indicate when the event occurred,
          check this box to ensure that the uses these timestamps when computing
          elapsed time. If you leave this unchecked, the operator will use the
          system time instead. Since the sample data stream includes a time_stamp attribute, we can use it. Check this box and select time_stamp under Timestamp field. See the section about timestamps above for more information on the
          correct timestamp format.
       
       
 3. Aggregation Definition: * Under Functions , we build a list of the desired output attributes for the operator. For
       each output attribute, use “Add function” to add it to the list. In our
       simple example, we just want 2 output attributes: The total sales and the
       time of the last sale. 1. Output attribute: Total sales in the last 5 min. * Output Field Name : Name of the value we want to compute. In this case, we’ll call
              it total_sales_last_5min .
            * Function Type : Select Sum .
            * Apply function to : This is the input attribute that will be used in our
              calculation. Select the total_price_of_basket
           
           
        2. Output attribute: Time stamp. *  * Click “Add function”.
               * Output Field Name : time_stamp
               * Function Type : Select “PassThrough” to copy the value from the input stream
                 to the output stream.
               * Apply function to : Select the time_stamp attribute.
              
              
 4. To copy any other attributes from the input stream attribute to the output
    stream, you can click “Add function” and select “PassThrough” to indicate
    that the value should just be transferred from the input stream to the
    output stream.


The configured operator should look like this:


Our output will be sent to a CSV file using the Object Storage operator, but
this is not the only available option. Results could also be sent to Message Hub
for integration with a real time dashboard, or stored in Redis, or DB2
Warehouse.
Click Run to run the flow and you should see data streaming between the operators. You
can browse to your output file in Cloud Object Storage and see the results:


time_stamp,total_sales_last_5min
""2018-01-02T11:17:51"",705188.169999982
""2018-01-02T11:17:51"",705188.169999982
""2018-01-02T11:17:51"",705188.169999982
""2018-01-02T11:17:51"",705188.169999982
""2018-01-02T11:17:51"",705250.149999982
""2018-01-02T11:17:51"",705250.149999982
""2018-01-02T11:17:51"",705269.679999982


Notice that there are some entries where the total sales is still the same. Why
is this happening? Since we used a sliding window, we get an update every time a
new tuple arrives. But not all the tuples in the clickstream represent a sale.
They could be generated for customer logging in or out, and so on. When a tuple
arrives, the running total is calculated even though it hasn’t changed. So, we
want to change the flow so that only tuples that represent a sale are used in
our calculation. This is done by adding a Filter operator between the Sample Data and the Total sales in the last hour operators.

After adding the Filter operator, set the filter condition to click_event_type == ""checkout"" .

This will only send checkout events to the Aggregation operator:

After making this change and re-running the flow, the running total is only
updated when a sale has occurred, as shown in the results file:


time_stamp,total_sales_last_hr

""2018-01-04T11:32:14"",35235.53
""2018-01-04T11:32:15"",35057.7
""2018-01-04T11:32:15"",35120.24
""2018-01-04T11:32:16"",35301.08


EXAMPLE 2: FOR EACH HOUR, HOW MANY CUSTOMERS WERE ACTIVE ON THE SITE?
To help determine the peak shopping hours, we want to count the number of unique
customers that generated clickstream events for each hour. We don’t want to just
count the number of clickstream events, since each customer will generate
multiple events. Instead, we’ll count the number of unique customer ids that
appear in the clickstream, starting from the arrival of the first customer. We
will use the CountDistinct function on the customer_id attribute.

 1. Drag another Aggregation operator to the canvas and connect it to the sample
    data operator.
 2. Under Aggregation Window: * Type : Use a tumbling window because we want results for each hour, not a running total as
       customers arrive.
     * Time Unit : hr
     * Number of Time units: 1
    
    
 3. Aggregation Definition: * Output function: 1. Output Field Name : Name of the value we want to compute. In this case, we’ll call it total_customers_per_hour .
        2. Function Type : Select CountDistinct to count the unique number of customers.
        3. Apply function to : This is the input attribute that will be used in our calculation.
           Select customer_id .
       
       
     * Repeat the same steps to add the time_stamp attribute as in Example 1.
    
    
Connect the output of this operator to another Cloud Object Storage target.
After running the flow, you should have output like this in the second output
file:


time_stamp,total_customers_last_hr
""2018-01-08T07:10:35"",455
""2018-01-08T07:11:36"",435
""2018-01-08T07:12:37"",368
""2018-01-08T07:13:38"",4363


In my test I used a 1 minute window, and in the results you will see that the
time stamps are apart by a minute. This is because we are using a tumbling
window, so the operator only generates output periodically, in this case, every
minute. If you compare that to the output of the previous example, which used a
sliding window, the timestamps were much more frequent because the sliding
window generates output whenever there is new data.

EXAMPLE 3: FOR EACH PRODUCT CATEGORY, WHAT ARE THE TOTAL SALES IN THE LAST 5, 10
AND 30 MINUTES?
In this case we want to compute the same value (running total sales) over
different time periods. This is a common scenario that requires using multiple
Aggregate operators in parallel. By computing the totals in parallel, you can
enrich the data stream before saving it in a database or using it in a
dashboard. Using different window sizes for the same data also helps account for
irregular peaks in your data. Each operator will compute the running total, but
use a different window size.

We’ll start with the total sales in the last 5 minutes and apply the same
concept to compute the sales for the last 10 and 30 minutes.

Total sales in the last 5 minutes
Since this is another running total, we will use a sliding window. For every
category, we’ll add up the value of the product_price attribute using the Sum function.

Using Partitions

To get the total sales for each category, we need to maintain the running total
for each category. We do this by putting all the events for a given category in
a separate window. This is called partitioning . Use the Partition By parameter to create windows for each category. The category is identified in
the product_category attribute. Whenever a product is sold, only the running total sales for the
category will be updated.

 1. Connect another Aggregation operator to the data source.
 2. Under Aggregation Window: * Type : Sliding
     * Time Unit : minute
     * Number of Time units: 5
     * Partition By : product_category.
    
    
 3. Aggregation Definition: * Output attributes: 1. total_sales_per_category is the Output Field Name. Select Sum as the Function Type and Apply function to : product_price
        2. product_category : Click “Add function”. Set Output Field Name to product_category and click PassThrough as the function. This is because we are not applying any computation
           to the value but we want to copy it from the input to the output.
        3. Repeat the above step to add the time_stamp as an output attribute.
       
       
Your canvas might look like this:

The 5_min_dept_sales operator would give a running total sales for the last 5 minutes for each
category. Here is some sample output after running the flow:


time_stamp,product_category,total_sales_5min
""2018-01-08T05:36:30"",""Food"",6188.7899999998
""2018-01-08T05:36:30"",""Food"",6189.77999999986
""2018-01-08T05:36:30"",""Food"",6196.76999999986
""2018-01-08T05:36:30"",""Food"",6202.75999999986
""2018-01-08T05:36:31"",""Home Products"",1392.28
""2018-01-08T05:36:31"",""Drinks"",5048.84999999995
""2018-01-08T05:36:31"",""Food"",6204.74999999986
""2018-01-08T05:36:31"",""Drinks"",5051.83999999995
""2018-01-08T05:36:31"",""Food"",6205.7399999998


Total sales in the last 10 and 30 minutes

To compute the total sales for the last 10 and 30 minutes (or last hour and day,
week, e.t.c), copy and paste the 5_min_dept_sales operator twice. Connect the copies to the Sample Data operator and modify their
parameters to use sliding windows of 10 and 30 minutes each.


CONCLUSION
This post has been an introduction to the Aggregation operator in Streams
Designer. We discussed the concept of using windows to process streaming data,
and a few examples of how to do so.

An example flow containing these examples is available on Github, so you can try
these examples by downloading the example flow and importing it into Streams Designer:

 * From Watson Data Platform, click Tools > Streams Designer * If you don’t already have a project, create one, then under Streams flow,
      Click “New Streams flow”.
   
   
 * From the “New Streams Flow” page, Click From file and then select the Aggregation_examples.stp file from the zip file you just downloaded. Click Create .
 * After the flow is created, you need to configure it to send the result files
   to your Cloud Object Storage service: * Click Edit , and for each Cloud Object Storage operator, edit it to specify the connection to the Cloud Object Storage
      service (you must have created one before importing the flow), and the
      file path.
   
   
 * Run the flow by clicking Run .

USEFUL LINKS
 * Questions or feedback? Please leave a comment below or ask a question
 * Find more samples in the Streams samples repository
 * Streams Designer documentation
 * More articles about windows in Streams


SHARE THIS:
 * Click to share on Facebook (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * 

Tags featured , windowby Natasha DSilva


JOIN THE DISCUSSION CANCEL REPLY
You must be logged in to post a comment.

Back to top

 * Contact
 * Privacy
 * Terms of use
 * Accessibility
 * Feedback
 * Report Abuse
 * Cookie Preferences",Learn how to use the Aggregation operator in Streams Designer to create applications that compute and store various statistics for streaming data.,Calculate moving averages on real time data with Streams Designer,Live,670
2054,"There are several development libraries listed below for a variety of languages. The official Cloudant libraries are noted with an *, and the other libraries are open source and typically have compatibility with both Apache CouchDB and Cloudant. Note that you could also use just your languages' native HTTP and JSON libraries to develop against Cloudant.In the right column, relevant technotes, articles, blogs, and code samples for several common development languages are included for your reference.* - denotes that it is an official Cloudant libraryIf your language isn't there, or you have a great library or development tutorial     you'd like us to recommend get in touch!",A list of software libraries and tutorials for a variety of programming languages,Libraries and Tutorials,Live,671
2066,"At Compose we're always on the lookout for new ways to work with your databases. Database administration and browsing tools are especially popular. After we ran our Compose data browsing guide we've come across two new contenders which work with PostgreSQL and Redis. So we thought we would drill down on SQLPro for Postgres and Keylord...In the world of PostgreSQL, there's only been a few client side data browsers. That's in part down to the existence of PGadmin3, the all-singing Swiss Army knife of admin tools. But its strength is also its weakness as there's so much functionality in there that it's like getting handed the Swiss Army knife with all the blades out. So there's been a move to make simpler, data-centric PostgreSQL browsing applications. And thats where SQLPro for Postgres comes in.SQLPro is a Mac OS X only application with a very neat and deceptively simple front end concealing a lot of functionality – it's very much a classic Mac application. On the left hand side of the window, it shows a tree, rooted in connections to PostgreSQL database instances with databases as children. Each database has just three children, Tables, Views and Functions letting you get at the most commonly created and modified elements of a Postgres database. The right hand side of the window is a tabbed editor/results viewer.The editor is an autocompleting, SQL syntax highlighting edit whose contents will be run on clicking the ""Execute All"" button. The results appear in a multiple table view below the editor which, when multiple result sets are returned by the executed SQL, uses slidable paned tables in the view. This is great for, for example, comparing results of two queries. Queries can also be saved locally and the results view is also an in-place editor for the data values.It's not just tables you can do this with, views can be edited and their results browsed too. Tables and views also expand out in the tree to present their component fields, complete with type and default information and key fields marked with a key.SQLPro does have the un-Mac habit is putting most of the editing functions in context menus on the tree view, tucked away. Want to modify a function? Select 'Alter Function' from its context menu. Want to drop a column? 'Drop Column' is in each column's context menu. Want to create or modify a table? ""New Table"" and ""Alter Table"" are in the context menu for the database and all the tables. Table modification is one time the right hand side UI changes, away from an editor to two tables of columns and indexes and one of the more competent table creation interfaces around though it could do with giving more assistance in selecting column types.And there's still more features to make scripting your database easier by creating scripts for particular functions. And where you have two or more databases open, you can switch between them, applying the same SQL between them.SQLPro for Postgres is one of the most focussed and competent SQL database browsers we've seen here and it is definitely worth a look. There's a downloadable version and currently the browser is selling for $6.Designed for key-value databases like Redis (and LevelDB), Keylord is across-platform (Windows, Mac OS X and Linux) application which sets out to give a multi-column browser view of the Redis key-value store. All it needs to connect is the host, port and password for Redis installations that are visible on the web, though it also has support for connections over SSH tunnels.Once connected, you can see and filter the key list on the database and click in on any of the keys to see their contents. The UI supports examining and editing of Hash, List, Set, Ordered Set and String keys. It also supports the HyperHyperLog value though, due to its being a rather clever data structure for counting unique values, only takes a string value entered yet shows you the number of unique values that it has seen. As well as editing existing key-value pairs, it can also create new key-values with the same selection of types as it can browse. For some fields there's also the option of a hex view of the value, making Keylord more useful if files are stored in the database.And that's about it. For each different database on a Redis instance you have to create a new connection. Performance in downloading the keys is good but there's no key sorting to make them easier to browse.  There's also no TTL setting or display and the list views require manual refreshing - if you have two tabs open on the same database and delete items in one tab, you need to remember to refresh the other tab before using it. There's also a lack of commands for administering the database.It's a solid foundation for what could be a fine tool for Redis users but you are likely to be running it alongside the Redis CLI for any database administration or debugging. Keylord costs $29 for a personal license and $49 per seat for the enterprise licence. There is a downloadable trial version if you want to try it out.",We review 2 great tools for working with your databases.,SQLPro for Postgres and Keylord for Redis,Live,672
2068,"Homepage Follow Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Snehal Gawas Blocked Unblock Follow Following Feb 12
--------------------------------------------------------------------------------

PREDICT CHRONIC KIDNEY DISEASE USING SPSS MODELER FLOWS
In this blog post we will try to predict chronic kidney disease using various
attributes collected from hospitals. Chronic kidney disease (CKD) is a condition
characterized by a gradual loss of kidney function over time, which may lead to
kidney failure.

Data Source :

We are using the UCI Chronic Kidney Disease data set from the Data Science
Experience community. You can get data into your project in two steps:

 1. Go to the Data Science Experience community . You can also navigate to the community from DSX by clicking the Community
    tab on the top panel.

2. Select the data set from the community and click the add (+) icon on the
community card. Select your project and click Add .

After you add the data set to your project you can find it in the Data assets section under the Assets tab.

List of data sets In a projectTools:

Data Science Experience offers an array of options for working with your data.
To model this problem and understand the factors affecting chronic kidney
disease, we will use IBM SPSS Modeler Flow in DSX. IBM SPSS Modeler Flow is a
graphical interface to create different machine learning flows.

Create SPSS Modeler Flow :

To create an SPSS Modeler Flow, go to the Assets tab as shown above and click the new flow icon under the SPSS Modeler Flow section. Give a name and description to the flow and select the IBM SPSS
Modeler runtime.

Nodes In IBM SPSS Modeler Flow:

Before starting with the analysis, let’s have a look at different node options
available in SPSS Modeler Flow.

On left side panel (Nodes Palette) you can see different types of nodes available for you to use while working on
your data. There are six types of node categories:

 1. Record Operations: As the name suggests, you can use them to perform operations such as
    selecting, appending, sorting on the record (row) level.
 2. Field Operations: These nodes are helpful in the data preparation phase. You can filter data,
    rename features, and choose the type of your attributes.
 3. Graphs: Nodes in this section will help you with basic data exploration and
    understanding distribution or relationship between features.
 4. Modelling: These nodes provide different modeling algorithms for different types of
    problems.
 5. Outputs: These nodes are helpful in understanding your data and model. You can display results in table format or get a report on
    evaluation parameters of your model.
 6. Export: After processing and modeling, this node will help you export data from the flow editor to your DSX project .

Drag and drop the node into the canvas and right-click to take further actions
such as open, preview, or run.

Data Cleaning:

To start working on the problem, first we need to get data into the canvas. It
is as easy as drag-and-drop. To preview the file, right-click on a node and
select Preview .

There are a few values missing from our data. Let’s dig deeper into summary
statistics of our data using the Data Audit node. Drag the Data Audit node and
connect it with the data node. We have to open the node to change settings or
give a custom name.

After running the node you can see your audit report on right side panel.

In the data, some features have more missing values compared to others. Let’s
drop those features using the Filter node, and then we will drop rows with
missing values using the Select node. In this way, we can retain the maximum
number of records.

If you decide to impute these missing values, the Filler node will help you do
that.

Once our data is clean, we can set our class variable as the target variable
using the Type node. It will help our model to distinguish between input and
target features.

Set class feature as targetLet’s take a quick look at the distribution of our target variable. Drag the
Distribution node from the graphs section of the node palette and provide field
information under settings.

In our data we have more non-chronic kidney disease cases than chronic kidney
disease cases.

One more step before building the classification model is to divide data into
train and test sets. We will use the Partition node for this.

Now let’s fit the classification model. We will be using a C5.0 algorithm to
build a decision tree . A C5.0 model works by splitting the data based on the
field that provides the maximum information gain.You can see node C5.0 under the
Modeling section of the nodes palette.

While building this model we don’t have to specify input and output variables.
We have already done that in the Type node. Once you run your decision tree
model you will be able to see your model in a golden color node.

Right-click on the golden color node and view the model. You can see predictor
importance, tree digram, and other model information here.

To evaluate the performance of the model, select the Analysis node from the
Output section of the node palette and connect it with the model. Similarly, use
the Table node to view data in a table format with predicted labels and
confidence.

This is the analysis report for our model.We have achieved 97% accuracy on our
test data set with this model.

Now it’s time to save our model. Right click on a terminal node in the flow
(e.g. analysis/table nodes) , click on save as a model option and provide model name to save this SPSS model to our project.


--------------------------------------------------------------------------------

A great thing about SPSS Modeler Flow is that you can build different models
within the same canvas.

Assume we don’t know if a person has chronic kidney disease or not. Let’s try to
build an unsupervised model and see if we can identify pattern for chronic
kidney disease.

We will choose the K-Means node for this experiment and select all features
except class (target feature). Set cluster size to 2. We are expecting only two
clusters here (chronic kidney disease & non-chronic kidney disease).

We can view model information by right-clicking the golden K-means node and
selecting View model option.

Finally, let’s evaluate clustering results with respect to our original target
labels.

Here cluster-1 represents non-chronic kidney disease patients and cluster-2
belongs to patients who have this disease. Some people with chronic kidney
disease are wrongly identified by the K- Means algorithm as non-chronic kidney
disease patients.


--------------------------------------------------------------------------------

Finally, to export your results use the Data Export node. Give a name to your
file under the path section in the settings. Data will be exported to your
project storage and you can see it under the Data Asset tab in the project.


--------------------------------------------------------------------------------

You can choose to run all flows on the canvas by using the run icon on the top
panel. You can also take advantage of shortcuts such as copy, cut, and paste. To
download your flow, click on the download icon from upper right side panel. This
will download a SPSS Modeler .str file that can be opened in SPSS Modeler.

You can work on different problems by experimenting with different algorithms
and modeling techniques in SPSS Modeler Flow.

To get started, go to Data Science Experience and create your own SPSS Modeler flow!

 * Machine Learning
 * Data Science
 * Spss
 * IBM

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

41 Blocked Unblock Follow FollowingSNEHAL GAWAS
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 41
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",In this blog post we will try to predict chronic kidney disease using various attributes collected from hospitals. Chronic kidney disease (CKD) is a condition characterized by a gradual loss of…,Predict Chronic Kidney Disease Using SPSS Modeler Flows,Live,673
2076,"G. Adam Cox Blocked Unblock Follow Following Apr 19
--------------------------------------------------------------------------------

SIMULATING E.T.
OR: HOW TO INSERT INDIVIDUAL FILES INTO OBJECT STORAGE FROM WITHIN A MAP
FUNCTION IN APACHE SPARK
SETI Institute Hackathon Trophy. Made from a decommissioned radio telescope
antenna.The SETI Institute’s upcoming hackathon and code challenge is based on a set of
simulated data that will be used as a labeled training data set to build signal
classification models. This post will discuss those simulations and share some
code that was useful in moving the simulated data from our IBM Apache Spark
Enterprise cluster to our OpenStack Object Storage account.

Aside: Registration for the hackathon and code challenge is now open. You can register for either or for both. At the hackathon, there
will be trophies. Also, if you’re in Seattle on May 23, I’ll be talking about
all of this at the local Seattle Data Science meetup .THE SIMULATION DATA
The goal of the code challenge and hackathon are the same: build a
classification model based on a set of simulated (and thus, labeled) data. For a
number of reasons, which will not be explained here, we decided to simulate six
of the different “types” of signals that are routinely observed in the SETI
Institute’s data set. These data were previously introduced in the SETI@IBMCloud project , including instructions on how to access and analyze them.

Each signal type has a set of characteristics that we controlled in our
simulations. We randomly sampled a value that represents the “amplitude” of each
characteristic for a given signal type and then generated the appropriate
sinusoid.

The generated signal was then added to a noise background. Most of the noise
backgrounds are real data that were acquired by observing the sun for an hour.
In that hour, the SETI Institute collected about 1 TB of raw binary data, which
we’ve uploaded to Object Storage. The amplitude of the signal from the sun was
scaled down such that its average power was equal to the average power of the
noise observed from distant star systems. The combined signal then matches
signals observed from other systems with the correct signal to noise ratio. The
result of adding these two data is an 8-bit complex-valued time-series data
file.

Details are limited here because we want to keep the participating citizen
scientists as blind as possible to how these data were created. We don’t want
them to reverse-engineer the signal, but instead build a general-purpose
classifier. We do expect them to be keen observers, however, and nothing written
in this post will escape their attention.

Below are a few examples of the simulated signals when viewed as spectrogram.

Classic narrow band signal. Frequency is on the horizontal axis (in arbitrary
units) and time is along the vertical. A “curving” narrow band signal. Frequency is on the horizontal axis (in
arbitrary units) and time is along the vertical. This is called a “squiggle”. You can see the signal if you squint. Since the
signal frequency jumps around so often, there is less power at each frequencies,
and thus the signal stands out less. Frequency is on the horizontal axis (in
arbitrary units) and time is along the vertical.SUBMITTING TO APACHE SPARK
The main piece of simulation software was originally written, in Java, by Dr.
Gerry Harp from the SETI Institute. This software focused primarily on
simulating narrow band signals. Additional components were added to simulate the
other signal types that have been observed. Finally, some scaffolding around the
core Java components were written, in Scala, to make a command-line application
that could be submitted to our Apache Spark cluster on IBM Bluemix with spark-submit .

With our Enterprise Spark cluster, we were able to simulate many thousands of
signals in just a few minutes. The vast majority of that time, of course, was
reading and writing across the network to our Object Storage and dashDB
instance.

INTO OBJECT STORAGE
All IBM Apache Spark service kernels are initiated with the --classpath to include the stocator/swift2d library. This facilitates opening data connections to IBM’s OpenStack Object
Storage instances using the very efficient swift2d protocol. However, you still need to add the configuration and credentials in
your code.

To build our simulations, we utilize Spark’s distributed computing to
parallelize the creation of the signals. First, we build an RDD of a size that
is equal to the number of simulated signals we intend to generate. Then, using
map functions, we add the noise , and then the signal. After the map functions, each row of the RDD contains
the simulated time-series and some metadata (the “amplitudes” of the different
signal characteristics and the signal’s label).

NORMAL USE-CASE: SAVING RDDS
Typically, one would save the entire RDD to Object Storage in one call, possibly
using ibmos2spark and swift2d . The ibmos2spark package prepares the configuration of the stocator/swift2d protocol with your particular credentials. Using that code looks something like
this:

However, saving an RDD like this results in data on Object Storage in the form
of a Hadoop MapReduce output structure (see below), where the number of output
“parts” is equal to the number of partitions in the RDD. From this output, it’s
very difficult and expensive to extract a particular simulated signal.

mydata.dat
mydata.dat/_SUCCESS
mydata.dat/part-00000-taskid
mydata.dat/part-00001-taskid
mydata.dat/part-00002-taskid
mydata.dat/part-00003-taskid

SAVING INDIVIDUAL FILES TO OBJECT STORAGE WITH SWIFT
In our case, we needed to save each simulation separately. We were still able to
do this using the swift2d connection. Instead of using ibmos2spark , however, we used the tools available in the org.apache.hadoop.hadoop-client package to stream the data from an array of bytes. The following example shows
how this was done inside a rdd.mapPartitionsWithIndex function.

The resulting container on Object Storage holds the list of simulation files
separately.

fca7c335-8c07-4341-8c09-8c6d8a31aee0.dat
fd65d3c3-2a3a-4dda-b16e-e4d7064465e3.dat
fe15c288-8415-4801-8bbf-8c8e90a6937c.dat
fe80468f-da45-4cec-aef2-864e4621b76f.dat
fecd7a4e-1c9f-4921-8fbd-b8a98d32ed4a.dat
ff7e58b0-ae07-4f39-83a1-876e2e32f363.dat
ff8b34b7-df52-4be5-a9ff-f4b63cf68397.dat

For completeness in showing this use-case, one can read single data files in
from Object Storage using the swift2d connection as well.

It is noted here that in IBM Apache Spark, one can use either the swift2d or swift scheme name. Both use the same stocator library.

YOU CAN ALSO DO THIS IN PYTHON
We decided to keep the entire simulation codebase together in a Scala+Java
package, which forced us to find the method outlined above to save individual
files. However, this is already easily done in Python. The following
code-snippet shows an example of how to do this with the python-swiftclient library.

CONCLUSION
The full apparatus to simulate the various types of signals observed in the SETI
Institute’s data set from the Allen Telescope Array has been constructed.
Discovering how to insert individual simulated files from Scala and within a map
function was crucial for the construction of our data set.

We are now simulating more data and testing to ensure access is sufficient for
the upcoming hackathon and code challenge. Additionally, we are beginning to
explore different analysis methods and look forward to the solutions created by
the participants.

If you enjoyed this article and the search for extraterrestrial life, please ♡
it to recommend it to other Medium readers.

All code snippets are licensed for use under the Apache License 2.0.

Thanks to Mike Broberg . * Apache Spark
 * IBM
 * Object Storage
 * Data Science
 * SETI

1 Blocked Unblock Follow FollowingG. ADAM COX
FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",The SETI Institute’s upcoming hackathon and code challenge is based on a set of simulated data that will be used as a labeled training data set to build signal classification models. This post will…,Simulating E.T. – Or: how to insert individual files into object storage from within a map function in Apache Spark,Live,674
2077,"RStudio provides the premiere open source and enterprise-ready professional
software for R, including RStudio Desktop, RStudio Server, RStudio Connect,
Shiny Server, and shinyapps.io. The tidyverse, shiny, ggplot, ggvis, dplyr,
knitr, R Markdown, and packrat are R packages from RStudio that every data
scientist will want to enhance the value, reproducibility, and appearance of
their work.Date: April 11th, 2018
Time: 1:00PM EDT

Webinar Description :

The data frame is a crucial data structure in R and, especially, in the
tidyverse. Working on a column or a variable is a very natural operation, which
is great. But what about row-oriented work? That also comes up frequently and is
more awkward. In this webinar I’ll work through concrete code examples,
exploring patterns that arise in data analysis. We’ll discuss the general notion
of “split-apply-combine”, row-wise work in a data frame, splitting vs. nesting,
and list-columns.


--------------------------------------------------------------------------------

Logistics:

Only 1,000 live attendees are allowed in the Webinar on a first come first serve
basis. It is typical for many people who register to not attend (which is why
registration does not guarantee access.) If for any reason you cannot make the
webinar or cannot get in we will provide links to the recording as well as all
materials within 48 hours.


THINKING INSIDE THE BOX: YOU CAN DO THAT INSIDE A DATA FRAME?! WEBINAR
REGISTRATION:


Presenter:

Jenny Bryan , Software Engineer - Jenny is a recovering biostatistician who takes special delight in eliminating
the small agonies of data analysis. She’s part of Hadley’s team, working on R
packages and integrating them into fluid workflows. She’s been working in R/S
for over 20 years, serves in the leadership of rOpenSci and Forwards, and is an
Ordinary Member of the R Foundation. Jenny is an Associate Professor of
Statistics (on leave) at the University of British Columbia, where she created
the course STAT 545 .


--------------------------------------------------------------------------------

Webinar Recordings:

We try to record every webinar we host and post all materials on our website.
http://www.rstudio.com/resources/webinars/


Slides & Code:

We've started a Github repository with all webinar materials. Speakers for this
webinar and all future webinars will add their materials to the repository.
https://github.com/rstudio/webinars


Industry Lead Source - Detail: Lead Source: SubmitLive on April 11th, at 1:00PM EDT There will be approximately 45 minutes of presentation followed by 15 Minutes
of Q&A.","Work through examples of row-oriented work in data frames, explore data analysis patterns, discuss “split-apply-combine”, row-wise work in data frames, splitting vs. nesting, and list-columns.",Webinar: April 11 - Thinking inside the box: you can do that inside a data frame?!,Live,675
2078,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCLOUDANT SYNC FOR IOS V1.0 IS RELEASEDGlynn Bird / November 6, 2015Today the IBM Cloudant SDK team is proud to announce the release of version 1.0of our Cloudant Sync library for iOS. Cloudant Sync is an iOS library that provides a simple storage API foryour app. The data is stored locally on the device but can optionally bereplicated to and from a Cloudant database. As Cloudant is based on ApacheCouchDB, data can be synced between two disconnected copies of a databasewithout data loss, which is the ideal foundation for providing offline-first storage for mobile devices.If and when your application syncs to Cloudant is a programmatic decision; itcan be when the device detects wifi, when the device goes online, or when a userpresses a “sync” button, for example.The Cloudant Sync library provides the following features: * storage API using CouchDB’s MVCC model * replication API to allow one-way or two-way replication to a Cloudant database * Cloudant Query API to allow structured queries to be performed against the local database * conflict resolution tooling * encryption of data at rest * changes notificationsThe library is written in Objective-C but can be used in an Objective-C projector in Swift apps with small piece of shim code. You can get started with Cloudant Syncusing the template project from the Github repository.If you need to create an Android application, then an equivalent library is available for your Java applications too.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","The IBM Cloudant SDK team is proud to announce the release of Cloudant Sync, an iOS library that provides a simple storage API for your app.",Cloudant Sync for iOS v1.0 is released,Live,676
2079,Cheat sheet,"The devtools package makes it easy to build your own R packages, and packages make it easy to share your R code. ",Package Development with devtools  Cheat Sheet,Live,677
2082,"{ spark .tc } * Community
 * Projects
 * Blog
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark
   
   
SPARK SQL: RAPID PERFORMANCE EVOLUTION


Spark SQL Version 1.6 runs queries faster!

That’s the good news from Berni Schiefer, an IBM Fellow , an STC engineer with deep expertise in rigorous performance testing, and a
native of the great Canadian prairie Province of Saskatchewan. “Our results are
encouraging for anyone moving to Spark from a SQL database background,” is how
Berni sums up the results. “Developers essentially need to know three things
about moving their queries to run in Spark SQL: Can Spark SQL parse the query,
can it get the right answer, and once it can do that, how long does it take to
run?”

“In earlier releases, developers would sometimes get slowed down at step one:
they’d expect the syntax to work unmodified—and find they could not run a query
because it did not work as is. The developer would need to make major or minor
edits, or even have to give up.”

Spark SQL is a Spark module for structured data processing; see Spark SQL: Relational Data Processing in Spark by Matei Zaharia et al for background.

In version 1.6, Spark SQL has a better and broader command of the SQL language,
especially in the nuanced cases where the meaning is the same but the syntax is
slightly different.

“Imagine you had to translate everything from British English to American
English—you’d have to remember to change “chemists” to “drug store” and “flats”
to “apartments” every time. In version 1.6, this annoyance is mitigated to a
significant degree. This reduces barriers to entry for Spark. If you’re hiring,
for example, you can now get a more skilled labor force more easily for fewer
dollars because you can hire from the large pool of SQL developers, rather than
from the smaller pool of developers with more specialized Spark-specific
SparkSQL skills.”

And people building Spark apps with regular SQL can have greater confidence that
it’s going to work.


Results:

To measure the performance of Spark SQL, we ran a subset of the TPC-DS* workload
This workload is derived from TPC-DS, and uses dsdgen and dsqgen to generate the
data and the queries from the TPC-DS query templates. The results reported are
not intended to represent an official TPC benchmark result. No changes to the schema or data generator were made, but only 32 of the 99
queries were executed.

The TPC-DS workload consists of 99 queries – but Spark SQL can’t run all 99 yet.
We tracked the performance of the queries that do run across multiple Spark
versions. In 1.6, the number of queries that run have also gone up.

Overall we observed a 9.3% reduction in elapsed time between Spark 1.4.1 and
Spark 1.6 for the 31 TPC-DS queries that we ran using a 1 Terabyte TPC-DS data
set running on a 5 node cluster. The maximum gain was 25% while the maximum
degradation was 31%. However, 15 of the queries had substantial elapsed time
improvements while only 2 queries visibly degraded.

We ran all workloads using spark-submit in yarn-client mode. We found that Spark
1.6 requires more driver memory than previous versions. For example, the TPC-DS
SQL workload used 8GB driver memory in 1.5.1 but needed 12GB for Spark 1.6 in
order to run without failed stages. So the memory footprint of Spark 1.6
increased, and depending on your application, you may also need to increase the
executor’s memory.

The TPC-DS SQL workload improved by almost 20% in performance compared to 1.5.1.
Almost all queries in the workload ran faster, making Spark 1.6 a really solid
release for Spark SQL.

*TPC is a trademark of the Transaction Processing Performance Council . The results reported are not intended to represent an official TPC benchmark. The workload is derived from TPC-DS, and uses dsdgen and dsqgen to generate the data and the TPC DS SQL from the
TPC-DS templates.

Berni noted the temporary elapsed-time increase in between version 1.4.1 and
1.5.1, as well as the 2 queries that remained regressed going from Spark 1.4 to
Spark 1.6 were not unusual or exceptional in the world of query processing.
“Improving software as complex as Spark is like Whac-A-Mole. When you make one
thing better with focused attention it’s common for other things to get worse
for a little while. But we see how quickly this was addressed in Spark 1.6. We
continue to work closely with the Spark community to investigate the origin of
the 2 remaining regressions and find ways to improve those.”

The testing in this article was done over a 5 month period after the July 15th,
2015, release of Spark version 1.4.1. Apache Spark has a 7 year history, beginning in 2009 when Romanian-Canadian Matei Zaharia
started the project at UC Berkeley’s AMPLab in 2009 as a PhD student.


The STC’s benchmarking team continues to test Spark SQL and other functions of Apache Spark. In the next couple of months, we’ll publish results from:

1. More tests with a larger subset of the 99 queries, possibly on bigger data
sets.

2. Additional testing using the query optimized Parquet format. We are confident
that exploiting the latest Spark enhancements in Spark 1.6 for parquet formats
will yield substantial benefits.

3. Testing use the way production Spark users would consume SparkSQL: measuring
throughput on multi user runs with varying workload profile.

4. The impact of the latest version of Spark. Check out Spark Summit East where Matei Zaharia will talk about what’s next for Spark in his keynote.

For more Spark 1.6 and Spark SQL benchmarking:

Recent performance improvements in Apache Spark: SQL, Python, DataFrames, and
More by Reynold Xin, Databricks

Spark 1.6.0 Performance Sneak Peek by Jesse F. Chen, IBM

Evaluating Hive and Spark SQL with BigBench by Todor Ivanov and Max-Georg Beer, Frankfurt Big Data Lab


Charles February 29th, 2016Where can I find instructions to run Spark-sql-perf test on physical cluster?
Can I use an existing 1TB tpcds dataset created using hive-tpcds-kit to run all
99 queries on spark-sql?

Berni Schiefer March 16th, 2016Please refer to the README on https://github.com/databricks/spark-sql-perf . The test kit must be compiled on a physical cluster with the correct Spark
version in order to run. We have tested it up to Spark 1.6.0. Yes, you can use
previously generated TPC-DS dataset – but you need to create temporary tables
with *your* location of the TPC-DS tables (presumably on HDFS). I hope this
helps. -berni

You

Notify me of follow-up comments by email.

Notify me of new posts by email.


{ AUTHOR }
BERNI SCHIEFER * 
 * 
 * 

Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software
Foundation in the United States and/or other countries.","Spark SQL Version 1.6 runs queries faster! That’s the good news from Berni Schiefer, an IBM Fellow, an STC engineer with deep expertise in rigorous performance testing, and a native of the great Ca…",Spark SQL - Rapid Performance Evolution,Live,678
2084,"Toggle navigation * Courses * Courses List
    * Learning Paths
   
   
 * Events
 * Badges
 * Resources * Resources List
    * Downloads
   
   
 * Participate!
 * Blog
 * About

 * Login
 * Register

THE BIG DATA UNIVERSITY BLOG


 1. Home
 2. The Big Data University Blog

THIS WEEK IN DATA SCIENCE (MAY 03, 2016)Posted on May 3, 2016 by Coralie Phanord

Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe
if you find this useful! Interesting Data Science Articles and News IBM Watson
co-designed the most high-tech dress at the Met Gala – IBM Watson and
high-fashion label Marchesa designed cognitive dress for the Met Gala. How VR
Will Revolutionize Big […]


--------------------------------------------------------------------------------

THIS WEEK IN DATA SCIENCE (APRIL 26, 2016)Posted on April 26, 2016 by Coralie Phanord

Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe
if you find this useful! Interesting Data Science Articles and News IBM Watson
Has a Few Predictions for ‘Game of Thrones’ Season Six – IBM Watson gives TV
forecasting a go. Using personality insights, Watson predicts the fate of the
characters […]


--------------------------------------------------------------------------------

THIS WEEK IN DATA SCIENCE (APRIL 19, 2016)Posted on April 19, 2016 by Coralie Phanord

Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe
if you find this useful! Interesting Data Science Articles and News Every shot
Kobe Bryant ever took. All 30,699 of them – Tour a data visualization of every
shot made by Kobe Bryant throughout his 20-year career. Amazing Big Data At […]


--------------------------------------------------------------------------------

THIS WEEK IN DATA SCIENCE (APRIL 12, 2016)Posted on April 12, 2016 by Coralie Phanord

Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe
if you find this useful! Interesting Data Science Articles and News Internet of
Things in Your Face – An Internet of Things toothbrush, equipped with its own
app, allows users to play games while brushing their teeth and it collects data
[…]


--------------------------------------------------------------------------------

THIS WEEK IN DATA SCIENCE (APRIL 5, 2016)Posted on April 5, 2016 by Coralie Phanord

Once again, here’s this week’s news in Data Science and Big Data. Don’t forget
to subscribe if you find this useful! Interesting Data Science Articles and News
IBM teams up with Coursera to deliver a developer’s guide to the Internet of
Things – IBM has teamed up with Coursera to offer a new open online […]


--------------------------------------------------------------------------------

THIS WEEK IN DATA SCIENCE (MARCH 29, 2016)Posted on March 29, 2016 by Coralie Phanord

Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe
if you find this useful! Interesting Data Science Articles and News Domino’s has
a robot delivering pizzas in Australia – Domino’s Pizza and Australian startup
Marathon Robotics created a self driving delivery robot called DRU (Domino’s
Robotic Unit). Election Tech: Big […]


--------------------------------------------------------------------------------

THIS WEEK IN DATA SCIENCE (MARCH 22, 2016)Posted on March 22, 2016 by Coralie Phanord

Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe
if you find this useful! Interesting Data Science Articles and News Twitter
Turns 10: How Tweets spread across the world – Visualize how your favorite
Tweets spread across the world. An AI with 30 Years’ Worth of Knowledge Finally
Goes to […]


--------------------------------------------------------------------------------

THIS WEEK IN DATA SCIENCE (MARCH 15, 2016)Posted on March 15, 2016 by Coralie Phanord

Once again, here’s this week’s news in Data Science and Big Data. Don’t forget
to subscribe if you find this useful! Interesting Data Science Articles and News
Detecting Emotion in Faces Using Geometric Features – Learn how Carlos Argueta
used basic geometry and machine learning in order to detect and recognize human
emotions. Using Big […]


--------------------------------------------------------------------------------

THIS WEEK IN DATA SCIENCE (MARCH 8, 2016)Posted on March 8, 2016 by Coralie Phanord

Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe
if you find this useful! Interesting Data Science Articles and News IBM Watson
Analytics for social media analysis – Social media analysis is the new add on to
Watson Analytics. Explore the five components of Watson Analytics: Explore,
Predict, Assemble, Social […]


--------------------------------------------------------------------------------

THIS WEEK IN DATA SCIENCE (MARCH 1, 2016)Posted on March 1, 2016 by Coralie Phanord

Here’s this week’s news in Data Science and Big Data. Don’t forget to subscribe
if you find this useful! Interesting Data Science Articles and News Top 10 Data
Visualization Projects on Github – This is a list of the top Github projects
integrating high quality visuals. Robots That Teach Each Other– Imagine a world
where robots […]


--------------------------------------------------------------------------------

POST NAVIGATION
← Older postsBLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (May 03, 2016)
 * This Week in Data Science (April 26, 2016)
 * This Week in Data Science (April 19, 2016)
 * This Week in Data Science (April 12, 2016)
 * This Week in Data Science (April 5, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My Tweets Follow us Facebok Twitter Google+ Linkedin YouTube * FAQ
 * Contact
 * About
 * Blog
 * Legal

 * 
 *",Learn about the latest Big Data University announcements and news from the world on Big Data on our blog.,This Week in Data Science,Live,679
2091,"Data.gov.sg Blog Follow Sign in / Sign up Home About Visit Data.gov.sg 317 15 * Share
 * 317
 * 
 * 

Never miss a story from Data.gov.sg Blog , when you sign up for Medium. Learn more Never miss a story from Data.gov.sg Blog Get updates Get updates datagovsg Blocked Unblock Follow Following Official Medium account for https://data.gov.sg, Singapore's open data portal. 16 hrs ago
--------------------------------------------------------------------------------

HOW THE CIRCLE LINE ROGUE TRAIN WAS CAUGHT WITH DATA
Text: Daniel Sim | Analysis: Lee Shangqian, Daniel Sim & Clarence Ng

The MRT Circle Line was hit by a spate of mysterious disruptions in recent
months, causing much confusion and distress to thousands of commuters.

Like most of my colleagues, I take a train on the Circle Line to my office at
one-north every morning. So on November 5, when my team was given the chance to
investigate the cause, I volunteered without hesitation.


--------------------------------------------------------------------------------

From prior investigations by train operator SMRT and the Land Transport
Authority (LTA), we already knew that the incidents were caused by some form of
signal interference, which led to loss of signals in some trains. The signal
loss would trigger the emergency brake safety feature in those trains and cause
them to stop randomly along the tracks.

But the incidents — which first happened in August — seemed to occur at random,
making it difficult for the investigation team to pinpoint the exact cause.

We were given a dataset compiled by SMRT that contained the following
information:

 * Date and time of each incident
 * Location of incident
 * ID of train involved
 * Direction of train

We started by cleaning the data. We worked in a Jupyter Notebook , a popular tool for writing and documenting Python code.

As usual, the first step was to import some useful Python libraries.

Snippet 1We then extracted the useful parts from the raw data.

Snippet 2We combined the date and time columns into one standardised column to make it
easier to visualise the data:

Snippet 3This gave us:

Screenshot 1: Output from initial processing
--------------------------------------------------------------------------------

NO CLEAR ANSWERS FROM INITIAL VISUALISATIONS
We could not find any obvious answers in our initial exploratory analysis, as
seen in the following charts:

1. The incidents were spread throughout a day, and the number of incidents
across the day mirrored peak and off-peak travel times.

Figure 1: Number of occurrences mirror peak and off-peak travel times.2. The incidents happened at various locations on the Circle Line, with slightly
more occurrences on the west side.

Figure 2: The cause of the interference did not seem to be location-based.3. The signal interferences did not affect just one or two trains, but many of
the trains on the Circle Line. “PV” is short for “Passenger Vehicle”.

Figure 3: 60 different trains were hit by signal interference.
--------------------------------------------------------------------------------

THE MAREY CHART: VISUALISING TIME, LOCATION AND DIRECTION
Our next step was to incorporate multiple dimensions into the exploratory
analysis.

We were inspired by the Marey Chart, which was featured in Edward Tufte’s
vaunted 1983 classic The Visual Display of Quantitative Information . More recently, it was used by Mike Barry and Brian Card for their extensive visualisation project on the Boston subway system:

Screenshot 2: Taken from http://mbtaviz.github.io/In this chart, the vertical axis represents time — chronologically from top to
bottom — while the horizontal axis represents stations along a train line. The
diagonal lines represent train movement.

We started by drawing the axes in our version of the Marey Chart:

Figure 4: An empty Marey Chart, Circle Line versionUnder normal circumstances, a train that runs between HarbourFront and Dhoby
Ghaut would move in a line similar to this, with each one-way trip taking just
over an hour:

Figure 5: Stylised representation of train movement on Circle LineOur intention was to plot the incidents — which are points instead of lines — on
this chart.


--------------------------------------------------------------------------------

PREPARING THE DATA FOR VISUALISATION
First, we converted the station names from their three-letter codes to a number:

 * Marina Bay to before Promenade: 0 to 1.5
 * Dhoby Ghaut to HarbourFront: 2 to 29

If the incident occurred between two stations, it would be denoted as 0.5 + the lower of the two station numbers . For example, If an incident happened between HarbourFront (number 29) and
Telok Blangah (number 28), the location would be “28.5”. This made it easy for
us to plot the points along the horizontal axis.

Snippet 4And then we computed the numeric location IDs…

Snippet 5And added that to the dataset:

Snippet 6Then we had:

Screenshot 3: Output table after location IDs are addedWith the data processed, we were able to create a scatterplot of all the
emergency braking incidents. Each dot here represents an incident. Once again,
we were unable to spot any clear pattern of incidents.

Figure 6: Signal interference incidents represented as a scatterplotNext, we added train direction to the chart by representing each incident as a
triangle pointing to the left or right, instead of dots:

Figure 7: Direction is represented by arrows and colour.It looked fairly random, but when we zoomed into the chart, a pattern seemed to
surface:

Figure 8: Incidents between 6am and 10amIf you read the chart carefully, you would notice that the breakdowns seem to
happen in sequence . When a train got hit by interference, another train behind moving in the same direction got hit soon after.


--------------------------------------------------------------------------------

HOW CAN SIGNAL INTERFERENCE MOVE THROUGH A TUNNEL?
At this point, it still wasn’t clear that a single train was the culprit.

What we’d established was that there seemed to be a pattern over time and
location: Incidents were happening one after another, in the opposite direction
of the previous incident. It seemed almost like there was a “trail of
destruction”. Could it be something that was not in our dataset that caused the incidents?

Indeed, imaginary lines connecting the incidents looked suspiciously similar to
those in a Marey Chart (Screenshot 2). Could the cause of the interference be a train — in the opposite track?

Figure 9: Could it be a train moving in the opposite direction?We decided to test this “rogue train” hypothesis.

We knew that the travel time between stations along the Circle Line ranges
between two and four minutes. This means we could group all emergency braking
incidents together if they occur up to four minutes apart.

Snippet 7We found all incident pairs that satisfied this condition:

Snippet 8We then grouped all related pairs of incidents into larger sets using a disjoint-set data structure . This allowed us to group incidents that could be linked to the same “rogue
train”.

Snippet 9Then we applied our algorithm to the data:

Snippet 10These were some of the clusters that we identified:

[{0, 1},
 {2, 4},
 {5, 6, 7},
 {8, 9},
 {18, 19, 20},
 {21, 22, 24, 26, 27},
 {28, 29, 30, 31, 32, 33, 34},
 {42, 44, 45},
 {47, 48},
 {51, 52, 53, 56}]

Next, we calculated the percentage of the incidents that could be explained by
our clustering algorithm.

Snippet 11The result was:

(189, 259, 0.7297297297297297)

What it means: Of the 259 emergency braking incidents in our dataset, 189 cases — or 73% of
them — could be explained by the “rogue train” hypothesis. We felt we were on the right track.

We coloured the incident chart based on the clustering results. Triangles with
the same colour are in the same cluster.

Figure 10: Incidents clustered by our algorithm
--------------------------------------------------------------------------------

HOW MANY ROGUE TRAINS ARE THERE?
As we showed in Figure 5 , each end-to-end trip on the Circle Line takes about 1 hour. We drew best-fit
lines through the incidents plots and the lines closely matched that of Figure 5 . This strongly implied that there was only one “rogue train”.

Figure 11: Time of clustered incidents strongly implies that the interference
could be linked a single trainWe also observed that the unidentified “rogue train” itself did not seem to
encounter any signalling issues, as it did not appear on our scatter plots.

Convinced that we had a good case, we decided to investigate further.


--------------------------------------------------------------------------------

CATCHING THE ROGUE TRAIN
After sundown, we went to Kim Chuan Depot to identify the “rogue train”. We
could not inspect the detailed train logs that day because SMRT needed more time
to extract the data. So we decided to identify the train the old school way — by
reviewing video records of trains arriving at and leaving each station at the
times of the incidents.

At 3am, the team had found the prime suspect: PV46, a train that has been in
service since 2015.


--------------------------------------------------------------------------------

TESTING THE HYPOTHESIS
On November 6 (Sunday), LTA and SMRT tested if PV46 was the source of the
problem by running the train during off-peak hours. We were right — PV46 indeed
caused a loss of communications between nearby trains and activated the
emergency brakes on those trains. No such incident happened before PV46 was put
into service on that day.

On November 7 (Monday), my team processed the historical location data of PV46
and concluded that more than 95% of all incidents from August to November could
be explained by our hypothesis. The remaining incidents were likely due to
signal loss that happen occasionally under normal conditions.

The pattern was especially clear on certain days, like September 1. You can
easily see that interference incidents happened during or around the time belts
when PV46 was in service.

LTA and SMRT eventually published a joint press release on November 11 to share the findings with the public.


--------------------------------------------------------------------------------

FINAL THOUGHTS
When we first started, my colleagues and I were hoping to find patterns that may
be of interest to the cross-agency investigation team, which included many
officers at LTA, SMRT and DSTA. The tidy incident logs provided by SMRT and LTA
were instrumental in getting us off to a good start, as minimal cleaning up was
required before we could import and analyse the data. We were also gratified by
the effective follow-up investigations by LTA and DSTA that confirmed the
hardware problems on PV46.

From the data science perspective, we were lucky that incidents happened so
close to one another. That allowed us to identify both the problem and the
culprit in such a short time. If the incidents were more isolated, the zigzag
pattern would have been less apparent, and it would have taken us more time —
and data — to solve the mystery.

Of course, we were most pleased that all of us can now take the Circle Line to
work with confidence again.


--------------------------------------------------------------------------------

Note: The code here was written on November 5, 2016 — the actual day when we were
working on SMRT data to identify the cause of the Circle Line incidents. We
acknowledge that there could be inefficiencies. You may download a copy of our
Jupyter Notebook here .Daniel Sim, Lee Shangqian and Clarence Ng are data scientists at GovTech’s Data
Science Division.

Follow Data.gov.sg: Twitter | Facebook

Data Science Data Visualization Singapore Public Transport Python 317 15 Blocked Unblock Follow FollowingDATAGOVSG
Official Medium account for https://data.gov.sg , Singapore's open data portal.

FollowDATA.GOV.SG BLOG
Official blog for Data.gov.sg, the Singapore government’s open data portal","The MRT Circle Line was hit by a spate of mysterious disruptions in recent months, causing much confusion and distress to thousands of commuters. Like most of my colleagues, I take a train on the…",How the Circle Line rogue train was caught with data,Live,680
2097,"--------------------------------------------------------------------------------

SENTIMENT ANALYSIS OF TWITTER HASHTAGS WITH SPARK
REVISITED WITH PIXIEDUST & JUPYTER NOTEBOOKS
WHY DO IT AGAIN?
This is yet another blog post where I discuss the application I built for running sentiment analysis of
Twitter content using Apache Spark™ and Watson Tone Analyzer . Before you quit reading, let me assure you that there is a good reason to
revisit this code, and that you will hopefully learn something new.

Let’s first recap the first two installments:

 1. In Part 1 , I built a Spark Streaming application in Scala that I invoked in a Scala
    Notebook to fetch live Twitter content. I then used a Python Notebook to
    build analytics on the data.
 2. In Part 2 , I tried to improve the application to update in real time. To do that, I
    ported the analytics in the Scala Spark Streaming piece, sending the results
    to a dashboard via MessageHub events (based on Apache Kafka™ ). The dashboard was a Node.js application that displays the D3 charts (deployed on IBM Bluemix ), which were updated continuously from the data received via MessageHub.

But wait! There’s good reason to revisit this code.Great! But a few things about the application still bothered me:

 1. The application was accessible to developers, but other users found it hard
    to deploy. To be fair, configuring the Node.js dashboard, the MessageHub
    service, the Weather service, the Watson Tone Analyzer service, and Spark
    Streaming was error prone. You had to ensure that the credentials, Kafka
    topics, etc. all matched in every place. Pretty hateful no?
 2. Data scientists lost the flexibility to further analyze the results.
    Everything was now in black boxes, namely the Spark Streaming Scala app and
    the Node.js dashboard code.

CAN WE DO BETTER?
For the third version of this app, I had two basic goals:

 1. Use only one flavor of data science notebook. Asking people to use a Scala
    Notebook first and then a Python Notebook created a lot of friction in previous versions of the
    app.
 2. No deployment and configuration changes should be required for the
    front-end.

The user story should be simple: a developer, data scientist, or
line-of-business user should be able to run the application end-to-end from
within a single Python Notebook. Difficult? Yes. Impossible? No.

Now that I’ve piqued your interest, read on!

PIXIEDUST
To achieve my goals I needed three capabilities:

 1. The ability to run Scala code from within a Python Notebook (the Spark
    Streaming application from Part 1 requires Scala).
 2. The ability to install third-party Java packages into a Python Notebook.
 3. The ability to run a fully functional UI from within a Python Notebook.

Thankfully, PixieDust provides these three capabilities and more. You can learn
more about PixieDust on GitHub: https://github.com/ibm-cds-labs/pixiedust . There’s also an intro post: PixieDust: Magic for Your Python Notebook .

You can easily install the PixieDust Python module in your notebook by using the
following command: !pip install --user —-upgrade pixiedustRUNNING SCALA FROM A PYTHON NOTEBOOK
In this section, we’ll recreate the app from Part 1, using only one Python
Notebook.

Step 1: Install the Spark Streaming Scala application into your Python Notebook using
the installPackage API.

import pixiedust
jarPath = ""https://github.com/ibm-cds-labs/spark.samples/raw/master/dist/streaming-twitter-assembly-1.6.jar""
pixiedust.installPackage(jarPath)

You should see the following output:

“PixieDust database opened successfully” & more victorious output! Look out for the red message asking you to restart the kernel. There’s more
information on the PixieDust package manager .Step 2: Store your credentials. We’ll reuse them a lot, so it’s best to store them in
Python variables. We’ll then use the PixieDust Scala bridge to make Python
variables available in Scala code.

Replace the XXXX below with your own credentials as explained in Part 1 .twitterConsumerKey = ""XXXX""
twitterConsumerSecret = ""XXXX""
twitterAccessToken = ""XXXX""
twitterAccessTokenSecret = ""XXXX""
toneAnalyzerPassword = ""XXXX""
toneAnalyzerUserName = ""XXXX""

Step 3: Run the Spark Streaming app. We can now use the %%scala magic to write the code that runs the Spark Streaming application. Notice how
we are using the credential variables declared above without the need to declare
them explicitly—all thanks to PixieDust variable auto-binding.

%%scala
val demo = com.ibm.cds.spark.samples.StreamingTwitter
demo.setConfig(""twitter4j.oauth.consumerKey"",twitterConsumerKey)
demo.setConfig(""twitter4j.oauth.consumerSecret"",twitterConsumerSecret)
demo.setConfig(""twitter4j.oauth.accessToken"",twitterAccessToken)
demo.setConfig(""twitter4j.oauth.accessTokenSecret"",twitterAccessTokenSecret)
demo.setConfig(""watson.tone.url"",""https://gateway.watsonplatform.net/tone-analyzer/api"")
demo.setConfig(""watson.tone.password"",toneAnalyzerPassword)
demo.setConfig(""watson.tone.username"",toneAnalyzerUserName)

//Run the Spark streaming for a limited time
import org.apache.spark.streaming._
demo.startTwitterStreaming(sc, Seconds(30))

You should see the following results:

Starting twitter stream
Twitter stream started
Tweets are collected real-time and analyzed
To stop the streaming and start interacting with the data use: StreamingTwitter.stopTwitterStreaming
Receiver Started: TwitterReceiver-0
Batch started with 105 records
Batch completed with 105 records
Batch started with 246 records
Stopping Twitter stream. Please wait this may take a while
Receiver Stopped: TwitterReceiver-0
Reason:  : Stopped by driver
Batch completed with 246 records
Twitter stream stopped
You can now create a sqlContext and DataFrame with 24 Tweets created. Sample usage: 
val (sqlContext, df) = com.ibm.cds.spark.samples.StreamingTwitter.createTwitterDataFrames(sc)
df.printSchema
sqlContext.sql(""select author, text from tweets"").show

Here’s more information on the PixieDust Scala bridge .Step 4: Collect tweets enriched with Tone Analyzer scores and move them into a Spark
DataFrame. Wait until Spark Streaming has finished and run the following cell:

%%scala
val demo = com.ibm.cds.spark.samples.StreamingTwitter
val (__sqlContext, __df) = demo.createTwitterDataFrames(sc)

In this cell, we again use Scala to call the createTwitterDataFrames API. Notice how we add special characters __ to each of the variables. The underscores tell the Scala bridge that these
variables need to be bound as Python variables. We’ll use them again to do some
data science.

You should see the following results:

A new table named tweets with 24 records has been correctly created and can be accessed through the SQLContext variable
Here's the schema for tweets
root
 |-- author: string (nullable = true)
 |-- userid: string (nullable = true)
 |-- date: string (nullable = true)
 |-- lang: string (nullable = true)
 |-- text: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- long: double (nullable = true)
 |-- Anger: double (nullable = true)
 |-- Disgust: double (nullable = true)
 |-- Fear: double (nullable = true)
 |-- Joy: double (nullable = true)
 |-- Sadness: double (nullable = true)
 |-- Analytical: double (nullable = true)
 |-- Confident: double (nullable = true)
 |-- Tentative: double (nullable = true)
 |-- Openness: double (nullable = true)
 |-- Conscientiousness: double (nullable = true)
 |-- Extraversion: double (nullable = true)
 |-- Agreeableness: double (nullable = true)
 |-- EmotionalRange: double (nullable = true)

Step 5: Do some data science. (See? I told you.) We are now able to run the same
analytics from the Python Notebook in Part 1 :

tweets=__df
tweets.count()
display(tweets)

Here we use the PixieDust display API to explore the data. The PixieDust GitHub wiki again has everything you need to know about display .You should see the following results:

You can optionally run the other analytics from Part 1 . They are pretty much the same.WHAT ABOUT THE LINE-OF-BUSINESS USER?
The previous section represents a great improvement over Part 1, as we now can
run the application end-to-end from within a single Python Notebook. However, we
still have a lot of code and syntax to deal with, and we don’t yet have
real-time analytics.

In this section, we’ll show how to create a PixieDust embedded UI with real-time
analytics.

Step 1: Create a PixieDust plugin. Create a new Github repo and add a setup.py file as follows:

from setuptools import setup
setup(name='pixiedust_twitterdemo',
      version='0.3',
      description='Pixiedust demo of the Twitter Sentiment Analysis tutorials',
      url='https://github.com/ibm-cds-labs/pixiedust_incubator/tree/master/twitterdemo',
      install_requires=['pixiedust'],
      author='David Taieb',
      author_email='david_taieb@us.ibm.com',
      license='Apache 2.0',
      packages=['pixiedust_twitterdemo'],
      include_package_data=True,
      zip_safe=False)

See it in context on GitHub in the pixiedust_incubator/twitterdemo repo.Step 2: Create your controller class. The controller class tells PixieDust when to
trigger the display class. It must inherit from DisplayHandlerMeta found in display.py . Then, in init.py :

from pixiedust.display import *
...

class PixieDustTwitterDemoPluginMeta(DisplayHandlerMeta):

  @addId
  def getMenuInfo(self,entity):
    if entity==self.__class__:
      return [{""id"": ""twitterdemo""}]
    else:
      return []

  def newDisplayHandler(self,options,entity):
    return PixieDustTwitterDemo(options,entity)

Step 3: Create the display class. This class contains the logic for processing and
displaying the results. It must inherit from the Display class found display.py . Then, in twitterDemo.py :

class PixieDustTwitterDemo(Display):
    ...
  def doRender(self, handlerId):
    self.addProfilingTime = False
    stream = self.options.get(""stream"")

    if stream is None:
      self._addScriptElement(""https://d3js.org/d3.v3.js"", checkJSVar=""d3"", 
        callback=[self.renderTemplate(""demoPieChart.js""), self.renderTemplate(""demoGroupedChart.js"")]
      )
      self._addHTMLTemplate(""demoScript.html"")
      self._addHTMLTemplate(""demo.html"")

    elif stream is True or str(stream).lower() == 'true':
      self.startStream()

    elif stream is False or str(stream).lower() == 'false':
      self.stopStream()
  def genStartStreamingExecuteCode(self):
    return self.renderTemplate(""startStreaming.execute"", 
      channel = StreamingChannel.__module__ + ""."" + StreamingChannel.__name__,
      receiver = ""com.ibm.cds.spark.samples.PixiedustStreamingTwitter$"",
      scalaCode = �print(\\\""done\\\"")"")

There’s is a lot happening in this example. If you’re interested in diving in
deeper, we’ll publish a hello-world example in January to help you follow along.Let’s look at a few key points: the doRender method is called by the framework
to process and render the visualization. The class is passed the following
objects:

 1. self.entity: the entity containing the data being displayed (not used above,
    but available for use)
 2. self.options: dictionary of state variables
 3. self._addHTMLTemplate: helper method that takes a path to a jinja2 template; the template is guaranteed to be passed a few variables like
    entity, prefix, etc.
 4. self._addScriptElement: helper method that lets you insert a JavaScript file
    into the browser client

I’m glossing over many details in this article. I encourage you to study the
code to understand how I’m creating a dialog box for the UI from within the demo.html template—specifically, how I use the basedialog.html macro.Step 4: Create your simple API. We are now ready to create a simple wrapper API that
the user can just call from a notebook. In init.py :

def twitterDemo():
  display(PixieDustTwitterDemoPluginMeta)

The twitterDemo method is wrapping a call to display, passing data that is
specific to this plugin.

Step 5 : Use our new application in a notebook. First, set the credentials again, this
time for the PixiedustStreamingTwitter class:

%%scala
val demo = com.ibm.cds.spark.samples.PixiedustStreamingTwitter
demo.setConfig(""twitter4j.oauth.consumerKey"",twitterConsumerKey)
demo.setConfig(""twitter4j.oauth.consumerSecret"",twitterConsumerSecret)
demo.setConfig(""twitter4j.oauth.accessToken"",twitterAccessToken)
demo.setConfig(""twitter4j.oauth.accessTokenSecret"",twitterAccessTokenSecret)
demo.setConfig(""watson.tone.url"",""https://gateway.watsonplatform.net/tone-analyzer/api"")
demo.setConfig(""watson.tone.password"",toneAnalyzerPassword)
demo.setConfig(""watson.tone.username"",toneAnalyzerUserName)
demo.setConfig(""checkpointDir"", System.getProperty(""user.home"") + ""/pixiedust/ssc"")

And then in the next cell, call the twitterDemo API:

from pixiedust_twitterdemo import *
twitterDemo()

You should see a dialog that lets you specify word filters in the top left
corner. It also shows you the tweets in tiles as they arrive. Clicking on an
individual tweet tile will reveal the Watson Tone Analyzer scores for that
tweet. Click the Start Streaming button, and you should start seeing tweets that match your filters. The UI also
displays real-time sentiment analysis charts. To go back to the notebook, simply
click the Back to Notebook button. The dialog is dismissed, and a new Spark DataFrame called __tweets containing the collected tweets is created. Feel free to run the analytics
above on this DataFrame.

Finally, you can create and customize a real-time analytics environment from the
context of a Python Notebook, with the UI elements your line-of-business
colleagues typically expect.

RUNNING THE APPLICATION FROM GITHUB
While the application can be run from any Jupyter Notebook environment, the
instructions outlined below will use the IBM Data Science Experience (DSX). The first step is to get the Twitter sentiment with Pixiedust notebook into DSX:

 1. Sign into DSX
 2. Create a new project (or select an existing project)
 3. Add a new notebook within the project:

 * Click add notebook
 * Choose From URL
 * Enter notebook name
 * Enter the notebook URL: 
   https://github.com/ibm-cds-labs/pixiedust/raw/master/notebook/GraphFrame%20with%20Pixiedust.ipynb
 * Select the Spark Service
 * Click Create Notebook

If prompted, select a kernel for the notebook. The notebook should successfully
import.

CONCLUSION
There you have it: the third installment of the Twitter sentiment application is
an elegant solution running entirely in a Python Notebook—and it’s easy to
install and configure, thanks to the magic of PixieDust.

I currently have no plan for a fourth installment, but if you have an idea you’d
like to share, please contact me via email or Twitter . I can’t wait to hear the many good ideas to improve this application.

Thanks to Mike Broberg . Python Apache Spark Scala Tutorial Data Science 2 Blocked Unblock Follow FollowingDAVID TAIEB
FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.","A developer, data scientist, or line-of-business user should be able to run a real-time analytics app, end-to-end, from within a single Python Notebook. I’ll show you how, with the PixieDust library.",Real-Time Sentiment Analysis of Twitter Hashtags with Spark (+ PixieDust),Live,681
2102,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine
Learning Dec 5, 2016
--------------------------------------------------------------------------------

EASY JSON LOADING AND SOCIAL SHARING IN DSX NOTEBOOKS
Working with notebooks in Data Science Experience just became easier with the
release of two new features!

EASY JSON LOADING
First, we listened to your feedback and made it much easier to load JSON data
from Object Storage. Now you can click on any JSON file inside a notebook to get
options for inserting with the following options:

Python Notebook

 * Pandas DataFrame
 * Spark SQL DataFrame
 * Spark RDD
 * Insert Credentials

R Notebook

 * R base DataFrame
 * Spark SQL DataFrame
 * Insert Credentials

We now support CSV and JSON, more formats will be coming as well.

Let DSX write the code for youI’m also very happy to announce we now support social sharing of DSX notebooks
on both Twitter and LinkedIn. Using this feature you can still choose what
content you want to share from your notebook.

With one-click I shared my notebook with my followers:

Try this #jupyter notebook on analyzing precipitation data in #dsx https://t.co/6CJxmdGRDZ — Greg Filla (@gdfilla) December 5, 2016Try out these features and let us know what you think. If you don’t have a DSX
account already you can sign up for free at http://datascience.ibm.com/ .


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on December 5, 2016.

 * Data Science
 * Json
 * Dsx


Blocked Unblock Follow FollowingGREG FILLA
Product manager & Data scientist — Data Science Experience and Watson Machine
Learning

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Working with notebooks in Data Science Experience just became easier with the release of two new features! First, we listened to your feedback and made it much easier to load JSON data from Object…",Easy JSON Loading and Social Sharing in DSX Notebooks,Live,682
2103,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

CONTENTS
 * Apache Spark * Get Started * Get Started in Bluemix
       * Load dashDB Data with Apache Spark
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Load Cloudant Data in Apache Spark Using a Python Notebook
       * Build SQL Queries
       * Use the Machine Learning Library
       * Use Spark Streaming
      
      
    * Tutorials and samples * Sample Notebooks
       * Sample Python Notebook: Precipitation Analysis
       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis
       * Build a Custom Library for Apache Spark
       * Sentiment Analysis of Twitter Hashtags
      
      
 * Cloudant * Get started * Copy a sample database
       * Create a database
       * Change database permissions
       * Connect to Bluemix
       * Developing against Cloudant
      
      
    * Intro to the HTTP API * Execute common API commands
       * Set up pre-authenticated cURL
      
      
    * Database Replication * Use cases for replication
       * Create a replication job
       * Check replication status
       * Set up replication with cURL
      
      
    * Indexes and Queries * Use the primary index
       * MapReduce and the secondary index
       * Build and query a search index
       * Use Cloudant Query
       * Cloudant Geospatial
      
      
    * Integrate * Create a Data Warehouse from Cloudant Data
       * Store Tweets Using Cloudant, dashDB, and Node-RED
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Load Cloudant Data in Apache Spark Using a Python Notebook
      
      
 * dashDB * dashDB Quick Start
    * Get * Get started with dashDB on Bluemix
       * Load data from the desktop into dashDB
       * Load from Desktop Supercharged with IBM Aspera
       * Load data from the Cloud into dashDB
       * Move data to the Cloud with dashDB’s MoveToCloud script
       * Load Twitter data into dashDB
       * Load XML data into dashDB
       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB
       * Load JSON Data from Cloudant into dashDB
       * Integrate dashDB and Informatica Cloud
       * Load geospatial data into dashDB to analyze in Esri ArcGIS
       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion
         Workbench (DCW)
       * Install IBM Database Conversion Workbench
       * Convert data from Oracle to dashDB
       * Convert IBM Puredata for Analytics to dashDB
       * From Neteeza to dashDB: It’s That Easy!
       * Use Aginity Workbench for IBM dashDB
      
      
    * Build * Create Tables in dashDB
       * Connect apps to dashDB
      
      
    * Analyze * Use dashDB with Watson Analytics
       * Use dashDB with Spark
       * Use dashDB with Pyspark and Pandas
       * Use dashDB with R
       * Publish apps that use R analysis with Shiny and dashDB
       * Perform market basket analysis using dashDB and R
       * Connect R Commander and dashDB
       * Use dashDB with IBM Embeddable Reporting Service
       * Use dashDB with Tableau
       * Leverage dashDB in Cognos Business Intelligence
       * Integrate dashDB with Excel
       * Extract and export dashDB data to a CSV file
       * Analyze With SPSS Statistics and dashDB
      
      
 * DataWorks * Get Started * Connect to Data in IBM DataWorks
       * Load Data for Analytics in IBM DataWorks
       * Blend Data from Multiple Sources in IBM DataWorks
       * Shape Raw Data in IBM DataWorks
       * DataWorks API
      
      
STORE TWEETS USING BLUEMIX, NODE-RED, CLOUDANT, AND DASHDB
sharynr / September 16, 2015Watch how to create a Bluemix application using Node-RED that searches for
tweets and stores the resulting tweets in a Cloudant NoSQL database, then use
the data warehousing tool built into Cloudant to load the data into a dashDB.

You can also read a transcript of this video

RELATED LINKS
 * Load JSON data from Cloudant into dashDB

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM","Watch how to create a Bluemix application using Node-RED that searches for tweets and stores the resulting tweets in a Cloudant NoSQL database, then use the data warehousing tool built into Cloudant to load the data into a dashDB.","Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB",Live,683
2104,"RStudio Blog * Home

 * Subscribe to feed

FLEXDASHBOARD: EASY INTERACTIVE DASHBOARDS FOR R
May 17, 2016 in Packages , R Markdown , Shiny

Today we’re excited to announce flexdashboard , a new package that enables you to easily create flexible, attractive,
interactive dashboards with R. Authoring and customization of dashboards is done
using R Markdown and you can optionally include Shiny components for additional interactivity.


Highlights of the flexdashboard package include:

 * Support for a wide variety of components including interactive htmlwidgets ; base, lattice, and grid graphics; tabular data; gauges; and value boxes.
 * Flexible and easy to specify row and column-based layouts . Components are intelligently re-sized to fill the browser and adapted for
   display on mobile devices.
 * Extensive support for text annotations to include assumptions, contextual
   narrative, and analysis within dashboards.
 * Storyboard layouts for presenting sequences of visualizations and related commentary.
 * By default dashboards are standard HTML documents that can be deployed on any
   web server or even attached to an email message. You can optionally add Shiny components for additional interactivity and then deploy on Shiny Server or shinyapps.io.

GETTING STARTED
The flexdashboard package is available on CRAN; you can install it as follows:


install.packages(""flexdashboard"", type = ""source"")


To author a flexdashboard you create an R Markdown document with the flexdashboard::flex_dashboard output format. You can do this from within RStudio using the New R Markdown dialog:

Dashboards are simple R Markdown documents where each level 3 header ( ### ) defines a section of the dashboard. For example, here’s a simple dashboard
layout with 3 charts arranged top to bottom:


---
title: ""My Dashboard""
output: flexdashboard::flex_dashboard
---

### Chart 1
 
```{r}

```
 
### Chart 2

```{r}

```

### Chart 3

```{r}

```


You can use level 2 headers ( ----------- ) to introduce rows and columns into your dashboard and section attributes to
control their relative size:


---
title: ""My Dashboard""
output: flexdashboard::flex_dashboard
---

Column {data-width=600}
-------------------------------------
 
### Chart 1
 
```{r}

```
 
Column {data-width=400}
-------------------------------------
 
### Chart 2

```{r}

``` 
 
### Chart 3
 
```{r}

```


LEARNING MORE
The flexdashboard website includes extensive documentation on building your own dashboards, including:

 * A user guide for all of the features and options of flexdashboard, including layout
   orientations (row vs. column based), chart sizing, the various supported
   components, theming, and creating dashboards with multiple pages.
 * Details on using Shiny to create dashboards that enable viewers to change underlying parameters and
   see the results immediately, or that update themselves incrementally as their
   underlying data changes.
 * A variety of sample layouts which you can use as a starting point for your own dashboards.
 * Many examples of flexdashboard in action (including links to source code if you want to
   dig into how each example was created).

The examples below illustrate the use of flexdashboard with various packages and
layouts (click the thumbnail to view a running version of each dashboard):

d3heatmap: NBA scoring

ggplotly: ggplot2 geoms

Shiny: biclust example

dygraphs: linked time series

highcharter: sales report

Storyboard: htmlwidgets showcase

rbokeh: iris dataset

Shiny: diamonds explorer


TRY IT OUT
The flexdashboard package provides a simple yet powerful framework for creating dashboards from
R. If you know R Markdown you already know enough to begin creating dashboards
right now! We hope you’ll try it out and let us know how it’s working and what else we can do to make it better.


SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

1 COMMENT
Comments feed for this article

May 19, 2016 at 6:39 pm

Ken Kleinman

This is stunning, JJ, as are the examples. Thanks, for the nth time, for
everything you’re doing for the community.

Reply
LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.

Notify me of new posts via email.


« Shiny JavaScript TutorialsBlog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","Today we’re excited to announce flexdashboard, a new package that enables you to easily create flexible, attractive, interactive dashboards with R. Authoring and customization of dashboards i…",flexdashboard: Interactive dashboards for R,Live,684
2106,"Homepage IBM Watson Follow Sign in Get started * Home
 * Announcements
 * Editorials
 * Tutorials
 * Code Spotlight
 * 
 * Build with Watson
 * 

Damian Cummins Blocked Unblock Follow Following Software Developer — IBM Watson Data API Mar 20
--------------------------------------------------------------------------------

WORKING WITH DATA FLOWS USING WATSON DATA APIS
Watson Data APIIBM Watson offers a collection of REST APIs for creating, running, managing, and
troubleshooting data flows to allow your applications to easily integrate with Data Refinery .

A flow can read data from a large variety of sources, process that data in a
runtime engine using pre-defined operations or custom code, and then write it to
one or more targets. The runtime engine can handle large amounts of data so it’s
ideally suited for reading, processing, and writing data at volume.

The APIs are supported by a growing set of resources, including documentation
and tutorials.

API DOCUMENTATION
The data flows API specification can be found in the Watson Data API documentation under Documentation > Data flows.

TUTORIALS AND NOTEBOOKS
The Watson Studio Community is a hub of useful blogs, notebooks, tutorials and data sets to get you
started.

The Create and run a data flow using Watson Data APIs notebook introduces you to the data flow model and shows you how to define a
flow with a data source, refining operations, and a target data set. It also
covers data flow run creation, status monitoring, and log retrieval for
troubleshooting.

The Monitor data flow usage using Watson Data APIs notebook demonstrates how run metrics can be visualized over time and shows how
lists of runs can be filtered using query parameters.

You can find more information about Data Refinery in the announcement blog post: Self-service data preparation with Data Refinery

Try out Data Refinery today from either Watson Studio or Watson Knowledge
Catalog.

 * API

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

20 Blocked Unblock Follow FollowingDAMIAN CUMMINS
Software Developer — IBM Watson Data API

FollowIBM WATSON
AI Platform for the Enterprise

 * 20
 * 
 * 
 * 

Never miss a story from IBM Watson , when you sign up for Medium. Learn more Never miss a story from IBM Watson Get updates Get updates","IBM Watson  offers a collection of REST APIs for creating, running, managing, and troubleshooting data flows to allow your applications to easily integrate with Data Refinery.",Working with data flows using  Watson Data APIs,Live,685
2110,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS
Loading...

Sign up by October 31st for an extended 3-month trial of YouTube Red.Working...

No thanks Try it free Find out why CloseIBM WATSON MACHINE LEARNING: SCORE A PREDICTIVE MODEL BUILT WITH IBM SPSS
MODELER
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

223 views 1LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 2 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Watch this video to see how to use Watson Machine Learning and IBM Data Science
Experience to create a data flow using IBM SPSS Modeler to predict chronic
kidney disease.

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Watson Machine Learning: Build a Naive-Bayes Model - Duration: 4:08. IBM
   Analytics Learning Services 244 views 4:08


--------------------------------------------------------------------------------

 * Seven secrets of SPSS Modeler - Duration: 35:56. Smart Vision Europe 3,691
   views 35:56
 * Watson Discovery Service Introduction - Duration: 10:44. 10 Minutes of IBM
   Tech 2,081 views 10:44
 * IBM Predictive Modeling service on Bluemix demo - Duration: 13:12. Armand
   Ruiz 2,089 views 13:12
 * IBM Watson, Twilio & Node.JS: Making a Twilio Chatbot powered by Watson
   Conversation-Icecream Sundae - Duration: 32:55. tanmay bakshi 94,098 views 32:55
 * Machine Learning in IBM DSX Local - Duration: 5:03. IBM Analytics 599 views 5:03
 * IBM Watson: How it Works - Duration: 7:54. IBM Watson 1,292,550 views 7:54
 * How to build a predictive model using IBM SPSS Modeler - Duration: 9:48.
   Smart Vision Europe 18,159 views 9:48
 * IBM Watson Machine Learning: Get Started - Duration: 1:23. developerWorks TV
   326 views 1:23
 * Free Download and Install IBM SPSS Statistics v25 - Duration: 5:53. Mr. Fine
   207 views 5:53
 * Constructing Predictive Model Using IBM SPSS Modeler - Duration: 22:54.
   IT_CHANNEL 6,541 views 22:54
 * How to apply predictive analytics to customer data - Duration: 3:35. IBM
   Analytics 2,489 views 3:35
 * IBM Watson Machine Learning: Build a Predictive Analytic Model - Duration:
   4:06. developerWorks TV 80 views 4:06
 * Workshop Predictive Modeling with Watson Analytics February 22, 2017 -
   Duration: 1:53:04. ITS UMUC 283 views 1:53:04
 * Watson Visual Inspection : Automotive Manufacturing - Duration: 0:45. Felipe
   Smolka 409 views 0:45
 * IBM Blockchain Business Models - Duration: 10:13. IBMBlockchain 439 views 10:13
 * Predictive Analytics Made Simple with IBM - Duration: 36:06.
   IBMBusAnalyticsOEM 4,817 views 36:06
 * Saving and Applying SPSS Scoring Model Logistic Regression - Duration: 9:25.
   Dr. Baker SDSU Marketing 6,708 views 9:25
 * IBM SPSS software and Watson Analytics: A powerful combo for the cognitive
   age - Duration: 36:46. IBM Analytics 3,750 views 36:46
 * Create Advanced Forecasting Models using Excel & Machine Learning - Duration:
   39:58. Microsoft Power BI 2,745 views 39:58
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Watch this video to see how to use IBM Watson Machine Learning and IBM Data Science Experience to create a data flow using IBM SPSS Modeler to predict chronic ki...,"Score a Predictive Model Built with IBM SPSS Modeler, WML & DSX",Live,686
2111,"Margriet Groenendijk Blocked Unblock Follow Following Developer Advocate | IBM Watson Data Platform | Data Science | Climate and
Weather | Geography Jul 13
--------------------------------------------------------------------------------

EASY ACCESS TO ALL POINTS OF INTEREST DATA
SCRAPING OPENSTREETMAP AND EXPLORING POI IN CLOUDANT AND JUPYTER NOTEBOOKS
When working with data, the format of the raw data is not always user-friendly.
For instance, the format could be one large binary file, or the data could
spread across hundreds of text files. An easy way to solve this problem is to
convert the data and store it in a database.

As an example of how to make working with data simpler, Raj Singh and I converted all the Points of Interest data from the global OpenStreetMap (OSM) project to GeoJSON files, which we then stored and are periodically updating in IBM Cloudant , a database service based on Apache CouchDB™ . The data is now easily accessible through an API , which you can try for free.

Our Points of Interest API , based on OpenStreetMap POI data. OpenStreetMap is built by a community of mappers that contribute and maintain
data about roads, trails, cafés, railway stations, and much more, all over the
world.Read along to learn how we built it and how you can use the data. (Note: Should
you reproduce all the work described below, you will likely incur costs for
Cloudant.)

OPENSTREETMAP DATA
The first step is to download the most recent data for each continent. We used Geofabrik , which extracts, selects, and processes free geo data from OpenStreetMap. The
examples that follow use data from Europe, but for complete global coverage, all
steps need to be repeated for each continent.

CONVERTING THE DATA
The second step is to extract the Points of Interest (POI) from this large file. We used Osmosis , which is a command line Java application for processing OSM data. You can
easily install it on a Mac with brew . We used it to extract all the POI data based on a selection of features .

osmosis --read-pbf europe-latest.osm.pbf \
        --tf accept-nodes \
        aerialway=station \
        aeroway=aerodrome,helipad,heliport \
        amenity=* building=school,university craft=* emergency=* \
        highway=bus_stop,rest_area,services \
        historic=* leisure=* office=* \
        public_transport=stop_position,stop_area railway=station \
        shop=* tourism=* \
        --tf reject-ways --tf reject-relations \
        --write-xml Europe.nodes.osm

The file Europe.nodes.osm contains all POI in Europe, but also some data that we do not need. A handy
tool to scrub OSM data is osmconvert . With this tool, selected data can be dropped from the file.

osmconvert Europe.nodes.osm — drop-ways — drop-author — drop-relations — drop-versions Europe.poi.osm

The third step is to convert POI data to the GeoJSON format. A good tool for this job is ogr2ogr , which is part of the GDAL library , which you can install with brew install gdal . Note that we are only interested in points, so only POI data is added to the
GeoJSON file Europe.poi.json .

ogr2ogr -f GeoJSON Europe.poi.json Europe.poi.osm points

UPLOADING DATA TO CLOUDANT
Each of the POI objects from the large GeoJSON file needs to be stored in a
separate document in the database. To upload them to Cloudant we used couchimport , which does exactly that (and more).

IBM Cloudant is a NoSQL database that you can try out for free after signing up
for a Bluemix account . Cloudant has a perpetually free tier, but please check Cloudant pricing if you anticipate heavier long-term use. For example, scraping POI data for the
whole world took us 5.26 GB!export COUCH_TRANSFORM=./osm_poi_transform.js
export COUCH_URL=''https://username:password@opendata.cloudant.com''
cat Europe.poi.json | couchimport --db poi-db --type json --jsonpath ''features.*''

These commands upload all POI features to a database called poi-db . The file osm_poi_transform.js contains extra information to use the osm_id as the document id and to format the keywords.

Keeping the data up-to-date is done by weekly running a Python script that
downloads the OSM change file and uses the above tools to create a GeoJSON file
with all new or updated POI.

As the change file contains both new and updated POI, the Cloudant Python library is used instead of couchimport. With this library, a POI record can be
replaced, or if it is a new POI record, added via the following code from our
POI API service:

EASY ACCESS TO THE DATA
Now that the database is ready, it is time to look at the data inside it. You
can visualize GeoJSON inside the Cloudant dashboard, or by using the Cloudant APIs . To be able to use Cloudant’s geospatial functionalities, a design document
with a geospatial index function needs to be added as in the screenshot below.

Adding a geospatial index in Cloudant.After the index has been built (processing can take a while for a large
database), you can explore the data in the dashboard by, for instance, drawing a
box on a map as below. Interacting with the map in the dashboard will also give
you the corresponding API call for this query. It’s a convenient feature for
further extending your query with some hints from the getting started example and Cloudant Geospatial documentation .

Selecting all data points within a rectangle.ANALYSE THE DATA IN A PYTHON NOTEBOOK
Another way to access and analyse the data is in a Python notebook. The examples
below are designed for you to be able to easily copy & paste them into a Jupyter Notebook . You can run your notebook locally or in the cloud. We ran ours in the cloud
using the IBM Data Science Experience (DSX) platform, which you can try out for free.

With the pandas and PixieDust packages, you can use the URL from the Cloudant dashboard above to start
exploring. The code below will load a JSON file with data of the 200 POIs from
the above map into a pandas DataFrame. To load the properties of the POI data,
add &include_docs=true to the URL.

Using PixieDust to display(poi_df) in a separate notebook cell will render the POI data from Cloudant as a table.The DataFrame needs some cleaning up, as all the variables are combined into one
column: rows . Extracting the fields you are interested in can be done with a lambda function, which is included below. It uses the function try_field for each row. It checks if a field exists, and if it does writes the value to a
new column. This code example only checks a few fields, but there are many more,
as you can see in the features selected with osmosis above. After adding the new columns, the original columns bookmark and rows can be dropped.

You can re-run your display(poi_df) cell to see how the POI data has changed.CREATE A MAP WITH PIXIEDUST
PixieDust is a great Python package to quickly visualize your data in Jupyter
Notebooks. The formatted data above can be plotted on a map with the following
code. First, you’ll need to add an extra column to specify which points are a
shop, public transport, or an amenity. Then you can make a map by simply using
the display() command and selecting a map from the menu.

PixieDust has two map renderers. To visualize your POI data, you’ll need to
choose mapbox . (Currently, the google maps render in PixieDust only uses simple location data, like country codes,
and not latitude & longitude.) As such, you’ll need a Mapbox access token , which you can get for free by signing up for an account. Enter it in your
visualization’s Options dialog, like so:

Entering map visualization options via PixieDust, in a Jupyter Notebook on IBM’s
DSX platform.You’ll want to specify latitude and longitude as your keys, with a numeric value like shops or amenities as your value.

Rendering map data using Mapbox in PixieDust. Looks like the map previewed in
the Cloudant dashboard, only more stylish and with more options!USE THE POINTS OF INTEREST API
You can also try this analysis using our POI API. Connecting to it is a little
simpler and cleaner than loading the data directly from Cloudant, and you can
grab more data in one call.

Try it out by replacing the corresponding notebook cells with the following
snippets:

SOME FINAL THOUGHTS
As you might have noticed, there is no password needed to access this data set.
As the OSM data is open data, we are keeping this POI database open as well.
Feel free to have a play with the data. We would love to hear what you are
building!

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

Thanks to Mike Broberg . * Data Science
 * Pixiedust
 * Openstreetmap
 * Cloudant
 * Built With Mapbox

1 Blocked Unblock Follow FollowingMARGRIET GROENENDIJK
Medium member since May 2017Developer Advocate | IBM Watson Data Platform | Data Science | Climate and
Weather | Geography

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","When working with data, the format of the raw data is not always user-friendly. For instance, the format could be one large binary file, or the data could spread across hundreds of text files. An…",Easy Access to All Points of Interest Data – IBM Watson Data Lab – Medium,Live,687
2112,"Bradley Holt Blocked Unblock Follow Following Developer Advocate and Senior Software Engineer with IBM Watson Data Platform |
Author and Speaker | opinions are my own Apr 6
--------------------------------------------------------------------------------

OFFLINE SYNC FOR PROGRESSIVE WEB APPS
APACHE COUCHDB + HOODIE + POUCHDB
Apache CouchDB , Hoodie , and PouchDB logos.A lot has changed in the almost fifteen years that I’ve been building web apps.
I’ve always loved the open nature of the web platform. There are no gatekeepers
on the web. No one can tell you what you can and cannot publish on the web
(which certainly has its pros and its cons ).

This “web developer” has stuck with the web platform since the very beginning.The massive success of the iOS platform after its launch in 2007 presented a bit
of an existential threat to the web platform. Would people abandon the web for
native apps? To a certain extent, many of web developers’ fears about native
apps proved true. The user interaction model of native apps was very compelling,
and many developers turned away from the web platform for the greener pastures
of native app platforms.

However, the web platform has demonstrated remarkable resiliency and
adaptability. Not only has the web platform persisted all of these years, it has
demonstrated an impressive capacity for slow yet inexorable innovation. The
introduction of HTML5 brought new energy to the web platform. Evergreen browsers
allowed for a constant stream of new web platform features to be delivered to
end users. Polyfills allowed web developers to adopt new web platform features
immediately without having to first wait for every browser to get these new
features.

PROGRESSIVE WEB APPS = WEB + NATIVE APPS
Voice of InterConnect is a Progressive Web App demo.Today the web platform is taking on native apps. There’s still some work to be
done, but I believe that the future of apps is web apps. More specifically, I
believe that the future of apps is Progressive Web Apps . A Progressive Web App combines the best parts both a web app and a native
app. You browse to a Progressive Web App just like you browse to any other
website. It lives at a URL, can be indexed by search engines, and can be linked
to from anywhere else on the web. As you continue to use a Progressive Web App
it gains additional native app-like capabilities. For example, the app could be
installed to the home screen on your device. You might also grant the app the
ability to send you push notifications, or the ability to access your camera,
your microphone, or other device resources.

One important aspect of Progressive Web Apps is the concept of building your app
to be Offline First . With an Offline First approach, you design your app for the most
resource-constrained environment first. This approach provides a consistent user
experience whether the user’s device has no connectivity, limited connectivity,
or great connectivity. One of the biggest benefits of Offline First apps is that
they can be very fast , as they provide zero-latency access to content and data stored directly on
the device.

The Service Worker API can do most of the heavy lifting when it comes to storing content and assets
for Offline First Progressive Web Apps. A Service Worker is a small bit of code
that a web developer can write to instruct the browser on exactly how to cache
content and assets for offline usage. Service Workers are an essential tool for
building Offline First Progressive Web Apps.

THINKING ABOUT SYNC
Once you’ve figured out Service Workers, the next hurdle is figuring out how to
store and sync your app’s data. Offline sync is a bit more challenging, and
there is no “one true answer” yet. However, I believe that the best answer for
offline sync for Progressive Web Apps is a combination of Apache CouchDB , Hoodie , and PouchDB .

Trust me, you do not want to write your own sync functionality—it’s way harder than you might think—and it’s a good thing you don’t have to!
Fortunately, Apache CouchDB has already solved this problem. Even better,
PouchDB is a JavaScript database that can run in a web browser and can sync with
Apache CouchDB over HTTP. Nolan Lawson recently had this to say about Apache CouchDB and PouchDB:

CouchDB’s superpower is sync. Sometimes I even try to explain it to people by
saying, “CouchDB isn’t a database; it’s a sync engine.” It’s a way of efficiently transferring data from one place to another, while
intelligently managing conflicts and revisions. It’s very similar to Git. When I
make that analogy, the light bulb often goes off.So, there you have it! Apache CouchDB on the server side, PouchDB on the client
side in the web browser, and the two of them sync’ing with each other over HTTP.
Fantastic! Not so fast…

Unfortunately Apache CouchDB does not (yet) have per-document permissions. This
means that you need to store each user’s data in their own database. Then
there’s the matter of account creation, and creating user databases, and user
login, and…

WELCOME TO THE NEIGHBOR-HOODIE
This is where Hoodie fits in. Hoodie is a Node.js framework that makes building
Offline First apps easier—a lot easier. Hoodie is an entire backend for Offline First applications, leveraging
Apache CouchDB on the server and PouchDB on the client. Hoodie provides APIs for
managing users, persisting data, sync’ing data, and a number of other features.
Also, Hoodie has cultivated an amazing developer community that is friendly,
inclusive, and welcoming. Check out the recording of Gregor Martynus’ recent talk on “ Building Offline First apps with Hoodie ” at js.la if you want to learn more about how Hoodie works.

Gregor Martynus recently spoke about “ Building Offline First apps with Hoodie ” at js.la .I hope that you’re excited about building some Offline First Progressive Web
Apps with Apache CouchDB, Hoodie, and PouchDB! Keep an eye out for resources
from me and other IBM Watson Data Lab team members on building Offline First Progressive Web Apps using IBM Cloudant (our fully-managed database-as-a-service based on Apache CouchDB) and IBM Bluemix (our platform-as-a-service). In the meantime, check out these resources:

 * Voice of InterConnect is an Offline First Progressive Web App featuring Hoodie, IBM Cloudant, and
   IBM Watson Services.
 * Tracker is a sample Hoodie app that can serve as the starting point for building with Hoodie. Instructions are provided for deploying Tracker to IBM Bluemix.
 * The Hoodie documentation provides instructions for using IBM Cloudant as
   Hoodie’s database backend.
 * You’ll find lots of great info about Offline First on the Offline Camp Medium publication and on the Offline First Resources page.
 * Looking for a consultant who can help you with your Offline First app? Make&Model specializes in user experience design for Offline First apps and Neighbourhoodie , an IBM Business Partner, specializes in architecting Offline First apps.

Let’s get sync’ing!

Thanks to Gregor . * Web Development
 * Mobile
 * Offline First
 * Progressive Web App
 * Couchdb

1 Blocked Unblock Follow FollowingBRADLEY HOLT
Developer Advocate and Senior Software Engineer with IBM Watson Data Platform |
Author and Speaker | opinions are my own

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",A lot has changed in the almost fifteen years that I’ve been building web apps. I’ve always loved the open nature of the web platform. There are no gatekeepers on the web. No one can tell you what…,Offline Sync for Progressive Web Apps – IBM Watson Data Lab – Medium,Live,688
2114,"IMPROVED PERFORMANCE FOR REDIS CACHE MODE ON COMPOSE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 1, 2016tl;dr - We've boosted the reliability of Cache Mode on Compose by disabling
backups and persistence by default on all Redis deployments using that mode. But
if you need them enabled, all you need to do is turn it on.

We made some changes recently to improve the performance of Redis deployments in Cache Mode . Our engineers noticed that there were significant memory overhead issues with
Redis deployments running near to using all available memory. This would run up
a memory deficit it wouldn't be able to repay without sacrificing performance.

We understand that most Cache Mode users are interested in performance above all. So, by default, we've decided to
turn off backup and persistence and free Redis of that deficit. But, where you
do want backup and persistence for your Cache Mode Redis, you can turn it on yourself.

Let's show you what it looks like in the Compose console and how to enable the
feature.

ENABLING BACKUPS
For all Redis deployments in Cache Mode , backups and persistence have been disabled by default. This is now indicated
on the Overview view next to Backups where you’ll now see Disabled (Cache Mode) .


If you want to enable Backups , select Settings from the side bar, which will take you to your Settings view . Here, you will see the Redis as a Cache panel telling you that “Backups and persistence are disabled by default” due to
the amount of memory required to run backups that result in huge performance
issues.


Automatic backups can be enabled by scrolling down to the Snapshotting panel and choosing enabled on the button next to save .


To apply any changes, scroll down the page a little more and click the Apply Configuration Changes button.


You’ll be taken to the Jobs view where you should see a blue star showing you that your new configuration is
being processed.

When it’s done, you’ll see a green check next to Apply Configuration .


Now, backups and persistence are enabled in cache mode.

With this change, you’re now able to control when or whether automatic backups
are activated. If you decide to enable snapshotting , then we want you to be aware of the performance risks involved.


--------------------------------------------------------------------------------","We've boosted the reliability of Cache Mode on Compose by disabling backups and persistence by default on all Redis deployments using that mode. But if you need them enabled, all you need to do is turn it on.",Improved Performance for Redis Cache Mode on Compose,Live,689
2115,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSIMPLE SEARCH SERVICEFaceted Search Engine Made EasyImplementing a powerful search engine in your app or website is easier than youthink.WHAT IS SIMPLE SEARCH SERVICE?In your travels across the Web, you’ve probably come across handy search enginesthat let you search for terms and filter results by specific values.Viewers click a link on the right to narrow down resultsand find exactly what they want.You too can add a search feature like this to your web apps without breaking asweat. Simple Search Service is an IBM Bluemix app that lets you quickly create a faceted search engine,exposing an API you can use to bring search into your own apps. The serviceincludes a guided web app that lets you preview the API and test it against yourown data. Once you deploy, just upload your CSV or TSV data, specify whichfields to facet, and the service handles the rest.HOW IT WORKSSimple Search Service is a Node.js app that you can get and use immediately bydeploying to the IBM Bluemix platform-as-a-service with a couple of mouseclicks. Deployment automatically provisions a Cloudant account, attaches it tothe service, and presents a web app that lets you upload a data file. Theservice automatically imports that data into Cloudant, with every field indexedfor search.Simple Search Service then exposes a RESTful search API that your applicationcan use. The API is CORS-enabled, so your client-side web app can use it withoutissue. The API is also caches popular searches in an in-memory data store forfaster retrieval. You can scale this solution by adding multiple Simple SearchService nodes and a centralized cache that uses Redis by Compose (also availableon Bluemix). In fact, a similar architecture powers the search experience in theBluemix services catalog.NEXT STEPS 1. Deploy Simple Search Service.         2. Follow the introductory tutorial .TUTORIALS * Introducing Simple Search Service * Turning a Spreadsheet into a Faceted Search Engine with CloudantIBM TECHNOLOGY * Bluemix * Cloudant * Redis by ComposeBLOGS ‘N’ STUFF * Cloudant how-tos * Redis articlesSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Adding a powerful faceted search feature to your app or website is easier than you think. Quickly turn spreadsheet data into a polished search feature.,Simple Search Service,Live,690
2123,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * Data Catalog
 * 
 * Watson Data Platform
 * 

Susanna Tai Blocked Unblock Follow Following Offering Manager, Watson Data Platform | Data Catalog Oct 30
--------------------------------------------------------------------------------

HOW SMART CATALOGS CAN TURN THE BIG DATA FLOOD INTO AN OCEAN OF OPPORTUNITY
One of the earliest documented catalogs was compiled at the great library of
Alexandria in the third century BC, to help scholars manage, understand and
access its vast collection of literature. While that cataloging process
represented a massive undertaking for the Alexandrian librarians, it pales in
comparison to the task of wrangling the volume and variety of data that modern
organizations generate.

Nowadays, data is often described as an organization’s most valuable asset, but
unless users can easily sift through data artifacts to find the information they
need, the value of that data may remain unrealized. Catalogs can solve this
problem by providing an indexed set of information about the organization’s
data, storing metadata that describes all assets and providing a reference to
where they can be found or accessed.

It’s not just the size and complexity of the data that makes cataloging a tough
challenge: organizations also need to be able to perform increasingly
complicated operations on that data at high speed, and even in real-time. As a
result, technology leaders must continually find better ways to solve today’s
version of the same cataloging challenges faced in Alexandria all those years
ago.

ENTER IBM
IBM’s aim with Watson Data Platform is to make data accessible for anyone who uses it. An integral part of Watson
Data Platform will be a new intelligent asset catalog, IBM Data Catalog , a solution underpinned by a central repository of metadata describing all the
information managed by the platform. Unlike many other catalog solutions on the
market, the intelligent asset catalog will also offer full end-to-end
capabilities around data lifecycle and governance.

Because all the elements of Watson Data Platform can utilize the same catalog,
users will be able to share data with their colleagues more easily, regardless
of what the data is, where it is stored, or how they intend to use it. In this
way, the intelligent asset catalog will unlock the value held within that data
across user groups — helping organizations use this key asset to its full
potential.

BREAKING DOWN SILOS
With Watson Data Platform, data engineers, data scientists and other knowledge
workers throughout an enterprise can search for, share and leverage assets
(including datasets, files, connections, notebooks, data flows, models and
more). Assets can be accessed using the Data Science Experience web user interface to analyze data,

To collaborate with colleagues, users can put assets into a Project that acts as
a shared sandbox where the whole team can access and utilize them. Once their
work is complete, they can submit any resulting content to the catalog for
further reuse by other people and groups across the organization.

Rich metadata about each asset makes it easy for knowledge workers to find and
access relevant resources. Along with data files, the catalog can also include
connections to databases and other data sources, both on- and off-premises,
giving users a full 360-degree view to all information relevant to their
business, regardless of where or how it is stored.

MANAGING DATA OVER TIME
It’s important to look at data as an evolving asset, rather than something that
stays fixed over time. To help manage and trace this evolution, IBM Data Catalog
will keep a complete track of which users have added or modified each asset, so
that it is always clear who is responsible for any changes.

SMART CATALOG CAPABILITIES FOR BIG DATA MANAGEMENT
The concept of catalogs may be simple, but when they’re being used to make sense
of huge amounts of constantly changing data, smart capabilities make all the
difference. Here are some of the key smart catalog functionalities that we see
as integral to tackling the big data challenge.

DATA AND ASSET TYPE AWARENESS
When a user chooses to preview or view an asset of a particular type, the data
and asset type awareness feature will automatically launch the data in the best
viewer — such as a shaper for a dataset, or a canvas for a data flow. This will
save time and boost productivity for users, optimizing discovery and making it
easier to work with a variety of data types without switching tools.

INTELLIGENT SEARCH AND EXPLORATION
By combining metadata, machine learning-based algorithms and user interaction
data, it is possible to fine-tune search results over time. Presenting users
with the most relevant data for their purpose will increase usefulness of the
solution the more it is used.

SOCIAL CURATION
Effective use of data throughout your organization is a two-way street: when
users discover a useful dataset, it’s important for them to help others find it
too. Users can be encouraged to engage by taking advantage of curation features,
enabling them to tag, rank and comment on assets within the catalog. By
augmenting the metadata for each asset, this can help the catalog’s intelligent
search algorithms guide users to the assets that are most relevant to their
needs.

DATA LINEAGE
If data is incomplete or inaccurate, utilizing it can cause more problems than
it solves. On the other hand, if data is accurate but users do not trust it,
they might not use it when it could make a real difference. In either scenario,
data lineage can help.

Data lineage captures the complete history of an asset in the catalog: from its
original source, through all the operations and transformations it has
undergone, to its current state. By exploring this lineage, users can be
confident they know where assets have come from, how those assets have evolved,
and whether they can be trusted.

MONITORING
Taking a step back to a higher-level view, monitoring features will help users
keep track of overall usage of the catalog. Real-time dashboards help chief data
officers and other data professionals monitor how data is being used, and
identify ways to increase its usage in different areas of the organization.

METADATA DISCOVERY
We have already mentioned that data needs to be seen as an evolving asset —
which means our catalogs must evolve with it. We plan to make it easy for users
to augment assets with metadata manually; in the future, it may also be possible
to integrate algorithms that can discover assets and capture their metadata
automatically.

DATA GOVERNANCE
For many organizations, keeping data secure while ensuring access for authorized
users is one of the most significant information management challenges. You can
mitigate this challenge with rule-based access control and automatic enforcement
of data governance policies.

APIS
Finally, the catalog will enable access to all these capabilities and more
through a set of well-defined, RESTful APIs. IBM is committed to offering
application developers easy access to additional components of Watson Data Platform , such as persistence stores and data sets. We hope that they can use our
services to extend their current suite of data and analytics tools, to innovate
and create smart new ways of working with the data.

Learn more about IBM Data Catalog


--------------------------------------------------------------------------------

Written by Jay Limburn
Distinguished Engineer and Offering Lead, Watson Data Platform

Originally published at www.ibm.com on August 1, 2017

 * Data Catalog
 * Data Management
 * Data Analytics
 * IBM
 * Ibm Watson

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingSUSANNA TAI
Offering Manager, Watson Data Platform | Data Catalog

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","One of the earliest documented catalogs was compiled at the great library of Alexandria in the third century BC, to help scholars manage, understand and access its vast collection of literature…",How smart catalogs can turn the big data flood into an ocean of opportunity,Live,221
2127,"CONNECTING R AND COMPOSE POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 7, 2016At Compose, we work with developers who connect to our databases using a variety
of languages and tools. One language that was recently requested by our
developer community to cover is R. So, in this post, we will look at ways to
connect to a Compose PostgreSQL deployment using R, and how we can run queries
from our R development environment to Compose PostgreSQL.

The R language is an open source statistical language used mostly by data scientists,
statisticians, and academics. One of the benefits of using R is that it is very
good at data analysis. It provides you with powerful statistical and graphing
tools, and it allows you to create and run data simulations. If you are not
familiar with the language, you can head on over to Code School to try it out. Or, for the more inquisitive, you can download R here and try it. In addition to R, which comes with a simple IDE, I will be using
RStudio. RStudio is a full-featured R IDE that you can download here and gives you the R console, an editor, and a history and debugging manager.

CONNECTING TO COMPOSE POSTGRESQL
RPostgreSQL and RPostgres are two packages that will enable you to connect to PostgreSQL. RPostgreSQL is
available on CRAN (Comprehensive R Archive Network) where the majority of R
packages are archived. To download RPostgreSQL to your R development environment
you run install.packages('RPostgreSQL') . On the other hand, RPostgres is not available on CRAN, so you first need to
install R’s devtools with install.packages('devtools') and then install the package from Github. The instructions to download
RPostgres are posted on their Github repository page .

Once you’ve downloaded the packages, you will notice that there is no difference
in how they connect to PostgreSQL. Both use R's DBI package that provides a set of classes and methods that allow you to connect to
databases such as PostgreSQL, MySQL, SQLite, and others. The following code
shows you how to connect to Compose PostgreSQL using both RPostgreSQL and
RPostgres.

# Connecting to RPostgreSQL

drv <- dbDriver('PostgreSQL')  
db <- 'myDatabase'  
host_db <- 'aws-us-east-1-portal.234.dblayer.com'  
db_port <- '98939'  
db_user <- 'henryviii'  
db_password <- ‘happydays’

conn <- dbConnect(drv, dbname=db, host=host_db, port=db_port, user=db_user, password=db_password)

# Connecting to RPostgres 
# with the same connection variables without 'drv'

conn <- dbConnect(RPostgres::Postgres(), dbname = db2, host=host_db, port=db_port, user=db_user, password=db_password)  


After running conn , you know when you’ve established a connection to the database when no errors
have been produced. Unless you write a custom function in R that indicates that
you’ve successfully made a connection, you will not receive a confirmation that
it's connected. One simple way to check if you have a connection, and to view
the tables in your database, is to use either use dbListTables() to view all the tables, or dbExistsTable() to return a Boolean value if a specific table is found. That way you don’t have
to write a custom function and you can use the methods that are available.

PREPARING THE DATA
There are a few ways that we can insert data into a Compose PostgreSQL
deployment. Either you can insert data via the terminal using psql , through the Compose UI, or through R’s development environment. Since we are
concerned with using Compose PostgreSQL with R, let’s look at one way we could
prepare our data, make a table, and insert it into our database using R.

First, let’s use a sample dataset provided by R: mtcars . This is the 1974 Motor Trend magazine car road test data. Although you
probably will not use the datasets provided by R in any production database,
they're useful if you want to practice inserting data and running queries for
practice.

data('mtcars')  
my_data <- data.frame(carname = rownames(mtcars), mtcars, row.names = NULL)  
my_data$carname <- as.character(my_data$carname)  
rm(mtcars)  


The code above will set up our table as a data frame in R and rename the first
column as carname , which has the list of cars, rather than the default row.name . If you don’t do this step, then row.name will be set as your column name when you upload it to PostgreSQL. After that,
we remove mtcars rm(mtcars) from R’s development environment memory since we’ve stored it in the variable my_data .

Next, using the dbWriteTable method, we can write my_data to a PostgreSQL table. If the table hasn't been created, it will create one for
us.

dbWriteTable(conn, name='cars', value=my_data)  


To use it, we use the database connection variable we've created above conn , define a name for our table cars , and provide the data that should be written in the table that we also defined
earlier as my_data . After you run it, take a look at your Compose PostgreSQL database. You will
see that the table cars has been created and your data has been uploaded.

If your table already exists, even if it has no data in it, you will receive a
warning message telling you that the process has been aborted. This safeguards
you from rewriting data into a table that already exists. If you want to
overwrite the table, however, then just set overwrite to TRUE within dbWriteTable .

dbWriteTable(conn, name='cars', value=my_data, overwrite=TRUE)  


Now, when you look at your Compose database, you will find that it does not have
a primary key set. You’ll want to do that now through a query.

QUERYING DATA
There are two basic queries that we will use: dbGetQuery and dbSendQuery. While
you might think that one is for getting data and the other is for sending data,
this is not entirely the case. dbGetQuery will return all the query results in a data frame. dbSendQuery will register a request for your data then it has to be called by fetch for RPostgreSQL or dbFetch for RPostgres to receive the data. The fetch or dbFetch method allows you to set parameters to query your data in batches. For example,
if you have a query that will return 10,000 items, you can assign a variable for
the first 500 results a <- fetch(query, n = 500) . Then you can create other variables for the rest of data using fetch by defining the number of results you want. If you use dbGetQuery you'd get all 10,000 queries, which might take a long time to process depending
on the data retrieved.

Make sure that after your requests from dbSendQuery you call dbClearResult so that any pending queries from the database to your R environment are
removed. dbGetQuery does this for you by implementing dbSendQuery , fetch , then dbClearResult behind the scenes. Also, make sure to disconnect dbDisconnect from the database once your query is done.

Going back to our Compose PostgreSQL primary key warning. Now that we know about
how to make queries, we will go ahead and set a primary key in the cars table. To do this, we will write an SQL query and assign the primary key to the
first column using the dbGetQuery method.

dbGetQuery(conn, 'ALTER TABLE cars ADD CONSTRAINT cars_pk PRIMARY KEY (""carnames"");')  


You could use dbSendQuery which would produce the same result, but for interactive queries, always use dbGetQuery . When setting up your table, you will also want to make sure to create indexes
using dbGetQuery with the appropriate SQL syntax.

Another useful query method that will give us an overview of the data in our
table is dbReadTable . This will send us the entire table. It is essentially the same as querying dbGetQuery(conn, ‘SELECT * FROM cars;’) .

dbReadTable(conn, 'cars')  


dbReadTable uses dbGetQuery in the background and just gives us an easy way to look at our data without
writing the SQL command.

Creating queries to give us customized data is essentially the same as writing
them in SQL. The difference is that your results are is stored as a variable in
R. For example, if we wanted to get the cars that have at least a 6 cycle engine
and at least 5 gears, our query would look like the following:

carQuery <- dbSendQuery(conn, 'select carname, cyl, gear from cars where cyl >= 6 and gear �')  
result <- fetch(carQuery)  
dbClearResult(carQuery)  


The code above creates a variable called carQuery that contains our SQL query. Then we fetch the results of the query and save them in the variable results . If we do not set fetch to a variable, our results will not be saved and we will have to run the carQuery query again. After we've stored our data in results , we clear the query by inserting carQuery into the dbClearResult method.

If we look at the data stored within results , you will find the following:


Now that we have our results, we can choose to keep them within our R
environment, or insert them into another table within our Compose PostgreSQL
deployment. It really depends on what you’d like to do. If you want to store it
back into your Compose database, then just write dbWriteTable(conn, 'my_new_table', results) and you'll have a new table my_new_table with your new data.

REFRESH
So, we've covered some of the essential methods provided by RPostgreSQL and
RPostgres. If you're familiar with SQL, then querying and modifying data will be
easy since you create them using SQL syntax. I'd recommend looking at the
documentation for RPostgreSQL and RPostgres, which will provide you with other
methods that are available to query data from PostgreSQL to use with R. Overall,
using R superpowers to analyze data from your own Compose PostgreSQL datasets,
will provide you with the opportunity to take a deep dive into your data and
crunch some serious numbers.

Image by Jay Mantri Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","In this post, we will look at ways to connect to a Compose PostgreSQL deployment using R, and how we can run queries from our R development environment to Compose PostgreSQL.",Connecting R and Compose PostgreSQL,Live,691
2130,"Skip to content * Unix
 * R

15 PAGE TUTORIAL FOR R
By admin | August 16, 2016 0 CommentFor Beginners in R, here is a 15 page example based tutorial that covers the
basics of R.

 1.  Starting R – Trivial tutorial on how to start R for those just wondering what to do
     next after downloading R.
 2.  Assignment Operator – Two important assignment operators in R are <- and =
 3.  Listing Objects – All entities in R are called objects. They can be arrays, numbers,
     strings, functions. This tutorial will cover topics such as listing all
     objects, listing object from a specific environment and listing objects
     that satisfy a particular pattern.
 4.  Sourcing R File – R code can also be written in a file and then the file can be called
     from the R code.
 5.  Basic Datastructures in R – Understanding data structures is probably the most important part of
     learning R. This tutorial covers vector and list. It also covers
     subsetting.
 6.  Data Structures in R, Matrix and Array – Covers matrix and vectors. An array is a vector with additional
     attributes dim which stores the dimension of the array and dimnames which
     stores the names of the dimensions. A matrix is an 2 dimensional array.
     Head to the tutorial for examples of both.
 7.  Data Structures in R, factors and Data Frame – DataFrames are probably the most widely used data structure. It would
     help to just go through the examples and practice them. The tutorial covers
     important operations on the data frame and factors as well as subsetting
     data frames.
 8.  Data Structures in R, Data Frame Operations – including stack, attach, with, within, transform, subset, reshape and
     merge
 9.  Control Structures in R – The basics of any programming language. Control loops allow looping
     through data structures. The tutorial covers if, if-else, for, while, next,
     break, repeat and switch
 10. Control Structures in R – apply – To make looping more efficient R has introduced a family of ‘apply’
     functions. For example – the apply function can be used apply a function
     over specific elements of an array (or matrix). The tutorial covers lapply,
     sapply, apply, tapply.
 11. Control Structures in R – apply 2 – We continue with some more apply functions – mapply and by.
 12. Functions in R – The nuts and bolts of any programming language. This tutorial not only
     explains the concept of functions using examples but also covers various
     scenarios such as anonymous functions or passing functions around.
 13. Printing on Console in R – Printing on console can come very handy. The tutorial covers the print
     and cat functions as well as printing data frames.
 14. Pretty printing using Format function in R – This tutorial looks at how to use the formatting functions for pretty
     printing.
 15. Reshape and Reshape2 Package – Once you start working on real life problems in R, a lot of time would
     be spent on manipulating data. Reshape and Reshape2 package will prove very
     powerful in converting data to the format required by other libraries. This
     tutorial has detailed examples to explain the package.

These tutorials are designed for beginners in R, but they can also be used by
experienced programmers as a refresher course or as reference. Running loops in
R can be slow and therefore the apply group of functions as well as the reshape
package can drastically improve the performance of the code.

We hope you enjoy the tutorials.

Category: R Post navigation ← R and GIS – working with shapefiles Search for:R Tutorials

 * Starting R
 * Assignment Operators
 * Listing Objects
 * Sourcing R File
 * Basic Data Structures in R
 * Data Structures in R, Matrix and Array
 * Data Structures in R, factors and Data Frame
 * Data Structures in R, Data Frame Operations
 * Control Structures in R
 * Control Structures in R - apply
 * Control Structures in R - apply 2
 * Functions in R
 * Printing on Console in R

Recent Posts

 * 15 Page Tutorial for R
 * R and GIS – working with shapefiles
 * Java convert a list to a csv string
 * Running R from Eclipse – StatET Features
 * Running R from Eclipse – StatET installation and Usage

R-bloggers

 * Dual axes time series plots may be ok sometimes after all August 17, 2016

Archives

 * August 2016 (1)
 * April 2016 (1)
 * March 2016 (1)
 * February 2016 (3)
 * January 2016 (1)
 * December 2015 (1)
 * November 2015 (2)
 * October 2015 (10)
 * September 2015 (1)
 * August 2014 (1)
 * July 2014 (2)
 * April 2013 (1)
 * March 2013 (1)
 * January 2013 (3)
 * December 2012 (2)
 * November 2012 (1)
 * October 2011 (3)
 * September 2011 (1)
 * April 2011 (1)
 * November 2010 (1)
 * July 2010 (4)
 * June 2010 (8)
 * November 2009 (1)
 * October 2009 (1)
 * September 2009 (10)
 * August 2009 (9)
 * July 2009 (6)

custom footer text left StudyTrails Iconic One Theme | Powered by Wordpress","For Beginners in R, here is a 15 page example based tutorial that covers the basics of R.",15 Page Tutorial for R,Live,692
2131,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own Oct 24, 2016
--------------------------------------------------------------------------------

BETTER TOGETHER: SPSS AND DATA SCIENCE EXPERIENCE
Although open source code in Python and R is popular because of its low cost,
flexibility, and power, the time required to properly create code and ensure
that it is working correctly can be frustrating. Not everyone is a programmer or
wants to program! That’s why the announcement by IBM in June about IBM Data
Science Experience is such a game-changer!

IBM Data Science Experience is a way for data scientists to collaborate and work
on data science programs in the most efficient way possible. What if the
collaboration could be extended to the data scientist or analyst who wants to
build predictive models without code? Data Science Experience enables data
scientists to choose their own preferred way to tackle this problem.

Today in our World of Watson conference we announced what everyone was asking
for: integration of IBM SPSS Modeler in the Data Science Experience. This is a
huge step forward and shows the direction we are heading: Make Data Science and Machine Learning as simple as possible for everyone . Good things really do come to those who wait!

SPSS provides an intuitive interface that is easy for everyone to learn and use
— from business users to data scientists. Uncover valuable insights quickly for
rapid time-to-value. Then, deploy your machine learning models into production
to create intelligent applications. Ready to see how it works? Watch the
following video to see how to bring data in, clean it up, and create a Neural
Network all in a matter of seconds.

There are thousands of SPSS Modeler users out there and the best news for them
is that they can import and bring their SPSS streams within the Data Science
Experience — they will work and render as expected!

We are really excited about this new capability and we are glad to see that our
users are also getting excited about it:

Today this capability is still in closed beta, but we will invite users to try
it out before we open it to the rest of our users. If you’re as excited as we
are, join the waitlist: SPSS Modeler in Data Science Experience .

 * Data Science
 * Machine Learning
 * Spss
 * IBM

Show your supportClapping shows how much you appreciated Armand Ruiz’s story.

Blocked Unblock Follow FollowingARMAND RUIZ
Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Although open source code in Python and R is popular because of its low cost, flexibility, and power, the time required to properly create code and ensure that it is working correctly can be…",Better together: SPSS and Data Science Experience,Live,693
2136,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectPREDICT TEMPERATURES USING DASHDB, PYTHON, AND RMargriet Groenendijk / April 18, 2016More and more data is available each day. The trick is to extract insights. Datawarehouse tools, like dashDB, are designed to help you perform meaningfulanalysis. In this example, we’ll take some raw climate data, convert it, andanalyse to deliver meaningful trends and numbers to prospective vacationerschoosing a holiday destination.Environmental, climate and satellite observations are commonly stored in binary netcdf files which contain spatial raster data in a 3D matrix. The 3 axes are: latitude,longitude, and time. You can imagine them as slices of global maps stacked ontop of each other for each month. The values of the cells in the matrix are thetemperature for a certain time and location: temperature[time,latitude,longitude] .This blog explains how to * convert netcdf files to csv format using Python * import csv data into dashDB * analyse and visualise the data using RCONVERT NETCDF INTO CSV WITH PYTHON 1. Download this netcdf file containing the average observed monthly temperature. 2. Install Python from Anaconda (a free distribution that includes the most common packages). 3. Load the data.         * If you’re using Windows and are new to Python, watch this movie on how to run the following Python code.     * For Linux or Mac: open a Terminal, type python at the command prompt, and run the following code:        from netCDF4 import Dataset    import numpy as np        cfile = 'absolute.nc'    id_in = Dataset(cfile)    longitude = id_in.variables['lon'][:]     latitude = id_in.variables['lat'][:]     time = id_in.variables['time'][:]     temperature = id_in.variables['tem'][:,:,:]        print np.shape(temperature)         4. View the data.Let’s take a quick look at what we’ve got. The 3D temperature matrix contains values for 12 months, 36 latitudes and 72 longitudes. To    see what this data looks like, enter the following code to create a quick    global map with the average temperature for January. The slice of data for    January is temperature[0,:,:] (in Python the index starts with 0).        from mpl_toolkits.basemap import Basemap, addcyclic, shiftgrid, maskoceans    import scipy    import matplotlib    from pylab import *        # define the area to plot and projection to use    m =\    Basemap(llcrnrlon=-180,llcrnrlat=-60,urcrnrlon=180,urcrnrlat=80,projection='mill')        # covert the latitude and longitude to raster coordinates to be plotted    t1 = temperature[0,:,:]    t1, lon = addcyclic(t1, longitude)    january, lons = shiftgrid(180., t1, lon, start=False)    x,y  = np.meshgrid(lons,latitude)    px,py = m(x,y)        # create the global map and save it as a png file    rcParams['font.size'] = 12    rcParams['figure.figsize'] = [8.0, 6.0]    palette = cm.RdYlBu_r    figure()    m.drawcoastlines(linewidth=0.5)    m.drawmapboundary(fill_color=(1.0,1.0,1.0))    cf=m.pcolormesh(px, py, january, cmap = palette)    cbar = colorbar(cf,orientation='horizontal', shrink=0.95)    cbar.set_label('Mean Temperature in January')        tight_layout()        savefig('temperature_january.png')                 5. Format the data for dashDB.        dashDB has a relational database structure, so we need to convert the temperature    data into a csv file with 4 columns: month, latitude, longitude, and    temperature. We do so by looping over the 3 dimensions of the matrix:        f = open('temperature.csv', ""w"")        f.write('month,temperature,latitude,longitude,latmin,latmax,lonmin,lonmax' + '\n')     for tim in range(len(time)):    for lat in range(len(latitude)):        for lon in range(len(longitude)):            dataline1 = '{0:.5f},{1:.5f},{2:.5f},{3:.5f},{4:.5f},{5:.5f},{6:.5f},{7:.5f}' .format(time[tim],temperature[tim,lat,lon],latitude[lat],longitude[lon],latitude[lat]-2.5,latitude[lat]+2.5,longitude[lon]-2.5,longitude[lon]+2.5)            f.write(dataline1 + '\n')        f.close()        LOAD DATA INTO DASHDB 1. If you don’t already have an account, sign up for Bluemix , IBM’s cloud platform. 2. Add the dashDB service.        From your Bluemix dashboard, click Work with data . Click New service then choose dashDB and create your instance ( more on getting started with dashDB ).         3. Load the csv file into dashDB .         1. Launch dashDB.     2. On the left, select Load Load from Desktop .     3. Click Browse files and choose the temperature.csv file. Change no options–the default        settings are fine.        What about dates? There are no dates or times in this file. The Month column contains the average temperature for each month over multiple        years and not for a specific date.                     4. Click Preview and check the column names and values in the table. Then click Next .     5. Select Create a new table and load and click Next .                     6. Click Finish .                        Your data is now in dashDB!VISUALISE DATA FOR ANY LOCATIONWithin dashDB, we can use R to figure out the seasonal temperature for anylocation. R is a free software environment for statistical computing and graphics, whichcomes built-in to dashDB . You can run scripts or use R studio inside your dashDB service, with featuresthat help you load and combine different data tables. 1. From the menu on the left, select on Analytics R scripts . 2. Click the + to create a new R script and select the Temperature table. 3. Click Apply and a few lines of R code appear in the script that load the data into a    data frame, from which you can start your analysis. 4. Enter the following code and click Submit :    library(ibmdbR)    mycon <- idaConnect(""BLUDB"", """", """")    idaInit(mycon)        df1 <- as.data.frame(ida.data.frame('""DASH107239"".""TEMP""')[ ,c('LATITUDE', 'LONGITUDE', 'MONTH', 'TEMPERATURE')])        If you prefer to use R studio you can. To get your dashDB username and    password, go to the menu on the left and click Connections .         5. Click Save . You can return to this script any time. Just launch dashDB and from the    menu choose Analytics R scripts . You’ll see a My Projects tab with all your saved scripts.Now we can analyse and visualise the temperature data by adding to the code wejust ran.Say we want to compute the average seasonal temperature in London and Sydney.Tip: you can pick any location. You just need the latitude and longitude of yourPoint of Interest (POI).The following algorithm finds the shortest distance between thelongitude/latitude of the POI and the centres of the raster cells. These arethen used to extract the temperatures for the 12 months of this location.# Latitude and Longitude for Points of InterestPOI_NAME1 <- 'London'POI_LAT1 <- 51.5POI_LON1 <- -0.1275POI_NAME2 <- 'Sydney'POI_LAT2 <- -33.8POI_LON2 <- 151.21# Find the minimum distance between POI to the centre of the nearest grid cell and use this distance to select the temperature values for this locationLONM <- min(abs(as.numeric(df1$LONGITUDE)-POI_LON1))LATM <- min(abs(as.numeric(df1$LATITUDE)-POI_LAT1))POI1 <- subset(df1, abs(as.numeric(LONGITUDE)-POI_LON1) == LONM & abs(as.numeric(LATITUDE)-POI_LAT1) == LATM)LONM <- min(abs(as.numeric(df1$LONGITUDE)-POI_LON2))LATM <- min(abs(as.numeric(df1$LATITUDE)-POI_LAT2))POI2 <- subset(df1, abs(as.numeric(LONGITUDE)-POI_LON2) == LONM & abs(as.numeric(LATITUDE)-POI_LAT2) == LATM)POI1 and POI2 each contain a subset of the original data frame df1 with a temperature value for each month for the specified locations, which weuse in the following code to make a plot of the monthly temperatures as both ajpeg and pdf file.# Make a figure of the seasonal temperature for the POIsjpeg(""temperature.jpg"",width=600,height=400) sink('/dev/null')plot(1, type = 'n', xlim = c(1, 12), ylim = c(0, 25), xaxt ='n',ylab = ""Temperature"")points(POI1$MONTH,POI1$TEMPERATURE,col=""blue"",pch=15)points(POI2$MONTH,POI2$TEMPERATURE,col=""red"",pch=15)axis(1, at=c(1,2,3,4,5,6,7,8,9,10,11,12), labels=c(""January"",""February"",""March"",""April"",""May"",""June"",""July"",""August"",""September"",""October"",""November"",""December""), las=2)legend(""top"", c(POI_NAME1,POI_NAME2),fill=c(""blue"",""red""))dev.off() pdf(""temperature.pdf"") sink('/dev/null') plot(1, type = 'n', xlim = c(1, 12), ylim = c(0, 25), xaxt ='n',ylab = ""Temperature"")points(POI1$MONTH,POI1$TEMPERATURE,col=""blue"",pch=15)points(POI2$MONTH,POI2$TEMPERATURE,col=""red"",pch=15)axis(1, at=c(1,2,3,4,5,6,7,8,9,10,11,12), labels=c(""January"",""February"",""March"",""April"",""May"",""June"",""July"",""August"",""September"",""October"",""November"",""December""), las=2)legend(""top"", c(POI_NAME1,POI_NAME2),fill=c(""blue"",""red""))dev.off() sink()The resulting chart clearly shows the difference between the southern andnorthern hemisphere.NEXT STEPSIn this tutorial, we converted raster data into a csv table, imported the tablein to dashDB, and did some first analyses. There is much more climate rasterdata available here , which is offered under the Open Database License . (These files are already in a table format, so the conversion is much easier.Just open with Excel and save them as csv files.)To find out more about the average climate of your holiday location, you couldadd more weather variables to the analysis we just ran, such as precipitation or sunshine hours . But be warned: these are all average values. So, the actual weather may not be exactly what you expect. ;-)SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: dashdb / Python / R / raster data Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",How to load raster data into dashDB for detailed analysis with Python and R.,"Predict temperatures using dashDB, Python, and R",Live,694
2139,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

IBM Data Science Experience Blocked Unblock Follow Following Mar 10
--------------------------------------------------------------------------------

EXCEL FILES: LOADING FROM OBJECT STORAGE — PYTHON
In this blog post we will work with pulling an Excel file from Object Storage
into our notebook.

This blog leverages a previous post that introduced work with Object Storage in Data Science Experience (DSX).

Prerequisites

For the following code to work, make sure that you are using Python version 3.X

In order to read the excel file into a Pandas DataFrame, you might need to
install a prerequisite package:
!pip install xlrd

Getting data

Data Science Experience allows you to insert a file into your notebook with a
single click. Once in the notebook, you can see your data assets you have in
Object Storage by clicking the 1001 icon in the top right, opening up a panel on
the right side of the notebook. The screenshot below shows the options you have
when you click Insert to code for an Excel file in a Python notebook. Selecting the first option Insert StringIO object will bring in the code that will need to slightly tweak for working with Excel files.

With just one click you bring in the code that uses your Object Storage API
credentials to pull in your data into a generic data_1 object. You can rename it later if you choose.

While the StringIO code will allow you to seamlessly import CSV files with one click, you will
need to make two minor adjustments in order to load Excel Files:

 1. On first line of code in the cell, replace from io import StringIO with from io import BytesIO - BytesIO allows you to work with binary data. Using StringIO class might
    generate Unicode-related errors when you attempt to read the Excel File into
    a Pandas DataFrame
 2. Change the last line of the function get_object_storage_.... that calls your Object Storage to return BytesIO(resp2.content) Simply change StringIO to BytesIO

The full code is below:

from io import BytesIO
import requests
import json
import pandas as pd

# @hidden_cell
# This function accesses a file in your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook. def get_object_storage_file_with_credentials_***(container, filename):
  """"""This functions returns a StringIO object containing
  the file content from Bluemix Object Storage.""""""

  url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
  data = {'auth': {
    'identity': {
      'methods': ['password'],
      'password': {
        'user': {
          'name': 'member_********************',
          'domain': {'id': '**************'},
          'password': '*************}
        }
      }
    }
  }
  headers1 = {'Content-Type': 'application/json'}
  resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
  resp1_body = resp1.json()

  for e1 in resp1_body['token']['catalog']:
    if(e1['type']=='object-store'):
      for e2 in e1['endpoints']:
        if(e2['interface']=='public'and e2['region']=='dallas'):
          url2 = ''.join([e2['url'],'/', container, '/', filename])
          s_subject_token = resp1.headers['x-subject-token']
          headers2 = {
            'X-Auth-Token': s_subject_token,
            'accept': 'application/json'
          }
          resp2 = requests.get(url=url2, headers=headers2)
          return BytesIO(resp2.text)

# Your data file was loaded into a StringIO object and you can process the data.
# Please read the documentation of pandas to learn more about your possibilities to load your data.
# pandas documentation: http://pandas.pydata.org/pandas-docs/stable/io.html data_1 = get_object_storage_file_with_credentials_***('SampleProject', 'Superstore.xlsx')

With the file object loaded we can now use pandas imported when the one-click Insert StringIO Object code was execute.

We create an excel file object and pull in the desired tab into a pandas
dataframe with 2 simple lines of code, previewing it with myd.head()

xls_file = pd.ExcelFile(data_1) #create instance of ExcelFile class
myd = xls_file.parse('Orders') #read the data from selected Excel sheet
myd.head()

Note: This is a sample data set. Customer names and details are not real. The data
set is available for download .


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on March 10, 2017 by Mikhail Lakirovich .

 * Data Science
 * Excel
 * Data


Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","This blog leverages a previous post that introduced work with Object Storage in Data Science Experience (DSX). In order to read the excel file into a Pandas DataFrame, you might need to install a…",Excel files: Loading from Object Storage — Python,Live,695
2140,"Compose The Compose logo Articles Sign in Free 30-day trialMONGO METRICS: FINDING A HAPPY MEDIAN
Published Mar 29, 2017 mongodb metrics mongo metrics Mongo Metrics: Finding a Happy MedianIn this second entry in our new ""Mongo Metrics"" series, we'll take a look at
using the MongoDB aggregations pipeline to compute the MEDIAN of a set of data.
Following up on our previous article on calculating the MEAN , we'll take a look at how computing the MEDIAN can help us reduce the effect
of outliers on the metrics we gather from our data and get a sense of how much
our MEAN is misleading us.

WHAT'S UP WITH MEDIAN?
One of the main weaknesses with the MEAN , or average , is its susceptibility to being skewed by a few values that lie far outside of
where most of the other values are. These outliers on either the high or low
side of our data can cause the MEAN to skew greatly in the direction of the
outlier. To get a more accurate view of our data, we'll want to reduce the
impact of these outliers.

There are a few ways to compute the MEDIAN, and all of them require your data to
be in a ordered set . In this article, we'll start by removing all non-conforming data (e.g. NULL
or empty values) and then sorting the remaining data in ascending order. By
sorting the data, our outliers will be pushed to the edges of our data set,
making the center elements more representative of what a typical data point
might look like, so our MEDIAN calculation will return those elements.

Since the MEDIAN calculation requires us to sort our data set, it's not going to
be as fast or efficient as the MEAN calculation. Also, as is the case with many
database systems, MongoDB does not have a $median operator so we'll have to compute the MEDIAN ourselves. While many developers choose to compute the MEDIAN on the
application layer, the aggregation pipeline in MongoDB can be used to perform
this sort of calculation in the data layer as well.

ENTERING THE MEDIAN
Before we get started, make sure you have a foundational understanding of the $match and $group operators in the MongoDB aggregation pipeline. If you need some background, you
can check out our previous article on MongoDB aggregations by example .

In the general sense, there are a few different ways to compute the MEDIAN value
of an ordered data set. For our purposes, we'll compute the MEDIAN by using the following steps:

 1. Count the number of items in the data set
 2. Sort the data from smallest to largest
 3. Find the middle elements of the data using the following algorithm:
     1. Divide the count of the elements in half (n / 2).
     2. Get the floor and ceiling of the value in step 1. We'll call these the high and low position. * If you have an even number of elements, this will be a positive
           integer so the floor and ceiling will return the same value. If you have an odd number of elements,
           the floor and ceiling will return two different values. Either way, we can use the same
           subsequent steps to achieve the MEDIAN.
        
        
     3. Retrieve the elements found in the high and low position within your sorted data and get the MEAN of those elements.
        This will give us our MEDIAN.
    
    
FINDING THE MEDIAN
Let's take a look at an example using the transactions data from a fictitious pet supply company used in our previous article on MEAN :

order_id | date       | item_count | order_value  
------------------------------------------------
50005    | 2016-09-02 | 0          | (NULL)  
50002    | 2016-09-02 | 1          | 5.99  
50003    | 2016-09-02 | 1          | 4.99  
50010    | 2016-09-02 | 1          | 20.99  
50006    | 2016-09-02 | 1          | 5.99  
50008    | 2016-09-02 | 1          | 5.99  
50009    | 2016-09-02 | 2          | 12.98  
50007    | 2016-09-02 | 2          | 19.98  
50001    | 2016-09-02 | 2          | 7.98  
50000    | 2016-09-02 | 3          | 35.97  
50004    | 2016-09-02 | 7          | 78.93  


That data will be contained in MongoDB as an array of documents that looks like
the following:

{ 
  _id: ObjectID(""589cd56b6ca2eef0f7737b0a""), 
  orderId: 50001, 
  date: ISODATE(""2016-09-02""), 
  itemCount: 3, 
  orderValue: 35.97 
}


Now that we have some data to work with, we can begin analyzing that data.
Analyzing data starts with figuring out what questions we're trying to answer.
In this case, we'd like to know what the MEDIAN order value across all of our
transactions.

First, we'll need to clean up our data. Looking at our transaction table, we'll
see that one of our orders contains no items and has an orderValue of NULL . In MongoDB, this corresponds to a missing order_value field. We can start our aggregation pipeline by matching only the documents
that have a value in the order_value field:

  {
    $match: {
      order_value: {
        $exists: true
      }
    }
  }


Note that you can use matching to filter out all kinds of information - feel
free to use multiple matches if it makes sense to filter data out of your
calculations.

Now that we've filtered out the documents we don't want to include, let's count
the number of transactions that we do want to include. We can do this by using
the group aggregation and creating a sum calculation, adding 1 for each document we encounter:

{ 
    $group: {
      _id: null, 
      count: { 
        $sum: 1 
      }, 
      values: { 
        $push: ""$order_value"" 
      } 
    } 
  }


Notice the values part of the document. This allows us to create an array called ""values"" which
contains one entry for each transaction document we encounter. In particular, it
pushes the order_value field from each transaction into the values array. We'll use the values array to pass the order values along the aggregation pipeline.

Running the aggregation at this stage results in something like this:

{ 
  ""_id"" : null, 
  ""count"" : 9, 
  ""values"" : [ 
     5.99, 
     4.99, 
     20.99, 
     5.99, 
     5.99, 
     12.98, 
     19.98, 
     35.97, 
     78.93 
] }


Our next step is to sort the values in order, but we have a problem. MongoDB
comes built-in with a very handy $sort operator that we can use to sort data. However, that $sort operator can only be used on documents in a result set, not on values inside of
an array. Basically, we need to turn our current single-document result into an
array of documents, one for each value. Luckily for us, MongoDB has the $unwind operator which lets us do exactly that. Using $unwind , we can take the values in an array and create one document from each of them:

  { 
    ""$unwind"": ""$values"" 
  }


Unwinding will reverse the grouping instruction by expanding our array out into multiple results, but it will also
preserve our new count field by including it in every new document.

BEFORE the unwind , our data looks like the following:

{ 
  ""_id"" : null, 
  ""count"" : 9, 
  ""values"" : [ 
    5.99, 
    4.99, 
    20.99, 
    5.99, 
    5.99, 
    12.98, 
    19.98, 
    35.97, 
    78.93 
 ] }


After unwinding , it has the following format:

{ ""_id"" : null, ""count"" : 9, ""values"" : 5.99 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 4.99 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 20.99 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 5.99 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 5.99 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 12.98 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 19.98 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 35.97 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 78.93 }


This allows us to go back to using the MongoDB aggregations operators on our
data.

Now that we have our data back in a usable format, let's sort the data across
all documents using the values field:

  { 
    ""$sort"": { 
      values: 1
    } 
  }


Sorting using MongoDB aggregations is pretty straight-forward - we pass the name
of the field as the key to the $sort operator, along with a positive value for a ascending sort and a negative value for a descending sort.

WHOOOAAAA - WE'RE HALFWAY THERE
At this point, we now have a count of our documents and have them in sorted
order, which are the first and second steps of our MEDIAN calculation. Now,
we're on to the third step which is computing the high and low midpoints, retrieving the elements there, and averaging them together.

At this stage, we're going to use a new operator that we haven't seen yet in
this series: $project . The $project operator creates a projection , which is a custom data structure that can be passed down the aggregation
pipeline. These projections are very versatile: they can contain data computed
using aggregation operators, results from previous stages in the pipeline, and
individual data fields from a collection. We'll use $project to gradually build up the information we'll need to compute our MEDIAN.

The first projection will calculate the midpoint of the data set using the $divide operator:

  {
    $project: { 
      ""count"": 1, 
      ""values"": 1, 
      ""midpoint"": { 
        $divide: [
          ""$count"", 
          2 
        ] 
      }
    }
  }


The first two keys have a value of 1, which has the effect of preserving the count and values fields from the previous step. Next, the midpoint is calculated by dividing the count by 2 using the $divide operator.

The next projection step will compute those two indices by using the $floor operator for the low-side index and the $ceil operator for the high-side index. This will return two different numbers if our
original count was odd, but a single number (the MEDIAN index) if the count was
even.

{
    $project: {
      ""count"": 1,
      ""values"": 1,
      ""midpoint"": 1,
      ""high"": { 
        $ceil: ""$midpoint""
      },
      ""low"": {
        $floor: ""$midpoint""
      }
    }
  }


Running the aggregation so far will give us results that look like the
following:

{ ""_id"" : null, ""count"" : 9, ""values"" : 4.99, ""midpoint"" : 4.5, ""high"" : 5, ""low"" : 4 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 5.99, ""midpoint"" : 4.5, ""high"" : 5, ""low"" : 4 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 5.99, ""midpoint"" : 4.5, ""high"" : 5, ""low"" : 4 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 5.99, ""midpoint"" : 4.5, ""high"" : 5, ""low"" : 4 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 12.98, ""midpoint"" : 4.5, ""high"" : 5, ""low"" : 4 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 19.98, ""midpoint"" : 4.5, ""high"" : 5, ""low"" : 4 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 20.99, ""midpoint"" : 4.5, ""high"" : 5, ""low"" : 4 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 35.97, ""midpoint"" : 4.5, ""high"" : 5, ""low"" : 4 }
{ ""_id"" : null, ""count"" : 9, ""values"" : 78.93, ""midpoint"" : 4.5, ""high"" : 5, ""low"" : 4 }


We now know what the midpoint , as well as the high and low indices of our MEDIAN items will be. However, the high , low , and midpoint are scattered throughout multiple documents. To move onto the next step, we'd
like to condense all of these into a single document containing the high-side
index, the low-side index, and the values. Since we have those items already, we
no longer need to preserve the midpoint .

We can bring all of these values into a single document using the $group operator.

{ 
    $group: {
      _id: null,
      values: {
        $push: ""$values""
      }, 
      high: {
        $avg: ""$high""
      },
      low: {
        $avg: ""$low""
      }
    }
  }


The $avg operator on the high and low values aren't meant to do anything productive at this point - it just allows us
to collapse our high and low indices across all documents into a single result once the grouping is done.
You can see that by taking a peek at the data after this operation is added to
the pipeline:

{ 
  ""_id"" : null, 
  ""values"" : [ 
    4.99, 
    5.99, 
    5.99, 
    5.99, 
    12.98, 
    19.98, 
    20.99, 
    35.97, 
    78.93
  ], 
  ""high"" : 5, 
  ""low"" : 4 
 }


Notice that the high and low indices contain just a single value now.

THE HOME STRETCH
We're almost there - just a couple more steps.

Now that we have the high and low indices for the MEDIAN values, as well as a sorted array containing those
values, we can retrieve the middle items out of the array. We'll do this by
creating a projection for the two middle values, calling them beginValue and endValue :

{
    $project: {
      ""beginValue"": {
        ""$arrayElemAt"": [""$values"" , ""$high""]
        } ,
      ""endValue"": {
         ""$arrayElemAt"": [""$values"" , ""$low""]
      }
    }
  }


Now we finally have our MEDIAN values:

{ ""_id"" : null, ""beginValue"" : 19.98, ""endValue"" : 12.98 }


We don't really need to provide any conditional logic here; this will retrieve
two different elements if our original data set count was odd or the same
element if our original data set was even. If the elements are different, we can
average them together to get the MEDIAN value and if the elements are the same,
the average acts as an identity function that ends up returning the correct value anyway:

{
    $project: {
      ""median"": {
        ""$avg"": [""$beginValue"" , ""$endValue""]
      }      
    }
  }


Our final MEDIAN value has now been computed, and since we're no longer
interested in preserving any of the original data in this query, we return only
the median in our projection.

THE FINAL RESULT
Phew - what a ride!

After everything is all said and done, the final MongoDB aggregation looks like
the following:


You can use this solution on a data set with either odd or even counts and it
will return the correct MEDIAN values.

EXIT STAGE LEFT
The MEDIAN calculation is a tricky one that requires more steps than the MEAN to
compute, but MEDIAN provides a valuable benchmark upon which we can measure the
accuracy of our MEAN. The MongoDB aggregations pipeline proves its power and
versatility by making the MEDIAN calculation possible. There's still one more
classic metric we can use, MODE, and in the next article in this series, we'll
explore how to use the MongoDB aggregations pipeline to compute it.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Luis Llerena

John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
Mar 6, 2017MONGO METRICS: CALCULATING THE MEAN
Mongo Metrics is a new series in collaboration with Compose's Resident Data
Scientist Lisa Smith that shows you how to extrac…

John O'Connor Feb 23, 2017AGGREGATIONS IN MONGODB BY EXAMPLE
In this second half of MongoDB by Example, we'll explore the MongoDB aggregation
pipeline. The first half of this series cov…

John O'Connor Mar 28, 2017SIMPLE OAUTH WITH MONGODB & MYSQL
Don Omondi, Campus Discounts' founder and CTO, discusses securing applications
with OAuth and shows you how to securely store…

Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this second entry in our new ""Mongo Metrics"" series, we'll take a look at using the MongoDB aggregations pipeline to compute the MEDIAN of a set of data.",Mongo Metrics: Finding a Happy Median,Live,696
2147,"DATALAYER: ONLINE SCHEMA MIGRATIONS FOR MYSQL USING GH-OST
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 21, 2016gh-ost is a new tool by GitHub which changes the paradigm of MySQL online schema
changes, designed to overcome today's limitations and difficulties in online
migrations. gh-ost is:

 * Triggerless
 * Pausable
 * Lightweight
 * Controllable
 * Testable

In this session we can see Tom Krouper from GitHub as he introduces gh-ost,
explains the reasoning for developing a new tool, compares it with existing
online schema change tools, describes the underlying logic, and shows off some
extra perks that make gh-ost operations so friendly.

Tom has been working with MySQL since 2003. He started working with MySQL as a
PHP developer. He briefly moved over to systems administration where he was
responsible for Apache and MySQL servers. His desire to learn more about
databases moved him into a role as a DBA and he's happily filled that role at
several companies. He is currently working at GitHub helping automate and expand
their existing architecture. He's previously worked for Box, Twitter, &
Booking.com.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",In this video we can see Tom Krouper from GitHub as he introduces gh-ost and compares it with existing online schema change tools.,DataLayer Conference: Online Schema Migrations for MySQL Using gh-ost,Live,697
2150,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectVISUALIZING STRIPE DATA WITH CHARTIOSarah Maston / December 15, 2015CLOUD DATA WAREHOUSING, BROUGHT TO YOU BY THE SIMPLE DATA PIPE AND DASHDBA data warehouse (or “operational data store,” a.k.a. ODS ) with one data source isn’t particularly interesting. But an ODS with more than one , is. Consolidating data from different web services sets the stage for runningthe analytics your business wants.In my other tutorial on analyzing Salesforce data , I described how to use our Simple Data Pipe example app to move Salesforce.com JSON into dashDB, IBM’s cloud data warehouseservice. In this article, we’ll explore combining that CRM data with addedcontext from Stripe, a popular Web payments system, and visualizing our datawith Chartio ‘s BI platform.To recap, here’s what we’re working with in this article: * Stripe: Built for developers, the Stripe API makes it easy to integrate its payments   system into web and mobile apps. * Chartio: Cloud business intelligence platform that combines powerful analytics with   easy visualizations and dashboards that make data more accessible to everyone   in your company.REPORTSWhile our goal is to consolidate different data sources for richer analysis,visualizing each data source on its own is also useful. Chartio does a good jobhere. With your ODS set up, you can easily isolate different sources in theirown dashboards.Chartio dashboard of Salesforce.com dataChartio dashboard of Stripe.com dataYou can also easily combine different data sources into the same Chartio dashboard. Now your reporting story startsgetting interesting — and your CEO starts smiling because “This! This is what Iwas talking about!” ← That’s CEO excitement right there.Here we have both Salesforce and Stripe reporting on one dashboard, with all the data available for exploration:Chartio dashboard of Salesforce data combined with Stripe dataPRE-REQSWe assume that readers have experience with our first Simple Data Pipe tutorialand have used it to populate a dashDB instance with Salesforce.com data. Anadditional tutorial by Patrick Titzler, Configure Simple Data Pipe for Stripe.com , will walk you through the OAuth process for Stripe. Complete these twotutorials before proceeding further [ 1 , 2 ].STARTING WITH CHARTIOWe have a special registration page with Chartio to give our intrepid readersfree access while completing this tutorial: http://landing.chartio.com/demo-ibm . And if you already have access, then great!You have two options for getting these dashboards into your Chartio system: 1. Learn by building some initial dashboards yourself with the instructions in    the chartio-for-pipes GitHub repo. 2. Get quick access to a wider range of handy Salesforce/Stripe dashboards by    asking Chartio to clone them into your environment auto-magically.ALL THE STEPS FOR CHARTIO & DASHDBA.J. Welch from Chartio ( @ajw0100 ) has prepared a wonderful set of visual instructions that take you fromconnecting Chartio to dashDB, all the way through JOINing data and dashboardingyour combined Salesforce and Stripe analytics: https://github.com/ibm-cds-labs/chartio-for-pipes .With his permission, I’m going to recap some of his initial instructions,specifically these: connecting to dashDB and creating your first chart . I’ll skip some steps here for brevity, but I want to highlight that there are more awesome step-by-step instructions in A.J.’s GitHub repo , like: * working with custom schemas * joining and merging data * previewing the additional pre-built dashboards that Chartio can, upon request, clone into your accountCONNECTING CHARTIO TO DASHDBdashDB is one of many data source connections that comes built into Chartio.Adding it is as simple as selecting “dashDB” as the data source type andentering your account details from the VCAP_SERVICES environment variable on IBM Bluemix.dashDB is a built-in data source for ChartioWhen you’re done, you’ll receive a notification that schema extraction for yourdashDB data source has started, and another when extraction is complete. Selectyour newly created data source and check the schema tab to verify that yourtables were correctly imported.Check your table importIf you’re wondering about any “Overflow” tables, don’t worry too much. Thesecome from the IBM DataWorks integration with Cloudant as part of our Simple DataPipe, well, pipeline ;-) DataWorks is the data prep and shaping service on the Bluemix app platform.If DataWorks finds any schema errors while it’s moving JSON from varioussources, staging it in Cloudant (a “NoSQL” native-JSON data store based on Apache CouchDB™ ), and converting that JSON into a relational format for dashDB — it will stashthat data in an overflow table so you won’t lose it. We can safely ignore themin our tutorial.CREATING YOUR FIRST DASHBOARDWe’ll start with a simple example that only visualizes the Stripe data we storedin dashDB. We’ll build a dashboard of the top 5 delinquent customers by planamount. Your salespeople will love this one :DCreating a new dashboard in Chartio is easy. Once you’ve named and categorizedit, it’s time to add an element to your chart.Build your first dashboardLocate the dashDB instance you connected earlier and search for the st customer table.Choose your dashDB connection and tableWe want our salesfolk to be able to quickly contact our most delinquentcustomers. Drag Email into the “Dimensions” field and Delinquent into “Filters,” then filter on “equals 1.”Set the dimension and filterNow to identify which customers owe the most. Find the St Customer Subscriptions table and drag Plan Amount into the “Measures” field and specify to sort by descending order.Set some measuresLimit the result set to 5 and click “Refresh Chart.” You’ll have the top-5delinquent accounts, but their plan amounts will be rather large sums. That’sbecause they’re listed in cents. Find the “Add Step” button to divide the valuesunder Plan Amount by 100 and convert to dollars.Hundred-dollar bills, y’allThere, that’s better. Now just change your visualization from table to barchart.Let’s all go to the bar — chart , that isUnder “Chart Settings,” you can title your chart and label its y-axis. Whenyou’re done, save your chart and Chartio will publish your dashboard. Here’swhat your completed dashboard should look like:You win. We all win. The end?!THERE’S MORE. A LOT MORE.We’ve seen how simple it is to visualize important business data that might haveotherwise been trapped within the Stripe API. With the Simple Data Pipeconfigured and connected to Chartio, this information suddenly becomes a lotmore accessible to knowledge workers and business execs within your company.That’s a win for everyone involved.You can do a whole lot more with Chartio. Be sure to visit the chartio-for-pipes repo on GitHub for additional tutorials on creating custom schemas and JOINingdata from different sources — and rolling it all up into more sophisticateddashboards. There are also instructions for contacting Chartio to access a widersuite of pre-built Stripe/Salesforce dashboards you can use for your ownbusiness. With the Simple Data Pipe, IBM dashDB and Chartio, you’ll be up andrunning in no time.The complete chartio-for-pipes tutorialSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Learn how to use Chartio's BI platform with IBM's dashDB cloud data warehouse to build powerful dashboards with data sources like Stripe and Salesforce.,Visualizing Stripe Data With Chartio,Live,698
2152,"Compose The Compose logo Articles Sign in Free 30-day trialMETRICS MAVEN: CALCULATING A WEIGHTED MOVING AVERAGE IN POSTGRESQL
Published Feb 7, 2017 postgresql metrics maven Metrics Maven: Calculating a Weighted Moving Average in PostgreSQLIn our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the metrics you need from your data.
In this article, we'll calculate a weighted moving average.

In this article we'll learn how to calculate a weighted moving average in
PostgreSQL. To do this, we'll combine what we learned from our last article on weighted averages with one of the approaches we learned in a previous article on calculating a moving average . With the concepts from both of these already under our belts, it'll be a
breeze.

Let's go.

WEIGHTED MOVING AVERAGE
A weighted moving average allows us to smooth out trends over time so it's
easier to get a read on how the trend is progressing while giving us the ability
to add a weighting factor so that some of the data will be treated as ""more
important"" than other data. Weighted moving averages are often applied to stock
prices so that it's easy to see if the price is generally moving up or down,
giving more weight to the recent dates for the stock price and less weight to
former dates. That's because recent dates are expected to be more indicative of
future prices than former dates. Weighted moving averages in business are
sometimes used to forecast future costs or revenue based on previous dates.

As usual, it's easier to explain by example so let's use our hypothetical pet
supply business, that we've used in recent articles on mean , median , and mode , to understand the concept better. What we'd like to know for our business is
how the 7-day average order value is trending. Similar to stock prices, we
believe that more recent order values are more indicative to future order values
that older ones are. To see how the average order value is trending, we're going
to calculate a 7-day weighted moving average for it.

OUR DATA
For our weighted moving average calculation, we're going to want summary data at
the daily level.

For the daily average order value, we can use what we learned in our article on mean . The example in that article has one day's worth of orders, which we
calculated to have an average value of 19.98. If we follow that technique to get
the average value for each day, ignoring orders with 0 items since those are
invalid, we can create an orders summary table that looks like this:

date       | total_orders | total_order_items | total_order_value | average_order_items | average_order_value  
--------------------------------------------------------------------------------------------------------------
2017-01-01 | 14           | 18                | 106.84            | 1.29                | 7.63  
2017-01-02 | 10           | 21                | 199.79            | 2.10                | 19.98  
2017-01-03 | 12           | 17                | 212.98            | 1.42                | 17.75  
2017-01-04 | 12           | 15                | 100.93            | 1.25                | 8.41  
2017-01-05 | 10           | 13                | 108.54            | 1.30                | 10.85  
2017-01-06 | 14           | 20                | 216.78            | 1.43                | 15.48  
2017-01-07 | 13           | 16                | 198.32            | 1.23                | 15.26  
2017-01-08 | 10           | 12                | 124.67            | 1.20                | 12.47  
2017-01-09 | 10           | 16                | 140.88            | 1.60                | 14.09  
2017-01-10 | 17           | 19                | 136.98            | 1.12                | 8.06  
2017-01-11 | 12           | 14                | 99.67             | 1.17                | 8.31  
2017-01-12 | 11           | 15                | 163.52            | 1.36                | 14.87  
2017-01-13 | 10           | 18                | 207.43            | 1.80                | 20.74  
2017-01-14 | 14           | 20                | 199.68            | 1.43                | 14.26  
2017-01-15 | 16           | 22                | 207.56            | 1.38                | 12.97  
2017-01-16 | 14           | 19                | 176.76            | 1.36                | 12.63  
2017-01-17 | 13           | 18                | 184.48            | 1.38                | 14.19  
2017-01-18 | 14           | 25                | 265.98            | 1.79                | 19.00  
2017-01-19 | 10           | 17                | 178.42            | 1.70                | 17.84  
2017-01-20 | 19           | 24                | 139.67            | 1.26                | 7.35  
2017-01-21 | 15           | 21                | 187.66            | 1.40                | 12.51  
2017-01-22 | 19           | 24                | 226.98            | 1.26                | 11.95  
2017-01-23 | 17           | 24                | 212.64            | 1.41                | 12.51  
2017-01-24 | 16           | 21                | 187.43            | 1.31                | 11.71  
2017-01-25 | 19           | 27                | 244.67            | 1.42                | 12.88  
2017-01-26 | 20           | 29                | 267.44            | 1.45                | 13.37  
2017-01-27 | 17           | 25                | 196.43            | 1.47                | 11.55  
2017-01-28 | 21           | 28                | 234.87            | 1.33                | 11.18  
2017-01-29 | 18           | 29                | 214.66            | 1.61                | 11.93  
2017-01-30 | 14           | 20                | 199.68            | 1.43                | 14.26  
2017-02-01 | 19           | 27                | 189.98            | 1.42                | 10.00  
2017-02-02 | 22           | 31                | 274.98            | 1.41                | 12.50  
2017-02-03 | 20           | 28                | 213.76            | 1.40                | 10.69  
2017-02-04 | 21           | 30                | 242.78            | 1.43                | 11.56  
2017-02-05 | 22           | 34                | 267.88            | 1.55                | 12.18  
2017-02-06 | 19           | 24                | 209.56            | 1.26                | 11.03  
2017-02-07 | 21           | 33                | 263.76            | 1.57                | 12.56  


This will be our base data for calculating our weighted moving average in this
article, referred to as daily_orders_summary.

THE WEIGHTING FACTOR
In a weighted moving average, we're going to add weights to our values before
aggregating them, similar to how we did in our previous article on weighted averages ; however, the weights for a weighted moving average are fractional values that
must add up to 1. Since we're doing a 7-day weighted moving average, we're going
to create our weights as fractions of 28. We get 28 by summing the days 1-7: (7 + 6 + 5 + 4 + 3 + 2 + 1) = 28 .

The date furthest in the past will have the least weight, 1/28. The second
oldest date will have the weight 2/28. The third oldest date will have the
weight 3/28. And so on...

We can construct a weighting factor table for reference that looks like this:

days_past | fraction  | weight  
-------------------------------
0         | 7/28       | 0.25  
1         | 6/28       | 0.21  
2         | 5/28       | 0.18  
3         | 4/28       | 0.14  
4         | 3/28       | 0.11  
5         | 2/28       | 0.07  
6         | 1/28       | 0.04  


The most current date, 0 days in the past, will give the most weight at 0.25 in
this scenario. You can adjust the weights accordingly for your situation. If you
are also doing a 7-day weighted moving average and 0.25 feels too low for the
most recent date, then you can increase that if you decrease weight elsewhere.
Just make sure that the weights cover the spectrum of the days you want to
consider and that the weights add up to 1. For us, this formula works fine: (0.25 + 0.21 + 0.18 + 0.14 + 0.11 + 0.07 + 0.04) = 1.0 .

OUR 7-DAY WEIGHTED MOVING AVERAGE
The most straightforward method of using PostgreSQL to calculate a weighted
moving average uses the alternative method we shared in our previous article on calculating a moving average . This approach uses two aliases of our data table, comparing one against the
other based on date intervals. Take a look at the section called ""An alternative
method for a simple moving average"" in the previous article if you aren't
familiar with this technique or if you just need a refresher.

We're going to expand on that approach by including a CASE WHEN condition to apply the weights and then sum the weighted values to get our
weighted moving average. Our query looks like this:

SELECT dos_1.date,  
ROUND(AVG(dos_2.average_order_value), 2) AS moving_average,  
ROUND(SUM(CASE  
        WHEN dos_1.date - dos_2.date = 0 THEN 0.25 * dos_2.average_order_value -- most recent date
        WHEN dos_1.date - dos_2.date = 1 THEN 0.21 * dos_2.average_order_value
        WHEN dos_1.date - dos_2.date = 2 THEN 0.18 * dos_2.average_order_value
        WHEN dos_1.date - dos_2.date = 3 THEN 0.14 * dos_2.average_order_value
        WHEN dos_1.date - dos_2.date = 4 THEN 0.11 * dos_2.average_order_value
        WHEN dos_1.date - dos_2.date = 5 THEN 0.07 * dos_2.average_order_value
        WHEN dos_1.date - dos_2.date = 6 THEN 0.04 * dos_2.average_order_value -- date furthest in the past
    END), 2) AS weighted_moving_average
FROM daily_orders_summary dos_1  
JOIN daily_orders_summary dos_2 ON dos_2.date >= dos_1.date - interval '6 days'  
    AND dos_2.date 


By comparing the dates in each table across an interval of 7 total days (6 days
in the past + the current date), we can determine how far in the past a given
date is for the calculation and apply the correct weight to its average order
value. We then sum the 7-day weighted values to arrive at our weighted moving
average.

Notice that we've also included the simple moving average in our query for
comparison purposes. For both values we're using ROUND to 2 decimal places which we learned about in our article on making data pretty .

Our results will look like this:

date       | moving_average | weighted_moving_average  
------------------------------------------------------
2017-01-07 | 12.88          | 13.85  
2017-01-08 | 13.02          | 13.59  
2017-01-09 | 12.83          | 13.55  
2017-01-10 | 12.68          | 12.15  
2017-01-11 | 12.81          | 11.25  
2017-01-12 | 12.87          | 11.95  
2017-01-13 | 12.79          | 13.97  
2017-01-14 | 12.71          | 14.13  
2017-01-15 | 12.71          | 14.09  
2017-01-16 | 12.66          | 13.89  
2017-01-17 | 12.83          | 14.19  
2017-01-18 | 13.01          | 15.43  
2017-01-19 | 12.93          | 16.03  
2017-01-20 | 12.61          | 13.86  
2017-01-21 | 12.54          | 13.51  
2017-01-22 | 12.52          | 13.02  
2017-01-23 | 12.51          | 12.78  
2017-01-24 | 12.43          | 12.27  
2017-01-25 | 12.08          | 12.21  
2017-01-26 | 11.76          | 12.39  
2017-01-27 | 12.02          | 12.37  
2017-01-28 | 11.99          | 12.07  
2017-01-29 | 11.99          | 12.02  
2017-01-30 | 11.96          | 12.54  
2017-02-01 | 11.98          | 11.93  
2017-02-02 | 11.90          | 12.04  
2017-02-03 | 11.77          | 11.65  
2017-02-04 | 11.79          | 11.64  
2017-02-05 | 11.86          | 11.72  
2017-02-06 | 11.85          | 11.54  
2017-02-07 | 11.50          | 11.72  


We'll ignore dates before Jan. 7 since, in our data set, there weren't enough
days prior for earlier dates to accurately calculate the 7-day weighted moving
average.

Now, if we plot the moving average and the weighted moving average against the
original daily average order values, we'd have a chart that looks like this:


As we can see, the original average order values show peaks and valleys, making
it somewhat difficult to spot a trend. The simple moving average smooths out the
spikiness too much and causes the data to appear relatively flat. The weighted
moving average gives just enough flavor for us to be able to see that the trend
had been on an incline around the middle of January, but has since declined.
Looks like we've got some work to do to improve the average order value in our
pet supply business.

WRAPPING UP
In this article we've calculated a weighted moving average using some concepts
we covered in previous articles. We've seen how the weighted moving average can
help us spot trends in our business that might have been hidden behind too much
volatility or too much smoothing. You can also use this same approach to apply
an exponential weighted moving average where the weights that are applied
decrease exponentially the further we go into the past. We'll look at this
metric more closely in our next article.

Image by: Unsplash Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith ’s author page and keep reading.RELATED ARTICLES
Jan 9, 2017METRICS MAVEN: CALCULATING A WEIGHTED AVERAGE IN POSTGRESQL
In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the…

Lisa Smith Dec 6, 2016METRICS MAVEN: MODE D'EMPLOI - FINDING THE MODE IN POSTGRESQL
In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the…

Lisa Smith Nov 2, 2016METRICS MAVEN: MEET IN THE MIDDLE - MEDIAN IN POSTGRESQL
In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the…

Lisa Smith Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",In this article we'll learn how to calculate a weighted moving average in PostgreSQL.,Metrics Maven: Calculating a Weighted Moving Average in PostgreSQL,Live,699
2155,"Compose The Compose logo Articles Sign in Free 30-day trialGRAPH 101: MAGICAL MARKOV CHAINS
Published Aug 17, 2017 Graph 101: Magical Markov Chains janusgraph graph graph101 Free 30 Day TrialGraph 101 is an article series on graph databases that explores graph algorithms
from the ground up. If you’ve ever wondered whether or not a Graph database is a
good approach for a problem you’re trying to solve, Graph 101 is the series for
you.

Almost all graph database algorithms can trace their mathematical underpinnings
to a construct called Markov Chains . In this article, we’re going to explore Markov chains from a practical
standpoint and demonstrate how they can be constructed and used in JanusGraph.

UNDERSTANDING MARKOV CHAINS
To understand Markov Chains, you must first understand traveling.

Imagine you're exploring a new city. You leave the train station with shops on
every side of you and your hotel in the distance. Once you leave the train
station the probability of going back there is really low - we’ll say 2% -
compared to your probability of going to one of the other shops. You are given
the probability that you'll enter each shop, and that once you have entered the
shop, your chance of going to another shop adjacent to it. You also know that if
you go into the hotel, you will probably stay there for the evening but there’s
a 20% chance that you’ll go to one of the 2 surrounding shops (ie: you may stay
in for the night, or you may decide to go back out shopping).

background by Vector Open Stock

We can use a Markov chain to generate ALL of the possible different paths that
we might traverse the town. All we have to know at each step of the way is where
we came from (ie: the town square) and the probability that we'll go a certain
direction (ie: to the hotel, or back to the square). We can represent these as a
probability matrix showing the probability that we’ll go from one place to
another.


Now, if you think back to your university computer science classes, you might
recognize these as Finite State Machines. Markov chains are a form of state
machine, with each state represented as a vertex and the probability of a
transition happening represented as an edge (this is also called the “transition
probability” in mathematical graph speak).

Of course, Markov Chains aren't just for modeling your travels. They find
application in many different fields of science and engineering, from making
massive-scale predictions of climate models to making stock market predictions.
One of the more interesting features of Markov chains is that they’re
parallelizable, meaning you can compute the probability of a transition
happening between two states regardless of what happened before that state, and
you can compute many of these in parallel. Thus, you can compute multiple links
across multiple machines all at once and then bring them together quickly.

This is one of many reasons that Markov chains were used in the PageRank
algorithm, which we’ll explore in a future article. For now, we’re more
interested in understanding the effect of the rules governing a Markov chain.

MODELING OUR TRAVELS WITH JANUSGRAPH
For our first example, we won’t need to model everything in the travel scenario
above. Let’s start with a simpler model - a pair of nodes we’ll call “A” and
“B”. In each step of the model, we’ll be equally likely to transition between the states as we
are to stay in the same state.


We’ll start out by spinning up a Compose JanusGraph instance . We’ll be using the Gremlin console for this section, so make sure you have that. You can also follow along
for instructions on how to connect to your Compose JanusGraph instance
Once we have the gremlin console up-and-running, we’ll set up our graph with
Vertices and Edges connecting them. Let’s create a new graph using the ConfiguredGraphFactory built into Compose JanusGraph. We’ll call this one markov :

gremlin> :> def graph = ConfiguredGraphFactory.create(""markov"")  


Notice the :> before the command; this is important as it sends the command over to Composes’
JanusGraph instance. If you’re getting errors at this step, it’s probably
because you forgot that part.

We can access our newly created graph object through the graph variable we just created using the Groovy def keyword. We’ll now create a transaction to that graph using the tx() method, and us the commit() method to commit that transaction to our remote JanusGraph.

gremlin> :> graph.tx().commit()  


If at any time you lose connectivity, or the graph variable seems to disappear, you can always re-open it by calling the
following:

gremlin> :> def graph = ConfiguredGraphFactory.open(""markov"")  


Now, we’ll create our two vertices representing our two states; we’ll call them stateA and stateB :

gremlin> :> def stateA = graph.addVertex(T.label, ""state"", ""name"", ""stateA"")  
==>v[4176]
gremlin> :> def stateB = graph.addVertex(T.label, ""state"", ""name"", ""stateB"")  
==>v[8272]
gremlin> :> graph.tx().commit()  
==>null


Note that we haven’t put any probabilities in here yet; the vertices are only
the states within our chain. Edges are what represent our state transitions and
are where the probabilities of each transition occurring are located.

In particular, we’re going to use a weighted edge with the weight representing our probability of making the state transition.

gremlin> :> stateA.addEdge(""self"", stateA, ""weight"", 0.5d)  
==>e[2ru-380-2dx-380][4176-self->4176]
gremlin> :> stateA.addEdge(""aToB"", stateB, ""weight"", 0.5d)  
==>e[362-380-3yt-6ds][4176-aToB->8272]
gremlin> :> stateB.addEdge(""self"", stateB, ""weight"", 0.5d)  
==>e[3ka-6ds-2dx-6ds][8272-self->8272]
gremlin> :> stateB.addEdge(""bToA"", stateA, ""weight"", 0.5d)  
==>e[3yi-6ds-4r9-380][8272-bToA->4176]
gremlin> :> graph.tx().commit()  
==>null


That’s it! We’ve now input our graph into the graph database. The only thing
left to do now is to traverse the graph. First, we’ll need to define a traversal
object. Then, we can see which states are connected to each other:

gremlin> :> def g=graph.traversal()  
==>graphtraversalsource[standardjanusgraph[astyanax:[10.189.87.4, 10.189.87.3, 10.189.87.2]], standard]
gremlin> :> g.V().bothE()  
==>e[3ka-6ds-2dx-6ds][8272-self->8272]
==>e[3ka-6ds-2dx-6ds][8272-self->8272]
==>e[362-380-3yt-6ds][4176-aToB->8272]
==>e[3yi-6ds-4r9-380][8272-bToA->4176]
==>e[2ru-380-2dx-380][4176-self->4176]
==>e[2ru-380-2dx-380][4176-self->4176]
==>e[362-380-3yt-6ds][4176-aToB->8272]
==>e[3yi-6ds-4r9-380][8272-bToA->4176]


USING GRAPH ALGORITHMS
Now that we have a simple graph, it’s time for us to traverse our graph.
Traversing is done in steps and there are many different step types we can use. For a comprehensive list,
check out the gremlin algorithm documentation . We’ll also be exploring more of these in future articles.

For this simple example, what we’d like to do is simply trace a path from one
vertex to another along the edges we defined above, and we’d like those to
happen according to the probabilities we laid out in our weights . Since this is a cyclic graph , we’ll use the Cyclic graph step algorithm to generate a path between our nodes.

gremlin> :> g.V().has('name', 'stateA').both().both().cyclicPath().path()  
==>[v[4176], v[4176], v[4176]]
==>[v[4176], v[4176], v[4176]]
==>[v[4176], v[4176], v[8272]]
==>[v[4176], v[4176], v[8272]]
==>[v[4176], v[4176], v[4176]]
==>[v[4176], v[4176], v[4176]]
==>[v[4176], v[4176], v[8272]]
==>[v[4176], v[4176], v[8272]]
==>[v[4176], v[8272], v[8272]]
==>[v[4176], v[8272], v[8272]]
==>[v[4176], v[8272], v[4176]]
==>[v[4176], v[8272], v[4176]]
==>[v[4176], v[8272], v[8272]]
==>[v[4176], v[8272], v[8272]]
==>[v[4176], v[8272], v[4176]]
==>[v[4176], v[8272], v[4176]]


Using this cyclic path algorithm we can now see all the different ways we can go
from stateA to stateB , and how those would be distributed based on the weights we provided.

ONWARD AND UPWARD
Hopefully you’ll now have an understanding of not only Markov chains, but how
they can be implemented in a graph database such as JanusGraph. Markov Chains
lay the foundations for many other graph algorithms, and we’ll build off of this
knowledge going forward. Spend some time exploring the resources presented
throughout, and attempt to build the more complex travel example at the
beginning of this article so we have a solid jumping off point for next time.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Scott Webb

John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page to keep reading.CONQUER THE DATA LAYER
Spend your time developing apps, not managing databases.

Try Compose for Free for 30 DaysRELATED ARTICLES
Jun 26, 2017WEBINAR: GREMLIN TRAVERSALS FOR THE SQL USER
Graph databases are the fastest growing database engine. Their power and
popularity stem from how they store their data - wi…

Jon Silvers Jun 15, 2017COMPOSE'S FIRST GRAPH DATABASE: JANUSGRAPH
At Compose we've always looked to ensure you can get the databases you need.
Today, we are proud to announce that JanusGraph…

Josh Mintz Jun 21, 2017INTERVIEW: CHRISTOPHER QUINONES ON THE NEWEST GRAPH DATABASE, JANUSGRAPH
We sat down with Christopher Quinones, Project Lead and Developer on Compose for
JanusGraph, to get the scoop on graph databa…

Josh Mintz Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","Graph 101 is an article series on graph databases that explores graph algorithms from the ground up. If you’ve ever wondered whether or not a Graph database is a good approach for a problem you’re trying to solve, Graph 101 is the series for you.",Magical Markov Chains,Live,700
2160,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Wale Akinfaderin Blocked Unblock Follow Following Physicist | Data Scientist Sep 11
--------------------------------------------------------------------------------

MISSING DATA CONUNDRUM: EXPLORATION AND IMPUTATION TECHNIQUES
WHY MISSING DATA
Missing data is a common and exciting problem in statistical analysis and
machine learning. They are necessary for evaluating data quality and can have
different sources such as users not responding to questions in a recommender
system, death of patients on treatment or non-compliance, errors in a database
that describes the maintenance information of plant equipment, and so on.

Missing Data Mechanism

For us to fully understand the importance of missing data, we need to
comprehensively identify the reasons for missing data occurrence. The first step
is to understand your data and more importantly, the data collection process.
This can lead to the possibility of reducing data collection errors. The nature
or mechanism of missing data can be categorized into four classes. These
categories are based on the degree of relationship between the nature of the
missing data and observed values. Understanding the mechanism is very useful in
understanding the appropriate analysis to use. I will explain these mechanisms
briefly:

1) Missing Completely at Random (MCAR) : This means that the nature of the missing data is not related to any of the
variables, whether missing or observed. In this case, the missingness on the
variable is completely unsystematic. For example, let’s look at a study that
involves determining the reason for obesity among K12 children. MCAR is when the
parents forgot to take their kids to the clinic for the study.

2) Missing at Random (MAR) : This means that the nature of the missing data is related to the observed
data but not the missing data. Using the above K12 study, missing data in this
case is due to parents moving to a different city and hence, the children had to
leave the study — missingness has nothing to do with the study.

3) Missing Not at Random (MNAR) : This is also known as non-ignorable because the missingness mechanism cannot
be ignored. They exist when the missing values are neither MCAR or MAR. The
missing values on the variable are related to that of both the observed and
unobserved variables. An example of MNAR is that the parents are offended by the
nature of the study and do not want their children to be bullied, so they
withdrew kids from the study. The difficulty with MNAR data is intrinsically
associated with the issue of identifiability.

The easiest way to assume a missing data mechanism from data is understanding
the data collection process and use substantive scientific knowledge (critical
in determining randomness in a missing data). The second method to understand
the type of missing data mechanism is statistical testing. This method is mostly
used when trying to figure out if the mechanism is either MAR or MCAR.

EXPLORING DATA MISSINGNESS
Here, I used the datasets in the ongoing Zillow’s Home Value Prediction
Competition ($1.2million prize) on Kaggle and I used a python package called missingno . This package is a very flexible missing data visualization tool built with
matplotlib and it takes any pandas DataFrame thrown at it. The Kaggle/Zillow
data has a training set and a properties dataset that describes the properties
of all the homes. I merged both dataset and presented a plot of the missing
value matrix.

import numpy as np
import pandas as pd
import matplotlib
import missingno as msno
%matplotlib inline

train_df = pd.read_csv('train_2016_v2.csv', parse_dates=[""transactiondate""])
properties_df = pd.read_csv('properties_2016.csv')
merged_df = pd.merge(train_df,properties_df)
missingdata_df = merged_df.columns[merged_df.isnull().any()].tolist()
msno.matrix(merged_df[missingdata_df])

The nullity matrix gives you a data-dense display which lets you quickly
visually pick out the missing data patterns in the dataset. Also, the sparkline on the right gives you a summary of the general shape of the data completeness
and an indicator of the rows with maximum and minimum rows.

msno.bar(merged_df[missingdata_df], color=""blue"", log=True, figsize=(30,18))

The missingno bar chart is a visualization of the data nullity. We log transformed the data
on the y-axis to better visualize features with very large missing values.

Finally, a simple correlation heatmap is shown below. This map describes the degree of nullity relationship between
the different features. The range of this nullity correlation is from -1 to 1
(-1 ≤ R ≤ 1). Features with no missing value are excluded in the heatmap . If the nullity correlation is very close to zero (-0.05 < R < 0.05), no value
will be displayed. Also, a perfect positive nullity correlation (R=1) indicates
when the first feature and the second feature both have corresponding missing
values while a perfect negative nullity correlation (R=-1) means that one of the
features is missing and the second is not missing.

msno.heatmap(merged_df[missingdata_df], figsize=(20,20))

HANDLING MISSING DATA
There are several methods used for treating missing data in literature,
textbooks and standard courses. A summary of the methods is shown in Figure 1.
Some of these methods started gaining a resurgence in the last decade because of
their importance in clinical trials and biomedical studies. In addition, there
are certain drawbacks associated with each of these methods when used for data
mining and one needs to be careful to avoid bias or the under- or
over-estimation of variability. I will explain case deletion and imputation
using some fantastic python packages like pandas , sklearn-Imputer and fancyimpute . The underlying principles of model-based imputation methods and machine
learning methods (this is different from machine learning imputation methods) is
beyond the scope of this article.

Handling Missing Data and the Different Data Mechanism (Adapted from [1])Case Deletion

There are two types of case deletion methods. The first one is known as the list
deletion (also known as complete case analysis) and the second method is the
pair deletion. The case deletion removes all the instances with missing values
while in pair deletion, you remove the missing cases from your dataset on an
analysis-by-analysis basis. Let’s create a dummy dataset with some missing
values using pandas dataframe. From the figure below, we can see that df.dropna() removes all the missing value and df.dropna(how =’all’) removes just the rows with missing values. We can also specify removing a
column using df.dropna(axis=1, how=’all’) , create a column with missing values using df[‘New’]=np.nan and create a
threshold for the number of observations using df.dropna(thresh=x) .

import pandas as pd  
import numpy as np  
import fancyimpute  
from sklearn.preprocessing import Imputer  
data = {'Name': ['John','Paul', np.NaN, 'Wale', 'Mary', 'Carli', 'Steve'], 'Age': [21,23,np.nan,19,25,np.nan,15],'Sex': ['M',np.nan,np.nan,'M','F','F','M'],'Goals': [5,10,np.nan,19,5,0,7],'Assists': [7,4,np.nan,9,7,6,4],'Value': [55,84,np.nan,90,63,15,46]}  
df=pd.DataFrame(data, columns =['Name','Age','Sex','Goals', 'Assists', 'Value'])

Mean, Median and Mode Imputation

Using the measures of central tendency involves substituting the missing values
with the mean or median for numerical variables and the mode for categorical
variables. The major limitation of using this method is that it leads to biased
estimates of the variances and covariance. The standard errors and test
statistics can also be underestimated and overestimated respectively. This
imputation technique works well with when the values are missing completely at
random. Scikit-learn comes with an imputed function in the form sklearn.preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0,
verbose=0, copy=True) . Strategy is the imputation strategy and the default is the ""mean"" of the axis
(0 for columns and 1 for rows). The other strategies are ""median"" and
""most_frequent"". Another API that can be used for this imputation is fancyimpute.SimpleFill() .

Imputation with Regression

This is an imputation technique that uses information from the observed data to
replace the missing values with predicted values from a regression model. The
major drawback of using this method is that it reduces variability and
overestimates the model fit and correlation coefficient. Scikit-learn
preprocessing Imputer function can be utilized for this imputation technique.

k-Neareast Neighbor (kNN) Imputation

For k-Nearest Neighbor imputation, the missing values are based on a kNN
algorithm. These values are obtained by using similarity-based methods that rely
on distance metrics (Euclidean distance, Jaccard similarity, Minkowski norm
etc). They can be used to predict both discrete and continuous attributes. The
main disadvantage of using kNN imputation is that it becomes time-consuming when
analyzing large datasets because it searches for similar instances through all
the dataset. fancyimpute.kNN(k=x).complete(data matrix) can be used for kNN imputation. Choosing the correct value for the number of
neighbors (k) is also an important factor to consider when using kNN imputation.

Multiple Imputation using MICE (Multiple Imputation by Chained Equations)

Multiple imputation is a process where the missing values are filled multiple
times to create “complete” datasets. Multiple imputation has a lot of advantages
over traditional single imputation methods. Multiple Imputation by Chained
Equations (MICE) is an imputation method that works with the assumption that the
missing data are Missing at Random (MAR). Recall that for MAR, the nature of the
missing data is related to the observed data but not the missing data. The MICE
algorithm works by running multiple regression models and each missing value is
modeled conditionally depending on the observed (non-missing) values. A complete
explanation of the MICE algorithm can be seen here . fancyimpute.MICE().complete(data matrix) can be used for MICE implementation.

Conclusion

This article highlights the importance of missing data in data science projects.
It reviews exploration techniques and important imputation methods used for
handling missing data. The other methods not described are model-based and
machine learning based methods. Model-based methods assume a joint distribution
of all the missing values in the model and estimate the model parameters
describing the observed data. A wildly used model-based imputation method is a
Pattern Mixture Model (PMM) trained with expectation-maximization (EM-)
algorithm. Machine learning algorithms like eXtreme Gradient Boosting (xgboost)
automatically learn the best imputation value for the missing data based on the
training loss reduction.

References

[1] García Laencina P.J et al. Pattern Classification with Missing Data: A Review . Neural Comput Applied. 2009. 9(1): 1–12.

[2] Azur M. J et al. Multiple Imputation by Chained Equations: what is it and how does it work? . Int J Methods Psych Res. 2011. 20(1): 40–49.

The author would like to thank Busola Sanusi for the discussion on the mechanism of missing data.

This entry was originally published on IBM Data Science Experience webpage.

 * Data Science
 * Machine Learning
 * Data Analytics
 * Python
 * Data

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

10 Blocked Unblock Follow FollowingWALE AKINFADERIN
Physicist | Data Scientist

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 10
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",Missing data is a common and exciting problem in statistical analysis and machine learning. They are necessary for evaluating data quality and can have different sources such as users not responding…,Missing data conundrum: Exploration and Imputation Techniques,Live,701
2161,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectACCESS AN ON-PREMISES DB2 DATA SERVER FROM THE BLUEMIX CLOUDptitzler / November 5, 2015OVERVIEWThe Secure Gateway service provides you with a secure way to access on-premises or cloud data sources.To make a DB2 data server (or any other on-premises relational database)available to your Bluemix applications: 1. Provision the secure gateway service on Bluemix. 2. Add a secure gateway for the source data center. 3. Install and configure the secure gateway client in the source data center. 4. Configure access to the data center’s data servers. 5. Create user-provided services for the desired database(s).This introductory tutorial walks you through the process using a simplescenario: we will make an on-premises DB2 database available to a Bluemixapplication. Even though the terminology used in this document is DB2-specific,you can use these instructions for any supported data source.PROVISION THE SECURE GATEWAY SERVICE ON BLUEMIXTo provision a secure gateway service using the Bluemix web console: 1. Log in . 2. Open the DASHBOARD and select the space where you want to provision the service. 3. Click USE SERVICES & APIs . 4. In the search box type in gateway and select the Secure Gateway service. 5. Leave the service unbound and click CREATE .    The Secure Gateway service appears in your current space.ADDING A GATEWAYA gateway is a tunnel between a Bluemix application and the source environment,which is either an on-premises data center or a cloud environment. 1. Click ADD GATEWAY . 2. Assign a name to the gateway, accept the defaults and click CONNECT IT .    PLANNING YOUR CLIENT INSTALLATIONNow, you must download the secure gateway client installation media. You areprompted to choose how you would like to connect this new gateway from yoursource environment.Select the method you want. You can choose between a native installer (availablefor Ubuntu, RHEL and SuSE, to install the client), a docker container (requiredif your database server runs on Windows, for example) or from a Data Powerappliance.To connect to a gateway service, secure gateway clients need to provide thedisplayed Gateway ID and Security Token . Take note of this information because you will need to provide it when youconfigure the client in the source environment.INSTALL AND CONFIGURE THE SECURE GATEWAY CLIENT IN THE SOURCE ENVIRONMENTIn this tutorial, we are using the native IBM installer, which at the time ofwriting was available for Ubuntu, RHEL and SuSE, to install the gateway client. 1. Download the desired client installer onto a machine in the source    environment, such as the one hosting your DB2 instance(s). Note that the    gateway client and the database server can be running on different systems,    as long as there is sufficient network connectivity between them.         2. Install the client following the client- and operating-system-specific instructions . The following command line instructions illustrate an installation using    the IBM Installer on SuSE Linux.         rpm -ivhf ibm-securegateway-client-1.3.0+client_amd64.rpm            By default, remote access to on-premises data sources is denied by the    client. You must create an access control file, which defines which    resources are accessible to the secure gateway client. You can explicitly grant or deny access to individual hosts or specific ports .         3. Create a suitable access control list file in a secure location (using any    file name, such as ibm-securegateway-client.acl ).    In this tuorial we grant the client access to a single DB2 instance, which    is listening on TCP/IP port 50001 on server 9.xxx.xxx.xxx. echo ""acl allow 9.xxx.xxx.xxx:50001""     /etc/ibm/ibm-securegateway-client.acl             4. Edit the secure gateway client configuration file /etc/ibm/sgenvironment.conf that the installer created. 5. Set the appropriate values for GATEWAY_ID , SECTOKEN and ACL_FILE .    # Configuration ID to connect    # If manually modifying the following, accepted values are:     GATEWAY_ID=pirhzrFrf4Q_prod_ng    export SECGW_GATEWAYID=$GATEWAY_ID        # Security Token for this Configuration ID (if any)    # If manually modifying the following, accepted values are:     SECTOKEN=eyJ0eXAiOiJKV1Wd1...        # Access Control List File    # If manually modifying the following, accepted values are the absolute path to your ACL file    ACL_FILE=/etc/ibm/ibm-securegateway-client.acl             6. We are now ready to start the client. The secure gateway client can be    started manually or automatically. For illustrative purposes we will start    the client manually, which requires that we specify GATEWAY_ID , SECTOKEN and ACL_FILE as command line parameters. su secgwadmin     cd /opt/ibm/securegateway     node lib/secgwclient.js --F /etc/ibm/ibmsecuregateway-client.acl --t ***SECTOKEN*** ***GATEWAY_ID****    [2015-10-27 18:41:29.010] [INFO] [default] - Setting log level to INFO    [2015-10-27 18:41:29.732] [INFO] [default] - The Secure Gateway tunnel is connected            Refer to the documentation for system-specific auto-start options.         7. Once the client has successfully connected to the gateway service, the    gateway status changes to connected .    CONFIGURE A DESTINATION FOR AN ON-PREMISES DB2 INSTANCECreate one or more destinations, which define the data server installations (DB2instances) that will be exposed by the gateway service in the Bluemixenvironment. The mandatory destination information requires three items: aunique name, source hostname/IP address and port number. 1. Enter the destination information. In this tutorial we are are creating a    destination name by combining the DB2 instance name with the data center    environment name.                Keep in mind that the specified hostname/ip address and port number must be    included in the secure gateway client’s ACL list, or access will be denied.         2. Choose how you want to secure communication for the secure gateway client.    Here are your options, according to the documentation :         * TCP       No authentication is provided. Your application can communicate directly       to the gateway without requiring any certificates.                   * TLS: Server Side       TLS is enabled and the server provides a certificate to prove its       authority. You need to accept the server certificate into your       application truststore.                   * TLS: Mutual Auth       The server provides a set of certificates. However, you also need to       upload your own certificate or select auto-generate to automatically       create a self-signed certificate/key pair that you can download along       with the server certificate.                   * HTTP       TCP connection where the host header is rewritten to match the       on-premises host name.                   * HTTPS       TLS connection where the host header is rewritten to match the       on-premises host name. The TLS connection ends at the cloud server. To       connect to a backend HTTPS server, enable client-side TLS.                   * HTTPS: Mutual Auth       A TLS: Mutual Auth connection where the host header is rewritten to match the on-premises       host name. The TLS connection ends at the cloud server. To connect to a       backend HTTPS server, enable client-side TLS.                      Communication between the secure gateway client and the data source is not    encrypted by default. To enable TLS, provide the requested details in the Advanced Destination configuration . You can also restrict from which systems/ports applications can access    the destination you are creating in the gateway service.                Click the + sign to add the destination to the gateway. A new tile appears    at the bottom of the screen.                To access your on-premises DB2 instance, your Bluemix applications must to    connect to the cloud host and port that was assigned to the secure gateway service, not to the data center’s    destination host and port. Click on the information icon to display the    assigned host name and port number.                CREATE A USER-DEFINED SERVICE FOR AN ON-PREMISES DB2 DATABASEBluemix applications can access a variety of cloud data sources, such asrelational databases or NoSQL databases, by using services that exposeconnectivity information and credentials. Secure gateway data sourcedestinations, such as db2inst1_on_QA1 we’ve just created, are not automatically exposed as a service to Bluemixapplications. To finish the setup, we will therefore create a user-providedservice for each DB2 database in db2inst1 that we want to make available to our applications. These serviceswill provide our applications with seamless access to the data sourceconnectivity information.To create a user-provided service for a particular database, say it’s called SAMPLE , we need to provide the connectivity information from the destination we’vecreated as well as on-premises data source credentials. In order to supportdifferent application runtimes (and their specific addressing mechanisms toaccess the data source) we will make this information available in differentformats.So what do we need to provide? * A unique service name . * The data source’s JDBC URL , expressed as jdbc:db2://<cloud host>:<port>/<databasename> , where cloud host and port are the values that were assigned to the db2inst1_on_QA1 destination. * The data source’s URI , expressed as db2://<cloud host>:<port>/<databasename> , where cloud host and port are the values that were assigned to the db2inst1_on_QA1 destination. * DB2 database user ID and password. Note that it is considered good practice to use credentials of   a dedicated account that holds only minimal privileges in the destination   environment. By configuring a Secure Gateway instance using dedicated cloud   accounts to access on-premises data sources, SQL activity can be easily   traced using performance monitors and security solutions, such as InfoSphere Guardium .At the time of writing, user-provided services cannot be created using theBluemix web console. Therefore, we need to download and install the Cloud Foundry command line client (CLI) on our local machine. (It only takes a minute.) OK, so here we go: 1. Install Cloud Foundry command line. 2. Log in to the Bluemix region, organization and space where you provisioned    the Secure Gateway service:        cf login -a  https://api.ng.bluemix.net -u my_bluemix_id  -o my_org -s how-do-i       API endpoint: https://api.ng.bluemix.net       Password       Authenticating...       OK           Targeted org my_org       Targeted space how-do-i           API endpoint:   https://api.ng.bluemix.net (API version: 2.27.0)       User:           my_bluemix_id       Org:            my_org       Space:          how-do-i         3. To verify that we are in the correct location, we use the cf services command, which should display the secure gateway service we’ve provisioned    and used earlier in the Bluemix web console:        cf services     Getting services in org my_org / space how-do-i as my_bluemix_id...     OK         name                service         plan                bound apps   last operation     Secure Gateway-xm   SecureGateway   securegatewayplan                create succeeded         4. Create the user-provided service in one of two ways:         * Using CLI’s interactive mode:               cf create-user-provided-service sample_db2inst1_on_QA1 -p ""jdbcUrl, uri, hostname, port, user, password""              jdbcUrl jdbc:db2://cap-sg-prd-5.integration.ibmcloud.com:15327/sample       uri db2://cap-sg-prd-5.integration.ibmcloud.com:15327/sample       hostname cap-sg-prd-5.integration.ibmcloud.com       port 15327       user mydb2user       password mypassword       Creating user provided service sample_db2inst1_on_QA1 in org my_org / space how-do-i as my_bluemix_id...       OK                          * Or specify parameter names and values as a JSON string:               cf create-user-provided-service sample_db2inst1_on_QA1 -p '{          ""jdbcUrl"" : ""jdbc:db2://cap-sg-prd-5.integration.ibmcloud.com:15327/sample"",          ""uri"" : ""db2://cap-sg-prd-5.integration.ibmcloud.com:15327/sample"",          ""hostname"" : ""cap-sg-prd-5.integration.ibmcloud.com"",          ""port"" : ""15327"" ,          ""user"" : ""mydb2user"",          ""password"" : ""mypassword""        }'                       5. To verify that the service was created as expected, do one of the following:         * Run the cf services command one more time.              cf services       Getting services in org my_org / space how-do-i as my_bluemix_id...       OK              name                   service        plan              bound apps  last operation       sample_db2inst1_on_QA1 user-provided       Secure Gateway-xm      SecureGateway securegatewayplan              create succeeded                   * Or refresh the Bluemix dashboard:                                    Once the service has been created it can be bound to Bluemix applications,providing them access to the SAMPLE database in our QA data center.ACCESS AN ON-PREMISES DATABASE USING A USER-PROVIDED SERVICETo demonstrate how to bind this service to an application, we are using a sampleJava application, which you can download from GitHub . This application retrieves database connectivity information from theuser-provided service, establishes a connection and runs a simple query,retrieving the current date. 1. Follow the readme instructions to deploy the application and bind one or more user-provided services. Once    deployed, you see the app in Bluemix:                 2. Validate that the on-premises data source is accessible using the    user-provided service:        Click the on-prem-data-source-access-test application tile to open it. Then, beside Routes click the link that is your app’s URL: on-prem-data-source-access-test-{random-string}.mybluemix.net .                 3. The application opens, indicating whether the on-premises database was    successfully queried.    SUMMARYIn this tutorial, we’ve outlined how you can easily configure your Bluemixenvironment to provide secure access to selected on-premises data sources.In a production deployment, you should take additional steps to secure thecommunication between the application and the gateway, as well as between thegateway client and the database.Check out the Secure Gateway service on IBM Bluemix and get started in minutes. Looking for more on combining cloud services andon-premises data? See our related how-tos .SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Read how to access your on-premises relational database server from a cloud-based Bluemix app.,Access an On-Premises DB2 Data Server from the Bluemix Cloud,Live,702
2164,"This tutorial is for web application developers interested in creating                     database-driven applications using nothing but HTML, CSS, and                     JavaScript. You should know how to write JavaScript and how to                     manipulate the Document Object Model (DOM) of an HTML page using                     JavaScript. You should also have some experience using a library tool,                     such as jQuery or Dojo.Apache CouchDB is an open source document-oriented database management                     system that stores data as JSON objects. Traditional database systems                     allow you to perform data retrieval and update functions using a                     series of SQL statements that are executed through some form of                     proprietary client software or API. Apache CouchDB is different                     — you send your queries or updates using a RESTful HTTP API,                     making it simple to communicate with Apache CouchDB in virtually any                     modern programming language.Because of the architecture Apache CouchDB is built on, it is actually                     possible to build entire web applications that reside inside an Apache                     CouchDB database. We call these applications CouchApps. CouchApps                     allow you to create full database-driven applications using nothing                     but HTML, CSS and JavaScript. The beauty of these apps is that they                     allow you to take full advantage of Apache CouchDB's powerful                     replication features to replicate your CouchApp across Apache CouchDB                     instances. This allows you to keep your CouchApp on several devices,                     and synchronize them, with automated incremental replication keeping                     your data up-to-date on each device.In this tutorial, you will learn how to create your own CouchApp using                     HTML, CSS, and JavaScript. Your application will perform database                     operations using Ajax powered by the jQuery framework. The application                     you will build is a contact manager that allows you to view, create,                     edit, and delete your contacts. Finally, you will learn how to                     replicate this application between two Apache CouchDB instances.You will need the following tools to follow along with this                     tutorial:See Resources for download information and Downloads for the source code of our                     sample application.In this section we'll look at the advantages of using Apache CouchDB                     over a traditional database solution.Apache CouchDB is a database system that works differently compared to                     traditional database solutions such as IBM DB2, Oracle, or MySQL. In                     place of the structured format of databases, tables, and columns                     offered by these solutions, Apache CouchDB instead works by storing                     documents. Documents are free-form structures, which means that you                     can have any combination of fields, and field structures, and this can                     be different for every document in the database.For example, with a traditional relational database you might store the                     basic information about a contact by defining a contact table using                     the statement shown in Listing 1.This places a rigid structure on your data, which can be both useful                     and restricting at the same time. For example, what happens if you                     need to add a middle name to the table? You would need to add a new                     field to accommodate the new data. Another element to consider is that                     additional information — for example, the telephone numbers                     used to contact an individual — would probably exist in another                     table. To get the phone number for an individual you would need to                     perform a query with a join (or subquery) to match the base contact                     table to the phone number table. You would have to do the same with                     other data points, such as addresses and email accounts, or with more                     flexible data, such as important dates, spouses, and other links.With Apache CouchDB, information is instead stored in documents, and                     the documents are freeform, written using JavaScript Object Notation                     (JSON), allowing you to construct a document that contains lists,                     hashes, as well as traditional fields, all into a single document. For                     example, Listing 2 shows a contact in JSON                     format.In Listing 2, all of the information about the contact is in a single                     document. However, the document structure is not fixed in any way. Listing 3 shows a different contact from the                     same database.Don't think that the use of such a freeform structure for your database                     content means that you lose the ability to enforce a structure. You                     can use a validation routine that can check not only the structure of                     the document, but also the contents of those fields.Documents within Apache CouchDB are stored using a document ID. You can                     use any string as a document ID, so your contact could be stored with                     the document ID 'MartinBrown', or you can allow Apache CouchDB to                     create a UUID (Universally Unique ID).One final element of the Apache CouchDB system is that unlike                     traditional databases, you do not need a special library or interface                     system to access or update the data. Instead, the entire interface is                     built around a REST-like interface accessible through anything that                     can access a web page over HTTP. Thus, we can access the document                     MartinBrown, stored within the database 'contacts', on the machine                     'Apache CouchDB' by opening a web browser and accessing                     http://127.0.0.1:5984/contacts/MartinBrown.                     The contact 'Paulie' would be stored within the URL                     http://127.0.0.1:5984/contacts/Paulie.This simple web interface to the database also provides the basis of                     the CouchApp. A CouchApp is an HTML5 and JavaScript-based application                     to the documents stored within an Apache CouchDB database. The                     documents and code that make up the interface and application are also                     stored within Apache CouchDB as design documents. The result                     is an application (including display elements) that can be entirely                     self-contained within the database that provides the data, making the                     entire process of building and interacting with your application                     focused on the information that you want to present.The use of JavaScript as a core part of CouchApps also extends to the                     server, where JavaScript is used to construct views of your database.                     In a traditional database, like Oracle, the SQL and the database                     structure provide the ability to pull out information. This makes it                     easy to pull out a list of all the records where the                     firstname field is 'Martin'. However, with                     Apache CouchDB, the data is stored in documents, not tables. To                     achieve the same result you would need to open every document in the                     database, work out if it contained the specified field, and then add                     the document to a list if the field matched what you wanted. This is a                     time (and CPU) expensive process, especially if your database contains                     thousands or even millions of documents.To improve the performance for these types of operation, Apache CouchDB                     uses views. Views perform the operation of iterating over every                     document and building a list of the documents with specific fields.                     The views are built on the server, and the resulting view is stored on                     disk as an index to the underlying documents. This improves the                     performance when retrieving lists of documents in this fashion, and                     makes it easy to pull out records according to whether they match a                     specific field value.To make the entire process of building a CouchApp easier, there is a                     command-line tool called CouchApp that can create stub and template                     code for your Apache CouchDB application, while creating files on the                     local filesystem that you can then edit and 'push' to your Apache                     CouchDB server using the CouchApp command line tool. This simplifies                     the entire process and means that you can concentrate on building the                     application without worrying about uploading the application to Apache                     CouchDB.There are some additional differences that we won't describe in detail                     here, but that we will cover in part as we go through the rest of the                     tutorial. For example, documents can also include attachments (one or                     more files associated with the document), and all document revisions                     are stored, with each update to a document forming a new 'revision'.                     The CouchApp tools hide this complexity to make the entire system                     easier to use.In this section we will install and configure CouchAppBefore installing CouchApp, you need to install Apache CouchDB. You can                     download Apache CouchDB in a variety of formats, including as source,                     which you can build yourself, or as a standalone application for                     running on Windows™, Mac OS X and Linux®. For example, the Mac OS X                     application, Apache CouchDBX, runs as a standard application.On Linux and UNIX®, you will get a binary, couchdb, which you can run                     from the command-line (see Listing 4).You are now ready to go! You can check that the database is running by                     accessing the given URL. If you want to open up your server so that it                     can be accessed over the network, change the bind parameter within the                     local.ini configuration file to match the IP address (not hostname) of                     your Apache CouchDB server.For the CouchApp command line tool, you need to download the CouchApp                     tar or Zip package from GitHub (see Resources). You will need an installation of Python on your                     machine (if you do not already have it installed), but the CouchApp                     installer will handle all of the dependencies of additional libraries                     for you.After you have downloaded the package, extract it, and then change into                     the CouchApp directory: $ cd couchapp.Now run the Python setup tool to download and install any dependencies                     and install the couchapp tool:                     $ python setup.py install.You can test the installation by running CouchApp, which should return                     the help information for the tool:                     $ couchapp.Note: Normally you would not create a database that allows anybody to                     access and update it. Apache CouchDB does support authentication and                     different levels of security and authority for performing different                     operations, but we do not cover them in this tutorial.As a good way of understanding the simplicity of Apache CouchDB, you                     can create a new database within your Apache CouchDB instance from the                     command using the curl command line tool. To create a database, you                     issue a PUT HTTP command to the URL of the                     database you want to create. For example, to create a contact                     database, you could use the command line:                     $ curl -X PUT http://127.0.0.1:5984/contacts.If you check the content of the file contacts, it should specify that                     the operation completed successfully.Alternatively, go into Futon, using                     http://127.0.0.1:5984/_utils. You can                     create a new database using the Futon administration interface.You now have an empty database. To create a stub application, you can                     use CouchApp to generate all of the basic files that you need on your                     file system ready to be uploaded to your Apache CouchDB database. You                     can do this by running:                     $ couchapp generate app contacts.This creates a directory, contacts, which contains an array of files                     and contents that we can use to build our contacts application. You                     can see a top-level file list in Listing 5.Some of the major elements of this are:* views— contain the Views on the database. These are JavaScriptfunctions that build a list of keys and data that you wantreturned, the equivalent of a typical database query.* lists— contain lists that are used to build formatted versionsof the view output. Lists are JavaScript functions that take theinformation from a view and format the information (usually asHTML) for display.* shows— display a single document, rather than a list ofdocuments as provided by a view. Like lists, a show is defined asa JavaScript function.* attachments— contain attachments for the application, includingindex.html and JavaScript files.Views are critical to the way you access information from the database                     when you don't know the document ID of the document you want to load.                     Lists and shows provide built-in methods for displaying information.                     However, you use other methods, such as the jQuery library, to get                     information out of the database using the view to return the list of                     documents that is then processed by the jQuery Apache CouchDB                     library.The default document, index.html, is contained within the database                     attachments directory, contacts/_attachments/index.html. You should                     edit this to contain some default links to the database.To make the best use of the environment offered by jQuery and Apache                     CouchDB we will define the entire interface for the Contacts                     application dynamically. This will use JavaScript to dynamically                     provide the different elements, include displaying the results of the                     view and the forms used to edit the contact information.For that to work, you need to edit the index.html file to that shown in                         Listing 6.The key elements of structure are:* The loading of the vendor/couchapp/loader.js script. This in turnloads the jQuery and jQuery Couch libraries, among others.* The loading of the recordedit.js script. This is the script wewill populate with the JavaScript functions used to build theapplication.* A button that will be used to trigger the Add form for creating anew contact.* A div element, with the id contacts, that will be used to displaythe contact list.* A div element, with the id contactform, that will be used todisplay the contact form.Once you have edited the file, you need to push the                     application to your Apache CouchDB database using the CouchApp                     command-line tool:                     $ couchapp push contacts http://127.0.0.1:5984/contacts.The first argument is the instruction to push (publish) the                     application, the second is the local directory, contacts, where the                     application is stored, and the third is the URL of the Apache CouchDB                     database where you want to upload the database. Once the push has                     completed successfully, you can view the uploaded application using                     the URL:                     http://127.0.0.1:5984/contacts/_design/contacts/index.html.It is worth dissecting this URL:* 127.0.0.1:5984 is the hostname, andport number, of the Apache CouchDB server. By default servers runon port 5984.* contacts is the name of thedatabase.* _design is a special identifier toApache CouchDB that indicates you want to access the designdocument. Design documents contain the view, list, and showdefinitions. You can have more than one design document for agiven database.* contacts is the name of the designdocument. The CouchApp tool creates design document with the samename as your application by default.* index.html is the name of theattachment for the contacts design document.CouchDB also includes a rewriting module that will simplify these URLs                     to something more friendly. See Resources for                     an article on this topic.With the basic document in place, you can start to build the rest of                     the application.In this section, we will create and display a list of contacts.To construct a list of contacts, you first need to create a view. This                     is a JavaScript function that accepts a document as the only argument.                     The function is executed by Apache CouchDB on every document in the                     database, and it should output keys (which are used for displaying and                     filtering information) and corresponding values that you want to                     output for each document in this view. This process is called the map,                     as you are mapping the document contents to the information that you                     want to extract. There is another step, called reduce, which can be                     used to summarize or simplify the information, but we will not need                     that for a contacts application.For example, to output the name from the contact document as the key,                     and the entire contact record as the value, you would need to write                     the view shown in Listing 7.The anonymous function accepts a single argument, the document. The                     function then checks to ensure the record has a field called name, and                     if true, the emit() function then returns                     two values: The first is the key, and the second is the value, in this                     case, a copy of the document. Both keys and values can be any valid                     JSON structure. Keys are used by Apache CouchDB during searching and                     paging. The values merely contain the information that you want to                     expose in this view when it is accessed.Within CouchApp, you can create a new view on the command line using                     the generate command:                     $ couchapp generate view contacts byname.This creates the view byname, in the                     directory contacts/views/byname, and creates two files, map.js and                     reduce.js. Edit the map.js file and change it to the function in                     Listing 6.You can now push the application to your Apache CouchDB database again.                     Views are accessible through the browser interface. You can access the                     view byname on the contacts design document                     by accessing the URL                     http://127.0.0.1:5984/contacts/_design/contacts/_view/byname.                     Again, we are using the design document contact, this time requesting                     the output of a view (identified by the                     _view) in the path, with the view name of                     byname.At this stage, the view will be empty:                     {""total_rows"":0,""offset"":0,""rows"":[]}.To display the view within our application, we can use the jQuery Couch                     library to access the view, iterate over each record returned by the                     view, and then print the record information.A function for this is shown in Listing 8.The first line in Listing 7 sets a variable used                     to access the database. The                     updatecontacts() function first empties the                     div element that will be used to display the contacts list, then it                     accesses the results of the view that was just created. If the view                     access was successful, an anonymous function is called with the                     returned view data as a JSON structure. The function then iterates                     over the content, and builds a contact row that outputs the contact                     name and phone number.The view results is represented as an array (rows), with each element                     of the array being a JSON structure containing the contents of the key                     returned by the view, and the value returned by the view. Hence, we                     can access the phone number of a contact record returned by the view                     by accessing the value.field portion of the                     array element.The output produces an Edit Contact and a Remove Contact link, which                     also includes the ID of the underlying Apache CouchDB document. This                     will be used to provide the information when updating and deleting a                     contact.To add this function to the contacts application, create a new file,                     contacts/_attachments/recordedit.js, and add the function to the                     file.The second step is to ensure that the document is loaded, and the                     updatecontacts() function is called to                     display the current contact list. jQuery makes this easy by providing                     a ready() function on the document.                     Everything within the function you assign to this operation will be                     executed when the document has finished loading. You can see the                     definition in Listing 9.Push the application again, and reload the index page. There shouldn't                     be any changes to what is displayed, but it is good practice to ensure                     that the application has not been broken when adding in new                     components.Of course, the list will still be empty, so let's create a form that                     can be used to view some contacts.In this section, we will show you how to create, edit, and delete                     contacts.As a web application, creating a new contact involves providing a form                     to the user that can be filled in, and then the content of the form                     can be written to the Apache CouchDB database. The first step for that                     is a new function that will dynamically generate the HTML for a form,                     and then attach this to the contactform div                     element defined in index.html (see Listing10.The function for this is shown in Listing 9.The last line in Listing 9 is the key. The                     $() construct is shorthand for using jQuery                     functions, and here we are using it to append a form to an existing                     element. jQuery makes it easy to access the DOM elements of an HTML                     page. In this case, using the # prefix looks for an element within the                     page DOM with the specified id attribute. This follows the same format                     as used for CSS formatting, which makes it easy to look up different                     elements. The period prefix looks for items with the specified class.                     We will see an example of that later.The function itself accepts a single argument, a document to be edited.                     We'll use this when we look at editing an existing contact. It's used                     in the function when building the form first to introduce the document                     ID for the contact (which will be needed when we update the contact                     record), and when setting the value of each field in the form.The basics of the form, however, are straightforward; we generate text                     input elements with the name and ID of the field (name, phone, email,                     and so on).For the form to be activated, you need to enable the AddContact link in the index.html file so that it calls the                     function. You can do this by adding the operation to the                     ready() function. This ensures that the                     button is not active until the document has loaded. See Listing 11 for the jQuery code to update the                     operation when the link is clicked.Push the application again to Apache CouchDB. You should now find that                     the Add Contact button populates the form with empty                     values. You can see a sample of this in Figure1.The second part of creating a contact is actually saving the document                     to the database when the Submit button is pressed.                     The function to handle this is shown in Listing12.This JavaScript fragment should be added to the existing                     ready() function. The fragment adds the                     function that will be called when the Update button is clicked. The                     first line of the inline function creates a variable through which we                     can access the fields of the form.The saveDoc() function will save a JSON                     structure as a document to the Apache CouchDB. The first argument                     should be the document data, and the second a JavaScript object, which                     defines what happens if saving the document was a success. Remember                     that JavaScript operations that go out to access information are                     asynchronous, that is, the request is sent to the host (Apache                     CouchDB), and you have to wait for the response to come back before                     operating on the information.The first argument to this function is the return value of another                     function, builddocfromform(). This is used                     to simplify the construction of the document from the form data,                     whether you are creating a new document, or editing an existing one.                     The code for this is shown in Listing 13.The function accepts an existing document object, initializing it to an                     empty JavaScript object if document is not defined. Then, it uses the                     supplied form jQuery object to access each field in the form and                     populate the document object before returning it. You could add more                     fields to the function here (providing you also added the field                     definitions to the form HTML).The anonymous function attached to the success field will be called if                     the document was written to the database successfully. If this occurs,                     the HTML of the contact form is removed by emptying the                     contactform div element content, and then                     the updatecontacts() function will be                     called, which will update the active list of contacts in the                     display.You can now push the application to your Apache CouchDB instance again,                     and try adding a contact to the system. You should end up with one or                     more contacts in Apache CouchDB, as shown here in Figure 2.We have already laid much of the groundwork for editing an existing                     contact. The JavaScript function for outputting the form already                     accepts an existing document, and the form is populated with the                     Apache CouchDB document ID and existing values from that object.The two changes required are first to enable the EditContact link output against each contact in the list.                     Creating this all individually would be a nightmare, but jQuery                     provides functionality to identify when any link has been clicked by                     identifying the target DOM object. That information can be used to                     access the ID of the document, which was embedded into the link, and                     then to load the record from Apache CouchDB and call the form                     function. You can see this in Listing 14.This should be added to the ready()function. Following the lines of the code in order:* Second line identifies the target that was clicked.* Third line checks that what was clicked as an 'A' clickableelement.* Fourth line identifies the id attribute. In the contacts list, theid attribute of each link contains the document ID of thecorresponding contact.* Fifth line identifies the class of the link that was clicked. TheEdit Contact links have a class ofedit, the remove links a class ofremove. If the link is an 'edit' link,access the Apache CouchDB to load the document (withopenDoc()), and when the document hassuccessfully been loaded, thecontactform() function is called withthe document data. This will present a contact form with theexisting contact information to be edited.The result is that when you click on the Edit Contact link against a                     contact, a form with the contact details is displayed.You might wonder why the contact information that is displayed is not                     used to populate the form directly. The reason is that as a                     potentially multi-user application you want to ensure that the                     document has not been updated (or deleted) by another user before you                     edit it. Therefore, you want to ensure that the document exists, and                     you have the latest version as stored in the database.The other half of the process is to change the function that is called                     when the Submit button on the form is pressed. Here                     you need to identify whether an existing record is being updated.                     Since the contactform() function only                     includes the document ID if we are updating an existing document, you                     can use this to determine the operation type. If we are updating an                     existing document, then that document should be loaded from the                     database, and then the form values updated, before the document is                     saved back to the database. The resulting code is shown in Listing 15.Again, we load the existing document and update the contents using the                     builddocfromform() function. This ensures                     that we are updating the latest version of the document. This is                     important because Apache CouchDB records the revision and changes in                     all documents. Therefore, you need to ensure that you are updating the                     latest version — the revision number is used as a check to                     ensure that you are updating the right version.There is another reason why we load the document before updating the                     fields and saving it back. The form, as it stands, only supports name,                     phone, and email fields of the document. But what if the document                     contains other fields that this form does not yet support? By loading                     the entire existing document, and updating only the fields that were                     on the form, you won't lose any of the fields that the form does not                     know about.Of course, there are times when you want to delete a record.Deleting an existing contact should be straightforward. You can add                     another hook to the Remove Contact link, like the Edit Contact link.                     However, you don't want the Remove Contact link to be accidentally                     clicked, so you can provide a confirmation process to ensure that the                     deletion is required.You can use some of the principles already demonstrated to output a new                     set of links, and then use the click events to make these new links                     either confirm, or cancel, the remove request.The code should be added after the editfunction in Listing 14. The code is shown in Listing16.When the user clicks the Remove Contact link, two                     further links are added next to the contact. If the                         Delete link is clicked, the document is loaded                     (to confirm it still exists), and the                     removeDoc() function is called to delete                     it. If this is successful, you remove the entire contact row, which                     you identify by looking for the parent DOM element. If the user clicks                     Cancel, you just remove the confirmation links.You can see a contact awaiting confirmation for deletion in Figure 3.With all of the different elements to the process, it can be difficult                     to see the entire application. Listing 17 shows                     the recordedit.js file, which contains all of the JavaScript for the                     application.After you have the file updated, push the application using CouchApp up                     to your Apache CouchDB instance and try it out.One of the main features of Apache CouchDB is that you can replicate                     the documents in your database to another database, whether that                     database exists on the same Apache CouchDB instance or a remote one.                     The synchronization occurs in both directions, which means that you                     can replicate your contacts database from your desktop machine to your                     laptop, make changes on your laptop while away, and then synchronize                     those changes back to your desktop so that the two databases are kept                     in synchronization.As an added bonus, CouchApps, which are just stored in the Apache                     CouchDB database as documents, are also synchronized. This means that                     when you replicate the contacts database you are also replicating the                     application code that makes up the CouchApp application. In                     environments that are more traditional, this would be difficult to                     achieve. With CouchApps, that functionality is provided as part of the                     Apache CouchDB functionality.You can set up replication by sending the request to the Apache CouchDB                     server using a command line tool such as curl, but an alternative is                     to use the Futon tool, a CouchApp built into every Apache CouchDB                     instance, that provides a complete management and editing interface                     for Apache CouchDB and the documents stored in the databases within                     Apache CouchDB.You can access Futon by visiting http://127.0.0.1:5984/_utils. This                     shows the Futon interface, as seen here in Figure4.Click on the Replicator link on the right hand side, and you will be                     presented with the form as seen here in Figure5.Replication can occur either in push (from the current Apache CouchDB                     instance to a remote database), or pull (from a remote Apache CouchDB                     instance to the local Apache CouchDB instance). Replication can also                     either be performed once, or you can set the replication to be                     continuous, which means that changes on one database will                     automatically be replicated to the other database if it is                     available.For example, if you start a Apache CouchDB instance on your laptop, you                     can replicate the contacts database from the CouchDB server to the                     local contacts database. You can see the filled in form, and the                     result of starting the replication process, in Figure6.After the application is on the laptop instance of Apache CouchDB, you                     can edit and update the database using exactly the same interface                     because you have replicated the entire application. What's more, you                     can even replicate the changes back to your desktop when you get                     home.The screenshot samples in this tutorial may look different than your                     application, because the CSS has been updated to improve the layout                     slightly. Fortunately, because we have used classes and IDs on all of                     the different components (including the form, contact list, and                     links), changing the formatting should be straightforward. The CSS is                     defined within the contacts/_attachments/style/main.css file on the                     local file system, and will be included when you push the application                     with couchapp. Changing the CSS is probably the easiest change you can                     make to improve your application.After that, you may want to improve the data that is captured (which                     can be fixed by updating the form and the structure for storing that                     data). For example, as outlined in the introduction, you could add                     support for adding multiple phone numbers by storing the phone numbers                     within a separate structure in the document.After you have extended the information displayed, you may want to                     start improving the contacts list so that you can display it in pages.                     Paging functionality in Apache CouchDB uses the key returned as part                     of the view. You may also want to construct different views, and                     provide search functionality, all of which are possible by building                     and constructing additional views that build different representations                     of the contact list.CouchApps and Apache CouchDB provide a rich environment for building                     web applications. The entire process for constructing the forms,                     saving the data, and reporting on the database content, is entirely                     stored within the Apache CouchDB database. Using JavaScript, and the                     jQuery libraries simplifies a lot of the complexity of constructing                     the application. Meanwhile, Apache CouchDB eliminates the need to                     worry about defining tables or writing complex queries to get at the                     information that you have stored. Finally, with Apache CouchDB you can                     easily replicate your entire application, including all of its data,                     to another Apache CouchDB instance, including your laptop or mobile                     phone.",Create a contact management app that's stored in an Apache CouchDB database,Building CouchApps,Live,703
2165,,"Aginity Workbench is a free application known best for working with IBM’s PureData for Analytics, formerly Netezza. With the launch of dashDB, Aginity released a new version of the workbench to provide a familiar interface for working with this cloud data warehouse. ",Use Aginity Workbench for IBM dashDB,Live,704
2166,"MENU
Close * Home

Subscribe MenuWORD2VEC IN DATA PRODUCTS
26 August 2016It is always amazing when someone is able to take a very hard, present day
problem, and translate it to one that has been studied for centuries. This is
the case with Word2Vec, which transforms words into vectors. Text is
unstructured data and has been explored mathematically far less than
vectors—both historically, and today. Newton (1642-1726) may have been the first
one to study vectors in the context of forces in physics, so vectors is a
concept with at least 289 years of scientific maturity. Mathematical exploration
of text data is a concept with only a few decades of maturity. Similarly, I have
worked with vectors for more than half of my life, but only explored text data
for less than a year.

The application of mathematical thinking to text data is especially important
now, at a time when the value of data is understood, but not actualized. The
majority of business-relevant information originates in unstructured form,
primarily text. This data is invisible to, and unusable by, business, health
care, education, and government, until it can be “read”. Mathematical
exploration of text data can yield insights that translate into better decisions
made by doctors, marketers, entrepreneurs, and teachers.

As part of my endeavor to make text data “readable”, I applied Word2Vec to
generate vectors that capture word meaning, and enable arithmetic operations
associated with words. For example, the vector(‘king’) + vector(‘woman’) –
vector(‘man’) will result in a vector that is close to the vector(‘queen’).
Isn’t this incredible? The Word2Vec method was proposed by Mikolos et al. in
2013. This algorithm is based on networks and maps a corpus of text to a matrix
where each row is associated to a word in the input text data (for example,
tweets, product reviews, playlists, ...). The resultant vector space can be
utilized in a variety of ways, such as measuring distance between words.
Therefore, given a word of interest, the aforementioned vector space can be used
to compute the top N closest words.


For example, a model that I built using 30 days of Twitter data gives the 5
closest words to #deeplearning. They are: 1. #machinelearning, 2. #ml, 3.
#smartdata, 4. #predictiveanalytics, 5. #datascience. The Word2Vec
implementation used is the one from Spark ML, one of the machine learning
package that’s part of Apache Spark.

If you’re interested in building your own Word2Vec model, take a look to this notebook that runs on The Data Science Experience.

Jorge Castañón's PictureJORGE CASTAÑÓN
applied mathematician and art lover

bay area https://www.linkedin.com/in/jorgecastaSHARE THIS POST
Twitter Facebook Google+ AN ODE TO THE ANALYTICS GREASE MONKEYS (ANALYTICS DEPLOYMENT = ROI) Analytics
has value only when it is actionable Analytics provide a significant business
(monetary) impact for organizations when analytical… IBM Data Science Experience Blog © 2016 Proudly published with Ghost","It is always amazing when someone is able to take a very hard, present day problem, and translate it to one that has been studied for centuries. This is the case with Word2Vec, which transforms words into vectors. ",Word2Vec in Data Products,Live,705
2169,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (August 16, 2016)
 * This Week in Data Science (August 09, 2016)
 * This Week in Data Science (August 02, 2016)
 * This Week in Data Science (July 26, 2016)
 * Welcome to the new BDU!

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (AUGUST 16, 2016)
Posted on August 16, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * How Cloud Computing Helps The US Cycling Team – Through the IBM Watson Internet of Things Platform, coaches can track
   athlete power watts, lap timing, new muscle oxygenation and matched burned
   results in real time.
 * Big data and robotics: A long history together – While the term “big data” is relatively new, the concept has long been a
   part of the world of robotics.
 * Can a computer copy your handwriting? – Researchers at University College London have taught a computer to imitate
   anyone’s handwriting.
 * Why Data Will Be Stored in Cloud Databases in the Future – Databases as a service has a high chance of becoming permanently accepted
   into the IT industry as a reliable Cloud offering.
 * Facebook Helps Develop Software That Puts Students in Charge of Their Lesson
   Plans – Facebook and Summit Public Schools, a nonprofit charter school network,
   announced that nearly 120 schools planned to introduce a free
   student-directed learning system.
 * IBM’s Almaden Lab: A glimpse into the future – Almaden lab, celebrated its 30th anniversary as an anomaly in a time when
   many profit-driven corporations have abandoned the uncertainties of pure
   research.
 * AI’s Language Problem – Machines that truly understand language would be incredibly useful. But we
   don’t know how to build them.
 * Getting Into Data Science: A Guide For Students And Parents – Students of data science aren’t taught strictly “data science” skills,
   rather they must become skilled in a variety of disciplines.
 * Design Better Data Tables – The design of a table is its linchpin: if it’s done right, it makes
   complex data easy to scan and compare. If it’s done wrong, it can render
   information completely incomprehensible.
 * Text analysis of Trump’s tweets confirms he writes only the (angrier) Android
   half – David Robinson, a Data Scientist at Stack Overflow, analyses Trump’s
   tweets and was able to decipher which ones are Trump’s and which ones are by
   his staff.
 * In China, IBM Watson Partners With Hospitals To Fight Cancer – IBM and its Beijing-based partner, Hangzhou CognitiveCare, will work with
   hospitals across China to adopt “Watson for Oncology” in order to speed up
   diagnosis and treatment for cancer.
 * A Look at IBM’s Watson 5 Years After Its Breathtaking Jeopardy Debut – IBM Watson has been a long way since its debut in jeopardy in 2011.
 * 4 must-have database security essentials – To help ensure you keep your data protected at all times, here are four
   must have security capabilities.
 * Donors for Bush, Kasich and Christie Are Turning to Clinton More Than to
   Trump – People who donated to establishment Republican candidates in the primary
   season are more likely to give money to Hillary Clinton, the Democratic
   nominee, than to their own party’s candidate, Donald J. Trump.
 * Big data is people! – The sum of our clickstreams is not an objective measure of who we are, but
   a personal portrait of our hopes and desires.
 * What are the 5 most popular programming languages and which pays the best
   salary? – Packt, a publishing house for technology and coding e-books, recently
   created a 2016 report which surveyed developers and IT professionals to look
   at emerging trends in IT and tech across the world.
 * Introducing BigInsights for Apache Hadoop Basic Plan on Bluemix – IBM has been working on an open beta of the Basic Plan for IBM BigInsights
   on Cloud that allows users to acquire fully managed Apache Hadoop and Apache
   Spark clusters in a matter of minutes.

UPCOMING DATA SCIENCE EVENTS
 * IBM Webinar:Constant Contact: Using IBM BigInsights to Create Business
   Insight – Join this session to learn how Constant Contact, a leader in email
   marketing for small and medium sized businesses, is using IBM BigInsights to
   create useful insights for their clients in a way that scales.
 * Visualizing Billions of Points of Data: Doing It Right – Join Continuum Analytics’ data scientists and engineers on August 18th for
   a webinar and learn to visualize and explore your largest data sets in new
   ways.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (August 16, 2016)",Live,706
2180,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCLOUDANTStefan-Kruger / March 18, 2016Back in 2012, Cloudant CTO Adam Kocoloski wrote that “ The future of CouchDB is CouchDB .” He was talking about our commitment to supporting the open source communityand devoting resources to Apache CouchDB. This commitment hasn’t changedfollowing the IBM acquisition. Today we can announce that the most awaitedrelease of Cloudant is almost here — a release that realigns Cloudant with theCouchDB 2.0 code base. We’re also making a test version available on Cloudantfor you to try.WHAT’S NEW FOR APACHE COUCHDBAs you might know, Cloudant terminated its BigCouch fork and has contributedeverything back to CouchDB. All of Cloudant’s development now happens directlyin Apache CouchDB’s source repos . The key accomplishment of the merged code for the Apache project is theBigCouch clustering capability. Apache CouchDB operability is now much improvedwhen running as a large-scale distributed system.Cloudant engineers also refactored internal CouchDB code, removing complicatedsections and boosting overall performance. Here are a few key databaseenhancements: * A new compactor process that creates smaller and better-organized   post-compaction databases * Boosts in high-concurrency access * Faster index update speeds * Updated aggregate reduce functions * Smooth hot-code updates * Improved logging * Streamlined librariesWHAT’S NEW FOR CLOUDANT USERSIf you’ve been using Cloudant all along, in terms of what’s new for you, hereare some highlights: * New endpoint _bulk_get_bulk_get is an optimization to reduce the number of requests used in replication to   mobile clients.       * Changes feeds now support view-based filters (not just filter functions)This means that you can now use the same function that defines a view to also   filter the changes feed, rather than needing two functions.       * Changes feeds now support the _doc_ids filterThis means that you can restrict the changes feed to a subset of document IDs   only.       * POST requests are supported for _changesYou can now POST to _changes , too, if you need to feed the _doc_ids filter a lot of document ids.       * Both _all_docs and _changes now support the attachments=true parameterThis means that any attachments will be returned directly, rather than having   to be fetched with separate requests.       * Support for the CouchDB 1.6 _users database featuresServer-side password hashing when creating documents in the _users database, etc.      GETTING THE TRIAL ON CLOUDANTWe are excited to make this release available to you on a Cloudant multi-tenantcluster. We want our users to try out the new release and provide feedback.NOTE: all the data and user accounts will be deleted at the end of this sandboxprogram.If you want to test drive the newest Cloudant release, create a new account oncloudant.com and then request to move this account to sandbox001 by emailing support@cloudant.com or messaging us via the Support tab in the Cloudant dashboard. (Alternatively,you could request to move an existing account to sandbox001 with theunderstanding that it will be deleted at the the end of the sandbox program.)The sandbox cluster is made available so that you can try out your existingapplications against the new release and report any issues. This cluster islimited to functional testing and should not be used for performance orscalability testing. Please report feedback or issues by adding the tag sandbox001 in the subject line of your message.Apache CouchDB 2.0 is currently in alpha release. For the latest status on theopen source project, check the CouchDB blog and the CouchDB 2.0 Release Testing plan for more information.Cloudant will soon run on the Apache CouchDB 2.0 code base.© “Apache”, “CouchDB”, “Apache CouchDB”, and the CouchDB logo are trademarks orregistered trademarks of The Apache Software Foundation. All other brands andtrademarks are the property of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Cloudant will soon run on the Apache CouchDB 2.0 code base. Today, we're announcing a sandbox cluster for functional testing. Here's what's new.",Cloudant <3 Apache CouchDB™ 2.0,Live,707
2181,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: LOAD AND ANALYZE PUBLIC DATA SETS
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

6 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * A Day in the Life of a Data Analyst - Duration: 2:56. Intermountain
   Healthcare 38,276 views 2:56


--------------------------------------------------------------------------------

 * A day in the life of a data scientist [Data Science 101] - Duration: 3:54.
   Cognitive Class 28,236 views 3:54
 * How Big Data Could Transform The Health Care Industry - Duration: 3:49.
   HuffPost 6,938 views 3:49
 * Data science expert interview: Holden Karau - Duration: 6:21. IBM Analytics
   4,722 views 6:21
 * Exploring Data Science Experience, a Platform for Data Scientists using Open
   Source Technologies - Duration: 54:10. Data Gurus 102 views 54:10
 * Datascience made simple with IBM DSX | HackerEarth Webinar - Duration:
   1:06:11. HackerEarth 260 views 1:06:11
 * Experiences with Watson Analytics at Ryerson University - Duration: 9:45.
   Evolving Education with Cognitive & Data Sciences 152 views 9:45
 * Student Experience - Masters of Science in Data Science - Duration: 3:08.
   Galvanize 3,295 views 3:08
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29
 * Rayid Ghani | Keynote: Using Data Science for Social Good: Examples,
   Opportunities, and Challenges - Duration: 46:36. PyData 569 views 46:36
 * Data Science Is Easy Right? by Ian Sharp - Duration: 45:41. Devoxx 431 views 45:41
 * Practical Jupyter at a Data Science Firm - Duration: 23:04. Next Day Video
   2,346 views 23:04
 * Using Data Science to Improve Traffic Safety - Duration: 3:54. Microsoft
   5,083 views 3:54
 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54.
   developerWorks TV No views * New 6:54
 * Getting Started With Data Science (part 1) - Duration: 18:41. Michigan State
   Data Science 777 views 18:41
 * BigInsights on Cloud: Use Sqoop to Ingest Data from Compose for MySQL -
   Duration: 5:24. developerWorks TV 6 views * New 5:24
 * Building A Scalable Data Science Platform - Duration: 43:14. Hadoop Summit
   362 views 43:14
 * BigInsights on Cloud: Process BigInsights data using Python and spark-submit
   - Duration: 4:36. developerWorks TV 1 view * New 4:36
 * JavaOne: Microservice hands-on - Duration: 5:22. developerWorks TV No views *
   New 5:22
 * Data Science Sydney: Tim Garnsey - Duration: 25:31. DataScienceSydney 96
   views 25:31

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to access data sets in IBM Data Science Experience.,Load and analyze public data sets in DSX,Live,708
2182,"A QUICK GUIDE TO REDIS 3.2'S GEO SUPPORT
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 11, 2016The recently released Redis 3.2 now has an official Geo API in the mainstream
branch of the in-memory database. If you are curious on how you could use this,
read on...

The Geo API has been around for a while, appearing in the Redis unstable branch about ten
months ago and that was, in turn, based on work from 2014 . There's a bit of history in that development process, which being practical
folk we'll skip past and go straight to the stuff that makes your development
day better.

At its simplest, the GEO API for Redis reduces longitude/latitude down into a
geohash. Geohash is a technique developed in 2008 to represent locations with
short string codes. The Geohash of a particular location, say Big Ben in London, would come out as
""gcpuvpmm3f0"" which is easier to pass around than ""latitude 51.500 longitude
-0.12455"". The longer the string, the more precise the geohash code .

That encoding into a string is good for humans and URLs but it isn't
particularly space efficient. The good news is geohashes can be encoded as
binary and using 52 bits, a geohash gets down to 0.6 meter accuracy which is
good enough for most uses. A 52-bit value which just happens to be able to be a
small-enough integer to live in a Redis floating-point double safely and that's
what the Geo API works with behind the scenes.

From the user's point of view though, it's all about the longitude and latitude
– yes, we switch them around when working with the API – and working with
locations. So let's make some locations:

> GEOADD locations -0.12455 51.5007 ""Big Ben"" -0.12520 51.50115 ""Westminster Station"" -0.11358 51.50482 ""BFI IMAX"" 
(integer) 3


locations is a Redis key and what follows is a list of longitudes and latitudes followed
by labels for those locations. I just picked some places in Central London to
work with. Those longitudes and latitudes are being hashed together to make a
numeric geohash and the label is saved in a sorted set with the geohash as a
score in a sorted set (ZSET). We're able to do this multiple times in one
command line. We can now ask the geo set what it contains using sorted set
commands:

> ZRANGE locations 0 -1
1) ""Big Ben""  
2) ""Westminster Station""  
3) ""BFI IMAX""  


We can start querying this. Let's say we're at longitude -0.11759 latitude
51.50574 – or as I call it ""just by the Royal Festival Hall"" – and we want to
know if any of those locations are within half a kilometre:

> GEORADIUS locations -0.11759 51.50574 500 m
1) ""BFI IMAX""  


We use the GEORADIUS command on our geo set and then give it our coordinates. That's followed by a
number and by a unit type for that number; in this case 500 metres (m for
meters, mi for miles, km for kilometers and ft for feet). Well, we've found out
it's within range. Let's find out how far by adding WITHDIST :

> georadius locations -0.11759 51.50574 500 m WITHDIST
1) 1) ""BFI IMAX""  
   2) ""295.9825""


So it's 295M – in a straight line – to the BFI IMAX. Let's open up the range to
a kilometre and ask for the results in ascending order of distance:

> GEORADIUS locations -0.11759 51.50574 1 km WITHDIST ASC
1) 1) ""BFI IMAX""  
   2) ""0.2960""
2) 1) ""Westminster Station""  
   2) ""0.7335""
3) 1) ""Big Ben""  
   2) ""0.7392""


Note the distances are in the same unit as our query. If something is already in
the set, we can query how close a member of the list is to other members. For
that we have GEORADIUSBYMEMBER , so if we want to know what's within 100 meters of Westminster Station, we can
ask using:

> GEORADIUSBYMEMBER locations ""Westminster Station"" 100 m withdist
1) 1) ""Westminster Station""  
   2) ""0.0000""
2) 1) ""Big Ben""  
   2) ""67.3659""


We asked for the distances so you can see that at 0.0M away is Westminster
Station. ""Big Ben"" is a whole 67M away and if we widened our radius we'd find
the IMAX is 900M away. There's a more direct way to find that distance though:

> GEODIST locations ""Westminster Station"" ""BFI IMAX""
""902.1221""


The GEODIST command takes two members from a geo set and returns the distance between them
(in meters by default but you can add mi , ft or km to a GEODIST query to get it in those units).

Well, now we've got distances, we also want to think about how we convert these
Geo set members back into usable coordinates. The GEOPOS command comes into play here:

> GEOPOS locations ""Big Ben""
1) 1) ""-0.12454837560653687""  
   2) ""51.50069897715604839""


That converts a Geo set member back into the longitude and latitude. You can
also get the string-based geohash back with GEOHASH in the same way:

> GEOHASH locations ""Big Ben""
1) ""gcpuvpmm3f0""  


And that's it for the Geo API. Wait, you say, there's no command to remove a
member of geo sets. Remember this is still a sorted set underneath, so to remove
a member use Redis's ZREM like this:

> ZREM locations ""Big Ben""
(integer) 1


And Big Ben is gone... from your set.

So that's the Geo API for Redis. It's a simple implementation ideal for quickly
working out the proximity of locations to other locations, like restaurants as a
user walks through a city, vehicles and garages or any other safe to approximate
geo-data. If you want the whole GIS environment, you'll probably want to look to
something like PostgreSQL and PostGIS, but if you just need the more common
""What's close to me?"" query done quickly, Redis's Geo API should be a first port
of call.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","The recently released Redis 3.2 now has an official Geo API in the mainstream branch of the in-memory database. If you are curious on how you could use this, read on...",A Quick Guide to Redis 3.2's Geo Support,Live,709
2184,"USING POSTGRESQL THROUGH SQLALCHEMY
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 22, 2016In this Write Stuff article, Gareth Dwyer writes about using SQLAlchemy, a Python SQL toolkit and
ORM, discussing the advantages of using it while performing database operations.
He shows the differences between using raw SQL and using an ORM, and gives
examples that perform CRUD operations on a PostgreSQL database.

If you build a database application using Python, you have a few options on how
to interact with your database from your Python code. Many developers who are
already familiar with Python and SQL will simply install a thin Python wrapper
to the database driver and write all the database logic using Python strings to
execute raw SQL against the database. This article is aimed at those who have
some experience with Python and SQL, but who are not familiar with SQLAlchemy.

SQLAlchemy offers several benefits over the raw SQL approach, including:

 * Cleaner code : Having SQL code as Python strings gets messy pretty quickly,
 * More secure code : Using SQLAlchemy's ORM functionalities can help mitigate against
   vulnerabilities such as SQL injection,
 * Simpler logic : SQLAlchemy allows us to abstract all of our database logic into Python
   objects. Instead of having to think on a table, row, and column level, we can
   consider everything on a class, instance, and attribute level.

Getting started with using SQLAlchemy can seem pretty daunting, as a lot of the
documentation and tutorials assume a fairly high level of familiarity with what
SQLAlchemy is, what it does, and how it works. In this article, I'll show how
you can use SQLAlchemy at three different layers of abstraction. The lowest
layer is using only SQLAlchemy's engine component to execute raw SQL. The middle layer is using SQLAlchemy's expression
language to build SQL statements in a more Pythonic way than using raw SQL
strings. The highest extraction layer is using SQLAlchemy's full Object
Relational Mapping (ORM) capabilities which allows one to think in terms of
Python classes and objects instead of database tables and connections.

OVERVIEW
In this article, we'll:

 * give a brief introduction to SQLAlchemy and outline the three layers of
   abstraction at which it can be used,
 * give practical examples of Create, Read, Update, and Delete (CRUD) operations
   for each layer of abstraction
 * present some discussion on when each of the abstraction layers might be most
   suitable.

The example code uses Python 3.5 and SQLAlchemy 1.1.3, but it should be almost
identical for other versions of Python and SQLAlchemy.

UNDERSTANDING SQLALCHEMY
SQLAlchemy is best understood by thinking of it as two distinct components, SQLAlchemy Core and SQLAlchemy ORM . The latter is built on top of the former, but you can use either component
exclusively to build your application, depending on which level of abstraction
you prefer. Let's take a quick look at some comparable insert statements, using
raw SQL, the SQLAlchemy Expression Language, and the SQLAlchemy ORM.

To insert a new film into a film database, you might write a raw SQL statement
that looks as follows:

statement = ""INSERT INTO films (title, director, year) VALUES ('Doctor Strange', 'Scott Derrickson', '2016')""  
conn.execute(statement)


Using the SQLAlchemy expression language, we would create the same statement
using more Pythonic code that looks similar to this (we'll discuss later exactly
what the films variable is doing here and how to set it up).

statement = films.insert().values(title=""Doctor Strange"", director=""Scott Derrickson"", year=""2016"")  
conn.execute(statement)  


Once you get used to this expression language, it's a lot cleaner than the raw
SQL. We don't have to worry about apostrophes and string formatting so much, and
we can probably start using more of our IDE's code linting and autocompletion
functions. But we're still essentially thinking about our database logic on a
table, column, and row level — films is still our table, which we call insert() on, and pass values for each column in order to form a new row.

If we use the full ORM from SQLAlchemy, we'll get to the highest level of
abstraction, and instead of creating an insert statement to add a new film to
our database, we could do everything on a Python Object level, and our code
would look something like:

doctor_strange = Film(""Doctor Strange"", ""Scott Derrickson"", ""2016"")  
db_session.add(doctor_strange)  


In this example, we simply create a Film object, and ""add"" it to our database
session. The last example has even more going on in the background that we
haven't shown here: we need to set up a Film class and the db_session correctly for this to work. Again, we'll show this in more detail below, but
the important thing to understand for now is that these are three distinct ways
to use SQLAlchemy, and when building a database application you'll need to
choose which of the three best suits your needs.

CREATE, READ, UPDATE, AND DELETE USING RAW SQL
In this section, we'll show how to run each of the CRUD operations against
PostgreSQL through SQLAlchemy, using only raw SQL statements. Using this method,
we get none of the advantages used above, and we may as well use only a basic
Python database driver (e.g. psycopg , which we need to install to use SQLAlchemy with PostgreSQL in any case).

The following assumes that you have access to a PostgreSQL database and that you
have installed both the sqlalchemy and psycopg2 python packages (both available through pip).

from sqlalchemy import create_engine

db_string = ""postgres://admin:donotusethispassword@aws-us-east-1-portal.19.dblayer.com:15813/compose""

db = create_engine(db_string)

# Create 
db.execute(""CREATE TABLE IF NOT EXISTS films (title text, director text, year text)"")  
db.execute(""INSERT INTO films (title, director, year) VALUES ('Doctor Strange', 'Scott Derrickson', '2016')"")

# Read
result_set = db.execute(""SELECT * FROM films"")  
for r in result_set:  
    print(r)

# Update
db.execute(""UPDATE films SET title='Some2016Film' WHERE year='2016'"")

# Delete
db.execute(""DELETE FROM films WHERE year='2016'"")  


The above code assumes that you have already created a database (called test ) and that you have a database user with a password set up. It shows the
simplest possible examples of the four main database operations (create, read,
update, and delete). We create a table and insert the film then we get the data
we just inserted. After that we modify the data, and, finally, we delete it.
Even with such a simple case, our database code is already getting messy. The
strings are long, and we'd have issues if we needed to insert double or single
quotation marks in any of the fields, as we are already using both (double for
the Python strings and single for the SQL strings). This code would become even
messier if we were to start using dynamic values, if we needed to run more
complicated queries, or if we were dealing with anything more complicated than
our single-table toy database. Let's take a look at the same example using the
SQLAlchemy's SQL Expression Language.

CREATE, READ, UPDATE, AND DELETE USING THE SQL EXPRESSION LANGUAGE
In this section, we'll show how to achieve exactly the same as above, but using
the SQL Expression Language that SQLAlchemy provides, instead of using raw SQL
strings. We'll see that this code is a little bit more verbose, and also fairly
complicated and difficult to read, so we don't gain that much over the raw SQL
queries above. But we can see that we're already starting to get closer to
interacting with our database in a pythonic way.

from sqlalchemy import create_engine  
from sqlalchemy import Table, Column, String, MetaData

db_string = ""postgres://admin:donotusethispassword@aws-us-east-1-portal.19.dblayer.com:15813/compose""

db = create_engine(db_string)

meta = MetaData(db)  
film_table = Table('films', meta,  
                       Column('title', String),
                       Column('director', String),
                       Column('year', String))

with db.connect() as conn:

    # Create
    film_table.create()
    insert_statement = film_table.insert().values(title=""Doctor Strange"", director=""Scott Derrickson"", year=""2016"")
    conn.execute(insert_statement)

    # Read
    select_statement = film_table.select()
    result_set = conn.execute(select_statement)
    for r in result_set:
        print(r)

    # Update
    update_statement = film_table.update().where(film_table.c.year==""2016"").values(title = ""Some2016Film"")
    conn.execute(update_statement)

    # Delete
    delete_statement = film_table.delete().where(film_table.c.year == ""2016"")
    conn.execute(delete_statement)


Note that we've had to add a few more imports at the top of this snippet to
allow us to talk about the same concepts as before (Table, Column and String)
using Python instead of SQL. The biggest change is creating a Table as a Python
class, which takes a variable number of Column objects as arguments, and then
calling .create() on this to actually create the table in Postgres.

Our four CRUD operations are a direct parallel of what we did before, but
instead of writing the statement as a long string, we can chain together various
functions such as update() , where() , and values() that are provided by SQLAlchemy's SQL Expression Language.

CREATE, READ, UPDATE, AND DELETE USING THE SQL ORM
In this last section, we'll see how to do the same thing as in the previous
examples using the full ORM. Our code is now longer again in terms of lines, but
it feels more concise in terms of line length, and it is more readable (once
you're used to the concepts behind the ORM). At this point, we can largely
ignore database concepts such as tables, and think only in terms of Python
objects.

from sqlalchemy import create_engine  
from sqlalchemy import Column, String  
from sqlalchemy.ext.declarative import declarative_base  
from sqlalchemy.orm import sessionmaker

db_string = ""postgres://admin:donotusethispassword@aws-us-east-1-portal.19.dblayer.com:15813/compose""

db = create_engine(db_string)  
base = declarative_base()

class Film(base):  
    __tablename__ = 'films'

    title = Column(String, primary_key=True)
    director = Column(String)
    year = Column(String)

Session = sessionmaker(db)  
session = Session()

base.metadata.create_all(db)

# Create 
doctor_strange = Film(title=""Doctor Strange"", director=""Scott Derrickson"", year=""2016"")  
session.add(doctor_strange)  
session.commit()

# Read
films = session.query(Film)  
for film in films:  
    print(film.title)

# Update
doctor_strange.title = ""Some2016Film""  
session.commit()

# Delete
session.delete(doctor_strange)  
session.commit()  


The first thing to note is the change in imports again. We no longer need to
import the Table class, but instead we import declarative_base and sessionmaker . Instead of creating tables, we'll create Python classes that subclass declarative_base , and instead of making a connection to our database we'll ask for a session.
Both of these concepts are a higher layer of abstraction than the ones we used
previously.

As before we still create a database engine, and now we also instantiate a declarative_base . Instead of defining our Film class as a Table , we create a normal Python object which subclasses base and which defines __tablename__ . As before, we define the columns and column types for our Film object, but
now we can use the attributes title , director , and year , instead of using the strings that we had before.

Sessions have some complications and subtleties that we won't cover in detail
here. You can think of a session as an intelligent connection to our database
that will watch what we're doing in our Python code and modify the database as
necessary. Sessions have one more layer of abstraction than you would expect —
first we have to instantiate a sessionmaker and link it to our engine ( Session = sessionmaker(db) ) and then we have to instantiate this to open an actual session ( session = Session() ). It's well worth reading all the nitty-gritty details of how sessions work in
the official documentation .

Our declarative_base also provides some shortcuts — For example, we create our table by calling base.metadata.create_all() , which examines the schema that we implicitly created by declaring the Film
class, and sends a CREATE TABLE command to our database.

We can insert data into our database by instantiating Python objects and passing
them to session.add() . In our earlier examples, we used a database connection which defaults to auto
commit. Sessions do not have this default, so we need to call session.commit() after any changes to our database.

Perhaps the starkest difference is in our update example. Here we don't pass
anything to the session at all — we simply modify our Python object, and then
call session.commit() , which notices the modification and makes the update call to the database.

CONCLUSIONS
We looked at three different ways to use SQLAlchemy in this article and showed
the basic database operations for each possibility. The final thing to discuss
is how to choose between the three ways of using SQLAlchemy.

You probably never want to use the first method (using raw SQL strings), unless
you are writing a throwaway script that you need to run only once and are
already using SQLAlchemy for the rest of your code base. This method can also be
useful if you have existing complex SQL code that you need to run. However,
usually, if you want to run raw SQL from Python, you can use a simpler Python
database driver such as psycopg2 directly, without the need to install SQLAlchemy as well.

The second method (SQL Expression Language) can be useful if you need to do
something unusual or to run some SQL code that is specific to a certain
database. Because the ORM layer tries to abstract away completely from which
database you're using, it is sometimes not possible to use it to run SQL code
that is highly customized and which targets non-general features of a specific
backend. The Expression Language is designed to map to primitive constructs of
each backend, and can, therefore, be useful in advanced scenarios.

The final method we showed (The ORM) is useful to keep your code Pythonic and to
abstract away from the database completely. It allows one to deal directly with
Python classes instead of Tables, with instantiations of objects instead of
rows, and with object attributes instead of columns. It also abstracts away from
the specific database flavor that you're using, and thus switching your codebase
to a different backend can be as simple as changing the call to create_engine , if you've only used functionality that is general to all RDBMSs. Using the
ORM for general database applications that do not require specializations
provided by only a subset of RDBMSs is thus highly advantageous and you should
strongly consider using it wherever possible.

Gareth enjoys writing. His favourite languages are Python and English, but he's
not too fussy. He is author of the book ""Flask by Example"", and plays around
with Python, Natural Language Processing, and Machine Learning. You can follow
him on Twitter @SixHobbits .

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Image by Brianna Fairhurst Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Gareth Dwyer writes about using SQLAlchemy, a Python SQL toolkit and ORM, discussing the advantages of using it while performing database operations using PostgreSQL.",Using PostgreSQL through SQLAlchemy,Live,710
2188,"
CROSS-DATABASE QUERYING IN COMPOSE POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 18, 2016In Compose PostgreSQL you can now perform cross-database queries using some
extensions we've recently made available: postgres_fdw and dblink . In this article we'll walk through how to set up your database to take
advantage of them.

Both the postgres_fdw and dblink extensions permit you to query, update, insert, or delete data in one
PostgreSQL database from a different one, but they work in different ways. We'll
take a closer look at each one below.

POSTGRES_FDW
The PostgreSQL foreign data wrapper, postgres_fdw , is now available for new deployments running PostgreSQL 9.5.3. postgres_fdw is the first foreign data wrapper the Compose development team is allowing for
customer use after passing our security evaluation, though additional foreign
data wrappers are expected to follow to support interaction with other Compose
databases. The postgres_fdw extension is essentially a PostgreSQL-to-PostgreSQL connector for databases
which may be on the same or different hosts. Let's set one up.

1) INSTALL THE EXTENSION
You'll first need to install the extension. PostgreSQL extensions is one of the new features we introduced earlier this month.

In the Compose administrative console, select your PostgreSQL deployment and
then Browser . Next, click on the database where you want to install the foreign data
wrapper. From there, click on Extensions in the menu on the left and scroll down to postgres_fdw . On the far right you'll see the ""install"" button. Click it and wait a few
seconds for the extension to be installed. Once it's installed, it will have the
green indicator and the button on the far right will now say ""remove"" (if you
ever want to remove the extension). You'll notice a plpgsql extension is already installed.


Note that if you instead try to run the CREATE EXTENSION SQL command for postgres_fdw , you will be able to create the extension in the database, but it won't have
the underlying functions required for it to run on the Compose platform. If
you've done this, you'll need to run DROP EXTENSION and then install the extension through the Compose console in order to use postgres_fdw in your Compose PostgreSQL database.

When you install postgres_fdw through the Compose console, it automatically grants usage on the foreign data
wrapper to the admin user.

2) CREATE THE SERVER CONNECTION
Next, via your SQL interface of choice, you'll need to run the CREATE SERVER command as the admin user to set up the connection to the foreign database.
You'll need to provide the host and port information even if the database where
you installed the foreign data wrapper is on the same host as the database
you're connecting to. Here's what ours looks like:

CREATE SERVER segment  
FOREIGN DATA WRAPPER postgres_fdw  
OPTIONS (host 'aws-us-east-1-portal.0.dblayer.com', dbname 'segment', port '10100');  


We've named our server connection ""segment"" since we're going to be connecting
to our Segment warehouse database , which we implemented at the end of last year.

3) CREATE THE USER MAPPING
Now, we'll run CREATE USER MAPPING for our server connection. The user you're creating the mapping for will be a
user with permissions in the current database (we're just using our admin user),
but the user and password you supply in the options of this command needs to
have the required permissions for the foreign database. Let's run it:

CREATE USER MAPPING FOR admin  
SERVER segment  
OPTIONS (user '<foreign_db_user>', password '<foreign_db_user_password�  


4) IMPORT THE DATA
So, now we've got the connection set up and the user credentials we'll need to
access the foreign database. The next step is importing the data. We can either
import individual tables or, new in PostgreSQL 9.5, we can import the schema
containing all the tables and views, or limiting to just a few we specify.

To import individual tables, use the CREATE FGOREIGN TABLE command with the table columns defined that you want to import (you can add
them all or only a few), like this:

CREATE FOREIGN TABLE aliases (  
  id character varying(254) NOT NULL,
  received_at timestamp with time zone,
  context_library_name text,
  context_library_version text,
  original_timestamp timestamp with time zone,
  previous_id text,
  sent_at timestamp with time zone,
  user_id text,
  ""timestamp"" timestamp with time zone
)
SERVER segment  
OPTIONS (schema_name 'production', table_name 'aliases');  


Note that our schema name in the segment database is called ""production"". Use
whatever schema name your database uses for the table you're importing...
typically this will be the ""public"" schema.

If you don't want to import individual tables one-by-one, then import the schema
instead using IMPORT FOREIGN SCHEMA :

IMPORT FOREIGN SCHEMA production  
LIMIT TO (tracks, pages)  
FROM SERVER segment INTO public;  


In this example, we're importing the ""production"" schema from our segment
database, we're limiting the import to just the two tables we're most interested
in (one is called ""tracks"" and one is called ""pages""), and we're importing into
our current schema, which is the ""public"" schema.

The LIMIT TO clause allows you to specify only the tables or views you want from the schema.
If you don't use it then all the tables and views from the schema will be
imported. That can be a bonus if you don't want to have to do CREATE FOREIGN TABLE for each and every table or view you want. The other great thing about using IMPORT FOREIGN SCHEMA rather than IMPORT FOREIGN TABLE is that, if the table structure changes in the foreign database (such as a new
column being added or one being dropped), those changes are automatically
carried over. With CREATE FOREIGN TABLE , you're defining a static table structure so, to accommodate changes from the
foreign database, you'd have to alter the table or drop it and re-import it with
the new definition.

5) QUERY YOUR DATA
Now we've got a couple tables from our segment database available to us in our
current database, which in our situation is our internal data warehouse that
contains our accounts information, among other things. We can now write queries
to tie the data together from these two databases in order to get richer
reports. Here's an example of a simple query that joins the tracks table from
the segment database to the accounts table in our current database to give us a
count of the number of times each account has signed in (we're tracking ""Sign
In"" events via Segment):

SELECT a.id, a.name, COUNT(t.user_id) as occurrences  
FROM accounts a  
JOIN tracks t ON t.user_id = a.id  
WHERE event_text = 'Sign In'  
ORDER BY occurrences DESC;  


Trying to run a query like this without having the foreign data wrapper set up
(or dblink, discussed below) would yield a ""cross-database references are not
implemented"" error. Luckily we don't need to worry about that anymore!

HOW TO CHECK YOUR SETUP
Keep in mind that the postgres_fdw extension will only work for new deployments running the 9.5.3 version for the
admin user. If you try to perform the above steps in earlier deployments, you'll
get ""permission denied"" errors.

Also, you won't see the foreign tables (or any database reference) in the
Compose data browser for your foreign data so it can be difficult to remember
what server connections you've created and what tables are available. To remind
yourself, you can look in a couple of different system catalogs: pg_foreign_server for the servers and pg_foreign_table for the tables.

Let's check the server connections we have implemented:

SELECT * FROM pg_foreign_server;  


Here's our segment database connection:

srvname | srvowner | srvfdw | srvtype | srvversion | srvacl | srvoptions  
--------------------------------------------------------------------------------------------------------------------------------------
segment | 16384    | 16399  |         |            |        | {host=aws-us-east-1-portal.0.dblayer.com,dbname=segment,port=10100}  


The owner and fdw columns for the server contain the internal object ID
reference for the admin user as owner and the postgres_fdw extension as the fdw.
The type, version, and acl columns are NULL because we did not specify those
parameters when we created the server connection, though you can specify them if
you need to.

Let's look at the tables available:

SELECT * FROM pg_foreign_table;  


We get the following:

ftrelid | ftserver | ftoptions  
----------------------------------------------------------------
16469   | 16464    | {schema_name=production,table_name=aliases}  
16472   | 16464    | {schema_name=production,table_name=pages}  
16475   | 16464    | {schema_name=production,table_name=tracks}  


Even though we added the aliases table using CREATE FOREIGN TABLE and we added the pages and tracks tables using CREATE FOREIGN SCHEMA , both show up with this query the same way. We see that the tables in this
case are all from the same foreign server connection (16464, the internal object
ID reference for the connection to the segment database) and that each of the
tables has its own relation ID (again, an internal object ID reference). Now if
we add other server connections and more tables, we can run these two queries to
help keep ourselves straight on what we've done.

READ, BUT DON'T TOUCH
So far we've only looked at querying the data in our foreign tables using SELECT , but we could also perform INSERT , UPDATE , and DELETE statements on the data. Depending on your situation, this may be one of the
benefits of the foreign data wrapper, but if you'd prefer ""read only"" access for
the foreign data, you can change that behavior with the updatable option.

The updatable option can be used for the server connection as a whole to prevent any table
from being updated (by default updatable is ""true"" when the server connection is created). We'd set it to ""false"" with
the following command:

ALTER SERVER segment  
OPTIONS (ADD updatable 'false');  


We can override the server-level configuration by setting updatable on any individual table. To set a specific foreign table as ""read only"", we'll
alter the table by adding the updatable option set to ""false"" (if you'd already set the server option as ""false"", you
could use this to set a specific table to ""true"" to allow updates for only that
table):

ALTER FOREIGN TABLE aliases  
OPTIONS (ADD updatable 'false');  


We're all set now to query across a couple different PostgreSQL databases using
the postgres_fdw extension. Next, we'll look at how to do the same thing, but using the dblink extension instead.

DBLINK
Though dblink predates foreign data wrappers in PostgreSQL for Postgres-to-Postgres
connections, it's also newly available on the Compose platform for your Postgres
deployments. Let's look at how to use it.

1) INSTALL THE EXTENSION
Just like with postgres_fdw , you'll need to install the module from the Compose console Extensions page in order for it to work properly in your Compose PostgreSQL deployment. In
the Compose administrative console, navigate to the database where you want to
install dblink , then click the ""install"" button on the far right:


Just like with postgres_fdw , this will install some functions (a whole bunch for this extension!) on the
backside and grant the appropriate permissions that allow this extension to work
on the Compose platform.

2) CONNECT TO THE SERVER
One key difference with foreign data wrappers and dblink is that, with the foreign data wrapper, a server connection only needs to be
created once and then it remains available across sessions until you run a DROP SERVER command. With dblink , the server connection is good for the life of the current session only, or
until you disconnect it during the current session. Once you disconnect from
your database server, the dblink server connection is gone so you have to create it anew for each session where
you want to use it.

To create a connection, use the dblink_connect() command, first specifying the name of the connection (ours is ""segment"")
followed by a comma and then a list of field-value pairs for all of the required
server connection parameters:

SELECT dblink_connect('segment', 'host=aws-us-east-1-portal.0.dblayer.com port=10100 schema=production dbname=segment user=<foreign_db_user> password=<foreign_db_user_password�  


If the command is successful, we'll get an ""OK"" response.

3) QUERY YOUR DATA
There are a handful of options for how to query data using dblink . We'll have a look at each of them and you can decide which one works best for
your situation.

Direct queryA direct query is the most straightforward. It allows us to run a SQL SELECT statement from within the dblink command and the results are returned as they become available.

SELECT *  
FROM dblink('segment', 'SELECT event_text, COUNT(user_id) AS user_count  
                        FROM production.tracks
                        WHERE event_text = ''Sign In''
                        GROUP BY event_text;')
AS event_count(event_text text, user_count integer);  


In this example, we're passing in the name of our server connection ""segment"",
then the query we want to run. We then have to specify field names and data
types for the output table , which we're calling ""event_count"". The table acts as a derived table which
can be joined to other tables in your current database. From this query example,
though, we don't have anything we'd join on. What we'll get returned is the
total number of times the ""Sign In"" event has occurred.

Note that because there was not a schema parameter for our server connection in
the previous step, we need to include the schema name with the table name,
""production.tracks"" in this case. Also, note the duplicated single quotation
marks around the text value ""Sign In"". These are needed to escape the quotation
marks since we're running a query within a query.

Query by cursorIf our situation calls for cycling through the results in a more methodical way
rather than having the result set returned as it becomes available, we can open
a cursor.

Open the cursor using the dblink_open command containing a SELECT statement with the fields we'll want to retrieve:

SELECT dblink_open('events', 'SELECT id, event_text, user_id  
                              FROM production.tracks
                              WHERE event_text = ''Sign In'';');


We've name our open cursor ""events"". Next, we'll retrieve one row at a time
using dblink_fetch :

SELECT id, event_text, user_id FROM dblink_fetch('events',1)  
AS (id character varying(254), event_text text, user_id text);  


Note that we have also specified the field names and data types for the
resulting row.

Once we've performed whatever row-by-row processing we wanted to do with our
open cursor, we need to close it using the dblink_close command:

SELECT dblink_close('events');  


Perform an async queryThe final option for querying with dblink is the async (asynchronous) query. In this option no results are returned until
the full result set is compiled.

There are two parts to an async query - sending the query and then getting the
result set.

Use dblink_send_query for sending the request:

SELECT *  
FROM dblink_send_query('segment', 'SELECT id, event_text, user_id  
                                   FROM production.tracks
                                   WHERE event_text = ''Sign In'';') AS events;


The results from the query above, when available, will be in a table called ""events"". Then we'll use dblink_get_result to retrieve the results:

SELECT *  
FROM dblink_get_result('segment')  
AS events(id character varying(254), event_text text, user_id text);  


We can use another command to be notified when our results are ready. If we're
waiting too long, we can use another dblink command to determine if the server
is busy with our request or a different command to cancel our query altogether.

4) DISCONNECT
Finally, when we're done with the server connection, we need to disconnect:

SELECT dblink_disconnect('segment');  


Again, we'll get an ""OK"" response for a successful disconnect.

WHETHER TO USE POSTGRES_FDW OR DBLINK
As mentioned, the Postgres foreign data wrapper is newer to PostgreSQL so it
tends to be the recommended method. While the functionality in the dblink extension is similar to that in the foreign data wrapper, the Postgres foreign
data wrapper is more SQL standard compliant and can provide improved performance
over dblink connections.

Also, unlike the foreign data wrapper, dblink does not have a way to make the data ""read only"". The data in the foreign
tables can be modified by building and executing an INSERT , UPDATE , or DELETE statement using the dblink commands defined for these operations. If your
situation calls for the foreign data to not be modifiable, then you'll want to
stick with the foreign data wrapper and make sure to set the updatable option appropriately.

Then there's the connection persistence. dblink connections are only good during a given session and need to be recreated each
time. The foreign data wrapper establishes a permanent connection. This could be
advantageous or disruptive, according to your needs.

On the plus side for dblink , it has a slew of specialized commands and lends itself easily to programmatic
processing.

Another current benefit of dblink over postgres_fdw is that it will work on earlier Compose PostgreSQL deployment versions (we ran
ours on a deployment sporting the 9.4 version). This can be an option for those
who don't want to spin up a new PostgreSQL deployment in order to use postgres_fdw .

Regardless of which method will work best for you at this time, we're happy to
be able to provide them both to you. Now you can query across your different
PostgreSQL databases with ease. Stay tuned for more cross-database query options
coming in the near future.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose",In Compose PostgreSQL you can now perform cross-database queries using some extensions we've recently made available: postgres_fdw and dblink. In this article we'll walk through how to set up your database to take advantage of them.,Cross-Database Querying in Compose PostgreSQL,Live,711
2189,"Compose The Compose logo Articles Sign in Free 30-day trialGEOFILE: USING OPENSTREETMAP DATA IN COMPOSE POSTGRESQL - PART I
Published Mar 16, 2017 geofile postgresql postgis GeoFile: Using OpenStreetMap Data in Compose PostgreSQL - Part IGeoFile is a series dedicated to looking at geographical data, its features, and
uses. In today's article, we're going to introduce you to OpenStreeMap data,
import that data into a Compose PostgreSQL deployment, and make queries on some
of the non-conventional data stored in hstore columns.

OpenStreetMap (OSM) is one of the most recognized and popular community-driven, open data sources
for maps worldwide. If you're going to do anything that happens to use maps,
more than likely you'll run into it and use it, or know someone who does. Most
web and mobile applications use it because of its reliability since it's
continuously updated by a vast number of contributors.

This article comprises part one of a two-part series on OSM data and Compose
PostgreSQL. In this article, we'll be taking a look at how to import OSM data
into a Compose PostgreSQL database. Also, we'll show you some of the OSM data
features, and how to query city amenities like restaurants. As we query
restaurants, we'll find the type of cuisine they serve, which is information
that is only found in the hstore column.

Let's get some OSM ...

IMPORTING OSM DATA
For our examples, we'll be using OSM data of Seattle, WA. On OSM's website, you
can search for Seattle in the search bar at the top left of the screen. Once
Seattle's on the screen, we'll export the map.


Since OSM provides us with a map of more than just the city of Seattle, we'll
want to zoom in on Seattle. We'll zoom in because OSM will export what's shown
on the screen, not just Seattle. Therefore, let's zoom in a little to get just
the city in our browser window.


Once we're happy with what we see on the screen, we can export the map by
pressing the Export button at the top of the window. It will give you several options to export
data.


If we click on the blue Export button, we'll receive an error in our browser. This error happens when we are
trying to download an area that contains too many nodes ( 50000). By running CURL in a terminal to reproduce the error, we'll see the
following message appear:

curl -I ""http://www.openstreetmap.org/api/0.6/map?bbox=-122.3819%2C47.5763%2C-122.2677%2C47.6267""  
...
Error: You requested too many nodes (limit is 50000). Either request a smaller area, or use planet.osm  
...


If we were to zoom in on a specific area of Seattle, we could successfully
export a portion of the city. However, since we're interested in exporting a map
of the entire city, go ahead and press the Overpass API link which gives us more nodes and will automatically start downloading a file
called map that includes the entire city.

You will need to add .osm to the map file in order to import it to PostgreSQL. To import OSM data, we'll
use the command line tool osm2pgsql .

osm2pgsql is a command line tool that transforms OSM data into a format that can be saved
into PostgreSQL. If you're using MacOS, you can download it using Homebrew brew install osm2pgsql . For other operating systems, osm2pgsql provides installation instructions on their Github repository .

Once osm2pgsql has been installed, we need to create a database. You can create a database via
the terminal after logging into your Compose PostgreSQL deployment, or from the
Compose Browser . We'll use the terminal which requires only the command line string provided
in your deployment's Connection info panel. Once we're logged in, create a database called osm by typing:

CREATE DATABASE osm;  


Ater that, connect to the database using \c osm; , and then install the PostGIS and HStore extensions using:

CREATE EXTENSION postgis;  
CREATE EXTENSION hstore;  


We'll need PostGIS installed to successfully import OSM data since it uses
geometry data that PostGIS requires. If you don't install the PostGIS extension
and try to import data, you'll receive an error. We'll also install the hstore
extension which will be used to store and retrieve non-standardized data that
doesn't fit into a column. This data is inserted as key value pairs in a column
named tags .

Once the database and extensions have been set up, we can use osm2pgsql in our terminal like:

osm2pgsql -U admin -W -d osm -H aws-us-west-2-portal.1.dblayer.com -P 15257 --hstore --hstore-add-index map.osm  


Start out with the osm2pgsql command and add -U with your database username (usually admin ) and -W indicating that you need a password prompt. Next, add -d and your database name (here osm ). The -H option is the deployment's hostname and -P is the port number. We've added the --hstore to create tags columns for each table that contains the supplemental non-standardized data and
the --hstore-add-index option sets up indexes on those columns. Finally, we add the map.osm file that we downloaded when we exported the Seattle map.

After running the command, enter the deployment password then we'll see our data
processed in the terminal and imported to the osm PostgreSQL database creating tables and indexes for primary key and tags columns. After the data has been processed and imported, log into the
PostgreSQL deployment and connect to the database.

Now, let's look at some of the tables that have been created. When listing the
tables using the \d command, we should see a list of tables like:

              List of relations
 Schema |        Name        | Type  | Owner 
--------+--------------------+-------+-------
 public | geography_columns  | view  | admin
 public | geometry_columns   | view  | admin
 public | planet_osm_line    | table | admin
 public | planet_osm_point   | table | admin
 public | planet_osm_polygon | table | admin
 public | planet_osm_roads   | table | admin
 public | raster_columns     | view  | admin
 public | raster_overviews   | view  | admin
 public | spatial_ref_sys    | table | admin


The tables that we are mostly concerned about are those that begin with planet_osm_* . These are the tables that we'll use when setting up our queries. They contain
the necessary geometry data that is used to view the map on a GIS client. Here's
a breakdown of what each of these tables contains:

 * planet_osm_polygon contains all the polygon and multipolygon data.
 * planet_osm_point contains points of interest, business names, and tags which contain
   additional information stored as key value data.
 * planet_osm_line contains all pathways.
 * planet_osm_roads contains a subset of planet_osm_line for rendering at low level zooms.

So now that our data has been imported and indexes have been created, let's see
what it looks like and make some queries ...

QUERYING OSM DATA
Loading our data from PostgreSQL into a GIS mapping program will provide us with
a visual representation of our OSM data, which will look similar to this:


As we can see, the map is made up of the geodata that was exported by OSM that
includes streets, structures, and location points. The abrupt cut off from the
sides of the map is due to us only exporting the portion of the OSM map that we
needed which was visible in the browser window.

Now that we have a map, let's query some of the data to find all of the
restaurants within our Seattle map. We'll find various types of facilities like
restaurants, banks, and schools in the amenity column of our tables. For these queries, we'll be looking at the planet_osm_point table that will provide us with the exact location of restaurants and other
facilities throughout the city. To get all the restaurants from the table, the
SQL would look like the following:

SELECT name, count(name) FROM planet_osm_point WHERE amenity = 'restaurant' GROUP BY name ORDER BY count DESC;  


This will give us a list of about 644 restaurants and the number of branches
they have in the city. If we refine this query a little more, we can select only
those restaurants that have three or more branches.

SELECT name, count(name) as number FROM planet_osm_point WHERE amenity = 'restaurant' GROUP BY name HAVING count(name)   


This refines our data and gives us five restaurants that have three branches.

       name        | number 
-------------------+--------
 Blue Moon Burgers |      3
 Cactus            |      3
 MOD Pizza         |      3
 Pho Than Brothers |      3
 Via Tribunali     |      3


The names of restaurants, however, are not very helpful since we don't know what
type of cuisine they serve. So how do we get this data? This is where our hstore
data comes in handy. Non-standard information like ""cuisine"" is stored in the tags column we set up when importing our data to PostgreSQL. To get the type of
cuisine for each of these restaurants, we could write the query like:

SELECT name, tags-  


This query will retrieve the ""cuisine"" key for each restaurant, which will give
us:

       name        |  cuisine   
-------------------+------------
 Blue Moon Burgers | burger
 Blue Moon Burgers | burger
 Blue Moon Burgers | 
 Cactus            | mexican
 Cactus            | mexican
 Cactus            | mexican
 MOD Pizza         | pizza
 MOD Pizza         | pizza
 MOD Pizza         | pizza
 Pho Than Brothers | vietnamese
 Pho Than Brothers | vietnamese
 Pho Than Brothers | vietnamese
 Via Tribunali     | pizza
 Via Tribunali     | pizza
 Via Tribunali     | 


What's noticeable is that two restaurants don't have values in the cuisine
column. This is perhaps one of the drawbacks of using OSM data since it can be
inconsistent at times. However, community members can update the missing data
and update their map as needed.

Since we're interested in looking at cuisines. Let's look at the top 10 most
popular cuisines in Seattle. To do that we'd run the following SQL query, which
will group the number of restaurants that serve the same type of cuisine.

SELECT DISTINCT tags-  


Here we're selecting the restaurants that have a ""cuisine"" key and then grouping
and counting all of those restaurants according to the cuisine they serve. This
query produces the following results:

   cuisine   | count 
-------------+-------
 coffee_shop |   171
 sandwich    |    73
 mexican     |    65
 pizza       |    58
 american    |    42
 italian     |    37
 vietnamese  |    37
 thai        |    34
 chinese     |    33
 japanese    |    32


Overwhelmingly, Seattlies love their coffee, followed by sandwiches, mexican
food and pizza. However, we can also see the strong Asian influence in Seattle
with Vietnamese, Thai, Chinese, and Japanese food all very popular throughout
the city. On the map, it shows us that the largest concentration of these
cuisines is served downtown.


To make this data a little more interesting, we might want to see what coffee
shops dominate Seattle. To view the top 10 coffee shops in the city, we'd run
the following SQL query:

SELECT name, count(name)FROM planet_osm_point WHERE tags-  


This will group the names of all the coffee shops that have ""coffee_shop"" as
their cuisine. Then it will give us the total number of branches that they have
in the city, and give us the top 10 results. The result will look like:

            name            | count 
----------------------------+-------
 Starbucks                  |    47
 Tully's Coffee             |    11
 Cherry Street Coffee House |     7
 Uptown Espresso            |     5
 Caffe Ladro                |     5
 Storyville Coffee          |     3
 Voxx Coffee                |     2
 Stumptown Coffee           |     2
 Specialty's                |     2
 Caffe Vita                 |     2


So, Starbucks wins as expected. However, it's also surprising that there are at
least 95 different coffee shops throughout the city.

ONWARD ...
OSM data is perhaps the most popular GIS data that's used by organizations all
over the world. In this article, we showed you how to select data from OSM,
export it, import it to your PostgreSQL deployment, query some of the data, and
show what it looks like on a map. For the next installment of GeoFile, we'll be
working more with our Seattle dataset and use PostGIS functions to narrow down
some of our queries as well as add and transform external datasets on our OSM
data.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Stephen Monroe

Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
Oct 17, 2016GEOFILE: EVERYTHING IN THE RADIUS WITH POSTGIS
GeoFile is a series dedicated to looking at geographical data, its features and
uses. In this article, we build upon our last…

Abdullah Alger Dec 15, 2016GEOFILE: POSTGIS AND RASTER DATA
GeoFile is a series dedicated to looking at geographical data, its features, and
uses. In this article, we'll look at raster…

Abdullah Alger Sep 30, 2016COMPOSE NEWSBITS: POSTGRESQL 9.6, NEURAL REDIS, GRAPHQL, NMAP, ERLANG, SETI AND
CHATBOTS
Compose NewsBits for the week ending September 30th: PostgreSQL 9.6 is out and
so is PostGIS 2.3 and Barman 2.0, Redis gets a…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","We're going to introduce you to OpenStreeMap data, import that data into a Compose PostgreSQL deployment, and make queries on some of the non-conventional data stored in hstore columns.",GeoFile: Using OpenStreetMap Data in Compose PostgreSQL - Part I,Live,712
2191,"COMPOSE NOTES - POSTGRESQL CONNECTION LIMIT CONTROLShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 27, 2016PostgreSQL 9.5 users can now control the number of incoming connections allowedto their deployments. You'll find the new control in the Compose web console foryour PostgreSQL database under Settings .By default, all PostgreSQL deployments on Compose start with 100 connectionsallowed. You will need to scale up your deployment to get the ability toincrease your connection count. As the database is scaled up, we add 100connections to the maximum number of connections for every two units of scale.Two units of scale is 2GB of disk storage and 204MB of RAM; on a 1GB PostgreSQLdeployment you would add two units by clicking the ""3x"" option here"", taking youto a total of 3GB of storage, 306MB of RAM and potentially a 200 connectionlimit.Just because we could increase the connection limit doesn't mean we do. Eachconnection consumes RAM, either to manage the connection or associated with theclient using the connection. The more connections, the bigger the potential tostarve the database of RAM which it could use to run the database. So we don'tautomatically increase the number of allowed connections and instead we keep itat 100 unless you use the Set Connection Limit option.A well written app should typically not need a large number of connections. Ifit does need a large number of connections then consider tools like pg_bouncer which can pool your connections or a PostgreSQL driver which can poolconnections for you. As each connection consumes RAM, you should be looking tominimise their use. You may also want to check your own application to ensureyou aren't holding connections open unnecessarily.Once you are sure you need those extra connections, you may need to scale up. Ifyou go to Settings and Set Connection Limit and view the options available, the system will indicate how many connectionsyou are set to and how many are available. If you are already at the maximum andneed more connections, you'll have to go back to the Overview and scale up there.Otherwise select your new connection limit but be aware, before you click Set Connection Limit , that this operation will trigger a rolling restart on your PostgreSQLdeployment.If you have a busy database, you may want to choose a quiet period to do thisoperation. Once you have clicked Set Connection Limit , the Compose console will take you to the Jobs view where you can view the progress of that rolling restart.Remember, you can also roll back the number of connections too so if you'vefound a particularly greedy client you've been working around and now fixed, youcan pull that connection limit down to something more reasonable. You now havethe ability to control your PostgreSQL connection limit and match it to what youneed; make good choices.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writersince Apples came in II flavors and Commodores had Pets. Love this article? Headover to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose",PostgreSQL 9.5 users can now control the number of incoming connections allowed to their deployments. You'll find the new control in the Compose web console for your PostgreSQL database under Settings.,PostgreSQL connection limit control,Live,713
2192,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Advisory Council
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Advisory Council
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
PERFORMANCE
A SURVEY OF BOOKS ABOUT APACHE SPARK™
Hundreds of authors have spent thousands of hours to make the native Apache
Spark™ documentation as clear and complete as possible — including a quick start guide and programming guide . They've done an incredible job.

Even so, if you're puzzling through Spark's many complexities and capabilities,
you may want to turn to books that offer a true guided tour of the material.

We've assembled a survey of the best of the books currently on the market — from
introductions for novices to deep-dive explorations for veterans:

 * Learning Spark: Lightning-Fast Big Data Analysis — by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia. This definitive guide comes from a core team of Spark insiders and is
   designed to get you up and running fast. Learn to quickly express parallel
   jobs and set up everything from simple batch jobs to streaming processing and
   machine learning.
   
   
 * Getting Started With Apache Spark — Jim Scott. A friendly, free, online introduction for new-comers. Scott offers
   step-by-step instructions to take users from installation to core
   capabilities (RDDs, Data Frames, Spark SQL, Spark Streaming, and the Machine
   Learning library). He ends with real-world production use cases.
   
   
 * Mastering Apache Spark — Mike Frampton. Frampton lays out advanced techniques and examples for processing and
   storing data, including integration with key third-party applications. Other
   topics include clustering and classification using MLlib; Spark stream
   processing via Flume and HDFS; creating and populating Spark schemas; and
   graph processing using Spark GraphX.
   
   
 * High Performance Spark: Best practices for scaling and optimizing Apache
   Spark — Holden Karau, Rachel Warren. A book for those who've used Apache Spark to solve medium sized-problems,
   but are ready to take advantage of Spark at scale. Learn to make jobs run
   faster, productionize exploratory data science , handle larger data sets, and
   reduce pipeline running times for faster insights. Additional info here .
   
   
 * Advanced Analytics with Spark: Patterns for Learning from Data at Scale — Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills. Four Cloudera data scientists present a set of self-contained patterns for performing
   large-scale data analysis with Spark. The book brings statistical methods
   (classification, collaborative filtering, anomaly detection, and more)
   together with real-world data sets from genomics, security, finance, and
   neuroimaging.
   
   
 * Machine Learning with Spark — Nick Pentreath. Pentreath walks through loading, processing, and preparing data as input to
   Spark’s machine learning models. Detailed examples and real-world use cases
   cover common models including recommender systems, classification,
   regression, clustering, and dimensionality reduction. Also covered: working
   with large-scale text data, plus methods for online machine learning and
   model evaluation using Spark Streaming.
   
   
 * Apache Spark Machine Learning Blueprint — Alex Liu. Liu explores connecting Spark with R to handle huge datasets at high speed.
   The book serves up project ""blueprints"" that demonstrate notebooks and
   machine learning capabilities for detecting fraud, analyzing financial risks,
   building predictive models, and setting up recommendation systems.
   
   
SHARE ON
 * 
 * Share

STEVE MOORE
DATE
18 July 2016TAGS
Performance, EducationalNEWSLETTER
Subscribe to the Spark Technology Center newsletter for the latest thought
leadership in Apache Spark™, machine learning and open source.

SubscribeNEWSLETTER

YOU MIGHT ALSO ENJOY
APACHE SPARK™ 2.0: DEEP DIVE INTO SPARK CATALOG AND DDL NATIVE SUPPORTS by Xiao
Li RESEARCH APACHE SPARK™ 2.0: KEEPING COUNT by Christian Kadner EDUCATIONAL APACHE SPARK™ 2.0: MIGRATING APPLICATIONS by Glenn Weidner NEWS APACHE SYSTEMML™ PAPER TAKES TOP PRIZE AT THIS YEAR’S VLDB CONFERENCE by
Steve MooreSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About
 * Advisory Council

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.

 * 
 * 
 * 
 *","From the big crop of books about Apache Spark™, we take a look at the best introductions for beginners, ML deep-dives, and guides for optimizing at scale.",A Survey of Books about Apache Spark™,Live,714
2196,"We use cookies to provide you with a better onsite experience. By continuing to
browse the site you are agreeing to our use of cookies in accordance with our Cookie Policy .

Skip to main content Subscribe Menu Scientific American * English
 * Cart 0
 * Sign In | Register Email: Password: Forgot password? Login Not yet registered?

Search Subscribe * English
 * Español
 * العربية
 * Other Editions

Close Search Close Search(examples: physics, climate change, etc.)

 * The Sciences
 * Mind
 * Health
 * Tech
 * Sustainability
 * Education
 * Video
 * Podcasts
 * Blogs
 * Store

 * Subscribe
 * Current Issue
 * Cart
 * Sign In
 * Register


 * Share
 * Latest

 * 
 * 
 * 
 * 

 * Share
 * Latest

 * 
 * 
 * 
 * 

CognitionFOR AI TO GET CREATIVE, IT MUST LEARN THE RULES—THEN HOW TO BREAK ‘EM
New artificial intelligence systems are using “adversarial networks” to develop
creativity and originality by more fluidly mixing and matching real-world
information

 * By Chris Baraniuk on January 25, 2018

 * 
 * 
 * 
 * 
 * 
 * 

Share on Facebook Share on Twitter Share on Reddit Email Print Share via * Google+
 * Stumble Upon

Credit: John Lund Getty Images AdvertisementAmerican poet Ralph Waldo Emerson once said, “Every artist was first an
amateur.” He likely never thought those words would apply to machines. Yet
artificial intelligence has demonstrated a growing aptitude for creativity,
whether writing a heavy-metal rock album or producing an original portrait that is strikingly reminiscent of a Rembrandt.

Applying AI to the art world might seem unnecessarily derivative; there are, of
course, plenty of humans delivering awe-inspiring work. Proponents say, however,
the real beauty of training AI to be creative does not lie in the end
product—but rather in the technology’s potential to expand on its own
machine-learning education, and to solve problems by thinking outside the box
far faster and better than humans can. For example, creative problem-solving AI
could someday make snap decisions that save the lives of the passengers in a
self-driving car if its sensors fail, or propose unconventional combinations of
chemical compounds that lead to new drugs for previously untreatable diseases.

AI with a creative streak will be essential in developing highly automated systems that can respond appropriately to human
life, says Mark Riedl , an associate professor at Georgia Institute of Technology’s School of
Interactive Computing. � lots of problem-solving goes on,” Riedl says. “If my
son gets a toy stuck under the couch, I have to devise a tool out of a hanger
[to retrieve it].”

Riedl points out human creativity is also important in human social
interactions, even telling a well-timed joke or recognizing a pun. Computers
struggle with such subtleties. An incomplete understanding of how humans
construct metaphors, for example, was all it took for an experiment in
AI-generated literature to compose a new Harry Potter chapter filled with nonsensical sentences such as, “The floor of the castle seemed like
a large pile of magic.”

Still, getting machines to accurately mimic human style—whether Rembrandt’s or
J. K. Rowling’s—is perhaps a good place to start when developing creative AI,
Riedl says. After all, human creators often start off imitating the skills and
processes of accomplished artists. The next step, for both people and machines,
is to use those skills as part of a strategy to create something original.

AI ART SCHOOL
Today’s AI programs are not advanced enough to spontaneously compose hit songs
or paint masterpieces. To get AI to do those things, humans must first calibrate a program by feeding
it large numbers of examples. German AI artist Mario Klingemann , for instance, has designed artificial neural networks to assemble strange and
beguiling images based on existing photographs and other visual artwork. An
artificial neural network consists of a series of interconnected processing
nodes, a system loosely based on the human brain’s neural structure. In an
artificial network each electronic “neuron” takes in an array of numbers,
performs a simple calculation on those inputs and then sends the result to the
next layer of neurons—which in turn performs more complex calculations on the
data.

Klingemann� the other evaluates the images based on its knowledge of those
guidelines. Thanks to feedback from the second network, the first gradually gets
better at making images that more accurately adhere to the chosen theme. “Right
now [the networks] are just tools that augment our own creativity,” Klingemann
notes. “We as humans still have to recognize the creativity or novelty.” His
goal is to build artistic networks that can independently select and even tweet
out their own best work based on the given theme.

Today’s GANs are strictly used to create new content or images within a broader
creative system, says Alex Champandard, founder of creative.ai , a start-up that aims to develop AI tools for creative people. GANs are able
to produce a lot of material quickly but still rely heavily on people to
establish their guidelines, he adds.

FROM THE ART WORLD TO THE REAL WORLD
GANs’ content-generating capabilities are a good start when it comes to
developing AI that can solve real-world problems, says Ian Goodfellow, a staff
research scientist at Google and lead author of the 2014 paper that first described the concept of GANs. Goodfellow has been working on
machine-learning models to let computers invent more dynamic narratives, which
could go beyond limited scenarios such as planning out a series of chess
moves—something computers have done extremely well for decades.

Take a classic example of forward-planning that humans do all the time: When
heading to the airport, we often fuzzily map out—purely in our heads—the
expected key details of the journey, such as traffic patterns or road repairs.
GANs could plan such a trip but they would likely do so in excruciating detail
and come up with many possible routes to the destination, Goodfellow says. What
we really need, he adds, is a layer of computation that looks at the many
options produced by a neural network and intuitively decides which one is best.

Another key component of human creative thinking is the ability to take
knowledge from one context and use it within another. George Harrison picks up a
sitar and applies his guitar-playing nous to the instrument. Shakespeare reads
stories from Greek mythology and writes an English play inspired by those tales.
A chief executive uses knowledge of military strategy, or perhaps chess , to plan a business deal.

To that end, experiments are now underway with AI algorithms that can mix and
match material. For example, researchers at the University of California,
Berkeley, are using their “cycle-consistent adversarial network” ( CycleGAN ) to transform a video of horses into one of zebra s. The AI detects the basic shape of a horse in the first video and can play
with the aesthetic on top of that image, immediately and seamlessly swapping a
shiny brown coat of hair for one with black-and-white stripes while the image is
moving. Such work could be a stepping-stone to AI that can enable a self-driving
car to adapt to unfamiliar road conditions, avoiding accidents. “If you’re
gathering your [road-] training data mostly in California, you might not have a
lot of real data [on] snowy situations,” Goodfellow says. “But you could take
all your real data in sunny conditions and use [generative systems] to change it
into snowy conditions.”

This suggests teaching AI not only the rules, but also how to throw them out the
window when necessary—much like amateurs who grow into artists.

Rights & PermissionsABOUT THE AUTHOR(S)
Chris Baraniuk


Chris Baraniuk is a freelance science and technology journalist based in the UK.
Besides Scientific American, his work has been published by New Scientist, the
BBC, Wired and Quartz.

RECENT ARTICLES
 * A Bolt from the Brown: Why Pollution May Increase Lightning Strikes

AdvertisementLATEST NEWS
EnvironmentWEST COAST WETLANDS COULD NEARLY DISAPPEAR IN 100 YEARS February 22,
2018 — Chelsea Harvey and E&E News EnvironmentOZONE POLLUTION GROWS, BUT IT CAN BE FIXED February 22, 2018 — Dave
Levitan and Ensia EvolutionMOSQUITOES LEARN THE SMELL OF DANGER 2 hours ago — Christopher
Intagliata EngineeringSCALY PLASTIC SNAKESKINS INCH IMMOBILE ROBOTS FORWARD 2 hours ago —
Tim Palmieri THE SHURI EFFECT: A GENERATION OF BLACK SCIENTISTS? 2 hours ago — Ayana
Elizabeth Johnson CognitionTALKING WITH--NOT JUST TO--KIDS POWERS HOW THEY LEARN LANGUAGE 3 hours
ago — Claudia WallisNEWSLETTER
GET SMART. SIGN UP FOR OUR EMAIL NEWSLETTER.
Sign UpEVERY ISSUE. EVERY YEAR. 1845 - PRESENT
Neuroscience. Evolution. Health. Chemistry. Physics. Technology.

Subscribe Now! For AI to Get Creative, It Must Learn the Rules--Then How to Break 'Em New artificial intelligence systems are using “adversarial networks” to develop
creativity and originality by more fluidly mixing and matching real-world
informationFollow us

 * instagram
 * soundcloud
 * youtube
 * twitter
 * facebook
 * rss

 * Store
 * About
 * Press Room
 * More

 * FAQs
 * Contact Us
 * Site Map

 * Advertise
 * Special Ad Sections
 * SA Custom Media

 * Terms of Use
 * Privacy Policy
 * Use of Cookies

Scientific American is part of Springer Nature, which owns or has commercial
relations with thousands of scientific publications (many of them can be found
at www.springernature.com/us ). Scientific American maintains a strict policy of editorial independence in
reporting developments in science to our readers.© 2018 Scientific American, a Division of Nature America, Inc.

All Rights Reserved.

CONFRONTING COMMON WISDOM


Learn More",New artificial intelligence systems are using “adversarial networks” to develop creativity and originality by more fluidly mixing and matching real-world information.,"For AI to Get Creative, It Must Learn the Rules--Then How to Break 'Em",Live,715
2200,"Compose The Compose logo Articles Sign in Free 30-day trialSIMPLE OAUTH WITH MONGODB & MYSQL
Published Mar 28, 2017 mongodb mysql oauth Simple OAuth With MongoDB & MySQLDon Omondi, Campus Discounts' founder and CTO, discusses securing applications
with OAuth and shows you how to securely store authentication data using MySQL
and MongoDB.

OAuth is an open standard for authorization, commonly used as a way for Internet
users to authorize websites or applications to access their information on other
websites but without giving them the passwords.

Eran Hammer-Lahav, former lead author and editor for the OAuth 2.0 project, gave
an analogy that explains the concept extremely well:

Many luxury cars today come with a valet key. It is a special key you give the
parking attendant and unlike your regular key, will not allow the car to drive
more than a mile or two. Some valet keys will not open the trunk, while others
will block access to your onboard cell phone address book. Regardless of what
restrictions the valet key imposes, the idea is very clever. You give someone
limited access to your car with a special key, while using another key to unlock
everything else.

OAuth allows you, the User, to grant access to your private resources on one
site (The Service Provider), to another site (The Consumer).

The most popular way to implement OAuth either follows the OAuth 1.0 protocol
(released 2010) or utilizes the OAuth 2.0 framework (released 2012). Both of
them, typically involve saving and querying tokens (e.g. request, access,
refresh) and scopes (permissions). They might also involve signed requests,
hashing, and a bunch of other things like multistep procedures to finally grant
permissions to an application.

OAuth1 Flow:


OAuth2 Flow:


For tightly controlled environments or simple use cases, OAuth 1.0 and 2.0 can
be overkill. Eran Hammer resigned from his position as the lead author and
editor for the OAuth 2.0 project and withdrew his name from the specification citing , among other things, the complexity of the framework.

Interestingly, it’s not that complex to implement a simplified OAuth system that
still keeps the core functionality intact but gives greater flexibility for
certain use cases.

THE DESIGN MODEL
When designing a Simple OAuth system, the service provider is required to create
and store entities for users , apps , app installations and app permissions . Apps are actually the consumer in the OAuth flow diagrams. App installation entities track which users installed what apps while app permissions track the actions a consumer app is allowed to perform on the user's behalf.

I'm a big fan of storing 'primary' data (e.g. users, products, restaurants) in a
relational database and 'secondary' data (e.g. comments, likes, check-ins) in a
nonrelational database. This way, primary data can have relationships between
each other via joins while secondary data has no direct relationships with other
secondary data, although comparisons can be drawn through the use of a separate
analytics database.

What we'll do is store users and apps in respective MySQL tables, while app
installations and permissions in respective MongoDB collections.

DROP TABLE IF EXISTS `users`;  
CREATE TABLE `users` (  
   `id` int(11) NOT NULL,
   `username` varchar(180) CHARACTER SET utf8 COLLATE=utf8mb4_unicode_ci NOT NULL,
   `email` varchar(180) CHARACTER SET utf8 COLLATE=utf8mb4_unicode_ci NOT NULL,
   `enabled` tinyint(1) NOT NULL,
   `password` varchar(255) CHARACTER SET utf8 COLLATE=utf8mb4_unicode_ci NOT NULL,
   `roles` longtext CHARACTER SET utf8 COLLATE=utf8mb4_unicode_ci NOT NULL COMMENT
  '(DC2Type:array)',
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

DROP TABLE IF EXISTS `apps`;

CREATE TABLE `apps` (  
 `id` int(11) NOT NULL,
 `appsecret` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
 `apphost` varchar(255) CHARACTER SET utf8 COLLATE=utf8mb4_unicode_ci DEFAULT NULL,
 `endpoint` varchar(255) CHARACTER SET utf8 COLLATE=utf8mb4_unicode_ci DEFAULT NULL,
 `title` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
 `description` longtext CHARACTER SET utf8 COLLATE=utf8mb4_unicode_ci NOT NULL,
 `appprice` decimal(14,2) NOT NULL,
 `appbalance` decimal(14,2) NOT NULL,
 `totalinstalls` int(11) NOT NULL,
 `totalcalls` int(11) NOT NULL,
 `created` datetime NOT NULL,
 `appstatus` int(11) NOT NULL,
 `category_id` int(11) DEFAULT NULL,
 `totalpermissions` int(11) NOT NULL,
 `developer_id` int(11) DEFAULT NULL,
 `appversion` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;


The endpoint column is used to implement the increasingly popular webhook feature where apps can be notified of user events via HTTP POST or websockets. The apphost column is just an added security check to lock down actions to requests from a
specific host only. The appprice and appbalance columns are mockups of pay-to-use apps. The appversion column is very important, as new changes to permissions and pricing are only
affected when this version is bumped to the next number.

Now, let's create the app_installations collection using the Mongo Shell:

db.createCollection(""app_installations"")

MongoDB is schemaless, so the best way to understand the required structure
would be to look at a typical document:

{
   ""_id"" : 123, 
   ""installs"" : [{
       ""_id"" : 0,
       ""installed_by"" : 5,
       ""installed_on"" : ISODate(""2016-08-06T22:57:52.000Z""), 
       ""status"" : 1,
       ""appversion"" : 1,
       ""autoupdate"" : false,
       ""uninstalled_on"" : ISODate(""2016-09-06T22:57:52.000Z"")
    }, {
       ""_id"" : 1,
       ""installed_by"" : 6,
       ""installed_on"" : ISODate(""2016-08-06T22:57:53.000Z""), 
       ""status"" : 0,
       ""appversion"" : 3,
       ""autoupdate"" : true
    } 
]}


The document _id is set to match the app_id , while the installs array is made up of embedded documents describing who installed the app, when
the app was installed, the installation status and the app version that was
installed. The status here can represent an app uninstalled status, app deprecated status or any
other logical status that a developer can come up with.

Let’s create the app_versions collection:

db.createCollection(""app_versions"")

Sample schema:

{
    ""_id"" : 123, 
    ""versions"" : [{
        ""_id"" : 0
    }, {
        ""_id"" : 1, 
        ""appprice"" : 0, 
        ""permissions"" : [{
            ""name"" : ""read, post and delete comments"", 
            ""webhook"" : true
        }]
     }, {
         ""_id"" : 2, 
         ""appprice"" : 10, 
         ""permissions"" : [{
             ""name"" : ""read, post and delete comments"", 
             ""webhook"" : true
         }, {
             ""name"" : ""read, post and delete posts"",
             ""webhook"" : true 
         }] 
     }]
}


The document _id here is also set to match the app_id , while the versions array is made up of embedded documents describing the key components of each
version of the app; here, its price and permissions . Each permission has a name and metadata stating whether the app wants to subscribe to a webhook for that permission.

Note that the first embedded doc is empty. This is deliberate as it will help us
walk through the array of embedded docs as each embedded doc _id matches its index in the array. The embedded _id s also match the appversion in the apps MySQL table, which helps to convey the concept that upon each app’s
creation, the version is 0 and the permissions are empty.

When an app is updated with new permissions and/or pricing, its version is
bumped and the new details saved in an embedded doc. MongoDB shines here because
by fetching just 1 doc, we can view all the important changes of each version of
an app. It is highly unlikely that an app’s permissions and pricing will be
updated tens of thousands of times in its lifetime so the current 16MB doc size
limit is in no danger of being exceeded.

These two tables and two collections are all we need. Let’s see how the OAuth
flow will work then.

THE DESIGN EXECUTION


For OAuth, the user is directed to the App Integration page, once he consents
and gives permission, we fetch a MongoDB doc from app_installations collection whose _id matches the app_id . Within this doc, we embed a new doc in the installs field containing details
of the installation. This embedded design again plays to the strengths of a
document store.

After permissions are granted, the user is redirected to a custom URL with his ID so that the consumer can store it.

Now, an app simply needs to connect to an API endpoint specifying the App ID , App Secret and the User ID to perform actions as.

curl -H ""Content-Type: application/json"" -H ""App-Id: 123"" -H ""App-Secret: Ac$R@Cas!^D"" -H ""User-Id: 5"" -X POST -d '{""article_id"":""30"",""comment"":""This was  
a very interesting blog post, help me a lot thanks!""}' http://localhost:8000/api/comment  


On receiving this request, the following database queries would take place:


So each request would result in at least three queries: one to MySQL and two to
MongoDB. Of course, you can, and perhaps always should, cache these, especially
the MySQL query.

This approach has some key advantages over other methods of authorization and
authentication. It allows for very easy authorization and application management
with no tokens. It provides better security by frequently changing the app_secret without logging out users or asking them to reauthorize the app. Its
auto-update feature allows users to immediately enjoy new app features without
waiting for re-authorization.

There are also some key disadvantages to this technique. Some service providers
encode the app_id , user_id and permissions in one token requiring only one query whereas the Simple OAuth approach
requires three. In terms of security, a single point of exposure (the app
secret) in Simple OAuth is arguably less secure than a distributed system that
employs access tokens such as OAuth 1.0 and OAuth 2.0. Lastly, Simple OAuth is
not as adaptable and portable to numerous use cases as OAuth 1.0 and 2.0.

SUMMING IT UP
So there you have it, a simple yet feature-rich OAuth system. It’s easy to
implement and could be great for certain scenarios, for example, at my startup, Campus Discounts , we innovate quickly and would like to allow all third party applications to
do so as well with an auto-updating OAuth feature backed by GraphQL to maintain
backward compatibility. As a matter of fact, we are open sourcing our Simple
OAuth implementation which can be found on Github .


attribution Ben Garratt

This article is licensed with CC-BY-NC-SA 4.0 by Compose.RELATED ARTICLES
Mar 13, 2017CREATING AN AWS VPC AND SECURED COMPOSE MONGODB WITH TERRAFORM
Connecting to Compose MongoDB from Amazon VPC? Using Terraform for
orchestration? In this Write Stuff article, Yamil Asusta s…

Guest Author Mar 2, 2017USE ALL THE DATABASES - PART 1
Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data
from multiple sources using GraphQL in this W…

Guest Author Jan 18, 2017DESIGNING THE UFC MONEYBALL
Using big data analysis on sports? Gigi Sayfan takes us through doing just that
with Cassandra/Scylla, MySQL, and Redis. In t…

Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Don Omondi, Campus Discounts' founder and CTO, discusses securing applications with OAuth and shows you how to securely store authentication data using MySQL and MongoDB.",Simple OAuth With MongoDB & MySQL,Live,716
2201,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCLOUDANT QUERY: HOW TO INDEX ARRAY ELEMENTSMike Broberg / November 24, 2015Cloudant provides several different ways to index JSON, each with its owntradeoffs. Cloudant Query is one such option. We designed it as a declarative query system, makingCloudant more familiar to developers who know MongoDB’s syntax for db.find() operations. If your only experience is with relational databases, CloudantQuery is also the most accessible way for you to start learning Cloudant.[""W"", ""T"", ""F"", ""?""]Despite the friendlier syntax, Cloudant Query comes with its own set ofconsiderations. One question we get a lot is how to query on elements inside ofan array. Well, IBM Cloudant Developer and Apache CouchDB™ Committer Tony Sun has the answers at https://cloudant.com/blog/mango-json-vs-text-indexes/ .In that post, Tony examines the tradeoffs of using different Cloudant Queryindex types when handling JSON fields that contain arrays. The article makes acouple assumptions: 1. You know that in Cloudant and CouchDB, an index on a field must exist before you can query it. 2. You have a cursory understanding of the different indexing systems in    CouchDB: the traditional incremental MapReduce ""view"" system for indexing    JSON fields, and the new Lucene search integration for full-text indexing.Tony’s article discusses ""Mango"", which is the name of the GitHub repo containing the code that powers Cloudant Query. We recently open-sourced it,and you should look for the new query system soon in the forthcoming release ofApache CouchDB 2.0.STACK OVERFLOW EXAMPLEHere’s an example on Stack Overflow of just this type of question: http://stackoverflow.com/questions/33262573/cloudant-selector-query/33835521 . It actually surfaced as Tony was drafting his article — fortuitous timing forus. The answers from users ""Will Holley"" and ""brobes�-)Get a free trial and try it yourself in the Cloudant dashboard. The dashboard has a nice UIwhere you can define indexes and run queries:I posted a couple sample documents to my account , in case it’s easier for you to pretty-print/parse the JSON. You can ignorethe _design JSON document, and be sure to remove the _id and _rev fields when saving your own versions to Cloudant. (Cloudant will generate thosefields automatically.)Once you’ve created a database and added some sample docs (check under ""AllDocuments"" within your database if you’re completing this task via thedashboard) — it’s time to build your Cloudant Query index! First edit youravailable indexes and define a new one, then return to the query editor andbuild your query. Grab my snippets from the Stack Overflow thread to get started.That’s it! Index JSON. Query JSON. Happy Cloudant Query-ing, friends.© ""Apache"", ""CouchDB"", ""Lucene"", ""Apache CouchDB"", ""Apache Lucene"", and theCouchDB and Lucene logos are trademarks or registered trademarks of The ApacheSoftware Foundation. All other brands and trademarks are the property of theirrespective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Learn how to query JSON in Cloudant/CouchDB for elements inside an array value.,Index Array Elements & Query JSON,Live,717
2203,"ECTO 2.0: THE DATABASE TOOL FOR ELIXIR AND PHOENIXShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 28, 2016This screencast introduces Ecto 2.0. It starts with an overview of the entirestack Ecto works with from Erlang to Elixir to Phoenix. Then it walks thru usingEcto with PostgreSQL by creating a mix project, configuring database access, seeding some data, and ending with asimple query. * Erlang * Elixir * Phoenix Framework * Ecto * Deploy PostgreSQL on ComposeShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","This screencast introduces Ecto 2.0. It starts with an overview of the entire stack Ecto works with from Erlang to Elixir to Phoenix. Then it walks thru using Ecto with PostgreSQL by creating a mix project, configuring database access, seeding some data, and ending with a simple query.",The Database Tool for Elixir and Phoenix,Live,718
2210,"How to move data in and out of Cloudant is a common question. This page includes a number of utilities to migrate and transform your data. Note that in most cases, tools that are compatible with Apache CouchDB will also work with Cloudant so are included for your reference.

If you know of a great tool you'd like us to recommend get in touch!","This page includes a number of utilities to migrate and transform your data. Note that in most cases, tools that are compatible with Apache CouchDB will also work with Cloudant so are included for your reference.",Data Migration & Transformation Tools,Live,719
2221,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
THE DATA VISUALIZATION PLAYBOOK: TELLING THE DATA STORY
LEARN DATA REPRESENTATION STRATEGIES, TECHNIQUES, AND NUANCES IN A SERIES GEARED
FOR DATA SCIENTISTS AND ANALYSTS
February 6, 2015 by Jennifer Shin Topics: Analytics , Big Data Technology Tags: analytical tool , analytics , behavior , big data , data science , data scientist , data visualization , data-driven model , pattern , representation , sequence , snapshot , story , storytellingData visualization—graphical representation of data—is growing in popularity,
but there is a wide range of opinions and perspectives on what constitutes good
data visualization. Moreover, little standardization exists across industries.
As the amount of data collected continues to grow, the ability to visualize
insights becomes increasingly important.

This data visualization playbook series can help data scientists and data
analyst professionals understand the key to identifying and creating good data
visualizations by taking an in-depth look at data through the lens of data
science. The goal is to introduce strategies, techniques, and thorough analysis
to turn a set of data points into a visualization, and in doing so, reveal the
data science that is often working behind the scenes.

A data scientist’s approach to data visualization focuses on the application of
scientiﬁc methodology and finding the representation that is well suited for the
data. This approach can be broken down into the following five core objectives:
delivering information; representing data; presenting a visual snapshot;
investigating patterns, behaviors, and sequences; and building analytical tools.

DELIVERING INFORMATION
Raw data can be difficult to digest, especially in large quantities. Data
visualizations are a great tool that helps ease this pain. Graphs, charts, and
graphics are essential for presenting complex information to any audience, but
only if the visualization is created with the audience in mind. For example, a
visualization that is appropriate for scientific researchers is probably a poor
choice for middle school students. Similarly, a visualization created for
elementary school students likely offers little benefit to college students.

Hence, the complexity of the information being delivered varies and needs to be
selected to best fit the intended audience. Data visualizations need to be
tailored based on the expected background and level of expertise of the
audience.

REPRESENTING DATA
How should the data be represented in a visualization? A representation of the
data needs to shape the story the audience will hear. For data scientists, the
story should also be accurate, and a proper application of scientific principles
is essential for ensuring the accuracy of the visual representation. As a
result, selecting a particular type of visualization before analyzing the data
or considering qualities of the data set can result in a biased visualization.

After all, what if a selected visualization causes aggregating data across 16
weeks into 4 months? The resulting visualization may miss important insights and
trends that can only be seen by looking at the weekly data points. For instance,
suppose there is a sharp increase in sales during the last week of each month.
If the visualization shows the number of sales by month, then this pattern would
remain hidden.

Data scientists needn’t be afraid to develop a hypothesis about which
visualizations will be the most accurate, and should try testing that hypothesis
by creating different visualizations for comparison. Once a data scientist
determines how he or she will tell the story, the focus needs to be on accuracy
instead of aesthetic appeal.

PRESENTING A VISUAL SNAPSHOT
Analyzing large data sets can be challenging and time-consuming. In addition,
learning about a data set by reading each line, row, or file is quickly becoming
infeasible. Data visualizations are a great way to enhance understanding of a
data set in less time than it takes to read the information. For instance,
looking through a list of numbers to find extreme values is slower and less
effective than plotting these values on a graph and instantly locating outliers.
The previous objective of accurately representing the data, therefore, is
intimately tied to selecting the right visualization.

When creating a visualization, data professionals should always ask themselves
if it is the best representation of the data they can make. The right
visualization can present a high-level overview or visual snapshot without
bogging the reader down with too much information.

INVESTIGATING PATTERNS, BEHAVIORS, AND SEQUENCES
Visualizations can capture more than the information contained in a single data
point or even a set of data points. Data scientists can make converting a set of
data points into a descriptive explanation of a new phenomenon look easy, but
even the best data scientist is concerned with how well these new insights are
communicated to readers. By selecting the right data and model, a data scientist
can create a highly effective visualization that highlights new patterns and
filters out any noise or unimportant information. By selecting an apt chart,
graph, or visual representation, data scientists can effectively communicate
what they see in the data, and readers can walk away understanding the
overarching message.

Data visualizations can be used for more than just communication. Data
scientists can utilize visualizations as powerful tools of investigation. They
can show meaningful insights by selecting data points that illustrate a new
trend or highlight a particular pattern. In this way, data scientists can share
their unique perspective by using visualizations to guide the audience through
their data analysis. Data scientists shouldn’t underestimate the value of the
axiom, seeing is believing.

BUILDING AN ANALYTICAL TOOL
Building an analytical tool is where the science of data science comes into
play. In the practice of science, data scientists work to develop a model that
describes or explains an observed phenomenon. Data science is at the frontier of
discovering progressive models that can become the foundation for innovative
analytical tools. Thanks to advances in technology, data science professionals
can visualize data in real time with minimal effort. Having instantaneous,
up-to-date information at their fingertips, data scientists can transform data
visualizations into a reusable analytical tool.

Good, reliable, tested, and data-driven models can be used to create powerful
analytical tools. Bear in mind the importance of science when building any analytical tool, and ensure a data scientist with a robust
scientific, statistical, or mathematical background is assigned to the job.

SETTING THE STAGE
The five core objectives for data visualization presented here serve as a guide
through the data visualization playbook series for discovering the importance of
data visualizations in a data scientist’s toolbox. In addition to this glimpse
into the world of data science, look for subsequent articles to learn different
concepts and apply techniques on data sets to communicate insight to specific
audiences. Experimentation is an important part of the scientific process.

Please share any thoughts or questions in the comments.

RELATED CONTENT
BLOG
WHAT IS MACHINE LEARNING?
Businesses can benefit enormously from analysis-derived rules that enable
understanding why certain events occur and the corresponding actions to take.
Learn more about a widely used six-phase methodology for building predictive
analytics models that can reveal hidden rules for meaningful business... Read Blog Blog IBM is a leader in the Forrester Wave™: Big Data Hadoop Cloud Solutions,
Q2 2016 Blog Simple polyglot persistence in the cloud Video IBM Analytics is Open for Data Podcast How is open source transforming machine learning? Blog Bridging NoSQL databases into open data science initiatives Blog Spark and R: The deepening open analytics stack Blog Extending trust and confidence in the cloud Infographic Do you stand apart from the others? Blog The data roadmap for IBM’s first CDO Blog The blurring lines between developer and data scientist Blog Delivering superior customer interactions: Banking edition Blog Next-generation DB2 release highlights BLU Acceleration IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analyticsMORE
Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insight Blog What is machine learning? Blog Simple polyglot persistence in the cloud Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Blog What is machine learning? Blog Simple polyglot persistence in the cloudMORE
Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Blog What is machine learning? Blog Simple polyglot persistence in the cloud Podcast How is open source transforming machine learning? Blog The future of cognitive business: Try the self-service technical preview Blog Bridging NoSQL databases into open data science initiatives Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freightMORE
Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic How financial advisors can connect with investors Podcast Cyber Beat Live: I'm In! When insiders threaten our security Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insightMORE
Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insight Infographic How financial advisors can connect with investors Blog What is machine learning? Blog IBM is a leader in the Forrester Wave™: Big Data Hadoop Cloud Solutions,
Q2 2016 * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site","This data visualization playbook series can help data scientists and data analyst professionals understand the key to identifying and creating good data visualizations by taking an in-depth look at data through the lens of data science. The goal is to introduce strategies, techniques, and thorough analysis to turn a set of data points into a visualization, and in doing so, reveal the data science that is often working behind the scenes.",Data Visualization Playbook: Telling the Data Story,Live,720
2226,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
THE POWER OF MACHINE LEARNING IN SPARK
Post Comment June 13, 2016 by Max Seiden Lead Spark Engineer, Platfora Follow me on LinkedIn , TwitterOne of the major differentiators between Apache Spark and the prior generation
of Apache Hadoop–based and MapReduce-based technologies is the built-in Spark
machine-learning library (MLlib). The motivation behind including these
capabilities is to make practical machine learning scalable and understandable
for data engineers and data scientists. MLlib also leverages Spark’s
distributed, in-memory execution model to yield significant performance benefits
over preceding technologies such as R and Apache Mahout.

MACHINE-LEARNING ALGORITHM APPLICATIONS
Out of the box, data scientists and engineers have access to many forms of
statistical analysis, machine learning and some supporting methods:

 * Classification: Naive Bayes, decision trees and ensemble methods
 * Regression: Logistic regression, linear regressions and support vector machines (SVMs)
 * Collaborative filtering: Alternating least squares (ALS)–based recommendations
 * Clustering: Gaussian mixture, particle in cell (PIC), linear discriminant analysis
   (LDA), k-means algorithm—and some improvements
 * Dimensionality reduction: Singular value decomposition (SVD) and principal component analysis (PCA)
 * Feature extraction: Term frequency-inverse document frequency (tf-idf), word2vec, PCA and more

While the capabilities in MLlib are powerful in the abstract, one still needs to
identify a practical application, implement a technical solution and productionalize the analysis for its downstream consumers. As I discussed in the post, Spark: The operating system for big data analytics , Spark makes the implementation and productionalization of advanced data
analysis significantly less challenging than the aforementioned technologies.
However, identifying practical and appropriate applications for specific
machine-learning algorithms remains a key challenge, not only in reaping the
benefits of MLlib in Spark, but of machine learning in general.

Take a look at a few specific applications. Each case rationalizes the chosen
machine-learning algorithm.

PRINCIPAL COMPONENT ANALYSIS
One common usage of applied machine learning is to identify the statistically
relevant attributes of a high-dimensionality data set. For example, imagine you
are charged with analyzing medical profiles to determine whether a particular
treatment may have an adverse effect on a patient. Such a data set may contain
hundreds, if not thousands, of attributes that describe a single patient’s
medical history and current conditions. Wide data sets such as this one can be
difficult to work with using traditional data analytics tools, so applying a
statistical method is certainly appropriate. Unfortunately, some such algorithms
do not perform well when the dimensionality of a data set is this large, and can
result in long processing times and high operational costs.

To address this problem, a dimensionality reduction technique can be used to
identify the most statistically important attributes. In the best case, this
technique can result in a substantial reduction in the width of the data set,
without a statistically relevant degradation in the actual results of the
prediction.

In version 1.6, Spark includes two such algorithms in MLlib— SVD and PCA . Let’s look specifically at PCA, a technique that identifies the most relevant
attributes in a data set based on the variance of those attributes. Speaking
specifically, the attributes that have the largest impact on a medical outcome
will be ranked the highest. The number of attributes—or principal components—is
also specified up front, enabling a data scientist to control the balance
between cost and accuracy. Because this example involves medical outcomes, a
user of this technique would likely err on the side of accuracy, and thus choose
a higher number of principal components to compute. Once this has been done, the
relevant attributes can be extracted from the source data set and used in a
downstream analysis.

DECISION TREES
Another common usage of machine learning is to augment a decision-making process
with supporting evidence or reliable, validated predictions. For example,
imagine you are a data scientist at a large ecommerce company who is trying to
forecast average monthly revenue per user by analyzing data sets related to
shopping behavior. One way to frame this analysis is to think of the set of
implicit decisions that exist in a behavioral data set as indicators of a
particular spending pattern. Furthermore, these indicators could point to either
a discrete shopping profile—a classification—or a forecasted monthly spend—a
regression.

Using Spark, both the classification and regression analyses can be achieved
with a decision tree . In the classification case, a summary of a user’s behavior over a period of
time can be used as a descriptor of the user’s behavior. When coupled with a
classification derived from the user’s actual monthly spend, a model can be
trained in Spark that decides which shopping profile a new customer will likely fall into. In the regression
case, the discrete labels are replaced with the predicted monthly spend,
resulting in a prediction that yields the expected spend for the next month.

STREAMING K-MEANS
In this last example, consider a use case in which we’d like to adaptively
classify entities in real time. For example, imagine you are a large shipping
company that wants to provide a prediction for customers as to when a package
will likely be shipped. This prediction can be achieved by considering a number
of attributes, including daily shipment volume, shipping class, destination and
so on—and using these attributes to train a classification algorithm.

However, given the real-time nature of the problem, a model that is trained in
the morning may not yield accurate results in the afternoon or evening. As such,
we need to apply a technique that can adaptively update the model as new data
arrives so that the classifications stay up to date.

To implement this system, we can combine two major components from Spark:
MLlib’s streaming k-means and Spark Streaming. The first of these components is
an implementation of the k-means clustering algorithm that can be trained using
data streams constructed with the second component. By choosing this particular
implementation of the clustering algorithm, we can meet the requirement of
adaptively updating the model as new data comes in over the course of a business
day, and still keep the solution confined to Spark’s core components.

PRACTICAL APPLICATION OF MACHINE-LEARNING ALGORITHMS
Hopefully the three examples described here shed some light on the issue of
identifying practical and appropriate applications for specific machine-learning
algorithms included in Spark. In addition, these examples show how to deploy
Spark in a variety of advanced analytics use cases, including data preparation
through dimensionality reduction, revenue prediction through decision trees
trained on user behavior and package delay classification based on real-time,
adaptive clustering.

Experience the power of Spark and machine learning in an integrated development environment for data science. Also, join the data
science experience and explore how you can use Spark and R to build your own data science applications .


Follow @IBMBigData

Topics: Analytics , Big Data Technology , Data Scientists Tags: Apache Spark , Spark , Apache Hadoop , Hadoop , machine learning , Big data discovery , data science , data scientist , data engineer , R , classification , clusteringRELATED CONTENT
BLOG
THE 3 CS OF BIG DATA
Data scientists and others often encapsulate big data by its dimensions known as
the four Vs: volume, variety, velocity and veracity. But when considering big
data as a source for insight to enhance decision making, it may be best
characterized by its three Cs—confidence, context and choice—with... Read Blog Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Video What does Hadoop and big data success look like? Blog Cloud-based ingestion: The future is here Blog Don’t sweat the ROI: IBM + Box = time well spent Blog How can data scientists collaborate to build better business applications? Blog A DB2 release that doubles down on data protection Blog InsightOut: The role of Apache Atlas in the open metadata ecosystem Blog Top analytics tools in 2016 Blog End-to-end analytics in the cloud Video IBM Cloud Data Services: A quick primer Blog Highlights from the Apache Spark Maker Community Event
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacyMORE
Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The 3 Cs of big data Blog The intersection of body camera video with CJIS guidelines and privacy Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog Cloud-based ingestion: The future is hereMORE
Blog The 3 Cs of big data Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Blog Cloud-based ingestion: The future is here Blog 3 strategies to get your CFO to care about Sales Performance Management Blog Proactive emergency plans: Data empowers law enforcement agencies at all
levels Blog Emergency management information system data needs to be filtered Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformationMORE
Infographic Why manually analyzing video data is not an option Interactive Fighting the bad guys with advanced Cyber Threat Analysis Blog The intersection of body camera video with CJIS guidelines and privacy Blog Shifting winds in the Cognitive Era for banking’s digital transformation Blog 4 ways intelligent video analytics enhance body-worn cameras White papers & Reports Capture more value from from body-worn camera video Blog Keep your head above water with information lifecycle governance Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutionsMORE
Blog The 3 Cs of big data Blog IBM Analytics Day at the 2016 U.S. Open golf tournament Blog Keep your head above water with information lifecycle governance Infographic Adrift in a sea of data? Rise above the tide with IBM Information
Lifecycle Governance solutions Video What does Hadoop and big data success look like? Blog The death of application performance White papers & Reports Introducing notebooks: A power tool for data scientists * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site",Take a look at some practical applications for specific Spark machine-learning algorithms in three advanced analytics use cases.,The power of machine learning in Spark,Live,721
2227,"Toggle navigation colah's blog * Blog
 * About
 * Contact

VISUAL INFORMATION THEORY
Posted on October 14, 2015

I love the feeling of having a new way to think about the world. I especially
love when there’s some vague idea that gets formalized into a concrete concept.
Information theory is a prime example of this.

Information theory gives us precise language for describing a lot of things. How
uncertain am I? How much does knowing the answer to question A tell me about the
answer to question B? How similar is one set of beliefs to another? I’ve had
informal versions of these ideas since I was a young child, but information
theory crystallizes them into precise, powerful ideas. These ideas have an
enormous variety of applications, from the compression of data, to quantum
physics, to machine learning, and vast fields in between.

Unfortunately, information theory can seem kind of intimidating. I don’t think
there’s any reason it should be. In fact, many core ideas can be explained
completely visually!

VISUALIZING PROBABILITY DISTRIBUTIONS
Before we dive into information theory, let’s think about how we can visualize
simple probability distributions. We’ll need this later on, and it’s convenient
to address now. As a bonus, these tricks for visualizing probability are pretty
useful in and of themselves!

I’m in California. Sometimes it rains, but mostly there’s sun! Let’s say it’s
sunny 75% of the time. It’s easy to make a picture of that:

Most days, I wear a t-shirt, but some days I wear a coat. Let’s say I wear a
coat 38% of the time. It’s also easy to make a picture for that!

What if I want to visualize both at the same time? We’ll, it’s easy if they
don’t interact – if they’re what we call independent. For example, whether I
wear a t-shirt or a raincoat today doesn’t really interact with what the weather
is next week. We can draw this by using one axis for one variable and one for
the other:

Notice the straight vertical and horizontal lines going all the way through. That’s what independence looks like! 1 The probability I’m wearing a coat doesn’t change in response to the fact that
it will be raining in a week. In other words, the probability that I’m wearing a
coat and that it will rain next week is just the probability that I’m wearing a
coat, times the probability that it will rain. They don’t interact.

When variables interact, there’s extra probability for particular pairs of
variables and missing probability for others. There’s extra probability that I’m
wearing a coat and it’s raining because the variables are correlated, they make
each other more likely. It’s more likely that I’m wearing a coat on a day that
it rains than the probability I wear a coat on one day and it rains on some
other random day.

Visually, this looks like some of the squares swelling with extra probability,
and other squares shrinking because the pair of events is unlikely together:

But while that might look kind of cool, it’s isn’t very useful for understanding
what’s going on.

Instead, let’s focus on one variable like the weather. We know how probable it
is that it’s sunny or raining. For both cases, we can look at the conditional probabilities . How likely am I to wear a t-shirt if it’s sunny? How likely am I to wear a
coat if it’s raining?

There’s a 25% chance that it’s raining. If it is raining, there’s a 75% chance
that I’d wear a coat. So, the probability that it is raining and I’m wearing a
coat is 25% times 75% which is approximately 19%. The probability that it’s
raining and I’m wearing a coat is the probability that it is raining, times the
probability that I’d wear a coat if it is raining. We write this:

\[p(\text{rain}, \text{coat}) = p(\text{rain}) \cdot p(\text{coat} ~|~
\text{rain})\]

This is a single case of one of the most fundamental identities of probability
theory:

\[p(x,y) = p(x)\cdot p(y|x)\]

We’re factoring the distribution, breaking it down into the product of two pieces. First we
look at the probability that one variable, like the weather, will take on a
certain value. Then we look at the probability that another variable, like my
clothing, will take on a certain value conditioned on the first variable.

The choice of which variable to start with is arbitrary. We could just as easily
start by focusing on my clothing and then look at the weather conditioned on it.
This might feel a bit less intuitive, because we understand that there’s a
causal relationship of the weather influencing what I wear and not the other way
around… but it still works!

Let’s go through an example. If we pick a random day, there’s a 38% chance that
I’d be wearing a coat. If we know that I’m wearing a coat, how likely is it that
it’s raining? Well, I’m more likely to wear a coat in the rain than in the sun,
but rain is kind of rare in California, and so it works out that there’s a 50%
chance that it’s raining. And so, the probability that it’s raining and I’m
wearing a coat is the probability that I’m wearing a coat (38%), times the
probability that it would be raining if I was wearing a coat (50%) which is
approximately 19%.

\[p(\text{rain}, \text{coat}) = p(\text{coat}) \cdot p(\text{rain} ~|~
\text{coat})\]

This gives us a second way to visualize the exact same probability distribution.

Note that the labels have slightly different meanings than in the previous
diagram: t-shirt and coat are now marginal probabilities , the probability of me wearing that clothing without consideration of the
weather. On the other hand, there are now two rain and sunny labels, for the
probabilities of them conditional on me wearing a t-shirt and me wearing a coat
respectively.

(You may have heard of Bayes’ Theorem. If you want, you can think of it as the
way to translate between these two different ways of displaying the probability
distribution!)

ASIDE: SIMPSON’S PARADOX
Are these tricks for visualizing probability distributions actually helpful? I
think they are! It will be a little while before we use them for visualizing
information theory, so I’d like to go on a little tangent and use them to
explore Simpson’s paradox. Simpson’s paradox is an extremely unintuitive
statistical situation. It’s just really hard to understand at an intuitive
level. Michael Nielsen wrote a lovely essay, Reinventing Explanation , which explored different ways to explain it. I’d like to try and take my own
shot at it, using the tricks we developed in the previous section.

Two treatments for kidney stones are tested. Half the patients are given
treatment A while the other half are given treatment B. The patients who
received treatment B were more likely to survive than those who received
treatment A.

However, patients with small kidney stones were more likely to survive if they
took treatment A. Patients with large kidney stones were also more likely to
survive if they took treatment A! how can this be?

The core of the issue is that the study wasn’t properly randomized. The patients
who received treatment A were likely to have large kidney stones, while the
patients who received treatment B were more likely to have small kidney stones.

As it turns out, patients with small kidney stones are much more likely to
survive in general.

To understand this better, we can combine the two previous diagrams. The result
is a 3-dimensional diagram with the survival rate split apart for small and
large kidney stones.

We can now see that in both the small case and the large case, Treatment A beats Treatment B.
Treatment B only seemed better because the patients it was applied to were more
likely to survive in the first place!

CODES
Now that we have ways of visualizing probability, we can dive into information
theory.

Let me tell you about my imaginary friend, Bob. Bob really likes animals. He
constantly talks about animals. In fact, he only ever says four words: “dog”,
“cat”, “fish” and “bird”.

A couple weeks ago, despite being a figment of my imagination, Bob moved to
Australia. Further, he decided he only wanted to communicate in binary. All my
(imaginary) messages from Bob look like this:

To communicate, Bob and I have to establish a code, a way of mapping words into
sequences of bits.

To send a message, Bob replaces each symbol (word) with the corresponding
codeword, and then concatenates them together to form the encoded string.

VARIABLE-LENGTH CODES
Unfortunately, communication services in imaginary-Australia are expensive. I
have to pay $5 per bit of every message I receive from Bob. Have I mentioned
that Bob likes to talk a lot? To prevent me from going bankrupt, Bob and I
decided we should investigate whether there was some way we could make our
average message length shorter.

As it turns out Bob doesn’t say all words equally often. Bob really likes dogs.
He talks about dogs all the time. On occasion, he’ll talk about other animals –
especially the cat his dog likes to chase – but mostly he talks about dogs.
Here’s a graph of his word frequency:

That seems promising. Our old code uses codewords that are 2 bits long,
regardless of how common they are.

There’s a nice way to visualize this. In the following diagram, we use the
vertical axis to visualize the probability of each word, \(p(x)\) , and the horizontal axis to visualize the length of the corresponding
codeword, \(L(x)\) . Notice that the area is the average length of a codeword we send – in this
case 2 bits.

Perhaps we could be very clever and make a variable-length code where codewords
for common words are made especially short. The challenge is that there’s
competition between codewords – making some shorter forces us to make others
longer. To minimize the message length, we’d ideally like all codewords to be
short, but we especially want the commonly used ones to be. So the resulting
code has shorter codewords for common words (like “dog”) and longer codewords
for less common words (like “bird”).

Let’s visualize this again. Notice that the most common codeword became shorter,
even as the uncommon ones became longer. The result was, on net, a smaller
amount of area. This corresponds to a smaller expected codeword length. On
average, the length of a codeword is now 1.75 bits!

(You may wonder: why not use 1 by itself as a codeword? Sadly, this would cause
ambiguity when we decode encoded strings. We’ll talk about this more shortly.)

It turns out that this code is the best possible code. There is no code which,
for this distribution, will give us an average codeword length of less than 1.75
bits.

There is simply a fundamental limit. Communicating what word was said, what
event from this distribution occurred, requires us to communicate at least 1.75
bits on average. No matter how clever our code, it’s impossible to get the
average message length to be less. We call this fundamental limit the entropy of
the distribution – we’ll discuss it in much more detail shortly.

If we want to understand this limit, the crux of the matter is understanding the
trade off between making some codewords short and others long. Once we
understand that, we’ll be able to understand what the best possible codes are
like.

THE SPACE OF CODEWORDS
There are two codes with a length of 1 bit: 0 and 1. There are four codes with a
length of 2 bits: 00, 01, 10, and 11. Every bit you add on doubles the number of
possible codes.

We’re interested in variable-length codes, where some codewords are longer than
others. We might have simple situations where we have eight codewords that are 3
bits long. We might also have more complicated mixtures, like two codewords of
length 2, and four codewords of length 3. What decides how many codewords we can
have of different lengths?

Recall that Bob turns his messages into encoded strings by replacing each word
with its codeword and concatenating them.

There’s a slightly subtle issue one needs to be careful of, when crafting a
variable length code. How do we split the encoded string back into the
codewords? When all the codewords are the same length, it’s easy – just split
the string every couple of steps. But since there are codewords of different
lengths, we need to actually pay attention to the content.

We really want our code to be uniquely decodable, with only one way to decode an
encoded string. We never want it to be ambiguous which codewords make up the
encoded string. If we had some special “end of codeword” symbol, this would be
easy. 2 But we don’t – we’re only sending 0s and 1s. We need to be able to look at a
sequence of concatenated codewords and tell where each one stops.

It’s very possible to make codes that aren’t uniquely decodable. For example,
imagine that 0 and 01 were both codewords. Then it would be unclear what the
first codeword of the encoded string 0100111 is – it could be either! The
property we want is that if we see a particular codeword, there shouldn’t be
some longer version that is also a codeword. Another way of putting this is that
no codeword should be the prefix of another codeword. This is called the prefix
property, and codes that obey it are called prefix codes.

One useful way to think about this is that every codeword requires a sacrifice
from the space of possible codewords. If we take the codeword 01, we lose the
ability to use any codewords it’s a prefix of. We can’t use 010 or 011010110
anymore because of ambiguity – they’re lost to us.

Since a quarter of all codewords start with 01, we’ve sacrificed a quarter of
all possible codewords. That’s the price we pay in exchange for having one
codeword that’s only 2 bits long! In turn this sacrifice means that all the
other codewords need to be a bit longer. There’s always this sort of trade off
between the lengths of the different codewords. A short codeword requires you to
sacrifice more of the space of possible codewords, preventing other codewords
from being short. What we need to figure out is what the right trade off to make
is!

OPTIMAL ENCODINGS
You can think of this like having a limited budget to spend on getting short
codewords. We pay for one codeword by sacrificing a fraction of possible
codewords.

The cost of buying a codeword of length 0 is 1, all possible codewords – if you
want to have a codeword of length 0, you can’t have any other codeword. The cost
of a codeword of length 1, like “0”, is 1/2 because half of possible codewords
start with “0”. The cost of a codeword of length 2, like “01”, is 1/4 because a
quarter of all possible codewords start with “01”. In general, the cost of
codewords decreases exponentially with the length of the codeword.

Note that if the cost decays as a (natural) exponential, it is both the height
and the area! 3

We want short codewords because we want short average message lengths. Each
codeword makes the average message length longer by its probability times the
length of the codeword. For example, if we need to send a codeword that is 4
bits long 50% of the time, our average message length is 2 bits longer than it
would be if we weren’t sending that codeword. We can picture this as a
rectangle.

These two values are related by the length of the codeword. The amount we pay
decides the length of the codeword. The length of the codeword controls how much
it adds to the average message length. We can picture the two of these together,
like so.

Short codewords reduce the average message length but are expensive, while long
codewords increase the average message length but are cheap.

What’s the best way to use our limited budget? How much should we spend on the
codeword for each event?

Just like one wants to invest more in tools that one uses regularly, we want to
spend more on frequently used codewords. There’s one particularly natural way to
do this: distribute our budget in proportion to how common an event is. So, if
one event happens 50% of the time, we spend 50% of our budget buying a short
codeword for it. But if an event only happens 1% of the time, we only spend 1%
of our budget, because we don’t care very much if the codeword is long.

That’s a pretty natural thing to do, but is it the optimal thing to do? It is,
and I’ll prove it!

The following proof is visual and should be accessible, but will take work to
get through and is definitely the hardest part of this essay. Readers should
feel free to skip to accept this as a given and jump to the next section.

Let’s picture a concrete example where we need to communicate which of two
possible events happened. Event \(a\) happens \(p(a)\) of the time and event \(b\) happens \(p(b)\) of the time. We distribute our budget in the natural way described above,
spending \(p(a)\) of our budget on getting \(a\) a short codeword, and \(p(b)\) on getting \(b\) a short codeword.

The cost and length contribution boundaries nicely line up. Does that mean
anything?

Well, consider what happens to the cost and the length contribution if we
slightly change the length of the codeword. If we slightly increase the length
of the codeword, the message length contribution will increase in proportion to
its height at the boundary, while the cost will decrease in proportion to its
height at the boundary.

So, the cost to make the codeword for \(a\) shorter is \(p(a)\) . At the same time, we don’t care about the length of each codeword equally, we
care about them in proportion to how much we have to use them. In the case of \(a\) , that is \(p(a)\) . The benefit to us of making the codeword for \(a\) a bit shorter is \(p(a)\) .

It’s interesting that both derivatives are the same. It means that our initial
budget has the interesting property that, if you had a bit more to spend, it
would be equally good to invest in making any codeword shorter. What we really
care about, in the end, is the benefit/cost ratio – that’s what decides what we
should invest more in. In this case, the ratio is \(\frac{p(a)}{p(a)}\) , which is equal to one. This is independent of the value of \(p(a)\) – it’s always one. And we can apply the same argument to other events. The
benefit/cost is always one, so it makes equal sense to invest a bit more in any
of them.

Infinitesimally, it doesn’t make sense to change the budget. But that isn’t a
proof that it’s the best budget. To prove that, we’ll consider a different
budget, where we spend a bit extra on one codeword at the expense of another.
We’ll invest \(\epsilon\) less in \(b\) , and invest it in \(a\) instead. This makes the codeword for \(a\) a bit shorter, and the codeword for \(b\) a bit longer.

Now the cost of buying a shorter codeword for \(a\) is \(p(a) + \epsilon\) , and the cost of buying a shorter codeword for \(b\) is \(p(b) - \epsilon\) . But the benefits are still the same. This leads the benefit cost ratio for
buying \(a\) to be \(\frac{p(a)}{p(a) + \epsilon}\) which is less than one. On the other hand, the benefit cost ratio of buying \(b\) is \(\frac{p(b)}{p(b) - \epsilon}\) which is greater than one.

The prices are no longer balanced. \(b\) is a better deal than \(a\) . The investors scream: “Buy \(b\) ! Sell \(a\) !” We do this, and end back at our original budget plan. All budgets can be
improved by shifting towards our original plan.

The original budget – investing in each codeword in proportion to how often we
use it – wasn’t just the natural thing to do, it was the optimal thing to do.
(While this proof only works for two codewords, it easily generalizes to more.)

(A careful reader may have noticed that it is possible for our optimal budget to
suggest codes where codewords have fractional lengths. That seems pretty
concerning! What does it mean? Well, of course, in practice, if you want to
communicate by sending a single codeword, you have to round. But as we’ll see
later, there’s a very real sense in which it is possible to send fractional
codewords when we send many at a time! I’ll ask you be patient with me on this
for now!)

CALCULATING ENTROPY
Recall that the cost of a message of length \(L\) is \(\frac{1}{2^L}\) . We can invert this to get the length of a message that costs a given amount: \(\log_2\left(\frac{1}{\text{cost}}\right)\) . Since we spend \(p(x)\) on a codeword for \(x\) , the length is \(\log_2\left(\frac{1}{p(x)}\right)\) . Those are the best choices of lengths.

Earlier, we discussed how there is a fundamental limit to how short one can get
the average message to communicate events from a particular probability
distribution, \(p\) . This limit, the average message length using the best possible code, is
called the entropy of \(p\) , \(H(p)\) . Now that we know the optimal lengths of the codewords, we can actually
calculate it!

\[H(p) = \sum_x p(x)\log_2\left(\frac{1}{p(x)}\right)\]

(People often write entropy as \(H(p) = - \sum p(x)\log_2(p(x))\) using the identity \(\log(1/a) = -\log(a)\) . I think the first version is more intuitive, and will continue to use it in
this essay.)

No matter what I do, on average I need to send at least that number of bits if I
want to communicate which event occurred.

The average amount of information needed to communicate something has clear
implications for compression. But are there other reasons we should care about
it? Yes! It describes how uncertain I am and gives a way to quantify
information.

If I knew for sure what was going to happen, I wouldn’t have to send a message
at all! If there’s two things that could happen with 50% probability, I only
need to send 1 bit. But if there’s 64 different things that could happen with
equal probability, I’d have to send 6 bits. The more concentrated the
probability, the more I can craft a clever code with short average messages. The
more diffuse the probability, the longer my messages have to be.

The more uncertain the outcome, the more I learn, on average, when I find out
what happened.

CROSS-ENTROPY
Shortly before his move to Australia, Bob married Alice, another figment of my
imagination. To the surprise of myself, and also the other characters in my
head, Alice was not a dog lover. She was a cat lover. Despite this, the two of
them were able to find common ground in their shared obsession with animals and
very limited vocabulary size.

The two of them say the same words, just at different frequencies. Bob talks
about dogs all the time, Alice talks about cats all the time.

Initially, Alice sent me messages using Bob’s code. Unfortunately, her messages
were longer than they needed to be. Bob’s code was optimized to his probability
distribution. Alice has a different probability distribution, and the code is
suboptimal for it. While the average length of a codeword when Bob uses his own
code is 1.75 bits, when Alice uses his code it's 2.25. It would be worse if the
two weren’t so similar!

This length – the average length of communicating an event from one distribution
with the optimal code for another distribution – is called the cross-entropy.
Formally, we can define cross-entropy as: 4 \[H_p(q) = \sum_x q(x)\log_2\left(\frac{1}{p(x)}\right)\]

In this case, it’s the cross-entropy of Alice the cat-lovers word frequency with
respect to the Bob the dog-lovers word frequency.

To keep the cost of our communications down, I asked Alice to use her own code.
To my relief, this pushed down her average message length. But it introduced a
new problem: sometimes Bob would accidentally use Alice’s code. Surprisingly,
it’s worse for Bob to accidentally use Alice's code than for Alice to use his!

So, now we have four possibilities:

 * Bob using his own code \((H(p) = 1.75 ~\text{bits})\)
 * Alice using Bob’s code \((H_p(q) = 2.25 ~\text{bits})\)
 * Alice using her own code \((H(q) = 1.75 ~\text{bits})\)
 * Bob using Alice’s code \((H_q(p) = 2.375 ~\text{bits})\)

This isn’t necessarily as intuitive as one might think. For example, we can see
that \(H_p(q) \neq H_q(p)\) . Is there some way we can see how these four values relate to each other?

In the following diagram, each subplot represents one of these 4 possibilities.
Each subplot visualizes average message length the same way our previous
diagrams did. They are organized in a square, so that if the messages are coming
from the same distribution the plots are beside each other, and if they use the
same codes they are on top of each other. This allows you to kind of visually
slide the distributions and codes together.

Can you see why \(H_p(q) \neq H_q(p)\) ? \(H_q(p)\) is large because there is an event (blue) which is very common under \(p\) but gets a long code because it is very uncommon under \(q\) . On the other hand, common events under \(q\) are less common under \(p\) , but the difference is less drastic, so \(H_p(q)\) isn’t as high.

Cross-entropy isn’t symmetric.

So, why should you care about cross-entropy? Well, cross-entropy gives us a way
to express how different two probability distributions are. The more different
the distributions \(p\) and \(q\) are, the more the cross-entropy of \(p\) with respect to \(q\) will be bigger than the entropy of \(p\) .

Similarly, the more different \(p\) is from \(q\) , the more the cross-entropy of \(q\) with respect to \(p\) will be bigger than the entropy of \(q\) .

The really interesting thing is the difference between the entropy and the
cross-entropy. That difference is how much longer our messages are because we
used a code optimized for a different distribution. If the distributions are the
same, this difference will be zero. As the difference grows, it will get bigger.

We call this difference the Kullback–Leibler divergence, or just the KL
divergence. The KL divergence of \(p\) with respect to \(q\) , \(D_q(p)\) , 5 is defined: 6

\[D_q(p) = H_q(p) - H(p)\]

The really neat thing about KL divergence is that it’s like a distance between
two distributions. It measures how different they are! (If you take that idea
seriously, you end up with information geometry.)

Cross-Entropy and KL divergence are incredibly useful in machine learning.
Often, we want one distribution to be close to another. For example, we might
want a predicted distribution to be close to the ground truth. KL divergence
gives us a natural way to do this, and so it shows up everywhere.

ENTROPY AND MULTIPLE VARIABLES
Let’s return to our weather and clothing example from earlier:

My mother, like many parents, sometimes worries that I don’t dress appropriately
for the weather. (She has reasonable cause for suspicion – I have often failed
to wear coats in winter.) So, she often wants to know both the weather and what
clothing I’m wearing. How many bits do I have to send her to communicate this?

Well, the easy way to think about this is to flatten the probability
distribution:

Now we can figure out the optimal codewords for events of these probabilities
and compute the average message length:

We call this the joint entropy of \(X\) and \(Y\) , defined

\[H(X,Y) = \sum_{x,y} p(x,y) \log_2\left(\frac{1}{p(x,y)}\right)\]

This is the exact same as our normal definition, except with two variables
instead of one.

A slightly nicer way to think about this is to avoid flattening the
distribution, and just think of the code lengths as a third dimension. Now the
entropy is the volume!

But suppose my mom already knows the weather. She can check it on the news. Now
how much information do I need to provide?

It seems like I need to send however much information I need to communicate the
clothes I’m wearing. But I actually need to send less, because the weather
strongly implies what clothing I’ll wear! Let’s consider the case where it’s
raining and where it’s sunny separately.

In both cases, I don’t need to send very much information on average, because
the weather gives me a good guess at what the right answer will be. When it’s
sunny, I can use a special sunny-optimized code, and when it’s raining I can use
a raining optimized code. In both cases, I send less information than if I used
a generic code for both. To get the average amount of information I need to send
my mother, I just put these two cases together…

We call this the conditional entropy. If you formalize it into an equation, you
get:

\[H(X|Y) = \sum_y p(y) \sum_x p(x|y) \log_2\left(\frac{1}{p(x|y)}\right)\] \[~~~~ = \sum_{x,y} p(x,y) \log_2\left(\frac{1}{p(x|y)}\right)\]

MUTUAL INFORMATION
In the previous section, we observed that knowing one variable can mean that
communicating another variable requires less information.

One nice way to think about this is to imagine amounts of information as bars.
These bars overlap if there’s shared information between them. For example, some
of the information in \(X\) and \(Y\) is shared between them, so \(H(X)\) and \(H(Y)\) are overlapping bars. And since \(H(X,Y)\) is the information in both, it’s the union of the bars \(H(X)\) and \(H(Y)\) . 7

Once we think about things this way, a lot of things become easier to see.

For example, we previously noted it takes more information to communicate both \(X\) and \(Y\) (the “joint entropy,” \(H(X,Y)\) ) than it takes to just communicate \(X\) (the “marginal entropy,” \(H(X)\) ). But if you already know \(Y\) , then it takes less information to communicate \(X\) (the “conditional entropy,” \(H(X|Y)\) ) than it would if you didn’t!

That sounds a bit complicated, but it’s very simple when we think about it from
the bar perspective. \(H(X|Y)\) is the information we need to send to communicate \(X\) to someone who already knows \(Y\) , the information in \(X\) which isn’t also in \(Y\) . Visually, that means \(H(X|Y)\) is the part of \(H(X)\) bar which doesn’t overlap with \(H(Y)\) .

You can now read the inequality \(H(X,Y) \geq H(X) \geq H(X|Y)\) right off the following diagram.

Another identity is that \(H(X,Y) = H(Y) + H(X|Y)\) . That is, the information in \(X\) and \(Y\) is the information in \(Y\) plus the information in \(X\) which is not in \(Y\) .

Again, it’s difficult to see in the equations, but easy to see if you’re
thinking in terms of these overlapping bars of information.

At this point, we’ve broken the information in \(X\) and \(Y\) up in several ways. We have the information in each variable, \(H(X)\) and \(H(Y)\) . We have the the union of the information in both, \(H(X,Y)\) . We have the information which is in one but not the other, \(H(X|Y)\) and \(H(Y|X)\) . A lot of this seems to revolve around the information shared between the
variables, the intersection of their information. We call this “mutual
information,” \(I(X,Y)\) , defined as: 8 \[I(X,Y) = H(X) + H(Y) - H(X,Y)\] This definition works because \(H(X) + H(Y)\) has two copies of the mutual information, since it’s in both \(X\) and \(Y\) , while \(H(X,Y)\) only has one. (Consider the previous bar diagram.)

Closely related to the mutual information is the variation of information. The
variation of information is the information which isn’t shared between the
variables. We can define it like so: \[V(X,Y) = H(X,Y) - I(X,Y)\] Variation of information is interesting because it gives us a metric, a notion
of distance, between different variables. The variation of information between
two variables is zero if knowing the value of one tells you the value of the
other and increases as they become more independent.

How does this relate to KL divergence, which also gave us a notion of distance?
Well, KL divergence gives us a distance between two distributions over the same
variable or set of variables. In contrast, variation of information gives us
distance between two jointly distributed variables. KL divergence is between
distributions, variation of information within a distribution.

We can bring this all together into a single diagram relating all these
different kinds of information:

FRACTIONAL BITS
A very unintuitive thing about information theory is that we can have fractional
numbers of bits. That’s pretty weird. What does it mean to have half a bit?

Here’s the easy answer: often, we’re interested in the average length of a
message rather than any particular message length. If half the time one sends a
single bit, and half the time one sends two bits, on average one sends one and a
half bits. There’s nothing strange about averages being fractional.

But that answer is really dodging the issue. Often, the optimal lengths of
codewords are fractional. What do those mean?

To be concrete, let’s consider a probability distribution where one event, \(a\) , happens 71% of the time and another event, \(b\) , occurs 29% of the time.

The optimal code would use 0.5 bits to represent \(a\) , and 1.7 bits to represent \(b\) . Well, if we want to send a single one of these codewords, it simply isn’t
possible. We’re forced to round to a whole number of bits, and send on average 1
bit.

… But if we’re sending multiple messages at once, it turns out that we can do
better. Let’s consider communicating two events from this distribution. If we
sent them independently, we’d need to send two bits. Can we do better?

Half the time, we need to communicate \(aa\) , \(21\%\) of the time we need to send \(ab\) or \(ba\) , and \(8\%\) of the time we need to communicate \(bb\) . Again, the ideal code involves fractional numbers of bits.

If we round the codeword lengths, we’ll get something like this:

This codes give us an average message length of 1.8 bits. That’s less than the 2
bits when we send them independently. Another way of thinking of this is that
we’re sending 0.9 bits on average for each event. If we were to send more events
at once, it would become smaller still. As \(n\) tends to infinity, the overhead due to rounding our code would vanish, and the
number of bits per codeword would approach the entropy.

Further, notice that the ideal codeword length for \(a\) was 0.5 bits, and the ideal codeword length for \(aa\) was 1 bit. Ideal codeword lengths add, even when they’re fractional! So, if we
communicate a lot of events at once, the lengths will add.

There is a very real sense in which one can have fractional numbers of bits of
information, even if actual codes can only use whole numbers.

(In practice, people use particular coding schemes which are efficient to
different extents. Huffman coding , which is basically the kind of code we've sketched out here, doesn't handle
fractional bits very gracefully -- you have to group symbols, like we did above,
or use more complicated tricks to approach the entropy limit. Arithmetic coding is a bit different, but elegantly handles fractional bits to be asymptotically
optimal.)

CONCLUSION
If we care about communicating in a minimum number of bits, these ideas are
clearly fundamental. If we care about compressing data, information theory
addresses the core questions and gives us the fundamentally right abstractions.
But what if we don’t care – are they anything other than curiosities?

Ideas from information theory turn up in lots of contexts: machine learning,
quantum physics, genetics, thermodynamics, and even gambling. Practitioners in
these fields typically don’t care about information theory because they want to
compress information. They care because it has a compelling connection to their
field. Quantum entanglement can be described with entropy. 9 Many results in statistical mechanics and thermodynamics can be derived by
assuming maximum entropy about the things you don’t know. 10 A gambler’s wins or losses are directly connected to KL divergence, in
particular iterated setups. 11

Information theory turns up in all these places because it offers concrete,
principled formalizations for many things we need to express. It gives us ways
of measuring and expressing uncertainty, how different two sets of beliefs are,
and how much an answer to one question tells us about others: how diffuse
probability is, the distance between probability distributions, and how
dependent two variables are. Are there alternative, similar ideas? Sure. But the
ideas from information theory are clean, they have really nice properties, and a
principled origin. In some cases, they’re precisely what you care about, and in
other cases they’re a convenient proxy in a messy world.

Machine learning is what I know best, so let’s talk about that for a minute. A
very common kind of task in machine learning is classification. Let’s say we
want to look at a picture and predict whether it’s a picture of a dog or a cat.
Our model might say something like “there’s a 80% chance this image is a dog,
and a 20% chance it’s a cat.” Let’s say the correct answer is dog – how good or
bad is it that we only said there was an 80% chance it was a dog? How much
better would it have been to say 85%?

This is an important question because we need some notion of how good or bad our
model is, in order to optimize it to do well. What should we optimize? The
correct answer really depends on what we’re using the model for: Do we only care
about whether the top guess was right, or do we care about how confident we are
in the correct answer? How bad is it to be confidently wrong? There isn’t one
right answer to this. And often it isn’t possible to know the right answer,
because we don’t know how the model will be used in a precise enough way to
formalize what we ultimately care about. The result is that there are situations
where cross-entropy really is precisely what we care about, but that isn’t
always the case. Much more often we don’t know exactly what we care about and
cross-entropy is a really nice proxy. 12

Information gives us a powerful new framework for thinking about the world.
Sometimes it perfectly fits the problem at hand; other times it’s not an exact
fit, but still extremely useful. This essay has only scratched the surface of
information theory – there are major topics, like error-correcting codes, that
we haven’t touched at all – but I hope I’ve shown that information theory is a
beautiful subject that doesn’t need to be intimidating.

To help me become a better writer, please consider filling out this feedback form .

FURTHER READING
Claude Shannon’s original paper on information theory, A Mathematical Theory of Communication , is remarkably accessible. (This seems to be a recurring pattern in early
information theory papers. Was it the era? A lack of page limits? A culture
emanating from Bell Labs?)

Cover & Thomas’ Elements of Information Theory seems to be the standard
reference. I found it helpful.

ACKNOWLEDGMENTS
I’m very grateful to Dan Mané , David Andersen , Emma Pierson and Dario Amodei for taking time to give really incredibly detailed and
extensive comments on this essay. I’m also grateful for the comments of Michael Nielsen , Greg Corrado , Yoshua Bengio , Aaron Courville , Nick Beckstead , Jon Shlens , Andrew Dai, Christian Howard , and Martin Wattenberg .

Thanks also to my first two neural network seminar series for acting as guinea
pigs for these ideas.

Finally, thanks to the readers who caught errors and omissions. In particular,
thanks to Connor Zwick, Kai Arulkumaran, Jonathan Heusser, Otavio Good, and an
anonymous commenter.


--------------------------------------------------------------------------------

MORE POSTS
UNDERSTANDING CONVOLUTIONS GROUPS & GROUP CONVOLUTIONS NEURAL NETWORKS, MANIFOLDS, AND TOPOLOGY VISUALIZING REPRESENTATIONS DEEP LEARNING AND HUMAN BEINGS


--------------------------------------------------------------------------------

 1.  It’s fun to use this to visualize naive Bayesian classifiers, which assume
     independence… ↩
     
     
 2.  But horribly inefficient! If we have an extra symbol to use in our codes,
     only using it at the end of codewords like this would be a terrible waste. ↩
     
     
 3.  I’m cheating a little here. I’ve been using an exponential of base 2 where
     this is not true, and am going to switch to a natural exponential. This
     saves us having a lot of \(log(2)\) s in our proof, and makes it visually a lot nicer. ↩
     
     
 4.  Note that this notation for cross-entropy is non-standard. The normal
     notation is \(H(p,q)\) . This notation is horrible for two reasons. Firstly, the exact same
     notation is also used for joint entropy. Secondly, it makes it seem like
     cross-entropy is symmetric. This is ridiculous, and I’ll be writing \(H_q(p)\) instead. ↩
     
     
 5.  Also non-standard notation. ↩
     
     
 6.  If you expand the definition of KL divergence, you get: \[D_q(p) = \sum_x p(x)\log_2\left(\frac{p(x)}{q(x)} \right)\] That might look a bit strange. How should we interpret it? Well, \(\log_2\left(\frac{p(x)}{q(x)} \right)\) is just the difference between how many bits a code optimized for \(q\) and a code optimized for \(p\) would use to represent \(x\) . The expression as a whole is the expected difference in how many bits
     the two codes would use. ↩
     
     
 7.  This builds off the set interpretation of information theory layed out in
     Raymond W. Yeung’s paper A New Outlook on Shannon’s Information Measures . ↩
     
     
 8.  If you expand the definition of mutual information out, you get:
     
     \[I(X,Y) = \sum_{x,y} p(x,y) \log_2\left(\frac{p(x,y)}{p(x)p(y)} \right)\]
     
     That looks suspiciously like KL divergence!
     
     What’s going on? Well, it is KL divergence. It’s the KL divergence of
     P(X,Y) and its naive approximation P(X)P(Y). That is, it’s the number of
     bits you save representing \(X\) and \(Y\) if you understand the relationship between them instead of assuming
     they’re independent.
     
     One cute way to visualize this is to literally picture the ratio between a
     distribution and its naive approximation:
     
     ↩
 9.  There’s an entire field of quantum information theory. I know precisely
     nothing about the subject, but I’d bet with extremely high confidence,
     based on Michael’s other work, that Michael Nielsen and Issac Chuang’s Quantum Computation and Quantum Information is an excellent introduction. ↩
     
     
 10. As someone who knows nothing about statistical physics, I’ll very nervously
     try to sketch its connection to information theory as I understand it.
     
     After Shannon discovered information theory, many noted suspicious
     similarities between equations in thermodynamics and equations in
     information theory. E.T. Jaynes found a very deep and principled
     connection. Suppose you have some system, and take some measurements like
     the pressure and temperature. How probable should you think a particular
     state of the system is? Jaynes suggested we should assume the probability
     distribution which, subject to the constraints of our measurement,
     maximizes the entropy. (Note that this “principle of maximum entropy” is
     much more general than physics!) That is, we should assume the possibility
     with the most unknown information. Many results can be derived from this
     perspective.
     
     (Reading the first few sections of Jaynes’ papers ( part 1 , part 2 ) I was impressed by how accessible they seem.)
     
     If you’re interested in this connection but don’t want to work through the
     original papers, there’s a section in Cover & Thomas which derives a
     statistical version of the Second Law of Thermodynamics from Markov Chains! ↩
     
     
 11. The connection between information theory and gambling was originally laid
     out by John Kelly in his paper ‘ A New Interpretation of Information Rate .’ It’s a remarkably accessible paper, although it requires a few ideas we
     didn’t develop in this essay.
     
     Kelly had an interesting motivation for his work. He noticed that entropy
     was being used in many cost functions which had no connection to encoding
     information and wanted some principled reason for it. In writing this
     essay, I’ve been troubled by the same thing, and have really appreciated
     Kelly’s work as an additional perspective. That said, I don’t find it
     completely convincing: Kelly only winds up with entropy because he
     considers iterated betting where one reinvests all their capital each bet.
     Different setups do not lead to entropy.
     
     A nice discussion of Kelly’s connection between betting and information
     theory can be found in the standard reference on information theory, Cover
     & Thomas’ ‘Elements of Information Theory.’ ↩
     
     
 12. It doesn’t resolve the issue, but I can’t resist offering a small further
     defense of KL divergence.
     
     There’s a result which Cover & Thomas call Stein’s Lemma, although it seems
     unrelated to the result generally called Stein’s Lemma. At a high level, it
     goes like this:
     
     Suppose you have some data which you know comes from one of two probability
     distributions. How confidently can you determine which of the two
     distributions it came from? In general, as you get more data points, your
     confidence should increase exponentially. For example, on average you might
     become 1.5 times as confident about which distribution is the truth for
     every data point you see.
     
     How much your confidence gets multiplied depends on how different the
     distributions are. If they are very different, you might very quickly
     become confident. But if they’re only very slightly different, you might
     need to see lots of data before you have even a mildly confident answer.
     
     Stein’s Lemma says, roughly, that the amount you multiply by is controlled
     by the KL divergence. (There’s some subtlety about the trade off between
     false-positives and false-negatives.) This seems like a really good reason
     to care about KL divergence! ↩
     
     
Built by Oinkina with Hakyll using Bootstrap , MathJax , Disqus , MathBox.js , Highlight.js , and Footnotes.js . Enable JavaScript for footnotes, Disqus comments, and other cool stuff.",I love the feeling of having a new way to think about the world. I especially love when there’s some vague idea that gets formalized into a concrete concept. Information theory is a prime example of this.,Visual Information Theory ,Live,722
2229,"BREAKING * What if companies managed their data as carefully as they manage their money?
 * Graph Visualization with a Time Machine
 * Trends Shaping Machine Learning in 2017
 * Cloud adoption on the rise for marketing and sales companies as AWS and Azure
   dominate
 * GDPR and the skills gap that could cost you €20million
 * 5 Misconceptions About Data-Driven Financial Services Marketing
 * How blockchain is changing the way we pay
 * Improving Employee Management using Big Data

 * Home * About Us
    * Newsletter
    * Submit
   
   
 * Events * Calendar
    * Community
    * Data Natives 2017
   
   
 * Data Science * Big Data
    * Machine Learning
    * Artificial Intelligence
    * Business Intelligence
    * IoT
    * Data Science 101
    * It’s All Data
   
   
 * Tech Trends * FinTech
    * HealthTech
    * Startups
   
   
 * Conversations
 * Careers * Job Board
    * Candidate Database
   
   
 * Research Papers

 * Home * About Us
    * Newsletter
    * Submit
   
   
 * Events * Calendar
    * Community
    * Data Natives 2017
   
   
 * Data Science * Big Data
    * Machine Learning
    * Artificial Intelligence
    * Business Intelligence
    * IoT
    * Data Science 101
    * It’s All Data
   
   
 * Tech Trends * FinTech
    * HealthTech
    * Startups
   
   
 * Conversations
 * Careers * Job Board
    * Candidate Database
   
   
 * Research Papers

Artificial Intelligence Data Science Machine LearningINTRO TO MACHINE LEARNING: 10 ESSENTIAL ALGORITHMS FOR MACHINE LEARNING
ENGINEERS
James Le · April 21, 2017 0 4 11.1k 2 James Le 2017-04-21It is no doubt that the sub-field of machine learning / artificial intelligence
has increasingly gained more popularity in the past couple of years. As Big Data
is the hottest trend in the tech industry at the moment, machine learning is
incredibly powerful to make predictions or calculated suggestions based on large
amounts of data. Some of the most common examples of machine learning are
Netflix’s algorithms to make movie suggestions based on movies you have watched
in the past or Amazon’s algorithms that recommend books based on books you have
bought before.

So if you want to learn more about machine learning, how do you start? For me,
my first introduction is when I took an Artificial Intelligence class when I was
studying abroad in Copenhagen. My lecturer is a full-time Applied Math and CS
professor at the Technical University of Denmark, in which his research areas
are logic and artificial, focusing primarily on the use of logic to model
human-like planning, reasoning and problem solving. The class was a mix of
discussion of theory/core concepts and hands-on problem solving. The textbook
that we used is one of the AI classics: Peter Norvig’s Artificial Intelligence — A Modern Approach , in which we covered major topics including intelligent agents,
problem-solving by searching, adversarial search, probability theory,
multi-agent systems, social AI, philosophy/ethics/future of AI. At the end of
the class, in a team of 3, we implemented simple search-based agents solving
transportation tasks in a virtual environment as a programming project.

I have learned a tremendous amount of knowledge thanks to that class, and
decided to keep learning about this specialized topic. In the last few weeks, I
have been multiple tech talks in San Francisco on deep learning, neural
networks, data architecture — and a Machine Learning conference with a lot of
well-known professionals in the field. Most importantly, I enrolled in Udacity’s Intro to Machine Learning online course in the beginning of June and has just finished it a few days ago.
In this post, I want to share some of the most common machine learning
algorithms that I learned from the course.

Machine learning algorithms can be divided into 3 broad categories — supervised
learning, unsupervised learning, and reinforcement learning. Supervised learning is useful in cases where a property ( label ) is available for a certain dataset ( training set ), but is missing and needs to be predicted for other instances. Unsupervised learning is useful in cases where the challenge is to discover implicit relationships in
a given unlabeled dataset (items are not pre-assigned). Reinforcement learning falls between these 2 extremes — there is some form of feedback available for
each predictive step or action, but no precise label or error message. Since
this is an intro class, I didn’t learn about reinforcement learning, but I hope
that 10 algorithms on supervised and unsupervised learning will be enough to
keep you interested.

SUPERVISED LEARNING
1. Decision Trees: A decision tree is a decision support tool that uses a tree-like graph or model
of decisions and their possible consequences, including chance-event outcomes,
resource costs, and utility. Take a look at the image to get a sense of how it
looks like.


Decision TreeFrom a business decision point of view, a decision tree is the minimum number of
yes/no questions that one has to ask, to assess the probability of making a
correct decision, most of the time. As a method, it allows you to approach the
problem in a structured and systematic way to arrive at a logical conclusion.

2. Naïve Bayes Classification: Naïve Bayes classifiers are a family of simple probabilistic classifiers based
on applying Bayes’ theorem with strong (naïve) independence assumptions between
the features. The featured image is the equation — with P(A|B) is posterior
probability, P(B|A) is likelihood, P(A) is class prior probability, and P(B) is
predictor prior probability.


Naive Bayes ClassificationSome of real world examples are:

· To mark an email as spam or not spam

· Classify a news article about technology, politics, or sports

· Check a piece of text expressing positive emotions, or negative emotions?

· Used for face recognition software.

3. Ordinary Least Squares Regression: If you know statistics, you probably have heard of linear regression before.
Least squares is a method for performing linear regression. You can think of
linear regression as the task of fitting a straight line through a set of
points. There are multiple possible strategies to do this, and “ordinary least
squares” strategy go like this — You can draw a line, and then for each of the
data points, measure the vertical distance between the point and the line, and
add these up; the fitted line would be the one where this sum of distances is as
small as possible.


Ordinary Least Squares RegressionLinear refers the kind of model you are using to fit the data, while least
squares refers to the kind of error metric you are minimizing over.

4. Logistic Regression: Logistic regression is a powerful statistical way of modeling a binomial
outcome with one or more explanatory variables. It measures the relationship
between the categorical dependent variable and one or more independent variables
by estimating probabilities using a logistic function, which is the cumulative
logistic distribution.


Logistic RegressionIn general, regressions can be used in real-world applications such as:

· Credit Scoring

· Measuring the success rates of marketing campaigns

· Predicting the revenues of a certain product

· Is there going to be an earthquake on a particular day?

5. Support Vector Machines: SVM is binary classification algorithm. Given a set of points of 2 types in N
dimensional place, SVM generates a (N — 1) dimensional hyperlane to separate
those points into 2 groups. Say you have some points of 2 types in a paper which
are linearly separable. SVM will find a straight line which separates those
points into 2 types and situated as far as possible from all those points.


Support Vector MachineIn terms of scale, some of the biggest problems that have been solved using SVMs
(with suitably modified implementations) are display advertising, human splice
site recognition, image-based gender detection, large-scale image
classification…

6. Ensemble Methods: Ensemble methods are learning algorithms that construct a set of classifiers
and then classify new data points by taking a weighted vote of their
predictions. The original ensemble method is Bayesian averaging, but more recent
algorithms include error-correcting output coding, bagging, and boosting.


Ensemble Learning AlgorithmsSo how do ensemble methods work and why are they superior to individual models?

· They average out biases : If you average a bunch of democratic-leaning polls and republican-leaning
polls together, you will get an average something that isn’t leaning either way.

· They reduce the variance : The aggregate opinion of a bunch of models is less noisy than the single
opinion of one of the models. In finance, this is called diversification — a
mixed portfolio of many stocks will be much less variable than just one of the
stocks alone. This is why your models will be better with more data points
rather than fewer.

· They are unlikely to over-fit: If you have individual models that didn’t over-fit, and you are combining the
predictions from each model in a simple way (average, weighted average, logistic
regression), then there’s no room for over-fitting.

UNSUPERVISED LEARNING
7. Clustering Algorithms: Clustering is the task of grouping a set of objects such that objects in the
same group ( cluster ) are more similar to each other than to those in other groups.


Clustering AlgorithmsEvery clustering algorithm is different, and here are a couple of them:

· Centroid-based algorithms

· Connectivity-based algorithms

· Density-based algorithms

· Probabilistic

· Dimensionality Reduction

· Neural networks / Deep Learning

8. Principal Component Analysis: PCA is a statistical procedure that uses an orthogonal transformation to
convert a set of observations of possibly correlated variables into a set of
values of linearly uncorrelated variables called principal components.


Principal Component AnalysisSome of the applications of PCA include compression, simplifying data for easier
learning, visualization. Notice that domain knowledge is very important while
choosing whether to go forward with PCA or not. It is not suitable in cases
where data is noisy (all the components of PCA have quite a high variance).

9. Singular Value Decomposition: In linear algebra, SVD is a factorization of a real complex matrix. For a given m * n matrix M, there exists a decomposition such that M = UΣV, where U and V are
unitary matrices and Σ is a diagonal matrix.


Singular Value DecompositionPCA is actually a simple application of SVD. In computer vision, the 1st face
recognition algorithms used PCA and SVD in order to represent faces as a linear
combination of “eigenfaces”, do dimensionality reduction, and then match faces
to identities via simple methods; although modern methods are much more
sophisticated, many still depend on similar techniques.

10. Independent Component Analysis: ICA is a statistical technique for revealing hidden factors that underlie sets
of random variables, measurements, or signals. ICA defines a generative model
for the observed multivariate data, which is typically given as a large database
of samples. In the model, the data variables are assumed to be linear mixtures
of some unknown latent variables, and the mixing system is also unknown. The
latent variables are assumed non-gaussian and mutually independent, and they are
called independent components of the observed data.


Independent Component AnalysisICA is related to PCA, but it is a much more powerful technique that is capable
of finding the underlying factors of sources when these classic methods fail
completely. Its applications include digital images, document databases,
economic indicators and psychometric measurements.

Now go forth and wield your understanding of algorithms to create machine
learning applications that make better experiences for people everywhere.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: algorithms Machine Learning Previous postTHE HISTORY OF NEURAL NETWORKS Next postBANKS AND FINTECHS, INSTEAD OF BANKS VERSUS FINTECHSTHE AUTHOR
JAMES LE
James Le is a rising senior at Denison University studying Computer Science and
Communication. He is interested in getting into product management within the
tech industry - a fascinating role that intersects business, design, and
development. He is skilled in object-oriented programming, has a knack of
curiosity for UX design, and adeptly knowledgeable about business and
management. He is looking forward to contributing his skills and enthusiasm to
the workforce.

RELATED POSTS
Artificial Intelligence Machine Learning Tech TrendsTRENDS SHAPING MACHINE LEARNING IN 2017
Data Science Machine LearningWHY BUSINESSES SHOULD EMBRACE MACHINE LEARNING
Data Science Machine LearningWHAT IS REVOLUTIONIZING MACHINE LEARNING FOR THE ENTERPRISE?
Artificial Intelligence Data Science Machine LearningHOW MATH AND PHYSICS MAJORS CAN BUILD ARTIFICIAL INTELLIGENCE CAREERS
Artificial Intelligence Data Science Machine LearningHOW DEEP LEARNING IS PERSONALIZING THE INTERNET
Data Science Machine LearningPREVENTING FRAUD AND SAVING COSTS WITH MACHINE LEARNING
SIGN UP FOR OUR NEWSLETTER
Dataconomy Dataconomy Jobs Data Natives


16.2k Followers 2.5k FansPOPULAR POSTS
 * Week
 * Month
 * All Time

What if companies managed their data as carefully as they manage their money? 467 Views By Andy Palmer Graph Visualization with a Time Machine 226 Views By Dr. Jans Aasman Trends Shaping Machine Learning in 2017 478 Views By Ronald van Loon What if companies managed their data as carefully as they manage their money? 467 Views By Andy Palmer Graph Visualization with a Time Machine 226 Views By Dr. Jans Aasman Trends Shaping Machine Learning in 2017 478 Views By Ronald van Loon What if companies managed their data as carefully as they manage their money? 467 Views By Andy Palmer Graph Visualization with a Time Machine 226 Views By Dr. Jans Aasman Trends Shaping Machine Learning in 2017 478 Views By Ronald van LoonLATEST POSTS
 * Big Data Technology & ITWHAT IF COMPANIES MANAGED THEIR DATA AS CAREFULLY AS THEY MANAGE THEIR MONEY?
   
 * BI & Analytics Data Science NewsGRAPH VISUALIZATION WITH A TIME MACHINE
   
 * Artificial Intelligence Machine Learning Tech TrendsTRENDS SHAPING MACHINE LEARNING IN 2017
   

FOLLOW US ON TWITTER
Tweets by @DataconomyMediaSIGN UP TO OUR NEWSLETTER
Dataconomy Dataconomy Jobs Data Natives


Home - About - Imprint - Contact - Site Map - Legal & PrivacyINTERESTING POSTS
What if companies managed their data as carefully as they manage their money? 467 Views By Andy Palmer Graph Visualization with a Time Machine 226 Views By Dr. Jans Aasman Trends Shaping Machine Learning in 2017 478 Views By Ronald van Loon Copyright © Dataconomy Media GmbH, All Rights Reserved. This website uses cookies to improve your experience. We'll assume you're ok
with this, but you can opt-out if you wish. Accept Privacy Statement SHAREINTRO TO MACHINE LEARNING: 10 ESSENTIAL ALGORITHMS FOR MACHINE LEARNING
ENGINEERS","Machine learning algorithms can be divided into 3 broad categories — supervised learning, unsupervised learning, and reinforcement learning.",10 Essential Algorithms For Machine Learning Engineers,Live,723
2230,"Follow Sign in / Sign up Home About Insight Data Science Data Engineering Health Data AI 6 * Share
 * 6
 * 
 * 

Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Join Medium Join Medium Ross Fadely Blocked Unblock Follow Following 2 days ago
--------------------------------------------------------------------------------

NIPS 2016 — DAY 2 HIGHLIGHTS: PLATFORM WARS, RL AND RNNS
— Jeremy Karnowski & Ross Fadely , Insight Artificial Intelligence

Missed our highlights from Day 1 of NIPS 2016? Read here . Want to learn about applied Artificial Intelligence from leading practitioners
in Silicon Valley or New York? Learn more about the Insight Artificial Intelligence Fellows Program .

RESEARCH PLATFORM BATTLES HEAT UP
Along with the currently rapid growing interest in AI, there is a rapidly
growing tension. On the one hand, top research labs in AI are becoming more open
— publishing results to ArXiv and pushing code to github. On the other,
competition is becoming more fierce to become the dominant player. From deep
learning frameworks, to cloud computing platforms, to customized hardware, the
battle is on for who will become the standard for the near (and possibly long term) future of AI technologies.

DeepMind’s Lab used to train an AI agent to navigate a Labyrinth maze.Day 2 at NIPS was no exception to these trends. At the end of the first invited
talk of the day, DeepMind announced its new open-source Reinforcement Learning platform DeepMind Lab . The aim of their new platform is to provide a means to build rich simulated
environments which can serve as laboratories for AI research. Turns out DeepMind
has been using its lab for quite some time , and now are opening it up to the community. Exciting, right?

While DeepMind’s Lab might be one of the newest, it is not the only player in
the AI research platform space. The most popular general platform, perhaps, is OpenAI’s Gym which has received significant interest in the community along with many
research contributions. Just a few weeks ago OpenAI announced its Universe platform , with the goal of offering more flexibility and extensibility than their Gym.

Going back further, research giants Microsoft and Facebook have already carved
their place in the space. In June 2015, Microsoft launched Project Malmo as a AI platform built on top of Minecraft. Similarly, later in 2015 Facebook
open-sourced CommAI-env , a lower-level platform for building out AI research environments.

While it is unclear who (if anyone) will win out and become the dominant AI
research platform, we are excited. Perhaps even more than the current battle
amongst deep learning frameworks, new AI platform efforts are fostering an
environment of opportunity and openness which we think will bear fruit if
continued. We hope it does.

OTHER QUICK HIGHLIGHTS
Less controversial but equally as exciting were the incredibly impressive
research results presented in Day 2. Themes continued along improvements in
Reinforcement Learning and Deep Learning, as well as more broadly used machine
learning techniques and their applications. Here is a shortlist that caught our
eye:

 * The NIPS award-winning work on Value Iteration Networks was incredibly impressive. The key innovation here is that such models
   include a differentiable “planning module” which allows networks to make
   plans and better generalize to unseen domains.
 * Two fantastic results pushing forward Recurrent Neural Networks (RNNs): Sequential Neural Models with Stochastic Layers and Phased LSTMs . The former combines ideas from State Space Models (formally best in class
   for stochastic sequences like audio) and RNNs, leveraging the best of both
   worlds. The latter adds a “time gate” to LSTMs which greatly improves
   optimization and performance for long sequence data.
 * A team from Amazon talked about Bayesian Intermittent Demand Forecasting for Large Inventories ( paper ). During the talk they showed impressive forecasting (at scale) for
   problems with intermittent or bursty conditions (think large distributed
   warehouse inventories).
 * K-means is a core algorithm for many data science applications. However,
   finding good cluster centers often relies on having good initializations.
   Talking about his work Fast and Provably Good Seedings for k-Means ( paper ), Olivier Bachem showed they can get good centroid seeds orders of magnitude faster than the
   previous state-of-the-art (k-Means++). Even better is that they have code,
   “pip install kmc2” = g2g.

Thanks to Jeremy Karnowski . Machine Learning Insight Data Science Insight Ai Artificial Intelligence Data Science 6 Blocked Unblock Follow FollowingROSS FADELY
FollowINSIGHT DATA
Insight Fellows Program —Your bridge to careers in Data Science and Data
Engineering.","Along with the currently rapid growing interest in AI, there is a rapidly growing tension. On the one hand, top research labs in AI are becoming more open — publishing results to ArXiv and pushing code to github. On the other, competition is becoming more fierce to become the dominant player. From deep learning frameworks, to cloud computing platforms, to customized hardware, the battle is on for who will become the standard for the near (and possibly long term) future of AI technologies.",NIPS 2016 — Day 2 Highlights,Live,724
2233,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (March 7, 2017)
 * This Week in Data Science (February 28, 2017)
 * This Week in Data Science (February 21, 2017)
 * Learn how to use R with Databases
 * This Week in Data Science (February 14, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (MARCH 7, 2017)
Posted on March 7, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * R Packages worth a look – A brief list of useful R packages.
 * IBM, Maersk aim to speed up shipping with blockchain technology –
   IBM and Maersk will use blockchain technology to expedite transactions in the
   shipping supply chain.
 * Data Science Job Report 2017: R Passes SAS, But Python Leaves Them Both
   Behind – A February 2017 report on openings for Data Science positions.
 * Artificial Intelligence: here’s what you need to know to understand how
   machines learn. – How becoming informed about Machine Learning can remedy fears surrounding
   AI.
 * The Data Science Project Playbook – Small steps to identify and create valuable Data Science projects.
 * 7 More Steps to Mastering Machine Learning With Python – An Additional seven steps to master the Machine Learning using Python.
 * How to start a Data Science project in Python – Basics for producing a Data Science product efficiently.
 * An Overview of Python Deep Learning Frameworks – A brief overview of seven (7) Python Machine Learning frameworks.
 * IBM Spins Out Analytics Platform For Utilities – IBM and VELCO will work to create a company specializing in smart energy
   via a new cloud based analytics platform.
 * Moving from R to Python: The Libraries You Need to Know – A comparison of Python Packages to their R Library equivalents.
 * 25 Big Data Terms Everyone Should Know – Basic concepts and terms for beginners in Big Data and Data Science.
 * The Anatomy of Deep Learning Frameworks – Common principles to aid in the understanding of deep learning frameworks.
 * How to Boost your Career in Big Data and Analytics – The necessary steps to advance your career in Big Data and Analytics.

FEATURED COURSES FROM BDU
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.
 * Deep Learning with TensorFlow – Take this free TensorFlow course and learn how to use Google’s library to
   apply deep learning to different data types in order to solve real world
   problems.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (March 7, 2017)",Live,725
2234,"No Free Hunch Navigation * kaggle.com

 * kaggle.com

May 2016: Scripts of the WeekMAY 2016: SCRIPTS OF THE WEEK
Megan Risdal | 06.07.2016

With several new datasets uploaded to Datasets this month, we saw a great number of exceptional scripts created. In this
month's blog featuring the May 2016 Scripts of the Week, you'll hear about four
that the team selected for their quality insight and analysis including:

 * How to get started with tracking image features across aerial photographs in
   the Draper Satellite Image Chronology competition
 * Understanding the bad reputation of payday loans by delving into consumer
   complaints by keyword
 * Using interactive visualization to explore researchers and their preferred
   tools
 * How social network analysis gives insight into who the major influencers are
   in the world of ISIS extremists on Twitter

Read on and find yourself motivated to fork any of these scripts and explore the
authors' ideas further!

MAY 6: EXPLORATORY IMAGE ANALYSIS
Created by: Ben Kamphaus
Competition: Draper Satellite Image Chronology
Language: Python

WHAT MOTIVATED YOU TO CREATE IT?
I was reviewing the data in the competition and feeling a bit stuck on how to
approach it. I have a background in earth observation/remote sensing using
satellite and aerial data, but the contest goals are pretty unusual in this
contest and the imagery is different from what I prefer to work with (multi- or
hyper-spectral imagery). Given that I didn't have a clear approach or purpose in
mind, I decided to just explore the data iteratively as a brainstorming process.

My process has been pretty simple - I literally am just setting one pomodoro
aside here and there to play with the data in some interesting way I haven't
yet, then I save the results. I opted to make it public as I have fairly limited
time to spend and didn't want any of my brainstorming to vanish into the aether
if I didn't end up in a place where I had a viable approach to tackling the
competition problem. I know a lot of Kagglers aren't familiar with image data or
are intimidated by it, so maybe it would give them a chance to get their feet
wet. Other more motivated people with a little more time to tackle the problem
could maybe feed it into their manual labeling or feature engineering process.

WHAT HAVE YOU LEARNED FROM THE CODE/OUTPUT?
I think it's likely a lot of what I've done may not generalize across the
dataset, but I think the more promising approach I've included is using the HSV
color space transform to extract information about shadows/variation in
illumination. I still think trying to track illumination changes between images
might be an important cue to use in an automated approach. Of course, sun
position changes dependent on a whole lot of other information (correct
orientation and referencing of the image, influence of sensor angle, period of
changes in sun angle within day or between days as described by analemma, etc.)
and usually you already have that data in hand with typical sensor, so there
aren't a lot of methods out there for reverse engineering it that I'm aware of
or that I've been able to dig up through basic research. That approach may just
turn out to be not viable in the end.

WHAT OTHER QUESTIONS WOULD YOU LOVE TO SEE ANSWERED OR EXPLORED IN THIS DATASET?
The coregistration/mosaicking/stitching work is probably the most important
other thing to tackle, as you want to compare differences between features that
you can identify in multiple images (whether you're manually generating filters
or extracting features like I am, or if you're taking a feature learning
approach like in a convolutional neural network). There are a few scripts
showing up where people are trying this out and I'm likely to add some material
influenced by their approaches, or based on other research literature I come
across.

See the code on Scripts

MAY 13: PAYDAY LOANS AND CUSTOMER SERVICE
Created by: Tommy Morrison
Public Dataset: US Consumer Finance Complaints
Language: R

WHAT MOTIVATED YOU TO CREATE IT?
I picked the CFPB dataset because it was fairly new and I wanted to be one of
the first ones to take a crack at it. When looking at the data, I remembered a
radio program I where they talked about payday loans and whether they deserved
their bad reputation. With that in mind, I decided to explore that part of the
data to see what I could uncover.

WHAT DID YOU LEARN FROM THE CODE/OUTPUT?
I had not really worked with non-numerical data before, so this was a great
opportunity to think about what I could do beyond time series and summary
statistics. Looking into the content of the consumer complaints by searching for
keywords was new for me.

As far as the actual results, it looks like payday loan users don't complain to
the CFPB too much; whether that is because they don't have complaints or don't
know where to lodge them, though, I can't say.

WHAT OTHER QUESTIONS WOULD YOU LOVE TO SEE ANSWERED OR EXPLORED IN THIS DATASET?
I plan on looking at how individual companies respond to consumer complaints and
whether there is any indication of improvement over time, in terms of a
decreased volume of complaints for a specific issue/sub-issue.

See the code on Scripts

MAY 23: SWORDSMEN AND THEIR SWORDS
Created by: Tony Liu (AKA 33Vito)
Public Dataset: 101 Innovations - Research Tools Survey
Language: R

D3.js a popular and powerful JavaScript library for producing dynamic interactive
visualisation in web browsers. There are a number of excellent packages that
seamlessly implement D3 in R. The purpose of this script is to demonstrate the
power of some of the most popular packages, from merely a couple lines of code,
including:

 * d3heatmap by Studio (Joe Cheng)
 * NetworkD3 by Christopher Gandrud
 * Plotly (Available in both R and Python)

The dataset is straightforward with pretty much all dummy-like variables
describing the tools used by researchers (for data analysis, manuscript, etc.)
in different disciplines. A natural question would be: Would researchers from
different disciplines have different preference in their research tools? Hence
the title of the script: Swordsman and their swords.

Inspired by the Kaggle script from Georgi , I also included a static network chart to illustrate the relative number of
people using different tools in the dataset. It is not surprising to see how
dominant Word and Excel are especially for social science researchers. It is
also interesting to see the popularity of Latex/Matlab/Github among Physicists
and Engineers and Dryad/Figsdata/R to life scientist, etc.

Visualisation has an important role in data analysis. Because human brain is
much more comfortable in reading shapes/colours/size than boring tables.
Interactivity from these packages has raised visualisation to the next level by
adding the fourth dimension, and people are comfortable with this level of
interactivity because it is already everywhere on the internet. A number of
brilliant developers are working on bringing more of these htmlwidget type tools
to R and Python, which has made our life a lot easier and much more fun. Thanks
to them.

See the code on Scripts

MAY 26: ISIS TWEET NETWORK ANALYSIS
Created by: Georgi Gospodinov
Public Dataset: How ISIS Uses Twitter
Language: R

WHAT MOTIVATED YOU TO CREATE IT?
I was motivated by an assignment from a course on Graphs and Big Data but the
assignment was merely to identify a complex network from real life. I had been
doing some data science for humanitarian aid and disaster relief (Nepal
earthquake, Ebola virus) and wanted to explore the recent Syrian migration
throughout Europe, but I found some excellent analysis on that. So I looked up
some datasets from Kaggle competitions and the ISIS Tweet data had just posted
and I thought it would be interesting to explore. It is especially suited for
social network analysis.

WHAT DID YOU LEARN FROM THE CODE/OUTPUT?
What I learned form the code was the limitations that graph analytics has as far
as scalability is concerned, as well as the depth of knowledge we can extract
from Twitter and the inter-relationships of the data there.

WHAT OTHER QUESTIONS WOULD YOU LOVE TO SEE ANSWERED OR EXPLORED IN THIS DATASET?
I want to generalize the model to include other classes of nodes, such as tweet
time, URL, @, and make the model directional.

See the code on Scripts


--------------------------------------------------------------------------------

Click the tag below for more posts highlighting Scripts of the Week! scripts scripts of the weekTHE OFFICIAL BLOG OF KAGGLE.COM
SearchCATEGORIES
 * Data Science News (36)
 * Kaggle News (118)
 * Scripts (21)
 * Tutorials (27)
 * Winners' Interviews (174)

POPULAR TAGS
Algo Trading Challenge Annual Santa Competition binary classification community computer vision CrowdFlower Search Results Relevance Dark Matter deep neural networks Deloitte diabetes Diabetic Retinopathy EEG data Elo Chess Ratings Competition Eurovision Challenge Facebook Recruiting Flavours of Physics: Finding τ → μμμ Flight Quest Grasp-and-Lift EEG Detection Heritage Health Prize How Much Did It Rain? image classification Intel Kaggle InClass logistic regression March Mania Merck multiclass classification natural language processing optimization problem Otto Product Classification Owen Zhang Practice Fusion Product News Profiling Top Kagglers Recruiting regression problem scikit-learn scripts scripts of the week The Hunt for Prohibited Content Tourism Forecasting Tutorial video series Wikipedia Challenge XGBoostARCHIVES
Archives Select Month June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 January 2014 December 2013 November 2013 September 2013 August 2013 July 2013 June 2013 May 2013 April 2013 March 2013 February 2013 January 2013 December 2012 November 2012 October 2012 September 2012 August 2012 July 2012 June 2012 May 2012 April 2012 March 2012 February 2012 January 2012 December 2011 November 2011 October 2011 September 2011 August 2011 July 2011 June 2011 May 2011 April 2011 March 2011 February 2011 January 2011 December 2010 November 2010 October 2010 September 2010 August 2010 July 2010 June 2010 May 2010 April 2010 Toggle the Widgetbar","With several new datasets uploaded to Datasets this month, we saw a great number of exceptional scripts created. In this month’s blog featuring the May 2016 Scripts of the Week, you’ll …",May 2016: Scripts of the Week,Live,726
2237,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses
 *  * Our Courses
    * Partner Courses
   
   
 * Mobile Apps
 * Badges
 *  * Our Badges
    * Badge Program
   
   
 * Business
 * Competitions

 * 

BLOG
Welcome to the Cognitive Class Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * From Python Nested Lists to Multidimensional numpy Arrays
 * Data Science Survey: The Results Are In!
 * Cognitive Class Uses Machine Learning to Help SETI Find Little Green Men
 * We’re Now Cognitive Class
 * This Week in Data Science (May 30, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsFROM PYTHON NESTED LISTS TO MULTIDIMENSIONAL NUMPY ARRAYS
Posted on October 28, 2017 by Joseph Santarcangelo

Dealing with multiple dimensions is difficult, this can be compounded when
working with data. This blog post acts as a guide to help you understand the
relationship between different dimensions, Python lists, and Numpy arrays as
well as some hints and tricks to interpret data in multiple dimensions. We
provide an overview of Python lists and Numpy arrays, clarify some of the
terminologies and give some helpful analogies when dealing with higher
dimensional data.

INTRODUCTION
Before you create a Deep Neural network in TensorFlow , Build a regression model, Predict the price of a car or visualize terabytes of data you’re going to have
to learn Python and deal with multidimensional data. So this blog post is
expanded from our introductory course on Python for Data Science and help you deal with nesting lists in python and give you some ideas about
numpy arrays.

Nesting involves placing one or multiple Python lists into another Python list,
you can apply it to other data structures in Python, but we will just stick to
lists. Nesting is a useful feature in Python, but sometimes the indexing
conventions can get a little confusing so let’s clarify the process expanding
from our courses on Applied Data Science with Python We will review concepts of nesting lists to create 1, 2, 3 and 4-dimensional
lists, then we will convert them to numpy arrays.

LISTS AND 1-D NUMPY ARRAYS
Lists are a useful datatype in Python; lists can be written as comma separated
values. You can change the size of a Python list after you create it and lists
can contain an integer, string, float, Python function and Much more. Indexing
for a one-dimensional (1-D) list in Python is straightforward; each index
corresponds to an individual element of the Python list. Python’s list
convention is shown in figure 1 where each item is accessed using the name of
the list followed by a square Bracket. For example, the first index is obtained
by A[0]:”0″ the means that the zeroth element of the List contains the string 0.
Similarly, the value of A[4] is an integer 4. For the rest of this blog, we are
going to stick with integer values and lists of uniform size as you may see in
many data science applications.


FIGURE 1: INDEXING CONVENTIONS FOR A LIST “A”
Lists are useful but for numerical operations such as the ones you will use in
data science, Python has many useful libraries one of the most commonly used is
numpy.

FROM LISTS TO 1-D NUMPY ARRAYS
Numpy is a fast Python library for performing mathematical operations. The numpy
class is the “ ndarray” we will refer to objects from this class as a numpy array. Some key
differences between lists include, numpy arrays are of fixed sizes, they are
homogenous I,e you can only contain, floats or strings, you can easily convert a
list to a numpy array, For example, if you would like to perform vector
operations you can cast a list to a numpy array. In example 1 we import numpy
then cast the two list to numpy arrays:


import nunpy as np 

u=np.array([1,0])

v=np.array([0,1])

EXAMPLE 1: CASTING LIST [1,0] AND [0,1] TO A NUMPY ARRAY U AND V.


If you check the type of u or v ( type ( v ) ) you will get a “numpy.ndarray”. Although u and v points in a 2 D space there dimension is one, you can verify this using the
data attribute “ndim”. For example, v.ndim will output a one. In numpy dimension
or axis are better understood in the context of nesting, this will be discussed
in the next section. It should be noted the sometimes the data attribute shape
is referred to as the dimension of the numpy array.

The numpy array has many useful properties for example vector addition, we can add the two arrays as follows:


z=u+v

z:array([1,1])

EXAMPLE 2: ADD NUMPY ARRAYS U AND V TO FORM A NEW NUMPY ARRAY Z.


Where the term “z:array([1,1])” means the variable z contains an array. The
actual vector operation is shown in figure 2, where each component of the vector
has a different color.


FIGURE 2: EXAMPLE OF VECTOR ADDITION
Numpy arrays also follow similar conventions for vector scalar multiplication,
for example, if you multiply a numpy array by an integer or float:


y=np.array([1,2])

y=2*z

y:array([2,4])

EXAMPLE 3.1: MULTIPLYING NUMPY ARRAYS Y BY A SCALER 2.


The equivalent vector operation is shown in figure 3:


FIGURE 3: VECTOR ADDITION IS SHOWN IN CODE SEGMENT 2


Like list you can access the elements accordingly, for example, you can access
the first element of the numpy array as follows u[0]:1. Many of the operations
of numpy arrays are different from vectors, for example in numpy multiplication
does not correspond to dot product or matrix multiplication but element-wise
multiplication like Hadamard product, we can multiply two numpy arrays as
follows:


u=np.array([1,2])

v=np.array([3,2)

z=u*v

z:array([6,3])

EXAMPLE 3.2: MULTIPLYING TWO NUMPY ARRAYS U AND V


The equivalent operation is shown in figure 4:


FIGURE 4: MULTIPLICATION OF TWO NUMPY ARRAYS EXPRESSED AS A HADAMARD PRODUCT.


NESTING LISTS AND TWO 2-D NUMPY ARRAYS
Nesting two lists are where things get interesting, and a little confusing; this
2-D representation is important as tables in databases, Matrices, and grayscale
images follow this convention. When each of the nested lists is the same size,
we can view it as a 2-D rectangular table as shown in figure 5. The Python list
“A” has three lists nested within it, each Python list is represented as a
different color. Each list is a different row in the rectangular table, and each
column represents a separate element in the list. In this case, we set the
elements of the list corresponding to row and column numbers respectively.


FIGURE 5: LIST “A” TWO NESTED LISTS REPRESENTED AS A TABLE


In Python to access a list with a second nested list, we use two brackets, the
first bracket corresponds to the row number and the second index corresponds to
the column. This indexing convention to access each element of the list is shown
in figure 6, the top part of the figure corresponds to the nested list, and the
bottom part corresponds to the rectangular representation.


FIGURE 6: INDEX CONVENTIONS FOR LIST “A” ALSO REPRESENTED AS A TABLE
Let’s see some examples in figure 4, Example 1 shows the syntax to access
element A[0][0], example 2 shows the syntax to access element A[1][2] and
example 3 shows how to access element A[2][0].


FIGURE 7: EXAMPLE OF INDEXING ELEMENTS OF A LIST.


We can also view the nesting as a tree as we did in Python for Data Science as shown in figure 5 The first index corresponds to a first level of the tree,
the second index corresponds to the second level.


FIGURE 8: AN EXAMPLE OF MATRIX ADDITION

2-D NUMPY ARRAYS
Turns out we can cast two nested lists into a 2-D array, with the same index
conventions. For example, we can convert the following nested list into a 2-D
array:


V=np.array([[1, 0, 0],[0,1, 0],[0,0,1]])

EXAMPLE 4: CREATING A 2-D ARRAY OR ARRAY WITH TWO ACCESS


The convention for indexing is the exact same, we can represent the array using
the table form like in figure 5. In numpy the dimension of this array is 2, this
may be confusing as each column contains linearly independent vectors. In numpy,
the dimension can be seen as the number of nested lists. The 2-D arrays share
similar properties to matrices like scaler multiplication and addition. For example, adding two 2-D numpy
arrays corresponds to matrix addition.


X=np.array([[1,0],[0,1]])

Y=np.array([[2,1][1,2]])

Z=X+Y;

Z:array([[3,1],[1,3]])

EXAMPLE 5.1: THE RESULT OF ADDING TWO NUMPY ARRAYS


The resulting operation corresponds to matrix addition as shown in figure 9:


FIGURE 9: AN EXAMPLE OF MATRIX ADDITION.

Similarly, multiplication of two arrays corresponds to an element-wise product:


X=np.array([[1,0],[0,1]])

Y=np.array([[2,1][1,2]])

Z=X*Y;

Z:array([[2,0],[2,0]])

EXAMPLE 5.2: THE RESULT OF MULTIPLYING NUMPY ARRAYS


Or Hadamard product:


FIGURE 10: AN EXAMPLE OF HADAMAR PRODUCT.
To perform standard matrix multiplication you world use np.dot(X,Y). In the next
section, we will review some strategies to help you navigate your way through
arrays in higher dimensions.

NESTING LIST WITHIN A LIST WITHIN A LIST AND 3-D NUMPY ARRAYS
We can nest three lists, each of these lists intern have nested lists that have
there own nested lists as shown in figure 11. List “A” contains three nested
lists, each color-coded. You can access the first, second and third list using
A[0], A[1] and A[2] respectively. Each of these lists contains a list of three
nested lists. We can represent these nested lists as a rectangular table as
shown in figure 11. The indexing conventions apply to these lists as well we
just add a third bracket, this is also demonstrated in the bottom of figure 6
where the three rectangular tables contain the syntax to access the values shown
in the table above.


FIGURE 11: LIST WITH THREE NESTED, EACH NESTED LIST HAS THREE NESTED LISTS.


Figure 12 shows an example to access elements at index A[0][2][1] which contains
a value of 132. The first index A[0] contains a list that contains three lists,
which can be represented as a rectangular table. We use the second index i.e
A[0][2] to access the last list contained in A[0]. In the table representation,
this corresponds to the last row of the table. The list A[0][2] corresponds to
the list [131,132,133]. As we are interested in accessing the second element we
simply append the index [1]; Therefore the final result is A[0][2][1].


FIGURE 12: VISUALIZATION OF OBTAINING A[0][2][1]


A helpful analogy is if you think of finding a room in an apartment building on
the street as shown in Figure 13. The first index of the list represents the
address on the road, in Figure 8 this is shown as depth. The second index of the
list represents the floor where the room is situated, depicted by the vertical
direction in Figure 13. To keep consistent with our table representation the
lower levels have a larger index. Finally, the last index of the list
corresponds to the room number on a particular floor, represented by the
horizontal arrow.


FIGURE 13: STREET ANALOGY FOR LIST INDEXING


For example, in figure 9 the element in the list A[2][2][1]: corresponds to
building 2 on the first floor the room is in the middle, the actual element is
332.


FIGURE 14: EXAMPLE OF LIST INDEXING STREET ANALOGY FOR LIST INDEXING


3D NUMPY ARRAYS
The mathematical operations for 3D numpy arrays follow similar conventions i.e
element-wise addition and multiplication as shown in figure 15 and figure 16. In
the figures, X, Y first index or dimension corresponds an element in the square
brackets but instead of a number, we have a rectangular array. When the add or
multiply X and Y together each element is added or multiplied together
independently. More precisely each 2D arrays represented as tables is X are
added or multiplied with the corresponding arrays Y as shown on the left; within
those arrays, the same conventions of 2D numpy addition is followed.


FIGURE 15: ADD TWO 3D NUMPY ARRAYS X AND Y.


FIGURE 16: MULTIPLYING TWO 3D NUMPY ARRAYS X AND Y.


BEYOND 3D LISTS
Adding another layer of nesting gets a little confusing, you cant really
visualize it as it can be seen as a 4-dimensional problem but let’s try to wrap
our heads around it. Examining, figure 17 we see list “A” has three lists, each
list contains two lists, which intern contain two lists nested in them. Let’
this list contains two lists in figure 10 we use the depth to distinguish them.
We can access the second list using the second index as follows A[2][1]. This
can be viewed as a table, from this point we follow the table conventions for
the previous example as illustrated in figure 17.


FIGURE 17: EXAMPLE OF AN ELEMENT IN A LIST, WITH A LIST, WITHIN A LIST NESTED IN
LIST “A”


We can also use the apartment analogy as shown in figure 18 this time the new
list index will be represented by the street name of 1st street and 2nd street.
As before the second list index represents the address, the third list index
represents the floor number and the fourth index represents the apartment
number. The analogy is summarized in Figure 11. For example directions to
element A[2][1][0][0] would be 2nd Street , Building 1, Floor 0 room 0.


FIGURE 18: STREET ANALOGY FOR FIGURE 11


We see that you can store multiple dimensions of data as a Python list.
Similarly, a Numpy array is a more widely used method to store and process data.
In both cases, you can access each element of the list using square brackets.
Although Numpy arrays behave like vectors and matrices, there are some subtle
differences in many of the operations and terminology. Finally, when navigating
your way through higher dimensions it’s helpful to use analogies.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Print
 * 

RELATED
Tags:


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * For Business
 * Events
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 *","This blog post helps you understand the relationship between different dimensions, Python lists, and Numpy arrays as well as showing you some helpful tips.",From Python Nested Lists to Multidimensional numpy Arrays,Live,727
2240,"Mark Watson Blocked Unblock Follow Following Developer Advocate, IBM Watson Data Platform Apr 10
--------------------------------------------------------------------------------

THE MOST POPULAR SEARCH TERM AT SXSW, ACCORDING TO OUR CHATBOT
ANALYZING CONVERSATION DATA USING SPARK, JUPYTER, AND PIXIEDUST
A few weeks ago Raj Singh and I demonstrated a chatbot called Cognitive Event Finder at IBM’s installation at SXSW . The chatbot allowed users to search for tech sessions, music gigs, or film
screenings from our mobile-optimized web app or via SMS using Twilio. We’re still running the demo app , and we also wrote up its architecture . Of course, it’s also on GitHub .

In this article, I’ll look at how we analyzed conversations with our bot. I’ll
also provide a Jupyter notebook that Raj and I worked on, which hopefully gives
you ideas on how to connect application logs to Apache Spark™ for back-end analytics. The notebook takes you through data access, all the way
to visualizing results.

The chatbot in action. We had about 500 users, with a bias toward tech as part
of SXSW Interactive.LOGGING CONVERSATIONS
We used Cloudant to log conversations, including every node of the dialog tree traversed in the Watson Conversation service. We captured this data for several reasons:

 1. To allow users to recall previous searches. (At any point a user can say
    “show me my recent searches,” and the chatbot pulls them from Cloudant and
    displays them to the user.)
 2. To analyze conversations to see what type of experience users were having
    with the chatbot. (Answer: mixed, but that’s a story for another day.)
 3. To compile fun stats!

I’m here to talk about #3. Raj and I built a Jupyter notebook that you can access in the IBM Data Science Experience (DSX), which wraps up
Spark with notebooks, object storage, and other handy features. Here’s the basic
flow of the notebook:

 1. Access Cloudant data directly from DSX.
 2. Transform Cloudant JSON into relational table-like structures required to
    use the data efficiently in Spark.
 3. Use PySpark operations to shape the data for analysis.
 4. Visualize application usage patterns via the PixieDust Python helper library for Spark notebooks .

In my case, the usage pattern I care about the most is popular search terms.

There are no data science notebooks in the movie “The Notebook,” but if there
were it would be more exciting.THE NOTEBOOK (2004)
Again, here’s the notebook on IBM’s Data Science Experience . If you plan on doing anything more than just viewing it, you’ll want to
download it via the icon in the menu bar.

When you walk through its cells, you will see how we import data from Cloudant.
Here’s the basic structure of our JSON documents. Each one represents a single
conversation between the user and the chatbot:

This user wanted to check out brass bands at SXSW. IIn the notebook, you’ll see how we flatten this data for analysis. In this case,
we transform this one document into two rows — one row for each node in the
dialog tree:

The relational representation of our Cloudant JSON. It’s a Spark SQL DataFrame,
visualized by PixieDust as a simple table.Ultimately, the notebook shows how we found the most popular tech event search.
(It was “AI”.) You’ll also see how we used PixieDust to visualize a range of
popular search terms in a single line of code:

A pie chart, courtesy of the display() API in PixieDust.BACKING INTO THE END
That’s the overview of the back end of the chatbot I built with Raj. We hope it
shows how you can quickly design a plan for monitoring and analyzing application
usage, all using cloud-based persistence and analytics services.

So to conclude: feel free to play with the example we set up by uploading the
notebook into your own DSX account. What other interesting trends can you find?
And if you’ve enjoyed this article, please remember to recommend it on Medium
using the ♡ here.

 * Data Science
 * Apache Spark
 * Chatbots
 * Analytics
 * Pixiedust

1 Blocked Unblock Follow FollowingMARK WATSON
Developer Advocate, IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","A few weeks ago Raj Singh and I demonstrated a chatbot called Cognitive Event Finder at IBM’s installation at SXSW. The chatbot allowed users to search for tech sessions, music gigs, or film…","The Most Popular Search Term at SXSW, According to Our Chatbot",Live,728
2245,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Home
 * Cognitive Computing
 * Data Science
 * Web Dev
 * 

David Taieb Blocked Unblock Follow Following Mar 8
--------------------------------------------------------------------------------

PIXIEDUST 1.0 IS HERE!
DATA SCIENCE NOTEBOOK VISUALIZATIONS SIMPLIFIED
Back in October of last year, I introduced the PixieDust open source project . PixieDust is a helper library for Python or Scala notebooks, which lets you
generate sophisticated charts, maps, and other visualizations in a few clicks —
no coding necessary. It smooths out some other pain points for notebook users
too, which you’ll read about in a minute. PixieDust got lots of interest from the community. Thank you all for your feedback , which is helping us refine the tool.

THE MAGIC IS BOTTLED
Finally, the wait is over. After much hard work from the team, I am happy to
announce the availability of PixieDust 1.0 on PyPi . Can’t wait to see it? Here’s a quick video showing PixieDust’s display() API and chart rendering in action:

NEW FEATURES
MULTI-RENDERER SUPPORT
PixieDust offers several different rendering engines you can use out-of-the-box
to display your data. Depending upon what chart you’re viewing, render with matplotlib , Bokeh , or Seaborn — all without coding a single extra line. You can also generate sophisticated,
gorgeous maps from your data using Mapbox or Google Maps .

SPARK PROGRESS MONITOR
Track the status of your Spark job. No more waiting in the dark. Notebook users
can now see how a cell’s code is running behind the scenes.

INSTALLER FOR LOCAL USE
We’ve made it easier to get started with PixieDust locally. Try our new packaged installer . It will walk you through setup, step by step.

SCALA IN A PYTHON NOTEBOOK
Enter Scala commands in a Python notebook. Variables are automatically
transferred from Python to Scala and vice-versa.

IMPROVEMENTS
EXTENSIBILITY GUIDANCE
Want to create your own visualizations or add a renderer? We help you understand
how to build add-ons with a generate wizard, which walks you through a sample setup using Terminal or other command
line tools.

DISPLAY IMPROVEMENTS
We continue to refine and improve PixieDust’s display() API with smarter introspection of your DataFrames and expanded options for data
visualizations.

THAT’S JUST THE LATEST
As before, PixieDust lets you install Spark packages inside a Python notebook,
export data, embed a polished app UI in your notebook, and more. For details on
these and other features, visit PixieDust’s readme .

COMING SOON
FULL SCALA NOTEBOOK SUPPORT
Love Scala, but crave the robust visualizations that only Python can deliver?
Fear not, Matplotlib lovers — soon, there’ll be no need to choose!

PixieDust will soon work in Scala notebooks too, letting you configure robust
and varied data display options in just a few clicks (no coding necessary). To
see a preview, watch the video above or jump straight to it on YouTube.

TRY IT YOURSELF
To help you get started, we offer some sample notebooks. Give PixieDust a try and share your issues, comments, and ideas on GitHub. PRs welcome!

Also, spread the magic. Click the ♡ here to sprinkle a bit of love in the name
of PixieDust.

Oh, hey. Did we forget to mention? We have a new logo too:

Sprinkling data science magic since 2016.Acknowledgements: Since there would be no magic without passion, I want to thank va barbosa , Mike Broberg , Jess Mantaro , Brad Noble , RAJ SINGH , Patrick Titzler , Chetna Warade , Mark Watson , and the rest of the of the IBM Watson Data Platform developer advocacy team
for their dedication and long hours trying to make data simple and accessible.

Thanks to Brad Noble . * Data Science
 * Python
 * Jupyter Notebook
 * Pixiedust
 * Scala

3 Blocked Unblock Follow FollowingDAVID TAIEB
FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 3
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",This helper library for Jupyter notebooks improves the user experience of working with data and makes charting a breeze.,PixieDust 1.0 is here! – IBM Watson Data Lab,Live,729
2249,"* Developing IBM Streams Applications with Python (Version 1.6)
 * 1.0 Installing Python APIs
 * 2.0 Developing for the IBM Streaming Analytics service
 * 3.0 Developing with an IBM Streams install
 * 4.0 Common Streams operations
 * 5.0 API features: User-defined parallelism


--------------------------------------------------------------------------------

In this topic:


2.0 DEVELOPING FOR THE IBM STREAMING ANALYTICS SERVICE
Edit me
--------------------------------------------------------------------------------

 * 1.0 Installing Python APIs
 * 3.0 Developing with an IBM Streams install

Follow the steps in this tutorial to get started with the Python Application API
by creating an application that reads data from a temperature sensor and prints
the output to the screen. The application runs as a job in your Streaming
Analytics instance.

The Streaming Analytics service is built on IBM Streams technology. You don’t
need a local version of IBM Streams to build Python applications for the
service.

This tutorial requires a Python 3.5 environment. Familiarity with Python is
recommended.

ABOUT STREAMING ANALYTICS APPLICATIONS
Streaming analytics applications are intended to run indefinitely because they
meet the need for real-time data processing. (Unlike applications created for
the Apache Hadoop framework, which are intended to terminate when a batch of
data is successfully processed.) For example, consider a company whose product
scans temperature sensors across the world to determine weather patterns and
trends. Because there is always a temperature, there is a perpetual need to
process data. The application that processes the data must be able to run for an
indefinite amount of time.

The application must also be scalable. If the number of temperature sensors
doubles, the application must double the speed at which it processes data to
ensure that analysis is available in a timely manner.

2.1 SETTING UP YOUR PYTHON ENVIRONMENT
The following steps show how you can set up your Python development environment:

 1. Ensure that you have Java 8 installed, and the JAVA_HOME environment
    variable is set:
    
    export JAVA_HOME=""/usr/lib/jvm/java-1.x.x-openjdk""
    
    
 2. Ensure that you have Python 3.5 installed. For example, you can get Python
    3.5 from Anaconda and follow these steps to activate your Python 3.5 environment:
    
    1: Set the bin directory in the PATH environment variable:
    
    export PATH=""~/anaconda3/bin:$PATH""
    
    
    2: In Windows operating systems, you might also have to set the Scripts
    directory in the PATH environment variable:
    
    set PATH=""%ANACONDA%;%ANACONDA_SCRIPTS%;%PATH%""
    
    
    3: To use Python 3.5 for the current session, enter the following command in
    the terminal:
    
    conda create –n py35 python=3.5
    
    
    4: Enter y to proceed, and then activate the 3.5 sub-environment with the following
    command:
    
    source activate py35
    
    
 3. Install the latest streamsx package with pip , a package manager for Python.
    
    !pip install --user --upgrade streamsx
    
    
2.2 STARTING A STREAMING ANALYTICS SERVICE
If you have a Streaming Analytics service in IBM Bluemix , make sure that it’s up and running.

To create a new Streaming Analytics service:

 1. Go to the Bluemix web portal and log in (or sign up for a free Bluemix account).
 2. Click Catalog , browse for the Streaming Analytics service and then click on it.
 3. Enter the service name and then click Create to set up your service. The service dashboard opens and your service starts
    automatically. The service name appears as the title of the service
    dashboard.

2.3 CREATING YOUR APPLICATION
The remainder of this tutorial will walk you through creating your application.
The steps are broken up such that they can be run from the Python interpreter.
If you prefer, you can download the complete application and run it in the Python 3.5 environment.

2.4 SETTING UP ACCESS TO THE SERVICE
The streaming application must be able to access the service. To set up access
to the service:

 1. Start your Python environment:
    
    python
    
    
 2. Go to your service dashboard, click Service Credentials , and then View Credentials to copy your service credentials.
    
    Paste your credentials using the following command:
    
    c={# ... *paste your credentials here*}
    
    
 3. Enter the name of your service and define the build configuration used to
    deploy your application to the service:
    
    fromstreamsx.topologyimportcontextservice_name=""Your Service Name Here""defbuild_streams_config(service_name,credentials):vcap_conf={'streaming-analytics':[{'name':service_name,'credentials':credentials,}]}config={context.ConfigParams.VCAP_SERVICES:vcap_conf,context.ConfigParams.SERVICE_NAME:service_name,context.ConfigParams.FORCE_REMOTE_BUILD:True}returnconfigstreams_conf=build_streams_config(service_name=service_name,credentials=c)
    
    
2.5 CREATING A TOPOLOGY OBJECT
The first component of your application is a Topology object.

from streamsx.topology.topology import Topology
topo = Topology(""temperature_sensor"")


A streaming analytics application is a directed flow graph that specifies how
data is generated and processed. The Topology object contains information about the structure of the directed flow graph.

2.6 DEFINING A DATA SOURCE
The Topology object also includes functions that enable you to define your data sources. In
this application, the data source is the temperature sensor.

In this example, simulate temperature sensor readings by defining a Python
generator function that returns an iterator of random numbers.

Create a new file called temperature_sensor_functions.py :

import random
def readings():
    while True:
        yield random.gauss(0.0, 1.0)


The Topology.source() function takes as input a zero-argument callable object, such as a function or
an instance of a callable class, that returns an iterable of tuples. In this
example, the input to source is the readings() function. The source function calls the readings() function, which returns a generator object. The source function gets the iterator from the generator object and repeatedly calls the next() function on the iterator to get the next tuple, which returns a new random
temperature reading each time.

In this example, data is obtained by calling the random.gauss() function. However, you can use a live data source instead of the random.gauss() function.

2.7 CREATING A STREAM
The Topology.source() function produces a Stream object, which is a potentially infinite flow of tuples in an application.
Because a streaming analytics application can run indefinitely, there is no
upper limit to the number of tuples that can flow over a Stream .

Tuples flow over a Stream one at a time and are processed by subsequent data operations . Operations are discussed in more detail in the Common Streams operations section of this guide.

A tuple can be any Python object that is serializable by using the pickle
module.

Returning to the interpreter, create a source stream with the following line:

import temperature_sensor_functions
source = topo.source(temperature_sensor_functions.readings)


2.8 PRINTING TO OUTPUT
After obtaining the data, you print it to standard output using the sink operation, which terminates the stream.

Include the following code in the temperature_sensor.py file:

source.sink(print)


The Stream.sink() operation takes as input a callable object that takes a single tuple as an
argument and returns no value. The callable object is invoked with each tuple.
In this example, the sink operation calls the built-in print() function with the tuple as its argument.

Tip: The print() function is useful, but if your application needs to output to a file, you need
to implement a custom sink operator.

2.9 SUBMITTING THE JOB TO THE STREAMING ANALYTICS SERVICE
After you define the application, you can submit it by using streamsx.topology.context module. When you submit the application, use the submit() function from the streamsx.topology.context module to submit the application. Use the STREAMING_ANALYTICS_SERVICE context to submit your Python application (the topo object) to the Streaming Analytics service. The config object contains the
credentials required to access the service:

context.submit('STREAMING_ANALYTICS_SERVICE',topo,config=streams_conf)

After your application is running in the Streaming Analytics service, you can
monitor the application through the Streams Console in your service.

2.9.1 VIEWING THE STREAMING DATA
In the Streams Console, the Application Dashboard view shows a summary of all of
the jobs that are running on the service.

 1. Go to the Application Dashboard view, click the Streams console log viewer
    on the left toolbar.
 2. Expand the log navigation tree and select the item with PE in its name, e.g. PE:4 .
 3. Select the Console Log tab.
 4. Click Load console messages . The contents of your output should look something like this: ...
     1.6191338426594375
     -0.3088492294198733
     0.43973191574979087
     -1.0249371132740133
     -0.3151212021333815
     -0.6787283449628287
     -0.11907886745291935
     -0.24096558784475972
     ...
    
    
2.10 THE COMPLETE APPLICATION
The following code should be in the temperature_sensor.py file, which is your
main application:


from streamsx.topology import context
from streamsx.topology.topology import Topology
from streamsx.topology.context import *
import temperature_sensor_functions

def build_streams_config(service_name, credentials):
     vcap_conf = {
         'streaming-analytics': [
             {
                 'name': service_name,
                 'credentials': credentials,
             }
         ]
     }
     config = {
         context.ConfigParams.VCAP_SERVICES: vcap_conf,
         context.ConfigParams.SERVICE_NAME: service_name,
         context.ConfigParams.FORCE_REMOTE_BUILD: True
     }
     return config

def main():
    c={
       # ... *paste your credentials here*
    }

    service_name=""service_name"" #Change this to your service name
    streams_conf = build_streams_config(service_name=service_name, credentials=c)

    topo = Topology(""temperature_sensor"")
    source = topo.source(temperature_sensor_functions.readings)
    source.sink(print)
    context.submit('STREAMING_ANALYTICS_SERVICE', topo, config=streams_conf)

if __name__ == '__main__':
     main()


The following code should be in the temperature_sensor_functions.py file:

import random

def readings():
     while True:
         yield random.gauss(0.0, 1.0)


 * 1.0 Installing Python APIs
 * 3.0 Developing with an IBM Streams install","Learn how to deploy a Python application in the IBM Streaming Analytics service on IBM Bluemix, without installing IBM Streams.",Developing for the IBM Streaming Analytics service,Live,730
2251,"YOU MAY GET PWNED! AT LEAST PROTECT PASSWORDS WITH BCRYPT.
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 17, 2016Passwords are precious. They not only protect access to the services you may
offer but also because users have a bad tendency to reuse them, they may be the
keys to a much wider domain of resources, accounts, and tangible assets. So when
a user does create a password with you, you have to assume the worst, that that
password is used in a hundred different places and that you do not want to be
the service provider that allowed someone to get hold of that password. This
process starts but doesn't end, by not storing plain text passwords.

The amount of effort required to protect passwords with some form of encryption
is small compared to the risks. Even if what you are building is just a
prototype that you know you will throw away, there isn't any reason to store
exposed passwords in a database. Sometimes those prototypes become something
more.

IF YOU AREN'T CONVINCED YET THIS SHOULD SCARE YOU
Here is are the top ten sites with exposed email/password combinations per haveibeenpwned.com :


This should spook anyone entrusted with the storage of usernames and passwords
no matter how small the application. If one looks even closer there is more to
be concerned about:


The thing of real note is the SHA1 storage of passwords without a salt. Suffice
it to say this is bad (and in their defense, it changed right after this
apparently and it was some years ago). Almost all of those were cracked.

Which means it ended up looking something like this:


This is what can happen if you don't do something about this and ignore the
minimum steps.

THE SIMPLEST ANSWER: BCRYPT
bcrypt is an intentionally slow hashing function. While this slowness sounds
paradoxical when it comes to password hashing it isn't because both the good
guys and the bad are slowed down. These days most password attacks are some
variant of a brute force dictionary attack . This means that an attacker will try many, many candidate passwords by
hashing them just like the good guys do. If there is a match, the password has
been cracked.

However, if the bad guy is slowed down per try that is amplified when millions
and millions of attempts are made, often to the point of thwarting the attack.
Yet the good guys probably won't notice on a single attempt when they try to log
in.

PYTHON AND BCRYPT
The natural place to hash a password is in the application tier. A simple
example in Python:

pip install bcrypt

import bcrypt

passwd = b""helo""

hashed_passwd = bcrypt.hashpw(passwd, bcrypt.gensalt())

print hashed_pwd  


Generates this:

$2b$12$lVpvD4gOZZZ4IJUr94fBeeZbKofJBNcuCeq0ylhJxLJC2JjuEoive

The $2b says this is a bcrypt hash ( $2a does too).

The $12$ encodes the fact that this password was hashed with 12 rounds which are
logarithmic. This is the part that allows for the slowness to be encoded in the
actual output which gives some cause for longevity as compute power increases.
You can set it to more by passing the number of rounds to the bcrypt.gensalt(14) which would have been a factor of 2 more than the default.

The next 22 characters:

lVpvD4gOZZZ4IJUr94fBee

are a salt which helps prevent pre-computed dictionary attacks. Each dictionary would have
to be computed for just this salt combined with each word in the dictionary and
then computed.

The remaining 32 characters are the actual hash of the original password and the
salt through the number of rounds:

ZbKofJBNcuCeq0ylhJxLJC2JjuEoive

This is what is actually compared after all of the computations.

HASH AND COMPARE
Using python to check if a submitted password is good is simple too:

if bcrypt.checkpw(passwd, hashed_passwd):  
    print ""good password""
else:  
    print ""bad password""


The above takes the inbound password from the user to be checked. We call it passwd . Then it takes the previously computed and stored hashed_passwd from the database. This would have been looked up via something like a username
or email key from the database (this isn't shown) and supplied to checkpw . checkpw reads the salt, hashes the password, and compares it to the previously computed
hash. If it matches, then we'll get good password . That's it. Variants of this are available in almost every language.

NOTHING IS PERFECT
While there are other options, some more sophisticated and some less, bcrypt strikes a nice balance currently of good defaults out of the box and a solid
track record out in the wild. It doesn't take much to use it, and it is
available almost everywhere. So, there is no reason to store a plaintext
password with such an easy solution as bcrypt available.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by PDPics Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton writes code and then writes about it. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",bcrypt is an intentionally slow hashing function. While this slowness sounds paradoxical when it comes to password hashing it isn't because both the good guys and the bad are slowed down.,You may get pwned! At least protect passwords with bcrypt.,Live,731
2252,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * Data Catalog
 * 
 * Watson Data Platform
 * 

Adam Massachi Blocked Unblock Follow Following Data Science Experience @ IBM Oct 25
--------------------------------------------------------------------------------

RAPIDLY BUILD MACHINE LEARNING FLOWS WITH DSX
Create Flows for your data science workloads on Data Science Experience. Flows offer an interactive, graphical environment to ingest and clean data, to
train and tune models, and to organize your workflow in a visual, collaborative
ecosystem of tools. Take advantage of either Spark or SPSS Modeler runtimes to
accelerate model development and deployment.

The interface allows you to build Flows quickly and intuitively. When we’re
through with this guide, the entire Flow will look like this:

Completed FlowTo get started, create a project in IBM Data Science Experience .

Next, you’ll need to add the data assets we’ll use for the project. There are
few ways to handle this task — programmatically from notebooks, from the project
user interface, and from the side bar panel. Find the data here . Then add the data sets to your project.

We’re modeling customer churn for a telecommunications company. The data sets contain demographic and other
information about customers, and a label indicating whether or not this customer
had failed to renew with the company. Notice these data sets are stored in
different files, but we’ll easily merge them soon.

Add data assets in DSXNow we’re ready to model the data using Flows. If you’d like more information,
refer to the documentation .

Let’s create a new flow from within your project. After clicking (+) New flow , you’ll find this screen.

New flow with SPSS Modeler runtimeChoose IBM SPSS Modeler for the runtime. We’ll get to Spark later.

After clicking Create Flow , you’ll be taken to the Flow interface, where you can build flows, inspect
your models, and analyze your data. Open the right-hand-side panel and add both
data sets to your Flow. We’re using customer.csv and churn.csv . We need to merge the data sets, joining on the ID column. Then, we’ll have the feature data and the label data in one data set.
Take a look at how it’s done.

First, add the data.

Then merge the data sets. First, we’ll need to drop a Merge node. Then, we’ll configure the node with Inner Join on the ID column.

Connect the data sets to the Merge nodeNow configure the node.

Configure the node to join on IDNow that we’ve merged the data sets together, let’s do some exploration with
some of the powerful graphing capabilities that DSX provides.

Go to the Palette on the left hand side and find the Graphs drop down.

We’re going to visualize a comparison in the churn rate by gender . You can build many visualizations using Graphs .

First, add a Distribution node.

Adding and configuring the Distribution nodeThis node will produce Outputs once run. You access the Outputs view from the right hand panel. Flows allow you to intuitively organize all of
your assets during the modeling process. You can quickly find and interact with
your data, models, graphs, and exports. Run the node and take a look.

We use the graphing functionality to explore relationships in the data.
Experiment with the different nodes in Palette and the fields in the data.

Now we’re going to finish building the model. Find the Type node under Field Operations . We’ll connect the data traveling through the Merge node.

The Type nodeClick (+) Add Columns and select all columns except for ID . We will not be using the ID in our model because it’s just an index and has no other relevance. Then for
the field CHURN , we select Target under the Role menu and Flag under Measure . That’s how we communicate to the SPSS runtime that we’d like to use CHURN as a label column. You can find more information about these nodes and others in the docs.

Then we’ll build a C5.0 model. The IBM Knowledge Center Entry describes the C.50 algorithm, noting that it

works by splitting the sample based on the field that provides the maximum information gain. Each subsample defined by the first split is then split again, usually based on a different field, and the process repeats until the subsamples cannot be split any further. Finally, the lowest-level splits are reexamined, and those that do not contribute significantly to the value of the model are removed or pruned.

Let’s add the node.

Add the C5.0 node from the Modeling menuNow click the node and select Run . Running will execute the entire flow and produce new nodes and outputs. One
of the new nodes is the model golden nugget .

Running this flow will produce a trained model available in the gold nuggetLet’s a take look at its contents. Right click the model nugget and click View Model on the node.

Viewing the trained modelNow that we’ve created a model in DSX with Flows, we’ll add a Table and an Object Store node from the Outputs and Export menus. The Object Store node allows us to export the predicted records to familiar file format, such as csv .

The Output and Export nodesAfter running the Flow, you can view the Table in the Outputs view. Then the predicted label and confidence appear in the rightmost fields.

The Table outputThus far, we’ve built and scored a model using training data. Now, we’re going
to score new records. Go back to the Github repo at the top and get the customer_churn.csv data. By now you know how to add this to a project and drop a data node.

Connecting the test data and the Analysis nodeThen, connect the new data node that you’ve just dropped as input to the trained
C5 model. Drop an Analysis node and connect the model as input. Navigate to the Analysis node and click Run . After running, you’ll be able to view the output.

The output of the analysisDouble-click on the item for details.

The details of the analysisWe’re satisfied with 99% accuracy on the test set. You can configure the Analysis node to provide other metrics as well.

We’ve completed the guide. If you followed along, you have loaded data in IBM
Data Science Experience and created a new Flow. Then, you explored your data and
created visualizations that you can use to gain insight and to share with your
team. Finally, we developed a model using SPSS implementations of machine
learning algorithms and then tested the performance of this model on unseen
data.

In the second part of this tutorial series, I’ll cover Model Deployment and the powerful Spark Runtime .

This guide is based on the work of Elena Lowery, an experienced Analytics
Architect at IBM.

 * Machine Learning
 * Dsx
 * Data Science
 * IBM
 * Flow

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

21 Blocked Unblock Follow FollowingADAM MASSACHI
Data Science Experience @ IBM

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 21
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Flows offer an interactive, graphical environment to ingest and clean data, to train and tune models, and to organize your workflow in a visual, collaborative ecosystem of tools. Take advantage of either Spark or SPSS Modeler runtimes to accelerate model development and deployment.",Rapidly build Machine Learning flows with DSX,Live,732
2254,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register IBM Bluemix Develop in the cloud at the click of a button!Start building for free * DASHBOARD * HOW-TO * BLOG * EVENTS * DOCS * SUPPORTSearch How-to < Previous / Next >MOBILE APPS OFFLINE AND ONLINE – PART IBluemix cloudant MobileFirst node.js AndrewTrice / May 7, 2015 / 0 commentsIn the dynamic and ever-changing realm of mobile, context is critical to thesuccess of your applications. Users may be at home sitting on the couch, or theycould be on top of a mountain with very limited connectivity. There’s no way topredict where someone will be when they’re using your app, and as many of uspainfully know already, there is never a case when you are always online on yourmobile devices.Well, this doesn’t always have to be a problem. Regardless of whether your appis online or offline, it is important that your app does what it needs to do –solve a problem and provide value.This is the first in a 3 part series that walks through the creation of a sampleapplication called GeoPix, which leverages IBM MobileFirst on IBM Bluemix tocapture data and image attachments locally (even offline) and replicate thosechanges to an online data store so that the user experience is nevercompromised.In this series we will demonstrate: * User authentication using the Bluemix Advanced Mobile Access service * App logging and instrumentation using Bluemix Advanced Mobile Access service * Using a local data store for offline data access * Data replication (synchronization) to a remote Cloudant NoSQL data store * Building a web based endpoint on the Node.js infrastructureYou can access all parts of this series using the links below: * Part 1: Setting up the backend and saving data locally (this post) * Part 2: Moving data to the cloud with replication * Part 3: Exposing data on a Node.js web endpointAll code assets and a complete script for these materials can be downloaded at https://github.com/IBM-Bluemix/MobileFirst-Offline-Apps .PART 1: SETTING UP THE BACKEND AND SAVING DATA LOCALLYYou can also view a complete end to end video walkthrough embedded below.USER AUTHENTICATIONWe are going to use Facebook for authentication within the application, so ourfirst step is to configure a Facebook App authentication at https://developers.facebook.com/ 1. Go to “My Apps” at: https://developers.facebook.com/apps/ 2. Next click the green “Add a New App” button. 3. When prompted to select a platform for the new app, select the big blue    “iOS” option. 4. You will be prompted to enter the name of your application (this is the name    as it will be displayed on Facebook, not necessarily the native App’s name    in a production environment). I used the value “GeoPix-Sample” Then click    the “Create New Facebook App ID” button. 5. Next you will need to select a category and press the “Create App ID”    button. You can also select whether this App ID is a test/debug version of    another app. For the basic demo, this is not required. 6. At this point Facebook will present you with the Quick Start guide for    setting up a new iOS app. You can reference this later if you’d like to    learn more. However, it is not necessary, as the sample project has already    been configured. Make note of the values for “AppID and App Name – you are    going to need them. 7. Scroll down to enter the bundle identifier for your app. This will be the    unique id for the native iOS application. This must be a unique value. For    mine, I will use com.tricedesigns.GeoPix-Sample, but yours should be    different. Click “Next” to proceed. You can always come back here and    add/remove bundle identifiers later. 8. Next scroll back to the top and click the “Skip Quick Start” button, and you    will be able to see your app dashboard.    --------------------------------------------------------------------------------CREATE A NEW IOS 8 APP ON BLUEMIXNext we need to setup the backend infrastructure for our mobile application. 1.  Log into your IBM Bluemix Dashboard at console.ng.bluemix.net/ 2.  From the Dashboard Click on the “Create An App” button. 3.  You will next be prompted to select the kind of app you are creating.     Select the “Mobile” option. 4.  Next select the iOS 8 platform template. You will be presented with details     about the MobileFirst Services Starter template for iOS 8. Click “Continue”     to proceed. 5.  You’ll next be prompted to enter your app name on Bluemix (this must be     unique in Bluemix), and click Finish to proceed. 6.  The backed app infrastructure will now be created and you will be prompted     to configure your client Xcode project. Here you will be prompted to enter     the bundle ID and Version number for our app. This bundle ID should match     what you entered for the Facebook App ID, and will match that app ID that     we will later specify for the native Xcode project. 7.  If you were setting up a new project you would need to install the Bluemix     SDK. Since we are already have the starter app configured, just scroll down     and hit the “Done” button on the bottom right of the screen. ***You must do this, or else the client registration will not be     completed. 8.  After hitting the “Done” button, you will be brought to the Application     Dashboard screen. Here you could add additional services or APIs, or scale     up server instances if you need them. 9.  Next we want to turn on Facebook user authentication. In the left sidebar     menu, select “Advanced Mobile Access”. 10. Here we can view client information for the Bluemix app. You can see your     platform bundle ID(s) and version(s). Click on the “Allows Unauthenticated     access to the backend” link, and we will add authentication to this app id. 11. Next select the “New Authentication” button, and then select Facebook for     the authentication mechanism. 12. Next we need to specify the Facebook App ID that we created earlier. If you     didn’t make note of it, you can always go back to Facebook and access the     App ID again. Enter your Facebook App ID and click “Save”. 13. Facebook has now been setup as an authentication mechanism for your Bluemix     app backend. You will be presented with steps to authenticate the app, but     we can ignore those for now since the sample app is already configured.--------------------------------------------------------------------------------SETTING UP THE XCODE WORKSPACEWe’re now ready to start working on our mobile application in Xcode. This mustbe done on a Mac, native iOS apps cannot be authored on Windows machines. 1. Make sure that you have CocoaPods installed on your system. You need it to    configure the application environment. If you don’t have CocoaPods, then    install it from the instructions at: cocoapods.org/ 2. Next open up a command line terminal and navigate to the iOS-native/GeoPix-starter folder from the downloaded source code . 3. Next, run the command pod install to download and setup the app environment and dependencies.     4. Open Finder and navigate to the iOS-native/GeoPix-starter folder and open the GeoPix.xcworkspace file. Make sure you open the .xcworkspace file that was just created, not the .xcproject file. The .xcworkspace file contains all of the project dependency information that is required to    build the project.--------------------------------------------------------------------------------SET CONFIGURATION VARIABLESNext we need to setup the app configuration so that all of our subsequent stepswill work properly. 1. The first thing that you need to do is properly set the app Bundle    Identifier to match the Bundle Identifiers entered above. In the sidebar    select the GeoPix project, then select the GeoPix target, and update the    Bundle Identifier value.     2. Next we need to configure the Info.plist file and enter the values for the Facebook App Auth that were just    configured. Open the SupportingFiles/Info.plist file and update the values for FacebookAppID to the Facebook App ID that we created earlier in this process, update FacebookDisplayName to the app name that was created on Facebook, an update the URL types - Item 0 - URL Schemes - Item 0 value to “fb”+ the Facebook App ID. IF we use the values from    earlier in these instructions, those values would respectively be: 1614385392135489 , GeoPix-Sample , and fb1614385392135489 .     3. Next open the file Configuration.h and update the Bluemix route and client ID. You can get these from the app    Dashboard on IBM Bluemix.    --------------------------------------------------------------------------------MOBILEFIRST PLATFORM APISThe bulk of the application has already been created. This includes capturingimages and integrating with location services. Now we’re ready to startintegrating the MobileFirst platform for operational analytics, logging, anddata management. 1. The first thing that we’ll do is connect to the MobileFirst backend on    Bluemix. Open the AppDelegate.m file, and add the following lines inside of didFinishLaunchingWithOptions to establish the connection to the MobileFirst backend as soon as the    application is launched. This will be used to track usage of the app.IMFClient *imfClient = [IMFClient sharedInstance];    [imfClient initializeWithBackendRoute:BLUEMIX_ROUTE                              backendGUID:BLUEMIX_GUID];         2. Next, in the same file & function setup the MobileFirst logger for remote    log collection immediately after initializing the MobileFirst client.// capture and record uncaught exceptions (crashes)    [IMFLogger captureUncaughtExceptions];    [IMFLogger setLogLevel:IMFLogLevelTrace];        logger = [IMFLogger loggerForName:NSStringFromClass([self class])];    [logger logDebugWithMessages:@""initializing bluemix client...�        Now let’s enable the Facebook app authorization. Immediately after the    MobileFirst logging has been setup, use the IMFFacebookAuthenticationHelper class to setup authorization with a single line of code:[[IMFFacebookAuthenticationHandler sharedInstance] registerWithDefaultDelegate];        Then add the Facebook authentication application: openURL callback function.    (This is directly from the user documentation):- (BOOL)application: (UIApplication *)application                openURL: (NSURL *)url      sourceApplication: (NSString *)sourceApplication             annotation: (id)annotation {        //handle Facebook login        return [FBAppCall handleOpenURL:url sourceApplication:sourceApplication];    }        Now add a hook to the Facebook API to track when the application becomes    active.- (void)applicationDidBecomeActive:(UIApplication *)application {            [FBAppEvents activateApp];    }        At this point you can run the application to test authentication. No images    or captured data will be saved yet. If you launch the application you will    be prompted for permission to use location services, and will be redirected    to Facebook for OAuth authentication.    Once the user is authenticated you will be presented with a UI for capturing    images. The two large buttons are for saving media (take a picture or save    an existing picture from the device). The user’s location will be displayed    at the bottom, and there is a button in the top right that will be used to    view existing data that has been captured on this device.    All logic for interaction and controlling the user interface resides inside    of the ViewController.m file. If you’d like to change any behavior of the UI, you can change it    here. Whenever an image is captured (by either camera or existing image),    the saveImage: withLocation method on the DataManager is invoked. All logic for saving the image and metadata will be    encapsulated within this class. Up next: setting up the data management.        The DataManager class is a singleton class (meaning there is only ever a single instance in    memory within the application). In this singleton instance resides the logic    for managing data, so there is only ever one data store in memory within the    application. Once the DataManager is initialized, the first thing that needs    to be done is setup a local data store for saving data locally on the    device. Saving to the cloud will come later. First open up the DataManager.m class and find the init method. Inside of the init method we first need to create a local data store for saving the data    locally.// initialize an instance of the IMFDataManager    self.manager = [IMFDataManager sharedInstance];        NSError *error = nil;    self.datastore = [self.manager localStore:@""geopix"" error:�        if (error) {        [logger logErrorWithMessages:@""Error creating local data store %@�    }        Next let’s look at the saveImage: withLocation method. This method receives the image data and location data as    parameters, and will save the data to the local data store. This method is    setup so that all data processing/saving occurs in a background thread so    that it does not impace the user experience. In the background thread we    will perform three steps: 1) create a new document revision (a document is    the piece of data that will be stored within the MobileFirst data store    based on the Cloudant NoSQL service), 2) save the image as an attachment to    the document revision, and 3) save the document revision in the local data    store. If you don’t already have the saveImage: withLocation method open in Xcode, open it now. First we will create a document revision and set the document body to    contain the data that we wish to save. Logically you can think of this as a    generic object, kind of like a JSON object. Here we save the location    information, plus a timestamp and other metadata.// Create a document    CDTMutableDocumentRevision *rev = [CDTMutableDocumentRevision revision];    rev.body = @{      @""sort"": [NSNumber numberWithDouble:[now timeIntervalSince1970]],      @""clientDate"": dateString,      @""latitude"": [NSNumber numberWithFloat:location.coordinate.latitude],      @""longitude"": [NSNumber numberWithFloat:location.coordinate.longitude],      @""altitude"": [NSNumber numberWithFloat:location.altitude],      @""course"": [NSNumber numberWithFloat:location.course],      @""type"": @""com.geopix.entry�        Next, we save the image to the local app’s temporary directory, and then    create a CDTUnsavedFileAttachment containing a reference to the newly saved image. Once we have created the    attachment we then assign it to the attachments property of the document    revision.//create the image filename and save to a temporary path    NSDate *date = [NSDate date];    NSString *imageName = [NSString stringWithFormat:@""image%f.jpg�        [logger logDebugWithMessages:@""saving image to temporary location: %@�        //create a new attachment    CDTUnsavedFileAttachment *att1 = [[CDTUnsavedFileAttachment alloc]       initWithPath:imagePath       name:imageName       type:@""image/jpeg�        Next we save the document revision to the local data store using the save method.//save the attachment to local storage    [self.datastore save:rev completionHandler:^(id savedObject, NSError *error) {        if(error) {            [logger logErrorWithMessages:@""Error creating document: %@�        }        [logger logDebugWithMessages:@""Document created: %@�        At this point you could launch the application on your device and start    capturing data. You can view debug logs in the Xcode console output to    verify that everything is working correctly. Note: You cannot capture images    using the camera inside of the Xcode simulator. Next let’s setup the capability to view all images that have been captured    on the local device. The user interface for displaying local data is    encapsulated within the PhotosGridViewController class. In this we will focus on the DataManager class’ getLocalData method to query the local data store. In the getLocalData method, a query is performed on the local data store, and then the completionHandler code block is executed – providing a callback to the    PhotosGridViewController class once the local query has been performed. A    CDTCloudantQuery query instance is created to return all data with type    ‘com.geopix.entry’, and that is used to query the local datastore. You will    notice that when we were saving data earlier, we saved the type for all new    data as ‘com.geopix.entry’ intentionally.-(void) getLocalData:(void (^)(NSArray *results, NSError *error)) completionHandler {      NSPredicate *queryPredicate = [NSPredicate predicateWithFormat:@""(type = 'com.geopix.entry')�    }        Now if you bring up the view to display local images, you will be able to    see media that has been captured and saved locally.    We now have an application that can capture images and data locally and handleoffline scenarios. In the next entry we will use data replication toautomatically synchronize the local data with the cloud data store, and in thefinal entry we will expose our cloud data through a Node.js-based web endpointhosted on IBM Bluemix .--------------------------------------------------------------------------------HELPFUL LINKS * Complete Source Code * MobileFirst for Bluemix iOS 8 Documentation * Advanced Mobile Access Documentation * Cloudant/Mobile Data Documentation * Cloudant Node.js ClientSHARE THIS: * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to email this to a friend (Opens in new window) * LEAVE A COMMENTClick here to cancel reply. Tell us who you are Name (required) Email (required) Comment textNotify me of follow-up comments by email.Notify me of new posts by email. * CATEGORIES    * General    * Events    * Updates    * How-to       * @IBMBLUEMIX   . @koding and #Bluemix are hosting a humongous virtual #hackathon . You should join us bit.ly/1nAOOdr #Cloud pic.twitter.com/oCIl…      RT @EyalLiberman In-page user behavior tracking made simple with @IBMBluemix and Cloudant: ow.ly/Xmc6I #nodejs pic.twitter.com/4KNb…      . @bhunstable CEO of @Ustream , explains why businesses should care about video how it changes how we   communicate bit.ly/1QhFJPW      Preview #IoT at InterConnect - intelligence built into intelligent endpoints, not just   interconnected objects bit.ly/1Sl2Axe      RT @IBMInterConnect This @IBMDevOps crowdchat will rock your world. Find out how to deliver with our futurists. crowdchat.net/ibmint… pic.twitter.com/jWoL…       * @IBMCLOUDSUPPORT   Try the new Simple Search Service in @IBMcloudant on #Bluemix for faceted search results. ibm.co/1S89MLN pic.twitter.com/ugzy…      Announcement: The IBM XPages Runtime/boilerplate/NoSQL DB service are now in   Beta! See: ibm.co/1UdQEv4 pic.twitter.com/jKi8…      Microservices, SOA, and APIs: Friends or enemies? ibm.co/1NkEZoy pic.twitter.com/m04D…      Do you develop WebSphere Liberty Java EE apps? Need continuous delivery? See   how: ibm.co/1nbtivd #Bluemix pic.twitter.com/gQpT…      Welcome to the family! MT @IBMcloud : Big News! IBM acquires @Ustream . See: ibm.co/1KsVE9H pic.twitter.com/mcAG… #IBMCloud       * SOLUTIONS    * Analytics    * Big Data    * Bluemix Dedicated    * Bluemix Local    * Catalog    * DevOps    * Integration    * Internet of Things    * Mobile    * Security    * Watson    * Web Apps      Follow us on Twitter RSS Feed * Contact us * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","walks through the creation of a sample application called GeoPix, which leverages IBM MobileFirst on IBM Bluemix to capture data and image attachments locally (even offline) and replicate those changes to an online data store so that the user experience is never compromised.",Mobile Apps Offline and Online,Live,733
2256,"Enterprise Pricing Articles Sign in Free 30-Day TrialASYNCHRONOUS JOINS USING RABBITMQ
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 23, 2017Don Omondi, Campus Discounts' founder and CTO, talks about their use case of
fetching data from multiple databases and joining them using RabbitMQ RPCs in
this new Write Stuff article.

Technology keeps evolving and fast. Early last month, web browsing from mobile
devices overtook that via desktop computers. For developers, this means
optimizing our mobile users' experience and one of the best ways to do so is by
minimizing the number of network requests needed to fetch data. But fetching
data in a single payload might not be so simple because the nature of today’s
data is unstructured.

It’s very common for apps to use multiple databases to store and query data. In
addition, some apps need to query an external API endpoint such as weather or
finance before displaying data to the end user. So how can we join all the
needed data into one that's on demand? One way is by using RabbitMQ RPCs, and we
can do so asynchronously using parallel RPCs.

REMOTE PROCEDURE CALL (RPC)
Remote Procedure Call (RPC) is a pattern where you run a function on a remote
computer and wait for the result. RabbitMQ has documentation on how their implementation of RPC works and sample code in different programming languages on their website. Here is the structural
overview.


We can use RabbitMQ RPC to send messages to remote servers with instructions to
execute queries and respond with the results when done.

A SAMPLE USE CASE
Campus Discounts is a social network where students find and recommend discounts posted by
businesses near campus. A discount post can have additional content such as
comments, likes and recommendations. When constructing a discount post’s UI, we
show a few of the latest discounts in tabs below it. Our backend is PHP, a
synchronous language by default, hence to achieve this would potentially mean
fetching the content one after the other.

<?php  
………
   $discount = $service- // ~ 5ms
   $comments = $service- // ~ 12ms per comment
   $likes = $service- // ~ 12ms per like
   $recommendations = $service- // ~ 10ms per each
   $response = array(
     ‘discount’ => $discount,
     ‘comments’ => $comments,
     ‘likes’ => $likes,
     ‘recommendations’ =
………
?>


Here, PHP will perform 4 synchronous blocking tasks. First to fetch the discount
that takes about 5ms, then fetch its comments, likes and recommendations. For a
discount with 10 comments, 10 likes, and 10 recommendations, it would take 340ms
plus 5ms for the actual discount. Querying 10 discounts would take about 3.4
seconds. We can optimize this process by fetching the data asynchronously using
parallel RabbitMQ RPCs.


In the above image, the ‘C’ is the backend server that responds to the client’s
API request. It is also the same server that initiates an RPC. The ‘S’ are
servers that process remote queries received via RPC.

With parallel RPCs, we construct a single message on the client and queue it.
RabbitMQ will then deconstruct it and simultaneously send each part of the
message to different awaiting servers. Each server reads the message and
performs the query action contained in it and sends back the result. RabbitMQ
will wait for all parts to return before constructing them into a single message
and send it back to the client.

We now replace our blocking query tasks with a single combined RPC call.


   $client = new Thumper\RpcClient($registry-
   $client-
   $client-
   $client-
   $client-
   $client-
   $replies = $client- // ~ 120ms longest time for 10 comments or likes
   $response = array(
     ‘discount’ => $discount,
     ‘comments’ => $comments,
     ‘likes’ => $likes,
     ‘recommendations’ =
………
?>


$client->InitClient() initializes the RabbitMQ RPC connection while the $client->addRequest() adds the various parts into the RPC message. $client->getReplies() ensures that no further PHP code is processed beyond that line until RabbitMQ
responds to the RPC call.

With RabbitMQ RPC, the total query times are now down to about 120ms, which is
the longest time taken to query at least 10 comments or likes. How awesome is
that, but wait, there’s still more.

For schema flexibility and horizontal scalability, discount comments and likes
are embedded in respective MongoDB documents. However, the actual data
containing the users or apps used to create the content is stored in MariaDB
(cached by Redis) and would have to be fetched at runtime.

{
 ""_id"": 1,
 ""comments"": {
 ""0"": {
 ""_id"": 1,
 ""message"": ""The first comment"",
 ""user_id"": 1,
 ""written_on"": some_date_string,
 ""edited_on"": some_date_string,
 ""status "": 1,
 ""via_app_id"": 5,
 ""total_replies"": 0
 ""replies"": []
 },
 ""1"": {
 ""_id"": 1,
 ""message"": ""The second comment"",
 ""user_id"": 1,
 ""written_on"": some_date_string,
 ""edited_on"": some_date_string,
 ""status "": 0 // deleted,
 ""via_app_id"": 5,
 ""total_replies"": 0,
 ""replies"": []
 },
 ""2"": {
 ""_id"": 1,
 ""message"": ""The third comment"",
 ""user_id"": 4,
 ""written_on"": some_date_string,
 ""edited_on"": some_date_string,
 ""status "": 1,
 ""via_app_id"": 2,
 ""total_replies"": 0,
 ""replies"": []
 }


This MongoDB document contains comment details but only the id of the users
and/or apps that made it. To get the actual ‘usernames’ or app names, we have to
loop through them querying for each in a very similar manner to our first
problem solved by RabbitMQ RPC.

<?php  
………
   $commentsDoc = $mongoService-
     $comment[‘user’] = $mariaCachableService-
     If($app_id){
      $comment[‘viaapp] = $ mariaCachableService -
………
?>


Thus, fetching 10 discount comments could potentially mean one query for the
MongoDB document and 10 queries to get the users and another 10 if they were
created by apps. So a maximum of 21 queries.

PARALLEL RPCS WITHIN PARALLEL RPCS
One of the best things about RPCs is that you can nest one in another. So, for
example, our comments consumer can create two more parallel RPCs to
asynchronously fetch the user and app responsible for creating the comment.


   $commentsDoc = $mongoService-
   $client = new Thumper\RpcClient($registry-
   $client-
     $client-
     If($app_id){
       $client-
     }
   }
   $replies = $client-
?>


We must first loop through each comment to grab the user and/or app id and each
as a ‘part’ of the RabbitMQ RPC. Once finished we queue it and wait for the
response. Thereafter, we have to loop through each comment again and set the now
full user and/or app object we got from the RPC call.

With nested parallel RPCs, fetching all the data to display one discount takes
approximately 21ms, which is the longest time required to fetch the MongoDB
document and at least one comment user.

Sometimes, though, parallelism can be counterproductive. In the implementation
above, should a discount_id result in a 404 not found error, we would still have to incur the penalty for
sending out parallel RPCs to query for comments, likes, and recommendations that
definitely do not exist. To avoid wasting resources, we first query if the
discount exists before sending out parallel RPCs.

There are other gotchas as well, like when and how long to maintain an open
connection as well as ensuring the correlation of unique ids. At Campus
Discounts, we decided to embed the aggregate results of each discount query,
together with its comments, likes, and recommendations, into respective
Elasticsearch documents. This also gave us the added benefit of deeply nested
searches. Nonetheless, the parallel RPC setup described enables us to greatly
improve the overall Elasticsearch re-indexing times.

The Main Advantages :

 1. You can process a lot of data from different sources asynchronously and join
    them together with good response times.
 2. You can split your backend API servers, using the beefy ones for data
    processing while keeping them more secure and closer to your database
    servers (RPC Servers). Whereas the lighter ones can be used to respond to
    requests which you can geo-distribute closer to your users (RPC Clients).
 3. Parallel computing maximizes resource utilization by minimizing idleness.

The Main Disadvantages :

 1. Complex stack to setup, debug, and maintain.
 2. Real benefits only manifest if the number of individual queries and time to
    process them is large enough.
 3. Parallel data processing can easily lead to unnecessary DB queries that
    would have been avoided synchronously.

CONCLUSION
So there you have it, an asynchronous remote task processing system that returns
the aggregate results using RabbitMQ. Remember, RabbitMQ RPCs are not limited to
DB queries. You can use them to do anything so long as the result can be
represented as a string, like a lot of APIs, whether getting a real-time
sentiment analysis via Watson or receiving financial information from Yahoo.

Don Omondi is a full-stack developer and the Founder and CTO of Campus Discounts . Besides the typical coffee and code, he also loves old school music over a
game of chess or checkers.

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Image by Mike Wilson Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","Don Omondi, Campus Discounts' founder and CTO, talks about their use case of fetching data from multiple databases and joining them using RabbitMQ RPCs.",Asynchronous Joins Using RabbitMQ,Live,734
2257,"Homepage Follow Sign in Get started * Home
 * ✍️ Contribute
 * 
 * 🔥 ML Newsletter
 * 

Arthur Juliani Blocked Unblock Follow Following Deep Learning @Unity3D & Cognitive Neuroscience PhD student. Jan 31
--------------------------------------------------------------------------------

MAKING SENSE OF THE BIAS / VARIANCE TRADE-OFF IN (DEEP) REINFORCEMENT LEARNING
WHAT GOES INTO A STABLE, ACCURATE REINFORCEMENT SIGNAL?
(This post assumes some familiarity with machine learning, and reinforcement
learning in particular. If you are new to RL, I’d recommend checking out a series of blog posts I wrote in 2016 on the topic as a primer)

INTRODUCTION
Since the launch of the ML-Agents platform a few months ago, I have been surprised and delighted
to find that thanks to it and other tools like OpenAI Gym, a new, wider audience
of individuals are building Reinforcement Learning (RL) environments, and using
them to train state-of-the-art models. The ability to work with these
algorithms, previously something reserved for ML PhDs, is opening up to a wider
world. As a result, I have had the unique opportunity to not just write about
applying RL to existing problems, but also to help developers and researchers
debug their models in a more active way. In doing so, I often get questions
which come down to a matter of understanding the unique hyperparameters and
learning process around the RL paradigm. In this article, I want to attempt to
highlight one of these conceptual pieces: bias and variance in RL , and attempt to demystify it to some extent. My hope is that in doing so a
greater number of people will be able to debug their agent’s learning process
with greater confidence.

SUPERVISED MACHINE LEARNING
If you’ve studied ML, these charts may be familiar to you .Many machine learning practitioners are familiar with the traditional
bias-variance trade-off. For those who aren’t, it goes as follows: on the one
hand, a “biased” model generalizes well, but doesn’t fit the data perfectly
(“under-fitting”). On the other hand, a high-variance model fits the training
data well, too well in-fact, to the detriment of generalization (“overfitting”).
In this situation, the problem becomes one of limiting the capacity of a model
with some regularization method. In many cases, dropout or L2 regularization
with a large enough data set is enough to do the trick. That is the story for
typical supervised learning. RL is a little different, as it has its own
separate bias-variance trade-off which operates in addition to, and at a higher
level than the typical ML one.

REINFORCEMENT LEARNING
In RL, bias and variance no longer just refer to how well the model fits the
training data, as in supervised learning, but also to how well the reinforcement signal reflects the true reward structure of the
environment . To understand that statement, we have to backup a little. In reinforcement
learning, instead of a set of labeled training examples to derive a signal from,
an agent receives a reward at every decision-point in an environment. The goal
of an agent is to learn a policy (method for taking actions) which will lead to obtaining the greatest reward
over time. We must do this using only the individual rewards that agent
receives, without the help of an outside oracle to designate what count as
“good” or “bad” actions.

Rewarding state denoted by yellow star. Value estimates denoted by green
spheres. Above : Without credit assignment, only rewarding state is seen as being valuable. Below : By using discounted sums over future rewards, the trajectory toward star has
meaningful value estimates.A naive approach to an RL learning algorithm would be to encourage actions which
were associated with positive rewards, and discourage actions associated with
negative rewards. Instead of updating our agent’s policy based on immediate
rewards though, we often want to account for actions (and the states of the
environment when those actions were taken) which lead up to rewards. For
example, imagine walking down a corridor to a rewarding object. It isn’t just
the final step we want to perform again, but all the steps up to that rewarding
one. There are a number of approaches for doing this, all of which involving
doing a form of credit assignment . This means giving some credit to the series of actions which led to a
positive reward, not just the most recent action. This credit assignment is
often referred to as learning a value estimate: V(s) for state, and Q(s, a) for state-action pair.

We control how rewarding past actions and states are considered to be by using a
discount factor (γ, ranging from 0 to 1). Large values of γ lead to assigning
credit to states and actions far into the past, while a small value leads to
only assigning credit to more recent states and actions. In the case of RL,
variance now refers to a noisy, but on average accurate value estimate, whereas
bias refers to a stable, but inaccurate value estimate. To make this more
concrete, image a game of darts. A high-bias player is one who always hits close
to the target, but is always consistently off in some direction. A high-variance
player, on the other hand, is one who sometimes hits the target, and is
sometimes off, but on average near the target.

Red : True value of a given state/action. Blue : Low-variance, high-bias estimate. Green : low-bias, high-variance estimates.There is a multitude of ways of assigning credit, given an agent’s trajectory
through an environment, each with different amounts of variance or bias. Monte-Carlo sampling of action trajectories as well as Temporal-Difference learning are two classic algorithms used for value estimation, and both are prototypical
examples of methods which are variance and bias heavy, respectively.

HIGH-VARIANCE MONTE-CARLO ESTIMATE
In Monte-Carlo (MC) sampling , we rely on full trajectories of an agent acting within an episode of the
environment to compute the reinforcement signal. Given a trajectory, we produce
a value estimate R(s, a) for each step in the path by calculating a discounted sum of future rewards for
each step in the trajectory. The problem is that the policies we are learning
(and often the environments we are learning in) are stochastic, which means
there is a certain level of noise to account for. This stochasticity leads to
variance in the rewards received in any given trajectory. Imagine again the
example with the reward at the end of the corridor. Given that an agent’s policy
might be stochastic, it could be the case that in some trajectories the agent is
able to walk to the rewarding state at the end, and in other trajectories it
fails to do so. These two kinds of trajectories would provide very different
value estimates, with the former suggesting the end of the corridor is valuable,
and the latter suggesting it isn’t. This variance is typically mitigated by
using a large number of action trajectories, with the hope that the variance
introduced in any one trajectory will be reduced in aggregate, and provide an
estimate of the “true” reward structure of the environment.

Monte-Carlo Estimate of Reward Signal. t refers to time-step in the trajectory. r refers to reward received at each time-step.HIGH-BIAS TEMPORAL DIFFERENCE ESTIMATE
On the other end of the spectrum is one-step Temporal Difference (TD) learning . In this approach, the reward signal for each step in a trajectory is composed
of the immediate reward plus a learned estimate of the value at the next step.
By relying on a value estimate rather than a Monte-Carlo rollout there is much
less stochasticity in the reward signal, since our value estimate is relatively
stable over time. The problem is that the signal is now biased, due to the fact
that our estimate is never completely accurate. In our corridor example, we
might have some estimate of the value of the end of the corridor, but it may
suggest that the corridor is less valuable than it actually is, since our
estimate may not be able to distinguish between it and other similar unrewarding
corridors. Furthermore, in the case of Deep Reinforcement Learning, the value
estimate is often modeled using a deep neural network, making things worse. In Deep Q-Networks for example, the Q-estimates (value estimates over actions) are computed using
an old copy of the network (a “target” network), which will provide “older”
Q-estimates, with a very specific kind of bias, relating to the belief of an
outdated model.

Temporal-Difference Estimate of Reward Signal. r refers to reward at time t. V(s) refers to parameterized value estimate.APPROACHES TO BALANCING BIAS AND VARIANCE
Now that we understand bias and variance and their causes, how do we address
them? There are a number of approaches which attempt to mitigate the negative
effect of too much bias or too much variance in the reward signal. I am going to
highlight a few of the most commonly used approaches in modern systems such as Proximal Policy Optimization (PPO), Asynchronous Advantage Actor-Critic (A3C), Trust Region Policy Optimization (TRPO), and others.

 1. Advantage Learning

One of the most common approaches to reducing the variance of an estimate is to
employ a baseline which is subtracted from the reward signal to produce a more
stable value. Many of the baselines chosen fall into the category of Advantage-based Actor-Critic methods , which utilize both an actor which defines the policy, and a critic (often a
parameterized value estimate) which provides a more reduced variance reward
signal to update the actor. The thinking goes that variance can simply be
subtracted out from a Monte-Carlo sample (R/Q) using a more stable learned value
function V(s) in the critic. This value function is typically a neural network,
and can be learned using either Monte-Carlo sampling, or Temporal difference
(TD) learning. The resulting Advantage A(s, a) is then the difference between
the two estimates. This advantage estimate has the other nice property of
corresponding to how much better the agent actually performed than was expected on average, thus allowing for intuitively interpretable values.

Advantage Estimate Equation. Pi refers to the current policy. Q(s, a) here refers to Monte-Carlo sampled reward signal analogous to R(s, a), rather
than a learned estimate. V(s) refers to parameterized value estimate.2. Generalized Advantage Estimate

We can also arrive at advantage functions in other ways than employing a simple
baseline. For example, the value function can be applied to directly smooth the
reinforcement signal obtained from a series of trajectories. The Generalized Advantage Estimate (GAE), introduced by John Schulman in 2016 does just this. The GAE formulation
allows for an interpolation between pure TD learning and pure Monte-Carlo
sampling using a lambda parameter. By setting lambda to 0, the algorithm reduces
to TD learning, while setting it to 1 produces Monte-Carlo sampling. Values
in-between (particularly those in the 0.9 to 0.999 range) produce better
empirical performance by trading off the bias of V(s) with the variance of the trajectory.

Generalized Advantage Estimate under two edge cases which reduce to: TD Learning
( Above ), and MC Sampling ( Below ).3. Value-Function Bootstrapping

Outside of calculating an advantage function, the bias-variance trade-off
presents itself when deciding what to do at the end of a trajectory when
learning. Instead of waiting for an entire episode to complete before collecting
a trajectory of experience, modern RL algorithms often break experience batches
down into smaller sub-trajectories, and use a value-estimate to bootstrap the
Monte-Carlo signal when that trajectory doesn’t end with the termination of the
episode. By using a bootstrap signal, that estimate can contain information
about the rewards the agent might have gotten, if it continued going to the end of
the episode . It is essentially a guess about how the episode will turn out from that point
onward. Take again our example of the corridor. If we are using a time horizon
for our trajectories that ends halfway through the corridor, and if our value
estimate reflects the fact that there is a rewarding state at the end, we will
be able to assign value to the early part of the corridor, even though the agent
didn’t experience the reward. As one might expect, the longer the trajectory
length we use, the less frequently value estimates are used for bootstrapping,
and thus the greater the variance (and lower the bias). In contrast, using short
trajectories means relying more on the value estimate, creating a more biased
reinforcement signal. By deciding how long the trajectory needs to be before
cutting it off and bootstrapping it, we can propagate the reward signal in a
more efficient way, but only if we get the balance right.

Arrow corresponds to agent trajectory. Above : The value estimate at time step 3 is used to bootstrap trajectory value
estimate. Below : no bootstrapping is used, and no value is assumed for these states.HOW TO APPROACH THE TRADE-OFF
Say you have some environment you’d like to have an agent learn to perform a
task within (for example, an environment made using Unity ML-Agents ). How do you decide how to control the GAE lambda and/or trajectory time horizon ? The outcome of setting these hyperparameters in various ways often depends on
the task, and come down to a couple of factors:

 * How well can your value estimator (using function approximation) capture the
   “true” reward structure of the environment? If it is possible for the value estimate to be reflective of the actual
   future expected reward, then the bias of the reinforcement signal is low, and
   using it to update the policy will lead to stable reward-increasing
   improvement. One simple proxy for the quality of the value estimate can be
   the loss when updating the model responsible for V(s)/Q(s, a). If the loss
   decreases to (near) zero, then the value estimate is reflective of the
   rewards the agent is receiving from the environment. More often that not
   however, this won’t be the case. There are a number of reasons a value
   estimate may be biased. One simple reason comes back to problems of
   supervised learning, where an under-capacity model being used to learn V(s) will likely produce a biased estimate. Increasing the capacity of the model
   might help here, as well as providing a richer state representation. The
   other problem is that the “true” reward structure is always a moving target.
   As an agent improves its policy, then the expected rewards should increase as
   well. This causes the unintuitive experience of the loss function increasing,
   even though the agent’s policy is more successful.
 * How stochastic are the rewards experienced in the environment? This deals directly with the variance piece of the equation. Environments
   with more uncertain outcomes (due to multi-agent dynamics, stochasticity in
   policies and environment, and other factors) will result in less predictable
   rewards, and hence greater variance in the reward signal. As mentioned above,
   the traditional fix for this is to simply use larger batches of data during
   each step of stochastic gradient descent. This has a downside however. It is
   known that at least for supervised learning problems, large batches of
   gradient descent are much less efficient than smaller batches. It is also
   difficult to fit large batches into the memory of a system, especially when
   images of any considerable size are being used as observations. This approach
   is also the most data-intensive, as large batches require more interaction
   with the environment.

Ultimately, correctly balancing the trade-off comes down to a few things:
gaining an intuition for the kind of problem under consideration, and knowing
what hyperparameters for any given algorithm correspond to what changes in the
learning process. In the case of an algorithm like PPO, this corresponds to the discount factor, GAE lambda, and bootstrapping time horizon . Below are a few guidelines which may be helpful:

 * Discount Factor: Ensure that this captures how far ahead agents should be predicting rewards.
   For environments where agents need to think thousands of steps into the
   future, something like 0.999 might be needed. In cases where agents only need
   to think a few steps into the future, then something closer to 0.9 might be
   all that is needed.
 * GAE Lambda : When using the Generalized Advantage Estimate, the lambda parameter will
   control the trade-off between bias and variance. While it is typically kept
   within the high 0.95–0.99 range, this depends on the quality of the value
   estimate V(s) being used, and more accurate V(s) can allow for greater
   reliance on it when calculating the Advantage.
 * Time-Horizon : The “best” time horizon is determined primarily by the reward structure of
   the environment itself. The time-horizon should be long enough to allow the
   agent to likely receive some meaningful reward within it, but not too long as
   to allow large amounts of variance into the signal. In cases where the
   environment has high-variance in the expected rewards, utilizing a smaller
   time-horizon will allow for a more consistent learning signal in spite of
   that stochasticity.

With all the tweaking and tuning that often goes into the process, it can
sometimes feel overwhelming, and like black magic, but hopefully, the
information presented above can help contribute, even in a small way, to ensure
that Deep Reinforcement Learning is a little more interpretable to those
practicing it.

If you have questions about the bias-variance trade-off in RL, or if you are an
RL researcher and have additional insight (or corrections) to share, please feel
free to comment below!

Thanks to Marwan 'Moe' Mattar for the helpful feedback when reviewing a draft of this post.

Thanks to Marwan Mattar . * Machine Learning
 * Reinforcement Learning
 * Deep Learning
 * Neural Networks
 * Robotics

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

536 6 Blocked Unblock Follow FollowingARTHUR JULIANI
Deep Learning @Unity3D & Cognitive Neuroscience PhD student.

FollowML REVIEW
Highlights from Machine Learning Research, Projects and Learning Materials. From
and For ML Scientists, Engineers an Enthusiasts.

 * 536
 * 
 * 
 * 

Never miss a story from ML Review , when you sign up for Medium. Learn more Never miss a story from ML Review Get updates Get updates","In this article, I want to attempt to highlight a conceptual piece, bias and variance in RL, and attempt to demystify it to some extent. My hope is that in doing so a greater number of people will be able to debug their agent’s learning process with greater confidence.",Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning,Live,735
2262,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (September 20, 2016)
 * This Week in Data Science (September 13, 2016)
 * This Week in Data Science (September 06, 2016)
 * This Week in Data Science (August 30, 2016)
 * This Week in Data Science (August 23, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (SEPTEMBER 20, 2016)
Posted on September 20, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Five surprising ways AI could be a part of our lives by 2030 – Artificial intelligence could soon allow robots that can learn and adapt
   in real-time with no human input to take over many tasks now performed by
   people.
 * IBM gets marketers to THINK with Watson – IBM is once again harnessing the power of Watson, with its cognitive
   capabilities set to help marketers drive digital transformation with THINK
   Marketing.
 * Becoming a Big Data Scientist: Skills You Need to Know and How to Learn Them – Rick Delgado discusses the most useful skills for a data scientist.
 * Silicon Valley online course to mint self-driving car engineers – Silicon Valley is creating a crash course in self-driving car technology
   to address a shortage of engineers.
 * Next-generation data science: Open analytics ecosystems – Open data science projects are revolutionizing the fabric of business in
   diverse industries, spawning new ecosystems of innovation.
 * Rex Animal Health is using genomics to keep livestock healthy – A startup called Rex Animal Health wants to protect livestock from
   illnesses that can quickly turn into epidemics, and help farmers breed
   animals with the healthiest and most attractive traits.
 * Why Do Television Companies Need a Digital Transformation – The world has changed, and continues to change, at a rapid pace. This
   change has introduced a number of challenges to businesses in the television
   industry.
 * Customer Service Bots Are Getting Better at Detecting Your Agitation – A virtual assistant that can tell you’re frustrated can slow down and help
   you out.
 * Human in AI Loop – A.I is out to get us. It will replace us all.. or maybe, it’s just another
   tool that we will get used to like calculators or iPhones.
 * Way Beyond Jeopardy: 5 Marketing Uses Of IBM Watson – The phenomenal power of IBM Watson has already been leveraged in many
   ways. Watson has revelatory applications for marketers as well.
 * Top Algorithms Used by Data Scientists – This poll identifies the list of top algorithms actually used by Data
   Scientists, finds surprises including the most academic and most
   industry-oriented algorithms.
 * Congress Just Got Serious About a National Strategy for the Internet of
   Things – The Center for Data Innovation today congratulated the House of
   Representatives for passing H.Res. 847, a resolution calling for a national
   strategy for the Internet of Things to promote economic growth and consumer
   empowerment.
 * This Camera Can Read a Book Without Opening It – A team of researchers at MIT and Georgia Tech unveiled a new machine
   imaging technique that allows a computer to determine what is printed on
   individual sheets of stacked paper without having to flip through them.

UPCOMING DATA SCIENCE EVENTS
 * Data Science Methodology and Data Science in action Hands On – Come and learn about data science methodology – the mental flowchart of
   data scientists – from start to end on September 22nd in Toronto.
 * IBM Webinar: Become a Better Data Scientist with IBM Data Science Experience
   (DSX) – Join this webinar to see Greg Filla, a data scientist at IBM, showcase a
   demonstration of Data Science Experience, and teach you how to get started
   with notebooks, Spark and more on September 22nd.
 * IBM DataFirst Launch Event – Join data and analytics leaders and practitioners from the open source
   community, startups, and enterprises at the IBM DataFirst Launch Event on
   September 27th in NYC.
 * What does IoT Edge Analytics look like? – This webinar, on September 27th, will showcase the new Statistica features
   around Edge Scoring for IoT analytics and Native Distributed Analytics
   Architecture.
 * Tableau Conference 2016 – If you love data – the stories it tells, the action it inspires, the
   success it brings to your business – Tableau Conference 2016 is the place you
   want to be on November 7-11 in Austin, Texas.

NEW IN BIG DATA UNIVERSITY
 * Data Science Fundamentals Learning Path – When a butterfly flaps its wings what happens? Does it fly away and move
   on to another flower or is there a spike in the rotation of wind turbines in
   the British Isles. Come be exposed to the world of data science where we are
   working to create order out of chaos that will blow you away!

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (September 20, 2016)",Live,736
2265,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. May 4
--------------------------------------------------------------------------------

POUCHDB: THE SWISS ARMY KNIFE OF DATABASES
USE POUCHDB IN THE BROWSER, AS A CLIENT FOR A REMOTE DATABASE, OR ON THE SERVER
SIDE
PouchDB is a database. It’s a JSON document store to be precise, allowing you to
create, read, update, delete and query your documents with a simple JavaScript
API.

The PouchDB logo.PouchDB is most commonly used in a browser , to keep data on the client side. Storing data within the browser allows your web applications to keep working, even when the network
connection is flaky or nonexistent — this is called an offline first approach. [Check the Offline Camp Medium publication for more articles and events. — Ed.] Offline first brings 100% uptime to web
applications together with performance gains and better UX. These concepts form
the backbone of the Progressive Web App (PWA) movement.

MANY FLAVOURS OF POUCHDB
PouchDB goes to great lengths to store data in a format that allows the client
side to be disconnected from the server side, for both copies to be updated
(even when updated in different ways), and for the data to synced without the loss of data. This is no mean feat. Syncing data across a
distributed system is not a problem you want to tackle yourself. PouchDB solves
it for you in a single stroke.

PouchDB stands on the shoulders of the Apache™ CouchDB project, which defines the protocol and the storage mechanism that allow the
seamless syncing of data between occasionally-connected replicas. PouchDB to
CouchDB, PouchDB to Cloudant, PouchDB to PouchDB — you decide.

Now that you have more context on PouchDB, I’m going to outline all the ways you
can deploy and use it. As you’ll see, it’s far more versatile than the database
servers of yesteryear.

POUCHDB — THE IN-BROWSER DATABASE
Storing data in a web application couldn’t be simpler. Get PouchDB into your web page’s code by whichever means you prefer, and then write some JavaScript:

var db = new PouchDB('mydb'
var doc = { 
  name: 'Glynn', 
  team: 'blue', 
  date: '2017-03-24', 
  verified: true 
};
db.put(doc);

Read documents back:

db.allDocs();

Or run queries , calculate aggregations and lots more .

POUCHDB — THE SERVER-SIDE DATABASE
You can also use PouchDB as the storage mechanism for your single-server,
client-side Node.js application. Just npm install pouchdb and “require” the library into your code:

var PouchDB = require('pouchdb'
var db = new PouchDB('mydb'

The rest of the code is the same as the client-side example above.

POUCHDB — THE CLIENT LIBRARY FOR COUCHDB/CLOUDANT
If you want to write client-side code that speaks to a remote CouchDB or
Cloudant server, then you can use PouchDB as a client library:

var remotedb= new PouchDB('https://USER:PASS@HOST.cloudant.com/db'
remotedb.put(doc);

By providing a URL instead of a database name, PouchDB makes API calls to the
remote database, using the same code as if it were a local database.

POUCHDB — THAT SYNCING FEELING
Syncing data from client- to server-side copies is also painless. On the client
side:

db.sync(remotedb);

That’s it. Two-way sync from your in-browser database to a remote copy. See the documentation for more examples, including one-way sync and receiving update notifications.

POUCHDB — THE HTTP SERVER
If you want to run PouchDB as if it were an Apache CouchDB service, with HTTP
API, dashboard, and all, then you’re only a couple of commands away:

npm install -g pouchdb-server
pouchdb-server

Then visit http://127.0.0.1:5984/_utils/ in your browser to see the dashboard. Apart from the colour scheme, you’d be
hard-pushed to notice the difference between this UI and that of full-fat
CouchDB.

PouchDB running on the server side.POUCHDB — THE APP DATA LAYER
If you are building Apache Cordova -based native mobile applications or Electron -based desktop apps, then PouchDB can be included in the project and used to
provide the data storage API. Storing data with PouchDB unlocks your app’s
ability to sync its data to the server, allowing your app to uncouple from the
network and roam free, as apps were intended to do!

POUCHDB — PLUG IN FOR EXTRA FUNCTIONALITY
By adding plugins, PouchDB can be extended to perform peer-to-peer sync , social media authentication , framework integration and much more .

POUCHDB — WHERE’S THE CATCH?
How much does this cost? What are the licensing restrictions? Where do I enter
my credit card details?

PouchDB is a free, open-source project maintained by volunteers. You are free to
use it in your own projects within the generous provisions of the Apache-2.0 license . So what are you waiting for?

FURTHER READING
 * PouchDB getting started guide
 * Installing Web Apps with Electron
 * Scaling Offline First with Envoy

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * JavaScript
 * Web Development
 * Pouchdb
 * Couchdb
 * Offline First

2 Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 2
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","PouchDB is a database. It’s a JSON document store to be precise, allowing you to create, read, update, delete and query your documents with a simple JavaScript API. PouchDB is most commonly used in a…",PouchDB: the Swiss Army Knife of databases – IBM Watson Data Lab – Medium,Live,737
2268,"
--------------------------------------------------------------------------------

SIMPLE SERVICE REGISTRY
DISTRIBUTED CONFIG MANAGEMENT FOR YOUR MICROSERVICES
Here on the Watson Data Platform, we have spent a lot of time over the last six
months working on building examples of microservices that demonstrate how
developers like you can use our range of database solutions to solve real world
problems. So far we have talked about creating a searchable index from a CSV
with the Simple Search Service , and showed you how to add autocompletes to your web forms in no time at all
thanks to the Simple Autocomplete Service . We’ve also covered logging and caching .

WHY MICROSERVICES?
Taking a microservices approach means that we break our app down into several
smaller, independent applications that handle one specific aspect of the task at
hand. This means that your code is more streamlined, services are less dependent
on each other, and as such are easier to maintain and re-use.

There are, however, some sticking points. How do we wire all of these services
together? What if my app is auto-scaling and nodes are appearing and
disappearing at regular intervals? And how do we manage any changes to
configuration without experiencing downtime?

GET CONFIGGY WIT IT
Na na na na na na na nana …One way to manage the logistics of a setup like that is to employ some kind of
service discovery mechanism. Service discovery is a means of automating the
discovery and implementation of distributed services, as well as distributing
other configuration information across the whole infrastructure.

The general idea is that each individual service would register itself with a
centralised registry, including any information required to connect to and use
the service (like hostname, port, etc.). Subsequently, any other service or
system that can integrate another can then autonomously discover these services,
apply the configuration information to itself, and start making requests — all
without the need for human intervention.

Additionally, configuration changes can also be detected and acted upon across a
whole cluster of servers, minimising downtime and human error.

Sound good? We thought so too, so decided to set ourselves a challenge…

THE CHALLENGE
Extend the Simple Search Service by letting it use our other microservices. We
can add autocompletes to the data management pages, cache popular searches, and
record searches for further analysis downstream.

To make this a bit more interesting, we decided on a few stipulations:

 * The Simple Search Service must auto-detect the presence of the other services
 * It should be possible to enable or disable these services without restarting
   the Simple Search Service
 * The Simple Search Service should not rely on any external services to perform
   its primary task of searching (i.e. the use of external services is NOT required)

THE TOOLS
Let’s take a look at the technology we chose to help us complete this challenge.

OUR EXISTING MICROSERVICES
 * Simple Search Service
 * Simple Autocomplete Service
 * Simple Cache Service
 * Simple Logging Service

Each of these four microservices has its own HTTP API. None of these services
know anything about the other—they are completely stand-alone. Individually,
they provide one useful feature each. We plan to make them work together to be
even more useful.

ETCD
To build a registry for our services to use, we chose etcd . From the etcd homepage:

etcd is a distributed key value store that provides a reliable way to store data
across a cluster of machines. It’s open-source and available on GitHub. etcd
gracefully handles leader elections during network partitions and will tolerate
machine failure, including the leader. Your applications can read and write data into etcd. A simple use-case is to
store database connection details or feature flags in etcd as key value pairs.
These values can be watched, allowing your app to reconfigure itself when they
change.Our services register themselves against a known key in etcd, and we can then
“watch” these keys for changes in the future to act upon them.

We created a Simple Service Registry module for Node.JS to help with the registration/discovery of services and use
this to talk to etcd.

Each of the above services requires a ETCD_URL environment variable defined to enable the registration/discovery processes,
which should point to an instance of etcd. For example:

export ETCD_URL='https://username:password@etcd.hostname.com'

It is really simple to spin up an etcd instance with Compose on a 30-day free trial, which is what we used. However as etcd is open-source
you can also grab it from GitHub or via one of the popular package managers for your platform of choice.

SOCKET.IO
Detecting and reacting to configuration changes using etcd is only half of the
battle. We also need to visualise the effects of this configuration for users
via the UI, and the UI also needs to reflect any changes within etcd. To handle
this job, we use Socket.IO to help us implement WebSockets, which can transmit any app configuration
changes to the browser.

PUTTING IT ALL TOGETHER
We must follow several steps to succeed with this challenge:

 1. We modify the Simple Search Service to integrate the additional
    microservices. (Service discovery is a powerful tool; however, it is almost
    useless if once the services have been discovered, we don’t know how to use
    them.)
 2. We make all of our services register themselves with etcd.
 3. We alter the Simple Search Service so it can discover these services and
    seamlessly start using them at a user’s request.

SERVICE INTEGRATION
We aim to add the following services to the Simple Search Service:

 * Autocompletes in the data editor
 * Caching of searches
 * Logging of searches

Each service requires storage of some configuration information—most importantly
a hostname . We are using the Express app.locals object to store this information for each service individually.

Caching and logging of searches requires us to modify the existing search to optionally use a caching and/or logging module, depending on availability. We don’t want
the search to break if these services aren’t available, and any switchover
should be transparent to the user.

We create two new modules within the Simple Search Service to make HTTP requests
to the cache and logging services at appropriate points within the search
process. At the point of making these requests, we determine if the necessary
configuration information is available. If it is not, then we simply bypass this
step.

Adding autocompletes to the data editor is a bit more involved. In order to
provide autocomplete suggestions, we need to create an index on the Simple
Autocomplete Service. The Simple Autocomplete Service has APIs available that
let us specify a remote text file to create our index. This approach is perfect
for our use case. We just need to provide the data. When importing data into the
Simple Search Service, you can define which fields you want to create facets on,
and we can use these facets to help generate our list of terms.

We created a new API endpoint on the Simple Search Service that provides a list
of available terms for a supplied facet. Try it on your own Simple Search
Service:

curl -X GET http://your.simplesearch.com/autocompletes/<facet_name>

This means that when we enable the autocomplete service within the Simple Search
Service, we ping the Simple Autocomplete Service with all of our autocomplete
lists so that we have indexes ready and waiting.

We also pass our configuration information to the front end so that we can point
the jQuery autocomplete plugin to the correct API. We use our Socket.IO implementation for that.

SERVICE REGISTRATION
The next step is to register our services with etcd using the .register() method of our Simple Service Registry module.

The code to do that is quite simple. (We do something similar for all of our
services.):

At this point, we have all of our services registered in etcd, along with the
URL required to start making requests to their respective APIs. However, before
we can do that, we need to let the Simple Search Service discover the other
services.

SERVICE DISCOVERY
In the Simple Search Service UI we created a new menu item called Services , which lets you:

 * see if service discovery is enabled
 * see what services you can discover (Autocomplete, Caching, Logging)
 * see which services are discovered
 * enable or disable any discovered services

To enable the Simple Search Service to discover any available services, we call
the Simple Service Registry module’s service() method, which returns a Node.JS EventEmitter that has events for set , expire and delete . We can then use these events to manage the configuration information for each
service.

Here’s a simplified version of what we are doing:

Lets take a closer look at what is happening in that code:

 * We attempt to discover the autocomplete-service from within the search namespace.
 * When this key is set, we update our autocomplete configuration information.
 * When this key expires, we reset the autocomplete configuration information.
 * For each event we send a reload-config event to the front end using WebSockets. This triggers an Ajax request to GET /config , pulling in the latest configuration information to the browser.

ENABLING SERVICES
At this point we have all of our microservices successfully registering
themselves with our service registry. We also modified the Simple Search Service
to discover our other services and store the configuration information locally.

The final step is to let users enable/disable one of these discovered services
via the UI. On our Services page we have an enable/disable button associated with each microservice. This button pings an endpoint on the
Simple Search Service to enable or disable a specific service—something like:

POST /service/enable/autocomplete

It uses the setEnv() method on the Simple Service Registry module to mark the Simple Autocomplete Service as enabled in etcd. We can then
use the getEnv() method to detect any changes to this environment parameter on ALL of the Simple Search Service servers in the cluster.

If the autocomplete_enable flag expires, is deleted, or is not set to true , we mark the autocomplete microservice as disabled in our configuration
information. However if this flag is set to true , then we mark it as enabled and start to show autocompletes within the Simple
Search Service. We can repeat this step for all of the other microservices.

The important thing here is that we are managing our config in a distributed
way. As long as we are always referencing the data stored in etcd, we can change
our application config in real time across a large cluster of servers.

TRY IT YOURSELF
The Simple Service Registry module is designed to let anyone perform service discovery and configuration
distribution for any set of apps or services. Why not try and implement it
yourself?

CONCLUSION
A microservices-based architecture has many benefits for both development and
operations teams when creating, managing, and maintaining a distributed system.
However, a forever-changing infrastructure brings its own headaches and presents
several different challenges. A service discovery approach helps mitigate these
challenges at the cost of a little extra work up front.

This post shows how your suite of microservices can adhere to the philosophy of
standalone, independent microservices while you extend and expand them in a
scalable and predictable way using service discovery tools—no matter the size of
your infrastructure.

Microservices Configuration Management JavaScript Tutorial Database 5 Blocked Unblock Follow FollowingMATT COLLINS
Developer Advocate with IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.",This post shows how your suite of microservices can adhere to the philosophy of standalone microservices while you extend/expand them in a scalable & predictable way using service discovery tools.,Simple Service Registry,Live,738
2269,"STORE RESULT SETS WITH MATERIALIZED VIEWS IN POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 4, 2017Lucero Del Alba talks about using PostgreSQL's materialized views to process
large amounts of data without slowing down your database in this new Write Stuff
article, showing how to efficiently create, access, update, and delete them.

In this article, we'll learn how to cache result sets for queries that require
heavy processing — as in the case of performing aggregate functions (SUM, AVG,
etc) and JOINs over millions of records - so that the results will be processed
just once and stored as a materialized view for quick retrieval later.

We'll describe the scope of the problem and some possible use cases before we
get our hands on the actual code and implementation of materialized views; if
you are either familiar with the use case or just want to go straight to the
implementation details, feel free to jump ahead to The Solution section.

THE USE CASE: A FINANCIAL APPLICATION
We'll be expanding on our previous article on Building OHLC Data in PostgreSQL to see how we can further improve the responsiveness of the application by
immediately retrieving result sets that have already been queried.

Here's a quick summary of the previous article:

We went all the way from downloading ticks for a financial instrument to
generating OHLC data that's typically used to display and analyze prices for
such instruments. The visualization was a plus. What was interesting about what
we did was we let the database do most of the heavy work by using PostgreSQL features to generate the OHLC data for which we would have
otherwise needed to write a separate program.

OTHER USE CASES
Keep in mind that even if you're not in the finance field, the following
principles will be useful for whenever you need to store the results of a query
in a programmatic way such as a defined method to query the DB, get the results,
and store them for later.

In general, if you're performing operations on records that are unlikely to be
changed once inserted in the DB and where no new records for that group or
period are likely to be introduced, then you have a perfect scenario for
applying the techniques we'll cover here.

Some examples:

 * Accounting : when querying totals and sub-totals for periods such as quarters, monthly
   balances, etc.
 * Linear algebra : there's a lot of heavy math that can be performed on the database, such a
   vector-to-vector, vector-to-matrix and matrix multiplication; such results
   can also be stored.
 * Large joins : ever tried to join two tables with hundreds of thousands (when not
   millions) records each, and wished you only needed to do that once?
 * Averages, sums, maximums, minimums, counts : operations that will usually require heavy processing before they're
   calculated.

In many of these cases there's really no need to write a program to do all these
calculations and pass the results to a caching system, often times all you'll
need is just some SQL savvy.

THE PROBLEM: HEAVY AND RECURRENT QUERIES
Specifically, in our case, the problem we face is that generating OHLC from the
ticks database is computationally expensive.

Consider this, the file we downloaded ( EURUSD_Ticks_08.11.2016-08.11.2016.csv ) in the previous article has 69,384 records in it, and those are the ticks for just one day . If we want to generate OHLC data for one-hour bars (or any other period for
that matter) for a week, assuming the same volatility for the rest of the days,
we'll need to go over ~347,000 records (69,384 x 5 days). If we wanted to
generate the same data over a period of a month, we are already over the million records (347,000 x 4 weeks). Also, the aggregation functions that we need for our OHLC
data, like maximum, minimums, averages, and counts, aren't particularly
expensive for one-hour bars, but minute bars really will be because there's a
lot more of computing involved when grouping and aggregating to a lesser
granularity.

More grouping means more operations. Consider the following for any given hour
bar:

2 times more grouping for 30 minutes bars
4 times more grouping for 15 minutes bars
...
60 times more grouping for 1 minute bars


... and the processing you did for one-hour bars is entirely useless from the one you'll need for two-hours or thirty-minute bars. So you'll have to
go over millions of records depending on the different kind of bars you want to
generate for just one symbol (EUR/USD).

And there aren't many insights you can get from a single week of financial data
on a single instrument; you'll need months, if not years, from multiple symbols,
and occasionally from multiple data sources as well when there isn't a
centralized exchange, as in forex markets. To make things worse, you'll need to
visualize/analyze bars more than once meaning you'll need to query the DB for
the same data several times.

THE SOLUTION: STORING THE RESULT SETS
We know you can't change the past, historical data will remain the same, and a
one-hour bar chart for a given instrument from a given exchange or broker during
a given period of time will always look the same. If we use this to our
advantage, it implies we will only need to calculate things once and somehow
store the results for future reference.

We could get the results and populate a new table in the database for future
reference then batch some scripts to handle the process; but, there is a more
programmatic, elegant way around this using the database alone.

WHAT IS A MATERIALIZED VIEW?
A materialized view is a database object that contains the results of a query.
(...) This is a form of caching the results of a query, similar to memoization
of the value of a function in functional languages, and it is sometimes
described as a form of precomputation. As with other forms of precomputation,
database users typically use materialized views for performance reasons.

https://en.wikipedia.org/wiki/Materialized_view

Some of you may be thinking — why not just use a cache? That is, query the database and store the results in some object. Yes, that
will work, but it will also require a cache solution, possibly some extra
maintenance, and add some logic to your app. Unless you come up with a very good
cache, material views are better than that .

With materialized views you can not only retrieve the results set that were
stored, but you can query the materialized view table itself as if it was a regular table , use conditional statements, joins and ranges as you would with any other
table effectively as if you were changing the original query that generated the
materialized view. But it's a whole lot faster and there's no need for a cache
or extra maintenance since you're relying on something you already have and
maintain — your DB.

In the following code snippets we're going to be focusing on materialized views in PostgreSQL , which have been available since version 9.3.

CREATING A MATERIALIZED VIEW
Creating a materialized view is super easy — you precede the query statement for
which you want the materialize view with the CREATE MATERIALIZED VIEW mymatview AS statement, where mymatview will be the name for the materialized view.

For example, if you want to create a materialized view for SELECT * FROM mytable :

CREATE MATERIALIZED VIEW mymatview AS SELECT * FROM mytable;  


TRANSPARENTLY CREATE/ACCESS A MATERIALIZED VIEW
In order to make the process as transparent as possible, what we want to do is
to create the materialized view if it doesn't exist and to access it immediately right after.

POSTGRESQL 9.5 AND ABOVE
This is the easy way, and it's pretty much doing what we just described, but in
SQL:

CREATE MATERIALIZED VIEW IF NOT EXISTS mymatview AS  
    SELECT
        date_trunc('second', dt) dt,
        (array_agg(bid ORDER BY dt ASC))[1] o,
        MAX(bid) h,
        MIN(bid) l,
        (array_agg(bid ORDER BY dt DESC))[1] c,
        SUM(bid_vol) bid_vol,
        SUM(ask_vol) ask_vol,
        COUNT(*) ticks
    FROM ""EUR/USD""
    WHERE dt BETWEEN '2016-11-08' AND '2016-11-09'
    GROUP BY date_trunc('second', dt)
    ORDER BY dt;
SELECT * FROM mymatview;  


The first statement won't return any result set, so we won't get any unexpected
output whether the materialized view previously existed or not.

POSTGRESQL 9.3 AND 9.4
IF NOT EXISTS for creating a materialized view was introduced in 9.5, so we'll need a way
around this issue for earlier versions. We'll use Procedural Language/PostgreSQL
(PL/pgSQL) with a DO code block containing a BEGIN / END block for a transaction.

DO $$  
BEGIN  
    IF
        (SELECT COUNT(*) FROM pg_matviews WHERE matviewname='mymatview') = 0
    THEN
        CREATE MATERIALIZED VIEW ""mymatview"" AS
            SELECT
                date_trunc('second', dt) dt,
                (array_agg(bid ORDER BY dt ASC))[1] o,
                MAX(bid) h,
                MIN(bid) l,
                (array_agg(bid ORDER BY dt DESC))[1] c,
                SUM(bid_vol) bid_vol,
                SUM(ask_vol) ask_vol,
                COUNT(*) ticks
            FROM ""EUR/USD""
            WHERE dt BETWEEN '2016-11-08' AND '2016-11-09'
            GROUP BY date_trunc('second', dt)
            ORDER BY dt;
    END IF;
END  
$$ LANGUAGE plpgsql;
SELECT * FROM mymatview;  


Note that pg_matviews is the name of the system catalog where the materialized views are stored.

ALL CASES
We'll again focus on the process since the query you'll materialize can be
anything that requires you to do heavy processing.

You'll need to replace mymatview with something that accurately describes your view. That might come as obvious
and as a trivial thing to do for cases such as: balance_2015 , whole_results , etc. For our example we'll need to include the symbol name, the periodicity
of the OHLC bars, and possibly the data range, so you got to come up with a
template for naming such as {symbol}_{date_start}_{date_end}_{periodicity} . In the case you don't normally need to access these tables manually, you
could just use an 8-character hash that you'll compare against a hash table.

Notice that while you can get away by just executing queries with SELECT statements, when creating materialized views within your programs you'll need to commit these queries in order to actually save the newly created view in the DB. In Python, this is
just a matter of adding a conn.commit() (replacing conn with the name of your connection name) right after sending the query execution
to the cursor. Don't forget this step otherwise you'll be creating (and not
saving) a materialized view every time .

Yes, you'll add an extra query every time since you have to check if the
materialized view exists, and even add an extra procedure for PostgreSQL 9.3 and
9.4. However, this is extremely fast and it's something that shouldn't worry you since you don't access
balances and heavy queries all that often. Now, if you're making these queries
by the thousands in a matter of seconds, then yes, go ahead and batch the
creation of materialized views and query them as simply as you possibly can;
even forget about materialized views altogether and load the results in an HDF5
store that you can load to memory and access on demand.

UPDATING A MATERIALIZED VIEW
A materialized view isn't just a virtual table representing the result of a
query (a ""view""), these results are stored as an actual (""materialized"") table
which can be updated from the original tables when needed.

Right, we said before that you can't change the past, except when you do. It may
happen that new or corrected historical data becomes available, either filling in gaps or fixing
inaccuracies. This is the time when you'll need to re-generate your views.

Since materialized views are stored along with the details of the very query
that generated them, handling an update is incredibly easy. You don't need to
delete the materialized view and recreate it, just this simple SQL statement
will do the trick:

REFRESH MATERIALIZED VIEW mymatview;

That simple!

CHANGING AND DELETING A MATERIALIZED VIEW
As mentioned, these aren't just views but actual tables that you can operate in
pretty much the same ways as regular tables. Just use the MATERIALIZED keyword to tell the engine what kind of operation you want to perform:

 * ALTER MATERIALIZED VIEW
 * DROP MATERIALIZED VIEW

OTHER DBMS
Materialized views are supported by several databases though the implementation
details may vary from one engine to another so you are advised to read the
documentation.

ACTUAL SUPPORT
Oracle was the first database to implement materialized views adding support was in
version 8i back in 1998.

Here's what creating a materialized view in Oracle looks like:

CREATE MATERIALIZED VIEW MV_MY_VIEW  
REFRESH FAST START WITH SYSDATE  
  NEXT SYSDATE + 1
    AS SELECT * FROM <table_name�


For more information see Managing Read-Only Materialized Views in the Oracle's Database Administrator’s Guide.

Sybase SQL Anywhere , at least since version 10, has supported materialized views. The latest
version as of now, 17, has extensive documentation covering this topic.

PARTIAL SUPPORT (OR WITH A WORKAROUND)
IBM DB2 since version 10 has a close enough implementation called materialized query tables (MQT). Implementation details vary from what we covered, so check out an introduction to materialized query tables on IBM developerWorks.

Microsoft SQL Server , since version 2005, ships a similar feature called ""index views"". See Improving Performance with SQL Server 2005 Indexed Views on Microsoft TechNet and Create Indexed Views (for SQL Server 2016 and later).

MySQL doesn't quite support materialized views, however you can workaround this by
using triggers and stored procedures ; see Materialized Views with MySQL to see an example of this. There's also a third-party application called Flexviews which implements materialized views on MySQL, though the project is
unmaintained lately.

WRAP-UP
Materialized views are incredibly useful for saving computation time and are
trivially easy to create and maintain: all you need is your DB! They mix the
immediate accessibility of a view with the flexibility of an actual table,
meaning that you can alter them and even run queries within them, or filtering results. Also, whenever new data becomes available, you'll
be all set with a simple REFRESH . So keep materialized views in mind whenever some queries are taking you
valuable time and you happen to run them somewhat regularly.

Finally, have a look at Robert M. Wysocki's It's a view, it's a table... no, it's a materialized view! where he goes into more technical details such as table locking, which we
didn't get to cover here.

Lucero dances, plays music , writes about random topics, leads projects to varying and doubtful degrees of
success, and keeps trying to be real and he keeps failing.

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Image by Mike Wilson Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose",Lucero Del Alba talks about using PostgreSQL's materialized views to process large amounts of data without slowing down your database.,Store Result Sets with Materialized Views in PostgreSQL,Live,739
2271,"MENU
Close
Subscribe Subscribe10 PIECES OF ADVICE TO BEGINNER DATA SCIENTISTS
Aug 25, 2016 #advice #datascienceWe hear a lot about who data scientists are . But… how should one get better ?

HELP THE BUSINESS
Going after the best models is not exactly what you are paid for. It does not mean you shouldn’t work on the science, but
keep in mind which problems are crucial and which solutions can be actionable . If you have a good model but you realize there is no way to turn your
analysis into processes or products, find a new challenge.

This being said, writing blog posts or research articles can help you structure
your thoughts.

When coding, choose well what you outsource to libraries. In young organizations
or projects the first priority should be to get a minimum viable product out the
door quickly. There will come a time to improve your models after that.

UNDERSTAND THE DATA’S LIFECYCLE
Where does your data come from? Make sure you know how it was processed and who
could benefit from your help.

For instance, I once attended a workshop by Wix , a website builder company. A friend transfered me a dataset including
timestamps for page generation and subsequent user interactions. Because he had
truncated timestamps to the nearest second, I was unable to realize that longer
page generation times led to catastrophic drops in engagement.

TALK TO COLLEAGUES
In software companies this is often obvious. Learn from software engineers.
Chances are they know more about best practices, tooling and devops than you do
as a data scientist.

In non-software companies non-tech peers will be able to help you a lot. They usually have a different
perspective and may know how you will be able to help them. Improving or
empowering their work is why you are here for.

They understand your data much better and will recognize data quality issues
faster than you will. Your feature engineering work will benefit a lot from
their insight.

Learn about the history of projects started by your colleagues. Ask about issues
they encountered along the way. Try to take different paths or revisit them if
e.g. you now have much more data. Data projects are usually enabled by data
entry rules started years before.

SMALL PROJECTS ARE COOL
Narrow scopes help a lot your work as they provide clear data and metrics.

Apple just started talking about their machine learning work. Beyond Siri, one of the major examples is
how they predict which apps you’ll open next when you use the app drawer. In my
opinion, it is the perfect illustration of how small machine-learning-powered
features can improve user experience a lot.

DOCUMENT WHAT YOU DO
Having a “paper trail” is very important.

 * Save your server’s setup scripts in case of disaster, dependencies issues, or
   to save yourself troubles when you will want to go to production. Even if you
   are part of an established business with colleagues to help you, be
   autonomous.
 * Write “lab reports” in the form of notebooks . It will force you to think about a data story, and will be shareable
   nicely. Literate programming is a great fit for data science. Experiment in
   those. Be inspired by AirBnb’s scalable knowledge .
 * Create solid libraries and packages as you need to declutter the notebooks.

BE A STRONG DATA MUNGLER
Manipulating data should not get in your way. Whether you need to aggregate data , filter it, tidy it , maybe crawl it or build any kind of pipeline, make you are up to it.

DON’T FALL FOR THE HYPE
Deep learning… Distributed systems… Stacking outputs as features.. Don’t go for
extreme data science at first .

Most likely you will not need it and it will be a distraction. Use your laptop
for as long as possible. Yes your friend at X works at peta scale, but in most
cases sampling and aggregation will be enough for research…

Be skeptical. Use familiar tools: “If it ain’t broke don’t fix it”. Consider
that using any project younger than a few year will result in inavoidable pains
(breaking APIs, flexibility, bugs…). Allow yourself only a couple of new pieces
of technology per work project.

Even if your work involves reading and implementing papers, start by making sure
your team has good reproducable benchmarks, and add complexity as needed.

Be boring until you can’t.

BE DECENT AT VISUALIZATION
No need to be a d3.js or frontend master, but at least:

 * Know how to make decent graphs and how to choose which ones to use.
 * Learn about color theory enough not to embarass yourself. Aim for legibility and information density .
 * Interactive vizualizations are unavoidale now — and each year easier to do.

ALWAYS BE LEARNING
Here are some ideas:

 * Your langages’ ecosystems.
 * More maths.
 * Domain knowledge.
 * Follow the news. Find out new interesting things. Broaden your horizon. Make
   connections.

HANDLING EXPECTATIONS YOU’RE A UNICORN
Last but not least, data science covers a lot of ground. You can’t learn
everything in one day, and you don’t need to in order to be effective. Here are
a few more tips:

 * Don’t hesitate to say you don’t know something.
 * Rely on strong fondamentals, both in computer science and machine learning.
 * Know tech and techniques’ names or someone might (sadly) think you are a
   fool.
 * Skim over open-source projects’ documentation. Get the fondamentals and the
   motivations behind them.

Arthur Flam's pictureARTHUR FLAM
Entrepreneur, data scientist. Now experimenting new projects at ECI Telecom's
CTO office.

Tel Aviv http://www.ecitele.comSHARE THIS POST
Twitter Facebook LinkedInWE WANT TO STAY IN TOUCH.
Get the latest posts delivered right to your inbox. No spam.

Subscribe or like us on facebook ! Please enable JavaScript to view the comments powered by Disqus. THE SHORTCOMINGS OF DATA SCIENCE WHAT ARE ACTIONABLE INSIGHTS? Shape Science All rights reserved - 2016 Proudly generated by HUGO , with Casper theme. Background by Simon C. Page .",We hear a lot about who data scientists are. But… how should one get better?,10 pieces of advice to beginner data scientists,Live,740
2277,"FACETED WESTEROS: EASY FRONTENDS FOR SIMPLE SEARCH API USING GOT DATA
va barbosa / May 12, 2016The wait is over! The sixth season of Game of Thrones is in full swing. In preparation for the new season Glynn Bird put together a searchable database of Game of Thrones characters using the Simple Search Service .

Building on his work, I created a custom frontend for the Game of Thrones searchable database . Here, I explain how I created this page, and show how easy it is to spin up
your own custom user interface for searching the Seven Kingdoms.

CONNECT THE FRONTEND
My goal was to create a basic framework to ensure minimal hassle for anyone who
wants to create a frontend for their search or attach the service to an existing
web page.

First, we need a bridge to connect the web page with the service’s /search REST API.

To handle this, I created a JavaScript module: SimpleSearchJS . Once you download this module and add it to an HTML page, you can customize
the connection in one of 2 ways:

 * insert your specific data-* attributes in the input text field:
   <html>
   <head>
   <script type=""text/javascript"" src=""simplesearch.js""></script>
   </head>
   <body>
   <header>
   <input type=""text""
     data-simple-search=""http://sss-got.mybluemix.net""
     data-search-table=""#results""
     data-search-facets=""#facets"">
   </header>
   <section id=""facets""></section>
   <section id=""results""></section>
   </body>
   </html>
   
   
 * Or, initialize via javascript insteadvar simplesearch = new SimpleSearch(serviceUrl, callbacks, selectors);
   
   
See the module’s readme for more configuration details.

Text passed into this input tag is used to query the Simple Search Service. Then the responses are parsed,
formatted, and inserted onto the web page.

That’s it!

You’ve got a fully functioning frontend connected to your search API.


It does the job, but looks pretty bland. Any noble house would insist on a bit
more style.

DRESS IT UP


Because SimpleSearchJS handles the gruntwork of making REST API calls and parsing responses, you can
focus on designing the user interface.

Settings are available to configure the module for different scenarios. But in
most cases, all you need to do at this point is add CSS.

The HTML fragments generated by the module have CSS classes defined throughout
for easier styling and theming. The class names are each prefixed with simplesearch- , so you can quickly access and pick out the elements added by the module.

I tested and put the module through its paces, creating several styles to review
colors, layouts, etc. The beginnings of a Simple Search Service theme gallery
started to emerge:


THE SIMPLE SEARCH SERVICE IS READY FOR WINTER
To design our Game of Thrones frontend, I borrowed some eye-catching imagery, like the bloody Jon Snow HBO was using to promote the new season. After some back and forth, a
dark-themed web page emerged which incorporated this image as its background. I
added a touch of red to stay within the spirit of the image.

The layout is clean and simple. The header contains a search field and drop-down
with some sample queries. A footer shows links to more information, leaving most
of the page free to display the output/results from the SimpleSearchJS.


Here’s where I added some sophistication. Instead of using the SimpleSearchJS
via data-attributes-* , I took the JavaScript approach to enable the following callbacks:

 * onBefore triggers the transition of the components.
 * onSuccess rearranges some DOM elements from the search results.

The page also uses CSS3 animations, which means it works best in the following
browser versions or later: Chrome 43, Internet Explorer 10, Firefox 16, Safari
9.

Finally, I added Simple Metrics Collector to the page to track what searches people are performing.

TRY IT
Got an idea or content for a great search page? Fork the Simple Search Service and SimpleSearchJS on GitHub and get designing. It’ll be like putting your petrified dragon eggs
in the fire. Whoosh! Power.

In the meantime, our Game of Thrones page is available for easy reference while you watch what unfolds in Season 6.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Simple Search Service is an example app that easily exposes a faceted search API via Cloudant. Here, we build a UI for searching Game of Thrones characters.",Easy Frontends for Simple Search API Using GoT Data,Live,741
2283,"Check out this webinar to learn how to deploy a full stack Node.js application to IBM Bluemix and how to quickly and easily add a Cloudant service to your Bluemix application. The value of Cloudant, a NoSQL database as a service (DBaaS) that excels at data replication and sync, is that it makes data available to users non-stop with elastic scalability, high availability, and integrated security. Join Bradley Holt, Developer Advocate for Cloudant, as he explores the Bluemix application manifest, buildpacks, deploy scripts, the Procfile, and configuration of Bluemix applications.","Bradley Holt presents this webinar where you can learn how to deploy a full stack Node.js application to IBM Bluemix and how to quickly and easily add a Cloudant service to your Bluemix application. The value of Cloudant, a NoSQL database as a service (DBaaS) that excels at data replication and sync, is that it makes data available to users non-stop with elastic scalability, high availability, and integrated security. Join Bradley Holt, Developer Advocate for Cloudant, as he explores the Bluemix application manifest, buildpacks, deploy scripts, the Procfile, and configuration of Bluemix applications.",Deploying a Full Stack Node.js Application to IBM Bluemix,Live,742
2285,"SEVEN DATABASES IN SEVEN DAYS - A CLOUD DATA SERVICES JOURNEY
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 7, 2016Our colleagues at Cloud Data Services have been blogging a journey through the
databases available on IBM Bluemix and the cloud and it's interesting to see
how, in seven stops, how many times the destination has been Compose. Lorna
Mitchell and Matt Collins of the Developer advocacy team took up the challenge.

DAY 1: RETHINKDB
The first stop was RethinkDB, a Compose hosted database. The advocates set off
deploying it on Compose, getting to the admin interface and then connecting with
Node.js. They then created a bug tracking application which made use of
RethinkDB's change feeds to update browsers in real-time.

DAY 2: MONGODB
Next up, MongoDB, the popular NoSQL database. Again the advocates deployed their
MongoDB database using Compose's easy-to-use platform. This time the application
created was in PHP, and they stepped through the process needed to connect
Compose's MongoDB with its SSL certificates and PHP applications. They then got
down to inserting and querying data, nesting data and ended up creating an
example blog.

DAY 3: POSTGRESQL
Switching from a NoSQL to SQL heading, the advocates next port of call was
PostgreSQL, the indomitable RDBMS. It's another database available from Compose,
and after working through deploying and connecting to a fresh PostgreSQL
database, they set about creating a bookstore database at the psql command line. This database example comes complete with a look at the use of
foreign key relationships, the PostgreSQL ARRAY type and constraint checking.
Once that's all setup, they move on to connecting a PHP, Node.js, Python or Go
application to the database.

DAY 4: CLOUDANT
The advocates moved onto their first non-Compose database, Cloud Data Service's
Cloudant, the JSON store based on Apache CouchDB. This time they deployed the
database through the Cloudant Dashboard and showed how to communicate with it
over HTTPS using curl . Populating the database with student information using POST methods and
showing how to modify it with PUT opens up the examples which then move on to
using Cloudant's Views - it's data fetching, selection and analysis features -
to create complex queries. Then there's some fuzzy matching using a Lucene-based
search engine and wrapping up with a brief look at the web interface.

DAY 5: ETCD
The next stop was etcd, a database with a very specific purpose in its design,
to provide a single source of truth. Using a Compose deployed cluster, the
advocates dove into etcd's HTTP interface and the command line to work with the
basic features of etcd. Then it was out to node.js to build a simple
configuration manager using etcd. Configuration management is the thing etcd was
built for and it's got the features you need for the job; the ability to wait
for changes, a structured hierarchy of key/values and all in a fast, secure,
consensus arbitrated cluster.

DAY 6: IBM GRAPH
Graph databases excel in managing complex inter-relationships between entities
which make them ideal for the mesh of social connections that define the modern
world. It's here that the advocates look at IBM Graph, based on Apache
Tinkerpop, and run an instance of it on Bluemix. From there they create a graph,
a schema to structure the data in the graph, and then add vertices (aka nodes)
and edges to populate the graph. For their example, they model people and
interests and then search for friends with common interests. With a traditional
table or collection-centric database, this can be a recursive challenge, but for
a graph database, it's a basic capability of the model.

DAY 7: REDIS
The final stop on the seven-day tour is Redis, the essential in-memory key-value
store which can be valuable in most application stacks. Most commonly used as a
cache, Redis can reduce the load from common queries. But it is also a versatile
data store, with hashes, sorted sets and scored sets and more allowing
applications to use it as a way to share common data or keep running
co-ordinated counts on common values. Add in its publish and subscribe features,
solid atomic operations, and controllable persistence.

JOURNEY'S END
And so at the end of the seven days, seven databases have been visited. They've
covered the generation of real-time updates, the indexing of JSON documents, the
relating of data across tables, the creation of sources of truth, the traversal
of social graphs and the rapid caching of essential values. Each database has
value in most application stacks which is why its worth visiting each, if only
for a short while, to know when it will pay to use it.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Our colleagues at Cloud Data Services have been blogging a journey through the databases available on IBM Bluemix and the cloud and it's interesting to see how, in seven stops, how many times the destination has been Compose.",Seven Databases in Seven Days – a Cloud Data Services journey,Live,743
2288,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Jorge Castañón Blocked Unblock Follow Following applied mathematician and art lover | opinions are my own Aug 26, 2016
--------------------------------------------------------------------------------

SPARK-BASED MACHINE LEARNING TOOLS FOR CAPTURING WORD MEANINGS
It is always amazing when someone is able to take a very hard, present day
problem, and translate it to one that has been studied for centuries. This is
the case with Word2Vec, which transforms words into vectors. Text is
unstructured data and has been explored mathematically far less than vectors —
both historically, and today. Newton (1642–1726) may have been the first one to
study vectors in the context of forces in physics, so vectors is a concept with
at least 289 years of scientific maturity. Mathematical exploration of text data
is a concept with only a few decades of maturity. Similarly, I have worked with
vectors for more than half of my life, but only explored text data for less than
a year.

The application of mathematical thinking to text data is especially important
now, at a time when the value of data is understood, but not actualized. The
majority of business-relevant information originates in unstructured form,
primarily text. This data is invisible to, and unusable by, business, health
care, education, and government, until it can be “read”. Mathematical
exploration of text data can yield insights that translate into better decisions
made by doctors, marketers, entrepreneurs, and teachers.

As part of my endeavor to make text data “readable”, I applied Word2Vec to
generate vectors that capture word meaning, and enable arithmetic operations
associated with words. For example, the vector(‘king’) + vector(‘woman’) —
vector(‘man’) will result in a vector that is close to the vector(‘queen’).
Isn’t this incredible? The Word2Vec method was proposed by Mikolos et al. in
2013. This algorithm is based on networks and maps a corpus of text to a matrix
where each row is associated to a word in the input text data (for example,
tweets, product reviews, playlists, …). The resultant vector space can be
utilized in a variety of ways, such as measuring distance between words.
Therefore, given a word of interest, the aforementioned vector space can be used
to compute the top N closest words.

For example, a model that I built using 30 days of Twitter data gives the 5
closest words to #deeplearning. They are: 1. #machinelearning, 2. #ml, 3.
#smartdata, 4. #predictiveanalytics, 5. #datascience. The Word2Vec
implementation used is the one from Spark ML, one of the machine learning
package that’s part of Apache Spark.

If you’re interested in building your own Word2Vec model, take a look to this repo that runs on The Data Science Experience.


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on August 26, 2016.

 * Machine Learning
 * Data Science Experience
 * Dsx


Blocked Unblock Follow FollowingJORGE CASTAÑÓN
applied mathematician and art lover | opinions are my own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","It is always amazing when someone is able to take a very hard, present day problem, and translate it to one that has been studied for centuries. This is the case with Word2Vec, which transforms words…",Spark-based machine learning tools for capturing word meanings,Live,744
2289,"TALKING SCYLLADB WITH CTO AVI KIVITY
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 21, 2016We recently sat down with CTO of ScyllaDB, Avi Kivity, to talk about Scylla, its
roots, how it approaches being Cassandra compatible, and where Scylla is going
from here.

Dj Walker-Morgan : Hi Avi, as you know, at Compose we are just rolling out Scylla on Compose Hosted and Enterprise .


Avi Kivity : Yes, I'm really happy that Compose is offering this way for users to use
Scylla easily without the hassle of setting it up and provisioning systems. I
think it's great for Compose and it's great for us -- I'm looking forward to
seeing many users run on top of Compose.

Dj : For people who don't know ScyllaDB, could you give us some of the history?

Avi : All right. So we were working on a different project for improving
performance of applications and one of the targets was running the Java
workloads like Cassandra. We tried to measure the project's acceleration and see
what we could improve but we found that Cassandra was fighting with itself most
of the time and it would be very hard to accelerate because most of the
inefficiency was in Cassandra itself. Coming from high-throughput and storage
background we thought that we could do this very much better. So we tried it and
it worked very well. We were able to achieve 10x the throughput increase on
wider machines, which was more than we expected.

CASSANDRA AND BEYOND
Dj : One of the things that people find notable is that you're going for very
exact compatibility with Cassandra. Is that a difficult thing to do or is it
something that is just coming naturally from the processes that you use to
develop it?

Avi : It is not very difficult. Of course sometimes we have to reverse engineer
some Cassandra code in order to determine the exact guarantees and the exact
APIs, for example the SSTable data file format is not documented so we had to
reverse engineer it and it is documented on our site. The client visible APIs
are better documented. The CQL language syntax and the client server protocol is
documented. It is quite easy to re-implement them and get the exact same
behavior.

Dj : Is there anything you'd say to a Cassandra user of Scylla on why they should
run on Scylla beyond just performance?

Avi : One of the things that we offer beyond performance is something that we call
workload conditioning. That is a set of auto-tuning facilities that make
managing a Scylla cluster much easier than managing a Cassandra cluster. If you
look at the experience that people have with Cassandra, they often run into
issues where they have a huge backlog of compactions that Cassandra is not able
to clear, or their memory fills up causing timeouts. Instead, in Scylla we have
a set of internal measurements that Scylla takes continuously.

These are available to the user but they're also used internally to close the
feedback loop. So if Scylla sees, for example, that a compaction is running
behind and is generating a large backlog it is able to dynamically increase
compaction throughput at the expense of write throughput. This way it is able to
clear the backlog or prevent it from accumulating in the first place. The
administrator does not need to keep tuning and re-tuning the database to get a
reasonable performance. Instead the database knows exactly which processes need
extra resources and it is able to redirect those resources wherever they are
needed.

We were able to do this first because we have all of these internal measurements
and second because we have resource schedulers that allow us to prioritize
different internal processes. So we can, if we're talking about disk IO
bandwidth for example, change the fraction of the disk that goes to query,
writes, compaction, repair, and other internal processes and this can be changed
dynamically and automatically.

Dj : So from say Compose's point of view that makes it ideal for being a hosted
database because we don't have to let people spend all their time fiddling with
the controls as it's tuning itself.

Avi : It's definitely good for a service provider but it's also great for a regular
user that just manages their own database. Fiddling with tuning is time
consuming and you never know when you get it right. Even if you do, your
workload can change day-by-day or even minute-by-minute and you don't want to
keep going back and retuning it. You want your database to be able to handle
changing workloads.

LOOKING FORWARD
Dj : So high performance, self-managing; what else has Scylla got?

Avi : We have a lot of plans for the future. We plan to utilize the high
performance and the self-managing ability in order to allow multi-tenancy so you
could have a large physical cluster and you could allow multiple different users
to carve out slices of this cluster for themselves. This way, instead of
provisioning a lot of small clusters that each need to be managed separately,
they could have a large physical cluster with a single administration point and
request a slice of that cluster in a very similar way to what people do with
virtual machines.

Another reason to use Scylla is low latency. With Cassandra, a lot of people
experience problems with latency and garbage collection so from time-to-time the
JVM will simply stop everything it is doing. This is called ""Stop the world"" and
it will just do garbage collection and it will not respond to any client
request. For some applications this can be very detrimental. In Scylla nothing
like that happens. It is always responsive and if you have a latency sensitive
application Scylla can make your life a lot easier.

Dj : Are you, in terms of actual database features, always going to be tracking
Cassandra or is there anywhere that you're looking to take a lead?

Avi : Indeed we are going to do both. So we are going to track Cassandra in order
to not fragment the ecosystem. We are going to make sure that database drivers
and applications keep working with Cassandra or with Scylla so that users have
an easy migration path but we are also going to implement our own features.
Multi-tenancy is one such feature that we will implement independently from
Cassandra and there will be others.

Dj : You are implementing Scylla in C++, so, just for my curiosity, what toolchain
do you use?

Avi : We're using GCC but we're also looking at using Clang. It's really good now
that compilers became more competitive with Clang coming to the scene. We're
seeing a lot of movement in terms of the C++ language support and optimizers. We
are also interested in integrating LLVM just-in-time compilation so that we can
optimize the client queries on the fly. So when a query comes in we can compile
it into machine code instead of interpreting it and getting an even larger
performance advantage.

Dj : I'd assume you'd be piling all of that into a super query optimizer. You'd
have a chance to do an awful lot of optimization.

Avi : Yes, indeed. We really could utilize the CPU even better than we do now.

SCYLLA PHILOSOPHY
Dj : This is a more general question - from your point of view what's the most
interesting thing happening in the database space at the moment?

Avi : Well from my very biased point of view I think Scylla is the most exciting
thing. It's really bringing a different way of doing things so instead of using
a managed language, which makes life maybe a little easier for the database
implementer, we're using a native implementation, which means that we can bring
much more robust implementation in terms of performance and latency variation.

We are also relying less and less on the operating system and doing things much
more internally and again that brings better resource utilization and better
performance because we can use special purpose algorithms in the database itself
instead of using the general purpose algorithms that come with the operating
system. This is mostly about managing IO, doing IO scheduling, and doing
caching. Instead of relying on the operating system page cache we have a
specialized cache in Scylla.

Dj : The reason I ask is is because right behind you on the whiteboard is a bunch
of other database names, which looks very much like somebody was doing some
competitive analysis.

Avi : Yeah, this is mainly a map to show the location of Scylla in the panorama of
NoSQL databases. There is a huge variation in the feature sets and the ease of
use and the different guarantees that all of those databases provide. One of the
reasons that we picked Cassandra as our base was because Cassandra had such
great features in terms of high availability and cross datacenter replication.
It is really unmatched in the NoSQL data space in this area but the
implementation was really lacking because it used the JVM and it was not
optimized for performance. So we thought that we could combine the good features
of Cassandra with a great implementation that provides good performance and
really good latency.

Dj : And that's why Compose selected ScyllaDB as our latest database. Thanks Avi.

Avi : Thank you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","We recently sat down with CTO of ScyllaDB, Avi Kivity, to talk about Scylla, its roots, how it approaches being Cassandra compatible, and where Scylla is going from here.",Talking ScyllaDB with CTO Avi Kivity,Live,745
2293,"RStudio Blog * Home

 * Subscribe to feed

HTTR 1.2.0
July 5, 2016 in Packages

httr 1.2.0 is now available on CRAN. The httr package makes it easy to talk to
web APIs from R. Learn more in the quick start vignette. Install the latest version with:

install.packages(""httr"")

There are a few small new features:

 * New RETRY() function allows you to retry a request multiple times until it succeeds, if
   you you are trying to talk to an unreliable service. To avoid hammering the
   server, it uses exponential backoff with jitter, as described in https://www.awsarchitectureblog.com/2015/03/backoff.html .
 * DELETE() gains a body parameter.
 * encode = ""raw"" parameter to functions that accept bodies. This allows you to do your own
   encoding.
 * http_type() returns the content/mime type of a request, sans parameters.

There is one important bug fix:

 * No longer uses use custom requests for standard POST requests. This has the side-effect of properly following redirects after POST , fixing some login issues in rvest.

httr 1.2.1 includes a fix for a small bug that I discovered shortly after
releasing 1.2.0.

For the complete list of improvements, please see the release notes .

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,744 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

« xml2 1.0.0 tibble 1.1 »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,744 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:",httr 1.2.0 is now available on CRAN. The httr package makes it easy to talk to web APIs from R. Learn more in the quick start vignette. Install the latest version with: install.packages(“httr…,httr 1.2.0,Live,746
2296,"Compose The Compose logo Articles Sign in Free 30-day trialSTORING NETWORK ADDRESSES USING POSTGRESQL
Published Apr 5, 2017 postgresql networking data types Storing Network Addresses using PostgreSQLPostgreSQL provides developers with numerous data types with specialized
functions. In this article, we focus on the network address data type and show
you how it works, when to use it, and how it can help you when storing IP
addresses in your database.

In this article, we'll take a look at Network Address data types in PostgreSQL , namely the INET (Internet Protocol) and CIDR (Classless Internet Domain
Routing) data types to store IPv4 and IPv6 addresses. We'll cover the
differences between how PostgreSQL stores INET and CIDR addresses, the input
error checking capabilities that come out of the box, and some of the functions
that are available for these data types.

Without further adieu, let's take a look at INET and CIDR data types.

INET AND CIDR
CIDR and INET are the two data types that store IP addresses in PostgreSQL. The
data types come with their own input error checking capabilities as well as
their own operators and functions. It may be confusing trying to figure out when
to use either the CIDR or INET data type in your PostgreSQL tables, so we'll go
over some of the differences between them and when you should use one over the
other.

INETIf we're storing IPv4 or IPv6 host addresses, PostgreSQL recommends using the
INET data type with an optional netmask. While it's possible to store addresses
that represent a network using INET, like 192.10/14 , PostgreSQL recommends using CIDR, which we'll discuss further below. For now,
let's take a look at how INET data is stored and some of the problems we might
run into.

To demonstrate what the INET data type does, we'll start off by creating a table
called inet_test with address taking the INET data type.

CREATE TABLE inet_test (  
    address INET
);


Now, let's insert some address values. The first value contains an IPv4 host
address with a netmask of 24, second is just an IPv4 host address representing a
single host, and the last is a host address with a random netmask value.

INSERT INTO inet_test (address) VALUES ('198.24.10.0/24');  
INSERT INTO inet_test (address) VALUES ('198.24.10.0');  
INSERT INTO inet_test (address) VALUES ('198.10/8');  


Using the SELECT statement, our table will give us the three values we've entered:

    address     
----------------
 198.24.10.0/24
 198.24.10.0
 198.10.0.0/8


Notice that the last address we inserted 198.10/8 has zeros added to the address. The INET data type will add the necessary zeros
to the IPv4 host address to complete it when we append a netmask value. Without
appending a netmask, we'd receive an error telling us that the address is not
valid because it's incomplete:

ERROR:  invalid input syntax for type inet: ""198.10""  
LINE 1: INSERT INTO inet_test (address) VALUES ('198.10');  


Another instance where we can get an error is if we're adding a netmask value
that exceeds the number of bits allowed for that IPv4 host address. For example,
if we enter another address 198.24/24 , we'll get the following error:

INSERT INTO inet_test (address) VALUES ('198.24/24');  
ERROR:  invalid input syntax for type inet: ""198.24/24""  
LINE 1: insert into inet_test (address) values ('198.24/24');  


The last two examples, however, are cases where the INET data type is not
recommended to be used. If we're storing IPv4 or IPv6 addresses representing a
network, PostgreSQL recommends using the CIDR data type because it follows its
own conventions and checks for errors a little differently. So, let's see what
CIDR is all about ...

CIDRPostgreSQL recommends that the CIDR data type should be used when storing
addresses that represent a network. Unlike INET, the CIDR data type checks
whether there are nonzero bits to the right of the netmask on insertion. If
there are, then it will give us an error and no values will be inserted.

Let's first look at an example of how the CIDR data type works. First, we'll
create a table cidr_test with the address as the CIDR data type.

CREATE TABLE cidr_test (  
    address CIDR
);


We'll then insert some sample network addresses with and without a netmask. If
we don't enter the netmask bits, then the CIDR data type will revert to a
classful network numbering system, which is an older numbering system that is
not recommended to be used. However, for the sake of understanding the
differences, we've put together addresses in pairs.

INSERT INTO cidr_test (address) VALUES ('192/8');  
INSERT INTO cidr_test (address) VALUES ('192');  
INSERT INTO cidr_test (address) VALUES ('192.168/16');  
INSERT INTO cidr_test (address) VALUES ('192.168');  
INSERT INTO cidr_test (address) VALUES ('192.168.10/24');  
INSERT INTO cidr_test (address) VALUES ('192.168.10');  
INSERT INTO cidr_test (address) VALUES ('192.168.100.128/25');  
INSERT INTO cidr_test (address) VALUES ('192.168.100.128');  


Selecting all the values that we've inserted into the database shows us some
slight differences in how PostgreSQL has stored them:

      address       
--------------------
 192.0.0.0/8
 192.0.0.0/24
 192.168.0.0/16
 192.168.0.0/24
 192.168.10.0/24
 192.168.10.0/24
 192.168.100.128/25
 192.168.100.128/32


The stored addresses using CIDR values are quite compelling, but they don't
really tell us much about how the network and broadcast addresses are conceived.
To determine these addresses, we can turn to PostgreSQL's specialized functions
for network address data types.

NETWORK ADDRESS FUNCTIONS
PostgreSQL has several network address functions that are available for INET and CIDR data types. Of particular interest for us
are the broadcast , host , netmask , and network functions since they provide the broadcast address, the IP address, the network
netmask, and the network address of the INET and CIDR addresses we inserted. If
PostgreSQL didn't have these functions, we'd have to manually figure them out,
or we'd be dependent on an online resource that figures them out for us. Let's
take a look at what these functions provide us by looking at our existing
datasets.

INET DatasetSELECT address, host(address), broadcast(address), netmask(address), network(address) FROM inet_test;  


This gives us:

    address     |    host     |     broadcast     |     netmask     |    network     
----------------+-------------+-------------------+-----------------+----------------
 198.24.10.0/24 | 198.24.10.0 | 198.24.10.255/24  | 255.255.255.0   | 198.24.10.0/24
 198.24.10.0    | 198.24.10.0 | 198.24.10.0       | 255.255.255.255 | 198.24.10.0/32
 198.10.0.0/8   | 198.10.0.0  | 198.255.255.255/8 | 255.0.0.0       | 198.0.0.0/8


From this dataset, the interesting values come from the first two addresses. For
the first address 198.24.10.0/24 , our addresses for the broadcast, netmask, and network addresses are different
from those in the second address 198.24.10.0 . 198.24.10.0 represents a single host, which is shown via the network function since it appends a netmask of 32 to the address in the network field.

CIDR DatasetSELECT address, host(address), broadcast(address), netmask(address), network(address) FROM cidr_test;  


This gives us:

      address       |      host       |     broadcast      |     netmask     |      network       
--------------------+-----------------+--------------------+-----------------+--------------------
 192.0.0.0/8        | 192.0.0.0       | 192.255.255.255/8  | 255.0.0.0       | 192.0.0.0/8
 192.0.0.0/24       | 192.0.0.0       | 192.0.0.255/24     | 255.255.255.0   | 192.0.0.0/24
 192.168.0.0/16     | 192.168.0.0     | 192.168.255.255/16 | 255.255.0.0     | 192.168.0.0/16
 192.168.0.0/24     | 192.168.0.0     | 192.168.0.255/24   | 255.255.255.0   | 192.168.0.0/24
 192.168.10.0/24    | 192.168.10.0    | 192.168.10.255/24  | 255.255.255.0   | 192.168.10.0/24
 192.168.10.0/24    | 192.168.10.0    | 192.168.10.255/24  | 255.255.255.0   | 192.168.10.0/24
 192.168.100.128/25 | 192.168.100.128 | 192.168.100.255/25 | 255.255.255.128 | 192.168.100.128/25
 192.168.100.128/32 | 192.168.100.128 | 192.168.100.128    | 255.255.255.255 | 192.168.100.128/32


Always make sure to include the CIDR netmask to the network addresses. Examining
the output of our query above, we can see significant changes that have occurred
in the broadcast and netmask fields where the addresses differ significantly.
The exception is the third pair of addresses 192.168.10 where we seem to have gotten lucky since the addresses are translated the same
using CIDR and the older classful system.

INDEXING NETWORK ADDRESSES
So, what about query performance? We can set up indexes on INET and CIDR
addresses using a Btree index by default. But to increase performance we can set
up GIN and GiST indexes on INET and CIDR columns using built-in operator classes for the
indexes.

To set up a GiST index on INET or CIDR data, we'd write:

CREATE INDEX idx_name ON cidr_test USING GIST(address inet_ops);  


This will index both INET and CIDR datatypes and most operators, excluding
bitwise and addition and subtraction operators.

Creating a GIN index is a little more tricky in that you can only set up an index on an array of
values which have to either be all CIDR or INET addresses. Nonetheless, to set
up these indexes we'd write:

CREATE INDEX idx_name_cidr ON cidr_test USING GIN(address_array _cidr_ops);

-- OR

CREATE INDEX idx_name_inet ON inet_test USING GIN(address_array _inet_ops);  


It's been suggested that a more performative solution to indexing network
addresses is the IP4R project , which extends the native PostgreSQL network address data types. It provides
additional network address types, as well as additional support for GiST
indexing.

ADDRESSING WHAT WE DID
Understanding the different data types that PostgreSQL provides and how they
work gives us tools that enable us to expand on how our data is represented and
manipulated inside a database. With a new understanding of how the network
address data type works, we can use INET and CIDR data types effectively and
understand the various ways network address are interpreted and addressed in
PostgreSQL.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Patrick Lindenberg

Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
Apr 4, 2017METRICS MAVEN: CROSSTAB REVISITED - PIVOTING WISELY IN POSTGRESQL
In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the…

Lisa Smith Mar 31, 2017NEWSBITS: PEDIS, ELASTICSEARCH 5.3, POSTGRESQL AND JSON, APOLLOCLIENT,
KUBERNETES 1.6 AND MORE
NewsBits for the week ending March 31st: Pedis changes licence, Elastic ships
Elasticsearch 5.3, the future of PostgreSQL and…

Dj Walker-Morgan Mar 24, 2017NEWSBITS - INDEXEDDB 2.0, POSTGRESQL, PELOTON, RABBITMQ, MILLER, TERRAFORM AND
MORE
NewsBits for the week ending 24 March - IndexedDB 2.0 lands in more browsers,
hashed indexes to endure in PostgreSQL 10, Pele…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this article, we focus on the network address data type and show you how it works, when to use it, and how it can help you when storing IP addresses in your database.",Storing Network Addresses using PostgreSQL,Live,747
2299,"GEOFILE: SPATIAL REFERENCE SYSTEMS AND DATABASES
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Aug 24, 2016GeoFile is a series dedicated to looking at geographical data, its features and
its uses. In this article, we discuss spatial reference systems, what EPSG codes
are, and how databases handle and transform them.

Having and knowing what the right spatial reference system is for your GIS data
could be the difference between having the right map or the wrong map when your
data points are projected. While you may believe that a map is a map, whether
it's flat or a sphere, map spatial reference systems don't work that way.
Knowing what spatial reference systems are, their differences, and how to use
data from different sources will help you make better decisions when pulling out
data and storing it your databases.

What are Spatial Reference Systems?Spatial reference systems (SRS) consist of components that describe a series of
geographic parameters, such as the orientation, latitude, longitude, and
elevation in reference to geographic objects, which define coordinate systems
and spatial properties on a map.

The underlying assumption of spatial reference systems is that the Earth is a geoid . Since it would be difficult to precisely calculate the Earth as a geoid, we
use the next best shape, an ellipsoid (or a flattened sphere; also known as a
spheroid). Now we need to select point, a spatial reference point or anchor
point, on the ellipsoid for a frame of reference. We call this point a geodetic datum .


Image from ICSM

The point of origin on a map is an easy way to visualize geodetic datum (Point
P). It indicates the center and orientation of the ellipsoid. It includes a
description of the position and orientation of the ellipsoid. It is made up of
an equatorial radius (semi-major axis) and a polar radius (semi-minor axis) (the
dotted lines running from N to the equator on image above). These lines are then
calculated, producing a flattening measurement that measures the compression relative to the equatorial axis, providing the
shape of the ellipsoid. The datum also has a prime meridian that is set to zero
longitude (the solid line running from N to the equator). This is usually set to
the Greenwich prime meridian; however, it might differ if using an older, or
localized datum for a particular area or region.

Don't worry about calculating geodetic datum, unless you need to create your
own. Most of the datum that you will use are predefined, so it isn't necessary
to get out your calculator and compass to create your own.

Now that we have a datum, we can use a geographical coordinate reference system
(or geodetic coordinate system ) to provide longitude and latitude coordinates on the ellipsoid. It's
important to note that longitude and latitude coordinates depend on the datum
used, but their values are not unique to to any particular datum. It is
important to note that if you do not know the datum being used, your coordinate
system could be off by 1 meter to several hundred meters. Therefore, the
consequences of not knowing the datum could pose significant problems.

The last component of the SRS is a projection. Projection refers to taking the
Earth as a 3D ellipsoid and squashing it onto a 2D flat surface. There are many
types of projections but they all fall on a Cartesian coordinate system and depend on the geographic coordinate reference system used. Choosing which
projection to use depends on several factors such as measurement, shape,
direction, and range, and each has its tradeoffs. The most common type of
projections are conic, cylindrical, and azimuthal/planar. These are classified
into different flavors such as Mercator, Lambert Azimuthal Equal Area, Lambert
Conformal Conic, Universal Trans Mercator (UTM), national grid systems, state
plane, and geodetic. For a description of these, see here .

Just to review what we've covered. A spacial reference system (SRS) is made up
of a an ellipsoid, geodetic datum, and a geographic coordinate reference system
with an associated projection. Often, when working with SRSs you will find that
they are referred to by a number following the acronym EPSG. These are
predefined SRSs with unique IDs, which are recognized and used throughout the
GIS industry. Let's get to know them better.

EPSG
When working with databases or GIS libraries, you will see the number 4326
referred to a lot. It's full name, EPSG 4326, is a unique SRS identification
number developed by the European Petroleum Survey Group , or EPSG. You will also find that EPSG 4326 is referred to as WGS 84. WGS is
the World Geodetic System which is a standardized geodetic system developed in 1984. What makes EPSG
4326/WGS 84 well-known is that it is used by the US Department of Defense, NATO,
and Global Positioning Systems (GPS).

Identification numbers like 4326 refer to a standardized collection of SRSs and
coordinate transformations. These numbers have been archived and can be viewed
in the Geodetic Parameter Registry . Below is a snapshot of EPSG 4326 from the registry . When we look at 4326, we notice that it comprises two main features: geodetic datum and ellipsoidal coordinate system (or geodetic coordinate system).

View here for full record

As mentioned previously, geodetic coordinate systems are the latitude and
longitude points derived from geodetic datum. Geodetic datum refer to a set of
points, or anchors, where survey measurements are based. There are two types of
datum: global or local. Some localized datum can be more accurate than datum
that cover larger areas since they concentrate on one area. We will discuss two
of them.

The most recognized datum are WGS 84 (covering the entire world) and NAD 83
(covering only North America). The anchor for WGS 84 is placed at the center of the Earth , while NAD 83 has its anchor placed at the center of North America, lying in
the middle of northern Canada . From the registry we can see that although both datum use the same semi-major
axis, or radius, they have slightly different flattening calculations. The
differences in the calculations are based on the locations where each takes
measurements from. NAD 83 uses the North American Plate as a reference, which
can change by up to 2 cm per year. WGS 84 does not change, since it takes
reference points from all over the Earth.

One tool that will help you to understand ESPG references is ESPG.io . It is a visualization tool that gives you information about datum and
coordinate systems and shows you the locations they cover on a map.

DATABASES AND SRSS
When considering databases and geodata, you must understand how your database
works with SRSs.

GeoJSON is used by MongoDB and Elasticsearch. By default, GeoJSON is set to WGS
84 (EPSG 4326). According to the IETF standards for GeoJSON, WGS 84 (EPSG 4326) is the only supported datum for GeoJSON . MongoDB follows the GeoJSON standard, but is discussing the possibility of supporting other datum in the future.

PostGIS , the PostgreSQL extension for GIS data, has also set its default to EPSG 4326.
But, if you have data that needs to be converted into a different SRS, then
PostGIS provides you with the ST_Transform function. This function enables you to take a point, an area, a line, or
whatever can be expressed in longitude and latitude coordinates, and transform
it to any SRS you require by providing a new EPSG number.

One important note is that PostGIS often refers to EPSG numbers as SRIDs
(spacial reference identification number). You don't have to worry about them
too much because they are synonymous.

To show you an example of how ST_Tranform works, let's create a table called my_geometry and set the column geom with our data point to EPSG 4269 (NAD 83).

CREATE TABLE my_geometry(  
  gid serial primary key,
  geom geometry(POINT, 4269),
  name text not null
);


Then, we insert a data point using the ST_GeomFromText function. This function allows us to define a point, line, or polygon as a
string, and set it to any SRID (spacial reference identification number, or EPSG
number). In this case our SRID is EPSG 4269. If we don't set the inserted point
to EPSG 4269, we will get a data compatibility error because the value returned
from ST_GeomFromText will be in EPSG 4326 and our table requires ESPG 4269.

INSERT INTO my_geometry (geom, name)  
VALUES (  
  ST_GeomFromText('POINT(1 -1)', 4269),
  'My Point'
);


The result will produce the following table that sets the POINT to its Well-Known Text representation.

gid |                        geom                        |   name  
-----+----------------------------------------------------+----------
   1 | 0101000020AD100000000000000000F03F000000000000F0BF | My Point


To transform the geom result to ESPG 4326, we'll use the ST_Transform function. This will allow us to and indicate the SRID that we want the geom to be transferred to.

SELECT  
  geom as original, 
  ST_Transform(geom, 4326) as new 
FROM my_geometry;


This produces the following table that provides us with both the original and
the new geom results, which show the geom in EPSG 4269 and 4326, respectively.

                   original                      |                        new                         
----------------------------------------------------+----------------------------------------------------
 0101000020AD100000000000000000F03F000000000000F0BF | 0101000020E6100000000000000000F03F000000000000F0BF


Transforming your data with ST_Transform , however, is not without a caveat. If you do a lot of data processing, using
the function could produce floating-point errors; therefore, it's recommended
that you only use it once instead of retransforming data back and forth a lot.

SO, WHAT'S THE WORD?
In this article we covered what spatial referencing systems are, what they are
composed of, and how EPSG numbers provide a nice referencing system to be able
to talk about spatial systems. Additionally, we looked at how PostGIS allows you
to transform your geographic data from one spatial system to another, and
learned that it's best to transform your data as little as possible.

Next time, we will delve into some practical examples using PostGIS with many
more code samples.

Image by Stephen Monroe Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","GeoFile is a series dedicated to looking at geographical data, its features and its uses. In this article, we discuss spatial reference systems, what EPSG codes are, and how databases handle and transform them.",GeoFile: Spatial Reference Systems and Databases,Live,748
2302,"roadtolarissa Adam PearceHURRICANE HOW-TO
When projections showed Hurricane Harvey could bring a record setting amount of
rain to Houston, the graphics desk at the New York Times started exploring ways
of showing the rainfall. After a couple of days of scrambling, we managed to
make a map showing both the accumulation and rate of rainfall:


Getting this done on deadline required mashing together a couple of different
web technologies, like SVG and canvas, with a grab bag of open source libraries.
This post describes some of the tricks and techniques we used to bring
everything together.

FINDING THE DATA
First we needed a data source. Preferably a frequently updating one, with
gridded values so we could make a map with more detailed information than just
overlayed numbers with the total rainfall at few locations.

The Global Precipitation Measurement Constellation initially seemed like a good candidate—it promised a grid of rainfall rates
around the world in 30 minute slices. But after spending most of the day
wrangling netCDF files and R, I had only managed to make a map showing the path
GPM satellites had traced over the earth during one of the 30 minute update
windows:


Interesting , but nowhere near to anything publishable. This was particularly frustrating
because the previous afternoon I had watched Josh Katz put together a historical rainfall map using similar data and tools, but I wasn't familiar enough with the domain to
duplicate his efforts quickly. I started to worry that I had wasted time that
would have been better spent making a map with a couple of numbers overlaid.

Thankfully two of my other collagues, Jugal Patel and Anjali Singhvi, found a National Weather Service FTP and showed me how to convert the files to simple CSVs. Opening them in QGIS showed they had exactly the data we wanted—a grid of hourly rainfall values.


A bit of bash downloaded files from the FTP, extracted them and converted them
to a CSV with the day and hour in the file name (the 26th and 7 AM here).

URL=""http://www.srh.noaa.gov/data/ridge2/Precip/qpehourlyshape/2017/201708/20170826/nws_precip_2017082607.tar.gz""
curl -s$URL | tar xz --strip=3
ogr2ogr -f CSV 2607.csv nws_precip_2017082607.shp

Each row of the CSV has the observed rainfall in inches ( Globvalue ) at each point in the grid ( Hrapx / Hrapy are the x and y grid indices)

Id,Hrapx,Hrapy,Lat,Lon,Globvalue
169209,591,117,28.1442,-97.6825,0.02169210,592,117,28.1399,-97.6446,0.03169213,595,117,28.1268,-97.5306,0.02170771,577,118,28.2358,-98.2106,0.02170783,589,118,28.1862,-97.7537,0.02170784,590,118,28.1820,-97.7157,0.02170785,591,118,28.1777,-97.6777,0.03...

PUTTING IT ON A CANVAS
The next step was to see how much rain was falling where. This could have been
done in QGIS, but since the end result was going on the web, I started up a
webpage with d3 and d3-jetpack .

First, I loaded the data and set up the canvas preliminaries . Using SVG to draw the data wouldn't be a good idea with 20,000 points to
draw—it's too slow.

var width = 700var height = width/2

d3.loadData('2607.csv', (err, [data]) => {
  var ctx = d3.select('#graphic')
    .append('canvas')
    .at({width, height})
    .node()
    .getContext('2d')

Next, loop over grid points and draw a 1px rectangle over each, scaling the
opacity based on the rainfall:

var color = d3.scaleLinear()
  .range(['rgba(255,0,255,0)', 'rgba(255,0,255,1)'])
data.forEach(d =>{
  ctx.beginPath()
  ctx.fillStyle = color(d.Globvalue)
  ctx.rect(d.Hrapx, d.Hrapy, 1, 1)
  ctx.fill()
})

It looks like something!

codeBut where is it? And why is the upper left corner cut off?

MAKE IT A MAP
While plotting with the grid indices is quick, it's not totally clear what we're
looking at. Because we wanted to overlay the coast and city labels, we had to
position the rainfall values based on their lat/lon instead.

Sarah Almukhtar made me a detailed shapefile of Texas and Louisiana, running it
through mapshaper to shrink the file. I loaded it and set up a Texas South State Plane projection zoomed in on the gulf.

d3.loadData('2607.csv', 'states.json', (err, [data, states]) => {
  var projection = d3.geoConicConformal()
    .parallels([26 + 10 / 60, 27 + 50 / 60])
    .rotate([98 + 30 / 60, -25 - 40 / 60])
    .fitSize([width, height], { 
      ""type"": ""LineString"", ""coordinates"": [[-99.2, 27.5], [-91.1, 30.5]]
    })
  })

fitSize makes adjusting projection crops way simpler than fiddling with translate and
scale values.

Since text and detailed features can be blurry on canvas if they aren't rendered
at double resolution , I decided to make the map overlay with SVG. topojson merges the loaded state shapes into one shape which gets drawn to the screen as
a path.

var svg = d3.select('#graphic')
  .append('svg')
  .at({width: width, height: height})

var path = d3.geoPath().projection(projection)
var pathStr = path(topojson.mesh(states, states.objects.states))
svg.append('path')
  .at({d: pathStr, fill: 'none', stroke: '#000', strokeWidth: .5})

A bit of css positions the svg directly over the canvas.

#graphic{
  position: relative;
}#graphiccanvas, #graphicsvg{
  position: absolute;
  top:0px;
  left:0px;
}

Finally, the observed rainfall values are drawn at their projected lat/lon.
Since the map is zoomed in, I've bumped the sides of the rectangles from 1px to
3px.

data.forEach(d =>{
  ctx.beginPath()
  // ctx.rect(d.Hrapx, d.Hrapy, 1, 1)var [x, y] = projection([d.Lon, d.Lat])
  ctx.rect(x, y, 3, 3)

codeNow it's clear why the corner was cut off previously—there's only data for
locations within a few miles of land. This misleadingly makes it look like it
isn't raining over most of the ocean. To fix this, we decided to only show
rainfall over land.

There are a couple of ways this could have been done. Only drawing the
observations on land would work, but the coastline would be blocky because the
grid is zoomed in. Instead I decided to cover up the values in the ocean by
drawing a white ocean and overlaying it.

Drawing the ocean with only a shapefile of the land is a little tricky. I ended
up drawing the land to a mask element and used that to clip a white rectangle
covering up the whole graphic.

var pathStr = path(topojson.feature(states, states.objects.states))
var mask = svg.append('mask#ocean')
mask.append('rect').at({width, height, fill: '#fff'})
mask.append('path').at({d: pathStr, fill: '#000'})
svg.append('rect').at({width, height, fill: '#fff', mask: 'url(#ocean)'})

Masks make lots of things possible. I suspect that there's a more efficient solution here, but it works!

codeThe city labels are group elements translated to each city's location with a
circle and text inside.

var cities = [
  {name: 'Houston',     cord: [-95.369, 29.760]},
  {name: 'Austin',      cord: [-97.743, 30.267]},
  {name: 'San Antonio', cord: [-98.493, 29.424]}
]
var citySel = svg.appendMany(cities, 'g')
  .translate(d => projection(d.cord))
citySel.append('circle').at({r: 1})
citySel.append('text').text(d => d.name).at({textAnchor: 'middle', dy: -5})

ANIMATION
Showing the progression of the storm required getting more hours of data into
the browser.

Making a separate network request for each hour of data wouldn't be very
efficient, so I wrote a script to load all the CSVs, added a column with the
observation time and then merged them into one big array.

var {_, d3, jp, glob, io} = require('scrape-stl')

var data = []
glob.sync(__dirname + '/csv/*.csv').forEach(path => {
  var time = _.last(path.split('/')).split('.csv')[0]
  io.readDataSync(path).forEach(d => {
    d.time = time
    data.push(d)
  })
})

Combining two days of rainfall data made a 30 MB CSV—too big. Each observation
from the same location repeated the station Id , Lat , Lon , Hrapx and Hrapy properties. To remove that redundancy, I grouped all the observations from one
station into a single object.

var points = jp.nestBy(data, d => d.Id).map(point => {
  var vals = {}
  point.forEach(d => vals['t' + d.time] = +d.Globvalue)

  return {vals, lat: +point[0].Lat, +lon: point[0].Lon}
})

io.writeDataSync(__dirname + '/points.json', times)

This creates an array of stations, each with a lat , lon and vals hash. The vals hash lists the inches of rainfall that occurred during each hour (the t is
preprended to avoid a nasty safari bug ).

[
  {
    ""lat"": 26.6631,
    ""lon"": -97.4435,
    ""vals"": {
      ""t2600"": 0.02""t2601"": 0.01
    },
  },
  {
    ""lat"": 27.6294,
    ""lon"": -98.2536,
    ""vals"": {
      ""t2601"": 0.04,
      ""t2602"": 0.01,
      ""t2607"": 0.03,
      ""t2608"": 0.11,
      ""t2618"": 0.05,
      ""t2619"": 0.12,
      ""t2620"": 0.09,
      ""t2621"": 0.01
    },
  },
  ...

Canvas is a lower level abstraction than SVG: it can easily draw 20,000 shapes,
but there's no general purpose .transition function. To animate the rainfall on the 26th of August, I made an array of the
hourly timestamps on that day and and looped over it at 5 frames per second.

var times = d3.range(24).map(d => 't26' + d3.format('02')(d))
  var curTimeIndex = 0
  d3.interval(() => {
    drawTime(times[curTimeIndex++ % times.length])
  }, 200)

At the start of each frame, everything on the canvas is removed with clearRect . Only points with rainfall at the current time are drawn. Since the data
structure has changed d.vals[time] replaces d.Globvalue .

functiondrawTime(time){
  ctx.clearRect(0, 0, width, height)
  points.filter(d => d.vals[time]).forEach(d =>{
    ctx.beginPath()
    var [x, y] = projection([d.lon, d.lat])
    ctx.rect(x, y, 3, 3)
    ctx.fillStyle = color(d.vals[time])
    ctx.fill()
  })
}

Tom MacWright has good tutorial on canvas animation with more techniques; d3.interval and clearRect are sufficient here:

codeACCUMULATION
Since the total rainfall was an important part of the story, I stared playing
with different ways showing the accumulation of water. First, I needed to
calculate how much water had fallen since the start of the storm.

points.forEach(function(d){
  var total = 0
  d.totals = {}
  for (hour in d.vals) {
    total += d.vals[hour]
    d.totals[hour] = total
  }

  d.pos = projection([d.lon, d.lat])
})

The hours have been added chronologically so the running total at each hour is equal to the cumulative rainfall at that hour. To keep things
simple, the structure of the totals hash matches the vals hash. The point's position on the screen is also stored so projection doesn't have to be called on each point in each frame.

Next, I added a new canvas element below everything else and made a new color
scale for showing accumulation.

var ctx2 = d3.select('#graphic')
  .append('canvas')
  .at({width, height})
  .node()
  .getContext('2d')

var totalColor = d => d3.interpolateYlGnBu(d / 12)

Finally, I updated drawTime to use the totals hash and the totalColor scale to draw the accumulated rainfall on the second canvas. I don't want to
remove accumulated rainfall values on points that weren't rained on in a given
hour, so ctx2.clearRect only gets called on the first hour.

functiondrawTime(time){
  ctx.clearRect(0, 0, width, height)
  if (time == times[0]) ctx2.clearRect(0, 0, width, height)

  points.filter(d => d.vals[time]).forEach(d =>{
    ctx2.beginPath()
    ctx2.rect(d.pos[0], d.pos[1], 3, 3)
    ctx2.fillStyle = totalColor(d.totals[time])
    ctx2.fill()

    ...

codeRendering to different layers lets you break problems down into smaller
pieces—it's much easier to tinker and fix bugs when you can hide everything
(both visually and conceptually) but the part that you're working on.

DELETING DATA
Starting too look like something publishable! There were two remaining
obstacles. As the hours passed and the storm progressed, the data file kept
growing and was close to being too large to publish. And the bivariate color
scale for showing rate and accumulation simultaneously was pretty but required
too much explanation.

A little tired from rushing to finish this, I sat down with Jeremy Ashkenas to
figure out how to reduce the file size. There were some potential optimizations
in how the data was represented, but it wasn't clear how much of an improvement
they'd offer after gzipping. Instead we shrunk the data by removing off screen
points and only including every fourth point on the grid.

points = points.filter(d =>{
  var validLat =  25.7 <= d.lat && d.lat <= 31.2var validLon = -99.2 <= d.lon && d.lon <= -89.1var validIndex = d.Hrapx % 2 && d.Hrapy % 2return validLat && validLon && validIndex
})

io.writeDataSync(__dirname + '/points.json', points)

This got the data down to a manageable size. With only a quarter as many points,
the sides of the rectangles needed to be doubled to continue covering the map.
Half the side length is subtracted from the x and y positions of the rectangle to center it over its actual position.

points.filter(d => d.vals[time]).forEach(d =>{
  ctx2.beginPath()
  ctx2.rect(d.pos[0] - 3, d.pos[1] - 3, 6, 6)
  ctx2.fillStyle = totalColor(d.totals[time])
  ctx2.fill()

The blockier grid gave us enough space to explore alternative representations
for the rate of rainfall. We replaced the solid purple squares of color with the
outline of a circle representing hourly rainfall.

var r = Math.sqrt(d.vals[time])      

  ctx.beginPath()
  ctx.moveTo(d.pos[0] + r, d.pos[1])
  ctx.arc(d.pos[0], d.pos[1], r, 0, 2 * Math.PI)
  ctx.stroke()

This still left enough detail to see the eye as the hurricane made landfall and
the current location of the storm, but didn't require a complicated legend with
two color scales.

codeFINISHING AND BEYOND
The published version has additional features like a legend, a replay button,
tooltip that I snuck in and responsiveness. The code for all that is 4× longer
than what's included here and isn't nearly as polished:

if (hour < 5) day = +day - 1
hour = hour - 5
hour = hour % 24
hour = (hour + 24) % 24var ampm = hour >= 12 ? 'p.m.' : 'a.m.';
hour = hour % 12;
hour = hour ? hour : 12;

You do what you have to do to finish fast!

With additional time, I would have liked to figure out how to not reduce the
spatial resolution of the data. A particle system in WebGL to show the rate
(canvas can draw 10,000+ things at 5 FPS but not 60) and a more compact
representation of that data (something like NetCDF perhaps) could have worked.
If I had been more familiar with those tools, I might have given it a shot.

Getting to make things at different tempos is one of my favorite parts of
working on a graphics desk. Tight time constraints force you to find creative
solutions, while slack between projects gives you space to explore techniques
and tools you hadn't realized realized you wanted to learn.","When projections showed Hurricane Harvey could bring a record setting amount of rain to Houston, the graphics desk at the New York Times started exploring ways of showing the rainfall. ",Hurricane How-To,Live,749
2303,"This video provides an overview of Cloudant Geospatial. Watch the next videos in this series titled ""Build and Query a Cloudant Geospatial Index"" and ""Cloudant Geospatial in Action"". Find more videos in the Cloudant Learning Center: http://www.cloudant.com/learning-center",See what you can do with Cloudant Geospatial and watch a quick intro to using GeoJSON.,Introducing Cloudant Geospatial,Live,750
2307,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS
Loading...

Sign up by October 31st for an extended 3-month trial of YouTube Red.Working...

No thanks Try it free Find out why CloseIBM WATSON MACHINE LEARNING: BUILD A PREDICTIVE ANALYTIC MODEL
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

199 views 4LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 5 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Watch this video to see how to use logistic regression classifiers with publicly
available data about metabolic diseases to determine if someone has chronic
kidney disease.

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Watson Machine Learning: Build a logistic regression model - Duration:
   4:10. developerWorks TV 246 views 4:10


--------------------------------------------------------------------------------

 * What is Robotic Process Automation (RPA) - Duration: 9:41. Ragz 165 views 9:41
 * Maximo with Watson Analytics: Introduction (Part 1) - Duration: 10:06. IBM
   IoT Support 1,243 views 10:06
 * IBM Watson IoT Platform Demo - Duration: 12:16. IBM Watson Internet of Things
   42,814 views 12:16
 * Machine Learning in IBM DSX Local - Duration: 5:03. IBM Analytics 599 views 5:03
 * The Deloitte Approach to Robotic Process Automation - Duration: 2:03.
   Deloitte 200 views 2:03
 * IBM Watson Machine Learning: Score a Predictive Model Built with IBM SPSS
   Modeler - Duration: 5:31. developerWorks TV 190 views 5:31
 * IBM Plant Performance Analytics - Duration: 9:57. IBM Watson Internet of
   Things 138 views 9:57
 * Watson Visual Inspection : Automotive Manufacturing - Duration: 0:45. Felipe
   Smolka 409 views 0:45
 * Prescriptive Maintenance on Cloud demo - Duration: 20:00. IBM Watson Internet
   of Things 1,168 views 20:00
 * Explore Watson at Work: Pro Basketball Scouting (360° Video) - Duration:
   1:45. IBM 1,850 views * 360° 1:45
 * Watson Analytics Demo - Duration: 16:20. Channing 2,510 views 16:20
 * Transform Data into Intelligence Using IBM Watson Machine Learning -
   Duration: 1:19. IBM Analytics 657 views 1:19
 * Azure Machine Learning: Getting Started - Duration: 3:26. Microsoft Azure
   3,593 views 3:26
 * Robotics Process Automation (RPA): A primer for internal audit professionals
   - Duration: 3:55. PwC US 1,169 views 3:55
 * VMWorld 2017 - IBM's Watson Celebrity Match with Ripples - Duration: 1:45.
   Ripples 297 views 1:45
 * NNIT Robotic Process Automation - Duration: 1:02. NNITvideo 158 views 1:02
 * Watson IoT Platform Example - Duration: 10:04. pscmpf 1,412 views 10:04
 * IoT made simple with IBM Watson IoT Platform - Duration: 3:18. IBM Watson
   Internet of Things 28,002 views 3:18
 * What Is SAP BusinessObjects Predictive Analytics? - Duration: 2:02. SAP
   Analytics 12,923 views 2:02
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Watch this video to see how to use logistic regression classifiers with publicly available data about metabolic diseases to determine if someone has chronic ...,Build a Predictive Analytic Model,Live,751
2308,"1 technical a representation of a plan or theory in the form of an outline or model: a schema of scientific reasoning.3 (in Kantian philosophy) a conception of what is common to all members of a class; a general or essential type or form.CouchDB is a schemaless document store, but there are times when a schema is a good thing to have around, one way or another. So can you have your cake and eat  it too?Below I'll take a high level look at adding a kind of schema to an application and the benefits and draw backs associated with this way of working. What I describe below isn't for everyone. It goes against some of the core principles of CouchDB and makes your data much less human readable, but there are cases where that trade off is worth making.It might seem a bit weird to add a schema to a schema-less database but sometimes it is a very useful thing indeed. When you're dealing with large datasets verbose object key names can be a problem (e.g. cost you money) so you end up stuck between a rock and a hard place; either make your data terse and hard to use or be explicit and spend more on storage and network.""shape"": ""triangle"",""colour_label"": ""red"",""opposite_length_in_mm"": 767.12254256805875,""angle_in_radians"": 1.5514293603308698,""adjacent_length_in_mm"": 73.59881843627835What usually happens is some middle ground where a nice descriptive name like ""angle_in_radians"" gets reduced to ""angle"" or ""rads"". That's fine in that it reduces the storage and network required to deal with all that data.""adj"": 73.59881843627835,""shape"": ""triangle"",""angle"": 1.5514293603308698,""opp"": 767.12254256805875,""colour"": ""red""However, by making this small change you move the description of the data out of your database and into some undefined place; higher level code, documentation, shared knowledge, a whiteboard, a notebook, someones head.As your data becomes more terse you might rely on duck typing (deriving from the data itself what the data describes) to get data that quacks right in your application. That's fine so long as you have data that is sufficiently distinguishable from the other ducks on the pond; if I rely on pulling a triangle object from the database because it has an angle member I might accidentally pull out a rhombus or an icosahedron.To make sure you get the data you expect you might add an explicit type field to each data (e.g. ""type=goose"" or ""shape=triangle"") something which I've always felt was rather odd. This starts to add up on storage (remember you have a large dataset/flock of ducks) and, more importantly, it doesn't help with where the description of the data is held - you know that you have a goose but don't know what a goose is.This last point is important, especially if you're working in a team of developers. Knowing what describing a shape as a triangle means is vital in producing consistent code that many people can work on. The straight jacket of a SQL schema looks pretty comfy sometimes.So how do you add a schema into a CouchDB database, something that is inherently schema-less? Can I get the best of both worlds? Here's a little trick that might help.First you define a document that is the schema for a particular type of data:""_id"": ""datatype/triangle/v1"",""fields"": [""opposite_length_in_mm"",""adjacent_length_in_mm"",""angle_in_radians"",""colour_label""Then you change your document structure to reference that ""schema"":""datatype"": ""triangle/v1"",""data"": [879.07395066446952,84.607510245708468,1.4444230241122715,""red""Note that the schema is versioned and that ordering in the data list is important here!I now know precisely what the data represents without having to store that description in the data itself. This way of working has benefits beyond disk storage; you reduce wire traffic, and there is less for a client to parse before rendering it. This is especially useful if you're rendering into a browser based visualisation - you don't need a complex set of objects to make a bar chart, just a list of x and y values.I can also share the data structure with colleagues and be reasonably confident that when I'm talking about a ""v1 triangle"" they'll know that lengths are in millimeters, are the opposite and adjacent sides and that the angle is in radians, hopefully reducing the chance ofcostly mistakes.Yes and no. If you make a mistake in the ordering of your fields then, yes you are going to have issues. This is reasonably easy to manage with some form of client verification (e.g. validation on a web form) and generating the interface from the data (e.g. use the schema definition to build the GUI).If you're adding these data into the database by hand (e.g. via a curl or futon) then you aren't going to be in the regime where this trick is useful; your dataset needs to be large for this to make sense.What's particularly nice about this way of working is that I can still duck type the data, add additional fields to annotate it etc. since the schema isn't strictly enforced. Nothing stops me from having a triangle document like:""datatype"": ""triangle/v1"",""data"": [879.07395066446952,84.607510245708468,1.4444230241122715,""red""""owner"": ""Simon"",""location"" ""space""My views that deal with the data with a schema will still work (by ignoring these additional fields), my MVC framework will still render my pages, and I'll still have all the data I want in my database.You could have a nested object structure like:""datatype"": ""pattern/v1"",""data"": [""datatype"": ""triangle/v1"",""data"": [879.07395066446952,84.607510245708468,1.4444230241122715,""red""""owner"": ""Simon"",""location"" ""space""""datatype"": ""triangle/v1"",""data"": [879.07395066446952,84.607510245708468,1.4444230241122715,""blue""""owner"": ""Fred"",""location"" ""space""""datatype"": ""square/v1"",data: [10,""green""But if you're going to have a schema you may as well reflect the nesting inside it, e.g say that you have a list of triangles and a list of squares:""_id"": ""datatype/pattern/v1"",""fields"": [[""triangle/v1""],[""square/v1""]""datatype"": ""pattern/v1"",""data"": [""data"": [879.07395066446952,84.607510245708468,1.4444230241122715,""red""""owner"": ""Simon"",""location"" ""space""""data"": [879.07395066446952,84.607510245708468,1.4444230241122715,""blue""""owner"": ""Fred"",""location"" ""space""data: [10,""green""A nice feature of this way of working  is that you can deal with schema evolutions; changing the format of your data.""_id"": ""datatype/triangle/v2"",""fields"": [""opposite_length_in_cm"",""hypotenuse_length_in_cm"",""angle_in_degrees"",""colour_label""There are only so many ways you can represent the data. While sometimes you may have a major schema evolution, one where old data is completely unusable, often changes are just tweaks for consistency (say changing the units of a quantity) or extending the schema by adding in optional data. In either case you should be able to use data from multiple schema versions together by using appropriate manipulations on the data. For example you could instantiate shape objects via a factory which knows how to create the right object for different schema versions.The above does no validation of the data; the color field in the input data could be set to a number instead of a string, the angle to something non- physical etc. If you really needed validation you could do it with CouchDB's validation functions.If you go the fully validated route you'd want to define the schema in the design document (instead of as a normal doc) and use a CommonJS include to make sure that the validator in the app was doing the same thing as the schema. This ties you to a version of the design document (which is where the validators live), which may or may not be an issue. It will also considerably slow down insertion rate as CouchDB has to do more work to add your data.Personally I prefer to put validation logic in the client making writes.If I were using this way of working I would want to have a view which returned all the schema's defined on the database. This then allows me to build objects appropriately. A view to return schema's documents would look like:function(doc) {if (doc._id.slice(0, 'datatype'.length) == 'datatype') {emit (doc._id.slice('datatype/'.length, doc._id.length), doc.fields)You can pull out documents that have a schema with a simple view like:function(doc) {if (doc.datatype){emit(doc.datatype, doc.data);This can be queried to find objects of a given shape using CouchDB's view slicing (e.g. ?startkey=""square/v1""&endkey=""square/v2"") which returns data like:{""id"":""datatype/square/v1"",""key"":[""square/v1"",0],""value"":[""side_length_in_mm"",""colour_label""]},{""id"":""f98ffe7e4cd91cbb0d904f9098499ca8"",""key"":[""square/v1"",1],""value"":[872.4342711412228,""green""]},{""id"":""f98ffe7e4cd91cbb0d904f909849a218"",""key"":[""square/v1"",1],""value"":[370.29971491443905,""yellow""]},{""id"":""f98ffe7e4cd91cbb0d904f909849acd0"",""key"":[""square/v1"",1],""value"":[8.799279300193753,""yellow""]}You'll notice the name of the ""schema"" is the key and the values are held invalue. This means I can parse the data into a set of appropriate objects with something like:var objects = [];function build(schema, data){// Build the appropriate object for the schema...for (row in data){// build up the objects in a factoryvar obj = build(row.key, row.value);objects.push(obj);If I wanted all versions  of a shape the query would be, and used avNUMERIC_COUNTER notation for versioning,?startkey=""square/v1""&endkey=""square/vXXX"" as numbers sort lower than strings.If you are really worried about data size you can take this technique to the extreme by encoding the data arrays as a byte string and using the schema documents to describe that byte array. This effectively turns your JSON structure into something not dissimilar to a protocol buffer, at the expense of human readability and view complexity. If you are particularly concerned with data size over the wire (for example are writing anMMORPG) then this may be an acceptable trade off.This trick isn't suitable for every dataset. If you modify the data by hand itis prone to error. If you have a small dataset, or only ever send a small subset of the data to the client it's massive overkill. But if you have a large dataset of machine generated data, that needs to be frequently accessed over the WAN (think a monitoring app or game) then this is a nice way to reduce storage, network IO and browser render time.It's also worth reiterating that the schema is not enforced, you could have a square with 3 sides, and that adding strict schema enforcement with a validation function will considerably slow down insert rate.","NoSQL doesn't force you to define the schema of your documents up front. That isn't to say that you have no schema in your database, only that you have flexibility to alter it at will. This blog post describes some techniques for how to organise your data to minimise storage while retaining document readability.",“Schemas” in CouchDB,Live,752
2310,"Patrick Titzler Blocked Unblock Follow Following Developer Advocate at IBM Watson Data Platform May 31
--------------------------------------------------------------------------------

SHARING NON-PUBLIC DATA IN JUPYTER NOTEBOOKS
KEEPING DATA SAFE WHILE COLLABORATING WITH OTHERS
In our Watson Developer Advocacy team we’ve been collecting statistics to
measure our reach for several years now. Traditionally we’ve used Business
Intelligence software, such as Looker, or custom-built web applications to
create interactive reports to facilitate analysis.

For a recent project that analyzes sample application deployments to Bluemix,
I’ve evaluated Jupyter notebooks as an option for those users who want to run their own advanced analytics over
data that our Deployment Tracker Service is collecting. Notebooks are commonly used in the world of Data Science,
providing an interactive computing environment for small- or large-scale data
engineering and analysis.

Based on the project’s findings, this post outlines two data-security aspects
that one must consider before making non-public relational data available
through notebooks.

THE ISSUES
A notebook is made of “cells,” and there are two types. Input cells contain
executable code (many languages are supported) or markup that produces output
(like text or graphics). Output cells display this output.

The Python code in the gray input cell creates a data set and visualizes it. The
pie chart is displayed in the output cell.Data can be loaded into a notebook from a variety of local and remote sources —
both public and private — by running code in input cells. This code requires
credentials, such as host, user id and password to connect to the data source.
Credentials can be embedded in the notebook or read from sources like
environment variables or a file when a cell is executed.

Embedded database credentials in a Python notebook.Credentials can be hidden from users that only view the notebook, but are in
general exposed to users that run the notebook. (There is no built-in
abstraction layer in notebooks like an application might provide that hides
connection information.) To prevent manipulation of data sets or unauthorized
access to other data sets in the same data source, the credentials must
therefore only hold minimal authority.

After a data set is loaded into a notebook by running the appropriate cell(s),
all data values can be accessed by the user, including information that might be
sensitive or irrelevant to the analysis. To avoid potential exposure, filters
must therefore be put in place on the data source that hide, mask or anonymize
data as needed.

This loaded data set contains personally identifiable information. That’s likely
an issue.LIMITING ACCESS TO DATA SETS
Access to data stored in relational databases can be usually restricted in many
ways (no/read/write access, restricted catalog visibility, row-level data access
control, etc.), making it easy to implement a basic strategy:

 * A dedicated user is used, making it easy to monitor and audit all data access
   requests.
 * The user is granted SELECT (read-only) access to the relevant database
   objects, preventing intentional or unintentional attempts to modify the
   source data.
 * The user’s access to catalog tables is restricted. The database can therefore
   not be easily explored using the exposed credentials.
 * Only data that’s deemed to be of relevance to the analysis is exposed, as
   described in the following section.

ENFORCING DATA PRIVACY
Data that’s loaded into the notebook from any type of data source is no longer
protected from prying eyes. Therefore, each piece of information in the source
data set needs to be classified into one of these categories:

 * Category 1: values that are not sensitive, and are of potential relevance to the
   analysis, can therefore be shared as-is. In our deployment tracking service
   for Bluemix, such a value would be the types of Bluemix services bound to a
   particular application, which, for example, could be a Cloudant NoSQL database .
 * Category 2: values that are sensitive or irrelevant to the analysis — these must not be
   shared. While the type of a bound Bluemix service instance is relevant, the
   assigned globally unique identifier 9ab…32f for the Cloudant NoSQL database does not contribute to new insights.
 * Category 3: values that are sensitive but relevant to the analysis — or at least
   required in order to perform the analysis — these must not be shared as-is.
   An example of such a value would be a key that can be used to combine data
   sets, or a key that represents a unique identifier that’s needed to perform
   aggregations.

To ensure data privacy we created a view of the data in the source database
objects that provides raw, filtered or masked access, as needed:

 * Category 1 values are represented in their original form.
 * Category 2 values are omitted from the data set.
 * Category 3 values are processed using a one-way cryptographic hash function . The calculated hash values retain the identifying properties needed to
   correlate records within a data set and could still be used to combine data
   from multiple data sets, without exposing the true values.

A RELATIONAL SOLUTION
In this post I’ve outlined some of the lessons we’ve learned during our first
attempt to expose non-public data from a relational data source in a Jupyter
notebook to enable internal users to perform their own custom analysis.

Some of the approaches we’ve taken are likely not suitable for other data source
types or other types of data, and more investigation is needed on how to
properly protect those. Let us know how you’ve secured data access for your
notebooks in the comments below.

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

Thanks to Mike Broberg and G. Adam Cox . * Data Science
 * Jupyter Notebook
 * Data Security
 * Jupyter
 * Database

Blocked Unblock Follow FollowingPATRICK TITZLER
Developer Advocate at IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",This post outlines the data-security aspects one must consider before making non-public relational data available via Jupyter data science notebooks.,Sharing non-public data in Jupyter notebooks – IBM Watson Data Lab – Medium,Live,753
2315,"Develop in the cloud at the click of a button!Telekinesis, or the ability to move an object with your thoughts, is no                 longer a far-fetched reality. Using the Muse™, a wearable device                 that collects brain wave signals, an Arduino device, and a Python program,                 this article demonstrates how you can move a toy car using the power of                 your mind. In reality, the toy car is not moved by analyzing your                 thoughts, but actually by determining the type of brain waves that are                 being most active at a given point in time.Moving a toy car may not be a meaningful application. However, imagine the                 possibilities for people with disabilities who cannot walk or                 talk—using their brain waves, they can now potentially move an                 object. Moreover, thanks to the internet, you may be able to move an                 object with your brain waves in another country or continent! Follow the                 steps outlined in this tutorial to get your first object (a toy car) to                 move. This tutorial also describes how a Node.js web application running                 on IBM Bluemix was developed to                 store and show the brain activity collected using the Muse.Whatyou'll need to build your applicationA toy car with remote control (the New Bright 1:24 Scale Radio ControlSports Car model is used in this tutorial).A Bluemix account, so youcan use the Node.js runtime, the Cloudant NoSQL database service, and the dashDB service.A DevOps Services account to getand fork the sample code below.The Muse connects to the laptop using Bluetooth. The laptop is connected to                 an Arduino Uno device, which is connected to the remote controller that                 comes with the toy car.The laptop runs a local Python application with logic that analyzes the                 type of brain wave received by the Muse. When it determines that 'alpha'                 waves are the strongest (associated with relaxation), it will execute a                 small program running on the Arduino Uno device to move the car                 forward.At the same time, a Bluemix application using Node.js, Cloudant NoSQL                 database, and dashDB is running. This is a web application that will                 display in real time the Muse sensor's activity, the brain activity, and                 the data being passed and stored in a Cloudant NoSQL database. The                 Cloudant and dashDB Bluemix services are configured so that the data is                 synchronized from Cloudant to dashDB. Using R within dashDB, we show a                 simple visualization of the data collected.You first need to set up all the hardware. For this tutorial, we used a                 specific car model (New Bright 1:24 Scale Radio Control Sports Car) that                 allowed us to solder some wires to specific points and have full control                 of the car. Although it's not strictly necessary, we recommend that you                 use a similar model to make things easier.This particular car has forward/backward and left/right buttons in the                 remote control. After opening up the remote, you can look for the                 corresponding test point (TP) spots on the circuit board that lead to                 their respective buttons. These spots on the circuit board are used for                 testing in the manufacture process. We can use them to send HIGH and LOW                 signals to control the car without removing the buttons. Also, since the                 remote control is powered by two AA batteries (3V total), we can power the                 circuit board by using the 3.3V output in the Arduino.Note: This schema is specific to the remote control we're                 using. If you are using another model of toy car and different remote                 control, you need to look for the specific test points of the circuit                 board.After soldering wires on the TP spots (forward/backward and left/right) and                 on the power spots, the remote control is ready to be connected to the                 Arduino.Step 2.Connect the Arduino to the toy car remoteAfter soldering the controller on the right places, it's time to connect                 the Arduino. Connect the colored wire as follows:Note: You can use any colors you want for the cables. Just                 remember to wire the cables to the right pins.Let's see how the fully assembled setup looks.Step 1.Extract the data from the Muse headbandThe Muse headband includes an SDK and sample applications. This first                 describes how you can extract the different types of brain waves from the                 Muse.Pair the Muse to your computer using Bluetooth (details are describedon the device's manual).Install the MuseSDK on your computer. Note: Make sure to                         install the pyliblo and liblo (version 0.27) libraries.Run the following command to see all the OSC messages coming from theMuse: Step 2.Develop a web app using Node.js on Bluemix to show brain statesNow that you successfully extracted some data from the Muse, let's develop                 the web application to show all this data.We suggest you fork our repository and get the code for the app.Change the code to send data to a Cloudant database. Look for the following code and change the variables with theinformation for your Cloudant database: /*CLOUDANT DATABASE SETTINGSvar cradle = require('cradle'); //Must have Cradle library installedvar account = new(cradle.Connection)({host: 'username.cloudant.com', //Cloudant URLport: 443,secure: true,auth:{username: ""username"", //Cloudant usernamepassword: ""password"" //Cloudant passwordvar db = account.database('databasename'); //database nameLook for and uncomment the following lines, which send thecollected data to the specified database: Set the color variation of the bar chart. Change the last attribute in the drawChartCanvasmethod (variation 0-1). if (startSession){//Change the variation in which the chart should change color, e.g 0.2, 0.35...drawChartCanvas(att[""C0""],att[""C1""],att[""C2""],att[""C3""],0.1);//The last attribute is the ""VARIATION"" in which the chart should change the color$(function(){$(""#datasent"").html("""");}Prepare the app for deployment. Change the file with information about your app. Note: You can create a Node.js application on the                         Bluemix dashboard and then change the manifest.yml with the                         information of the app (provided when the app is created).Deploy the application to Bluemix. Note: Before                         deploying the application, you must have installed the Cloud Foundry command-line interface (CLI).Go to the application directory (nodejs-app).Run the following cf commands: Tip: When deploying an application using                                 the cf tool, open another command prompt and                                 tail the logs by using                                 $ cf                                 logs .Step 3.Develop the code to control your RC Car from ArduinoIf you don't have the Arduino IDE installed, download it.Open the file called serialArduino.ino on the repository.Upload the code to your Arduino.That's it. Your arduino is set.Now let's take a look at the code:Here, we set the pins that will send commands to the controller:#define FORWARD 2#define BACKWARD 4#define LEFT 8#define RIGHT 12#define LED 13String inputString = """"; // a string to hold incoming databoolean stringComplete = false, goodString = false; // whether the string is completeint forward=HIGH, backward=HIGH, left=HIGH, right=HIGH;void setup() {// initialize serial:Serial.begin(115200);// reserve 10 bytes for the inputString:inputString.reserve(10);pinMode(FORWARD, OUTPUT);pinMode(BACKWARD, OUTPUT);pinMode(LEFT, OUTPUT);pinMode(RIGHT, OUTPUT);pinMode(LED, OUTPUT);digitalWrite(FORWARD, HIGH);digitalWrite(BACKWARD, HIGH);digitalWrite(LEFT, HIGH);digitalWrite(RIGHT, HIGH);}On the main loop, we check for the serial input and set the pins                 accordingly:void loop() {if (stringComplete) {//     Serial.println(inputString);goodString = false;if(inputString == ""F\n""){digitalWrite(LED, HIGH);forward = LOW;backward = HIGH;goodString = true;else if(inputString == ""B\n""){forward = HIGH;backward = LOW;goodString = true;else if(inputString == ""L\n""){left = LOW;right = HIGH;goodString = true;else if(inputString == ""R\n""){left = HIGH;right = LOW;goodString = true;else if(inputString == ""S\n""){digitalWrite(LED, LOW);left = HIGH;right = HIGH;forward = HIGH;backward = HIGH;goodString = true;else{goodString = false;if(goodString){digitalWrite(FORWARD, forward);digitalWrite(BACKWARD, backward);digitalWrite(LEFT, left);digitalWrite(RIGHT, right);Serial.print('!');// clear the string:inputString = """";stringComplete = false;}Step 4.Redirect the data from the Muse using a Python programThe last step is reading the data from the Muse and sending, at the same                 time, the right command to the Arduino and the correct data to be                 displayed on your web application.Before you run the Python application, you need to connect to the Museusing the software they provided. This software will receive the datathrough Bluetooth and redirect it to an OSC port. Go to the SDK folder.Change the Python script (moveRCToyCar.py) to use yourcustomized URL provided by Bluemix (change the variable calledwebAppURL) and the port your Arduino is connectedto (arduinoPort). If you don't know which port your Arduino is connected to,                                 just open the Arduino IDE, click on the                                     Tools menu, and look for the checked                                 option on Serial Port.Once you have finished changing the code, it's time to run it!Note: You can use the -t flag to set                         the level of ""relaxation"" you have to accomplish to move the car.                         You can also omit it and it will default to 0.35. (You can also                         change this in the code.)If everything went well, you should expect to see something like                         the following:Setting threshold to 0.12Connecting to arduino on /dev/ttyACM0 at 115200 baudConnected!Message to Arduino:  0Message to Arduino:  0Message to Arduino: When you want to stop, press Enter before youdisconnect the Muse or the Arduino. You will see the following:Closing everything...Waiting for threads to close...Message to Arduino: Note: If you forget to press                             Enter before you disconnect the Muse or the                         Arduino, you will probably have to kill the Python process. If you                         are using a Unix-like system (such as OSX or Ubuntu), just run the                         following command and it will close any Python applications                         running at the moment:Step 5.Use R to analyze the data stored on Cloudant and dashDBNote: For this part, you need a Cloudant account and the dashDB service on Bluemix. Sync data from Cloudant. Enter the information for your Cloudant account anddatabase.Choose a name for the table to be created.To run the R script, go to Analyze/Develop R Scriptsin dashDB's menu, where you have the options to create and importscript. The following R script, which can be found in the                         repository, fetches the data from a table named SAMPLES and plots                         a line chart using one of the fields.For this scenario, BLU11196 is the database's user, and SAMPLES is                         the name of the table. Take a look at the plotted graph of the Muse data using R:In this tutorial, you have learned how to extract brain wave signals from                 the Muse headband. You've also learned how to interpret this data using a                 Python program, which has the logic to make a toy car move forward or not.                 The tutorial has also shown how you can create a web application in                 Node.js on Bluemix to display in real time the brainwave activity, as well                 as the threshold that has been coded to make the toy car move. You may                 argue this is not really a Telekinesis demo since you are not really                 making the toy car move by thinking about it. Instead, the toy car is                 being moved by how relaxed you get, because we are detecting your alpha                 brain waves. In spite of this argument, one thing is a fact—your                 brain waves are making the car move!BLUEMIX SERVICES USED IN THIS TUTORIAL:The Cloudant NoSQL service provides access to a fully managed NoSQL JSON data layer that'salways on.The dashDB service helps you move your data into a next-generation columnar in-memorydatabase, run complex analytical queries with in-database algorithms, andintegrate with analytic and business intelligence tools.The SDK for Node.js runtime helps you develop, deploy, and scale server-side JavaScript appswith ease.Required fields are indicated with an asterisk (*).By clicking Submit, you agree to the developerWorks terms of use.The first time you sign into developerWorks, a profile is created for you.  Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name.  You may update your IBM account at any time.The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name.  Your display name accompanies the content you post on developerWorks.Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.Required fields are indicated with an asterisk (*).By clicking Submit, you agree to the developerWorks terms of use.","Telekinesis, or the ability to move an object with your thoughts, is no longer a far-fetched reality. Using the Muse™, a wearable device that collects brain wave signals, an Arduino device, and a Python program, this article demonstrates how you can move a toy car using the power of your mind. This tutorial also describes how a Node.js web application running on IBM Bluemix was developed to store and show the brain activity collected using the Muse.",Move a toy car with your mind,Live,754
2316,"GETTING STARTED WITH ELASTICSEARCH AND NODE.JS - PART 2
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 6, 2016In the first article in this series we created a Compose Elasticsearch deployment, created an index, added some
documents and took our first look at using Elasticsearch to search in those
documents. At the end of the article we left you with a question - why did a
query that looked like it was trying to match ""Ipswich North"" return hits that
didn't match the search phrase?

The answer lies in how Elasticsearch handles text in fields as they are added to
the index and made ready for searching. And that's what this article will look
at.

MAPPINGS
Unless we tell it otherwise, Elasticsearch treats the text in our constituencyname field as a string, and passes it through an analyzer before it is indexed. The
same analyzer is then used on the query string in your search. Without going
into too much detail at this stage, what's happening is that Elasticsearch is
converting the string into a series of tokens - where typically each word in the
string becomes a token - and filtering out what are known as stop words - words
that might appear frequently in a text but which aren't important or significant
in any way. In English typically this means words such as prepositions,
conjunctions and articles.

In terms of our constituencyname field this means that for the 'Central Suffolk and North Ipswich' document,
we're left with a search index containing entries for four words: 'Central';
'Suffolk'; 'North'; 'Ipswich'. Likewise, our search term is analyzed and
Elasticsearch determines that we are looking for entries containing the word
'North', the word 'Ipswich', or both words. It then ranks all the positive
results according to how northy and ipswichy they are. For example, if there was
a constituency with the exact name of North Ipswich, it would get the highest
score, for example, because it contains all the words in our search term and no
words that are not in our search term.

We don't need to go into the details of how scoring works in this article, but
if you want to learn more you'll find an excellent assessment in another of our
articles, How scoring works in Elasticsearch .

To search for exact terms we need to tell Elasticsearch to not analyze the constituencyname field. To do this we need to define mappings. In Elasticsearch, mappings say
what sort of data you are indexing. If you don't specify any mappings, when
Elasticsearch encounters a field it hasn't indexed before it will do its best to
figure out what kind of data the field contains (this is called 'dynamic
mapping'), and will select accordingly from one of its core field datatypes .

DEFINING MAPPINGS
Let's tell Elasticsearch not to analyze the constituencyname field in our constituencies type. To do this, we define index as 'not_analyzed'. When setting the index attribute, type is a required attribute, so we will set that to 'string' since
we expect the field to contain text strings. Create a new file and add the
following:

var client = require('./connection.js');

client.indices.putMapping({  
  index: 'gov',
  type: 'constituencies',
  body: {
    properties: {
      'constituencyname': {
        'type': 'string', // type is a required attribute if index is specified
        'index': 'not_analyzed'
      },
    }
  }
},function(err,resp,status){
    if (err) {
      console.log(err);
    }
    else {
      console.log(resp);
    }
});


What this says to Elasticsearch is 'treat the constituencyname field as a string, but don't analyze it'. Save the file as mapping.js and run it.

Oops. You should see a nasty looking RemoteTransportException error which includes some details of problems with the constituencyname field:

MergeMappingException[Merge failed with failures {[mapper [constituencyname] has different index values, mapper [constituencyname] has different tokenize values, mapper [constituencyname] has different index_analyzer]}]  


The problem is that we have already defined a mapping for this field, or rather
we've already let Elasticsearch define one for us when we indexed the
constituencies data in part 1. In Elasticsearch you can't simply redefine a
mapping for an existing field, but you can see what mappings Elasticsearch has
defined with getMapping . Create a new file, getmappings.js , and add the following:

var client = require('./connection.js');

client.indices.getMapping({  
    index: 'gov',
    type: 'constituencies',
  },
function (error,response) {  
    if (error){
      console.log(error.message);
    }
    else {
      console.log(""Mappings:\n"",response.gov.mappings.constituencies.properties);
    }
});


Run getmappings.js and Elasticsearch will return information on the existing mappings for your
fields:

Mappings:  
 { ConstituencyID: { type: 'string' },
  ConstituencyName: { type: 'string' },
  ConstituencyType: { type: 'string' },
  Electorate: { type: 'long' },
  ValidVotes: { type: 'long' },
  constituencyID: { type: 'string' },
  constituencyname: { type: 'string' },
  constituencytype: { type: 'string' },
  country: { type: 'string' },
  county: { type: 'string' },
  electorate: { type: 'long' },
  region: { type: 'string' },
  regionID: { type: 'string' },
  validvotes: { type: 'long' } }


We can see that most of the fields were indexed as strings, which makes sense,
while electorate and validvotes were indexed as long, which is the default field type Elasticsearch chooses for
whole numbers.

To change how these fields are indexed we need to delete the index and start
again. This does unfortunately mean undoing some of our work from part 1 , but it helps to emphasise the importance of planning and testing before
indexing documents. With the right mappings in place our documents will be ready
for us to do exactly what we have planned for later.

To start over run delete.js , which you created in part 1, to clear out the constituencies type, then run create.js to recreate the index, and ready yourself for some mapping work.

So let's have another go at defining the mappings. Before we do that, though, do
we really want those numeric fields stored as long? Not that it will make a
significant difference, to our index perhaps, but it seems excessive when we're
storing numbers corresponding to the size of a parliamentary constituency. Let's
store those as integers instead (the short type isn't quite large enough). Add two more mappings underneath constituencyname like so:

'electorate': {  
  'type': 'integer'
},
'validvotes': {  
  'type': 'integer'
}


Now run mappings.js again, and then take a look at your new constituencyname mapping with getmappings.js . If all is well you'll get a response containing the mappings you've just
defined:

Mappings:  
 { constituencyname: { type: 'string', index: 'not_analyzed' },
  electorate: { type: 'integer' },
  validvotes: { type: 'integer' } }


Finally, run constituencies.js again to re-index your documents with their new mappings. (If you want to, you
can check your document count again by running info.js ).

SEARCHING NON-ANALYZED FIELDS
Let's run our search from Part 1 again, using search.js , and setting the search term as ""North Ipswich"". How many hits do we get now?
A big fat zero is how many, because we are now looking for fields that exactly
match the search phrase. There is no constituency of North Ipswich, so
Elasticsearch returns zero results. We need to search for ""Central Suffolk and
North Ipswich"" to return a hit for the relevant constituency.

Now let's search for ""Ipswich"". We will expect to get one hit for this search.
If we do our mappings have worked as intended. (If this search doesn't return a
hit, use info.js to check the status of your Elasticsearch deployment, and try recreating the
index and mappings again if you see any problems).

It might seem quite unusual to define what is a text field in such a way that it
requires an exact match; what we've effectively achieved is to make constituencyname an identifier. Now, this might seem odd when we already have a potential
identifier field in ConstituencyID , but we're doing this because later in this series we're going to connect our
constituencies data to a postcode lookup service. The lookup will return a
constituency name as a result, which we'll match against the constituencies in
our index.

One thing we won't be able to do now is allow users to perform text searches on
the constituencyname field. That's OK, because that's not a use-case for the application we're
working towards in these articles. Before you start mapping fields as not_analyzed , think about whether you will want to run text searches against those fields:
if so, it's better to let Elasticsearch analyze the field.

NEXT
In the next article in the series we'll obtain and index the petitions
themselves, and we'll look at some more advanced mapping concepts.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Neil Dewhurst Love this article? Head over to Neil Dewhurst’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","In part 2, we cover mappings in Elasticsearch",Getting started with Elasticsearch and Node.js - part 2,Live,755
2318,"Learn how to build mobile web apps using PouchDB, AngularJS, Node.js and IBM Cloudant, a globally distributed data layer for web and mobile apps. Quickly integrate with Cloudant using the Cloudant Node.js client library and ease the management and deployment of your application with IBM Bluemix.","Learn how to build mobile web apps using PouchDB, AngularJS, Node.js and IBM Cloudant, a globally distributed data layer for web and mobile apps. Quickly integrate with Cloudant using the Cloudant Node.js client library and ease the management and deployment of your application with IBM Bluemix.","Mobile Web Apps with PouchDB, AngularJS, Node.js and IBM Cloudant",Live,756
2319,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

IBM Data Science Experience Blocked Unblock Follow Following Jan 23
--------------------------------------------------------------------------------

WORKING WITH ON-PREMISES DATABASES — STEP BY STEP
You might quickly find yourself working with notebooks or RStudio in IBM Data
Science Experience (DSX) and get to a point where you would like to access some
databases that reside behind some firewall, for instance in your own company.
Here is how you can make that possible using a Db2 Warehouse system deployed
on-premises.

You need a mechanism to tunnel a database connection across the firewall. The
Bluemix Secure Gateway service provides a very easy solution for that. It allows
you to set up a gateway client behind your firewall that maintains the tunneled
connection to the service in Bluemix, which in turn exposes it to DSX as if it
were a database in the cloud.

Here is the entire end-to-end procedure to set up the secure tunnel to a Db2 Warehouse deployment and use it in DSX. The same procedure can also be used with other on
premise database and data warehouse systems such as Db2 LUW or BigSQL .

Sign in to your Bluemix account and navigate to the Catalog. Under Services,
select “Integrate” and click on “Secure Gateway”:

Create an instance of the service by clicking the create button:

Click “Add Gateway” and specify a custom name of your gateway:

Click on the new gateway you just created to open its configuration. There
select the tab “Destinations” to the bottom left and kick the “+” sign to add a
new destination:

Select the “On-Premises” radio button for the location of the system that you
want to access, and then click “Next”:

Type in the IP address or server name of your machine running your Db2 Warehouse
instance. Also type in port number 50000 (or 50001 if you want to use SSL
encrypted port of Db2 Warehouse). Then click “Next”:

Keep the default protocol “TCP” for connection to the destination and click
“Next”:

You can keep the authentication setting to “None” (otherwise you would have to
deploy encryption keys to the gateway components). Then click “Next”:

Click “Next” again if you don’t want to set up additional IP rules:

Give the destination a meaningful name and click “Finish”:

Your destination is now configured inside the secure gateway service, but it is
not yet connected to your on-premises database. To connect, you first need to
set up the gateway “client” behind your firewall. To do so select the “Clients”
tab at the bottom right and click on “Connect Client”:

Select “Docker” as the method to connect to the gateway. Copy the “docker run …”
command and paste it into a terminal session on your system that is behind the
firewall and run it. This could be the same machine where your Db2 Warehouse
docker container runs or another system in the same network:

You will get a command prompt of the gateway client. There you now need to set
up the access control list of endpoints that this client is allowed to expose
via the gateway. To do so type “acl allow Db2host:Db2port”:

(Note, when you want to gain access to the command prompt later on again you can
simply re-attach to the docker container with “docker attach container_name”. To
find out the name of your docker container issue “docker ps”.)

In your Secure Gateway Bluemix service, you now see that the gateway is
connected:

Next, open the “Settings” dialog of the destination (the gear-wheal icon) to see
its details. Copy the cloud hostname and port:

Now you can login to Data Science Experience and define a connection to your
on-premises Db2 Warehouse system. Open one of your projects and click on “add
data assets” and then the “Create Connection” button:

Provide a name for your connection, select service category “External”, and
paste in the cloud host and port name that you copied from the destination in
your secure gateway. Specify database name “BLUDB” and specify your user and
password of your Db2 Warehouse system. Then click “Create”:

Now you can use that connection in a notebook. For instance use a sample
notebook like this one and use the “Insert to code” function to paste in the connection data for your
Db2 Warehouse system:


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on January 23, 2017 by Torsten Steinbach.

 * Docker

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",You might quickly find yourself working with notebooks or RStudio in IBM Data Science Experience (DSX) and get to a point where you would like to access some databases that reside behind some…,Working with on-premises databases — Step by Step,Live,757
2321,"* Select a country/region: United States

IBM� * Site map

Search

 * Related materials

 * NO RELATED MATERIALS FOUND
   

 * 

 * 

 * 

 * Related materials

 * NO RELATED MATERIALS FOUND
   

 * Full display/Download

 * 

 * Full display/Download


SHARE THIS PAGE


CONTACT IBM
CONSIDERING A PURCHASE?
 * Email IBM

FOOTER LINKS
 * Contact
 * Privacy
 * Terms of use
 * Accessibility",Must-have skills? Daily challenges? Find out what actual data scientists  really think about their critical role in data science ,A glimpse inside the mind of a data scientist,Live,758
2324,"Homepage Follow Sign in Get started Homepage * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Sourav Mazumder Blocked Unblock Follow Following Nov 27
--------------------------------------------------------------------------------

USING APACHE SPARK AS A PARALLEL PROCESSING FRAMEWORK FOR ACCESSING REST BASED
DATA SERVICES
Today’s world of data science leverages data from various sources. Commonly,
these sources are Hadoop File System, Enterprise Data Warehouse, Relational
Database systems, Enterprise file systems, etc. The data from these sources are
accessed in bulk using connectors specific to the underlying technology and
optimized for accessing large volume of data.

However, many a times, a data science exploration/modeling exercise also needs
to access data from sources that support only API-based data access. These
API-based data sources/data services can be of various types. For example:

 * Data services (external or internal), which can provide curated/enriched data
   in record-by-record manner.
 * Validation services for verifying the data using an API. For example Address
   validation.
 * Machine learning/AI services, which provide prediction, recommendations, and
   insights based on a single input record.
 * Service from internal systems (like CRM, MDM, etc.) of the organization,
   which supports data access through API only in record-by-record manner.
 * And many more …

These API-based data services are commonly implemented using REST architectural
style ( https://en.wikipedia.org/wiki/Representational_state_transfer ) and are designed to be called for single item (or a limited set of items) per
request. While this works well when the API needs to be called from an online
application, the approach breaks down in situations when the API has to be
called in bulk. For example, during an online sign-up process an address
validation API can be called for the particular address of the user. But, say in
a health care analytics application, where addresses of thousands of doctors,
which already exist in a database or were obtained as part of a bulk load from
an external source, have to be verified, this approach will not work. Because of
the “single item per request” design of the API, you’d have to call the API
thousands of times.

Calling data service APIs in sequence — Processing Time = (# of Records)*(API
response time)
--------------------------------------------------------------------------------

The above pseudo code snippet shows how calling a target REST API service is
handled in a sequential manner. You must first load the list of parameter values
from a file or table in the memory. Next run a loop. In the loop, the target
REST API has to be called for each set of parameter values. From the response
returned by each call the output must be extracted. The output is typically
populated in a complex object like JSON, XML, etc. Next, the necessary part of
the output has to be added to a result array or collection. For that, you must
know the schema of the result beforehand so that you can process the result
accordingly. Finally, you can filter, exploring, aggregating data from the
result array or collection. For all of these steps, you have to use
language-specific complex code.

Alternatively, you could use a programming language-specific library related to
multi-processing/multi-threading that can parallelize the call to the API.
However, with that approach the parallelization achieved from a single machine
would be minuscule — limited to the number of cores of the machine. Consider, a
case where someone is trying to get personality insights from tweets or Facebook
comments using a Natural Language Processing service. The tweets and comments
can be in tens to hundreds of thousands. So, using a single machine could take a
number of hours to get the result. Hence, the approach should be to use a
distributed processing framework to make the API calls parallelized using
multiple cores of multiple machines with the least coding effort. Though it is
possible to get distributed computing libraries or frameworks to achieve the
same in some programming languages like Java, C++ etc., they require a
reasonable amount of coding and setup to achieve the same result. Achieving this
in popular data science languages, like R or Python is actually more difficult
as they are originally designed to run in single threaded/single machine
environment.

Here enters distributed computing frameworks like Apache Spark ( https://spark.apache.org/ ). REST APIs are inherently conducive to parallelization as each call to the
API is completely independent of any other call to the same API. This fact, in
conjunction with the parallel computing capability of Spark, can be leveraged to
create a solution that solves the problem by delegating the API call to Spark’s
parallel workers. Under this approach, one can package a specification for how
to call the API along with the input data, and pass that to Spark to divide the
effort among its workers (and tasks). The output can be assembled in set-level
abstractions supported by Spark (like dataframes or data sets ) and passed back to the calling program. This approach not only helps you turn a
sequential execution into a parallel one with the least coding effort, but also
makes it much easier to analyze and transform the returned result with an easier
data abstraction model to work with.

The performance benefit you gets is tremendous in this approach. This turns a
problem that takes incremental time for computation (that increases linearly
with the number of records to process), to one that is much more efficient and
scales linearly on a much lower slope — number of records to process divided by
the number of cores available to process them. Theoretically, one can make the
process constant time by having enough cores to process ALL of the records at
once.

To enable the benefits of using Spark to call REST APIs, we are introducing a
custom data source for Spark, namely REST Data Source. It has been built by
extending Spark’s Data Source API. This helps in delegating calls to the target
REST API to a Spark level Task for each set of input parameter values/record.
This also enables the results from multiple API calls to be returned as one
Spark Dataframe. The REST Data Source expects the input to be in the format of a
Spark Temporary table. The results from the API calls are returned in a single
Dataframe of Rows including the input parameters in their corresponding column
names, as well as the output from the REST call in a structure matching that of
the target API’s response. You can check the schema of this Dataframe, and
access the result as necessary using Spark SQL.

The architecture of REST Data SourceThe above figure shows how REST Data Source works.

 1. You first read different sets of parameter values (that have to be sent to
    target REST API) from a file/table to a Spark Dataframe (say Input Data
    Frame).
 2. Then the Input Data Frame is passed to the REST Data Source.
 3. The REST Data Source returns the results to another Dataframe, say Result
    Data Frame.
 4. Now you can use Spark SQL to explore, aggregate, and filter the result using
    the Result Data Frame.

REST Data Source internally calls the target REST API in parallel by executing
multiple tasks spawned by multiple worker processes running in different
machines. Each task is responsible for calling the target REST API Service for a
part of the input (part of sets of parameter values).

The code snippet below demonstrates how to use REST Data Source in Python to get
results from Socrata Data Service (SODA API) for multiple sets of parameter
values by calling the appropriate REST API in parallel.

A sample code snippet showing use of REST Data Source to call REST API in
parallelYou can configure the REST Data Source for different levels of parallelization.
Depending on the volume of input sets of parameter values to be processed and
throughput supported by the target REST API server, you can pass the number of
partitions to be used, and that can limit or extend the level of parallelization
as needed. You can use this framework in all programming languages supported by
Spark — Python, Scala, R, or Java — without any additional coding specific to
that programming language. Last, but not the least, you can also use this
framework to ensure that the target API is called only once for a given set of
parameter values. In this way you can avoid calling the target REST API multiple
times for same set of parameter values. This is especially useful when you must
pay for the REST API being called or there is a limit per day for the same.

See 
https://github.com/sourav-mazumder/Data-Science-Extensions/tree/master/spark-datasource-rest for details of the REST Data Source. Also see this notebook 
https://dataplatform.ibm.com/analytics/notebooks/ae63f056-e267-443e-bfc0-b9331f51d68a/view?access_token=0ec63c6e031aa57d065a4e1c4b71733729db43b1490c331a44323cce28725b7d for an example of how to use the REST Data Source.

 * Big Data
 * Spark
 * Artificial Intelligence
 * Data Science
 * Rest Api

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

9 Blocked Unblock Follow FollowingSOURAV MAZUMDER
Medium member since Nov 2017 FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 9
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Today’s world of data science leverages data from various sources. Commonly, these sources are Hadoop File System, Enterprise Data Warehouse, Relational Database systems, Enterprise file systems, etc…",Using Apache Spark as a parallel processing framework for accessing REST based data services,Live,398
2331,"A wise philosopher (or comedian) once said, “Even a broken clock is right twice a day.” That same statement might also apply to some predictive models. Since prediction is about the future (usually), then random chance (like broken clockwork) may allow our model to be right occasionally (just by accident). The important step in the data science process that aims to reduce the danger of this occurring is the all-important cross-validation phase (or model-testing phase, which uses an independent data set). This phase is devoted to validating that our model works accurately on previously unseen data that were not used in the model training (model-building) phase.",Are Your Predictive Models like Broken Clocks?,Are Your Predictive Models like Broken Clocks?,Live,759
2332,"SETI@IBMCLOUD: SETI DATA, PUBLICLY AVAILABLE, FROM IBM
G. Adam Cox / September 29, 2016SETI@IBMCLOUD
Last year, the IBM jStart group began a collaboration with the SETI Institute of Mountain View to use IBM’s data storage services and the power of IBM’s Spark service to
analyze the many TBs of radio-telescope data they’ve acquired over the past few
years.

Today, we are proud to announce the launch of the public-facing component of
this collaboration, which we call SETI@IBMCloud .

Our team has constructed a system, built with IBM Cloud Data Services, that
delivers the raw SETI data to the general public, along with the basic signal
processing tools to consume that data and connect it to machine-learning tools.

Image credit: SETI Institute

SETI, IBM, NASA COLLABORATION
The jStart team’s mission is to exercise emerging IBM technologies with real
data and analytics problems in order to find their strengths and weaknesses.
They chose SETI because of the size and complexity of their data set, which
resonates with many of today’s business needs.

IBM COMPUTATIONAL POWER
The SETI Institute believes that by using IBM’s Cloud Data Services they can
improve their analysis through increased computation, faster analysis algorithm
development, and better data management. It has been technically difficult for
the relatively small SETI team to handle increases in data production and to
perform multiple analysis iterations.

Additionally, moving data to IBM Object Storage and Spark allows them to use
big-data machine-learning tools, which are quickly being adopted throughout the
scientific community, complementing many of the standard data analysis methods.

The SETI Institute, with collaborators from NASA, IBM Research, IBM Cloud Data
Services and Swinburne University have thus far moved over 16 TBs of recently
taken data onto IBM Object Storage and IBM dashDB and have analyzed that data
using the IBM Spark framework. This collaboration has been so successful that
the SETI Institute and IBM are currently collaborating on new observations of
planetary systems and new acquisition schemes (which produce significantly more
data).

You can observe what’s going on at the Allen Telescope Array each day, where
SETI makes their observations, by visiting http://setiquest.info and checking out the Twitter feed from Jon Richards ( @jrseti ), a leading engineer at SETI.

JOIN US!
Despite the amount of data collected thus far, because the SETI Institute has a
finite number of scientists, most of their data have not yet been analyzed in
novel ways. They are looking for YOU to help develop those new ways to search
for signs of extraterrestrial intelligent life.

Are there patterns in SETI’s data that have been overlooked thus far? We’ll show
you how to read the data and extract features. Can your signal processing
algorithm detect signals that SETI cannot? If so, your code could end up in
SETI’s data acquisition system. Or you might even be the first to discover
intelligent alien life!

To get started, you will need an IBM Bluemix or Data Science Experience account. And IBM Bluemix account comes with a free 30-day trial to use most
services, including the Spark and Object Storage services that will facilitate
your work with SETI data.

You will learn, hands-on, how to use IBM’s data and analytics platform tools
while simultaneously analyzing interesting scientific data from one of the
world’s most recognizable scientific institutes.

LEARN
The technical landing page for this project is on GitHub: https://github.com/ibm-cds-labs/seti_at_ibm .

Through example Jupyter notebooks and other documentation, it will guide you through the handful of steps you
need to get the raw SETI data and to start your analysis, extract features, and
share results. You do not need to be an expert in signal processing, astronomy,
or astrophysics to get involved!

Look forward to future blog posts from me and others about technical details of
this project, new data analysis, interesting results from citizen scientists,
and new data and updates from the SETI Institute.

An example Python notebook from the seti_at_ibm repo.

FINAL THOUGHTS
Personally, I hope this project will create an active, long-lasting open-science
collaboration between citizen scientists from around the world and the
scientists from this collaboration — the SETI Institute, NASA, IBM, Swinburne University of Technology — from Stanford University and from other institutions.

I hope you are as excited about this project as I am! I can’t wait to see what
we build together.

There are a number of individuals at SETI, NASA and IBM that directly helped or
supported my work to make this happen. I’d like to thank Graham Mackintosh, Jill
Tarter, Bill Diamond, Jon Richards, Jeff Scargle, Gerry Harp, Chris Henze,
Francois Luus, Niru Anisetti, Ted Morris, Sven Hafeneger, Steve Moore, Randy
Horman, Mark Watson, Brad Noble, Derek Schoettle and Rob Thomas.

View on Github

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: Analytics / Apache Spark / Bluemix / citizen scientist / Cloud Foundry / cloudant / dashdb / digital signal processing / extra-terrestrial / Object Storage / SETI Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Object Storage
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Launch of SETI@IBMCloud: SETI data analysis w/ IBM Bluemix, Data Science Experience, Spark & Object Storage. Built w/ IBM dashDB, Cloudant & Cloud Foundry.","SETI data, publicly available, from IBM",Live,760
2337,"Elvis Dohmatob Home Blog Publications Photos CodeVARIATIONAL AUTO-ENCODER FOR ""FREY FACES"" USING KERAS
Oct 22, 2016

In this post, I’ll demo variational auto-encoders [Kingma et al. 2014] on the “Frey faces” dataset, using the keras deep-learning Python library .

SOME FORMAL PRELIMINARIES
A well-known thermodynamic variational bound on surprise goes as follows:

\begin{equation} \begin{split} -\log p_G(x) = F_G(x) = F^R_G(x) -
D_{KL}(p_R(.|x)||P_G(.|x)) \le F^R_G(x), \end{split} \end{equation} where

 * $x$: Examplar datavector (visible layer).
 * $z$: Hidden / latent variable (these are the ‘causes’ of the datavectors).
 * $G$: Generative model, with density $z \sim p_G(.|x)$, parametrized by a
   tensor of weights $W^G$ (we’ll use a neural network)
 * $R$: Recognition model, with density $z \sim p_R(.|x)$, parametrized by a
   tensor of weights $W^R$ (we’ill use a NN).
 * $D_{KL}(q||p)$ is the Kullback-Leibler divergence between probability
   densities $p$ and $q$, defined by \begin{equation} D_{KL}(q||p) :=
   \sum_{z}q(z)\log(q(z)/p(z)) \end{equation}
   
   
 * $F_G(x)$: Helmhost free-energy f a fictive thermodynamic system with
   macrostate energy levels $(E_G(z,x))_z$ with $E_G(z,x) := -\log(p_G(z,x))$,
   and partition function $p_G(x)$.
 * $F^R_G(x)$ is the variational Helmholtz free-energy from $G$ to $R$, defined by \begin{equation} F_G^R(x)
   := \langle -\log(p_G(., x)) \rangle_{P_R(.|x)} - \mathcal H(P_R(.|x)),
   \end{equation} with \begin{equation} \mathcal H(p_R(.|x)) :=
   -\sum_{z}p_R(z|x))\log(p_R(z|x)), \end{equation} the entropy of $p_R(.|x)$.

Problem: How do we sample from the recognition density $p_R(.|x)$ in such a way that the
sampling process is differentiable w.r.t the weights of the recognition network
$ W^R$ ?

Solution: The reparametrization trick!

The solution proposed in [Kingma et al. 2014] is to use a reparametrization trick :

 * Choose $\epsilon \sim p_{\text{noise}}$ (noise distribution, independent of
   $W^R$! )
 * Set $z = g(W^{R}, x, \epsilon)$, where $g$ is an appropriate class $\mathcal
   C^1$ function.
   
   $ \implies $ a sample $z \sim p_R(.|x)$, from the correct posterior
   
   
THE CODE
Dependencies: We’ll need the following python libraries to get things running:

 * Numpy / Scipy (install everything via anaconda)
 * keras
 * Theano or Tensforflow (as backend for Keras)

A bit of setup

importnumpyasnpimportmatplotlib.pyplotasplt# configure matplotlib%matplotlibinlineplt.rcParams['figure.figsize']=(13.5,13.5)# set default size of plotsplt.rcParams['image.interpolation']='nearest'plt.rcParams['image.cmap']='gray'# for auto-reloading external modules# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython%load_extautoreload%autoreload2

Now, let’s load the dataset

importosfromurllib2importurlopen,URLError,HTTPErrorfromscipy.ioimportloadmatdeffetch_file(url):""""""Downloads a file from a URL.
    """"""try:f=urlopen(url)print""Downloading data file ""+url+"" ...""# Open our local file for writingwithopen(os.path.basename(url),""wb"")aslocal_file:local_file.write(f.read())print""Done.""#handle errorsexceptHTTPError,e:print""HTTP Error:"",e.code,urlexceptURLError,e:print""URL Error:"",e.reason,urlurl=""http://www.cs.nyu.edu/~roweis/data/frey_rawface.mat""data_filename=os.path.basename(url)ifnotos.path.exists(data_filename):fetch_file(url)else:print""Data file %s exists.""%data_filename# reshape data for later convenienceimg_rows,img_cols=28,20ff=loadmat(data_filename,squeeze_me=True,struct_as_record=False)ff=ff[""ff""].T.reshape((-1,img_rows,img_cols))

… and split data into train / validation folds

np.random.seed(42)n_pixels=img_rows*img_colsX_train=ff[:1800]X_val=ff[1800:1900]X_train=X_train.astype('float32')/255.X_val=X_val.astype('float32')/255.X_train=X_train.reshape((len(X_train),n_pixels))X_val=X_val.reshape((len(X_val),n_pixels))

Visualize some examples from the dataset

defshow_examples(data,n=None,n_cols=20,thumbnail_cb=None):ifnisNone:n=len(data)n_rows=int(np.ceil(n/float(n_cols)))figure=np.zeros((img_rows*n_rows,img_cols*n_cols))fork,xinenumerate(data[:n]):r=k//n_colsc=k%n_colsfigure[r*img_rows:(r+1)*img_rows,c*img_cols:(c+1)*img_cols]=xifthumbnail_cbisnotNone:thumbnail_cb(locals())plt.figure(figsize=(12,10))plt.imshow(figure)plt.axis(""off"")plt.tight_layout()show_examples(ff,n=200,n_cols=25)


Build forward model (encoding)

fromkerasimportbackendfromkeras.layersimportInput,Dense,Lambdafromkeras.modelsimportModelfromkeras.objectivesimportbinary_crossentropyintermediate_dim=256latent_dim=2batch_size=100nb_epoch=100noise_std=.01x=Input(shape=(n_pixels,))h=Dense(intermediate_dim,activation=""relu"")(x)z_mean=Dense(latent_dim)(h)z_log_var=Dense(latent_dim)(h)

Sample from latent space

defsampling(args):z_mean,z_log_var=argsepsilon=backend.random_normal(shape=(batch_size,latent_dim),mean=0.,std=noise_std)epsilon*=backend.exp(.5*z_log_var)epsilon+=z_meanreturnepsilonz=Lambda(sampling,output_shape=(latent_dim,))([z_mean,z_log_var])

Build backward model (decoding)

decoder_h1=Dense(intermediate_dim,activation=""relu"")decoder_h2=Dense(n_pixels,activation=""sigmoid"")z_decoded=decoder_h1(z)x_decoded=decoder_h2(z_decoded)

Build the autoencder

vae=Model(input=x,output=x_decoded)fromkeras.utilsimportvisualize_utilasvizuvizu.plot(vae,""ff.png"",show_layer_names=False,show_shapes=True)


# Objective function minimized by autoencoderdefvae_objective(x,x_decoded):loss=binary_crossentropy(x,x_decoded)kl_regu=-.5*backend.sum(1.+z_log_var-backend.square(z_mean)-backend.exp(z_log_var),axis=-1)returnloss+kl_regu

# Compile the autoencoder computation graphvae.compile(optimizer=""adam"",loss=vae_objective)

Train the autoencoder (or reload a previously trained one)

importosweights_file=""ff_%d_latent.hdf5""%latent_dimifos.path.isfile(weights_file):vae.load_weights(weights_file)else:fromkeras.callbacksimportHistoryhist_cb=History()vae.fit(X_train,X_train,shuffle=True,nb_epoch=nb_epoch,batch_size=batch_size,callbacks=[hist_cb],validation_data=(X_val,X_val))vae.save_weights(weights_file)# plot convergence curves to show offplt.plot(hist_cb.history[""loss""],label=""training"")plt.plot(hist_cb.history[""val_loss""],label=""validation"")plt.grid(""on"")plt.xlabel(""epoch"")plt.ylabel(""loss"")plt.legend(loc=""best"")

Separate encoder from input to latent space

encoder=Model(input=x,output=z_mean)

Generator from latent to input space

decoder_input=Input(shape=(latent_dim,))h_decoded=decoder_h1(decoder_input)x_decoded=decoder_h2(h_decoded)generator=Model(input=decoder_input,output=x_decoded)

Display a 2D manifold of the faces. In this example we found that the each
dimension of the hidden variable z was encoding for socially meaningful things
like humour / expression & pose

fromipywidgetsimportFloatSlider,interactwewillsamplepointswithingivenstandarddeviationshumour=FloatSlider(min=-15,max=15,step=3,value=0)pose=FloatSlider(min=-15,max=15,step=3,value=0)@interact(pose=pose,humour=humour)defdo_thumb(humour,pose):z_sample=np.array([[humour,pose]])*noise_stdx_decoded=generator.predict(z_sample)face=x_decoded[0].reshape(img_rows,img_cols)plt.figure(figsize=(11.5,11.5))ax=plt.subplot(111)ax.imshow(face)plt.axis(""off"")


Share this on → Twitter Facebook Google+ Please enable JavaScript to view the comments powered by Disqus.ELVIS DOHMATOB
 * Elvis Dohmatob
 * gmdopp@gmail.com

 * dohmatob
 * dohmatobelvis

mathematics / machine learning / convex optimization / game theory / brain
science","In this post, I’ll demo variational auto-encoders [Kingma et al. 2014] on the “Frey faces” dataset, using the keras deep-learning Python library.","Variational auto-encoder for ""Frey faces"" using keras",Live,761
2342,"Skip navigation Upload Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseFROM MACHINE LEARNING TO LEARNING MACHINE (DINESH NIRMAL)
Apache Spark Subscribe Subscribed Unsubscribe 18,875 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics

148 views 2LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 3 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Nov 3, 2016

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Loading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * OrderedRDD: A Distributed Time Series Analysis Framework for Spark -
   Duration: 26:10. Apache Spark 109 views * New 26:10


--------------------------------------------------------------------------------

 * Hive to Spark—Journey and Lessons Learned - Duration: 27:50. Apache Spark 1
   view * New 27:50
 * Vegas, the Missing MatPlotLib for Spark - Duration: 31:53. Apache Spark 77
   views * New 31:53
 * Prediction as a Service with Ensemble Model Trained in SparkML on 1 billion
   Observed Flight Prices D - Duration: 31:49. Apache Spark 110 views * New 31:49
 * Fusing Apache Spark and Lucene for Near Realtime Predictive Model Building -
   Duration: 33:31. Apache Spark 83 views * New 33:31
 * Adopting Dataframes and Parquet in an Already Existing Warehouse - Duration:
   28:34. Apache Spark 40 views * New 28:34
 * Time Series Analysis with Spark in the Automotive R&D Process - Duration:
   30:15. Apache Spark 73 views * New 30:15
 * Sparkling Water 2 0: The Next Generation of Machine Learning on Apache Spark
   - Duration: 31:29. Apache Spark 3 views * New 31:29
 * Productizing a Spark and Cassandra Based Solution in Telecom - Duration:
   30:43. Apache Spark 100 views * New 30:43
 * Finding Outliers in Streaming Data: A Scalable Approach - Duration: 22:29.
   Apache Spark 10 views * New 22:29
 * A Journey from Scikit learn to Spark - Duration: 30:33. Apache Spark 213
   views * New 30:33
 * Containerized Spark on Kubernetes - Duration: 27:38. Apache Spark 58 views *
   New 27:38
 * Spark Streaming in a Multitenant World - Duration: 24:22. Apache Spark 21
   views * New 24:22
 * Spark Streaming at Bing Scale - Duration: 30:26. Apache Spark 42 views * New 30:26
 * Online Learning with Structured Streaming - Duration: 33:35. Apache Spark 54
   views * New 33:35
 * Spark Summit EU 2016 Meet Up - Duration: 1:49:31. Apache Spark 219 views *
   New 1:49:31
 * Operational Tips for Deploying Spark - Duration: 30:46. Apache Spark 1,077
   views 30:46
 * The Spark Revolution in The Netherlands (Renald Buter) - Duration: 11:55.
   Apache Spark 41 views * New 11:55
 * Apache Spark at Scale: A 60 TB+ Production Use Case - Duration: 29:31. Apache
   Spark 2 views * New 29:31
 * The Potential of GPU driven High Performance Data Analytics in Spark (Andy
   Steinbach) - Duration: 17:41. Apache Spark 63 views * New 17:41
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Try something new!
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",Machine learning explained in 10 minutes.,From Machine Learning to Learning Machine (Dinesh Nirmal),Live,762
2347,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: LOAD DATA INTO RSTUDIO
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

8 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Science Experience: Run Shiny applications in RStudio - Duration: 3:02.
   developerWorks TV No views * New 3:02


--------------------------------------------------------------------------------

 * Data Science Experience: Overview of RStudio IDE - Duration: 8:37.
   developerWorks TV 1 view * New 8:37
 * Data Science Experience: Analyze Db2 Warehouse on Cloud data in RStudio -
   Duration: 5:30. developerWorks TV 3 views * New 5:30
 * Introduction to Spark and Data Science Experience - Duration: 49:24. Data
   Gurus 419 views 49:24
 * Inside a Google data center - Duration: 5:28. G Suite 6,810,063 views 5:28
 * How Customers Are Using the IBM Data Science Experience Expected Cases and
   Not So Expected Ones - Duration: 18:29. Databricks 190 views 18:29
 * Hadley Wickham, RStudio Scientist & open source pioneer, on data, stats, &
   philosophy - Duration: 1:04:12. This Week In Startups 4,481 views 1:04:12
 * Use IBM PixieDust and Data Science Experience to analyze San Francisco
   traffic - Duration: 11:57. scottdangelo 447 views 11:57
 * Exploring Data Science Experience, a Platform for Data Scientists using Open
   Source Technologies - Duration: 54:10. Data Gurus 102 views 54:10
 * R Programming For Beginners | Data Science Tutorial | Simplilearn - Duration:
   15:39. Simplilearn 482 views 15:39
 * RStudio Tips and Tricks - Duration: 21:26. Work-Bench 4,028 views 21:26
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29
 * Using Data Science Experience DSX to extract insights from NY restaurant
   inspection records - Duration: 32:27. Data Gurus 136 views 32:27
 * How to Become a Data Scientist in 2017? | Data Scientist Career | Data
   Science Future - Duration: 1:17:14. HackerEarth 133,464 views 1:17:14
 * 14-Year-Old Prodigy Programmer Dreams In Code - Duration: 8:42. THNKR
   6,587,062 views 8:42
 * Tetiana Ivanova - How to become a Data Scientist in 6 months a hacker’s
   approach to career planning - Duration: 56:26. PyData 131,889 views 56:26
 * What is NoSQL? - Duration: 5:36. Intricity101 42,810 views 5:36
 * Getting started with R and RStudio - Duration: 14:38. How To R 155,200 views 14:38
 * R Studio: Importing & Analyzing Data - Duration: 7:22. MrClean1796 58,292
   views 7:22
 * Data Science Hands on with Open source Tools - Apache Spark in RStudio IDE -
   Duration: 4:18. Cognitive Class 2,176 views 4:18
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to upload CSV data for analytics using RStudio in IBM Data Science Experience (DSX).,Load data into RStudio for analysis in DSX,Live,763
2348,"2.5 * Share
 * 
 * ?

 * Profiles ▼
 * Communities ▼
 * Apps ▼
 * 
 * 

BLOGS
 * My Blogs
 * Public Blogs
 * My Updates
 * if(currentLogin.globalAdmin) { document.write('<li role=""presentation""><div
   class=""lotusTabWrapper""><a class=""lotusTab"" role=""button""
   aria-pressed=""false""
   href=""/developerworks/community/blogs/roller-ui/admin/rollerConfig.do?method=edit�
   }


IT BEST KEPT SECRET IS OPTIMIZATION
 * Log in to participate

 * if(currentLogin.auth) document.write('<li class=""lotusFirst""
   style=""border-left:none;""><span class=""lotusBtn""><a role=""button""
   aria-label=""Follow this Blog"" class=""blogsFollowBtn"" href=""javascript:;""
   onclick=""followAction(\'8efbd6b5-b6cf-4f3f-a40d-2ef96ac2df9d\', \'follow\');
   "">Follow this Blog</a></span></li>');
 * if(!currentLogin.auth)
   dojo.byId(""login_participate_link"").style.display=""block""; function
   followAction(id, action) { dojo.xhrPost({ url:
   '/developerworks/community/blogs/roller-services/json/following', handleAs:
   'json', postData: 'version=300&uuid='+id+'&action='+action+'� }

ABOUT THIS BLOG
Musing about Analytics, Optimization, Data Science, and Machine Learning
Leverages Python and Mathematical Optimization. I am now publishing my code (esp
notebooks) on git hub at: https://github.com/jfpuget/ My Views are my own. * Facebook
 * Twitter
 * Google
 * LinkedIn
 * RSS

RELATED POSTS
DATA SCIENCE AUTOMAT...
Updated Likes 1 Comments 2INSTALLING XGBOOST F...
Updated Likes 0 Comments 0WHAT DO PYTHON PROGR...
Updated Likes 0 Comments 0A SPEED COMPARISON O...
Updated Likes 0 Comments 11PRE-REQUISITES : PYT...
Updated Likes 0 Comments 0LINKS
 * My github repository
 * @JFPuget on twitter
 * Optimization Community on Deve...
 * Free CPLEX Trials
 * Free Cloud trial
 * Free Software For Academics
 * Support Forum on developerWork...
 * CPLEX and OPL products
 * IBM Decision Optimization Cent...
 * LinkedIn profile for jfpuget
 * Michael's Trick Operations Res...
 * OR in an OB World
 * Open Courses on Operations Res...
 * Punk Rock OR

TAGS
TIDY DATA IN PYTHON
JeanFrancoisPuget 2700028FGP | | Visits (19322)
TweetIs your data tidy or messy? If you are not sure about how to answer this
question, don't worry, you'll understand it in a minute. This question has to do
with an issue that keeps busy data scientists (or statisticians, or machine
learners, pick your favorite). The issue is that of data preparation. It is well
known that 80% of the time spent in a data science project is spent in data
preparation, and as little as 20% is spent in actually learning from it (or
modeling it).

What is this data preparation about? Well known steps include dealing with
missing values. Should they be replaced by 0, or by some average value?
Shouldn't we rather get rid of the observations with missing values? Another
popular topic is what to do with outliers, i.e. values that are way apart form
most values in the data. These could be error measurements, or transcription
errors, in which cases they should be ignored. But they could also be
meaningful, in which case we should keep them by all means.

TIDY AND MESSY DATA SETS
I could go on, but there is a fundamentally different data preparation step that
transforms data to make it easy to process by statistical or machine learning
packages. That transformation was a bit fuzzy for me until recently. I knew when
I was seeing data that it needed to be transformed, and I could transform it,
but I never spent time defining precisely what this transformation is about.
Then the light came with this beautiful article:

Tidy Data , by Hadley Wickham.

I recommend reading it, but I'll try to convey the idea here. Wickham defines
two types of data sets. The tidy datasets are made of tables with observations in rows, variables in columns.
Actually there is a third condition about data being normalized that we will
ignore for now. We will revisit it in the last section of this post. In machine
learning terms, tidy datasets are matrices with features as columns, and
examples as rows. Anything else is called a messy dataset.

Let's look at an example for the sake of clarity. This example, and all examples
in this blog entry are taken from Hadley's article.

The data contains the result of two treatments applied to a set of people. A
rather natural way to report these results is depicted below.


First Last Treatment A Treatment B John Smith NaN 2 Jane Doe 16 11 Mary Johnson 3 1

Sometimes, the transpose view could be preferred


First John Jane Mary Last Smith Doe Johnson Treatment A NaN 16 3 Treatment B 2 11 1

While these are compact ways to present data, neither of them are suited for
easy analysis. For instance, there is a missing value shown as NaN (Not a
Number). Let's say we want to ignore the corresponding observation. How do we do
that? The only way is to loop over a 2D matrix and test each time if the value
is a number or not. This is not convenient.

A much better way is to reorganize data in a tidy form, with one observation per
row. Here, an observation is one value for a treatment for a person:


First Last treatment result John Smith Treatment A NaN Jane Doe Treatment A 16 Mary Johnson Treatment A 3 John Smith Treatment B 2 Jane Doe Treatment B 11 Mary Johnson Treatment B 1

Ignoring the observation with a missing value is easy: we remove the row for it:


First Last treatment result Jane Doe Treatment A 16 Mary Johnson Treatment A 3 John Smith Treatment B 2 Jane Doe Treatment B 11 Mary Johnson Treatment B 1

We can then proceed with further analysis without having to worry about missing
values. This is a tidy data set. The previous ones were messy data sets.

The rest of this post will go through most of the examples used by Wickham in
his article to show how to turn messy data sets into tidy ones. Hadley uses the
R programming language for that. He actually is the main author of many popular
R packages for data preparation and data visualization. His packages make it
easy to get tidy datasets from messy ones.

We will use Python to do the same if possible. All the code we use is available
in a notebook on github and nbviewer .

The code for the above example is the following. First, we create the messy data
set.

import pandas as pd
import numpy as np

messy=pd.DataFrame({'First':['John','Jane','Mary'],'Last':['Smith','Doe','Johnson'],'Treatment A':[np.nan,16,3],'Treatment B':[2,11,1]})

The transpose view is easy to get:

messy.T

The tidy version is easily obtained via the melt() function.

tidy=pd.melt(messy,               id_vars=['First','Last'],               var_name='treatment',               value_name='result')

This function is quite powerful. Let's have a closer look at it.

A SIMPLE MELT EXAMPLE
Understanding how the melt() function works is key for turning data into tidy data. Wickham provides this
simple example to explain the melt process. We first create a pandas dataframe
for it.

messy=pd.DataFrame({'row':['A','B','C'],'a':[1,2,3],'b':[4,5,6],'c':[7,8,9]})

This yields


a b c row 0 1 4 7 A 1 2 5 8 B 2 3 6 9 C

This dataset has three variables (or features). One is stored in the row column. The second one appears as column names ( a , b , and c ). The last one is stored as entries in the table. As Wickham puts it (I
modified the R names into Python names):

""To tidy it, we need to melt, or stack it. In other words, we need to turn
columns into rows. While this is often described as making wide datasets long or
tall, I will avoid those terms because they are imprecise. Melting is
parameterised by a list of columns that are already variables, or id_vars for
short. The other columns are converted into two variables: a new variable called
variable that contains repeated column headings and a new variable called value
that contains the concatenated data values from the previously separate
columns.""

We therefore use melt() with the row column as the id_vars .

pd.melt(messy,id_vars='row')

It yields


row variable value 0 A a 1 1 B a 2 2 C a 3 3 A b 4 4 B b 5 5 C b 6 6 A c 7 7 B c 8 8 C c 9

The row column stays as a column. All the other column names are now values for the new variable column. And the original entries in the table are now the values of the new value column.

It is possible to rename the new columns with additional arguments.

tidy = pd.melt(messy,id_vars='row',var_name='dimension',value_name='length')

yields


row dimension length 0 A a 1 1 B a 2 2 C a 3 3 A b 4 4 B b 5 5 C b 6 6 A c 7 7 B c 8 8 C c 9

The pivot function is almost the inverse of the melt function.

messy1 = tidy.pivot(index='row',columns='dimension',values='length')

yields


dimension a b c row A 1 4 7 B 2 5 8 C 3 6 9

This is close, but not identical to the original dataframe, because row is used
as index. We can move it back to a column with reset_index() . We should also remove the name dimension used for columns.

messy1.reset_index(inplace=True)

messy1.columns.name=''

It yields a dataframe that is identical to the original one, up to some column
reordering.


row a b c 0 A 1 4 7 1 B 2 5 8 2 C 3 6 9COLUMN HEADERS ARE VALUES, NOT VARIABLE NAMES
The melt function is key, but it is not always sufficient, as we will see with
additional examples used by Wickham. The data for the first example of problems
is depicted below. We store it as a dataframe in the messy Python variable.


religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k 0 Agnostic 27 34 60 81 76 137 1 Atheist 12 27 37 52 35 70 2 Buddhist 27 21 30 34 33 58 3 Catholic 418 617 732 670 638 1116 4 Don't know/refused 15 14 15 11 10 35 5 Evangelical Prot 575 869 1064 982 881 1486 6 Hindu 1 9 7 9 11 34 7 Historically Black Prot 228 244 236 238 197 223 8 Jehovah's Witness 20 27 24 24 21 30 9 Jewish 19 19 25 25 3095


As Wickham describes it: ""This dataset explores the relationship between income and religion in the US.
It comes from a report produced by the Pew Research Center, an American
think-tank that collects data on attitudes to topics ranging from religion to
the internet, and produces many reports that contain datasets in this format.

This dataset has three variables, religion, income and frequency. To tidy it, we
need to melt, or stack it. ""Again, the melt() function is our friend. We sort the result by religion to make it easier to
read.

tidy=pd.melt(messy,id_vars=['religion'],var_name='income',value_name='freq')tidy.sort_values(by=['religion'],inplace=True)tidy.head()

It yields (we only show the first 5 rows via the head() function).


religion income freq 0 Agnostic <$10k 27 30 Agnostic $30-40k 81 40 Agnostic $40-50k 76 50 Agnostic $50-75k 137 10 Agnostic $10-20k 34MULTIPLE VARIABLES STORED IN ONE COLUMN
This example is a little trickier. We first read the input data as a data frame.
This data is available at http s:// gith ub.c om/h adle y/ti dy-d ata/ tree /mas ter/ dat a

I've cloned it therefore it is in my local data directory.

Reading it is easy with the pandas built-in function read_csv() . We remove the new_sp_ prefix appearing in most columns, and we rename a couple of columns as well. We
restrict it to year 2000 and drop a couple of columns to stay in sync with
Wickham's paper.

messy=pd.read_csv('data/tb.csv')messy.columns=messy.columns.str.replace('new_sp_','')messy.rename(columns={'iso2':'country'},inplace=True)messy=messy[messy['year']==2000]messy.drop(['new_sp','m04','m514','f04','f514'],axis=1,inplace=True)messy.iloc[:,:11].head(10)

Printing the top 10 rows of the first 11 columns yields.

country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu f014 10 AD 2000 0 0 1 0 0 0 0 NaN NaN 36 AE 2000 2 4 4 6 5 12 10 NaN 3 60 AF 2000 52 228 183 149 129 94 80 NaN 93 87 AG 2000 0 0 0 0 0 0 1 NaN 1 136 AL 2000 2 19 21 14 24 19 16 NaN 3The melt() function is useful, but is not enough. Let's use it still.

molten=pd.melt(messy,id_vars=['country','year'],value_name='cases')molten.sort_values(by=['year','country'],inplace=True)molten.head(10)

It yields


country year variable cases 0 AD 2000 m014 0 201 AD 2000 m1524 0 402 AD 2000 m2534 1 603 AD 2000 m3544 0 804 AD 2000 m4554 0 1005 AD 2000 m5564 0 1206 AD 2000 m65 0 1407 AD 2000 mu NaN 1608 AD 2000 f014 NaN 1809 AD 2000 f1524 NaN

This molten dataframe makes it easy to remove the values where the age is mu . However, it is not really tidy as the variable column encodes two variables: sex and age range. Let's process the dataframe to
create two additional columns, one for the sex, and one for the age range. We
then remove the variable column.

tidy=molten[molten['variable']!='mu'].copy()

defparse_age(s):s=s[1:]ifs=='65':return'65+'else:returns[:-2]+'-'+s[-2:]tidy['sex']=tidy['variable'].apply(lambdas:s[:1])tidy['age']=tidy['variable'].apply(parse_age)tidy=tidy[['country','year','sex','age','cases']]tidy.head(10)

It yields


country year sex age cases 0 AD 2000 m 0-14 0 201 AD 2000 m 15-24 0 402 AD 2000 m 25-34 1 603 AD 2000 m 35-44 0 804 AD 2000 m 45-54 0 1005 AD 2000 m 55-64 0 1206 AD 2000 m 65+ 0 1608 AD 2000 f 0-14 NaN 1809 AD 2000 f 15-24 NaN 2010 AD 2000 f 25-34 NaNLet's look at a trickier example.

VARIABLES ARE STORED IN BOTH ROWS AND COLUMNS
This example is really tricky. We store the following data as a dataframe in the messy Python variable.


id year month element d1 d2 d3 d4 d5 d6 d7 d8 0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN NaN 1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN NaN 2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN NaN 3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN NaN 4 MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN NaN 5 MX17004 2010 3 tmin NaN NaN NaN NaN 14.2 NaN NaN NaN 6 MX17004 2010 4 tmax NaN NaN NaN NaN NaN NaN NaN NaN 7 MX17004 2010 4 tmin NaN NaN NaN NaN NaN NaN NaN NaN 8 MX17004 2010 5 tmax NaN NaN NaN NaN NaN NaN NaN NaN 9 MX17004 2010 5 tmin NaN NaN NaN NaN NaN NaN NaN NaN

As Wickham describes it: � it stores the names of variables.""

Most of the values are missing. However, filtering the NaN values isn't possible
in this messy form. We need to melt the dataframe first. We reindex the
dataframe, but this step isn't mandatory. I just prefer to have my rows numbered
consecutively in the dataframe, but keeping the original indices may be valuable
in other circumstances.

molten=pd.melt(messy,                 id_vars=['id','year','month','element',],
                 var_name='day');molten.dropna(inplace=True)molten=molten.reset_index(drop=True)molten

It yields.


id year month element day value 0 MX17004 2010 2 tmax d2 27.3 1 MX17004 2010 2 tmin d2 14.4 2 MX17004 2010 2 tmax d3 24.1 3 MX17004 2010 2 tmin d3 14.4 4 MX17004 2010 3 tmax d5 32.1 5 MX17004 2010 3 tmin d5 14.2

This dataframe is not in tidy form yet. First, the column element contains variable names. Second, the columns year, month, day represent one variable: the date. Let's fix the latter problem first.

deff(row):return""%d-%02d-%02d""%(row['year'],row['month'],int(row['day'][1:]))molten['date']=molten.apply(f,axis=1)molten=molten[['id','element','value','date']]molten

It yields


id element value date 0 MX17004 tmax 27.3 2010-02-02 1 MX17004 tmin 14.4 2010-02-02 2 MX17004 tmax 24.1 2010-02-03 3 MX17004 tmin 14.4 2010-02-03 4 MX17004 tmax 32.1 2010-03-05 5 MX17004 tmin 14.2 2010-03-05

Now we need to move the values in the element column to be the name of two new
columns. This is the opposite of a melt operation. As we have seen above, this
is the pivot operation.

tidy=molten.pivot(index='date',columns='element',values='value')tidy

It yields


element tmax tmin date 2010-02-02 27.3 14.4 2010-02-03 24.1 14.4 2010-03-05 32.1 14.2

Wait a minute.

Where is the id?

One way to keep the id column, is to move it to an index with the groupby() function, and apply pivot() inside each group.

tidy=molten.groupby('id').apply(pd.DataFrame.pivot,
                                  index='date',
                                  columns='element',                                  values='value')tidy

It yields


element tmax tmin id date MX17004 2010-02-02 27.3 14.4 2010-02-03 24.1 14.4 2010-03-05 32.1 14.2

We are almost there. We simply have to move id back as a column with the reset_index() .

tidy.reset_index(inplace=True)tidy

It yields.


element id date tmax tmin 0 MX17004 2010-02-02 27.3 14.4 1 MX17004 2010-02-03 24.1 14.4 2 MX17004 2010-03-05 32.1 14.2

Et Voila!

MULTIPLE TYPES IN ONE TABLE
Wickham uses yet another dataset to illustrate further issues with messy data.
It is an excerpt from the Billboard top hits for 2000. We store the following
data in the messy Python variable.


year artist track time date entered wk1 wk2 wk3 0 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 87 82 72 1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 87 92 2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70 68 3 2000 98^0 Give Me Just One Nig... 3:24 2000-08-19 51 39 34 4 2000 A*Teens Dancing Queen 3:44 2000-07-08 97 97 96 5 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 84 62 51 6 2000 Aaliyah Try Again 4:03 2000-03-18 59 53 38 7 2000 Adams,Yolanda Open My Heart 5:30 2000-08-26 76 76 74

The first columns give the artist performing the song, the title of the song,
its duration, and the date it entered to the top hits. Columns wk1, wk2, etc.
represent the rank of a given song in the weeks after it entered the top hits.

This dataframe is messy because there are several observations per row, in the
columns wk1, wk2, wk3. We can get one observation per row by melting the
dataframe.

molten=pd.melt(messy,id_vars=['year','artist','track','time','date entered'],var_name='week',value_name='rank',)molten.sort_values(by=['date entered','week'],inplace=True)molten.head()

It yields


year artist track time date entered week rank 5 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk1 84 13 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk2 62 21 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk3 51 0 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 wk1 87 8 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 wk2 82

We can clean the dataset further, first by turning week into number. Second, we
need the starting date of the week for each observation, instead of the date the
track entered.

fromdatetimeimportdatetime,timedeltadefincrement_date(row):date=datetime.strptime(row['date entered'],""%Y-%m-%d"")returndate+timedelta(7)*(row['week']-1)

molten['week']=molten['week'].apply(lambdas:int(s[2:]))

molten['date']=molten.apply(increment_date,axis=1)molten.drop('date entered',axis=1,inplace=True)molten.head()

It yields


year artist track time week rank date 5 2000 Aaliyah I Don't Wanna 4:15 1 84 2000-01-29 13 2000 Aaliyah I Don't Wanna 4:15 2 62 2000-02-05 21 2000 Aaliyah I Don't Wanna 4:15 3 51 2000-02-12 0 2000 2,Pac Baby Don't Cry 4:22 1 87 2000-02-26 8 2000 2,Pac Baby Don't Cry 4:22 2 82 2000-03-04

This is a tidy set by the definition we gave above, with one observation per
row, and one feature per column. This form is suitable for most statistical and
machine learning packages. Yet, it is not in the 'purest' tidy form. Indeed,
some data is repeated, like artist name, song name, and song duration. We can
remove this redundancy by using more than one table for the data. This is called
data normalization in relational database community. For Wickham, normalization
is required to have tidy datasets.

Let us first get the information pertaining to each song. We restrict ourselves
to the year, artist, rack. and time columns. And we only need to keep the first
row for each combination of year, artist, and track.

tidy_track=molten[['year','artist','track','time']]\
             .groupby(['year','artist','track'])\
             .first()tidy_track.reset_index(inplace=True)tidy_track.reset_index(inplace=True)tidy_track.rename(columns={'index':'id'},inplace=True)tidy_track

The first call to reset_index moves the columns that were used in group_by() back to columns. The second one adds a new column that will serve as id.

It yields our first new table .


id year artist track time 0 0 2000 2,Pac Baby Don't Cry 4:22 1 1 2000 2Ge+her The Hardest Part Of ... 3:15 2 2 2000 3 Doors Down Kryptonite 3:53 3 3 2000 98^0 Give Me Just One Nig... 3:24 4 4 2000 A*Teens Dancing Queen 3:44 5 5 2000 Aaliyah I Don't Wanna 4:15 6 6 2000 Aaliyah Try Again 4:03 7 7 2000 Adams,Yolanda Open My Heart 5:30

The second table is obtained by adding the id column to the original table, via
a merge operation, then restrict to the columns about weekly ranks.

tidy_rank=pd.merge(molten,tidy_track,on='track')tidy_rank=tidy_rank[['id','date','rank']]tidy_rank.head()

It yields.


id date rank 0 5 2000-01-29 84 1 5 2000-02-05 62 2 5 2000-02-12 51 3 0 2000-02-26 87 4 0 2000-03-04 82

This concludes our little exercise. I hope it shows that pandas provides data
tidying features that are flexible enough to match what Wickham showcased in his
article. All the code for this blog entry is available in a notebook on github and nbviewer .


Tags:&nbsp python r data_science Login to access this feature * Add a Comment
 * More Actions v

Notify Other People Notify Other People

Comments (0) * Add a Comment
 * if(!currentLogin.edit) document.write('<li style=""display:none"">'); else
   document.write('<li>');
 * Edit
 * More Actions v
 * if(!currentLogin.postModerationReviewer) document.write('<li
   style=""display:none"">'); else document.write('<li>');
 * Quarantine this Entry

notificationSEND EMAIL NOTIFICATION
Type in a Name: + Notify: Message:QUARANTINE THIS ENTRY
Provide a reason for quarantining this blog entry (optional): deleteEntry duplicateEntryMARK AS DUPLICATE
Find the duplicate idea: * Previous Entry
 * Main
 * Next Entry","Is your data tidy or messy?  If you are not sure about how to answer this question, don't worry, you'll understand it in a minute.  This question has to do with an issue that keeps busy data scientists (or statisticians, or machine learners, pick your favorite).  The issue is that of data preparation.  It is well known that 80% of the time spent in a data science project is spent in data preparation, and as little as 20% is spent in actually learning from it (or modeling it).  What is this data preparation about?  Well known steps include dealing with missing values.  Should they be replaced by 0, or...",Tidy Data In Python,Live,764
2351,"Compose The Compose logo Articles Sign in Free 30-day trialANNOUNCING THE DATA BROWSER FOR JANUSGRAPH
Published Aug 30, 2017 Announcing the Data Browser for JanusGraph janusgraph graph visualization Free 30 Day TrialIt's now even easier to query and visualize a JanusGraph graph with the
integrated Data Browser for JanusGraph.

Since we introduced Compose for JanusGraph as a beta experience, the engineering
team have been working on making the command line and programmatic experience
the best and most reliable environment that you can traverse your JanusGraph
database in. Now the time has come to expand that experience to the Compose web
console complete with graph visualization; say hello to the Data Browser for JanusGraph .

You'll find it under the Browser tab in the Compose console on a JanusGraph
deployment and click through to open up a new window with the browser. Although
it looks simple, there's a lot of power in this user interface. At the top, you
can edit your Gremlin query, click the arrow or press shift enter and the query
will be executed immediately.


The power comes in how the results are handled. Each set of query results
generates a ""Query Response Card"" that is placed on the top of a stack of
previous queries. Each card is a combined view of the JSON results and, if
possible to create, a graph visualization of those results.

The displayed graph will join vertices and edges together while letting you
re-layout the graph. Clicking on vertices in the graph will automatically focus
the JSON view of the results on the results that created that node. The JSON
view can also be filtered down to labels, types or properties, depending on how
much information you need to explore with.

When you are done browsing the card, just go type your new query into the editor
up top, or if you want to refine a cards query, click the copy icon to push its
creation query into the editor. Old cards can be left open or folded up as you
work your way around your graph's topologies.

You can read more about the browser in the Compose help or just dive in on your Compose for JanusGraph deployment today.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Ricardo Gomez Angel

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page to keep reading.CONQUER THE DATA LAYER
Spend your time developing apps, not managing databases.

Try Compose for Free for 30 DaysRELATED ARTICLES
Aug 17, 2017GRAPH 101: MAGICAL MARKOV CHAINS
Graph 101 is an article series on graph databases that explores graph algorithms
from the ground up. If you’ve ever wondered…

John O'Connor Jun 26, 2017WEBINAR: GREMLIN TRAVERSALS FOR THE SQL USER
Graph databases are the fastest growing database engine. Their power and
popularity stem from how they store their data - wi…

Jon Silvers Jun 15, 2017COMPOSE'S FIRST GRAPH DATABASE: JANUSGRAPH
At Compose we've always looked to ensure you can get the databases you need.
Today, we are proud to announce that JanusGraph…

Josh Mintz Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company",,Announcing the Data Browser for JanusGraph,Live,765
2354,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own Sep 26, 2016
--------------------------------------------------------------------------------

MAKING DATA SCIENCE A TEAM SPORT
We’re spending a lot of time interviewing data scientists around the world to
understand how they work and interact with their peers and other stakeholders.
What have we found? Today, data science is an individual sport. As a data
scientist, you have to choose your own tools and work independently. With luck,
you might produce a meaningful insight, but anything you learn stays with you —
it’s self-contained.

To be a data scientist today, you need to master many different skills:

 * Statistics, machine learning, and optimization
 * Communication and storytelling
 * Programming, sometimes in multiple languages: Python, R, Scala, SQL, and more
 * Big data and cloud computing
 * Business and domain knowledge
 * Data visualization

It is nearly impossible to find someone with all this knowledge. And with data
science rapidly evolving, there are new cool tools, open source libraries, and
techniques appearing every week. It’s hard to stay current, and data scientists
can feel overwhelmed.

To battle these challenges head-on, IBM is investing a lot to make Data Science
Experience a truly collaborative environment — a place where data science
becomes a team sport. Today, we are very excited to announce the next set of
features that will help to bring this collaboration to the next level:

In a Jupyter notebook, you can now have real-time discussions with peers working
in or watching your notebooks. Comments are a great way to insert notes for
collaborators or other viewers to see. You can add, edit and remove comments
while editing or viewing a notebook.

VERSION CONTROL FOR NOTEBOOKS
Now you can save versions of a Jupyter notebook. You can see a history of
changes to the notebook and make previous versions current. For now, you can
keep up to 10 versions of a notebook, and we’re working to extend it even
further.

Now you can share the notebooks and data sets on Twitter and Linkedin directly
from the Data Science Experience UI.

LIKE YOUR FAVORITE COMMUNITY CARDS
We have new community cards every day in Data Science Experience. They bring you
articles, tutorials, sample notebooks, and open data sets. Now you can like
them, and we will soon add sorting options in order to show the most popular
items from the Data Science Community. Start liking your favorite stuff! With
your input, we’ll learn your favorite topics and improve the community
experience.

DASHBOARD VIEW OF JUPYTER NOTEBOOKS
You can now share an entire notebook with collaborators or share only the output
graphics in a dashboard view. You can also hide sensitive information such as
access credentials.

See here an example of the same Jupyter notebooks with:

 * Code and output
 * Output only (more user-friendly

HIDE SENSITIVE CODE CELLS
If your notebook includes code cells with sensitive data, such as credentials
for data sources, you can hide those code cells from anyone you share your
notebook with. Project collaborators with viewer permissions will also not see
hidden cells. Project collaborators with admin and editor permissions will still
be able to see hidden cells.

We hope you enjoy all these new features!

 * Data Science

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingARMAND RUIZ
Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","We’re spending a lot of time interviewing data scientists around the world to understand how they work and interact with their peers and other stakeholders. What have we found? Today, data science is…",Making data science a team sport,Live,766
2361,"Enterprise Pricing Articles Sign in Free 30-Day TrialBUILDING SECURE DISTRIBUTED JAVASCRIPT MICROSERVICES WITH RABBITMQ AND SENECAJS
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Feb 1, 2017To take Microservices into production, you need to make sure they are
communicating securely and reliably. We explore using RabbitMQ as an alternative
transport for SenacaJS microservices and show you how easy it can be to plug
Compose RabbitMQ into your microservices stack.

CREATING THE MICROSERVICES
In a previous article, we demonstrated how to build microservices with Javascript using SenecaJS . SenecaJS is a library for building Microservices in Javascript that uses JSON
messages and executes actions depending on whether the message matches a
pre-defined pattern. By default, SenecaJS transports these messages between
microservices using HTTP and has a direct TCP message transport bundled with it
as well.

Let's take a look at this by setting up a new project and creating two simple
SenecaJS Microservices and connecting them together. Create a new directory to
house your project and initialize a new NPM module there:

$ mkdir seneca-compose
$ cd seneca-compose
$ npm init
...
...
...
$ npm install --save seneca


Now, we'll call our first service Foo :

// foo.js
var seneca = require('seneca')();

seneca  
  .add({
      ""role"": ""foo"", 
      ""cmd"": ""ping""
  }, (args, done) =


This creates a simple microservice that receives messages via HTTP on port 10101
(the default for the SenecaJS HTTP transport) and listens for messages that have
a pattern with a role or foo and a cmd of ping . SenecaJS will pattern match on the messages you send and execute that
function if the message matches that pattern. Your messages can have any format
you'd like. However, it's a strong convention in SenecaJS to use the keys role and cmd to identify a service you'd like to call and the action you'd like to perform
in that service.

Our second microservice, bar , will use HTTP as well, but since port 10101 is already being used by our
first microservice, we'll configure our second service to use port 10102.

// bar.js
var seneca = require('seneca')();

seneca  
  .add({
      ""role"": ""bar"",
      ""cmd"": ""ping""
  }, (args, done) =


Finally, let's create a small example application that connects to both of these
microservices as a client and executes actions on each of them:

// app.js
var seneca = require('seneca');

var foo = seneca()  
    .client()
    .act({""role"": ""foo"", ""cmd"": ""ping""}, function(err, response) {
        if (err) console.error(err);
        else console.log(response);
    });


var bar = seneca()  
    .client(10102)
    .act({""role"": ""bar"", ""cmd"": ""ping""}, (err, response) =


To run the application, you'll need to run all 3 processes at the same time. You
can do so in three separate terminals, or you can run them all in the background
like the following:

$ node foo.js &
$ node bar.js &
$ node app.js


The startup order is important here - make sure you start the app.js program last. You should see output like the following:

{""kind"":""notice"",""notice"":""hello seneca pv2qmop9xejh/1485838502653/5138/3.2.2/-"",""level"":""info"",""when"":1485838502673}
{""kind"":""notice"",""notice"":""hello seneca gf039p7ud4l5/1485838502755/5138/3.2.2/-"",""level"":""info"",""when"":1485838502758}
{ result: 'Hi there' }
{ result: 'Hi there' }


USING RABBITMQ AS A SECURE MESSAGE TRANSPORT
SenecaJS command pattern matching doesn't stop at application code. All of the
services in SenecaJS use pattern matching, including the transport layer. This
means you can build an application using one transport system and then swap it
out with a different one later on in development with minimal impact to your
existing system. To demonstrate this, let's switch out our HTTP transport with
one that can talk AMQP. That's the protocol that RabbitMQ uses natively and
plugging into RabbitMQ also gives you a whole new flexible layer in your
application stack.

The first thing we'll need to do is spin up an RabbitMQ deployment on Compose . Since RabbitMQ on Compose uses TLS encryption, your messages can be safely
passed from one microservice to another. Log into your Compose account and click
on the deployments tab and click Create Deployment .


Then, select RabbitMQ from the list of available deployments. The default deployment settings will be
sufficient for now, but if you want to change RAM settings or the name of the
deployment you can do so here. When you're ready, click Create Deployment :


Once your deployment is created you can use the data browser to create a new
user. The user credentials you create now will be used to connect to this
deployment of RabbitMQ:


Next, set the user's permissions to allow for configuring, writing, and reading
all resources. In production, you'll be able to improve the security of your
application by changing the user permissions so they only have access to the
resources they need to use. For this small example, we'll allow our user to be
as permissive as possible. The user permissions are defined by regular
expressions that match the resources for each access permission:


Now, let's go back to our microservices and configure them to use RabbitMQ.
First, install the seneca-amqp-transport node module by using the npm command:

$ npm install --save seneca-amqp-transport


Then, we'll configure our Seneca application to use the seneca-amqp-transport plugin. Copy the url from the Connection strings section of the deployment page:


Notice that RabbitMQ on Composes uses a URL with the amqps:// protocol. This is how we know our connection will be encrypted using SSL. Paste
the URL with the new connection settings into each microservice:

var seneca = require('seneca');

var foo = seneca()  
    .client({
        type: 'amqp',
        pin: 'role:foo,cmd:*',
        url: ""amqps://<user>:<password

var bar = seneca()  
    .client({
        type: 'amqp',
        pin: 'role:bar,cmd:*',
        url: ""amqps://<user>:<password>@portal1966-10.groovy-rabbitmq-64.jwo.composedb.com:15907/groovy-rabbitmq-64""
    })
    .act({""role"": ""bar"", ""cmd"": ""ping""}, (err, response) =


INSPECTING MESSAGES WITH THE ADMIN UI
One of the major benefits of using RabbitMQ as a message transport is the
ability to view connections between microservices and the messages being sent
over those connections. RabbitMQ on Compose provides a handy administrative UI
so you can quickly and easily view different properties of your messages.

To access it, log into your RabbitMQ deployment and scroll down to the Connection Strings section. Beneath the URL for the Admin UI, click on the open button to open the UI in a browser tab.


From there, you can view stats about the various connections, channels, and
queues being used by RabbitMQ. You can also details of the messages being sent
between different microservices there as well.

SUMMING UP
Transporting of messages is a critical piece of any microservice-based
application, and using RabbitMQ to transport those messages makes your
application more robust and fault-tolerant. SenecaJS with its pluggable
architecture makes swapping out transport mechanisms easy and fast.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Clint Adair Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe John O'Connor is a software architect that enjoys tinkering with things, designing software,
and writing about it all. Love this article? Head over to John O'Connor’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","To take Microservices into production, you need to make sure they are communicating securely and reliably. We explore using RabbitMQ as an alternative transport for SenacaJS microservices and show you how easy it can be to plug Compose RabbitMQ into your microservices stack.",Building Secure Distributed Javascript Microservices with RabbitMQ and SenecaJS,Live,767
2364,"* Blog
 * Videos & Webinars
 * About Me // Contact
 * Download The E-book!
 * The Junior Data Scientist’s first month (1-month online course)
 * Join the Community Forum!

Menu Close * Blog
 * Videos & Webinars
 * About Me // Contact
 * Download The E-book!
 * The Junior Data Scientist’s first month (1-month online course)
 * Join the Community Forum!

Hey, I'm Tomi Mester. This is my data blog, where I give you a sneak peek into
online data analysts' best practices. You will find here articles and videos
about data analysis, AB-testing, researches, data science and more...SUBSCRIBE FOR DATA ARTICLES HERE:
Email Address * Name *© 2018 Data36 .

Powered by WordPress .

PYTHON IF STATEMENTS EXPLAINED (PYTHON FOR DATA SCIENCE BASICS #4)
Written by Tomi Mester on January 8, 2018We use if statements in our everyday life all the time – even if our everyday life is not written in
Python. If the light is green then I’ll cross the road otherwise I’ll wait. If the sun comes up then I’ll get out of the bed otherwise I’ll go back to sleep. Okay, maybe it’s not this direct, but when we take
actions based on conditions, our brain does what a computer would do: evaluate
the conditions and act upon the results. Well, a computer script doesn’t have a
subconscious mind, so for practicing data science we have to understand how an
if statement works and how we can apply it in Python!

Note: This is a hands-on tutorial. I highly recommend to do the coding part with
me – and if you have time, solve the exercises at the end of the article! If you
haven’t done it yet, please go through these articles first:

 1. How to install Python, R, SQL and bash to practice data science!
 2. Python for Data Science #1 – Tutorial for Beginners – Python Basics
 3. Python for Data Science #2 – Python Data Structures
 4. Python for Data Science #3 – Python Built-in Functions

PYTHON IF STATEMENTS BASICS
The logic of an if statement is very easy.


Let’s say, we have two values a=10 and b=20 . We compare these two values: a==b . This comparison has either a TRUE or a FALSE output. (Test it in your Jupyter Notebook!)


We can go even further and set a condition: if a==b is TRUE then we print 'yes' . If it’s FALSE then we print 'no' . And that’s it, this is the logic of the Python if statements. Here’s the
syntax:

a=10
b=20
if a==b:
    print('yes')
else:
    print('no')


Run this mini script in your Jupyter Notebook! The result will be (obviously): 'no' .

Now, try the same – but set b to 10 !

a=10
b=10
if a==b:
    print('yes')
else:
    print('no')


The returned message is 'yes' .


PYTHON IF STATEMENT SYNTAX
Let’s take a look at the syntax, because it has pretty strict rules.
The basics are simple:


You have:

 1. an if keyword, then
 2. a condition, then
 3. a statement, then
 4. an else keyword, then
 5. another statement.

However, there are two things to watch out for:
1. Never miss the colons at the end of the if and else lines!


2. And never miss the indentation at the beginning of the statement-lines!


If you miss any of the above two, an error message will be returned saying
“invalid syntax” and your Python script will fail.

Note: if you are watching the Silicon Valley TV-show, you might have heard about
the indentation-debate of the “tabs vs spaces”. Here’s the hilarious scene:


What’s the real answer? Here’s what the original Style Guide for Python Code says:


Pretty straight forward! 🙂
But, to be honest, I do use tabs because it’s much easier and – you know –
“because I prefer… precision.”

PYTHON IF STATEMENTS – LEVEL 2
Now that you understand the basics, it’s time to make your conditions more
complex – by using arithmetic, comparison and logical operators . (Note: if the word “operators” does not ring any bells, you might want to check
out this article first: Python for Data Science – Tutorial for Beginners #1 – Python Basics .)

Here’s a quick example:

a=10
b=20
c=30
if (a+b)/c==1 and c-b-a==0:
    print('yes')
else:
    print('no')


This script will result 'yes' , since both of the conditions, (a+b)/c==1 and c-b-a==0 are actually TRUE and the logical operator between them was: and .


Of course, you can make this even more complex if you want, but the point is:
having multiple operators in an if statement is absolutely possible – in fact,
it’s pretty common in real life scenarios!

PYTHON IF STATEMENTS – LEVEL 3
You can take it to the next level again, by using the elif keyword (which is a short form of the “else if” phrase) to create
condition-sequences. “Condition-sequence” sounds fancy but what really happens
here is just adding an if statement into an if statement:


Another example:

a=10
b=11
c=10
if a==b:
    print('first condition is true')
elif a==c:
    print('second condition is true')
else:
    print('nothing is true. existence is pain.')


Sure enough the result will be ""second condition is true"" .


You can do this infinite times, and build up a huge if-elif-elif-…-elif-else
sequence if you want!

Aaand… This was more or less everything you have to know about Python if
statements. It’s time to:

TEST YOURSELF!
Here’s a random integer: 918652728452151 .

First, I’d like to know 2 things about this number:

 * Is it divisible by 17?
 * Does it have more than 12 digits?

If both of these conditions are true, then I want to print “ super17 “.

And if either of the conditions are false, then I’d like to run a second test on
it:

 * Is it divisible by 13?
 * Does it have more than 10 digits?

If both of these two new conditions are true, then I want to print “ awesome13 “.

And if the original number is not classified as “ super17 ” nor “ awesome13 “, then I’ll just print: “ meh, this is just an average random number “.

So: is 918652728452151 a super17, an awesome13 or just an average random number?

Okay! Ready. Set. Go!

THE SOLUTION
918652728452151 is a super17 number!


Take a look at the script:

my_number=918652728452151
if my_number%17==0 and len(str(my_number))>12:
    print(""super17"")
elif my_number%13==0 and len(str(my_number))>10:
    print(""awesome13"")
else:
    print(""meh, this is just a random number"")


On the first row, I’ve stored 918652728452151 into a variable so I don’t have to type it again and my script will be much
nicer too: my_number=918652728452151

Then, I set my if-elif-else condition sequence.
On the if line I wanted to specify two conditions. The first one is that the my_number variable is divisible by 17. This was the my_number%17==0 part. To be more accurate, this code means that the remainder from the division
by 17 is equals zero. The other half of the condition required to count the
number of digits in my_number . Since – by the limitation of Python – you can’t count the number of digits in
an integer, I had to turn my_number into a string with a str() function (LINK) and then use the len() function on this string to get the number of characters. That was the len(str(my_number)) .

It seems that both my original conditions were true since I got back the super17 on my screen. But if it weren’t then the next elif line would have done the same things we have done in the if line, only it would
have checked the divisibility by 13 (and not 17) and the number of digits should
have been greater than 10 (and not 12.)

If that were not true either, my else statement would have been run and print(""meh, this is just a random number"")

That’s the solution! Wasn’t too difficult, was it?

SUMMARY
If statements are widely used in every programming language. Now you know how to use them too! The logic of it is
super clear and on the top of that, in Python, the syntax is even fully
understandable by simply speaking in English…

Anyway. This was my introduction into Python If Statements . Next time we will continue with Python for loops !

Stay tuned and subscribe to my newsletter list !

Cheers,
Tomi Mester

 * January 8, 2018
 * In Coding In Data Science and Analytics
 * data coding data science if statement learn data science learn to code python python3

← Previous post Next post →2 COMMENTS
 1. SRI
    January 14, 2018I must say, the way you explain stuff is much simple compared to many online
    courses. I learnt so much from this blog and I’m waiting for 1-month online
    course.
    
    I ditched pricey Udacity, CodeAcademy etc(on coding part) and learning
    DataScience in the order you recommended – SQL, Python, Bash from Data36. So
    just wanted to quickly check, how many more chapters will you be covering in
    Python and at what pace?
    
    Thanks
    
    Reply * TOMI MESTER
       January 18, 2018hi Sri,
       
       thanks a lot! I’m really glad you like my tutorials!
       As of Python basics, I’m planning 3 more episodes, one of them is already
       online! : )
       And I’ll write an advanced Python series as well later this year!
       
       Cheers,
       Tomi
       
       Reply
     * 
    
    
 2. 

LEAVE A REPLY CANCEL REPLY
Comment

Name *

Email *

Website


Get free data articles weekly: We use cookies to ensure that we give you the best experience on our website. Ok","Python if statements are very commonly used to handle conditions. If you learn data + coding, here's an article to learn the concept and the syntax!",Python If Statements Explained (Python For Data Science Basics #4),Live,768
2368,"POWERING SOCIAL FEEDS AND TIMELINES WITH ELASTICSEARCH
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 28, 2016Evolving from MongoDB and Redis to Elasticsearch, Campus Discounts' founder and
CTO Don Omondi talks about how and why the company made the switch to power
their user recommendation feeds.

Campus Discounts is a social network where students find and recommend discounts posted by
businesses near campus. We have a worldwide list of campuses with their
geographic location. Businesses create pages and post discounts tagging campuses
near them. Students can then view their campus page and find discounts nearby.

If students signup (it’s free), they can select product categories of interest
and also connect to fellow students through the buddy system. When a student’s
friend recommends a discount which falls within the student's categories of
interest, he/she will be notified and see it on their feeds.

Our data is classified into two types. Primary data is our core data that
includes users, pages, apps, discounts, countries, recommendations and campuses.
It's stored in a MariaDB RDBMS. Secondary data is derived from actions on
primary data such as likes, comments, follows, ratings, reviews, friendships,
etc. and are stored in a MongoDB database.

Our MariaDB tables typically look like this:


The recommendations table column ""type"" can have a value of 1 to represent a discount_recommendation , 2 for a page_recommendation or 3 for an app_recommendation .

When User A recommends a discount (eg. Save Big, 25% Off Mens' Leather Shoes) it’s saved in
the discount_recommendations table and pointers to it are saved in the recommendations table. This will
appear in the feeds, for example, ""User A recommends Save Big, 25% Off Mens'
Leather Shoes.""

If User B sees this recommendation and shares it, no new entry is made in the discount_recommendations table. Instead, a new entry in the recommendations table is made with the sharer as User B , but with data pointing to the exact recommendation made by User A . So, all comments and likes are tied to the original discount_recommendation .

When building the UI, if the recommendations sharer ID does not match the
recommender ID, then the feed item will appear as ""User B shared User A's
recommendation Save Big, 25% Off Mens' Leather Shoes"".

NORMALIZATION IN RDBMS – THE GOOD
This model highlights the strength of normalization in relational databases. For
example, a discount recommendation is joined to a discount, which is joined to a
category and page. The latter of which is joined to a country. The same discount
recommendation is also joined to a user (recommender), which is joined to a
country as well as a campus. A campus is also joined to a country.

This means that a user who has set ""Men's Shoes"" as a category of interest could
have a feed entry such as:

 * discount recommendations for ""men's shoes"" – a category join
 * by my friends – a join on user_id
 * in my campus – a join on campus
 * by businesses – a join on page
 * in my country – a join on country
 * near my campus – a join by location.

That's 6 joins and it easily could be more.

NORMALIZATION IN RDBMS – THE BAD
RDBMS joins are one of the biggest performance killers at scale especially when
multiple joins are used. In our case, the user feed contains 3 types of
recommendations arranged in a collective, time-based chronological order.
Queries span across 3 different tables and thus are fetched one-by-one.

A typical feed fetch of this nature would look like this:

 1. Find my friend ids (1 query)
 2. Find my interests i.e. category ids (1 query)
 3. Find my latest 20 friend recommendations of interest (1 query)
 4. Populate feed by fetching each of the 20 recommendations one by one i.e. (20
    queries of 6 joins each!)

Almost all of our data is persisted asynchronously using RabbitMQ, so users are
oblivious to whether it took 1 second, 15 seconds, or 5 minutes to perform
tasks. Hence, writing to the DB was not the issue, just reads were.

MONGODB - REDIS HYBRID
We initially tried to solve this problem by caching, so we created a temporary
store for each user's timeline and feed in MongoDB and kept this store in RAM
for fast retrieval.

We have two collections: user_feeds and user_timelines . Each collection stores one document per user with the _id set to the users' respective ids. When a user makes a recommendation, that data
will be cached as an embedded document in a user's timeline.

Simultaneously, we established feeds for those friends whose interests align
with the discount category of this new recommendation. Their MongoDB documents
are updated with new entries as embedded documents. This is a Push-on-Change
strategy where each document contains a cache of the recommendations table and
is restricted to a maximum of 200 embedded documents.

Still, there were expensive queries on the discount_recommendations , page_recommendations , and app_recommendations tables. Ideally, each query would be embedded along with the recommendations table data, but this would lead to unsustainable data duplication.

Instead, we cached each result of the 6-join query in Redis setting the key to a
hash of the respective ID. A user_feed document looked like this:

{
   ""_id"": 1,
   ""recommendations"": {
     ""0"": {
       ""_id"": 1,
       ""user_id"": 1,
       ""type"": 1,
      ""recommendation_id"": 1
     },
     ""1"": {
       ""_id"": 3,
       ""user_id"": 78,
       ""type"": 2,
       ""recommendation_id"": 3
     },
     ""2"": {
       ""_id"": 45,
       ""user_id"": 2,
       ""type"": 1,
       ""recommendation_id"": 6
     },
     ""3"": {
       ""_id"": 564,
       ""user_id"": 7,
       ""type"": 3,
       ""recommendation_id"": 9
     }
   }
}


While a discount_recommendation key-value in Redis looked like this:

{
 ""data"": {
   ""discount_recommendation_id"": 1,
   ""discount"": {
     ""id"": 1,
     ""title"": ""Save Big, 25% Off Mens’ Leather Shoes"",
     ""category"": {
       ""id"": 1,
       ""name"": ""Men’s Shoes""
     },
     ""page"": {
       ""id"": 1,
       ""name"": ""Safari Shoes LTD."",
       ""country"": {
         ""id"": 1,
         ""name"": ""Kenya""
       },
       ""location"": {
         ""latitude"": 12.121212,
         ""longitude"": 12.121212
       }
      }
    },
   ""recommended_by"": {
     ""id"": 1,
     ""username"": ""The_Don_Himself"",
     ""country"": {
       ""id"": 1,
       ""name"": ""Kenya""
   },
   ""campus"": {
     ""id"": 1,
     ""name"": ""University of Nairobi, Chiromo Campus"",
     ""country"": {
       ""id"": 1,
       ""name"": ""Kenya""
     },
     ""location"": {
       ""latitude"": 12.121212,
       ""longitude"": 12.121212
     }
   }
 },
 ""recommened_on"": {
   ""date"": ""2015-05-07 10:48:23.000000"",
   ""timezone_type"": 3,
   ""timezone"": ""UTC""
  }
 }
}


This approach performed really well. A MongoDB document was fetched and the
individual recommendations fetched over a loop from Redis. Eventually, a user
got his feed in less than 30ms for 20 items at a time after 21 queries.

But if it was all rosy, why did we abandon this approach?

The disadvantages:

 1. A user’s friends could possibly fill his 200 limit feed cache with only one
    type of recommendation making filtering for another type yield nothing.
 2. Whenever a user changed his interests by adding or removing a category, or
    when a user made or removed friends, his existing feed document would have
    to be destroyed and regenerated.
 3. The biggest disadvantage was that a user could not filter his feed on the
    fly (e.g. a user viewing his feed could not just select one category for
    example phones and just get friends recommendations for phones alone). The
    only way to do so was to change profile settings which leads to disadvantage
    #2.

ELASTICSEARCH
Therefore, a more flexible approach was required. We needed a model that allowed
us to fall in line with the reactive requirements of modern apps. Using
Elasticsearch, we changed our feeds generation strategy from Push-on-Change to
Pull-on-Demand.

So how does Elasticsearch make things more flexible?

With Elasticsearch, a single query can easily and quickly fetch different
documents across the entire dataset. Elasticsearch also has a feature called
Types that are a very interesting but a sometimes misused feature that allows
you to save several types of data in the same index. Being in the same index
means a query across several types would normally perform better than a query
across a similar dataset across several indices - unless the index holding the
multiple types is really large.

The performance boost is nice, but the best part of types is that they represent
a class of similar documents with similar mappings. So we've created a recommendations_type_index and saved our data as discount_recommendations_type , page_recommendations_type and app_recommendations_type .

Elasticsearch scores use index-wide statistics and since our recommendations
have similar fields like datetime , totallikes , totalcomments , totalshares and so forth, we can now use Elasticsearch Types to provide a feeds display
algorithm other than the time-based chronological method of our previous MongoDB
– Redis setup.

The main advantages over previous setup are:

 1. This reduced our queries from 23 to just 3.
 2. No need to cache a list of 200 recommendations per user.
 3. We are now able to query recommendations by deeply nested fields on the fly
    allowing us to introduce new, real-time, filter-like features to our users.

The main disadvantages over previous setup are:

 1. Index size . We de-normalized our data to fit it in an all-in-one document. This
    increased query performance, but led to a much larger size on disk due to
    data duplication.
 2. Tedious updates . Changes required a scan and update across all documents that have that
    data present.
 3. Varying query times . MongoDB – Redis setup has pre-cached feeds that gave a more predictable
    standard deviation of feed generation times, but with Elasticsearch, a
    feed's query time now depends on the number of filters used.

CONCLUSION
So there you have it, a fully functional and elastic user feed system. I hope
this exposed some interesting uses of various databases that we use and that it
can inspire you to use it as it's meant to be – a tool to help you accomplish a
task.

Don Omondi is a full-stack developer and the Founder and CTO of Campus Discounts . Besides the typical coffee and code, he also loves old school music over a
game of chess or checkers.

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Image by Inbal Marilli Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Evolving from MongoDB and Redis to Elasticsearch, Campus Discounts' founder and CTO Don Omondi talks about how and why the company made the switch to power their user recommendation feeds.",Powering social feeds and timelines with Elasticsearch,Live,769
2370,"Compose The Compose logo Articles Sign in Free 30-day trialEASIER JAVA CONNECTIONS TO MONGODB AT COMPOSE
Published Apr 10, 2017 mongodb java ssl Easier Java connections to MongoDB at ComposeHaving trouble getting Java to connect to Compose's MongoDB with SSL enabled?
Try this one library trick to make everything work from the environment.

Java has been one of the trickier languages when it comes to connecting with
MongoDB. Unencrypted connections are no problem, but as soon as we want to use
Compose's self-signed certificates to enable verification of the server, there's
a problem. Java's SSL/TLS support is a maze of twisty trust stores and confusion
and you end up having to either create some store files and use them or add
certificates, or override to the system trust stores. It's not a pretty sight.

So when I found Joe Kutner's "" Creating Java Truststores and KeyStores from Environment Variables "" it seemed to be just the thing to make life easier for all. As described in
the article, the library used env-keystore has moved on in various ways, but here I'm going to talk about how it applies
specifically to connecting to Compose.

BEFORE BUILDING
The idea with the library is that you can set an environment variable to the
certificate, or certificates, that you want to encapsulate in a Java crypto
consumable KeyStore or TrustStore. With one of these created you can create a
socket factory that'll do the connection work and validate things using the
certificates in the store. So... the first thing we need is an environment
variable with a certificate in it. I'm just going to make one like so:

export TRUSTED_CERT=""-----BEGIN CERTIFICATE-----                        ⏎  
MIIDZzCCAk+gAwIBAgIEWH6noDANBgkqhkiG9w0BAQ0FADA1MTMwMQYDVQQDDCpj  
b21wb3N1cmUtMDlhNjgzZThlMjIwOGNhOGYzZTg4Njc1NmMyYTUxZDMwHhcNMTcw  
MTE3MjMyNDE2WhcNMzcwMTE3MjMwMDAwWjA1MTMwMQYDVQQDDCpjb21wb3N1cmUt  
...
77+NBWQRKebBuvIKGj+/nwpu7q3BuwJO08AYSnJt5unUGSo27FcCn6GMMXzb6zuT  
P9lmdxoukBpXLvg79FHw/pQL2zOchl4EjEC6PdMHo4MkR2i076SfcguM45v9pFO3  
zJpYwXp1BQEnBNU=  
-----END CERTIFICATE-----""
$


Now, let's start up our Java development environments, make a project and get
coding:

ADDING THE LIBRARY
The first step is adding the Maven dependencies to bring in the libraries we
need. Add this to your pom.xml file.

        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongodb-driver</artifactId>
            <version>3.4.2</version>
        </dependency>
        <dependency>
            <groupId>com.github.jkutner</groupId>
            <artifactId>env-keystore</artifactId>
            <version>0.1.2</version>
        </dependency>


Other routes to installing the code env-keystore and MongoDB libraries from the
Github repository are, of course, available but this is quick enough.

STEPPING THROUGH THE CODE
As with all Java code, there's the import preamble which I'll skip - you can find the code in our examples repositiory for a longer read. We'll start just inside the static main method:

            KeyStore ts = EnvKeyStore.createWithRandomPassword(""TRUSTED_CERT"").keyStore();


This is where the magic happens. It calls on EnvKeyStore to create a keystore using the environment variable TRUSTED_CERT . There are a few instances of creating EnvKeyStore which allow you to handle certificates with their own keys and passwords, but
we just need it to associate a random password with our new store. Now comes the
dance of factories. This is where we make a TrustManagerFactory and then
initialise it with our new EnvKeyStore.

            String tmfAlgorithm = TrustManagerFactory.getDefaultAlgorithm();
            TrustManagerFactory tmf = TrustManagerFactory.getInstance(tmfAlgorithm);
            tmf.init(ts);


With the TrustManagerFactory initialized, the next step is to create an
SSLContext... don't worry, it's really TLS and the name is just a hangover from
the old days. We get a ""TLSv1.2"" instance of an SSLContext and initialise it
with the TrustManagers from our TrustManagerFactory:

            SSLContext sc = SSLContext.getInstance(""TLSv1.2"");
            sc.init(null, tmf.getTrustManagers(), new SecureRandom());


So far, so generic Java TLS/SSL handling. Now comes the MongoDB specific bit.
We're assuming at this point that you are using full connection strings, like
the Compose UI gives you. We need to use that connection string, but there's one
thing we can't express in the connection string. That's which SSLContext to use
as we just made our own. We can pass that in using MongoClientOptions.

            MongoClientOptions.Builder mco=MongoClientOptions.builder().socketFactory(sc.getSocketFactory());


This creates a Builder instance and sets the SocketFactory for creating MongoDB
connections to the SocketFactory created by our SSLContext. We can pass that
into the new MongoClientURI constructor along with the full connection string like so:

            MongoClientURI connectionString = new MongoClientURI(""mongodb://<user>:<pass


This MongoClientURI is now our special version loaded with SSLContext. From here
on out, it's standard MongoDB Java Client all the way...

            MongoClient mongoClient = new MongoClient(connectionString);
            MongoDatabase mongoDatabase = mongoClient.getDatabase(""javatester"");
            MongoCollection mongoCollection = mongoDatabase.getCollection(""javatester"");
            ArrayList<Document> arrayList = new ArrayList< i 


What's not shown is the try...catch surrounding that code.

    try {
    ...
    } catch (KeyManagementException | IOException | KeyStoreException | NoSuchAlgorithmException | CertificateException ex) {
            Logger.getLogger(ComposeToMongoDB.class.getName()).log(Level.SEVERE, null, ex);
        }


There's a host of exceptions which can be thrown, mostly from the key management
code. I'll leave your it to you to convert it to your preferred idiom of
exception handling.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Crew

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Oct 3, 2016HOW-TO: CONNECTING TO COMPOSE MONGODB WITH JAVA AND SSL
We've heard from some Compose users that connecting to MongoDB with Java and SSL
is a somewhat difficult process so we're goi…

Dj Walker-Morgan Feb 17, 2016RETHINKDB AT COMPOSE: JAVA AND SSL
Last December, RethinkDB announced that they had added Java to their list of
officially supported drivers. For Compose users…

Dj Walker-Morgan Feb 10, 2016CONNECTING TO THE NEW MONGODB AT COMPOSE
So you've just spun up one of our new MongoDB deployments and you are ready to
go. Before you do that though, we'd like to ex…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",Having trouble getting Java to connect to Compose's MongoDB with SSL enabled? Try this one library trick to make everything work from the environment.,Easier Java connections to MongoDB at Compose,Live,770
2372,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

EXPORT CLOUDANT JSON AS CSV, RSS, OR ICAL
Glynn Bird / September 22, 2015You can access IBM Cloudant’s NoSQL database from any programming language
because of its RESTful HTTP API and JSON data format. There are some
circumstances, however, where you need to present your data in a format other
than JSON. Perhaps you need to crunch some data in a spreadsheet, import data
into your calendar, or aggregate data in an RSS reader.

You could write your own converter code on the the client side, but why not configure
Cloudant to output data directly to other formats using a List Function . A list function works alongside Cloudant MapReduce to let you output a Cloudant database in the format of your choice.

These are the basic steps, which I’ll expand upon in this article:

 1. Devise a Javascript map function that defines an index into your Cloudant database. MapReduce is an
    efficient and highly scalable method of working with large datasets in a
    distributed database. This step is equivalent to indexing a field in an
    relational database.
 2. Devise a list function that describes how the data is to be transformed. The list
    function runs through each document in the result set and calls functions to
    generate HTTP headers and to output text.
 3. Upload the map and list functions as a Cloudant Design Document . Design documents are special records in a Cloudant database that contain
    code that defines MapReduce indexes, Search indexes, and list functions.
 4. Query the newly-created view with a list function. Also, use this function
    to transform your data format. (Once an index has built, querying the view
    is very efficient and has excellent performance on large data sets.)

THE DATA SET
Before we can index our data, we need to take a look at the data itself. Let’s
say we have a database of appointments for a hair styling business. A typical
document looks like this:

{
  ""_id"": ""89a66bae620c33d653a0c02f2add5b4a"",
  ""_rev"": ""1-bba9e83dbb192ea0c05e13f0d9296322"",
  ""product"": ""Cut"",
  ""customer"": ""Lady Catherine de Bourgh"",
  ""date"": ""2015-09-29 16:00:00"",
  ""duration"": 1,
  ""stylist"": ""Mandy"",
  ""branch"": ""Stockport"",
  ""cost"": 25,
  ""paid"": false
}

A document is a JSON object with simple key/value pairs indicating the
attributes of the appointment. Cloudant allows large complex JSON documents, but
in this example we have a simple shallow object.

You can examine the full sample data set here:

https://examples.cloudant.com/appointments/_all_docs?include_docs=true

ORDERING THE DATA BY DATE
The appointment data is ordered by the _id field, but we would like to order our data by date, so we can extract a single
day’s appointments from a larger data set. To do so, we’ll create a bydate view of this data, using a JavaScript map function. When a new view is created
(by adding a design document to the database), the view’s map function is called
with each document in the database in turn. The map function chooses to “emit” a
key and a value which forms an index into the data set. In this case, we need to
index the date field, so doc.date is emitted as the key. We don’t need to map the value of the date field so our
function will emit null in the value slot:

function(doc) {
  emit(doc.date, null);
}

You can paste this map function in one of two places:

 * into the Cloudant Dashboard’s New View page (Open your database, click the +
   button beside All Design Docs and choose New View )
   
   
 * or into to a Design Document using Cloudant’s API

When you query this view by the following URL, you get all the appointments in
chronological order:

https://examples.cloudant.com/appointments/_design/find/_view/bydate

If you want to see the full document bodies, simply add ?include_docs=true to the URL:


https://examples.cloudant.com/appointments/_design/find/_view/bydate?include_docs=true

This database covers only one day, but in real life, there could be millions of
records in your database. To include only specific dates, we can add a startkey and endkey to the URL to define the date range.

For example, this query shows the afternoon appointments on September 29th:


https://examples.cloudant.com/appointments/_design/find/_view/bydate?include_docs=true&startkey=”2015-09-29+12:00:00″&endkey=”2015-09-29+23:59:00″

LIST FUNCTIONS
Now that we have a date-ordered view into our data set and can filter data
between two dates, we can start adding list functions to transform the output data into something other than JSON. A list function
contains a row iterator to run through the data set and you can optionally call a start function once to output HTTP headers. You can call a send function multiple times to output a block of data:

function(head, req) {
  var row;
  start({
    headers: {  'Content-Type': 'text/plain'  },
  });
  while(row = getRow()) {
    send('some data');
    send('some more data');
  }
};

Define your list function in the same Design Document as your map function. Many
map functions and list functions can co-exist in the same design document. The
structure of our design document is:

{
  ""_id"": ""_design/find"",
  ""_rev"": ""1-91f1c6803a631e375f3e8b35a63414c5"",
  ""views"": {
    ""firstview"": {
      ""map"": ","Need to convert your Cloudant JSON to a different format? Read how to get JSON data into a spreadsheet, into your calendar, or aggregate data in an RSS reader.","Export Cloudant JSON as CSV, RSS, or iCal",Live,771
2373,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  SAMPLE SCRIPTSJess Mantaro / March 22, 2016To help you hit the ground running, we’ve provided some sample scripts to getyou started fast with BigInsights. These scripts are tested on BigInsights onCloud (Bluemix) but should also work for BigInsights on-premises version.Run one of the example projects to see it working against your BigInsightscluster. You can then copy the project to your environment and add your owncustom logic. See what’s available in the BigInsights Examples GitHub repo .GET STARTED 1. Follow setup instructions in the repo’s README page . 2. Choose the script you want to run from this list . Open its repo and give it a whirl.Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM","To help you hit the ground running, we’ve provided some sample scripts to get you started fast with BigInsights. These scripts are tested on BigInsights on Cloud (Bluemix) but should also work for BigInsights on-premises version.",BigInsights Sample Scripts,Live,772
2377,"CACHING CLOUDANT REQUESTS WITH CACHEMACHINE

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Glynn Bird 11/22/16Glynn Bird

Before joining IBM Cloud Data Services, Glynn served as the Head of IT and
Development for Central Index, creating a white-label frontend for a NoSQL
business directory (using PHP, Node.js, MySQL, Redis, Cloudant, and Redshift).
His experience includes writing CRM systems, ""find my nearest"" indexes,
e-commerce platforms, and a phone…

Learn More Recent Posts * Caching Cloudant Requests with cachemachine Increase app response time and spare your data services unnecessary work.
   Use cachemachine to cache…
 * Blazingly Fast Geospatial Queries with Redis Use Redis and and Python scripts to speed your geospatial queries.
 * Enhanced Cloudant Search with Watson Alchemy Create an app that takes an RSS news feed, passes it through the Alchemy
   Language…

A cache is a copy of data stored in place where it can be retrieved quickly with
the original and most up-to-date copy being a separate, slower repository.
Examples of caching include:

 * microprocessors contain on-board data caches to save having to fetch the data
   from memory.
 * spinning disks contain a cache of a data in memory to save having to fetch
   data on disk.
 * web browsers cache a page’s assets on the client machine to save having to
   fetch data from the network.

Latency numbers every programmer should know pic.twitter.com/H0Bp2nYivt

— Mario Fusco (@mariofusco) October 11, 2016


Your own systems can achieve huge performance benefits if you build in a caching
layer, choosing to cache data that is accessed frequently–data that still has
value when stale–and deciding how long to cache each item. A cache provides
speed benefits for your application, and takes load away from your primary data
source, freeing up resources for other work.

Table of system latencies makes for great reading pic.twitter.com/346giJZQnh

— Azeem Azhar (@azeem) November 26, 2015


CACHING FAST CHANGING DATA
Imagine I am storing web metrics data in an IBM Cloudant database where each JSON document represents an interaction with my web page:


{
  ""date"": ""2016-09-01 10:24:13"",
  ""type"": ""click"",
  ""path"": ""/Born-Run-Bruce-Springsteen/dp/1471157792/"",
  ""userAgent"": � Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36"",
  ""host"": ""www.bookseller.com"",
  ""userid"": ""yJBsMCs4qEyF89jvGD1cI1V81YdNw6""
}


I can use a MapReduce function in Cloudant with the _count reducer to create an aggregated view of my data, counting the number of ‘click’
documents by day:


function(doc) {
  if (doc.type === 'click') {
    var d = doc.date.split(' ')[0];
    emit(d, null);
  }
}


I can then create a Node.js dashboard to query the view and present the results:


var myurl = 'https://myusername:mypassword@myhost.cloudant.com';
var cloudant = require('cloudant')({ url: myurl });
var db = cloudant.db.use('metrics');
db.view('clicks', 'byday', {group: true}, function(err, data) {
  console.log(data);
  // present the data on the dashboard.
});


The call to the Cloudant view will retrieve data of this form:


{ ""rows"":[
    {""key"":""2016-09-01"", ""value"":10985882},
    {""key"":""2016-09-02"", ""value"":11884271},
    {""key"":""2016-09-03"", ""value"":12004155},
    {""key"":""2016-09-04"", ""value"":11094426}
  ]
}


Cloudant is more than happy to answer queries like this, as its distributed
design shares the computational load around its cluster. But every time your
dashboard web page loads, you ask Cloudant to recalculate the totals and to
rebuild the view to incorporate the newly arrived documents. As the rate of
change of data is not insignificant (circa 11m documents a day, or 127 per
second), then Cloudant has its work cut out for it!

You can employ caching in this use-case to take some load from Cloudant by
storing the daily totals in cache when they are first requested. The
time-to-live (TTL) of the cache is a choice: is it ok for the users to see data
that is 1 hour old? 10 minutes old? The longer the TTL, the less load Cloudant
sees, but the more out-of-date the dashboard is.

Using an in-memory cache in front of Cloudant would give the dashboard a faster
load time and let Cloudant get on with the core task of storing and indexing the
incoming data.

REDIS
Redis makes an ideal cache because it stores its data in RAM for fast storage and
retrieval. Redis at its simplest, stores key/value pairs and can additionally
expire the keys at a pre-defined TTL. To store a key (mykey) with an associated
value (xyz) that expires in an hour, you can use Redis’s SETEX command:


SETEX mykey 3600 ""xyz""


Retrieve the value at any time within the next hour with


GET mykey


CACHING CLOUDANT DATA WITH CACHEMACHINE
To cache HTTP requests to Cloudant, we can use the cachemachine utility as a drop-in replacement for the request library with a twist—it caches the paths you specify in a Redis cache. The
first time you ask for the data, Cloudant responds and the result is cached. All
other requests for the same URL (within the TTL) are answered by calls to Redis!

In this case, we want to cache requests that access views for only one hour
(paths beginning with the database name followed by /_design/ ):


var paths = [ { path: '^/mydb/_design/.*', ttl: 60*60 }];
var cachemachine = require('cachemachine')({redis: true, paths: paths});


We can then tell the Cloudant Node.js library to use this as a plugin:


var cloudant = require('cloudant')({ url: myurl, plugin: cachemachine });


Now requests we make to our view are cached automatically:


db.view('clicks', 'byday', {group: true}, function(err, data) {
  // data is returned and cached transparently
  console.log(data);
});


Cloudant handles the first request for the data, but subsequent requests for the
same data come from Redis, until the cache key expires.

BUILDING A CACHING PROXY SERVER WITH EXPRESS AND CACHEMACHINE
You can also create your own proxy server in front of Cloudant, caching the
requests you want to intercept and passing all other requests untouched.


var express = require('express'),
  app = express(),
  cachemachine = require('cachemachine')(),
  couchURL = process.env.COUCH_URL || 'http://localhost:5984';

// intercept all requests for GET requests for views
app.get('/:db/_design/:design/_view/:view', function (req, res, next) {
  var r = {
    url: couchURL + req.path,
    qs: req.query
  };
  // handle them with cachemachine
  cachemachine(r, function(err, response, body) {
    res.send(body);
  });
});

// Catch all other paths and proxy them to CouchDB
app.use('/', require('express-http-proxy')(couchURL));

app.listen(3000, function () {
  console.log('Example app listening on port 3000!');
});


This code doesn’t use cachemachine’s path matching, it simply catches the paths
it needs to intercept and routes them through cachemachine, leaving all other
traffic to be transparently proxied.

USING CACHEMACHINE WITH REDIS ON COMPOSE
Deploying a highly-available Redis cluster is easy with Redis on Compose . It deploys in minutes in a choice of data centers. Using cachemachine with Compose’s Redis is just as simple:


var opts = {
  redis: true,
  hostname: 'myhostname.dblayer.com',
  port: 10000,
  password: 'mypassword'
};
var request = require('cachemachine')(opts);


substituting the values of hostname , port and password for those found in your Compose dashboard.

CONCLUSION
Caching provides quicker access to frequently-requested data, trading-off speed
of delivery for freshness of the data. You tell your app how stale you can
afford to let the data get by configuring the TTL of cache key. Caching lets
your application respond quickly, while not overburdening underlying data
storage services. Use the cachemachine to transparently cache any HTTP service with Redis as the cache store, and plug
it directly into the Cloudant Node.js library.",Increase app response time and spare your data service. Use cachemachine and Cloudant's Node.js library to cache any HTTP service with Redis as the cache store.,Caching Cloudant Requests with cachemachine,Live,773
2378,"AUTHENTICATING NODE-RED USING JSONWEBTOKEN - PART 2
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 20, 2016In Part 1 of this series , we got a first look at using JSONWebToken in Node-RED by learning how to encrypt and decrypt tokens with the node-red-contrib-auth package. In this article, we'll connect our application to a MongoDB database
and lay the groundwork for a fully-authenticated application with authorized
routes.

INSTALL THE NODE-RED-CONTRIB-MONGODB2 NODE
To process user login, we'll first need to connect to a MongoDB instance that
stores our users. We'll use the node-red-contrib-mongodb2 node to make the connection. Install the mongodb2 package by clicking on the Node-RED main menu and clicking on Manage Palette . Select the Install tab and type mongodb2 in the search field. Then, click the install button next to the node-red-contrib-mongodb2 package.


SET UP THE ENDPOINTS
Let's start out by creating a simple user authentication flow. The user
authentication flow will consist of three endpoints: a login form that accepts a username and password, a login processing route that will validate the users' credentials and return a JSONWebToken with
that users' information, and a protected route that requires the user to present
a token before it can be accessed.

First, we'll drag three HTTP input nodes onto the canvas. Configure the first node with a method of GET and a URL of /login . This route will present the user with a login form.


Configure the second node with a method of POST and a URL of /process_login . This is where the user credentials will be sent to be validated and the
JSONWebToken generated from those credentials.


Finally, configure the last HTTP input node with a method of GET and a URL of /protected . This will be the route that is inaccessible unless the user is authenticated.


CREATE THE LOGIN FORM
Next, we'll create a simple login form in HTML with a username and password text
field and a submit button. Drag a template node into the canvas from the palette and place it near the output of the HTTP input node with the URL of /login .


Then, double-click the template node and add an HTML form with an action of /process_login and a method of POST .

<html>  
   <head>
   </head>
   <body>
      <form action=""/process_login"" method=""POST"">
         <label for=""username"">Login</label>
         <input name=""username"" type=""text"" />
         <br />
         <label for=""password"">Password</label>
         <input name=""password"" type=""password"" />
         <br />
         <input type=""submit"" />
      </form>
   </body>
</html>  


Finally, drag an HTTP response node onto the canvas and wire everything up.


ADD LOGIN PROCESSING WITH MONGODB2
Processing login means we need to store our users in a database. We'll store our
user's login credentials using MongoDB and connect to it using the mongodb2 Node-RED node.

The first thing you'll need is a MongoDB instance. You can spin up a Compose MongoDB database with SSL enabled or start your own local instance of MongoDB. Starting a new deployment on
Compose is the easiest way to get started.

Next, connect to your MongoDB deployment using the mongodb2 node. You can find instructions for doing so in the previous article in this series . Once you've configured your mongodb2 node, we need to add a user to our database. We'll do this using the inject node, which will allow us to inject data into a flow by clicking a button.

Drag an inject node onto the canvas and double-click it to configure. In the payload section, click on the timestamp drop down and select {} JSON . Then, paste the following user information into the text box in payload (you can substitute with your own info if you'd like):

{ ""username"": ""admin"", ""password"": ""secret"", ""firstName"": ""John"", ""lastName"": ""O'Connor"", ""email"": ""johnwoconnor@compose.io"" }


For now, we'll store this password in plain text since we're using a simple
authentication mechanism. In practice, you never want to store your passwords in
plaintext - instead, you'd want to encrypt your password using something like bcrypt .


Then, drag the mongodb2 node onto the canvas and double-click it to configure. Type Users in the Collection section of the configuration and select insert from the Operation drop-down.


We'll also put a debug node near the output of the mongodb2 node so we can see any messages that are returned from the node. Wire
everything up and click Deploy .


Finally, let's inject our new user into the database. Click on the button
attached to the inject node to send the JSON object you defined above into the database. You can
double-check that the user exists by going to the Compose console and viewing
the collection.


Now that we have some data in the database, we can retrieve that data and
securely store and pass it around using JSONWebToken.

ENCRYPT USER DATA INTO A JSONWEBTOKEN
Let's start out by using our login processing to retrieve a user from the
database. Drag the function node onto the canvas near the HTTP input node with the POST method and /login route. Then, double-click to add the following code to the function:

msg.userData = msg.payload;  
msg.payload = {  
    username: msg.userData.username
};
return msg;  


This will save the form data, which is passed into the function, into the msg.userData object. Then, we'll modify the msg.payload so it can be passed into the mongodb2 node as a database query. Next, drag a new mongodb2 node near the output of the function and double-click it to configure. Use the findOne operation and type Users in the collection field (make sure the collection name matches the collection you inserted the
user into earlier).


JSONWebToken is agnostic to the login mechanism you use. For simplicity's sake,
we'll ignore the actual password validation for now and just assume that the
login was successful. In practice, you can choose any method of authentication
to validate the user's login credentials.

If you already have a JSONWebToken Token Configuration from the previous article in this series, you can skip the next paragraph.
Otherwise, you'll need to create a Token Configuration .

To create a token configuration, you'll first need to find the JSONWebToken node in the palette and drag it onto the canvas. Double-click it to open the
configuration panel, then click the pencil icon next to the drop down that says
Add new JsonWebToken_config. Give the configuration a unique name, and enter a
random set of characters in the secret section. You want the secret to be random
and not guessable - this will serve as the key you use to encrypt and decrypt
the JSONWebToken.


Now that you have a Token Configuration , you can pass the user data into the JSONWebToken node to encrypt it. If you haven't already done so, drag a JSONWebToken node onto the canvas after the output of the mongodb2 node and double-click it to configure. Select the Token Configuration you created earlier from the drop-down and name the node encrypt (or anything you'd like really).


Now that we've configured our JSONWebToken, we can start using it. The JSONWebToken node saves the encrypted token in the msg.token object. Since we want to send the token as output, we'll move the msg.token generated by the JSONWebToken node over to the msg.payload using a change node. Double-click on the change node and configure it to SET the msg.payload to the value of the msg.token .


Since we started with an HTTP input node, we need to add an HTTP response node to complete the response. Drag an HTTP response node onto the canvas next to the JSONWebToken node, wire them all together, and hit DEPLOY . When you submit the login form, the generated token will be returned as the
response.


Now that we have the login form wired up, navigate to your login form by going
to http://yourserver/login in your browser, replacing yourserver with the location of your Node-RED installation. Enter admin into the username section of the login form, and anything as the password (we're ignoring the
password for now). When you click submit you should be redirected to a web page with your JSONWebToken, which looks
something like the following:


eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJfaWQiOiI1ODQ4NDI2NWVmYTY0MDAwMWM2ZDBlZDAiLCJ1c2VybmFtZSI6ImFkbWluIiwiaWF0IjoxNDgyMTg0NDI0fQ.86PuteGK1JxpfsHhhckkqElbcseoCjAqpe083kRzCko

This JSONWebToken contains the encrypted data for your user and can only be
decrypted by using the same secret key you used to encrypt it. Let's see how
that decryption process works.

DECRYPT USER DATA FROM A JSONWEBTOKEN
Your JSONWebToken from the previous section now contains encrypted data about
your user. To decrypt the data, we just need to run it through the JSONWebToken node again. Assuming you use the same key and Token Configuration , the JSONWebToken node will decrypt any data stored in the msg.token object.

Now, let's test this on our /protected route. Navigate to yourserver/protected in a web browser and pass the JSONWebToken as a query string parameter called token , so a call to our /protected route might look something like this:


yourserver/protected?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJfaWQiOiI1ODQ4NDI2NWVmYTY0MDAwMWM2ZDBlZDAiLCJ1c2VybmFtZSI6ImFkbWluIiwiaWF0IjoxNDgyMTg1MDcyfQ.KxmCfxj8rGKbkZu_Wfu75lHRIPjWjpH_apY58HnJgS0

As a reminder, make sure you replace yourserver with the location of your Node-RED installation.

Node-RED will place the token query string parameter into the msg.payload.token object. Since the JSONWebToken expects the token to be in the msg.token object, we'll need to move it using a function node. Drag a function node onto the canvas and double-click it to configure. Since there are two
outputs to the function, we'll also need to update the outputs section on the bottom of the function configuration. Change the value in the input field from 1 to 2 . Then, add the following code to the function node:

if (msg.payload.token) {  
   msg.token = msg.payload.token;
   node.send(msg);
} else {
   msg.statusCode = 403;
   node.send([null, msg]);
}


Here, we're using the node.send API to conditionally send a message on one of the two outputs of the function.

The first part of the conditional will send the msg.token to the JSONWebToken node, and the other will return an error to the user by setting the msg.statusCode to 403 (Unauthorized) and passing it directly to an HTTP response node.


With the error condition handled, we'll drag the JSONWebToken node onto the canvas and double-click it to configure. Make sure you select the
same configuration from the dropdown that you used to encrypt the token.


The JSONWebToken puts the decrypted message back into the msg.token object. Since we want to send it back to the user, we'll need to transfer that
over to the msg.payload object. Let's do this by dragging a change node next to the JSONWebToken node and setting the msg.payload to the value from msg.token .


Finally, we'll drag an HTTP response node in and wire them all together. Since we have two outputs from our function node, make sure sure to wire the first output of the function to the JSONWebToken node. You should already have the second output wired to an HTTP response node. Then, click DEPLOY to deploy the changes.


Let's test it out by calling our /protected route and passing in our JSONWebToken using the token query string parameter. The final route looks like the following:


http://yourdomain/protected?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJfaWQiOiI1ODQ4NDI2NWVmYTY0MDAwMWM2ZDBlZDAiLCJ1c2VybmFtZSI6ImFkbWluIiwiaWF0IjoxNDgyMTg1MDcyfQ.KxmCfxj8rGKbkZu_Wfu75lHRIPjWjpH_apY58HnJgS0

and should produce the following result:

{""_id"":""58484265efa640001c6d0ed0"",""username"":""admin"",""iat"":1482185072}


WRAP UP
Now that you have all the pieces together, you can start wiring up secure
applications that connect to local or a cloud-hosted database like Compose
MongoDB. There are some improvements we can still make, such as storing and
validating the users' login credentials securely, but this should provide a good
starting point.


--------------------------------------------------------------------------------

If you have bits you think should be in NewsBits, or any feedback about any
Compose articles, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Ryan McGuire Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe John O'Connor is a software architect that enjoys tinkering with things, designing software,
and writing about it all. Love this article? Head over to John O'Connor’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","In this article, we'll connect our application to a MongoDB database and lay the groundwork for a fully-authenticated application with authorized routes.","Authenticating Node-RED using JSONWebToken, part 2",Live,774
2386,"POSTGRAPHQL: POSTGRESQL MEETS GRAPHQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 1, 2016GraphQL is a data query language that provides a flexible syntax for client
applications to describe how data is consumed. The basic idea behind GraphQL is
that, rather than consuming the data specified by the server, it's the client
that specifies the returned data and its format.

In a previous article on GraphQL and MongoDB , we provided a 10,000-foot overview of GraphQL, set up a GraphQL server using
ExpressJS, and set up a connection and ran queries on a Compose MongoDB database
with a ReactJS front end.

In this article, we'll be looking at PostGraphQL , a library that provides us with ""a GraphQL schema created by selection over a
PostgreSQL schema"". It gives us an easy way connect to a PostgreSQL database,
automatically detecting primary keys, relationships, tables, types, and
functions, that can then be queried using a GraphQL server.

PostGraphQL can be used as middleware within an HTTP framework, or you can
connect to your PostgreSQL database and run queries from the terminal. We'll be
focusing on how to use the library and its features on the terminal rather than
developing an application and using it as middleware in an HTTP framework. If
you'd want to include PostGraphQL in your HTTP framework, here are some examples .

SET UP AND DEPLOY
We'll assume that you already have a Compose PostgreSQL deployment set up and have some data imported into a PostgreSQL database. Here, we'll be using a DVD rental database that has been downloaded and imported into our database.

Now, let's install PostGraphQL and set up a connection to our deployment. Make
sure you have NodeJS and the NPM package manager installed on your system. After that, PostGraphQL
is can be installed like:

npm install -g postgraphql  


For our purposes, we'll use the Compose PostgreSQL connection string provided in
our deployment's console . Just use the -c flag and paste in your connection string and then define the schema -s where your data is stored:

postgraphql -c postgres://admin:password@aws-us-west-2-portal.1.dblayer.com:15257/rentals -s rental  


PostGraphQL by default uses the ""public"" schema. If your database has PostgreSQL
extensions like PostGIS installed, you may receive an error since you're pulling in all of the utility functions from the ""public"" schema.
The idea here is that you want to only include the tables and functions that you
require, so creating a new schema for your data is a wise choice.

If all goes well and you've made a successful connection, you should see
something like the following:


This just confirms that your GraphQL server is running and connected to Compose
PostgreSQL using the rental schema. To make queries, we'll have to use GraphiQL, an interactive query tool
that makes testing GraphQL queries a snap. You'll be able to access GraphiQL on localhost:5000/graphiql .

MAKING QUERIES
All of the tables and functions that you make within PostgreSQL are available
and can be used within GraphiQL to query data. The nice thing about PostGraphQL
is that you can write all your functions within your PostgreSQL database and you
immediately have access to them on the GraphiQL interface. Therefore, all you
need to know is the GraphQL query language to have access to all your data.

To view your data, go to the GraphiQL endpoint via your web browser where you
will see something like this:


If you want to query individual tables, they are accessed using all and the table name like:


Typing all will pull up a list of the available tables from the PostgreSQL database. If,
for instance, we query the ""Customer"" table, we'd select allCustomers . This will allow us to access either the nodes or edges fields provided by GraphQL. If we select the nodes field, and limit the query to the first four customers with the customerID , firstName , and lastName fields, we'd get an array of the first four customers from our PostgreSQL
database:


ACCESSING POSTGRESQL FUNCTIONS
One of the features unique to PostGraphQL is that you can write specialized
functions in your PostgreSQL database and immediately have access to them
through GraphQL. All of your customized functions are created within PostgreSQL
and then they are exposed by PostGraphQL within GraphiQL.

To illustrate how this works, we'll create a function that will concatenate the
first and last name of customers like:

CREATE FUNCTION customer_entire_name(customer customer) RETURNS text AS $$  
    SELECT customer.first_name || ' ' || customer.last_name
$$ language sql stable;


If you want the function to be made available within the allCustomers query, the function name must first include the name of the table it
corresponds to. So, customer_entire_name will be available only in the allCustomers query. If, for example, you do not include the name of a table as the first
part of a function, it will be made available outside the table and it will not
work.

Now that we set up a function within the ""Customer"" table, we'll now have access
to it. All we need to do is enter the name of the function and PostGraphQL will
do the rest for us. In this example, we query for the first customer and want
the result to return her entire name. Once we use the entireName field, our result will automatically return the entire name like:


In this example, we only queried for the first name. However, if you want to
concatenate all of the customer names, you'd only remove the argument from allCustomers .


SUMMING IT UP ...
GraphQL is an exciting new way to work with databases on the web, and
PostGraphQL makes adding GraphQL to your PostgreSQL database a snap. While the
article is really only a preview of what the library is capable of, it should
give you a nice overview of how to get up-and-running with GraphQL and
PostgreSQL. Next time, we'll look making queries via GraphQL from a MongoDB and
PostgreSQL database.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Samuel Zeller Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","In this article, we'll be looking at PostGraphQL, a library that provides us with ""a GraphQL schema created by selection over a PostgreSQL schema"". It gives us an easy way connect to a PostgreSQL database, automatically detecting primary keys, relationships, tables, types, and functions, that can then be queried using a GraphQL server. ",PostGraphQL: PostgreSQL meets GraphQL,Live,775
2387,"YOUR OWN WEATHER FORECAST IN A PYTHON NOTEBOOK
Margriet Groenendijk / October 6, 2016Have you ever wondered where the data in your weather app comes from? There is a
lot of science behind every number and map in your local forecast on your phone.
Let’s start with weather observations. At thousands of locations worldwide, the temperature, humidity, rainfall, windspeed, and many more
variables are measured at places like airports , for instance.

Probably, you are more interested in what the weather will be tomorrow or next
week. Forecasts are calculated with weather forecasting models at places like
the Met Office , ECMWF , NOAA or the Weather Company . All these models are built around similar physical processes of atmospheric and ocean
circulation and interactions with the land surface. They are similar to the
global circulation models (GCMs) that forecasters use to predict the climate
over the next century, but with different configurations related to space and
time scales.

The output of these models is a regular grid for each weather variable. A
frequently used data format is netcdf, about which I wrote a blog earlier this year. The forecast on your phone is extracted from this big data.

But how does the data end up on your phone? Weather apps use HTTP-based REST APIs to handle individual items and collections of items in a clean, data-driven
way. These APIs give you easy access for retrieving and filtering data, and
well-defined methods of dealing with other CRUD operations.

In this article, I use APIs from the Weather Company Data to show you how to make your own weather forecast graphs and maps.

This tutorial shows how to:

 * retrieve weather forecasts and transform the json data into a pandas
   DataFrame
 * create forecast timeseries
 * create a weather map

SET UP SERVICES ON BLUEMIX
Start by setting up a Weather Company Data service on Bluemix:

 1. Login to Bluemix (or sign up for a free trial) .
 2. In the menu at the top of the screen, click Catalog .
 3. Search for Weather Company Data for IBM Bluemix and click on it.
 4. Scroll down to select the free plan and click Create .
 5. On the left, click Service Credentials and copy your username and password , which you will need in a minute.

Now you’re ready to start using the APIs to retrieve weather data. You’ll see
how to use a Python notebook to retrieve the weather data, transform it, and
make charts and maps. You can do this on your own computer using the Anaconda Python distribution. Or you can run a Python notebook on the IBM Data Science Experience as we do in this tutorial. If you run into errors due to missing packages, you
can install them by running the following command in your notebook: !pip install --user . You have to do so only once.

Follow these steps to create a notebook:

 1. Log in to the IBM Data Science Experience with your Bluemix account credentials.
 2. If prompted, create an instance of the Apache Spark service.
    
    Data Science Experience may generate a Spark instance for you automatically.
    If not, you’ll be prompted to instantiate your own. You’ll need it to run
    your Python code.
    
    
 3. Create a new notebook.
    
    On the upper left of the screen, click the hamburger menu to reveal the left
    menu. Then click New Notebook .
    
    
 4. Create or select your Spark instance.
    
    If you don’t already have the Spark service up, Data Science Experience
    prompts you to instantiate it. You’ll need it to run your Python code.
    
    
RETRIEVE WEATHER FORECAST DATA
The first step is to load data into your notebook with the Weather Company Data
API. In Bluemix, you can find a complete list of the available APIs and examples of how to use them. To load a 10-day forecast for London (latitude=51.49999473,
longitude=-0.116721844), copy the following code into your notebook, replacing <USERNAME> and <PASSWORD> with the credentials you saved earlier. It loads the data using the requests package, which is a HTTP library for Python. Click run at the top of the
notebook to load the data. To load data for a different location, you can look
up the latitude and longitude here and add them to the code.

import requests
import json

# Weather company data API credentials
username='<USERNAME>'
password='<PASSWORD>'

# Request forecast for London
lat = '51.49999473'
lon = '-0.116721844'
line='https://'+username+':'+password+'@twcservice.mybluemix.net/api/weather/v1/geocode/'+lat+'/'+lon+'/forecast/intraday/10day.json?&units=m'
r=requests.get(line)
weather = json.loads(r.text)    

print json.dumps(weather,indent=1)

TRANFORM JSON DATA INTO A PANDAS DATAFRAME
The data you loaded into your notebook is json data. The json format is great to store data, but for analysis and
visualisations you need to convert it. Copy and run the following code to
convert it into a pandas DataFrame. To learn more about this format, read my last blog on analysing open data sets .

Since json data has the same structure as a dictionary you can use the pandas DataFrame_from_dict() function to convert the data. But because the data is nested, with a list of
forecasts for different time periods, you have to loop over each of them. The
following code handles this by appending each forecast to the DataFrame df using concat() . Also transpose() makes sure each time period is a new row in the DataFrame. To make the
DataFrame even easier to use, I added time as the index. To convert the string fcst_valid_local into a datetime object you can use datetime.strptime() . Then add an index to the DataFrame with set_index() .

import pandas as pd
import numpy as np
from datetime import datetime

df = pd.DataFrame.from_dict(weather['forecasts'][0],orient='index').transpose()
for forecast in weather['forecasts'][1:]:
    df = pd.concat([df, pd.DataFrame.from_dict(forecast,orient='index').transpose()])

# extract time and use it as index
time = np.array(df['fcst_valid_local'])
for row in range(len(time)):
    time[row] = datetime.strptime(time[row], '%Y-%m-%dT%H:%M:%S+0100')

df = df.set_index(time)
    
list(df)

CREATE FORECAST TIMESERIES
The data is now in an easy-to-use DataFrame, and you can create timeseries plots
for the different variables. After checking out the data, I noticed that pop (percentage of precipitation) looked strange (have a look yourself with print df.pop ). There is more than one column here, which results in errors when you try to
plot the data. I fixed this by adding a column rain and then deleting pop using drop() .

The forecast data comes in 6 hour time intervals. I added a rolling mean to the
plots that you can easily calculate with rolling.mean() .

You can now plot the timeseries using the matplotlib package. This package gives you full control over your figure. There are
several styles available (I chose bmh ). The figure consists of 5 rows and 1 column, which you specify with subplots() . The different subplots are then accessed by ax=axes[0] . You can choose any colour you like .

Copy the following code into your Python notebook and run it to create a figure
with the weather forecast for the next 10 days.

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

df['rain'] = df['pop'].as_matrix()
df=df.drop('pop',1)

tmean = pd.rolling_mean(df['temp'], window=4, center=True)
rhmean = pd.rolling_mean(df['rh'], window=4, center=True)
cldsmean = pd.rolling_mean(df['clds'], window=4, center=True)
wspdmean = pd.rolling_mean(df['wspd'], window=4, center=True)
rainmean = pd.rolling_mean(df['rain'], window=4, center=True)

matplotlib.style.use('bmh')

fig, axes = plt.subplots(nrows=5, ncols=1, figsize=(8, 10))

df['temp'].plot(ax=axes[0], color='dodgerblue',sharex=True)
tmean.plot(ax=axes[0], kind='line',color='darkorchid', sharex=True)
axes[0].set_ylabel('temperature [$^o$C]')

df['rain'].plot(ax=axes[1], color='dodgerblue',sharex=True)
axes[1].set_ylabel('chance of rain [%]')

df['rh'].plot(ax=axes[2], color='dodgerblue',sharex=True)
rhmean.plot(ax=axes[2], kind='line',color='darkorchid', sharex=True)
axes[2].set_ylabel('humidity [%]')

df['clds'].plot(ax=axes[3], color='dodgerblue',sharex=True)
cldsmean.plot(ax=axes[3], kind='line',color='darkorchid', sharex=True)
axes[3].set_ylabel('clouds [%]')

df['wspd'].plot(ax=axes[4], color='dodgerblue',sharex=False)
wspdmean.plot(ax=axes[4], kind='line',color='darkorchid', sharex=True)
axes[4].set_ylabel('wind [m s$^{-1}$]')


CREATE A WEATHER FORECAST MAP
The Weather Company Data service lets you download a set of icons that you can use in a weather map. I adapted code I found on Stack Overflow to make it all work and stored the icons on github , where my notebook can access them.

To start creating a map, you need a list of locations for the area you want to
show. I compiled a list of cities, including their latitude and longitude, to
make a weather map for the UK. Loop over these cities to load the data into two
arrays containing the icons and temperatures for each city. There are many other
variables available. Have a go at making maps for different countries or
regions.

cities = [
    ('Exeter',50.7184,-3.5339),
    ('Truro',50.2632,-5.051),
    ('Carmarthen',51.8576,-4.3121),
    ('Norwich',52.6309,1.2974),
    ('Brighton And Hove',50.8225,-0.1372),
    ('Bristol',51.44999778,-2.583315472),
    ('Durham',54.7753,-1.5849),
    ('Llanidloes',52.4135,-3.5883),
    ('Penrith',54.6641,-2.7527),
    ('Jedburgh',55.4777,-2.5549),
    ('Coventry',52.42040367,-1.499996583),
    ('Edinburgh',55.94832786,-3.219090618),
    ('Cambridge',52.2053,0.1218),
    ('Glasgow',55.87440472,-4.250707236),
    ('Kingston upon Hull',53.7457,-0.3367),
    ('Leeds',53.83000755,-1.580017539),
    ('London',51.49999473,-0.116721844),
    ('Manchester',53.50041526,-2.247987103),
    ('Nottingham',52.97034426,-1.170016725),
    ('Aberdeen',57.1497,-2.0943),
    ('Fort Augustus',57.1448,-4.6805),
    ('Lairg',58.197,-4.6173),
    ('Oxford',51.7517,-1.2553),
    ('Inverey',56.9855,-3.5055),
    ('Shrewsbury',52.7069,-2.7527),
    ('Colwyn Bay',53.2932,-3.7276),
    ('Newton Stewart',54.9186,-4.5918),    
    ('Portsmouth',50.80034751,-1.080022218)]   

icons=[]
temps=[]
for city in cities:
    lat = city[1]
    lon = city[2]
    line='https://'+username+':'+password+'@twcservice.mybluemix.net/api/weather/v1/geocode/'+str(lat)+'/'+str(lon)+'/observations.json?&units=m'
    r=requests.get(line)
    weather = json.loads(r.text)    
    icons=np.append(icons,weather['observation']['wx_icon'])    
    temps=np.append(temps,weather['observation']['temp'])   

You can use the icons and temps arrays to create two maps. The code that follows produces 2 maps: one shows a
weather icon for each location in the city list and the other shows the
temperature at each location. To create maps, you need the package Basemap that takes care of most of the formatting and the geographical projection and
background colours.

 1. Just as we did for our forecast graphs, define the number of subplots using plt.subplot . This gives you the handles fig and axes .
 2. Define the area to plot (range of latitudes and longitudes) and the
    projection with Basemap . My example uses the Miller Cylindrical Projection mill , but there are many more projections to choose from.
 3. I need a nice visual image as background map of the UK. drawlsmask creates a simple background with land and ocean in any colour you prefer.
    Many different backgrounds are available.
 4. To plot the icons for each location, loop over the cities, read the png icon
    using urllib , and add the icon to the first map with axes[0].add_artist(ab) .
 5. The temperature map has a colour scale, which is defined with plt.get_cmap() . Choose your own at matplotlib.org .
 6. To create the temperature map, loop over all cities again. To generate the
    coloured boxes, this code normalizes the temperature between 0 and 30
    degrees. bbox_props defines the shape and colour of the box around the text .

from mpl_toolkits.basemap import Basemap
from matplotlib.offsetbox import AnnotationBbox, OffsetImage
from matplotlib._png import read_png
from itertools import izip
import urllib

#matplotlib.style.use('bmh')
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 12))

# background maps
m1 = Basemap(projection='mill',resolution=None,llcrnrlon=-7.5,llcrnrlat=49.84,urcrnrlon=2.5,urcrnrlat=59,ax=axes[0])
m1.drawlsmask(land_color='dimgrey',ocean_color='dodgerBlue',lakes=True)

m2 = Basemap(projection='mill',resolution=None,llcrnrlon=-7.5,llcrnrlat=49.84,urcrnrlon=2.5,urcrnrlat=59,ax=axes[1])
m2.drawlsmask(land_color='dimgrey',ocean_color='dodgerBlue',lakes=True)

# weather icons map
for [icon,city] in izip(icons,cities):
    lat = city[1]
    lon = city[2]
    pngfile=urllib.urlopen('https://github.com/ibm-cds-labs/python-notebooks/blob/master/weathericons/icon'+str(int(icon))+'.png?raw=true')
    icon_hand = read_png(pngfile)
    imagebox = OffsetImage(icon_hand, zoom=.15)
    ab = AnnotationBbox(imagebox,m1(lon,lat),frameon=False) 
    axes[0].add_artist(ab)

# temperature colours   
jetcols = plt.get_cmap('jet')

# temperature map
for [temp,city] in izip(temps,cities):
    lat = city[1]
    lon = city[2]
    # normalize temperature between 0 and 30 degrees
    tempnorm = (temp-0.)/(30.-0.)
    x1, y1 = m2(lon,lat)
    bbox_props = dict(boxstyle=""round,pad=0.3"", fc=jetcols(tempnorm), ec=jetcols(tempnorm), lw=2)
    axes[1].text(x1, y1, temp, ha=""center"", va=""center"",
                size=11,bbox=bbox_props)
    
plt.tight_layout() 


CONCLUSION
Now you are a weather forecaster! You learned how to load weather data into a
Python notebook and use it to create forecast graphs and maps. Explore further
by changing the locations to create maps for other regions, then customize with
your own layout and colours.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: Data Science Experience / mapping / open data / Python / Spark / weather Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Object Storage
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Use Weather Company Data APIs and matplotlib to generate your own weather forecast map within a python notebook.,Your own weather forecast in a Python notebook,Live,776
2394,"Compose The Compose logo Articles Sign in Free 30-day trialMAKING OF A SMART BUSINESS CHATBOT - PART 2
Published Aug 2, 2017 Making of a Smart Business Chatbot - Part 2 node-red chatbot slack Free 30 Day TrialChatbots are a great way to interact with your customers in real-time and gain
insights into your users. In this second in our series on building smart
business chatbots, we’ll learn how to give our Slackbot from Part 1 some intelligence by connecting it to a Watson Conversation dialog.

In our previous article , we took a look at building a slack chatbot using Node-RED and Compose. If you
haven't gone through that article already, now's your chance. Go ahead - I'll
wait....

Ok, now that you're back, it’s time to make our chatbot look far more
intelligent than it really is by connecting it to IBM Watson Conversation and giving it some realistic dialog.

Let’s get started...

UNDERSTANDING A CONVERSATION
Before we jump in, there are a couple of vocabulary terms we should be familiar
with. You can find more details in the Watson Conversation documentation . In this section, we’ll just quickly cover the basics you need to start
building our chatbot.

Workspaces are how Watson organizes your conversation applications. Basically, each
application is going to have its own workspace. If you’re building chatbots,
this means that each chatbot gets its own workspace.

Intents are the first level of your conversation that uses Artificial Intelligence.
When your user sends a request to your chatbot, Watson attempts to figure out
what kind of conversation the user is trying to have so it can respond with the
appropriate dialog. It does this by having you supply a list of phrases that
could be used to initiate a conversation and then training on those phrases.

Entities act like variables in a programming language and provide context for your
conversation. In an application that looks for business locations across a
country, an entity might represent the specific city a user is searching in.

Finally, we have Dialogs which are the actual text your chatbot will respond to users with. Your chatbot
can respond with multiple different dialogs for each intent and can include
entities in the response for things like confirming the users’ selection.

Let’s explore how these fit together by creating our first conversation.

CREATING A CONVERSATION
The first step to creating a conversation is to create a workspace in Watson Conversation. You’ll need to sign up for Watson Conversation using your Bluemix account before we start.

Once you’ve logged in, you’ll be dropped directly into the workspaces page. Workspaces are how Watson organizes different conversation applications,
and you’ll want to create a new workspace for each chatbot you make. Click the create button at the top to create a new workspace, then give your workspace a name
and description.


Once we’ve created a new workspace, we should see a page with a tab for each
element of a conversation (intents, entities, and dialogs). In general, we’ll
want to start with some Intents to provide an entry point for users in to our
conversation. Let’s add a new intent called welcome that provides a simple greeting to our users. Click on the Create new button in the intent tab.


We’ll start out by giving our intent a name (which automatically starts with a #
sign) and add some phrases that the user can type to call this intent. You can
add as many phrases as you feel are necessary to cover your users’
conversational needs. Watson will automatically train on those phrases as well
as handle things like spelling errors and pluralizations, so there’s no need to
include different spellings or pluralizations of the same words.

Create an intent called #greeting and add some phrases that a user might enter to initiate conversation with the
chatbot.


Next, we’ll create a dialog of responses that the user can receive when they
trigger the #welcome intent we made above. Navigate to the dialogs tab and click the Create button.


Watson will automatically create two dialog branches: the first is named welcome , which contains a simple phrase when a user initiates a conversation and Anything Else which is the default flow if a conversation doesn’t match against an intent.


Now, let’s connect up an Intent to one of our newly-created dialogs. We’ll use
the welcome intent and tell it to trigger if the bot recognizes #greeting . When we use the name of an Intent in the if field, Watson will use that chain of dialog when the Intent is encountered.
We’ll also add a few more response phrases, and click on the Set to random link that appears when we add our second phrase so we get back a random
response every time the user triggers the Intent.


We now have enough pieces together that we can start testing our conversation
bot. Click on the Deploy button in the left-hand menu and you’ll be taken to the deployment page. This
page has the information we’ll need to connect our slackbot up to the
Conversation API.


CONNECTING OUR SLACKBOT TO CONVERSATION
Now that we have a brain behind our bot, let’s wire the two together. In the Deploy section of Conversation, click on the credentials tab. Make a note of these credentials - we’re going to use them in a moment.


Now, we’re going to head back to the Node-RED application we set up in Part 1 . If you’re using Bluemix-Hosted Node-RED, you’ll already have a couple of
Watson Conversation nodes installed.


Drag the conversation node onto the canvas and double-click it to configure.
Enter the Username , Password , and Workspace ID you noted from your conversation credentials and click done .


We’ll need to do some processing of the message coming in from slack before we
can send it off to Watson conversation. Drag a function node between the HTTP IN node and the conversation node. In the function node, add the following code to format the message for conversation .

if (msg.payload.event.username != ""DrWatson"") { // Check to make sure bot isn’t responding self  
    msg._payload = msg.payload; // Save the event for later use
    msg.payload = msg._payload.event.text; // Set the text to send to conversation.
    node.send(msg); // Send the message long to the next node
}


In this function, we first check to make sure that the message didn’t come from
our bot - otherwise, it ends up responding to its own messages in an infinite
loop.

So far, our flow should look like this:


Before we send the message out to Slack, we’ll need to format it to match the Slack Events API . This means updating the content-type of the message to x-www-form-urlencoded and formatting our msg.payload so we can send it to the chat.postMessage API call.

Drag another function node on the canvas on the output side of the conversation node and double-click to add the following code:

msg.headers = {  
    ""content-type"": ""application/x-www-form-urlencoded""
}

msg.payload = {  
    ""token"": ""<your_bot_user_access_token  


When you enter the above, make sure to replace the <your_bot_user_access_token> with the actual access token for your bot user. You can find that in the OAuth and Permissions section of your slack app .


While you’re there, you’ll also want to make sure that you have the correct app
permissions to send messages to our slack team. In the OAuth and Permissions section of your slack app, scroll down to the Permission Scopes section and type chat in the search box. Click on the option that says chat:write:bot and click Save Changes on the bottom. Slack will prompt you to re-install your app to accommodate the
new permissions. Once that’s done, you’ll now be able to send messages from the
chatbot to Slack.


Now that we have our message formatted for the Events API, it’s time to send it
off to slack. Grab an HTTP Request node and drag it onto the canvas near the output of our last function node. Double click the node and put https://slack.com/api/chat.postMessage in the URL field and set the Method to POST :


Once you click Deploy , you should have everything in place for your chatbot to receive and send
messages from slack. Navigate over to Slack, open a private message with your
bot user, and type in some text. Your bot should now respond to messages that
match your Intents with responses from your dialog tree.


ADDING SOME PERSONALITY
Once we have our users' chats, and we can respond to them using Watson
Conversation, the sky's the limit on what we can do with that information!
Anything from training new Artificial Intelligence models to finding better
business FAQ's.

Let's try out a this out in a small way first: by doing some simple sentiment
analysis. Sentiment analysis will allow us to find a general sentiment of the
conversation, giving us a score based on how positive or negative the language is. We'll save this information along with our chats in the
database, and use it later to provide more interesting text analysis.

First, let's switch out the stock MongoDB Node-RED node with one that allows us to do a little more processing. In
Node-RED, open the side menu and click on the menu option that says Manage Palette


This will allow you to install a new Node-RED node. Click on the install tab, and search for mongodb2 . Then, click the install button.


Once that's done, replace the mongodb node you added in our previous article with the mongodb2 node and configure it with your database credentials. Then, configure your
connection to use the insert operation and click done .


With the MongoDB2 node, we'll be able to take continue the flow even after our
data has been inserted into the database. For now, we'll use it to spot-check
the data inserted, but in a future tutorial we'll send that info off to Watson
to be processed later.

Now, let's add some simple Sentiment anlysis. First, we'll search for the sentiment node into the palette and drag it onto the canvas. The sentiment node assumes that your message will be in the msg.payload object, but since we've already created a function to move that data for the conversation node, we'll wire up our sentiment node after that.


The sentiment node analyzes the text and adds a score to the message in the msg.sentiment object. The last thing we'll need to do is put that sentiment in the msg.payload object so it will save in the database. Add a function node and double-click it
to add the following code:

msg.payload = msg._payload;  
msg.payload.sentiment = msg.sentiment;  
return msg;  


Your final flow should look something like the following:


Now, when we start sending messages to our chatbot, we should see an additional sentiment object in our chat log. The sentiment object breaks down the text and provides
a sentiment score for the phrase, with negative numbers representing more
negative language, and positive numbers representing more affirmative language.

{
  ""token"":""XXXXXXXXX"",
  ""team_id"":""XXXXXXXXX"",
  ""api_app_id"":""XXXXXXXXX"",
  ""event"":{
    ""type"":""message"",
    ""user"":""XXXXXXXXX"",
    ""text"":""I’m having a bad day"",
    ""ts"":""1501602893.311521"",
    ""channel"":""XXXXXXXXX"",
    ""event_ts"":""1501602893.311521""
  },
  ""type"":""event_callback"",
  ""authed_users"":[
    ""XXXXXXXXX""
  ],
  ""event_id"":""Ev6GK9UYKW"",
  ""event_time"":1501602893,
  ""sentiment"":{
    ""score"":-3,
    ""comparative"":-0.6,
    ""tokens"":[
      ""im"",""having"",""a"",""bad"",""day""
    ],
    ""words"":[""bad""],
    ""positive"":[],
    ""negative"":[""bad""]
  }
}


With a basic sentiment analysis, we can determine whether a message had a
positive, negative, or neutral sentiment. This can be used for a number of
different reasons, such as automatically redirecting a frustrated customer to a
service representative or prompting a satisfied customer to rate your business
on social media.

WRAPPING UP
We now have a slack chatbot that saves our conversations and responds to our
users in a realistic way. We can also save and analyze the messages our
customers are sending to us to determine their sentiment. In the next article in
this series, we'll look at how to use our chatbot to provide answers to users'
questions using a JanusGraph-backed knowledge base.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Clem Onojeghuo

John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page to keep reading.CONQUER THE DATA LAYER
Spend your time developing apps, not managing databases.

Try Compose for Free for 30 DaysRELATED ARTICLES
Jul 20, 2017MAKING OF A SMART BUSINESS CHATBOT: PART 1
Chatbots have become an incredibly popular method for providing real-time
communication with customers, and in this first in…

John O'Connor Jun 30, 2016THE CONVERSATIONAL INTERFACE IS THE NEW PARADIGM
In 1962 Thomas Kuhn published The Structure of Scientific Revolutions. In it he
posited that science moves forward with brief…

Hays Hutton Jun 19, 2017BUILDING AN ORDERING APPLICATION WITH WATSON AI AND POSTGRESQL: PART II
Do you want to leverage Compose, Twilio, and IBM Watson to provide customers
with a real-time, interactive experience? We'll…

Abdullah Alger Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","In this second in our series on building smart business chatbots, we’ll learn how to give our Slackbot from Part 1 some intelligence by connecting it to a Watson Conversation dialog.",Making of a Smart Business Chatbot: Part 2,Live,777
2395,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseIBM DATA CATALOG: GOVERNANCE OVERVIEW
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

17 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 31, 2017This video provides an overview of the governance features in IBM Data Catalog.
Find more videos in the IBM Data Catalog Learning Center at http://ibm.biz/data-catalog-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Data Catalog: Add data assets to a catalog - Duration: 3:03.
   developerWorks TV 2 views * New 3:03


--------------------------------------------------------------------------------

 * IBM Data Catalog: Create and administer a data catalog - Duration: 3:19.
   developerWorks TV 13 views * New 3:19
 * Information Governance Catalog Basic Overview - Duration: 9:01. 3milevideo
   1,368 views 9:01
 * IBM InfoSphere Information Analyzer: Analyzing Data Quality and Risk with the
   Thin Client - Duration: 3:56. IBM Analytics 219 views 3:56
 * IBM Data Refinery: Shape data - Duration: 5:46. developerWorks TV 10 views *
   New 5:46
 * How to kick-off a Data Governance Project using IBM Information Governance
   Catalog - Duration: 52:14. PR3 Systems 81 views 52:14
 * IBM Cognos Integrated with IBM Information Governance Catalog - Duration:
   13:02. Jeff Martin 701 views 13:02
 * IBM - InfoSphere Information Governance Catalog Demo - Duration: 11:43. PR3
   Systems 3,833 views 11:43
 * UrbanCode Deploy: Using composite blueprints - Duration: 9:13. developerWorks
   TV 6 views * New 9:13
 * Business Glossary Azure Data Catalog - Duration: 22:38. Microsoft Azure 1,586
   views 22:38
 * IBM Infosphere MDM Online Training - Duration: 35:47. Entirety Technologies
   1,672 views 35:47
 * Collibra: Ashish Haruray - Catalog - Duration: 3:10. Collibra University 493
   views 3:10
 * Tech Talk: Information Analyzer Thin Client - Duration: 1:01:12. IBM
   Analytics 835 views 1:01:12
 * Healthy Habits Pet Assembly, Part1 - Duration: 5:53. developerWorks TV 6
   views * New 5:53
 * IBM Information Server Data Lineage in Two Easy Steps - Duration: 2:50.
   BusinessDrivenInfo 4,047 views 2:50
 * Manta Flow + IBM Information Governance Catalog - Duration: 4:27. MANTA 541
   views 4:27
 * What is a Data Catalog - Tech VLOG - Duration: 8:09. Garnie Bolling 1,110
   views 8:09
 * Welcome - Duration: 1:35. developerWorks TV No views * New 1:35
 * Brief Overview of IBM Stewardship Center - Duration: 3:36. MDM Developers
   4,091 views 3:36
 * Using IBM InfoSphere Business Glossary to Support Data Governance in
   Insurance - Duration: 7:14. Sunil Soares 6,466 views 7:14

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video provides an overview of the governance features in IBM Data Catalog. ,Governance overview for IBM Data Catalog,Live,778
2396,"CONNECTING APPLICATIONS TO COMPOSE FOR MYSQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 20, 2016At Compose, we focus on making databases accessible and easy. Since launching Compose for MySQL in beta , we've discussed how to set up a MySQL deployment and connect to it from the command line . Now, we are going to look connecting your applications to Compose for MySQL
databases.

In this article, we'll look at how to connect to Compose for MySQL using your
deployment's connection string, so that you can start connecting to MySQL
immediately. This quick connection guide covers connection drivers and gives
code samples you can run in 5 popular programming languages: JavaScript, Go,
Python, Java, and Ruby.

We’re assuming that you’ve set up a Compose for MySQL deployment. If you don't
have a deployment yet, take a peek at our Connecting to Compose for MySQL guide, which will help you to quickly set one up.

THE CONNECTION STRING
The connection string is the first point of reference to extract nearly all the
information we need to connect to a database. It's found in the deployment’s Connection Info panel.

The connection string is formatted as a URI that can be used where drivers and
applications understand the standardized format. Where they don't, we can
extract the essential details from the connection string URI. If, for example,
this:


is the connection string URI, then to use it we'll need to enter the username
and password into the [username] and [password] positions respectively. If you are going to use the 'admin' user's password,
you can find that by revealing the password in the Credentials panel:


The [username] and [password] will automatically be replaced with your credentials in the displayed
connection string.

In our connection guide examples, we'll be connecting with the admin user
account. The admin user is a fully privileged user who has administrative access
to all our databases. However, for security reasons, the admin account should
only be used to create users and grant them privileges, and not to connect to
applications.

Other drivers use component parts of the connection URI – the host name and port
– for their connection parameters. The host name is found after the @ symbol and the port number is after the : . We've marked them below so they can be found easily.


We create a ""compose"" database in our deployments to ensure we have at least one
database that can be connected to with the default connection string. If you
don't designate a database to connect to, you would need the USE my_database command to select a database; a MySQL connection doesn't have to specify a
database to connect.

SSL AND MYSQL
All MySQL deployments have SSL enabled by default; however, the way MySQL does
SSL means it will also accept non-SSL connections. This means that, when
configuring your application's MySQL driver, you will want to make sure that
it's creating a secure connection. With the MySQL command line, that's simply a
matter of adding --ssl-mode=REQUIRED to the connection command. For application drivers, the settings are specific
to the driver. In our following examples, we will show commands which should
always and only connect with SSL enabled.

USING THE SELF-SIGNED CERTIFICATE
Security isn't just a matter of encrypting your connection; it also helps to
know you are talking to the right system. That's why Compose deployments offer a
self-signed certificate which can be used to validate the host being talked to.
To use the certificate we'll want to copy it to a file. The certificate is found
on the Deployment Overview page here:


Click on the Show certificate button where you will be prompted to enter the Compose account password. Then
the certificate will be revealed like:


Copy and paste the text from -----BEGIN CERTIFICATE----- to -----END CERTIFICATE----- into a file with the extension .cert or .pem , depending on which the connection driver requires. Save the file to a
location that you can access because we'll need the certificate's path later.

We now have the username, password, connection string and certificate to enable
us to connect to Compose for MySQL, so let's begin.

JAVASCRIPT, NODE.JS AND COMPOSE FOR MYSQL
The leading driver for MySQL for Node.js is the node-mysql driver . We'll walk through setting up a simple Node.js/MySQL connection here. If you
want to follow along, make sure that Node.js and npm are installed on your system.

After initializing a new project using npm init , install the driver package in the project’s folder using:

npm install mysql --save  


This will install the driver and its dependencies in the node_modules folder, and the driver will appear under dependencies in the package.json file.

We can connect to MySQL as follows:

const mysql = require('mysql');  
const fs = require('fs');  


Here, we require the mysql driver and filesystem module (as we need to read a
local file).

const connection = mysql.createConnection(  
    {
        host: 'aws-us-east-1-portal.23.dblayer.com',
        port: 15942,
        user: 'admin',
        password: 'mypass',
        ssl: {
            ca: fs.readFileSync(__dirname + '/cert.crt')
        }
});


Here we're using the more explicit form of createConnection , which lets us specify all the parameters in the form of a JavaScript object.
The createConnection call can take a URI as a parameter too, but doing that means we can't read and
pass the certificate. That's what is happening with this bit of code:

        ssl: {
            ca: fs.readFileSync(__dirname + '/cert.crt')
        }


This uses Node's synchronous file read function to get our certificate from the
local disk and pass it to the createConnection function. When the connection is being established, the driver will use it to
verify that the server it is talking to is in the one that generated this
certificate. If it isn't, then the connection will close and throw an error
stating: Error: unable to verify the first certificate .

Interestingly, the driver doesn't actually make a connection after
createConnection() is called. It just sets everything up ready to make a
connection. To make a connection, you can call the connect() method on the created connection, which gives you the opportunity to catch any
errors, or you can let the driver make the connection when needed by, for
example, making a query. That's what we'll do next in the code.

connection.query('SHOW DATABASES', (err, rows) =� i �


SHOW DATABASES is a MySQL command that will list the databases that a user has privileges for.
For this driver, the results of the query are returned as an array of objects,
which we can iterate through and print to the console. Since we've logged in as
the admin user, we'll see all the databases on our deployment.

CONNECTING WITH GO
Go has a generic interface for SQL databases and using it with the go-sql-driver will provide us with several options to connect to MySQL. The problem with the
driver is that we cannot verify the self-signed certificate. The only TLS/SSL
option we have with self-signed certificates is skip-verify , which is less than optimal.

However, if you want to connect to MySQL using Go, here's the way to do it.
First, we'll have to install the go-sql-driver package to our $GOPATH by:

go get github.com/go-sql-driver/mysql  


We can now create the code needed to connect to the database. We start by
importing the sql interface and the mysql driver that uses it.

package main

import (  
    ""database/sql""
    ""log""
    ""fmt""
    _ ""github.com/go-sql-driver/mysql""
)


After importing the driver, we sql.Open a database connection using the variant of the deployment's connection string.
The Go driver likes to be told what what transport it is using (TCP in this
case), and that's done by surrounding the host and port number with tcp() . Also, as it's a generic SQL framework, the ""mysql://"" part is removed and the
first parameter passed is the name of the driver - mysql . Finally, there's the option to pass parameters in a URL style with a ?name=value format. Put that together and it looks like this:

func main() {  
    db, err := sql.Open(""mysql"", ""admin:mypass@tcp(aws-us-east-1-portal.23.dblayer.com:15918)/compose?tls=skip-verify"")


Notice that we have passed in a TLS/SSL configuration option tls= , which in our case is ?tls=skip-verify . This is necessary because the Go library isn't currently able to verify with
the self-signed certificate.

As with all idiomatic Go code, we then check the error returned, and to ensure
we clean up, defer the act of closing the connection.

    if err != nil {
        log.Fatal(err)
    }
    defer db.Close()


Again, the driver does not make a connection to the database until a query is
made. One important note is that the sql.Open function is a pool for database connections, which opens connections when
they're requested. The function does not handle errors, so we'll want to handle
them ourselves like:

    if err != nil {
    log.Fatal(err)
    }
    defer db.Close()


Executing a query is achieved by using the connection's query method with an SQL command:

    rows, err := db.Query(""SHOW DATABASES"")
    if err != nil {
    log.Fatal(err)
    }


Here, we're assigning the results of the Query to the variable rows . We can iterate through the results using Next() , extract the values from the row with Scan and then log the results – the database names – to the console.

    var dbNames string
    for rows.Next() {
    err := rows.Scan(&dbNames)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(dbNames)
    }
}


CONNECTING WITH PYTHON
To connect MySQL with Python application, we'll use MySQL's official Python connector by installing its connector package . After installing the package, we can connect to our database using the
following code:

import mysql.connector

config = {  
    'user': 'admin',
    'password': 'mypass',
    'host': 'aws-us-east-1-portal.23.dblayer.com',
    'port': 15942,
    'database': 'compose',

    # Create SSL connection with the self-signed certificate
    'ssl_verify_cert': True,
    'ssl_ca': 'cert.pem'
}

connect = mysql.connector.connect(**config)  
cur = connect.cursor()  
cur.execute(""SHOW DATABASES"")

for row in cur:  
    print(row[0])

connect.close()  


In this example, we're importing the MySQL connector and adding any connection
parameters within a config dictionary. We're also initializing a secure connection using the deployment's
self-signed certificate in the connection dictionary:

{ ...
    'ssl_verify_cert': True,
    'ssl_ca': 'cert.pem'
}


ssl_verify_cert requires the ssl_ca option. With the configuration option 'ssl_verify_cert': True , we're making sure that the server's certificate and the certificate stored in
our application cert.pem are the same. If there's a certificate error, Python will return a ValueError indicating that the certificates don't match.

When establishing a connection, we can pass in the configuration dictionary into
the connector's connect method like:

connect = mysql.connector.connect(**config)  


or insert the required arguments individually within the function. It depends on
what's easier for you. A useful list of the available connection arguments can
be found here .

The MySQL Python connector requires that a cursor object be instantiated in order to execute SQL queries. Therefore, in our code
sample, we use the cursor method of the connection object here and assign it the variable cur :

cur = connect.cursor()  


Then, we assign the variable query to the SQL command, which tells MySQL to show all of the deployment's
databases:

cur.execute(""SHOW DATABASES"")  


The execute method executes database operations that convert Python objects to MySQL
commands over a secure connection. In this case, our execution method tells
MySQL to show all the databases, which we then iterate over, log to the console,
and close the connection:

for row in cur:  
    print(row[0])

connect.close()  


CONNECTING WITH JAVA
Perhaps the most frequent query that we get from customers is for details of how
to connect to their Compose deployment using Java. Java's handling of
certificates and SSL does make this process a bit more involved than other
languages.

The first step in connecting to MySQL is to download and install MySQL's Connector/J JDBC driver .

We can then use the following code to connect to the deployment:

import java.sql.*;

public class Main {

    public static void main(String[] args) {

       System.setProperty(""javax.net.ssl.trustStore"",""/home/project/truststore"");
   System.setProperty(""javax.net.ssl.trustStorePassword"",""somepass"");

       String user = ""admin"";
       String password = ""mypass"";
       String URL = ""jdbc:mysql://aws-us-east-1-portal.23.dblayer.com:15942/compose"" +

                ""?verifyServerCertificate=true""+
                ""&useSSL=true"" +
                ""�
        }
    }
}


The basic information we need to set up a connection is within the connection
string. We use that with the JDBC API and store that string in the variable URL :

String URL = ""jdbc:mysql://aws-us-east-1-portal.23.dblayer.com:15942/compose?verifyServerCertificate=true&useSSL=true�  


The username and password for the deployment are stored in separate variables.
These will be used to establish a connection with the JDBC DriverManager.getConnection() method. There, we will pass in the user , password , and the URL as arguments like:

String URL = ""jdbc:mysql://aws-us-east-1-portal.23.dblayer.com:15942/compose?verifyServerCertificate=true&useSSL=true�  


Like Go's driver, the connection string includes the host name, port, and
database we're connecting to, while the SSL options are appended to the
connection string.

""?verifyServerCertificate=true&useSSL=true&requireSSL=true""


What we're telling the server to do here is to establish an SSL connection and
to verify using the self-signed certificate. We could remove these options from
the URL and still connect to the server, but it would be insecure and we'd
receive a warning from MySQL stating:

""Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default...""


To allow the driver to process the certificate, we'll have to save the
credentials in a file called cert.pem . Unlike the other drivers discussed so far, we can't use the certificate
directly. Instead, for Java we create a trust store to hold the certificate's
credentials using keytool , which would look like this:

keytool -import --alias composeCert -file ./cert.pem -keystore ./truststore -storepass henrythedog  


What the command does is import the cert.pem file, assign it an alias called composeCert , or whatever name we want, and define the output file as a keystore called truststore . It uses the password henrythedog to unlock the keystore.

Now, we'll have to set up some JVM properties to use the truststore filename and password so that they're set in the system before we start the
connection. We'll use the setProperty method on javax.net.ssl.trustStore and javax.net.ssl.trustStorePassword to use the information we set up in the keystore.

System.setProperty(""javax.net.ssl.trustStore"",""/your/path/truststore"");  
System.setProperty(""javax.net.ssl.trustStorePassword"",""henrythedog"");  


The system properties should be set towards the top of the program to run early
at runtime; preferably set them before the Connection variable, so that they're locked in before starting an SSL connection.

After creating the connection, we then get a Statement object from the
Connection and execute our query against that. This returns a ResultSet which we
can iterate through, retrieving and printing the database names from each row.

Statement st = conn.createStatement();  
ResultSet rs = st.executeQuery(""SHOW DATABASES"");

while (rs.next()) {  
    String dbname = rs.getString(""Database"");
    System.out.format(""%s\n"", dbname);
}


CONNECTING WITH RUBY
There are a number of MySQL drivers for Ruby. MySQL2 is the fastest and the more Ruby-like of them. The code to create a connection
looks like:

# MySQL2 Driver
require ""mysql2""  
config = {  
    ""hostname"": ""aws-us-east-1-portal.23.dblayer.com"",
    ""port"": 15942,
    ""database"": ""compose"",
    ""username"": ""admin"",
    ""password"": ""mypass"",
    ""sslCA"": ""cert.pem"",
    ""sslverify"": true
}
conn = Mysql2::Client.new(config)  
results = conn.query(""SHOW DATABASES"")  
results.each {|row| puts row[""Database""]}  


In keeping with Ruby's tendency towards succinct and readable code, creating a
connection is straightforward. All of the connection configuration is done
within the config hash:

config = {  
    ""hostname"": ""aws-us-east-1-portal.23.dblayer.com"",
    ""port"": 15942,
    ""database"": ""compose"",
    ""username"": ""admin"",
    ""password"": ""mypass"",
    ""sslCA"": ""cert.pem"",
    ""sslverify"": true
}


The configuration hash is where we set up the host name, ports, database,
username, and password. Other driver options are available from setting up the configuration of the Ruby gem to other connection and SSL options , which can be specified in the configuration hash.

Setting up an SSL connection is done within the connection's configuration hash.
The gem lists seven parameters that a user can define, but the parameters are
given default values if they are not set.

To set up a secure connection using the self-signed certificate, only two
options need to be set: sslCA and sslverify . sslCA is the path of the .pem or .cert file containing the certificate, while sslverify should be set to true in order check for a valid certificate.

Now, setting up a connection and query is similar to other drivers that we've
been looking at.

conn = Mysql2::Client.new(config)  
results = conn.query(""SHOW DATABASES"")  
results.each {|row| puts row[""Database""]}  


We set up a new connection variable by initializing the Mysql2::Client.new constructor and passing in the config file. We create a query using the constructor we defined and pass in a SQL
command into the query method to execute on the database. The results of the query is an array of hashes, which are iterated through and printed to
the console.

ALL SET
We set up this quick connection guide to help you get connected to your
deployment so that you can start exploring Compose for MySQL. In the coming
weeks, we will dive into new MySQL features and provide you with a more in-depth
guide to security features. For now, spin up a database and start connecting.

Image by Mike Wilson Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",We're going to look connecting your applications to Compose for MySQL databases.,Connecting applications to Compose for MySQL,Live,779
2397,"Enterprise Pricing Articles Sign in Free 30-Day TrialDATABASE MANAGEMENT TOOLS AND COMPOSE FOR MYSQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 27, 2016Since the launching of Compose for MySQL, we’ve shown how to connect to your
deployment using the command line and various connection drivers . However, there are times when you may want to dive deeper into your database
without having to run a series of SQL statements from the command line. A
database management tool will help you out.

Most database management tools are designed with rich graphic user interfaces,
or GUIs. These tools are sometimes preferable to use than command line tools,
since they can lessen the cognitive load and increase productivity for users who
are new to MySQL.

There are many data management tools that are compatible with MySQL. However,
not all tools are created equal. So, while this is not an exhaustive list, we’ve
chosen three management tools to cover: MySQL Workbench, Sequel Pro, DBeaver.
These tools have been chosen based on 1) the ability to connect to a Compose
deployment using SSL/TLS with a self-signed certificate, 2) their popularity
among MySQL users, and 3) their usability and features.

MYSQL WORKBENCH
MySQL Workbench is the official database management tool for MySQL. It comes with a variety of
features that help you manage, develop and design your MySQL databases.


One of the notable features of MySQL Workbench is its visual database design
tool. This feature alone is perhaps the most compelling reason to try out MySQL
Workbench since it allows you to take a visual data model and transform it into
a database. That means you can drag and drop tables and views onto an ER diagram
view like


and define the columns and datatypes you want to use for each table.


The column names are user defined, while you have a list of data types to select
from. Indices, foreign keys, and other options can be determined as well.


After your diagram is complete, MySQL Workbench will forward engineer your ER diagram and generate the SQL script like


All you need to do is press Continue and it will execute the SQL script on your database.

Viewing data within the tables can only be accomplished by writing an SQL query.
Once you’ve written and executed a query, you’ll be presented with the results
at the bottom of the query editor.


So, with this point and click solution, it's relatively easy for those without
extensive experience working with SQL to create a professionally functioning
database, query data, and make edits. MySQL Workbench handles most of the SQL
scripts for you, so that even the seasoned database administrator doesn't have
to spend a lot of time remembering the latest MySQL commands.

Overall, MySQL Workbench contains a lot of features that are continuously being
improved. The number of features that the tool provides arguably the best
available. But, the best part is that it’s available for most platforms for free .

SEQUEL PRO
Sequel Pro is another database management tool that helps to make querying your MySQL
database rather painless. While it doesn’t have all the database design features
of MySQL Workbench, the tool’s GUI is simple and user friendly.


Most of the tools to view your databases tables and content are located on the
top toolbar. This makes it particularly nice to work with since you don't have
to sift through various tabs and views to find your data.


Using the side menu to select a table, you're able to insert, delete or modify
content, table fields, indices, and relationships.


Although Sequel Pro doesn't allow you to visualize your tables in an ER diagram,
a simpler design and layout of the tool appears to eliminate some of the
redundant features found in MySQL Workbench surrounding the creation of tables,
views, procedures, and functions.

The process of querying the database is essentially the same as MySQL Workbench.
You type in an SQL query in the editor then press Run previous to execute the query and view the results.


You can edit data directly from the result table by clicking on a value in the
result table. MySQL will automatically validate your data and will throw an
error if a value cannot be modified or if it's an illegal value.

An important aspect of Sequel Pro is that it can be tailored to the user. The
developers of the tool have allowed the community to build bundles that essentially extend Sequel Pro's functionality. This allows for users to
add only the features that are necessary for their use case, and it helps to cut
down on unnecessary tools that are not used by the majority of users.

Sequel Pro is free to download and is currently available for MacOS X and macOS only.

DBEAVER
DBeaver is also an open-source database management tool. Unlike Sequel Pro, however,
DBeaver is a Java-based tool that supports a multitude of databases and comes
with the similar database querying and management tools as Sequel Pro and MySQL
Workbench.


Like the other tools we've reviewed, the menu and toolbar is where most of the
database features live. Once you've accessed your deployment, you have all the
databases and tables situated in the menu. The top center view is where you
write your queries, and below that is where the results of the executed query
reside.

Selecting a table from the menu will create a new tab in the center view and
will show you the properties of the selected table.


Here is where all the information concerning your table columns, foreign keys,
constraints references, indices, etc. are created, edited and deleted. Selecting Columns on the Properties side menu, for instance, will show you the selected table's columns, data
types, and other options assigned to them.


By right clicking on the view, you'll be given a list of options that are
available to modify the columns, which can be done on any of the menu items
located in the Properties tab.

Clicking on the Data tab, you'll be given the results of the table. Like the other tools, the table
values can be modified and the values will be verified by MySQL.


Perhaps one of the most interesting features of DBeaver is located in the Diagram tab.


Like MySQL Workbench, the Diagram view allows you to view the relationships between tables. While you can’t
create new tables, views and define new relationships, you’re able to see tables
and their metadata with the diagram.


This image only shows the relationships between the Cars table and others within the database. However, if you select compose from the side menu and then click on the Diagram tab, you'll get a full diagram of all the database tables and relationships.


ER diagrams can be saved, but only as images, which makes them useful for
presentations or showing them in a report. Unfortunately, unlike MySQL
Workbench, they cannot be created and then converted into SQL. Their presence is
only to serve as a visual aid.

Creating queries in DBeaver is also a lot like Sequel Pro and MySQL Workbench in
that you have to use SQL to query your database. While this might not seem to be
a feature that should be highlighted, DBeaver is the only tool that includes an
autocomplete feature in the editor.


The autocomplete tool is helpful when writing queries in the SQL editor. To
generate hints it uses pre-loaded MySQL metadata and any information loaded from
your database.

Autocomplete is also helpful if you use DBeaver’s SQL templates. Templates are
adopted from Eclipse code templates that can be configured to hold SQL code and saved with a unique name.


After setting up a template called get_cars , all you have to do is enter the name and press the tab key and it will automatically insert your SQL query. It's a really nice tool to
have if you're writing the same query multiple times.

Like MySQL Workbench and Sequel Pro, DBeaver depends on a community of
developers to develop new features and improve the database management tool.
Also, DBeaver uses plugins developed by the community to tailor the tool to your
needs.

DBeaver is available on most platforms and can be downloaded for free.

GET QUERYING
This was a brief overview of some popular database management tools that you can
use right now with your Compose for MySQL deployment. While we didn't cover all
of the features of the tools, we wanted to go over the basic features and
benefits of each. There are many more tools that are available for Windows,
Linux and macOS environments that we didn't cover, but as you can see most of
them all contain similar features and have similar capabilities. We suggest to
try them all out to understand how they work and to perhaps choose one next time
your working with Compose for MySQL.


--------------------------------------------------------------------------------","Here we provide a brief overview of some popular database management tools that you can use right now with your Compose for MySQL deployment: MySQL Workbench, Sequel Pro, DBeaver.",Database Management Tools and Compose for MySQL,Live,780
2400,"THE FILE DRAWER
My name is Chris Said and I am a data scientist at Opendoor. This blog is mostly
about tech, stats, and science.

Home Archive FeedEMPIRICAL BAYES FOR MULTIPLE SAMPLE SIZES
03 May 2017Here’s a data problem I encounter all the time . Let’s say I’m running a website where users can submit movie ratings on a
continuous 1-10 scale. For the sake of argument, let’s say that the users who
rate each movie are an unbiased random sample from the population of users. I’d
like to compute the average rating for each movie so that I can create a ranked
list of the best movies.

Take a look at my data:


Figure 1. Each circle represents a movie rating by a user. Diamonds represent sample
means for each movie.
I’ve got two big problems here. First, nobody is using my website. And second,
I’m not sure if I can trust these averages. The movie at the top of the rankings
was only rated by two users, and I’ve never even heard of it! Maybe the movie
really is good. But maybe it just got lucky. Maybe just by chance, the two users
who gave it ratings happened to be the users who liked it to an unusual extent.
It would be great if there were a way to adjust for this.

In particular, I would like a method that will give me an estimate closer to the
movie’s true mean (i.e. the mean rating it would get if an infinite number of users rated it).
Intuitively, movies with mean ratings at the extremes should be nudged, or
“shrunk”, towards the center. And intuitively, movies with low sample sizes
should be shrunk more than the movies with large sample sizes.


Figure 2. Each circle represents a movie rating by a user. Diamonds represent sample
means for each movie. Arrows point towards shrunken estimates of each movie's
mean rating. Shrunken estimates are obtained using the MSS James-Stein
Estimator, described in more detail below.


As common a problem as this is, there aren’t a lot of accessible resources
describing how to solve it. There are tons of excellent blog posts about the Beta-binomial distribution, which is useful if you wish to estimate
the true fraction of an occurrence among some events. This works well in the
case of Rotten Tomatoes, where one might want to know the true fraction of
“thumbs up” judgments. But in my case, I’m dealing with a continuous 1-10 scale,
not thumbs up / thumbs down judgments. The Beta-binomial distribution will be of
little use.

Many resources mention the James-Stein Estimator, which provides a way to shrink
the mean estimates only when the variances of those means can be assumed to be
equal. That assumption usually only holds when the sample sizes of each group are equal. But in most real world examples, the sample sizes (and
thus the variances of the means) are not equal. When that happens, it’s a lot
less clear what to do.

After doing a lot of digging and asking some very helpful folks on Twitter, I found several solutions. For many of the
solutions, I ran simulations to determine out which worked best. This blog post
is my attempt to summarize what I learned. Along the way, we’ll cover the
original James-Stein Estimator, two extensions to the James-Stein Estimator,
Markov Chain Monte Carlo (MCMC) methods, and several other strategies.

Before diving in, I want to include a list of symbol definitions I’ll be using
because – side rant – it sure would be great if all stats papers did this, given
that literally every paper I read used its own idiosyncratic notations! I’ll
define everything again in the text, but this is just here for reference:

Symbol Definition $ k $ The number of groups. $ \theta_i $ The true mean of a group. $ x_i $ The sample mean of a group. The MLE estimate of $ \theta_i $. $ \epsilon^{2}_i $ The true variance of observations within a group. $ \epsilon^2 $ The true variance of observations within a group if we assume all groups have
the same variance. $ s^{2}_i $ The sample variance of a group. The MLE estimate of $ \epsilon^{2}_i $. $ n_i $ The number of observations in a group. $ n $ The number of observations in a group, if we assume all groups have the same
size. $ \sigma^{2}_i $ The true variance of a group's mean. If each group has the same variance of
observations, then $ \sigma^{2}_i = \epsilon^{2} / n_i $. If each group has
different variances of observations, then $ \sigma^{2}_i = \epsilon^{2}_i / n_i
$. $ \sigma^{2} $ Like $ \sigma^{2}_i $, but if we assume all groups had the same variance of the
mean. Equal to $ \epsilon^{2} / n $. $ \hat{\sigma^{2}_i} $ Estimate of $ \sigma^{2}_i $. $ \hat{\sigma^{2}} $ Estimate of $ \sigma^{2} $. $ \mu $ The true mean of the $ \theta_i $'s (the true group means). The mean of the
distribution from which the $ \theta_i $'s are drawn. $ \overline{X} $ The sample mean of the sample means. $ \tau^{2} $ The true variance of the $ \theta_i $'s (the true group means). The variance of
the distribution from which the $ \theta_i $'s are drawn. $ \hat{\tau^{2}} $ Estimate of $ \tau^{2} $. $ \hat{B} $ Estimate of the best term for weighting $ x_i $ and $ \overline{X} $ when
calculating $ \hat{\theta_i} $. Assumes each group has the same $ \sigma^2 $. $ \hat{B_i} $ Estimate of the best term for weighting $ x_i $ and $ \overline{X} $ when
calculating $ \hat{\theta_i} $. Does not assume that all group's have the same $
\sigma^{2}_i $. $ \hat{\theta_i} $ Estimate of a true group means. Its value depends on the method we use. $ k_{\Gamma} $ Shape parameter for the Gamma distribution from which sample sizes are drawn. $ \theta_{\Gamma} $ Scale parameter for the Gamma distribution from which sample sizes are drawn. $ \mu_v $ In simulations in which group observation variances $ \epsilon^{2}_i $ are
allowed to vary, this is the mean parameter of the log-normal distribution from
which the $ \epsilon^{2}_i $'s are drawn. $ \tau^{2}_v $ In simulations in which group observation variances $ \epsilon^{2}_i $ are
allowed to vary, this is the variance parameter of the log-normal distribution
from which the $ \epsilon^{2}_i $'s are drawn.QUICK ANALYTIC SOLUTIONS
Our goal is to find a better way of estimating $ \theta_i $, the true mean of a
group. A common theme in many of the papers I read is that good estimates of $ \theta_i $ are usually weighted averages of
the group’s sample mean $ x_i $ and the global mean of all group means $
\overline{X} $. Let’s call this weighting factor $ \hat{B_i} $.

This seems very sensible. We want something that is in between the sample mean
(which is probably too extreme) and the mean of means. But how do we know what
value to use for $ \hat{B_i} $? Different methods exist, and each leads to
different results.

Let’s start by defining $ \sigma^{2}_i $, the true variance of a group’s mean.
This is equivalent to $ \epsilon^{2}_i / n_i $, where $ \epsilon^{2}_i $ is the
true variance of the observations within that group, and $ n_i $ is the sample
size of the group. According to the original James-Stein approach, if we assume
that all the group means have the same known variance $ \hat{\sigma^2} $, which
would usually only happen if the groups all had the same sample size, then we
can define a common $ \hat{B} $ for all groups as:

This formula seems really weird and arbitrary, but it begins to make more sense
if we rearrange it a bit and sweep that pesky $ \left(k-3\right) $ under the rug
and replace it with a $ (k-1) $. Sorry hardliners!

Before getting to why this makes sense, I should explain the last step above.
The denominator $ \sum{\left(x_i - \overline{X}\right)}/\left(k-1\right) $ is
the observed variance of the observed sample means. This variance comes from two
sources: $ \tau^2 $ is true variance in the true means and $ \sigma^2 $ is the
true variance caused by the fact that each $ x_i $ is computed from a sample.
Since variances add, the total variance of the observed means is $ \tau^{2} +
\sigma^{2} $.

Anyway, back to the result. This result is actually pretty neat. When we
estimate a $ \theta_i $, the weight that we place on the global mean $
\overline{X} $ is the fraction of total variance in the means that is caused by
within-group sampling variance. In other words, when the sample mean comes with
high uncertainty, we should weight the global mean more. When the sample mean
comes with low uncertainty, we should weight the global mean less. At least
directionally, this makes sense.

This idea is so widely applicable that many other fields have discovered it
independently. In the image processing literature, it is a special case of the Wiener Filter , assuming that both the signal and the additive noise are Gaussian. In the
insurance world, actuaries call it the Bühlmann model . And in animal breeding , early researchers called it the Best Unbiased Linear Prediction or BLUP (technically the Empirical BLUP). The BLUP approach is so useful, in fact, that
it has received the highly coveted endorsement of the National Swine Improvement Federation .

While the original James-Stein formula is useful, the big limitation is that it
only works when we believe that all groups have the same $ \sigma^2 $. In cases
where we have different sample sizes, each group will have it’s own $
\sigma^{2}_i $. (Recall that $ \sigma^{2}_i = \epsilon^{2}_i / n_i $.) We’re
going to want to shrink some groups more than others, and the original
James-Stein estimator does not allow this. In the following sections, we’ll look
at a couple of extensions to the James-Stein estimator. These extensions have
analogues in the Bühlmann model and BLUP literature.

THE MULTI SAMPLE SIZE JAMES-STEIN ESTIMATOR
The most natural extension of James-Stein is to define each group’s $
\hat{\sigma^{2}_i} $ as the squared standard error of the group’s mean. This
allows us to estimate a weighting factor $ \hat{B_i} $ tailored to each group.
Let’s call this the Multi Sample Size James-Stein Estimator, or MSS James-Stein
Estimator.

The denominator can just be estimated as the variance across group sample means.

As reasonable as this approach sounds, it somehow didn’t feel totally kosher to
me. But when I looked into the literature, it seems like most researchers
basically said “ yup , that sounds pretty reasonable ”.

To test this approach, I ran some simulations on 1000 artificial datasets. Each dataset involved 25 groups with sample sizes
drawn from a Gamma distribution $ \Gamma(k_{\Gamma}=1.5,\theta_{\Gamma}=10) $.
True group means ($ \theta_i $’s) were sampled from a Normal distribution $
\mathcal{N}(\mu, \tau^2) $. Observations within each group were sampled from $
\mathcal{N}(\theta_i, \epsilon^2) $, where $ \epsilon $ was shared between
groups.

For each dataset I computed the Mean Squared Error (MSE) between the vector of
true group means and the vector of estimated group means. I then averaged the
MSEs across datasets. This process was repeated for a variety of different
values of $ \epsilon $ and for two different estimators: The MSS James-Stein
Estimator and the Maximum Likelihood Estimator (MLE). To compute the MLE, I just
used $ x_i $ as my $ \hat{\theta_i} $ estimate.

Figure 3. Mean Squared Error between true group means and estimated group means. In this
simulation, the $ \epsilon $ parameter for within-group variance of observations
is shared by all groups.
As expected, the MSS James-Stein Estimator outperformed the MLE, with lower MSEs
particularly for high values of $ \epsilon $. This make sense. When the raw
sample means are noisy, the MLE should be especially untrustworthy and it makes
sense to pull extreme estimates back towards the global mean.

THE MULTI SAMPLE SIZE POOLED JAMES-STEIN ESTIMATOR
One thing that’s a little weird about the MSS James-Stein Estimator is that even
though we know all the groups should have the same within-group variance $
\epsilon^2 $, we still estimate each group’s standard error separately. Given
what we know, it might make more sense to pool the data from all groups to estimate a common $ \epsilon^2 $. Then we can
estimate each group’s $ \sigma^{2}_i $ as $ \epsilon^2 / n_i $. Let’s call this
approach the MSS Pooled James-Stein Estimator.

Figure 4. Mean Squared Error between true group means and estimated group means. In this
simulation, the $ \epsilon $ parameter for within-group variance of observations
is shared by all groups.
This works a bit better. By obtaining more accurate estimates of each group’s $
\sigma^{2}_i $, we are able to find a more appropriate shrinking factor $ B_i $
for each group.

Of course, this only works better because we created the simulation data in such
a way that all groups have the same $ \epsilon^2 $. But if we run a different
set of simulations, in which each group’s $ \epsilon_i $ is drawn from a
log-normal distribution $ ln\mathcal{N}\left(\mu_v, \tau^{2}_v\right) $, we
obtain the reverse results. The MSS James-Stein Estimator, which estimates a
separate $ \hat{\epsilon^{2}_i} $ for each group, does a better job than the MSS
Pooled James-Stein Estimator. This makes sense.

Figure 5. Mean Squared Error between true group means and estimated group means. In this
simulation, each group has its own variance parameter $ \epsilon^2 $ for the
observations within the group. These parameters are sampled from a log-normal
distribution $ ln\mathcal{N}\left(\mu_v, \tau^{2}_v\right) $. For simplicity,
the two parameters of this distribution are always set to be identical, and are
shown on the horizontal axis.


Which method you choose should depend on whether you think your groups have
similar or different variances of their observations. Here’s an interim summary
of the methods covered so far.

SUMMARY OF ANALYTIC SOLUTIONS
All of these estimators define $ \hat{\theta_i} $ as a weighted average of the
group sample mean $ x_i $ and the mean of group sample means $ \overline{X} $.
$$ \hat{\theta_i} = \left(1-\hat{B}\right) x_i + \hat{B_i} \overline{X} $$ Make
sure to clip $ \hat{B_i} $ to the range [0, 1].

 1. Maximum Likelihood Estimation (MLE) $ \hat{B_i} = 0 $
 2. MSS James-Stein Estimator $ \hat{B_i} = \frac{s^{2}_i/n_i}{\sum\frac{(x_i - \overline{X})^2}{k-1}} $ where $ s_i $ is the standard deviation of observations with a group.
    
 3. MSS Pooled James-Stein Estimator $ \hat{B_i} = \frac{s^{2}_p/n_i}{\sum\frac{(x_i - \overline{X})^2}{k-1}} $ where $ s^{2}_p $ is the pooled estimate of variance.

Implementations in Python and R are available here .A BAYESIAN INTERPRETATION OF THE ANALYTIC SOLUTIONS
So far, the analytic approaches make sense directionally. As described above,
our estimate of $ \theta_i $ should be a weighted average of $ x_i $ and $
\overline{X} $, where the weight depends on the ratio of sample mean variance to
total variance of the means.

But is this really the best weighting? Why use a ratio of variances instead of,
say, a ratio of standard deviations? Why not use something else entirely?

It turns out this formula falls out naturally from Bayes Law. Imagine for a
moment that we already know the prior distribution $ \mathcal{N}\left(\mu,
\tau^2\right) $ over the $ \theta_i $’s. And imagine we know the likelihood
function for a group mean is $ \mathcal{N}\left(x_i, \epsilon^{2}_i/n_i\right)
$. According to the Wikipedia page on conjugate priors , the posterior distribution for the group mean is itself a Gaussian
distribution with mean:

(Note that the Wikipedia page uses the symbol ‘$ x_i $’ to refer to observations, whereas this blog post will
always use the term to refer to the sample mean, including in the equation
above. Also note that Wikipedia refers to the variance of observations within a
group as ‘$ \sigma^2 $’ whereas this blog post uses $ \epsilon^{2}_i $.)

If we multiply all terms in the numerator and denominator by $ \frac{\tau^2
\epsilon^{2}_i}{n_i} $, we get:

Or equivalently,

where

This looks familiar! It is basically the MSS James-Stein estimator. The only
difference is that in the pure Bayesian approach you must somehow know $ \mu $,
$ \tau^2 $, and $ \sigma^{2}_i $ in advance. In the MSS James-Stein approach,
you estimate those parameters from the data itself. This is the key insight in
Empirical Bayes: Use priors to keep your estimates under control, but obtain the
priors empirically from the data itself.

HIERARCHICAL MODELING WITH MCMC
In previous sections we looked at some analytic solutions. While these solutions
have the advantage of being quick to calculate, they have the disadvantage of
being less accurate than they could be. For more accuracy, we can turn to
Hierarchical Model estimation using Markov Chain Monte Carlo (MCMC) methods.
MCMC is an iterative process for approximate Bayesian inference. While it is
slower than analytical approximations, it tends to be more accurate and has the
added benefit of giving you the full posterior distribution. I’m not an expert
in how it works internally, but this post looks like a good place to start.

To implement this, I first defined a Hierarchical Model of my data. The model is a description of how I think the data is generated:
True means are sampled from a normal distribution, and observations are sampled
from a normal distribution centered around the true mean of each group. Of
course, I know exactly how my data was generated, because I was the one who
generated it! The key thing to understand though is that the Hierarchical Model
does not contain any information about the value of the parameters. It’s the
MCMC’s job to figure that out. In particular, I used PyStan’s MCMC implementation to fit the parameters of the model based on my data,
although I later learned that it would be even easier to use bambi .

Figure 6. Mean Squared Error between true group means and estimated group means. In this
simulation, the $ \epsilon $ parameter for within-group variance of observations
is shared by all groups.


For simulated data with shared $ \epsilon^2 $, MCMC did well, outperforming both
the MSS James-Stein estimator and the MSS Pooled James-Stein estimator.

If you don’t care about speed and are willing to write the Stan code, then this
is probably your best option. It’s also good to learn about MCMC methods, since
they can be applied to more complicated models with multiple variables. But if
you just want a quick estimate of group means, then one of the analytic
solutions above makes more sense.

OTHER SOLUTIONS
There are several other solutions that I did not include in my simulations.

 1. Regularization . Pick the set of $ \hat{\theta_i} $’s that minimize . Use cross-validation to choose the best $ \lambda $. This will probably
    work pretty well, although it takes a bit more work and time than the
    analytic solutions described above.
    
    
 2. Mixed Models . Over in Mixed Models World, there’s a whole ’nother literature on how to
    shrink estimates depending on sample size. I don’t really understand the
    math behind Mixed Models, but I was able to use lme4 to estimate group means in simulated data under the assumption that the
    group means are a random effect. This gave me slightly different results
    compared to the James-Stein / Empirical Bayes approach. I would love if some
    expert who understood this could write an accessible and authoritative blog
    post on the differences between Mixed Models and Empirical Bayes. The
    closest I could find was this comment by David Harville.
    
    
 3. Efron and Morris ’ generalization of James-Stein to unequal sample sizes (Section 3 of their
    paper). I thought this paper was difficult to read. A more accessible
    presentation can be found at the end of this column . The Efron and Morris approach is a numerical solution that seemed to work
    reasonably well when I played around with it, but I didn’t take it very far.
    If you want to implement it, be sure to prevent any variance estimates from
    falling below zero. If one of them does, just set it to zero and then
    compute your estimates of the means. That being said, I feel like if you’re
    going to go with a numerical solution, you may as well just go with MCMC.
    
    
 4. Double Shrinkage . When we think that different groups not only have different sample sizes,
    but also different $ \epsilon_i $’s, we are faced with an interesting
    conundrum. As shown above, the MSS James-Stein Estimator outperforms the MSS
    Pooled James-Stein Estimator, because it computes $ \hat{\epsilon_i} $’s
    specific to each group. However, these estimates of group variances are
    probably noisy! Just like we don’t trust the raw estimates of group sample
    means, why should we trust the raw estimates of group sample variances? One
    way to address this is to use Zhao’s Double Shinkage Estimator, which not
    only shrinks the means, but also shrinks the variances.
    
    
 5. Kleinman’s weighted moment estimator . Apparently this was motivated by groups of proportions (i.e. the Rotten
    Tomatoes case), but the estimator can be applied generally.
    
    
CONCLUSION
Which method you choose depends on your situation. If you want a simple and
computationally fast estimate, and if you don’t want to assume that the group
variances $ \epsilon^{2}_i $ are identical, I would recommend either the MSS
James-Stein Estimator or the Double Shrinkage Estimator if you can get it to
work. If you want a fast estimate and can assume all groups share the same $
\epsilon^{2} $, I’d recommend the MSS Pooled James-Stein Estimator. If you don’t
care about speed or code complexity, I’d recommend MCMC, Mixed Models, or
regularization with a cross-validated penalty term.

ACKNOWLEDGEMENTS
Special thanks to the many people who responded to my original question on Twitter , including: Sean Taylor , John Myles White , David Robinson , Joe Blitzstein , Manjari Narayan , Otis Anderson , Alex Peysakhovich , Nathaniel Bechhofer , James Neufeld , Andrew Mercer , Patrick Perry , Ronald Richman , Timothy Sweetster , Tal Yarkoni and Alex Coventry . Special thanks also to Marika Inhoff and Leo Pekelis for many discussions.

All code used in this blog post is available on GitHub .

Tweet Please enable JavaScript to view the comments powered by Disqus.RELATED POSTS
 * OPTIMIZING THINGS IN THE USSR 11 MAY 2016
   
 * COMPARING THE OPINIONS OF ECONOMIC EXPERTS AND THE GENERAL PUBLIC 10 APR 2016
   
 * FOUR PITFALLS OF HILL CLIMBING 28 FEB 2016",A look at Empirical Bayes methods for multiple sample sizes.,Empirical Bayes for multiple sample sizes,Live,781
2409,"Toggle navigation * 
 * About
 * Recommended Resources
 * 
 * Archives
 * 
 * 

PRACTICAL BUSINESS PYTHON
Taking care of business, one python script at a time

Tue 25 April 2017EFFECTIVELY USING MATPLOTLIB
Posted by Chris Moffitt in articles

INTRODUCTION
The python visualization world can be a frustrating place for a new user. There
are many different options and choosing the right one is a challenge. For
example, even after 2 years, this article is one of the top posts that lead people to this site. In that article, I threw
some shade at matplotlib and dismissed it during the analysis. However, after
using tools such as pandas, scikit-learn, seaborn and the rest of the data
science stack in python - I think I was a little premature in dismissing
matplotlib. To be honest, I did not quite understand it and how to use it
effectively in my workflow.

Now that I have taken the time to learn some of these tools and how to use them
with matplotlib, I have started to see matplotlib as an indispensable tool. This
post will show how I use matplotlib and provide some recommendations for users
getting started or users who have not taken the time to learn matplotlib. I do
firmly believe matplotlib is an essential part of the python data science stack
and hope this article will help people understand how to use it for their own
visualizations.

WHY ALL THE NEGATIVITY TOWARDS MATPLOTLIB?
In my opinion, there are a couple of reasons why matplotlib is challenging for
the new user to learn.

First, matplotlib has two interfaces. The first is based on MATLAB and uses a state-based interface. The second option is an an object-oriented
interface. The why’s of this dual approach are outside the scope of this post
but knowing that there are two approaches is vitally important when plotting with
matplotlib.

The reason two interfaces cause confusion is that in the world of stack overflow
and tons of information available via google searches, new users will stumble
across multiple solutions to problems that look somewhat similar but are not the
same. I can speak from experience. Looking back on some of my old code, I can
tell that there is a mishmash of matplotlib code - which is confusing to me
(even if I wrote it).

Key Point New matplotlib users should learn and use the object oriented interface.Another historic challenge with matplotlib is that some of the default style
choices were rather unattractive. In a world where R could generate some really
cool plots with ggplot, the matplotlib options tended to look a bit ugly in
comparison. The good news is that matplotlib 2.0 has much nicer styling
capabilities and ability to theme your visualizations with minimal effort.

The third challenge I see with matplotlib is that there is confusion as to when
you should use pure matplotlib to plot something vs. a tool like pandas or
seaborn that is built on top of matplotlib. Anytime there can be more than one
way to do something, it is challenging for the new or infrequent user to follow
the right path. Couple this confusion with the two different API ’s and it is a recipe for frustration.

WHY STICK WITH MATPLOTLIB?
Despite some of these issues, I have come to appreciate matplotlib because it is
extremely powerful. The library allows you to create almost any visualization
you could imagine. Additionally, there is a rich ecosystem of python tools built
around it and many of the more advanced visualization tools use matplotlib as
the base library. If you do any work in the python data science stack, you will
need to develop some basic familiarity with how to use matplotlib. That is the
focus of the rest of this post - developing a basic approach for effectively
using matplotlib.

BASIC PREMISES
If you take nothing else away from this post, I recommend the following steps
for learning how to use matplotlib:

 1. Learn the basic matplotlib terminology, specifically what is a Figure and an Axes .
 2. Always use the object-oriented interface. Get in the habit of using it from
    the start of your analysis.
 3. Start your visualizations with basic pandas plotting.
 4. Use seaborn for the more complex statistical visualizations.
 5. Use matplotlib to customize the pandas or seaborn visualization.

This graphic from the matplotlib faq is gold. Keep it handy to understand the different terminology of a plot.

Most of the terms are straightforward but the main thing to remember is that the Figure is the final image that may contain 1 or more axes. The Axes represent an individual plot. Once you understand what these are and how to
access them through the object oriented API , the rest of the process starts to fall into place.

The other benefit of this knowledge is that you have a starting point when you
see things on the web. If you take the time to understand this point, the rest
of the matplotlib API will start to make sense. Also, many of the advanced python packages like
seaborn and ggplot rely on matplotlib so understanding the basics will make
those more powerful frameworks much easier to learn.

Finally, I am not saying that you should avoid the other good options like
ggplot (aka ggpy), bokeh, plotly or altair. I just think you’ll need a basic
understanding of matplotlib + pandas + seaborn to start. Once you understand the
basic visualization stack, you can explore the other options and make informed
choices based on your needs.

GETTING STARTED
The rest of this post will be a primer on how to do the basic visualization
creation in pandas and customize the most common items using matplotlib. Once
you understand the basic process, further customizations are relatively
straightforward.

I have focused on the most common plotting tasks I encounter such as labeling
axes, adjusting limits, updating plot titles, saving figures and adjusting
legends. If you would like to follow along, the notebook includes additional detail that should be helpful.

To get started, I am going to setup my imports and read in some data:

importpandasaspdimportmatplotlib.pyplotaspltfrommatplotlib.tickerimportFuncFormatterdf=pd.read_excel(""https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=true"")df.head()

account number name sku quantity unit price ext price date 0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51 1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58 3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55The data consists of sales transactions for 2014. In order to make this post a
little shorter, I’m going to summarize the data so we can see the total number
of purchases and total sales for the top 10 customers. I am also going to rename
columns for clarity during plots.

top_10=(df.groupby('name')['ext price','quantity'].agg({'ext price':'sum','quantity':'count'}).sort_values(by='ext price',ascending=False))[:10].reset_index()top_10.rename(columns={'name':'Name','ext price':'Sales','quantity':'Purchases'},inplace=True)

Here is what the data looks like.

Name Purchases Sales 0 Kulas Inc 94 137351.96 1 White-Trantow 86 135841.99 2 Trantow-Barrows 94 123381.38 3 Jerde-Hilpert 89 112591.43 4 Fritsch, Russel and Anderson 81 112214.71 5 Barton LLC 82 109438.50 6 Will LLC 74 104437.60 7 Koepp Ltd 82 103660.54 8 Frami, Hills and Schmidt 72 103569.59 9 Keeling LLC 74 100934.30Now that the data is formatted in a simple table, let’s talk about plotting
these results as a bar chart.

As I mentioned earlier, matplotlib has many different styles available for
rendering plots. You can see which ones are available on your system using plt.style.available .

plt.style.available


['seaborn-dark',
 'seaborn-dark-palette',
 'fivethirtyeight',
 'seaborn-whitegrid',
 'seaborn-darkgrid',
 'seaborn',
 'bmh',
 'classic',
 'seaborn-colorblind',
 'seaborn-muted',
 'seaborn-white',
 'seaborn-talk',
 'grayscale',
 'dark_background',
 'seaborn-deep',
 'seaborn-bright',
 'ggplot',
 'seaborn-paper',
 'seaborn-notebook',
 'seaborn-poster',
 'seaborn-ticks',
 'seaborn-pastel']


Using a style is as simple as:

plt.style.use('ggplot')

I encourage you to play around with different styles and see which ones you
like.

Now that we have a nicer style in place, the first step is to plot the data
using the standard pandas plotting function:

top_10.plot(kind='barh',y=""Sales"",x=""Name"")

The reason I recommend using pandas plotting first is that it is a quick and
easy way to prototype your visualization. Since most people are probably already
doing some level of data manipulation/analysis in pandas as a first step, go
ahead and use the basic plots to get started.

CUSTOMIZING THE PLOT
Assuming you are comfortable with the gist of this plot, the next step is to
customize it. Some of the customizations (like adding titles and labels) are
very simple to use with the pandas plot function. However, you will probably find yourself needing to move outside of
that functionality at some point. That’s why I recommend getting in the habit of
doing this:

fig,ax=plt.subplots()top_10.plot(kind='barh',y=""Sales"",x=""Name"",ax=ax)

The resulting plot looks exactly the same as the original but we added an
additional call to plt.subplots() and passed the ax to the plotting function. Why should you do this? Remember when I said it is
critical to get access to the axes and figures in matplotlib? That’s what we
have accomplished here. Any future customization will be done via the ax or fig objects.

We have the benefit of a quick plot from pandas but access to all the power from
matplotlib now. An example should show what we can do now. Also, by using this
naming convention, it is fairly straightforward to adapt others’ solutions to
your unique needs.

Suppose we want to tweak the x limits and change some axis labels? Now that we
have the axes in the ax variable, we have a lot of control:

fig,ax=plt.subplots()top_10.plot(kind='barh',y=""Sales"",x=""Name"",ax=ax)ax.set_xlim([-10000,140000])ax.set_xlabel('Total Revenue')ax.set_ylabel('Customer');

Here’s another shortcut we can use to change the title and both labels:

fig,ax=plt.subplots()top_10.plot(kind='barh',y=""Sales"",x=""Name"",ax=ax)ax.set_xlim([-10000,140000])ax.set(title='2014 Revenue',xlabel='Total Revenue',ylabel='Customer')

To further demonstrate this approach, we can also adjust the size of this image.
By using the plt.subplots() function, we can define the figsize in inches. We can also remove the legend using ax.legend().set_visible(False)

fig,ax=plt.subplots(figsize=(5,6))top_10.plot(kind='barh',y=""Sales"",x=""Name"",ax=ax)ax.set_xlim([-10000,140000])ax.set(title='2014 Revenue',xlabel='Total Revenue')ax.legend().set_visible(False)

There are plenty of things you probably want to do to clean up this plot. One of
the biggest eye sores is the formatting of the Total Revenue numbers. Matplotlib
can help us with this through the use of the FuncFormatter . This versatile function can apply a user defined function to a value and
return a nicely formatted string to place on the axis.

Here is a currency formatting function to gracefully handle US dollars in the several hundred thousand dollar range:

defcurrency(x,pos):'The two args are the value and tick position'ifx>=1000000:return'${:1.1f}M'.format(x*1e-6)return'${:1.0f}K'.format(x*1e-3)

Now that we have a formatter function, we need to define it and apply it to the
x axis. Here is the full code:

fig,ax=plt.subplots()top_10.plot(kind='barh',y=""Sales"",x=""Name"",ax=ax)ax.set_xlim([-10000,140000])ax.set(title='2014 Revenue',xlabel='Total Revenue',ylabel='Customer')formatter=FuncFormatter(currency)ax.xaxis.set_major_formatter(formatter)ax.legend().set_visible(False)

That’s much nicer and shows a good example of the flexibility to define your own
solution to the problem.

The final customization feature I will go through is the ability to add
annotations to the plot. In order to draw a vertical line, you can use ax.axvline() and to add custom text, you can use ax.text() .

For this example, we’ll draw a line showing an average and include labels
showing three new customers. Here is the full code with comments to pull it all
together.

# Create the figure and the axesfig,ax=plt.subplots()# Plot the data and get the averagedtop_10.plot(kind='barh',y=""Sales"",x=""Name"",ax=ax)avg=top_10['Sales'].mean()# Set limits and labelsax.set_xlim([-10000,140000])ax.set(title='2014 Revenue',xlabel='Total Revenue',ylabel='Customer')# Add a line for the averageax.axvline(x=avg,color='b',label='Average',linestyle='--',linewidth=1)# Annotate the new customersforcustin[3,5,8]:ax.text(115000,cust,""New Customer"")# Format the currencyformatter=FuncFormatter(currency)ax.xaxis.set_major_formatter(formatter)# Hide the legendax.legend().set_visible(False)

While this may not be the most exciting plot it does show how much power you
have when following this approach.

FIGURES AND PLOTS
Up until now, all the changes we have made have been with the indivudual plot.
Fortunately, we also have the ability to add multiple plots on a figure as well
as save the entire figure using various options.

If we decided that we wanted to put two plots on the same figure, we should have
a basic understanding of how to do it. First, create the figure, then the axes,
then plot it all together. We can accomplish this using plt.subplots() :

fig,(ax0,ax1)=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(7,4))

In this example, I’m using nrows and ncols to specify the size because this is very clear to the new user. In sample code
you will frequently just see variables like 1,2. I think using the named
parameters is a little easier to interpret later on when you’re looking at your
code.

I am also using sharey=True so that the yaxis will share the same labels.

This example is also kind of nifty because the various axes get unpacked to ax0 and ax1 . Now that we have these axes, you can plot them like the examples above but
put one plot on ax0 and the other on ax1 .

# Get the figure and the axesfig,(ax0,ax1)=plt.subplots(nrows=1,ncols=2,sharey=True,figsize=(7,4))top_10.plot(kind='barh',y=""Sales"",x=""Name"",ax=ax0)ax0.set_xlim([-10000,140000])ax0.set(title='Revenue',xlabel='Total Revenue',ylabel='Customers')# Plot the average as a vertical lineavg=top_10['Sales'].mean()ax0.axvline(x=avg,color='b',label='Average',linestyle='--',linewidth=1)# Repeat for the unit plottop_10.plot(kind='barh',y=""Purchases"",x=""Name"",ax=ax1)avg=top_10['Purchases'].mean()ax1.set(title='Units',xlabel='Total Units',ylabel='')ax1.axvline(x=avg,color='b',label='Average',linestyle='--',linewidth=1)# Title the figurefig.suptitle('2014 Sales Analysis',fontsize=14,fontweight='bold');# Hide the legendsax1.legend().set_visible(False)ax0.legend().set_visible(False)

Up until now, I have been relying on the jupyter notebook to display the figures
by virtue of the %matplotlib inline directive. However, there are going to be plenty of times where you have the
need to save a figure in a specific format and integrate it with some other
presentation.

Matplotlib supports many different formats for saving files. You can use fig.canvas.get_supported_filetypes() to see what your system supports:

fig.canvas.get_supported_filetypes()


{'eps': 'Encapsulated Postscript',
 'jpeg': 'Joint Photographic Experts Group',
 'jpg': 'Joint Photographic Experts Group',
 'pdf': 'Portable Document Format',
 'pgf': 'PGF code for LaTeX',
 'png': 'Portable Network Graphics',
 'ps': 'Postscript',
 'raw': 'Raw RGBA bitmap',
 'rgba': 'Raw RGBA bitmap',
 'svg': 'Scalable Vector Graphics',
 'svgz': 'Scalable Vector Graphics',
 'tif': 'Tagged Image File Format',
 'tiff': 'Tagged Image File Format'}


Since we have the fig object, we can save the figure using multiple options:

fig.savefig('sales.png',transparent=False,dpi=80,bbox_inches=""tight"")

This version saves the plot as a png with opaque background. I have also
specified the dpi and bbox_inches=""tight"" in order to minimize excess white space.

CONCLUSION
Hopefully this process has helped you understand how to more effectively use
matplotlib in your daily data analysis. If you get in the habit of using this
approach when doing your analysis, you should be able to quickly find out how to
do whatever you need to do to customize your plot.

As a final bonus, I am including a quick guide to unify all the concepts. I hope
this helps bring this post together and proves a handy reference for future use.

 * ← Understanding the Transform Function in Pandas
 * How Accurately Can Prophet Project Website Traffic? →

Tags pandas matplotlib
--------------------------------------------------------------------------------

Tweet Vote on Hacker NewsCOMMENTS
SOCIAL
 * Github
 * Twitter
 * BitBucket
 * Reddit
 * LinkedIn

CATEGORIES
 * articles
 * news

POPULAR
 * Pandas Pivot Table Explained
 * Common Excel Tasks Demonstrated in Pandas
 * Overview of Python Visualization Tools
 * Web Scraping - It's Your Civic Duty
 * Simple Graphing with IPython and Pandas

TAGS
csv scikit-learn matplot s3 plotting pandas seaborn powerpoint python jinja analyze-this ipython pelican stdlib cases word plotly pdf excel pygal matplotlib xlwings xlsxwriter gui google oauth2 vcs sets ggplot barnum github beautifulsoup bokeh process notebooks

FEEDS
 * Atom Feed


--------------------------------------------------------------------------------

DISCLOSURE
We are a participant in the Amazon Services LLC Associates Program, an affiliate
advertising program designed to provide a means for us to earn fees by linking
to Amazon.com and affiliated sites.

Ⓒ 2017 Practical Business Python • Site built using Pelican • Theme based on VoidyBootstrap by RKI",Matplotlib is a valuable but misunderstood foundation of the python data science stack.,Effectively Using Matplotlib,Live,782
2411,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Lorna Mitchell Blocked Unblock Follow Following Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net) 14 mins ago
--------------------------------------------------------------------------------

HANDLING FAILURE SUCCESSFULLY IN RABBITMQ
RabbitMQ is a powerful, flexible message broker that is a great fit for many modern
applications, enabling scalability and loose coupling between components. There
are many excellent resources and tutorials available for adding RabbitMQ into
your existing application and going on to great performance and success — when
things go well. Today we’ll look at what sort of things can go wrong in a real
world system using RabbitMQ, and how to design our applications to react
appropriately if something does ever go wrong on the internet.

A MESSAGE WASN’T PROCESSED CORRECTLY
The first question here is whether it matters if a message was dropped. If a
message isn’t useful if it’s delivered late, then it is worth considering
whether it’s important to detect or react to failures at all. If it’s not
important, then the consumer itself can choose to “auto-ack” the message. In
this case, the worker that consumes the message doesn’t need to acknowledge it;
as long as the message has been delivered to a consumer then we stop worrying
about it. We call this “at-most-once” delivery.

In other scenarios, we do want to detect and potentially react to message
processing failures. By setting the auto-ack parameter on the consumer to false, we ensure that messages are removed from
the queue only when a consumer has accepted the message and acknowledged that it was successfully processed (this setup is often referred
to by the shorthand ack ). If a consumer takes a message and does not acknowledge that it has been
successfully processed within a given time window, then another consumer is
given the message. This can lead to messages being processed more than once so
it's important to make sure that your system can tolerate this outcome, and that
the timeout settings are set appropriately for the length of time the worker is
expected to need to process the message. On the plus side, this does give us
""at-least-once"" delivery guarantee as we can be sure that the message will get
processed.

THIS MESSAGE CAN NEVER BE SUCCESSFULLY PROCESSED
Problems arise when “at-least-once” doesn’t have room for the fact that some
messages will never be successfully processed. Perhaps they are invalid or
intentionally harmful, or simply contain data that the worker doesn’t know how
to handle. By always leaving them on the queue, the queue could become clogged
up with these messages that can’t be processed but aren’t getting drained away.
We can configure our queues to direct any rejected messages to a “dead letter”
exchange; this allows us to inspect and potentially process later any messages
which our existing workers didn’t handle.

If it’s possible to identify which messages can’t be processed and will never be
able to be processed (for example because the data is missing or invalid rather
than because a 3rd party API is not responding), then we can simply give up on
them without any further processing. To do so, we send a reject response rather than our usual ack . Remember that if we simply fail to acknowledge — perhaps because the invalid
message causes our worker to crash — the messages are automatically requeued.
Instead we need to create defensive workers that react to failure situations appropriately. In the case of ""poison""
messages that can never successfully be processed, this means sending the reject response with the requeue option set to false so that the message will never be processed. If there is a
dead letter exchange, the message is routed there and if not, it gets binned.

Configuring the dead letter exchange is part of the queue setup. If the main
exchange is called “events” and the dead letter exchange is called “mishaps”,
then creating and binding a queue called “notifications” would look like this:

./rabbitmqadmin declare queue name=""notifications"" 'arguments={""x-dead-letter-exchange"": ""mishaps""}'
./rabbitmqadmin declare binding source=""events"" destination=""notifications"" routing_key=""notifications.v1"" destination_type=""queue""

The examples here use the rabbitmqadmin tool , but you could use the same approach in any of the language-specific
libraries.

For messages which could potentially be reprocessed later (a good example is the
unresponsive 3rd party API that could be experiencing temporary failure), it
could be more appropriate to send the reject response to indicate that the message wasn't processed, but with the requeue option set to true. This puts the message back on the queue for another worker
to collect later. But if the failure persists, these essentially become poison
messages as well, because our workers will try to process, fail to process, put
them back on the queue ... and repeat. RabbitMQ does not have built-in
functionality for counting retries, but it is trivial to implement this
yourself: when the message fails to process, but the failure could be temporary,
create a new message exactly like the existing one, but add some additional
metadata indicating that this is a retry attempt. (You can't edit messages which
is why you need a new one.) Acknowledge the original message, and put the new
message onto the queue. Then you can check the retries count each time the
message goes to a worker for processing. If it fails, either create a new
message with updated metadata or permanently fail the message by sending reject with requeue set to false. Again, if there's a dead letter exchange the message goes there,
so we'll be able to see what didn't get processed.

CHOOSING FAILURE STRATEGIES
Contemplating what we know about the status and experience of any given message
can help to work out what the next move is when dealing with failure cases. The
following diagram attempts to summarise the cases and some recommended reactions
to different outcomes.

Identifying and dealing with failure outcomes in RabbitMQThinking about whether the message should remain in the queue until it can be
processed, or whether the system should just move on so that it services the
immediate messages as best it can, is a very important step in working with
queues. For example, if you have an application that sends both email
notifications and a push message to the user’s browser, are those types of
message equivalent in whether they can be delivered late? Does it matter if one
gets missed? A common setup there would be to add retry logic to the emails so
that they are sent eventually, even if they are delayed by a few minutes. If the
push message to the browser fails, then the application may not bother to try to
recover as the notification is less valuable if not delivered at the right
moment.

QUEUE IS TOO FULL
Sometimes it makes sense to maintain very long queues. The main variable is more
around the time a message typically spends in the queue rather than the queue
length itself. Usually queues are pretty small, and only grow large if there’s a
problem processing the messages. In this case, it is often useful to specify a
maximum length of queue. Beyond this size, messages are either dropped, or sent
to the dead letter exchange if there is one.

To specify a queue with a maximum length, use a declaration like this:

./rabbitmqadmin declare queue name=""notifications"" 'arguments={""x-max-length"": 10000}'

When the queue gets longer than 10000 messages, the messages are discarded from the front of the queue to make room for the new messages arriving in the queue. If the queue also has
an x-dead-letter-exchange declared then the discarded messages go there, otherwise they are just thrown
away. This approach can be very useful to avoid RabbitMQ outgrowing its
allocated resources and losing messages in an uncontrolled way.

Another option to keep queue length down is to set a TTL (“time to live”) for
messages in a given queue — this is actually a property of the queue so it’s
configured with something like:

/rabbitmqadmin declare queue name=""notifications"" 'arguments={""x-message-ttl"": 60000}'

The units of the TTL are microseconds, so this command allows a message to
remain in the queue for 60 seconds. After this time, if it hasn’t been
processed, the message self-destructs! If there’s a dead letter exchange, it
goes there, otherwise it is simply dropped. For queues where timely delivery is
very important, this can be an excellent way of avoiding a situation where the
queue quickly becomes larger than can be processed when something goes wrong. If
you see a situation where the only real option is to clear the queue completely
and let the system start again, then adding message TTLs can help you retain
only the current information and bin the stale messages clogging up the queue.

WHAT ABOUT THESE DEAD MESSAGES?
The important thing to understand about messages that end up at the dead letter
exchange is that they are routed in the normal way. These exchanges are not
special, they can be declared in the usual way as direct, topic or fanout
exchanges, and you bind queues to them in the normal way. When you configure a
dead letter queue, you can set the routing key that messages should use when
being routed there. But if this isn’t specified, then the messages keep the
routing key they previously had when they were routed through the “live”
exchange.

Once messages are on the dead letter exchange and routed into queues, we can
then process them as we normally would. In most cases, you may just choose to
log the contents of messages that haven’t been processed, so they can be
reported on later. However if messages have been routed to the dead letter
exchange as a result of a problem that is later fixed (perhaps a queue
overflowed or an external service was temporarily unavailable), one option is to
write a one-off consumer to pick up those messages and simply route them back to
their original destination.

RABBITMQ AND HANDLING FAILURE
Queues enable us to write complex systems, and in those complex systems
sometimes things do go wrong! This article covers some handy configuration
options and architecture concepts that let you make the most of RabbitMQ
features, so you can handle adversity gracefully. If you have other tactics that
you’d like to share, we’d love to hear from you in the comments.

Rabbitmq Python Message Queue Tutorial Microservices Blocked Unblock Follow FollowingLORNA MITCHELL
Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net )

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","RabbitMQ is a powerful, flexible message broker that is a great fit for many modern apps. This article provides queue setup strategies that process message failures gracefully.",Handling Failure Successfully in RabbitMQ – IBM Watson Data Lab,Live,783
2412,"* Home
 * Learn * Blog * Business Analytics * SAS
          * R
          * Python
         
         
       * Business Intelligence * Qlikview
          * Web Analytics
         
         
       * Big data
       * Stories
      
      
    * Infographics
    * Trainings
    * Learning Paths * SAS Business Analyst
       * LeaRn Data Science on R
       * Data Science in Python
       * DATA SCIENCE IN WEKA
       * Data Visualization with Tableau
       * Data Visualization with QlikView
       * Interactive Data Stories with D3.js
      
      
    * Glossary
   
   
 * Engage * Discuss
    * Events
    * DataHack Summit 2017
    * Write For Us
   
   
 * Compete * Hackathons
   
   
 * Get Hired * Jobs
   
   
 * AVbytes
 * Contact Us
 * We are hiring!
 * Corporate

 * 
 * 
 * 
 * 

 * Home
 * Blog
 * Jobs
 * Trainings
 * Learning Paths
 * Discuss
 * Corporate

 * 

LEARN EVERYTHING ABOUT ANALYTICS
 * Home
 * Learn * Blog * Business Analytics * SAS
          * R
          * Python
         
         
       * Business Intelligence * Qlikview
          * Web Analytics
         
         
       * Big data
       * Stories
      
      
    * Infographics
    * Trainings
    * Learning Paths * SAS Business Analyst
       * LeaRn Data Science on R
       * Data Science in Python
       * DATA SCIENCE IN WEKA
       * Data Visualization with Tableau
       * Data Visualization with QlikView
       * Interactive Data Stories with D3.js
      
      
    * Glossary
   
   
 * Engage * Discuss
    * Events
    * DataHack Summit 2017
    * Write For Us
   
   
 * Compete * Hackathons
   
   
 * Get Hired * Jobs
   
   
 * AVbytes
 * Contact Us
 * We are hiring!
 * Corporate

Home Business Analytics 10 Data Science, Machine Learning and AI Podcasts You Must Listen To10 DATA SCIENCE, MACHINE LEARNING AND AI PODCASTS YOU MUST LISTEN TO
10 Data Science, Machine Learning and AI Podcasts You Must Listen To Business Analytics Business Intelligence Data Science SHARE Pranav Dar , January 21, 2018 / 16With the rapid pace at which technology is driving innovation in machine
learning and artificial intelligence, it has become immensely important to keep
pace with the ongoing trends in data science. However, it can become challenging
to read everything that’s out there.

Podcasts are a great alternative to keep yourself updated. With the remarkable
rise of the data science industry in recent years, enough podcasts have been
created for us to geek out over.

In this article, we look at 10 such podcasts that we feel any data scientist
must listen to:


THE DATA SKEPTIC


With episodes ranging from anywhere between 15 minutes to an hour, the Data
Skeptic is a great way to introduce yourself to the world of Data Science
podcasts. The topics include interviews with data science practitioners to talk
about real world data science challenges, simple academic concepts like feature
selection, NLP, decision trees, among may others.

If you have to narrow down to one podcast, make sure it’s this one.

Average episode duration: Anywhere between 15-60 minutes

Total Number of Episodes: 198

Area(s) of focus: Data science concepts and real-world issues


THE O’REILLY DATA SHOW


This can get technical and quite in-depth at times, but it’s still a great way
of keeping up to date with what’s happening in the world of AI and Machine
Learning. It’s hosted by O’Reilly Media’s Chief Data Scientist, Ben Lorica.

Average episode duration: 20-60 minutes

Total Number of Episodes: 60

Area(s) of focus: Technically driven, deals with current issues


CONCERNING AI


This podcast offers a slightly different take on AI – it looks at the threats
and risks that the growing influence of AI can have on today’s society, and what
steps we need to take to combat it.

Average episode duration: 20-40 minutes

Total Number of Episodes: 62

Area(s) of focus: Regulations and concerns in the AI world


DATA STORIES


Data Visualization is at the heart of this podcast. Hosts Enrico Bertini and
Moritz Stefaner interview folks from various fields every week. Recent topics
include Data Pottery, Bitcoin Visualizations and a fascinating episode of
“What’s Going on with this Graph?”.

Average episode duration: 30-50 minutes

Total Number of Episodes: 112

Area(s) of focus: Data visualization


LEARNING MACHINES 101


Their aim is to “demystify the field of Artificial Intelligence by explaining
fundamental concepts in an entertaining manner”. Their topics do tend to get
technical sometimes, like “How to use Expectation Maximization to Learn
Constraint Satisfaction Solutions” or “How to Use Radial Basis Function
Perceptron Software for Supervised Learning”. However, some topics are meant for
all listeners regardless of technical knowledge.

Average episode duration: 20-30 minutes

Total Number of Episodes: 69

Area(s) of focus: Technically driven, intermediate to advanced machine learning concepts


PARTIALLY DERIVATIVE


One of the “cooler” podcasts out there, the hosts Chris and Vidya get together
over a round of drinks and discuss all things data science. A few topics include
“The Future of Deep Learning”, consequently “The Limits of Deep Learning”, and a
fascinating discussion on how AI is affecting the world of artists.

Note: They have stopped recording in the last few months but the hope is that
they come back soon. You can trawl through their archives as there is a lot of
wonderful stuff there.

Average episode duration: 30 minutes

Total Number of Episodes: 107

Area(s) of focus: Basic data science, ML and AI topics


ARTIFICIAL INTELLIGENCE IN INDUSTRY (DAN FAGGELLA)


Every week, Dan Faggella interviews data scientists and AI leaders from the
world’s companies to figure out the applications and implications of AI. There
are tons of episodes you can listen to from the last couple of years that are
still relevant. The latest episode, “Will you buy your home or car using AI?” is
a very pertinent topic in today’s society.

Average episode duration: 3 0 minutes

Total Number of Episodes: 99

Area(s) of focus: Interviews with data practitioners, discussions on current topics


TALKING MACHINES


If you’re new to data science or are not a fan of the technical stuff, this
podcast is for you. Each episode features interviews with industry experts about
data science and gives a holistic overview of a few techniques. Most episodes
also feature listeners calling in and asking questions.

Average episode duration: 60 minutes

Total Number of Episodes: 29

Area(s) of focus: Basic to intermediate data science concepts, listener Q&A , interviews with
industry experts


THIS WEEK IN MACHINE LEARNING & AI


Episodes in this podcast are churned out at a fairly regular interval every
week. This features interviews with AI /ML experts on a variety of data science
topics.

Average episode duration: 45 minutes

Total Number of Episodes: 111

Area(s) of focus: Current topics in ML and AI including projects and companies


LINEAR DIGRESSIONS


The hosts, Ben Jaffe and Katie Malone, manage to break down complex data science
problems and techniques into snippets of information that can be easily digested
by the casual listener.

Average episode duration: 15 minutes

Total Number of Episodes: 164

Area(s) of focus: Data science and machine learning concepts applied to real-world issues


FREAKONOMICS RADIO


Disclaimer: This isn’t exclusively a data science podcast. Host Stephen J.
Dubner explores a whole host of puzzling issues in the world but it gives the
listener a good idea of how data science can be integrated into economics and
other global issues.

Average episode duration: 45 minutes

Total Number of Episodes: 310

Area(s) of focus: Economics, interviews with social scientists


Are there any other podcasts you listen to? Add them in the comments section
below!

Note: The total number of episodes mentioned in this article are updated till
the publishing date.

SHARE THIS:
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * 

RELATED
Tags: AI , data science , machine learning , podcast Next Article Google’s Arts and Culture App Matches Your Face With Classical Art Previous Article Building a FAQ Chatbot in Python – The Future of Information
Searching Author Pranav DarEditor at Analytics Vidhya.

Data visualization and Six Sigma practitioner who loves reading and delving
deeper into the data science and machine learning arts. Always looking for new
ways to improve processes using ML and AI.


16 COMMENTS
 * Data science podcasts – suitedclerk says: January 22, 2018 at 5:09 am[…] 
   https://www.analyticsvidhya.com/blog/2018/01/10-data-science-machine-learning-ai-podcasts-must-liste&#8230 ; […]
   
   Reply
 * Vishalkumar Arvind Singh says: January 22, 2018 at 8:19 amHow can i get this courses
   
   Reply * Pranav Dar says: January 22, 2018 at 10:33 amHi Vishualkumar,
      
      These are podcasts you can listen to. You can access each podcast by
      clicking on the names mentioned in the above article.
      
      Reply
   
   
 * Ghouse says: January 22, 2018 at 8:21 amI need to know more about it and I want to be part of it
   
   Reply
 * puleen says: January 22, 2018 at 11:18 amyou might share some links for these courses
   
   Reply * Pranav Dar says: January 22, 2018 at 3:03 pmHi Puleen,
      
      As Michael mentioned, these are podcasts and not courses.
      
      To listen to the podcasts, please click on the names of each podcast in
      the article. They will take you directly to the podcast page.
      
      Reply
   
   
 * Basava says: January 22, 2018 at 12:02 pmPartially Derivative have plenty of episodes, not only 10.
   
   Reply * Pranav Dar says: January 22, 2018 at 3:30 pmThanks Basava, we have updated the episode count for Partially Derivate.
      🙂
      
      Reply
   
   
 * Michael Poulsen says: January 22, 2018 at 12:42 pmThis is not courses. It’s podcasts 🙂 Just install a podcast app – either
   Android or Iphone, and find the recordings in the app.
   
   Reply
 * Dinesh Chauhan says: January 22, 2018 at 2:40 pmThis is a goldmine. I have been listing to few of them already but now list
   of weekly todo has gone bigger 🙂 Thanks a lot for sharing.
   
   Reply * Pranav Dar says: January 22, 2018 at 3:33 pmGlad you found it helpful, Dinesh!
      
      Reply
   
   
 * Aritra Chatterjee says: January 22, 2018 at 3:43 pmThis is really good resource. I can see how my conventional tv and music is
   habit is replaced by ted and podcast session.
   
   Reply * Pranav Dar says: January 22, 2018 at 5:17 pmSame here, Aritra. My travelling time to and from work is spent listening
      to podcasts these days. Totally worth it!
      
      Reply
   
   
 * Amit says: January 22, 2018 at 4:01 pmVery Useful. Thanks
   
   Reply
 * Namita Nair says: January 22, 2018 at 4:51 pmThank you so much for this!! This is so great!
   
   Reply
 * Karan says: January 22, 2018 at 5:32 pmSuper Data Science is another one.
   
   Reply

TOP ANALYTICS VIDHYA USERS
Rank Name Points 1 vopani 8694 2 SRK 8287 3 Aayushmnit 7108 4 binga 5319 5 Sonny 5266 More Rankings

POPULAR POSTS
 * Essentials of Machine Learning Algorithms (with Python and R Codes)
 * A Complete Tutorial to Learn Data Science with Python from Scratch
 * Key Highlights in Data Science / Deep Learning / Machine Learning 2017 and
   What can we Expect in 2018?
 * 7 Types of Regression Techniques you should know!
 * Understanding Support Vector Machine algorithm from examples (along with
   code)
 * A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
 * The most comprehensive Data Science learning plan for 2017
 * Fundamentals of Deep Learning – Introduction to Recurrent Neural Networks

RECENT POSTS
THE ULTIMATE LEARNING PATH TO BECOMING A DATA SCIENTIST IN 2018
Kunal Jain , January 25, 201810 DATA SCIENCE, MACHINE LEARNING AND AI PODCASTS YOU MUST LISTEN TO
Pranav Dar , January 21, 2018BUILDING A FAQ CHATBOT IN PYTHON – THE FUTURE OF INFORMATION SEARCHING
Yogesh Kulkarni , January 21, 201810 AUDIO PROCESSING TASKS TO GET YOU STARTED WITH DEEP LEARNING APPLICATIONS
(WITH CASE STUDIES)
Faizan Shaikh , January 19, 2018GET CONNECTED
13,803 Followers 42,404 Followers 2,484 Followers Email Subscribe


ANALYTICS VIDHYA * About Us
 * Our Team
 * Career
 * Contact Us
 * Write for us

About Us • Our Team • Careers • Contact Us DATA SCIENTISTS * Blog
 * Hackathon
 * Discussions
 * Apply Jobs
 * Leaderboard

COMPANIES * Post Jobs
 * Trainings
 * Hiring Hackathons
 * Advertising
 * Reach Us

Don't have an account? Sign up here.

JOIN OUR COMMUNITY : 42295 Followers 13635 Followers 2462 Followers 5135 Followers "" type=""submit""© Copyright 2013-2018 Analytics Vidhya.

 * Privacy Policy
 * Terms of Use
 * Refund Policy

Don't have an account? Sign up here

Join 100000+ Data Scientists in our CommunityReceive awesome tips, guides, infographics and become expert at:

 * R / Python / SAS
 * Machine Learning / Deep Learning
 * Data Visualization

Interact with thousands of data science professionals across the globe!


P.S. We only publish awesome content. We will never share your information with
anyone.

Subscribe!","With rapid changes in data science, machine learning and artificial intelligence, podcasts are a great way to keep yourself updated with new developments.","10 Data Science, Machine Learning and AI Podcasts You Must Listen To",Live,784
2414,"RStudio Blog * Home

 * Subscribe to feed

INTERACTIVE TIME SERIES WITH DYGRAPHS
April 14, 2015 in Packages , Shiny

The dygraphs package is an R interface to the dygraphs JavaScript charting library. It provides rich
facilities for charting time-series data in R, including:

 * Automatically plots xts time-series objects (or objects convertible to xts).
 * Rich interactive features including zoom/pan and series/point highlighting .
 * Highly configurable axis and series display (including optional 2nd Y-axis).
 * Display upper/lower bars (e.g. prediction intervals) around series.
 * Various graph overlays including shaded regions , event lines , and annotations .
 * Use at the R console just like conventional R plots (via RStudio Viewer).
 * Embeddable within R Markdown documents and Shiny web applications.

The dygraphs package is available on CRAN now and can be installed with:

install.packages(""dygraphs"")


EXAMPLES
Here are some examples of interactive time series visualizations you can create
with only a line or two of R code (the screenshots are static, click them to see
the interactive version).

PANNING AND ZOOMING
This code adds a range selector that’s can be used to pan and zoom around the
series data:

dygraph(nhtemp, main = ""New Haven Temperatures"") %>%
  dyRangeSelector()


POINT HIGHLIGHTING
When you hover over the time-series the values of all points at the location of
the mouse are shown in the legend:

lungDeaths <- cbind(ldeaths, mdeaths, fdeaths)
dygraph(lungDeaths, main = ""Deaths from Lung Disease (UK)"") %>%
  dyOptions(colors = RColorBrewer::brewer.pal(3, ""Set2""))


SHADING AND ANNOTATIONS
There are a wide variety of tools available to annotate time series. Here we
demonstrate creating shaded regions:

dygraph(nhtemp, main=""New Haven Temperatures"") %>% 
  dySeries(label=""Temp (F)"", color=""black"") %>%
  dyShading(from=""1920-1-1"", to=""1930-1-1"", color=""#FFE6E6"") %>%
  dyShading(from=""1940-1-1"", to=""1950-1-1"", color=""#CCEBD6"")


You can find additional examples and documentation on the dygraphs for R website.

BRINGING JAVASCRIPT TO R
One of the reasons we are excited about dygraphs is that it takes a mature and
feature rich visualization library formerly only accessible to web developers
and makes it available to all R users.

This is part of a larger trend enabled by the htmlwidgets package, and we expect that more and more libraries like dygraphs will emerge
over the coming months to bring the best of JavaScript data visualization to R.


SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

6 COMMENTS
April 14, 2015 at 9:26 pm

pierrelap1971

Thank you for allowing a second y-axis. This is cruelly lacking in ggplot2. And
yes, I am aware of the reasons Hadley opposes a second y-axis in ggplot2, but
sometimes, you just need one.

 * April 18, 2015 at 11:01 pm
   
   R-Enthusiast
   
   What are his reasons?
   
   
April 15, 2015 at 8:30 am

Distilled News | Data Analytics & R

[…] Interactive time series with dygraphs The dygraphs package is an R interface
to the dygraphs JavaScript charting library. It provides rich facilities for
charting time-series data in R, including: • Automatically plots xts time-series
objects (or objects convertible to xts). • Rich interactive features including
zoom/pan and series/point highlighting. • Highly configurable axis and series
display (including optional 2nd Y-axis). • Display upper/lower bars (e.g.
prediction intervals) around series. • Various graph overlays including shaded
regions, event lines, and annotations. • Use at the R console just like
conventional R plots (via RStudio Viewer). • Embeddable within R Markdown
documents and Shiny web applications. […]

April 18, 2015 at 9:52 am

Excel, csv e C++ no R. Livro do Alvin Roth, Nova biografia de Steve Jobs. PCO e
liberdade de expressão. | Análise Real

[… […]

April 28, 2015 at 12:40 pm

David Epstein

This looks very nice, but my first question is whether it incorporates real
anomaly detection features. That is, can the package create error ranges based
on day of week/time of day/month of year, etc, so that we can tell when a time
series is displaying “unusual” behavior? That’s the key bottom line for us when
dashboarding our data for operations monitoring.

 * April 28, 2015 at 12:43 pm
   
   jjallaire
   
   There is no anomaly detection per-se however there are several tools for
   annotation as well as the ability to provide custom drawing code so you could
   certainly add this.
   
   
« RStudio v0.99 Preview: Tools for Rcpp Get data out of excel and into R with readxl »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","The dygraphs package is an R interface to the dygraphs JavaScript charting library. It provides rich facilities for charting time-series data in R, including: Automatically plots xts time-series ob…",Interactive time series with dygraphs,Live,785
2417,"GETTING CONNECTED WITH RABBITMQ AND ELASTICSEARCH
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 9, 2016The latest enhancement to Compose makes your life simpler and more secure with
the introduction of TLS/SSL certificates backed and verified by Let's Encrypt.
We've initially rolled out this new certificate scheme for users of RabbitMQ and
Elasticsearch.

We have also published new guides for connecting to RabbitMQ and connecting to Elasticsearch using NodeJS, Java, Python, Go and Ruby. The guides are recommended for anyone
setting up a new RabbitMQ or Elasticsearch deployment as these will now be
automatically set up with Let's Encrypt certificates.

AN UPGRADER'S GUIDE
If you are upgrading an existing RabbitMQ or Elasticsearch deployment then,
essentially, all you need to do is remove the code you use to set up an SSL
certificate for verification or the remove the code which turns off
verification. The older style connections for Compose use a downloadable
self-signed certificate which can be used to verify the connection.

With the new system, the certificate is validated through the established
certificate verification network so no extra steps are needed.

For a Java application, there's no need to create a keystore outside the
application, load it at runtime and wrangle the connection. Use the connection
strings and it should all just work.

For a Node application, there's no need to read in a file and pass it to the
connection by adding ssl:{ } options to your code. Just use the plain https:// (Elasticsearch) or amqps:// (RabbitMQ) connection strings from the Compose console's overview. There's one
snag and that's that some drivers don't set the SNI property which is needed
when making a connection. The driver for RabbitMQ, amqplib, suffers from this.
It's easily worked around though as this snippet shows:

rabbitmqurl = 'amqps://user:password@portal194-1.rabbity.compose-3.composedb.com:10194/Rabbity';  
parsedurl = url.parse(rabbitmqurl);

amqp.connect(rabbitmqurl, { servername: parsedurl.hostname }, function(err, conn) {  
    if (err !== null) return bail(err, conn);


The hostname is simply extracted from the connection string and passed
separately to the connect call as servername .

The Bunny library for Ruby and RabbitMQ has a similar issue and there is a patch
in for the next release. We show how you can install the patched driver early in
the documentation for RabbitMQ connections .

These are the few exceptions to the rule though; in general, all you need to do
is remove the code for adding a ""ca_cert"" and ensure that you aren't suppressing
the certificate verification by forcing variables like tls_verify to false.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",The latest enhancement to Compose makes your life simpler and more secure with the introduction of TLS/SSL certificates backed and verified by Let's Encrypt. We've initially rolled out this new certificate scheme for users of RabbitMQ and Elasticsearch. ,Getting Connected with RabbitMQ and Elasticsearch,Live,786
2419,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

CONTENTS
 * Apache Spark * Get Started * Get Started in Bluemix
      
      
    * Tutorials * Load dashDB Data with Apache Spark
       * Load Cloudant Data in Apache Spark Using a Python Notebook
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Build SQL Queries
       * Use the Machine Learning Library
       * Build a Custom Library for Apache Spark
       * Sentiment Analysis of Twitter Hashtags
       * Use Spark Streaming
       * Launch a Spark job using spark-submit
      
      
    * Sample Notebooks * Sample Python Notebook: Precipitation Analysis
       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis
      
      
 * BigInsights * Get Started * BigInsights on Cloud for Analysts
       * BigInsights on Cloud for Data Scientists
       * Perform Text Analytics on Financial Data
       * Sample Scripts
      
      
 * Compose * Get Started * Create a Deployment
       * Add a Database and Documents
       * Back Up and Restore a Deployment
       * Enable Two-Factor Authentication
       * Add Users
       * Enable Add-Ons for Your Deployment
      
      
    * Compose Enterprise * Get Started
      
      
 * Cloudant * Get started * Copy a sample database
       * Create a database
       * Change database permissions
       * Connect to Bluemix
       * Developing against Cloudant
      
      
    * Intro to the HTTP API * Execute common API commands
       * Set up pre-authenticated cURL
      
      
    * Database Replication * Use cases for replication
       * Create a replication job
       * Check replication status
       * Set up replication with cURL
      
      
    * Indexes and Queries * Use the primary index
       * MapReduce and the secondary index
       * Build and query a search index
       * Use Cloudant Query
       * Cloudant Geospatial
      
      
    * Integrate * Create a Data Warehouse from Cloudant Data
       * Store Tweets Using Cloudant, dashDB, and Node-RED
       * Load Cloudant Data in Apache Spark Using a Scala Notebook
       * Load Cloudant Data in Apache Spark Using a Python Notebook
      
      
 * dashDB * dashDB Quick Start
    * Get * Get started with dashDB on Bluemix
       * Load data from the desktop into dashDB
       * Load from Desktop Supercharged with IBM Aspera
       * Load data from the Cloud into dashDB
       * Move data to the Cloud with dashDB’s MoveToCloud script
       * Load Twitter data into dashDB
       * Load XML data into dashDB
       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB
       * Load JSON Data from Cloudant into dashDB
       * Integrate dashDB and Informatica Cloud
       * Load geospatial data into dashDB to analyze in Esri ArcGIS
       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion
         Workbench (DCW)
       * Install IBM Database Conversion Workbench
       * Convert data from Oracle to dashDB
       * Convert IBM Puredata System for Analytics to dashDB
       * From Netezza to dashDB: It’s That Easy!
       * Use Aginity Workbench for IBM dashDB
      
      
    * Build * Create Tables in dashDB
       * Connect apps to dashDB
      
      
    * Analyze * Use dashDB with Watson Analytics
       * Perform Predictive Analytics and SQL Pushdown
       * Use dashDB with Spark
       * Use dashDB with Pyspark and Pandas
       * Use dashDB with R
       * Publish apps that use R analysis with Shiny and dashDB
       * Perform market basket analysis using dashDB and R
       * Connect R Commander and dashDB
       * Use dashDB with IBM Embeddable Reporting Service
       * Use dashDB with Tableau
       * Leverage dashDB in Cognos Business Intelligence
       * Integrate dashDB with Excel
       * Extract and export dashDB data to a CSV file
       * Analyze With SPSS Statistics and dashDB
      
      
    * REST API * Load delimited data using the REST API and cURL
      
      
 * DataWorks * Get Started * Connect to Data in IBM DataWorks
       * Load Data for Analytics in IBM DataWorks
       * Blend Data from Multiple Sources in IBM DataWorks
       * Shape Raw Data in IBM DataWorks
       * DataWorks API
      
      
USE DASHDB WITH R
Jess Mantaro / July 18, 2015Watch how to use IBM dashDB with R to do analytics.

You can also read a transcript of this video

RELATED LINKS
 * Publish Applications that use R Analysis with Shiny and dashDB
 * Perform market basket analysis using dashDB and R
 * Connect R Commander and dashDB
 * Analyzing with R

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM",Watch how to use IBM dashDB with R to do analytics.,Cloud Data Services,Live,787
2423,"* Home
 * Blog * Business Analytics * SAS
       * R
       * Python
      
      
    * Business Intelligence * Qlikview
       * Web Analytics
      
      
    * Big data
    * Infographics
   
   
 * Jobs
 * Trainings
 * Discuss
 * Learning Paths * SAS Business Analyst
    * LeaRn Data Science on R
    * Data Science in Python
    * DATA SCIENCE IN WEKA
    * Data Visualization with Tableau
    * Data Visualization with QlikView
    * Interactive Data Stories with D3.js
   
   
 * DataHack * Hackathons
    * Events
   
   
 * Write For Us
 * Contact Us

 * 
 * 
 * 
 * 

 * Home
 * Blog
 * Jobs
 * Trainings
 * Learning Paths
 * Discuss
 * DataHack

 * 

LEARN EVERYTHING ABOUT ANALYTICS
 * Home
 * Blog * Business Analytics * SAS
       * R
       * Python
      
      
    * Business Intelligence * Qlikview
       * Web Analytics
      
      
    * Big data
    * Infographics
   
   
 * Jobs
 * Trainings
 * Discuss
 * Learning Paths * SAS Business Analyst
    * LeaRn Data Science on R
    * Data Science in Python
    * DATA SCIENCE IN WEKA
    * Data Visualization with Tableau
    * Data Visualization with QlikView
    * Interactive Data Stories with D3.js
   
   
 * DataHack * Hackathons
    * Events
   
   
 * Write For Us
 * Contact Us

Home Machine Learning Quick Guide to Build a Recommendation Engine in PythonQUICK GUIDE TO BUILD A RECOMMENDATION ENGINE IN PYTHON
Quick Guide to Build a Recommendation Engine in Python Machine Learning Python SHARE Aarshay Jain , June 2, 2016 / 26INTRODUCTION
This could help you in building your first project!

Be it a fresher or an experienced professional in data science, doing voluntary
projects always adds to one’s candidature. My sole reason behind writing this
article is to get your started with recommendation systems so that you can build
one. If you struggle to get open data, write to me in comments.

Recommendation engines are nothing but an automated form of a “shop counter
guy”. You ask him for a product. Not only he shows that product, but also the
related ones which you could buy. They are well trained in cross selling and up
selling. So, does our recommendation engines.

The ability of these engines to recommend personalized content, based on past
behavior is incredible. It brings customer delight and gives them a reason to
keep returning to the website.

In this post, I will cover the fundamentals of creating a recommendation system
using GraphLab in Python. We will get some intuition into how recommendation work and create
basic popularity model and a collaborative filtering model.


TOPICS COVERED
 1. Type of Recommendation Engines
 2. The MovieLens DataSet
 3. A simple popularity model
 4. A Collaborative Filtering Model
 5. Evaluating Recommendation Engines


1. TYPE OF RECOMMENDATION ENGINES
Before taking a look at the different types of recommendation engines, lets take
a step back and see if we can make some intuitive recommendations. Consider the
following cases:

CASE 1: RECOMMEND THE MOST POPULAR ITEMS
A simple approach could be to recommend the items which are liked by most number
of users. This is a blazing fast and dirty approach and thus has a major
drawback. The things is, there is no personalization involved with this approach.

Basically the most popular items would be same for each user since popularity is
defined on the entire user pool. So everybody will see the same results. It
sounds like, ‘a website recommends you to buy microwave just because it’s been
liked by other users and doesn’t care if you are even interested in buying or
not’.

Surprisingly, such approach still works in places like news portals. Whenever
you login to say bbcnews, you’ll see a column of “Popular News” which is
subdivided into sections and the most read articles of each sections are
displayed. This approach can work in this case because:

 * There is division by section so user can look at the section of his interest.
 * At a time there are only a few hot topics and there is a high chance that a
   user wants to read the news which is being read by most others


CASE 2: USING A CLASSIFIER TO MAKE RECOMMENDATION
We already know lots of classification algorithms . Let’s see how we can use the same technique to make recommendations.
Classifiers are parametric solutions so we just need to define some parameters
(features) of the user and the item. The outcome can be 1 if the user likes it
or 0 otherwise. This might work out in some cases because of following
advantages:

 * Incorporates personalization
 * It can work even if the user’s past history is short or not available

But has some major drawbacks as well because of which it is not used much in
practice:

 * The features might actually not be available or even if they are, they may
   not be sufficient to make a good classifier
 * As the number of users and items grow, making a good classifier will become
   exponentially difficult


CASE 3: RECOMMENDATION ALGORITHMS
Now lets come to the special class of algorithms which are tailor-made for
solving the recommendation problem. There are typically two types of algorithms
– Content Based and Collaborative Filtering. You should refer to our previous article to get a complete sense of how they work. I’ll give a short recap here.

 1. Content based algorithms: * Idea: If you like an item then you will also like a “similar” item
     * Based on similarity of the items being recommended
     * It generally works well when its easy to determine the context/properties
       of each item. For instance when we are recommending the same kind of item
       like a movie recommendation or song recommendation.
    
    
 2. Collaborative filtering algorithms: * Idea: If a person A likes item 1, 2, 3 and B like 2,3,4 then they have similar
       interests and A should like item 4 and B should like item 1.
     * This algorithm is entirely based on the past behavior and not on the
       context. This makes it one of the most commonly used algorithm as it is
       not dependent on any additional information.
     * For instance: product recommendations by e-commerce player like Amazon
       and merchant recommendations by banks like American Express.
     * Further, there are several types of collaborative filtering algorithms : 1. User-User Collaborative filtering: Here we find look alike customers (based on similarity) and offer
           products which first customer’s look alike has chosen in past. This
           algorithm is very effective but takes a lot of time and resources. It
           requires to compute every customer pair information which takes time.
           Therefore, for big base platforms, this algorithm is hard to
           implement without a very strong parallelizable system.
        2. Item-Item Collaborative filtering : It is quite similar to previous algorithm, but instead of finding
           customer look alike, we try finding item look alike. Once we have
           item look alike matrix, we can easily recommend alike items to
           customer who have purchased any item from the store. This algorithm
           is far less resource consuming than user-user collaborative
           filtering. Hence, for a new customer the algorithm takes far lesser
           time than user-user collaborate as we don’t need all similarity
           scores between customers. And with fixed number of products,
           product-product look alike matrix is fixed over time.
        3. Other simpler algorithms: There are other approaches like market basket analysis , which generally do not have high predictive power than the
           algorithms described above.
       
       
2. THE MOVIELENS DATASET
We will be using the MovieLens dataset for this purpose. It has been collected
by the GroupLens Research Project at the University of Minnesota. MovieLens 100K
dataset can be downloaded from here . It consists of:

 * 100,000 ratings (1-5) from 943 users on 1682 movies.
 * Each user has rated at least 20 movies.
 * Simple demographic info for the users (age, gender, occupation, zip)
 * Genre information of movies

Lets load this data into Python. There are many files in the ml-100k.zip file which we can use. Lets load the three most importance files to get a sense
of the data. I also recommend you to read the readme document which gives a lot of information about the difference files.

import pandas as pd

# pass in column names for each CSV and read them using pandas. 
# Column names available in the readme file

#Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=u_cols,
 encoding='latin-1')

#Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=r_cols,
 encoding='latin-1')

#Reading items file:
i_cols = ['movie id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv('ml-100k/u.item', sep='|', names=i_cols,
 encoding='latin-1')

Now lets take a peak into the content of each file to understand them better.

 * USERS
   

print users.shape
users.head()


This reconfirms that there are 943 users and we have 5 features for each namely
their unique ID, age, gender, occupation and the zip code they are living in.

 * RATINGS
   

print ratings.shape
ratings.head()


This confirms that there are 100K ratings for different user and movie
combinations. Also notice that each rating has a timestamp associated with it.

 * ITEMS
   

print items.shape
items.head()

This dataset contains attributes of the 1682 movies. There are 24 columns out of
which 19 specify the genre of a particular movie. The last 19 columns are for
each genre and a value of 1 denotes movie belongs to that genre and 0 otherwise.

Now we have to divide the ratings data set into test and train data for making
models. Luckily GroupLens provides pre-divided data wherein the test data has 10
ratings for each user, i.e. 9430 rows in total. Lets load that:

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings_base = pd.read_csv('ml-100k/ua.base', sep='\t', names=r_cols, encoding='latin-1')
ratings_test = pd.read_csv('ml-100k/ua.test', sep='\t', names=r_cols, encoding='latin-1')
ratings_base.shape, ratings_test.shape

Output: ((90570, 4), (9430, 4))

Since we’ll be using GraphLab, lets convert these in SFrames.

import graphlab
train_data = graphlab.SFrame(ratings_base)
test_data = graphlab.SFrame(ratings_test)

We can use this data for training and testing. Now that we have gathered all the
data available. Note that here we have user behaviour as well as attributes of
the users and movies. So we can make content based as well as collaborative
filtering algorithms.


3. A SIMPLE POPULARITY MODEL
Lets start with making a popularity based model, i.e. the one where all the users have same recommendation based on the most popular choices. We’ll use the graphlab recommender functions
popularity_recommender for this.

We can train a recommendation as:

popularity_model = graphlab.popularity_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating')

Arguments:

 * train_data : the SFrame which contains the required data
 * user_id : the column name which represents each user ID
 * item_id : the column name which represents each item to be recommended
 * target: the column name representing scores/ratings given by the user

Lets use this model to make top 5 recommendations for first 5 users and see what
comes out:

#Get recommendations for first 5 users and print them
#users = range(1,6) specifies user ID of first 5 users
#k=5 specifies top 5 recommendations to be given
popularity_recomm = popularity_model.recommend(users=range(1,6),k=5)
popularity_recomm.print_rows(num_rows=25)


Did you notice something? The recommendations for all users are same –
1500,1201,1189,1122,814 in the same order. This can be verified by checking the
movies with highest mean recommendations in our ratings_base data set:

ratings_base.groupby(by='movie_id')['rating'].mean().sort_values(ascending=False).head(20)


This confirms that all the recommended movies have an average rating of 5, i.e.
all the users who watched the movie gave a top rating. Thus we can see that our
popularity system works as expected. But it is good enough? We’ll analyze it in
detail later.


4. A COLLABORATIVE FILTERING MODEL
Lets start by understanding the basics of a collaborative filtering algorithm.
The core idea works in 2 steps:

 1. Find similar items by using a similarity metric
 2. For a user, recommend the items most similar to the items (s)he already
    likes

To give you a high level overview, this is done by making an item-item matrix in which we keep a record of the pair of items which were rated together.

In this case, an item is a movie. Once we have the matrix, we use it to
determine the best recommendations for a user based on the movies he has already
rated. Note that there a few more things to take care in actual implementation
which would require deeper mathematical introspection, which I’ll skip for now.

I would just like to mention that there are 3 types of item similarity metrics
supported by graphlab. These are:

 1. Jaccard Similarity: * Similarity is based on the number of users which have rated item A and B
       divided by the number of users who have rated either A or B
     * It is typically used where we don’t have a numeric rating but just a
       boolean value like a product being bought or an add being clicked
    
    
 2. Cosine Similarity: * Similarity is the cosine of the angle between the 2 vectors of the item
       vectors of A and B
     * Closer the vectors, smaller will be the angle and larger the cosine
    
    
 3. Pearson Similarity * Similarity is the pearson coefficient between the two vectors.
    
    
Lets create a model based on item similarity as follow:

#Train Model
item_sim_model = graphlab.item_similarity_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating', similarity_type='pearson')

#Make Recommendations:
item_sim_recomm = item_sim_model.recommend(users=range(1,6),k=5)
item_sim_recomm.print_rows(num_rows=25)


Here we can see that the recommendations are different for each user. So,
personalization exists. But how good is this model? We need some means of
evaluating a recommendation engine. Lets focus on that in the next section.


5. EVALUATING RECOMMENDATION ENGINES
For evaluating recommendation engines, we can use the concept of
precision-recall. You must be familiar with this in terms of classification and
the idea is very similar. Let me define them in terms of recommendations.

 * Recall: * What ratio of items that a user likes were actually recommended.
    * If a user likes say 5 items and the recommendation decided to show 3 of
      them, then the recall is 0.6
   
   
 * Precision * Out of all the recommended items, how many the user actually liked?
    * If 5 items were recommended to the user out of which he liked say 4 of
      them, then precision is 0.8
   
   
Now if we think about recall, how can we maximize it? If we simply recommend all
the items, they will definitely cover the items which the user likes. So we have
100% recall! But think about precision for a second. If we recommend say 1000
items and user like only say 10 of them then precision is 0.1%. This is really
low. Our aim is to maximize both precision and recall.

An idea recommender system is the one which only recommends the items which user
likes. So in this case precision=recall=1. This is an optimal recommender and we
should try and get as close as possible.

Lets compare both the models we have built till now based on precision-recall
characteristics:

model_performance = graphlab.compare(test_data, [popularity_model, item_sim_model])
graphlab.show_comparison(model_performance,[popularity_model, item_sim_model])


Here we can make 2 very quick observations:

 1. The item similarity model is definitely better than the popularity model (by
    atleast 10x)
 2. On an absolute level, even the item similarity model appears to have a poor
    performance. It is far from being a useful recommendation system.

There is a big scope of improvement here. But I leave it up to you to figure out
how to improve this further. I would like to give a couple of tips:

 1. Try leveraging the additional context information which we have
 2. Explore more sophisticated algorithms like matrix factorization

In the end, I would like to mention that along with GraphLab, you can also use
some other open source python packages like Crab . Crab is till under development and supports only basic collaborative
filtering techniques for now. But this is something to watch out for in future
for sure!


END NOTES
In this article, we traversed through the process of making a basic
recommendation engine in Python using GrpahLab. We started by understanding the
fundamentals of recommendations. Then we went on to load the MovieLens 100K data
set for the purpose of experimentation.

Subsequently we made a first model as a simple popularity model in which the
most popular movies were recommended for each user. Since this lacked
personalization, we made another model based on collaborative filtering and
observed the impact of personalization.

Finally, we discussed precision-recall as evaluation metrics for recommendation
systems and on comparison found the collaborative filtering model to be more
than 10x better than the popularity model.

Did you like reading this article ? Do share your experience / suggestions in
the comments section below.

YOU CAN TEST YOUR SKILLS AND KNOWLEDGE. CHECK OUT LIVE COMPETITIONS AND COMPETE WITH BEST DATA SCIENTISTS FROM ALL OVER THE WORLD.
SHARE THIS:
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * 

RELATED
Tags: collaborative filtering , content based recommendation engines , graphlab in python , graphlab tutorial , item item collaborative filtering , recommendation engines , recommendation with graphlab Next Article Getting Started with Big Data Integration using HDFS and DMX-h Previous Article Director / VP (Analytics) – Gurgaon (8 – 14 Years of
Experience) Author Aarshay Jain Aarshay is a ML enthusiast who is going to pursue MS in Data Science from
Columbia University from Fall 2016. He is currently exploring the various ML
techniques and writes articles for AV to share his knowledge with the community. * 
 * 
 * 
 * 

26 COMMENTS
 * Vinay Jain says: June 2, 2016 at 5:18 amI am completly new to the field of data science. I have started courses of
   Machine Learning.
   Can you please suggest me how to proceed or should I consider some other
   options. I plan to proceed further in the field of AI. Please suggest how
   should i carry on or begin my journey. ?
   
   Reply * Aarshay Jain says: June 2, 2016 at 5:49 amThanks for reaching out. I’m sorry but there is no fixed answer to your
      question and this thread is probably not the right place to answer. I
      recommend reading similar discussions on http://discuss.analyticsvidhya.com and you can start a new thread as well.
      
      You can also check out the learning paths on our website if you’re
      interested in a particular tool.
      
      Hope this helps.
      
      Reply
    * r4sn4 says: June 5, 2016 at 1:14 pmIn this blog, data is already divided into train and test. But ,how to
      divide data into Train and Test? On what basis to make this decision?
      
      Reply * Aarshay Jain says: June 6, 2016 at 9:02 amThere can be various considerations. One of them can be simply based on
         time, i.e. initial part as train then test. But you can face a cold
         start problem here, i.e. there might be some users in test which are
         not in train.
         Another option is to keep a particular number of entries of a user in
         the test file. This approach has been followed here. But in real case
         scenarios, the time approach should be used because it is more
         practical.
         
         Reply
      
      
 * Ericgits says: June 2, 2016 at 5:30 amGood article , very educative
   
   Reply * Aarshay Jain says: June 2, 2016 at 6:06 amThanks you!
      
      Reply
   
   
 * Hulisani says: June 2, 2016 at 6:18 amThanks for sharing such an Amazing article, can I please have in pdf.
   
   Reply * Aarshay Jain says: June 2, 2016 at 1:35 pmThanks Hulisani! Unfortunately, we don’t have proper pdf formats.
      Generally what I do is print the page and save it as a pdf. I won’t look
      that good but mostly works.
      
      Reply
   
   
 * Naveed1228 says: June 2, 2016 at 6:43 amNice Article.How can we use this article on website using Flask or Django.I
   am newbie if you suggest me some resources.I will thankful to you.
   
   Reply * Aarshay Jain says: June 2, 2016 at 1:37 pmI think GraphLab has deployment facility which you can use. I’m not sure
      how exactly that works. Another option could be to make a RestFul API of
      your recommender and call it on the website.
      
      Reply
   
   
 * Ron Williams says: June 2, 2016 at 7:24 amHi, thanks for the article, which is, in itself very useful.
   
   I just wanted to highlight what I see as a limitation of ‘recommender’
   systems, such as google’s or YouTube’s. Especially YT. The nub of my
   objection to them is that they either recommend things I’ve already seen, or
   very similar things. In a lot of cases I’ve basically already moved on from
   that, and really don’t want more of the same.
   
   E.g. suppose I accidentally watch a ‘Game of Thrones’ YT will ad nauseam
   present more of them, and most especially what irks me is that they present
   the exact same clip, but uploaded by a different user. grrr…
   
   Or, say I look at ‘How to Create an Amazon Affiliate Site in 25.3 seconds and
   make SQUILLIONS’ – yep, you guessed it, I get the ones on how to do that
   zact-same thing in 16.3 seconds and make $500, or $344,253.45 or …
   
   And with google – say I’ve been looking for products for the abovementioned
   Affiliate site. Google will have picked up on that and will most irritatingly
   present me, for literally days with those products, which I have absolutely
   no intention of buying.
   
   Another annoying ‘feature’ is, suppose I’ve already bought a product. Google
   will assume I’m interested in such things and present me with opportunities
   to buy even more of the thing that I already have one of, and really, really
   don’t need any more. I’ve got one. Thanks anyhow.
   
   Recommender systems have a looong way to go, to be actually useful as
   marketing tools, as opposed to irritants.
   
   Reply * Aarshay Jain says: June 2, 2016 at 1:40 pmThanks for sharing your thoughts. I agree with you totally. But I think
      its a good things. We can an untapped potential and this gives a perfect
      opportunity to explore this further and design better systems.
      
      I think one potential reason causing this could be that Google and say
      Amazon talk to each other only superficially. So google might get the info
      that a user is searching for a product but it might not have info about
      whether the product is already bought. This is just my assumption and I
      don’t know how that system actually works. I will explore this further for
      sure!
      
      Reply
   
   
 * basel says: June 2, 2016 at 9:09 amhow can someone build a very big recommender system with like netflix , i
   mean what is the way for that?
   
   Reply * Aarshay Jain says: June 2, 2016 at 1:42 pmI belive GraphLab is a scalable tool. You can use it on large datasets as
      well. But I’m not sure whether it’ll be possible on our PC or a GraphLab
      server is required. You can contact GraphLab directly for this.
      
      Reply * basel says: June 2, 2016 at 3:23 pmno i’m not talking about scalability i’m talking about real
         recommendation system like the one which is being used in netflix or
         quora …etc of course these systems don’t use simple algorithms with 1
         peice of code , of course they are using them but they add to them
         complex new algorithms in machine learning and recommendation system’s
         algorithms
         
         Reply * Aarshay Jain says: June 2, 2016 at 3:28 pmFor that I recommend going through research articles. I’m pretty
            sure you’ll find papers from netflix. Not sure about Quora..
            
            Reply
         
         
 * james says: June 2, 2016 at 2:33 pmI am a R person. Are there similar codes in R instead of Python ?
   
   Reply * Aarshay Jain says: June 2, 2016 at 2:44 pmYes there are 2 more articles on recommendation engines based on R:
      – 
      http://www.analyticsvidhya.com/blog/2016/03/exploring-building-banks-recommendation-system/
      – http://www.analyticsvidhya.com/blog/2015/10/recommendation-engines/
      
      Reply
   
   
 * Ash says: June 6, 2016 at 12:01 pmHey!
   
   I am developing a model similar to the one in this link: http://www.salemmarafi.com/code/collaborative-filtering-with-python/ .
   
   How do you think I should evaluate this model?
   
   Any suggestions please?
   
   Reply * Aarshay Jain says: June 13, 2016 at 11:40 amhave you tried the precision-recall technique which I’ve explained here?
      
      Reply
   
   
 * manish says: June 13, 2016 at 11:05 amI am having Trouble while importing ‘graphlab’, when I import and run I am
   getting the following output:
   
   —————————————————
   File “”, line 32, in
   import graphlab as gl
   
   ImportError: No module named ‘graphlab’
   
   —————————————————
   I have googled out things but am unable to find any specific solution on
   this, I am using Windows and using Anaconda – Spyder !!!
   
   Reply * Aarshay Jain says: June 13, 2016 at 11:39 amYou need to install graphlab first. The first year license is free.
      
      Reply * manish says: June 27, 2016 at 10:55 amThanks, great article.
         
         Reply
      
      
 * Kaustubh Sakhalkar says: June 14, 2016 at 12:30 pmHi Aarshay excellent article! helped me immensely with a project I am working
   on. Keep them coming!
   
   Reply * Aarshay Jain says: June 14, 2016 at 1:51 pmSure stay tuned!
      
      Reply
   
   
 * 3 Must to Know Analytical Concepts For Every Professional / Fresher in
   Analytics says: July 13, 2016 at 6:30 am[…] Guide to building a recommendation engine in Python: Link […]
   
   Reply

LEAVE A REPLY CANCEL REPLY
Connect with:Your email address will not be published.


TOP AV USERS
Rank Name Points 1 Nalin Pasricha 4380 2 SRK 4333 3 Aayushmnit 3882 4 binga 3365 5 vopani 3301 More RankingsPOPULAR POSTS
 * A Complete Tutorial to Learn Data Science with Python from Scratch
 * Bayesian Statistics explained to Beginners in Simple English
 * Essentials of Machine Learning Algorithms (with Python and R Codes)
 * A Complete Tutorial on Time Series Modeling in R
 * 7 Types of Regression Techniques you should know!
 * Complete guide to create a Time Series Forecast (with Codes in Python)
 * Top 5 Analytics Programs in India (2014 – 15)
 * SAS vs. R (vs. Python) – which tool should I learn?

FEATURED VIDEO
RECENT POSTS
20 CHALLENGING JOB INTERVIEW PUZZLES WHICH EVERY ANALYST SHOULD SOLVE ATLEAST
ONCE
B.Rabbit , July 21, 2016PRACTICAL GUIDE ON DATA PREPROCESSING IN PYTHON USING SCIKIT LEARN
syed danish , July 18, 2016GOING DEEPER INTO REGRESSION ANALYSIS WITH ASSUMPTIONS, PLOTS & SOLUTIONS
Manish Saraswat , July 14, 20163 MUST KNOW ANALYTICAL CONCEPTS FOR EVERY PROFESSIONAL / FRESHER IN ANALYTICS
Guest Blog , July 12, 2016GET CONNECTED
5,706 Followers 17,297 Followers 1,174 Followers Email SubscribeABOUT US
For those of you, who are wondering what is “Analytics Vidhya”, “Analytics” can
be defined as the science of extracting insights from raw data. The spectrum of
analytics starts from capturing data and evolves into using insights / trends
from this data to make informed decisions. Read MoreSTAY CONNECTED
5,706 Followers 17,297 Followers 1,174 Followers Email SubscribeLATEST POSTS
20 CHALLENGING JOB INTERVIEW PUZZLES WHICH EVERY ANALYST SHOULD SOLVE ATLEAST
ONCE
B.Rabbit , July 21, 2016PRACTICAL GUIDE ON DATA PREPROCESSING IN PYTHON USING SCIKIT LEARN
syed danish , July 18, 2016GOING DEEPER INTO REGRESSION ANALYSIS WITH ASSUMPTIONS, PLOTS & SOLUTIONS
Manish Saraswat , July 14, 20163 MUST KNOW ANALYTICAL CONCEPTS FOR EVERY PROFESSIONAL / FRESHER IN ANALYTICS
Guest Blog , July 12, 2016QUICK LINKS
 * Home
 * About Us
 * Our team
 * Privacy Policy
 * Refund Policy
 * Terms of Use

TOP REVIEWS
© Copyright 2016 Analytics Vidhya",This article explains the concept of recommendation systems in python and builds one using graphlab library. Explains the types of engines too.,Quick Guide to Build a Recommendation Engine in Python,Live,788
2424,"COMPOSE NOTES: JAVA AND LET'S ENCRYPT CERTIFICATES
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 3, 2017With the recent arrival of Let's Encrypt TLS/SSL certificate support on
Compose's Elasticsearch and RabbitMQ, we've noticed some issues popping up from
the rabbit hole that is this encryption support.

Top of that list is problems with Java connections erroring out with something
like:

Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target  
...


This error effectively means that your Java runtime doesn't know about the Let's
Encrypt certificates needed to verify the connection. But you may have also
browsed sites with Let's Encrypt certificates, on the same system even, so how
come Java doesn't know about them.

At it's simplest, a server may present a certificate saying ""I am Bob, Alice can
confirm that"". That confirmation is performed through public key encryption
which means that Alice will have signed the certificate digitally. We can be
sure about that because we can use Alice's own public certificates to verify her
signature. In our case, Bob is a Compose database host and Alice is Let's
Encrypt.

Getting everyone to update their library of root certificates takes time,
sometimes years. To get around that delay, Let's Encrypt got intermediate
certificates that it signed cross-signed by IdenTrust, an existing certificate authority which was already trusted by
major browsers. This meant that browsers worked pretty much immediately with
Let's Encrypt certified sites.

This meant browsers would work for HTTPS connections, but the Let's Encrypt
certificates weren't in all operating systems and language runtimes - that meant
applications that relied on the operating system or runtime to handle SSL
certificate checking still needed the Let's Encrypt certificates to be added to
their list of trusted certificates. So over 2015 and 2016, those certificates
were being steadily added to various operating systems. Java has its own
operating system independent collection of trusted certificates and that meant
it would have to be updated by Oracle; Let's Encrypt support in the operating
system would make no difference to Java.

It took until August 2016 for Oracle to add Let's Encrypt's certificates to the
Java distribution, and they did that addition as part of a scheduled Java update
release. Practically, that means when an application on a Java earlier than Java
8U101, released in August 2016, tries to connect it can't completely check the
certificate presented by the Let's Encrypt protected Compose deployments.

UPGRADE YOUR JAVA
The simplest solution, and it's recommended as it closes many other security
vulnerabilities, is to upgrade to Java JDK/JRE 8U101 or later (and ideally
later, as of writing, it's version 8U111). There is also Java 7 update, 7U111
which also has the certificates needed but that's only for Oracle clients on
support contracts.

If your codebase is such that you can't upgrade or can't upgrade without
significant downtime, we still suggest you at least start planning to upgrade
anyway to ensure that you can keep being able to use the regular security
updates to Java. In the interim, though, if you have a running application,
you'll have some options:

OR DO NOTHING FOR NOW...
At Compose, we haven't turned off the support for the older self-signed
certificates yet and haven't announced a retirement date for them yet, so you
can continue to use the old connection strings and certificates you have already
stored.

OR MANUALLY ADD CERTIFICATES
You can add the IdenTrust certificates to the version of Java you are using. Go
to https://letsencrypt.org/certificates/ where you'll find ""Various"" der formatted certificates available there. You'll need to install them into the
Java JRE.

There's plenty to look out for in this manual process. Make sure that your $JAVA_HOME environment variable is set correctly. The keystore store lives in $JAVA_HOME/jre/lib/security/cacerts and you'll want to make a copy of that before you begin with a command like:

sudo cp -a $JAVA_HOME/jre/lib/security/cacerts $JAVA_HOME/jre/lib/security/cacerts.orig  


That will let you revert changes simply if anything goes amiss. The certificate
we want will be 'Let’s Encrypt Authority X3 (IdenTrust cross-signed)"", as of
writing this article. That's the active cross-signed certificate. You can
manually download it through the browser or wget it like so:

wget https://letsencrypt.org/certs/lets-encrypt-x3-cross-signed.der  


Installing this certificate involves using the Java keytool command.

sudo keytool -trustcacerts -keystore $JAVA_HOME/jre/lib/security/cacerts -storepass changeit -noprompt -importcert -alias letsencryptauthorityx3 -file lets-encrypt-x3-cross-signed.der  


That should be enough to make SSL/TLS connections once you've restarted any Java
using applications to reload the keystore. You will need to repeat this process
on any system where you are using an older Java version to connect. Or, as we
said, you can upgrade your Java installation and get the certificates included
by default.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","With the recent arrival of Let's Encrypt TLS/SSL certificate support on Compose's Elasticsearch and RabbitMQ, we've noticed some issues popping up from the rabbit hole that is this encryption support. ",Java and Let's Encrypt certificates,Live,789
2431,"COMPOSE RABBITMQ NOW OUT OF BETA
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 3, 2016Today, we're pleased to announce that Compose RabbitMQ is generally available to all Compose users for all their messaging needs. To
celebrate RabbitMQ's arrival as the sixth fully supported Compose option –
following in the footsteps on MongoDB, Elasticsearch, Redis, RethinkDB and
PostgreSQL – we've made this month RabbitMQ month. That means if you spin up a
new deployment of RabbitMQ, in trial or on account, we'll send you a link where
you can claim a limited edition, and utterly adorable, Compose RabbitMQ t-shirt.

We've taken the beta label off the messaging platform and upgraded our deployed
RabbitMQ to version 3.6.5, which includes new features like ""Lazy Queues"". We
introduced RabbitMQ in beta at the end of 2015 and during that time, Compose RabbitMQ has seen developers working on hardening
the service so it's in prime condition for your messaging workloads. Existing
RabbitMQ users can upgrade to Rabbit 3.6.5 too.

ABOUT RABBITMQ AND COMPOSE
For those of you unfamiliar with RabbitMQ, it isn't a database, but it does
manage data. Specifically, messages between servers, services and clients.
Traditionally, connected applications have evolved using a calling model where
server A connects to server B, requests an operation, waits for a reply and
disconnects – that's a tightly coupled connection and it can be quite brittle
under load. Developers then end up writing code to handle the brittleness whilst
losing flexibility.

With a message broker like RabbitMQ, server A sends a message through the broker
to request an operation. That message can pass through exchanges for sorting it
into the right queue. Server B connects to the broker and waits on appropriate
queues for messages which it picks up, handles and responds to by posting a
message back to server A. This is a loosely coupled model. In actuality, you
don't even think about servers, your applications just send messages to message
queues where you have other applications listening waiting to fulfil the
requests.

By abstracting the connections between services and servers, the system becomes
more resilient to load thanks to the decoupling, more adaptable to change. It's
easy to re-configure and tune to changing or expanding workloads. For example,
we use RabbitMQ inhouse at Compose to drive the fleet of cloud servers and
applications which provide the Compose service. If you want to know more about
how to use RabbitMQ, check out these previous Compose articles:

 * Messaging, AMQP and RabbitMQ - A Speed Guide
 * Getting Started with RabbitMQ
 * Making Secure Connections With Compose RabbitMQ
 * Configuring RabbitMQ Exchanges, Queues and Bindings: Part 1 and Part 2
 * Go-ing from PostgreSQL rows to RabbitMQ messages
 * Turning RethinkDB changes to RabbitMQ messages
 * Deploying the Metrics Collector Microservice on Compose

The best way to get started is to spin up your own RabbitMQ deployment and start
working with it. Those articles, along with the RabbitMQ documentation and tutorials and the Compose RabbitMQ help will get you up to speed and decoupling your applications in no time.

RABBITMQ AND LAZY QUEUES
The flexibility is extended further with the latest update of RabbitMQ. Compose
RabbitMQ users are now on the latest stable release and can enjoy Lazy Queues, a
3.6 feature which persists queues to disk and only loads them into memory when
an application is actively consuming them. This saves precious RAM and improves
performance for applications where there may be many, many queues which are only
accessed occasionally.

It's an alternative persistence mode for the messaging queues of RabbitMQ which,
by default, is not lazy and keeps an active cache of all messages on a queue.
You don't run out of memory with lots of messages, but Rabbit and the operating
system will page out unused messages and it's this paging out – and paging back
in – that can hurt performance. Lazy queues eliminate that cache and the paging
hit.

Of course, there's tradeoffs. For example, where transient messages are being
passed without persistence, lazy queues will be make the queue use disk I/O to
persist them. There, the default mode makes more sense for best performance. You
can read more about how to configure and use Lazy Queues in the RabbitMQ documentation.

GETTING RABBITMQ 3.6 ON COMPOSE
Existing RabbitMQ users on Compose can upgrade to 3.6.5 by selecting their
RabbitMQ deployment in the Compose console, clicking on Settings in the sidebar and then clicking Change Version . New deployments of RabbitMQ – generated by signing up for a new account and
selecting RabbitMQ for the first database, or by selecting Create Deployment in the Compose deployment view – will be RabbitMQ 3.6.5 (or later current
version) by default.

Image by Hans Eiskonen Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","We've taken the beta label off the messaging platform and upgraded our deployed RabbitMQ to version 3.6.5, which includes new features like ""Lazy Queues"".",Compose RabbitMQ now out of beta,Live,790
2432,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCOMMAND-LINE TOOLS FOR CLOUDANT AND COUCHDBGlynn Bird / October 19, 2015Some developers spend their days dragging and dropping in a graphical userinterface, others are more comfortable typing green letters into a blackbackground on a command-line terminal. If you are the latter type of developer,then this blog post is for you. We introduce a range of command-line tools thatyou can use to interface with IBM Cloudant (or plain Apache® CouchDB™ ).Cloudant and CouchDB share an RESTful HTTP API allowing access from anyprogramming language or from the command-line using the curl utility. The packages featured in this blog post are all free to download andopen-source, allowing you to fork and modify them for your own purposes.CCURLMost Cloudant and CouchDB developers use curl to access the RESTful HTTP API. The trouble with curl is that the commands canget overly long, with lots of repetition between commands, for example:curl -X POST -H 'Content-type: application/json' -g ""https://myusername:mypassword@myhost.cloudant.com/mydb""  -d '{""val"": ""json""}'The utility ccurl is a wrapper around curl that removes some of the repetition. It adds the following features: * The protocol, username, password and hostname are not required; instead they   are taken from an environment variable. * The content-type header * The “-g” fixConfiguring ccurl is a one-off task: simply set your Cloudant or CouchDB URL as the COUCH_URL environment variable:    export COUCH_URL=""https://myusername:mypassword@myhost.cloudant.com""or    export COUCH_URL=""http://localhost:5984""ccurl makes command-line API requests much less verbose, as these examples show:    # fetch stats about database ‘mydb’    ccurl /mydb        # fetch single document    ccurl /mydb/12345        # add a document    ccurl -X POST -d'{""a"":1,""b"":2}' /mydbName Description URL Installation ccurl A curl helper for CouchDB ccurl npm install -g ccurlJQSimplify dealing with JSON on the command line by installing the jq utility, which parses and filters JSON strings. In this example, jq takes the stream of data coming from ccurl , and turns it into nicely formatted, coloured terminal output:    ccurl /mydb/12345 | jq .You can supply arguments to the jq command. For example:    ccurl /mydb/12345 | jq .geometry.coordinatesfetches the coordinates from the geometry object included within the JSON objectreturned by the ccurl command. jq has its own syntax that allows JSON objects to be filtered, manipulated andqueried. See the jq website for further details.Name Description URL Installation jq A lightweight command-line JSON processor jq manualCOUCHSHELLIf you’re familiar with the file and directory commands of a Unix shell, thenyou should find couchshell intuitive to use. It presents a Cloudant account or CouchDB installation as adirectory tree, with top-level directories being databases, and their contentsbeing documents. It uses the same environment variable as ccurl and can be invoked by typing couchshell . The result is you find yourself in a shell-like environment, with a prompt. The environment is optimized for working with Couch and Cloudantcommands with “tab autocomplete” of database names and document ids:The above sequence of commands creates a database, creates two documents, anddeletes one of them.A full list of the couchshell commands is provided on the tool’s website .Name Description URL Installation couchshell A shell to interact with CouchDB as if it were a file system couchimport npm install -g couchshellCOUCHIMPORTIf you have CSV files containing data which you need to upload to Cloudant orCouchDB, then couchimport can import the files. The following sequence of shell commands download adataset containing US crime data, unzip it, creates a new CouchDB database, andfinally imports the CSV data by piping the file to the couchimport :  curl 'http://data.octo.dc.gov/feeds/crime_incidents/archive/crime_incidents_2013_CSV.zip' > crime.zip  unzip crime.zip  ccurl -X PUT /crime  cat crime_incidents_2013_CSV.csv | couchimport --db crime --delimiter "",""Once again, couchimport uses the same COUCH_URL environment variable to determine the database to write to. couchimport can also be used programmatically to allow your Node.js applications to importCSV files or streams into CouchDB or Cloudant.Name Description URL Installation couchimport A CSV import utility couchimport npm install -g couchimportCOUCHBACKUPIf you need to backup and restore CouchDB data, then couchbackup and couchrestore utilities can help. Backup is as simple as running the couchbackup ; in this case taking a copy of the animals database and saving it to the file animals.txt :   couchbackup --db animals > animals.txtRestoring a backup is the reverse operation – pipe the file to couchrestore :   cat animals.txt | couchrestore --db animalsWe can compress the data using standard compression utilities:   couchbackup --db animals | gzip > animals.txt.gz   cat animals.txt.gz | gunzip | couchrestore --db animalscouchbackup takes a shallow copy of a Cloudant database with only the winning revisionsbeing backed up.Name Description URL Installation couchbackup A backup and restore utility couchbackup npm install -g couchbackupCOUCHMIGRATEWhen changing a database’s design documents, you need to take care that users ofthe database don’t suffer performance issues as the new index rebuilds. The couchmigrate utility creates a new design document, waits for the index to build, andfinally makes the index live.For example, if our new design document is in a file dd.json , we could run the following command:    couchmigrate --dd dd.json --db moviesThis command blocks use until the views defined in dd.json are ready to use.Name Description URL Installation couchmigrate A design document migration utility couchmigrate npm install -g couchmigrate© “Apache”, “CouchDB”, “Apache CouchDB” and the CouchDB logo are trademarks orregistered trademarks of The Apache Software Foundation. All other brands andtrademarks are the property of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: ccurl / cloudant / couchbackup / CouchDB / couchimport / couchmigrate / couchshell / jq Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",A survey of command-line tools you can use to interface with IBM Cloudant or Apache CouchDB.,Command-line tools for Cloudant and CouchDB,Live,791
2433,"Enterprise Pricing Articles Sign in Free 30-Day TrialBUILDING INSTANT RESTFUL API'S WITH MONGODB AND RESTHEART
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 24, 2017When you need to turn your Mongo database into a RESTFul API, RESTHeart can get
you up-and-running quickly. In this article, we'll explore using RESTHeart to
expose a RESTFUL API directly from a Mongo database on Compose.

SETUP
Before we get started, we'll need to set up a few basic services. You can create
a new MongoDB deployment on Compose by following the rough guide to composing . You'll also need to install Docker for your platform. If you're using macOS or Windows, the best way to get Docker
is by installing the Docker Toolbox . On Linux, all major distributions have a package available in their package
management systems.

Once you have a new MongoDB deployment on Compose, you'll need to add a database
to your deployment to serve your RestHEART API. Create a new database called restheart through the MongoDB browser.


Then, create a user for the restheart database by clicking on the database in the Compose browser, selecting the Users menu and clicking the Add User button.


RUNNING RESTHEART IN DOCKER
RESTHeart can be run in a number of ways, but the easiest is to get use the
Docker container. You can get started by pulling the RESTHeart container from
the Docker Hub central container repository using the docker pull command:

docker pull softinstigate/restheart  


RestHEART uses YAML files to store configuration information so we'll need to
create one to point to our new Mongo database. The default configuration file is
stored in a file called restheart.yml in the /opt/restheart/etc directory of the container and assumes that you're running MongoDB in another
docker container on your local machine with the name of mongodb . We'll need to update the configuration to point to our Compose MongoDB
deployment. The easiest way to update those is to mount our own configuration in
place of the default by using Docker Volumes .

First, let's create a new directory for our RESTHeart project. Then, create a
directory within it called etc .

mkdir restheart  
cd restheart  
mkdir etc  
cd etc  


Next, copy the sample configuration file into the etc folder in a new file called restheart.yml . There are a few key differences between our sample file and the default
configuration file in the docker container. Namely, we're going to point our
database to Mongo on compose rather than a local Mongo URI. Search the restheart.yml file for the mongo-uri key and replace the value with the MongoDB connection string from Compose:

mongo-uri: mongodb://dbuser:secret@aws-us-east-1-portal.8.dblayer.com:15234,aws-us-east-1-portal.7.dblayer.com:15234/?authSource=restheart  


The ?authSource at the end of the URI is important: it allows us to connect to MongoDB with the
user credentials from our newly created database (in this case, our restheart database). This also has the effect of scoping RestHEART so it only has access
to a single database that we allow. While you can configure RESTHeart to have
root access to your database, it's not recommended.

Next, we'll need to run the Docker container and tell RestHEART to use our
updated configuration file. Since we can't pass in the configuration file to the
docker container, we'll instead use the volume mounting capabilities of Docker
to mount our configuration file (or, in our case, the entire configuration
folder) in the same location as the default one. RESTHeart will automatically
load our new configuration file from this location. We'll do this by running the
container with a local read-only volume pointing to our new etc folder using the following command:

docker run -d -p 80:8080 --name restheart -v $PWD/etc:/opt/restheart/etc:ro softinstigate/restheart:3.0.0  


Note that this command assumes you're in root directory of your project. You can
reference the container later using the name restheart . For example, to stop the container, use the following command:

docker stop restheart  


Once the container is running, you can access it by navigating to http://localhost/browser in your web browser. RESTHeart comes pre-installed with the HAL data browser , which makes verifying the installation easier.

BASIC OPERATIONS
Once we have a running container with RestHEART, we can now test our new RESTful
API by making an HTTP GET call to localhost .

$ curl http://localhost
{""http status code"":403,""http status description"":""Forbidden"",""message"":""The MongoDB user does not have enough permissions to execute this operation.""}


Initially, you will get a response with a 403 Forbidden status code. This is the desired result right now - it means that we have not
given RESTHeart credentials that work on the root of our database deployment.
Since we want to scope access to a specific database, we'll include the database
name at the end of the URI.

$ curl http://localhost/restheart
{""_embedded"":[],""_id"":""restheart"",""_size"":0,""_total_pages"":0,""_returned"":0}


You should now get a message back indicating that the query was executed and
that there were no results. That's because our database is currently empty.
Let's try adding some data to a collection in the restheart database.

ADDING A COLLECTION
Now that we have RESTHeart running successfully, let's put a document in our restheart database. RESTHeart uses the POST HTTP method to create new documents in the database. Let's try this out by
creating a movies collection and adding some of our favorite movies to it.

First, we'll create the movies collection inside our restheart database. All operations in RESTHeart are based on RESTful API calls, so
creating a new document, collection, or even new databases will all consist of
making HTTP calls. RESTHeart exposes the interface for adding new collections
through the PUT method on the collection:

$ curl -i -H ""Content-Type: application/json"" -X PUT http://localhost/restheart/movies -d '{""desc"": ""These are some of my favorite movies""}'


RESTHeart expects to receive data in JSON format, so we first set the Content-Type to application/json . Then, we specify the PUT method and send along the collection we wish to add.
Note that the URL format for our RESTHeart instance takes the following form:

http://localhost/<database>/<collection>

where <database> is the name of the database you want to operate on and <collection> is the name of the collection.

This structure can be used to retrieve data from specific locations as well. For
example, to retrieve all of the collections in our restheart database, let's make a GET call to the database URL:

$ curl http://localhost/restheart
{""_embedded"":[{""_id"":""movies"",""desc"":""These are some of my favorite movies"",""_etag"":{""$oid"":""588734a946e0fb000a6e8827""}}],""_id"":""restheart"",""_size"":1,""_total_pages"":1,""_returned"":1}


Notice how this output is different from the first time we make this call. The
newly added movies collection can now be seen in the list of collections.

USING POST TO CREATE A NEW DOCUMENT
Following this logic, we can add a new movie to our movies collection by using
the same URL scheme. To create a new document in RESTHeart, we'll send an HTTP
POST call to RESTHeart and specify which collection to put the data in by using
the URL. In our case, the URL looks like the following:

http://localhost/restheart/movies  


So if we wanted to add ""Gone with the Wind"" to our movies collection, the
request would look something like the following:

$ curl -i -H ""Content-Type: application/json"" -X POST http://localhost/restheart/movies -d '{""title"":""Gone With the Wind"", ""year"": ""MongoDB""}'


We can test that our insertion worked by sending a GET request to the same URL to list all of the documents in a collection:

$ curl http://localhost/restheart/movies
{""_embedded"":[{""_id"":{""$oid"":""5887875ed4cf043f4cdcdf8b""},""_etag"":{""$oid"":""588736fc46e0fb000a6e8829""},""title"":""Gone With the Wind"",""year"":""MongoDB""}],""_id"":""movies"",""desc"":""These are some of my favorite movies"",""_etag"":{""$oid"":""588734a946e0fb000a6e8827""},""_returned"":1}


USING PUT TO UPDATE A DOCUMENT
Updating a document is similar to inserting a new document, and adds one more
layer of specificity by adding the document ID to the URL. To update our
favorite movie with a record of Academy Awards, we can use the following:

$ curl -i -H ""Content-Type: application/json"" -X PUT http://localhost/restheart/movies/5887875ed4cf043f4cdcdf8b -d '{""academy_awards"": 10}'


You can also update all the documents in the entire collection by omitting the
document id from your URL:

$ curl -i -H ""Content-Type: application/json"" -X PUT http://localhost/restheart/movies/ -d '{""visible"": true}'


Do a GET request again to see the latest changes:

$ curl http://localhost/restheart/movies
{""_embedded"":[{""_id"":{""$oid"":""5887875ed4cf043f4cdcdf8b""},""academy_awards"":12,""_etag"":{""$oid"":""5887387d46e0fb000a6e882b""}}],""_id"":""movies"",""visible"":true,""_etag"":{""$oid"":""588738fd46e0fb000a6e882c""},""_returned"":1}


MORE OPERATIONS
RESTHeart has a comprehensive API, but this should get you started with the
basics. You can try out more API calls using the CURL utility or gain an in-depth understanding of the RESTHeart API by browsing through the latest documentation .

CONCLUSION
RESTHeart can be a great way to quickly expose your Mongo database directly as
RESTFul API resources. Since the recommended configuration for RESTHeart uses
docker containers, it can be especially convenient for teams that use Docker for
deploying their applications.

As with any API, you should never expose a running instance of RESTHeart to the
open Internet without strong authentication and end-to-end encryption. RESTHeart
comes with several ways to configure HTTPS and authentication, and in a future article we'll show you how to secure your
RESTHeart API for production. In the meantime, you can find more information on Configuring and Enabling Security in RESTHeart in the official documentation.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Pixabay User SplitShire Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe John O'Connor is a software architect that enjoys tinkering with things, designing software,
and writing about it all. Love this article? Head over to John O'Connor’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","When you need to turn your Mongo database into a RESTFul API, RESTHeart can get you up-and-running quickly. In this article, we'll explore using RESTHeart to expose a RESTFUL API directly from a Mongo database on Compose.",Building Instant RESTFul API's with MongoDB and RESTHeart,Live,792
2434,"Homepage Follow Sign in Get started Homepage * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Adam Massachi Blocked Unblock Follow Following Data Science Experience @ IBM Nov 15
--------------------------------------------------------------------------------

DATA SCIENCE, U AND I : 10 POWERFUL FEATURES ON WATSON DATA PLATFORM, NO CODING
NECESSARY
Get started todayThe IBM Watson Data Platform offers the only collaborative ecosystem of analytics tools, cognitive services,
and data discovery, management, and governance capabilities designed for teams
to minimize the time it takes to solve problems. To succeed in providing this
unified and valuable experience to diverse data-driven teams, WDP takes of
advantage of many graphical, UI based tools alongside familiar, lower level
features such as API development, data science Notebooks, and GitHub
integration. In this post, we’ll cover 10 powerful features on Watson Data
Platform you can use without writing a line of code. We’ll show how seamless
integration with IBM Cloud services leverages the power of IBM and open-source technologies combined.

1. FIND AND EXPLORE RICH ASSETS IN THE COMMUNITY
The Community provides access to a variety of assets and educational material on Watson Data
Platform from myriad sources including information from across the Web, IBM, and
the public domain. We’ll cover a few in this post — first among them is Data Sets .

Searching for “transactions” data in the CommunityUse the Community to find data and incorporate the data sets into your Projects. Importantly, we
maintain a large and growing repository of Notebooks , Tutorials , and Articles to easily onboard new users and empower experienced data scientists with new
tools and techniques.

2. MANAGE AND DISCOVER DATA IN DATA CATALOG
Data Catalog helps you reclaim lost time with intelligent, automated, and simplified data
discovery and governance.

Home screen for the Great Outdoors Sandbox catalogWith Data Catalog, you can manage secure connections to databases, govern permissions, and share
and discover data. This data control panel integrates with projects, model
development, and visualization tools so that you can quickly gain insight and
develop solutions without forcing your whole team to navigate through a complex
journey of discovery, permission, and extrapolation.

3. REFINE AND SHAPE DATA WITH DATA REFINERY
Data Refinery contains everything you need to refine, shape, and inspect your data. The UI
provides an interactive environment where anyone on your team can connect to
data wherever it resides and then quickly analyze and visualize. We have a more in depth demo, too.

Data Refinery for the Kidney Disease dataThe Operation button helps you define transformations and aggregations for your data
graphically or with code.

Summary after refinement4. CREATE VISUALIZATIONS IN A FLASH
IBM open sourced the brunel data visualization language, the backbone of the advanced charting and data
visualization capabilities on Watson Data Platform. Use a combination of drag
and drop intuition and code level customization.

Histogram of “Age” in the Kidney Disease data setGenerate a histogram instantly, if you’d like. Configure the Columns , Chart types and Brunel syntax to develop more complex visualizations.

5. RAPIDLY BUILD MODELS WITH THE WATSON MACHINE LEARNING MODEL BUILDER
The Automated Model Builder uses the power of Watson Machine Learning to automatically prepare data and build models. In the video below, I upload a
dataset of customer churn , train three models, and consider the performance metrics when deciding which
to save and deploy.

Build a powerful customer churn model in a few easy stepsYou can drag and drop a data set, or use refined and connected data. Then,
manually choose your models or let WDP do it for you.

6. SAVE, VERSION, AND DEPLOY MODELS WITH ONE CLICK
After creating models with Watson Machine Learning tools in the platform, we
will automatically integrate version control, provide model save and management
capabilities, and even generate an API endpoint that you can consume in a
variety applications.

Model overview page7. MONITOR PERFORMANCE, ENABLE CONTINUOUS LEARNING
After you’ve built and saved models, you can automatically monitor the
performance of your model over time. Then, select the metrics you’d like to
track, the number of records, or whatever feedback you’d like to use as a
trigger — then retrain and deploy.

Configuration for Continuous Learning for ML Models8. TEST YOUR API AND MAKE PREDICTIONS
After deploying a model, we create an API endpoint for that model version. You
can test the API from the model’s page on the platform. This should give you an
indication of how the model will behave given a particular observation.

Testing the API for a multiclass classification problemYou can also grab the code used to make the API request and the response from
the icon on the top left of the chart. Simply copy/paste into your application
to embed your machine learning model.

9. DEFINE YOUR FLOW
Build complex Flows on the Data Platform. We’ve got another post detailing how to build to expressive and powerful machine learning models in
this interface. Flexibly leverage SPSS or Spark runtimes.

The Flow interface for a Customer Churn modelAbove, we build an SPSS Modeler Flow for the customer churn data. On the top
left, notice how easy it is to merge datasets. Follow the data through
transformations until we build a model and evaluate the model on unseen data.
Create visualizations, write data to tables and files, and persist models with
Watson Machine Learning from a graphical, intuitive interface.

10. UNITE YOUR TEAM
The Watson Data Platform does more than integrate a suite of powerful tools and
services — it unites your team. The concepts of Projects and Collaborators incorporate team members of various levels and roles into a managed
environment. All you need to do is look them up.

Add collaborators to your projectYou can control different permissions and access from at Project level and from
the Data Catalog, expanding your collaboration and governance capabilities
without delay.

 * Machine Learning
 * Dsx
 * Watson
 * Wdp
 * Watson Data Platform

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

26 Blocked Unblock Follow FollowingADAM MASSACHI
Data Science Experience @ IBM

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 26
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","IBM Watson Data Platform offers the only collaborative ecosystem of analytics tools, cognitive services, and data discovery, management, and governance capabilities designed for teams to minimize the time it takes to solve problems.","10 Powerful Features on Watson Data Platform, No Coding Necessary",Live,793
2438,"Skip to contentWin-Vector Blog

The Win-Vector LLC data science blog

Menu and widgets * Win-Vector Blog
 * About
 * Company
 * Popular Articles
 * Practical Data Science with R
 * Introduction To Data Science
 * @WinVectorLLC Twitter

PROVIDED BY WIN-VECTOR LLC
Win-Vector LLC: providing expert data science consulting and training.SEARCH
Search for:PRACTICAL DATA SCIENCE WITH R
INTRODUCTION TO DATA SCIENCE VIDEO COURSE
ABOUT
The Win-Vector blog is a product of Win-Vector LLC , a data science consultancy. Contact us for custom consulting and training: contact@win-vector.com .PAGES
 * About
 * Introduction To Data Science
 * Popular Articles
 * Practical Data Science with R

RECENT POSTS
 * On calculating AUC
 * Adding polished significance summaries to papers using R
 * Proofing statistics in papers
 * Relative error distributions, without the heavy tail theatrics
 * Adversarial machine learning

SUBSCRIBE/FOLLOW US
Please follow us on RSS and Twitter @WinVectorLLC .CATEGORIES
 * Administrativia (66)
 * Applications (15)
 * art (2)
 * Coding (31)
 * Computer Science (36)
 * Computers (6)
 * data science (70)
 * Exciting Techniques (26)
 * Expository Writing (71)
 * Finance (14)
 * History (7)
 * math programming (16)
 * Mathematics (55)
 * Opinion (98)
 * Practical Data Science (68)
 * Pragmatic Data Science (83)
 * Pragmatic Machine Learning (85)
 * Programming (45)
 * Public Service Article (9)
 * Quantitative Finance (8)
 * Rants (49)
 * Statistics (188)
 * Statistics To English Translation (25)
 * Tutorials (123)
 * Uncategorized (3)

TAGS
A/B testing Analytics cross-validation Data Mining data science differential privacy Dynamic Programming Finance fun ggplot2 git GLM Hadoop Introduction to Data Science linear regression Logistic Regression Machine Learning Map Reduce Markov Chains Mathematical Bedside Reading modeling trick optimizaton plotting Practical Data Science Practical Data Science with R principal components analysis principal components regression python R Random Sampling R as it is Regression R is not your friend R programming annoyances Sharpe Ratio Shiny significance Statistics Statistics as it should be variable selection variable treatment visualization vtreat Wald writingCREDIT
Win-Vector Blog (The Applied Theorist's Point of View) is part of Win-Vector LLC , authors John Mount and Nina Zumel .

All material Copyright Win-Vector LLC . Some material under redistribution agreement.

THE WIN-VECTOR LLC MAILING LIST
Please subscribe to the Win-Vector LLC mailing list.COMMENT POLICY
All comments are held for moderation. Only comments that will be interesting to
other readers will be considered for posting. Comments that are irrelevant,
offensive or link-spam will be deleted. Also we do use a mechanical comment spam
filter, and would like to apologize in advance for any comments that get lost to
the filter.ARCHIVES
 * October 2016 (3)
 * September 2016 (4)
 * August 2016 (7)
 * July 2016 (3)
 * June 2016 (6)
 * May 2016 (8)
 * April 2016 (6)
 * March 2016 (8)
 * February 2016 (8)
 * January 2016 (9)
 * December 2015 (6)
 * November 2015 (6)
 * October 2015 (9)
 * September 2015 (10)
 * August 2015 (2)
 * July 2015 (5)
 * June 2015 (8)
 * May 2015 (4)
 * April 2015 (4)
 * March 2015 (3)
 * February 2015 (3)
 * January 2015 (5)
 * December 2014 (4)
 * November 2014 (3)
 * October 2014 (2)
 * September 2014 (3)
 * August 2014 (2)
 * July 2014 (3)
 * June 2014 (4)
 * May 2014 (7)
 * April 2014 (4)
 * March 2014 (4)
 * February 2014 (5)
 * January 2014 (4)
 * December 2013 (4)
 * November 2013 (2)
 * October 2013 (3)
 * September 2013 (2)
 * August 2013 (1)
 * July 2013 (1)
 * June 2013 (1)
 * May 2013 (5)
 * April 2013 (6)
 * March 2013 (2)
 * February 2013 (5)
 * January 2013 (1)
 * December 2012 (3)
 * November 2012 (2)
 * October 2012 (5)
 * September 2012 (4)
 * August 2012 (4)
 * July 2012 (3)
 * June 2012 (2)
 * May 2012 (3)
 * April 2012 (3)
 * March 2012 (2)
 * February 2012 (2)
 * January 2012 (2)
 * December 2011 (3)
 * November 2011 (2)
 * October 2011 (2)
 * September 2011 (3)
 * August 2011 (2)
 * July 2011 (3)
 * June 2011 (2)
 * April 2011 (2)
 * March 2011 (1)
 * February 2011 (1)
 * January 2011 (1)
 * December 2010 (2)
 * November 2010 (1)
 * October 2010 (1)
 * September 2010 (1)
 * August 2010 (3)
 * July 2010 (1)
 * June 2010 (1)
 * May 2010 (1)
 * April 2010 (3)
 * March 2010 (1)
 * February 2010 (2)
 * January 2010 (3)
 * December 2009 (3)
 * November 2009 (3)
 * October 2009 (2)
 * September 2009 (2)
 * August 2009 (3)
 * July 2009 (3)
 * June 2009 (3)
 * May 2009 (2)
 * April 2009 (1)
 * March 2009 (2)
 * February 2009 (1)
 * January 2009 (3)
 * November 2008 (1)
 * October 2008 (1)
 * September 2008 (1)
 * August 2008 (1)
 * June 2008 (2)
 * May 2008 (2)
 * April 2008 (2)
 * March 2008 (1)
 * February 2008 (1)
 * October 2007 (1)
 * June 2007 (1)

ON CALCULATING AUC
Posted on October 7, 2016 Author John Mount Categories Expository Writing , Pragmatic Data Science , Pragmatic Machine Learning , Statistics , Tutorials Tags AUC , R , ROC

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots . We expand on this a bit and discuss some of the issues in computing “area
under the curve” (AUC).

R has a number of ROC/AUC packages; for example ROCR , pROC , and plotROC . But it is instructive to see how ROC plots are produced and how AUC can be
calculated. Bob Horton’s article showed how elegantly the points on the ROC plot
are expressed in terms of sorting and cumulative summation.

The next step is computing AUC. Obviously computing area is a solved problem.
The issue is how you deal with interpolating between points and the conventions
of what to do with data that has identical scores. An elegant interpretation of
the usual tie breaking rules is: for every point on the ROC curve we must have
either all of the data above a given score threshold or none of the data above a
given score threshold. This is the issue alluded to when the original article
states:

This brings up another limitation of this simple approach; by assuming that the
rank order of the outcomes embodies predictive information from the model, it
does not properly handle sequences of cases that all have the same score.

This problem is quite easy to explain with an example. Consider the following
data.


d <- data.frame(pred=c(1,1,2,2),y=c(FALSE,FALSE,TRUE,FALSE))
print(d)
## pred     y
## 1    1 FALSE
## 2    1 FALSE
## 3    2  TRUE
## 4    2 FALSE


Using code adapted from the original article we can quickly get an interesting
summary.


ord <- order(d$pred, decreasing=TRUE) # sort by prediction reversed
labels <- d$y[ord]
data.frame(TPR=cumsum(labels)/sum(labels), 
           FPR=cumsum(!labels)/sum(!labels),
           labels=labels,
           pred=d$pred[ord])
##   TPR       FPR labels pred
## 1   1 0.0000000   TRUE    2
## 2   1 0.3333333  FALSE    2
## 3   1 0.6666667  FALSE    1
## 4   1 1.0000000  FALSE    1


The problem is: we need to take all of the points with the same prediction score
as an atomic unit (we take all of them or none of them). Notice also TPR is always 1 (an undesirable effect).

We do not really want rows 1 and 3 in our plot or area calculations. In fact the
values in row 1 and 3 are not fully determined as they can vary depending on
details of tie breaking in the sorting (though the values recorded in rows 2 and
4 can not so vary). Also (especially after deleting rows) we may need to add in
ideal points with (FPR,TPR)=(0,0) and (FPR,TPR)=(1,1) to complete our plot and area calculations.

What we want is a plot where ties are handled. Such plots look like the
following:


# devtools::install_github('WinVector/WVPlots')
library('WVPlots') # see: https://github.com/WinVector/WVPlots
WVPlots::ROCPlot(d,'pred','y',TRUE,'example plot')


There is a fairly elegant way to get the necessary adjusted plotting frame: use
differencing (the opposite of cumulative sums) to find where the pred column changes, and limit to those rows.

The code is as follows (also found in our sigr library here ):


calcAUC <- function(modelPredictions,yValues) {
  ord <- order(modelPredictions, decreasing=TRUE)
  yValues <- yValues[ord]
  modelPredictions <- modelPredictions[ord]
  x <- cumsum(!yValues)/max(1,sum(!yValues)) # FPR = x-axis
  y <- cumsum(yValues)/max(1,sum(yValues))   # TPR = y-axis
  # each point should be fully after a bunch of points or fully before a
  # decision level. remove dups to achieve this.
  dup <- c(modelPredictions[-1]>=modelPredictions[-length(modelPredictions)],
           FALSE)
  # And add in ideal endpoints just in case (redundancy here is not a problem).
  x <- c(0,x[!dup],1)
  y <- c(0,y[!dup],1)
  # sum areas of segments (triangle topped vertical rectangles)
  n <- length(y)
  area <- sum( ((y[-1]+y[-n])/2) * (x[-1]-x[-n]) )
  area
}


This correctly calculates the AUC.


# devtools::install_github('WinVector/sigr')
library('sigr') # see: https://github.com/WinVector/sigr
calcAUC(d$pred,d$y)
## [1] 0.8333333


I think this extension maintains the spirit of the original. We have also shown
how complexity increases as you move from code known to work on a particular
data set at hand, to library code that may be exposed to data with unanticipated
structures or degeneracies (this is why Quicksort, which has an elegant
description, often has monstrous implementations; please see here for a rant on that topic ).

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to print (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Tumblr (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * 

RELATED


Posted on October 7, 2016 Author John Mount Categories Expository Writing , Pragmatic Data Science , Pragmatic Machine Learning , Statistics , Tutorials Tags AUC , R , ROCLEAVE A REPLY CANCEL REPLY
POST NAVIGATION
Previous Previous post: Adding polished significance summaries to papers using R Proudly powered by WordPress Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",We discuss some of the issues in computing “area under the curve” (AUC).,On calculating AUC,Live,794
2439,"Search Search for:REGRESS TO IMPRESS
FEAR, LOATHING, DATA SCIENCE
Menu Skip to content * About Me
 * Reading List

Open SearchA DRAMATIC TOUR THROUGH PYTHON’S DATA VISUALIZATION LANDSCAPE (INCLUDING GGPLOT
AND ALTAIR)
WHY EVEN TRY, MAN?

--------------------------------------------------------------------------------

I recently came upon Brian Granger and Jake VanderPlas’s Altair, a promising
young visualization library. Altair seems well-suited to addressing Python’s
ggplot envy, and its tie-in with JavaScript’s Vega-Lite grammar means that as
the latter develops new functionality (e.g., tooltips and zooming), Altair
benefits — seemingly for free!

Indeed, I was so impressed by Altair that the original thesis of my post was
going to be: “Yo, use Altair.”

But then I began ruminating on my own Pythonic visualization habits, and — in a
painful moment of self-reflection — realized I’m all over the place: I use a
hodgepodge of tools and disjointed techniques depending on the task at hand
(usually whichever library I first used to accomplish that task 1 ).

This is no good. As the old saying goes: “The unexamined plot is not worth exporting to a PNG.”

Thus, I’m using my discovery of Altair as an opportunity to step back — to
investigate how Python’s statistical visualization options hang together. I hope
this investigation proves helpful for you as well.

HOW’S THIS GONNA GO?

--------------------------------------------------------------------------------

The conceit of this post will be: “You need to do Thing X. How would you do Thing X in matplotlib? pandas?
Seaborn? ggplot? Altair?” By doing many different Thing X’s, we’ll develop a reasonable list of pros,
cons, and takeaways — or at least a whole bunch of code that might be somehow
useful.

(Warning: this all may happen in the form of a two-act play.)

THE OPTIONS (IN ~DESCENDING ORDER OF SUBJECTIVE COMPLEXITY)

--------------------------------------------------------------------------------

First, let’s welcome our friends 2 :

matplotlib

The 800-pound gorilla — and like most 800-pound gorillas, this one should
probably be avoided unless you genuinely need its power, e.g., to make a really
custom plot or produce a publication-ready graphic.

pandas

“ stay for the plotting convenience functions that are arguably more pleasant
than the matplotlib code they supplant.” — rejected pandas taglines

(Bonus tidbit: the pandas team must include a few visualization nerds, as the
library includes things like RadViz plots and Andrews Curves that I haven’t seen
elsewhere.)

Seaborn

Seaborn has long been my go-to library for statistical visualization; it
summarizes itself thusly:

“If matplotlib ‘tries to make easy things easy and hard things possible,’
seaborn tries to make a well-defined set of hard things easy too”

yhat’s ggplot

A Python implementation of the wonderfully declarative grammar of graphics. This
isn’t a “feature-for-feature port of ggplot2,” but there’s strong feature
overlap. (And speaking as a part-time R user, the main geoms seem to be in
place.)

Altair

The new guy, Altair is a “declarative statistical visualization library” with an
exceedingly pleasant API.

Wonderful. Now that our guests have arrived and checked their coats, let’s
settle in for our very awkward dinner conversation. Our show is entitled…

LITTLE SHOP OF PYTHON VISUALIZATION LIBRARIES (STARRING ALL LIBRARIES AS
THEMSELVES)

--------------------------------------------------------------------------------

ACT I: LINES AND DOTS

--------------------------------------------------------------------------------

(In Scene 1, we’ll be dealing with a tidy data set named “ts.” It consists of
three columns: a “dt” a “value” and a “kind” column, which has four unique
levels: A, B, C, and D. Here’s a preview…)

dt kind value 0 2000-01-01 A 1.442521 1 2000-01-02 A 1.981290 2 2000-01-03 A 1.586494 3 2000-01-04 A 1.378969 4 2000-01-05 A -0.277937SCENE 1: HOW WOULD YOU PLOT MULTIPLE TIME SERIES ON THE SAME GRAPH?

--------------------------------------------------------------------------------

matplotlib: Ha! Haha! Beyond simple. While I could and would accomplish this task in any number of complex ways, I know your feeble brains
would crumble under the weight of their ingenuity. Hence, I dumb it down,
showing you two simple methods. In the first, I loop through your trumped-up
matrix — I believe you peons call it a “Data” “Frame” — and subset it to the
relevant time series. Next, I invoke my “plot” method and pass in the relevant
columns from that subset.


# MATPLOTLIB
fig, ax = plt.subplots(1, 1,
                       figsize=(7.5, 5))

for k in ts.kind.unique():
    tmp = ts[ts.kind == k]
    ax.plot(tmp.dt, tmp.value, label=k)

ax.set(xlabel='Date',
       ylabel='Value',
       title='Random Timeseries')    

ax.legend(loc=2)
fig.autofmt_xdate()


MPL: Next, I enlist this chump (*motions to pandas*) , and have him pivot this “Data” “Frame” so that it looks like this…


# in matplotlib-land, the notion of a tidy
# dataframe matters not
dfp = ts.pivot(index='dt', columns='kind', values='value')
dfp.head()


kind A B C D dt 2000-01-01 1.442521 1.808741 0.437415 0.096980 2000-01-02 1.981290 2.277020 0.706127 -1.523108 2000-01-03 1.586494 3.474392 1.358063 -3.100735 2000-01-04 1.378969 2.906132 0.262223 -2.660599 2000-01-05 -0.277937 3.489553 0.796743 -3.417402
MPL: By transforming the data into an index with four columns — one for each line I
want to plot — I can do the whole thing in one fell swoop (i.e., a single call
of my “plot” function).


# MATPLOTLIB
fig, ax = plt.subplots(1, 1,
                       figsize=(7.5, 5))

ax.plot(dfp)

ax.set(xlabel='Date',
       ylabel='Value',
       title='Random Timeseries')

ax.legend(dfp.columns, loc=2)
fig.autofmt_xdate()


pandas (*looking timid*): That was great, Mat. Really great. Thanks for including me. I do the same thing
— hopefully as good? (*smiles weakly*)


# PANDAS
fig, ax = plt.subplots(1, 1,
                       figsize=(7.5, 5))

dfp.plot(ax=ax)

ax.set(xlabel='Date',
       ylabel='Value',
       title='Random Timeseries')

ax.legend(loc=2)
fig.autofmt_xdate()


pandas: It looks exactly the same, so I just won’t show it.

Seaborn (*smoking a cigarette and adjusting her beret*): Hmmm. Seems like an awful lot of data manipulation for a silly line graph. I
mean, for loops and pivoting? This isn’t the 90’s or Microsoft Excel. I have
this thing called a FacetGrid I picked up when I went abroad. You’ve probably
never heard of it…


# SEABORN
g = sns.FacetGrid(ts, hue='kind', size=5, aspect=1.5)
g.map(plt.plot, 'dt', 'value').add_legend()
g.ax.set(xlabel='Date',
         ylabel='Value',
         title='Random Timeseries')
g.fig.autofmt_xdate()


SB: See? You hand FacetGrid your un-manipulated tidy data. At that point, passing
in “kind” to the “hue” parameter means you’ll plot four different lines — one
for each level in the “kind” field. The way you actually realize these four
different lines is by mapping my FacetGrid to this Philistine’s (*motions to matplotlib*) plot function, and passing in “x” and “y” arguments. There are some things you
need to keep in mind, obviously, like manually adding a legend, but nothing too
challenging. Well, nothing too challenging for some of us…

ggplot: Wow, neat! I do something similar, but I do it like my big bro. Have you heard of him? He’s so coo–

SB: Who invited the kid?

GG: Check it out!


# GGPLOT
fig, ax = plt.subplots(1, 1, figsize=(7.5, 5))

g = ggplot(ts, aes(x='dt', y='value', color='kind')) + \
        geom_line(size=2.0) + \
        xlab('Date') + \
        ylab('Value') + \
        ggtitle('Random Timeseries')
g


GG (*picks up ggpot2 by Hadley Wickham and sounds out words*): Every plot is com — com — com- prised of data (e.g., “ts”), aesthetic mappings (e.g, “x”, “y”, “color”), and the
geometric shapes that turn our data and aesthetic mappings into a real
visualization (e.g., “geom_line”)!

Altair: Yup, I do that, too.


# ALTAIR
c = Chart(ts).mark_line().encode(
    x='dt',
    y='value',
    color='kind'
)
c


ALT: You give my Chart class some data and tell it what kind of visualization you
want: in this case, it’s “mark_line”. Next, you specify your aesthetic mappings:
our x-axis needs to be “date” our y-axis needs to be “value” and we want to
split by kind, so we pass “kind” to “color.” Just like you, GG (* tousles GG’s hair*) . Oh, and by the way, using the same color scheme y’all use isn’t a problem,
either:


# ALTAIR

# cp corresponds to Seaborn's standard color palette
c = Chart(ts).mark_line().encode(
    x='dt',
    y='value',
    color=Color('kind', scale=Scale(range=cp.as_hex()))
)
c


*MPL stares in terrified wonder*

ANALYZING SCENE 1

--------------------------------------------------------------------------------

Aside from matplotlib being a jerk 3 , a few themes emerged:

 * In matplotlib and pandas, you must either make multiple calls to the “plot”
   function (e.g., once-per-for loop), or you must manipulate your data to make
   it optimally fit the plot function (e.g., pivoting). (That said, there’s
   another technique we’ll see in Scene 2.)

 * (To be frank, I never used to think this was a big deal, but then I met
   people who use R. They looked at me aghast.)

 * Conversely, ggplot and Altair implement similar and declarative “grammar of
   graphics”-approved ways to handle our simple case: you give their “main”
   function– “ggplot” in ggplot and “Chart” in Altair” — a tidy data set. Next,
   you define a set of aesthetic mappings — x, y, and color — that explain how
   the data will map to our geoms (i.e., the visual marks that do the hard work
   of conveying information to the reader). Once you actually invoke said geom
   (“geom_line” in ggplot and “mark_line” in Altair), the data and aesthetic
   mappings are transformed into visual ticks that a human can understand — and
   thus, an angel gets its wings.

 * Intellectually, you can — and probably should — view Seaborn’ however, it’s
   not 100% identical. FacetGrid needs a hue argument upfront — alongside your data — but wants the x and y arguments later . At that point, your mapping isn’t an aesthetic one, but a functional one:
   for each “hue” in your data set, you’re simply calling matplotlib’s plot
   function using “dt” and “value” as its x and y arguments. The for loop is
   simply hidden from you.

 * That said, even though the aesthetic maps happen in two separate steps, I
   prefer the aesthetic mapping mindset to the imperative mindset (at least when
   it comes to plotting).


Data Aside

(In Scenes 2-4, we’ll be dealing with the famous “iris” data set [though we
refer to it as “df” in our code]. It consists of four numeric columns
corresponding to various measurements, and a categorical column corresponding to
one of three species of iris. Here’s a preview…)

petalLength petalWidth sepalLength sepalWidth species 0 1.4 0.2 5.1 3.5 setosa 1 1.4 0.2 4.9 3.0 setosa 2 1.3 0.2 4.7 3.2 setosa 3 1.5 0.2 4.6 3.1 setosa 4 1.4 0.2 5.0 3.6 setosaSCENE 2: HOW WOULD YOU MAKE A SCATTER PLOT?

--------------------------------------------------------------------------------

MPL (*looking shaken*): I mean, you could do the for loop thing again. Of course. And that would be
fine. Of course. See? (*lowers voice to a whisper*) Just remember to set the color argument explicitly or else the dots will all be
blue…


# MATPLOTLIB
fig, ax = plt.subplots(1, 1, figsize=(7.5, 7.5))

for i, s in enumerate(df.species.unique()):
    tmp = df[df.species == s]
    ax.scatter(tmp.petalLength, tmp.petalWidth,
               label=s, color=cp[i])

ax.set(xlabel='Petal Length',
       ylabel='Petal Width',
       title='Petal Width v. Length -- by Species')

ax.legend(loc=2)


MPL: But, uh, (*feigning confidence*) I have a better way! Look at this:


# MATPLOTLIB
fig, ax = plt.subplots(1, 1, figsize=(7.5, 7.5))

def scatter(group):
    plt.plot(group['petalLength'],
             group['petalWidth'],
             'o', label=group.name)

df.groupby('species').apply(scatter)

ax.set(xlabel='Petal Length',
       ylabel='Petal Width',
       title='Petal Width v. Length -- by Species')

ax.legend(loc=2)


MPL: Here, I define a function named “scatter.” It will take groups from a pandas
groupby object and plot petal length on the x-axis and petal width on the
y-axis. Once per group! Powerful!

P: Wonderful, Mat! Wonderful! Essentially what I would have done, so I will sit
this one out.

SB (*grinning*): No pivoting this time?

P: Well, in this case, pivoting is complex. We can’t have a common index like we
could with our time series data set, and so —

MPL: SHHHHH! WE DON’T HAVE TO EXPLAIN OURSELVES TO HER.

SB: Whatever. Anyway, in my mind, this problem is the same as the last one. Build
another FacetGrid but borrow plt.scatter rather than plt.plot.


# SEABORN
g = sns.FacetGrid(df, hue='species', size=7.5)
g.map(plt.scatter, 'petalLength', 'petalWidth').add_legend()
g.ax.set_title('Petal Width v. Length -- by Species')


GG: Yes! Yes! Same! You just gotta swap out geom_line for geom_point!


# GGPLOT
g = ggplot(df, aes(x='petalLength',
                   y='petalWidth',
                   color='species')) + \
        geom_point(size=40.0) + \
        ggtitle('Petal Width v. Length -- by Species')
g


ALT (*looking bemused*): Yup — just swap our mark_line for mark_point.


# ALTAIR
c = Chart(df).mark_point(filled=True).encode(
    x='petalLength',
    y='petalWidth',
    color='species'
)
c


ANALYZING SCENE 2

--------------------------------------------------------------------------------

 * Here, the complications that emerge from adapting your data to your
   visualization method become more clear. While the pandas pivoting trick was
   extremely convenient for time series, it doesn’t translate so well to this
   case.

 * To be fair, the “group by” method is somewhat generalizable, and the “for
   loop” however, they require more custom logic, and custom logic either
   introduces room for error or necessitates reinventing a wheel that Seaborn
   has already made for you.

 * Conversely, Seaborn, ggplot, and Altair all realize that scatter plots are in
   many ways line plots without the assumptions (however innocuous those
   assumptions may be). As such, our code from Scene 1 can largely be reused,
   but with a new geom (geom_point/mark_point in the case of ggplot/Altair) or a
   new method (plt.scatter in the case of Seaborn). At this junction, none of
   these options seems to emerge as particularly more convenient than the other,
   though I love Altair’s elegant simplicity.

SCENE 3: HOW WOULD YOU FACET YOUR SCATTER PLOT?

--------------------------------------------------------------------------------

MPL: Well, uh, once you’ve mastered the for loop — as I have, obviously — this is a
simple adjustment to my earlier example. Rather than build a single Axes using
my subplots method, I build three. Next, I loop through as before, but in the
same way I subset my data, I subset to the relevant Axes object.

(*confidence returning*) AND I WOULD CHALLENGE ANY AMONG YOU TO COME UP WITH AN EASIER WAY! (*raises arms, nearly hitting pandas in the process*)


# MATPLOTLIB
fig, ax = plt.subplots(1, 3, figsize=(15, 5))

for i, s in enumerate(df.species.unique()):
    tmp = df[df.species == s]

    ax[i].scatter(tmp.petalLength, tmp.petalWidth, c=cp[i])

    ax[i].set(xlabel='Petal Length',
              ylabel='Petal Width',
              title=s)

fig.tight_layout()


*SB shares a look with ALT, who starts laughing; GG starts laughing to appear in
on the joke*

MPL: What is it?!

Altair: Check your x- and y-axes, man. All your plots have different limits.

MPL (*goes red*): Ah, yes, of course. A TEST TO ENSURE YOU WERE PAYING ATTENTION. You can, uh, ensure that all subplots share the same limits by specifying this
in the subplots function.


# MATPLOTLIB
fig, ax = plt.subplots(1, 3, figsize=(15, 5),
                       sharex=True, sharey=True)

for i, s in enumerate(df.species.unique()):
    tmp = df[df.species == s]

    ax[i].scatter(tmp.petalLength,
                  tmp.petalWidth,
                  c=cp[i])

    ax[i].set(xlabel='Petal Length',
              ylabel='Petal Width',
              title=s)

fig.tight_layout()�span class=""p""�
�/span�


P (*sighs*): I would do the same. Pass.

SB: Adapting FacetGrid to this case is simple. In the same way we have a “hue”
argument, we can simply add a “col” (i.e., column) argument. This tells
FacetGrid to not only assign each species a unique color, but also to assign
each species a unique subplot, arranged column-wise. (We could have arranged
them row-wise by passing in a “row” argument rather than a “col” argument.)


# SEABORN
g = sns.FacetGrid(df, col='species', hue='species', size=5)
g.map(plt.scatter, 'petalLength', 'petalWidth')


GG: Oooo — this is different from how I do it. (*again picks up ggplot2 and starts sounding out words*) See, faceting and aesthetic mapping are two fundamentally different steps, and
we don’t want to in-ad-vert-ent-ly conflate the two. As such, we need to take
our code from before but add a “facet_grid” layer that explicitly says to facet
by species. (*shuts book happily*) At least, that’s what my big bro says! Have you heard of him, by the way? He’s
so cool– 4


# GGPLOT
g = ggplot(df, aes(x='petalLength',
                   y='petalWidth',
                   color='species')) + \
        facet_grid(y='species') + \
        geom_point(size=40.0)
g


ALT : I take a more Seaborn-esque approach here. Specifically, I just add a column
argument to the encode function. That said, I’m doing a couple of new things
here, too: (A) While the column parameter could accept a simple string argument,
I actually use a Column object instead — (B) I use my configure_cell method,
since without it, the subplots would have been way too big.


# ALTAIR
c = Chart(df).mark_point().encode(
    x='petalLength',
    y='petalWidth',
    color='species',
    column=Column('species',
                  title='Petal Width v. Length by Species')
)
c.configure_cell(height=300, width=300)


ANALYZING SCENE 3

--------------------------------------------------------------------------------

 * matplotlib made a really good point: in this case, his code to facet by
   species is nearly identical to what we saw above; assuming you can wrap your
   head around the previous for loops, you can wrap your head around this one.
   However, I didn’t ask him to do anything more complicated — say, a 2 x 3
   grid. In that case, he might have had to do something like this:


# MATPLOTLIB
fig, ax = plt.subplots(2, 3, figsize=(15, 10), sharex=True, sharey=True)

# this is preposterous -- don't do this
for i, s in enumerate(df.species.unique()):
    for j, r in enumerate(df.random_factor.sort_values().unique()):
        tmp = df[(df.species == s) & (df.random_factor == r)]

        ax[j][i].scatter(tmp.petalLength,
                         tmp.petalWidth,
                         c=cp[i+j])

        ax[j][i].set(xlabel='Petal Length',
                     ylabel='Petal Width',
                     title=s + '--' + r)

fig.tight_layout()


 * To use the formal visualization expression: Yeesh. Meanwhile, in Altair, this would have been wonderfully simple:


# ALTAIR
c = Chart(df).mark_point().encode(
    x='petalLength',
    y='petalWidth',
    color='species',
    column=Column('species',
                  title='Petal Width v. Length by Species'),
    row='random_factor'
)
c.configure_cell(height=200, width=200)


 * Just one more argument to the “encode” function than we had above!

 * Hopefully, the advantages of having faceting built into your visualization
   library’s framework are clear.

ACT 2: DISTRIBUTIONS AND BARS

--------------------------------------------------------------------------------

SCENE 4: HOW WOULD YOU VISUALIZE DISTRIBUTIONS?

--------------------------------------------------------------------------------

MPL (*confidence visibly shaken*): Well, if we wanted a boxplot — do we want a boxplot? — I have a way of doing
it. It’ you’d hate it. But I pass an array of arrays to my boxplot method, and
this produces a boxplot for each subarray. You’ll need to manually label the
x-ticks yourself.


# MATPLOTLIB
fig, ax = plt.subplots(1, 1, figsize=(10, 10))

ax.boxplot([df[df.species == s]['petalWidth'].values
                for s in df.species.unique()])

ax.set(xticklabels=df.species.unique(),
       xlabel='Species',
       ylabel='Petal Width',
       title='Distribution of Petal Width by Species')


MPL: And if we wanted a histogram — do we want a histogram? — I have a method for
that, too, which you can produce using either the for loop or group by methods
from before.


# MATPLOTLIB
fig, ax = plt.subplots(1, 1, figsize=(10, 10))

for i, s in enumerate(df.species.unique()):
    tmp = df[df.species == s]
    ax.hist(tmp.petalWidth, label=s, alpha=.8)

ax.set(xlabel='Petal Width',
       ylabel='Frequency',
       title='Distribution of Petal Width by Species')    

ax.legend(loc=1)


P (*looking uncharacteristically proud*): Ha! Hahahaha! This is my moment! You all thought I was nothing but matplotlib’s
patsy, and although I’ve so far been nothing but a wrapper around his plot
method, I possess special functions for both boxplots and histograms — these make visualizing distributions a snap. You only need two
things: (A) The column name by which you’ and (B) The column name for which
you’d like distributions. These go to the “by” and “column” parameters,
respectively, resulting in instant plots!


# PANDAS
fig, ax = plt.subplots(1, 1, figsize=(10, 10))

df.boxplot(column='petalWidth', by='species', ax=ax)


# PANDAS
fig, ax = plt.subplots(1, 1, figsize=(10, 10))

df.hist(column='petalWidth', by='species', grid=None, ax=ax)


*GG and ALT high five and congratulate P; shouts of “awesome!”, “way to be!”,
“let’s go!” audible*


SB (*feigning enthusiasm*): Wooooow. Greeeeat. Meanwhile, in my world, distributions are exceedingly
important, so I maintain special methods for them. For example, my boxplot
method needs an x argument, a y argument, and data, resulting in this:


# SEABORN
fig, ax = plt.subplots(1, 1, figsize=(10, 10))

g = sns.boxplot('species', 'petalWidth', data=df, ax=ax)
g.set(title='Distribution of Petal Width by Species')


SB: Which, I mean, some people have told me is beautiful… but whatever. I also have a special distribution method named “distplot” that goes beyond histograms (*looks at pandas haughtily*) . You can use it for histograms, KDEs, and rugplots — even plotting them
simultaneously. For example, by combining this method with FacetGrid, I can
produce a histo-rugplot for every species of iris:


# SEABORN
g = sns.FacetGrid(df, hue='species', size=7.5)

g.map(sns.distplot, 'petalWidth', bins=10,
      kde=False, rug=True).add_legend()

g.set(xlabel='Petal Width',
      ylabel='Frequency',
      title='Distribution of Petal Width by Species')


SB: But again… whatever.

GG: THESE ARE BOTH JUST NEW GEOMS! GEOM_BOXPLOT FOR BOXPLOTS AND GEOM_HISTOGRAM FOR
HISTOGRAMS! JUST SWAP THEM IN! (*starts running around the dinner table*)


# GGPLOT
g = ggplot(df, aes(x='species',
                   y='petalWidth',
                   fill='species')) + \
        geom_boxplot() + \
        ggtitle('Distribution of Petal Width by Species')
g


# GGPLOT
g = ggplot(df, aes(x='petalWidth',
                   fill='species')) + \
        geom_histogram() + \
        ylab('Frequency') + \
        ggtitle('Distribution of Petal Width by Species')
g


ALT (*looking steely-eyed and confident*): I… I have a confession…

*silence falls — GG stops running and lets plate fall to the floor*

ALT: (*breathing deeply*) I… I… I can’t do boxplots. Never really learned how, but I trust the JavaScript
grammar out of which I grew has a good reason for this. I can make a mean
histogram, though…


# ALTAIR
c = Chart(df).mark_bar(opacity=.75).encode(
    x=X('petalWidth', bin=Bin(maxbins=30)),
    y='count(*)',
    color=Color('species', scale=Scale(range=cp.as_hex()))
)
c


ALT: The code may look weird at first glance, but don’t be alarmed. All we’re saying
here is: “Hey, histograms are effectively bar charts.” meanwhile, their y-axes
correspond to the number of items in the data set which fall into those bins,
which we can explain using a SQL-esque “count(*)” as our argument for y.

ANALYZING SCENE 4

--------------------------------------------------------------------------------

 * In my work, I actually find pandas’ however, I’ll admit that there’s some
   cognitive overhead in remembering that pandas has implemented a “by”
   parameter for boxplots and histograms but not for lines.

 * I separate Act 1 from Act 2 for a few reasons, and a big one is this: Act 2
   is when using matplotlib gets particularly hairy. Remembering a totally
   separate interface when you want a boxplot, for example, doesn’t work for me
   (and just wait until we get to bar charts!).

 * Speaking of Act 1 v. Act 2, a fun story: I actually came to Seaborn from
   matplotlib/pandas for its rich set of “proprietary” visualization functions
   (e.g., distplot, violin plots, regression plots, etc.). While I later learned
   to love FacetGrid, I maintain that it’s these Act 2 functions which are
   Seaborn’s killer app. They’ll keep me a Seaborn fan as long as I plot.

 * These examples are really when you begin to grok the power of ggplot’s geom
   system. Using mostly the same code (and more importantly, mostly the same
   thought process), we create a wildly different graph. We do this not by
   calling an entirely separate function, but by changing how our aesthetic
   mappings get presented to the viewer, i.e., by swapping out one geom for
   another.

 * Similarly, even in the world of Act 2, Altair’s API remains remarkably
   consistent. Even for what feels like a different operation, Altair’s API is
   simple, elegant, and expressive.


Data Aside

(In the final scene, we’ll be dealing with “titanic,” another famous tidy
dataset [although again, we refer to it as “df” in our code]. Here’s a preview…)

survived pclass sex age fare class 0 0 3 male 22.0 7.2500 Third 1 1 1 female 38.0 71.2833 First 2 1 3 female 26.0 7.9250 Third 3 1 1 female 35.0 53.1000 First 4 0 3 male 35.0 8.0500 Third

In this example, we’ll be interested in looking at the average fare paid by
class and by whether or not somebody survived. Obviously, you could do this in
pandas…


dfg = df.groupby(['survived', 'pclass']).agg({'fare': 'mean'})
dfg


fare survived pclass 0 1 64.684008 2 19.412328 3 13.669364 1 1 95.608029 2 22.055700 3 13.694887

…but what fun is that? This is a post on visualization, so let’s do it in the
form of a bar chart!)

SCENE 5: HOW WOULD YOU CREATE A BAR CHART?

--------------------------------------------------------------------------------

MPL (*looking grim*): No comment.


# MATPLOTLIB

died = dfg.loc[0, :]
survived = dfg.loc[1, :]

# more or less copied from matplotlib's own
# api example
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7))

N = 3

ind = np.arange(N)  # the x locations for the groups
width = 0.35        # the width of the bars

rects1 = ax.bar(ind, died.fare, width, color='r')
rects2 = ax.bar(ind + width, survived.fare, width, color='y')

# add some text for labels, title and axes ticks
ax.set_ylabel('Fare')
ax.set_title('Fare by survival and class')
ax.set_xticks(ind + width)
ax.set_xticklabels(('First', 'Second', 'Third'))

ax.legend((rects1[0], rects2[0]), ('Died', 'Survived'))

def autolabel(rects):
    # attach some text labels
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%d' % int(height),
                ha='center', va='bottom')

ax.set_ylim(0, 110)        

autolabel(rects1)
autolabel(rects2)

plt.show()


*everyone else shakes their head*

P: I need to do some data manipulation first — namely, a group by and a pivot —
but once I do, I have a really cool bar chart method — much simpler than that
mess above! Wow, I’m feeling so much more confident — who knew all I had to was
put someone else down!? 5


# PANDAS
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7))
# note: dfg refers to grouped by
# version of df, presented above
dfg.reset_index().\
    pivot(index='pclass',
          columns='survived',
          values='fare').plot.bar(ax=ax)

ax.set(xlabel='Class',
       ylabel='Fare',
       title='Fare by survival and class')


SB: Again, I happen to think tasks such as this are extremely important. As such, I
implement a special function named “factorplot” to help out:


# SEABORN
g = sns.factorplot(x='class', y='fare', hue='survived',
                   data=df, kind='bar',
                   order=['First', 'Second', 'Third'],
                   size=7.5, aspect=1.5)
g.ax.set_title('Fare by survival and class')


SB: As ever, you pass in your un -manipulated data frame. Next, you explain what you would like to group by — in
this case, it’s “class” and “survived,” so these become our “x” and “hue”
arguments. Next, you explain what numeric field you would like summaries for —
in this case, it’s “fare,” so this becomes our “y” argument. The default summary
statistic is mean, but factorplot possesses a parameter named “estimator,” where
you can specify any function you want, e.g., sum, standard deviation, median,
etc. The function you choose will determine the height of each bar.

Of course, there are many ways to visualize this information, only one of which is a bar. As such, I also have a “kind” parameter where you can
specify different visualizations.

Finally, some of us still care about statistical certainty, so by default, I bootstrap you
some error bars so you can see if the differences in average fair between
classes and survivorship are meaningful.

(*under her breath*) Would like to see any of you top that…

*ggplot2 pulls up in his Lamborghini and walks through the door*

ggplo2: Hey, have y’all see–

GG: HEY BRO.

GG2: Hey, little man. We gotta go.

GG: Wait, one sec — I gotta make this bar plot real quick, but I’m having a hard
time. How would you do it?

GG2 (*reading instructions*) : Ah, like this:


# GGPLOT2

# in R, I believe you'd do something like this:

ggplot(df, aes(x=factor(survived), y=fare)) +
    stat_summary_bin(aes(fill=factor(survived)),
                     fun.y=mean) +
    facet_wrap(~class)

# damn ggplot2 is awesome...


GG2: See? You define your aesthetic mappings like we always talk about, but you need
to turn your “y” mapping into average fare. To do so, I get my pal
“stat_summary_bin” to do that for me by passing in “mean” to his “fun.y”
parameter.

GG (*eyes wide in amazement*): Oh, whoa… I don’t think I have stat_summary yet. I guess — pandas, could you
help me out?

P: Uh, sure.

GG: Weeeee!


# GGPLOT
g = ggplot(df.groupby(['class', 'survived']).\
               agg({'fare': 'mean'}).\
               reset_index(), aes(x='class',
                                  fill='factor(survived)',
                                  weight='fare',
                                  y='fare')) + \
        geom_bar() + \
        ylab('Avg. Fare') + \
        xlab('Class') + \
        ggtitle('Fare by survival and class')
g


GG2: Huh, not exactly grammar of graphics-approved, but I guess so long as Hadley
doesn’t find out it seems to work fine… In particular, you shouldn’t have to
summarize your data in advance of your visualization. I’m also confused by what
“weight” means in this context…

GG: Well, by default, my bar geom seems to default to simple counts, so without a
“weight,” all the bars would have had a height of one.

GG2: Ah, I see… Let’s talk about that later later.

*GG and GG2 say their goodbyes and leave the dinner party*

ALT: Ah, now this is my bread-and-butter. It’s really simple.


# ALTAIR
c = Chart(df).mark_bar().encode(
    x='survived:N',
    y='mean(fare)',
    color='survived:N',
    column='class')
c.configure(
    facet=FacetConfig(cell=CellConfig(strokeWidth=0, height=250))
)


ALT: I’m hoping all the arguments are intuitive by this point: I want to plot mean
fare by survivorship — faceted by class. This directly translates into
“survived” “mean(fare)” and “class” as the column argument. (I specify the color
argument for some pizazz.)

That said, a couple of new things are happening here. Notice how I append “:N”
to the “survived” string in the x and color arguments. This is a note to myself
which says, “This is a nominal variable.” I need to put this here because
survived looks like a quantitative variable, and a quantitative variable would lead to a
slightly uglier visualization of this plot. Don’t be alarmed: this has been
happening the whole time — just implicitly. For example, in the time series
plots above, if I hadn’t known “dt” was a temporal variable I would have assumed
they were nominal variables, which… would have been awkward (at least until I
appended “:T” to clear things up.

Separately, I invoke my configure_facet protocol to make my three subplots look
more unified.

ANALYZING SCENE 5

--------------------------------------------------------------------------------

 * Don’t overthink this one: I’m never making a bar chart in matplotlib again.
   Conversely, whenever I need summary statistics and error bars, I will always
   and forever turn to Seaborn.

 * (It’s potentially unfair I chose an example that seems tailor-made to one of
   Seaborn’s functions, but it comes up a lot in my work, and hey, I’m writing
   the blog post here.)

 * I don’t find either the pandas approach or the ggplot approach particularly
   offensive.

 * However, in the pandas case, knowing you must group by and pivot — all in service of a simple bar chart — seems a bit silly.

 * Similarly, I do think this is the main hole I’ve found in yhat’s ggplot —
   having a “stat_summary” equivalent would go a long way toward making this
   thing wonderfully full-featured.

 * Meanwhile, Altair continues to impress! I was struck by how intuitive the
   code was for this example. Even if you’d never seen Altair before, I imagine
   someone could intuit what was happening. It’s this type of 1:1:1 mapping between thinking, code, and visualization that is my
   favorite thing about the library.

FINAL THOUGHTS

--------------------------------------------------------------------------------

You know, sometimes I think it’s important to just be grateful: we have a ton of
great visualization options, and I enjoyed digging into all of them!

(Yes, this is a cop-out.)

Although I was a bit hard on matplotlib, it was all in good fun: matplotlib is
the foundation upon which pandas plotting, Seaborn, and ggplot are built, and
the fine-grained aesthetic control he gives you is essential. I didn’t touch on
this, but in almost every non-Altair example, I used matplotlib to customize our
final graph. But — and this is a big “but” — matplotlib is purely imperative,
and specifying your visualization in exacting detail gets tedious (see: bar
chart).

Meanwhile, pivoting plus pandas works wonders for time series plots. Given how
good pandas’ time series support is more broadly, this is something I’ll
continue to leverage. Moreover, the next time I need a RadViz plot, I’ll know where to go. That said, while pandas does improve upon matplotlib’s imperative pedantry by giving you basic declarative
syntax (see: bar chart), it’s still fundamentally matplotlib-ish.

Moving on: if you want to do anything more stats-y, use Seaborn (she really did
pick up a ton of cool things when she went abroad). Learn her API — factorplot,
regplot, displot, et al — however, if I hadn’t worked with Seaborn for so long,
I think I would probably prefer the ggplot or Altair versions — after all,
they’re rooted in more expressive, purely declarative syntaxes, and, in general,
it’s nicer to specify what I want than specify exactly how I want it done.

Speaking of declarative elegance, I’ve long loved ggplot2, and for the most part
came away impressed by how well Python’s ggplot managed to hang in
example-for-example. This is a project I will definitely continue to monitor.
(More selfishly, I hope it prevents my R-centric coworkers from making fun of
me.)

Finally, if the thing you want to do is implemented in Altair (sorry, boxplot
jockeys), it boasts an amazingly simple and pleasant API. Use it! If you need
additional motivation, consider the following: one exciting thing about Altair —
other than forthcoming improvements to its underlying Vega-Lite grammar — is
that it technically isn’t a visualization library. It emits Vega-Lite approved
JSON blobs, which — in notebooks — get lovingly rendered by IPython Vega.

Why is this exciting? Well, under the hood, all of our visualizations looked
like this:


Granted, that doesn’t look exciting, but think about the implication: if other libraries were
interested, they could also develop ways to turn these Vega-Lite JSON blobs into visualizations. That would
mean you could do the basics in Altair and then drop down to matplotlib for more
control.

I am already salivating about the possibilities.

All of that said, some parting words: visualization in Python is larger than any
single man, woman, or Loch Ness Monster. Thus, you should take everything I said
above — code and opinions alike — with a grain of salt. Remember: everything on
the internet amounts to lies, damned lies, and statistics.

I hope you enjoyed this far nerdier version of Mad Hatter’s Tea Party, and that
you learned some things you can take to your own work.

As always, code is available .

NOTES

--------------------------------------------------------------------------------

First, a huge thank you to redditor /u/counters, who provided extremely valuable
feedback/perspective in the form of this comment . I incorporated some of it into the “Final Thoughts” it’s good.

1 Strictly speaking, this story isn’t true. I’ve almost always used Seaborn if I
could, dropping down to matplotlib when I needed the customizability. That said,
I find this premise to be a more compelling set-up, plus we’re living in a
post-truth society anyway.

2 Right off the bat, you’re mad at me, so allow me to explain: I love bokeh and
plotly. Indeed, one of my favorite things to do before sending out an analysis
is getting “free interactivity” however, I’m not familiar enough with either to
do anything more sophisticated. (And let’s be honest — this post is long
enough.)

Obviously, if you’re in the market for interactive visualizations (versus
statistical visualizations), then you should probably look to them.

3 Please note: this is all in good fun. I am rendering no judgments on any library with my amateur anthropomorphism. I’m sure matplotlib
is very charming in real life.

4 To be frank, I’m not totally sure if faceting is handled separately for ideological purity or if it’s simply
a practical concern. While my ggplot character claims it’s the former (his
understanding is based on a hasty reading of this paper ), it may be that ggplot2 has such rich faceting support that — practically
speaking — it needs to happen as a separate step. If my characterization offends
any grammar of graphics disciples, please let me know and I’ll find a new bit.

5 Absolutely not the moral of this story


SHARE THIS:
 * Twitter
 * Facebook
 * Google
 * 

LIKE THIS:
Like Loading...DANIELSABER


Author archive Author website October 2, 2016Data Science

Previous post6 THOUGHTS ON “ A DRAMATIC TOUR THROUGH PYTHON’S DATA VISUALIZATION LANDSCAPE (INCLUDING GGPLOT
AND ALTAIR) ”
ADD YOURS
 1. dkapitanOctober 3, 2016 at 4:42 am
    
    ReplyGreat post, I was wondering whether you left Bokeh out for a reason?
    
    Like Like
    
    Reply * danielsaberOctober 3, 2016 at 2:07 pm
       
       ReplyNo particular reason — I’m actually just less familiar with it (and to a
       lesser extent, it’s because this post focuses on statistical
       visualization and I actually hadn’t realized bokeh implemented a
       higher-level Charts API that might be good for that use case).
       
       To be frank, I think I need to do another pass and add it in, since a lot
       of folks seem to be asking for it😉
       
       Thanks a ton for reading/commenting!
       
       Like Like
       
       Reply
     * 
    
    
 2. 
 3. Paul GOctober 3, 2016 at 10:27 am
    
    ReplyJust fell in love with you blog. Thanks for that!
    
    Like Like
    
    Reply * danielsaberOctober 3, 2016 at 2:08 pm
       
       ReplyGlad you enjoyed it — thanks a ton for reading/commenting!
       
       Like Like
       
       Reply
     * 
    
    
 4. 
 5. Tom BranderOctober 3, 2016 at 1:50 pm
    
    ReplyGreat job!, Would love to see BQ plot and Bokeh added! I know just too
    many..
    
    Like Like
    
    Reply * danielsaberOctober 3, 2016 at 2:09 pm
       
       ReplyThank you for reading! One nice thing about this post is that I’ve
       learned about a lot of new visualization libraries, so I think I’ll need
       to do a Part 2 at some point😉
       
       BQ plot looks very interesting — thanks for introducing me!
       
       Like Like
       
       Reply
     * 
    
    
 6. 

LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.


Blog at WordPress.com.

Up ↑

%d bloggers like this:","Why Even Try, Man? I recently came upon Brian Granger and Jake VanderPlas’s Altair, a promising young visualization library. Altair seems well-suited to addressing Python’s ggplot envy,…",A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot and Altair),Live,795
2440,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Mar 24
--------------------------------------------------------------------------------

MOVING DATA FROM DYNAMODB TO CLOUDANT OR COUCHDB
INTRODUCING THE DYNAMODBEXPORT NPM MODULE
If you have data in an Amazon DynamoDB service and want to move it to IBM
Cloudant or Apache CouchDB, how would you go about it? First of all, DynamoDB
has a peculiar form of JSON. A single temperature measurement would be expressed
like this:

{
	""temperature"": {
		""N"": ""8391""
	},
	""time"": {
		""S"": ""2017-03-09T01:38:11+0000""
	},
	""id"": {
		""S"": ""1489023491""
	}
}

Instead of the more straightforward JSON:

{
	""temperature"": 8391,
	""time"": ""2017-03-09T01:38:11+0000"",
	""id"": ""1489023491""
}

Cloudant and CouchDB can store any JSON documents with nested objects of
arbitrary complexity, whereas DynamoDB stores a series of key values at the top
of the JSON tree.

GETTING THE DATA OUT
The AWS SDK provides a comprehensive toolkit for interacting with AWS services. You need to
to “scan” a whole DynamoDB table, performing a chain of API calls to pull back
records in batches until you’ve consumed them all. I’ve written a script to do
this for you: dynamodbexport .

The aptly-named dynamodbexport npm module moves data from DynamoDB to IBM Cloudant or Apache CouchDB™.First, install the tool:

$ npm install -g dynamodbexport

Define a couple of environment variables with your Amazon API credentials:

$ export AWS_ACCESS_KEY_ID=""OGIIWJGNWNIITJHWTHSO""
$ export AWS_SECRET_ACCESS_KEY=""YRPHIIIWJJJYwKLGV28JJuiuwnjiiqq06AS""

Then simply run dynamodbexport , supplying the name of the table to export and the AWS region it is hosted in:

$ dynamodbexport --table iot --region us-east-1
{""temperature"":30730,""time"":""2017-03-09T02:21:48+0000"",""id"":""1489026108""}
{""temperature"":17072,""time"":""2017-03-09T02:15:22+0000"",""id"":""1489025722""}
{""temperature"":18177,""time"":""2017-03-08T21:27:23+0000"",""id"":""1489008443""}
Export complete { iterations: 1, records: 3, time: 0.145 }

The tool makes as many API calls as it needs to extract the data, converting the
JSON to a more compact form as it goes.

IMPORTING INTO COUCHDB/CLOUDANT
I already have a tool to import data into CouchDB: couchimport , which you can install in a similar way:

$ npm install -g couchimport

Set an environment variable with your target Cloudant/CouchDB service’s URL:

$ export COUCH_URL=""https://MYUSER:MYPASS@MYHOST.cloudant.com""

Then run both the dynamodbexport and couchimport commands together, piping the output of the former into the latter:

$ dynamodbexport --table iot --region us-east-1 | couchimport --db iot --type jsonl

The --type jsonl parameter tells couchimport that it is to expect one JSON document per line,
and --db iot defines the name of the target database. (Make sure your target database
already exists, since couchimport does not create new databases.)

It’s that simple! You can now use Cloudant’s awesome MapReduce tools to aggregate the data or replicate it to other devices.

IT WORKS FOR LOCAL DATABASES TOO
The dynamodbexport tool also works for local DynamoDB databases. Just leave out the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, and it will assume a local database.

Moving data from local DynamoDB to local CouchDB is as simple as:

$ dynamodbexport --table iot | couchimport --db iot

You’ll find more details on command-line usage and programmatic access for dynamodbexport on npm . And please ♡ this article if you’d like to recommend it to other Medium
readers.

 * Nodejs
 * Dynamodb
 * Cloudant
 * Couchdb
 * Web Development

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","If you have data in an Amazon DynamoDB service and want to move it to IBM Cloudant or Apache CouchDB, how would you go about it? First of all, DynamoDB has a peculiar form of JSON. A single…",Moving data from DynamoDB to Cloudant or CouchDB – IBM Watson Data Lab,Live,796
2442,"REINTRODUCING THE SIMPLE SEARCH SERVICE
Matt Collins / July 11, 2016At the start of the year, we introduced the Simple Search Service , a Bluemix-deployable app that lets you create a fully-indexed, faceted search
engine using Node.js and IBM Cloudant in no time at all.

Since then, we've been tweaking and improving things to give you the extra
features you have been asking for–along with a few other things!

NEW LOOK AND FEEL
Veterans of the Simple Search Service will instantly notice the look and feel
has changed a fair bit since January.


Things are now styled up to improve usability and to house all the tweaks that
we have been adding. Don't worry though, it is still as easy as ever to import
and index your data.

CONTENT MANAGEMENT
Quite a lot of feedback we received centered around the ability to manage your
data after import. To help with this, we added a new Edit Data section, which you'll see in the app's left-hand menu.

The Edit Data screen lets you search existing data and narrow down results with clickable
facets. Once you find the data you're looking for, edit it directly in the app.
Of course, you'll be able to add new rows, and remove any rows that you no
longer need.


When you add or edit rows, the app enforces the schema you set at import, which
helps keep the quality of your data high. The app also gives you a data-entry
form tailored to your schema.

API INTEGRATION
As well as adding management features to the UI, we also exposed the underlying
data APIs to help you integrate your apps with the Simple Search Service. All of
the APIs are subject to the same schema validation you find in the UI, so you
can be sure that the quality of your data won't be compromised no matter how you
manage it.

You can use the API to add, edit, delete, and get rows from your dataset via
HTTP. See the full data API Reference .

LOCKDOWN IMPROVEMENTS
Previously, you had to decide whether the Simple Search Service was open , letting you delete and re-add your data, or closed , meaning that you could access only the /search endpoint.

We extended Lockdown Mode, so you can make the Simple Search Service read-only
for your users, while allowing administrators to bypass lockdown mode using HTTP
Basic Authentication to access both the UI and the new API.

TRY THE NEW FEATURES
 * If you have an existing instance of Simple Search Service, you can upgrade by
   pulling down the latest code from the Github Repository and npm install .
 * If you're new to the service or just want to explore in a fresh instance,
   click this button to deploy:
   
   
Try out the new features, and share your thoughts at the bottom of this page, or
submit a pull request on GitHub. Collaborators welcome!

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Enhancements to our Simple Search Service app and API.,Reintroducing the Simple Search Service,Live,797
2443,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Home Cognitive Computing Data Science Web Dev Mike Elsmore Blocked Unblock Follow Following Develops software by the coffee cup and organiser of @hackferencebrum.
Developer Advocate at @ibm for @IBMcloudant @IBMGraph @composeio. Opinions etc
are my own Feb 10
--------------------------------------------------------------------------------

BLOGGING WITHOUT THE HOSTING
USING COMPOSE’S MYSQL SERVICE & IBM BLUEMIX TO RUN YOUR BLOG
MySQL is a hugely popular database . Not only is it the cornerstone of a massively-relied-on development stack ( LAMP ), but it also backs a huge amount of open source systems on alternative
stacks.

Meme credit: MozALL OR NOTHING
The internet is full of blogs or, more precisely, full of people using blogging software to build other things, from image galleries to full shopping carts. They’re
able to build all this stuff because of the prevalence of open source blogging
software and because this software is highly extensible. Two such platform are Ghost and Wordpress , both of which have an active group of supporters, users and developers
working on themes and extensions.

These two blogging applications have hosted solutions, where you rely on a
company to manage and maintain everything on your behalf. But what happens when
you need customisation or want to reduce costs by taking on some of the work? In
most cases, the solution is to completely self-host your site, which means
managing servers, software and databases. Most people want to avoid managing a
full stack because it can be a real commitment in terms of time and effort, even
if you are saving money. (There’s a reason these hosting companies exist, after
all.)

BEST OF BOTH WORLDS
But what if you could have the flexibility of self-hosting the applications, but
rely on tested providers to manage the infrastructure availability and database
operations on your behalf? Well, you can. Here at IBM Watson Data Platform,
we’ve produced a simple one-click install that takes advantage of IBM Bluemix to host the application environment and manage its uptime, and use Compose for MySQL to maintain the database.

Recently Compose released its MySQL-as-a-service offering, and it’s also
available as part of the IBM Bluemix catalog . If you’re familiar with Compose and know they already ran managed services
for MongoDB, PostgreSQL, Redis and many other open source DBs — you might be
wondering why they just now got around to offering MySQL. Dj has the details on the Compose blo g, but the tl;dr is that the advent of MySQL InnoDB Cluster delivered the high-availability
features required to run on Compose’s platform.To deploy either these blogging platforms to Bluemix, head to either the ghost-on-bluemix GitHub repo or the wordpress-on-bluemix GitHub repo and follow the instructions in the README. Happy blogging!

Actually, speaking of blogging, please click the ♡ here to let Medium know your
warm feelings for this article. Thanks.

MySQL Database WordPress Blogging Web Hosting 1 Blocked Unblock Follow FollowingMIKE ELSMORE
Develops software by the coffee cup and organiser of @hackferencebrum . Developer Advocate at @ibm for @IBMcloudant @IBMGraph @composeio . Opinions etc are my own

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",One-click deployment using IBM Bluemix to host a Wordpress or Ghost environment and Compose for MySQL to maintain the database. Try it if you want more customisation than your typical hosting :D,Blogging without the hosting – IBM Watson Data Lab,Live,798
2447,"va barbosa Blocked Unblock Follow Following code rules everything around me Apr 7
--------------------------------------------------------------------------------

EYE CANDY FOR CLOUDANT
DATA VISUALIZATION RIGHT IN THE DASHBOARD, VIA USERSCRIPT
Rebus : eye-candy for cloud-antIBM Cloudant is a fully-managed NoSQL database-as-a-service based on Apache CouchDB™ . While Cloudant’s web UI provides a JSON view and a table view of your data,
it does not, however, include a visual representation of your data.

There are numerous tools and applications out there to extract your data from
Cloudant and visualize/analyze it. While these systems are powerful and
definitely necessary, sometimes you just need a quick visualization.

THE CANDY JAR
With the SimpleDataVis npm module and a little userscript, you can easily integrate a data-visualizing
view into Cloudant. Initially created for the Simple Logging Service visualization page , SimpleDataVis is a JavaScript module that aids in the visualization of data.
Userscripts allow for on-the-fly the modification of web pages to add
functionality or make cosmetic changes.

The userscript created is simple, but effective, in providing a visualization for a database
view in the Cloudant dashboard. The script adds a new button in the context of a
view beside the JSON and Table buttons.

Injecting a new Eye Candy button into the Cloudant dashboard.Clicking the new button renders a visualization of the current view data.

Automatically visualize aggregated Cloudant JSON with SimpleDataVis .CHOCOLATE FACTORY
To install the script, use the Firefox browser with the Greasemonkey extension.

 1. Go to the cloudant-eyecandy.user.js
 2. Click on the Raw button
 3. Install the script when prompted

Once installed, log in to Cloudant and go to a view. In the view page, the new Chart button should now appear. If the button does not appear, ensure the script is
enabled and try refreshing the page.

Visualizations like pie charts or bar charts assume some kind of aggregation on
your data. The most direct way to do aggregations is to use Cloudant’s built-in
reduce functions. Cloudant’s documentation on views (MapReduce) has the details.

Different charts expect data in different formats. Here are the schemas that SimpleDataVis will use to automatically render visualizations . You’ll notice that certain simple charts like bar and pie varieties expect
data in the same format: [{ key: """", value: n }, ...] . In this case, the code will randomly pick one format to display. If you
extend my userscript to select certain charts in this scenario, please let me
know in the comments.

Once you have properly defined your database view, it’s time to visualize it. To
see the aggregated results, be sure to check the Reduce button from the query Options menu in your view. Now, you’ll be all set to receive your Eye Candy.

Be sure to check your query options to enable aggregated results for
visualizations.If you’d like to try and reproduce the example in this article, replicate https://examples.cloudant.com/movies-demo into your own Cloudant instance (note that you might incur a small fee). Here's
an example view definition you can use—be sure to select the _sum reduce function in the UI too:

Example view function that accounts for empty string values. You could also try
doc.Movie_rating in the if-statement and note the difference.SOUR PATCHES
The current implementation of the userscript works with the Greasemonkey extension for Firefox. However, you can easily modify it to work with other
userscript manager extensions (for example, Tampermonkey for Chrome). Here’s the code for SimpleDataVis on GitHub . Theoretically, SimpleDataVis could also work with Apache CouchDB™’s new
Fauxton dashboard. I’d love to hear about any modifications in the comments.

Thanks for checking out SimpleDataVis, and please ♡ this article to recommend it
to other Medium readers.

 * Data Visualization
 * Cloudant
 * Web Development
 * JavaScript
 * Couchdb

Blocked Unblock Follow FollowingVA BARBOSA
code rules everything around me

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","IBM Cloudant is a fully-managed NoSQL database-as-a-service based on Apache CouchDB™. While Cloudant’s web UI provides a JSON view and a table view of your data, it does not, however, include a…",Eye Candy for Cloudant – IBM Watson Data Lab – Medium,Live,799
2453,"Skip to contentDinesh Nirmal's Blog

A blog about how big data, analytics and digital technologies are impacting
businesses in the data driven economy.

Menu * Welcome!

 * Twitter

MACHINE LEARNING FOR THE ENTERPRISE.
In my last blog “Business differentiation through Machine Learning” I introduced
and described the concepts of machine learning. We traced its origins from a
computer science project to Watson showcasing and winning on the Jeopardy TV
quiz show and its real world use across numerous industries including health
care.

We concluded that machine learning has the potential to help make the world a
better and safer place. However (yep…there’s always a ‘however’), in order for
that to become a reality machine learning in all its forms has to be enterprise
ready.


Enterprise Requirements.

When I use the word “Enterprise” I mean organizations that have business
critical requirements. It’s not just about organization size. Volumes of
transactions and data, velocity of interactions, variety of data – yes it’s
those “Vs” of big data again – are all key factors that might impact an
organization’s machine learning requirements. There is also collaboration across
data scientists, engineers, developers creating, testing, training, deploying
machine learning and its models and the levels at which different audiences want
to be exposed to machine learning. So let’s look at some of these and other
factors that make machine learning from IBM truly enterprise ready.


Collaboration

Large enterprises tend to have a significant data scientist team often with
multiple data scientists engaged on a single project. Collaboration across these
data scientists and maybe other personas is required to maximize productivity,
agility and effectiveness. Today as part of the IBM data science experience we
bring in the concept of the “project” wherein various personas and users can
safely collaborate on a project to build, test, use and deploy many artefacts
with a group of people. Our machine learning technologies adopt the same concept
– able to share all the analytic artefacts (notebook, pipeline, model, etc,).
For example a group of people can collaborate on a single notebook wherein one
person can do curation/transformation and then can hand over to another person
for creating algorithms and testing and training the model. Then other team
members can evaluate the model and deploy it. Each individual user can be
authenticated separately and authorized as part of the roles defined in the
project limiting or granting access to part of the overall process / experience
accordingly.


Consumption

Not everyone is a data scientist nor wants or needs to know about model design,
statistical theory or training the model. Developers for example may have
varying levels of needs. They may just want to be able to use a known model that
works well and deploy it in their app. Figure #1 below shows how IBM has
designed a work space that allows application developers to not only choose and
deploy a model but actually create a pipeline via a step-by-step process. Take
that one step further higher level developers may want to choose from a
collection of pre-packaged machine learning services such as fraud detection,
weather prediction, manufacturing models, sentiment analysis, emotional
analysis. IBM provides these today through its Bluemix services which are
integrated as part of data science experience.


Figure #1 Integrated workspace – creating a pipeline.


Commoditizing, Automating Machine Learning

Machine learning in enterprise environments can be challenging. It starts with
the assumption that the model becomes stale the minute you stop training it.
Over time the accuracy of the models can worsen and can take significant time to
understand what is happening, why and to then retrain existing models and deploy
new versions. It comes down to revenue – and some enterprises may have a hard
time adopting the necessary discipline – because they cannot gauge the impact to
the bottom line. A lot of machine learning use cases might not be very intuitive
in a sense that you cannot set clear control points and flows and people cannot
logically relate to them. So just the very term “Machine Learning” may send some
of the less scientific people in our communities running the opposite direction.
Often data scientists perform a number of tedious and time-consuming steps to
derive insight from a raw data set. The process can involve data ingestion,
cleaning, and transformation (e.g. outlier removal, missing value imputation),
then proceed to model building, and finally a presentation of predictions that
align with the end-users objectives and preferences. It can be a long, complex,
and sometimes artful process requiring substantial time and effort, especially
because of the combinatorial explosion in choices of algorithms (and platforms),
their parameters, and their compositions. Tools that can help automate steps in
this process have the potential to accelerate the time-to-delivery of useful
results, expand the reach of data science to non-experts, and offer a more
systematic exploration of the available options. Cognitive Automation of Data
Science (CADS) helps integrate learning, planning and composition, and
orchestration techniques to automatically and efficiently build models. This is
done by deploying analytic flows to interactively support would-be data
scientists in their tasks. CADS also provides the capability to run multiple
predefined algorithms in parallel and identify the best suitable algorithm for a
particular use case. In short, CADS selects the best algorithm for the given use
case. Click here to read an IBM paper on CADS.


Training, Tuning, Model Optimization

There can be times when the model or algorithm used in machine learning becomes
too good to be true on the training data: the model predicts the training data
very well, but performs poorly on new data. This is known as overfitting – an extreme case can occur in rote learning where the model achieves 100%
performance on data you have already seen, and probably won’t do any better than
random guess on new data. Imagine what this would do to a business – it could
ruin it, upset customers, set the wrong price point and miss many business
opportunities. IBM machine learning helps counter this through a clean
separation of training data from holdout data used to evaluate model
performance, as well as careful use of cross validation techniques.


Data Sovereignty and Isolation

Some enterprise organizations have a fear, psychological or otherwise, when it
comes to putting their data and applications on hardware, storage and network
infrastructures that is shared with other organizations. IBM’s Cloud First
strategy provides the necessary sovereignty, multi-tenancy and isolation to help
ensure that their data and applications are managed privately across IBM world
wide data centers.


Variety – All types of Data

There used to be a time when data was simpler – structured relational or
hierarchical data stored in databases. Big data simply means “all data” which
includes volumes of raw content some structured in some form, other parts
unstructured. Add the Internet of Things sending massive amounts of sensor data
and the world is not so simple. The IBM data strategy is to Make Data Simple.
Our machine learning capabilities leverages this strategy being able to process
structured, semi-structured and unstructured data sets using many connectors and
abstracting complexity by exploiting the Spark, R, Python runtimes for the
machine learning. IBM provides 20+ different data sources(connectors) from which
an organization can ingest data.


Large compute power

Enterprises need high compute power since they process ever increasing work
loads of data, transactions and processes. Since our Spark service is a single
multi-tenant cluster, resource utilization is not wasted as we can repurpose the
computer power. Scaling out and scaling down are capabilities that are built as
part of the service. Our machine learning is able to transparently enable this
scale out / scale down capabilities.


Information governance

IBM has a strong heritage on information governance from managing data over it’s
lifecycle, cleansing and quality of data, data wrangling and shaping, through to
security and privacy of data. The information governance catalogue uses policies
that help ensure that only the right people can see, access, execute data and
services. IBM can also help provide real time monitoring, threat detection,
prevention and intervention as well as forensics, compliance, detailed audit,
obfuscation/masking of data, encryption and more. All these can be applied to
machine learning and training data.


Machine Learning not a One Trick Pony – Machine Learning as a Service (MLaas)

There are many forms of machine learning. From the System ML that IBM donated to
Apache Foundation to natural language processing, vision, personality and
emotional insights, customer sentiment, retrieve and rank and more. IBM can
makeit simple for an enterprise of any size to pick and choose from a number of
predefined machine learning services from the Bluemix tiles below in figure #2.
I ran my earlier blog through the “personality analyser” – I knew I was a really
nice guy but it was reassuring to hear it from a machine learning service. Don’t
believe me? Try it here .


Figure #2 Machine Learning as a Bluemix service.


Using Machine Learning to Reduce Costs and Risks
There are many customers across different industries that have used our machine
learning capabilities to help reduce costs, improve customer service and reduce
risks.

The Vermont Electrical Power Company (VELCO) worked with IBM Research to develop
an integrated weather forecasting system to help deliver reliable, clean,
affordable power to their consumers while integrating renewable energy into the
grid. The solution combines high resolution weather with multiple forecasting
tools based on machine learning. The machine learning models are trained on
hindcasts of weather correlated to historical energy production and historical
net demand.

The results are some of the most precise and accurate wind and solar generation
forecasts in the world. This powerful tool turns multiple streams of
data—transmission telemetry, distribution meter data, generation production,
highly precise forecast models—into actionable information using leading edge
analytics. A collaborative achievement involving dozens of in-state and regional
partners and the formidable intellectual resources of IBM Research, VWAC’s
results are significant and its value already demonstrated, even as further
benefits continue to emerge. To find out more watch the video on the VELCO website Courtesy Vermont Electrical Power Company web site and video


Conclusion – Breadth, Depth and Enterprise Ready

While many vendors may provide an aspect of machine learning often restricted to
a particular runtime or platform IBM provides many forms of machine learning
covering generic machine learning, natural language processing, vision,
personality and emotional insights, sentiment, retrieve and rank and many
others. IBM has been exploiting many of these machine learning capabilities for
many years as part of its Watson Analytics portfolio helping to take
organizations on their cognitive journey. Combine this breadth and depth of
capability with the enterprise readiness of machine learning capabilities
discussed above with our cognitive strategy and execution and it becomes clear
why some many of the biggest and business critical organizations in the world
choose IBM.

For more information on IBM’s cognitive strategy and machine learning
capabilities click this link ibm.com/outthink


Dinesh Nirmal,

Vice President, Development, Next Generation Platforms, Big Data & Analytics

Follow me on Twitter @DineshNirmalIBM


TRADEMARK DISCLAIMER: Apache, Apache Hadoop, Hadoop, Apache Spark, Spark are
trademarks of The Apache Software Foundation.

SHARE THIS:
 * Twitter
 * 

Author Dinesh Nirmal Posted on August 23, 2016 Categories Uncategorized4 THOUGHTS ON “MACHINE LEARNING FOR THE ENTERPRISE.”
 1. Edward Liao says: August 25, 2016 at 9:24 amgood article , worth reading. It tells the truth and challenges for the
    enterprise machine learning
    
    Like Like
    
    Reply
 2. 
 3. Alberto says: August 25, 2016 at 7:04 pmVery interesting document, clear and enough concise for beginners.
    The problem is “where to start from?” for somebody that has not an in-depth
    training in statistics and data modelling? I’m great on using SPSS to create
    streams for data mining but I’m struggling on using the tool for Cognitive
    and Predictive purposes on the data I’m taking from the different
    data-warehouses.
    
    Like Like
    
    Reply 1. Dinesh Nirmal says: September 1, 2016 at 6:29 pmThank you Alberto. I would recommend starting with the Data Science
        Experience at datascience.ibm.com
        
        Like Like
        
        Reply
     2. 
    
    
 4. 
 5. Eric says: September 7, 2016 at 1:03 amCADS looks significant differentiation from existing cross validator to save
    much time to determine algorithm and best parameters of model for specific
    data.
    
    Like Like
    
    Reply
 6. 

LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.


POST NAVIGATION
Previous Previous post: Business Differentiation through Machine Learning. Next Next post: Apache Spark continues to forge ahead with help from IBM’s
Spark Technology Center.RECENT POSTS
 * IBM BigInsights Basic Plan ready for a successful journey September 21, 2016
 * Apache Spark continues to forge ahead with help from IBM’s Spark Technology
   Center. August 30, 2016
 * Machine Learning for the Enterprise. August 23, 2016
 * Business Differentiation through Machine Learning. June 29, 2016
 * Welcome! May 12, 2016

NEWS AND EVENTS
Upcoming Spark Summits
Upcoming Hadoop Conferences Search for: Search * Welcome!

 * Twitter

Dinesh Nirmal's Blog Create a free website or blog at WordPress.com.",In my last blog “Business differentiation through Machine Learning” I introduced and described the concepts of machine learning. We traced its origins from a computer science project to Watson show…,Machine Learning for the Enterprise,Live,800
2457,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Perform Sentiment Analysis       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  SENTIMENT ANALYSIS OF TWITTER HASHTAGSJess Mantaro / October 21, 2015Use Apache Spark Streaming in combination with IBM Watson to perform sentimentanalysis and track how a conversation is trending on Twitter. Open the tutorial or take the online course .Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Use Spark Streaming and IBM Watson to track how a conversation is trending on Twitter,Sentiment Analysis of Twitter Hashtags,Live,801
2460,"CONNECTING TO COMPOSE FOR MYSQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 13, 2016With Compose for MySQL now available as a beta, you'll be eager to get up and running with it. Here, we'll walk
through how to set up a deployment and securely connect to it. We will take you
from provisioning your own MySQL database to connecting to it within a few
minutes. If you are familiar with Compose's other databases, you'll feel right
at home with Compose for MySQL.

PREPARATION
Creating a Compose for MySQL deployment follows the same pattern as all other
Compose database deployments. Sign in to your Compose account and then, from the
Deployments view, click on Create Deployment and select MySQL from the Beta selection of databases.


The Compose system generates a deployment name and selects AWS US-East-1 as the
location for the new database deployment. You can change the database name and
the data center location; in the latter case selecting from Amazon, Google and
SoftLayer data centers around the world. There's also a slider which allows you
to select the amount of disk storage your deployment should initially be
allocated. You can leave it as is and let Compose's autoscaling system allocate
storage as needed, or you can dial in your requirements and configure your
deployment with that much space up front. Autoscaling will still work if you
manually select your database size. Once the form is filled in, click Create Deployment and Compose will begin provisioning your database. When it's complete you'll
see this...


You'll find the information needed to connect to your deployment in the Connection info panel. It contains three things: your credentials, which in this case is the
password for the admin user, a Connection string which can be used with database drivers to automatically connect, and a Command line which allows you to manually log in to the database.

CONNECTING TO YOUR DEPLOYMENT
Your deployment comes with a ""compose"" database that's created by default. To
connect to Compose for MySQL using the command line, you first have to install
MySQL on your machine. Each platform has its own installation packages, so
choose the one that's right for you. There's Oracle community packages at http://dev.mysql.com/downloads/ . Before you download from there, be aware that Linux users will usually find
MySQL in their distributions repository. If you're working on a Mac, we
recommend you install MySQL using Homebrew using brew install mysql , which will compile and install the latest version of MySQL.

To access your deployment via the terminal, you can copy and paste the command
line string provided in the Connection info panel into your terminal to connect to your deployment like:


You can ensure that SSL is used when connecting by using --ssl-mode=REQUIRED . The default user is named ""admin"" which you use as the username in the -u parameter. After running the command, since the password parameter -p has no value, you'll be prompted to enter the admin user's password, which you
can get from the credentials section here:


Compose for MySQL has SSL enabled by default, but when connecting, we suggest
you take extra precautions by using the --ssl-mode=REQUIRED option on the command line. This stops MySQL falling back, for whatever reason,
to an unencrypted connection. SSL can be disabled by using --ssl-mode=DISABLED if you want to test unencrypted connections.

If you want to verify which host you're connecting to, you can obtain a
self-signed certificate for the MySQL server from the Compose web console. You
can use this for verification when establishing a connection:


To view the certificate, click on Show certificate . You might be asked for your Compose username and password for
re-authentication. Once your password is authenticated, you'll be given a
self-signed certificate like this:


To use the certificate, create a new file and copy everything from ---BEGIN CERTIFICATE--- to ---END CERTIFICATE--- and save with the .pem extension. This file is used to verify the certificate that the server presents
the MySQL command line tool and ensure that the host you are connecting to is
the one you expect. To make use of it, change the --ssl-mode to --ssl-mode=VERIFY_CA and add --ssl-ca with the path and name of your .pem file like this:


When you've successfully logged in, you should get see something similar to
this:


DISCOVER COMPOSE FOR MYSQL
Now that we've covered the basics of setting up and connecting to your own MySQL
deployment, you have the ability to discover other features of the database and
view the benefits of using MySQL on Compose. So, discover what Compose for MySQL
can do for you and your organization and tell us what you're building and what
you think about it.

Image by valor kopeny Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Here, we'll take you from provisioning your own MySQL database on Compose, to connecting to it within a few minutes.",Connecting to Compose for MySQL,Live,802
2463,"SIMPLE WEBSITE METRICSUser Behavior Tracking and Analysis Made EasyWant to track how people use your web app or site? Most tracking software outthere is overly complex and adds unnecessary overhead. In addition, thesesolutions fall short because they capture only URL requests. If your app workswithout changing the URL, user interactions are invisible to these tools. Youmay also want precise control over what you track or how you store and analyzedata.When our team faced this problem, we created our own app (well 2 apps actually)to solve it. You can visit our demo metrics visualization app to see the end-result.LIGHTWEIGHT TRACKINGThe first app leverages the Piwik open source javascript library to give you control over what you capture when –literally any event on a web page’s DOM. It’s drop-dead simple. To connectPiwik to the web page you want to track, all you do is add one simple line tothat page, like this:<script src=""//metrics-collector.mybluemix.net/tracker.js"" siteid=""cds.search.engine""></script>The app uses a NodeJS service (hosted on IBM Bluemix) to capture events andpersist them to Cloudant.VISUALIZE WITH D3Next, it’s time to get actionable insights on the details you collect. With thedata stored as JSON in Cloudant, our Metrics Analytics app uses MapReduce views to aggregate metrics, then pumps that data into D3 for slick visualizations and analysis.NEXT STEPS 1. Deploy Metrics collector and follow the tutorial .         2. Read how to visualize with JSON and D3 .TUTORIALS * Part 1: Metrics Collection * Part 2: Visualize with D3 and JSON * Metrics Collector MicroserviceIBM TECHNOLOGY * Bluemix * CloudantBLOGS ‘N’ STUFF * Interview: Enterprise JavaScript with D3.js + JSON * Simple Metrics Collector – Microservices Edition * In-page user behavior tracking made simple * Analyze metrics and visualize * Cloudant how-tosSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",User behavior tracking and analysis for your web app or site. Capture detailed events and actions--even those that don't change the URL.,Simple Website Metrics,Live,803
2465,"MAKING THE MOST OF COMPOSE - HUMAN DESIGN
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 23, 2016This is the first in a series of case studies of Compose database users and how
Compose enables them to do their business. This is Making the most of Compose .

When your business is all about providing the highest quality experiences for
your clients, the last thing you want to worry about is your databases. At Human Design , they have solved that problem by using Compose's MongoDB for all the full
stack solutions they create for their non-profit clients. We talked to Matt Null
of Human Design to find out more.


Driven by a desire to help make a better world, Matt Null and John Weiss got
together three years ago to found Human Design. The company which is now twenty
five strong operates from its Boulder, Colorado HQ and has developed projects
like Racing Extinction for the Oceanic Preservation Society, the Sustainable Development Goals Action Campaign for the United Nations and saving lives from textual distraction with Katasi's Groove mobile app . Human Design's mission is to take on projects that have a positive impact on
society through their work.

The company is ""kind of full-service"" says Matt. They do everything from mobile
apps, websites, marketing campaigns – ""pretty much anything digital, we can do.""

Human Design have been using Compose since before it was called Compose, when it
was MongoHQ offering plain hosted MongoDB. As a full stack JavaScript developer,
Human Design have a wide range of options to choose from when it comes to
databases. Compose's innovative elastic scaling has made it their first choice
for MongoDB says Matt, calling it ""really nice, so you don't have to upgrade and
do a whole migration. It's just really convenient, and that's one of my favorite
features.""

Historically, the company has deployed its apps using the Heroku platform and
uses Heroku's Compose add-on to connect those apps to the Compose MongoDB
instances. The applications tend to be Node.js based and MongoDB ""plays really
well with Node."" The stack selection for each client is based on ""a combination
of familiarity and what the client really needs. We've built really large
platforms, and they are all backed by Mongo and it works well.""

Human Design's process is all about getting the right elements for their stack
and they keep it light and agile using Mithril.js as their JavaScript framework. ""It's named after the armor that Frodo wore in
Lord of the Rings, because it's lightweight, and it's strong"" says Matt
mentioning that the framework's author joined the company. The front-end and
Node.js apps talk to Compose's MongoDB at the back-end and that means Human
Design can deliver to all platforms, and to all scales, without exceptional
stack changes. Matt also notes that everything is the same stack – ""And, you
know, we host everything on the cloud, as well – nothing internal, no internal
machines. The way that our developers work is, they all work locally, and then
we deploy to a staging instance of the app"" using Compose databases too.

Using Compose also means Human Design can hand off projects to customers who
want to take control without having to relocate databases and know that the
storage and backup needs of the application are being monitored and managed by
Compose. It's also the start of the process that impresses Matt, ""That's what's
great about Compose - how quickly I can just have a DB up and running, with
staging and a live version.""

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","When your business is all about providing the highest quality experiences for your clients, the last thing you want to worry about is your databases. At Human Design, they have solved that problem by using Compose's MongoDB for all the full stack solutions they create for their non-profit clients",Making the most of Compose: Customer case study with Human Design,Live,804
2466,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own Oct 26, 2016
--------------------------------------------------------------------------------

MACHINE LEARNING FOR EVERYONE
Today we take another step toward making Machine Learning simple and accessible
for everyone with our launch of IBM Watson Machine Learning.

Why are so few companies using Machine Learning today?

 * Companies want Machine Learning, but they don’t understand it.
 * Requisite skills are hard to find.
 * Machine Learning at scale is daunting.
 * Competing technologies make for difficult choices.
 * Deployment is very complex and not standardized.

Watson Machine Learning is designed to make AI and Machine Learning easy to use
and understand. We want to support the full machine learning workflow and bring
automation to every single step.

Our goal is to democratize Machine Learning (ML) and make it available to all
types of users regardless of skill level. There are different user interfaces to
build ML models:

 * Model Builder : Not everyone is a data scientist, nor needs to know about model design,
   statistical theory, or training the model. Even developers, for example, may
   have varying levels of need when it comes to data science. They may just want
   to be able to use a known model that works well and deploy it in their app.
   We designed a simple flow that enables developers to not only choose and
   deploy a model but actually create a pipeline through a step-by-step process:
   Select data → Train ML model → Evaluate performance → Deploy model → Monitor.
 * Jupyter Notebooks or RStudio : Create machine learning models by using Python, R and Scala. Feel free to
   choose your favorite libraries: Spark MLlib, Python Scikit-Learn, Spark ML
   Pipeline and many other. Then use the model.deploy() command to get the web service API of that model.
 * SPSS Modeler : Create a visual pipeline and build your models without programming. Use
   SPSS streams management and deployment with real-time scoring and
   batch-processing options.

Use your own data to create, train, and deploy self-learning models. Leverage an
automated, collaborative workflow to drive intelligence into day-to-day business
applications easily and with more confidence.

Select your data sources, review the data to ensure it’s right for your needs,
and prepare it for analysis. Select transformers and watch the pipeline being created as
you work.

Train the model, validate the results, and check your model’s performance. Take
advantage of our easy-to-understand visualizations to evaluate your model and
its performance. IBM Watson Machine Learning facilitates this type of validation
by providing a clean separation of training data from holdout data used to
evaluate model performance, as well as careful use of cross-validation
techniques.

Deploy your model as an API, as a batch process, or into a real-time stream. Easily
create apps powered by machine learning, or make existing processes smarter. A
Machine Learning model becomes stale the minute you stop training it. Over time
the accuracy of the models can worsen and can take significant time to
understand what is happening. Our tools make it easy to retrain your existing
models and deploy new versions.

IBM Watson Machine Learning is in a closed beta, join the waitlist and we will
invite you in the coming weeks!
http://datascience.ibm.com/features#machinelearning

Learn more about how we are trying to bring Machine Learning to all enterprises
in the article written by Dinesh Nirmal — Vice President of Big Data and
Analytics at IBM: 
https://dineshnirmal.wordpress.com/2016/08/23/machine-learning-for-the-enterprise/

 * Machine Learning

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingARMAND RUIZ
Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",Today we take another step toward making Machine Learning simple and accessible for everyone with our launch of IBM Watson Machine Learning. Watson Machine Learning is designed to make AI and Machine…,Machine Learning for everyone,Live,805
2467,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: COLLABORATE ON PROJECTS
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

1 view 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Science Experience: Create a project and notebook - Duration: 1:04.
   developerWorks TV 1 view * New 1:04


--------------------------------------------------------------------------------

 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21.
   IBM Analytics 8,386 views 8:21
 * Using Data Science Experience DSX to extract insights from NY restaurant
   inspection records - Duration: 32:27. Data Gurus 136 views 32:27
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29
 * Forecasting Retail Sales with Weather Data Using Watson Data Platform -
   Duration: 55:36. Data Gurus 21 views 55:36
 * Data Science Hands on with Open source Tools - Features - Duration: 4:46.
   Cognitive Class 4,266 views 4:46
 * Creating the Data Science Experience - Duration: 3:55. IBM Analytics 3,197
   views 3:55
 * The Voice of Customer Data with a Taxonomy - Duration: 28:43. tmreTV 5 views 28:43
 * Q&A with Lightbend’s Duncan DeVore on Reactive Microservices and JavaOne -
   Duration: 5:39. developerWorks TV 36 views * New 5:39
 * Ritika Gunnar , IBM - BigDataNYC #BigDataNYC 2016 #theCUBE - Duration: 14:35.
   SiliconANGLE 560 views 14:35
 * Collaboration in Databricks - Duration: 2:22. Databricks 488 views 2:22
 * JavaOne: Microservice hands-on - Duration: 5:22. developerWorks TV No views *
   New 5:22
 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54.
   developerWorks TV No views * New 6:54
 * Data analytics, Software development, Data science company - Harmonic -
   Duration: 0:51. Harmonic Analytics 1,769 views 0:51
 * JavaOne: Meet a new Java face at developerWorks - Duration: 2:30.
   developerWorks TV 1 view * New 2:30
 * BigInsights on Cloud: Use Sqoop to Ingest Data from Compose for MySQL -
   Duration: 5:24. developerWorks TV 6 views * New 5:24
 * IBM Analytics Engine Overview - Duration: 7:21. developerWorks TV 7 views *
   New 7:21
 * JavaOne: The excitement so far - Duration: 5:04. developerWorks TV 1 view *
   New 5:04
 * OpenML - Duration: 1:04:33. Bcs Aberdeen 287 views 1:04:33
 * Connecting R to the OpenML project for Open Machine Learning - Duration:
   15:22. TechEd North America 191 views 15:22

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to add collaborators to projects in IBM Data Science Experience (DSX).,Collaborate on projects in DSX,Live,806
2475,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (August 23, 2016)
 * This Week in Data Science (August 16, 2016)
 * This Week in Data Science (August 09, 2016)
 * This Week in Data Science (August 02, 2016)
 * This Week in Data Science (July 26, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (AUGUST 23, 2016)
Posted on August 23, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * This is What It Takes To Connect a Volcano to the Internet – Explorers, volcanologists, GE, and the Nicaraguan government have banded
   together to bring the Masaya volcano in Nicaragua into the 21st century, by
   installing sensors that can compile huge amounts of data.
 * Uber’s First Self-Driving Fleet Arrives in Pittsburgh This Month – Starting later this month, Uber will allow customers in downtown
   Pittsburgh to summon self-driving cars from their phones. Uber will be the
   first company to bring a self-driving car-sharing service to market.
 * Why You Should Embrace Analytic Athleticism – Focus on hiring the right people who have a strong level of analytic
   athleticism. Just as good athletes can spot one another, so can people who
   are good at analytics.
 * Five great charts in 5 lines of R code each – Journalist, Sharon Machlis, displays how useful R is for data viualization
   with these five charts.
 * Sprinters Should Start Fast; Everyone Else Should Finish Fast – Data visualization charts are used to analyze the best pacing technique
   for runners.
 * Why the Future of the Olympics are “The Internet of Olympic Games” – Francisco Maroto, a Telco Industry Executive, discusses the impact of IoT
   on the Olympic Games in Rio.
 * Think It’s Hot Now? Just Wait – July wasn’t just hot – it was the hottest month ever recorded, according
   to NASA. And this year is likely to be the hottest year on record.
 * New R extension gives data scientists quick access to IBM’s Watson – Aiming to put IBM’s Watson AI within closer reach, analytics firm Columbus
   Collaboratory recently released a new open-source R extension called
   CognizeR.
 * Will Smart Machines Be Less Biased Than Humans? – Some critics claim that smart machines will have systemic “algorithmic
   bias” that enables government and corporate abuse.
 * The 10 Algorithms Machine Learning Engineers Need to Know – Read this introductory list of contemporary machine learning algorithms of
   importance that every engineer should understand.
 * Use Big Data to Create Value for Customers, Not Just Target Them – To build lasting advantage, marketing programs that leverage big data need
   to turn to more strategic questions about longer term customer stickiness,
   loyalty, and relationships.
 * From Now On You’ll Be Able to Access NASA Research for Free – NASA has announced that it will now provide public access to all journal
   articles on research funded by the agency.
 * New Brain-Mapping Technique Captures Every Connection Between Neurons –
 * A Beginner’s Guide to Neural Networks with R! – A new technique called MAP-seq uses RNA bar codes to quickly and cheaply
   chart connections between brain cells.
 * Forget Python vs. R: how they can work together – Learn how employees at Civis use both R and Python for their work.

UPCOMING DATA SCIENCE EVENTS
 * Constant Contact: Using IBM BigInsights to Create Business Insight – Join this session on August 25th to learn how Constant Contact, a leader
   in email marketing, is using IBM BigInsights to create useful insights for
   their clients in a way that scales.
 * IBM i2 Summit – Join the IBM i2 Summit on August 30-31 to hear directly from experts who
   are using all forms of data, including “dark data,” to outthink threats.
 * Combining IBM SPSS Statistics and R for competitive advantage – This Data Science Central Webinar event on September 1st, will show you
   how SPSS Statistics can help you keep up with the influx of new data and make
   faster, better business decisions without coding.
 * Big Data and Health presented by IBM Canada – Join We Are Wearables Toronto and IBM Canada on September 16th for a look
   at how wearables and sensors are changing healthcare.
 * How Data Can Help in the Fight Against Sexual Assault – Join the Center for Data Innovation and Rise, a civil rights nonprofit, on
   October 6th in Washington D.C., for a panel discussion on how policymakers
   and law enforcement can develop data-driven policies and practices to help in
   the fight against sexual assault and improve the lives of survivors.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (August 23, 2016)",Live,807
2476,"Compose The Compose logo Articles Sign in Free 30-day trialINTERVIEW: CHRISTOPHER QUINONES ON THE NEWEST GRAPH DATABASE, JANUSGRAPH
Published Jun 21, 2017 janusgraph scylla compose Interview: Christopher Quinones on the newest graph database, JanusGraphWe sat down with Christopher Quinones, Project Lead and Developer on Compose for
JanusGraph, to get the scoop on graph databases, the JanusGraph community, and
the team's usage of Scylla.

Let’s start out with the basics: What is a graph database? How is it different
than other types of databases?


Graph databases are used to store data modeled as nodes and relationships. There
are three use cases that I like to think about when comparing a graph database
with another database.

Firstly, there's storing a very simple relationship. Say two people meeting at a
venue for the first time. In a tabular database, one would need to create two
tables to represent our nodes ( person and venue ) and one additional table to represent the relationship between those nodes ( person meeting person at venue ). Querying that data for this specific relationship would require one to start
joining tables together and that could grow more complex as the data model
evolves.

A graph database abstracts all of that from the user and allows one to think
about the data more naturally as things and relationships to one another, which
is more intuitive for human beings. The traversal languages that power these
graph databases also allows the user to represent their query in a manner that’s
easy to understand.

Secondly, there are the situations when you need to introduce new properties for
a node. In a tabular database, there's quite an effort involved as it requires
either adding a column or creating a new table for a particular type.

With a graph database, one can easily introduce a new property into a node
without having to introduce them across all nodes of the same type.

Thirdly, there is the process of introducing or completely changing the
relationship structure of your data model. With tabular databases, this can
involve a lot of heavy lifting, creating new indexes and tables to represent the
changes.

Since a relationship in a graph database is nothing more than a pointer between
two nodes, this makes restructuring simple. One can change the structure of the
database by issuing a few queries to find the nodes for the relationship and
introduce or customize the relationships.

What are some of the ideal uses cases for graph databases, and why is a graph
database better for those solutions?

Most things in this world can be modeled in relationships to one another. Some
of the ideal use cases that come to mind are social networks, Internet of
Things, recommendation engines, fraud, and master data management. All of the
questions that would be asked in these use cases would be in the context of a
relationship to another object in the graph such as tell me other people that I
might know because I know X or tell me other systems that this person’s
information is in and so on.

A graph database makes it easy to model, write, and visualize such queries. The
data sets in all of the these use cases are evolving and it’s important for the
data model to also be flexible enough to evolve in parallel. Take fraud, for
example. We could start with a very simple graph to identify fraud risk based on
an individual’s relationship to other individuals and later decide to overlay
additional information and relationships into that graph so that we can enhance
the types of questions being asked.

For users that already have their data in MongoDB or PostgreSQL or Cloudant,
etc., how would they use JanusGraph alongside those other databases?

JanusGraph can be used as a primary data store or as a complementary datastore.
Which database one picks depends on the use case, requirements for the
application, and current landscape.

If one’s data is highly-interconnected and an application will be asking lots of
questions based on relationships, a graph database is a good option. You’d be
surprised by what you can learn from visualizing a graph data model. If one can
foresee the data model evolving (e.g, properties, relationships, new nodes), a
graph database will make it easier to adapt the data model. If one’s application
can tolerate eventual consistency and queries always start from a single node
and traverse outward — aka OLTP — our hosted JanusGraph service is optimized for
those use cases and you can enhance your deployment in seconds to support
increased workload and data.

If the landscape is such that many systems of record exist, a user can use a
graph database as a complementary database by ingesting a subset of their data
to ask the relationship-based questions each isolated system can’t answer on
their own. Master Data Management is an example of such a landscape.

IBM Compose for JanusGraph uses Scylla, a wide-column database and drop-in
replacement for Cassandra. What advantages does Scylla offer over other backend
solutions?

JanusGraph continues to evolve as part of the efforts of the open source
community. At the time of our implementation, JanusGraph supported five
different backends: BerkeleyDB, HBase, BigTable, Cassandra, and In Memory. All
of these backends are good, but one needs to understand their application’s
requirements before selecting the right tool for the job.

In our case, our goal was to offer a highly-available service that is partition
tolerant. Out of these backends, Cassandra fit the bill as it’s oriented towards
availability and partition tolerance over consistency. As the name implies, the
“In Memory” backend does not persist data outside of memory, which eliminated it
as an option. BerkeleyDB is more geared towards data on a single node. HBase and
BigTable are more oriented towards consistency and partition tolerance over
availability.

We use Scylla as a drop in replacement for Cassandra. It has numerous advantages
over Cassandra such as user space networking, avoiding the page cache and
thrashing by providing its own row cache, and a share nothing thread model.

Why did IBM get involved with the JanusGraph project?

IBM has a number of products and solutions that rely on graph technology to pull
insights from highly interconnected data. Our developers are eager to make
contributions to move graph technologies forward, so we worked together with
companies like Expero, Google, GRAKN.AI, and HortonWorks to start the JanusGraph
project, forked from the TitanDB code, so that we can continue to make
contributions and back it with the support of a community. It’s very exciting to
see the community of JanusGraph developers grow both inside and outside of IBM.

Your team has been actively committing to that project, what are some of the
goals you have?

We’ve recently been contributing changes to the Apache Tinkerpop and JanusGraph
projects in support of hosting JanusGraph securely on a cloud platform connected
to a Scylla backend. This includes performance and graph management enhancements
and defects we fixed as part of operating a graph service over the last year. As
a future goal, I’d like to see some improvement around ingesting data into the
cloud service and being able to visualize the results of a query to aid with
developing graph models and queries.

You didn’t start out working on graph databases. What is your background and
what did you have to learn to become a proficient user?

I started off my career as a software developer focusing on automating
deployments and testing, developing Java-based applications and tooling, and
consulting with large customers to automatically deploy new software across
their enterprise.

Once you understand that graph databases are about modeling your data as nodes
and relationships, and are very flexible, you really start to think about all of
the problems you’re presented as ""graph problems."" Also, there are a lot of
documentation and tutorials out there in the Apache Tinkerpop Community,
JanusGraph Community, and IBM DeveloperWorks that helped me understand and
experiment with graph data models and queries.

There's a range of resources that I recommend for people getting started:

 * The JanusGraph on Compose documentation
 * The Learning Center for IBM Graph
 * The video on Overcoming development challenges with IBM Graph
 * And episode 12 of The New Builders: Of Graphs and Gremlins – Graph Database 101

Inspecting sample applications is also a great way to learn more about using
graph databases. Here's some that are worth taking a look at:

 * Search Slack with IBM Graph
 * Have you had ""The Talk"" with your chatbot about graph data structures?
 * Six Degrees of Kevin Bacon

With the project being built around open source projects, the official
JanusGraph and Tinkerpop documentation are also good to refer to, especially if
you’re looking for more hands-on experience and knowledge with the tools at
play.

 * JanusGraph
 * Apache Tinkerpop

Thanks Christopher! So, to pull all this together. Graph databases are a great
way to persist dynamic data sets. They allow for richer models and queries that
treat relationships as first class citizens. The persistence in Compose for
JanusGraph is powered by Scylla, a drop-in replacement for Cassandra. And
JanusGraph itself is an exciting open source project with a growing developer
community.

Interested in learning more? Join us July 25th, 2017 for a webinar led by Keith
Lohnes, ""Gremlin Traversals for the SQL User.""

Josh Mintz is an Offering Manager at IBM Watson Data Platform. He has an enthusiasm for
homemade hummus, foreign policy, and the English Premier League. Love this
article? Head over to Josh Mintz ’s author page and keep reading.RELATED ARTICLES
Jun 15, 2017COMPOSE'S FIRST GRAPH DATABASE: JANUSGRAPH
At Compose we've always looked to ensure you can get the databases you need.
Today, we are proud to announce that JanusGraph…

Josh Mintz Dec 2, 2016SCYLLA 1.4 ON COMPOSE - OPTIMIZING CONNECTIVITY
Compose is pleased to announce the availability of Scylla 1.4.2 on the Compose
platform. This is the first update of the data…

Dj Walker-Morgan Nov 4, 2016NEWSBITS - SCYLLA 1.4 RELEASED, DATASTAX/CASSANDRA DIVERGE, PGADMIN 4 UPDATED,
AMAZON LINUX FOR ALL AND MORE
Compose NewsBits for the week ending November 4th - Scylla 1.4 builds on
monitoring and tracing, Datastax unhooks itself from…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL JanusGraph Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","We spoke with Christopher Quinones, Project Lead and Developer on Compose for JanusGraph, to get the scoop on graph databases, the JanusGraph community, and the team's usage of Scylla.","Interview: Christopher Quinones on the newest graph database, JanusGraph",Live,808
2477,"* United States

IBM® * Site map

Search

IBM Developer Advocacy * Services * Move Data with IBM Bluemix Lift
    * Cloudant
    * Compose
    * Apache Spark
    * IBM Graph
    * Bluemix Data Connect
    * Bluemix Lift
    * BigInsights on Cloud
    * IBM Data Science Experience
    * Streaming Analytics
    * IBM Watson Machine Learning
   
   
 * Blog
 * Showcases
 * Search Resources
 * Events

Services to get , build , and analyze data on the ibm cloud Move Data with IBM Bluemix LiftYou can use IBM Bluemix Lift to migrate IBM PureData for Analytics data to Db2
Warehouse on Cloud (formerly dashDB). Before moving data with Bluemix…

CloudantA fully-managed NoSQL database as a service (DBaaS) built from the ground up to
scale globally, run non-stop, and handle a wide variety of data…

ComposeProduction-ready hosting for the following databases: MongoDB with SSL,
Elasticsearch, RethinkDB, PostgreSQL, Redis, etcd, and RabbitMQ.

Apache SparkAnalytics for Apache Spark provides fast, in-memory, distributed analytics
processing of large data sets.

IBM GraphIBM Graph is an easy-to-use, fully managed graph database service for storing,
querying, and visualizing data points, their connections, and properties. IBM
Graph is based…

Bluemix Data ConnectData Connect is a cloud-based data refinery that transforms raw data into
relevant and actionable information. Find data, shape it, and deliver it to
applications…

Bluemix LiftMigrate data from on-premises to the cloud quickly and securely.

BigInsights on CloudIBM BigInsights on Cloud provides Hadoop-as-a-service on IBM’s SoftLayer global
cloud infrastructure. It offers the performance and security of an on-premises
deployment without the cost…

IBM Data Science ExperienceIBM Data Science Experience (DSx) is an interactive, collaborative, cloud-based
environment where data scientists can use multiple tools to activate their
insights. Data scientists can…

Streaming AnalyticsPerform real-time analysis on data in motion as part of your Bluemix®
applications by using IBM® Streaming Analytics for Bluemix. Streaming Analytics
is powered by…

IBM Watson Machine LearningMachine learning is everywhere – influencing nearly everything we do. You’ve
likely heard that Uber is world’s largest taxi company, yet owns no vehicles.
Facebook,…

Search Topic
Advanced Search Language
Technology
Powered by the Simple Search Service i What's This?The most popular Topics, Technologies and Languages are determined by the Simple
Search Service - a microservice that lets you quickly create a faceted search
engine. See what else IBM can do for you.

Learn More about the Simple Search Service CloudDataServices Labs Open Menu * 
 * Services * Back to Navigation
    * Streaming Analytics
    * IBM Data Science Experience
    * IBM Watson Machine Learning
    * Move Data with IBM Bluemix Lift
    * Bluemix Data Connect
   
   
 * Blog
 * Showcases
 * Search resources * Back to Navigation
   
   
 * Events

IBM DATA SCIENCE EXPERIENCE

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

 * Get started
 * Notebooks
 * Integrate
 * Resources

USE THE MACHINE LEARNING LIBRARY
Learn how to use the Apache® Spark™ Machine Learning Library (MLlib) in IBM
Analytics for Apache Spark on IBM Bluemix. Apache® Spark™ includes extension
libraries that can be used for SQL and DataFrames, streaming, machine learning,
and graph analysis. In this video, you’ll see how to use machine learning
algorithms to determine the top drop off location for New York City taxis using
a popular algorithm known as KMeans.


TRY THE TUTORIAL
Learn how to use Apache® Spark™ machine learning algorithms to determine the top
drop off location for New York City taxis using the KMeans algorithm.WHAT YOU’LL LEARN
At the end of this tutorial, you should be able to:

 * download New York City taxi cab data in CSV format.
 * create a Scala notebook in IBM Analytics for Apache Spark.
 * load a CSV file into a Scala notebook.
 * use the KMeans and Vectors algorithms to analyze the data.

BEFORE YOU BEGIN
Watch the Getting Started on Bluemix video to create a Bluemix account and add the IBM Analytics for Apache Spark service.

PROCEDURE 1: DOWNLOAD NEW YORK CITY TAXI CAB DATA
 1. Navigate to the NYC OpenData site.
 2. Click Transportation .
 3. For the search criteria, type taxi .
 4. Select the trip data of your choice, and download the data in CSV format. We
    recommend you select the 2013_Green_Taxi_Trip_data.csv file, or change the
    code found later in this tutorial to match the selected year.

PROCEDURE 2: CREATE A SCALA NOTEBOOK
 1.  Sign in to IBM Data Science Experience .
 2.  From the menu, access My Projects , and open an existing project.
 3.  Click Add Notebook , select Scala and Spark 2.0 , type a name for the notebook, and click Create .
 4.  Paste the following code into the first cell in the notebook, and then
     click the Run icon on the toolbar. This first cell contains two commands that set up use
     of the Apache® Spark™ machine learning algorithms KMeans and Vectors.
     Commands:
     import org.apache.spark.mllib.clustering.KMeans
     import org.apache.spark.mllib.linalg.Vectors
 5.  In the Files slide out panel, drag and drop the CSV file you downloaded in procedure 1
     into the box labelled Drop your file here .
 6.  Next to the uploaded file, click Insert to code , then select Insert Spark RDD . This command uses your object storage credentials to read the contents
     of the file and assign it to the taxifile variable. It then displays the
     first 5 rows. Click Run .
     When the results display, you’ll see that the first row will be the header
     for the columns, and the rest of the rows actually show data. In the first
     row, notice the dropoff_latitude and dropoff_longitude. And in the
     subsequent rows, we actually see data.
 7.  Paste the following code into the fourth cell. This command filters this
     data, so we only see the records from 2013. And we also want to make sure
     that the dropoff_latitude and dropoff_longitude aren’t null. If you
     downloaded a different data set, the column numbers may be different.
     Commands:
     val taxidata=taxifile.filter(_.contains(""2013"")).
     filter(_.split("","") (4) !="""").
     filter(_.split("","") (5) !="""")
 8.  Paste the following code into the fifth cell, and then click Run . This filters the data containing drop off areas with latitudes and
     longitudes that are roughly in the Manhattan area.
     Commands:
     val taxifence = taxidata.filter(_.split("","")(4).toDouble>40.70).
     filter(_.split("","") (4).toDouble<40.86).
     filter(_.split("","") (5).toDouble>(-74.02)).
     filter(_.split("","") (5).toDouble<(-73.93))
 9.  Paste the following code into the sixth cell, and then click Run . This command takes this data and puts it in a vector which will be used
     as input for the KMeans algorithm.
     Command:
     val
     taxi=taxifence.map(line=>Vectors.dense(line.split(',').slice(4,6).map(_.toDouble)))
 10. Paste the following SQL statement into the sixth cell, and then click Run . This final cell contains commands to invoke the KMeans algorithm. In
     this case, we’ however, the parameters could be changed in this cell to
     determine the top three or the top ten locations. It’s also interesting to
     note that Apache® Spark™ machine learning provides other algorithms for
     collaborative filtering, clustering, and classification.
     Commands:
     val model=KMeans.train(taxi,1,1)
     val clusterCenters=model.clusterCenters.map(_.toArray)
     clusterCenters.foreach(lines= println(lines(0),lines(1)))

Select and copy the coordinates. Then, open a browser, and paste the coordinates
into a map program such as Google Maps to see the location on the map.

Notebooks * Use the Machine Learning Library
 * Build SQL Queries
 * Build a Custom Library for Apache Spark
 * Introduction to Decision Tree Learning
 * Introduction to Machine Learning: Predict Cancer Diagnosis
 * Introduction to Natural Language Processing (NLP)
 * Community Notebook: Precipitation Analysis
 * Community Notebook: NY Motor Vehicle Accidents Analysis
 * Community Notebook: Use Spark R to Load and Analyze Data

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus
RECENT UPDATES

 * Blog
 * Recent Post
 * Moving to Medium
 * 2/24/17
 * We're now the IBM Watson Data Lab on Medium. More on what that means for…
 * Mike Broberg",Learn how to use the Apache Spark Machine Learning Library (MLlib) in IBM Analytics for Apache Spark on IBM Bluemix. ,Use the Machine Learning Library,Live,809
2478,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Feb 10
--------------------------------------------------------------------------------

ENVIRONMENT VARIABLES, OR KEEPING YOUR SECRETS SECRET IN A NODE.JS APP
Imagine you have some Node.js code that uses an external API which needs an API
key:

If we commit the above code to GitHub we divulge our secret API key allowing
someone to use our account. This isn’t a rare event — many developers accidentally commit their credentials and others seek them out for nefarious purposes!

KEEPING YOUR SECRETS SECRET
Credentials are usually hidden in environment variables that your application can pick up when it runs. Our code now looks like this:

We expect an environment variable called MYAPIKEY to be there when our code runs. This file can now be safely committed to git.

SETTING ENVIRONMENT VARIABLES
On the command-line, environment variables can be set using export on Mac/Linux and set on Windows e.g.

export MYAPIKEY=ndsvn2g8dnsb9hsg

Once set, you can run your application in the usual way e.g node app.js .

As a shortcut, you can define environment variables and run the app in a single
line:

MYAPIKEY=ndsvn2g8dnsb9hsg node app.js

USING THE DOTENV PACKAGE
A simple way of defining multiple environment variables on your local machine is
to use the dotenv package.

Create a .env file at the top of your project containing the environment variables you want
to set:

MYAPIKEY=ndsvn2g8dnsb9hsg 
DEBUG=true 
DEBUGLEVEL=5

Then at the entry point in your code add:

require('dotenv'

which loads the values from the .env file into your application's process.env .

The .env file can be excluded from any git commits by adding a .env line to your . gitignore file.

ENVIRONMENT VARIABLES IN BLUEMIX
Bluemix sends its configuration to its CloudFoundry applications through environment
variables:

 * VCAP_SERVICES - a JSON-encoded object describing the services that are paired with your
   application
 * VCAP_APPLICATION - a JSON-encoded object describing your application's meta data
 * custom environment variables can be defined in the Bluemix dashboard and are
   available to read in your application’s process.env

The cfenv library is often used to parse the CloudFoundry environment variables. Read
more on how to use cfenv here .

 * Nodejs
 * JavaScript

4 1 Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 4
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",If we commit the above code to GitHub we divulge our secret API key allowing someone to use our account. This isn’t a rare event — many developers accidentally commit their credentials and others…,"Environment variables, or keeping your secrets secret in a Node.js app – IBM Watson Data Lab",Live,810
2479,"GEOFILE: GETTING THE DISTANCE USING REDIS AND POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 29, 2016GeoFile is a series dedicated to looking at geographical data, its features and
uses. In this article, we discuss getting the distance between two points on a
map and how Redis and the PostgreSQL extension PostGIS approach this problem.

The most common reason why we use maps is to calculate the distance from one
point to another. Redis and PostGIS come with commands, GEODIST and ST_Distance , that allow us do this easily, but there are reasons why choosing one database
over the other may be a better choice for your use case.

GEODIST in RedisLet's create a set of Cities and add three cities (Seattle, New York, and San Francisco) with their
longitude and latitude coordinates. To do this, we will use the GEOADD command like:

GEOADD Cities -122.3321 47.6062 ""Seattle""  
-74.0059 40.7128 ""New York"" 
-122.4194 37.7749 ""San Francisco""


Redis will store your geodata and automatically set up the geospatial index for
us.

If you want to check what's stored in the Cities set, use the GETPOS command with the set name and the city name(s) you want. This will give you the
longitude and latitude coordinates of each city like this:

GEOPOS Cities Seattle  
1) 1) ""-122.33210116624832153""  
   2) ""47.60619876995696131""

GEOPOS Cities Seattle ""New York"" ""San Francisco""  
1) 1) ""-122.33210116624832153""  
   2) ""47.60619876995696131""
2) 1) ""-74.00589913129806519""  
   2) ""40.71279898695150479""
3) 1) ""-122.41940170526504517""  
   2) ""37.77490001056577995""


Now that the coordinates are stored, we can use the GEODIST command that allows us to get the distance between any two members in a sorted set. To
get the distance between Seattle and New York, we can write the following query
which will give us the distance between the cities in meters:

GEODIST Cities Seattle ""New York""  
""3866626.6659""


By default, if you don't include the unit of measurement at the end of the GEODIST command, it will give you a distance in meters. You can also append to the
query specific measurement units like meters, kilometers, feet, and miles.

For example, if we wanted our output in miles, we'd append mi to the command we just ran, which would give us the number of miles between
Seattle and New York:

GEODIST Cities Seattle ""New York"" mi  
""2402.6164""


The only problem with GEODIST is that it assumes your points are on a perfect sphere. You cannot change the
projection to one that uses a spheroid (for more information on projections,
spheres and spheroids see here ). The downside of calculating distances using a sphere is that it can produce
an error of 0.5%, which is not ideal for every application, especially those
that require a precise measurement over long distances.

ST_Distance in PostGISIf you need to use a spheroid, or want to define another projection for your
geodata, using PostGIS is a better option. It will allow you to define the
projection you want to use, but all your measurements will return in meters.

To construct a basic query in SQL to find the distance between two points, use
the ST_Distance function. ST_Distance is flexible in that you can pass in geometry or geography type object (see here for more information on these object types).

You are given the choice to set the projection to your requirements by either
manually setting the projection and providing the longitude/latitude coordinates
in the query, or you can use a geometry or geography value from your geom column.

To quickly set up a query, you could write the following, which will
unfortunately give you the distance in degrees and not in meters:

SELECT ST_Distance(  
  ST_SetSRID(
    ST_MakePoint(-122.33210116624832153, 47.60619876995696131), 
  4326), 
  ST_SetSRID(
    ST_MakePoint(-74.0058991312980651, 40.71279898695150479), 
  4326)
);

   st_distance    
------------------
 48.8153742553631


Therefore, it's important that you set a boolean parameter at the end of the ST_Distance function to either true or false depending on whether your coordinates are located on a spheroid or sphere,
respectively. You can view the differences between the calculations in the query
below:

SELECT ST_Distance(  
  ST_SetSRID(
    ST_MakePoint(-122.33210116624832153, 47.60619876995696131), 
  4326), 
  ST_SetSRID(
    ST_MakePoint(-74.0058991312980651, 40.71279898695150479), 
  4326),
  true
) as spheroid, 
ST_Distance(  
  ST_SetSRID(
    ST_MakePoint(-122.33210116624832153, 47.60619876995696131), 
  4326), 
  ST_SetSRID(
    ST_MakePoint(-74.0058991312980651, 40.71279898695150479), 
  4326),
  false
) as sphere;

    spheroid     |     sphere      
------------------+-----------------
 3875620.66137305 | 3865541.3370621


If you already have a geom column in your table that's populated with a geography value, then you can
select the value from your table without setting the projection, since it's
already encoded in the geometry value. To set up the query, you could write the
following:

SELECT ST_Distance(  
  (select geom from cities where id=1), 
-- 0101000020E6100000973D642541955EC0EE6F1AEC97CD4740
  (select geom from cities where id=2), --
-- 0101000020E6100000DA649EA6608052C08DCF64FF3C5B4440
  true
) as spheroid;

     spheroid     
------------------
 3875620.66515256


WHAT'S BEST?
So, there are benefits to using either Redis or PostGIS. If you want to do a
quick look up on the distance between two locations with rather short distances,
then Redis will allow you to write a query fast and give you a pretty accurate
result without thinking about various projections. However, if you want options
to have more control over your projections in addition to many other functions
that will allow you to manipulate geodata, then PostGIS will give you the full
power to transform and define SRIDs and the projections you need.

Next time we'll cover getting points within a radius in Compose MongoDB,
PostgreSQL with PostGIS, ElasticSearch, Redis, and RethinkDB.

Image by Stephen Monroe Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","The most common reason why we use maps is to calculate the distance from one point to another. Redis and PostGIS come with commands, GEODIST and ST_Distance, that allow us do this easily, but there are reasons why choosing one database over the other may be a better choice for your use case.",Getting the distance using Redis and PostgreSQL,Live,811
2483,"CURIOUS INSIGHT

--------------------------------------------------------------------------------

Technology, software, data science, machine learning, entrepreneurship,
investing, and various other topics

TAGS

--------------------------------------------------------------------------------

About Machine Learning Data Visualization Data Science Big Data Software Development Emerging Technology Entrepreneurship Investing Strategy Random Thoughts Curious InsightsCopyright © Curious Insight. 2016 • All rights reserved.

Proudly published with Ghost .

Ivy Theme by Ecko .

 * 
 * 
 * 
 * 
 * 
 * 

CURIOUS INSIGHT
Machine Learning Data Science Data VisualizationMACHINE LEARNING EXERCISES IN PYTHON, PART 1
5th December 2014
--------------------------------------------------------------------------------

This post is part of a series covering the exercises from Andrew Ng's machine learning class on Coursera. The original code, exercise text, and data files for this
post are available here .

Part 1 - Simple Linear Regression
Part 2 - Multivariate Linear Regression
Part 3 - Logistic Regression
Part 4 - Multivariate Logistic Regression
Part 5 - Neural Networks
Part 6 - Support Vector Machines
Part 7 - K-Means Clustering & PCA
Part 8 - Anomaly Detection & Recommendation

One of the pivotal moments in my professional development this year came when I
discovered Coursera. I'd heard of the ""MOOC"" phenomenon but had not had the time
to dive in and take a class. Earlier this year I finally pulled the trigger and
signed up for Andrew Ng's Machine Learning class. I completed the whole thing from start to finish, including all of the
programming exercises. The experience opened my eyes to the power of this type
of education platform, and I've been hooked ever since.

This blog post will be the first in a series covering the programming exercises
from Andrew's class. One aspect of the course that I didn't particularly care
for was the use of Octave for assignments. Although Octave/Matlab is a fine
platform, most real-world ""data science"" is done in either R or Python
(certainly there are other languages and tools being used, but these two are
unquestionably at the top of the list). Since I'm trying to develop my Python
skills, I decided to start working through the exercises from scratch in Python.
The full source code is available at my IPython repo on Github . You'll also find the data used in these exercises and the original exercise
PDFs in sub-folders off the root directory if you're interested.

While I can explain some of the concepts involved in this exercise along the
way, it's impossible for me to convey all the information you might need to
fully comprehend it. If you're really interested in machine learning but haven't
been exposed to it yet, I encourage you to check out the class (it's completely
free and there's no commitment whatsoever). With that, let's get started!

EXAMINING THE DATA
In the first part of exercise 1, we're tasked with implementing simple linear
regression to predict profits for a food truck. Suppose you are the CEO of a
restaurant franchise and are considering different cities for opening a new
outlet. The chain already has trucks in various cities and you have data for
profits and populations from the cities. You'd like to figure out what the
expected profit of a new food truck might be given only the population of the
city that it would be placed in.

Let's start by examining the data which is in a file called ""ex1data1.txt"" in
the ""data"" directory of my repository above. First we need to import a few
libraries.

import os  
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
%matplotlib inline


Now let's get things rolling. We can use pandas to load the data into a data
frame and display the first few rows using the ""head"" function.

path = os.getcwd() + '\data\ex1data1.txt'  
data = pd.read_csv(path, header=None, names=['Population', 'Profit'])  
data.head()  


Population Profit 0 6.1101 17.5920 1 5.5277 9.1302 2 8.5186 13.6620 3 7.0032 11.8540 4 5.8598 6.8233Another useful function that pandas provides out-of-the-box is the ""describe""
function, which calculates some basic statistics on a data set. This is helpful
to get a ""feel"" for the data during the exploratory analysis stage of a project.

data.describe()  


Population Profit count 97.000000 97.000000 mean 8.159800 5.839135 std 3.869884 5.510262 min 5.026900 -2.680700 25% 5.707700 1.986900 50% 6.589400 4.562300 75% 8.578100 7.046700 max 22.203000 24.147000Examining stats about your data can be helpful, but sometimes you need to find
ways to visualize it too. Fortunately this data set only has one dependent
variable, so we can toss it in a scatter plot to get a better idea of what it
looks like. We can use the ""plot"" function provided by pandas for this, which is
really just a wrapper for matplotlib.

data.plot(kind='scatter', x='Population', y='Profit', figsize=(12,8))  


It really helps to actually look at what's going on, doesn't it? We can clearly
see that there's a cluster of values around cities with smaller populations, and
a somewhat linear trend of increasing profit as the size of the city increases.
Now let's get to the fun part - implementing a linear regression algorithm in
python from scratch!

IMPLEMENTING SIMPLE LINEAR REGRESSION
If you're not familiar with linear regression, it's an approach to modeling the
relationship between a dependent variable and one or more independent variables
(if there's one independent variable then it's called simple linear regression,
and if there's more than one independent variable then it's called multiple
linear regression). There are lots of different types and variances of linear
regression that are outside the scope of this discussion so I won't go into that
here, but to put it simply - we're trying to create a linear model of the data X , using some number of parameters theta , that describes the variance of the data such that given a new data point
that's not in X , we could accurately predict what the outcome y would be without actually knowing what y is.

In this implementation we're going to use an optimization technique called
gradient descent to find the parameters theta . If you're familiar with linear algebra, you may be aware that there's another
way to find the optimal parameters for a linear model called the ""normal
equation"" which basically solves the problem at once using a series of matrix
calculations. However, the issue with this approach is that it doesn't scale
very well for large data sets. In contrast, we can use variants of gradient
descent and other optimization methods to scale to data sets of unlimited size,
so for machine learning problems this approach is more practical.

Okay, that's enough theory. Let's write some code. The first thing we need is a
cost function. The cost function evaluates the quality of our model by
calculating the error between our model's prediction for a data point, using the model parameters,
and the actual data point. For example, if the population for a given city is 4
and we predicted that it was 7, our error is (7-4)^2 = 3^2 = 9 (assuming an L2
or ""least squares"" loss function). We do this for each data point in X and sum the result to get the cost. Here's the function:

def computeCost(X, y, theta):  
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))


Notice that there are no loops. We're taking advantage of numpy's linear
algrebra capabilities to compute the result as a series of matrix operations.
This is far more computationally efficient than an unoptimizted ""for"" loop.

In order to make this cost function work seamlessly with the pandas data frame
we created above, we need to do some manipulating. First, we need to insert a
column of 1s at the beginning of the data frame in order to make the matrix
operations work correctly (I won't go into detail on why this is needed, but
it's in the exercise text if you're interested - basically it accounts for the
intercept term in the linear equation). Second, we need to separate our data
into independent variables X and our dependent variable y .

# append a ones column to the front of the data set
data.insert(0, 'Ones', 1)

# set X (training data) and y (target variable)
cols = data.shape[1]  
X = data.iloc[:,0:cols-1]  
y = data.iloc[:,cols-1:cols]  


Finally, we're going to convert our data frames to numpy matrices and
instantiate a parameter matirx.

# convert from data frames to numpy matrices
X = np.matrix(X.values)  
y = np.matrix(y.values)  
theta = np.matrix(np.array([0,0]))  


One useful trick to remember when debugging matrix operations is to look at the
shape of the matrices you're dealing with. It's also helpful to remember when
walking through the steps in your head that matrix multiplications look like (i x j) * (j x k) = (i x k) , where i, j, and k are the shapes of the relative dimensions of the matrix.

X.shape, theta.shape, y.shape  


((97L, 2L), (1L, 2L), (97L, 1L))

Okay, so now we can try out our cost function. Remember the parameters were
initialized to 0 so the solution isn't optimal yet, but we can see if it works.

computeCost(X, y, theta)  


32.072733877455676

So far so good. Now we need to define a function to perform gradient descent on
the parameters theta using the update rules defined in the exercise text. Here's the function for
gradient descent:

def gradientDescent(X, y, theta, alpha, iters):  
    temp = np.matrix(np.zeros(theta.shape))
    parameters = int(theta.ravel().shape[1])
    cost = np.zeros(iters)

    for i in range(iters):
        error = (X * theta.T) - y

        for j in range(parameters):
            term = np.multiply(error, X[:,j])
            temp[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(term))

        theta = temp
        cost[i] = computeCost(X, y, theta)

    return theta, cost


The idea with gradient descent is that for each iteration, we compute the
gradient of the error term in order to figure out the appropriate direction to
move our parameter vector. In other words, we're calculating the changes to make
to our parameters in order to reduce the error, thus bringing our solution
closer to the optimal solution (i.e best fit).

This is a fairly complex topic and I could easily devote a whole blog post just
to discussing gradient descent. If you're interested in learning more, I would
recommend starting with this article and branching out from there.

Once again we're relying on numpy and linear algebra for our solution. You may
notice that my implementation is not 100% optimal. In particular, there's a way
to get rid of that inner loop and update all of the parameters at once. I'll
leave it up to the reader to figure it out for now (I'll cover it in a later
post).

Now that we've got a way to evaluate solutions, and a way to find a good
solution, it's time to apply this to our data set.

# initialize variables for learning rate and iterations
alpha = 0.01  
iters = 1000

# perform gradient descent to ""fit"" the model parameters
g, cost = gradientDescent(X, y, theta, alpha, iters)  
g  


matrix([[-3.24140214, 1.1272942 ]])

Note that we've initialized a few new variables here. If you look closely at the
gradient descent function, it has parameters called alpha and iters . Alpha is the learning rate - it's a factor in the update rule for the
parameters that helps determine how quickly the algorithm will converge to the
optimal solution. Iters is just the number of iterations. There is no hard and
fast rule for how to initialize these parameters and typically some
trial-and-error is involved.

We now have a parameter vector descibing what we believe is the optimal linear
model for our data set. One quick way to evaluate just how good our regression
model is might be to look at the total error of our new solution on the data
set:

computeCost(X, y, g)  


4.5159555030789118

That's certainly a lot better than 32, but it's not a very intuitive way to look
at it. Fortunately we have some other techniques at our disposal.

VIEWING THE RESULTS
We're now going to use matplotlib to visualize our solution. Remember the
scatter plot from before? Let's overlay a line representing our model on top of
a scatter plot of the data to see how well it fits. We can use numpy's
""linspace"" function to create an evenly-spaced series of points within the range
of our data, and then ""evaluate"" those points using our model to see what the
expected profit would be. We can then turn it into a line graph and plot it.

x = np.linspace(data.Population.min(), data.Population.max(), 100)  
f = g[0, 0] + (g[0, 1] * x)

fig, ax = plt.subplots(figsize=(12,8))  
ax.plot(x, f, 'r', label='Prediction')  
ax.scatter(data.Population, data.Profit, label='Traning Data')  
ax.legend(loc=2)  
ax.set_xlabel('Population')  
ax.set_ylabel('Profit')  
ax.set_title('Predicted Profit vs. Population Size')  


Not bad! Our solution looks like and optimal linear model of the data set. Since
the gradient decent function also outputs a vector with the cost at each
training iteration, we can plot that as well.

fig, ax = plt.subplots(figsize=(12,8))  
ax.plot(np.arange(iters), cost, 'r')  
ax.set_xlabel('Iterations')  
ax.set_ylabel('Cost')  
ax.set_title('Error vs. Training Epoch')  


Notice that the cost always decreases - this is an example of what's called a convex optimization problem. If you were to plot the entire solution space for the
problem (i.e. plot the cost as a function of the model parameters for every
possible value of the parameters) you would see that it looks like a ""bowl""
shape with a ""basin"" representing the optimal solution.

That's all for now! In part 2 we'll finish off the first exercise by extending
this example to more than 1 variable. I'll also show how the above solution can
be reached by using a popular machine learning library called scikit-learn.

Follow me on twitter to get new post updates.

Follow @jdwittenauer

Machine Learning Data Science Data Visualization AUTHORJOHN WITTENAUER
Data scientist, engineer, author, investor, and machine learning enthusiast

 * 
 * 
 * 
 * 
 * 
 * 

VIEW COMMENTS
CURIOUS INSIGHT
AUTHORJOHN WITTENAUER
Data scientist, engineer, author, investor, and machine learning enthusiast

 * 
 * 
 * 
 * 
 * 
 * 

SHARE ARTICLE

--------------------------------------------------------------------------------

Twitter Facebook Google+ Reddit LinkedIn PinterestCopyright © Curious Insight. 2014 • All rights reserved.

Proudly published with Ghost .

Ivy Theme by Ecko .",This post is part of a series covering the exercises from Andrew Ng's machine learning class on Coursera.,"Machine Learning Exercises In Python, Part 1",Live,812
2486,"Stats and Bots Follow Sign in / Sign up * Home
 * Analytics
 * Data Science
 * Design
 * Startups
 * Bots
 * Updates
 * Subscribe
 * 
 * 🤖 STATSBOT
 * 

Anton Karazeev Blocked Unblock Follow Following Junior Researcher @ MIPT Aug 17
--------------------------------------------------------------------------------

GENERATIVE ADVERSARIAL NETWORKS (GANS): ENGINE AND APPLICATIONS
HOW GENERATIVE ADVERSARIAL NETS ARE USED TO MAKE OUR LIFE BETTER
Illustration sourceGenerative adversarial networks (GANs) are a class of neural networks that are
used in unsupervised machine learning. They help to solve such tasks as image
generation from descriptions, getting high resolution images from low resolution
ones, predicting which drug could treat a certain disease, retrieving images
that contain a given pattern, etc.

The Statsbot team asked a data scientist, Anton Karazeev, to make the introduction to GANs
engine and their applications in everyday life.


--------------------------------------------------------------------------------

GANs were introduced by Ian Goodfellow in 2014 . They aren’t the only approach of neural networks in unsupervised learning.
There’s also the Boltzmann machine (Geoffrey Hinton and Terry Sejnowski, 1985)
and Autoencoders (Dana H. Ballard, 1987). Both of them are dedicated to extract
features from data by learning the identity function f( x ) = x and both of them rely on Markov chains to train or to generate samples.

Generative adversarial networks were designed to avoid using Markov chains
because of the high computational cost of the latter. Another advantage relative
to Boltzmann machines is that the Generator function has much fewer restrictions
(there are only a few probability distributions that admit Markov chain
sampling).

In this article, we’ll tell you how generative adversarial nets work and what
their most popular applications in real life are. We will also give you links to
some helpful resources for getting deeper into these approaches.

THE ENGINE OF GENERATIVE ADVERSARIAL NETS
To explain GANs’ concept let us use an analogy.

Illustration sourceImagine you want to buy good watches. If you never buy them it’s very likely
that you can’t distinguish brand watches from fake ones. You have to be
experienced to not let yourself be fooled by the seller.

As you start to label most of the watches as fake (after a number of mistakes of
course), the seller will start to “generate” more compelling copies of the
watches. This example demonstrates the behavior of generative adversarial
networks: Discriminator (watches buyer) and Generator (seller of fake watches).

These two networks, Discriminator and Generator, are contesting with each other.
This technique allows the generation of realistic objects (e.g. images). The
Generator is forced to generate samples that look real and the Discriminator
learns to distinguish generated samples and samples from real data.

Illustration sourceWhat’s the difference between discriminative and generative algorithms? In
brief: discriminative algorithms learn the boundaries between classes (as the
Discriminator does) while generative algorithms learn the distribution of
classes (as the Generator does).

IF YOU’RE READY TO GO DEEPER
To learn the Generator’s distribution, p_g over data x , the distribution on input noise variables p_z ( z ) should be defined. Then G( z , θ_g ) maps z from latent space Z to data space and D( x , θ_d ) outputs a single scalar — probability that x came from the real data rather than p_g .

The Discriminator is trained to maximize the probability of assigning the
correct label to both examples of real data and generated samples. While the
Generator is trained to minimize log(1 — D(G( z ))). In other words — to minimize the probability of the Discriminator’s
correct answer.

It is possible to consider such a training task as minimax game with value
function V(G, D):

In other words — the Generator tries harder to fool the Discriminator and the
Discriminator becomes more captious in order to not be fooled by the Generator.

“Adversarial training is the coolest thing since sliced bread.” — Yann LeCunThe process of training stops when the Discriminator is unable to distinguish p_g and p_data , i.e. D( x , θ_d ) = ½ . Equilibrium between errors of the Generator and the Discriminator is
established.

IMAGE RETRIEVAL FOR HISTORICAL ARCHIVES
An interesting example of GANs applications is retrieving visually similar marks
in “ Prize Papers ,” one of the most valuable archives in the field of maritime history.
Adversarial nets make it easier to work with documents of historical importance
containing information about the legitimacy of ship captures at sea.

Adversarial Training For Sketch RetrievalEach query contains examples of Merchant Marks — unique identification of
property of a merchant, sketch-like symbols that are similar to hieroglyphs.

Feature representation of every mark should be obtained, but there are some
problems of applying conventional machine and deep learning methods (including
Convolutional neural networks):

 * they require a large amount of labelled images;
 * there are no labels for Merchant Marks;
 * marks are not segmented from the dataset.

This new approach shows how to extract and learn features from images of the
Merchant Marks using GANs. After the representation of each mark is learned the
visual search on scanned documents could be processed.

TEXT TRANSLATION INTO IMAGES
Other researchers showed that it’s possible to use descriptive properties of
natural language to generate corresponding images. A method of text translation
into images allows the illustration of the performance of generative models to
mimic samples of real data.

Generative Adversarial Text to Image SynthesisThe main problem of image generation is that image distribution is multimodal.
For example, there are many correct samples that correctly illustrate the
description. GANs help to solve this problem.

Illustration sourceLet’s consider the following task of mapping the blue input dot to the green
output dot (green dots are possible outputs to blue dot). This red arrow
indicates the error of prediction and means that after some time the blue dot
will be mapped to the mean of the green dots — this exact thing causes the
blurry images we are trying to predict.

Generative adversarial nets don’t directly use pairs of inputs and outputs.
Instead, they learn how the inputs and outputs can be paired.

Here are the examples of generated images from text descriptions:

Generative Adversarial Text to Image SynthesisDatasets that were used to train GANs :

 * Caltech-UCSD-200–2011 is an image dataset with photos of 200 bird species. Total number of images is 11,788 ;
 * Oxford-102 Flowers dataset consists of 102 flower categories with numbers between 40 and 258 images per category.

DRUG DISCOVERY
While others apply generative adversarial networks to images and videos,
researchers from Insilico Medicine proposed an approach of artificially
intelligent drug discovery using GANs.

The goal is to train the Generator to sample drug candidates for a given disease
as precisely as possible to existing drugs from a Drug Database.

Illustration sourceAfter training, it’s possible to generate a drug for a previously incurable
disease using the Generator, and using the Discriminator to determine whether
the sampled drug actually cures the given disease.

MOLECULE DEVELOPMENT IN ONCOLOGY
Another research by Insilico Medicine showed the pipeline of generating new
anticancer molecules with a defined set of parameters. The aim is to predict
drug responses and compounds which are good at fighting against cancer cells.

Researchers proposed an Adversarial Autoencoder (AAE) model for identification
and generation of new compounds based on available biochemical data.

Adversarial Autoencoders “To the best of our knowledge, this is the first application of GANs techniques
within the field of cancer drug discovery.” — say the researchers .There are many available biochemical data in databases such as Cancer Cell Line Encyclopedia (CCLE), Genomics of Drug Sensitivity in Cancer (GDSC), and NCI-60 cancer cell line collection . All of them contain screening data for different drug experiments against
cancer.

GDSC websiteAdversarial Autoencoder was trained using Growth Inhibition percentage data (GI, which shows the reduction in the number of cancer cells after the
treatment), drug concentrations, and fingerprints as inputs.

The fingerprint of the molecule contains a fixed number of bits in which each
bit represents the absence or presence of some feature.

The latent layer consists of 5 neurons, one of which is responsible for GI
(efficiency against cancer cells) and the four others are discriminated with
normal distribution. So, a regression term was added to the Encoder cost
function. Furthermore, the Encoder was restricted to map the same fingerprint to the same latent vector, independently from input concentration by additional manifold cost.

Illustration sourceAfter training, it is possible to generate molecules from a desired distribution
and use a GI-neuron as a tuner of output compounds.

Results of this work are the following: the trained AAE model predicted compounds that are already proven to be anticancer agents and new untested compounds that should be validated with experiments on anticancer activity.

“Our results suggest that the proposed AAE model significantly enhances the
capacity and efficiency of development of the new molecules with specific
anticancer properties using the deep generative models.”CONCLUSION
Unsupervised learning is a next frontier in artificial intelligence and we are moving towards it.

Generative adversarial nets can be applied in many fields from generating images
to predicting drugs, so don’t be afraid of experimenting with them. We believe
they help in building a better future for machine learning.

Below, we give you a few helpful resources to learn more about adversarial nets.

Taken from “ Generative Adversarial Nets ” :

 * GANs allow the model to learn that there are many correct answers (i.e.
   handling well on multimodal data);
 * semi-supervised learning: features from the Discriminator or inference net
   could improve performance of classifiers when limited labeled data is
   available;
 * one can use adversarial nets to implement a stochastic extension of the
   deterministic Multi-Prediction Deep Boltzmann Machines;
 * a conditional generative model p ( x | c ) can be obtained by adding c as the input to both the Generator and the Discriminator.

FURTHER READING
 * What is a Variational Autoencoder?
 * Ian Goodfellow about GANs for Text on Reddit
 * “ StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative
   Adversarial Networks ” by Baidu Research [ github ]
 * “ Generative Visual Manipulation on the Natural Image Manifold ” by Adobe Research [ github ]
 * “ Unsupervised Cross-Domain Image Generation ” by Facebook AI Research [ github ]
 * “ Image-to-Image Translation with Conditional Adversarial Networks ” by Berkeley AI Research [ github ]

YOU’D ALSO LIKE:
Support Vector Machines Tutorial Learning SVMs from examples blog.statsbot.co Google Analytics Audit Checklist and Tools Auditing a Google Analytics setup
like a pro blog.statsbot.co Data Structures Related to Machine Learning Algorithms A primer on data
structures for data scientists blog.statsbot.co * Machine Learning
 * Gans
 * Neural Networks
 * Data Science
 * Adversarial Nets

6 Blocked Unblock Follow FollowingANTON KARAZEEV
Junior Researcher @ MIPT

FollowSTATS AND BOTS
Data stories from Statsbot’s makers

 * Share
 * 6
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates","Generative adversarial networks (GANs) are a class of neural networks that are used in unsupervised machine learning. They help to solve such tasks as image generation from descriptions, getting high…",Generative Adversarial Networks (GANs),Live,813
2493,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Home
 * Cognitive Computing
 * Data Science
 * Web Dev
 * 

Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. 16 mins ago
--------------------------------------------------------------------------------

ADDING CUSTOM ALEXA SKILLS WITH OPENWHISK AND CLOUDANT
SET UP YOUR ALEXA SKILL TO COST YOU LESS AND SCALE WELL.
Alexa, the humanoid persona of the Amazon Echo and Echo Dot conversation
appliances, can be programmatically extended by adding new skills to your device. There is an app store for such skills, but even if you want to add functionality to your own Dot,
then you can use the same APIs without publishing your code to the app store.

Echo DotCREATE A SAMPLE ALEXA SKILL
A skill is a number of moving parts:

 * An invocation phrase that is used to identify your skill in a spoken sentence. For example, you
   say “Alexa ask Tarot bot if it’s my lucky day.” Alexa wakes up the device and
   parses the rest of the sentence to see if it matches any built-in or custom
   skills. In this case Tarot bot might be the invocation phrase of your code.
 * A number of intents can be configured to let a skill deal with multiple scenarios. For instance,
   a tarot bot might detect the intent to have a single card drawn, or a full
   daily reading. Intents can also be configured to pull out one or more slots from the sentence, like a date, or a person.
 * Some code is executed as a phrase is detected. Alexa calls your code with an
   HTTP post and expects a JSON response, which Alexa speaks back to the user.

CONFIGURE OPENWHISK
Amazon recommends its Lambda serverless platform for coding custom skills, but IBM’s OpenWhisk is much
easier to build with. (If you don’t already have OpenWhisk, read how to get started .)

Create a file myskill.js that contains a main function:

The object your action returns contains the HTTP status code you want to reply
with (200), any HTTP headers and a ‘body’ which is the data returned by your
action,, In this case, it is your JSON response turned into base64-encoded
string.Deploy it to OpenWhisk:

wsk package create alexa
wsk action update alexa/myskill myskill.js -a web-export true

You can then access this OpenWhisk action as a web-facing API call —

curl 'https://openwhisk.ng.bluemix.net/api/v1/web/MYNAMESPACE/alexa/myskill'

— where MYNAMESPACE is your OpenWhisk namespace. (To find yours, run wsk namespace list . It should look something like glynn.bird@uk.ibm.com_dev .) You’ll enter this URL in a minute when you configure your Alexa skill.

ADD ALEXA SKILL
Visit your Amazon Developer dashboard and click the Add a new skill button.

Under Skill Type , choose Custom Interaction Model .

Name the skill and enter an Invocation Name , which users will say to call your skill. (You’ll have a chance to expand on
this in a minute.) Then click Next .

In Intent Schema , enter at least something like:

{
  ""intents"": [
    {
      ""intent"": ""myskill""
    }
  ]
}

In Sample Utterances, enter phrases that trigger your skill and click Next . For example:

myskill my skill
myskill test my skill
myskill skill test

Under Endpoint select HTTPS , enter your OpenWhisk URL, and click Next .

Under Certificate for NA Endpoint select My development endpoint is a sub-domain of a domain that has a wildcard
certificate from a certificate authority and click Next .

Test your skill. In Enter Utterance , type one of the trigger utterances you entered in Step 5, then click Ask test .

Response appears on the right. Click Listen to hear Alexa say the output speech.Serverless frameworks like OpenWhisk are great for this application because you
don’t have to pay to stand up a permanent HTTP server to deal with the trickle
of traffic that your custom skill will generate. If, however, you decide to
publish your skill and it becomes popular, OpenWhisk can scale automatically to
deal with the demand.

RETURNING DYNAMIC RESPONSES
I have a temperature sensor in my house that stores its readings in Cloudant
periodically. So if I say “Alexa, my house temperature” I want to look up the
latest temperature reading and report it back to Alexa as a readable string.
Here’s how I do it:

The code is pretty simple. It uses a Cloudant MapReduce view that has the
temperature readings sorted by date and time. It fetches the newest reading and
then forms a JSON object containing the response.

I deploy it using a shell script, with my Cloudant credentials and database name
stored in environment variables:

The ‘package’ encapsulates the credentials for Cloudant service so that you
don’t have to “hard-code” them or pass them in as run-time parameters. The
second line adds our action into the package and exports it as a web-facing API
call.The result is: when I say “Alexa, my house temperature”, it replies:

“The temperature is 21.5 degrees Celsius”FURTHER WORK
I could extend the code to detect other intents:

 * Tell me the coldest overnight temperature
 * Give me the average temperature in the last week
 * Convert the units if I say “Alexa, my house temperature in Fahrenheit.”

Build your own Alexa skill with OpenWhisk and let me know how it goes. To share
this article on Medium, just ♡ it. Thanks!

Thanks to Mike Broberg . * Amazon Echo
 * Alexa
 * Serverless
 * Openwhisk
 * Cloudant

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Run your Alexa skill serverless, paying for only for traffic you generate when your skill gets popular.",Add custom Alexa skills with OpenWhisk and Cloudant – IBM Watson Data Lab,Live,814
2498,"WHAT I LEARNED AT PYCON AND SPARK SUMMIT
David Taieb / June 21, 2016I just got back from a 2-week road trip and am finally catching my breath to
write about all the cool things I learned. My first stop was at the
all-important PyCon conference in Portland, Oregon where I ran a 3-hour workshop
session: Developing Analytic Applications using Apache Spark™ and Python . After a quick stop back home, I turned around and headed to San Francisco for
the much-anticipated Spark Summit , where I showed off Marvin and his now world-famous Rock-Paper-Scissors skills.

PYCON 2016, PORTLAND, OREGON
It was my first time at PyCon, which brings together thousands of Python
developers every year, and I was not disappointed. The conference sold out
months ago, with over 3300 attendees. The crowd included data scientists,
developers, data engineers, and students. Also present in the expo hall was a
very diverse selection of sponsor companies: techies like IBM, Google, and
Facebook, but also financial firms like Capital One and JP Morgan Chase (which
seemed odd until I talked to them and learned that they have massive IT
organizations working with Python and that they contribute extensively to open
source projects). There was also a large and diverse set of nonprofit
organizations that form the backbone of the Python community. One that caught my
attention was PyLadies , which brings together women who love to code in Python. They have meetups all
over the world, even in Boston, my home town!

My workshop session was about developing big data analytics apps using Python
and Apache Spark (see slides for Part 1 and Part 2 ).

The audience at work. Close to 100 people attended my workshop.

Special thanks to Mark Watson and Prachi Khadke for helping participants go
through the tutorial.

Mark, David, and Prachi.

After that, I was free to check out other sessions. Here are a few topics and
talks that caught my attention:

 * Machine Learning. There were many sessions about using Python for machine learning which
   demonstrates how hot this topic has become, as companies rely more and more
   on predictive analytics for decision-making. But, despite the advent of
   high-level Python ML libraries, machine learning remains a dark art that’s
   difficult to navigate for non-experts. In a very interesting talk, Visual Diagnostics for more informed Machine Learning: within and beyond
   Scikit-Learn , the speaker showed how to use visualization to better understand the data
   and choose the right Machine Learning algorithm.
 * Internet of things. Keeping cool, using a Raspberry PI is a good introduction on how to use Python with Raspberry PI.
 * Mobile. A tale of two cellphones: Python on Android and IOS is an excellent talk about how tools developed by the beeware project help you write mobile apps in Python that you can deploy in Android and IOS.
 * Innovation. In Building an interpreter in RPython , the speaker shows how to build a php interpreter . It’s one of these talks where you say to yourself “I didn’t know you could
   do that!” Very informative.
 * More innovation. Putting 1 Million new words into the dictionary explained an innovative effort to automatically discover neologisms using a
   big-data machine-learning pipeline.

Going into PyCon, I knew that Python was extremely popular among data
scientists, but I underestimated its popularity with developers and data
engineers, especially around machine learning. As I was getting ready to head to
San Francisco for the Spark Summit event, I was wondering if I would be seing
the same trend with Apache Spark.

SPARK SUMMIT 2016, SAN FRANCISCO CALIFORNIA
This is my second time attending Spark Summit San Francisco at the Hilton on
Union Square, and once again there was huge attendance. I even heard that next
year’s conference will move to a bigger venue! There was a record number of
companies present in the expo hall, which shows how fast enterprises are
adopting Apache Spark.

I started the week by attending the Apache Spark Maker Community Event at Galvanize sponsored by IBM. Before a large gathering of leading minds in data and
analytics, we announced our new IBM Data Science Experience . IBM has made major early investments in Apache Spark and it’s a key part of
our big-data strategy. Commitments include donation of our declarative
machine-learning framework, called SystemML, to the Spark Community. The event
included more than 20 booths featuring demos and discussions with leading
experts.

Meghan O’Rourke and I watched Marvin the robot beat humans at
Rock-Paper-Scissors with the help of Spark analytics.

The Spark Summit started with an impressive cast of keynote speakers including

 * Matei Zaharia, creator of Spark, who talked about the new Spark 2.0 features.
 * Jeff Dean of Google who talked about tensorflow machine-learning framework.
 * Andrew Ng founder of Coursera who explained that the convergence of Cloud
   computing (engine) and Big Data (Fuel) is the reason why we’re experiencing
   exponential growth of machine learning across the enterprise.

SPARK 2.0 NEW FEATURES
Spark 2.0 is a major milestone, providing the following API and performance
improvements:

 * New API that unifies Dataset and DataFrame APIs. Developers will be able to
   use strongly-typed methods that combine the best of both worlds:
   Object-oriented programming with type-safety at compile time and the
   DataFrame performance boost provided by Catalyst query optimizer, along with
   super efficient off-heap memory management. As this new API matures, the need
   to use RDDs will lessen, except maybe in the few complex cases where
   developers need the flexibility to express their algorithms in a richer
   language. Spark 1.6 already offers a preview of the dataset API. An easy way
   to start is to use the ‘as’ method:df.as[CustomClass]
   
   
 * Structured streaming takes Spark SQL to the next level by dramatically
   simplifying work with streams. With structured streaming, developers can
   perform streaming ETL using DataFrame Queries pretty much the same way they’d
   run batch jobs on static data. For example, the following code shows how to
   do ETL on a continuous stream of data://Open stream from Json file. Notice use of stream method instead of load
   inputSource = sqlContext.read.format(""json"").stream(""file_path"")
   expenses = inputSource.select(""dept"", ""expense"").where(""expense5000)
   expenses.write(""parquet"").startStream(""flag_expenses_path"")
   
   //Continuous compute expenses average by department
   inputSource
       .groupBy($""dept"",window($""event-time-col"",""2min"")
       .avg($""expense"")    
   
   
   Structured streaming makes it easy to continuously compute values. With Spark
   Streaming, you need to maintain state using updateStateByKey or mapWithState
   method, but now you can pass the window as a parameter:
   
   //Compute expenses average by department
   inputSource
       .groupBy($""dept"",window($""event-time-col"",""2min"")
       .avg($""expense"")    
   
   
 * Spark developers continue to improve performance with the next phase of
   project code, named Tungsten, which further optimizes DataFrame query
   execution. Query optimization relies on 3 strategies: efficient object
   storage in memory bypassing the JVM object model, use of Code generation that
   leverages JIT capabilities, and exploitation of L1/L2/L3 cache hierarchy
   available in CPU to minimize data transfer wait.
 * GraphFrames is to GraphX what DataFrames is to RDDs. It provides an API that
   combines Graph Analytics and Graph Queries. As an added bonus, GraphFrames
   comes with a Python binding (GraphX is only available in Scala). Important to
   note that the GraphFrames query engine supports a subset of the Neo4J cypher
   language (which makes it easy to write pattern-based queries in Graphs) and
   leverages the Catalyst Query optimization for higher performance. GraphFrames
   is already available as a Spark 1.6 package on spark-package.org and is well worth a look. I’ll publish a tutorial on this technology soon.
   Stay tuned!


The sessions I enjoyed the most were individuals (from academia or enterprise)
showing how they innovated to solve particular problems. There were plenty of
such talks, which shows how vibrant and active the Spark community is. Two were
of particular interest:

 * Yggdrasil : How to scale decision trees to high-dimensional data. The main idea is to
   use column partitioning instead of rows, as implemented in the built-in Spark
   ML Decision Tree algorithm. The speaker shows how this very simple idea led
   to a performance boost of up to 13x compared to the standard algorithms.
 * Dynamic repartitioning : Default partition hashing may lead to even data distribution across
   executors, causing some tasks to be slow or even crash due to out-of-memory
   errors. The proposed solution is to constantly monitor jobs and stage
   submissions and decide when to repartition using new hash functions aimed at
   properly rebalancing the partition sizes.

Final thoughts
Year over year, the pace of innovation keeps increasing. This wouldn’t be
possible without the wide adoption of open source culture in the enterprise. Big
tech companies like IBM are already deeply committed to open source and more
join every day. I can’t wait to be back at PyCon and Spark Summit next year to
learn about new breakthroughs in Machine Learning, Big Data, and more.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","A report from 2 important conferences covering the latest innovations in Apache Spark, Python, data analytics, and machine learning.",What I learned at PyCon and Spark Summit,Live,815
2499,"Maureen McElaney Blocked Unblock Follow Following dev advocate at @IBM Watson Data Platform. founder of @GDIBurlington.
co-organizer at @uxburlington. fellow at @BTVIgnite. content here is mine. Mar 1
--------------------------------------------------------------------------------

OFFLINE FIRST FOMO
DON’T MISS OUR PANEL AT SXSW: “FROM MOBILE FIRST TO OFFLINE FIRST”
I am excited to moderate a panel discussion at SXSW this year on Offline First.
For the uninitiated, in the smallest nutshell, Offline First concerns how and
why to build applications that meet growing standards around handling low (or
no) connectivity.

If you aren’t planning on attending our panel at SXSW, you’ll be missing out. So
in the spirit of fomenting some mild FOMO, I thought I’d give you a taste of who
will be on the panel and what we plan to discuss!


--------------------------------------------------------------------------------

Q: WHAT IS YOUR DAY JOB?
Dan Zajdband : I just took a job working in fintech here in Argentina. I’m working on a
project that is still in the early stages and can’t be mentioned publicly yet.

Nolan Lawson : I work as a Program Manager on the Performance team for Microsoft Edge . Essentially this means I spend all my time figuring out how to make the
browser and websites faster.

Maureen McElaney : I am a Developer Advocate for IBM Watson Data Platform, advocating for IBM Cloudant and other cloud database technologies. I work remotely from my home office in
Vermont.

Gregor Martynus: I am a System Architect at Neighbourhoodie , a Berlin-based Software Consulting company. We specialise in building Offline
First applications and help companies with architecture reviews, training and
support to make their apps work offline, too.

Dan and NolanQ: IN WHAT WAYS ARE YOU INVOLVED WITH THE TECH COMMUNITY OUTSIDE OF WORK?
Dan Zajdband : I do a lot of Open Source outside of work and also try to work building
community. I run a meetup called BAFrontend in Buenos Aires, it’s the tech meetup with most attendees (about 150 people per
event). I also give talks at events and I’m a staff member for Hacks/Hackers
Buenos Aires.

Nolan Lawson : I help maintain PouchDB as well as a few other open-source projects. I also
do a lot of blogging, conferencing, and meetup-ing.

Maureen McElaney : I founded Girl Develop It Burlington, a non-profit that teaches women to
code. Like Nolan, I also do a fair amount of conferencing, meetuping, and
blogging. I just recently got my first Open Source PR accepted and I look
forward to doing more of that.

Gregor Martynus: I’m founding contributor to an Open Source project called Hoodie,
which is a backend for Offline First applications. We coined the term Offline
First in 2013 and are promoting its concepts ever since. I’m helping to manage
the Offline First community and am co-hosting Offline Camps, which are
unconference-style events to discuss Offline First in nice remote places with
nice people with diverse perspectives.

Q: HOW DOES YOUR DAY JOB INTERTWINE WITH YOUR INTEREST IN OFFLINE FIRST?
Dan Zajdband : I was a Knight-Mozilla Fellow working for The Coral Project at The New York
Times and I was building a moderation app for a commenting system. I wanted the
app to work offline and started doing research on the concept, which lead me to
find this community.

Nolan Lawson : On the Edge team, we’re in the process of implementing Web Manifest, Service
Worker, Cache Storage, and next-generation IndexedDB, which are all things that
are pretty fundamental for modern offline webapps. We’re also actively
contributing to the specs for all these things.

Maureen McElaney : In my job as a Developer Advocate for IBM Watson Data Platform, a big part of
my job is getting paid to do community organizing, so I’m allowed to strategize
with the Offline First community on how to progress and organize. This has been
such a rewarding community to be a part of and to help grow. The community
connections that I’ve made by being a part of Offline Camp alone have been
invaluable. Our involvement has helped us position IBM Cloudant as a leader in
the growing Offline First space and has provided fertile ground for the
development of new products and new partnerships in collaboration with the
community.

Gregor Martynus: I built my first Offline First web application in 2011, and am
fascinated by it ever since. At Neighbourhoodie, founded in 2014, we made
Offline First a key component of our services. Many of the products we built for
our customers have the requirements to work offline. We can feed the experience
from these projects back into the Offline First community, which works very well
for us.

Maureen and GregorQ: DO YOU HAVE ANY OFFLINE FIRST PROJECTS YOU WANT TO DISCUSS AT SXSW?
Dan Zajdband : I’m a core contributor for next.js , a framework for building universal React applications, and we are researching
ways to make offline-first apps and also the community is bringing new examples.

Nolan Lawson : Yes, PouchDB ! I believe it probably offers the best out-of-the-box experience for an
offline JavaScript database today, especially when you consider the
sophistication of the sync protocol and conflict resolution system. I would like
to discuss sync/conflicts because lots of people build offline-first without
thinking of this and then it bites them.

Maureen McElaney : I’m going to attend Offline Camp in Berlin and I would encourage you to join the waitlist if you can, or watch out for the
next one that will probably be back in the US. It’s a great event. I’ve been
accepted to speak at JSConfEU on how Offline First can save the world, covering
Offline First apps that promote social change and support public services.

Gregor Martynus: In terms of demand for Offline First, we found the health sector in developing countries to be the most relevant right now. We worked on several
projects that got deployed in many countries; including Nigeria, Guinea,
Liberia, Sierra Leone, Somalia and Kenya. Among others we created a lot of tools
to help contain the Ebola outbreak in West Africa as well as supporting the
search to find a vaccine. In general, I would like to discuss other aspects of
Offline First that are less obvious. Like benefits for general user experience,
security and commercial viability.


--------------------------------------------------------------------------------

If you are attending SXSW, we hope you will come see our panel titled “From Mobile First to Offline First” at 12:30pm CST on Sunday March 12th. There are also two other panels on offline first at SXSW that we also plan to check out.

If you can’t make it to SXSW, we hope you’ll follow Offline Camp Medium publication and become part of the conversation by joining the community on Slack . See you there!

 * Offline First
 * Community
 * Web Development
 * SXSW

2 Blocked Unblock Follow FollowingMAUREEN MCELANEY
dev advocate at @IBM Watson Data Platform. founder of @GDIBurlington . co-organizer at @uxburlington . fellow at @BTVIgnite . content here is mine.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 2
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","I am excited to moderate a panel discussion at SXSW this year on Offline First. For the uninitiated, in the smallest nutshell, Offline First concerns how and why to build applications that meet…",Offline First FOMO,Live,816
2502,"This year I was lucky enough to attend my fourth Girl Develop It Leadership Summit from October 6-9, 2016 in Austin, TX. With IBM Cloud Data Services as a
sponsor, it was exciting to be there this year as a representative of IBM. The
60 women in attendance represented 53 chapters from cities across the United
States.

Girl Develop It Leadership Summit

WHO IS GIRL DEVELOP IT?
Girl Develop It (GDI) is a nonprofit organization that exists to provide
affordable and judgment-free opportunities for women interested in learning web
and software development. Through in-person classes and community support, Girl
Develop It helps women of diverse backgrounds achieve their technology goals and
build confidence in their careers and their every day lives. Founded in 2010,
Girl Develop It has grown to over 74,600 members nationally and there are 183
women who lead/organize the local chapters in their cities. I became involved
with Girl Develop It as a student in Philadelphia in 2011, and then founded my
own chapter when I moved to Burlington, Vermont in 2012.

THE SUMMIT
The goal of the Leadership Summit is to provide chapter leaders with opportunities to connect, and to receive
training and support in order to empower them to continue as ambassadors of
GDI’s mission. The Summit features workshops on inclusive outreach, building a
leadership team, and curriculum building. This is a unique opportunity for
chapter leaders to connect and learn from each other, especially since
throughout the year we work as a distributed team powered by volunteers.

One of my favorite sessions at the summit was a training on overcoming
unconscious bias by Tyi L. McCray of Paradigm . Unconscious bias is, “The brain’s tendency to take mental shortcuts, relying
on past experiences and cultural stereotypes to quickly and subconsciously
process information.” Tyi taught us tips for how to combat our tendencies for
unconscious bias, and collaborate with people who come from diverse backgrounds.
A key takeaway for me was that the question isn’t “whether” we’re biased, but
how our biases affect our work and what we can do about it. In addition to
impeding diversity efforts, bias limits objective decision making. Tyi showed us
that awareness is not enough. The next step is developing individual, team, and
organizational strategies for managing bias. This training helps me to better
envision my role as a developer advocate in fostering an inclusive tech
community.

WHY SUPPORT GIRL DEVELOP IT?
Girl Develop It chapter leaders volunteer in their free time to serve the
mission of Girl Develop It. In addition to their volunteer service for GDI, they
work full time as tech leads, architects, technologists, CEOs, community
builders, and prominent leaders in their cities. GDI chapter leaders work in all
facets of tech, many of which are already customers or partners with IBM. By
fostering the vital services that GDI chapters provide to their local
communities, we make a direct investment in a more diverse tech community. If
you are interested in teaching or volunteering for your local Girl Develop It
chapter, or starting one of your own, check out our website: https://www.girldevelopit.com/chapters",IBM Cloud Data services proudly sponsored the 2016 Girl Develop It Leadership Summit. One of our developer advocates reports back from the event in Austin.,Girl Develop It Leadership Summit Recap,Live,817
2505,"Compose Databases * MongoDB
 * Elasticsearch
 * RethinkDB
 * Redis
 * PostgreSQL
 * etcd
 * RabbitMQ
 * ScyllaDB
 * MySQL

Enterprise Pricing Articles Sign in Free 30-Day TrialBUILDING JAVASCRIPT MICROSERVICES WITH SENECAJS AND COMPOSE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 12, 2017Database-backed microservices are powerful and in this article we show how to
use SenecaJS, NodeJS and Compose databases to create a virtual product catalog
using them. Microservices are making a huge dent in the web development world,
with companies like Netflix , Walmart , and IBM embracing microservice architectures for their mission critical applications.

INTRODUCTION TO MICROSERVICES
First, in case you've been hiding in the dungeons of ancient enterprise
application development for the past few years, let's start with a small
introduction to microservices. If you're already comfortable with microservices,
you can skip to the next section.

A typical software application consists of a single process that contains all of
the instructions you'll need to execute the application. This monolithic
architecture allows a process to be quickly and simply executed, but scaling up
functionality requires executing multiple copies of the entire process even if
you only need to scale up one small piece of functionality (for example, Netflix
uses the streaming portion of its application far more than it uses the 'new
user signup' portion).

One of the challenges that can arise from this ""all-in-one"" approach is that
it's very difficult to scale your applications. Since monolithic applications
are designed to be self-contained, the mechanisms for connecting multiple
redundant copies of a monolithic process can be impractical or impossible if the
application hasn't been designed to accommodate such tandem scaling. Errors in
monolithic applications also tend to be catastrophic, requiring a reboot of the
entire application and resulting in downtime for end users.

Modern distributed software, such as large web applications, need to be able to
scale easily and fail robustly by running processes on multiple machines in
multiple locations. Microservices architectures are designed to allow for
exactly this type of behavior. Each microservice handles a small piece of
functionality for the application, and multiple microservices connected together
in a network constitutes a complete application. Designing applications as a
""system of systems"" means that developers can add new instances of a
microservice to handle the extra load without impacting the existing system.
Developers can also redirect traffic to redundant copies running on different
servers in the event of server or application failure.

Microservices aren't without their faults - while a single microservice might be
a small and simple piece of software, the way it interacts with other parts of
the application is more complex than a typical monolithic application.
Frameworks like SenecaJS provide tools to make these interactions easier.

INTRODUCTION TO SENECAJS
SenecaJS is a NodeJS application framework that allows developers to define
microservices and connect them together in different ways. The fundamental unit
of communication in SenecaJS is a simple JSON message. By convention, that
message will be an object with at least 2 keys: a role and a cmd . For example, a message that outputs hello, world might look like the following:

{ ""role"": ""hello"", ""cmd"": ""sayHello"" }


SenecaJS looks at the message and compares it to a set of commands that have
been registered with it. When the message matches one of those commands,
SenecaJS executes the function associated with that command.

Let's start with a basic SenecaJS application to see how commands can be
registered. Create a new folder for your project on your system and initialize a
new Node module using npm init . The npm init command will prompt you for options for your new module. You can use the
defaults. Then, use npm install to install SenecaJS.

$ mkdir seneca-compose
$ cd seneca-compose
$ npm init
...
...
...
$ npm install --save seneca


Now, we'll create a simple ping application. This application will just respond
with a message when we call into it letting us know that the application is
Online and working correctly.

// Create a new SenecaJS application
var seneca = require('seneca')();  


First, we'll create a new SenecaJS application by calling the ""seneca()""
function. This provides us with a place to register new commands and
functionality.

seneca.add({""role"": ""compose"", ""cmd"": ""ping""}, (args, done) =


Next, we'll call the seneca.add method. seneca.add is used to register a new command with the seneca system, and to define the function that should be executed when that command it
sent. It takes two arguments: a command pattern and the function that executes
when the pattern is matched. The function that executes also takes two
arguments: an args object which contains the entire message that was sent (including the command
pattern and any extra parameters) and a done function which is executed once the service is ready to send a response. The done function itself also takes two arguments in typical node.js fashion: an error
as the first argument (or null if there's no error) and the results of any
operations as the second argument.

seneca.listen({""type"": ""http"", ""port"": 8080});  


Finally, we'll have the application to listen for connections using the seneca.listen method. Since SenecaJS uses simple JSON messages, there are many different ways
you can communicate between microservices (called the transport in SenecaJS lingo). There are transport plugins for Redis, PubNub, TCP, and
RabbitMQ among others. In our example, we'll use HTTP which has the added
benefit of allowing us to test our microservices using the curl utility. The HTTP transport exposes a special route, /act , for this purpose and also useful for connecting other applications and
external systems (for example, those designed in another programming language)
to SenecaJS-designed microservices. If your platform has an HTTP library, it can
communicate with a SenecaJS-based Microservice.

To test our microservice, you'll need to open two terminal windows. In the first
window, you'll run your SenecaJS microservice.

$ node service.js


In the second window, we'll use the curl utility to send a message to our running service.

$ curl -d '{""role"":""compose"",""cmd"":""ping""}' http://localhost:8080/act


If all goes well you should see the following output:

{result: ""Hi there""}


That's about it! All SenecaJS applications are designed in a similar fashion:
use .add to register command patterns and tie that command to a function, and then
execute that function by sending a message into SenecaJS matching the pattern.

STORING DATA USING SENECA ENTITIES
Now that you have a feel for creating a microservice in SenecaJS, let's do
something a little more interesting: save data in a database. SenecaJS by itself
is minimally functional, but there is an ecosystem of plugins (many developed by
the SenecaJS team itself) that can be used to make SenecaJS far more powerful.
We'll use a plugin called seneca-entity that makes storing and retrieving data in SenecaJS a snap. If you've ever used
an ORM like ActiveRecord or an ODM like Mongoose, you'll feel right at home with
Seneca entities.

As a demo, we'll create a catalog for a nerdy clothing store. Our clothing store
contains products and we'll represent these products as entities. To create a
new entity, you'll first need to add the seneca-entity and seneca-basic plugins.

npm install --save seneca-entity seneca-basic  


Then, you can tell SenecaJS to use those plugins with the .use command. If you have experience with NodeJS development with ExpressJS you'll recognize the syntax:

seneca  
  .use('basic')
  .use('entity');


The seneca-entity plugin decorates our SenecaJS application with a new method, .make$ , which we'll use to create a new entity type. Call seneca.make$ and pass in the name of the entity you'd like to create. seneca.make$ will return a new instance of that entity that you can now modify as you would
any other JavaScript object.

var product = seneca.make$(""Product"");  
product.name = ""Star Wars Jacket"";  
product.price = 100.00;  
product.description = ""The force will be with you with this stellar Star Wars jacket!"";  


Now, you can save the new product using the save$ function. save$ takes a callback with the function signature of (err, savedProduct) where err contains any errors encountered while saving (or null if there are no errors)
and savedProduct contains the entity that was persisted to the database.

product.save$((err, savedProduct) =


Let's bring this all together with a new command we can add to SenecaJS. This
new command will add a product to our virtual catalog. We'll also demonstrate
how to take the name, price, and description for the product from arguments
passed into the service.

// service.js

// Create a new SenecaJS application
var seneca = require('seneca')();

seneca  
  .use('basic')
  .use('entity');

seneca.add({""role"": ""product"", ""cmd"": ""create""}, (args, done) =
  product.save$((err, savedProduct) =


Then, just like we did earlier, run your service in one terminal by calling the
following:

$ node service.js


And, in a separate terminal, use this curl command to add a new product to the catalog:

$ curl -d \
  '{""role"":""product"",""cmd"":""create"",""name"":""Star Wars Jacket"",""price"":100.00,""description"":""Awesome!""}' \
    http://localhost:8080/act


You'll notice that we've passed in far more data than we were listening for with
the command. SenecaJS is only concerned with matching all the parts of it's
command to your message (ie: role: product, cmd: create ). Since those two fields are present in our message, it's considered a match.
The extra data is sent into the function in the args object, which we used to send in extra information about our product in the
catalog.

Since we sent back the savedProduct as our success response, we should see something like the following (notice the
new entity$ and id fields added by the entity system):

{
   ""entity$"":""-/-/Product"",
   ""name"":""Star Wars Jacket"",
   ""description"":""Awesome!"",
   ""price"":100,
   ""id"":""6esbkg""
}


STORING ENTITIES IN MEMORY
You might be wondering where the entities are currently being saved. By default,
SenecaJS provides a memory store plugin that saves the entities using an
in-memory database. This default behavior allows us to create some entities and
test out operations on them during early development. This is also useful for
creating test drivers that can validate entity logic without connecting to a
persistent database.

Using this plugin concept, we can now swap out the in-memory database with a
plugin that adds a persistent backing data store without having to change any of
our application or entity logic.

STORING ENTITIES IN MONGO FOR COMPOSE
To store our entities in MongoDB, we first need to spin up a new Mongo database in Compose .

Once you've created a new database, we'll use mongo-store to tell SenecaJS to
use MongoDB as the persistent backing store for our application. First, we'll
install the seneca-mongo-store node package:

npm install --save seneca-mongo-store  


Next, we'll modify our SenecaJS application to use the mongo-store plugin. Since SenecaJS plugins are added in the order your add them in your
application, we'll .use the mongo-store plugin after we .use the entity plugin (otherwise the entity plugin will default back to memory store):

seneca  
  .use('basic')
  .use(""entity"")
  .use('mongo-store', {
      uri: 'mongodb://<youruser>:<password>@aws-us-east-1-portal.23.dblayer.com:16659,aws-us-east-1-portal.21.dblayer.com:16659/whatever?ssl=true' 
  })


Make sure you use your username and password in the connection URI.

The final program looks like the following (notice that the only thing we've
done from the previous example is to add the extra .use('mongo-store', ...) after .use('entity') ):

// service.js

// Create a new SenecaJS application
var seneca = require('seneca')();

seneca  
  .use('basic')
  .use(""entity"")
  .use('mongo-store', {
      uri: 'mongodb://testuser:secret@aws-us-east-1-portal.23.dblayer.com:16659,aws-us-east-1-portal.21.dblayer.com:16659/whatever?ssl=true'
  })

seneca.add({""role"": ""product"", ""cmd"": ""create""}, (args, done) =
  product.save$((err, savedProduct) =


Our entities should now persist to our new Mongo database. To test this out, run
your service like before in one terminal, and run the curl command to add a
product in the other:

$ node service.js


And, in a separate terminal, use this curl command to add a new product to the catalog:

$ curl -d \
  '{""role"":""product"",""cmd"":""create"",""name"":""Star Wars Jacket"",""price"":100.00,""description"":""Awesome!""}' \
    http://localhost:8080/act


You should get back the same response (something like the following):

{
  ""entity$"":""-/-/Product"",
  ""name"":""Star Wars Jacket"",
  ""description"":""Awesome!"",
  ""price"":100,
  ""id"":""5873cec107d642a2e9216d73""
}


Now, when you navigate to the database browser in Compose, you'll see a new
collection in your database called ""Product"" and a new item in that collection:


SENECA-REDIS-STORE
Storing an item in REDIS uses the same process as storing an item in MongoDB.
First, you'll need to spin up a new REDIS on Compose database .

Once you've created a new deployment, install the seneca-redis-store plugin:

$ npm install --save seneca-redis-store


Then, in your microservice, add the redis-store plugin and use your Redis
connection string:

seneca.use('basic')  
    .use('entity')
    .use('redis-store', {
      'uri': 'redis://x:HFLFFYKEMELVLHKW@portal.tangy-redis-22.jwo.composedb.com:15933'
    }


If you've been following along, you'll notice that it's the same basic structure
as the mongo-store configuration above. The final service looks like this:

// service.js

// Create a new SenecaJS application
var seneca = require('seneca')();

seneca  
  .use('basic')
  .use(""entity"")
  .use('redis-store', {
    'uri': 'redis://x:HFLFFYKEMELVLHKW@portal.tangy-redis-22.jwo.composedb.com:15933'
  });

seneca.add({""role"": ""product"", ""cmd"": ""create""}, (args, done) =
  product.save$((err, savedProduct) =


Once again, run the service in one terminal:

$ node service.js


and run your CURL command in the other:

$ curl -d \
  '{""role"":""product"",""cmd"":""create"",""name"":""Star Wars Jacket"",""price"":100.00,""description"":""Awesome!""}' \
    http://localhost:8080/act


You should get something like the following as output from the service:

{
  ""entity$"":""-/-/Product"",
  ""name"":""Star Wars Jacket"",
  ""description"":""Awesome!"",
  ""price"":100,
  ""id"":""0c7b4cfa-ff6c-4329-97a6-9f58cc47c9f6""
}


And when you visit the data browser in your Compose for Redis database, you
should see a new key that contains your newly saved product:


WRAP UP
Writing microservice applications in JavaScript doesn't have to be a hassle.
With SenecaJS and Compose, you can quickly spin up new server-side applications
and swap out databases as your needs change. SenecaJS also allows you to use
multiple transport mechanisms, and in a future article, we'll discuss how to use
RabbitMQ and Redis on Compose to create stable and secure distributed
microservices applications.


--------------------------------------------------------------------------------

If you have bits you think should be in NewsBits, or any feedback about any
Compose articles, drop the Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Vadim Sherbakov Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe John O'Connor is a software architect that enjoys tinkering with things, designing software,
and writing about it all. Love this article? Head over to John O'Connor’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","Database-backed microservices are powerful and in this article we show how to use SenecaJS, NodeJS and Compose databases to create a virtual product catalog using them.",Building JavaScript Microservices with SenecaJS and Compose,Live,818
2508,"Compose The Compose logo Articles Sign in Free 30-day trialAVOID STORING DATA INSIDE ""ADMIN"" WHEN USING MONGODB
Published May 4, 2017 mongodb administration support Avoid Storing Data Inside ""Admin"" When Using MongoDBWhen importing or restoring your database into MongoDB, make sure that you don't
use the admin database. Why? We'll show you.

At Compose, our support team helps customers with a variety of issues related to
their deployments. One issue that comes up frequently is related to errors that
developers receive when importing or restoring documents inside a new MongoDB
deployment and forgetting not to use the admin database.

Whether you're familiar or not with the pitfalls of putting your data into admin , this article will show you some of the errors that you could run into, and
show you the easiest way to solve this issue (hint: just don't use the admin database for your data).

THE ADMIN DATABASE
When creating a Compose MongoDB deployment, the database that is initially
created is admin , which lives on different servers where your other databases will live.
Although admin is shown as the initial database in the connection strings, it's important to
realize that this is the database for administration. If you use the admin database directly you'll run into problems.


The main purpose of this database is to store system collections and user
authentication and authorization data, which includes the administrator and
users' usernames, passwords, and roles. Access is limited to only to
administrators, who have the ability to create, update, and delete users and
assign roles.

One of the major security risks of using the admin database is that all users who have access to it have the ability to create,
read, update, and delete system users and databases in your deployment. While
this can be limited by creating custom roles, it's far better to eliminate the
risk altogether. If you manage administrators and users using the Compose
console, you will only be able to create administrators within the admin database, and you can manage individual database owners through each database's
administration console.

In addition to the security risks, putting data into admin may lead to unforeseeable consequences that will lead to significant problems
when importing or restoring data into MongoDB. Let's take a look at some of
these problems and what we can do to avoid them.

USING MONGOIMPORT TO IMPORT INTO ADMIN
Using the mongoimport command to insert data into the admin database can create some potentially strange side effects. Take for example the
following command to import a CSV file containing a list of 550 banks:

mongoimport --host aws-us-west-2-portal.2.dblayer.com --port 15983 --db admin --ssl --sslAllowInvalidCertificates -u admin_user -p adminpass --type csv --headerline --file bankfile.csv --authenticationDatabase admin  


As you can see, we are using the admin database along with the administrator username and password along with other
security options and flags for importing CSV files. Once the command runs, we'll
get the following output from the terminal:

2017-05-04T06:22:40.656-0700    no collection specified  
2017-05-04T06:22:40.656-0700    using filename 'bankfile' as collection  
2017-05-04T06:22:41.050-0700    connected to: aws-us-west-2-portal.2.dblayer.com:15983  
2017-05-04T06:22:41.349-0700    error inserting documents: Writes to config servers must have batch size of 1, found 550  
2017-05-04T06:22:41.349-0700    imported 550 documents  


Although we tried to import the documents, we ran across error inserting documents: Writes to config servers must have batch size of 1,
found 550 . That's because config servers do not accept bulk writes. Even though it
states that it has imported 550 documents , logging into the deployment will show you that in fact no data has been
imported into admin .

The solution to this problem is to swap out --db admin for a new database --db banks , which will be created automatically when the command is run. Additionally,
since we don't have any users who are authorized to write data to the banks database, we'll have to use the --authenticationDatabase admin option that will allow us to log into our deployment and import the data using
the administrator credentials.

So, for example, using the same credentials and only swapping out the database,
we'd get:

mongoimport --host aws-us-west-2-portal.2.dblayer.com --port 15983 --db banks --ssl --sslAllowInvalidCertificates -u admin_user -p adminpass --type csv --headerline --file bankfile.csv --authenticationDatabase admin

2017-05-04T06:18:03.591-0700    no collection specified  
2017-05-04T06:18:03.591-0700    using filename 'bankfile' as collection  
2017-05-04T06:18:03.980-0700    connected to: aws-us-west-2-portal.2.dblayer.com:15983  
2017-05-04T06:18:04.313-0700    imported 550 documents  


Now, logging into our deployment, we'll see that the database banks has been successfully created and contains 550 documents.

mongos> use banks  
mongos> db.banklist.find().count()  
550  


Once that database has been created, we can

USING MONGORESTORE TO RESTORE DATA INTO ADMIN
If you've used mongodump to download your database into a dump file, you can restore your data along
with users and their credentials into your new deployment. Sometimes, however,
you may forget to swap out ""admin"" when copying and pasting your connection
information. If you forget to change the database name, it can cause some
unwanted problems.

When using mongodump to get all of your data as well as a copy of the users of that database, you'll
have to use the -dumpDbUsersAndRoles option, which requires an administrator. Since we are using administrator
credentials, we'll also have to use the --authenticationDatabase flag to log in. We might write the following command to dump our bank data:

mongodump --host aws-us-west-2-portal.2.dblayer.com --port 15983 --ssl --sslAllowInvalidCertificates -u admin_user -p adminpass --db banks --dumpDbUsersAndRoles  --authenticationDatabase admin  


Now that we have the data in a dump file, let's see what happens when we restore
it to the admin database using mongorestore . Notice that we've used the --restoreDbUsersAndRoles , which states that we want the users and roles of the banks database to be restored:

mongorestore --ssl --sslAllowInvalidCertificates --host aws-us-west-2-portal.2.dblayer.com --port 15984 -u admin_user -p adminpass --restoreDbUsersAndRoles --db admin dump/banks  


The problem we'll encounter when running the restore command is the following:

2017-05-04T06:52:24.589-0700    Failed: cannot use --restoreDbUsersAndRoles with the admin database  


In this case, MongoDB has stopped any potential problem and just tells us that
with the restoreDbUsersAndRoles flag, we are not able to add new users or data to admin . However, let's see what happens with we remove restoreDbUsersAndRoles :

2017-05-04T06:55:28.728-0700    building a list of collections to restore from dump/banks dir  
2017-05-04T06:55:28.864-0700    reading metadata for admin.banklist from dump/banks/banklist.metadata.json  
2017-05-04T06:55:28.914-0700    restoring admin.banklist from dump/banks/banklist.bson  
2017-05-04T06:55:29.182-0700    error: Writes to config servers must have batch size of 1, found 550  
2017-05-04T06:55:29.182-0700    restoring indexes for collection admin.banklist from metadata  
2017-05-04T06:55:29.223-0700    finished restoring admin.banklist (550 documents)  
2017-05-04T06:55:29.223-0700    restoring users from dump/banks/$admin.system.users.bson  
2017-05-04T06:55:29.288-0700    roles file 'dump/banks/$admin.system.roles.bson' is empty; skipping roles restoration  
2017-05-04T06:55:29.288-0700    restoring roles from dump/banks/$admin.system.roles.bson  
2017-05-04T06:55:29.678-0700    done  


As MongoDB tries to restore the data, we run into the same error that we
received when we tried to import the data earlier using mongoimport : error: Writes to config servers must have batch size of 1, found 550 . So, again, MongoDB prevents us from bulk inserts, which will not restore any
of our data or users into admin .

DON'T DROP WITH ADMIN
The real problem comes when attempting to use the --drop flag when restoring your data using the admin database. In the previous examples, we saw some of the failures that MongoDB
shows, but these don't have an effect on our database, except for the error
we'll receive that will reject the data from being restored.

According to the MongoDB documentation :

When the restore includes the admin database, mongorestore with --drop removes all user credentials and replaces them with the users defined in the
dump file.

Therefore, what will happen is that all of our administrator credentials will be
deleted. At the same time, our deployment will fail and we'll see the errors
when we log into our deployment's console on Compose.

Adding the --drop flag to the mongorestore command will look like:

mongorestore --ssl --sslAllowInvalidCertificates --host aws-us-west-2-portal.2.dblayer.com --port 15984 -u admin_user -p adminpass --db admin -drop dump/banks  


Now, after running the command, let's try logging into our deployment. If you
login, you'll immediately see something like:

MongoDB shell version: 3.2.11  
connecting to: aws-us-west-2-portal.2.dblayer.com:15984/admin  
2017-05-04T07:10:50.231-0700 W NETWORK  [thread1] SSL peer certificate validation failed: unable to get local issuer certificate  
2017-05-04T07:10:50.388-0700 E QUERY    [thread1] Error: Authentication failed. :  
DB.prototype._authOrThrow@src/mongo/shell/db.js:1441:20  
@(auth):6:1
@(auth):1:2

exception: login failed  


Since all the deployment's users have been dropped, our authentication
credentials have also been deleted. This is also shown when logging into our
Compose deployment at the top of the browser window like the following:


At this point, you'll have to delete your MongoDB deployment from the console
and create a new one since you won't be able to access your databases from the browser .

To avoid the problem of dropping your administrators, and to restore users that
were stored in your dump file, let's create a new database called newbanks using the --db flag. We'll also use the admin database for authentication like we did when using mongodump and mongoimport earlier. At the same time, since we are now using newbanks as our database, we can use the restoreDbUsersAndRoles flag, which will import the user credentials that were dumped from the banks database. This will look like:

mongorestore --ssl --sslAllowInvalidCertificates --host aws-us-west-2-portal.2.dblayer.com --port 15983  -u admin_user -p adminpass --drop --authenticationDatabase admin --restoreDbUsersAndRoles --db newbanks dump/banks  


As you may have noticed, we are using the --drop flag here. This will make sure that when creating newbanks we are dropping any collections in the database that might be in the target
deployment and restoring all of the data and user credentials as a fresh copy.

CREATE NEW DATABASES WITH SPECIFIC OWNERS AND USERS
To avoid using the admin database, it's best to create databases that you can connect to which have
their own users and roles assigned to them. To do this from the Compose console,
click on the Browser link on the Overview view , then click the Add database button on the top right of the Browser view .


After clicking on the button, fill in the field next to use with the name of a new database. In this instance, we will call it ""my_data"".
Then click run to create the database.


After the database has been created, we can create database users or collections . However, to get the connection strings for the database, we can click the Admin button on the database menu.


This will provide us with the connection strings that we need to connect to our
database with any application. As you can see, all of the connection information
is the same as on the Overview view , except instead of admin as your database, you now have the connection strings using to your new
database my_data .

Adding users to the database is also done by clicking on the Users button on the menu. There we can click the button Add User .


This will prompt us to add a new username and password for the database
administrator, or dbOwner , who has administrative rights to the database. In this example, our user is
""superman""


Once the user has been created by clicking Run , you will see that ""superman"" is now the owner of the new database.


To assign readOnly privileges to users, you can create a user and click the readOnly button under the db.addUser command. We'll create the user ""batman"" who will only have readOnly access.


This will give the new user ""batman"" only read privileges for this database,
which is shown next to their name once he has been added to the database.


WRAPPING UP
In this article, we showed you some of the pitfalls of using admin , but we also showed you ways to rectify the most common problems you're likely
to encounter. Just remember, when setting up a new Compose MongoDB deployment,
be aware that when trying to import or restore data from an existing database,
always make sure to not use the admin database. By keeping your deployment's users separate from the core data,
you'll save yourself from some serious headaches that can be avoided.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Anthony Da Cruz

Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
May 3, 2017CAMPUS DISCOUNTS - MAKING THE MOST OF COMPOSE
Campus Discounts uses several Compose-hosted databases including MySQL, MongoDB,
Redis, Elasticsearch and RabbitMQ to power t…

Arick Disilva May 2, 2017COMPOSE TIPS: DATES AND DATING IN MONGODB
Working with dates in MongoDB can be surprisingly nuanced, and knowing how dates
are stored can make avoiding pitfalls much e…

John O'Connor Apr 28, 2017NEWSBITS - MYSQL, ELASTICSEARCH, MONGODB, ETCD, COCKROACHDB, SQL SERVER, CRICKET
AND JUICE
NewBits for the week ending 28th April - MySQL 8.0.1's preview demos better
replication, Elasticsearch, MongoDB and etcd get…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","When importing or restoring your database into MongoDB, make sure that you don't use the admin database. Why? We'll show you.","Avoid Storing Data Inside ""Admin"" When Using MongoDB",Live,819
2510,"If data is the new natural resource, analytics is the tool that helps us get insight from that natural resource.   Watch Adam Ronthal, IBM Technical Marketing Lead, on the new IBM dashDB.

dashDB is IBM’s new cloud-based data warehousing and analytics as a service offering forming a core pillar of the insight trifecta with Data Works and Watson Analytics.",IBM dashDB allows ad-hoc querying of data from Cloudant using R or SQL for data warehousing and predictive analytics.,Keeping data warehouse infrastructure out of your way,Live,820
2513,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine
Learning Jul 8, 2016
--------------------------------------------------------------------------------

Using RStudio in IBM Data Science Experience

Data Science Experience has helped me to move all of my work that is in Jupyter
notebooks and RStudio from my local environment to the cloud. This works well
for me because I can now pick up projects from any machine, with all of my data
and configurations ready to go. It’s been great to move to a single tool that
has RStudio, notebooks, and Spark available with a few clicks.

With that being said, any time you move to a new tool there is a learning curve.
I hope this blog post helps people who are new to Data Science Experience or to
working with RStudio in the cloud.

Where to find RStudio

In the Data Science Experience left side menu you have the option to select Notebooks , Projects , or RStudio .

To get into RStudio, click the link from this menu.

Loading packages

Just as in a local version of RStudio, you can install an R package by using the
RStudio interface. To do this, click Packages in the bottom right section of RStudio, then click Install below it.

Next, type the names of the packages that you want to install in the pop-up, and
click Install .

If you want to install packages from within your script, use the following code
where “myPackage” is replaced with the name of your package.

install.packages(""myPackage"") library(myPackage)

Adding local files

To add files from your local environment to used with RStudio in Data Science
Experience, click Files in the bottom right section of RStudio. This is where you can create folders,
upload files, and delete files.

If you click Upload , you can select a ZIP file that will create a new folder with the contents in
the folder. You can also select a single file to upload.

Using the uploaded files

Because the files that you added are now in the server file structure, you will
read them from there. For this article, I added a file named train.csv into the
Home directory of RStudio. If you want to save your file in a different
location, you can create a new folder or select existing folders to navigate to
them.

After the file is uploaded, click the file name to see a preview in a data
viewer.

To access this file in R, set the working directory as the directory with the
data set. You can do this by navigating to the directory with the file and
clicking More , then click Set as Working Directory .

After you have the directory with the data set as the working directory,
reference the file in your R script by the file name:

myd <- read.csv('train.csv') head(myd)

Now you are ready to use RStudio in IBM Data Science Experience. If you aren’t a
member yet sign up here .


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on July 8, 2016.

 * Data Science
 * Rstats

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

1 1 Blocked Unblock Follow FollowingGREG FILLA
Product manager & Data scientist — Data Science Experience and Watson Machine
Learning

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",Data Science Experience has helped me to move all of my work that is in Jupyter notebooks and RStudio from my local environment to the cloud. This works well for me because I can now pick up projects…,Using RStudio in IBM Data Science Experience,Live,821
2515,"PLUG INTO THE CLOUDANT NODE.JS LIBRARY V1.5
Glynn Bird / September 23, 2016You don’t need a library to interact with Cloudant or Apache CouchDB™ : its HTTP API means that any device that can speak HTTP can read or write data. A phone, a
browser, a fridge — anything. If you’re a Node.js developer, then the Nano library provides just enough scaffolding to help you along without getting in the way
of your code. The Cloudant Node.js Library builds on Nano, adding some extra functions that are Cloudant-specific.

Today sees the release of version 1.5 of the Cloudant Node.js Library , which adds the following features over the previous release:

 * Keep-Alive is on by default
 * Pluggable request functions


Let’s deal with each in turn.

KEEP-ALIVE
In previous versions of the Cloudant library, the socket connection between your
app and the Cloudant instance was discarded after each request. This follows the
default behaviour of Node.js’s built-in HTTP functions. With version 1.5, we
have changed the default behaviour to keep sockets open so that they can be
reused by future API calls. This brings performance benefits because HTTPS
requires a chatty setup conversation between client and server which is avoided
when a second request reuses a pre-existing connection. By reducing the number
of new socket connections, we can get to work faster and deliver a greater
throughput. Cloudant is happy to keep sockets alive and now the library is
following its lead.

This “Keep-Alive” behaviour has always been available as an option with our
library, but v1.5 sees it become the default behaviour. If you need a different
configuration, then you can still pass your own HTTP agent as requestDefaults.agent at startup.

PLUGINS
Working on a library is a challenge not least because its users expect new
features, but are also keen to avoid the imposition of breaking changes to the
library’s API. Version 1.5 adds some new features to the way the library
delivers outgoing requests, but its default behaviour is unchanged. When setting
up the library you can opt for one of three different plugins, or provide your
own:

 1. ‘default’ – the library’s normal behaviour: every function call results in a
    single outgoing HTTP request. Your code receives the response as a
    JavaScript ‘callback’, or the return value of the function can be piped to
    use Node.js’s Stream API .
 2. ‘retry’ – each function call may result in multiple HTTP requests, as the
    request will be retried if Cloudant replies with an HTTP 429 response.
 3. ‘promises’ – instead of passing in a callback function to each library call,
    the library will return a Promise that is resolved when the HTTP request
    completes.
 4. custom – write your own request handler!

THE ‘DEFAULT’ PLUGIN – FOR CALLBACKS AND STREAMING
If no plugin is specified, then the pre-1.5 behaviour is retained.


var cloudant = require('cloudant')({url:myurl});
var db = cloudant.db.use('mydb');

// responses are provided to your callback function
db.get('mydoc', function(err, data) {
});

// or can be piped to a stream
db.list().pipe(process.stdout);


THE ‘RETRY’ PLUGIN – FOR CONFIGURABLE RETRIES
When Cloudant is delivered in its multi-tenant flavour, we have to ensure that
one user doesn’t cause problems for other users sharing the same physical
hardware. Cloudant may, from time to time, return an HTTP 429 response code
which instructs the caller to try the request again at a later time because the
caller has exceeded its allotted API call rate.

The ‘retry’ plugin will automatically retry requests if a 429 response is
received.


var cloudant = require('cloudant')({url: myurl, plugin: 'retry'});
var db = cloudant.db.use('mydb');

// responses are provided to your callback function as normal
db.get('mydoc', function(err, data) {
});


By default, up to three API calls are attempted with a delay of 500ms (doubling
after each attempt). These defaults are also configurable by you:


var cloudant = Cloudant({url: myurl, plugin:'retry', retryAttempts:5, retryTimeout:1000 });


N.B.

 * you may still get called back with an ‘err’ containing a 429 response if the
   last retry attempt returned a 429
 * you cannot use the Node.js streaming API when using the ‘retry’ plugin, only
   callbacks
 * dedicated and local Cloudant customers (i.e., not on Cloudant multi-tenant) do not get 429 responses. 429s are only used to
   prevent “noisy neighbours” in multi-tenant deployments.

THE ‘PROMISES’ PLUGIN – FOR CLEANER CODE
You may now use Promises to handle asynchronous calls in the Cloudant library. Simply start up the
library using the ‘promises’ plugin:


var cloudant = require('cloudant')({url: myurl, plugin: 'promises'});
var db = cloudant.db.use('mydb');
db.list().then(function(data){
  // success
}).catch(function(e) {
  // failure
});


N.B.

 * you cannot use the Node.js streaming API when using the ‘promises’ plugin

BRING YOUR OWN PLUGIN
As well as picking up one of the built-in plugins, you can provide your own request -like function to handle the HTTP traffic yourself.


var doNothingPlugin = function(opts, callback) {
  // don't do anything, just pretend that everything's ok.
  callback(null, { statusCode:200 }, { ok: true });
};
var cloudant = Cloudant({url: myurl, plugin: doNothingPlugin});


A more useful example would be to add in a caching layer using cachemachine :


var cachePlugin = require('cachemachine')();
var cloudant = Cloudant({url: myurl, plugin: cachePlugin});


Now every outgoing GET request is cached so that repeated requests are retrieved
from memory instead of hitting Cloudant.

WHAT IF I WANT MULTIPLE PLUGINS IN THE SAME APP?
You can!


var Cloudant = require('cloudant');
var cloudantPromises = Cloudant({url: myurl, plugin: 'promises'});
var cloudantStreaming = Cloudant({url: myurl});


CONCLUSION
The Node.js Cloudant Library is an open-source project. All issues, pull requests and comments are greatly
appreciated.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: cloudant / nodejs Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Sep. 2016 marks version 1.5 of the Cloudant Node.js Library. The library comes with a new plug-in system, and Keep-Alive is now on by default.",Plug-In to the Cloudant Node.js Library v1.5,Live,822
2521,"Node-RED * home
 * blog
 * documentation
 * flows
 * github

NODE-RED
A VISUAL TOOL FOR WIRING THE INTERNET OF THINGS
Node-RED is a tool for wiring together hardware devices, APIs and online
services in new and interesting ways.

GETTING STARTED


For Linux/OS X, if you already have Node.js installed, run:


$ sudo npm install -g node-red
$ node-red


Otherwise, head over to the Getting Started guide .

Already got Node-RED installed and wonder where the download button has gone?
Head over to the Upgrading guide .

Latest version: v0.13.4 (npm)

BROWSER-BASED FLOW EDITING
Node-RED provides a browser-based flow editor that makes it easy to wire
together flows using the wide range nodes in the palette. Flows can be then
deployed to the runtime in a single-click.

JavaScript functions can be created within the editor using a rich text editor.

A built-in library allows you to save useful functions, templates or flows for
re-use.

BUILT ON NODE.JS
The light-weight runtime is built on Node.js, taking full advantage of its
event-driven, non-blocking model. This makes it ideal to run at the edge of the
network on low-cost hardware such as the Raspberry Pi as well as in the cloud.

With over 225,000 modules in Node's package repository, it is easy to extend the
range of palette nodes to add new capabilities.

SOCIAL DEVELOPMENT
The flows created in Node-RED are stored using JSON which can be easily imported
and exported for sharing with others.

An online flow library allows you to share your best flows with the world.

Node-RED is a visual wiring tool for the Internet of Things.

A creation of IBM Emerging Technologies

 * GitHub
 * npm
 * Twitter
 * Mailing List
 * Slack

 * Blog
 * Flow Library
 * Documentation
 * APIs
 * Code of Conduct","Node-RED is a tool for wiring together hardware devices, APIs and online services in new and interesting ways.",Node-RED,Live,823
2526,"This video is a demonstration and code walkthrough of IBM's Acme Apparel application, developed by software engineers in IBM Cloud.  The video from Spencer Hockeborn gives a quick demonstration of the app and its capabilities, a walkthrough of the Bluemix setup, a walkthrough of the node and client code setup, and walkthroughs of the security, MQA, data, and push code.The contents include:- Bluemix Setup - [2:59]- Node Setup - [7:32]- Client Setup - [10:51]- Security Code Walkthrough - [12:34]- Data Code Walkthrough - [17:44]- MQA Code Walkthrough - [24:52]- Push Code Walkthrough - [25:19]","Video demo and code walkthrough of IBM's sample mobile Acme Apparel app.  Spencer Hockeborn quickly shows you how the app works. Then he steps through the setup on Bluemix, integration with Cloudant database, user authentication setup. and confiration of push notifications. ",Code Walkthrough of IBM MobileFirst Platform on Bluemix Hosted Acme Apparel Application,Live,824
2530,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Libo Li
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

WHAT IS SMOTE IN AN IMBALANCED CLASS SETTING (E.G. FRAUD DETECTION)?
Posted on November 27, 2016By: Bart Baesens , Seppe vanden Broucke

This QA first appeared in Data Science Briefings, the DataMiningApps newsletter as a “Free Tweet Consulting Experience” — where we answer a data
science or analytics question of 140 characters maximum. Also want to submit
your question? Just Tweet us @DataMiningApps . Want to remain anonymous? Then send us a direct message and we’ll keep all
your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to
data on data science news, or follow us @DataMiningApps .


--------------------------------------------------------------------------------

You asked: What is SMOTE in an imbalanced class setting (e.g. fraud detection)?

Our answer:

Rather than replicating the minority observations (e.g., defaulters, fraudsters,
churners), Synthetic Minority Oversampling (SMOTE) works by creating synthetic
observations based upon the existing minority observations (Chawla et al.,
2002). This is illustrated in the below figure where the circles represent the
majority class (e.g. non-defaulters) and the squares the minority class (e.g.
defaulters). For each minority class observation, SMOTE calculates the k nearest
neighbors. Let’s assume we consider the crossed square and pick the 5 nearest
neighbors represented by the black squares. Depending upon the amount of
oversampling needed, one or more of the k-nearest neighbors are selected to
create the synthetic examples. Let’s say our oversampling percentage is set at
200%. In this case, 2 of the 5 nearest neighbors are selected at random. The
next step is then to randomly create two synthetic examples along the line
connecting the observation under investigation (crossed square) with the two
random nearest neighbors. These 2 synthetic examples are represented by dashed
squares in the figure.

As an example, consider an observation with characteristics (e.g. age and
income) of 30 and 1000, and its nearest neighbor with corresponding
characteristics 62 and 3200. We generate a random number between 0 and 1, let’s
say 0.75. The synthetic example then has age 30+0,75*(62-30) or 54, and income
1000+0,75*(3200-1000)=2650. SMOTE then combines the synthetic oversampling of
the minority class with undersampling the majority class. Note that in their
original paper, Chawla et al (2001) also developed an extension of SMOTE to work
with categorical variables. Empirical evidence has shown that SMOTE usually
works better than either under- or oversampling. Also for fraud detection it has
proven to be very valuable.


‹ Big Data for Credit Scoring: Opportunities and Challenges —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * What is SMOTE in an imbalanced class setting (e.g. fraud detection)?
 * Big Data for Credit Scoring: Opportunities and Challenges
 * Web Picks (week of 31 October 2016)
 * In fraud detection, it is often stated that the target is noisy. What is
   meant by that?
 * Explaining FRAUDAR: Detecting Fake Reviews with Provable Guarantee

Archives * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com","Rather than replicating the minority observations (e.g., defaulters, fraudsters, churners), Synthetic Minority Oversampling (SMOTE) works by creating synthetic observations based upon the existing minority observations (Chawla et al., 2002). ",What is SMOTE in an imbalanced class setting (e.g. fraud detection)?,Live,825
2531,"Compose The Compose logo Articles Sign in Free 30-day trialBUILDING SECURE INSTANT API'S WITH RESTHEART AND COMPOSE
Published Apr 20, 2017 mongodb REST API restheart Building Secure Instant API's with RESTHeart and ComposeWhen you need to turn your Mongo database into a RESTFul API, RESTHeart can get
you up-and-running quickly and securely. Following up on our previous article on using RESTHeart to expose a RESTFUL API directly from a Mongo database on
Compose , in this article, we'll show you how to secure your RESTHeart API by adding
authentication and role-base access control, as well as enabling SSL encryption.

When we explored how to create instant RESTFul API's on Compose with RESTHeart , we left out one key important feature: Securing those API's. By default,
RESTHeart exposes your API's via HTTP with authentication turned off. Now it's
time to make our Instant API's more production-ready by adding authentication
and SSL encryption to our RESTHeart API endpoints.

SETUP
We'll pick up where we left off in the previous article , and in particular, you'll want to have RESTHeart already connected to Compose
MongoDB running in a Docker container. You should also have a folder at the root of your project called etc and inside that etc folder you should have a restheart.yml starter file with a mongo-uri key that points to your Compose MongoDB database.

AUTHENTICATION
The first step to securing our RESTHeart application is to require a username
and password to access the API. RESTHeart uses a pluggable identity management
system that allows users to authenticate using a variety of authentication
mechanisms. We'll start out by storing our credentials in a file and cover
database-stored credentials in a future article.

FILE-BASED AUTHENTICATION
The easiest way to get started is to store your users' credentials in a file and
use the built-in file-based identity manager to authenticate your users. To set
this up, first, create a new file in the etc directory of your project (NOT your system /etc folder) called security.yml . Next, open restheart.yml and add the following line of code to the Identity Management section:

idm:  
    implementation-class: org.restheart.security.impl.SimpleFileIdentityManager
    conf-file: ./etc/security.yml


Then, open the ./etc/security.yml file and add your users' credentials:

users:  
 - userid: admin
   password: changeme
 - userid: client
   password: changeme


This provides us with 2 users, an admin user and a client user. Obviously, this SimpleFileIdentityManager is less than ideal as it stores our users' passwords in plain text in a file
rather than using industry-standard encryption in a database. Fortunately, the
Identity Management system in RESTHeart is pluggable and in a future article,
we'll prepare RESTHeart for production by creating a more secure custom Identity Manager to address th.

RESTHeart uses Basic Authentication, a web standard which generates a token from
user credentials. If you visit the URL for RESTHeart in your browser, you'll be
prompted to enter your username and password automatically. Once you've
successfully authenticated, an authentication token is added to the header of
each subsequent call.

If you're attempting to generate access tokens programmatically, rather than
through your web browser, you'll need to perform the Basic Authentication steps.
First, the initial credentials are sent via an Authorization: header of the request, along with the word BASIC and the username and password separated by a colon (:) encoded using base64
encoding like the following:

base64(""admin:changeme"")

The final result would look like the following:

Authorization: Basic YWRtaW46Y2hhbmdlbWUK

If the authorization request is successful, an auth token will be returned in
the response headers of the call you made. They look something like the
following:

Auth-Token: 6a81d622-5e24-4d9e-adc0-e3f7f2d93ac7  
Auth-Token-Location: /_authtokens/user@si.com  
Auth-Token-Valid-Until: 2017-04-30T11:13:45.031Z  


Once you have the auth token, simply include it in the request headers for every
subsequent call. They will work until the expiration date in the Auth-Token-Valid-Until header.

You can use the CURL command to automatically generate the token and include it
with every subsequent CURL request by using the following command:

curl -i --user admin:changeme http://localhost/restheart  


For other methods, you may have to store the auth header value and include it
manually for each request.

ACCESS MANAGEMENT
So far we've looked at authenticating users in RESTHeart, but sometimes we need
different users to have access to different parts of the system. Like the
authentication system, RESTHeart has a pluggable access management system as
well. This will allow us to define multiple roles and allow or deny access to
specified routes and methods for those role types.

We'll try this out by using the File-based Access management system. First, open
the ./etc/restheart.yml file and add the following to the security section:

access-manager:  
    implementation-class: org.restheart.security.impl.SimpleAccessManager
    conf-file: ./etc/security.yml


The SimpleAccessManager reads users and roles from a file. Let's open up our ./etc/security.yml file again and add the following:

permissions:  
 - role: admins
   predicate: path-prefix[path=""/""]
 - role: $unauthenticated
   predicate: path-prefix[path=""/public/""] and method[value=""GET""]
 - role: users
   predicate: path-prefix[path=""/public/""]


These settings create two roles, admins and users , as well as a built-in default role of $unauthenticated . Access is granted using predicates, and the predicates are define per-role,
meaning predicates that affect one role type will not affect the others.

Our first predicate grants users with the role of admin access to any route using any method. The second grants any unauthenticated
user permission to access any route with the prefix of /public , but only using the GET HTTP method. Finally, the last rule defines a users role that has access to any route prefixed with /public but using any HTTP method.

Next, we'll update our users to specify roles for each one:

users:  
 - userid: admin
   password: changeme
   roles: [admins]
 - userid: client
   password: changeme
   roles: [users]


Note that roles is an array, meaning we can add any user type to multiple different roles. This
allows us to create granular roles that define specific behavior, and allows us
to create a robust role-based authentication scheme.

Now that we've covered authentication and authorization, there's one more
security issue that we need to address: connection encryption.

SSL USING EMBEDDED SELF-SIGNED CERTIFICATES
Using RESTHeart without SSL exposes your database to serious security risks,
including Man-in-the-Middle attacks which can compromise the integrity and
security of your database. To solve that problem, let's take a look at how we
can set up RESTHeart to use HTTPS and SSL.

RESTHeart comes with an embedded set of self-signed certificates, so adding SSL
support can be as simple as enabling these embedded self-signed certificates. To
enable SSL, change the following configuration setting in the restheart.yml file:

### listeners
https-listener: true  
https-host: 0.0.0.0  
https-port: 443  


This enables the HTTPS listener in the RESTHeart service on port 443 (the
default for HTTPS). We'll also need to tell RESTHeart which certificates to use
for the connection. To use the embedded self-signed certificates that come with
RESTHeart use the use-embedded-keystore configuration option in your restheart.yml file like the following:

#### SSL configuration

use-embedded-keystore: true

#keystore-file: /path/to/keystore/file
#keystore-password: password
#certpassword: password


Notice that, for now, we have the keystore-file , keystore-password , and certpassword commented out. We'll see how we can use those in a moment.

When we run our container, we'll want to expose port 443 in our RESTHeart docker
container and map it to our hosts' 443 port using the -p command:

docker run -d -p 443:443 --name restheart -v $PWD/etc:/opt/restheart/etc:ro softinstigate/restheart:3.0.0  


You can now access RESTHeart by going to https://localhost . However, you may notice that visiting the site in a web browser will give you
a security warning, and using CURL --insecure flag. This is because the built-in self-signed certificates are not recognized
as being valid certificates, and this will be an issue in production as well. In
a future article, we'll explore how we can use authoritative certificates, and
in particular, we'll look at using LetsEncrypt to generate
automatically-renewing authority-signed certificates.

WRAPPING UP
In this article, we took a look at how we can add SSL and authentication
mechanisms to our RESTHeart APIs and this gets us closer to running RESTHeart in
production. In a future article, we'll complete the cycle and create a
production-quality Custom Identity Manager in Java with hashed passwords, show
how to use production-ready certificates using LetsEncrypt, and deploy our
RESTHeart API to the public using the IBM Bluemix Container Service.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Pixabay User Der_Typ_von_Nebenan John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
Jan 24, 2017BUILDING INSTANT RESTFUL API'S WITH MONGODB AND RESTHEART
When you need to turn your Mongo database into a RESTFul API, RESTHeart can get
you up-and-running quickly. In this article,…

John O'Connor Jul 10, 2014WORKING WITH THE NEW API - PART 2: NODE.JS
In the last part, we used Go to create a command line utility that reported on
your database versions and upgrade options. In…

Dj Walker-Morgan Jul 1, 2014MONGOHQ API - WELCOME TO THE NEW BETA
Today, we are excited to announce the beta release of our new REST API for the
MongoHQ platform. The new API will allow custo…

Chris Winslett Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this article, we'll show you how to secure your RESTHeart API by adding authentication and role-base access control, as well as enabling SSL encryption.",Building Secure Instant API's with RESTHeart and Compose,Live,826
2536,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseIBM DATA CATALOG: ADD DATA ASSETS TO A CATALOG
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

11 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 31, 2017This video shows you how to add assets to a data catalog. Find more videos in
the IBM Data Catalog Learning Center at http://ibm.biz/data-catalog-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Data Catalog: Overview - Duration: 2:03. developerWorks TV 2 views * New 2:03


--------------------------------------------------------------------------------

 * IBM Data Refinery: Create a connection and add it to a project - Duration:
   1:54. developerWorks TV 3 views * New 1:54
 * IBM Data Refinery: Create a project and add data - Duration: 1:56.
   developerWorks TV 7 views * New 1:56
 * IBM Data Catalog: Create and administer a data catalog - Duration: 3:19.
   developerWorks TV 13 views * New 3:19
 * IBM Graph: Create a model and schema in IBM Graph - Duration: 4:45.
   developerWorks TV 1,492 views 4:45
 * IBM Data Catalog: Use data assets in a project - Duration: 1:09.
   developerWorks TV 1 view * New 1:09
 * IBM Blockchain Car Lease Demo - Duration: 3:01. developerWorks TV 50,058
   views 3:01
 * IBM Data Catalog: Governance overview - Duration: 4:11. developerWorks TV 5
   views * New 4:11
 * IBM Data Refinery: Shape data - Duration: 5:46. developerWorks TV 10 views *
   New 5:46
 * Data Sources for IBM Watson Customer Experience Analytics - Duration: 4:09.
   IBM Watson Marketing 288 views 4:09
 * Introduction to the IBM MobileFirst Platform Foundation - Duration: 17:51.
   developerWorks TV 15,383 views 17:51
 * Healthy Habits Pet Assembly, Part1 - Duration: 5:53. developerWorks TV 6
   views * New 5:53
 * An Introduction to IBM Transformation Extender - Duration: 2:12. IBM Watson
   Customer Engagement 2,378 views 2:12
 * UrbanCode Deploy: Using composite blueprints - Duration: 9:13. developerWorks
   TV 6 views * New 9:13
 * IBM Graph: Create data elements in IBM Graph - Duration: 3:55. developerWorks
   TV 355 views 3:55
 * Jackson: An IBM data story - Duration: 2:31. IBM Cloud Computing 68,277 views 2:31
 * IBM Information Server Data Lineage in Two Easy Steps - Duration: 2:50.
   BusinessDrivenInfo 4,047 views 2:50
 * Adding a data source to your IBM Secure Gateway - Duration: 2:10. IBM
   Planning Analytics 298 views 2:10
 * IBM Control Center 6.1 - Mapping tags to create a custom data column -
   Duration: 2:40. IBM Watson Customer Engagement 288 views 2:40
 * Data Insights From IBM Watson Analytics - Duration: 1:00:54. LPA Software
   Solutions 654 views 1:00:54

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to add assets to a data catalog. ,Add data assets to a catalog using IBM Data Catalog,Live,827
2537,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses
 * Badges
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (August 09, 2016)
 * This Week in Data Science (August 02, 2016)
 * This Week in Data Science (July 26, 2016)
 * Welcome to the new BDU!
 * This Week in Data Science (July 19, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (AUGUST 09, 2016)
Posted on August 9, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * The White House requested input on artificial intelligence, and IBM’s
   response is a great AI 101 – The field of artificial intelligence is so huge, and the potential
   applications so numerous, that it would be folly to try to explain it all in
   one – no, wait, IBM just did.
 * 10 Online Big Data Courses and Where To Find Them 2016 – Whether you want to learn the basics for fun, sharpen your technical
   knowledge, or feel properly trained on specific platforms, there’s a course
   for you.
 * How Rio Olympics athletes are using tech to win medals – Big Data is not limited to business and tech industries. Olympic athletes
   are using data analysis in order to improve their chance of succeeding.
 * Map: The Most Common* Job In Every State – This interactive data visualization map displays the most common job per
   state between 1978 and 2014.
 * Next-generation data scientist: Harnessing an integrated development
   environment – The typical data science team involves a multifaceted interplay of roles,
   functions and workflows. Each of the principal roles has to handle its own
   set of complexities.
 * Pokémon Go and Big Data: You Teach Me And I’ll Teach You – Big Data is playing a huge role in Pokémon Go
 * How To Build A Career In Data Science – Alexander Isakov completed his PhD at Harvard then went on to start a data
   science company. Alexander gives us advice on how to build a successful
   career in data science.
 * IBM is one step closer to mimicking the human brain – Scientists at IBM have claimed a computational breakthrough after
   imitating large populations of neurons for the first time.
 * Watson correctly diagnoses woman after doctors were stumped – After treatment for a woman suffering from leukemia proved ineffective, a
   team of Japanese doctors turned to IBM’s Watson for help.
 * Pinterest’s Founder: Algorithms Don’t Know What You Want – CEO Ben Silbermann says Pinterest is built on the idea that crowds of
   people are best at finding content that consumers care about.
 * Twitter Facial Analysis Reveals Demographics of Presidential Campaign
   Followers – If you follow Hillary Clinton or Donald Trump on Twitter, your face has
   probably been analyzed by a machine to determine your age, ethnicity, and
   social influence.
 * Apply design thinking to enhance user experience for incentive compensation
   management – As we move into the cognitive era of enhanced solutions and unprecedented
   capabilities, we have drawn upon innovations in design thinking.
 * Baidu Is Bringing Intelligent AR to the Masses – An augmented-reality system powered by computer vision and deep learning
   will add an extra layer to the real world for millions of people.
 * MIT researchers have developed a technology that makes videos interactive – Researchers at MIT’s Computer Science and Artificial Intelligence
   Laboratory (CSAIL) have found a way to let people reach in and push, pull,
   poke, and prod objects in videos.
 * A Visual History of Which Countries Have Dominated the Summer Olympics – This data visualization traces the rise and fall of each country’s medal
   count over time.

UPCOMING DATA SCIENCE EVENTS
 * Webinar: The Inside Scoop on Apache Sqoop – Learn about the best practices for using Sqoop on August 25th.
 * Big Data University on 2016 Hadoop Summit – Join Big Data University at the Hadoop Summit in Australia on September
   1st as they discuss spatial-temporal trajectory analysis with Spark.
 * eMetrics Summit – Marketing analytics practitioners, experts and visionaries discuss
   capturing and applying insights from data on October 23 – 27 in New York.

THE NEW BIG DATA UNIVERSITY PLATFORM
 * Welcome to the new BDU! – Big Data University has been updated and massively improved. Find details
   about the changes here.
 * Courses – Courses now have improved structure and offer a more user-friendly
   experience. Check out the many new courses offered.
 * Learning Paths – The new learning paths make it easy to get started on your path to
   expertise in your field of choice.
 * Badges – The new badges offered are distinguished by levels. Level 1 badges can now
   be claimed immediately.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data. ,"This Week in Data Science (August 09, 2016)",Live,828
2539,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectANALYZING SALESFORCE DATA WITH LOOKERUNBLOCK YOUR CRM DATA, AND MAKE DATA-DRIVEN DECISIONS WITH YOUR WHOLE TEAM!Sarah Maston / July 17, 2015Salesforce is the largest CRM out there and with their API they’ve built thedoor to get to your data. But if you’re seeking a timely business intelligencesolution that involves combining your salesforce data with other operationaldata sources to see across your business, it can be a bit daunting to knock onthat door and ship that data to your data warehouse.In this tutorial, featuring our Simple Data Pipe , you’ll see how easy it can be, and pretty, too, thanks to our friends atLooker.GET, BUILD, ANALYZESOLVING A REAL-WORLD PROBLEM FOR IBM CLOUDANTThis tutorial started out as a project to solve this very problem for ourselves.We wanted to build an Enterprise Data Warehouse (EDW) to analyze our CRM data,among other things. I teamed up with David Taieb , the Senior Application Architect in our group, to plan our approach: a Nodeapp on Bluemix, OAuth via Salesforce, data movement via IBM’s DataWorks API, aNoSQL data stage on IBM Cloudant, scheduling, and single sign on (SSO). Soundcomplicated? It was. But now it’s not. We built the Simple Data Pipe to do allthat, and you can use it too.TEAMING UP WITH LOOKERWhen we started designing the EDW, I introduced the team to Looker, an analyticsplatform I fell in love with last year. You’ll see why in a minute. (Haven’tseen Looker? Watch this quick intro video .) Erin Franz, Analyst at Looker, worked with us to bring our Simple Data Pipework to life. She architected the LookML that models the Salesforce data indashDB and built all the reporting you’re about to see. With your Lookerinstance up and your data in dashDB, we can have you going with a Salesforcereporting solution in minutes. Huzzah!SIMPLE DATA PIPE TECHNICAL SUMMARYUnder the hood, the Simple Data Pipe leverages Bluemix, IBM’s open cloudplatform for building, running, and managing applications and runs Node.js tocommunicate with and move data between Salesforce, Cloudant, dashDB, andDataWorks.Technical Architectural OverviewBEFORE YOU BEGINIf you don’t already have the following accounts, set them up now: * Salesforce.com account with Admin permissions. If you aren’t the administrator, contact your Salesforce administrator to   grant you the ability to create and manage connected apps. If you’re not part   of an existing Salesforce account, set up a trial Developer Salesforce account . * Looker Instance. If you aren’t signed up for Looker yet, request a Looker demo instance using your company email. Looker will reach out to you to get you set up for   your Salesforce solution using Simple Data Pipe. * IBM Bluemix. You can sign up for a free trial.DEPLOY SIMPLE DATA PIPEThe fastest way to deploy this application to Bluemix is to click the Deploy to Bluemix button below—it automatically adds all the services you need: Node.js,Cloudant, dashDB, DataWorks, and SSO.If you prefer instead to deploy manually, please refer to the readme .On the Deploy to Bluemix screen, give the app a name that’s meaningful to you,like Simple Data Pipe , and click Save . (If you have any trouble deploying, make sure you haven’t already reachedyour Bluemix memory and services quotas. You get 10 with your account’s freetrial period.) To launch the app, click the View Your App button.GET STARTED WITH SIMPLE DATA PIPENow that you’re looking at Simple Data Pipe in the browser, you’re ready to getgoing.CONNECT YOUR SIMPLE DATA PIPE TO SALESFORCE.COMBefore you connect Salesforce to Simple Data Pipe, you need to set up SalesforceOAuth. 1.  In Salesforce.com, access your Setup menu in one of 2 ways: * If you’re in a trial version, go to the menu on the upper right of the        screen and click Setup .      * If you don’t see Setup in the menu, click the dropdown arrow beside your        name and select Setup .                           2.  Within the menu on the left side of the page, add an app in one of the     following ways: * In a trial version, go to the Build section, select Create > Apps , and under Connected Apps , click New .      * In other versions, go to the App Setup section, select Create > Apps , and and under Connected Apps, click New .                           3.  To configure the connected app: 1. Complete all required fields (marked in red)      2. Turn on the Enable OAuth Settings checkbox.      3. Enter the Callback URL , which you can grab from your Simple Data Pipe app’s Connect tab.      4. In the Available OAuth Scopes box, select all and click the Add button to send them all to the Selected OAuth Scopes box.                        5. Click Save .           4.  Click Continue to get your OAuth Consumer Key and Consumer Secret.           5.  Keep this Salesforce window open. 6.  Return to your Simple Data Pipe app and in the menu on the left, click Create A New Pipe . In the Type field, select SalesForce and give your pipe run a unique     name. Enter a description too, if you dare :P 7.  Under the Connect section of your pipe run, you’ll share your Salesforce     OAuth key and secret with your Simple Data Pipes app. Copy and paste these     values from Salesforce into the corresponding fields in Simple Data Pipe. 8.  Note the checkbox for Use the SalesForce sandbox . If it applies to you, you’-) 9.  Click Connect to Salesforce . If you get an error here, Salesforce may not have processed your     connected app yet. This can take up to 10 minutes. Coffee break! 10. If prompted, sign in to Salesforce. 11. At the prompt telling you Simple Data Pipe requests access permission,     click Allow . 12. The Salesforce prompt closes and leaves you back in Simple Data Pipe. 13. Click Save and continue .MOVE DATA TO DASHDBIn your Simple Data Pipes app you can set up and schedule data movementactivities. 1. Once connected, you have a choice of what you are going to pipe to dashDB.    Under the Pick Tables section for your pipe run, All Tables is the first option when you click into the dropdown, but if you don’t want    everything, pick the table you want. Click Save and continue . 2. Schedule an activity run (optional): If you want the freshest data every    day, use this option. Click Save and continue . 3. Now that you’ve connected and picked tables, move data from Salesforce to    dashDB by clicking Run Now under the Activity section for your pipe. A progress screen appears.When the pipe run is complete, you can select View Details to review the    tables and the number of records that were moved for each one.                Now that the data is in dashDB, we’re ready to view it with Looker. (You alreadyfilled out the form to get Looker spun up, right? Check your email for thosecredentials.)SET UP LOOKER TO WORK WITH DASHDBLOOKER LOOKML INSTALLATIONWhen you get your credentials email from Looker, click its sign-up link andfollow Looker’s instructions.CONNECT LOOKER TO DASHDBIn Looker, create your connection to dashDB. 1. On the left side of the screen, click Admin and select Connections .         2. Click the New Database Connection button. 3. Name your connection “pipes”. 4. In Dialect , choose IBM DashDB . 5. Complete required fields with additional dashDB details.You can find these    details in your dashDB console. Return to Bluemix and go to the dashboard.    Under the Services section, open your pipes-dashdb-service , click the Launch button, then select Connect > Connection Information .Note: In the Schema field in the Looker form, enter your DashDB UserID, but capitalize all    letters. For example, change dashDB1234 to DASHDB1234.                 6. Click Add Connection .CREATING A NEW LOOKML PROJECT 1. In the menu on the left, select LookML > Manage Projects . Click the + New LookML Project button. If you see a Developer Mode Required message make sure that your Dev Mode is on. Name your project. Uncheck the Generate Model & Views checkbox.         1. Click Create Project .Still with me? 1. You’ll see your new project. 2. Click Configure Git .If you’re new to Git, Looker provides a link on how to create a repository.                 3. Fork our LookML files into your Github repo: https://github.com/ibm-cds-labs/looker-for-pipes 4. Once looker-for-pipes lands in your Github account, paste the repository URL    into the Looker form and click Continue .         5. Follow the link and Looker’s instructions to copy and save your deploy key on Github. Make sure “Allow write access” is checked when you save it on Github. 6. Click the Continue Setup button. 7. Click Sync developer mode in your Looker browser to see the views you just uploaded. 8. Select LookML > Salesforce to access your project with your views.        Tada!! You’re good to go. With your Pipe installed, configured and scheduled you will now have yourSalesForce data refreshing regularly. Your Looker instance will now be availableto you with a robust set of dashboards to get your organization going.EXPLORING YOUR DATA WITH LOOKERLet’s look at some of our Salesforce data, shall we? (Royal we —meaning you.) There is a Salesforce dashboard included in the files weprovided you from Looker to start you on your data exploration journey. Click Spaces > LookML Dashboards :Clicking on the title of any one of the reporting components on the dashboardwill bring you to what Looker calls a Look .Looker shows the detail.For a mental model, think ‘Pivot Table on Steroids’. You are free to modify thisExplore by adding or subtracting new dimensions and measures on the left. Todisplay your new data, click Run at top right.USING EXPLORES AND DASHBOARDSNow you are ready to start learning how to explore your data in Looker. * Start with the Looker Video Library . * Looker’s documentation is robust * Live chat with a Looker Analyst at any time via the icons on the bottom of   your side navigation menu.      SECURING SIMPLE DATA PIPESimple Data Pipe uses HTTPS everywhere, so the data in flight is secure. Tosecure the app in the browser, we recommend you use Bluemix SSO. The SSO serviceis intentionally left unbound when your initially deploy the Simple Data Pipe.Once you configure Bluemix SSO in this unbound state, Simple Data Pipe willrespond automatically when you bind the SSO service to it and restage theapplication. Follow the README in the repo for our bluemix-helper-sso Node module for detailed instructions.If you happen to find the Bluemix documentation for binding SSO to your app — they don’t apply in our case. You won’t need to copy any additional Node.jsconfiguration code. We’ve already implemented that portion in the Simple DataPipe for you!WITH A LITTLE HELP FROM OUR FRIENDS – CROWDSOURCING BI SOLUTION EFFORTSOpen sourcing this solution is very important to us. As mentioned, the strugglefor BI solution architects is mostly about finding solutions, but the bulk ofthe work for most of them most of the time is in building the data movementpiece of it — 90% of the work and 0% of the business impact. With Simple DataPipe, every BI solution architect has a product to use and a framework toextend. Instead of reinventing yet more ways to move data, you can instead usethis tool and extend it, developing yet more connect pipes, Then you’ll havetime instead and focus on architecting data-driven reports and solutions.In the two months that we have put this solution together, we have come up witha list of future enhancements. We are now excited to share our tool and we hopethat it inspires you as it has us. Please reach out to us on Github,Stackoverflow, and Twitter to tell us what you think, extend this work, and giveus feedback.FAQSWhat about errors? The Simple Data Pipe has an activity monitor built into it recording eachscheduled activity. Should you encounter errors, you will find a meaningfulerror message that you can troubleshoot. If you run into something we haven’tfound yet, reach out on StackOverflow using the tag #SimpleDataPipe. Are there limits the amount of data I can move? We tested up to 100k Salesforce records without a problem. That’s all weneeded. Let us know if you have problems past that threshold. I want to source another data source, what do I do? If you are a developer who wants to build upon the foundation we are laying andcontribute to our project, by all means do! If you have a connector you’d liketo see or contribute, please let us know in the comments below or on StackOverflow. We’re already planning the next connector–as pipes are created tobring in more data sources, they will be released to the developer community. Can we use other visualization tools on top of dashDB? Absolutely. As this project grows look for more reporting assets beingcontributed. Or contribute your own! Doing something awesome in D3 that youthink could help someone else? Bring it!PRODUCTS FEATURED * Salesforce * Looker * IBM dashDB * IBM DataWorks * IBM Cloudant * IBM BluemixPlease enable JavaScript to view the comments powered by Disqus.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","The Simple Data Pipe is an app that moves your Salesforce data to dashDB, which is the IBM cloud data warehouse. Once you have your Salesforce data in dashDB, you can do all kinds of analysis on it, with all kinds of tools, such as SQL, R, and Looker.","Analyzing Salesforce Data with Looker: A Salesforce BI solution with dashDB, Cloudant, and Looker",Live,829
2540,"Hi again! In this latest part of our tour through getting started with the multiple, or singular, database joys of Compose we'll be looking at setting up Elasticsearch. In the first part, we covered MongoDB on Compose and the sign up process is basically the same for both. If you didn't read that article, have a browse now...There is of course one difference, obviously you'll need to select Elasticsearch in the Choose a Database panel and take a note that the Elasticsearch pricing is different to MongoDB's. If you're on the thirty day trial, that won't bother you immediately, but do take a note.Anyway, back to the Elasticsearch database setup. Once you have signed in, you'll be greeted by the Jobs screen showing that Compose has created your Elasticsearch database.You're next stop should be to select the Overview option in the  sidebar. This will show you various, useful, items of information at the top, but, right now, it'll be showing a message at the top of the page...Yes, but what you've got is an account which you can log in to with your email address and password for the Compose Dashboard and your own Compose account. Those credentials are only ever used to log into the Compose Dashboard. As we explain in the MongoDB walkthrough, each database has it's own credential system to be managed. For Elasticsearch those credentials are usernames and passwords for the HTTP/TCP access portal which protects your Elasticsearch cluster.There's actually two HTTP/TCP access portals to enable redundancy in accessing the cluster and both get automatically updated with the same credentials. All you need to do is create your user/password pair and you can  connect to either of the portals (or both if your driver supports multiple URLs to connect with). Click on Users in the left hand sidebar and then click Add user.You'll be prompted to enter a username and password here; make them different from your Compose account login at the very least.After you have clicked Add user in this screen, you'll be redirected to the Jobs display, where you should see the user being added to each of the access portals. Once that's done, it is time to connect. The next question is...That can be answered on the Overview page. If you go there you'll find a number of ""Connect Strings"" listed in the first panel. As Elasticsearch deployments have two access portals, there's two URL's listed under the ""HTTP connection"" section.Either one of them will work – we'll just use the first one shown here for this example. We then substitute in our user name and password to get:And we're ready to go. Well, at least once we have an application that can use that URL. If we just want to test connectivity to the Elasticsearch cluster, we can use the example Cluster Health Call. This uses the curl command line utility which is widely available. If your operating system doesn't have curl then you can download it from http://curl.haxx.se/download.html. Then you can substitute in the user name and password and ask the cluster about its health like so:$ curl --user example:examplepass 'https://aws-us-east-1-portal5.dblayer.com:10225/_cluster/health?pretty'""cluster_name"" : ""runstate-elasticsearch"",""status"" : ""green"",""timed_out"" : false,""number_of_nodes"" : 3,""number_of_data_nodes"" : 3,""active_primary_shards"" : 0,""active_shards"" : 0,""relocating_shards"" : 0,""initializing_shards"" : 0,""unassigned_shards"" : 0,""number_of_pending_tasks"" : 0Now, you may notice that we used a slightly different URL there, without the username:password embedded in it. We can use the URL we created earlier by appending /_cluster/health?pretty like so:$ curl 'https://example:examplepass@aws-us-east-1-portal5.dblayer.com:10225/_cluster/health?pretty'Remember to wrap it in quotes so that the shell doesn't see the ? and try and match files using the URL though. If you don't have curl but do have wget you can use the URL with that command like so:$ wget -O - 'https://example:examplepass@aws-us-east-1-portal5.dblayer.com:10225/_cluster/health?pretty'Anyway, assuming this worked, you have a connection to your Elasticsearch database. The ""Cluster Health"" URL is a query on Elasticsearch's REST API and you could, if you really wanted to, access it entirely from the command line. But you will probably want to go for the far less taxing route of using a library.We'll just create a small Node.js application now to show how you can connect to your Elasticsearch cluster. Create an Node.js project with npm init and just press return to agree to all the settings. Now, run npm install elasticsearch --save to install the Elasticsearch module. Now we can write some code...var elasticsearch=require('elasticsearch');var client=new elasticsearch.Client( {hosts: ['https://example:examplepass@aws-us-east-1-portal5.dblayer.com:10225/','https://example:examplepass@aws-us-east-1-portal4.dblayer.com:10216/'First, this code ""requires"" the elasticsearch module. Then it moves on to create an Elasticsearch client. The Client only takes one parameter, but that just is a way of wrapping up a lot of options to create connections to the Elasticsearch cluster. If you check the documentation you'll see the full extent. Here though we're keeping it simple and just passing the host key with an array of URLs to connect to as its value. The first URL is the one we've been using and the second URL is the second one from the overview's ""Connect Strings"". Let's go query the cluster health in JavaScript...client.cluster.health({},function(err,resp,status) {console.log(resp);Ok, we can issue the call, to get cluster health and when we get the response, we run the callback which just prints the response. The {} at the start is where the options get placed and looking at the API entry for cluster.health we can see there's some useful options to be harnessed, like the ability to wait for particular statuses or availabilities. This of course is just an example. Now we have a client we can use the Elasticsearch Quick Start examples and the rest of the documentation to master talking with Elasticsearch.There is another way to interact with the Elasticsearch cluster and that's through the web-based site plugins. You'll find them by selecting Plugins from the side bar. There's Kibana, ElasticHQ, Bigdesk, Head, Paramedic and Kopf, which all have different capabilities, from just monitoring cluster health to helping with query creation and execution.You're next steps with Elasticsearch should be to...* Check out the Elasticsearch: The Definitive Guide for how to master Elasticsearch's search capabilities* Find out more about the Compose Transporter which can help you get your data from other databases into your Elasticsearch database.",How to set up an Elasticsearch database on Compose.,Getting Started with Elasticsearch using Compose,Live,830
2549,"TURN SMALL DATA INTO SMART DATA. PART 1: THE STAR SCHEMA
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 19, 2016This is the first of a three part series focused on an open source tool chain
for small data business intelligence. Part 1 explores the history and reasoning
behind dimensional modeling via a use case: online video viewing analysis. We
examine the design of a star schema highlighting the utility of dimensional data
structures. Specific topics include: date and time dimensions, data slicing, and
choosing the appropriate grain for a fact table.

Part 2 will focus on Extract, Transform and Load with NodeJS while Part 3 will
use gnuplot for visualizations. All three together represent a complete small
data business intelligence solution from design to implementation to
visualization.


Some Relevant Links * Ralph Kimball Group has many resources for dimensional data modeling.
 * Wistia for hosting videos plus the stats.
 * Google's Timezone API for turning a lat, lon into an offset from UTC.
 * And, as always, if you need an instance of PostgreSQL you could learn how to
   deploy one here . :)

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose",This is the first of a three part series focused on an open source tool chain for small data business intelligence. Part 1 explores the history and reasoning behind dimensional modeling via a use case: online video viewing analysis.,Turn Small Data Into Smart Data. Part 1,Live,831
2559,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Home
 * Cognitive Computing
 * Data Science
 * Web Dev
 * 

Bradley Holt Blocked Unblock Follow Following Developer Advocate and Senior Software Engineer with IBM Watson Data Platform |
Author and Speaker | opinions are my own Mar 21
--------------------------------------------------------------------------------

VOICE OF INTERCONNECT
AN OFFLINE FIRST MOBILE WEB APP FEATURING HOODIE , IBM CLOUDANT , AND IBM WATSON SERVICES
Voice of InterConnect visualization of aggregate sentiment over time.I’m here in Las Vegas this week for IBM InterConnect , our annual conference that features the “what’s next” of tech innovation for
cloud services, Internet of Things, mobile app development, and IBM Watson. Several us from the IBM Watson Data Platform team are here at InterConnect
giving a number of demos and talks. Over the past couple of months our team has been working with Steve Trevathan of Make&Model and Gregor Martynus of Neighbourhoodie on a demo app called Voice of InterConnect . You can try out Voice of InterConnect here in the DevZone at InterConnect or on your own device at voiceofinterconnect.com .

Voice of InterConnect architecture diagram.Voice of InterConnect is an Offline First mobile web app that features Hoodie , IBM Cloudant , and IBM Watson Services . We decided to build this as a Progressive Web App , rather than as a native mobile app, as we wanted to demonstrate some of
what’s possible today on the web platform. The user experience of the Voice of
InterConnect app is very similar to that of a native mobile app.

The app first prompts the user with a question, then leverages the HTML Media Capture API to access the device’s microphone and record a short voice response from the
user. This voice response is stored locally in a PouchDB database on the device. This is the Offline First aspect of the app—the user
can continue to record voice responses whether or not the app is able to connect
to the cloud.

The Voice of InterConnect booth in the DevZone at IBM InterConnect.The next time the front end app has a network connection (which could be
immediately) it uploads the audio file to a Hoodie app running in the cloud. Hoodie is a Node.js framework that provides a
complete backend for Offline First apps. The Hoodie app sends the voice response
to IBM Watson Speech to Text for transcription, and then sends the transcribed text to IBM Watson Natural Language Understanding for sentiment analysis. The audio recording, the transcribed text, and the
sentiment analysis are all stored in IBM Cloudant . Finally, all of the sentiment analysis data is displayed in a visualization
that aggregates sentiment over time. Watch this interview with Steven and Gregor if you want to learn more about how
the app is built. You can find all of the code for Voice of InterConnect on GitHub .

If you’re at InterConnect then please stop by the DevZone and try out Voice of
InterConnect! There you’ll find the Voice of InterConnect team: Steve Trevathan , Gregor Martynus, Mark Watson , Irina J. Uno , Teri Chadbourne, CMP , and me ( Bradley Holt ). We’d love to talk with you about the app and answer your questions about how
it’s built and the technologies that it uses.

 * Mobile
 * Web Development
 * Offline First
 * Cloud Computing
 * Cognitive Computing

Blocked Unblock Follow FollowingBRADLEY HOLT
Developer Advocate and Senior Software Engineer with IBM Watson Data Platform |
Author and Speaker | opinions are my own

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",Over the past couple of months our team has been working with Steve Trevathan of Make&Model and Gregor Martynus of Neighbourhoodie on a demo mobile web app app called Voice of InterConnect.,Voice of InterConnect – IBM Watson Data Lab,Live,832
2561,"Homepage Follow Sign in / Sign up Dinesh Nirmal Blocked Unblock Follow Following Vice President — Analytics Development, Site Executive IBM Silicon Valley Lab.
The opinions expressed are my own and don’t necessarily represent those of IBM. Mar 20
--------------------------------------------------------------------------------

HOW TO DECIDE: MACHINE LEARNING AND THE SCIENCE OF CHOOSING
“If you choose not to decide, you’ve still made a choice.” - Neil Peart

Decisions can be fatiguing and prone to bias, not just for individuals but for
whole organizations. The bigger the decision and the more complex the inputs,
the greater the consequences of failure. Consider the vetting of millions of tax
returns. Consider an airline that needs to decide when and where to perform
maintenance between flights. Or consider an emergency response agency that needs
to decide where to distribute disaster services. Examples range across
manufacturing, healthcare, finance, and beyond — and in each case, the decision
starts with data and ends with action. But what are the steps, and how can
machine learning play a role to help decision makers reduce fatigue and bias —
and raise confidence?

In a previous post , I talked about the emotional, political, and personal biases that creep into
decisions — and offered some thoughts about a decision optimization offering from IBM that helps organizations make more informed decisions by setting
constraints, finding bottlenecks, evaluating what-if scenarios, and offering
choices. I gave a sense of the flow of inputs and outputs at a high level, but
let’s go deeper.

We can think of machine learning as focused on insights — for example, insights
that help enable better forecasts. By contrast, decision optimization focuses on
actions such as creating specific plans or schedules based on those forecasts.

PREDICTIVE VS. PRESCRIPTIVE
We know that machine learning can offer help for complex, next-best action
decisions, whether the decision is about choosing a career, choosing a business
strategy, or choosing whether to turn left or right at a point in a journey. But
we also know that many difficult choices are really a combination of more than
one decision — decisions that involve complex, interdependent trade-offs. For
those choices, it’s important to consider how the decisions might affect each
other, how they affect the larger goals, and how various rules or constraints
limit the range of good decisions.

For example, the airline mentioned above will have access to public data for
weather patterns, airport closures, and air traffic activity — in addition to their own private data of engine types, instrumented data,
mechanical expertise for maintenance tasks, and staffing.

With good predictive algorithms and enough data, they can anticipate the
likelihood of failure of each engine for each plane. That helps them create a
preventive maintenance plan for each plane. But what if two planes need to be
maintained at a time when only one technician is available? What if two planes
are triggered for maintenance on the same critical route? The maintenance
planner ends up having to solve the conflict manually — but extrapolate that
conflict to a large, international airline with thousands of planes and the
manual approach becomes untenable — or at least massively inefficient.

This is where decision optimization can play a role. The airline’s failure
predictions become the data input to a decision optimization model that can not
only consider multiple predictions simultaneously, but can also factor in
customer service, costs, and other trade-offs, as well as availability and skill
of maintenance engineers, demand on the airline routes, availability of
maintenance facilities — all within a single model. The output is a maintenance
schedule for the entire fleet, which can potentially minimize customer
disruption and costs, and eliminate manual intervention by the maintenance planner .

A quick aside: I spend most of my time focused on clients whose data lives
behind the firewall. These are often industry giants with masses of accumulated
data that they’re eager to feed into machine learning algorithms to generate
deep knowledge about the organization. Whether it’s transaction records, logs of
customer behavior online, healthcare archives, or other proprietary information,
taking advantage of that data can be like wielding a secret weapon.

But as the airline example makes clear, tapping that data for predictions isn’t enough. To get to more informed decisions requires prescriptive analytics.

GOALS + PREDICTIONS + RULES + DATA = DECISIONS
To generate decisions such as plans and schedules, optimization models and
optimizer engines consider business goals, and how those goals can be affected
by various decisions. For this, the models also take as input predictions,
business rules, and other business data required to describe the goals and
rules.

Another good example of how predictive and prescriptive technology complement
each other is the global tire manufacturer that wanted to gain competitive
advantage by eliminating inefficiencies in its production across 10,000
different products. Using predictive technology, the company could anticipate
the demand of these 10,000 products across its supply network. This was then
used as an input to IBM Decision Optimization , which simultaneously considered up to ten million constraints across all plants and products — and automatically generated a production plan
for all products across all locations, highlighting where bottlenecks might
occur.

In this example, the decision optimization model also considered the many
business goals and rules (also called constraints), such as demand, available
resources, costs, yields, production recipes, operational constraints, and
customer preferences.

In other words, the optimization engine generates decisions by using one or more
mathematical models to combine a machine-learning analysis of the situation at
hand and the possible future states (data and predictions) with a set of goals
and choice of possible decisions affecting those goals. Optimization can
consider millions of choices simultaneously to provide plans and schedules much faster than traditional planning approaches — typically with significant improvements in Key Performance Indicators (KPIs).

Configuring business rules and constraints doesn’t happen automatically and
normally requires human intervention, but advances in natural language
processing can enable non-experts to build, test, and visualize rules involving,
for example, revenue, inventory capacity, schedules, labor limits, or any number
of additional business impacts. With machine learning, optimization, and rules
engines combined, decision models could get more sophisticated over time by
suggesting additional rules or adapting existing rules.

As the figure below suggests, the inputs of an optimization process are a
combination of forecasts generated by machine learning technology (such as
demand forecasts), other data (such as resource availability, costs, yields, and
recipes), and a description of the rules or constraints limiting the decisions
(such as capacities or customer preferences), as well as the business goals to
optimize (such as costs, revenue, or customer service).

All these factors are input into one or more optimization models, which are
analyzed by the optimizer engines. The output from the process is usually a
recommended set of actions, plans, or schedules designed to optimize the defined
goals.

The optimization process typically happens continuously in an organization, and
often in the form of what-if analysis. What if the demand forecast is 10%
higher? What if the demand forecast is 10% lower? To go a step further, we
advance to technology for optimization under uncertainty, which could answer the
what-if questions within a single model. (But that’s the topic of another post.)

This post just scratches the surface of where we think decision optimization is
headed over the next months and years. Check back for more posts as we go
deeper, and in the meantime, find out more about IBM’s Decision Optimization Center .

 * Machine Learning
 * Big Data
 * Decision Making
 * IBM

16 Blocked Unblock Follow FollowingDINESH NIRMAL
Vice President — Analytics Development, Site Executive IBM Silicon Valley Lab.
The opinions expressed are my own and don’t necessarily represent those of IBM.

FollowINSIDE MACHINE LEARNING
Deep-dive articles about machine learning and data. Curated by IBM Analytics.

 * Share
 * 16
 * 
 * 
 * 

Never miss a story from Inside Machine learning , when you sign up for Medium. Learn more Never miss a story from Inside Machine learning Get updates Get updates","Decisions can be fatiguing and prone to bias, not just for individuals but for whole organizations. The bigger the decision and the more complex the inputs, the greater the consequences of failure…",Machine Learning and the Science of Choosing,Live,833
2564,"APP STORE GIVES HOPE TO INDIE DEVELOPERS AGAIN?
markwatson / June 9, 2016I published my first app to Apple's App Store back in March 2011. I can remember
how exciting it was to see those first downloads rolling in. This particular app
ended up being my most successful with around 3/4 million downloads of the iOS
and Android versions combined, but even back then, I felt like I was late to the
game and wondered how to get my app to stand out from the others.

At that time there were around 400,000 apps in the App Store. Now there are over
1.5 million.

Fast forward 3 years to the Spring of 2014. A friend and I came up with what we
thought was a great idea for an app. We did a quick search on the App Store and
found nothing. Nice! We were in business. After a few months of late nights and
long weekends we shipped.

Crickets.

It turns out we had not searched the right keywords. There were over 100
competitors and huge established players in the space we were trying to
penetrate. It was demoralizing.

Here is how my apps have fared since I started tracking downloads and usage in
late 2012:

Downloads (via App Annie):


Sessions (via Flurry):


Feels hopeless, right? Maybe not…

The Verge interviewed Phil Schiller about upcoming changes to the App Store that might give Indie developers hope
again. Here is a brief summary of the changes and my initial reaction:

ADVERTISING IN THE APP STORE
Apple will start showing search ads for apps in the App Store.

I have mixed feelings about this. It’s hard to imagine the ad space not being
filled by big companies with bigger budgets. I’ve advertised my own apps in the
past using Admob and other ad networks. It is really easy to burn through a lot
of money quickly.

Back in 2011, with even less competition, I believe I would have had to spend
tens of thousands of dollars to see any significant impact. Needless to say I
didn't have tens of thousands of dollars to spend :)

AUTO-RENEWABLE SUBSCRIPTION SUPPORT FOR ALL APPS
Previously Apple approved auto-renewable subscriptions only for certain types of
apps. From the App Store Review Guidelines :

11.15 Apps may only use auto-renewing subscriptions for periodicals (newspapers,
magazines), business Apps (enterprise, productivity, professional creative,
cloud storage), and media Apps (video, audio, voice), or the App will be
rejected

With the new changes all apps can take advantage of this model.

This is huge! I experienced the challenge of manual renewals first-hand with a
kid-friendly private social networking app I spent almost 2 years building in my
free time. Rather than require users to manually renew their subscription every
3 months, we settled on a higher, yearly price because we felt it would be less
onerous. If we could have had a recurring, quarterly subscription for, let’s say
$1, it would have been perfect! In that alternate auto-renewal reality, I might
even have made those tens of thousands of advertising dollars by now. Well, tens
anyway. ;-)

85/15 REVENUE SPLIT AFTER FIRST YEAR OF SUBSCRIPTION
After the first year of a subscription, Apple will reduce their cut of revenue
from 30% to 15%.

I love the 85/15 split, and not just because developers will make more money. My
hope is that this will incentivize developers to build amazing apps that people
love and will continue to use for years.

A RAY OF HOPE FOR UPCOMING OR BURNT-OUT INDIE DEVELOPERS?
The minute I read the article on Verge I took a look at my own apps. They
haven’t been maintained in years, and I am no longer making any significant
amount of money from ads in those apps. I went to all my ad networks and
disabled all ads (except iAd – that requires an app submit!). I went to iTunes
Connect, the Google Play Developer Console, and the Amazon Appstore Developer
Portal and made all my paid apps free.

This interview got me genuinely excited – not just about these specific changes,
but about the possibilities. I started dreaming again…

Hopefully you will too.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Reaction to the Verge interview with Phil Schiller about upcoming changes to Apple's App Store.,App Store gives hope to Indie Developers Again?,Live,834
2574,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * IBM Analytics for Apache Spark * Get Started * Get started in Bluemix       * Build SQL Queries       * Use the Machine Learning Library       * Load and Analyze dashDB data with Apache Spark                * Tutorials and samples * Start Developing with Notebooks       * Sentiment Analysis of Twitter Hashtags       * Sample Notebooks                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant       * Integrate with IBM dashDB                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON from Cloudant database into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata for Analytics to dashDB       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                   * DataWorks * DataWorks Forge    * DataWorks APIs      BUILD SQL QUERIESJess Mantaro / October 21, 2015See how to build SQL Queries in a Scala notebook using IBM Analytics for ApacheSpark.You can also read a transcript of this videoTRY THE TUTORIALLearn how to build SQL queries using a Scala notebook in IBM Analytics forApache Spark.WHAT YOU’LL LEARNAt the end of this tutorial, you should be able to: * create a JSON file containing tweets using IBM Insights for Twitter. * create a Scala notebook in IBM Analytics for Apache Spark. * load a JSON file into a Scala notebook. * run SQL queries to gather insights on collected tweets.BEFORE YOU BEGINWatch the Getting Started on Bluemix video to create a Bluemix account and add the IBM Analytics for Apache Spark service.PROCEDURE 1: COLLECT TWEETS USING IBM INSIGHTS FOR TWITTER 1. Sign in to Bluemix . 2. Access the Dashboard , and open the Insights for Twitter instance. If you don’t already have    this service added, you’ll find in the Catalog under the Data and Analytics    category. 3. Click Service Credentials . 4. Copy the URL listed. The URL contains your username and password along with    the Bluemix hostname. For example,     https://fadd4d98-aaaa-4ddc-85c4-00dbcea7d0e1:GfBDqK97ey@cdeservice.mybluemix.net 5. To search for tweets with specific terms, construct the URL as follows:     https://<username>:<password>@cdeservice.mybluemix.net/api/v1/messages/search?q=<search-terms>&size=<number-of-tweets-to-return>    For example,     https://fadd4d98-aaaa-4ddc-85c4-00dbcea7d0e1:GfBDqK97ey@cdeservice.mybluemix.net/api/v1/messages/search?q=spark&size=100 6. The search URL will return the resulting tweets in JSON format. Save the    response body to a JSON file called tweets.json.PROCEDURE 2: CREATE A SCALA NOTEBOOK TO BUILD SQL QUERIES 1.  Access the Dashboard , and open the Apache Spark instance. 2.  Click New Notebook , select Scala , type a name for the notebook, and click Create . 3.  Click Add Data Source in the right sidebar. 4.  Drag and drop the JSON file you created in procedure 1 into the box     labelled Drop file to add data source . 5.  Paste the following SQL statement into the first cell in the notebook, and     then click the Run icon on the toolbar. This first command contains SQLContext which is the     entry point into all functionality in Spark SQL and is necessary to execute     SQL queries.     Command: val sqlContext = new org.apache.spark.sql.SQLContext(sc)           6.  Paste the following SQL statement into the second cell, and then click Run . Replace tweets.json if you used a different file name. This second command reads the contents     of the tweets.json file and assigns it to the tweets variable.     Command: val tweets = sqlContext.read.json(""swift://notebooks.spark/tweets.json"")           7.  Paste the following SQL statement into the third cell, and then click Run . This third command takes the tweets data and registers it as a table.     Command: tweets.registerTempTable(""tweets"")           8.  Paste the following SQL statement into the fourth cell. Spark SQL can     automatically infer the schema for this JSON file. This next command lets     you take a look at that schema. Click Run . You can see that there is a lot of information in addition to the tweet     such as the author’s name, gender, marital status, and location. If you     scroll down, you’ll also see the actual message stored in the body of the     tweet. And along with it, a lot of metadata about the tweet and the user.     Command: tweets.printSchema           9.  Paste the following SQL statement into the fifth cell, and then click Run . The tweets.collect command shows the data collected for each tweet.     Command: tweets.collect           10. Paste the following SQL statement into the sixth cell, and then click Run . Once the data is registered as a table, you can use SQL to process the     data. This next line contains a typical SQL statement. SQL statements can     be simple as in this case, or very complex. In this case, we’re just     collecting the author’s gender. When you execute this statement, you’ll see     that the gender is listed as null, unknown, male, or female.     Command: val results = sqlContext.sql(""select tweets.cde.author.gender from     tweets"")          From here it may be interesting to determine how many tweets were sent by malesand how many by females. In addition, it is possible to collect tweets with aspecific polarity or sentiment which could be used for sentiment analysis. Forexample, collect the tweets with an overall positive polarity and warm sentimentusing this statement {“polarity”:”POSITIVE”,”sentimentTerm”:”warmth”}.You can leverage the other extensions using the same common programmingframework for streaming, machine learning, and graph analysis. Find more videosin the Spark Learning Center at http://developer.ibm.com/clouddataservices/spark .Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",How to build SQL Queries in a Scala notebook using IBM Analytics for Apache Spark,Build SQL Queries in a Scala notebook using Apache Spark,Live,835
2578,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (October 05, 2016)
 * Mexico: Amazing interest in Data Science
 * BDU China initiatives
 * This Week in Data Science (September 27, 2016)
 * Introducing Two New SystemT Information Extraction Courses

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (OCTOBER 05, 2016)
Posted on October 4, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Our Robot Overlords Are Now Delivering Pizza, And Cooking It On The Go – In Mountain View, Calif., a couple of miles down the road from Google,
   there’s a new pizza shop. Only instead of a dozen blue-collar workers pouring
   marinara sauce, Zume Pizza has robots and algorithms running the show.
 * How Europe Is Trying to Get Everything in Place for Self-Driving Cars – It’s a politically led alliance.
   Europe’s automotive and telecoms industries have formed an alliance to
   promote automated driving and connected cars.
 * Why This Diet App Is Using Computer Vision To Help You Lose Weight – With its new “Snap It” feature, the Lose It app automatically identifies
   the food you’re eating.
 * BDU China initiatives – Learn about Big Data University’s work in China.
 * Announcing Over $80 million in New Federal Investment and a Doubling of
   Participating Communities in the White House Smart Cities Initiative – With nearly two-thirds of Americans living in urban settings, many of our
   fundamental challenges from climate change to equitable growth to improved
   health will require our cities to be laboratories for innovation.
 * Three Things IMDB Confirms About Millennials – Marcus analyses the difference age makes on attitudes towards film by
   mining IMDb’s ratings.
 * Google’s New Service Translates Languages Almost as Well as Humans Can – A jump in the fluency of Google’s language software will help efforts to
   make chatbots less lame.
 * AI is Crushing It, But Why Now? – This is the year of artificial intelligence, when the technology came into
   its own for mainstream businesses.
 * Cubr – Cubr is an app that can solve your rubik’s cube.
 * Experts share best practices that deter cyber attacks – At the recent i2 Summit for a Safer Planet event, experts discussed best
   practices and steps that governments, commercial organizations and citizens
   alike can use to uncover malicious indicators and predict threats before they
   are executed.
 * Estimating Delivery Times: A Case Study In Practical Machine Learning – Fulton shares insights gained from the development process for Estimated
   Delivery Time.
 * Hate our electoral system? Here’s who could have been president under other
   setups – We explored what would happen under different electoral systems and found
   that adjusted rules could have changed the outcome in half of the
   presidential elections since 2000.
 * A combination of machine learning and game theory is being used to fight
   elephant poaching in Uganda – Using Protection Assistant for Wildlife Security (PAWS), a technology
   combining machine learning and game theory, researchers can predict where
   poachers may attack and tell rangers where to patrol.
 * Ranking the Most Beloved TV Shows That Got Canceled – Using data, short-lived cancelled shows were ranked by the number of user
   ratings.
 * Big Data in cars could be a $750 billion business by 2030 – For the past decade, electric cars have been the next big thing in the
   auto world. Now, self-driving cars are starting to steal that thunder.
 * How cognitive computing and cloud technology can improve our health and
   fitness – From medicine to athletics, IBM’s cognitive, cloud-based technology is
   changing our understanding of the human body

UPCOMING DATA SCIENCE EVENTS
 * BDU Data Science Volunteer Internship Information Session – In this BDU information session, on October 7th, we will explain Data
   Science volunteer positions that are currently available.
 * Introduction to R Programming – Join BDU on October 6th to learn the basics of R programming.
 * IBM Webinar: Driving Innovation and Growth with Big Data – Join our guest speaker Noel Yuhanna, Principal Analyst at Forrester
   Research, on October 6th, to hear how an emerging collection of technologies
   that Forrester calls big data fabric is driving innovation and growth.
 * The Data Between Massive Open Online Courses – On October 13th, Antonio Cangiano from IBM’s Big Data University will give
   a brief overview of free data science tools to help anybody get started in
   data science.
 * Data Science for High School – Discuss data science for High School with BDU on October 20th.
 * BDU China initiatives – Raul discusses the CDA (Certified Data Analyst) Summit in Beijing.
 * TDWI AUSTIN – Join TDWI on December 4-9 for an accelerated, multifaceted learning
   opportunity that’s all about creating immediate value back at the office.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (October 05, 2016)",Live,836
2589,"Compose The Compose logo Articles Sign in Free 30-day trialDATALAYER EXPOSED: JOSHUA DRAKE & POSTGRESQL: THE CENTER OF YOUR DATA UNIVERSE
Published Jul 3, 2017 datalayer postgresql DataLayer Exposed: Joshua Drake & PostgreSQL: The Center of Your Data UniverseStart your Monday on a high note and catch up on videos from this year's
DataLayer Conference. This week we're highlighting Joshua Drake , from Command Prompt. JD met us in Austin for DataLayer and presented on
PostgreSQL as the center of your data universe.

Next to take stage at DataLayer was Joshua Drake who is one of the founders of United States PostgreSQL, the SPI-PostgreSQL
Liaison and a Chair for PgConf US, the largest PostgreSQL conference in North
America.

It should come as no surprise that JD presented on PostgreSQL. JD showed off
some of Postgres' power and gave a powerful case for why it's a pivotal
component for many people's stacks.

Previous DataLayer 2017 talks:

 * Charity Majors' presentation on observability
 * Ross Kukulinski's presentation on the state of containers
 * Antonio Chavez's presentation on the why he left MongoDB
 * DataLayer Exposed: Jonas Helfer & Joins Across Databases with GraphQL

Be sure to tell us what you think using hashtag #DataLayerConf and check back
next Monday for the next talk at DataLayerConf.


--------------------------------------------------------------------------------

We're in the planning stages for DataLayer 2018 right now so, if you have an
idea for a talk, start fleshing that out. We'll have a CFP, followed by a blind
submission review, and then select our speakers, who we'll fly to DataLayer to
present. Sounds fun, right?

Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe ’s author page and keep reading.RELATED ARTICLES
Dec 19, 2016DATALAYER: PARTIAL INDEXING FOR IMPROVED QUERY PERFORMANCE
PostgreSQL is able to let you index just a subset of your data if that's what
your application in interested in. The feature…

Thom Crowe Jun 28, 2017ACCESSING RELATIONAL DATABASES USING GO
Have you considered using Go to access your relational databases? In this Write
Stuff article, Gigi Sayfan shows you how to a…

Guest Author Jun 26, 2017DATALAYER EXPOSED: JONAS HELFER & JOINS ACROSS DATABASES WITH GRAPHQL
Wanting something to make Monday mornings a bit more exciting? Well, for the
next few weeks, we're bringing you a new video f…

Thom Crowe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL JanusGraph Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Joshua Drake who is one of the founders of United States PostgreSQL, the SPI-PostgreSQL Liaison and a Chair for PgConf US, the largest PostgreSQL conference in North America. Here is his talk from the Compose DataLayer conference.",The Center of Your Data Universe,Live,837
2592,"SEVEN DATABASES IN SEVEN DAYS – DAY 5: ETCD

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Lorna Mitchell & Matt Collins 10/31/16Lorna Mitchell & Matt Collins


This post is part of a series of posts created by the two newest members of our
Developer Advocate team here at IBM Cloud Data Services. In honour of the book Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, we challenged Lorna and Matt to take a new
database from our portfolio every day, get it set up and working, and write a
blog post about their experiences. Each post reflects the story of their day
with a new database. We’ll update our seven-days GitHub repo with example code as the series progresses. —The Editors

A distributed key-value store with some neat consistency features: it’s etcd!

 * Database type: Fast and reliable central key-value store
 * Best tool for: Persisting configuration settings that need to be updated across distributed
   systems

A [CONFIG] STATE OF MIND
etcd is a durable key-value store that provides consistency guarantees, making it
good for coordinating configuration state across many distributed services.

The database usually runs as a cluster to improve scalability. When you create
your deployment on Compose, they set up the cluster for you: three database
nodes and two proxies. etcd has an HTTP interface that speaks JSON, so it works
well cross-platform, and it also comes with a command-line tool, etcdctl .

In this article we’ll look at using etcd running on Compose and how to store and change values using the HTTP interface and command-line
tool. We’ll also show a simple example of an application getting its config
settings and config changes from etcd.

GETTING STARTED
We’ll begin by setting up an etcd cluster on Compose. You can absolutely
configure your own cluster to use with this tutorial, but since we recommend
running the cluster with an odd-numbered set of servers, it’s not trivial. (The Raft consensus algorithm shows why odd-numbered is a good idea.)

We’re taking a shortcut by using Compose, which configures a cluster with three
database nodes: one will always be the leader, and in the event of server
failure, the rest of the cluster will renegotiate and reconfigure. This design
is entirely intentional. It gives our applications good reliability and high
availability. A system built this way can handle a lot of load, and the platform
is designed to scale as you grow.

SET UP A COMPOSE ETCD CLUSTER
Compose has a good guide on getting started with etcd . Here’dr:

 1. Create a new etcd deployment with Compose
 2. Visit the overview page to get the connection strings and password
 3. Scroll down and get the certificate — you’ll need this cert to access your
    etcd from another machine

Once you have your Compose etcd cluster running, we can start accessing it.

ACCESS ETCD OVER HTTP
As we mentioned, you’ll need a certificate to access the service. The default
user will be root , and you’ll need to grab the password as well. The certificate in this example
is stored in a file called cert . So, here’s how to create a curl command to show us the current contents of
the store:

$ curl --cacert cert https://root:password@sl-eu-lon-2-portal.1.dblayer.com:10276/v2/keys


{""action"":""get"",""node"":{""dir"":true}}


We don’t have any keys or values stored yet, so let’s add some using etcd’s HTTP
API. We’ll be using this to store credentials that our app will need. Start by
setting a key config/api/username using a request like this:

$ curl -X PUT --cacert cert \
https://root:password@sl-eu-lon-2-portal.1.dblayer.com:10276/v2/keys/config/api/username \
-d  value=""apiuser""


{""action"":""set"",""node"":{""key"":""/config/api/username"",""value"":""apiuser"",""modifiedIndex"":17,""createdIndex"":17}}


We can also check this specific key using curl by making a GET request to its endpoint:

$ curl --cacert cert https://root:password@sl-eu-lon-2-portal.1.dblayer.com:10276/v2/keys/config/api/username


And we’ll see:

{""action"":""get"",""node"":{""key"":""/config/api/username"",""value"":""apiuser"",""modifiedIndex"":38,""createdIndex"":38}}


Using this RESTful interface over HTTP with JSON means that this API can be
easily managed from any technology stack. In fact, quite a few of them will
offer wrappers or libraries for easier syntax (such as the Node.js example later
on in this article).

USE ETCD FROM THE COMMAND LINE WITH ETCDCTL
There is an excellent command-line tool for etcd called etcdctl , which you may prefer over making the curl requests as we have done so far. It
is available as part of the etcd package itself. To install it, get the newest release of etcd from GitHub and follow the installation instructions for your platform. (You don’t need to
start the etcd command.)

Before using the tool, we need to indicate which version of the API we speak.
Set the version number to an environment variable:

ETCDCTL_API=3


Now we can start adding more keys. How about a password to go with the username
we set in the previous section? The command looks like this:

$ etcdctl --ca-file cert --no-sync --peers https://sl-eu-lon-2-portal.1.dblayer.com:10276,https://sl-eu-lon-2-portal.2.dblayer.com:10263 -u root:password set /config/api/password RP865tLxpfKFrAu9


etcdctl creates and parses the JSON for us. Instead of outputting the detailed
response, as we get when we make the raw HTTP commands ourselves, etcdctl just outputs what it thinks we need to know. When we’re setting values, that’s just the value . Here’s what you should see:

RP865tLxpfKFrAu9


etcd’s command-line utility lets us perform other useful tasks. For a
comprehensive set of instructions, run etcdctl with no arguments and you’ll see the usage instructions. For now, here’s a
quick cheatsheet:

 * ls / – see all the top level keys
 * ls /config/api see all the keys in the /config/api directory
 * get /config/api/username see the value of /config/api/username
 * set /config/api/username write to this key, creating it if it doesn’t exist and overwriting it if it
   does
 * rm /config/api/username remove this key

We can use this tool to update config as we wish, and then use it in our
applications.

Note that Compose also provides a UI for manipulating etcd records in the
browser.

BUILDING A SIMPLE CONFIG MANAGER IN NODE.JS
To demonstrate how you might use etcd, we are going to build a simple config
manager in Node.js. The aim is to have our app load the necessary config on
start-up, and then wait for changes to happen and apply them.

Let’s get started!

CONNECTING
There is currently no official library for etcd in Node.js; however, there are
plenty of good unofficial libraries out there. We are going to be using node-etcd .

First up, we’ll get ourselves connected to etcd. Thankfully, this is very
simple. Here’s our connect.js :

# connect.js

// Load in the ETCD module
var Etcd = require('node-etcd');

// SSL cert
var fs = require('fs');
var options = {
  ca: fs.readFileSync('cert')
};

// Connect!
var etcd = new Etcd(""https://root:password@sl-eu-lon-2-portal.1.dblayer.com:10276�


Because we are using Compose, we need to reference the SSL certificate we
mentioned earlier and pass that in when we make the connection. Once that’s
done, we are connected and ready to go! We have exported the connection at the
bottom of the file so we can reuse this bit of code across our whole
application.

If you are following along using Compose, make sure that you enter your own
credentials in place of our old ones above. Also, don’t forget to download your
SSL certificate!

CREATING OUR CONFIG
All of our config data, for all of our services, is going to reside in etcd —
which means we need to create more config data to make this example interesting.

We’ll have two services — an api and a website — and each of these services
needs config for username , password and hostname . So using the examples above, let’s set some keys as shown below (values are
shown in parentheses):

/config/website/username (webuser)
/config/website/password (webpass)
/config/website/hostname (example.com)
/config/api/username (apiuser)
/config/api/password (apipass)
/config/api/hostname (api.example.com)


Now that we have our config, it’s time to build something that will use it.

LOADING OUR CONFIG
First things first: We need to decide which service we want to load config for
and connect to etcd.

# etcd.js
// get our connection
const etcd = require('./connect.js');

// helper functions to parse the responses
const parseConfig = require('./parsing.js').parseConfig;
const changeEvent = require('./parsing.js').changeEvent;

// use optimist to read command line arguments
const argv = require('optimist').argv;

// define our service
const service = argv.service || null;
if (service === null) {
    console.error(""Please provide a --service parameter�
}

// define our service key
const serviceKey = ""/config/�


Here we are using the connect.js we created earlier to create a connection to etcd. Then, we use the excellent optimist module to listen for command line arguments. We’re particularly interested in
the --service argument to tell us which service we should load config for so that we can
create our serviceKey . This is the key we will be loading from etcd later on.

You may also notice that we have a couple of helper functions: parseConfig and changeEvent . We have created these functions to help keep this example tidy and easy to
follow. Feel free to check out parsing.js to see what’s going on there.

Now, let’s load our config!

# etcd.js continued..

// get our config on wakeup
etcd.get(serviceKey, function(err, data) {

    // handle the error
    if (err) throw (err);

    // parse the initial config load using helper function
    var config = parseConfig(data);

    // log out what the config is
    console.log(""========= CONFIG LOADED ==========�


All we’re doing here is asking etcd to load the config for our service. Using
the .get() method, we are asking for the data belonging to either the /config/api or /config/website key, depending on which service was specified at run time via the --service argument. Then we use the parseConfig() function mentioned earlier to create a simple JSON object and log it out to the
console window.

Lets see what happens!

$ node etcd.js --service api
========= CONFIG LOADED ==========
{ username: 'apiuser',
  password: 'apipass',
  hostname: 'api.example.com' }


Perfect! We’ve successfully loaded our config. If we specified the website
service, we would see the website-specific config instead. Why not try it out?

LISTENING FOR CONFIG CHANGES
Configs change over time, which means that at some point we are going to want to
update the data we have stored in etcd. But what then? Luckily for us, etcd
allows you to listen out for changes to your data and then act upon them!

If we were to extend our example:

# etcd.js continued..

// get our config on wakeup
etcd.get(serviceKey, function(err, data) {

    // handle the error
    if (err) throw (err);

    // parse the initial config load using helper function
    var config = parseConfig(data);

    // log out what the config is
    console.log(""========= CONFIG LOADED ==========�

    // watch for changes
    const changes = etcd.watcher(serviceKey+""/username�

    // when something changes
    changes.on(""change�

        // log out the changes
        console.log(""========= CONFIG CHANGED ==========�


Here we are using etcd’s ability to listen for changes using the .watcher() method of our etcd connection, creating a Node.js EventEmitter in the process. We specify the key
we want to listen to via our serviceKey , and flag that we want to monitor all keys in this group by using the recursive: true option.

Whenever a change affects our config, changes will emit a “change” event, and we can act upon it. In this example, we are
parsing the change data to give us a value , previousValue and the key that has changed before logging this out along with our updated config.

So, what does this look like? If we start our app up again, we should still see
the original output:

$ node etcd.js --service api
========= CONFIG LOADED ==========
{ username: 'apiuser',
  password: 'apipass',
  hostname: 'api.example.com' }


But then if we change one of the values — for instance, /config/api/username — we should see this change reflected in the output from our app:

========= CONFIG CHANGED ==========
username changed from apiuser to newuser
{ username: 'newuser',
  password: 'apipass',
  hostname: 'api.example.com' }


So there we have it!: using etcd as a distributed data store to update
configuration values in a Node.js app. Try running multiple instances of your
app to see how this could really be helpful when dealing with large distributed
systems. You’ll never have out-of-date config again!

CONCLUSION
On the face of it, etcd looks like quite a simple tool — a key-value store that
you can read/write to via HTTP — but it really is so much more:

 * It’s fast, benchmarked at over 10k reads/second
 * It’s reliable, using the Raft consensus algorithm to provide a
   highly-available distributed setup along with network partition tolerence
 * It’s secure, with TLS and SSL support

Add in the ability to monitor changes and react to them (as opposed to polling
for changes), and you can really start to see the benefits of using etcd to
create distributed management systems for config state on bigger platforms. The
tools for the command line are written in Go, so they work well across
platforms.

Will etcd be a good fit for your system? Well, your mileage should and will
vary, but given the free trial period on Compose, and this tutorial showing how
you to begin, you can now find out!

Head to https://github.com/ibm-cds-labs/seven-days for example code and other database tutorials in this series. Cheers.","Looking to learn the basics of cloud databases? In this series, we show how to use them on Compose and the IBM cloud. Enter: etcd.",Seven Databases in Seven Days – Day 5: etcd,Live,838
2593,,Learn how to configure a dashDB connection in Cognos Business Intelligence and create stunning visualizations. ,Leverage dashDB in Cognos Business Intelligence,Live,839
2596,"acurl is a handy alias we use here at Cloudant that allows you to curl Cloudant URLs without having to enter your username and password for every HTTP request. That means a simple GET to a database no longer needs to be written as https:// but instead you can just use https://Not only does this cut down on annoyingly long URLs, but the acurl alias is also more secure. It prevents someone from reading your password over your shoulder as you type, and importantly, it makes sure your password isn’t sent in plain text over the network by enforcing HTTPS.All it takes is three simple steps:Create an account and try this with your own Cloudant account1) First we base64-encode your Cloudant username and password:python -c 'import base64; print base64.urlsafe_b64encode(""Note: Remember that your password is still stored in plain text on your computer; base64-encoding is not encryption!2) Now we create an alias for curl that includes these credentials so we don’t have to enter them every time we write a curl command.Add this line to your ~/.bashrc or ~/.bash_profile:alias acurl=""curl -s --proto '=https' -g -H 'Authorization: BasicThis alias adds an Authorization header instead of including the authorization credentials in the URL you enter on the command line. It also forces the use of HTTPS which we strongly recommend over plain HTTP as it encrypts your data and credentials in transit and helps you be sure you’re connecting to our systems.3) Now start a new shell or run source ~/.bash_profile (or ~/.bashrc if you used that) to make the alias functional.And that's it!Now let's make sure everything is set up correctly. Go ahead and run:acurl https://If you get the list of your databases back, awesome! acurl is set up and ready to go.  Happy coding!If you need more help, watch the  video below to see how to set up pre-authenticated cURL or follow the step-by-step instructions in this  tutorial.","At Cloudant, we use curl a lot to interact with Cloudant's HTTP API. 'acurl' is a tool that allows you to interact with Cloudant without having to enter the username and password with every request.","Authorized curl, a.k.a acurl",Live,840
2597,This video shows you how to access the Cloudant documentation and locate support resources. Sign up for a Cloudant account here: https://cloudant.com/sign-up/. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center,See how to find Cloudant documentation and locate support resources.,Access Documentation and Support Resources,Live,841
2600,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Jul 31
--------------------------------------------------------------------------------

QUERYING YOUR POUCHDB DATABASE WITH SQL
ADAPTING THE SIMPLE SQL SUPPORT IN MY SILVERLINING NPM MODULE TO WORK WITH
IN-BROWSER DATABASE POUCHDB
I blogged recently about querying a Cloudant (and CouchDB 2.0) database with SQL by using the silverlining library to convert SQL queries into a form of JSON that the database can
understand.

WHAT ABOUT POUCHDB?
You can deploy PouchDB in a variety of configurations , and it too can understand the JSON known as “Mango” or “Cloudant Query”
queries using the pouchdb-find plugin.

SO CAN POUCHDB DATABASES BE QUERIED USING SQL TOO? YES!
The pouchdb-silverlining npm module builds on pouchdb-find to provide SQL support in PouchDB through a new .sql() function in your Node.js code:

This setup also works in web page JavaScript by loading the libraries in
sequence:

Blob image credit: mark du toit .POUCHDB.FIND IS THE FUTURE
The PouchDB roadmap has the find function as the default means of querying the database. The MapReduce-based query function will be relegated to a plugin, so when PouchDB hits version 7, the pouchdb-silverlining plugin may not require the pouchdb-find plugin as a dependency, because it will be built in to the product.

If you’re interested in how SQL is converted into Mango JSON, then you can read
the code in the sqltomango module, which is used by both Silverlining libraries.

Happy querying!

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * JavaScript
 * Web Development
 * Pouchdb
 * Couchdb
 * Cloudant

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",Adapting the simple SQL support in my silverlining npm module to work with in-browser database PouchDB,Querying Your PouchDB Database with SQL – IBM Watson Data Lab – Medium,Live,842
2603,"Homepage Follow Sign in Get started * Home
 * AI
 * BTC
 * VC
 * 
 * SHIP CODE 2 DAYS FASTER
 * 

Tomi Mester Blocked Unblock Follow Following Data analyst & researcher. “Data helps you to understand your customers.”
Focusing on #datadriven #startup #ecommerce #bigdata #analytics
http://www.data36.com Jan 8
--------------------------------------------------------------------------------

ASPIRING DATA SCIENTISTS! START TO LEARN STATISTICS WITH THESE 6 BOOKS!
Statistics is difficult. Of course it is, as mostly that’s the actual science part in data science. But
it doesn’t mean that you couldn’t learn it by yourself if you are smart and
determined enough.

In this article, I am going to list 6 books that I recommend to start with to learn statistics . The first three are lighter reads. These books are really good for setting
your mind to think more numerical, mathematical and statistical. They also
present why statistics is exciting (it is!) really well.

The second three books are more scientific — with formulas and Python or R
codes. Don’t get intimidated though! Mathematics is like LEGO: if you build the
small pieces up right, you won’t have trouble with the more complex parts
either!

Let’s see the list!

1. YOU ARE NOT SO SMART — BY DAVID MCRANEY
When I first saw the title, I loved it already! This is a very well written
book, it comes with many stories — and everything in it is based on real
experiments and real scientific research.

David McRaney introduces one sad but true fact of life: that our brain
constantly tricks us and we are not even smart enough to realize it. For an
aspiring data scientist this book is essential, because it lists many common
statistical bias types. It points out classic mistakes like the self-serving
bias, the availability heuristic, the confirmation bias, and it also shows why
people tend to be tricked by fake news, by scams or why people do not help when
seeing someone having a heart attack on a busy street. Being aware of these
biases should be basic, but I see even the practicing data professionals are
falling for them from time to time…

(I wrote a detailed article about Statistical Bias Types . Find it here .)

2. THINK LIKE A FREAK — BY DUBNER & LEVITT
The previous book was about why we are not so smart. But this one is about how
to be smart! Think Like a Freak shows us how critical and unconventional
thinking can lead to huge success… and, hey, that’s something that as a data
scientist, you should practice day by day.

The book lists a bunch of case studies from everyday life, goes into details and
analyzes why a solution for a problem is good or bad. Reading it will definitely
boost your analytical thinking.

3. INNUMERACY — BY JOHN ALLEN PAULOS
If you have hated mathematics in the mid/high school, it was for one reason: you
had a bad teacher. A good teacher turns mathematical equations into mystical
puzzles, probability theory into detective stories, and linear algebra into the
ultimate solution for all the big question in life. Luckily, I had really good
math teachers, so I was always generally excited by mathematics and statistics.
Looking back, this really affected my life.

If you didn’t have a good math teacher, John Allen Paulos is here to make up the
loss for you: he’s the awesome teacher you wish you’d had. Innumeracy is
focusing mostly on one specific segment of statistics: probability theory and
calculations. It explains the math behind it, shows the formulas and puts
everything into a very logical context. And it’s doing it by showing the real
life relations of these calculations, so you can immediately understand the
advantage of being more math-minded.

4. NAKED STATISTICS — BY CHARLES WHEELAN
I have already highlighted this book in my previous article , but I can’t stand to add it to this list either. It’s the perfect transition
between the previous light-read statistics books and the next two more
scientific ones. Reading it, you can easily understand basic concepts like mean,
median, mode, standard deviation, variance, standard error or the more advanced
things like the central limit theorem, normal distribution, correlation analysis
or regression analysis.

Almost needless to say that all of these are packed into metaphors for the ease
of understanding.

5. PRACTICAL STATISTICS FOR DATA SCIENTISTS — BY ANDREW & PETER BRUCE
This is a relatively new book and it contains everything that a Junior Data
Scientist has to know about the practical part of statistics. In my opinion, the
biggest advantage of the book is the structure. It really makes it clear how
things are built on top of each other. But it also goes into details with the
most common prediction and classification models — and it tells a bit about
Machine Learning and Unsupervised Learning too.

The book comes with R code examples, but if you don’t know R, that’s not a
problem, you can just simply skip those parts.

6. THINK STATS — BY ALLEN B. DOWNEY
Topic-wise, Think Stats is really similar to Practical Statistics for Data
Scientists. I wanted to have it on the list though because even if the topic is
the same, different writers usually are approaching things differently. On a
topic this complex as data science, I think it’s worth to see different angles
and have things explained by two different data professionals. Plus, this is a
book from 2011. It’s good to see how much the interpretation of (even these
standard) things have changed in as short as 6 years.

Oh, and I almost forgot to mention that Think Stats is available for free in PDF
format, here: http://greenteapress.com/thinkstats/

AND THAT’S IT!
By reading these 6 books you can get a solid understanding of Statistics for
Data Science! What’s the next step to become a data scientist? You can read even
more books: here’s my 7 favorite data books . Or you can start to learn coding in SQL or in Python .

If you are missing something from this list, let me know in the comment section
below!

Thanks for reading!

Enjoyed the article? Please just let me know by clicking the 👏 below. It also
helps other people see the story!

Tomi Mester
my blog: data36.com
my Twitter: @data36_com

 * Data Science
 * Statistics
 * Books
 * Learning To Code
 * Data Analysis

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

1K 1 Blocked Unblock Follow FollowingTOMI MESTER
Data analyst & researcher. “Data helps you to understand your customers.”
Focusing on #datadriven #startup #ecommerce #bigdata #analytics http://www.data36.com

FollowHACKER NOON
how hackers start their afternoons.

 * 1K
 * 
 * 
 * 

Never miss a story from Hacker Noon , when you sign up for Medium. Learn more Never miss a story from Hacker Noon Get updates Get updates","Statistics is difficult. Of course it is, as mostly that’s the actual science part in data science. But it doesn’t mean that you couldn’t learn it by yourself if you are smart and determined enough…",Aspiring Data Scientists! Start to learn Statistics with these 6 books!,Live,843
2606,"* United States

IBM® * Site map

Search

IBM Developer Advocacy * Services * Cloudant
    * Compose
    * Apache Spark
    * IBM Graph
    * Bluemix Data Connect
    * Bluemix Lift
    * BigInsights on Cloud
    * IBM Data Science Experience
    * Streaming Analytics
    * IBM Watson Machine Learning
   
   
 * Blog
 * Showcases
 * Search Resources
 * Events

Services to get , build , and analyze data on the ibm cloud CloudantA fully-managed NoSQL database as a service (DBaaS) built from the ground up to
scale globally, run non-stop, and handle a wide variety of data…

ComposeProduction-ready hosting for the following databases: MongoDB with SSL,
Elasticsearch, RethinkDB, PostgreSQL, Redis, etcd, and RabbitMQ.

Apache SparkAnalytics for Apache Spark provides fast, in-memory, distributed analytics
processing of large data sets.

IBM GraphIBM Graph is an easy-to-use, fully managed graph database service for storing,
querying, and visualizing data points, their connections, and properties. IBM
Graph is based…

Bluemix Data ConnectData Connect is a cloud-based data refinery that transforms raw data into
relevant and actionable information. Find data, shape it, and deliver it to
applications…

Bluemix LiftMigrate data from on-premises to the cloud quickly and securely.

BigInsights on CloudIBM BigInsights on Cloud provides Hadoop-as-a-service on IBM’s SoftLayer global
cloud infrastructure. It offers the performance and security of an on-premises
deployment without the cost…

IBM Data Science ExperienceIBM Data Science Experience (DSx) is an interactive, collaborative, cloud-based
environment where data scientists can use multiple tools to activate their
insights. Data scientists can…

Streaming AnalyticsPerform real-time analysis on data in motion as part of your Bluemix®
applications by using IBM® Streaming Analytics for Bluemix. Streaming Analytics
is powered by…

IBM Watson Machine LearningMachine learning is everywhere – influencing nearly everything we do. You’ve
likely heard that Uber is world’s largest taxi company, yet owns no vehicles.
Facebook,…

Search Topic
Advanced Search Language
Technology
Powered by the Simple Search Service i What's This?The most popular Topics, Technologies and Languages are determined by the Simple
Search Service - a microservice that lets you quickly create a faceted search
engine. See what else IBM can do for you.

Learn More about the Simple Search Service CloudDataServices Labs Open Menu * 
 * Services * Back to Navigation
    * Streaming Analytics
    * IBM Data Science Experience
    * IBM Watson Machine Learning
    * Bluemix Data Connect
    * Bluemix Lift
   
   
 * Blog
 * Showcases
 * Search resources * Back to Navigation
   
   
 * Events

IBM DATA SCIENCE EXPERIENCE

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

 * Get started
 * Notebooks
 * Integrate
 * Resources

LOAD CLOUDANT DATA IN APACHE SPARK USING A PYTHON NOTEBOOK
Learn how to use a Python notebook for easy access to filter and refine Cloudant
data in IBM Analytics for Apache Spark using the Cloudant-spark beta connector.

You can download the Python notebook shown in the video and referenced in this tutorial, or create your own notebook
by cutting/pasting the code found in the tutorial below into a new notebook.

There is also a Scala version of the notebook available for download on github.


TRY THE TUTORIAL
Learn how use Spark SQL to load, filter, and visualize Cloudant data in a Python
notebook in IBM Analytics for Apache Spark.WHAT YOU’LL LEARN
At the end of this tutorial, you should be able to:

 * replicate a Cloudant database into your Cloudant account
 * create a Python notebook in IBM Analytics for Apache Spark.
 * use Spark SQL to load and filter the Cloudant data.
 * write data back to a Cloudant database.
 * use PySpark, Pandas, and matplotlib to visualize the data.

BEFORE YOU BEGIN
Watch the Getting Started on Bluemix video to add the IBM Analytics for Apache Spark service to your Bluemix account. You
may also want to watch the Load and Filter Cloudant Data in Apache Spark video to see how to accomplish these same tasks using a Scala notebook.

You can download the Python notebook shown in the video and referenced in this tutorial, or create your own notebook
by cutting/pasting the code into a new notebook.

PROCEDURE 1: REPLICATE THE CRIMES DATABASE INTO YOUR CLOUDANT ACCOUNT
 1. Sign in to your Cloudant account or sign in to Bluemix , and access the Cloudant Dashboard .
 2. If attempting to access the Cloudant Dashboard through Bluemix:
 3. Access your Bluemix Dashboard. Click Work With Data . Select the Services tab. Click Portfolio . Select the Cloudant service
 4. Click the Replication tab.
 5. Complete the form on the right side of the screen to create a new
    replication job with the following specifications. 1. For the _id, type crimes_replication .
     2. In this tutorial, you want to replicate a database from the Education
        account to your own personal account, so indicate that the source
        database is a Remote Database and type the URL to the database as https://education.cloudant.com/crimes .
        In this case, you don’t need to set any special permissions because this
        database is already set to allow anyone to replicate it locally.
     3. For the target database, click New Database , select Create a new database locally , and then specify the database name as crimes .
     4. Leave Make this replication continuous unchecked so this will be a singular replication event.
    
    Click Replicate . Next, type your password, and click Continue .Under the covers, the process base64 encodes your credentials and includes
    that authentication information in the replication document.
    
    You get the success message: This replication has been posted to the _replicator database but hasn’t
    been fired yet. Check the _replicator DB to see its state.

PROCEDURE 2: CREATE A PYTHON NOTEBOOK TO ANALYZE THE CLOUDANT DATA
 1.  Log in to Data Science experience at http://datascience.ibm.com .
 2.  Open an existing project, or create a new project.
 3.  Create a new notebook, specifying a name, description, Spark service to
     use, Python 2, and Spark 2.0.
 4.  This set of statements connects to an existing crimes database in your
     Cloudant account, filters out documents with a specific crime code, and
     then writes those documents to a new Cloudant database called
     crimes_filtered. For a detailed explanation of what these statements do,
     refer to the Load and Filter Cloudant Data in Apache Spark video and tutorial.You must manually create the crimes_filtered database in the Cloudant
     dashboard prior to executing these statements. Follow these steps to create
     the crimes_filtered database.
     
      1. From the Cloudant Dashboard, click Databases .
      2. In the upper right corner, click Create Database .
      3. Type crimes_filtered for the database name, and then click Create.
     
     
 5.  Paste the following statements into the cells of the notebook. Replace hostname , username , and password with the hostname, username, and password for your Cloudant account. Then
     click Run .
     Command:
     from pyspark.sql import SparkSession
     spark = SparkSession\
     .builder.getOrCreate()
     cloudantdata = spark.read.format(""com.cloudant.spark"")\
     
     .option(""cloudant.host"",""hostname"")\
     .option(""cloudant.username"", ""username"")\
     .option(""cloudant.password"",""password"")\
     .load(""crimes"")
     cloudantdata.printSchema()
     cloudantdata.count()
     cloudantdata.select(""properties.naturecode"").show()
     disturbDF = cloudantdata.filter(""properties.naturecode = 'DISTRB'"")
     disturbDF.show()
     disturbDF.select(""properties"").write.format(""com.cloudant.spark"")\
     .option(""cloudant.host"",""hostname"").\
     .option(""cloudant.username"", ""username"")\
     .option(""cloudant.password"",""password"")\
     .save(""crimes_filtered"")
 6.  Next, you’ll see how to create a visualization of the crimes data. Paste
     the following statement into the next cell, and then click Run . This line creates a DataFrame containing all of the naturecodes and a
     count of the crime incidents for each code.
     Command: reducedValue = cloudantdata.groupBy(""properties.naturecode"").count()
     reducedValue.printSchema()
 7.  Paste the following statement into the next cell, and then click Run . This line imports two Python modules. The pprint module helps to produce
     pretty representations of data structures, and the counter subclass from
     the collections module helps to count hashable objects.
     Command: import pprint
     from collections import Counter
 8.  Paste the following statement into the next cell, and then click Run . This line imports PySpark classes for Spark SQL and DataFrames.
     Command: from pyspark.sql import *
     from pyspark.sql.functions import udf, asc, desc
     from pyspark import SparkContext, SparkConf
     from pyspark.sql.types import IntegerType
 9.  Paste the following statement into the next cell, and then click Run .
     Command: import pandas as pd
     pandaDF = reducedValue.orderBy(desc(""count""), asc(""naturecode"")).toPandas()
     print(pandaDF)
 10. Paste the following statement into the next cell, and then click Run . This line is required to actually see the plots.
     Command: %matplotlib inline
 11. Paste the following statement into the next cell, and then click Run . This line imports matplotlib.pyplot which is a collection of command
     style functions that make matplotlib work like MATLAB.
     Command: import matplotlib.pyplot as plt
 12. Paste the following statement into the next cell, and then click Run . This line assigns the count and naturecode data to the values and labels
     objects.
     Command: values = pandaDF['count']
     labels = pandaDF['naturecode']
 13. Paste the following statement into the next cell, and then click Run . The first two statements provide the format for the plot. The next two
     statements specify that the plot should display as a horizontal bar chart
     with values for the x-axis and labels for the y-axis. The last statement
     displays the plot.
     Command: plt.gcf().set_size_inches(16, 12, forward=True)
     plt.title('Number of crimes by type')
     plt.barh(range(len(values)), values)
     plt.yticks(range(len(values)), labels)
     plt.show()
     A bar chart displays:
     

Integrate * Analyze Data Using RStudio
 * Load dashDB Data with Apache Spark
 * Load Cloudant Data in Apache Spark Using a Scala Notebook
 * Load Cloudant Data in Apache Spark Using a Python Notebook
 * Use GraphFrames
 * Sentiment Analysis of Twitter Hashtags Using Spark Streaming
 * Sentiment Analysis of Reddit AMAs
 * Reddit sentiment analysis in SparkR and CouchDB
 * R in Jupyter notebooks

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus
RECENT UPDATES

 * Blog
 * Recent Post
 * Moving to Medium
 * 2/24/17
 * We're now the IBM Watson Data Lab on Medium. More on what that means for…
 * Mike Broberg",Learn how to use a Python notebook for easy access to filter and refine Cloudant data in IBM Analytics for Apache Spark.,Use the Cloudant-Spark connector in Python notebook,Live,844
2607,"There is a school of thought that says one database should be able to cover all your database needs. Tsk. When we hear the question ""So, which should I choose for my application, MongoDB or Elasticsearch?"" our default answer, for so many database questions and especially this one, is ""depends"". It depends on your application, its search needs, how you want it to manage the data and what you really want from your database.Let's start with MongoDB. At its core its a fairly traditional document database. It stores JSON documents in collections, but not before quietly converting them into BSON, a binary version of JSON that is designed to make it easier to traverse arbitrary-sized fields and values in memory (and on disk). BSON adds more types to JSON's very simple model of numbers and strings to make things easier and more efficient to process. Its roots are very much, though, in the realm of traditional database even if it is storing documents.With Elasticsearch, its roots show through too, though its heritage is that of the search engine and the Lucene text search library. It too stores JSON documents into what it calls indices. There's no conversion to a binary format, but the content of the document is analysed and indexing created based on that analysis. The analysis itself is driven by indicating the values of various fields have types and that type information is used to better understand and index those fields. That's the search engine heritage showing through making the stored data amenable to lots of different ad-hoc queries.So, if digging through your data in different ways is important then you probably want Elasticsearch in your solution. As a rule of thumb we use in house, if you need any more than five or so indexes on a MongoDB collection, then that collection is a prime candidate for putting into Elasticsearch.You may wonder why the number of indexes is important. Every time a record is inserted or updated, each one of those indexes has to be updated. The bigger the indexes, the harder the task. The more indexes there are, the more times that hard task has to be carried out. This applies to most databases, but none more so than the current MongoDB.So, you may say, why not use Elasticsearch as your main database. With only a few indexes, MongoDB is as fast as most applications need and if you need performance then a MongoDB schema tuned for minimal indexes is ideal. It'll outperform Elasticsearch with queries on the similar indexing. But there's a more important reason not to use Elasticsearch - it can lose write operations. The problem is known and well explored. When Elasticsearch finds itself splitting and reforming the cluster for whatever reason, writes can get lost. The problem comes from the search engine roots of Elasticsearch and it means that Elasticsearch is not a good pick for your main database.Elasticsearch is good as a search engine and one task it is especially popular for is digging through logs. Logs are vast amounts of semistructured data and they never get updated. It's like the ideal use case for Elasticsearch and its really good at giving the ad-hoc analysis and search results on that kind of data.Before you assume the question of which database to use is answered though, you'll notice that question hanging in the air about MongoDB's indexes and their effect on index performance. This is where Elasticsearch comes back into play. If you keep your MongoDB indexes minimal and simple, but then also feed all your records into Elasticsearch as a secondary database, what you get is a database tuned for efficient transactions and a secondary database tunable for speedy, complex and comprehensive searches.So how do you build a multi-database solution like this? There's three routes. One is to use drivers for MongoDB and Elasticsearch and write your application to use both when writing and updating. This is entirely customisable, but you will probably end up abstracting your read and write operations out so no one has to see the complexity underneath.Another option is to use a bi-lingual driver like Mogoostastic which plugs into the Node.js ORM Mongoose and lets you select fields which will be indexed in Elasticsearch as well as stored in MongoDB. A .search() method is then added to let you look up records using the Elasticsearch Query DSL but retrieving the MongoDB records. Of course, that does tie you to the ORM and that plugin, but its worth considering for new applications.The third and final route to multiple databases is the Compose way. Run your application with MongoDB and let Compose's Transporter application keep a synchronised copy of that data in Elasticsearch. With that synchronised Elasticsearch, you are free to extend your application to query it using the Elasticsearch API and DSL while continuing working with MongoDB. That means you retain choice as to which languages, drivers and ORM your application uses and it's easier to backport this solution into a legacy codebase.No one database will fulfill all your storage and search needs. There will always be compromises made in general purpose databases which allow specialised databases to shine. MongoDB is very much a general purpose database and Elasticsearch specialises in search, so it isn't surprising to find that they complement each other.Similar pairings could be made with PostgreSQL or RethinkDB (also available on the Compose platform) and Elasticsearch but the MongoDB/Elasticsearch combination is more common. That's in part because, before MongoDB 2.6. MongoDB's full text search wasn't good enough and Elasticsearch's text search was strong.Now, MongoDB 2.6 has a ""good enough"" single field text index but for anything richer than that, your first stop is Elasticsearch. More importantly though, Elasticsearch is capable of performing complex searches and providing the data for features such as autocomplete in your applications. By using Elasticsearch for that type of work, you also move work away from your MongoDB servers and spread the load.So, to sum up, choose MongoDB for general application use and add Elasticsearch for when you need rich full text searching. It's not an either/or question at Compose where we understand how the power of two (or three or more) database technologies can drive your application.","If you keep your MongoDB indexes minimal and simple, but then also feed all your records into Elasticsearch as a secondary database, what you get is a database tuned for efficient transactions and a secondary database tunable for speedy, complex and comprehensive searches.",Optimizing MongoDB Queries with Elasticsearch,Live,845
2608,"{ spark .tc } * Community
 * Projects
 * Blog
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark
   
   
PEARSON CORRELATION AGGREGATION ON SPARKSQL


For Spark 1.6, I’ve been working to add Pearson correlation aggregation
functionality to SparkSQL. The aggregation function is one of the expressions in
SparkSQL. It can be used with the GROUP BY clause within SQL queries or DSL
syntax within DataFrame/Dataset APIs. The common aggregation functions are sum,
count, etc.

At first glance, Pearson correlation might not seem connected to the SparkSQL
aggregation function. However, since we want to compute a single value from a
group of data, we’ll see that it does natively fit to the nature of the
aggregation function.

All aggregation functions as expressions are located in the file
sql/catalyst/src/main/
scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala. The
definition is:

case class Corr(
  left: Expression,
  right: Expression, 
  mutableAggBufferOffset: Int = 0, 
  inputAggBufferOffset: Int = 0)

In the snippet, left and right represent expressions (typically two columns in
your DataFrame) that we can use for the Pearson correlation. Similarly,
mutableAggBufferOffset and inputAggBufferOffset are parameters specified for the
SparkSQL aggregation framework. They’re related to the positions of this
expression in the aggregation buffer object during execution. Typically, you can
leave the default values.

SparkSQL has two kinds of aggregation functions: ImperativeAggregate and
DeclarativeAggregate. ImperativeAggregate provides a fixed interface including
initialize(), update(), and merge() functions that work on aggregation buffers.
DeclarativeAggregate is implemented with Catalyst expressions. Regardless of the
kind of aggregation functions you choose to implement, your aggregation function
must define the schema (aggBufferSchema) and attributes (aggBufferAttributes)
that indicate the schema and attributes for the aggregation buffer used by this
aggregation function.

For the Corr aggregation function, the schema and attributes are

def aggBufferSchema: StructType = StructType.fromAttributes(aggBufferAttributes)

val aggBufferAttributes: Seq[AttributeReference] = Seq( 
   AttributeReference(""xAvg"", DoubleType)(), 
   AttributeReference(""yAvg"", DoubleType)(),
   AttributeReference(""Ck"", DoubleType)(), 
   AttributeReference(""MkX"", DoubleType)(), 
   AttributeReference(""MkY"", DoubleType)(), 
   AttributeReference(""count"", LongType)())

The code specifies six columns in the aggregation buffers and defines their
attribute names and data types.

Because Corr implements the ImperativeAggregate interface, it needs to implement
the initialize(), update(), and merge() functions.

The initialize() function has the signature: def initialize(buffer: MutableRow):
Unit

Its goal is to initialize the aggregation buffer. You can see that the
aggregation buffer is actually a row. We put the initial values into the buffer.
The signature for the update() function is:

def update(buffer: MutableRow, input: InternalRow): Unit

It’s responsible for updating the content of the aggregation buffer with an
input row. Corr’s update function follows the algorithm of Pearson correlation
to update the buffer. In other words, we compute the co-variance for all rows in
the same partition.

Then in the merge() function, we merge the two aggregation buffers from the two
partitions:

def merge(buffer1: MutableRow, buffer2: InternalRow): Unit

The final evaluation calls the eval() function to compute the final result for
this function based on the content of aggregation buffer:

def eval(buffer: InternalRow): Any

With the implementation of Pearson correlation aggregation function, we now can
compute this measure between two columns in a DataFrame:

val df = Seq.tabulate(10)(i => (1.0 * i, 2.0 * i)).toDF(""a"", ""b"") 
val corr = df.groupBy().agg(corr(""a"", “b”)).collect()

You

Notify me of follow-up comments by email.

Notify me of new posts by email.


{ AUTHOR }
SIMON HSIEH * 
 * 
 * 

Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software
Foundation in the United States and/or other countries.","For Spark 1.6, I’ve been working to add Pearson correlation aggregation functionality to SparkSQL. The aggregation function is one of the expressions in SparkSQL. It can be used with t…",Pearson correlation aggregation on SparkSQL,Live,846
2609,"NAVIGATING NOSQLGlynn Bird / March 4, 2016At conferences, I’m often asked to explain the term “NoSQL” and contrast thecategory with relational databases. In this article, I recap the major points ofthese conversations, including infrastructure, data models, and the types ofdatabases available today on the IBM cloud.NoSQL databases were designed to provide storage solutions for use-cases thatweren’t well suited to traditional Relational Database Management Systems(RDBMSs), where data is stored in rows and columns in pre-defined tables, withrelationships between tables. RDBMSs still have a role to play storing highlyrelational data in a rigid schema, but NoSQL databases are often used where theform of the data your application is storing: * is not highly relational * does not require the consistency guarantees offered by RDBMSs * has a flexible or fast-moving schemaA NoSQL database is unlikely to implement Structured Query Language (SQL) forextracting subsets of data and in most cases joins, transactions, storedprocedures and triggers are not implemented either.AGILE NOSQLIf your development team is following an Agile workflow, then software releasesare frequent and with a RDBMS a schema change will mean a database migration . A migration is a set of SQL commands that alter the structure of the databaseschema to allow new fields or tables to be added or modified. If the database islarge, then a migration can be an expensive operation causing the entireapplication to grind to a halt as the tables are modified.NoSQL databases allow the development teams to alter their schema without havingto undergo a migration at all, leaving the developers free to concentrate ontheir work and keeping the application performant.SCALING OUT VS. SCALING UPNoSQL is often chosen when a project needs to scale out more rapidly than anRDBMS can accommodate. In this scenario, there are typically two kinds of scale:the amount of data (uh oh, “big data”!), and how frequently it’s accessed(concurrency of requests).For projects that are 100s of terabytes or petabytes large, NoSQL software moreeasily spreads parts of a database across many servers. This flexibility comesfrom NoSQL data models that more naturally aggregate data into the samestructures used within applications. The database can therefore more easilygroup related data together and re-distribute it as storage requirements growand servers are added.Mobile appstores illustrate the problem of scaling for high-concurrency dataaccess. Appstores give developers an almost limitless distribution mechanism,but millions of signups and log-ins can hit the database faster than it canreasonably respond. Even though the amount of data on disk can be relativelysmall — in the range of 100s of gigabytes or 10s of terabytes — high rates ofconcurrency can bring the database server to a grinding halt, as records arelocked and not released fast enough. Many NoSQL databases offer differentconsistency models with relaxed rules for locking records. These models can betuned to provide the right mix of availability-to-consistency for a particularapplication.Together, relaxed consistency rules and more natural data models allow NoSQLdatabases to easily scale out across many commodity servers, whereas relationaldatabases can only easily scale up to a single, more powerful machine.NOSQL FAMILY TREEThe term “NoSQL” every database other than RDBMSs, in fact! The diagram belowshows some NoSQL products grouped into different classes of storage:A NoSQL “map”, courtesy of IBM Cloudant .A document store ‘s atomic unit of data is the document , usually a JSON object that can be arbitrarily complex, containing keys that map to strings,numbers, booleans, arrays and sub-objects. Each document in a database can varyin structure from one to the next. There is no enforced schema. Each databasehas its own means of querying documents by their contents, but usually not usingSQL — this is NoSQL after all! Document stores make great operational datastores for web and mobile applications.Key-value stores are like simplified document stores, where a value can only be retrieved by arecord’s key, often with a low-level API offering data storage and retrievalprimitives. One key maps to one value. Values are opaque, meaning that akey-value store is not designed to allow for queries on a particular part of arecord’s contents.Column stores flip RDBMSs on their head. They are a variant of relational database, butinstead of storing all of a row’s data together on disk, storage is by columninstead. This favours queries where ranges of a column’s value is of interest,as would be the case for a data warehouse, allowing the database efficientlyaccess the sequentially stored data of a particular column.Column- family stores represent records as a “hash map of hash maps”, while borrowing the efficiencyof column storage on disk. They’re the trickiest flavour of NoSQL database togrok, so let’s break it down. Hash maps are essentially a list of simplekey-value pairs. These hash maps are organized into cells in correspondingcolumns. A record is then a grouping of such columns. Columns, also, can be“sparse,” meaning that there is no schema that forces each record to have acorresponding entry in every column. Column-family stores allow for quick accessgiven a row key, column name and cell timestamp (for versioning).Graph data stores are about interconnections; relationships between data nodes, like friends in asocial network. Graph databases are good for social applications andrecommendation engines. While graph stores do not understand SQL, they are, infact, hyper-relational.NOSQL @ IBMIBM offers a number of NoSQL databases (everything except column-family), all ofwhich can be consumed as-a-service with no installation or maintenance required,in a choice of data centres and configurations. Most of IBM’s offerings are open-source projects that are hosted for you, giving you peace of mind that you are notlocked in to a single vendor.IBM CLOUDANTCloudant is a highly scalable, distributed JSON document store based on Apache CouchDB . The database comes with a built-in system for replicating and synchronizingchanges between many copies of data. (All copies do reads and writes.) Cloudantruns across a cluster of machines, allowing additional servers to be added todeal with many concurrent users. It has an HTTP API, allowing access from thebrowser, any programming language or directly from mobile devices. A great wayto see what Cloudant can do is to deploy the Simple Search Service and upload a CSV file of your own data.MONGODBMongoDB , like Cloudant, is a JSON document store that allows ad-hoc queries using itsown query language. Data can be sharded across multiple machines for scale.MongoDB offers a stronger consistency model than Cloudant or CouchDB, however,increased consistency can limit scalability to certain concurrency thresholds.RETHINKDBRethinkDB is another JSON document store with the emphasis on a real-time pusharchitecture, where database changes are sent to connected clients in real-time.RethinkDB also allows for JOINs between different collections of JSON data.REDISRedis is an in-memory, key-value store with low-level data structures such as lists,sets, hashes and pubsub channels. Redis makes an excellent cache, queue orpubsub hub where speed is favoured over data resilience.DASHDBdashDB is a column store, and it is IBM’s data warehouse service. Unlike most databaseservices listed here, which are based on open-source projects, dashDB combinesin-memory technology and RStudio with other IBM compression and in-database analytics features. While manyconsider it more of a relational database, I include it here for completeness.IBM GRAPHCurrently in Beta, IBM Graph is based on Apache Tinkerpop and allows the storing and querying of connections between data points forbuilding social networks and recommendation engines.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Learn about NoSQL products at IBM! A recap of my talks on ""NoSQL"", including details on infrastructure, data models, and the types of databases available today on the IBM cloud.",Navigating NoSQL,Live,847
2610,"SCHEMA MIGRATIONS WITH ALEMBIC, PYTHON AND POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 13, 2016In his latest Write Stuff article, Gigi Sayfan takes a dive into database
migrations with an Alembic Tour-de-Force.

Database schema migrations are the bane of agile development and continuous
deployment. Some migrations can be trivial if you just add a new table or a new
column that can be null. But, in other cases it can be very complicated when it
involves denormalizing existing tables or changing column data types. In some
cases, the schema change will require data migration too. What makes data schema
migrations particularly complicated is that often they must be done in tandem
with corresponding code changes. In large distributed systems you often don't
have the capability to deploy a whole new version of your code + update your
database schema instantly. In these cases, you must must resort to multi-step
rollout. Then, there is the concern of being able to rollback your changes if
something goes wrong.

In this article, I'll discuss all these scenarios and use case. I'll use Compose PostgreSQL as the target database, Python 3.5.1 as the programming language, SQLAlchemy 1.0.13 as the DB access library and Alembic 0.8.6 as the the database migration toolkit.

SAMPLE APPLICATION - ROOT CAUSE ANALYSIS
To make everything more concrete let's design a database to solve a real problem
- analyzing the root cause of bugs that were discovered in production. As big
believer in tests, I always say that code that is not tested properly is broken.
The opposite is that every bug can be traced back to missing or insufficient
test. The domain is then bugs and their root causes. For each bug I want to
record its root cause and later I'll be able to analyze the root causes and find
ways to improve the process. Here is the initial database design with some test
data:


It's a single table called bug . I capture the URL of the bug in the bug tracking system where a lot of
additional information can be recorded. The root_cause is just a string. Finally there is the who and when columns.

This is pretty basic, but can serve as a good starting point. I'll evolve it
during the article.

BTW, bug-tracker.org is a fictional bug tracking system. There is nothing there.

SQLALCHEMY MODELS
SQLAlchemy is arguably the best relational database access toolkit. It provides
several interfaces. In this article I'll stick to the SQLAlchemy ORM. Here is
the bug table defined as a SQLAlchemy model in the models.py file.

from sqlalchemy import Column, DateTime, String, Integer, func  
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class Bug(Base):  
    __tablename__ = 'bug'
    id = Column(Integer, primary_key=True)
    bug_tracker_url = Column(String, unique=True)
    root_cause = Column(String)
    who = Column(String)
    when = Column(DateTime, default=func.now())

    def __repr__(self):
        return 'id: {}, root cause: {}'.format(self.id, self.root_cause)


I will not explain too much about SQLAlchemy. The main point is that it provides
a convenient object-oriented / Pythonic way to define your schema and later to
evolve it.

ALEMBIC BASICS
Alembic is designed to work with SQLAlchemy and provides excellent migration
support. Alembic works by managing change management scripts. Whenever you
modify your models you create (or let Alembic create for you) a Python script
that Alembic can invoke and upgrade your actual database schema. To install
SQLAlchemy, Psycopg2 (the Postgress driver) and Alembic you can pip install -r requirements.txt . Once all the requirements are installed you need to prepare your project for
working with alembic. First verify that alembic is installed properly. Type alembic and you should see something like:

(alembic-tour-de-force)(G)/alembic-tour-de-force > alembic
usage: alembic [-h] [-c CONFIG] [-n NAME] [-x X] [--raiseerr]  
               {branches,current,downgrade,edit,heads,history,init,list_templates,merge,revision,show,stamp,upgrade}


To prepare your project to work with alembic type alembic init alembic in your project's directory:

(G)/alembic-tour-de-force > alembic init alembic
  Creating directory /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic ... done
  Creating directory /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/versions ... done
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic.ini ... done
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/env.py ... done
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/README ... done
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/script.py.mako ... done
  Please edit configuration/connection/logging settings in '/Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic.ini' before proceeding.


That will create a sub-directory called alembic that contains the following files and directories:

(alembic-tour-de-force)(G)/alembic-tour-de-force > tree alembic
alembic  
├── README
├── env.py
├── script.py.mako
└── versions


The README file is not very interesting. You can add your own instrutions for
how to upgrade. The env.py is a Python script that performas a lot of the heavy
lifting and script.py.mako is a template for generating migration scripts. You
don't call them directly, but use the alembic command-line tool. Finally, the versions directory is where your migration
scripts go. Another important file that is generated is the alembic.ini file in
your project's directory. You need to set the sqlalchemy.url to point to your
database. You can leave the rest as is or modify to your liking.

GENERATE A BASELINE SCRIPT
Let's generate a baseline script for the current state of the database. The
baseline script can generate a database schema from scratch. I use the revision command to generate a new revision of the database (the first in this case).
The -m flag is the revision name. Alembic will concatenate a uuid before to ensure
uniqueness and create a migration script in the alembic/versions sub-directory.

alembic revision -m ""baseline""  
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/versions/bc25b5f8939f_baseline.py ... done


This generates an empty script with upgrade() and downgrade() functions. In
order to generate the baseline script I added some code to generate our bug table:

""""""baseline

Revision ID: bc25b5f8939f  
Revises:  
Create Date: 2016-05-29 15:00:18.721577

""""""

# revision identifiers, used by Alembic.
revision = 'bc25b5f8939f'  
down_revision = None  
branch_labels = None  
depends_on = None

from alembic import op  
import sqlalchemy as sa


def upgrade():  
    op.create_table(
        'bug',
        sa.Column('id', sa.Integer, primary_key=True),
        sa.Column('bug_tracker_url', sa.String(), nullable=False),
        sa.Column('root_cause', sa.String()),
        sa.Column('who', sa.String()),
        sa.Column('when', sa.DateTime(), default=sa.func.now()))


def downgrade():  
    op.drop_table('bug')


Note, the revision ID bc25b5f8939f. Using this revisions ID and the
down_revision alembic keeps track of the sequence of revisions. Since this is
the first revision there is no down_revision.

To test the revision I dropped the bug table in compose.io PostgressSQL SQL tab: http://i.imgur.com/asbicmT.png

Then I ran the alembic upgrade head command. This command makes sure that the database is up to date with the most
recent revision.

(G)/alembic-tour-de-force > alembic upgrade head
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.  
INFO  [alembic.runtime.migration] Will assume transactional DDL.  
INFO  [alembic.runtime.migration] Running upgrade  -> bc25b5f8939f, baseline  


As you can see alembic correctly ran the baseline revision, created the bug
table according to the upgrade instruction and also added its own table alembic_version . This table has exactly one row with one column called version_num that contains the current alembic revision ID. That's how alembic keeps track
of where the database is and knows what revisions to apply to bring it up to
date.

MODIFY THE MODELS
The current modeling is fragile. The root cause is just a text field. There is
no curated list of root causes. Anyone can write whatever text they want for the
root cause. Typos can sneak in. Did you notice the typo for the root cause of
bug 22? The root cause is 'no integration test' instead of 'no integration
test'. Let's create a separate table for root causes that bugs will be related
to via a foreign key.

from sqlalchemy import Column, DateTime, String, Integer, func, ForeignKey  
from sqlalchemy.ext.declarative import declarative_base  
from sqlalchemy.orm import relationship

Base = declarative_base()


class RootCause(Base):  
    __tablename__ = 'root_cause'
    id = Column(Integer, primary_key=True)
    name = Column(String)


class Bug(Base):  
    __tablename__ = 'bug'
    id = Column(Integer, primary_key=True)
    root_cause_id = Column(ForeignKey('root_cause.id'),
                           nullable=False,
                           index=True)
    bug_tracker_url = Column(String, unique=True)
    who = Column(String)
    when = Column(DateTime, default=func.now())

    root_cause = relationship(RootCause)

    def __repr__(self):
        return 'id: {}, root cause: {}'.format(self.id, self.root_cause.name)


AUTO-GENERATE A MIGRATION SCRIPT
I need to write a migration script from the current single Bug table to the new
Bug + RootCause tables including setting the foreign key correctly. This is
annoying especially when the DB schema is changed a lot. Alembic's killer
feature is its ability to auto-generate the migration scripts. Alembic looks at
the current models to compare them with current DB schema and figures out what
changed.

Before you can use the auto-generation you need to tell alembic where to find
your model's metadata. This is done in the env.py file. Here is the relevant section where I added the last line:

# add your model's MetaData object here
# for 'autogenerate' support
# from myapp import mymodel
# target_metadata = mymodel.Base.metadata
target_metadata = models.Base.metadata  


Auto-generating the migration script uses the --autogenerate flag to the alembic revision command.

(G)/alembic-tour-de-force > alembic revision --autogenerate -m ""add root_cause table""
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.  
INFO  [alembic.runtime.migration] Will assume transactional DDL.  
INFO  [alembic.autogenerate.compare] Detected added table 'root_cause'  
INFO  [alembic.ddl.postgresql] Detected sequence named 'bug_id_seq' as owned by integer column 'bug(id)', assuming SERIAL and omitting  
INFO  [alembic.autogenerate.compare] Detected added column 'bug.root_cause_id'  
INFO  [alembic.autogenerate.compare] Detected NULL on column 'bug.bug_tracker_url'  
INFO  [alembic.autogenerate.compare] Detected added index 'ix_bug_root_cause_id' on '['root_cause_id']'  
INFO  [alembic.autogenerate.compare] Detected added unique constraint 'None' on '['bug_tracker_url']'  
INFO  [alembic.autogenerate.compare] Detected added foreign key (root_cause_id)(id) on table bug  
INFO  [alembic.autogenerate.compare] Detected removed column 'bug.root_cause'  
  Generating /Users/gigi/Documents/dev/github/alembic-tour-de-force/alembic/versions/6c55e0560fd4_add_root_cause_table.py ... done  


Alembic generated a complete migration script:

""""""add root_cause table

Revision ID: 6c55e0560fd4  
Revises: bc25b5f8939f  
Create Date: 2016-05-30 23:57:53.858201

""""""

# revision identifiers, used by Alembic.
revision = '6c55e0560fd4'  
down_revision = 'bc25b5f8939f'  
branch_labels = None  
depends_on = None

from alembic import op  
import sqlalchemy as sa


def upgrade():  
    ### commands auto generated by Alembic - please adjust! ###
    op.create_table('root_cause',
    sa.Column('id', sa.Integer(), nullable=False),
    sa.Column('name', sa.String(), nullable=True),
    sa.PrimaryKeyConstraint('id')
    )
    op.add_column('bug', sa.Column('root_cause_id', sa.Integer(), nullable=False))
    op.alter_column('bug', 'bug_tracker_url',
               existing_type=sa.VARCHAR(),
               nullable=True)
    op.create_index(op.f('ix_bug_root_cause_id'), 'bug', ['root_cause_id'], unique=False)
    op.create_unique_constraint(None, 'bug', ['bug_tracker_url'])
    op.create_foreign_key(None, 'bug', 'root_cause', ['root_cause_id'], ['id'])
    op.drop_column('bug', 'root_cause')
    ### end Alembic commands ###


def downgrade():  
    ### commands auto generated by Alembic - please adjust! ###
    op.add_column('bug', sa.Column('root_cause', sa.VARCHAR(), autoincrement=False, nullable=True))
    op.drop_constraint(None, 'bug', type_='foreignkey')
    op.drop_constraint(None, 'bug', type_='unique')
    op.drop_index(op.f('ix_bug_root_cause_id'), table_name='bug')
    op.alter_column('bug', 'bug_tracker_url',
               existing_type=sa.VARCHAR(),
               nullable=False)
    op.drop_column('bug', 'root_cause_id')
    op.drop_table('root_cause')
    ### end Alembic commands ###


Note the down_revision, which is the baseline. To bring the database itself up
to date let's upgrade to the new head.

(G)/alembic-tour-de-force > alembic upgrade head
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.  
INFO  [alembic.runtime.migration] Will assume transactional DDL.  
INFO  [alembic.runtime.migration] Running upgrade bc25b5f8939f -> 6c55e0560fd4, add root_cause table  


Checking the database itself the schema has been changed indeed and there is a
new root_cause table: http://i.imgur.com/0Zqn28P.png .

LIMITATIONS OF AUTO-GENERATION
Auto-generation is magical but it has some limits you need to be aware of. First
of all if you rename a table or a column alembic will consider it a removal and
addition of a new table/column. You will lose all the data. You should manually
adjust the aut-generated script and replace the drop/create with an explicit
rename operation that will preserve the data.

You should name all your constraints. Alembic can't handle anonymously named
constraints.

There are some unimplemented features that auto-generation will detect and
handle in the future:

 * Some free-standing constraint additions and removals, like CHECK, PRIMARY KEY
 * Sequence additions and removals

SQL GENERATION
You may not be comfortable directly running a migration script via alembic upgrade or you may not be sure what the impact is going to be on your database. Alembic
has for this purpose special offline mode where it generate a SQL file with all
the SQL statements that represent the upgrade script instead of executing them.
Then, you can observe the SQL file, make sure it doesn't do anything fishy and
also perform the migration using the SQL file.

The command is alembic upgrade --sql . You need to provide also a start and end revision. Let's see the SQL that
generated our most recent upgrade.

(G)/alembic-tour-de-force �

INFO  [alembic.runtime.migration] Running upgrade bc25b5f8939f -> 6c55e0560fd4, add root_cause table  
-- Running upgrade bc25b5f8939f -�  


MULTI-STAGE UPGRADE
Naive schema changes may drop data. For example, if you look at the SQL
generated above you'll note that the root_cause column was dropped from the bug table. In order to preserve the data you'll have to perform a multi-stage
migration. Here's how it can work for the root_cause case. First step just add
the root_cause table and the root_cause_id foreign key, but DON'T remove the
root_cause column just yet. Perform the migration and the code will still read
the root cause information from the root_cause column in the bug table. Then, the information in the root_cause column of the bug table will be inserted into the new root_cause table. The code will be changed
to read from the ""root_cause"" table. At this point the root_cause column is not
used anymore and can be dropped in a second migration.

This approach may be necessary even if there is no concern about data loss. For
example, in large distributed systems it is difficult if not impossible to
perform synchronized changes to the DB schema and the code on all the servers.
In these case every breaking change, must be done in a multi-stage approach
(unless the system is designed to handle temporary mismatches and can recover
gracefully later).

CONCLUSION
Database schema migrations are an important aspect of large-scale systems. They
must be handled properly in order to prevent data loss and system availability.
Alembic is an excellent solution for SQLAlchemy-based systems. It provides a
methodical approach and supports auto-generation of migration scripts.

Gigi Sayfan is the director of software infrastructure at Aclima ( http://aclima.io ), a start-up company that designs and deploys distributed sensor networks that
enable a higher level of environmental awareness. Gigi has been developing
software professionally for 20 years in domains as diverse as instant messaging,
morphing, chip fabrication process control, embedded multi-media application for
game consoles, brain-inspired machine learning, custom browser development, web
services for 3D distributed game platform and most recently IoT/sensors. Image via Bitterherbs1This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Gigi Sayfan takes a dive into database migrations using Compose PostgreSQL as the target database, Python 3.5.1 as the programming language, SQLAlchemy 1.0.13 as the DB access library and Alembic 0.8.6 as the the database migration toolkit. ","Schema migrations with Alembic, Python and PostgreSQL",Live,848
2612,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectHUMANS VS. APACHE SPARK: BUILDING OUR ROCK-PAPER-SCISSORS GAMEDavid Taieb / December 1, 2015IBM has a tradition of producing technology that defeats the smartest humans.For example, * IBM Deep Blue bested world chess champion Gary Kasparov * IBM Watson defeated Jeopardy TV game show champion Ken JenningsThe Cloud Data Services Developer Advocacy team takes up the banner with asolution challenging human dominance in child's play, namely Rock-Paper-Scissors .For IBM Insight 2015 in Las Vegas, we built a Rock-Paper-Scissors game powered by the IBM Analytics for Apache Spark service available in IBM Bluemix. People play the classic childhood game against Spark,and the first to win 3 rounds wins the game. Try it yourself: Play Rock-Paper-Scissors .Of course, there are no grand-master champions or tournaments for such a simplegame, but each of us has our own strategies: Always throw rock. Throw the samemove twice in a row. Anticipate and out-think what your opponent will throw.Using the data and analytics power of Apache® Spark™, we set out to create apattern-recognition engine that could browse a large collection of interactionsto determine what would most likely be the winning move.The requirements for this app were: * Application code would be open source, available on GitHub for anyone to see   and reuse (coming soon) * Players would compete on iPads mounted on stands * A monitor would display statistics showing what Spark is thinking * First-class design and art for the game front-end web application * Spark-powered analysis would learn from past games and identify patterns. Of   course, Spark would not compete against a single individual but many humans.   So we made the assumption that human strategies have a pattern, and if so,   Spark would find it over time. * The game would run on Bluemix: The front-end would be a Bluemix web app and   IBM Analytics for Apache Spark would handle the back-end analysis.The team had only 2 months to meet this challenge and build all components ofthe app. We reached out to the IBM Design team in Austin (mainly Russell Parishand Joe Meersman) and pitched the project. They didn't hesitate to help andshowed us how design thinking could produce very exciting results (more aboutthat in a minute).ARCHITECTURETo start, we needed to come up with an elegant architecture to connect thefollowing components: 1. Spark Streaming to process the game data events in real-time 2. Spark Core to build the different game strategies 3. IBM Cloudant to store the historical plays and other various game data 4. Node.js for the game's front-end web app middleware 5. WebSockets to handle messages between Spark and the Web Application middleware 6. Mozaik , ReactJS , D3JS / C3JS to build the dashboard showing the various Spark statistics 7. AngularJS , JQuery , and Bootstrap to structure the front-endAfter multiple iterations, the overall architecture looks like this:In this post, I focus on how we built the front-end application. In a separatepost, I'll delve under the hood.PLAYER EXPERIENCE DESIGNThe IBM Design team, located in Austin, Texas, used design thinking practices tocreate the best user experience for conference attendees. They spent many hoursdesigning a game play that accurately represents Rock-Paper-Scissors. Theirresearch showed slight variations in how people play, including what wordsplayers shout prior to a round, the timing to reveal a move, and the number ofrounds it takes to determine a winner. Creating one holistic experience that aplayer could instantly understand on iPad, and orchestrating play against amachine was all part of the challenge. The design team created rough wireframeprototypes and tested the experience with users to determine the best flow,timing, and animation to make the game feel right.Creating an emotional response was one thing missing from early prototypes. Theteam thought that maybe Spark should have a know-it-all attitude. Theyconsidered audio responses from Spark that would taunt or undermine the users.Ultimately they decided to design an experience that felt competitive andresonated with the spirit of Las Vegas’ boxing matches (since the conference wasbeing held in Mandalay Bay Casino).Historic boxing posters were the inspiration for the typography and split-toneillustrations.The design team not only created a digital experience, but also screen printedpromotional posters to attract users from around the conference to the event.IMPLEMENTATION WITH NODE.JSThe web application middleware was implemented using Node.js. Why Node.js? Foryears, I've been a Java afficianado, using multiple J2EE frameworks like Tomcat,WebSphere, Liberty, and even OSGi Equinox. I've been working with Node.js foronly just over a year, but there are multiple reasons why I'd pick it over amore established Java J2EE framework: 1. Performance: Node.js is wicked fast compared any of its Java counterparts (yes, even    with an optimized framework like OSGI Equinox with Jetty) 2. Isomorphic code: JavaScript runs very effectively in web browsers, which lets developers    write code that can run both on the client and server side. One may argue    that Java allowed for the same thing with Java Applets, but let's face it:    Java Applets are dead (for many good reasons that I won't go into) and are    never coming back. 3. Less boilerplate code: As any developer would tell you, writing less code is always better and    means more productivity and less bugs. With Node.js, you can write a fully    functional web app with a few lines of code in a single file. No need for    arcane deployment files or war packaging, and so on. Try to do the same with    a J2EE framework! 4. Rich npm registry of open source modules to choose from: If you're looking to build a certain capability or algorithm, most likely    there is already someone who has published a node module that you can add to    your package.json as a dependency. While some can argue that the    proliferation of these modules without control can be a bad thing, the    benefits simply outweigh the risks, and there are ways to control the    versioning of the dependencies. 5. Native JSON Support: JSON is the bread and butter data interchange format used by JavaScript,    making it extremely easy and natural for developers to work with. 6. Better for microservice architecture: Node.js apps are light (easy to build, deploy, and run), modular (easy to    break up and refactor) and I/O driven (asynchronous programming). Therefore,    they're perfectly suited to work within microservice architectures–all the    rage today for cloud deployments. 7. Open source community adoption: the number of web sites powered by Node.js keeps increasing, with big    companies running mission-critical applications.Having said all that, Java-based framework still has a lot of good things tooffer: 1. 20+ years of development makes it a very mature platform 2. First class IDEs with Eclipse and IntelliJ, which are great editors with    content assist, continuous compilation, and more. Also has great debuggers. 3. Multi-threading support: this gives Java the advantage when you need to run    CPU-intensive algorithm, as Node.js is single threaded.Node.js is making great progress at catching up to Java in these areas. Forexample, I found the experience of debugging Node.js applications withnode-inspector and Chrome V8 very acceptable. Plus there are Eclipse pluginsthat provide good JavaScript editing and debugging capabilities.GAME FRONT-ENDThe development of Rock-Paper-Scissors' user interface was all about keepingthings clean and simple. You play the game on an iPad, but it's actually a webapplication, not a native iOS app. This means you can also access the game withdesktop browsers and non-iOS devices. We chose the technologies (AngularJS,Bootstrap, JQuery) because they: * are easy to use and learn. There's lots of documentation and support   available. * integrate well with each other. * are simple. The game is a single-page app. * are comprehensive and extensible.All these things contribute to faster development and rapid iteration throughdevelopment and test cycles.The UI provided by the design team was spectacular enough and game flow simpleenough that development of the front-end mostly entailed swapping backgroundsand replacing images. Gotta love great design and Design Thinking!The single page was comprised of multiple DOM sections coinciding with thedifferent game states (like New Round , Select Move , Show Results , and so on). The ngSwitch directive in AngularJS was used to conditionally switch the DOM sectionsdepending on the game state.The game front-end is unaware of any historical game plays, nor does it care whoits opponent is. It is self-contained and could be played without a connectionto the back-end. (But in that scenario, you wouldn't be playing against Spark.)In addition, the game front-end makes only two REST calls: 1. GET Spark's move.        During this request no data is sent to the back-end. Spark's move is    retrieved first then the user makes their pick. This ensures no cheating by    Spark. Once the user has made their move, the round play is evaluated and    round winner determined.         2. POST the round result        These 2 steps repeat until someone wins three rounds.To achieve a more native look and feel for the game (when accessed from aniPad), we consulted the iOS Developer Libary . The resulting tweaks, although mostly cosmetic, result in a better overalluser experience.GAME DASHBOARDThere are multiple open source dashboard frameworks that we could have used tobuild the Rock-Paper-Scissors Spark dashboard. Some that I looked at are: * Dashing * RazorFlow * Angular-dashboard * FreeboardAll of these frameworks were very attractive and use the state-of-the-artHTML5/CSS3 technologies. However, I settled in the end with Mozaik because it: * is easy to use and integrates into your application. It provides declarative   means for configuring the data APIs that drive each widget content. * features real-time support with WebSockets * has a modular and extendable widget framework based on ReactJS (this gave me   an excuse to start learning ReactJS !) * offers customizable themes based on StylusThe next step was to build widgets that display the payload data coming from theSpark back-end analytics. For example, we showed a Wins over time histogram, a Round Patterns chart, Spark's next move , most winning move , and the most popular move .Because the layout is configured declaratively, we were able to experiment withmultiple designs iteratively, incorporating feedback from test users until wefound the right approach.We built the widgets by extending the ReactJS Component class, which providesvery nice lifecycle APIs. Some that we used in our app are: 1. getInitialState: initialize the widget state before the component is mounted. 2. getApiRequest: specify the API id configured in the main App.jsx module 3. onApiData: called when the mozaik framework is receiving new data from the WebSocket    channel 4. componentDidMount: Invoked immediately after the widget is rendered. This is where the widget    content is dynamically created. For example, in the WinsOverTime widget, the    c3 chart is created at this time. 5. componentDidUpdate: Invoked when the new data has been received so that the widget can be    refreshed accordingly 6. componentWillUnmount: Invoked when the component is about to be destroyed, so the widget can    clean up any associated resources. 7. render: return the html fragment that will contain the widget.The last architectural component related to data flow and persistence. Here'show the data flows between the different components: 1. User makes a play, the client app sends data to an API endpoint on the    middleware 2. Data is posted to the Spark Streaming back-end and processed in real time 3. Data is saved into Cloudant 4. Ouput of real-time analysis (metrics, next spark play) is sent back to the    middleware via WebSocket 5. Dashboard updates to show the latest data.At the time we built this app, we chose WebSockets as the message brokermechanism between Spark and the middleware. In retrospect, a better architecturewould be to use Kafka (provided as a service on Bluemix with MessageHub )WHAT’S NEXT?Now you know how we designed this app's front-end and middleware. Check backhere later for a follow-up blog post that delves into the back-end code. We'llalso put together a tutorial that walks you through the app. Sign up to get notified .Meanwhile, stand up for humanity! Test your Rock-Paper-Scissors skills againstApache® Spark™. Play now . Or, better yet, go play in person. Bradley Holt will host and referee some Rock-Paper-Scissors rounds at Node.js Interactive in Portland, Oregon on Dec. 8-9th, when he speaks on Offline-First Apps with PouchDB . If you can't make it, no worries. We'll feature Rock-Paper-Scissors at futureevents too.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: angularjs / Bootstrap / C3JS / cloudant / CSS3 / D3JS / html5 / iOS / JQuery / Kafka / Message Hub / Mozaik / Node.js / ReactJS / Spark / WebSockets Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Are you smarter than Apache Spark? Play our Rock-Paper-Scissors game, powered by IBM Analytics for Apache Spark, then read how we built it.",Humans vs. Apache Spark: Building our Rock-Paper-Scissors game,Live,849
2614,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

CREATING CHROME EXTENSIONS USING POUCHDB
Glynn Bird / May 11, 2016Google Chrome extensions are small web applications that are downloaded and
installed in the user’s Chrome browser. They have a minimal user interface,
usually a small icon to the right of the URL bar and a pop-down panel, but have
additional rights over and above normal websites:

 * they can be bundled and submitted to the Chrome Web store as a distribution
   mechanism
 * they have limited access to the host computer to save files, access
   networking tools and communicate with connected devices
 * they have access to the browser’s inner workings and can alter the rendering
   of web pages

Chrome extensions are really just web applications built from HTML, CSS and
JavaScript, and we can use the framework of our choice to build them.

When it comes to storing data, the chrome.storage API allows your extension to store data, which can be synced between other
instances of the extension with the same Google account, such as between your
desktop and laptop browsers. If you don’t want to use the Google APIs, then you
can use your own storage tools such as PouchDB , an in-browser JSON document store that can sync with remote PouchDB, Apache CouchDB™ or Cloudant databases.

This post shows two sample Chrome extensions:

linkshare – a very simple bookmarking tool


volt – a password database


Both apps store their data locally first using PouchDB and allow the user to optionally sync the data with a remote
server.

INSTALLING
The samples presented here have not been submitted to the Chrome Web Store, but
you can install them from source as long as you enable developer mode. If you
subsequently modify the source code on your machine, the changes will be
reflected in Chrome the next time you open the extension.

On the command-line, clone the linkshare repository:

git clone https://github.com/glynnbird/linkshare.git

And/or clone the volt repository:

git clone https://github.com/glynnbird/volt.git

Then:

 * Visit chrome://extensions in your Chrome browser
 * Switch on Developer mode
 * Click “Load unpacked extension…”
 * Navigate to the directory where you clone the git repository
 * Repeat for the other extension

Source code:

 * https://github.com/glynnbird/linkshare
 * https://github.com/glynnbird/volt

Full installation instructions for non-packaged extensions here .

CREATING A CHROME EXTENSION FROM SCRATCH
The life of a Chrome extension begins with a manifest file called
“manifest.json”:

{
  ""manifest_version"": 2,
  ""name"": ""Linkshare"",
  ""description"": ""A simple link sharing Chrome extension"",
  ""version"": ""1.0"",
  ""browser_action"": {
    ""default_icon"": ""linkshare.png"",
    ""default_popup"": ""linkshare.html"",
    ""default_title"": ""Linkshare""
  },
  ""permissions"": [
    ""activeTab"",
    ""https://ajax.googleapis.com/""
  ],
  ""content_security_policy"": ""script-src 'self' 'unsafe-eval'; object-src 'self'""
}

It declares the metadata about the app and which APIs and options it will use.
It references a PNG file containing an icon and the HTML page that contains the
popup panel. Here’s what you need to add:

 * an icon file
 * a page of HTML
 * a JavaScript file (for Chrome extensions, your JavaScript must reside in a
   separate file, not in the HTML page)

Once installed, you can modify your local files and your installed extension
will reflect the changes you make.

STORING DATA IN A CHROME EXTENSION
Using PouchDB couldn’t be easier. First of all it needs to be downloaded and
referenced in our single-page web app:

<script src=""js/pouchdb-5.3.2.min.js""></script>

Our page can then create a database and start saving data:

var db = new PouchDB(""linkshare"");
db.post({a:1,b:2}).then(function(data) {
  console.log(""done"");
});

For debugging, the JavaScript console can be accessed by right-clicking the
extension icon and choosing “Inspect popup”.

A SIMPLE BOOKMARKING APP
Chrome extensions have access to the current tab that is being viewed, so it’s
simple to create a bookmarking tool that keeps a list of URLs that you want to
save permanently, read later or share. By storing such data in PouchDB, we can
then optionally sync the data with an Apache CouchDB or IBM Cloudant databases
to ensure that there is more than one copy of the data. Furthermore, other users
sharing that database can also access the data, allowing the sharing of links
with your partner, work colleagues or other ad-hoc groups of people who also
have the extension installed.


The JSON schema we will be using is this:

{
  ""_id"": ""123"",
  ""_rev"": ""456"",
  ""url"": ""http://myfavourite.website.com/is/this/one.html"",
  ""date"": ""2016-06-01 12:44:22 Z"",
  ""title"": ""My favourite website""
}

When a user presses the “Save” button in our extension window, we fetch the page
URL and title from the Google Chrome API, add a timestamp and turn the data into
a JSON object before storing it in the PouchDB database. PouchDB automatically
generates the _id and _rev fields. These fields are reserved by the database to enforce uniqueness of
document IDs and to keep track of any document revisions.

LOADING DATA ON STARTUP
When our Chrome extension is activated, our code fetches a list of documents
from the PouchDB database in reverse date order (newest first). We implement
this behaviour using the query function in “linkshare.js”:

db.query(map, {include_docs:true, descending:true}).then(function(result) {
    // render the result
});

The map function is a MapReduce function that defines the index we wish to create.

var map = function(doc) {
  emit(doc.date, null);
};

The map function is called with every document in the collection, and the
key/value pairs it emits form the basis of an index we can subsequently
interrogate (query). In this case we simply want the data in reverse order ( descending:true ), and we want the whole document bodies back ( include_docs:true ). There’s no reduce function in our code because we don’t need one. If we needed to aggregate query
results, we could use one of PouchDB’s built-in reduce functions. See the PouchDB guide on queries for more on the built-in reduces for _count , _sum and _stats .

SYNCING WITH AN APACHE COUCHDB OR CLOUDANT SERVER
PouchDB makes it very simple to sync with a remote server that speaks the
CouchDB replication protocol:

db.sync(url, { live:true, retry:true });

This single line of code:

 * initiates replication from the client to the server
 * initiates replication from the server to the client
 * handles reconnection if the connections are interrupted
 * streams live changes happening on the client or server to the other party

The URL that our app syncs with is provided by the user in the settings panel of
our Chrome extension, e.g.:

https://myusername:mypassword@myhost.cloudant.com/mydatabase

The URL itself is also stored in PouchDB, so it is remembered for the next
session. But this value is stored in a _local document , which is a special class of document that resides in the same database as the
other bookmark documents, but only in the local copy of the database. “Local”
documents are not replicated or indexed.

Note: Chrome extensions obey CORS restrictions , so you’ll need to enable CORS in your remote Cloudant or CouchDB database before the app can successfully sync.

OFFLINE-FIRST MEANS ALWAYS ON
Writing data to an in-browser database means that your web application will
always work, whether you have a network connection or not. The data is synced to
a cloud service too, but the primary mechanism for reading and writing data is
the local PouchDB database. So even if remote syncing is impossible, the Chrome
extension is always available to store and retrieve data from its local store.

Syncing data also makes sharing it easy. Syncing to a remote copy allows a
collection of data to be shared with other users who have access to the same
database. All subscribed users will see all the changes when they sync.
Coordinating the transfer of data between PouchDB and a cloud-based copy and
vice versa is a complex programming task if you were building it yourself, but
thanks to PouchDB it’s a one-liner!

See our offline-first page for more demo and example apps. Cheers!

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Chrome extensions are really small web apps, making them great for PouchDB JSON replication & sync. We provide code for two such offline-first apps.",Example Chrome Extensions Using PouchDB,Live,850
2615,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Apr 11
--------------------------------------------------------------------------------

CREATE A SERVERLESS, WATSON-POWERED CHATBOT FOR YOUR BUSINESS
ADD A CONVERSATION WIDGET TO YOUR WEB PAGE THAT HANDLES QUESTIONS FOR YOU
Imagine you have a small business, like a restaurant or a hotel. What are the
most common enquiries that you get?

 * What are your opening times?
 * Do you have a Facebook page?
 * Can I make a booking?

Many companies are now creating chat bots that can answer such questions. The
human has an interactive conversation with the bot on the company website and
information is passed to the customer in a very natural way:

The Watson Conversation service gives you the tools to program your own custom chatbot, but I’ve
provided a helper script to get you started more quickly: the concierge-chatbot . It creates a chatbot for you using your business information and gives you
the HTML code that you need to run the chatbot in your own company website.

You can try a demo chatbot here .

PRE-REQUISITES
Before you can build your own chatbot, you must sign up for some accounts and
install some tools on your machine.

 * Node.js — if you haven’t got Node.js on your machine already, then visit the Node.js homepage and install it
 * Bluemix — sign up for a Bluemix account . There’s nothing to install on your machine as Bluemix hosts its services
   for you in the cloud.
 * OpenWhisk — In your Bluemix dashboard, visit the OpenWhisk page to download and configure the wsk command-line tool using those instructions. OpenWhisk is IBM's serverless
   computing platform that lets you host code in the cloud without paying for
   dedicated computing power to run it.
 * Watson Conversation — In your Bluemix dashboard, find the Watson Conversation service in the catalog and sign up for an account. Make a note of the
   username and password of your service — you’ll need that shortly.
 * Concierge Chatbot — run npm install -g concierge-chatbot to install the chatbot generation script. You may need to prefix that
   command with sudo depending on how Node.js is configured on your machine.

Now that you have the software, you can make some chatbots!

ROLL YOUR OWN CHATBOT
To create a chatbot, simply run concierge-chatbot from the command line. You're prompted for your Watson Conversation credentials
and details about your business (ignore the questions about Cloudant for now):

The concierge-chatbot script creates a Watson Conversation service, publishes an OpenWhisk action to
provide a web front-end for it, and generates some HTML customised for your
business. Cut and paste this HTML snippet into the bottom of your website's code
to see it in action. It's that simple. (Make sure you have jQuery present on your web page)

If you make a mistake, you can run concierge-chatbot again - it remembers your previous answers: just modify the data and it creates
another one for you.

The free Watson Conversation plan lets you configure up to three bots at once.
If you run out, just delete the unwanted ones from the Watson Conversation
dashboard, which is linked from your service in the Bluemix dashboard.

HOW DOES THIS WORK?
There’s lots going on here, so take a breath and see what you’ve created.

The code that you pasted into your website generates the pop-up Concierge widget
in your web page. Every time someone uses it, a message is sent to your
OpenWhisk service which sends it on to your Watson Conversation service together
with the service credentials. The reply from Watson Conversation is sent back to
your website.

STORING YOUR CONVERSATIONS
If you would like to store the conversations that your widget has then you need
to add a database. You’re going to spin up a Cloudant NoSQL database and ask the
OpenWhisk action to store the conversation data as it happens.

 1. In the Bluemix dashboard, add a Cloudant NoSQL DB service and make a note of the Cloudant URL. It should look something like https://MYUSERNAME:MYPASSWORD@MYHOST.cloudant.com
 2. In the Cloudant service’s dashboard, create a new database called
    “concierge” (or whatever you like)
 3. Re-run the concierge-chatbot supplying the Cloudant URL and database name when prompted

Your chatbot data is now saved in the database!

FREE AND OPEN-SOURCE
All of the code described here is free for you to use and modify yourself:

If you want to modify the chatbot yourself, then feel free to copy the code and
build on it. It is open-sourced here:

 * OpenWhisk action
 * Client-side Widget
 * Setup script

If you want to contribute to the concierge-chatbot source code, I’d love to see your enhancements. Pull requests welcome! And
please ♡ this article to recommend it to other Medium readers.

 * Nodejs
 * Chatbots
 * Serverless
 * Cloudant
 * Web Development

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Imagine you have a small business, like a restaurant or a hotel. What are the most common enquiries that you get? Many companies are now creating chat bots to answer questions automatically.","Create a serverless, Watson-powered chatbot for your business",Live,851
2617,"PUT MESSAGES INTO RABBITMQ FROM YOUR PHP APPLICATION
LornaMitchell / September 16, 2016PHP is a cornerstone of the web. It is widely used and deployed in everything
from simple blog software, to APIs and complex applications. A key feature of
web-scale applications is that the components are loosely coupled to one another
— and in PHP we’re increasingly building applications rather than websites.

One approach I often use to loosen the coupling between system components is to
introduce a job queue to defer tasks to another part of the system, or to
another system entirely. This article covers how to adapt an existing PHP
application to create jobs on a queue using RabbitMQ.

The code samples here are taken from the Guestbook sample application; you can
find the code at https://github.com/ibm-cds-labs/guestbook

THE PHP APPLICATION
We have a very simple guestbook application that allows users to leave a message
with their name. When a user saves a comment, we write it to the database. So
far, so simple.


We’re going to add webhook functionality to this application, but we need to
consider performance. We don’t want to send potentially large numbers of
webhooks synchronously, causing the user to wait for feedback that their message
was successfully saved. Instead, we’ll use RabbitMQ and simply add a job to it
when a new message is saved, then process the webhooks later.

SET UP RABBITMQ
We’ll use a RabbitMQ instance from Compose.com for this example, but RabbitMQ is an open source project, so it can be
installed on any compatible platform for production or development purposes. The
example will work for any kind of RabbitMQ installation; just set your config
values as appropriate.

From the Compose dashboard, choose Create Deployment , then choose RabbitMQ from the list on the left-hand side.


Once RabbitMQ has been deployed, the dashboard will show your instance’s
connection details in the form of a URL string. Here’s what mine looks like:


amqps://[username]:[password]@sl-eu-lon-2-portal.2.dblayer.com:10406/lj-brilliant-rabbitmq


In order to use this information, we’ll need to split out the host, port and
vhost portions. Compose.com also requires use of a certificate with its SSL
connection, which is available when you click the Show certificate button below the connection details. Copy the text including the opening and
closing lines, and paste it into a file (mine is called rabbit.cert ). Notice under the main connection string that there is also a link to an
admin interface. Open this link in a new browser tab so you can follow the
action as we communicate with RabbitMQ.

CONNECT PHP TO RABBITMQ
To make things easy for ourselves, we’ll use the amqplib/amqplib library in our PHP project, which we can install using Composer . Once you’ve installed it, or if you’re using it already in your project, run
these commands:


composer require php-amqplib/php-amqplib
composer install


With this library added, we can connect to RabbitMQ with code like this:


require_once __DIR__ . '/vendor/autoload.php';

$ssl_options = array(
  'capath' => '/etc/ssl/certs',
  'cafile' => './rabbit.cert',
  'verify_peer' =�

$channel = $connection-�


If you’re following along with the GitHub code, you’ll find this in the src/web/dependencies.php file where the rabbitmq dependency is defined.

Let’s work through these four sections in turn:

 * First we include the autoloader for the phpamqplib library that we installed
   via Composer.
 * Next the $ssl_options variable is created. This variable is only needed if your connection is over
   SSL, as it is with Compose. We need to include the path to the certificates
   on the server, then the specific certificate file we downloaded earlier.
 * Now the interesting part: we connect to RabbitMQ. The parameters here are
   host, port, username, password, vhost and then the array of $ssl_options . (See Compose’s article on Getting Started with RabbitMQ for more.) If you’re not using SSL, switch to the AMQPStsreamConnection
   instead. This approach uses just the first four parameters (host, port,
   username and password) and is used in the tutorials on the RabbitMQ site if you want to see a working example.
 * Finally, make the call to channel() . If you inspect the result of the call, you’ll see that it’s a PhpAmqpLib\Channel\AMQPChannel object and in this case, all is well!

At this point, PHP is ready to talk to RabbitMQ and we can move on to creating
queues and putting messages onto them.

SEND THE MESSAGE
RabbitMQ is a powerful message broker, capable of much more advanced tasks than
what we will ask of it today. Messages are passed into RabbitMQ by a “producer”,
which in this example, is our PHP application. Queues are created and bound to
“exchanges”, which deal with routing the messages to the right queues. In simple
terms, there should be as many queues as there are copies of messages.

To keep things simple in this example, we’ll use the default exchange (the
second argument to the basic_publish() method), and simply declare a queue and pass messages straight to it. The
queues are storage, so the messages will then stay there until they are
processed by a worker. (We won’t cover workers here.)

Messages themselves are just strings, so depending on the information to
transmit and what will be consuming the message you can use any string format
you like. In this example, we’ll simply use a JSON format to encode the
guestbook message data of name, comment and a datestamp. Here’s the code for
that section:


$channel-� 

$msg = new \PhpAmqpLib\Message\AMQPMessage(
    json_encode($comment),
    [""delivery_mode"" =�
$channel-�


In the Guestbook project, this code is in the add() method of the src/web/classes/Guestbook/CommentService.php file. It also handles the case where no queue is available.

There are a few separate steps going on here. The first is to create the queue
itself, and since RabbitMQ is so configurable, there are quite a few options we
can set to tell the queue how to behave. We only need a simple work queue, and
so we leave most options unset (the various options are commented to make it
easier to understand what each one does). For our purposes, we simply mark the
queue as durable, which means that the contents will be written to disk and
therefore survive a server restart.

Next, we prepare the message by creating an AMQPMessage object with our desired message string inside; in this case, it’s the
JSON-encoded $comment data. This constructor also accepts an array of options. We’ll set the delivery_mode to 2 so that our message will be stored on disk (2 means persistent), since the
queue we’re sending to is itself durable.

Finally, the message is published to the queue with the call to basic_publish() . By passing that empty second parameter we are using the default exchange,
which has no name. To work with multiple exchanges, they would be created and
the queue or queues bound to them with particular routing keys before we sent
the message. Sending an empty exchange parameter causes RabbitMQ to use the
default exchange and to route the message into the queue with the matching name.

If you’d like to see the complete code for this project, it’s on GitHub: https://github.com/ibm-cds-labs/guestbook

ADDING RABBITMQ TO YOUR APPLICATIONS
Using this approach to delegate tasks, rather than processing them synchronously
in PHP, is an important technique for building robust, scalable systems. Keeping
the website or user interface responsive is key to surviving heavy traffic
loads.

Putting the task into a system like RabbitMQ means that we have loose coupling
between our application that writes the messages, and another system that
actually processes them. That other system might run on other servers,
protecting the resources of the user-facing site. System components may also be
scaled independently of other parts of the system, and they also use other
technology stacks. In this specific example, the messages are consumed by a
Node.js application, since we have potentially large numbers of web requests to
handle and Node.js is well suited to this task.

If you’d like to see a tutorial on how the Node.js component works, please let
us know below in the comments. Thanks for reading!

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: Compose / php / RabbitMQ Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Learn how to incorporate a job queue by connecting your PHP code to RabbitMQ. Example Vagrant environment included.,Scaling an Existing PHP App with RabbitMQ,Live,852
2623,"Homepage IBM Watson Follow Sign in Get started * Home
 * Announcements
 * Editorials
 * Tutorials
 * Code Spotlight
 * 
 * Build with Watson
 * 

Kevin Gong Blocked Unblock Follow Following Product manager @IBMWatson. Photographer. UX/UI designer. DIYer. Data tinkerer.
Social good supporter. Formerly @McKinsey, @TEDx, @Cal, @ColumbiaSIPA Nov 13, 2017
--------------------------------------------------------------------------------

BEST PRACTICES FOR CUSTOM MODELS IN WATSON VISUAL RECOGNITION
Since the launch of the Watson Visual Recognition API, we’ve seen users help California save water , perform infrastructure inspections with drones , and even find Pokemon . Powering many of these use cases are custom classifiers, a feature within
Visual Recognition that allows users to train Watson on almost any visual
content.

To create custom classifiers, users define categories they want to identify and
upload example images for those categories. For example, a user wishing to
identify different dog breeds may create 4 classes (golden retrievers, huskies,
dalmatians, and beagles) and upload training images for each class. You can find
this exact example in the Watson Visual Recognition demo or explore other tutorials on custom classifiers.

Custom classifiers can be highly powerful but require careful training and
content considerations to be properly optimized. Through our user conversations,
we’ve assembled a best practices guide below to help you get the most out of
your custom classifiers.

HOW TRAINING CAN INCREASE WATSON VISUAL RECOGNITION’S QUALITY
The accuracy you will see from your custom classifier depends directly on the
quality of the training you perform. Clients in the past who closely controlled
their training processes have observed greater than 98% accuracy for their use
cases. Accuracy — different from confidence score — is based on a ground truth
for a particular classification problem and particular data set.

“Clients who closely control their image training processes observed greater
than 98% accuracyAs a best practice, clients often create a ground truth to benchmark against
human classification. Note that often humans make mistakes in classifications
due to fatigue, reputation, carelessness, or other problems of the human
condition.

On a basic level, images in training and testing sets should resemble each
other. Significant visual differences between training and testing groups will
result in poor performance results.

There are a number of additional factors that will impact the quality of your
training beyond the resolution of your images. Lighting, angle, focus, color,
shape, distance from subject, and presence of other objects in the image will
all impact your training. Please note that Watson takes a holistic approach when
being trained on each image. While it will evaluate all of the elements listed
above, it cannot be tasked to exclusively consider a specific element.

The API will accept as few as 10 images per class, but we strongly recommend
using a significantly greater amount of images to improve the performance and
accuracy of your classifier. 100+ images per class is usually a good starting
point to get more robust levels of accuracy.

What is the score that I see for each tag?

Each returned tag will include a confidence score between 0 and 1. This number
does not represent a percentage of accuracy, but instead indicates Watson’s
confidence in the returned classification based on the training data for that
classifier. The API will classify for all classes in the classifier, but you can
adjust the threshold to only return results above a certain confidence score.

The custom classifier scores can be compared to one another to compare
likelihoods, but they should be viewed as something that is compared to the
cost/benefit of being right or wrong, and then a threshold for action needs to
be chosen. Be aware that the nature of these numbers may change as we make
changes to our system, and we will communicate these changes as they occur.

Further details about scores can be found here .

EXAMPLES OF DIFFICULT USE CASES
While Watson Visual Recognition is highly flexible, there have been a number of
recurring use case that we’ve seen the API either struggle on or require
significant pre/post-work from the user.

 * Face Recognition: Visual Recognition is capable of face detection (detecting the presence of
   faces) not face recognition (identifying individuals).
 * Detecting details: Occasionally, users want to classify an image based on a small section of an
   image or details scattered within an image. Because Watson analyzes the
   entire image when training, it may struggle on classifications that depend on
   small details. Some users have adopted the strategy of breaking the image
   into pieces or zooming into relevant parts of an image. See this guide for image pre-processing techniques .
 * Emotion: Emotion classification (whether facial emotion or contextual emotion) is not
   a feature currently supported by Visual Recognition. Some users have
   attempted to do this through custom classifiers, but this is an edge case and
   we cannot estimate the accuracy of this type of training.

EXAMPLES OF GOOD AND BAD TRAINING IMAGES
GOOD: The following images were utilized for training and testing by our partner
OmniEarth. This demonstrates good training since images in training and testing
sets should resemble each other in regards to angle, lighting, distance, size of
subject, etc. See the case study OmniEarth: Combating drought with IBM Watson cognitive capabilities for more details.

Training images:

Testing image:

BAD: The following images demonstrate bad training since the training image shows a
close-up shot of a single apple while the testing image shows a large group of
apples taken from a distance with other visual items introduced (baskets, sign,
etc). It’s entirely possible that Watson may fail to classify the test image as
‘apples,’ especially if another class in the classifier contains training images
of a large group of round objects (such as peaches, oranges ,etc).

Training image:

Testing image:

BAD: The following images demonstrate bad training since the training image shows a
close-up shot of a single sofa in a well-lit, studio-like setting while the
testing image show a sofa that is partially cut off, farther away, and situated
among many other objects in a real world setting. Watson may not be able to
properly classify the test image due to the number of other objects cluttering
the scene.

Training image:

Testing image:

NEED HELP OR HAVE QUESTIONS?
We’re excited to see what you build with Watson Visual Recognition, and we’re
happy to help you along the way. Try the custom classifiers feature , share any questions or comments you have on our developerWorks forums , and start building with Watson for free today .


--------------------------------------------------------------------------------

Originally published at www.ibm.com on October 24, 2016.

 * Machine Learning
 * Ibm Watson
 * Tutorials
 * Watson Visual Recognition
 * AI

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

31 Blocked Unblock Follow FollowingKEVIN GONG
Product manager @IBMWatson . Photographer. UX/UI designer. DIYer. Data tinkerer. Social good supporter.
Formerly @McKinsey , @TEDx , @Cal , @ColumbiaSIPA

FollowIBM WATSON
AI Platform for the Enterprise

 * 31
 * 
 * 
 * 

Never miss a story from IBM Watson , when you sign up for Medium. Learn more Never miss a story from IBM Watson Get updates Get updates","Custom classifiers can be highly powerful but require careful training and content considerations to be properly optimized. Through our user conversations, we’ve assembled a best practices guide below to help you get the most out of your custom classifiers.",Best Practices for Custom Models in Watson Visual Recognition,Live,853
2627,"METRICS MAVEN: MAKING DATA PRETTY IN POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Aug 3, 2016In our Metrics Maven series, Compose's data scientist shares database features,
tips, tricks, and code you can use to get the metrics you need from your data.
In this article, we'll look at how you can make your reports more understandable
(aka ""pretty"") using built-in functions and operators in PostgreSQL.

The first step in producing any report is to know the audience. Will the report
be used in the VP's quarterly slide deck to the C-level executives? Is it
intended for use by the product marketing team to understand the impact of a
recent ad campaign? Or will it be used as input to analyses by fellow data
scientists? Depending on who will be using your reports (and for what purposes),
you may need to make the output more high-level and human-friendly (or focus
more on precision and detail). For our use case, we're going to continue looking
at app downloads, but we want to add some additional human-friendly detail to
the report and make the data easy-to-understand.

In our previous article about how to calculate a moving average , we successfully calculated both a simple and a cumulative moving average of
app downloads by day. But the results left a little something to be desired
since the average values were calculated to 16 decimal places. Here's a sample
of what our output looked like for the simple moving average over 30 days:

date          | avg_downloads  
-----------------------------------
. . . .
2016-06-07    | 23.4000000000000000  
2016-06-08    | 23.2666666666666667  
2016-06-09    | 22.7333333333333333  
2016-06-10    | 22.4000000000000000  
2016-06-11    | 21.9000000000000000  
2016-06-12    | 21.3000000000000000  
. . . .


That kind of precision is undoubtedly important in some scenarios, but for us
it's overkill (read: ""ugly""). In this article, we're going to use some built-in
Postgres functions and operators to make it more human-friendly by adding some
detail in some areas and simplifying in others. We're calling this ""making the
data pretty"". We'll use this data from the simple 30 day rolling average
calculation to walk through some options for our use case here.

MATHEMATICAL FUNCTIONS
Using PostgreSQL mathematical functions , we have a few different options to tackle the ""too many decimal places""
problem. We can round our results, truncate our results, or find the ceiling or
floor. Any of these functions will make our data easier to read. Each of these
functions has its own nuance, though, so let's compare them.

Before we dive in, an important thing to note is that the mathematical functions
described below cannot be used by the OVER clause for window frames since it expects window functions (like RANK() or ROW_NUMBER() ) or aggregations (like SUM() or AVG() ). That puts a minor hitch in our plan, but we can esily handle it by using WITH to create a common table expression (CTE) or by using our existing query as a derived table wrapped with a new
query. Since CTEs help make queries easier to read, we're going to use that
option. Our CTE for the examples in this article, then, looks like this and
we're calling it ""avg_app_downloads_30_days_rolling_by_day"":

WITH avg_app_downloads_30_days_rolling_by_day AS (  
    SELECT ad.date,  
           AVG(ad.downloads)
               OVER(ORDER BY ad.date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_downloads
    FROM app_downloads_by_date ad
)

SELECT ....  


To this base, we'll add our new SELECT query for creating our final report.

ROUNDING NUMBERS
The first function we're going to look at to simplify our moving average values
is round . The round function is going round the value up or down, but we can choose whether we want
to round to a certain number of decimal points or just to the nearest integer
value.

To round to a certain number of decimal points, we provide the number of decimal
points we want to see as the second input to the function; the first input to
the function is the value we want to round (in our case the value from the
""avg_downloads"" field). Let's say we want to round to just 2 decimal places. Our
query will look like this:

WITH avg_app_downloads_30_days_rolling_by_day AS (  
    SELECT ad.date,  
           AVG(ad.downloads)
               OVER(ORDER BY ad.date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_downloads
    FROM app_downloads_by_date ad
)

SELECT date,  
       round(avg_downloads, 2) as avg_downloads_rounded_decimal
FROM avg_app_downloads_30_days_rolling_by_day  
;


And our output will look like this:

date          | avg_downloads_rounded_decimal  
---------------------------------------------
. . . .
2016-06-07    | 23.40  
2016-06-08    | 23.27  
2016-06-09    | 22.73  
2016-06-10    | 22.40  
2016-06-11    | 21.90  
2016-06-12    | 21.30  
. . . .


We don't want to deal with partial downloads, though. It makes more sense for
our use case to round to the nearest whole number. For that reason we'll apply round to just the value from the ""avg_downloads"" field like we show below without
specifying a number of decimal points. Also, from here on out in the examples in
this article, since we know the CTE isn't changing, we'll focus on just the new SELECT query we're performing on the CTE:

 -- CTE goes here

SELECT date,  
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;


Now our output will look like this:

date          | avg_downloads_rounded  
-------------------------------------
. . . .
2016-06-07    | 23  
2016-06-08    | 23  
2016-06-09    | 23  
2016-06-10    | 22  
2016-06-11    | 22  
2016-06-12    | 21  
. . . .


Notice how June 9th and 11th got rounded up to the next whole number while the
other dates were rounded down.

TRUNCATING NUMBERS
Another option is to truncate the values. trunc can be used to truncate the value to a certain number of decimal places or to
the integer value. The difference with trunc as compared to round is that there is no rounding; any characters past the specified position are
simply chopped off. trunc follows the same syntax as round so here's what a query looks like for truncating to 2 decimal places:

 -- CTE goes here

SELECT date,  
       trunc(avg_downloads, 2) as avg_downloads_truncated_decimal
FROM avg_app_downloads_30_days_rolling_by_day  
;


And the output looks like this:

date          | avg_downloads_truncated_decimal  
-----------------------------------------------
. . . .
2016-06-07    | 23.40  
2016-06-08    | 23.26  
2016-06-09    | 22.73  
2016-06-10    | 22.40  
2016-06-11    | 21.90  
2016-06-12    | 21.30  
. . . .


For this data, the results are almost the same as when we were using round , except that June 8th didn't get rounded up. If we truncate to just the
integer without specifying any decimal points, our results will look like this:

date          | avg_downloads_truncated  
---------------------------------------
. . . .
2016-06-07    | 23  
2016-06-08    | 23  
2016-06-09    | 22  
2016-06-10    | 22  
2016-06-11    | 21  
2016-06-12    | 21  
. . . .


June 9th and June 11th are now one whole number difference from the results we
got when we used round . While this may seem insignificant for this data set, it can be very impactful
for other data sets and could skew the results one way or another if this report
were used for further calculations down the line. Understanding the nuances of
how different functions affect the data is key to selecting the right ones for
your purposes.

NUMBER CEILINGS AND FLOORS
Before we leave mathematical functions, let's have a look at ceiling and floor . The ceiling function (you can also used the shortened form ceil ) will round up to the nearest integer value. Conversely, floor will round down to the nearest integer value. These functions don't support
decimal places; they deal in whole numbers only. Let's compare them to the
functions we already reviewed and to each other.

We'll start with ceiling :

 -- CTE goes here

SELECT date,  
       ceiling(avg_downloads) as avg_downloads_ceiling
FROM avg_app_downloads_30_days_rolling_by_day  
;


The value ceilings are returned as:

date          | avg_downloads_ceiling  
-------------------------------------
. . . .
2016-06-07    | 24  
2016-06-08    | 24  
2016-06-09    | 23  
2016-06-10    | 23  
2016-06-11    | 22  
2016-06-12    | 22  
. . . .


All of the values are rounded up to the next whole number. That's a pretty big
difference from what we've already seen with the other functions.

And here's floor :

 -- CTE goes here

SELECT date,  
       floor(avg_downloads) as avg_downloads_floor
FROM avg_app_downloads_30_days_rolling_by_day  
;


The value floors are returned as:

date          | avg_downloads_floor  
-----------------------------------
. . . .
2016-06-07    | 23  
2016-06-08    | 23  
2016-06-09    | 22  
2016-06-10    | 22  
2016-06-11    | 21  
2016-06-12    | 21  
. . . .


For this data, because floor rounds down to the nearest integer, the results are the same as if we'd used trunc , but each of them is one whole number difference from the results of ceiling . That can be significant so make sure to choose your functions wisely.

Now that we have an idea how different mathematical functions can affect our
results, we're going to stick with round to the nearest integer for our use case. That way we don't have to deal with
the partial downloads represented by decimal points and we can approximate the
closest whole number the moving average is indicating.

Let's turn our attention to formatting the date next.

FORMATTING FUNCTIONS
Our date field is pretty simple and easy-to-read, but for the final report we
want to generate, we'd like to have a more human-friendly textual label for the
date. We can use formatting functions to play with the ""date"" field in our results. Specifically, we're going to use
the to_char function to list out some elements of the date that will be useful for
reporting and analysis, including day of the week, month name, and quarter of
the year. The to_char function takes any numeric data type, as well as timestamps and intervals, and
converts them to a string using a pattern that we can specify. Though our ""date""
field is not, strictly-speaking, a timestamp, it will be automatically converted
to a timestamp by the function for processing.

Let's start with a simple example to get the quarter using the to_char function:

 -- CTE goes here

SELECT date,  
       to_char(date, 'Q') as quarter,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;


Here, the to_char pattern we're using to convert the date to a string is ""Q"" for quarter. Pretty
straightforward, but we've kept the original date value for reference. The
result we get back is the quarter number of the year. A ""1"" represents months
from January through March, ""2"" for April through June, ""3"" for July through
September, and ""4"" for October through December. Since we've been using dates in
June in our examples, our results look like this:

date       | quarter | avg_downloads_rounded  
--------------------------------------------
. . . .
2016-06-07 | 2       | 23  
2016-06-08 | 2       | 23  
2016-06-09 | 2       | 23  
2016-06-10 | 2       | 22  
2016-06-11 | 2       | 22  
2016-06-12 | 2       | 21  
. . . .


Similarly, we can get other values form our ""date"" field based on the patterns
we specify. Here's a handful of example, each represented as a different column:

 -- CTE goes here

SELECT date,  
       to_char(date, 'YYYY') as year,
       to_char(date, 'Q') as quarter,
       to_char(date, 'MON') as month,
       to_char(date, 'W') as week_of_month,
       to_char(date, 'DD') as day,
       to_char(date, 'Day') as day_of_week,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;


Based on the patterns we've specified in our query, we're going to get a 4-digit
year, the quarter number of the year (which we looked at above), the 3-character
shortened form of the month in all caps, the week number of the month, the day
as a 2-digit number, and also the day of the week with the first letter
captialized.

The results look like this:

date       | year | quarter | month | week_of_month | day | day_of_week | avg_downloads_rounded  
-----------------------------------------------------------------------------------------------
. . . .
2016-06-07 | 2016 | 2       | JUN   | 1             | 07  | Tuesday     | 23  
2016-06-08 | 2016 | 2       | JUN   | 2             | 08  | Wednesday   | 23  
2016-06-09 | 2016 | 2       | JUN   | 2             | 09  | Thursday    | 23  
2016-06-10 | 2016 | 2       | JUN   | 2             | 10  | Friday      | 22  
2016-06-11 | 2016 | 2       | JUN   | 2             | 11  | Saturday    | 22  
2016-06-12 | 2016 | 2       | JUN   | 2             | 12  | Sunday      | 21  
. . . .


Patterns can also be combined into a single request. Now that we have a sense of
what some of the different patterns return, let's combine a few patterns, as an
example, to produce a new ""day_of_week"" column:

 -- CTE goes here

SELECT date,  
       to_char(date, 'Month DD, Day') as day_of_week,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;


In the revised query above, we've indicated that the ""day_of_week"" column should
have the full month name with a capitalized first letter followed by the 2-digit
day and a comma then followed by the day of the week fully spelled out with the
first letter captialized.

The results look like this:

date       | day_of_week             | avg_downloads_rounded  
-----------------------------------------------------------------------------------------------------------------
. . . .
2016-06-07 | June      07, Tuesday   | 23  
2016-06-08 | June      07, Wednesday | 23  
2016-06-09 | June      07, Thursday  | 23  
2016-06-10 | June      07, Friday    | 22  
2016-06-11 | June      07, Saturday  | 22  
2016-06-12 | June      07, Sunday    | 21  
. . . .


Notice that full name of the month and the full name of the day of the week in
the ""day_of_week"" column are padded to 9 characters. If you use a shortened form
of the name (such as if we'd used ""MON"" to get the capitalized shortened form of
the month instead), the padding is not used.

The to_char formatting function is going to be useful for making the dates in our final
report ""pretty"". They're not the only options for dates, though. There are also
some handy date functions to be aware of.

DATE FUNCTIONS
There are two date functions in particular we'll look at here briefly: date_part and date_trunc . These can be used to extract elements from or simplify timestamps (dates are
converted to timestamps automatically for processing, same as with the to_char function).

EXTRACTING DATE PARTS
date_part can be used to pull individual elements from a timestamp, time, or interval.
Note that the extract function does the same thing as date_part . extract is SQL-standard compliant whereas date_part is based on Ingres , but as the extract function in PostgreSQL uses the date_part syntax under the hood, you can just call date_part directly. If you're concerned about the portability of your code to other SQL
systems, then choose extract instead of date_part .

The difference between date_part and the to_char function we looked at above is that date_part returns a numeric value. Because of this, there are no formatting patterns for date_part . If you specify ""month"", you'll get back the numeric value of the month only.
The to_char function can also return the number of the month, but has the option of
returning the 3-character short form of the month (with various captialization
options) or the complete name of the month spelled out (with various
captialization options).

Here's an example of extracting the month using date_part (notice how the syntax is different... the part of the date to return is the
first input; the value to process is the second input):

 -- CTE goes here

SELECT date,  
       date_part('month', date) as month,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;


Here's our ""month"" date_part :

date       | month | avg_downloads_rounded  
------------------------------------------
. . . .
2016-06-07 | 6       | 23  
2016-06-08 | 6       | 23  
2016-06-09 | 6       | 23  
2016-06-10 | 6       | 22  
2016-06-11 | 6       | 22  
2016-06-12 | 6       | 21  
. . . .


TRUNCATING DATES
date_trunc is similar to the trunc function we looked at for number values in the section above about mathematical
functions, but it operates on timestamps and intervals. There are different
levels of truncation that can be applied. Since our dates don't go more granular
than the daily level, we could truncate at the month level, quarter level, year
level... all the way up to the millennium if that was necessary for the use
case.

Let's look and see how date_trunc works for the month level (you'll notice it uses the same syntax as date_part ):

 -- CTE goes here

SELECT date,  
       date_trunc('month', date) as month,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;


In return, we get the timestamp truncated at the month:

date       | month                  | avg_downloads_rounded  
------------------------------------------------------------
. . . .
2016-06-07 | 2016-06-01 00:00:00+00 | 23  
2016-06-08 | 2016-06-01 00:00:00+00 | 23  
2016-06-09 | 2016-06-01 00:00:00+00 | 23  
2016-06-10 | 2016-06-01 00:00:00+00 | 22  
2016-06-11 | 2016-06-01 00:00:00+00 | 22  
2016-06-12 | 2016-06-01 00:00:00+00 | 21  
. . . .


As we mentioned, our date field gets automatically converted to a timestamp for
processing. Then, at the month level, the time components of the timestamp are
zeroed out (if we had any) and the day is set to ""01"" as a default. If we had
other months represented in our example, they'd look similar... July would look
like this, for example: 2016-07-01 00:00:00+00. August would look like this:
2016-08-01 00:00:00+00. As you can see there are some important differences
between extracting date parts, truncating dates, and formatting using to_char .

STRING FUNCTIONS
Since we're going to use the to_char formating function for dates in our report (rather than using date_part or date_trunc ) and those values will therefore be converted to strings, let's have a look at string functions .

String functions allow you to manipulate string values in a variety of ways.
There are options for case, trimming, substrings, length, and more. One
commonly-used function that we're going to look at here is concatenation. Even
though we were able to do some interesting formatting using just the to_char function on our dates, concatenation will allow us to make our report more
human-friendly. For example, let's use concatenation to create an
easier-to-understand column for quarter:

 -- CTE goes here

SELECT date,  
       to_char(date, 'YYYY') || '-Q' || to_char(date, 'Q') as quarter,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;


Here we're using the concatenation operator ""||"" to prepend the quarter number
with the 4-digit year, a hyphen, and the letter ""Q"". We can't do that using just
the to_char function because ""Q"" means quarter number and we'd end up duplicating the
number (using to_char(date, 'QQ') , we'd get ""22""). By concatenating, though, we can make it clear to the report
user that values for the specified dates happened in ""Q2"" as opposed to the more
ambiguous ""2"".

date       | quarter       | avg_downloads_rounded  
--------------------------------------------------
. . . .
2016-06-07 | 2016-Q2       | 23  
2016-06-08 | 2016-Q2       | 23  
2016-06-09 | 2016-Q2       | 23  
2016-06-10 | 2016-Q2       | 22  
2016-06-11 | 2016-Q2       | 22  
2016-06-12 | 2016-Q2       | 21  
. . . .


Let's go through another example using the concatenation operator to clean up
the ""day_of_week"" column we created as well as rearrange some of the elements.
To do that we'll also use the trim function to remove the extra padding around the month and day of week names.
Here's how:

 -- CTE goes here

SELECT date,  
       trim(trailing ' ' from to_char(date, 'Day')) || 
          ', ' || 
          trim(trailing ' ' from to_char(date, 'Month')) || 
          ' ' ||
          to_char(date, 'DD') || 
          ', ' || 
          to_char(date, 'YYYY') as display_date,
       round(avg_downloads) as avg_downloads_rounded
FROM avg_app_downloads_30_days_rolling_by_day  
;


We are using the trim function on the trailing spaces for the full name of the weekday by wrapping
our to_char function inside trim . Next we're concatenating a comma and a space for readability, then the month
name also wrapped in a trim for the trailing spaces. We're then concatenating another space then the
2-digit day then another comma and whitespace and finally the 4-digit year.

It looks like this when it's returned:

date       | display_date             |avg_downloads_rounded  
------------------------------------------------------------
. . . .
2016-06-07 | Tuesday, June 07, 2016   | 23  
2016-06-08 | Wednesday, June 08, 2016 | 23  
2016-06-09 | Thursday, June 09, 2016  | 23  
2016-06-10 | Friday, June 10, 2016    | 22  
2016-06-11 | Saturday, June 11, 2016  | 22  
2016-06-12 | Sunday, June 12, 2016    | 21  
. . . .


Now that we've reviewed some options to make our data ""pretty"", let's make a
final report.

FINAL REPORT
In our final report, we're going to use some of the functions we reviewed in
this article, including round , to_char , trim , and the concatenation operator ""||"". Our report will have a column for
quarter, one for month, and one for the display date to provide detail for the
30 day rolling average of app downloads. We're also going to make it clear that
the average app downloads is a 30 day rolling average in the field name. Here's
our query:

 -- CTE goes here

SELECT  
       to_char(date, 'YYYY') || '-Q' || to_char(date, 'Q') as quarter,
       to_char(date, 'YYYY-MM') || 
          ' (' ||
          trim(trailing ' ' from to_char(date, 'Month')) ||
          ')' as month,
       trim(trailing ' ' from to_char(date, 'Day')) || 
          ', ' || 
          trim(trailing ' ' from to_char(date, 'Month')) || 
          ' ' ||
          to_char(date, 'DD') || 
          ', ' || 
          to_char(date, 'YYYY') as display_date,
       round(avg_downloads) as avg_downloads_30_days_rolling
FROM avg_app_downloads_30_days_rolling_by_day  
;


The final report now looks like this:

quarter | month          | display_date             |avg_downloads_30_days_rolling  
-----------------------------------------------------------------------------------
. . . .
2016-Q2 | 2016-06 (June) | Tuesday, June 07, 2016   | 23  
2016-Q2 | 2016-06 (June) | Wednesday, June 08, 2016 | 23  
2016-Q2 | 2016-06 (June) | Thursday, June 09, 2016  | 23  
2016-Q2 | 2016-06 (June) | Friday, June 10, 2016    | 22  
2016-Q2 | 2016-06 (June) | Saturday, June 11, 2016  | 22  
2016-Q2 | 2016-06 (June) | Sunday, June 12, 2016    | 21  
. . . .


NEXT STEPS
In this article we covered how to make our reports more suited to the audience
and the use case. For us, that meant making our report a little more
human-friendly. We looked at a variety of mathematical, formatting, date, and
string functions to add some detail and make our data easy-to-understand.

There are a slew of additional options for how to present data in your reports.
Depending on your data and the needs of your audience, you could look at further
grouping and ordering of the data, perhaps even using the new group by options for sets, rollups and cubes that we wrote about when PostgreSQL 9.5 was released. You could even go a step
further and use a data visualization tool for charting and dashboarding like we
demonstrated in our article about how we visualize data using Leftronic . There are a number of 3rd party visualization tools on the market with a
variety of features. Regardless of what additional steps you may want to take
with your reports beyond what we discussed here, it all starts with the data you
have at hand. By making that data ""pretty"", your reports will be
easy-to-understand and therefore more useful to your audience.

In our next Metrics Maven article, we'll look at how to use crosstab to
effectively create pivot tables in PostgreSQL so we can see our data in a
different way.

Image by: Romi Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose",Learn how to make moving average calculations more high-level and user-friendly using PostgreSQL on Compose.,Making Data Pretty in PostgreSQL,Live,854
2631,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own Oct 24, 2016
--------------------------------------------------------------------------------

ENJOY PYTHON 3.5 IN JUPYTER NOTEBOOKS
Just a month after announcing our partnership with Continuum Analytics , we are proud to introduce the integration of Anaconda with Jupyter Notebooks in IBM Data Science Experience. What does that mean for you? More options and
more power!

We started supporting Python 2.7 because it supports all the packages that data
scientists need for data analysis, including SciPy, Numpy, Matplotlib,
scikit-learn, PySpark, and more. While most companies who use Python prefer
Python 2.7 right now, Python 3 offers several nice improvements over earlier
versions. As Python 3 gains additional features and packages, it will become the natural choice for
data scientists. With support for both Python 2 and Python 3.5 IBM Data Science
Experience gives you the option to choose your favorite.

Anaconda is the leading Open Data Science platform powered by Python. It uses
over 720 packages for data preparation, data analysis, data visualization,
machine learning and interactive data science applications that deliver results
— everything from discovering gravitational waves to creating new revenue
channels.

Getting started with Python 3 in Data Science Experience is easy. Create a new
notebook and you are prompted with four language choices: Scala, R, Python 2 and
Python 3.

From the Notebook interface you can see on which Spark instance this Notebook is
being executed, which Spark version (now we support Spark 1.6 and 2.0), and the
language and the libraries installed on that instance.

Anaconda 4.2.0 includes an easy installation of Python (2.7.12, 3.4.5, and/or
3.5.2) and updates of over 100 pre-built and tested scientific and analytic
Python packages, including Numpy, Pandas, SciPy, Matplotlib, and IPython, with
over 620 more packages available via a simple conda install <packagename> command. You can find the full list of packages available here: https://docs.continuum.io/anaconda/pkg-docs

 * Data Science
 * Machine Learning
 * IBM


Blocked Unblock Follow FollowingARMAND RUIZ
Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Just a month after announcing our partnership with Continuum Analytics, we are proud to introduce the integration of Anaconda with Jupyter Notebooks in IBM Data Science Experience. What does that…",Enjoy Python 3.5 in Jupyter Notebooks,Live,855
2632,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectRUNNING POUCHDB IN A WEB WORKERGlynn Bird / February 26, 2016POUCHDBPouchDB is the best kept secret in web development. It’ it picks a suitable localstorage mechanism for you, and presents a clean, modern API to allow your webapp to store data.A database … that runs in your browser. Why would you want that? So that yourweb application can collect and query data regardless of whether it’s connectedto the web or not. Although many of us enjoy fast 4G connectivity on our mobiledevices, much of the world has patchy or a prohibitively expensive mobile dataservice. If your app’s data resides on the user’s mobile device, they can continue to query and create data offline.But that’s not all. PouchDB database can also sync with remote CouchDB , PouchDB or IBM Cloudant databases letting you modify data on the server side, the client side, or both,and sync in both directions, without loss of data. This unlocks a new generationof web apps – often called offline-first apps . By using local on-device storage, the user gets a high-performance, always-onuser experience.Creating a local database and replicating data from a remote database takes onlya few lines of client-side code:  var localdb = new Pouchdb('mydb');  var remotedb = new Pouchdb('https://myremote.cloudant.com/mydb');  localdb.replicate.from(remotedb);The best news is that PouchDB is free and open source!OFFLINE-FIRSTA simple example of an offline-first website is my own home page www.glynnbird.com . It’s hosted on Bluemix , IBM’s platform-as-a-service and its dynamic data is synced from a Cloudant NoSQL database. When the page first loads, it replicates its data from Cloudantand stores it locally in PouchDB. By leveraging browser caching, the website’sassets (CSS, JavaScript, etc) are also cached and so the page works even when your machine is not connected to the Internet .SERVICE WORKERSTo cache its page assets, my homepage uses the deprecated Appcache API . The best practice today is to use Service Workers instead. The Service Worker API is more complicated than AppCache, but givesthe developer fine-grained control of how web requests that originate fromclient-side code are routed. Requests can go through a local caching layer, bedirected to the internet at large, or be answered with custom responses – it’sup to you.Unfortunately at time of writing, the AppCache API is deprecated and itsreplacement the Service Workers API, is “experimental” and browser support is patchy.WEB WORKERSThe Web Workers API is a new API that lets JavaScript code run in multiple threads. Threads are notnew in computer science, but JavaScript code has normally been single-threaded,with an event loop handling concurrency. With Web Workers, we can now split our JavaScript codeinto separate threads; perhaps the database on one thread and the user interfacerendering on another. This allows greater parallelism in the execution ofJavaScript code, making the most of modern multi-core CPUs and preventing theuser interface from becoming jerky or unresponsive, which can happen whendatabase work steals processing time from front-end rendering code.Programming with multiple threads can be more complicated for the developerbecause threads are isolated from each other, and communicate via messages orports.USING POUCHDB IN A WEB WORKERThe worker-pouch plugin transparently moves the database tasks to a separate thread, leaving themain thread free to do other work. Load PouchDB first, then add the worker-pouch plugin:<script src=""pouchdb.js""></script><script src=""pouchdb.worker-pouch.js""></script>When you create a new database, add { adapter: 'worker' } as an option, and you’re done:var db = new PouchDB('mydb', {adapter: 'worker'});Unfortunately, browser support for Web Workers isn’t universal yet, so you’llprobablly need to detect whether worker-pouch is supported on the browser you are running on and fall back on standardPouchDB if it isn’t. Full details in the worker-pouch README .PUTTING IT ALL TOGETHERAn excellent example of an offline-first web app that uses Cloudant to PouchDBsyncing to store data in the browser, is https://www.pokedex.org/ . Reference data from the Pokémon games is presented in a web app that: * renders nicely on mobile, desktop and tablet * uses CSS animations to provide an engaging and performant user interface * once loaded, works offline as well as it did online (on browsers that support   Service Workers) * carefully separates animation and database work into separate threads to give   the best performance using Web WorkersThe author, Nolan Lawson, is also one of the key contributors to the PouchDBproject and his blog about pokedex.org provides much more detail on how he made this Progressive Web App with meticulous attention to detail when it comes to performance.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / CouchDB / mobileline First / Offline First / PouchDB / web workers Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Store data locally with PouchDB to keep unconnected users working and achieve an offline-first app.,Running PouchDB in a Web Worker,Live,856
2633,"R Markdown marries together three pieces of software: markdown, knitr, and pandoc. This five page guide lists each of the options from markdown, knitr, and pandoc that you can use to customize your R Markdown documents.","R Markdown marries together three pieces of software: markdown, knitr, and pandoc. This five page guide lists each of the options from markdown, knitr, and pandoc that you can use to customize your R Markdown documents.",R Markdown Reference Guide,Live,857
2641,"Skip to content * United States

IBM® developerWorks * Site map

Search Search IBM Code

Search

IBM Code * Journeys
 * Technologies
 * Open Source
 * Advocates
 * Events
 * Blogs
 * Community

ANALYZE STARCRAFT II REPLAYS WITH JUPYTER NOTEBOOKS
CREATE DATA VISUALIZATIONS FOR STARCRAFT II REPLAYS WITH THE DATA SCIENCE
EXPERIENCE
– – – Last updated –WE BUILT THIS CODE FOR THE STARCRAFT II ENTHUSIAST WHERE YOU CAN ACTUALLY
IMPROVE YOUR GAME. THIS JOURNEY WILL HELP YOU ACHIEVE YOUR GOALS, WHETHER YOU’RE
A DEVELOPER WANTING TO FIND INTERESTING INSIGHTS BY USING REPLAY ANALYTICS, OR
YOU’RE A PROFESSIONAL PLAYER LOOKING TO RAMP UP YOUR GAMING SKILLS. WITH THE
HELP OF IBM’S DATA SCIENCE EXPERIENCE, YOU’LL LEARN HOW TO CREATE DATA
VISUALIZATIONS WITH JUPYTER NOTEBOOKS AND HAVE THE CAPABILITY TO ANALYZE
STARCRAFT II DATA.
BY MARK STURDEVANT , JOIR-DAN GUMBS
Get the code View the demo

OVERVIEW
Starcraft II is a real-time strategy video game that has over 240K active
players worldwide and numerous competitions to showcase the best in gamer
strategy. In this developer journey, we use Jupyter Notebooks to analyze
StarCraft II replays, create data visualizations that are based on player
activity, and extract interesting insights about the winners and losers.

Once you complete this journey, you’ll master how to:

 * Create and run a Jupyter Notebook in IBM’s Data Science Experience (DSX).
 * Use DSX Object Storage to access a replay file.
 * Use sc2reader to load a replay into a Python object.
 * Examine some of the basic replay information in the result.
 * Parse the contest details into a usable object.
 * Visualize the contest with Bokeh graphics.
 * Store the processed replay in Cloudant.

FLOW
 1. The developer loads the provided notebook on IBM’s Data Science Experience
    platform. Starcraft II replays are loaded into Bluemix’s® Object Storage.
 2. The notebook analyzes the replays, pulling them from Bluemix’s Object
    Storage. Lastly, the notebook uses a Cloudant NoSQL database to store the results
    and analysis.

COMPONENTS
IBM DATA SCIENCE EXPERIENCEAnalyze data in a configured and collaborative environment.

Read more

CLOUDANT NOSQL DBA fully managed data layer designed for modern web and mobile applications that
leverages a flexible JSON schema.

Read more

BLUEMIX OBJECT STORAGEBuild and deliver cost effective apps and services with high reliability and
fast speed to market in an unstructured cloud data store.

Read more

JUPYTER NOTEBOOKAn open source web application that allows you to create and share documents
that contain live code, equations, visualizations, and explanatory text.

Read more

TECHNOLOGIES
ANALYTICSFinding patterns in data to derive information.

Read more

DATA SCIENCESystems and scientific methods to analyze structured and unstructured data in
order to extract knowledge and insights.

Read more

DATABASESRepository for storing and managing collections of data.

Read more

GAMINGTechnology related to developing video games for PC, consoles, mobile devices
and websites.

Read more

RELATED BLOGS
COGNITIVE FACEBOOK DATA ANALYSIS USING A JUPYTER NOTEBOOK WITH PIXIEDUST
by markstur on Jul 26, 2017 in Cognitive , data science , data-analytics , Watson , Watson Developer Cloud: Python SDKHarness the power and find hidden insights in your Facebook data with Watson and
Data Science Experience.

Continue reading Cognitive Facebook data analysis using a Jupyter Notebook with
PixieDust

ANALYZE TRAFFIC DATA FROM THE CITY OF SAN FRANCISCO
by Anamita Guha on Jul 12, 2017 in data science , JourneysIn this developer journey, we will use PixieDust running on IBM Data Science
Experience to analyze traffic data from the city of San Francisco. Data is
claimed to be the most valuable commodity in the world. At IBM, we want you to
take advantage of your data – manipulate it, visualize it, and understand it...

Continue reading Analyze traffic data from the city of San Francisco

LEVERAGE THE DATA SCIENCE EXPERIENCE TO ANALYZE STARCRAFT II REPLAYS
by Steve Martinelli on May 19, 2017 in data scienceIt comes as no surprise that studies show more than 70 percent of American
households play video games. What might surprise you is that there are 1.8
billion gamers worldwide! While it’s hard to explain the appeal to some, it is
speculated that video games fill a human void in a way that our world...

Continue reading Leverage the Data Science Experience to analyze StarCraft II
replays

RELATED LINKS
VIDEO: USE DATA SCIENCE TO UP YOUR GAME PERFORMANCE
IBM Developer Advocate Spencer Krum shows you how he uses Jupyter Notebooks and
IBM Data Science Experience (DSX) to analyze professional StarCraft II matches.DATA AND ANALYTICS ARCHITECTURE
Learn how this Journey fits into the Data and Analytics Reference ArchitectureSTARCRAFT II USER MANUAL
A python library that extracts data from various Starcraft 2 resources to power
tools and services for the SC2 community.WELCOME TO BOKEH
Bokeh is a Python interactive visualization library. It provides novel graphics
with high-performance interactivity over very large or streaming datasets.STARCRAFT II OFFICIAL GAME SITE
StarCraft II is a military science fiction real-time strategy video game
developed and published by Blizzard Entertainment.STARCRAFT 2 – SHOWTIME VS. NEEB (PVP) – IEM SHANGHAI – SEMIFINAL
Watch a broadcast of the game that was used in the example from this journey.Back to top

 * Contact
 * Privacy
 * Terms of use
 * Accessibility
 * Feedback
 * Report Abuse
 * Cookie Preferences",A journey to help developers find interesting insights using replay analytics and gamers to ramp up their Starcraft skills.,Analyze Starcraft II replays with Jupyter Notebooks,Live,858
2642,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Lorna Mitchell Blocked Unblock Follow Following Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net) 19 mins ago
--------------------------------------------------------------------------------

BUILD SCALABLE WEBHOOKS WITH A QUEUE AND WORKERS SETUP
GET ASYNCHRONOUS WITH WEBHOOKS + RABBITMQ
Webhooks are an excellent way of moving data between applications, but if added
without consideration for scaling they can easily become a performance problem.
In a previous article , we saw a Guestbook application which, instead of sending webhooks
synchronously, added them to a queue for later processing. Today we’ll look at
how that processing actually works.

The basic setup is that there is some data that we want to send to a webhook
endpoint. In this case, the data is a username, a comment, and a timestamp, but
this data could be anything you want. The endpoint could be a listening chatbot,
a webhook to trigger our CI service, or one of our own applications that is
designed to react to the data. A message is sent and stored on a message queue,
containing the data and an array of webhooks. In this example, the data is
JSON-encoded to make it easy to store and parse, and the example queue is RabbitMQ , although the approach here would work with any queue.

DIVIDE AND CONQUER
We know what data we’re expecting (comment, plus an array of webhooks), and the
next step is to plan how to write a “worker” — a script that will know how to
take that data and process it. In this case, the process looks something like:

 * for each webhook URL, send a POST request to that endpoint

If there are a lot of URLs to notify for a particular message, it can take time
to process. If a problem occurs partway through the script, it could be tricky
to know exactly what has or hasn’t already been processed. To tackle this
problem, we’ll have one worker that simply splits up the work into many
individual messages: one message per URL that needs requesting. The setup will
look something like this:

This “comment worker” will process the messages from the comments queue, and for
each URL in the webhooks array, it will create a new message on the
“notifications” queue containing one URL and the data to send to it.

ONE WORKER CONSUMES AND CREATES MESSAGES
The first worker doesn’t send webhooks at all; it simply processes the data and
prepares it for another worker to quickly send the hook and check the response.
Here’s the code that does it:

Let’s walk through what’s happening here: first we grab some requirements and
set up our config. My app lives either on localhost (a local development VM), or
it’s deployed to Bluemix. So the code here grabs the config I need to connect to
the guestbook-messages RabbitMQ where the messages are. On line 17, we set up the connection, connect
to the ""comments"" queue that we want to consume, and output a message to the
console.

Things get interesting on line 23, when we consume our first message. We connect
to a second queue (connecting also creates a queue if one doesn't exist
already), and then work through the webhooks collection in the data to create the new messages, adding them to this second
""notifications"" queue. For example, if the first message on the “comments” queue
contains this data:

Then this worker will create two new messages on the “notifications” queue:

The URLs in these examples are RequestBin examples. RequestBin lets you create a unique endpoint that you can send any
data to, and then visit it in a web browser to inspect what was sent. It’s very
handy when debugging! In this case, it gives me an easy way to see what my
script is sending.

ANOTHER WORKER HANDLES THE WEBHOOKS
The second worker script in this project does the actual webhook sending. Let’s
dive straight into the code:

The overall shape of this worker script is pretty similar to the previous one:
picking up environment variables and connecting to RabbitMQ. Jump to line 23,
where we consume the message. The worker parses the JSON data from the message
string, and then simply re-JSONs the comment part. Then we use the request library to create a request using the URL from the message and add the comment data as
the body of the request. Then, it sends the message.

Checking one of the RequestBin endpoints we created earlier, here’s what I see:

The request arrived safely, and right at the bottom of the screenshot you can
see the data that arrived with it: our webhook!

WEBHOOKS AND QUEUES
Using webhooks in your own applications can be a good way of distributing load
and maintaining responsive UIs by processing some actions asynchronously. In my
example, I deploy just 1 or 2 commentWorker scripts to handle splitting the data into per-request messages, but I deploy
more of the notificationWorker scripts so that even if a message arrives with a large array of URLs, it can be
serviced quickly by several workers sharing the load.

All the code in this project is available on GitHub: https://github.com/ibm-cds-labs/guestbook . You’ll find the worker code from this post in src/workers . Check the README for deployment instructions.

As always, I'm interested to hear if you find this post useful, or if you have
other advice or experiences to share. Let me know in the comments!

Webhooks JavaScript Rabbitmq Web Development Tutorial Blocked Unblock Follow FollowingLORNA MITCHELL
Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net )

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Webhooks are an excellent way of moving data between applications, but if added without consideration for scaling they can easily become a performance problem.",Build Scalable Webhooks with a Queue and Workers Setup – IBM Watson Data Lab,Live,859
2645,MapReduce is a two-phase paradigm for crunching large data sets in a distributed system. This video uses a basic example to explain how MapReduce works. See more videos and complete tutorials in the Learning Center on cloudant.com: http://www.cloudant.com/learning-center,MapReduce is a two-phase paradigm for crunching large data sets in a distributed system. This video uses a basic example to explain how MapReduce works. ,MapReduce explained,Live,860
2649,"* United States

IBM® * Site map

Search within Bluemix Blog Bluemix Blog * About Bluemix * What is Bluemix
    * Getting Started
    * Case Studies
    * Hybrid Architecture
    * Open Source
    * Trust, Security, Privacy
    * Data Centers
    * Our Network
    * Automation
    * Architecture Center
   
   
 * Products * Compute Infrastructure
    * Compute Services
    * Hybrid Deployments
    * Watson
    * Internet of Things
    * Mobile
    * DevOps
    * Data Analytics
    * Network
    * Open Source
    * Storage
    * Security
   
   
 * Services * Bluemix Services
    * Garage
   
   
 * Pricing
 * Support * Support
    * Contact Us
    * Resources
    * Docs
   
   
 * Blog * How-tos
    * Trending
    * What's New
    * Events
   
   
 * Partners * Partners
    * Become a Partner
    * Find a Partner
   
   
 * Sign up

DATA ANALYTICSCLEANING THE SWAMP: TURN YOUR DATA LAKE INTO A SOURCE OF CRYSTAL-CLEAR INSIGHT
September 21, 2017 | Written by: Jay Limburn

Categorized: Data Analytics

Share this post:


HOW A DATA LAKE BECOMES A DATA SWAMP
When we talk to data scientists, we hear the same sad story again and again.
They tell us how their organization fell in love with the idea of building a
data lake as a single platform for self-service data science. How they were
wooed and won by a vendor with a solution that promised much, but delivered little . How their vision of a data lake as a clear source of business insight has
turned into a stagnant swamp—a dumping ground where data goes to die.

The problem rarely lies with the infrastructure itself. If you want to capture,
manage and analyze vast quantities of highly varied data, technologies such as
Apache Hadoop, Apache Spark and Object Storage are a good way to provide the
highly scalable storage and computing resources that you will need. From a pure
technology perspective, there is nothing wrong with the physical architecture of
the data lakes that many companies have built over the past few years.

The issue is that infrastructure alone isn’t enough. As the quantity and variety
of data increases, it doesn’t just demand more storage and computing power—it
also demands better organization and management.

By building data lakes with a focus on data capture, storage and processing,
organizations have too often overlooked concerns such as data findability,
classification and governance. This is the fundamental issue of the data swamp
phenomenon: data goes in, but there’s no safe, reliable or easy way to find what
you’re looking for and get it out again.


WHY DATA GETS LOST IN THE SWAMP
First, there’s a common problem that much of an organization’s data never makes
it into their data lake in the first place. This is partly due to the time and
cost that need to be expended on building complex ETL processes to ingest new
data sources into the lake.

But more importantly, there’s also a psychological reason: it’s all about trust.
In theory, if you own a dataset that could be of value to others in your
organization, you should upload it into the data lake for them to use. In
practice this rarely happens, because data owners are too worried about the lack
of data governance. They have no way of knowing who will use their data, or how
they will use it. If the data contains commercially sensitive information or
there are any data privacy concerns, the risk of opening it up to potential
misuse is too high—and the data owner probably won’t want to take that risk.

Second, even if a dataset does get ingested into the data lake, users generally
won’t be able to find it; or if they do, they won’t understand it or know how to
use it. Without an understanding of the metadata to explain what the dataset is,
what information it contains, how high the quality of the data is, and how other
data scientists have used it, most data assets are practically worthless.

Finally, even the most comprehensive data lake can’t hold everything. Most data
science involves combining proprietary data (such as a company’s daily sales
figures or customer records) with external datasets (such as weather data, maps,
or stock market prices). Without an easy way to integrate data from external
sources with internal datasets, most data lakes force data scientists to do much
of their work outside of the data lake ecosystem—which once again contributes to
users bypassing the data lake and taking their datasets elsewhere.


CLEARING MUDDY WATERS
These problems are all closely related and tend to reinforce each other in a
vicious circle. Fortunately, however, their close relationship stems from a
common root cause—and by addressing that root cause, we can solve all of the
problems simultaneously.

Any attempt to manage and organize information—from a simple telephone directory
up to the largest and most complex database—depends on two things: data and
metadata. The data is the information itself, while the metadata describes the
information’s attributes, such as what structure it is stored in, where it is
stored, how to find it, who created it, where it came from, and what it can be
used for. Most of today’s data lakes offer powerful capabilities for storing and
processing data, but are comparatively weak in terms of managing metadata.

By augmenting your data lake with a metadata management platform, such as IBM Data Catalog (currently in beta), you can overcome these deficiencies and start unlocking
the true value of your data. IBM Data Catalog enables you to build a
comprehensive index of all your data assets, and automatically add useful
metadata to help classify their content, understand their context, trace their
lineage, and monitor their usage.

Users can add tags and comments to explain what information each dataset
contains, and why it is useful. Meanwhile, data stewards can apply governance
policies to ensure that only authorized users will be able to access sensitive
resources, and can monitor any breaches.

As the quantity and quality of metadata attached to each asset increases, the
solution’s intelligent search capability makes it easier for users to find the
information they are looking for. And because IBM Data Catalog is primarily a
metadata repository rather than a data store, it is capable of indexing data
assets both within and beyond the data lake. This means that users can use it as
a single interface to find, explore and integrate data regardless of whether it
lives in the lake itself, is held in transactional systems, or comes from a
third-party repository or service.


GAIN CRYSTAL-CLEAR INSIGHT
By strengthening your data lake’s metadata management capabilities, you can
solve your data findability, management and governance issues. Solutions such as
IBM Data Catalog enable you to create a detailed map that helps your data
scientists navigate your data lake much more easily. Using the catalog as their
compass, they can plumb the deepest depths of your datasets to obtain
crystal-clear insight, explore unfamiliar waters in safety, and avoid running
aground on hazardous data governance issues.

To take a more detailed look at how IBM Data Catalog can help you extend the
value of your data lake, visit our website and sign up for updates about the beta .

JAY LIMBURN
Jay Limburn

PAUL TAYLOR


IBM Data Catalog IBM Watson Data Platform


Previous Post

Access Trail is now IBM Cloud Activity TrackerNext Post

Get the most out of SalesforceADD COMMENT NO COMMENTS
LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


Search for:RECENT POSTS
 * Watson is getting a little emotional with PubNub
 * Get the most out of Salesforce
 * Cleaning the swamp: Turn your data lake into a source of crystal-clear
   insight
 * Access Trail is now IBM Cloud Activity Tracker
 * Cutting the cord: separating data from compute in your data lake with object
   storage

ARCHIVES
Archives Select Month September 2017 August 2017 July 2017 June 2017 May 2017 April 2017 March 2017 February 2017 January 2017 December 2016 November 2016 October 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 October 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 November 2013TAGS
analytics announcements api apps Architecture Center best-of-bluemix Bluemix bluemix-support-notifications buildpacks client success cloud cloudant cloud foundry conference conferences containers dashdb deployment devops docker eclipse garage garage-method hackathon homepage hybrid interconnect iot java Kubernetes liberty local microservices mobile MobileFirst node.js openwhisk security Spark swift twilio ui video watson webinar More Data Analytics StoriesData Analytics

HOW TO EASE THE STRAIN AS YOUR DATA VOLUMES RISE
Ever had to make a decision when you didn’t have the time, means or patience to
look up all the data that could help you choose the best option? Yes, well,
you’re not alone on that score. Usually, this doesn’t have significant or
long-lasting consequences—does it really matter if you choose where to go for
dinner because you like the look of a place, rather than combing through recent
reviews?

Continue reading


Share this post:


Data Analytics

BUILD A FITNESS APP USING IBM WATSON DATA PLATFORM AND 3RD PARTY FITNESS APIS
In this post, I’ll share technical details and code samples to help you to
create your very own Fitness App solution. If you want to further customize it
or add specialized features, you can also go ahead and connect it to other
services and APIs (like we did with the location mapping API).

Continue reading


Share this post:


Data Analytics

MAKE MACHINE LEARNING A REALITY FOR YOUR ENTERPRISE
In recent years, machine learning has prevailed over two champions on the quiz
show, Jeopardy!, and vanquished the world’s number one-ranked player of Go, one
of the most complex strategy games humankind has ever devised. You can’t doubt
its immense power and reach, but it’s not all about playing games. Machine
learning is fundamentally changing the way we approach computing—and it can pay
off big time for your business.

Continue reading


Share this post:


SIGN UP FOR A BLUEMIX TRIAL TODAY


Get started free Learn more about Bluemix

CONNECT WITH US


 * Contact
 * Privacy
 * Terms of use
 * Accessibility","Companies too often overlook data lake concerns like findability, classification and governance, focusing instead on data capture, storage and processing.",Cleaning the swamp: Turn your data lake into a source of crystal-clear insight,Live,861
2650,"Home Facebook Twitter Linkedin Back to the Blog
Subscribe to our newsletterTIME SERIES ANALYSIS USING MAX/MIN... AND SOME NEUROSCIENCE.
06 June 2016 on R , machine learning , time series , neuroscience

INTRODUCTION
Time series have maximum and minimum points as general patterns. Sometimes the noise present on it causes problems
to spot general behavior.

In this post, we will smooth time series -reducing noise- to maximize the story that data has to tell us.
And then, an easy formula will be applied to find and plot max/min points thus
characterize data.


WHAT WE HAVE
# reading data sources, 2 time series
t1=read.csv(""ts_1.txt"")  
t2=read.csv(""ts_2.txt"")


# plotting...
plot(t1$ts1, type = 'l')  
plot(t2$ts2, type = 'l')


As you can see there are many peaks, but intuitively you can imagine a more
smoother line crossing in the middle of the points. This can achieved by
applying a Seasonal Trend Decomposition (STL).


SMOOTHING THE SERIES
# first create the time series object, with frequency = 50, and then apply the stl function.
stl_1=stl(ts(t1$ts1, frequency=50), ""periodic"")  
stl_2=stl(ts(t2$ts2, frequency=50), ""periodic"")  


Important : If you don't know the frequency beforehand, play a little bit with this parameter until you find a result in
which you are comfortable.

More STL info


FINDING MAX AND MIN
Creating the functions...

ts_max<-function(signal)  
{
  points_max=which(diff(sign(diff(signal)))==-2)+1
  return(points_max)
}

ts_min<-function(signal)  
{
  points_min=which(diff(sign(diff(-signal)))==-2)+1
  return(points_min)
}


VISUALIZING THE RESULTS!
trend_1=as.numeric(stl_1$time.series[,2])

max_1=ts_max(trend_1)  
min_1=ts_min(trend_1)


## Plotting final results
plot(trend_1, type = 'l')  
abline(v=max_1, col=""red"")  
abline(v=min_1, col=""blue"")


With the line: stl_1$time.series[,2] we are accessing the time series trend component. This is the smoothing method we will use, but there are others.

This first series has 3 maximums (red line) and 2 minimums (blue line) in the following places:

# When the max points occurs:
max_1  


# When the min points occurs:
min_1  


COMPARING TWO TIME SERIES
trend_2=as.numeric(stl_2$time.series[,2])

max_2=ts_max(trend_2)  
min_2=ts_min(trend_2)

# create two aligned plots    
par(mfrow=c(2,1))

## Plotting series 1
plot(trend_1, type = 'l')  
abline(v=max_1, col=""red"")  
abline(v=min_1, col=""blue"")

## Plotting series 2
plot(trend_2, type = 'l')  
abline(v=max_2, col=""red"")  
abline(v=min_2, col=""blue"")


Some conclusions from both plots:

 * Series 2 starts with a min while 1 does with a max
 * Series 1 has 3 max and 2 min , just the opposite to the other series

Why is this important? Because of the nature of the data, which is in next
section.


WHAT IS THIS DATA ABOUT?


ts1 and ts2 are two typical responses to a brain stimulus, in other words: what happens
with the brain when a person looks at a picture / move a finger / think in a
particular thing, etc... Electroencephalography .

Some studies in neuroscience focus on averaging several responses to one stimulus -for example, to look at
one particular picture. They present several times a particular image to the
person. Averaging all of these signal/time series, you get the typical response .

Then you can predict based on the similarity between this typical response and the new image (stimulus) that the person is looking at.


TYPICAL RESPONSE (OR EVENT RELATED POTENTIAL)


It's important to get the when the positive peaks occur. In this case they are: P1 , P2 and P3 . The same goes for the negative ones.

Wiki: Event related potential .

Note: It´s a common practice to invert negative and positive values.


FINALLY...
Typically the signal time length for this kind of studies last for 400ms , thus 1 point per millisecond, just the displayed plots. And the amplitude is
in volts , (actually micro-volts) . The same unit of measurement used by the notebook you are using now ;)


that's all! Reproduce all the analysis with this repository.


DSH Twitter


DSH Facebook


_More DSH posts! Pabloc's PicturePABLOC
I have passion for analyzing data, the art of extracting and visualizing
information. pabloc { at } datascienceheroes [dot] com

http://blog.datascienceheroes.com/author/pabloc/ Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus Data Science Heroes Blog © 2016 Proudly published with Ghost",Time series analysis using max and min to describe brain signal data (event related potential).,Time Series Analysis Using Max/Min and Neuroscience,Live,862
2653,"Compose The Compose logo Articles Sign in Free 30-day trialINTRODUCTION TO COMPOSE FOR MONGODB
Published Apr 3, 2017 webinar mongodb Getting Started Introduction to Compose for MongoDBAt 10-years old, MongoDB remains one of the most popular NoSQL databases. Yet,
despite its immense popularity, we find many users new to Compose are unsure how
MongoDB works, what sets it apart from other databases, or how to work with its
JSON-like data structure. If you're new to MongoDB, this webinar recording is a
good first stop for learning more.

In addition to offering an excellent overview of ""what is MongoDB?"" the video
also covers what's included with the Compose cloud hosting platform —
high-availability configuration, auto-scaling, storage options including
WiredTiger, the Compose web UI, availability zones, and more. If you're
considering giving Compose for MongoDB a try, but had some questions about how
it all works, this is a good place to start. Enjoy!

Jon Silvers works in marketing at Compose. Love this article? Head over to Jon Silvers ’s author page and keep reading.RELATED ARTICLES
Mar 29, 2017MONGO METRICS: FINDING A HAPPY MEDIAN
In this second entry in our new ""Mongo Metrics"" series, we'll take a look at
using the MongoDB aggregations pipeline to compu…

John O'Connor Mar 28, 2017SIMPLE OAUTH WITH MONGODB & MYSQL
Don Omondi, Campus Discounts' founder and CTO, discusses securing applications
with OAuth and shows you how to securely store…

Guest Author Mar 15, 2017USE ALL THE DATABASES – PART 2
Loren Sands-Ramshaw, author of GraphQL: The New REST shows how to combine data
from multiple data sources using GraphQL in p…

Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","If you're new to MongoDB, this webinar recording is a good first stop for learning more.",Introduction to Compose for MongoDB,Live,863
2669,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * 

BLOG
Welcome to the BDUBlog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (February 21, 2017)
 * Learn how to use R with Databases
 * This Week in Data Science (February 14, 2017)
 * This Week in Data Science (February 7, 2017)
 * This Week in Data Science (January 31, 2017)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsBLOGROLL
 * RBloggers

THIS WEEK IN DATA SCIENCE (FEBRUARY 21, 2017)
Posted on February 21, 2017 by Janice Darling

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * IBM, Visa partner to make the Internet of things commerce friendly – IBM Watson’s IoT platform and Visa’s token technology will partner to
   increase the capability of IoT devices.
 * 17 More Must-Know Data Science Interview Questions and Answers – Useful Solutions to Data Science Interview Questions.
 * The Mathematics of Machine Learning – The importance of Mathematics and Statistics in the field of Machine
   Learning.
 * How to analyze text – Broad Overview of extracting data from text.
 * IBM wants to bring machine learning to the mainframe – IBM will aim to bring its platform for Machine Learning to the z systems
   portfolio.
 * Spark gets faster for streaming analytics – Spark Summit in Boston highlights progress in fields such as Machine
   Learning and Deep Learning.
 * IBM and Fox make a big impact with Hidden Figures screening – IBM and Fox partner to show screenings of Hidden Figures to young aspiring
   engineers and scientists.
 * IBM Turns Watson Into A Cybersecurity Weapon Amid White House Interest – IBM Watson will use it abilities to spot security breaches and hacking
   attempts.
 * Software Engineering vs Machine Learning Concepts – Brief discussion of differences between Software Engineering and Machine
   Learning concepts.
 * January New Data Packages –A list of new R data packages for the month of January.
 * Removing Outliers Using Standard Deviation in Python – Using Standard Deviation to remove outliers with Python.
 * Natural Language Processing Key Terms, Explained – A brief explanation of 18 Natural Language processing terms for beginners.
 * Graphs from 1900 that depict a snapshot of African American life – A series of charts made in 1900 by W.E.B Du Bois.
 * A National Pizza Day investigation: how many slices a day do Americans eat? – A look into the data surrounding the pizza eating habits of Americans.
 * IBM’s new weather app sends emergency alerts without a network – IBM has created a weather app which utilizes Mesh Network Alerts enabling
   transmission of notifications without cellular network.

UPCOMING DATA SCIENCE EVENTS
 * Data Science with R – Bootcamp

FEATURED COURSES FROM BDU
 * Big Data 101 – What Is Big Data? Take Our Free Big Data Course to Find Out.
 * Predictive Modeling Fundamentals I – Take this free course and learn the different mathematical algorithms used
   to detect patterns hidden in data.
 * Using R with Databases – Learn how to unleash the power of R when working with relational databases
   in our newest free course.

COOL DATA SCIENCE VIDEOS
 * Deep Learning with Tensorflow – Training a Restricted Boltzmann Machine – This video from our free course on Tensorflow demonstrates how to train a
   Restricted Boltzmann Machine using Tensorflow.
 * https://www.youtube.com/watch?v=8EaVZbmAnV0 – A demonstration of how to build a collaborative, filtering based
   recommendation system using a Boltzmann Machine and Tensorflow.
 * Deep Learning with Tensorflow – Initializing a Restricted Boltzmann Machine. – This video will demonstrate how to build a initializer-restricted
   Boltzmann machine.

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Events
 * Ambassador Program
 * Resources
 * FAQ
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here’s this week’s news in Data Science and Big Data.,"This Week in Data Science (February 21, 2017)",Live,864
2671,"Skip to content * United States

IBM® developerWorks Developer Centers * Site map

Search Search IBM Code

Search

IBM Code * Journeys
 * Technologies
 * Open Source
 * Advocates
 * Events
 * Blogs
 * Community

ANALYZE TRAFFIC DATA FROM THE CITY OF SAN FRANCISCO
CREATE CHARTS, GRAPHS, AND MAPS USING OPEN CITY DATASETS
– – – Last updated –BUILT FOR ANYONE USING DATA TO CREATE JUPYTER NOTEBOOKS AND OTHER ARTIFACTS,
THIS JOURNEY SHOWS THE POWER OF THE OPEN SOURCE HELPER LIBRARY PIXIEDUST. WITH
PIXIEDUST, HOSTED ON IBM DATA SCIENCE EXPERIENCE (DSX), A DEVELOPER OR OTHER
USER CAN QUICKLY CREATE CHARTS, GRAPHS, AND TABLES WITHOUT COMPLEX CODE, IN AN
INTERACTIVE AND DYNAMIC MANNER. IN ADDITION, PIXIEAPPS ARE USED TO EMBED UI
ELEMENTS DIRECTLY IN THE JUPYTER NOTEBOOK. GIVEN AN OPEN SOURCE DATA PROVIDER
LIKE THE CITY OF SAN FRANCISCO’S DATASF OPEN DATA, PIXIEDUST AND IBM’S DSX CAN
EMPOWER THE USER TO ANALYZE AND SHARE DATA VISUALIZATIONS.
BY SCOTT D’ANGELO , DAVID TAIEB
Get the code View the demo

OVERVIEW
DataSF Open Data provides hundreds of data sets from the city and county of San Francisco. In
this journey, we demonstrate how to incorporate open data into a Jupyter
Notebook hosted on IBM Data Science Experience (DSX), and how to quickly and
easily create graphs and charts using PixieDust. We then use PixieApps to create
UI elements that can be run directly in the Jupyter Notebook.

When you have completed this journey, you will understand how to:

 * Use Jupyter Notebooks to load, visualize, and analyze data
 * Run Jupyter Notebooks in IBM Data Science Experience
 * Leverage PixieDust as an IPython Notebook helper
 * Build a dashboard using PixieApps
 * Fetch data from the city of San Francisco’s DataSF Open Data
 * Create an interactive map with Mapbox GL

FLOW
 1. Load the provided notebook onto the IBM Data Science Experience platform. DataSF Open Data traffic info is loaded into the Jupyter Notebook. The notebook analyzes the traffic info.
 2. You can interactively change charts and graphs. A PixieApp dashboard is created and can be interacted with.

COMPONENTS
IBM DATA SCIENCE EXPERIENCEAnalyze data in a configured and collaborative environment.

Read more

JUPYTER NOTEBOOKAn open source web application that allows you to create and share documents
that contain live code, equations, visualizations, and explanatory text.

Read more

PIXIEDUSTProvides a Python helper library for IPython Notebook.

Read more

FEATURED TECHNOLOGIES
ANALYTICSFinding patterns in data to derive information.

Read more

DATA SCIENCESystems and scientific methods to analyze structured and unstructured data in
order to extract knowledge and insights.

Read more

RELATED BLOGS
COGNITIVE FACEBOOK DATA ANALYSIS USING A JUPYTER NOTEBOOK WITH PIXIEDUST
by markstur on Jul 26, 2017 in Cognitive , data science , data-analytics , Watson , Watson Developer Cloud: Python SDKHarness the power and find hidden insights in your Facebook data with Watson and
Data Science Experience.

Continue reading Cognitive Facebook data analysis using a Jupyter Notebook with
PixieDust

ANALYZE TRAFFIC DATA FROM THE CITY OF SAN FRANCISCO
by Anamita Guha on Jul 12, 2017 in data science , JourneysIn this developer journey, we will use PixieDust running on IBM Data Science
Experience to analyze traffic data from the city of San Francisco. Data is
claimed to be the most valuable commodity in the world. At IBM, we want you to
take advantage of your data – manipulate it, visualize it, and understand it...

Continue reading Analyze traffic data from the city of San Francisco

LEVERAGE THE DATA SCIENCE EXPERIENCE TO ANALYZE STARCRAFT II REPLAYS
by Steve Martinelli on May 19, 2017 in data scienceIt comes as no surprise that studies show more than 70 percent of American
households play video games. What might surprise you is that there are 1.8
billion gamers worldwide! While it’s hard to explain the appeal to some, it is
speculated that video games fill a human void in a way that our world...

Continue reading Leverage the Data Science Experience to analyze StarCraft II
replays

RELATED LINKS
CITY AND COUNTY OF SAN FRANCISCO DATASF OPENDATA
Search hundreds of datasets from city and county of San Francisco.PIXIEDUST OVERVIEW
A PixieDust 101 presentation at SlideShareCREATE A PREDICTIVE MODEL THAT CAN FORECAST FLIGHT DELAYS WITH PIXIEDUST
Create a Spark instance with Weather Company data for IBM Bluemix service, and
Flightstats Connector to create a flexible Python notebook that leverages Spark
MLLIB to train predictive models.Back to top

 * Contact
 * Privacy
 * Terms of use
 * Accessibility
 * Feedback
 * Report Abuse
 * Cookie Preferences","Look at traffic data from the city, create robust data visualizations that allow users to encapsulate business logic, create charts and graphs, and more.",Analyze traffic data from the city of San Francisco,Live,865
2672,"Enterprise Pricing Articles Sign in Free 30-Day TrialTURN SMALL DATA INTO SMART DATA. PART 2: ETL WITH NODEJS AND ES6 PROMISES
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 1, 2016This is the second of a three part series focused on an open source tool chain
for small data business intelligence. Part 1 explored dimensional modeling from
a logical and use case perspective.

Part 2 focuses on Extract, Transform and Load with NodeJS. While it may be out
of the ordinary to use Node for batch style processing, the use of Node's newly
builtin ES6 Promise support works well for the scale of this project. Plus it
highlights a fitting use of Promises in general.

Part 3 will use gnuplot for visualizations. All three together represent a
complete small data business intelligence solution from design to implementation
to visualization.


Some Relevant Links * See the Code on github
 * Turn Small Data Into Smart Data. Part 1: The Star Schema
 * NodeJS ES6 Support listing.
 * Wistia for hosting videos plus the stats.
 * Google's Timezone API for turning lat, lon into an offset from UTC.
 * MassiveJS a PostgreSQL data access tool

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","This is the second of a three part series focused on an open source tool chain for small data business intelligence. Part 1 explored dimensional modeling from a logical and use case perspective. Part 2 focuses on Extract, Transform and Load with NodeJS.",Turn Small Data Into Smart Data. Part 2,Live,866
2673,"The Web, as of course you know, is covered with links. These links connect related pages of information together and allow the person (or machine) browsing the web to follow those links to gather more information. Linked Data is that same idea, brought to data.Data is often stored and distributed in esoteric formats. You made up the data formats for your application, or the developer before you did. I made up the formats for mine, and so on.Even when the data is available in a parse-able format (CSV, XML, JSON, etc), there is often little provided with the data to explain what's inside. If there is descriptive meta data provided, it's often only meant for the next developer to read when implementing yet-another-parser for said data.Really, it's all quite abysmal...Enter, JSON-LD! JSON-LD (JSON Linked Data) is a simple way of providing semantic meaning for the terms and values in a JSON document. Providing that meaning with the JSON means that the next developer's application can parse and understand the JSON you gave them.Way back in 1998, Sir Tim Berners-Lee (aka. TBL; inventor of the Web you're using to read this blog post), presented the idea of ""Webizing"" things.It is now 2014, and the future of a Webized Database is a reality! In fact, it's been a reality since 2010 when Cloudant went online for the general public.Cloudant's HTTP API sets it squarely on the right hand side of TBL's webize table.Every Cloudant database has a URL. Every document inside of that database has a URL. The results of every MapReduce or Full-Text Search index has a URL. Every...you get the idea. ;)Given that URLs are the stock and trade of a Cloudant database, let's look at the additional magic we can provide to data with JSON-LD to add more meaning and connectivity to concepts and other data on the Web.Get a Cloudant Account, and start Webizing your data!You likely already have a load of JSON document somewhere...or you know of some social network APIs that could get you some.The key feature that JSON-LD provides is the ability to give the keys and values in those JSON documents semantic meaning. It does this by associating the keys in your JSON documents to URL's that can be used by a developer or an application to know what to expect the value to be and how that value should be treated (its type, schema, other objects, etc).This globally addressable meta information about a document is provided along side the existing JSON inside a @context object. The ''along side'' bit is key. It means that you can provide meaning and additional value to your data when distributing it (externally or to others within your organization) without changing your existing JSON format.JSON-LD libraries often let you provide both some raw JSON and the context document. The spec, however, also allows you to ship them together.Let's look at how that can be done now, to some existing JSON documents, and without modifying anything we've previously written to our database.Think of it as ""append only awesomeness.""If you've used Cloudant, you're likely familiar with the idea of a Design Document for storing your MapReduce View functions and Full-Text Search indexing functions.Design Documents can also contain Show Functions. These functions provide an opportunity to transform or otherwise modify a JSON document on its way out.NOTE: _show functions aren't the most performance friendly piece of the Cloudant puzzle. However, they are very cache friendly. They include an E-Tag header that can be stored and used with the If-None-Match to return a much faster 304 Not Modified response with an empty body--which is what browsers do.Here's some initial JSON I made up (somewhat) at random for this blog post. It states that I know Simon Metson, Max Thayer, and Mike Miller:""_id"": ""BenjaminYoung"",""first_name"": ""Benjamin"",""last_name"": ""Young"",""knows"": [""SimonMetson"", ""MaxThayer"", ""MikeMiller""]You can look at that, and make sense of it. However, if I gave that same JSON to an application that expected a name object with given and surname fields...then new custom processing code would have to be written. It would have to be written using either a) the developers understanding of what I meant by first_name or by referencing some documentation that hopefully explained it more clearly.So, let's give that thing some @context!We'll be using the FOAF vocabulary which is a document format for describing people, who they know, etc.""@context"": {""@base"": ""http://bigbluehat.cloudant.com/foaf/_design/json-ld/_show/foaf/"",""_id"": ""@id"",""first_name"": ""http://xmlns.com/foaf/0.1/givenName"",""last_name"": ""http://xmlns.com/foaf/0.1/familyName"",""knows"": {""@id"": ""http://xmlns.com/foaf/0.1/knows"",""@type"": ""@id""That bit of @context provides valuable meta data to someone who understands a Friend-of-a-Friend document (or someone or something that can follow URLs to explain it to themselves). The person data in my Cloudant database is no longer just some esoteric JSON I made up when writing a blog post about people. It actually has value when you look it up in the FoaF context.Here's the _show function to add that:function (doc, req) {var tmp = doc;var path = req.path;path.pop(); // drop the current doc namevar base = ""http://"" + req.headers.Host + ""/"" + path.join(""/"") + ""/"";tmp['@context'] = {""@base"": base,""_id"": ""@id"",""first_name"": ""http://xmlns.com/foaf/0.1/givenName"",""last_name"": ""http://xmlns.com/foaf/0.1/familyName"",""knows"": {""@id"": ""http://xmlns.com/foaf/0.1/knows"",""@type"": ""@id""delete tmp.couchapp;delete tmp._revisions;return {""json"": tmp};Note: feel free to hack on this further. This sample code is in the Cloudant Labs Spellbook.Accessing that newly contextualized JSON document looks like this:http://bigbluehat.cloudant.com/foaf/_design/json-ld/_show/foaf/BenjaminYoungThe response, as you likely guessed, looks like this:""_id"": ""BenjaminYoung"",""_rev"": ""2-c81a120b45cdb4330673d4ff615cc020"",""first_name"": ""Benjamin"",""last_name"": ""Young"",""knows"": [""SimonMetson"",""MaxThayer"",""MikeMiller""""@context"": {""@base"": ""http://bigbluehat.cloudant.com/foaf/"",""_id"": ""@id"",""first_name"": ""http://xmlns.com/foaf/0.1/givenName"",""last_name"": ""http://xmlns.com/foaf/0.1/familyName"",""knows"": {""@id"": ""http://xmlns.com/foaf/0.1/knows"",""@type"": ""@id""We now have a properly contextualized document that has some meaning. Changing that meaning would look like simply making other @context objects and information available.Once contextualized, a JSON-LD library can parse that data, transform it into RDF triples for graph-based processing, or simply use the data type information to store it properly within their data store--assuming they don't also have the flexibility of a schema-free document database like Cloudant.Cloudant provides a ""webized"" database ready for global data sharing with contextualized meaning that can be used for everything from data interchange to form filling automation and processing. We'll cover some more of that greatness in future posts.Also, a quick thank you to dlongley and m4nu for their fabulous help in #json-ld on irc.freenode.net They filled in the gaps in my Linked Data know-how quite swimmingly. Thanks to you both!",JSON-LD (JSON Linked Data) is a simple way of providing semantic meaning for the terms and values in a JSON document. Providing that meaning with the JSON means that the next developer's application can parse and understand the JSON you gave them.,Webizing your database with Linked Data in JSON-LD on Cloudant,Live,867
2676,"Learn R programming for data science

 * Home
 * About Us
 * Archives
 * Contribute
 * Free Account
 * 

We share R tutorials from scientists at academic and scientific institutions
with a goal to give everyone in the world access to a free knowledge. Our
tutorials cover different topics including statistics, data manipulation and
visualization! Introduction Getting Data Data Management Visualizing Data Basic Statistics Regression Models Advanced Modeling Programming Best R Packages Tips & Tricks IntroductionHOW TO WRITE THE FIRST FOR LOOP IN R
by Martijn Theuwissen on December 2, 2015 2 CommentsIn this tutorial we will have a look at how you can write a basic for loop in R.
It is aimed at beginners, and if you’re not yet familiar with the basic syntax
of the R language we recommend you to first have a look at this introductory R tutorial .

Conceptually, a loop is a way to repeat a sequence of instructions under certain
conditions. They allow you to automate parts of your code that are in need of
repetition. Sounds weird? No worries, it will become more clear once we start
working with some examples below.

Before you dive into writing loops in R, there is one important thing you should
know. When surfing on the web you’ll often read that one should avoid making use
of loops in R. Why? Well, that’s because R supports vectorization. Simply put,
this allows for much faster calculations. For example, solutions that make use
of loops are less efficient than vectorized solutions that make use of apply
functions, such as lapply and sapply. It’s often better to use the latter.
Nevertheless, as a beginner in R, it is good to have a basic understanding of
loops and how to write them. If you want to learn more on the concepts of
vectorization in R, this is a good read.

WRITING A SIMPLE FOR LOOP IN R
Let’s get back to the conceptual meaning of a loop. Suppose you want to do
several printouts of the following form: The year is [year] where [year] is equal to 2010, 2011, up to 2015. You can do this as follows:


print(paste(""The year is"", 2010))
""The year is 2010""
print(paste(""The year is"", 2011))
""The year is 2011""
print(paste(""The year is"", 2012))
""The year is 2012""
print(paste(""The year is"", 2013))
""The year is 2013""
print(paste(""The year is"", 2014))
""The year is 2014""
print(paste(""The year is"", 2015))
""The year is 2015""

You immediately see this is rather tedious: you repeat the same code chunk over
and over. This violates the DRY principle, known in every programming language:
Don’t Repeat Yourself, at all cost. In this case, by making use of a for loop in
R, you can automate the repetitive part:


for (year in c(2010,2011,2012,2013,2014,2015)){
  print(paste(""The year is"", year))
}
""The year is 2010""
""The year is 2011""
""The year is 2012""
""The year is 2013""
""The year is 2014""
""The year is 2015""

The best way to understand what is going on in the for loop, is by reading it as
follows: “For each year that is in the sequence c(2010,2011,2012,2013,2014,2015) you execute the code chunk print(paste(""The year is"", year)) ”. Once the for loop has executed the code chunk for every year in the vector,
the loop stops and goes to the first instruction after the loop block.

See how we did that? By using a for loop you only need to write down your code
chunk once (instead of six times). The for loop then runs the statement once for
each provided value (the different years we provided) and sets the variable ( year in this case) to that value. You can even simplify the code even more: c(2010,2011,2012,2013,2014,2015) can also be written as 2010:2015 ; this creates the exact same sequence:


for (year in 2010:2015){
  print(paste(""The year is"", year))
}
""The year is 2010""
""The year is 2011""
""The year is 2012""
""The year is 2013""
""The year is 2014""
""The year is 2015""

As a last note on the for loop in R: in this case we made use of the variable year but in fact any variable could be used here. For example you could have used i , a commonly-used variable in for loops that stands for index:


for (i in 2010:2015){
  print(paste(""The year is"", i))
}
""The year is 2010""
""The year is 2011""
""The year is 2012""
""The year is 2013""
""The year is 2014""
""The year is 2015""

This produces the exact same output. So you can really name the variable anyway
you want, but it’s just more understandable if you use meaningful names.

USING NEXT
Let’s have a look at a more mathematical example. Suppose you need to print all
uneven numbers between 1 and 10 but even numbers should not be printed. In that
case your loop would look like this:


for (i in 1:10) {
  if (!i %% 2){
    next
  }
    print(i)
}
1
3
5
7
9

Notice the introduction of the next statement. Let’s explore the meaning of this
statement walking through this loop together:

When i is between 1 and 10 we enter the loop and if not the loop stops. In case we
enter the loop, we need to check if the value of i is uneven. If the value of i has a remainder of zero when divided by 2 (that’s why we use the modulus
operand %%) we don’t enter the if statement, execute the print function and loop
back. In case the remainder is non zero, the if statement evaluates to TRUE and
we enter the conditional. Here we now see the next statement which causes to loop back to the i in 1:10 condition thereby ignoring the the instructions that follows (so the print(i) ).

CLOSING REMARKS
In this short tutorial you got acquainted with the for loop in R. While the
usage of loops in general should be avoided in R, it still remains valuable to
have this knowledge in your skillset. It helps you understand underlying
principles, and when prototyping a loop solution is easy to code and read. In
case you want to learn more on loops, you can always check this R tutorial .

Tags Functions Loop The Author Martijn is the co-founders of DataCamp, an online interactive learning platform
for R and data science. At DataCamp they're passionate about creating the next
generation learning tools! He studied commercial engineering, majoring in
finance and accounting. WebsiteDISCLOSURE
 * Martijn Theuwissen works or receives funding from a company or organization
   that would benefit from this article.

0 Shares Like this article? Give it a share: Facebook Twitter Google+ Linkedin Email this * Paul RougieuxIn his book “advanced R”, Hadley Wickham explains how to replace for loops by
   lapply (and its variants) to produce more readable code. He also explains
   that in some cases – modifying in place, recursive functions, while loops –
   it is preferable to keep using a loop: http://adv-r.had.co.nz/Functionals.html#functionals-loop
   
   
 * 
 * Paul RougieuxThis should be called how not to write a loop in R.
   
   paste(“The year is”,seq(2010,2015))
   
   
 * 

TRENDING NOW ON DATASCIENCE+
 * K Means Clustering in R
 * Sentiment analysis with machine learning in R
 * Euro 2016 analytics: Who’s playing the toughest game?
 * Fitting a Neural Network in R; neuralnet package
 * How to Create, Rename, Recode and Merge Variables in R

DataScience+ Learn R programming for data science Site Links * About Us
 * Contribute
 * Advertise
 * Contact Us

Legal * Privacy Policy
 * Terms of Use
 * Account Terms
 * Stylebook

Other Sites * R Bloggers

 * 
 * 
 * 
 * 

Connect with Us © 2016 DataSciencePlus.com","In this tutorial we will have a look at how you can write a basic for loop in R. It is aimed at beginners, and if you’re not yet familiar",How to write the first for loop in R,Live,868
2681,"AN IBM GRAPH CLIENT LIBRARY FOR NODE.JS
Kim Stebel / June 2, 2016IBM Graph’s HTTP API means that storing and querying data is as easy as making an HTTP
request. But it’s even easier when you use a simple wrapper around the HTTP API
that takes care of some of the details. ibm-graph-client is our Node.js library for IBM Graph, built by Developer Advocate Mike Elsmore . If you want to see an application that uses the library, have a look at our 6 degrees of Kevin Bacon sample app . Please note that while IBM of course offers support for its IBM Graph
service, this does not currently extend to the Node.js library.

INSTALLATION AND USE
Installing the library is easy using Node’s package management system. A single
command takes care of it:

npm install ibm-graph-client --save

And in your code:

var creds = { // your bluemix credentials
  ""url"": ""https://ibmgraph-alpha.ng.bluemix.net/a26...89b/g"",
  ""username"": ""575...3a8"",
  ""password"": ""aca...a27""
}
var IGC = require(""ibm-graph-client�

You can find a detailed guide on how to use IBM Graph with Node.js in the documentation .

FEATURES
 * Simple HTTP helper for the IBM Graph API
 * Hides the complexity of each HTTP request such as content type,
   authentication and other headers
 * Session management: If you make a request after your session expires, the
   library will create a new one
 * Tests using nock as well as the IBM Graph service

OPEN-SOURCE – CONTRIBUTIONS WELCOME!
The IBM Graph Node.js Library is an open-source project and we welcome
contributions. If you find a fault then please raise an issue on our Github
repository. If you’re feeling more adventurous, then fork our code, modify it
and submit a pull request! We’d be happy to hear from you.

The best way to keep up to date with developments in this project is to subscribe to ibm-graph-client with libraries.io . You will be emailed whenever we release a new version.

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Our Node.js library for IBM Graph makes it easy to store and query data using Graph's HTTP API.,An IBM Graph client library for Node.js,Live,869
2683,"Homepage IBM Watson Follow Sign in Get started * Home
 * Announcements
 * Editorials
 * Tutorials
 * Code Spotlight
 * 
 * Build with Watson
 * 

Kevin Gong Blocked Unblock Follow Following Product manager @IBMWatson. Photographer. UX/UI designer. DIYer. Data tinkerer.
Social good supporter. Formerly @McKinsey, @TEDx, @Cal, @ColumbiaSIPA Dec 11, 2017
--------------------------------------------------------------------------------

STACKING MULTIPLE CUSTOM MODELS IN WATSON VISUAL RECOGNITION
This story is part of a best practice series on Watson Visual Recognition. You
can find the previous entry here and get started with Watson Visual Recognition through IBM Cloud .

The custom model within Watson Visual Recognition is one of the API service’s most popular
functionalities, allowing users to train Watson to recognize virtually any
custom content. We’ve already seen customers use custom models to combat drought in California , analyze unstructured social media data , and inspect cell towers using drones .

While the majority of use cases we see only need to utilize a single custom
model, a nifty technique to extend the capabilities of custom models even
further is to layer multiple models. With some programming from the user’s side,
it’s possible to take an image that is classified through one custom model and
feed it into a second custom model based on the results from the first model.
This can continue indefinitely (in theory) and ultimately provide the user with
a set of highly specific tag results for a single image.

Let’s go through a car insurance claim use case to see how this technique can
work for a real-world business application.

LAYER 1 — TYPE OF VEHICLE DAMAGE
In this example use case, our goal is to help a car insurance company
automatically generate quotes from images of car damage. We’ll assume that the
insurance company has already gathered additional information regarding the car,
such as model, vehicle registration number, etc.

Image an insurance company receives that they must generate a quote for. ( Source )Let’s say the insurance company receives the image above. The first step would
be to identify the type of vehicle damage, so the insurance company passes this
image through a custom model that is trained to recognize a few different types
of vehicle damage (see diagram below). This custom model, which we designate as
the first layer of this workflow, identifies the type of damage as broken
windshield.

Like training any custom model, the insurance company gathers example images of
each of the classes it wants to identify and trains the service on them. You can
find an interactive version of this particular custom model in our demo here , and we have a best practices guide on training available here .

LAYER 2 — SEVERITY OF DAMAGE
Now that the type of damage has been identified, the insurance company programs
an interaction that checks the result from the first custom model (broken
windshield) and passes the image into a second custom model that’s specifically
trained to asses the severity of the windshield damage.

For this particular example, the insurance company trains the second custom
model to recognize three severity levels of windshield damage:

 * Light : small chip or crack
 * Medium : larger spiderweb of cracks
 * Heavy : significant portions of windshield missing

Image sources: 1 , 2 , 3This second model in particular allows the insurance company to achieve the goal
of providing an estimate for the cost of the damage. Each severity level can be
tied to a monetary range that is then returned to the user. Since the original
image is categorized as light damage, the insurance company might send the user
a message saying, “The cost of repairs to your windshield will be $70-$150.”

MORE MODELS, MORE FLEXIBILITY
So, why not just put everything into a single custom model instead of spread
across multiple models?

There’s a few technical reasons as to why this might not work well. First, a
custom model returns only one tag result for any given image (whichever class
gives the highest confidence score). A user looking for multiple custom tags for
an image will need to run the image through multiple custom models. Second, each
custom model should have no more than 8–10 classes. As the number of classes
within a custom model exceed these levels, the accuracy of the model will suffer
due to increased noise.

By appropriately splitting training data across multiple custom models, users
not only gain a greater number of tags, they’ll also be able to swap models as
needed. In the auto insurance example we walked through, a car insurance company
that wants to explore car brand or color of vehicle or other factors as part of
their second custom model can simply swap in the appropriate model instead of
needing to retrain a single model multiple times. It’s this level of flexibility
that makes Watson Visual Recognition a powerful tool in solving challenges
across many industries.

Questions? Comments? Working on your own Watson projects? Let us know in the
comments below!

 * Insurance
 * Ibm Watson
 * Watson Visual Recognition
 * AI
 * Editorial

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

62 Blocked Unblock Follow FollowingKEVIN GONG
Product manager @IBMWatson . Photographer. UX/UI designer. DIYer. Data tinkerer. Social good supporter.
Formerly @McKinsey , @TEDx , @Cal , @ColumbiaSIPA

FollowIBM WATSON
AI Platform for the Enterprise

 * 62
 * 
 * 
 * 

Never miss a story from IBM Watson , when you sign up for Medium. Learn more Never miss a story from IBM Watson Get updates Get updates","The custom model within Watson Visual Recognition is one of the API service’s most popular functionalities, allowing users to train Watson to recognize virtually any custom content. ",Stacking Multiple Custom Models in Watson Visual Recognition,Live,870
2688,"* Free 7-Day Crash Course
 * Blog
 * Masterclass

OVERFITTING IN MACHINE LEARNING: WHAT IT IS AND HOW TO PREVENT IT
EliteDataScience 0 Comments

September 7, 2017

Share Google Linkedin TweetDid you know that there’s one mistake…

…that thousands of data science beginners unknowingly commit?

And that this mistake can single-handedly ruin your machine learning model?

No, that’s not an exaggeration. We’re talking about one of the trickiest
obstacles in applied machine learning: overfitting .

But don’t worry:

In this guide, we’ll walk you through exactly what overfitting means, how to
spot it in your models, and what to do if your model is overfit.

By the end, you’ll know how to deal with this tricky problem once and for all.

TABLE OF CONTENTS
 1. Examples of Overfitting
 2. Signal vs. Noise
 3. Goodness of Fit
 4. Overfitting vs. Underfitting
 5. How to Detect Overfitting
 6. How to Prevent Overfitting
 7. Additional Resources

EXAMPLES OF OVERFITTING
Let's say we want to predict if a student will land a job interview based on her
resume.

Now, assume we train a model from a dataset of 10,000 resumes and their
outcomes.

Next, we try the model out on the original dataset, and it predicts outcomes
with 99% accuracy… wow!

But now comes the bad news.

When we run the model on a new (“unseen”) dataset of resumes, we only get 50%
accuracy… uh-oh!

Our model doesn’t generalize well from our training data to unseen data.

This is known as overfitting, and it’s a common problem in machine learning and
data science.

In fact, overfitting occurs in the real world all the time. You only need to
turn on the news channel to hear examples:

Overfitting Electoral Precedence (source: XKCD )

SIGNAL VS. NOISE
You may have heard of the famous book The Signal and the Noise by Nate Silver.

In predictive modeling, you can think of the “signal” as the true underlying
pattern that you wish to learn from the data.

“Noise,” on the other hand, refers to the irrelevant information or randomness
in a dataset.

For example, let’s say you’re modeling height vs. age in children. If you sample
a large portion of the population, you’d find a pretty clear relationship:

Height vs. Age (source: CDC )

This is the signal.

However, if you could only sample one local school, the relationship might be
muddier. It would be affected by outliers (e.g. kid whose dad is an NBA player)
and randomness (e.g. kids who hit puberty at different ages).

Noise interferes with signal.

Here’s where machine learning comes in. A well functioning ML algorithm will
separate the signal from the noise.

If the algorithm is too complex or flexible (e.g. it has too many input features
or it’s not properly regularized), it can end up “memorizing the noise” instead
of finding the signal.

This overfit model will then make predictions based on that noise. It will
perform unusually well on its training data… yet very poorly on new, unseen
data.

GOODNESS OF FIT
In statistics, goodness of fit refers to how closely a model’s predicted values match the observed (true)
values.

A model that has learned the noise instead of the signal is considered “overfit”
because it fits the training dataset but has poor fit with new datasets.

While the black line fits the data well, the green line is overfit.

OVERFITTING VS. UNDERFITTING
We can understand overfitting better by looking at the opposite problem,
underfitting.

Underfitting occurs when a model is too simple – informed by too few features or
regularized too much – which makes it inflexible in learning from the dataset.

Simple learners tend to have less variance in their predictions but more bias
towards wrong outcomes (see: The Bias-Variance Tradeoff ).

On the other hand, complex learners tend to have more variance in their
predictions.

Both bias and variance are forms of prediction error in machine learning.

Typically, we can reduce error from bias but might increase error from variance
as a result, or vice versa.

This trade-off between too simple (high bias) vs. too complex (high variance) is
a key concept in statistics and machine learning, and one that affects all
supervised learning algorithms.

Bias vs. Variance (source: EDS )

HOW TO DETECT OVERFITTING
A key challenge with overfitting, and with machine learning in general, is that
we can’t know how well our model will perform on new data until we actually test
it.

To address this, we can split our initial dataset into separate training and test subsets.

Train-Test Split (source: Our free ML crash course )

This method can approximate of how well our model will perform on new data.

If our model does much better on the training set than on the test set, then
we’re likely overfitting.

For example, it would be a big red flag if our model saw 99% accuracy on the
training set but only 55% accuracy on the test set.

If you’d like to see how this works in Python, we have a full tutorial for
machine learning using Scikit-Learn .

Another tip is to start with a very simple model to serve as a benchmark.

Then, as you try more complex algorithms, you’ll have a reference point to see
if the additional complexity is worth it.

This is the Occam’s razor test. If two models have comparable performance, then you should usually pick
the simpler one.

HOW TO PREVENT OVERFITTING
Detecting overfitting is useful, but it doesn’t solve the problem. Fortunately,
you have several options to try.

Here are a few of the most popular solutions for overfitting:

CROSS-VALIDATION
Cross-validation is a powerful preventative measure against overfitting.

The idea is clever: Use your initial training data to generate multiple mini
train-test splits. Use these splits to tune your model.

In standard k-fold cross-validation, we partition the data into k subsets,
called folds. Then, we iteratively train the algorithm on k-1 folds while using
the remaining fold as the test set (called the “holdout fold”).

K-Fold Cross-Validation (source: Our free ML crash course )

Cross-validation allows you to tune hyperparameters with only your original
training set. This allows you to keep your test set as a truly unseen dataset
for selecting your final model.

We have another article with a more detailed breakdown of cross-validation .

TRAIN WITH MORE DATA
It won’t work everytime, but training with more data can help algorithms detect
the signal better.

In the earlier example of modeling height vs. age in children, it’s clear how
sampling more schools will help your model.

Of course, that’s not always the case. If we just add more noisy data, this
technique won’t help.

That’s why you should always ensure your data is clean and relevant. We provide
a starting framework for data cleaning in our free crash course on applied machine learning .

REMOVE FEATURES
Some algorithms have built-in feature selection.

For those that don’t, you can manually improve their generalizability by
removing irrelevant input features.

An interesting way to do so is to tell a story about how each feature fits into
the model. This is like the data scientist's spin on software engineer’s rubber duck debugging technique, where they debug their code by explaining it, line-by-line, to a
rubber duck.

If anything doesn't make sense, or if it’s hard to justify certain features,
this is a good way to identify them.
In addition, there are several feature selection heuristics you can use for a good starting point.

EARLY STOPPING
When you’re training a learning algorithm iteratively , you can measure how well each iteration of the model performs.

Up until a certain number of iterations, new iterations improve the model. After
that point, however, the model’s ability to generalize can weaken as it begins
to overfit the training data.

Early stopping refers stopping the training process before the learner passes
that point.

Today, this technique is mostly used in deep learning while other techniques
(e.g. regularization) are preferred for classical machine learning.

REGULARIZATION
Regularization refers to a broad range of techniques for artificially forcing
your model to be simpler.

The method will depend on the type of learner you’re using. For example, you
could prune a decision tree, use dropout on a neural network, or add a penalty
parameter to the cost function in regression.

Oftentimes, the regularization method is a hyperparameter as well, which means
it can be tuned through cross-validation.

We have a more detailed discussion here on algorithms and regularization methods .

ENSEMBLING
Ensembles are machine learning methods for combining predictions from multiple
separate models. There are a few different methods for ensembling, but the two
most common are:

Bagging attempts to reduce the chance overfitting complex models.

 * It trains a large number of ""strong"" learners in parallel.
 * A strong learner is a model that's relatively unconstrained.
 * Bagging then combines all the strong learners together in order to ""smooth
   out"" their predictions.

Boosting attempts to improve the predictive flexibility of simple models.

 * It trains a large number of ""weak"" learners in sequence.
 * A weak learner is a constrained model (i.e. you could limit the max depth of
   each decision tree).
 * Each one in the sequence focuses on learning from the mistakes of the one
   before it.
 * Boosting then combines all the weak learners into a single strong learner.

While bagging and boosting are both ensemble methods, they approach the problem
from opposite directions.

Bagging uses complex base models and tries to ""smooth out"" their predictions,
while boosting uses simple base models and tries to ""boost"" their aggregate
complexity.

NEXT STEPS
Whew! We just covered quite a few concepts:

 * Signal, noise, and how they relate to overfitting.
 * Goodness of fit from statistics
 * Underfitting vs. overfitting
 * The bias-variance tradeoff
 * How to detect overfitting using train-test splits
 * How to prevent overfitting using cross-validation, feature selection,
   regularization, etc.

Hopefully seeing all of these concepts linked together helped clarify some of
them.

To truly master this topic, we recommend getting hands-on practice.

While these concepts may feel overwhelming at first, they will ‘click into
place’ once you start seeing them in the context of real-world code and
problems.

So here are some additional resources to help you get started:

 * Python machine learning tutorial
 * Plotting overfitting and underfitting with Scikit-Learn
 * More real-world examples of overfitting
 * How overfitting fits into the entire machine learning process (our free ML
   crash course)

Now, go forth and learn! (Or have your code do it for you!)

Share Google Linkedin TweetLEAVE A RESPONSE CANCEL REPLY
Name* Email* Website* Denotes Required Field

RECOMMENDED READING
 * Overfitting in Machine Learning: What It Is and How to Prevent It
 * Datasets for Data Science and Machine Learning
 * How to Learn Python for Data Science in 2017 (Updated)
 * Best Practices for Feature Engineering
 * The Beginner’s Guide to Kaggle
 * How to Handle Imbalanced Classes in Machine Learning
 * 9 Mistakes to Avoid When Starting Your Career in Data Science

Copyright © 2017 · EliteDataScience.com · All Rights Reserved


 * Home
 * Terms
 * Privacy","Overfitting in machine learning can single-handedly ruin your models. This guide covers what overfitting is, how to detect it, and how to prevent it.",Overfitting in Machine Learning: What It Is and How to Prevent It,Live,871
2690,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Perform Sentiment Analysis       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  SAMPLE PYTHON NOTEBOOK: PRECIPITATION ANALYSISsharynr / December 11, 2015Learn how to analyze precipitation data using a Python Notebook in IBM Analyticsfor Apache Spark. This sample notebook is available directly in the Analyticsfor Apache Spark service on Bluemix and can be adapted to suit your needs. To doso, from within the service, click the New Notebook button, and click the Samples tab.Watch this short video to see how the Precipitation Analysis notebook works.You can also read a transcript of this videoRELATED LINKS * Analyzing New York City Traffic Collisions Data in Apache Spark * Build SQL Queries * Use the Machine Learning Library * Load and Analyze dashDB Data with Apache Spark * Load and Filter Cloudant data with Apache SparkPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Learn you how to analyze precipitation data using a Python Notebook in IBM Analytics for Apache Spark.,Sample Python Notebook: Precipitation Analysis,Live,872
2693,,See how to populate data into a table in your IBM dashDB database from a local file residing on your machine. ,Load data from the desktop into dashDB,Live,873
2694,"FASTER PERFORMANCE WITH UNLOGGED TABLES IN POSTGRESQL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Sep 12, 2016In the first ""Addon"" article of this cycle of Compose's Write Stuff , Lucero Del Alba takes a look at how to get better performance with
PostgreSQL, as long as you aren't too worried about replication and persistence.
Do you want to shed light on a favorite feature in your preferred database? Why
not write a short ""Addon"" for Write Stuff ?

If data consistency after a server crash is not an issue, or you’re just gonna
deal with a disposable table that needs that extra boost for writing — then
unlogged tables are for you. Unlogged tables were introduced in PostgreSQL 9.1
in 2011. From the CREATE TABLE manual page :

(...) Data written to unlogged tables is not written to the write-ahead log,
which makes them considerably faster than ordinary tables. However, they are not
crash-safe: an unlogged table is automatically truncated after a crash or
unclean shutdown. The contents of an unlogged table are also not replicated to
standby servers. Any indexes created on an unlogged table are automatically
unlogged as well.

What makes writing a lot faster is that the logging daemon is disabled entirely
for the table that is flagged as UNLOGGED . However, the Write-Ahead Logging ( WAL ) system is a key component of what makes Postgres reliable. Furthermore, it is
also used for replication by Compose for a failover, but if you make a table
unlogged when you create it, writes to it will never be copied to the secondary
server, so if a failover happens, the secondary server will always come up with completely
empty tables.

So, why would you want unreliable tables? The truth is that, under some specific
circumstances, you don't care that much.

WHEN TO GO UNLOGGED
Keep in mind that by ""unreliable"" we don't mean that information will be
corrupted or that data types will somehow be less precise, simply that data will
be lost after a crash. As bad as it sounds, this is something you may be able to
afford.

Think of the following scenarios:

 * Large data sets that take a lot of time to import and are only used a couple
   of times (finance, scientific computing, and even big data).
 * Dynamic data that after a server crashes will not be that useful anyway, such
   as user sessions.
 * Static data that you can afford losing and re-importing in the unlikely event
   of a server crash.

Finally, unlike temporary tables , unlogged ones are not dropped at the end of a the session or the current
transaction, and under normal operations (that is, no crashes) they are, in
fact, persistent and operate normally — but faster.

... HOW MUCH FASTER?
This is hard to tell as it heavily depends on both your hardware and your
application. It is not the same writing logs to old fashion optical drives ( HDD ) than to newer solid state technology ( SSD ); also, the type of writing your application is doing (one-to-many,
many-to-one) will be very important.

That being said, reports range from 10% to more than 90% in speed gain ; but you rather do your own benchmark to be sure.

CREATING AN UNLOGGED TABLE
When creating a new table you just need to set the UNLOGGED flag in between the CREATE TABLE statement, everything else remains the same:

CREATE UNLOGGED TABLE ""EUR/USD_ticks""  
(
  dt timestamp without time zone NOT NULL,
  bid numeric NOT NULL,
  ask numeric NOT NULL,
  bid_vol numeric,
  ask_vol numeric,
  CONSTRAINT ""EUR/USD_ticks_pkey"" PRIMARY KEY (dt)
)


MAKING AN EXISTING TABLE LOGGED/UNLOGGED
Since PostgreSQL 9.5 —which is available on Compose — you can also alter an existing table to make it unlogged and vice-versa.

If the table already exists, you will turn the WAL off by setting the UNLOGGED flag:

ALTER TABLE ""EUR/USD_ticks"" SET UNLOGGED  


Turning the WAL back on just as with regular tables is equally easy:

ALTER TABLE ""EUR/USD_ticks"" SET LOGGED  


LISTING UNLOGGED TABLES
Postgres provides the relpersistence column in the pg_class catalog , where the u keyword signifies ""unlogged"".

It can be queried like this:

postgres=# SELECT relname FROM pg_class WHERE relpersistence = 'u';
┌───────────────────┐
│ relname           │
╞═══════════════════╡
│ EUR/USD_ticks     │
└───────────────────┘
(1 row)


CONCLUSIONS
Unlogged tables are a fast alternative to permanent and temporary tables, this
performance increase comes at that expense of losing data in the event of a
server crash, which is something you may be able to afford under certain
circumstances. Speedups can be very significant, but you better perform your own
benchmark if you wanna make sure where the ballpark is.

Lucero dances, plays music, writes about random topics, leads projects to
varying and doubtful degrees of success, and keeps trying to be real and he
keeps failing.

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Image by Tony Webster Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","In the first ""Addon"" article of this cycle of Compose's Write Stuff, Lucero Del Alba takes a look at how to get better performance with PostgreSQL, as long as you aren't too worried about replication and persistence.",Faster Performance with Unlogged Tables in PostgreSQL,Live,874
2695,"Follow Sign in / Sign up Home About Insight Data Science Data Engineering Health Data AI 10 * Share
 * 10
 * 
 * 

Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates Ross Fadely Blocked Unblock Follow Following 16 hrs ago
--------------------------------------------------------------------------------

NIPS 2016 — DAY 1 HIGHLIGHTS
— Jeremy Karnowski & Ross Fadely , Insight Artificial Intelligence

Want to learn about applied Artificial Intelligence from leading practitioners
in Silicon Valley or New York? Learn more about the Insight Artificial Intelligence Fellows Program .

Packed. That is one way to describe the thirteenth annual Neural Information
Processing Systems ( NIPS ) Conference. Packed with excitement, packed with results, and packed with people . This year over 2500 top quality papers where submitted, and over 5000 people
are in attendance. We too are in attendance for the week, and are excited to
talk with the top scientists and teams to hear what their currently doing.

Kicking off Day 1 of NIPS was a series of tutorials, focused on bringing
participants up to speed on the current state of the art for a number of topics.
This year NIPS has ran three tutorial sessions simultaneously, so sadly we
weren’t able to catch them all. However, the ones we did see were quite
fantastic.

David Blei (Columbia University) and company gave an in-depth primer on Variational
Inference, which has seen a number of advances recently. Of these perhaps the
most influential is the reparameterization trick , which allows backprop through random variables and helped to unlock recent
advances in Variational Autoencoders . On a more applied front, Andrew Ng ( Baidu Research ) laid out his best practices roadmap for building out learning systems in
industrial settings. In another session, Francis Bach brought us up to date on recent advances in (non)convex optimization, where
algorithms like SAGA are simply crushing BFGS . We think such results could see significant adoption in data science and
applied machine learning settings, once rolled out to commonly used libraries.

The star, in our opinion, was the tutorial on Generative Adversarial Networks by Ian Goodfellow . In 2014, Goodfellow introduced GANs to the Deep Learning community with extensive interest and praise. In his
conference opening keynote speech, Yann LeCun heralded GANs as “the most exciting idea [in the field] in the last 20 years.”
In his talk, Goodfellow clearly described the concepts and current advances
around GANs, including tips and tricks as well as current research frontiers.
Much of his talk highlighted recent advances in training and using GANs , which we have been excited about for some time. At the end, Goodfellow amazed
the audience with recent results from Plug & Play Generative Networks which produced — for the first time — realistic computer generated images (see
below) which are similar to ImageNet data ( example here from Andrej Karpathy ).

Images generated from the Plug & Play Generative Networks of Nguyen et al. 2016.
The above images of a redshank bird, ant, and monastery are far more realistic
than recent attempts.Wrapping up the day was an exciting poster session. Over 170 awesome submissions
were present, and well known contributors like Diederik Kingma , David Blei , and Yoshua Bengio joined the crowd and explained their posters. The quality of the submissions
was really high, and it is a shame we can talk about them all. Some of our
highlights included:

 * Exponential Family Embeddings A novel and powerful embedding technique for multiple data types. This opens
   up the possibility of bring the power of embeddings that NLP enjoys to (say)
   user ratings data.
 * Unsupervised Learning for Physical Interaction through Video Prediction Which exploits robot pushing data to make plans for possible futures. While
   traditional agent learning algorithms rely heavily on supervision, this type
   of approach may be a new way forward for robotics and similar domains.
 * Improving Variational Inference with Inverse Autoregressive Flow A work that combines recent advances in variational inference with ideas
   from autoregressive networks, resulting in better encoder models. Like its
   precursors, we are excited to see the state of unsupervised learning evolve.

All in all, it was a amazing first day at NIPS 2016 and we are eager to see what
lies ahead.

Thanks to Jeremy Karnowski . Machine Learning Artificial Intelligence Insight Data Science Insight Ai Deep Learning 10 Blocked Unblock Follow FollowingROSS FADELY
FollowINSIGHT DATA
Insight Fellows Program —Your bridge to careers in Data Science and Data
Engineering.",Want to learn about applied Artificial Intelligence from leading practitioners in Silicon Valley or New York? Learn more about the Insight Artificial Intelligence Fellows Program.,NIPS 2016 — Day 1 Highlights,Live,875
2697,"* Videos & Webinars
 * About Me // Contact
 * Download The E-book!
 * Blog
 * Why Data?

Menu Close * Videos & Webinars
 * About Me // Contact
 * Download The E-book!
 * Blog
 * Why Data?

Hey, I'm Tomi Mester. This is my data blog, where I give you a sneak peek into
online data analysts' best practices. You will find here articles and videos
about data analysis, AB-testing, researches, data science and more...SUBSCRIBE FOR DATA ARTICLES HERE:
Email Address * Name *© 2017 Data36 .

Powered by WordPress .

STATISTICAL BIAS TYPES EXPLAINED – PART2 (WITH EXAMPLES)
Written by Tomi Mester on August 28, 2017It’s time to continue our discourse about Statistical Bias Types. This is part 2
– if you missed part 1, read it here: Statistical Bias Types part 1 . In the previous article I have introduced 5 ways (not) to get biased during
the data collection/sampling phase of your researches. Now I’ll focus on what
can (but shouldn’t) go wrong during the analysis and the presentation part.

STATISTICAL BIAS #6: OMITTED VARIABLE BIAS
Omitted Variable Bias occurs, when you are leaving out one or more important
variables from your model. This issue comes up especially often regarding Predictive Analytics.


Everyday example of Omitted Variable Bias:Imagine a grocery store. You are finished with shopping and you want to pay.
There are 3 lines and you want to pick the one, where you have to spend the
least time. So you are checking, which one is the shortest and queuing up.
Murphy’s Law: the other line is going much faster. Your prediction failed –
maybe because you have omitted an important variable, namely how packed the
carts were in the different lines. This mistake caused you 5 more minutes in the
line…

Online Analytics Related example of Omitted Variable Bias:In real life data projects you can lose much more than 5 minutes with wrong
predictions. Here’s an example:

It’s quite common, that online businesses want to predict the possible churns of
their users, so they can act beforehand. Let’s say you are monitoring all user
activity on your product and based on your own data you built up a model, that
predicts if a user will cancel her subscription in one week – with 75% accuracy.
Nice job! But the next day you see, that a big chunk of the users are cancelling
their subscription without any warning from your model. What did just happen? In
this hypothetical scenario a strong competitor entered your market and offered
the same solution you have, but on half the price. Of course, this is something
your model wasn’t ready for. The presence of the competitor is an omitted
variable in this case. In fact it’s a variable, that’s almost impossible to
prepare any predictive models for.

Note: Predictive Analytics nowadays work pretty much by the principle of “what
happened in the past will happen in the future”. This makes these models very
vulnerable. If something new is happening on the market, it’s often not
calculated in the predictions and it causes major inaccuracy. The bottom line
is: don’t expect a predictive model to be accurate for more than 1 or 2 years.

STATISTICAL BIAS #7: CAUSE-EFFECT BIAS
Our brain is wired to see causation everywhere, where correlation shows up.
Cause-effect bias is usually not mentioned as a classy statistical bias, but I
wanted to include it on this list as many decision makers (business/marketing
managers) are not aware of that. Even those (me too), who are aware of it, have
to remind themselves from time to time: correlation does not imply causation.


Everyday example of Cause-Effect Bias:Here’s my favorite example: the kids who had tutors in high schools, got
eventually worse grades, than the kids, who didn’t. I intentionally put this in
this misleading way. But the point is, that even though you see a correlation
between bad grades and tutoring, the tutoring wasn’t the cause of the bad
grades. The bad grades were the cause the tutors were needed.

Online analytics related example of Cause-Effect Bias:You have a new loyalty program! You see that the customers, who signed up into
that loyalty program are spending 5-times more money in your e-commerce store,
than those who didn’t. Is the loyalty program successful? Maybe, but we don’t
know that for sure. Because it’s also possible, that only those more committed
(or with other words: loyal) customers are interested in the loyalty program on
the first hand, who were going to spend 5-times more anyway. (See more here: self-selection bias .)

Unfortunately the only way to crack the correlation vs. causation issue is to
run experiments. While it’s easy to A/B test your loyalty program online – it’s a bit more difficult to say to the half of
the kids who perform bad at school, that they don’t get tutors because of a
scientific research. But let this be the problem of the social economists.

STATISTICAL BIAS #8: FUNDING BIAS
I briefly mentioned Funding Bias (sometimes called sponsorship bias) already in Statistical Bias Types part 1 . We are talking about it, when the results of a scientific study is biased in
a way, that it supports the financial sponsor of the research.


Everyday example of Funding Bias:I won’t name any particular industry here, but I think we all know, what I’m
talking about. Anytime, you are watching “documentaries”, when you are reading
the “news”, when you are checking “research results” – try to make sure first,
that you are consuming content of independent creators, who are not biased by
their sponsors’ expectations.

Online analytics related example of Funding Bias:If you are working for a company as a Data Scientist or Analyst, you are getting
your money from that company – so in a sense, it’s your sponsor. Now, of course
you want to deliver good news to make your “sponsors” happy. Let’s imagine a
game developer company. A data analyst might feel really bad for reporting, that
the new game, that everybody was working on in the last 3 months: looks like a
huge failure. But keep it in mind and train your colleagues too: as a data scientist/analyst your are not getting paid to deliver good news. You
are getting paid to deliver accurate, useful and actionable information. Was the new product a failure? It’s OK, but make sure, that everyone can learn
from the data that you have collected during the test phase, so the new version
can be better!

STATISTICAL BIAS #9: COGNITIVE BIAS
Cognitive biases are related to human perception, thus it’s a much broader
category originally. But they have a relation to statistical biases too! They
can also have a huge affect on how you should present and interpret the data.


Everyday and online analytics related examples of Cognitive Bias:For cognitive biases I’m gonna lump together the everyday and the online
examples. Here are the most important ones:

 1. Hindsight bias. Even the greatest findings seem very trivial – looking back at them a few
    days later. You feel, that it was so logical. You should have known this the
    whole time. When you are presenting the results of your 1-month data
    analysis project, there will be always someone in the room, who will tell: “I was gonna say the very same thing on the last meeting…” My suggestion: smile inside and try to keep the comment “of course, but then why weren’t you?” – for yourself.
 2. Confirmation bias. A variation of the previous one, but this is a bit more dangerous.
    Confirmation bias happens, when a decision maker has serious pre-conceptions
    and listening only to that part of your presentation, that confirms his/her
    beliefs and missing the rest. Suggestion: always have a one sentence take
    away for your presentations, that’s impossible to miss even if someone’s
    eyes are covered by preconceptions. (Also feel free to point out to possible
    confirmation biases and send over this article. ;-))
 3. Belief bias . When someone is so sure about his own gut feelings, that he/she is
    ignoring the results of a data research project. Suggestion: ehh… hustle. In
    more details here: Data-resistance – how to evangelize the data driven mindset?
 4. Curse of knowledge. When you are assuming someone has the same background knowledge, that you
    do. Especially important to be aware of this bias, when you are presenting
    your data projects to non-data-minded people. Mind that business managers
    not necessarily have in their dictionary the “statistically significant”,
    “multiple regression”, “least square estimates” phrases, so try to
    communicate them using their words. (Eg. “statistically significant” =
    “pretty damn sure”)

There are many-many more cognitive bias types, but I’ll limit my article to
these four most important ones. If you want to learn more, look up this
Wikipedia article: List of Cognitive Biases .

HOW NOT TO BE BIASED?
Now, that we have learned about all the important statistical bias types, the
only question left is, how can we overcome them. How can we ultimately avoid to
be biased? In next weeks article I’ll write about that and will give you some
practical advice!

Till then, stick with me and subscribe to my weekly Newsletters (no spam, just 100% useful data content)! And if you have any comments, let me
know below!

Cheers,
Tomi

 * August 28, 2017
 * In Analyze the Data , Data Tools - growing up
 * analytics bias data data science learn data science predictive analytics sql startup statistical bias statistical bias types statistics

← Previous postLEAVE A REPLY CANCEL REPLY
Comment

Name *

Email *

Website


Get free data articles weekly: We use cookies to ensure that we give you the best experience on our website. Ok",It’s time to continue our discourse about Statistical Bias Types. What can go wrong during the analysis and the presentation part?,Statistical Bias Types explained,Live,876
2699,"Compose The Compose logo Articles Sign in Free 30-day trialBUILDING AN ORDERING APPLICATION WITH WATSON AI AND POSTGRESQL: PART I
Published Jun 13, 2017 watson postgresql bluemix Building an Ordering Application with Watson AI and PostgreSQL: Part IDo you want to leverage Compose, Twilio, and IBM Watson to provide customers
with a real-time, interactive experience? We'll show you how by building an IBM
Bluemix application that lets Watson interact with users via Twilio's SMS
service then sends their data over to Compose PostgreSQL for storage.

If you want to take your business to the next level, Watson Conversation is a
great tool to integrate into your system. It gives you the power of machine
learning, while also providing customers with an interactive experience. In this
article, we're using Watson Conversation to build a custom SMS ordering system
using Twilio and Compose PostgreSQL. The company that we've made up is called
ProLumber that wants an ordering system where customers can quickly create
orders via SMS to be delivered to their job sites.

We'll use Watson Conversation to build the dialog so that it can interact with
customers and gather the data we need to put into our Compose PostgreSQL
database which will store the orders. To get the conversation from Watson
Conversation to the customer's phone via SMS, we'll use Twilio , which provides the phone number for customers to use. Twilio will also
provide Watson Conversation with the customers' SMS responses so that Watson can
interact with them to understand what they want to order.

Our application will be built using NodeJS, IBM Watson Conversation, Twilio, and
Compose PostgreSQL, which will be hosted on IBM Bluemix. The code for this
application can be found on this Github repository .

This article is broken into two parts. We'll start off by setting up the Bluemix
environment and setting up our Watson Conversation dialog that will run the
conversation over SMS. In part two, we'll write the code that ties together
Watson Conversation, Compose PostgreSQL, and Twilio, and we'll deploy the
application to Bluemix to get it working. At the end, we'll have an application
that looks like this:


To start, let's begin by creating our setting up our application and services
together in Bluemix ...

SETTING THINGS UP ON BLUEMIX
First and foremost, you haven't set up an IBM Bluemix account, then jump on over
to https://www.ibm.com/cloud-computing/bluemix and sign up. This is an important step since we'll be using Bluemix to host our
application and to store the authentication credentials for our services in the
Bluemix environment.

Once you've set up a Bluemix account, you'll want to click the Create app button. On the left-hand menu, click on Cloud Foundry Apps and select SDK for Node.js.


We'll name the application ""ProLumber SMS Application"". If you're following this
tutorial, make sure that you name your application something different since
every hostname on Bluemix must be unique.


After that, click Create . Now you'll be taken to your application's ""Getting started"" page where you
will download and install the Bluemix command-line interface. You'll need that
to log in and deploy the application to Bluemix. Once you've installed the
Bluemix CLI, create a folder on your computer to store the code for your
application. Then, follow the instructions on the overview page to connect and
log in to Bluemix.

Clicking on the Overview button in the left-hand menu will give you an overview of your application's
runtime, cost, activity, and a list of the other services that are connected to
it.


The menu options that will be important for us are
Runtime and Connections . By clicking on Runtime , we can view the application's Environment variables which we'll need when using our authentication credentials. As we add services
to the application, the credentials will be populated within VCAP_SERVICES and will be available or us to use when deploying to Bluemix.


The Connections menu will show us the Bluemix services that are connected to our application.
But, since we don't have any services connected at the moment, the page will
also be empty. We'll show you what it looks like after we start adding the
services we'll tie into Bluemix. Let's set those up now ...

First, let's look at setting up IBM Watson Conversation. To do that, we'll click
on the Connections menu option and click Connect new in the upper right-hand corner. This will take us to the IBM Bluemix catalog
that will show us all of the services that we could attach to our application.
Since we're interested in Watson Conversation, click on Watson in the left-hand menu and select Conversation .


This will take us to a setup page where we can rename the service. To the left,
you'll see the Connect to menu, which by default is set to Leave unbound . We'll want to change that to bind the service to our application so select
the name of the NodeJS application you just created.

You'll notice that once you've selected your application to bind to that the Credential name disappears. That's because the credentials are tied into your application's
environment variables. Now, click Create at the bottom of the screen to create the Watson Conversation service.


Upon doing that, you'll be taken to the Connections view where you will see Watson Conversation as well as the Availability Monitoring Lite services added. The Availability Monitoring Lite is a free service that automatically comes with your application, but we won't
be covering it here.


The next service we'll create is our Compose PostgreSQL service. You can select
that from the Data & Analytics side menu button. Once you click that, you'll see a number of databases that
are available to you, but we'll select Compose for PostgreSQL .


Again, you have the option of selecting a name for the service and whether to
bind it to an existing application. We'll bind it to our application like we did
with Watson Conversation.


Next, we'll add the Twilio App service. Again, just click on Connect new , then in the Bluemix catalog select Mobile under Apps , then Twilio.


This will take you to a page where you'll have to enter your Twilio account SID
and Authentication Token. You'll need to set up a Twilio account if you don't
have one to get your ""Account SID"" and ""Auth Token"". Bluemix doesn't provide you
with credentials for Twilio as they do for our other services.


Now, going back to the Bluemix connections page, we'll see all of the services
attached to that application. You can view the credentials of each application
individually by selecting View credentials in each service.


But, since we've tied all of our services to the NodeJS application, we can view
all of them by selecting Runtime from the left-hand menu and view all our credentials within VCAP_SERVICES under the Environment variables tab.


VCAP_SERVICES is a JSON document listing all of the credentials for our services that we
attached to our application. You can download them, or add to them, but since we
are using them within the Bluemix environment, we just have to point to them in
our NodeJS application and Bluemix will automatically know how to get them.
We'll show you a demonstration in the code we'll write for the application.

So, the setup for Bluemix is done. Let's set up the dialog we'll have with our
customers using Watson Conversation ...

WATSON CONVERSATION
Using Watson Conversation is how we'll create the dialog for our order
application. What the dialog does is leads our customers from one part of the
conversation to the next based on their answers to Watson's replies. This
back-and-forth dialog continues until it has been completed when the customer
has finished making the order.

To set up Watson Conversation, select the Conversation service. This will take you to an overview page where you can click *Launch tool , where you'll be taken to Watson Conversation Workspaces . A workspace is essentially a repository for a conversation that stores
intents, entities, and a dialog. We'll briefly touch on what all three are in a
bit, but head on over to the Watson Conversation documentation if you'd like to know more.


For now, let's create a new workspace. The name we'll give it will be ""ProLumber
Ordering System"". Once that's done, click Create and you'll have a blank conversation workspace.


At the top of the screen you have Intents , Entities , and Dialog .


Intents are groups of potential user inputs that you want Watson to recognize in order
to choose the right response and direct conversation flow. Entities are terms that you want to clarify to Watson from a user's response. For
example, when a user says ""lumber"" or ""wood"" we want Watson to be able to
recognize these are synonyms for the same word, or that they relate to a
particular intent in a certain context.

You can create your own intents and entities for your application. You can also
import the intents and entities CSV files we've created for this application
from our Github repository .

Once you've imported the intents and entities, or created your own, click on the Dialog menu at the top of the screen. Since we can't import dialogs, we'll have to
create our own. Click Create new . You'll be given two default dialog boxes, ""Welcome"" and ""Anything else"",
which you can open and modify.


""Welcome"" is what a user will first encounter when interacting with Watson,
while ""Anything else"" will be triggered if Watson doesn't understand the
response. On the upper right-hand of the screen, there is a dialog icon that,
when clicked, will open up a window that will allow you to test out your dialog.
Notice that ""Hello. How can I help you"" is triggered when we open up the dialog.


To create the dialog, we simply add more dialog boxes and use the intents and
entities we've imported to control the dialog flow. Our conversation comprises
six dialog boxes.


Each of them takes the user from one part of the conversation to the other. The
intent of the dialog is to ask customers for their customer ID, what kind of
lumber they'd like, and the quantity that they want - here, we've limited
customers to 10, 50, 100, and 150 boards.


The other important part for our application is the collection of user
information. This is done via context variables. You can read more about them here . Essentially what they allow us to do is store information in a variable and
pass that along to other parts of the conversation when it's called for. For
example, if we want to store the customer's ID and use that in other parts of
our conversation, we can do that using the context variable. However, for our
application, we won't be passing information back and forth between dialogs too
much. What we're concerned with is gathering information to send over to Compose
PostgreSQL to be stored.

To set up context variables within Watson Conversation we'll have to click on a
dialog box, in this case, the ""Welcome"" box. A window will open up and in the
""Responses"", click on the advanced feature button indicated by the three rows of
dots icon.


We'll select JSON which will show us the JSON representation of our ""Welcome""
dialog box. Within the JSON dialog we'll only be given the ""output"" by default,
so we can enter the context variables that we want manually and set them to any
value. In this case, I've initially set them to null . You don't need to set up the context variables first, but I've done that
strictly for convenience.


You may notice that our first dialog box is triggered with the ""#greetings""
intent. We don't want Watson to initiate the chat. Instead, we use the
""#greetings"" intent to tell Watson to wait for a greeting to start the
conversation. Once the ""#greetings"" intent is triggered, the context variables
will be stored and the response for the greeting will be ""Hi. This is the
ProLumber Delivery System. Please enter your account number to get started."",
prompting the customer for their account number.

The customer account number will be stored in the background of our conversation
when the next dialog box has been triggered by the customer sending a number and
a phrase from the ""#customer_id"" intent. What Watson is looking for is something
like ""My account is 12345"". The entity @sys-number, an entity that comes with
Watson Conversation, will recognize that 12345 is a number, while the
""customer_id"" intent will understand the phrase ""account is ..."".


Now if we look at this dialog's JSON representation, we can see that we've
stored the context variable accountID as the @sys-number 12345. In addition, I've used the @sys-number entity to tell
the user that I recognize them as 12345 and prompt them for the next input.


For the next input, we'll say that we want some ""wood"" or ""lumber"". So the
trigger states if the user asks for ""wood"" or ""lumber"", then ask if they want
""oak"" or ""pine"". So, here we're not leaving the response open ended, but we are
trying to guide the user to a response that will trigger the next dialog. In
this case, the user will say, ""I want some oak"".


When the user tells Watson that they want some ""oak"", the next dialog will
trigger and the @kind_of_wood entity will populate the lumber_type context variable with ""oak"".


Prior to setting up the dialog, we imported intents and entities from a CSV
file. If we look at the @kind_of_wood entity, we'll see that we've saved pine,
oak, lumber, and wood.


That's so that when we use the entity, Watson will recognize the customer's
input and display it using the entity variable. What we've also done is take the
@kind_of_wood entity and tell Watson to store what the customer the kind of wood
that the user states inside of the lumber_type context variable.

Once the user has given Watson the type of wood, they'll be prompted for the
quantity they want. Like lumber_type , quantity is also stored in a context variable.


So, after telling Watson ""150 boards"", 150 will be recognized using the
@lumber_num entity and we can use that to store the number in the quantity context variable. With the quantity stored, we can now use the context
variables to send a response to the user using the stored context variables $accountId , $quantity , and $lumber_type .


With the conversation set up, all we have to do now is click on the Deploy menu button on the left-hand side to get the ""Workspace ID"", which we'll need
when setting up the code for the application.


NEXT TIME ...
What we've accomplished in this article is setting up your Bluemix environment
and creating a Watson Conversation dialog for customers to interact with when
making orders. Next time, we'll start writing the code, deploy the application
on Bluemix, and start making orders.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Alex Jones

Abdullah Alger is a former University lecturer who likes to dig into code, show people how to
use and abuse new technology, and fish when the conditions are right. He's also
from Seattle so coffee is in his DNA. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
May 3, 2017CAMPUS DISCOUNTS - MAKING THE MOST OF COMPOSE
Campus Discounts uses several Compose-hosted databases including MySQL, MongoDB,
Redis, Elasticsearch and RabbitMQ to power t…

Arick Disilva Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX
The power of IBM's Bluemix cloud platform is now able to seamlessly harness
Compose's databases, making Compose-configured Mo…

Dj Walker-Morgan Jun 9, 2017NEWSBITS: OTTERTUNE TUNES DATABASES WITH MACHINE LEARNING
NewsBits is Compose's roundup of the past week's database and developer news:
Ottertune uses machine learning to tune databas…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Do you want to leverage Compose, Twilio, and IBM Watson to provide customers with a real-time, interactive experience? We'll show you how, on IBM Bluemix.",Building an Ordering Application with Watson AI and PostgreSQL: Part 1,Live,877
2700,"* Program
 * Courses * Course Listing
    * Summer
    * Data Science
    * Calendar
   
   
 * Students
 * Hiring
 * Events
 * Blog
 * Locations * New York
    * India
   
   
 * 
 * Share * 
    * 
    * 
    * 
   
   
 * Join us on Slack!


applyBLOG
CATEGORY
 * Data Science
 * Career
 * FinTech
 * MedTech
 * Programming Tips
 * Startup
 * Student Stories
 * Blockchain
 * Current Events
 * Diversity
 * Events
 * Finovate
 * Light Reads
 * Millennials

ARCHIVES
 * 2017
 * 2016
 * 2015


SUBSCRIBE TO OUR NEWSLETTER
submit Tuesday, June 20, 201710 DATA SCIENCE PODCASTS YOU NEED TO BE LISTENING TO RIGHT NOW
 * Data Science
 * By: Byte

Podcasts have got to be everybody’s favorite mode of consuming knowledge. They
pack a lot of information in a single show, provide you with a fun
conversational tone for all topics and are absolutely perfect for those boring,
long commutes. Listening to Data Science podcasts is a great way to stay up to
date on the latest advancements in the field, coming right from the leading
voices of the industry. Here are 10 of our favorite picks –


The AI PodcastsNvidia is at the forefront of the GPU technology that happens to be one of the
key factors of growth in the AI revolution. Little wonder, then, that they’ve
created a great podcast series that focuses on answering some of the most
commonly asked questions by practitioners of data science, apart from featuring
interviews with some very interesting thought leaders of the data science
industry. While relatively new, this podcast comes highly recommended.

Great for – AI enthusiasts looking to deepen their understanding of the
technical aspects of deep learning.


The Data SkepticIt’s easy to get swept up by the hype that surrounds data science and machine learning – therefore a podcast that takes a skeptic’s view on all things data is a great
way to learn about the latest trends minus all the hype. Featuring both
interviews and expositions on topics related to statistics, machine learning and
artificial intelligence, the highlight of the series are its mini episodes
wherein the host Kyle explains technical topics such as how GPU’s work to what
are generative adversarial networks to his wife Linh Da, who happens to be a
non-STEM person.

Great for – Absolute beginners looking for explanations on foundational topics.


Talking MachinesThe hosts of Talking Machines Katherine Gorman, Ryan Adams, and now Neil
Lawrence interview industry experts and explore interesting topics around
machine learning, with an emphasis on real-world relevance. As the impact of
machine learning continues to grow, their explorations become cornerstones in
our understanding of how the field works.

Great for – Getting an in-depth analysis on machine learning techniques.


The O’Reilly Data ShowO’Reilly is one of the leading voices in data science education via their
extremely popular and practical books, and this podcast is no exception. Hosted
by their Chief Data Scientist Ben Lorica, the show dives into big data and data
science topics and is not afraid to get technical with its explanations. A great
listen for both industry players and novices looking to build a career in data
science.

Great for – Keeping abreast with the latest news in data science.


10 Data Science Podcasts You Need To be Listening To Right Now Click To Tweet

Learning Machines 101This podcast roundup would be remiss without adding Learning Machines 101 to the
list. Billed as a ‘Gentle Introduction to Artificial Intelligence and Machine
Learning’, this podcast targets the general public looking to understand the
fundamentals of machine learning and AI. The show is extremely informative and a
great way to get a holistic view on the subject.

Great for – Data Scientists looking for clear explanations on machine learning
algorithms.


This Week in Machine Learning & AIThis podcast is the best way to keep in touch with timely topics in data
science. Featuring interviews with the best professionals in the field, the
podcast goes into topics as diverse as the relevance of data science in cyber
security to how particle physics is relevant for audio AI. The one podcast you
need to listen to for all news on Machine Learning and AI.

Great for – Getting weekly updates on the latest news in Machine Learning and
AI.


The O’Reilly Bots PodcastPerhaps the importance of podcasts as a means of education can be gauged by the
fact that this is the second O’Reilly show on our list. Focusing entirely on the
hot topic of bots and messaging systems, the host Jon Bruner brings the light
the latest developments in the field.

Great for – Anyone interested in building bots.


Concerning AIWhen Elon Musk warns you about the negative repercussions of Artificial
Intelligence, you listen. This podcast, by Brandon Sanders and Ted Sarvata,
elaborates on the question of what the advent of Artificial Intelligence means
for the world. Unlike the technical podcasts mentioned above, it takes an
exploratory approach to understanding machine learning and deep learning
principles and touches on topics that are or should be relevant to people across
industries. A thought provoking, extremely relevant listen.

Great for – Exploring the tough, philosophical questions regarding the advance
of AI and its impact on our lives.


Linear DigressionsTopics such as artificial neural networks and the Turing problem can be
inaccessible to casual listeners, but the hosts of Linear Digressions – Katie
Malone and Ben Jaffe, have a knack for explaining complex topics in data science
and machine learning in a clear, concise and entertaining manner, often using
absurd real world examples to prove their point.

Great for – Introducing a layman to the world of data science and machine
learning.


Partially DerivativeEasily our favorite podcast of the lot, Partially Derivative is hosted by Chris
Albon, Jonathon Morgan and Vidya Spandana who get together over a round of
drinks (!) and talk about all things data science. Aa accomplished data
scientists and technologists in their own right, their conversations about data
science news and articles are insightful and fun!

Great for – Those who need information and entertainment – in equal measure.

Need something more substantial than podcasts to beef up your data science
skills? Sign up for one of these courses today.

Like our post? Share it! hide form add comment * 
 * 
 * 
 * 
 * 
 * 

Thanks for the comment No Comments Cancel reply * subscribe to newsletter

sendOTHER SUGGESTED READS
 * Thursday, May 25, 20174 BIG WAYS DATA SCIENCE IS TRANSFORMING THE FINTECH INDUSTRY
   The financial industry is no stranger to the power of technology. After all,
   it has developed its own legion of Quants who use advanced quantitative
   analysis and programming techniques to exploit big ... read more
 * Wednesday, May 17, 2017PYTHON VS R. WHICH LANGUAGE SHOULD YOU CHOOSE?
   Data science is an interdisciplinary field where scientific techniques from
   statistics, mathematics, and computer science are used to analyze data and
   solve problems more accurately and effectively. I... read more
 * Tuesday, May 09, 2017THE 5 MUST READ BOOKS FOR EVERY DATA SCIENTIST
   As the hottest career field in 2016 and beyond, there are a plethora of
   learning resources available for a budding Data Scientist – right from MOOC’s
   to online video lectures and even ou... read more

 * Financial Aid (US)
 * Upcoming Classes
 * Contact Us

 * Corporate Training
 * FAQ
 * Press

 * Careers at Byte
 * Byte Ambassador
 * Meetup (NY)

 * NEW YORK
    * 295 Madison Ave.
    * Fl. 35
    * New York, NY 10017
    * info@byteacademy.co
    * (646) 500-8608
   
   
 * INDIA
   info.in@byteacademy.co
   
   Bangalore, India
   
   
 * REMOTE
   Check Out
   Byte Dev
   
   
RECEIVE BYTE UPDATES
submit * 
 * 
 * 
 * 
 * 
 * 
 * 
 * 

Licensed by the State of New York, New York State Education Department


© 2017 Byte Academy LLC.
All rights reserved.

Privacy PolicyNEW Tuition Deferral Program | Pay Tuition Once Hired

Learn More","Listening to Data Science podcasts is a great way to stay up to date on the latest advancements in the field, coming right from the leading voices of the industry. Here are 10 of our favorite picks.",10 Data Science Podcasts You Need To be Listening To Right Now,Live,878
2702,"DO YOU KNOW WHY COMPOSE PROXIES DATABASE CONNECTIONS?
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 19, 2016TL;DR: We use proxies to enable SSL, High Availability and Whitelisting. Because we
care.

Observant users of Compose will have noted that we make extensive use of HAProxy
to enable users to connect to their databases. Curious observant users have
wondered why we use proxies on all our services, so allow us to explain.

PORTAL PRIVACY
The Compose architecture for database deployments sees every database running on
an isolated private network with no external visibility for any of the database
or service nodes in the deployment. There is one class of node that can be seen
by the outside world and that's the portals, dedicated to managing the transit
of traffic from outside to inside the network and keeping your database private.
The most common of all the portals is the TCP portal, a proxy based on HAProxy .

HIGH AVAILABILITY
The first reason we run a proxy is to enable us to supplement any high
availability options on the database behind the proxy by getting the proxy to
independently follow the leader in a cluster. There are some databases where
this doesn't apply, like ScyllaDB that exposes all its nodes and gets client
drivers to work out which node is leader, but in general, the task is managed by
the proxy.

When a database node becomes unresponsive, or a monitoring task tells the proxy
that a node is unavailable, the proxy looks for a new leader and sends new
requests to that node. Ongoing connections will be broken, which is why we
advise developers to program defensively when writing their database calls and
be prepared to retry when a request times out. The database is still available
in this scenario, but, in a similar fashion to a temporary connectivity issue on
the network, some connections have gone astray.

SSL
There are databases that do SSL natively and there are databases that don't.
Let's deal with the latter group first. At Compose, we know SSL is important for
developers and businesses to maintain their data privacy. But where a database
has no native SSL support, we have to do something and thats where the proxy
comes in. Clients can make an SSL connection to the proxy and we terminate the
SSL part of the connection at the proxy. The connection will continue,
unencrypted, to the database nodes from that point on. Remember though that each
Compose database deployment is on its own private isolated SDN managed VLAN, so
there's no risk of exposure to the data.

But there's another benefit; isolation. With all the SSL work being done in the
proxy, we can easily change security certificates or upgrade the proxy without
touching the running database. If there's an SSL vulnerability found and patches
are available, all we need to do is upgrade the proxy. That means no database
downtime and just a moment taken while the proxy updates before the connections
flow again.

This type of SSL proxy is used with MongoDB, Elasticsearch, RethinkDB, ScyllaDB
and RabbitMQ. As you can tell, that's not the whole list of Compose databases;
there are exceptions.

For PostgreSQL and MySQL, the SSL termination takes place in the database
server. It's managed as part of the wire protocol so it's not easy to extract
the SSL element from the connection at the proxy. So we route that through to
the database. This does mean that SSL vulnerabilities, when discovered and
patched, and certificate updates will require database server restarts to put
those fixes in place. We are, though, always on the lookout for ways to manage
these so that we can minimise interruptions to services. These databases still
use proxies, but just for the high availability functionality.

Finally, there's Redis, which has no SSL support. For that, we have the proxy
for high availability, but there's no SSL endpoint available. If you want to
secure your Redis connections, check out the SSH portal, which will let you
connect over a secure tunnel. Your traffic still goes through the proxy to make
leader following happen. When a SSL solution for Redis appears that doesn't
break existing drivers we'll look at using it.

WHITELISTING
The last feature the proxies give us is one of our most accessible security
features. Whitelisting allows you, the user, to configure which IP addresses or
ranges get access to your database. Whitelists aren't usually a feature of
database networking so by ensuring that the capability is in the proxy, we can
give every database that has a proxy portal the ability to have a whitelist for
who accesses it. You'll find the whitelist controls under the Security view for any Compose database.

WRAPPING UP
We've had to craft how you connect to each one of the Compose databases with an
eye to their strengths and weaknesses and leveraged proxies to give a consistent
experience of availability and encryption where possible. We are always looking
to improve how we do this and hope this article's given you some insight on the
design decisions we make so that you can just click to create Compose databases
and be sure your are getting a production-ready database.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Vladimir Kudinov Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","TL;DR: We use proxies to enable SSL, High Availability and Whitelisting. Because we care.",Do you know why Compose proxies database connections?,Live,879
2703,"Homepage Stats and Bots Follow Sign in / Sign up Homepage * Home
 * DATA SCIENCE
 * ANALYTICS
 * STARTUPS
 * BOTS
 * DESIGN
 * Subscribe
 * 
 * 🤖 TRY STATSBOT FREE
 * 

Prasoon Goyal Blocked Unblock Follow Following PhD candidate at UT Austin Nov 2
--------------------------------------------------------------------------------

PROBABILISTIC GRAPHICAL MODELS TUTORIAL — PART 1
BASIC TERMINOLOGY AND THE PROBLEM SETTING
A lot of common problems in machine learning involve classification of isolated
data points that are independent of each other. For instance, given an image,
predict whether it contains a cat or a dog, or given an image of a handwritten
character, predict which digit out of 0 through 9 it is.

It turns out that a lot of problems do not fit into the above framework. For
example, given a sentence, “I like machine learning,” tag each word with its
part-of-speech (a noun, a pronoun, a verb, an adjective, etc.). As even this
simple example demonstrates, this task cannot be solved by treating each word
independently — “learning” could be a noun or a verb depending on its context.
This task is important for a lot of more complex tasks on text, such as
translating from one language to another, text to speech, etc.

It is not obvious how you would use a standard classification model to handle
these problems. A powerful framework which can be used to learn such models with
dependency is probabilistic graphical models (PGM). For this post, the Statsbot team asked a data scientist, Prasoon Goyal, to make a tutorial on this
framework to us.

Before talking about how to apply a probabilistic graphical model to a machine
learning problem, we need to understand the PGM framework. Formally, a
probabilistic graphical model (or graphical model for short) consists of a graph
structure. Each node of the graph is associated with a random variable, and the
edges in the graph are used to encode relation between the random variables.

Depending on whether the graph is directed or undirected, we can classify
graphical modes into two categories — Bayesian networks and Markov networks.

BAYESIAN NETWORKS: DIRECTED GRAPHICAL MODELS
A canonical example of Bayesian networks is the so-called “student network,”
which looks like this:

This graph describes a setting for a student enrolled in a university class.
There are 5 random variables in the graph:

Difficulty (of the class): Takes values 0 (low difficulty) and 1 (high
difficulty) Intelligence (of the student): Takes values 0 (not intelligent) and 1
(intelligent) Grade (the student gets in the class): Takes values 1 (bad grade), 2 (average
grade), and 3 (good grade) SAT (student’s score in the SAT exam): Takes values 0 (low score) and 1 (high
score) Letter (quality of recommendation letter the student gets from the professor
after completing the course): Takes values 0 (not a good letter) and 1 (a good
letter)The edges in the graph encode dependencies in the graph.

 * The “Grade” of the student depends on the “Difficulty” of the class and the
   “Intelligence” of the student.
 * The “Grade,” in turn, determines whether the student gets a good letter of
   recommendation from the professor.
 * Also, the “Intelligence” of the student influences their “SAT” score, in
   addition to influencing the “Grade.”

Note that the direction of arrows tells us about cause-effect relationships —
“Intelligence” affects the “SAT” score, but the “SAT” score does not influence
the “Intelligence.”

Finally, let’s look at the tables associated with each of the nodes. Formally,
these are called conditional probability distributions (CPDs).

СONDITIONAL PROBABILITY DISTRIBUTIONS
The CPDs for “Difficulty” and “Intelligence” are fairly simple, because these variables do not depend on any of the other
variables. The tables basically encode the probabilities of these variables,
taking 0 or 1 as values. As you might have noticed, the values in each of the
tables must sum to 1.

Next, let’s look at the CPD for “SAT.” Each row corresponds to the values that its parent (“Intelligence”) can take,
and each column corresponds to the values that “SAT” can take. Each cell has the
conditional probability p(SAT=s | Intelligence=i), that is, given that the value
of “Intelligence” is i, what is the probability of the value of “SAT” being s.

For instance, we can see that p(SAT=s¹ | Intelligence = i¹) is 0.8, that is, if
the intelligence of the student is high, then the probability of the SAT score
being high as well is 0.8. On the other hand, p(SAT=s⁰ | Intelligence = i¹),
which encodes the fact that if the intelligence of the student is high, then the
probability of the SAT score being low is 0.2.

Note that the sum of values in each row is 1. That makes sense because given
that Intelligence=i¹, the SAT score can be either s⁰ or s¹, so the two
probabilities must add up to 1. Similarly, the CPD for “Letter” encodes the
conditional probabilities p(Letter=l | Grade=g) . Because “Grade” can take three values, we have three rows in this table.The CPD for “Grade” is easy to understand with the above knowledge. Because it has two parents, the
conditional probabilities will be of the form p(Grade=g | Difficulty=d, SAT=s),
that is, what is the probability of “Grade” being g, given that the value of
“Difficulty” is d and that of “SAT” is s. Each row now corresponds to a pair of values of “Difficulty” and “Intelligence.” Again, the row values add up to
1.

An essential requirement for Bayesian networks is that the graph must be a
directed acyclic graph (DAG).

MARKOV NETWORKS: UNDIRECTED GRAPHICAL MODELS
Here’s a simple example of a Markov network:

In the interest of brevity, we will look at this abstract graph, where nodes A,
B, etc. do not directly correspond to real-world quantities, as above. Again,
the edges represent the interaction between variables. We see that while A and B
directly influence each other, A and C do not. Note that Markov networks do not
need to be acyclic, as was the case with Bayesian networks.

POTENTIAL FUNCTIONS
Just as we had CPDs for Bayesian networks, we have tables to incorporate
relations between nodes in Markov networks. However, there are two crucial
differences between these tables and CPDs.

First, the values do not need to sum to one, that is, the table does not define
a probability distribution. It only tells us that configurations with higher
values are more likely. Second, there is no conditioning. It is proportional to
the joint distribution of all the variables involved, as opposed to conditional
distributions in CPDs.

The resulting tables are called “factors” or “potential functions,” and denoted
using the Greek symbol 𝜑. So, for example, we can use the following potential
function to capture the relation between three variables A, B, and C, where C is
the “soft” XOR of A and B, that is, C is likely to be 1 if A and B are
different, and C is likely to be 0 if A and B are identical:

In general, you will have a potential function defined for each maximal clique
in the graph.

The graph structure and the tables succinctly capture the joint probability
distribution over the random variables.

A question you may have at this point is, why do we need directed as well as
undirected graphs? The reason is that for some problems, it is more natural to
represent them as directed graphs, such as the student network above, where it
is easy to describe the causal relationship between variables — the intelligence
of the student affects the SAT score, but the SAT score does not affect the
intelligence of the student (although it may tell us about the intelligence of
the student).

For other kinds of problems, say, images, you may want to represent each pixel
as a node, and we know that neighboring pixels affect each other, but there is
no cause-effect relationship between the pixels; instead, the interaction is
symmetric. So, we use undirected graphical models in such cases.

THE PROBLEM SETTING
We’ve been discussing graphs and random variables and tables so far, and you may
be thinking what the point of all this is? What are we actually trying to do?
Where is machine learning in all this — data, training, prediction, etc.? This
section will address these questions.

Let’s go back to the student network. Suppose we have the graph structure with
us, which we can create from our knowledge of the world (referred to as “domain
knowledge” in machine learning). But we don’t have CPD tables, only their sizes.
We do have some data though — for ten different classes in a university, we have
a measure of their difficulty.

Also, we have data about each student in each of the courses — their
intelligence, what their SAT score was, what grade they got, and whether they
got a good letter from the professor. From this data, we can estimate the
parameters of the CPDs. For instance, the data might show that students with
high intelligence often get good SAT scores, and we might be able to learn from
it that p(SAT=s1 | Intelligence = i1)is high. This is the learning phase. We
will soon see how we do this parameter estimation in Bayesian and Markov
networks.

Now, for a new data point, you will observe some of the variables, but not all. For example, in the graph below, you know about
the difficulty of a course and a student’s SAT score, and you want to estimate
the probability of the student getting a good grade. (By now, you already have
values in the tables from the learning phase.)

While we don’t have a CPD that gives us that information directly, we can see
that a high SAT score from the student would suggest that the student is likely
intelligent, and consequently, the probability of a good grade is high if the
difficulty of the course is low, as shown using the red arrows in the above
image. We may also want to estimate the probability of multiple variables
simultaneously, like what is the probability of the student getting a good grade
and a good letter?

The variables with known values are called “observed variables,” while those
whose values are unobserved are called “hidden variables” or “latent variables.”
Conventionally, observed variables are denoted using grey nodes, while latent
variables are denoted using white nodes, as in the above image. We may be
interested in finding the values of some or all of the latent variables.

Answering these questions is analogous to prediction in other areas of machine
learning — in graphical models, this process is called “inference.”

While we used Bayesian networks to describe the terminology above, it applies as
is to Markov networks as well. Before we get into the algorithms for learning
and inference, let’s formalize the idea that we just looked at — given the value
of some node, which other nodes can we get information about?

CONDITIONAL INDEPENDENCES
The graph structures that we’ve been talking about so far actually capture
important information about the variables. Specifically, they define a set of
conditional independences between the variables, that is, statements of the form
— “If A is observed, then B is independent of C.” Let’s look at some examples.

In the student network, let’s say you know that a student had a high SAT score.
What can you say about her grade? As we saw earlier, a high SAT score suggests
that the student is intelligent, and therefore, you would expect a good grade.
What if the student has a low SAT score? In this case, you would not expect a
good grade.

Now, let’s say that you also know that the student is intelligent, in addition
to her SAT score. If the SAT score was high, then you would expect a good grade.
What if the SAT score was low? You would still expect a good grade because you know that the student is intelligent, and you would assume that she
just didn’t perform well enough on the SAT. Therefore, knowing the SAT score
doesn’t tell us anything if we see the intelligence of the student. To put this
as a conditional independence statement, we would say — “If Intelligence is
observed, then SAT and Grade are independent.”

We got this conditional independence information from the way these nodes were
connected in the graph. If they were connected differently, we would get
different conditional independence information.Let’s see this with another example.

Let’s say you know that the student is intelligent. What can you say about the
difficulty of the course? Nothing, right? Now, what if I tell you that the
student got a bad grade on the course? This would suggest that the course was
hard because we know that an intelligent student got a bad grade. Therefore we
can write our conditional independence statement as follows — “If Grade is
unobserved, then Intelligence and Difficulty are independent.”

Because these statements capture an independence between two nodes subject to a condition , they are called conditional independences . Note that the two examples have opposite semantics — in the first one, the
independence holds if the connecting node is observed ; in the second one, the independence holds if the connecting node is unobserved . This difference is because of the way the nodes are connected, that is, the
directions of arrows.

We will not go into all possible cases in the interest of brevity, but it is
straightforward to come up with these using intuition as above.

In Markov networks, we can use a similar intuition, but because there are no
directional edges (arrows), the conditional independence statements are
relatively simple — if there is no path between nodes A and B such that all
nodes on the path are unobserved, then A and B are independent. Said
differently, if there is at least one path from A to B where all intermediate
nodes are unobserved, then A and B are not independent.

We will look at the details of how parameter estimation and inference are done
in the second part of this blog post. For now, let’s look at an application of
Bayesian networks, where we can use the idea of conditional independence we just
learned.

APPLICATION: MONTY HALL PROBLEM
You must have seen some version of this in a TV game show:

Illustration sourceThe host shows you three closed doors, with a car behind one of the doors and
something invaluable behind the others. You get to pick a door. Then, the host
opens one of the remaining doors, and shows that it does not contain the car.
Now, you have an option to switch the door, from the one you picked initially to
the one that the host left unopened. Do you switch?

Intuitively, it appears that the host did not divulge any information. It turns
out that this intuition is not entirely correct. Let’s use the new tool in our
arsenal — graphical models — to understand this.

Let’s start by defining some variables:

 * D : The door with the car.
 * F : Your first choice.
 * H : The door opened by the host.
 * I : Is F = D?

D, F, and H take values 1, 2, or 3 and I takes values 0 or 1. D and I are
unobserved, while F is observed. Until the host opens one of the doors, H is
unobserved. Therefore, we get the following Bayesian network for our problem:

Note the directions of arrows — D and F are independent, I clearly depends on D
and F, and the door picked by the host H also depends on D and F. So far, you
don’t know anything about D. (This is similar to the structure in the student
network, where knowing the intelligence of the student does not tell you
anything about the difficulty of the course.)

Now, the host picks a door H and opens it. So, H is now observed.

Observing H does not tell us anything about I, that is, whether we have picked
the right door. That is what our intuition tells us. However, it does tell us something about D! (Again, drawing analogy with the student network, if you know that the
student is intelligent, and the grade is low, it tells you something about the
difficulty of the course.)

Let’s see this using numbers. The CPD tables for the variables are as follows
(This is when no variables have been observed.):

The tables for D and F are straightforward — the door with the car could be any
door with equal probability, and we pick one of the doors with equal
probability. The table for I simply says that I=1 when D and F are identical,
and I=0 when D and F are different. The table for H says that if D and F are
equal, then the host picks one door from the other two with equal probability,
while if D and F are different, then the host picks the third door.

Now, let’s assume that we have picked a door, that is, F is now observed, say
F=1. What are the conditional probabilities of I and D, given F?

Using these equations, we get the following probabilities:

These numbers make sense — so far, the probability that we have picked the
correct door is ⅓ and the car could still be behind any door with equal
probability.

Now, the host opens one of the doors other than F, so we observe H. Assume H=2.
Let’s compute the new conditional probabilities of I and D given both F and H.

Using the above equations, we get the following probabilities:

Therefore, we don’t know anything additional about I — our first choice is
correct still with probability ⅓, and this is what our intuition tells us.
However, we now know that the car is behind door 3 with probability ⅔, instead
of ⅓.

So, if we switch, we get the car with probability ⅔; if we don’t, we get the car
with probability ⅓.

We could have gotten the same answer without using graphical models too, but
graphical models give us a framework that scales well to larger problems.

CONCLUSIONS
In this PGM tutorial, we looked at some basic terminology in graphical models,
including Bayesian networks, Markov networks, conditional probability
distributions, potential functions, and conditional independences. We also
looked at an application of graphical models on the Monty Hall problem.

In the second part of this blog post, we will look at some algorithms for
parameter estimation and inference, and another application. So, stay tuned!

YOU’D ALSO LIKE:
How to Get All Your Product Launch Metrics Without Leaving Slack How we used
Statsbot to track our product launch metrics blog.statsbot.co SQL Queries for Funnel Analysis A template for building SQL funnel queries
blog.statsbot.co How to Reduce Churn Rate By Handling Stripe Failed Payments How We Automated
Dunning Management blog.statsbot.co * Machine Learning
 * Data Science
 * Bayesian Statistics
 * Graphical Models
 * Classifcation Models

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

51 Blocked Unblock Follow FollowingPRASOON GOYAL
PhD candidate at UT Austin

FollowSTATS AND BOTS
Data stories on machine learning and analytics. From Statsbot’s makers.

 * 51
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates",Probabilistic graphical models tutorial to understand the framework and its applying to machine learning problems.,Probabilistic Graphical Models Tutorial — Part 1 – Stats and Bots,Live,880
2707,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

WEB PICKS (WEEK OF 2 OCTOBER 2017)
Posted on October 8, 2017Every two weeks, we find the most interesting data science links from around the
web and collect them in Data Science Briefings , the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting
resources .

 * Is AI Riding a One-Trick Pony?
   Just about every AI advance you’ve heard of depends on a breakthrough that’s
   three decades old. Keeping up the pace of progress will require confronting
   AI’s serious limitations.
 * New Theory Cracks Open the Black Box of Deep Learning
   A new idea called the “information bottleneck” is helping to explain the
   puzzling success of today’s artificial-intelligence algorithms — and might
   also explain how human brains learn…
 * A Brain Built From Atomic Switches Can Learn
   A tiny self-organized mesh full of artificial synapses recalls its
   experiences and can solve simple problems. Its inventors hope it points the
   way to devices that match the brain’s energy-efficient computing prowess.
 * The Ten Fallacies of Data Science
   There’s a gap between the idealized data science projects that students are
   exposed to and what actually happens in the real world. This article explores
   ten of the most important surprises.
 * Why SQL is beating NoSQL, and what this means for the future of data
   After years of being left for dead, SQL today is making a comeback. How come?
   And what effect will this have on the data community?
 * When Websites Design Themselves
   Today, we’re on the verge of another revolution, as artificial intelligence
   and machine learning turn the graphic design field on its head again.
 * I asked Tinder for my data. It sent me 800 pages of my deepest, darkest
   secrets
   “The dating app knows me better than I do, but these reams of intimate
   information are just the tip of the iceberg. What if my data is hacked – or
   sold?”
 * Microsoft: “Announcing tools for the AI-driven digital transformation”
   Microsoft announces new AI tooling.
 * Apache Arrow and the “10 Things I Hate About pandas”
   “In this post I hope to explain as concisely as I can some of the key
   problems with pandas’s internals and how I’ve been steadily planning and
   building pragmatic, working solutions for them. To the outside eye, the
   projects I’ve invested in may seem only tangentially-related: e.g. pandas,
   Badger, Ibis, Arrow, Feather, Parquet. Quite the contrary, they are all
   closely-interrelated components of a continuous arc of work I started almost
   10 years ago…”
 * The traveling metallurgist
   Explore simulated annealing & the traveling salesman problem: new interactive
   dataviz & blog post…
 * PixelNN: Example-based Image Synthesis
   “We present a simple nearest-neighbor (NN) approach that synthesizes
   high-frequency photorealistic images from an “incomplete” signal such as a
   low-resolution image, a surface normal map, or edges…”
 * Chiron: Translating nanopore raw signal directly into nucleotide sequence
   using deep learning
   “Sequencing by translocating DNA fragments through an array of nanopores is a
   rapidly maturing technology which offers faster and cheaper sequencing than
   other approaches. However, accurately deciphering the DNA sequence from the
   noisy and complex electrical signal is challenging. Here, we report the first
   deep learning model – Chiron – that can directly translate the raw signal to
   DNA sequence without the error-prone segmentation step…”
 * How You Can Use the New Stack Overflow Bot from Microsoft
   “So when Microsoft showed us how they were bringing AI to every developer
   through their platforms and tools, and asked if they could partner with us to
   create an AI driven experience for developers to use and learn with, we of
   course said yes…”
 * europilot: A toolkit for controlling Euro Truck Simulator 2 with python to
   develop self-driving algorithms.
   Europilot is an open source project that leverages the popular Euro Truck
   Simulator(ETS2) to develop self-driving algorithms. Think of europilot as a
   bridge between the game environment, and your favorite deep-learning
   framework, such as Keras or Tensorflow. With europilot, you can capture the
   game screen input, and programmatically control the truck inside the
   simulator…
 * Predicting NFL Plays with the xgboost Decision Tree Algorithm
   “Enter – the play predictor. This tool aims to enhance in-game NFL decision
   making with a tool capable of predicting the type of play the opposing team
   will run at high accuracy in real-time. On average this tool is able to
   predict pass or run at 73.6% accuracy, with varying performance dictated by
   teams playing and mostly game situation.”
 * Learning to Optimize with Reinforcement Learning
   “There is a paradox in the current paradigm: the algorithms that power
   machine learning are still designed manually. This raises a natural question:
   can we learn these algorithms instead?”
 * 3D Face Reconstruction from a Single Image
   “This is an online demo of our paper Large Pose 3D Face Reconstruction from a
   Single Image via Direct Volumetric CNN Regression.”
 * Visualizing Distributions
   “Many charting taxonomies include distributions, but they only present a few
   options. Let’s remedy that with a post on the many. We’ll use a single
   (completely fake) data set so we can easily compare how each chart type
   displays the same data.”
 * The Complex World of Data Scientists and Black-Box Algorithms
   Hilary Mason speaks to audiences around the world about data, machine
   learning, AI, and how to build real, functional, and robust products. She’s
   the Founder of Fast Forward Labs and is the Data Scientist in Residence at
   Accel. In this interview, Hilary discusses a variety of topics including
   things like careers, data products, and black-box deep learning algorithms.
 * LEGO color themes as topic models
   This post uses text mining techniques to explore the color themes across a
   corpus of LEGO sets. It’s a fun read with nice visualizations along the way.
 * Franchise – an open-source notebook for SQL
   Franchise is an open-source SQL tool with a notebook interface. Supports
   CSVs, JSON, XLSX files and offers a variety of ways to explore your data.
 * STYLE2PAINTS: Converting sketches to paintings
   The AI can paint on a sketch according a given specific color style.

‹ What’s in the Network: A Stepwise Overview of Working with Network Data in R —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Web Picks (week of 2 October 2017)
 * What’s in the Network: A Stepwise Overview of Working with Network Data in R
 * Web Picks (week of 18 September 2017)
 * Can you comment on the impact of privacy regulation on Big Data & Analytics?
 * A (Good) Picture is Worth a Thousand Words: Visualizing Complex Data for
   Business Value with R

Archives * October 2017
 * September 2017
 * August 2017
 * July 2017
 * June 2017
 * May 2017
 * April 2017
 * March 2017
 * February 2017
 * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com","Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter.",Web Picks (week of 2 October 2017),Live,881
2711,"Working Vis * 
 * 

 * Home
 * About This Blog
 * Brunel",How to create a blog entry using Brunel in a few minutes. The data comes from The Guardian and is completely unmodified.,Blogging With Brunel,Live,882
2712,"Compose The Compose logo Articles Sign in Free 30-day trialBUILDING AN ORDERING APPLICATION WITH WATSON AI AND POSTGRESQL: PART II
Published Jun 19, 2017 watson postgresql twilio Building an Ordering Application with Watson AI and PostgreSQL: Part IIDo you want to leverage Compose, Twilio, and IBM Watson to provide customers
with a real-time, interactive experience? We'll show you how by building an IBM
Bluemix application that lets Watson interact with users via Twilio's SMS
service then sends their data over to Compose PostgreSQL for storage. Read Part
I of this article here .

In the first part of this two-part series, we set up the foundation for our ordering application
by creating a Cloud Foundry NodeJS runtime application in Bluemix then we added services to that application, namely Compose PostgreSQL, Watson
Conversation, and Twilio. We then went over how to create a workspace and a
dialog in Watson Conversation and then deployed it to get the workspace ID,
which will be important when we start writing the code so that we use the right
conversation for our application.

The second part of this article will look at writing the code, making a small
change to the Twilio phone configuration, and deploying the application on
Bluemix. The Github repository for this project can be found here .

SETTING UP THE CODE
To start setting up the code, we'll first have to download a few libraries from
npm to get a NodeJS server up and running. We'll use ExpressJS for the server,
the Watson Conversation library, the JavaScript SDK for Twilio, Cloud Foundry's
library to parse environment variables stored on Bluemix, and SequelizeJS to
create the database table and model to send over the customer orders to
PostgreSQL.

const express = require('express');  
const app = express();  
const ConversationV1 = require('watson-developer-cloud/conversation/v1');  
const twilio = require('twilio');  
const cfenv = require(""cfenv"")  
const Sequelize = require(""sequelize"");  


Next, we'll set up the environment variables so that our authentication
credentials will be called from VCAP_SERVICES in Bluemix. To do that, we'll use
Cloud Foundry's cfenv library which will use the environment variables from Bluemix. If you want to
take a look at your application's environment variables, go to the Bluemix
console and click the Runtime menu on the left side.

The code we'll use to get the authentication credentials is the following when
we have assigned to their own variables.

const appEnv = cfenv.getAppEnv();  
const vcap_services = JSON.parse(process.env.VCAP_SERVICES);  
const postgresUri = vcap_services['compose-for-postgresql'][0].credentials.uri;  
const conversationUsername = vcap_services.conversation[0].credentials.username;  
const conversationPassword = vcap_services.conversation[0].credentials.password;  
const twilioAccountSid = vcap_services['user-provided'][0].credentials.accountSID;  
const twilioAuthToken = vcap_services['user-provided'][0].credentials.authToken;  


Now let's set up our database model. Using SequelizeJS, we first create a new
instance using our environment variable postgresUri . This gives Sequelize the URI of our PostgreSQL deployment on Bluemix. Then,
we'll create a model called Order , which will set up a table orders (make sure those double quotes are there for the table name) with the column
names and datatypes we'll store in PostgreSQL. After that, we'll call the sync() method on the Order model to create the table orders which will happen at the time the application is deployed on Bluemix.

const sequelize = new Sequelize(postgresUri);

const Order = sequelize.define(""orders"", {  
    id: { type: Sequelize.INTEGER, autoIncrement: true, primaryKey: true },
    quantity: Sequelize.INTEGER,
    lumber_type: Sequelize.TEXT,
    customer_id: Sequelize.INTEGER
}, { timestamps: false });

Order.sync();  


Setting up Watson Conversation and Twilio follows a similar pattern in that we
pass the VCAP_SERVICE environment variables to them. In this case, we have the
Watson Conversation username and password and Twilio's AccountSID and AuthToken
taken from the environment variables. For your workspace in Watson Conversation,
you'll have to either add your workspace ID into the path object, or you can create your own environment variable in your application to
handle that.

// Watson Conversation set up
const conversation = new ConversationV1({  
    username: conversationUsername,
    password: conversationPassword,
    path: { workspace_id: 'you_workspace_id' },
    version: 'v1',
    version_date: '2017-05-26'
});

// Twilio set up
const client = new twilio(twilioAccountSid, twilioAuthToken);  


Now let's get the conversation going ...

An important part of the conversation is to allow multiple users to interact
with your conversation application at the same time. It's also important to
maintain state so each message sent back and forth from the application and
Watson Conversation doesn't always start at the beginning of the dialog.

We'll use an array to hold information that we want to persist throughout the
conversation. To do that, we'll create an array called contexts to store conversation responses from Watson and to maintain state.

let contexts = [];  


To get the responses from Twilio into Watson Conversation, we'll use a GET HTTP request. To set that up in ExpressJS we'll use app.get and point that to the endpoint message .

app.get('/message', (req, res) =  


Within app.get we have three variables we'll set up that will contain the Twilio message, the
number the message is from, and the Twilio number we're sending responses back
to. These will be important for storing information related to maintaining the
conversation's context and for sending the Watson Conversations back to the user
through Twilio.

let message = req.query.Body;  
let number = req.query.From;  
let twilioNum = req.query.To;  


The second set of variables and loop will make sure that we maintain the state
of the application by retaining the customer's information. The information that
we'll store in context is about the user from Twilio, our context values, and their position in the
dialog. At each turn, the position of the customer in the dialog changes, and
their context values may get updated. Here, we are making sure that we are
keeping track of that.

let context = null;  
let index = 0;  
let indexForContext = 0;

contexts.forEach(val =


Next, we'll create a function that will contain all the logic to process
messages sent to Watson Conversation. Here, is where we send Watson's responses
back to the customer through Twilio. Also, it's where we check to see if a
conversation is done to send their order information to PostgreSQL and delete
their information from the contexts array.

We'll first start by running the message function from Watson Conversation and pull in the requested SMS message sent to
Twilio. This has been stored earlier in our message variable above. The context is where we pass in our context variable, which is
initially null to maintain the application's state.

function processResponse(err, data) {  
        if (err) {
            console.error(err);
            return;
        }

    conversation.message({
        input: { text: message },
        context: context
    }, function(err, resp)             
        if (err) {
            console.error(err);
        }

        if (context === null) {
               contexts.push({ 'from': number, 'context': resp.context });
        } else {
               contexts[indexForContext].context = resp.context;
        }
        ...

 }, processResponse);

}


Once Watson processes this input, we're given a response that looks something
like this:

{""intents"":[{""intent"":""ask_oak"",""confidence"":1}],""entities"":[{""entity"":""kind_of_wood"",""location"":[14,17],""value"":""oak"",""confidence"":1}],""input"":{""text"":""I'd like some oak""},""output"":{""text"":[""How much would you like? You can choose from 10, 50, 100, and 150 boards""],""nodes_visited"":[""Ask for pine or oak""],""log_messages"":[]},""context"":{""conversation_id"":""8823737a-324f-9g3j-9999-90s9929g"",""system"":{""dialog_stack"":[{""dialog_node"":""root""}],""dialog_turn_counter"":3,""dialog_request_counter"":3,""_node_output_map"":{""node_6_1497233079260"":[0],""node_7_1497233116909"":[0],""Ask for pine or oak"":[0]},""branch_exited"":true,""branch_exited_reason"":""completed""},""accountId"":7276363,""lumber_type"":""oak"",""quantity"":null}}


Within the message function, we'll check to see if the context is null and if it is, we'll pass in the customer's phone number that they sent the SMS
from and it will include the response context that we get from Watson
Conversation. If the context isn't null , then we'll update the context object. With each update to our context, we'll
replace that with the new context reponse given by Watson. The response context
object looks something like this:

{""from"":""+1206999999"",""context"":{""conversation_id"":""8823737a-324f-9g3j-9999-90s9929g"",""system"":{""dialog_sta
ck"":[{""dialog_node"":""root""}],""dialog_turn_counter"":1,""dialog_request_counter"":1,""_node_output_map"":{""node_6_1497233079260"":[0]},""branch_exited"":true,""branch_exited_reason"":""completed""},""accountId"":  
7276363,""lumber_type"":null,""quantity"":null}}  


Next, we'll add another piece of logic that will tell us when an order should be
processed. In the following code example, we check if the nodes_visited from the response output is equal to ""Order Processing"", which is the name of
the last dialog box we created in our Watson Conversation workspace.

if (resp.output.nodes_visited[0] === ""Order Processing"") {  
     Order.create({
        customer_id: resp.context.accountId,
        lumber_type: resp.context.lumber_type,
        quantity: resp.context.quantity
     }).then(data =
}


Therefore, we're checking whether we end up at ""Order Processing"", and if we do,
we send the values from the accountId , lumber_type , and quantity over to our PostgreSQL database and we take that customer's information out of
the contexts array.

Sending messages back to Twilio each time is done using the following:

client.messages.create({  
    body: resp.output.text[0],
    to: number,
    from: twilioNum
}, function(err) {
    if (err) {
       console.error(err.message);
    }
});


Basically, each returned response from Watson is sent to Twilio to send back to
the customer. Since we already stored the customer's and our Twilio number, we
just send those back as well.

After creating the processResponse function, we invoke it inside our GET route like:

conversation.message({}, processResponse);  


Again, the entire code can be found on the Github repository here .

PUSHING TO BLUEMIX
So, your Bluemix environment has been set up with the Cloud Foundry Node.js SDK
runtime, Compose PostgreSQL, Twilio, and Watson Conversation. You've set up your
Watson Conversation dialog and have written the JavaScript code that will tie
everything together. Now, let's push this application to Bluemix.

If we logged in successfully when we set up Bluemix, we just have to make sure
that the manifest.yml file has your application's information name, route, and memory allocation
information stored.

---
applications:  
 - name: prolumber-sms-application
   random-route: true
   memory: 128M


Next, you just have to run bluemix app push ""ProLumber SMS Application"" in your terminal and wait for the application to be deployed. Any errors that
you encounter will be visible in your application's log, which is visible your
application's Bluemix console within the menu option Logs .

There is one more step that we need to do after deploying the application to
Bluemix. We'll have to set Twilio up to send messages to the message route of our Bluemix application. To do that, log back into your Twilio console
and go into the Manage Numbers view. Select your number and that will take you to another page with
configuration settings.


Within the Messaging section at the bottom, in the input box next to ""A message comes in"", insert
your Bluemix application's URL with the /message route attached. That's the ExpressJS route we created. You'll want to set the
input to ""HTTP GET"" as well. Once that's done, just save and test out your
application using the SMS on your phone.


START A CONVERSATION
Both articles have shown you how to set up a simple ordering system using IBM
Bluemix, Compose PostgreSQL, Watson Conversation, and Twilio. Using Bluemix to
keep track of your credentials and all the services you have tied to the
application is convenient, especially if you have a lot of services tied to it.
We also covered Watson Conversation, which is a really interesting tool that
gives you the ability to create automated and interactive dialogs between users,
and in this case a dialog with customers to order goods. Try out Watson,
Compose, and Bluemix next time you want to create something fun and interactive.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Hudson Hintze

Abdullah Alger is a former University lecturer who likes to dig into code, show people how to
use and abuse technology, talk about GIS, and fish when the conditions are
right. Coffee is in his DNA. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
Jun 13, 2017BUILDING AN ORDERING APPLICATION WITH WATSON AI AND POSTGRESQL: PART I
Do you want to leverage Compose, Twilio, and IBM Watson to provide customers
with a real-time, interactive experience? We'll…

Abdullah Alger Jun 14, 2017MANAGE YOUR BUSINESS WITH COMPOSE, ODOO, AND BLUEMIX
Odoo is a suite of applications to help you manage any business from anywhere.
Learn how to deploy using Compose and IBM Blue…

John O'Connor May 3, 2017CAMPUS DISCOUNTS - MAKING THE MOST OF COMPOSE
Campus Discounts uses several Compose-hosted databases including MySQL, MongoDB,
Redis, Elasticsearch and RabbitMQ to power t…

Arick Disilva Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL JanusGraph Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",We'll show you how by building an IBM Bluemix application that lets Watson interact with users via Twilio's SMS service then sends their data over to Compose PostgreSQL for storage.,Building an Ordering Application with Watson AI and PostgreSQL: Part II,Live,883
2713,"THE GRADIENT FLOW
DATA / TECHNOLOGY / CULTURE
Menu Search Skip to content * Home
 * About
 * Calendar
 * Contact
 * Hardcore Data Science and Data Engineering
 * The Data Show
 * Webcasts

Search for:DON’T OVERLOOK SIMPLER TECHNIQUES AND ALGORITHMS
EVEN IN AREAS AND DOMAINS WHERE DEEP LEARNING EXCELS, SIMPLER APPROACHES ARE
WORTH EXAMINING.
A while back I noted that there are several considerations when evaluating machine learning
models :

Accuracy is the main objective and a lot of effort goes towards raising it. But in
practice tradeoffs have to be made, and other considerations play a role in
model selection. Speed (to train/score) is important if the model is to be used in production. Interpretability is critical if a model has to be explained for transparency reasons
(“black-boxes” are always an option, but are opaque by definition). Simplicity is important for practical reasons: if a model has “too many knobs to tune” and
optimizations have to be done manually, it might be too involved to build and
maintain it in production.


While deep learning has emerged as a technique capable of producing
state-of-the-art results across several domains and types of data it’s far from the optimal choice in
some situations. Simple techniques can sometimes produce comparable results and
they register better along the other dimensions listed above ( interpretability and speed ).

A few weeks ago I tweeted a few examples where simpler approaches outperformed
deep learning (or in the case of the last tweet, bag-of-words + CNNs
outperforming RNNs). These examples struck a chord which prompted me to collect
them into a post. It also got me thinking that I should solicit similar examples
from my readers: so please leave your favorite example in the comments below , and I will update this post with the best suggestions.


--------------------------------------------------------------------------------

TTIC on Sentence Embeddings: Simple word averaging beats LSTM models https://t.co/LTzZtWrdgG #ai #deeplearning pic.twitter.com/OqR1zIeT2L

— Ben Lorica (@bigdata) April 24, 2016

Count-based language models can be competitive with neural networks ( @Facebook #AI research) https://t.co/ogYc79c6KQ pic.twitter.com/fWEqLlIApF

— Ben Lorica (@bigdata) April 25, 2016

Simple bag-of-words suffice 👍 you don't need RNNs for visual Q&A ( @Facebook #AI research) https://t.co/PNtRw4tr56 pic.twitter.com/4vW1mwhwPv

— Ben Lorica (@bigdata) April 23, 2016

SHARE THIS:
 * Click to share on Twitter (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * Click to email (Opens in new window)
 * 


05/11/2016 Ben Lorica deep learningPOST NAVIGATION
← →LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.

Notify me of new posts via email.


SEARCH
Search for:RSS FEED
 * RSS - Posts

SITE MAP
 * About
 * Calendar
 * Contact
 * Hardcore Data Science and Data Engineering
 * The Data Show
 * Webcasts

RECENT POSTS
 * Structured streaming comes to Apache Spark 2.0
 * Don’t overlook simpler techniques and algorithms
 * Recent trends in recommender systems
 * Semi-supervised, unsupervised, and adaptive algorithms for large-scale time
   series
 * Practical machine learning techniques for building intelligent applications

CATEGORIES
 * Data Engineer
 * Data Science
 * Finance
 * Marketing
 * Science
 * Uncategorized

ARCHIVES
 * May 2016 (3)
 * April 2016 (2)
 * March 2016 (5)
 * February 2016 (3)
 * January 2016 (3)
 * December 2015 (3)
 * November 2015 (4)
 * October 2015 (5)
 * September 2015 (5)
 * August 2015 (3)
 * July 2015 (4)
 * June 2015 (4)
 * May 2015 (3)
 * April 2015 (6)
 * March 2015 (5)
 * February 2015 (7)
 * January 2015 (6)
 * December 2014 (7)
 * November 2014 (3)
 * October 2014 (3)
 * September 2014 (4)
 * August 2014 (5)
 * July 2014 (7)
 * June 2014 (6)
 * May 2014 (1)
 * April 2014 (4)
 * March 2014 (4)
 * February 2014 (7)
 * January 2014 (4)
 * December 2013 (5)
 * November 2013 (3)
 * October 2013 (3)
 * September 2013 (5)
 * August 2013 (4)
 * July 2013 (4)
 * June 2013 (5)
 * May 2013 (4)
 * April 2013 (4)
 * March 2013 (4)
 * February 2013 (2)
 * October 2012 (1)
 * August 2012 (1)

My Tweets Blog at WordPress.com. | The Sorbet Theme . FollowFOLLOW “THE GRADIENT FLOW”
Get every new post delivered to your Inbox.


Join 35 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Even in areas and domains where deep learning excels, simpler approaches are worth examining. A while back I noted that there are several considerations when evaluating machine learning models: Acc…",Don’t overlook simpler techniques and algorithms,Live,884
2715,"USE NIFI TO LESSEN THE FRICTION OF MOVING DATA
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 27, 2016Apache NiFi is a powerful data routing and transformation server which connects systems via
extensible data flows. All types of data can stream through NiFi's customizable
network of processes with real time administration in a web browser.

If data has gravity, as McCrory contends , then data movement has friction proportional to the data's size and the speed
of the move. NiFi, a now open source data flow server, was conceived and
nurtured inside the NSA to reduce the huge amount of friction from the constant
movement of signals intelligence data.

The NSA open sourced NiFi, known as Niagra Files internally, through their technology transfer program in November of 2014, they had run it in production for the prior eight years.
The NiFi codebase matured in that demanding environment but the NiFi community
is still less than two years old. The original team, who also spun out with the
project during the transfer, created a commercial company, Onyara , which was quickly purchased by Hortonworks . NiFi is now a top level project of the Apache Software Foundation.

NiFi's history is interesting, but so too is its functionality. As a visual data
flow server, it provides some pieces like scheduling tasks and tracking
processing steps which are akin to workflow management a la AirBnB's Airflow or pInterest's Pinball but it also provides some pieces like buffering data and transforming content
which are more akin to a streaming server a la Kafka or Google's Cloud Dataflow . It is a self-contained server which can scale down to running on a laptop and
scale up to a very large cluster of instances. Well engineered, its core model
is flow based programming centered around a simple abstraction named the
FlowFile. It delivers this flow based functionality to end users via a web page.
And it utilizes a couple of optimized data stores to manage the data in a
performant way. All with a security focus that might be construed by some as
paranoid.

FLOW BASED PROGRAMMING WITH FLOWFILES
NiFi abstracts flow based programming's notion of a message into a slightly more formal structure that is a set of
metadata attributes with a pointer to a binary payload:


These are the simplest set of attributes (custom ones can easily be added).


And this is a formatted JSON content payload (a Pokemon tweet).

The payload is just bits as far as NiFi is concerned. These bits could be as
small as a JSON message or as large as a multi gigabyte video or anything in
between. NiFi doesn't really care. As a FlowFile flows through NiFi it mainly
uses the metadata attributes to handle routing or other needs for decision
making but that is an optimization so that the payload doesn't have to be read
unless it's actually needed.


In the flow based model of programming processing is independent of routing. So, each step of the
way for a FlowFile through the flow is separate. The above is a screenshot of
the web UI. The boxes are the Processors. Processors hold configuration and are
where the work is actually done when they are running. They can be independently
scheduled and represent the extension point for the NiFi platform as a whole.
The Processors on the edge tend to ""hook up"" to external systems: HTTP API
endpoints(e.g. Twitter or AWS S3), databases(e.g. Mongo, Cassandra, or SQL), or
other TCP services(e.g. IMAP or FTP). Once the edge Processor creates a
FlowFile, it begins its journey through the flow. The ""blue"" Processors in the
picture above represent a flow from one MongoDB to another (this example dedupes
ids to brute force a continuous synchronization).


The Connections queue data between different Processors which keeps them
uncoupled but can also allow for different processing speeds or spikes in
quantities of data. Plus, Processors can make decisions to route to one
Connection or even multiple Connections or to failure handlers too. The above
even updates each step's stats such as counts and amounts of data processed in
soft real time.

A SOFT REALTIME WEB UI BUILT WITH SVG AND D3.JS
NiFi's UI is productive. Being able to start and stop Processors and even add
new ones to a running data flow is useful. Being able to hook into currently
running production flows to split the data into new ones for testing or staging
is freeing. Stopping a Processor while the rest of the data flow executes is
just fine too. Connections are queues so they will just buffer FlowFiles when
there's no running Processor to take data from them. The independence of each of
the pieces in the flow based model allows a data flow manager to accommodate
many scenarios. Things like one off data dumps or synchronizing to a test or
development environment or even transferring full data stores are all easy.
Especially when compared to running a set of scripts at a command line.

While native applications for Mac OS X or Windows or Linux would certainly be
able to deliver the utility of NiFi's UI, Scalable Vector Graphics (SVG) with D3.js are more than capable of delivering a rich interactive user interface in the
web browser which is a strong point of NiFi.

OPTIMIZED DATA STORES
Under the covers the FlowFile abstraction gives way to two data storage
approaches tuned to the needs of the data. The content is stored in an append
only log, called the Content Repository, on the basis that it should be
immutable. The attributes are kept in a key-value store, called the FlowFile
Repository, where they can be both rapidly processed and changed or added to as
they pass through the system.
By matching these two different use cases of content and metadata to two
optimized data stores NiFi removes a great deal of the ""friction"" from moving
data from place to place and system to system.
The FlowFile binds these two implementations together and exposes them to the
user in a flow. The user can then optimize even further by injecting some domain
knowledge into a flow's design and ensuring that data is processed in whichever
manner makes sense.
With some thought, architecting a performant flow can be accomplished for very
large payloads by minimizing copying and complex flows can be accommodated too.

LESS DATA FRICTION WITH NIFI
One of the beauties of NiFi is that it works to lessen the friction of data
flowing. NiFi uses a really nice abstraction of the FlowFile to split the
problem into two optimized solutions for content and metadata. Plus the flow
based programming model it delivers lets users inject domain knowledge to even
further lessen friction by tailoring a flow to the problem all delivered in a
rich UI. So, if you need to move some data, go with the flow and checkout NiFi.

In the future, we'll look at extending NiFi with custom Processors and we'll
build an in depth example with more than a few steps. In the meantime, we'll
leave you with this introductory screencast which walks through installing NiFi
and creating an example data flow:


Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Hays Hutton is a spirit runner. Love this article? Head over to Hays Hutton’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Learn about the Apache NiFi data routing and transformation server, with its flow-based programming engine.",Use NiFi to Lessen the Friction of Moving Data,Live,885
2721,,Watch how to apply association rules using R to data stored in dashDB using a retail market basket scenario. ,Perform market basket analysis using dashDB and R,Live,886
2722,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

FORGETTING THE PAST TO LEARN THE FUTURE: LONG SHORT-TERM MEMORY NEURAL NETWORKS
FOR TIME SERIES PREDICTION
Posted on August 7, 2017Contributed by: Tine Van Calster , Bart Baesens and Wilfried Lemahieu

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow
us @DataMiningApps . Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail
over at briefings@dataminingapps.com and let’s get in touch!


--------------------------------------------------------------------------------

Long Short-Term Memory neural networks (LSTM’s) are a type of recurrent neural
networks that has been used for many state-of-the-art applications in recent
years, such as speech recognition, machine translation, handwriting recognition
and, of course, time series prediction. Originally, Hochreiter and Schmidhuber 1 conceived LSTM’s as a solution to the ‘vanishing gradient problem’, which
refers to the rapid decay in influence of past values on the output. Long-term
dependencies were therefore difficult to model with, for example, simple
recurrent neural networks. This ability of LSTM’s to handle a much larger
context, is precisely what makes them so interesting for any application where
dependencies are crucial, such as natural language and time series analysis. In
this column, we will explain how the fundamentals of LSTM’s work and how to
pre-process time series data as input for this type of recurrent neural network.

The fundamental idea behind LSTM’s is combining recurrent neural networks with a
memory aspect that naturally selects which information to remember and which
information to regret. In order to achieve this goal, every LSTM unit consists
of three or four elements, or ‘gates’. Each gate computes a value between 0 and
1 that refers to how much of the information should pass through that gate. In
the figure below 2 , each layer of the LSTM block consists of a yellow rectangle, which
communicates with the horizontal line on the top that represents the cell state.
The first gate on the left is the ‘forget’ gate, which produces a decision about
which parts of the previous information to keep in the memory. The next yellow
element constitutes the ‘input’ layer, which decides which elements should be
updated with new information, while the third element creates a list of new
information that could be added to the memory. When we put these two layers
together, it becomes clear which values should be replaced in the cell state.
The results of these three layers are then used to update the cell state by
forgetting what should be forgotten according to the first layer, and by
updating the information that needed to be replaced according to the second and
third layers. Finally, the fourth layer consist of the ‘output’ gate, which
selects the elements that should actually be outputted by the entire block. LSTM
blocks can be used recurrently after one another or can be inserted into other
neural network structures.


Now that we understand the fundamentals of LSTM’s, we can turn to the times
series data itself. When pre-processing time series for neural networks, we need
to transform the data into a correct input format for the LSTM. Firstly, the
time series needs to be stationary, as neural networks perform the best when
predicting the variance that cannot be contributed to any seasonal or trend
component 3 . Therefore, we can, for example, apply differencing and/or seasonal
differencing on the data beforehand, depending on the data’s characteristics.
Secondly, the entire time series should be normalized, as the LSTM expects the
data to be in the same scale as the activation function, such as, for example,
the hyperbolic tangent. The output of the LSTM will have to undergo the opposite
process of these two steps in order to reflect the true predicted value for the
time series. Finally, the data needs to be adjusted to look like a supervised
learning dataset, i.e. a continuous label is linked to a set of variables. This
label is the value of the time series at time t + 1 , while the variables consist of the values of time series until time T. The
number of previous time steps to include in the model, is a hyper parameter that
requires attention. Generally, more training data is better, but this parameter
can be determined by a validation set in order to ensure the best fit with the
given data. These steps assume a one-step-ahead prediction, but larger
prediction horizons are possible as well. The neural network can either be
trained to have multiple outputs or the LSTM with one-step-ahead prediction can
take previously predicted values as new input.

In short, LSTM’s are a popular form of neural networks that can take on many
different applications. Generally, the more long-term dependencies are present
in the data, the better LSTM’s perform. Applications involving language, both
written and spoken, and time series are therefore a perfect fit with this type
of recurrent neural network. In forecasting, LSTM’s have been especially useful
in applications with frequent time periods, such as load forecasting and traffic
forecasting. In these cases, the LSTM certainly benefits from the large number
of data points to train on.

REFERENCES:
 * Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 8 (1997): 1735-1780.
 * Christopher Olah . Understanding LSTM Networks . Github blog, 27 Aug. 2015,
   http://colah.github.io/posts/2015-08-Understanding-LSTMs/. Accessed 12 June
   2017.
 * Zhang, G. Peter, and Min Qi. “Neural network forecasting for seasonal and
   trend time series.” European journal of operational research 2 (2005): 501-514.

‹ Web Picks (week of 10 July 2017) —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Forgetting the Past to Learn the Future: Long Short-Term Memory Neural
   Networks for Time Series Prediction
 * Web Picks (week of 10 July 2017)
 * Finding Structure in Complex Data Using Visual Analytics
 * What is response modeling and how can analytics be used for it?
 * Web Picks (week of 26 June 2017)

Archives * August 2017
 * July 2017
 * June 2017
 * May 2017
 * April 2017
 * March 2017
 * February 2017
 * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com","Long Short-Term Memory neural networks (LSTM’s) are a type of recurrent neural networks that has been used for many state-of-the-art applications in recent years, such as speech recognition, machine translation, handwriting recognition and, of course, time series prediction.",Forgetting the Past to Learn the Future: Long Short-Term Memory Neural Networks for Time Series Prediction,Live,887
2725,"Compose The Compose logo Articles Sign in Free 30-day trialUSE ALL THE DATABASES – PART 2
Published Mar 15, 2017 graphql elasticsearch mongodb Use All the Databases – Part 2Loren Sands-Ramshaw, author of GraphQL: The New REST shows how to combine data from multiple data sources using GraphQL in part two
of this Write Stuff series.

In Part 1 I introduced the app we’re building, the databases we’re using, the what and
why of GraphQL, how to write a GraphQL query and schema, and how to set up and
run a GraphQL server. Now we’ll finish writing the server code by querying all
of our data sources. If you'd like to follow along with the code, start on the empty-resolvers branch of the server repo :

git clone git@github.com:GraphQLGuide/all-the-databases.git  
git checkout empty-resolvers  


And then finish the setup instructions to run locally. If you run into problems, you can check your work against the master branch , which has the completed server code.

Part 1

 * The databases
 * GraphQL intro
 * The query
 * Schema
 * Server setup

Part 2

 * Resolvers * SQL
    * Elasticsearch
    * MongoDB
    * Redis
    * REST
   
   
 * Performance
 * Done!

RESOLVERS
The last thing we need are resolvers, which take a field and resolve it into a value by looking it up from a data source and returning it.

SQL
Let's start with a user query:


Here's how we format the resolver function for the user query:

const resolvers = {  
  Query: {
    user(_, args) {
      // args.id will be 1 in the above example
    },
  },
}


code

We need to return the user who has an id of 1, and our users are stored in Postgres, so we want to execute this SQL
query:

SELECT * FROM `users` AS `user` WHERE (`user`.`id` = 1);  


which in the Sequelize ORM is done with User.find :

const resolvers = {  
  Query: {
    user(_, args) {
      return User.find({
        where: { id: args.id },
      });
    },
  },
}


code

ELASTICSEARCH
Each field in a GraphQL query needs to be resolved—not just the top-level query
name. In the above basic user query, the GraphQL server knew how to resolve the firstName , lastName , and photo fields because they were attributes of the user object returned from the user resolver. However, that object didn't have a mentions attribute, so if we want the below query to work, we'll need to add a resolver
for mentions :


According to the schema, the user query returns an object of type User , and mentions is a field on a User . So we define the resolver accordingly:

const resolvers = {  
  Query: {
    user(_, args) { ... },
  },
  User: {
    mentions(user) {
      // fetch and return mentions
    },
  },
}


code

We get the user object as an argument (the one returned from the Query.user resolver), and we want to search Elasticsearch for all tweets that contain that
user's name. Once we get the results from the database, we wrap them in our
Sequelize Tweet model with Tweet.build so they're easier to work with:

User: {  
  async mentions(user) {
    const results = await Elasticsearch.search({ q: `${user.firstName} ${user.lastName}` });
    return results.hits.hits.map(hit =
  },
},


code

But we're not done yet! The docs we got from Elasticsearch look like this:

{
  ""text"": ""Maurine Rau Eligendi in deserunt."",
  ""userId"": 1,
  ""city"": ""San Francisco"",
  ""created"": 1481742701457
}


The GraphQL server finds the text , city , and created fields, but doesn't know what to do for the author or views fields, so we need to write resolvers for those:

const resolvers = {  
  Query: {
    user(_, args) { ... },
  },
  User: {
    async mentions(user) { ... },
  },
  Tweet: {
    author(tweet) {
      // fetch and return author user doc
    },
    views(tweet) {
      // fetch and return the number of views
    },
  },
}


code

We know what to do in the author resolver—fetch the right user doc from Postgres. We do that with a SELECT
statement on the users table where user.id is equal to tweet.userId ( tweet we get as an argument to the resolver). If we set up our ORM right (with TweetModel.belongsTo(UserModel) ), we can do that with just tweet.getUser() :

Tweet: {  
  author(tweet) {
    return tweet.getUser();
  },
}


code

One cool thing about GraphQL is resolvers work at any query depth. According to
the schema, the author resolver returns a User object, and the User type has a mentions resolver, so we could also fetch the tweet's author's mentions. Compare the
below to our last query:


The mentions resolver is being used a second time at the deepest nesting level of the query
response.

MONGODB
The tweet's views are stored in MongoDB, so we need to do a findOne on our Views collection, which has documents of the form:

{
    ""_id"" : ObjectId(""5732432beca6120bbf6c0df3""),
    ""tweetId"" : NumberInt(1),
    ""views"" : NumberInt(82)
}


And we can look up the right doc by its tweedId :

Tweet: {  
  author(tweet) { ... },
  views(tweet) {
    return Views
      .findOne({ tweetId: tweet.id })
      .then(doc =
  },
}


code

REDIS
Next up we have the publicFeed , which we keep in a Redis list.


Whenever someone tweets, we LPUSH it onto the list and LTRIM it back to three items. Then in the query resolver, we fetch the whole list
with LRANGE , and since Redis just stores strings, we JSON.parse the result.

const resolvers = {  
  Query: {
    user(_, args) { ... },

    async publicFeed() {
      const feed = await redis.lrangeAsync('public_feed', 0, -1);
      return feed.map(JSON.parse);
    },
  },
  User: { ... },
  Tweet: { ... },
}


code

This works fine when you're storing a normal flat tweet with a userId. However,
to improve latency and reduce load on Postgres, we store the whole user document
as part of the Redis tweet:

{
  id: 1,
  text: ""Est dicta ullam aliquid quod et."",
  city: ""New York"",
  created: 1481763442107,
  user: {
    firstName: ""Pansy"",
    lastName: ""Herzog"",
    photo: ""http://placekitten.com/200/139""
  }
}


This means we need to modify the author resolver—which currently always does tweet.getAuthor() )—to instead return the user subdoc if it exists:

Tweet: {  
  author(tweet) {
    if (tweet.user) {
      return tweet.user;
    }

    return tweet.getUser();
  },
}


code

REST
Lastly we have the cityFeed query—the most recent tweets in the client's city. For someone in Mountain
View, this would look like:


These tweets are stored in Postgres, and we'll look them up based on the tweet.city field, but first we have to figure out the client's location. During the server
setup we get the request's IP address and add it to the GraphQL context :

const graphQLServer = express();

graphQLServer.use('/graphql', bodyParser.json(), graphqlExpress((req) =


code

The context is available as the final argument to all resolvers. Once we have
the IP, we can make a GET request to http://ipinfo.io/${context.ip} , which returns a JSON response like:

{
  ""ip"": ""8.8.8.8"",
  ""hostname"": ""google-public-dns-a.google.com"",
  ""city"": ""Mountain View"",
  ""region"": ""California"",
  ""country"": ""US"",
  ""loc"": ""37.3860,-122.0838"",
  ""org"": ""AS15169 Google Inc."",
  ""postal"": ""94040""
}


which has the city attribute that we need. Putting it all together, with Sequelize's Tweet.findAll function:

import rp from 'request-promise';

async cityFeed(_, args, context) {  
  const response = await rp(`http://ipinfo.io/${context.ip}`);
  const { city } = JSON.parse(response);

  const cityTweets = await Tweet.findAll({
    where: { city },
    limit: 3,
    order: [['created', 'DESC']],
  });

  return cityTweets;
},


code

PERFORMANCE
One advanced issue to note is that basic GraphQL resolver implementations like
the above can result in many repeated single-record queries. You can improve
latency and reduce load on the database with the DataLoader library, which batches database queries (eg for SQL batches multiple User.find({ where: { id: args.id }}) queries into a SELECT * WHERE IN query) and caches responses. That’s often sufficient, but for relational
databases you can also do JOIN s with Join Monster .

DONE
And we're done! Here's what the final resolvers.js file looks like, and the whole repository . The only thing left out was the database/ORM setup and seeding, which you can
find in connectors.js .

I hope you agree that this is a fantastic way to provide an API for your
clients—one that's easy to build, painless to consume, self-documenting, and
version-free. There are some aspects of a GraphQL server I didn't cover, which
you can read about at graphql.org and in the Apollo server library documentation . To learn about using GraphQL on the client (React, Angular, React Native, or
native mobile), check out the Apollo Client docs .

And finally, the best resource for learning GraphQL in depth will be my upcoming
book, GraphQL: The New REST 👌. Sign up at graphql.guide to be notified when it's released 😄🙌.


attribution Hyberbole and a half

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Mar 2, 2017USE ALL THE DATABASES - PART 1
Loren Sands-Ramshaw, author of GraphQL: The New REST, shows how to combine data
from multiple sources using GraphQL in this W…

Guest Author Oct 11, 2016COMPOSE: NOW AVAILABLE ON IBM BLUEMIX
The power of IBM's Bluemix cloud platform is now able to seamlessly harness
Compose's databases, making Compose-configured Mo…

Dj Walker-Morgan Sep 28, 2016POWERING SOCIAL FEEDS AND TIMELINES WITH ELASTICSEARCH
Evolving from MongoDB and Redis to Elasticsearch, Campus Discounts' founder and
CTO Don Omondi talks about how and why the co…

Guest Author Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Loren Sands-Ramshaw, author of GraphQL: The New REST shows how to combine data from multiple data sources using GraphQL in part two of this Write Stuff series.",Use All the Databases – Part 2,Live,888
2728,"Need to report the video?

Sign in to report inappropriate content.

Sign in

Want to watch this again later?

Sign in to add this video to a playlist.

Sign in

Sign in to make your opinion count.

Sign in to make your opinion count.",Video tutorial showing the complete creation of an offline-first native iOS app with a Cloudant backend,Mobile App Webinar Demo,Live,889
2730,"Compose The Compose logo Articles Sign in Free 30-day trialCONNECTING PHP TO COMPOSE FOR MYSQL ON BLUEMIX
Published Aug 23, 2017 Connecting PHP to Compose for MySQL on Bluemix php mysql bluemix Free 30 Day TrialWhen we noticed an article about how to connect Compose for MySQL on Bluemix
with PHP, we just had to make sure more people saw how simple it was. So over to
Lorna Jane, who's allowed us to cross post her blog post .

Most of the PHP I write runs on Bluemix - it's IBM self-service cloud, and since
I work there, they pay for my accounts :) There are a bunch of databases you can
use there, mostly open source offerings, and of course, with PHP I like to use
MySQL. Someone asked me for my connection code since it's a bit tricky to grab
the credentials that you need, so here it is.

BLUEMIX ENVIRONMENT VARIABLES
In Bluemix, you can simply create a MySQL database (look for ""compose-for-mysql""
in the catalog), create a set of credentials, and then bind it to your app. I
should blog a few more of my PHP-on-Bluemix tricks but you can run a selection
of PHP versions and it's also possible to add extensions that you need, I have
found it does have what I need once I figure out how to configure it!

Once the database is bound to the application, then your PHP code running on
Bluemix will have an environment variable called VCAP_SERVICES . The variable contains a JSON string, with top-level elements for each of the
services that are bound to your application. Services will usually be the
databases you are using but could also be some of the APIs for example.

I like to decode VCAP_SERVICES to an array so I can work with it, like this:

$vcap_services = json_decode($_ENV['VCAP_SERVICES'], true);


GET PDO CONNECTED
To connect to PDO with MySQL, we need to supply a few different values:

 * host
 * port
 * username
 * password
 * database name

The VCAP_SERVICES supplies a URL containing all those elements, but not splitting them out. The PHP function parse_url()
can help us with this. Here's the full code block that I use to connect in my
applications:

$vcap_services = json_decode($_ENV['VCAP_SERVICES'], true);
$uri = $vcap_services['compose-for-mysql'][0]['credentials']['uri'];
$db_creds = parse_url($uri);
$dbname = ""your_database_name"";

$dsn = ""mysql:host="" . $db_creds['host'] . "";port="" . $db_creds['port'] . "";dbname="" . $dbname;
$db = new PDO($dsn, $db_creds['user'], $db_creds['pass']);


Hopefully, you can just borrow the code above and quickly get started with your
own PHP/MySQL applications.


attribution Émile Perron

CONQUER THE DATA LAYER
Spend your time developing apps, not managing databases.

Try Compose for Free for 30 DaysRELATED ARTICLES
May 3, 2017CAMPUS DISCOUNTS - MAKING THE MOST OF COMPOSE
Campus Discounts uses several Compose-hosted databases including MySQL, MongoDB,
Redis, Elasticsearch and RabbitMQ to power t…

Arick Disilva Mar 28, 2017SIMPLE OAUTH WITH MONGODB & MYSQL
Don Omondi, Campus Discounts' founder and CTO, discusses securing applications
with OAuth and shows you how to securely store…

Guest Author Jan 23, 2017ASYNCHRONOUS JOINS USING RABBITMQ
Don Omondi, Campus Discounts' founder and CTO, talks about their use case of
fetching data from multiple databases and joini…

Guest Author Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","When we noticed an article about how to connect Compose for MySQL on Bluemix with PHP, we just had to make sure more people saw how simple it was. So over to Lorna Jane, who's allowed us to cross post her blog post.",Connecting PHP to Compose for MySQL on Bluemix,Live,890
2740,Interactive Web Apps with shiny Cheat Sheet,"If you’re ready to build interactive web apps with R, say hello to Shiny. This cheat sheet provides a tour of the Shiny package and explains how to build and customize an interactive app.",Interactive Web Apps with shiny Cheat Sheet,Live,891
2741,"* United States

IBM® * Site map

Search within Bluemix Blog Bluemix Blog * About Bluemix * What is Bluemix
    * Getting Started
    * Case Studies
    * Hybrid Architecture
    * Open Source
    * Trust, Security, Privacy
    * Data Centers
    * Our Network
    * Automation
    * Architecture Center
   
   
 * Products * Compute Infrastructure
    * Compute Services
    * Hybrid Deployments
    * Watson
    * Internet of Things
    * Mobile
    * DevOps
    * Data Analytics
    * Network
    * Open Source
    * Storage
    * Security
   
   
 * Services * Bluemix Services
    * Garage
   
   
 * Pricing
 * Support * Support
    * Contact Us
    * Resources
    * Docs
   
   
 * Blog * How-tos
    * Trending
    * What's New
    * Events
   
   
 * Partners * Partners
    * Become a Partner
    * Find a Partner
   
   
 * Sign up

DATA ANALYTICSBREAKING THE 80/20 RULE: HOW DATA CATALOGS TRANSFORM DATA SCIENTISTS’
PRODUCTIVITY
August 23, 2017 | Written by: Armand Ruiz Gabernet

Categorized: Data Analytics | Watson

Share this post:


THE RISE OF DATA SCIENTISTS
There’s no doubt that data scientists are in high demand. A job that didn’t
exist a decade ago topped Glassdoor’s ranking of best roles in America for two years in a row based on salary, job satisfaction, and number of job openings. It was even dubbed the “sexiest job of the 21 st century” by Harvard Business Review. So, what is driving the appetite for data
scientists and their unique skill sets?

Analytics is changing the world, and applications of analytics are proliferating
in every industry. It’s no longer just about canned reports and accounting—the
use cases range from cracking down on crime to fighting poverty to lobbying
voters in elections to enhancing cancer survival rates. The data revolution is
in full swing and the possibilities seem to be endless, so it’s no surprise that
data professionals are at the top of recruiters’ “most wanted” lists.

SO WHAT’S THE PROBLEM?
Data scientists are scarce and busy. IBM recently published a study showing that demand for data scientists and analysts is projected to grow by 28
percent by 2020, and data science and analytics job postings already stay open
five days longer than the market average. Unless something big changes, the
skills gap will continue to widen.

Against this backdrop, helping your data scientists work more productively
should be a key priority—which is why the news that data scientists spend only 20 percent of their time on analysis is a problem you need to address (and soon).

The reason you hire data scientists in the first place is to develop algorithms
and build machine learning models—which are typically the parts of the job that
they enjoy most. Yet in most companies, the so-called “80/20 rule” applies: 80
percent of a data scientist’s valuable time is spent simply finding, cleansing,
and organizing data, leaving only 20 percent to actually perform analysis.

HARD WORK BEHIND THE SCENES
At the beginning of any new analytics initiative, data scientists must identify
relevant data sets, which is no small task. Many organizations’ data lakes have turned into dumping grounds with no easy way to search for data and little incentive to share it. Data
scientists may need to contact different departments to beg for the data they
need and wait weeks for it to be delivered, only to find that it doesn’t provide
the information they need or has serious quality issues. At the same time,
responsibility for data governance often falls to them, since corporate-level
governance policies are confusing, inconsistent, or difficult to enforce.

Even when they can get their hands on the right data, data scientists need to
spend time exploring and understanding it. For example, they might not know what
a set of fields in a table is referring to at first glance, or data may be in a
format that can’t be easily understood or analyzed. There is usually little to
no metadata to help, and they may need to seek advice from the data’s owners to
make sense of it.

Once they wrangle the data, there’s yet another laborious task to perform:
preparing it for analysis. This step involves formatting, cleaning, and
sometimes sampling the data. In some cases, they may also have to perform
scaling, decomposition, and/or aggregation transformations on the data before
they are ready to start training their models.

WHY IT’S SUCH A CONUNDRUM
These processes can be time-consuming and tedious. But it’s crucial to get them
right since a model is only as good as the data used to build it. And because
models generally improve as they are exposed to increasing amounts of data, it’s
in data scientists’ interests to include as much data as they can in their
analysis.

However, in the real world, every project has a deadline. Consequently, data
scientists can be tempted to make compromises on the data they use, aiming for
“good enough” rather than optimal results.

The problem is that with machine learning models, “good enough” often just isn’t
good enough. Making hasty decisions or cutting corners during model development
and training can lead to widely different outputs and potentially render a model
unusable when it’s put into production. Data scientists are constantly making
judgment calls on how to approach an analytical problem. Starting out with bad
or incomplete data can easily lead them down the wrong path.

Due to the need to balance quality against time constraints, data scientists are
generally forced to focus their energies on one model at a time. This means that
if they haven’t chosen the right line of inquiry, they are forced to drop
everything and start all over again. In effect, they’re obliged to double down
on every hand, turning data science into a high-stakes, high-risk game of
chance.

ESCAPING THESE PITFALLS
Data scientists’ time is precious. So how can you help them work to their full
potential? The answer is to use automation to give them more time for analysis
without compromising the quality of the data they use.

IBM Data Catalog , a new beta solution that’s part of Watson Data Platform , offers tools to automate and simplify data discovery, curation, and
governance. Intelligent search capabilities help data scientists find the data
they need, while metadata such as tags, comments, and quality metrics help them
decide whether a data set will be useful to them and how best to extract value
from it. Integrated data governance gives data scientists confidence that they
are permitted to use a given data set and that the models and results they
produce are used responsibly by others in the organization.

Rather than being limited to working on one model at a time, the goal is to give
data scientists the time they need to build and train multiple models
simultaneously. This spreads out the risk of analytics projects, encouraging the
experimentation that yields breakthroughs instead of focusing resources on a
single approach that may be a dead end.

Making it easy for data scientists to save, access, and extend models allows
them to use existing assets as templates for new projects instead of starting
from scratch every time. The concept of transfer learning—focusing on preserving
the knowledge gained while solving one problem and applying it to a different
but related problem—is a hot topic in the machine learning world. By developing
visualizations to communicate how models work, solutions like Data Catalog
promote re-use, saving time, and reducing risk.

Data scientists play an essential role in pushing forward innovation and
garnering competitive advantage for companies. Disruptive solutions like Data Catalog give data science teams the tools to transform their workflows, break the 80/20
rule, and reclaim much of the time that they’re currently wasting on discovery
and cleansing.

Request access to IBM Data Catalog beta today

ARMAND RUIZ GABERNET
Lead Offering Manager, IBM Data Science Experience & Watson Machine Learning

JAY LIMBURN
Distinguished Engineer - Offering Management, Watson Data Platform

data data analytics data science IBM Data Catalog Watson Data Platform


Previous Post

How healthy are we indoors?Next Post

Developing a finance application using IBM CloudADD COMMENT NO COMMENTS
LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


Search for:RECENT POSTS
 * My App is Secure: Application Security Assessed
 * Get real-time billing insights from your Bluemix account
 * IBM Lift CLI is out of beta and available for download
 * Using Codeship Pro To Deploy Workloads to IBM Bluemix Container Service
 * Securing single page apps with App ID service

ARCHIVES
Archives Select Month September 2017 August 2017 July 2017 June 2017 May 2017 April 2017 March 2017 February 2017 January 2017 December 2016 November 2016 October 2016 September 2016 August 2016 July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 October 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 November 2013TAGS
analytics announcements api apps Architecture Center best-of-bluemix Bluemix bluemix-support-notifications buildpacks client success cloud cloudant cloud foundry conference conferences containers dashdb deployment devops docker eclipse garage garage-method hackathon homepage hybrid interconnect iot java Kubernetes liberty local microservices mobile MobileFirst node.js openwhisk security Spark swift twilio ui video watson webinar More Watson StoriesData Analytics

STREAMING ANALYTICS PRICING UPDATE
We're lowering the prices for Streaming Analytics.

Continue reading


Share this post:


Data Analytics

CLEANING THE SWAMP: TURN YOUR DATA LAKE INTO A SOURCE OF CRYSTAL-CLEAR INSIGHT
When we talk to data scientists, we hear the same sad story again and again.
They tell us how their organization fell in love with the idea of building a
data lake as a single platform for self-service data science. How they were
wooed and won by a vendor with a solution that promised much, but delivered
little. How their vision of a data lake as a clear source of business insight
has turned into a stagnant swamp—a dumping ground where data goes to die.

Continue reading


Share this post:


Data Analytics

HOW TO EASE THE STRAIN AS YOUR DATA VOLUMES RISE
Ever had to make a decision when you didn’t have the time, means or patience to
look up all the data that could help you choose the best option? Yes, well,
you’re not alone on that score. Usually, this doesn’t have significant or
long-lasting consequences—does it really matter if you choose where to go for
dinner because you like the look of a place, rather than combing through recent
reviews?

Continue reading


Share this post:


SIGN UP FOR A BLUEMIX TRIAL TODAY


Get started free Learn more about Bluemix

CONNECT WITH US


 * Contact
 * Privacy
 * Terms of use
 * Accessibility","80 percent of data scientists' time is spent finding, cleansing and organizing data, which leaves only 20 percent to actually perform analysis.",Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity,Live,892
2744," USE THE MACHINE LEARNING LIBRARYJess Mantaro / October 22, 2015This video will show you how to use the Apache® Spark™ machine learningprogramming model in IBM Analytics for Apache Spark on IBM Bluemix. Apache®Spark™ includes extension libraries that can be used for SQL and DataFrames,streaming, machine learning, and graph analysis. In this video, you’ll see howto use machine learning algorithms to determine the top drop off location forNew York City taxis using a popular algorithm known as KMeans.You can also read a transcript of this videoTRY THE TUTORIALLearn how to use Apache® Spark™ machine learning algorithms to determine the topdrop off location for New York City taxis using the KMeans algorithm.WHAT YOU’LL LEARNAt the end of this tutorial, you should be able to: * download New York City taxi cab data in CSV format. * create a Scala notebook in IBM Analytics for Apache Spark. * load a CSV file into a Scala notebook. * use the KMeans and Vectors algorithms to analyze the data.BEFORE YOU BEGINWatch the Getting Started on Bluemix video to create a Bluemix account and add the IBM Analytics for Apache Spark service.PROCEDURE 1: DOWNLOAD NEW YORK CITY TAXI CAB DATA 1. Navigate to the NYC OpenData site. 2. Click Transportation . 3. For the search criteria, type taxi . 4. Select the data trip data of your choice, and download the data in CSV    format.PROCEDURE 2: CREATE A SCALA NOTEBOOK TO BUILD SQL QUERIES 1.  Sign in to Bluemix . 2.  Access the Dashboard , and open the Apache Spark instance. 3.  Click New Notebook , select Scala , type a name for the notebook, and click Create . 4.  Click Add Data Source in the right sidebar. 5.  Drag and drop the CSV file you downloaded in procedure 1 into the box     labelled Drop file to add data source . 6.  Paste the following SQL statements into the first cell in the notebook, and     then click the Run icon on the toolbar. This first cell contains two commands that set up use     of the Apache® Spark™ machine learning algorithms KMeans and Vectors.     Commands:     import org.apache.spark.mlli.clustering.KMeans          import org.apache.spark.mllib.linalg.Vectors           7.  Paste the following SQL statement into the second cell, and then click Run . Replace nyctaxisub.csv with file name you used. This command reads the contents of the     nyctaxisub.csv file and assigns it to the taxifile variable.     Command: val taxifile = sc.textFile(""swift://notebooks.spark/nyctaxisub.csv"")           8.  Paste the following SQL statement into the third cell, and then click Run . This command shows what the data in this file looks like. When it     displays, you’ll see that the first row will be the header for the columns,     and the second row actually shows data. So, here is the first row which     shows a few different things, but of particular interest is the     dropoff_latitude and dropoff_longitude. And in the next row, we actually     see data.     Command: taxifile.take(2)           9.  Paste the following SQL statement into the fourth cell. This command     filters this data, so we only see the records from 2013. And we also want     to make sure that the dropoff_latitude and dropoff_longitude aren’t null..     Commands:     val taxidata=taxifile.filter(_.contains(""2013"")).          filter(_.split("","") (3) !="""").          filter(_.split("","") (4) !="""")           10. Paste the following SQL statement into the fifth cell, and then click Run . This filters the data containing drop off areas with latitudes and     longitudes that are roughly in the Manhattan area.     Commands:     val taxifence = taxidata.filter(_.split("","")(3).toDouble>40.70).          filter(_.split("","") (3).toDouble<(40.86).          filter(_.split("","") (4).toDouble>(-74.02)).          filter(_.split("","") (4).toDouble<(-73.93))           11. Paste the following SQL statement into the sixth cell, and then click Run . This command takes this data and puts it in a vector which will be used     as input for the KMeans algorithm.     Command:          val     taxi=taxifence.map(line=>Vectors.dense(lines.split(',').slice(3,5).map(_.toDouble)))           12. Paste the following SQL statement into the sixth cell, and then click Run . This final cell contains commands to invoke the KMeans algorithm. In     this case, we’re looking for the top drop off location; however, the     parameters could be changed in this cell to determine the top three or the     top ten locations. It’s also interesting to note that Apache® Spark™     machine learning provides other algorithms for collaborative filtering,     clustering, and classification.     Commands:     val model=KMeans.train(taxi,1,1)          val     clusterCenters=model.clusterCenters.map(_.toArray)clusterCenters.foreach(lines= println(lines(0),lines(1)))          Select and copy the coordinates. Then, open a browser, and paste the coordinatesinto a map program such as Google Maps to see the location on the map. Find morevideos in the Spark Learning Center at http://developer.ibm.com/clouddataservices/spark .Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",How to use the Spark machine learning programming model in IBM Analytics for Apache Spark on IBM Bluemix,Use the Machine Learning Library in IBM Analytics for Apache Spark,Live,893
2756,"Cloudant is a database that lets you store JSON, and index that data a number ofways so queries run efficiently, even at large scale. One of those indexes isoptimized for GeoJSON spatial data. Today, we’re introducing a feature to theCloudant dashboard that makes working with Cloudant Geo data much more pleasant: integrated map visualizations via Mapbox.js .Draw a complex spatial query using the Map tab in the Cloudant dashboard,powered by Mapbox.Cloudant Geo leverages the GeoJSON standard format for storing spatial data such as points, polygons, and linestrings. As adeveloper using Cloudant Geo’s API, you can perform radius, bounding box,complex polygon, and nearest neighbor queries. When our team was tasked withimproving the experience of using Cloudant Geo, at the top of the list wasadding a mapping UI so developers could visualize their data and easily learnthe API.PUTTING A GEOJSON DATABASE ON THE MAP(BOX)We researched a number of tools and decided on Mapbox , both for their stunning base map layers as well as their powerful libraries,which the Cloudant team used to embed maps directly in the Cloudant dashboard.According to Ben Keen , IBM Cloudant lead JavaScript engineer, working with the Mapbox API wentsmoothly:Integrating Mapbox into our dashboard proved to be extremely simple: it camewith a clear, intuitive, feature-rich API that allowed us to quickly tie in theGeoJSON returned from our own Cloudant APIs.—Ben Keen, lead JavaScript engineer, IBM CloudantEasily copy the corresponding Cloudant API call for your query.My counterpart at Mapbox, Matthew Irwin , had this to say about how Mapbox and Cloudant fit together:Mapbox is geospatial Legos. We give developers the building blocks to integratelocation into any application. The Cloudant team just gets it. Cloudant Geo isbeautiful, speaks GeoJSON natively, and gives Cloudant users powerful new toolsto understand their data.—Matthew Irwin, business & strategy, MapboxMatt shares a bit more on the Cloudant integration work on the Mapbox blog . Check it out.TRY IT, LIVE ON CLOUDANT SHARED CLUSTERSWhere should you go next if you have a need for a database for storing andquerying geospatial data in your application? Sign up for Cloudant and head over to the Learning Center to access videos, tutorials, and documentation on Cloudant Geo. If you thenneed to provide a map UI just like Cloudant did in its own dashboard, use Mapbox for its powerful mapping platform.Fine tune lat-long coordinates and other query parameters in the dashboard toavoid syntax errors. Mapbox, ftw!Check out the dashboard at https://<youraccount>.cloudant.com/dashboard.html . Geospatial indexes, including the Mapbox-based tools, will be listed fordatabases containing them.We hope you like the integration! —the IBM Cloudant & Mapbox teams","Integrated Mapbox visualizations in Cloudant, a GeoJSON database that lets you build spatial queries interactively and instantly see the results.",GeoJSON Database with Cloudant + Mapbox,Live,894
2763,"SEVEN DATABASES IN SEVEN DAYS – DAY 6: IBM GRAPH

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Lorna Mitchell & Matt Collins 11/15/16Lorna Mitchell & Matt Collins


Learn More Recent Posts * Seven Databases in Seven Days – Day 6: IBM Graph Looking to learn the basics of cloud databases? Our sixth in a series of
   posts…
 * Seven Databases in Seven Days – Day 5: etcd Looking to learn the basics of cloud databases? Our fifth in a series of
   posts…
 * Seven Databases in Seven Days – Day 4: Cloudant Looking to learn the basics of cloud databases? Our fourth in a series of
   posts…

This post is part of a series of posts created by the two newest members of our
Developer Advocate team here at IBM Cloud Data Services. In honour of the book Seven Databases in Seven Weeks by Eric Redmond and Jim R. Wilson, we challenged Lorna and Matt to take a new
database from our portfolio every day, get it set up and working, and write a
blog post about their experiences. Each post reflects the story of their day
with a new database. We’ll update our seven-days GitHub repo with example code as the series progresses. —The Editors

“Edge, shoulders, knees and nodes”: it’s IBM Graph!

 * Database type: Graph database
 * Best tool for: Representing and querying relationships that involve connections between
   people, places and things

OVERVIEW
IBM Graph is a cloud solution based on the technology of Apache Tinkerpop™ . It offers some great functionality for storing complex, related data and for
querying that data in a powerful and performant way. This article will give an
overview of how to begin using IBM Graph.

One thing that can be strange for graph newcomers is the nomenclature. To get
you started, here’s a quick guide:

 * Graph: A set of vertices and edges. A database can contain many graphs.
 * Schema: Definition of the types of data that will be represented. This helps the
   database determine what should be indexed.
 * Vertex: (plural: vertices) A node in a graph. A vertex has a label and often some other properties.
 * Edge: A link that joins two vertices. A vertex has direction and can also have its
   own type and properties that relate to the relationship between the two
   vertices.

GETTING STARTED
Start by adding an instance of IBM Graph to your Bluemix account. Once
provisioned, you will see that your dashboard includes three key pieces of
information that you will need to connect to the database:

 * apiURL
 * username
 * password

Looking for service credentials?

There are two main ways of authenticating against IBM Graph. The simplest way is
to use your username and password and access the service using HTTP Basic
Authentication; however, this approach is rate-limited. The recommended method
is to acquire an access token and use that, as it does not have the same
limitations.

To acquire an access token, we simply make a call to the /_session endpoint using HTTP Basic Authentication. Note that the URL does not include
the default /g/ segment that refers to the graph that was created for us by default. To make
this command from curl we would do:

curl --user username:password https://ibmgraph-alpha.eu-gb.bluemix.net/7415c5d6-a80a-4ffe-896a-1a1dc7a81d21/_session


The result contains a gds-token which will be our access token for all the other curl requests we make. We will
pass it as an Authorization header with the format:

Authorization: gds-token [token]


Now that we’re authenticated, we can create a graph to work with and a schema to
describe the data we’ll be storing.

For the sake of clarity, we use IBM Graph’s simplified API in the examples that
follow. For more complex queries or for creating a large graph with bulk
operations, check out Graph’s Gremlin API and bulk input API . Graph’s simplified API is not the recommended way to use the service for anything beyond trying out basic
querying and database CRUD. The /gremlin endpoint is what developers should use in their projects.

CREATE A GRAPH
Creating a graph is as simple as making a POST request to the endpoint /_graphs to create a new graph. You can specify an ID for your graph, but if you don’t,
the service will simply generate a unique identifier. The curl request looks
like this:

curl -H ""Content-Length: 0"" -X POST -H ""Authorization: gds-token MGEzYTgyYmUtNjlhMi00OTljLWIxNTAtNmNhYmY3M2ZjOGJmOjE0NjU4OTU4ODkyNzI6aWtTU1B1Njg3VEk5cjFkb3RWR3RhOXM4ajMrcE5aZU9VYmh2eE5tTk1JMD0="" -v https://ibmgraph-alpha.eu-gb.bluemix.net/7415c5d6-a80a-4ffe-896a-1a1dc7a81d21/_graphs


Note the use of Content-Length: 0 as an additional header here. By default cURL won’t set this header since we’re
not actually sending any body data. Without this header, you’ll see a 411 Length
Required response.

The response contains both a graphId and a dbUrl , which identify the new graph that has been created. We’ll use the dbUrl in creating the schema for our graph and in adding vertices and edges.

DESIGN A SCHEMA
Before adding vertices and edges to the graph, it is good practice to design the
schema and apply it. This step requires a bit of thinking and possibly the use
of a whiteboard, as we consider the types of data and relationships that our
system will represent.

As an example, we’re going to represent a group of friends (or rather, random
humans where some of them know one another), and the various interests/hobbies
of those people. Once we’ve collected this data and added it, we’ll be able to
answer questions such as:

 * Which new human might I like to be friends with?
 * What hobbies are my friends enjoying that I should try?

To represent this data, we’ll need two types of vertices: one is people, and the
other is interests. And two types of edges: one showing that someone is friends
with someone else, and another showing that a person is interested in a
particular hobby. Therefore, we design a schema that reflects these
relationships.

Here’s our schema, which we’ll store in a file called schema.json :


{
  ""propertyKeys"": [
    {""name"": ""personName"", ""dataType"": ""String"", ""cardinality"": ""SINGLE""},
    {""name"": ""interestName"", ""dataType"": ""String"", ""cardinality"": ""SINGLE""}
  ],
  ""vertexLabels"": [
    {""name"": ""person""},
    {""name"": ""interests""}
  ],
  ""edgeLabels"": [
    { ""name"": ""likes"" },
    { ""name"": ""friendsWith"" }
  ],
  ""vertexIndexes"": [
    {""name"": ""vByPersonName"", ""propertyKeys"": [""personName""], ""composite"": true, ""unique"": false},
    {""name"": ""vByInterestName"", ""propertyKeys"": [""interestName""], ""composite"": true, ""unique"": false}
  ]
}


We can apply this schema to our graph by using the dbUrl given when we created a new graph, and POSTing this data to the /schema endpoint. The cURL command looks like this:

curl -H ""Authorization: gds-token [token]"" -H 'Content-Type: application/json' https://ibmgraph-alpha.eu-gb.bluemix.net/7415c5d6-a80a-4ffe-896a-1a1dc7a81d21/19ff504c-3d33-438b-a275-f7994c9c471f/schema --data @schema.json


This posts a new schema to the graph, which we can then inspect the results by
making a GET request to the same endpoint.

curl -H ""Authorization: gds-token [token]"" -sS https://ibmgraph-alpha.eu-gb.bluemix.net/7415c5d6-a80a-4ffe-896a-1a1dc7a81d21/19ff504c-3d33-438b-a275-f7994c9c471f/schema


With the schema set up, we can start to add people and their interests.

SHAPING THE GRAPH
Here’s a representation of the data we collected and the schema our graph will
represent. It’s a tiny example, for clarity’s sake. In practice, the best
applications of IBM Graph and similar tools is on very large (“big”?!) data sets
where the number of connections makes it impossible to calculate by hand and
slow/heavy to use other storage approaches such as a traditional RDBMS.

Our example is a few acquainted people and their various interests. Again, we
have the two types of vertices: a person and an interest. We also have two types
of edges: a paler “friendsWith” edge that links two people together and another
orange-with-an-arrow “likes” edge that links people to their interests. You’ll
see this distinction as we move on to adding these elements to our graph using
curl commands.

ADD PEOPLE VERTICES
First off, we’ll add our people vertices. To do this, we need to POST to the /vertices API endpoint, supplying the label we want ( person ) and the properties ( personName ) that we want to apply to this new vertex, as shown below:

curl -X POST -H ""Authorization: gds-token [token]"" -H 'Content-Type: application/json' https://ibmgraph-alpha.eu-gb.bluemix.net/7415c5d6-a80a-4ffe-896a-1a1dc7a81d21/19ff504c-3d33-438b-a275-f7994c9c471f/vertices -d '{""label"": ""person"", ""properties"": {""personName"": ""Dave""}}'


You will get some JSON back in the response that, amongst other things, will
tell you the unique ID of this vertex. In this instance, our ID is 4160 :


{  
  ""requestId"": ""7bf1c9dd-2da5-4616-ba93-f28d1425c9d7"",
  ""status"": {  
    ""message"": """",
    ""code"": 200,
    ""attributes"": {

    }
  },
  ""result"": {  
    ""data"": [
      {  
        ""id"": 4160,
        ""label"": ""person"",
        ""type"": ""vertex"",
        ""properties"": {
          ""personName"": [  
            {  
              ""id"": ""16w-37k-27t1"",
              ""value"": ""Dave""
            }
          ]
        }
      }
    ],
    ""meta"": {

    }
  }
}


To recap what we just did — we added a new vertex with a label field set to person . Our new vertex has a property called personName , which we have set to Dave . When this vertex was created, we received a unique ID in return, which is 4160 .

Now we need to do this again, adding person vertices for a few more people. We
did this for Colin, Emma, Jenny and Craig. Make sure to note the ID for each of
these vertices as you add them.

ADD INTEREST VERTICES
Now that we have some people in our graph, we’ll add a few interest vertices.
This process is identical to adding a person, with the exception of the values
for the label and properties fields.

curl -X POST -H ""Authorization: gds-token [token]"" -H 'Content-Type: application/json' https://ibmgraph-alpha.eu-gb.bluemix.net/7415c5d6-a80a-4ffe-896a-1a1dc7a81d21/19ff504c-3d33-438b-a275-f7994c9c471f/vertices -d '{""label"": ""interest"", ""properties"": {""interestName"": ""books""}}'


Here we are saying that the label is of type interest , and we want to set a property of interestName to the value books . Add more interest vertices in the same way for music, tennis and films.
Remember to record the unique ID of each of these vertices!

FINDING OUR VERTICES
To prove that we have successfully added our vertices, we can retrieve them from
the graph using the GET /vertices/<id> endpoint. For any of the IDs you have stashed away from the previous steps, you
can do the following:

curl -H ""Authorization: gds-token [token]"" -sS https://ibmgraph-alpha.eu-gb.bluemix.net/7415c5d6-a80a-4ffe-896a-1a1dc7a81d21/19ff504c-3d33-438b-a275-f7994c9c471f/vertices/4160


This simple query will return the JSON of the requested vertex. Try it on a few
of your IDs to prove that they exist.

CONNECTING THE DOTS
We have a graph that contains a number of people and a number of interests, but
at the moment there is not much going on. We need to add relationships, or
edges, between our vertices.

Edges are the connections between vertices, and they are directional. This means
that when we create an edge, we have to specify where it came from and where it
is going to — this matters! We also want to describe what this relationship is.

In the curl example below, we are creating an edge between Dave (ID: 4160) and
Colin (ID: 4240). We want it to be known that Dave is Friends with Colin, so we
have applied the friendsWith label using the /edges endpoint.

curl -X POST -H ""Authorization: gds-token [token]"" -H 'Content-Type: application/json' https://ibmgraph-alpha.eu-gb.bluemix.net/7415c5d6-a80a-4ffe-896a-1a1dc7a81d21/19ff504c-3d33-438b-a275-f7994c9c471f/edges -d '{ ""outV"": 4160, ""label"": ""friendsWith"", ""inV"": 4240 }'


Friendship is a two-way street, but currently this relationship is directional,
meaning that Dave is friends with Colin but as yet, Colin is not friends with
Dave — poor Dave! To remedy this, we need to add another edge between Dave and
Colin, but reverse the IDs so that we create the same relationship, but in the
other direction.

You can also create edges between people and interests, using the likes label. Try creating the following edges using the IDs from earlier:

 * Dave friendsWith Colin
 * Colin friendsWith Dave
 * Dave friendsWith Jenny
 * Jenny friendsWith Dave
 * Jenny friendsWith Emma
 * Emma friendsWith Jenny
 * Emma friendsWith Craig
 * Craig friendsWith Emma
 * Dave likes tennis
 * Dave likes films
 * Emma likes films
 * Craig likes tennis
 * Craig likes music

Once this is done, our graph will match the image from above! Now we can start
to query it.

Here, we have encountered the graph theory notion of “directed” vs. “undirected”
graphs. In an undirected graph , edges are bidirectional. Friendships, in our example, are bidirectional. In a
directed graph, edges run in one direction, like the interests in our example.
Since we are mixing the two ideas together, we need to represent each friendship
by creating two complementary directed edges. Think Like (a) Git provides a concise explanation with images.

DOES ANYONE WANT TO BE FRIENDS?
There are lots of ways that we can use the power of IBM Graph to query our data,
but we’re going to ease in slowly with a simple example. We met Dave earlier.
Dave likes tennis and films, and would like to make more friends who like the
same things.

Using the POST /gremlin endpoint, we can traverse the graph to find other people who share the same interests as Dave. We’ll provide the
curl example in a minute, but first let’s take a look at the query part in
isolation:


graph.traversal().V().has(""personName"", ""Dave"").out(""likes"").in(""likes"").has(""personName"", without(""Dave""))


If we were to break down this query, we can see what is going on:


# Traverse the graph
# Find each Vertex that has a personName property set to 'Dave'
graph.traversal().V().has(""personName"", ""Dave"")


# Find all connected vertices that are connected via an outward 'likes' edge (i.e., Dave's interests)
.out(""likes"")


# Find all vertices that are connected to these interests via an inward 'likes' edge (i.e., other people who also like Dave's interests)
.in(""likes"")


# Of these vertices, find all that have a personName (i.e., are a person), but where that name is not Dave
.has(""personName"", without(""Dave""))


If we put that all together, we can hit the POST /gremlin endpoint like so:

curl -X POST -H ""Authorization: gds-token [token]"" -H 'Content-Type: application/json' https://ibmgraph-alpha.eu-gb.bluemix.net/7415c5d6-a80a-4ffe-896a-1a1dc7a81d21/a5461ea4-a3b8-4a1b-92a9-eac3e97d8e95/gremlin -d '{""gremlin"": ""graph.traversal().V().has(\""personName\"", \""Dave\"").out(\""likes\"").in(\""likes\"").has(\""personName\"", without(\""Dave\""))""}'


Which will return results like the JSON below:


[
  {
    ""id"": 4272,
    ""label"": ""person"",
    ""type"": ""vertex"",
    ""properties"": {
      ""personName"": [
        {
          ""id"": ""17a-3ao-sl"",
          ""value"": ""Emma""
        }
      ]
    }
  },
  {
    ""id"": 8376,
    ""label"": ""person"",
    ""type"": ""vertex"",
    ""properties"": {
      ""personName"": [
        {
          ""id"": ""2dz-6go-sl"",
          ""value"": ""Craig""
        }
      ]
    }
  }
]


This query shows us that both Emma (films) and Craig (tennis) share similar
interests to Dave — pretty cool! This example can be extended and altered to
find friends-of-friends, amongst other things. So why not have a look at some of
the IBM Graph documentation and examples , and have a go yourself? If you use Node.js, there’s a library our colleagues @ukmadlz and @ptitzler maintain that’s worth a look: nodejs-graph .

CONCLUSION
Graph databases bring a new dimension to working with data. They open up the
possibilities of querying relationships on a scale that can be difficult, clunky
and slow when elements are linked in a traditional RDBMS. In particular, graph
databases make it easy to visualize relationships and can handle queries with a
variable number of “hops” or edges between the starting vertex and the eventual
results. The query language definitely has a learning curve, but understanding
even just the basics of graph theory can take you a long way. (And look how far
you’ve come already!)

Based on Apache Tinkerpop, IBM Graph is an easy way to start working with this
technology, with the option to scale up your deployment as your data grows. If
you haven’t already, head over to the IBM Bluemix catalog so you can start your trial and test our example code. Until next time!","Looking to learn the basics of cloud databases? In this series, we show how to use them on Compose and the IBM cloud. Enter: IBM Graph.",Seven Databases in Seven Days – Day 6: IBM Graph,Live,895
2764,"This guide’s objective is twofold:
1. Explain the relationship between convolutional layers and transposed con-
volutional layers.
2. Provide an intuitive understanding of the relationship between input shape,
kernel shape, zero padding, strides and output shape in convolutional,
pooling and transposed convolutional layers.","This guide’s objective is to explain the relationship between convolutional layers and transposed convolutional layers and to provide an intuitive understanding of the relationship between input shape, kernel shape, zero padding, strides and output shape in convolutional, pooling and transposed convolutional layers.",A guide to convolution arithmetic for deep learning,Live,896
2767,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSIMPLE METRICS TUTORIAL PART 2: VISUALIZE WITH D3 AND JSONRaj R Singh / October 6, 2015You completed Part 1: Metrics Collection portion of this tutorial, but what can you do with the data you collect? Thereason we capture metrics is to understand how users engage with our apps. Thenext step is to analyze usage data in ways that uncover new insights.Here in part 2 of the tutorial, we use Cloudant’s secondary indexing engine to aggregate JSON , the format of the metrics data that we’ve persisted. Then we pump that datainto D3 for visualization and analysis. You can find all the code for this part of thetutorial in the metrics-analytics GitHub repo .UI OVERVIEWBegin by familiarizing yourself with the user interface. Visit our demo metrics visualization app . It shows usage reports for our developer.ibm.com/clouddataservices site’s How Tos page .Metrics analytics D3 appEXPLORING THE UI 1. On the left side of the screen, click the report By Events – total . 2. Select a time period to view.On the upper left, above the report, click the date picker to select a date    range for your query. Some relative options are built in for you, like Last 7 days and Last 30 days . Or you can enter a custom range.         3. Choose how you want to see the data.Above the report, in the button bar, select the type of visualization for    the data. You can choose a bar or pie chart, or see the raw data in a table.         4. Switch between applications or pages.The drop-down list on the top right lets you select the application for    which you want to view reports. As described in Part 1 , you can collect metrics for several pages or apps. A unique ID ( siteid ) identifies each page or app, which is included in the tracking script tag    like this:                <script src=""//metrics-collector.mybluemix.net/tracker.js"" siteid=""cds.search.engine""></script>                 5. These widgets immediately generate requests to get an aggretated view of the    data, returning data as JSON, which is the format D3 uses to generate the    chart in the center of the page.ARCHITECTURE OVERVIEWLet’s look under the covers and see how it works. Part 1 of our tutorial showed how we were tracking usage of dynamically generatedwebpage elements and persisting the JSON for future analysis. Here in Part 2,we’re going to show you how to use the MapReduce indexing engine in Cloudant or Apache CouchDB™ to create materialized views and aggregate the data. We then pass theaggregation to D3 , which renders our data in the browser as an SVG image that humans can analyze. Analytics!While we won’t cover all the UI development details here, the interface is builtwith Bootstrap and AngularJS so our frontend is up and running quickly.Here’s an overview of what our Simple Metrics app does:Simple Metrics analytics architectureAPPLICATION SETUPCloudant and CouchDB contain a simple Web server in addition to theirdatabase-clustering capabilities. The Web server lets us deploy our applicationas a “CouchApp.” From the CouchDB documentation : “CouchApps are web applications served directly from CouchDB, mostly drivenby JavaScript and HTML5. If you can fit your application into those constraints,then you get CouchDB’s scalability and flexibility ‘for free’ …” It alsoprovides a convenient way to write MapReduce views and store them in thedatabase at the same time — a feature we use in this tutorial.To build and deploy a CouchApp, run the following in the command line:$ sudo easy_install --upgrade couchappNOTE 1: Easy Install is a python module (easy_install) that lets youautomatically download, build, install, and manage Python packages. You probablyalready have this installed on your system, but if not, get it here .NOTE 2: The drawback to CouchApps is that there is no middle tier between theclient and the database. This means that any authentication credentials you needfor the database, like passwords or API keys, must reside in the client — or youmust give clients a password to the database itself — so the CouchApp strategyis inherently insecure and unfit for production apps. But they’re great forexample apps (like this one) because we can share the website and its datathrough CouchDB-style replication. Just make sure you use the Python flavor ofCouchApp, as there are a few different strategies for laying out the files. Onlythe Python style referenced here will work for this tutorial.MATERIALIZED VIEWS VIA MAPREDUCE INDEXINGCloudant and CouchDB borrow the MapReduce programming model, typically associated with distributed systems like ApacheHadoop™, and adapt it to build secondary indexes defined in JavaScript.Secondary indexes are often called “database views” however, they are persistedto disk, update incrementally as JSON changes, and come with built-in reducefunctions for aggregation.Read more about MapReduce in the Cloudant API documentation .Now, let’s revisit the final product (a bar chart like the one below) andinvestigate how we’re processing the JSON that powers it.D3 bar chart of events by typeThe chart shows user events grouped by the following action types: * link counts clicks on links in the page. There were about 8 within the period   specified. * pageView tracks visits to our How-Tos page. There were about 60 within our time   period. * search counts searches that users executed on the page. There were about 35   searches.Now for the JSON. When a viewer visits the site and performs actions, theygenerate an event document in the database, which looks like this:{  ""type"": ""search"",              //Type of event being captured (currently pageView, search and link)  ""idsite"": ""cds.search.engine"", //app id (must be unique)  ""ip"": ""75.126.70.43"",  ""url"": ""http://cloudant-labs.github.io/resources.html"",   //the source  ""geo"": {    ""lat"": 42.3596328,    ""long"": -71.0535177  }  ""search"": """",         //Search text if any (specific to search events)  ""search_cat"": [       //Faceted search info (specific to search events)    {      ""key"": ""topic"",      ""value"": ""Analytics""    },    {      ""key"": ""topic"",      ""value"": ""Data Warehousing""    }  ],  ""search_count"": 7,    //search result count (specific to search events)  ""action_name"": ""IBM Cloud Data Services - Developers Center - Products"", //Document title (specific to pageView events)  ""link"": ""https://developer.ibm.com/bluemix/2015/04/29/connecting-pouchdb-cloudant-ibm-bluemix/"",  //the target  ""rec"": 1,             //always 1  ""r"": 297222,          //from the base Piwik library, random number, we don’t use this value  ""date"": ""2015-5-4"",  ""h"": 16,  ""m"": 20,  ""s"": 10,  ""$_id"": ""0e9dcf4b6b5b0dc7"", //cookie visitor  ""$_idts"": 1433860426,       //cookie visitor count  ""$_idvc"": 2,          //Number of visits in the session  ""$_idn"": 0,           //Whether a new visitor or not  ""$_refts"": 0,         //Referral timestamp  ""$_viewts"": 1433881201,  //Last Visit timestamp  ""$_ref"": '',          //Referral url  ""send_image"": 0,      //used image to send payload  ""uap"": ""MacIntel"",  ""uab"": ""Netscape"",  ""pdf"": 1,             //browser features: supports pdf, QuickTime, etc.  ""qt"": 0,  ""realp"": 0,  ""wma"": 0,  ""dir"": 0,  ""fla"": 1,  ""java"": 1,  ""gears"": 0,  ""ag"": 0,  ""cookie"": 1,  ""res"": ""3360x2100"",  ""gt_ms"": 51           //from the base Piwik library, config generation time, we don’t use this value}Sample tracking event JSON documentTo create the D3 chart, we need the following information from this JSONdocument: The type field tells us what type of event occurred. Since our app lets us track events by idsite , we also need that. For setting the time period, we need the date field too.Let’s look at the MapReduce code to see how the database view extracts thesevalues from all the documents in our database. If you haven’t already, clone the metrics-analytics GitHub repo . Open the views folder. You’ll see a bunch of other folders, each of which is a database view.Each of these folders contains two files: * map.js which stores the map function * reduce.js (optional) which contains the reduce functionThe map function runs first. Open the grouped_events folder to find the functions that aggregate information for our Events by Type bar chart example. The following screenshot shows the map function code in theupper left, and the output of the map function in the table below.Map function for “Events by Type” chartMapReduce functions in Cloudant and CouchDB are JavaScript functions thatiterate over every document in the database. The map function in this examplechecks a document for the required fields, and if they are present, it creates a key-value pair where the key is an array consisting of idsite , the date (which we convert to separate numeric values for year, month and day), and theevent type . When the map function finds these details, the value associated with this newly created compound key is 1 . For instance, this map function found 3 documents that have the idsite : cds.search.engine , date : 2015-5-1 , and are of type : search .Now let’s look at the reduce function in grouped_events . It simply calls one of Cloudant’s/CouchDB’s built-in reduce functions, inthis case, sum(values) . This function groups every unique combination of idsite , date , and type and adds up the values produced by the map function ( 1 ). The aggregated output is then the frequency of each unique compound key inthe data set. For the example in the previous paragraph, the reduce functionoutput value is 3 .Reduce function for “Events by Type”Now that we’ve aggregated stats on our data set, we have a nice compact chunk ofJSON that we can easily send to the Web browser for visualization with D3. Foreach type of report you want to offer, you need to create a database view. Takea look at the other views we’ve written, like events_by_platform . Generating new views on your data is very straightforward, but note that theyare persisted to disk for efficient traversal and incremental updating of new orchanging JSON.VISUAL ANALYTICS WITH D3 AND JSONWe bring all the pieces of the puzzle together in a single-page app forvisualization. It uses Bootstrap for layout and leverages AngularJS to makesmooth state transitions when users switch between different reporting options.THE CODEIt’s best if you read the code, but we’ll go over some high-level concepts here.Notice the <html> tag is a little out of the ordinary, it has an attribute ng-app=""visualizationApp"" . This tells AngularJS to look for a module called visualizationApp (whichlives in app.js) and wire up visualizationApp to the node and every nodedescended from it. In our case that means the entire webpage. The otherimportant connection between our webpage and AngularJS is the <div> on Line 30 that has an attribute, ng-controller=""vizModel"" , which tells AngularJS to set the scope of the vizModel controller to that node and every node descended from it. This way you can have differentcode control different parts of the page, but in our case we only have a singlecontroller.So where does the actual chart creation happen? Find the selectVisualization function (look for the line $scope.selectVisualization = function(visualization){ ). We access the view with the command:couchApp.db.view( design + '/' + viewName,options)This command gets the data from the view and passes it to the propersub-function of getTotalEventsChartBuilder for rendering with D3. You can study the renderChart , renderTable , renderLine and renderPie sub-functions to see how we make graphics out of the view data.WRAPThat wraps it up for Part 2 of this tutorial. I hope this whirlwind tour ofMapReduce views and D3 visualization gave you a good feel for all the potentialanalytics you can develop on top of a Cloudant data store. If you modify thesample code and want to publish your own version, you’ll want to get up to speed on deploying CouchApps and you’ll probably also want to install CouchDB for local development and testing.We touched on many areas we could explore further. For instance, there are a lotof slick options to control CouchApps with JavaScript. And playing around withMapReduce views or working on the charts could be a whole separate tutorial.Finally, we could have skipped the CouchApps approach and built an app that gotthe view data by making calls to the Cloudant/CouchDB REST API. We would havehad to write more code, but it would be more modular and probably easier tointegrate into an enterprise environment.Let us know if you’re interested in learning more about any of these topics.Meanwhile, fork the repo and build something great. We can’t wait to see whatyou’ll create.Like Simple Metrics Analytics?© “Apache”, “CouchDB”, “Hadoop”, “Apache CouchDB”, “Apache Hadoop”, and theCouchDB and Hadoop logos are trademarks or registered trademarks of The ApacheSoftware Foundation. All other brands and trademarks are the property of theirrespective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Example app showing how to use Cloudant/Apache CouchDB to power analytics with D3 and JSON. ,Simple Metrics Tutorial Part 2,Live,897
2769,"OFIR PRESS
Sometimes deep sometimes learning

About / PublicationsNEURAL LANGUAGE MODELING FROM SCRATCH (PART 1)
Language models assign probability values to sequences of words. Those three
words that appear right above your keyboard on your phone that try to predict
the next word you’ll type are one of the uses of language modeling. In the case
shown below, the language model is predicting that “from”, “on” and “it” have a
high probability of being the next word in the given sentence. Internally, for
each word in its vocabulary, the language model computes the probability that it
will be the next word, but the user only gets to see the top three most probable
words.

Language models are a fundamental part of many systems that attempt to solve
natural language processing tasks such as machine translation and speech
recognition. Currently, all state of the art language models are neural
networks.

The first part of this post presents a simple feedforward neural network that
solves this task. In the second part of the post, we will improve the simple
model by adding to it a recurrent neural network (RNN). The final part will
discuss two recently proposed regularization techniques for improving RNN based
language models.

A SIMPLE MODEL
To begin we will build a simple model that given a single word taken from some
sentence tries predicting the word following it.

We represent words using one-hot vectors: we decide on an arbitrary ordering of
the words in the vocabulary and then represent the n th word as a vector of the size of the vocabulary ( N ), which is set to 0 everywhere except element n which is set to 1 .

The model can be separated into two components:

 * We start by encoding the input word. This is done by taking the one hot vector representing the
   input word ( c in the diagram), and multiplying it by a matrix of size (N,200) which we call the input embedding ( U ). This multiplication results in a vector of size 200 , which is also referred to as a word embedding. This embedding is a dense
   representation of the current input word. This representation is both of a
   much smaller size than the one-hot vector representing the same word, and
   also has some other interesting properties. For example, while the distance
   between every two words represented by a one-hot vectors is always the same,
   these dense representations have the property that words that are close in
   meaning will have representations that are close in the embedding space.
   
   
 * The second component can be seen as a decoder . After the encoding step, we have a representation of the input word. We
   multiply it by a matrix of size (200,N) , which we call the output embedding ( V ). The resulting vector of size N is then passed through the softmax function, normalizing its values into a
   probability distribution (meaning each one of the values is between 0 and 1 , and their sum is 1 ). This distribution is denoted by p in the diagram above.
   
   
The decoder is a simple function that takes a representation of the input word
and returns a distribution which represents the model’s predictions for the next
word: the model assigns to each word the probability that it will be the next
word in the sequence.

To train this model, we need pairs of input and target output words. For the (input, target-output) pairs we use the Penn Treebank dataset which contains around 40K sentences from
news articles, and has a vocabulary of exactly 10,000 words. To generate word pairs for the model to learn from, we will just take
every pair of neighboring words from the text and use the first one as the input
word and the second one as the target output word. So for example for the
sentence “The cat is on the mat” we will extract the following word pairs for training: (The, cat) , (cat, is) , (is, on) , and so on.

We use stochastic gradient descent to update the model during training, and the
loss used is the cross-entropy loss. Intuitively, this loss measures the
distance between the output distribution predicted by the model and the target
distribution at every iteration. The target distribution at each iteration is a
one-hot vector representing the current target word.

The metric used for reporting the performance of a language model is its
perplexity on the test set. It is defined as- , where is the probability given by the model to the target word at iteration ‘i’.
Perplexity is a decreasing function of the average log probability that the
model assigns to the target word at every iteration. We want to maximize the
probability that we give to the target word at every iteration, which means that
we want to minimize the perplexity (the optimal perplexity is 1 ).

The perplexity for the simple model 1 is about 183 on the test set, which means that on average it assigns a probability of about to the target word in every iteration on the test set. It’s much better than a
naive model which would assign an equal probability to each word (which would
assign a probability of to the correct word), but we can do much better.

USING RNNS TO IMPROVE PERFORMANCE
The biggest problem with the simple model is that to predict the next word in
the sentence, it only uses a single preceding word. If we could build a model
that would remember even just a few of the preceding words there should be an
improvement in its performance. To understand why adding memory helps, think of
the following example: what words follow the word “drink”? You’d probably say
that “coffee”, “beer” and “soda” have a high probably of following it. If I told
you the word sequence was actually “Cows drink”, then you would completely
change your answer.

We can add memory to our model by augmenting it with a recurrent neural network (RNN), as shown below.

This model is similar to the simple one, just that after encoding the current
input word we feed the resulting representation (of size 200 ) into a two layer LSTM , which then outputs a vector also of size 200 (at every time step the LSTM also receives a vector representing its previous
state- this is not shown in the diagram). Then, just like before, we use the
decoder to convert this output vector into a vector of probability values. (LSTM
is just a fancier RNN that is better at remembering the past. Its “API” is
identical to the “API” of an RNN- the LSTM at each time step receives an input
and its previous state, and uses those two inputs to compute an updated state
and an output vector 2 .)

Now we have a model that at each time step gets not only the current word
representation, but also the state of the LSTM from the previous time step, and
uses this to predict the next word. The state of the LSTM is a representation of
the previously seen words (note that words that we saw recently have a much
larger impact on this state than words we saw a while ago).

As expected, performance improves and the perplexity of this model on the test
set is about 114 . An implementation of this model 3 , along with a detailed explanation, is available in Tensorflow .

THE IMPORTANCE OF REGULARIZATION.
114 perplexity is good but we can still do much better. In this section I’ll
present some recent advances that improve the performance of RNN based language
models.

DROPOUT
We could try improving the network by increasing the size of the embeddings and
LSTM layers (until now the size we used was 200 ), but soon enough this stops increasing the performance because the network
overfits the training data (it uses its increased capacity to remember
properties of the training set which leads to inferior generalization, i.e.
performance on the unseen test set). One way to counter this, by regularizing
the model, is to use dropout.

The diagram below is a visualization of the RNN based model unrolled across
three time steps. x and y are the input and output sequences, and the gray boxes represent the LSTM
layers. Vertical arrows represent an input to the layer that is from the same
time step, and horizontal arrows represent connections that carry information
from previous time steps.

We can apply dropout on the vertical (same time step) connections:

The arrows are colored in places where we apply dropout. A dropout mask for a
certain layer indicates which of that layers activations are zeroed. In this
case, we use different dropout masks for the different layers (this is indicated
by the different colors in the diagram).

Applying dropout to the recurrent connections harms the performance, and so in
this initial use of dropout we use it only on connections within the same time
step. Using two LSTM layers, with each layer containing 1500 LSTM units, we achieve a perplexity of 78 (we dropout activations with a probability of 0.65 ) 4 .

The recently introduced variational dropout solves this problem and improves the model’s performance even more (to 75 perplexity) by using the same dropout masks at each time step.

WEIGHT TYING
The input embedding and output embedding have a few properties in common. The
first property they share is that they are both of the same size (in our RNN
model with dropout they are both of size (10000,1500) ).

The second property that they share in common is a bit more subtle. In the input
embedding, words that have similar meanings are represented by similar vectors
(similar in terms of cosine similarity ). This is because the model learns that it needs to react to similar words in
a similar fashion (the words that follow the word “quick” are similar to the
ones that follow the word “rapid”).

This also occurs in the output embedding. The output embedding receives a
representation of the RNNs belief about the next output word (the output of the
RNN) and has to transform it into a distribution. Given the representation from
the RNN, the probability that the decoder assigns a word depends mostly on its
representation in the output embedding (the probability is exactly the softmax
normalized dot product of this representation and the output of the RNN).

Because the model would like to, given the RNN output, assign similar
probability values to similar words, similar words are represented by similar
vectors. (Again, if, given a certain RNN output, the probability for the word
“quick” is relatively high, we would also expect the probability for the word
“rapid” to be relatively high).

These two similarities led us to recently propose a very simple method, weight tying , to lower the model’s parameters and improve its performance. We simply tie
its input and output embedding (i.e. we set U=V, meaning that we now have a
single embedding matrix that is used both as an input and output embedding).
This reduces the perplexity of the RNN model that uses dropout to 73 , and its size is reduced by more than 20% 5 .

WHY DOES WEIGHT TYING WORK?
The perplexity of the variational dropout RNN model on the test set is 75 . The same model achieves 24 perplexity on the training set. So the model performs much better on the
training set then it does on the test set. This means that it has started to
remember certain patterns or sequences that occur only in the train set and do
not help the model to generalize to unseen data. One of the ways to counter this
overfitting is to reduce the model’s ability to ‘memorize’ by reducing its
capacity (number of parameters). By applying weight tying, we remove a large
number of parameters.

In addition to the regularizing effect of weight tying we presented another
reason for the improved results. We showed that the word representations in the
output embedding are of much higher quality than the ones in the input embedding
of untied language models. This is shown using embedding evaluation benchmarks
such as Simlex999 . In a weight tied model, because the tied embedding’s parameter updates at
each training iteration are very similar to the updates of the output embedding
of the untied model, the tied embedding performs similarly to the output
embedding of the untied model. So in the tied model, we use a single high
quality embedding matrix in two places in the model. This contributes to the
improved performance of the tied model 6 .

To summarize, this post presented how to improve a very simple feedforward
neural network language model, by first adding an RNN, and then adding
variational dropout and weight tying to it.

In recent months, we’ve seen further improvements to the state of the art in RNN
language modeling. The current state of the art results are held by two recent
papers by Melis et al. and Merity et al. . These models make use of most, if not all, of the methods shown above, and
extend them by using better optimizations techniques, new regularization
methods, and by finding better hyperparameters for existing models. Some of
these methods will be presented in part two of this guide.

Feel free to ask questions in the comments below.


--------------------------------------------------------------------------------

 1. This model is the skip-gram word2vec model presented in Efficient Estimation of Word Representations in Vector Space . ↩
    
    
 2. For a detailed explanation of this watch Edward Grefenstette’s Beyond Seq2Seq with Augmented RNNs lecture. ↩
    
    
 3. This model is the small model presented in Recurrent Neural Network Regularization . ↩
    
    
 4. This is the large model from Recurrent Neural Network Regularization . ↩
    
    
 5. In parallel to our work, an explanation for weight tying based on Distilling the Knowledge in a Neural Network was presented in Tying Word Vectors and Word Classifiers: A Loss Framework for Language
    Modeling . ↩
    
    
 6. Our paper explains this in detail. ↩
    
    
Written on September 7, 2017 Please enable JavaScript to view the comments powered by Disqus.","Language models assign probability values to sequences of words. Those three words that appear right above your keyboard on your phone that try to predict the next word you’ll type are one of the uses of language modeling. In the case shown below, the language model is predicting that “from”, “on” and “it” have a high probability of being the next word in the given sentence. Internally, for each word in its vocabulary, the language model computes the probability that it will be the next word, but the user only gets to see the top three most probable words.",Neural Language Modeling From Scratch (Part 1),Live,898
2770,"RStudio Blog * Home

 * Subscribe to feed

TIDYR 0.4.0
February 2, 2016 in Packages

I’m pleased to announce tidyr 0.4.0. tidyr makes it easy to “tidy” your data,
storing it in a consistent form so that it’s easy to manipulate, visualise and
model. Tidy data has a simple convention: put variables in the columns and
observations in the rows. You can learn more about it in the tidy data vignette. Install it with:

install.packages(""tidyr"")

There are two big features in this release: support for nested data frames, and
improved tools for turning implicit missing values into explicit missing values.
These are described in detail below. As well as these big features, all tidyr
verbs now handle grouped_df objects created by dplyr, gather() makes a character key column (instead of a factor), and there are lots of other minor fixes and
improvements. Please see the release notes for a complete list of changes.

NESTED DATA FRAMES
nest() and unnest() have been overhauled to support a new way of structuring your data: the nested data frame. In a grouped data frame, you have one row per observation, and
additional metadata define the groups. In a nested data frame, you have one row per group, and the individual observations are stored in a column that is a
list of data frames. This is a useful structure when you have lists of other
objects (like models) with one element per group.

For example, take the gapminder dataset:

library(gapminder)
library(dplyr)

gapminder
#> Source: local data frame [1,704 x 6]
#> 
#>        country continent  year lifeExp      pop gdpPercap
#>         (fctr)    (fctr) (int)   (dbl)    (int)     (dbl)
#> 1  Afghanistan      Asia  1952    28.8  8425333       779
#> 2  Afghanistan      Asia  1957    30.3  9240934       821
#> 3  Afghanistan      Asia  1962    32.0 10267083       853
#> 4  Afghanistan      Asia  1967    34.0 11537966       836
#> 5  Afghanistan      Asia  1972    36.1 13079460       740
#> 6  Afghanistan      Asia  1977    38.4 14880372       786
#> 7  Afghanistan      Asia  1982    39.9 12881816       978
#> 8  Afghanistan      Asia  1987    40.8 13867957       852
#> ..         ...       ...   ...     ...      ...       ...

We can plot the trend in life expetancy for each country:

library(ggplot2)

ggplot(gapminder, aes(year, lifeExp)) +
  geom_line(aes(group = country))


But it’s hard to see what’s going on because of all the overplotting. One
interesting solution is to summarise each country with a linear model. To do
that most naturally, you want one data frame for each country. nest() creates this structure:

by_country <- gapminder %>% 
  group_by(continent, country) %>% 
  nest()

by_country
#> Source: local data frame [142 x 3]
#> 
#>    continent     country            data
#>       (fctr)      (fctr)          (list)
#> 1       Asia Afghanistan <tbl_df [12,4]>
#> 2     Europe     Albania <tbl_df [12,4]>
#> 3     Africa     Algeria <tbl_df [12,4]>
#> 4     Africa      Angola <tbl_df [12,4]>
#> 5   Americas   Argentina <tbl_df [12,4]>
#> 6    Oceania   Australia <tbl_df [12,4]>
#> 7     Europe     Austria <tbl_df [12,4]>
#> 8       Asia     Bahrain <tbl_df [12,4]>
#> ..       ...         ...             ...

The intriguing thing about this data frame is that it now contains one row per
group, and to store the original data we have a new data column, a list of data frames. If we look at the first one, we can see that it
contains the complete data for Afghanistan (sans grouping columns):

by_country$data[[1]]
#> Source: local data frame [12 x 4]
#> 
#>     year lifeExp      pop gdpPercap
#>    (int)   (dbl)    (int)     (dbl)
#> 1   1952    43.1  9279525      2449
#> 2   1957    45.7 10270856      3014
#> 3   1962    48.3 11000948      2551
#> 4   1967    51.4 12760499      3247
#> 5   1972    54.5 14760787      4183
#> 6   1977    58.0 17152804      4910
#> 7   1982    61.4 20033753      5745
#> 8   1987    65.8 23254956      5681
#> ..   ...     ...      ...       ...

This form is natural because there are other vectors where you’ll have one value
per country. For example, we could fit a linear model to each country with purrr :

by_country <- by_country %>% 
  mutate(model = purrr::map(data, ~ lm(lifeExp ~ year, data = .))
)
by_country
#> Source: local data frame [142 x 4]
#> 
#>    continent     country            data   model
#>       (fctr)      (fctr)          (list)  (list)
#> 1       Asia Afghanistan <tbl_df [12,4]> <S3:lm>
#> 2     Europe     Albania <tbl_df [12,4]> <S3:lm>
#> 3     Africa     Algeria <tbl_df [12,4]> <S3:lm>
#> 4     Africa      Angola <tbl_df [12,4]> <S3:lm>
#> 5   Americas   Argentina <tbl_df [12,4]> <S3:lm>
#> 6    Oceania   Australia <tbl_df [12,4]> <S3:lm>
#> 7     Europe     Austria <tbl_df [12,4]> <S3:lm>
#> 8       Asia     Bahrain <tbl_df [12,4]> <S3:lm>
#> ..       ...         ...             ...     ...

Because we used mutate() , we get an extra column containing one linear model per country.

It might seem unnatural to store a list of linear models in a data frame.
However, I think it is actually a really convenient and powerful strategy
because it allows you to keep related vectors together. If you filter or arrange
the vector of models, there’s no way for the other components to get out of
sync.

nest() got us into this form; unnest() gets us out. You give it the list-columns that you want to unnested, and tidyr
will automatically repeat the grouping columns. Unnesting data gets us back to the original form:

by_country %>% unnest(data)
#> Source: local data frame [1,704 x 6]
#> 
#>    continent     country  year lifeExp      pop gdpPercap
#>       (fctr)      (fctr) (int)   (dbl)    (int)     (dbl)
#> 1       Asia Afghanistan  1952    43.1  9279525      2449
#> 2       Asia Afghanistan  1957    45.7 10270856      3014
#> 3       Asia Afghanistan  1962    48.3 11000948      2551
#> 4       Asia Afghanistan  1967    51.4 12760499      3247
#> 5       Asia Afghanistan  1972    54.5 14760787      4183
#> 6       Asia Afghanistan  1977    58.0 17152804      4910
#> 7       Asia Afghanistan  1982    61.4 20033753      5745
#> 8       Asia Afghanistan  1987    65.8 23254956      5681
#> ..       ...         ...   ...     ...      ...       ...

When working with models, unnesting is particularly useful when you combine it
with broom to extract model summaries:

# Extract model summaries:
by_country %>% unnest(model %>% purrr::map(broom::glance))
#> Source: local data frame [142 x 15]
#> 
#>    continent     country            data   model r.squared
#>       (fctr)      (fctr)          (list)  (list)     (dbl)
#> 1       Asia Afghanistan <tbl_df [12,4]> <S3:lm>     0.985
#> 2     Europe     Albania <tbl_df [12,4]> <S3:lm>     0.888
#> 3     Africa     Algeria <tbl_df [12,4]> <S3:lm>     0.967
#> 4     Africa      Angola <tbl_df [12,4]> <S3:lm>     0.034
#> 5   Americas   Argentina <tbl_df [12,4]> <S3:lm>     0.919
#> 6    Oceania   Australia <tbl_df [12,4]> <S3:lm>     0.766
#> 7     Europe     Austria <tbl_df [12,4]> <S3:lm>     0.680
#> 8       Asia     Bahrain <tbl_df [12,4]> <S3:lm>     0.493
#> ..       ...         ...             ...     ...       ...
#> Variables not shown: adj.r.squared (dbl), sigma (dbl),
#>   statistic (dbl), p.value (dbl), df (int), logLik (dbl),
#>   AIC (dbl), BIC (dbl), deviance (dbl), df.residual (int).

# Extract coefficients:
by_country %>% unnest(model %>% purrr::map(broom::tidy))
#> Source: local data frame [284 x 7]
#> 
#>    continent     country        term  estimate std.error
#>       (fctr)      (fctr)       (chr)     (dbl)     (dbl)
#> 1       Asia Afghanistan (Intercept) -1.07e+03   43.8022
#> 2       Asia Afghanistan        year  5.69e-01    0.0221
#> 3     Europe     Albania (Intercept) -3.77e+02   46.5834
#> 4     Europe     Albania        year  2.09e-01    0.0235
#> 5     Africa     Algeria (Intercept) -6.13e+02   38.8918
#> 6     Africa     Algeria        year  3.34e-01    0.0196
#> 7     Africa      Angola (Intercept) -6.55e+01  202.3625
#> 8     Africa      Angola        year  6.07e-02    0.1022
#> ..       ...         ...         ...       ...       ...
#> Variables not shown: statistic (dbl), p.value (dbl).

# Extract residuals etc:
by_country %>% unnest(model %>% purrr::map(broom::augment))
#> Source: local data frame [1,704 x 11]
#> 
#>    continent     country lifeExp  year .fitted .se.fit
#>       (fctr)      (fctr)   (dbl) (int)   (dbl)   (dbl)
#> 1       Asia Afghanistan    43.1  1952    43.4   0.718
#> 2       Asia Afghanistan    45.7  1957    46.2   0.627
#> 3       Asia Afghanistan    48.3  1962    49.1   0.544
#> 4       Asia Afghanistan    51.4  1967    51.9   0.472
#> 5       Asia Afghanistan    54.5  1972    54.8   0.416
#> 6       Asia Afghanistan    58.0  1977    57.6   0.386
#> 7       Asia Afghanistan    61.4  1982    60.5   0.386
#> 8       Asia Afghanistan    65.8  1987    63.3   0.416
#> ..       ...         ...     ...   ...     ...     ...
#> Variables not shown: .resid (dbl), .hat (dbl), .sigma
#>   (dbl), .cooksd (dbl), .std.resid (dbl).

I think storing multiple models in a data frame is a powerful and convenient
technique, and I plan to write more about it in the future.

EXPANDING
The complete() function allows you to turn implicit missing values into explicit missing
values. For example, imagine you’ve collected some data every year basis, but
unfortunately some of your data has gone missing:

resources <- frame_data(
  ~year, ~metric, ~value,
  1999, ""coal"", 100,
  2001, ""coal"", 50,
  2001, ""steel"", 200
)
resources
#> Source: local data frame [3 x 3]
#> 
#>    year metric value
#>   (dbl)  (chr) (dbl)
#> 1  1999   coal   100
#> 2  2001   coal    50
#> 3  2001  steel   200

Here the value for steel in 1999 is implicitly missing: it’s simply absent from
the data frame. We can use complete() to make this missing row explicit, adding that combination of the variables and
inserting a placeholder NA :

resources %>% complete(year, metric)
#> Source: local data frame [4 x 3]
#> 
#>    year metric value
#>   (dbl)  (chr) (dbl)
#> 1  1999   coal   100
#> 2  1999  steel    NA
#> 3  2001   coal    50
#> 4  2001  steel   200

With complete you’re not limited to just combinations that exist in the data.
For example, here we know that there should be data for every year, so we can
use the fullseq() function to generate every year over the range of the data:

resources %>% complete(year = full_seq(year, 1L), metric)
#> Source: local data frame [6 x 3]
#> 
#>    year metric value
#>   (dbl)  (chr) (dbl)
#> 1  1999   coal   100
#> 2  1999  steel    NA
#> 3  2000   coal    NA
#> 4  2000  steel    NA
#> 5  2001   coal    50
#> 6  2001  steel   200

In other scenarios, you may not want to generate the full set of combinations.
For example, imagine you have an experiment where each person is assigned one
treatment. You don’t want to expand the combinations of person and treatment,
but you do want to make sure every person has all replicates. You can use nesting() to prevent the full Cartesian product from being generated:

experiment <- data_frame(
  person = rep(c(""Alex"", ""Robert"", ""Sam""), c(3, 2, 1)),
  trt  = rep(c(""a"", ""b"", ""a""), c(3, 2, 1)),
  rep = c(1, 2, 3, 1, 2, 1),
  measurment_1 = runif(6),
  measurment_2 = runif(6)
)
experiment
#> Source: local data frame [6 x 5]
#> 
#>   person   trt   rep measurment_1 measurment_2
#>    (chr) (chr) (dbl)        (dbl)        (dbl)
#> 1   Alex     a     1       0.7161        0.927
#> 2   Alex     a     2       0.3231        0.942
#> 3   Alex     a     3       0.4548        0.668
#> 4 Robert     b     1       0.0356        0.667
#> 5 Robert     b     2       0.5081        0.143
#> 6    Sam     a     1       0.6917        0.753

experiment %>% complete(nesting(person, trt), rep)
#> Source: local data frame [9 x 5]
#> 
#>    person   trt   rep measurment_1 measurment_2
#>     (chr) (chr) (dbl)        (dbl)        (dbl)
#> 1    Alex     a     1       0.7161        0.927
#> 2    Alex     a     2       0.3231        0.942
#> 3    Alex     a     3       0.4548        0.668
#> 4  Robert     b     1       0.0356        0.667
#> 5  Robert     b     2       0.5081        0.143
#> 6  Robert     b     3           NA           NA
#> 7     Sam     a     1       0.6917        0.753
#> 8     Sam     a     2           NA           NA
#> ..    ...   ...   ...          ...          ...

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

4 COMMENTS
February 2, 2016 at 9:21 pm

Ben J

Wow. Just a week ago I was creating multiple data frames and using sqldf to
create a Cartesian join of them to ‘complete’ a data set. Excited to put this to
work. Thank you!

February 3, 2016 at 12:46 pm

Jonny S

The nest function seems to be broken (see – https://github.com/hadley/tidyr/issues/160 ).
Any word when we might see a fix? Someone suggested one at https://github.com/hadley/tidyr/pull/158

February 5, 2016 at 11:15 am

You Complete Me | i'm a chordata! urochordata!

[…] a feature that is either new or I haven’t noticed before in the new release
of tidyr that is making me giddy with giddiness. The complete […]

February 5, 2016 at 3:05 pm

Albert Lee

Thank you so much… I was stuck with a particular data analys, and your “nest”
and “unnest” functions shifted the paradigm..


« httr 1.1.0 (and 1.0.0) memoise 1.0.0 »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","I’m pleased to announce tidyr 0.4.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention…",tidyr 0.4.0,Live,899
2771,"INTRODUCING THE SIMPLE NOTIFICATION SERVICE

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Matt Collins 11/30/16Matt Collins

Before joining IBM Cloud Data Services I was the Head of the Development team at
Web.com UK, overseeing the development of a range of web services with PHP,
Node.JS, Cloudant, Amazon Redshift and Redis. Most of my experience is in
creating whitelabel search, presence and CRM products designed to be…

Learn More Recent Posts * Introducing the Simple Notification Service Want live chat on your site? Or do you need to trigger user notifications
   based…
 * Real-Time Q&A App with RethinkDB RethinkDB has a unique ability of pushing updates as its data changes. It's
   a great…
 * Reintroducing the Simple Search Service Enhancements to our Simple Search Service app and API.

We’ve all experienced the power of the real-time web through features like
online chats, notifications that we’ there are issues of scale and consistency,
especially if you are deploying into a cloud environment with many servers.

The Simple Notification Service takes away all of that pain and lets you focus
on building killer real-time features to push your app to the next level. Want
to add real-time chat support to your e-commerce website? You can do that! Want
to alert users of new posts that might interest them? You can do that too.

SO, WHAT IS THE SIMPLE NOTIFICATION SERVICE?
The Simple Notification Service is “real-time Infrastructure as a Service.” In other words, it provides
developers with the means to build complex real-time features, at scale, with
the use of a few simple APIs.

Built with Node.JS and RethinkDB , the service offers persistence and scalability as well as a simple query
system to target notifications at specific users.

Features include:

 * Simple JavaScript and REST APIs
 * See user presence with connection/disconnection detection
 * Send targeted notifications using a simple query system
 * Retrieve historical notifications
 * Fully scalable out of the box
 * Deployable to IBM Bluemix

WHAT CAN I DO WITH THE SIMPLE NOTIFICATION SERVICE?
Anything you want!

Think of the Simple Notification Service as a way to connect your users so that
you can pass data directly from one user to another (or indeed, to many others).
This means that you can create engaging, rich experiences like:

 * collaborative software
 * multiplayer games
 * real-time dashboards
 * real-time marketplaces
 * and anything else you can think of!

To get you started, I included simple demos showing what you can do with just a
few lines of JavaScript and the Simple Notification Service. Once you are up and
running, you’ll see them on the Simple Notification Service app’s home page. Be
sure to check them out:

 * Soccer demo provides real-time soccer match updates and allows users to target specific
   matches only, or see all updates.
 * Chat demo lets users connect and message each other live and see who is online. It
   includes a snippet of code you can paste into your own website or app to
   create an instant chat room.

Watch for a follow-up article in the next few weeks that will take you through
all the steps you need to build a real-time, multiplayer game with the Simple
Notification Service.

SEE FOR YOURSELF
Source code and documentation are available on Github . The easiest way to start? Deploy to Bluemix by clicking this button:


Note: Deployment provisions a RethinkDB instance within Bluemix which may incur costs .","Build complex real-time features for your app or website, at scale, with the use of a few simple APIs provided by our Simple Notification Service.",Introducing the Simple Notification Service,Live,900
2774,"PouchDB is a JavaScript package that runs either in the browser or in a Node.js environment, and acts like its own little Cloudant instance, so you can write data to it even if you're not online, and sync data between it and remote Cloudant instances.

That means you can run a PouchDB instance on your server or in a user's browser, sync it with a Cloudant database, and then access that data right from the server, without needing to ping Cloudant on every request. This lets you use PouchDB like a caching layer, but it gets better:

Any changes to your PouchDB instance get propagated back to Cloudant when you have Internet access.

If you don't have Internet access, you can still access your PouchDB instance.

When Internet access is restored, PouchDB will bring your instance and Cloudant back into sync.

Using PouchDB to sync with Cloudant on the server lets you serve data without needing to ping Cloudant on every request, while using it on the client gives you offline functionality: even if your users are, say, on a train and they lose connection, they can still use your app uninterrupted.

To install PouchDB on your Node.js server:

npm install pouchdb

Then, import it in your code with this:

var PouchDB = require('pouchdb');

If you want to use PouchDB in client-side JavaScript, put this inside your HTML's  tag:

Let's see it in action by connecting and syncing. Then, do this in your JavaScript:

var db = new PouchDB('dbname'),

remote = 'https://USERNAME:PASSWORD@USERNAME.cloudant.com/DATABASE',

opts = {

continuous: true

db.replicate.to(remote, opts);

db.replicate.from(remote, opts);

Now your PouchDB instance is proactively syncing with your Cloudant database: when either changes, they'll push those changes to each other.

PouchDB also acts as a fully-capable interface to Cloudant, like Nano or Cradle, so you can use it to get and insert documents, perform queries, etc.:

// create a document; log the response

db.post({

name: 'Mike Broberg'

}, function (err, response) {

console.log(err || response);

// read a document by ID; log the response

db.get(DOCUMENT_ID, function (err, response) {

console.log(err || response);

// update a document; log the response

db.put({

_id: DOCUMENT_ID,

_rev: DOCUMENT_REV,

name: 'Mike Broberg',

title: 'Baseballer of the Ninth Circle'

}, function (err, response) {

console.log(err || response);

// delete a document; log the response

db.remove({

_id: DOCUMENT_ID,

_rev: DOCUMENT_REV

}, function (err, response) {

console.log(err || response);

You can even execute ad-hoc queries on PouchDB's local data:

db.query({

// write your map function in JavaScript

map: function (doc) {

if (doc.title) emit(doc.title, null);

// in this example, we won't use a reduce function

reduce: false

}, function (err, response) {

// log the error, or the response if no error

console.log(err || response);

You can see PouchDB in action in our express-cloudant project template, as part of its API in routes/api.js. There, it allows you to provide rapid, controlled access to your Cloudant databases.

If you have any trouble with any of this, post your question on StackOverflow, hit us up on IRC, or if you'd like to speak more privately, send us a note at support@cloudant.com

Create an account and try it yourself",PouchDB is in-browser database that syncs with any Couch-like database such as Cloudant. It allows offline-first mobile and web-apps to be created that sync to the cloud when online.,"PouchDB, the In-Browser Database That Replicates",Live,901
2780,"OPEN CRIME DATA, FREE FOR ALL

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Raj R Singh 11/3/16Raj R Singh

Raj is a Developer Advocate and Open Data Lead at IBM Cloud Data Services. He
specializes in all things geospatial and hacks on analytics in R/dashDB and
Spark/iPython notebooks. He's currently driven to make CDS the best place to
obtain and exploit comprehensive, curated open data sets for business. Raj…

Learn More Recent Posts * Hello world Today we’re launching the Developer Center for IBM Cloud Data Services, a
   site built for…
 * Cloudant and dashDB for Geospatial Analytics Attendees at the Esri User Conference are using spatial data to do amazing
   things.
 * Come say hello at NoSQL Now! After a great day of sessions here at NoSQL Now!, last night the exhibit
   hall…

LOCAL CONTEXT FOR YOUR ANALYTICS
The Ever-Present Cigar

You’ve no doubt heard the adage, “all politics is local.” That phrase was coined
by Tip O’Neill , a U.S. Congressperson for 34 years, the third-longest-serving Speaker of the
House, and a fellow patron of Verna’s Coffee & Donut Shop . Tip clearly knew politics, but the concept applies to business as well.
Whether you work for a regional franchise or a global enterprise, if you want to
understand the drivers of revenues and costs, you need to look at micro-scale
dynamics.

Depending on the size of your business, this may or may not have been possible —
until recently. With open data sets, combined with cloud-based self-service
analytics, even the smallest company can now use analytic techniques previously
available only to the Fortune 500.

In this spirit of analyzing local dynamics, we’ve built a large database of
crime records sourced directly from local police departments and tagged with the
type of crime, location, and date & time of the event. It’s currently about 3
million records, and the data set grows by about a thousand every day. It’s free
for you to use and integrate into your own web apps and analytics.

In this article, we’ll show you how to start using crime data we’ve sourced.

BUILDING OUR DATA HARVESTING APP
Unfortunately, we don’t have a team of data engineers tending to our crime
database, so we rely on software to do the dirty work. We’ve written a
harvesting app in Node.js that runs nightly and queries the supported cities’
databases for new crimes, converts them to GeoJSON format, and writes them to Cloudant.

This sourcing process is only feasible because many municipalities work with Socrata to publish their open data sets. Socrata provides a developer API with a
powerful SQL-like query language. So if you just wanted to download a single
city’s data for yourself, you could easily use the Socrata API. The only
downside is that Socrata’s system isn’t designed to support the backend
requirements of highly performant web apps. If you want speed and scale, you
want to persist the data in a database like Cloudant. And if you want data for
multiple cities, or if you’re doing inter-jurisdictional analysis, our database
is a great place to start.

When we built the harvesting app, we found that few cities use the same coding
system for recording the type of crime that occurred. The Federal Bureau of
Investigation offers a “ unified crime reporting ” system, but it’s not widely used for local a police department’s raw
database. (The FBI’s coding system is required for periodic reporting of certain
types of crimes, but we wanted to source the most current, disaggregated data as
soon as it was released.) So in addition to harvesting the latest crimes, we
also built a lookup table for each city so that we could tag each crime with
three extra properties:

 * CDSSTREET A “street crime” by our definition — basically if this crime would be
   visible and make you feel unsafe if you were walking down the street and
   witnessed it.
 * CDSNV A non-violent crime.
 * CDSDV Domestic violence — a crime against someone related to the perpetrator.

All this code is public, and can be found in the crimeharvest GitHub repo . If you’re just interested in the lookup tables, those are stored separately
in an open-data GitHub repo .

GETTING THE DATA
USE THE CLOUDANT APIS
If you want to play around with a few queries first and you know Cloudant well,
you can query a few public endpoints. Geospatial queries all start with this
endpoint:
https://opendata.cloudant.com/crimes/_design/geo/_geo/spatial ... Be sure to check the docs on Cloudant Geo as you go.

If you want all the data for a particular city, you can use one of the following
views:

 * https://opendata.cloudant.com/crimes/_design/view/_view/batonrouge?
 * https://opendata.cloudant.com/crimes/_design/view/_view/boston?
 * https://opendata.cloudant.com/crimes/_design/view/_view/chicago?
 * https://opendata.cloudant.com/crimes/_design/view/_view/philly?
 * https://opendata.cloudant.com/crimes/_design/view/_view/sf?
 * https://opendata.cloudant.com/crimes/_design/view/_view/vegas?

Remember to page through the results because there are more than 200 — the
maximum number of documents returned by any single query. Here’s the docs
section on querying views .

USE OUR OPEN CRIME DATA API
To simplify some of the syntax for new Cloudant users, we’ve also published a
limited API for spatial query that is quick and easy to use. If all you want to
do is query by a rectangle or radius and don’t want to handle result pagination,
check out the parameters and example queries at http://opendata.mybluemix.net/static/crimes.html .

EXAMPLE APPS
We’ve built some sample apps using the crime data to get you started on your own
project. (Please do fork these GitHub repos or submit pull requests!) Both web
apps use MapBox GL JS , an excellent client-side open source JavaScript library for mapping and
geospatial analysis that’s optimized for mobile devices. The crime data is
pre-cached on the web server, so after you load the app you can go offline
without losing any functionality. (Some would even call this design approach “ offline-first “!)

CRIME BROWSER
Crime Browser is a simple app that shows crime details. It uses a circular
window to highlight crimes, and lists the crimes falling within the circle along
the left side of the window.

Try it out at crimedemos.mybluemix.net/crimebrowser/ .

CRIME STATS
The Crime Stats app is more sophisticated. Instead of a circle, you select the
crimes of interest by drawing a polygon on the map. The selected crimes are then
summarized in the legend in two ways. First, by the three categories that are
common across all cities: street crime, non-violent, and domestic violence. Then
by the classification scheme unique to that city — the jurisdictional
classification. Further below is a time slider to narrow your selection by the
time of day that the crimes occurred.

Try it out at crimedemos.mybluemix.net .

I hope this introduction to crime data spurs your interest in using it in your
own apps and analytics. We’re excited to see what you build with it. Share your
work or pose questions or concerns on Twitter @rajrsingh .",Learn how to use the open crime data we've sourced from local police departments -- run geospatial queries and use our APIs in your web & analytics apps,"Open Crime Data, Free for All",Live,902
2781,"R news and tutorials contributed by (580) R bloggers * Home
 * About
 * RSS
 * add your blog!
 * Learn R
 * R jobs * Submit a new job (it’s free)
    * Browse latest jobs (also free)
   
   
 * Contact us

WELCOME!
Here you will find daily news and tutorials about R , contributed by over 573 bloggers.
There are many ways to follow us -
By e-mail: On Facebook:
If you are an R blogger yourself you are invited to add your own R content feed to this site ( Non-English R bloggers should add themselves- here )JOBS FOR R-USERS
 * Statistical Analyst @ Rostock, Mecklenburg-Vorpommern, Germany
 * Data Engineer
 * Data Scientist – Post-Graduate Programme @ Nottingham, England
 * Director, Real World Informatics & Analytics Data Science @ Northbrook,
   Illinois, U.S.
 * Junior statistician/demographer for UNICEF

POPULAR SEARCHES
 * web scraping
 * heatmap
 * twitter
 * maps
 * time series
 * animation
 * boxplot
 * shiny
 * hadoop
 * ggplot2
 * how to import image file to R
 * trading
 * finance
 * latex
 * eclipse
 * rstudio
 * excel
 * SQL
 * ggplot
 * quantmod
 * knitr
 * googlevis
 * PCA
 * market research
 * rattle
 * regression
 * map
 * tutorial
 * coplot
 * rcmdr

RECENT POSTS
 * Election 2016: Tracking Emotions with R and Python
 * Data science for executives and managers
 * The Worlds Economic Data, Shiny Apps and all you want to know about
   Propensity Score Matching!
 * August Package Picks
 * Slack all the things!
 * Warsaw R-Ladies
 * Notes from the Kölner R meeting, 14 October 2016
 * anytime 0.0.4: New features and fixes
 * 2016-13 ‘DOM’ Version 0.3
 * Building a package automatically
 * The new R Graph Gallery
 * Network Analysis Part 3 Exercises
 * Annotated Facets with ggplot2
 * Paper published: mlr – Machine Learning in R
 * a grim knight [cont’d]

OTHER SITES
 * Jobs for R-users
 * SAS blogs

AN ATTEMPT TO UNDERSTAND BOOSTING ALGORITHM(S)
June 26, 2015 By arthur charpentier(This article was first published on Freakonometrics » R-english , and kindly contributed to R-bloggers)

Tuesday, at the annual meeting of the French Economic Association, I was having lunch Alfred , and while we were chatting about modeling issues (econometric models against
machine learning prediction), he asked me what boosting was. Since I could not be very specific, we’ve been looking at wikipedia page.

Boosting is a machine learning ensemble meta-algorithm for reducing bias
primarily and also variance in supervised learning, and a family of machine
learning algorithms which convert weak learners to strong ones

One should admit that it is not very informative. At least, there is the idea
that ‘weak learners’ can be used to get a good predictor. Now, to be honest, I
guess I understand the concept. But I still can’t reproduce what I got with
standard ‘boosting’ packages.

There are a lot of publications about the concept of ‘boosting’. In 1988,
Michael Kearns published Thoughts on Hypothesis Boosting , which is probably the oldest one. About the algorithms, it is possible to
find some references. Consider for instance Improving Regressors using Boosting Techniques , by Harris Drucker. Or The Boosting Approach to Machine Learning An Overview by Robert Schapire, among many others. In order to illustrate the use of
boosting in the context of regression (and not classification, since I believe
it provides a better visualisation) consider the section in Dong-Sheng Cao’s In The boosting: A new idea of building models .


In a very general context, consider a model like


The idea is to write it as


or, as we will seen soon,


(where ‘s will be some shrinkage parameters). To get all the components of that sum,
we will use an iterative procedure. Define the partial sum (that will be our
prediction at step )


Since we consider some regression function here, use the loss function, to get the function, we solve


(we can imagine that the loss function can be changed, for instance in the
context of classification).

The concept is simple, but from a practical perspective, it is actually a
difficult problem since optimization is performed here in a very large set (a
functional space actually). One of the trick will be to use a base of learners.
Weak learners. And to make sure that we don’t use too strong learners, consider
also some shrinkage parameters, as discussed previously. The iterative algorithm
is

 * start with some regression model
 * compute the residuals, including some shrinkage parameter,

then the strategy is to model those residuals

 * at step , consider regression
 * update the residuals

and to loop. Then set


So far, I guess I understand the concept. The next step is then to write the
code to see how it works, for real. One can easily get the intuition that,
indeed, it should work, and we should end up with a decent model. But we have to
try it, and play with it to check that it performs better than any other
algorithms.

Consider the following dataset,

n=300
set.seed(1)
u=sort(runif(n)*2*pi)
y=sin(u)+rnorm(n)/4
df=data.frame(x=u,y=y)

If we visualize it, we get

plot(df)


Consider some linear-by-part regression models. It could make sense here. At
each iterations, there are 7 parameters to ‘estimate’, the slopes and the nodes.
Here, consider some constant shrinkage parameter (there is no need to start with
something to complicated, I guess).

v=.05 
library(splines)
fit=lm(y~bs(x,degree=1,df=3),data=df)
yp=predict(fit,newdata=df)
df$yr=df$y - v*yp
YP=v*yp

I store in the original dataset the residuals (that will be updated), and I keep
tracks of all the predictions. Consider now the following loop

for(t in 1:100){
  fit=lm(yr~bs(x,degree=1,df=3),data=df)
  yp=predict(fit,newdata=df)
  df$yr=df$yr - v*yp
  YP=cbind(YP,v*yp)
}

This is the implementation of the algorithm described above, right? To visualise
it, at some early stage, use

nd=data.frame(x=seq(0,2*pi,by=.01))
viz=function(M){
if(M==1)  y=YP[,1]
if(M>1)   y=apply(YP[,1:M],1,sum)
  plot(df$x,df$y,ylab="""",xlab="""")
  lines(df$x,y,type=""l"",col=""red"",lwd=3)
  fit=lm(y~bs(x,degree=1,df=3),data=df)
  yp=predict(fit,newdata=nd)
  lines(nd$x,yp,type=""l"",col=""blue"",lwd=3)
  lines(nd$x,sin(nd$x),lty=2)}

The red line is the initial guess we have, without boosting, using a simple call
of the regression function. The blue one is the one obtained using boosting. The
dotted line is the true model.

viz(50)


Somehow, boosting is working. But even the possibility to get different notes at
each step, it looks like we don’t use it. And we cannot perform better than a
simple regression function.

What if we use quadratic splines instead of linear splines?

v=.05 
fit=lm(y~bs(x,degree=2,df=3),data=df)
yp=predict(fit,newdata=df)
df$yr=df$y - v*yp
YP=v*yp
library(splines)
for(t in 1:100){
  fit=lm(yr~bs(x,degree=2,df=3),data=df)
  yp=predict(fit,newdata=df)
  df$yr=df$yr - v*yp
  YP=cbind(YP,v*yp)
}


Again, boosting is not improving anything here. We’ll discuss later on the
impact of the shrinkage parameter, but here, it won’t change much the output (it
might be faster of slower to reach the final prediction, but it will always be
the same predictive model).

In order to get something different at each step, it tried to add a boostrap
procedure. I don’t know if that’s sill ‘boosting’, but why not try it.

v=.1 
  idx=sample(1:n,size=n,replace=TRUE)
fit=lm(y~bs(x,degree=1,df=3),data=df[idx,])
yp=predict(fit,newdata=df)
df$yr=df$y - v*yp
YP=v*yp
 
for(t in 1:100){
  idx=sample(1:n,size=n,replace=TRUE)
fit=lm(yr~bs(x,degree=1,df=3),data=df[idx,])
yp=predict(fit,newdata=df)
df$yr=df$yr - v*yp
YP=cbind(YP,v*yp)
}

At each step, I sample from my dataset, and get a linear-by-parts regression.
And again, I use a shrinkage parameter not to learn too fast.


It is slightly different (if you look very carefully). But actually, an
algorithm that will be as costly is a ‘bagging’ one, where we boostrap many
samples, get different models, and predictions, and then average all the
predictions. The (computational) cost is exactly the same here

YP=NULL
library(splines)
for(t in 1:100){
  idx=sample(1:n,size=n,replace=TRUE)
  fit=lm(y~bs(x,degree=1,df=3),data=df[idx,])
  yp=predict(fit,newdata=nd)
  YP=cbind(YP,yp)
}
y=apply(YP[,1:100],1,mean)
plot(df$x,df$y,ylab="""",xlab="""")
lines(nd$x,y,type=""l"",col=""purple"",lwd=3)


It is very close to what we got with the boosting procedure.

Let us try something else. What if we consider at each step a regression tree,
instead of a linear-by-parts regression.


library(rpart)
v=.1 
fit=rpart(y~x,data=df)
yp=predict(fit)
df$yr=df$y - v*yp
YP=v*yp
for(t in 1:100){
fit=rpart(yr~x,data=df)
yp=predict(fit,newdata=df)
df$yr=df$yr - v*yp
YP=cbind(YP,v*yp)
}

Again, to visualise the learning process, use

viz=function(M){
y=apply(YP[,1:M],1,sum)
plot(df$x,df$y,ylab="""",xlab="""")
lines(df$x,y,type=""s"",col=""red"",lwd=3)
fit=rpart(y~x,data=df)
yp=predict(fit,newdata=nd)
lines(nd$x,yp,type=""s"",col=""blue"",lwd=3)
lines(nd$x,sin(nd$x),lty=2)}


This time, with those trees, it looks like not only we have a good model, but
also a different model from the one we can get using a single regression tree.

What if we change the shrinkage parameter?

viz=function(v=0.05){
  fit=rpart(y~x,data=df)
  yp=predict(fit)
  df$yr=df$y - v*yp
  YP=v*yp
  for(t in 1:100){
    fit=rpart(yr~x,data=df)
    yp=predict(fit,newdata=df)
    df$yr=df$yr - v*yp
    YP=cbind(YP,v*yp)
  }
  y=apply(YP,1,sum)
    plot(df$x,df$y,xlab="""",ylab="""")
    lines(df$x,y,type=""s"",col=""red"",lwd=3)
    fit=rpart(y~x,data=df)
    yp=predict(fit,newdata=nd)
    lines(nd$x,yp,type=""s"",col=""blue"",lwd=3)
    lines(nd$x,sin(nd$x),lty=2)  
}


There is clearly an impact of that parameter. It has to be small to get a good
model. This is the idea of using ‘weak learners’ to get a good prediction.

If we add a boostrap sample selection , we get also a good predictive model,
here

v=.1 
idx=sample(1:n,size=n,replace=TRUE)
fit=rpart(y~x,data=df[idx,])
yp=predict(fit,newdata=df)
df$yr=df$y - v*yp
YP=v*yp
for(t in 1:100){
  idx=sample(1:n,size=n,replace=TRUE)
  fit=rpart(yr~x,data=df[idx,])
  yp=predict(fit,newdata=df)
  df$yr=df$yr - v*yp
  YP=cbind(YP,v*yp)
}


It looks like using a small shrinkage parameter, and some regression tree at
each step, we get some ‘weak learners’. It performs well, but so far, I do not
see how it could perform better than standard econometric models. But I am still
working on it.

RELATED
To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english .
--------------------------------------------------------------------------------

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science , Big Data, R jobs , visualization ( ggplot2 , Boxplots , maps , animation ),
programming ( RStudio , Sweave , LaTeX , SQL , Eclipse , git , hadoop , Web
Scraping ) statistics ( regression , PCA , time series , trading ) and more...


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail , twitter , RSS , or facebook ...Comments are closed.

RECENT POPULAR POSTS
 * Election 2016: Tracking Emotions with R and Python
 * The new R Graph Gallery
 * Paper published: mlr - Machine Learning in R

MOST VISITED ARTICLES OF THE WEEK
 1. How to write the first for loop in R
 2. Installing R packages
 3. Using apply, sapply, lapply in R
 4. R tutorials
 5. How to Make a Histogram with Basic R
 6. How to perform a Logistic Regression in R
 7. How to “get good at R”
 8. In-depth introduction to machine learning in 15 hours of expert videos
 9. Box-plot with R – Tutorial

SPONSORS


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Contact us if you wish to help support R-bloggers, and place your banner here.JOBS FOR R USERS
 * Statistical Analyst @ Rostock, Mecklenburg-Vorpommern, Germany
 * Data Engineer
 * Data Scientist – Post-Graduate Programme @ Nottingham, England
 * Director, Real World Informatics & Analytics Data Science @ Northbrook,
   Illinois, U.S.
 * Junior statistician/demographer for UNICEF
 * Health Data Scientist @ Boston, Massachusetts, U.S.
 * Data Science Consultant @ Notre Dame, Indiana, United States

Full list of contributing R-bloggers R-bloggers was founded by Tal Galili , with gratitude to the R community.
Is powered by WordPress using a bavotasan.com design.
Copyright © 2016 R-bloggers . All Rights Reserved. Terms and Conditions for this website
NEVER MISS AN UPDATE!
SUBSCRIBE TO R-BLOGGERS TO RECEIVE
E-MAILS WITH THE LATEST R POSTS.
(YOU WILL NOT SEE THIS MESSAGE AGAIN.)
Submit Click here to close (This popup will not appear again)","There are a lot of publications about the concept of ‘boosting’. In 1988, Michael Kearns published Thoughts on Hypothesis Boosting, which is probably the oldest one. About the algorithms, it is possible to find some references. Consider for instance Improving Regressors using Boosting Techniques, by Harris Drucker. Or The Boosting Approach to Machine Learning An Overview by Robert Schapire, among many others. In order to illustrate the use of boosting in the context of regression (and not classification, since I believe it provides a better visualisation) consider the section in Dong-Sheng Cao’s In The boosting: A new idea of building models.",An Attempt to Understand Boosting Algorithm(s),Live,903
2783,The search index lets you create flexible queries on one or more field in the documents. This video shows you how to query a search index. Find more videos in the Cloudant Learning Center: http://www.cloudant.com/learning-center,The search index lets you create flexible queries on one or more fields in Cloudant documents. This video shows you how to query a search index.,Query The Search Index,Live,904
2787,"Homepage Towards Data Science Follow Sign in Get started * Home
 * Data Science
 * Machine Learning
 * Programming
 * Visualization
 * Letters
 * Contribute
 * 

Niklas Donges Blocked Unblock Follow Following I am a Working Student in the Machine Learning Foundation of SAP and write
weekly for my blog (https://machinelearning-blog.com). Feb 22
--------------------------------------------------------------------------------

THE RANDOM FOREST ALGORITHM
Random Forest is a flexible, easy to use machine learning algorithm that
produces, even without hyper-parameter tuning, a great result most of the time.
It is also one of the most used algorithms, because it’s simplicity and the fact
that it can be used for both classification and regression tasks. In this post,
you are going to learn, how the random forest algorithm works and several other
important things about it.

TABLE OF CONTENTS:
 * How it works
 * Real Life Analogy
 * Feature Importance
 * Difference between Decision Trees and Random Forests
 * Important Hyperparameters (predictive power, speed)
 * Advantages and Disadvantages
 * Use Cases
 * Summary

HOW IT WORKS:
Random Forest is a supervised learning algorithm. Like you can already see from
it’s name, it creates a forest and makes it somehow random. The „forest“ it
builds, is an ensemble of Decision Trees, most of the time trained with the
“bagging” method. The general idea of the bagging method is that a combination
of learning models increases the overall result.

To say it in simple words: Random forest builds multiple decision trees and
merges them together to get a more accurate and stable prediction.One big advantage of random forest is, that it can be used for both
classification and regression problems, which form the majority of current
machine learning systems. I will talk about random forest in classification,
since classification is sometimes considered the building block of machine
learning. Below you can see how a random forest would look like with two trees:

With a few exceptions a random-forest classifier has all the hyperparameters of
a decision-tree classifier and also all the hyperparameters of a bagging
classifier, to control the ensemble itself. Instead of building a
bagging-classifier and passing it into a decision-tree-classifier, you can just
use the random-forest classifier class, which is more convenient and optimized
for decision trees. Note that there is also a random-forest regressor for
regression tasks.

The random-forest algorithm brings extra randomness into the model, when it is
growing the trees. Instead of searching for the best feature while splitting a
node, it searches for the best feature among a random subset of features. This
process creates a wide diversity, which generally results in a better model.

Therefore when you are growing a tree in random forest, only a random subset of
the features is considered for splitting a node. You can even make trees more
random, by using random thresholds on top of it, for each feature rather than
searching for the best possible thresholds (like a normal decision tree does).

REAL LIFE ANALOGY:
Imagine a guy named Andrew, that want’s to decide, to which places he should
travel during a one-year vacation trip. He asks people who know him for advice.
First, he goes to a friend, tha asks Andrew where he traveled to in the past and
if he liked it or not. Based on the answers, he will give Andrew some advice.

This is a typical decision tree algorithm approach. Andrews friend created rules
to guide his decision about what he should recommend, by using the answers of
Andrew.

Afterwards, Andrew starts asking more and more of his friends to advise him and
they again ask him different questions, where they can derive some
recommendations from. Then he chooses the places that where recommend the most
to him, which is the typical Random Forest algorithm approach.

FEATURE IMPORTANCE:
Another great quality of the random forest algorithm is that it is very easy to
measure the relative importance of each feature on the prediction. Sklearn
provides a great tool for this, that measures a features importance by looking
at how much the tree nodes, which use that feature, reduce impurity across all
trees in the forest. It computes this score automatically for each feature after
training and scales the results, so that the sum of all importance is equal to
1.

If you don’t know how a decision tree works and if you don’t know what a leaf or
node is, here is a good description from Wikipedia: In a decision tree each
internal node represents a “test” on an attribute (e.g. whether a coin flip
comes up heads or tails), each branch represents the outcome of the test, and
each leaf node represents a class label (decision taken after computing all
attributes). A node that has no children is a leaf.

Through looking at the feature importance, you can decide which features you may
want to drop, because they don’t contribute enough or nothing to the prediction
process. This is important, because a general rule in machine learning is that
the more features you have, the more likely your model will suffer from
overfitting and vice versa.

Below you can see a table and a visualization that show the importance of 13
features, which I used during a supervised classification project with the
famous Titanic dataset on kaggle. You can find the whole project here .

DIFFERENCE BETWEEN DECISION TREES AND RANDOM FORESTS:
Like I already mentioned, Random Forest is a collection of Decision Trees, but
there are some differences.

If you input a training dataset with features and labels into a decision tree,
it will formulate some set of rules, which will be used to make the predictions.

For example, if you want to predict whether a person will click on an online
advertisement, you could collect the ad’s the person clicked in the past and
some features that describe his decision. If you put the features and labels
into a decision tree, it will generate nodes and some rules. Then you can
predict whether the advertisement will be clicked or not. When a decision tree
generates the nodes and rules, it often uses the information gain and gini index
calculations. In comparison, a Random Forest does this randomly.

Another difference is that „deep“ decision trees might suffer from overfitting.
Random Forest prevents overfitting most of the time, by creating random subsets
of the features and building smaller trees using these subsets. Afterwards, it
combines the subtrees. Note that this also makes the computation slower,
depending on how many trees your random forest builds.

IMPORTANT HYPERPARAMETERS:
The parameters in random forest are either used to increase the predictive power
of the model or to make the model faster. I will here talk about the
hyperparameters of sklearns built-in random forest function.

1. Increasing the Predictive Power

Firstly, there is the „n_estimators“ hyperparameter, which is just the number of trees the algorithm builds before
taking the maximum voting or taking averages of predictions. In general, a
higher number of trees increases the performance and makes the predictions more
stable, but it also slows down the computation.

Another important hyperparameter is „max_features“ , which is the maximum number of features Random Forest is allowed to try in an
individual tree. Sklearn provides several options, described in their documentation .

The last important hyper-parameter we will talk about in terms of speed, is „min_sample_leaf “ . This determines, like its name already says, the number of leafs.

2. Increasing the Models Speed

The „n_jobs“ hyperparameter tells the engine how many processors it is allowed to use. If it
has a value of 1, it can only use one processor. A value of “-1” means that
there is no limit.

„random_state“ makes the model’s output replicable. The model will always produce the same
results when it has a definite value of random_state and if it has been given
the same parameters and the same training data.

Lastly, there is the „oob_score“ (also called oob sampling), which is a random forest cross validation method.
In this sampling, about one-third of the data is not used to train the model and
can be used to evaluate its performance. These samples are called the out of bag
samples. It is very similar to the leave-one-out cross-validation method, but
almost no additional computational burden goes along with it.

ADVANTAGES AND DISADVANTAGES:
Like I already mentioned, an advantage of random forest is that it can be used
for both regression and classification tasks and that it’s easy to view the
relative importance it assigns to the input features.

Random Forest is also considered as a very handy and easy to use algorithm,
because it’s default hyperparameters often produce a good prediction result. The
number of hyperparameters is also not that high and they are straightforward to
understand.

One of the big problems in machine learning is overfitting, but most of the time
this won’t happen that easy to a random forest classifier. That’s because if
there are enough trees in the forest, the classifier won’t overfit the model.

The main limitation of Random Forest is that a large number of trees can make
the algorithm to slow and ineffective for real-time predictions. In general,
these algorithms are fast to train, but quite slow to create predictions once
they are trained. A more accurate prediction requires more trees, which results
in a slower model. In most real-world applications the random forest algorithm
is fast enough, but there can certainly be situations where run-time performance
is important and other approaches would be preferred.

And of course Random Forest is a predictive modeling tool and not a descriptive
tool. That means, if you are looking for a description of the relationships in
your data, other approaches would be preferred.

USE CASES:
The random forest algorithm is used in a lot of different fields, like Banking,
Stock Market, Medicine and E-Commerce. In Banking it is used for example to
detect customers who will use the bank’s services more frequently than others
and repay their debt in time. In this domain it is also used to detect fraud
customers who want to scam the bank. In finance, it is used to determine a
stock’s behaviour in the future. In the healthcare domain it is used to identify
the correct combination of components in medicine and to analyze a patient’s
medical history to identify diseases. And lastly, in E-commerce random forest is
used to determine whether a customer will actually like the product or not.

SUMMARY:
Random Forest is a great algorithm to train early in the model development
process, to see how it performs and it’s hard to build a “bad” Random Forest,
because of its simplicity. This algorithm is also a great choice, if you need to
develop a model in a short period of time. On top of that, it provides a pretty
good indicator of the importance it assigns to your features.

Random Forests are also very hard to beat in terms of performance. Of course you
can probably always find a model that can perform better, like a neural network,
but these usually take much more time in the development. And on top of that,
they can handle a lot of different feature types, like binary, categorical and
numerical.

Overall, Random Forest is a (mostly) fast, simple and flexible tool, although it
has its limitations.

This post was initially published at my blog ( https://machinelearning-blog.com ).

Thanks to Ludovic Benistant . * Machine Learning
 * Towards Data Science
 * Random Forest

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

327 2 Blocked Unblock Follow FollowingNIKLAS DONGES
I am a Working Student in the Machine Learning Foundation of SAP and write
weekly for my blog ( https://machinelearning-blog.com ).

FollowTOWARDS DATA SCIENCE
Sharing concepts, ideas, and codes.

 * 327
 * 
 * 
 * 

Never miss a story from Towards Data Science , when you sign up for Medium. Learn more Never miss a story from Towards Data Science Get updates Get updates","Random Forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. In this post, you are going to learn, how the random forest algorithm works and several other important things about it.",The Random Forest Algorithm ,Live,905
2790,"Brilliantly wrong thoughts on science and programming About AuthorJUPYTER (IPYTHON) NOTEBOOKS FEATURES
Sep 10, 2016 • Alex Rogozhnikov

JUPYTER (IPYTHON) NOTEBOOKS FEATURES ¶
It is very flexible tool to create readable analyses, because one can keep code,
images, comments, formula and plots together:


Jupyter is quite extensible, supports many programming languages, easily hosted
on almost any server — you only need to have ssh or http access to a server. And
it is completely free.

BASICS ¶
List of hotkeys is shown in Help > Keyboard Shortcuts (list is extended from
time to time, so don't hesitate to look at it again).

This gives an idea of how you're expected to interact with notebook. If you're
using notebook constantly, you'll of course learn most of the list. In
particular:

 * Esc + F Find and replace to search only over the code, not outputs
 * Esc + O Toggle cell output
 * You can select several cells in a row and delete / copy / cut / paste them.
   This is helpful when you need to move parts of a notebook


SHARING NOTEBOOKS ¶
Simplest way is to share notebook file (.ipynb), but not everyone is using
notebooks, so the options are

 * convert notebooks to html file
 * share it with gists , which are rendering the notebooks. See this example
 * store your notebook e.g. in dropbox and put the link to nbviewer . nbviewer will render the notebook
 * github renders notebooks (with some limitations, but in most cases it is ok),
   which makes it very useful to keep history of your research (if research is
   public)

PLOTTING IN NOTEBOOKS ¶
There are many plotting options:

 * matplotlib (de-facto standard), activated with %matplotlib inline
 * %matplotlib notebook is interactive regime, but very slow, since rendering is done on
   server-side.
 * mpld3 provides alternative renderer (using d3) for matplotlib code. Quite
   nice, though incomplete
 * bokeh is a better option for building interactive plots
 * plot.ly can generate nice plots, but those will cost you money


MAGICS ¶
Magics are turning simple python into magical python . Magics are the key to power of ipython.

In [1]:# list available python magics%lsmagic

Out[1]:Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

%ENV ¶
You can manage environment variables of your notebook without restarting the
jupyter server process. Some libraries (like theano) use environment variables
to control behavior, %env is the most convenient way.

In [2]:# %env - without arguments lists environmental variables%env OMP_NUM_THREADS=4


env: OMP_NUM_THREADS=4


EXECUTING SHELL COMMANDS ¶
You can call any shell command. This in particular useful to manage your virtual
environment.

In [3]:!pip install numpy
!pip list | grep Theano


Requirement already satisfied (use --upgrade to upgrade): numpy in /Users/axelr/.venvs/rep/lib/python2.7/site-packages
Theano (0.8.2)


SUPPRESS OUTPUT OF LAST LINE ¶
sometimes output isn't needed, so we can either use pass instruction on new line or semicolon at the end

In [4]:%matplotlib inline
frommatplotlibimportpyplotaspltimportnumpy

In [5]:# if you don't put semicolon at the end, you'll have output of function printedplt.hist(numpy.linspace(0,1,1000)**1.5);

SEE THE SOURCE OF PYTHON FUNCTIONS / CLASSES / WHATEVER WITH QUESTION MARK (?,
??) ¶
In [6]:fromsklearn.cross_validationimporttrain_test_split# show the sources of train_test_split function in the pop-up window
train_test_split??

In [7]:# you can use ? to get details about magics, for instance:
%pycat?

will output in the pop-up window:

Show a syntax-highlighted file through a pager.

This magic is similar to the cat utility, but it will assume the file
to be Python source and will show it with syntax highlighting.

This magic command can either take a local filename, an url,
an history range (see %history) or a macro as argument ::

%pycat myscript.py
%pycat 7-27
%pycat myMacro
%pycat http://www.example.com/myscript.py

%RUN TO EXECUTE PYTHON CODE ¶
%run can execute python code from .py files — this is a well-documented
behavior.

But it also can execute other jupyter notebooks! Sometimes it is quite useful.

NB. %run is not the same as importing python module.

In [8]:# this will execute all the code cells from different notebooks%run ./2015-09-29-NumpyTipsAndTricks1.ipynb


[49 34 49 41 59 45 30 33 34 57]
[172 177 209 197 171 176 209 208 166 151]
[30 33 34 34 41 45 49 49 57 59]
[209 208 177 166 197 176 172 209 151 171]
[1 0 4 8 6 5 2 9 7 3]
['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j']
['b' 'a' 'e' 'i' 'g' 'f' 'c' 'j' 'h' 'd']
['a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j']
[1 0 6 9 2 5 4 8 3 7]
[1 0 6 9 2 5 4 8 3 7]
[ 0.93551212  0.75079687  0.87495146  0.3344709   0.99628591  0.34355057
  0.90019059  0.88272132  0.67272068  0.24679158]
[8 4 5 1 9 2 7 6 3 0]


[-5 -4 -3 -2 -1  0  1  2  3  4]
[0 0 0 0 0 0 1 2 3 4]
['eh' 'cl' 'ah' ..., 'ab' 'bm' 'ab']
['ab' 'ac' 'ad' 'ae' 'af' 'ag' 'ah' 'ai' 'aj' 'ak' 'al' 'am' 'an' 'bc' 'bd'
 'be' 'bf' 'bg' 'bh' 'bi' 'bj' 'bk' 'bl' 'bm' 'bn' 'cd' 'ce' 'cf' 'cg' 'ch'
 'ci' 'cj' 'ck' 'cl' 'cm' 'cn' 'de' 'df' 'dg' 'dh' 'di' 'dj' 'dk' 'dl' 'dm'
 'dn' 'ef' 'eg' 'eh' 'ei' 'ej' 'ek' 'el' 'em' 'en' 'fg' 'fh' 'fi' 'fj' 'fk'
 'fl' 'fm' 'fn' 'gh' 'gi' 'gj' 'gk' 'gl' 'gm' 'gn' 'hi' 'hj' 'hk' 'hl' 'hm'
 'hn' 'ij' 'ik' 'il' 'im' 'in' 'jk' 'jl' 'jm' 'jn' 'kl' 'km' 'kn' 'lm' 'ln'
 'mn']
[48 33  6 ...,  0 23  0]
['eh' 'cl' 'ah' ..., 'ab' 'bm' 'ab']
['eh' 'cl' 'ah' ..., 'ab' 'bm' 'ab']
['bf' 'cl' 'dn' ..., 'dm' 'cn' 'dj']
['bf' 'cl' 'dn' ..., 'dm' 'cn' 'dj']


[ 2.29711325  1.82679746  2.65173344 ...,  2.15286813  2.308737    2.15286813]
1000 loops, best of 3: 1.09 ms per loop
The slowest run took 8.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 21.5 µs per loop


0.416
0.416


%LOAD ¶
loading code directly into cell. You can pick local file or file on the web.

After uncommenting the code below and executing, it will replace the content of
cell with contents of file.

In [9]:# %load http://matplotlib.org/mpl_examples/pylab_examples/contour_demo.py

%STORE: LAZY PASSING DATA BETWEEN NOTEBOOKS ¶
In [10]:data='this is the string I want to pass to different notebook'%store data
deldata# deleted variable

Stored 'data' (str)


In [11]:# in second notebook I will use:%store -r data
printdata

this is the string I want to pass to different notebook


%WHO: ANALYZE VARIABLES OF GLOBAL SCOPE ¶
In [12]:# pring names of string variables%who str


data	 


TIMING ¶
When you need to measure time spent or find the bottleneck in the code, ipython
comes to the rescue.

In [13]:%%time
import time
time.sleep(2) # sleep for two seconds


CPU times: user 1.23 ms, sys: 4.82 ms, total: 6.05 ms
Wall time: 2 s


In [14]:# measure small code snippets with timeit !importnumpy%timeit numpy.random.normal(size=100)


The slowest run took 13.85 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.35 µs per loop


In [15]:%%writefile pythoncode.py

import numpy
def append_if_not_exists(arr, x):
    if x not in arr:
        arr.append(x)
        
def some_useless_slow_function():
    arr = list()
    for i in range(10000):
        x = numpy.random.randint(0, 10000)
        append_if_not_exists(arr, x)


Overwriting pythoncode.py


In [16]:# shows highlighted source of the newly-created file%pycat pythoncode.py


In [17]:frompythoncodeimportsome_useless_slow_function,append_if_not_exists

PROFILING: %PRUN, %LPRUN, %MPRUN ¶
In [18]:# shows how much time program spent in each function%prun some_useless_slow_function()


Example of output:

26338 function calls in 0.713 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.684    0.000    0.685    0.000 pythoncode.py:3(append_if_not_exists)
    10000    0.014    0.000    0.014    0.000 {method 'randint' of 'mtrand.RandomState' objects}
        1    0.011    0.011    0.713    0.713 pythoncode.py:7(some_useless_slow_function)
        1    0.003    0.003    0.003    0.003 {range}
     6334    0.001    0.000    0.001    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.713    0.713 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

In [19]:%load_ext memory_profiler


In [20]:# tracking memory consumption (show in the pop-up)%mprun -f append_if_not_exists some_useless_slow_function()


('',)


Example of output:

Line #    Mem usage    Increment   Line Contents
================================================
     3     20.6 MiB      0.0 MiB   def append_if_not_exists(arr, x):
     4     20.6 MiB      0.0 MiB       if x not in arr:
     5     20.6 MiB      0.0 MiB           arr.append(x)

%lprun is line profiling, but it seems to be broken for latest IPython release, so
we'll manage without magic this time:

In [21]:importline_profilerlp=line_profiler.LineProfiler()lp.add_function(some_useless_slow_function)lp.runctx('some_useless_slow_function()',locals=locals(),globals=globals())lp.print_stats()

Timer unit: 1e-06 s

Total time: 1.27826 s
File: pythoncode.py
Function: some_useless_slow_function at line 7

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     7                                           def some_useless_slow_function():
     8         1            5      5.0      0.0      arr = list()
     9     10001        17838      1.8      1.4      for i in range(10000):
    10     10000        38254      3.8      3.0          x = numpy.random.randint(0, 10000)
    11     10000      1222162    122.2     95.6          append_if_not_exists(arr, x)


DEBUGGING WITH %DEBUG ¶
Jupyter has own interface for ipdb . Makes it possible to go inside the function and investigate what happens
there.

This is not pycharm and requires much time to adapt, but when debugging on the
server this can be the only option (or use pdb from terminal).

In [22]:#%%debug filename:line_number_for_breakpoint# Here some code that fails. This will activate interactive context for debugging

A bit easier option is %pdb , which activates debugger when exception is raised:

In [23]:# %pdb# def pick_and_take():#     picked = numpy.random.randint(0, 1000)#     raise NotImplementedError()# pick_and_take()

WRITING FORMULAE IN LATEX ¶
markdown cells render latex using MathJax.

$$ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $$Markdown is an important part of notebooks, so don't forget to use its
expressiveness!

USING DIFFERENT LANGUAGES INSIDE SINGLE NOTEBOOK ¶
If you're missing those much, using other computational kernels:

 * %%python2
 * %%python3
 * %%ruby
 * %%perl
 * %%bash
 * %%R

is possible, but obviously you'll need to setup the corresponding kernel first.

In [24]:%%rubyputs'Hi, this is ruby.'

Hi, this is ruby.


In [25]:%%bash
echo'Hi, this is bash.'

Hi, this is bash.


BIG DATA ANALYSIS ¶
A number of solutions are available for querying/processing large data samples:

 * ipyparallel (formerly ipython cluster) is a good option for simple map-reduce operations in python. We use it in rep to train many machine learning models in parallel
 * pyspark
 * spark-sql magic %%sql

LET OTHERS TO PLAY WITH YOUR CODE WITHOUT INSTALLING ANYTHING ¶
Services like mybinder give an access to machine with jupyter notebook with all the libraries
installed, so user can play for half an hour with your code having only browser.

You can setup your own system with jupyterhub , this is very handy when you organize mini-course or workshop and don't have
time to care about students machines.

WRITING FUNCTIONS IN OTHER LANGUAGES ¶
Sometimes the speed of numpy is not enough and I need to write some fast code.
In principle, you can compile function in the dynamic library and write python
wrappers...

But it is much better when this boring part is done for you, right?

You can write functions in cython or fortran and use those directly from python
code.

First you'll need to install:

!pip install cython fortran-magic

In [26]:%load_ext Cython


In [27]:%%cythondefmyltiply_by_2(floatx):return2.0*x

In [28]:myltiply_by_2(23.)

Out[28]:46.0

Personally I prefer to use fortran, which I found very convenient for writing
number-crunching functions. More details of usage can be found here .

In [29]:%load_ext fortranmagic


/Users/axelr/.venvs/rep/lib/python2.7/site-packages/IPython/utils/path.py:265: UserWarning: get_ipython_cache_dir has moved to the IPython.paths module
  warn(""get_ipython_cache_dir has moved to the IPython.paths module"")


In [30]:%%fortran
subroutine compute_fortran(x, y, z)
    real, intent(in) :: x(:), y(:)
    real, intent(out) :: z(size(x, 1))

    z = sin(x + y)

end subroutine compute_fortran


In [31]:compute_fortran([1,2,3],[4,5,6])

Out[31]:array([-0.95892429,  0.65698659,  0.41211849], dtype=float32)

I also should mention that there are different jitter systems which can speed up
your python code. More examples in my notebook .

MULTIPLE CURSORS ¶
Since recently jupyter supports multiple cursors (in a single cell), just like
sublime ot intelliJ!


Gif taken from http://swanintelligence.com/multi-cursor-in-jupyter.html

JUPYTER-CONTRIB EXTENSIONS ¶
are installed with

!pip install https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tarball/master
!pip install jupyter_nbextensions_configurator
!jupyter contrib nbextension install --user
!jupyter nbextensions_configurator enable --user


this is a family of different extensions, including e.g. jupyter spell-checker and code-formatter , that are missing in jupyter by default.

RISE : PRESENTATIONS WITH NOTEBOOK ¶
Extension by Damian Avila makes it possible to show notebooks as demonstrations.
Example of such presentation: http://bollwyvl.github.io/live_reveal/#/7

It is very useful when you teach others e.g. to use some library.

JUPYTER OUTPUT SYSTEM ¶
Notebooks are displayed as HTML and the cell output can be HTML, so you can
return virtually anything: video/audio/images.

In this example I scan the folder with images in my repository and show first
five of them:

In [32]:importosfromIPython.displayimportdisplay,Imagenames=[fforfinos.listdir('../images/ml_demonstrations/')iff.endswith('.png')]fornameinnames[:5]:display(Image('../images/ml_demonstrations/'+name,width=300))

I COULD TAKE THE SAME LIST WITH A BASH COMMAND ¶
because magics and bash calls return python variables:

In [33]:names=!ls ../images/ml_demonstrations/*.png
names[:5]

Out[33]:['../images/ml_demonstrations/colah_embeddings.png',
 '../images/ml_demonstrations/convnetjs.png',
 '../images/ml_demonstrations/decision_tree.png',
 '../images/ml_demonstrations/decision_tree_in_course.png',
 '../images/ml_demonstrations/dream_mnist.png']

RECONNECT TO KERNEL ¶
Long before, when you started some long-taking process and at some point your
connection to ipython server dropped, you completely lost the ability to track
the computations process (unless you wrote this information to file). So either
you interrupt the kernel and potentially lose some progress, or you wait till it
completes without any idea of what is happening.

Reconnect to kernel option now makes it possible to connect again to running kernel without
interrupting computations and get the newcoming output shown (but some part of
output is already lost).

WRITE YOUR POSTS IN NOTEBOOKS ¶
Like this one. Use nbconvert to export them to html.

USEFUL LINKS ¶
 * IPython built-in magics
 * Nice interactive presentation about jupyter by Ben Zaitlen
 * Advanced notebooks part 1: magics and part 2: widgets
 * Profiling in python with jupyter
 * 4 ways to extend notebooks
 * IPython notebook tricks
 * Jupyter vs Zeppelin for big data

This post was written in IPython. You can download the notebook from repository .

TOP POSTS AT ""BRILLIANTLY WRONG"":
 * Reconstructing pictures with machine learning
 * Reweighting with Boosted Decision Trees
 * Speed benchmarks: numpy vs all.
 * Numpy tips and tricks: part 1 , part 2
 * Machine learning in COMET: part 1 , part 2

List of all posts .VISUALIZATIONS:
 * Gradient Boosting playground
 * Gradient Boosting explained in 3d
 * ML demonstrations in your browser
 * Optimal control of oscillations
 * ROC curve visualization

BRILLIANTLY WRONG (SUBSCRIBE VIA RSS )
 * Alex Rogozhnikov
 * alex.rogozhnikov@yandex.ru

 * arogozhnikov

Brilliantly Wrong - Alex Rogozhnikov's blog about math, machine learning,
programming and high energy physics.","A very flexible tool to create readable analyses: one can keep code, images, comments, formula and plot together.",Jupyter (IPython) notebooks features,Live,906
2793,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Perform Sentiment Analysis       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  BUILD SQL QUERIESJess Mantaro / October 21, 2015Learn how to build SQL Queries in a Scala notebook using IBM Analytics forApache Spark.You can also read a transcript of this videoTRY THE TUTORIALLearn how to build SQL queries using a Scala notebook in IBM Analytics forApache Spark.WHAT YOU’LL LEARNAt the end of this tutorial, you should be able to: * create a JSON file containing tweets using IBM Insights for Twitter. * create a Scala notebook in IBM Analytics for Apache Spark. * load a JSON file into a Scala notebook. * run SQL queries to gather insights on collected tweets.BEFORE YOU BEGINWatch the Getting Started on Bluemix video to create a Bluemix account and add the IBM Analytics for Apache Spark service.PROCEDURE 1: COLLECT TWEETS USING IBM INSIGHTS FOR TWITTER 1. Sign in to Bluemix . 2. Access the Dashboard , and open the Insights for Twitter instance. If you don’t already have    this service added, you’ll find in the Catalog under the Data and Analytics    category. 3. Click Service Credentials . 4. Copy the URL listed. The URL contains your username and password along with    the Bluemix hostname. For example,     https://fadd4d98-aaaa-4ddc-85c4-00dbcea7d0e1:GfBDqK97ey@cdeservice.mybluemix.net 5. To search for tweets with specific terms, construct the URL as follows:     https://<username>:<password>@cdeservice.mybluemix.net/api/v1/messages/search?q=<search-terms>&size=<number-of-tweets-to-return>    For example,     https://fadd4d98-aaaa-4ddc-85c4-00dbcea7d0e1:GfBDqK97ey@cdeservice.mybluemix.net/api/v1/messages/search?q=spark&size=100 6. The search URL will return the resulting tweets in JSON format. Save the    response body to a JSON file called tweets.json.PROCEDURE 2: CREATE A SCALA NOTEBOOK TO BUILD SQL QUERIES 1.  Access the Dashboard , and open the Apache Spark instance. 2.  Click New Notebook , select Scala , type a name for the notebook, and click Create . 3.  Click Add Data Source in the right sidebar. 4.  Drag and drop the JSON file you created in procedure 1 into the box     labelled Drop file to add data source . 5.  Paste the following SQL statement into the first cell in the notebook, and     then click the Run icon on the toolbar. This first command contains SQLContext which is the     entry point into all functionality in Spark SQL and is necessary to execute     SQL queries.     Command: val sqlContext = new org.apache.spark.sql.SQLContext(sc) 6.  Paste the following SQL statement into the second cell, and then click Run . Replace tweets.json if you used a different file name. This second command reads the contents     of the tweets.json file and assigns it to the tweets variable.     Command: val tweets = sqlContext.read.json(""swift://notebooks.spark/tweets.json"") 7.  Paste the following SQL statement into the third cell, and then click Run . This third command takes the tweets data and registers it as a table.     Command: tweets.registerTempTable(""tweets"") 8.  Paste the following SQL statement into the fourth cell. Spark SQL can     automatically infer the schema for this JSON file. This next command lets     you take a look at that schema. Click Run . You can see that there is a lot of information in addition to the tweet     such as the author’s name, gender, marital status, and location. If you     scroll down, you’ll also see the actual message stored in the body of the     tweet. And along with it, a lot of metadata about the tweet and the user.     Command: tweets.printSchema 9.  Paste the following SQL statement into the fifth cell, and then click Run . The tweets.collect command shows the data collected for each tweet.     Command: tweets.collect 10. Paste the following SQL statement into the sixth cell, and then click Run . Once the data is registered as a table, you can use SQL to process the     data. This next line contains a typical SQL statement. SQL statements can     be simple as in this case, or very complex. In this case, we’re just     collecting the author’s gender. When you execute this statement, you’ll see     that the gender is listed as null, unknown, male, or female.     Command: val results = sqlContext.sql(""select tweets.cde.author.gender from     tweets"")From here it may be interesting to determine how many tweets were sent by malesand how many by females. In addition, it is possible to collect tweets with aspecific polarity or sentiment which could be used for sentiment analysis. Forexample, collect the tweets with an overall positive polarity and warm sentimentusing this statement {“polarity”:”POSITIVE”,”sentimentTerm”:”warmth”}.You can leverage the other extensions using the same common programmingframework for streaming, machine learning, and graph analysis. Find more videosin the Spark Learning Center at http://developer.ibm.com/clouddataservices/spark .Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",How to build SQL Queries in a Scala notebook using IBM Analytics for Apache Spark,Build Spark SQL Queries,Live,907
2801,"Compose The Compose logo Articles Sign in Free 30-day trialMAKING OF A SMART BUSINESS CHATBOT: PART 1
Published Jul 20, 2017 chatbot mongodb bluemix Making of a Smart Business Chatbot: Part 1Chatbots have become an incredibly popular method for providing real-time
communication with customers, and in this first in a series of futuristic
interaction articles, we’ll use Bluemix+Compose to build a chatbot that can
reliably converse with our users while collecting data and gaining insights on
their personality and mood.

It's the holy grail of business: to have the manpower (or womanpower) to have a
personal conversation with every customer or business lead you come across. Of
course, it's not possible especially for large businesses, to have the
call-center personnel required for a personal interaction with every potential
customer. But what if we could train a computer with Artificial Intelligence to
interact with our users' the same way a person would?

In this first of a multi-part series, we'll walk through the process of creating
a smart business chatbot using AI technology from the IBM Watson Data Platform
to provide human-like text chat interactions with your customers. We'll also
take a look at how we can gather information from those users and gain insights
into our customers' mood and personality based on the language they use in the
chat. Our example here will start with Slack as the chat interface, although this walkthrough could easily be adapted for
Facebook Messenger, Twitter, or even in-website chat widgets.

Later in the series, we'll use Node-RED to receive data from Slack and to pipe
it over to IBM Watson Conversation. We’ll then save those conversations in
Compose MongoDB and analyze them using Watson Personality Insights and Watson
Tone to determine the mood and personality of our customers. Finally, we’ll
connect our chatbot to a knowledge-base built in Compose JanusGraph so we can
provide faster and more robust answers to their questions.

Let's get started...

SETTING UP THE BACKEND
Let's start by setting up the Node-RED instance responsible for bringing all of
the pieces of our application together. First, let’s head on over to IBM Bluemix and create or log into our blue mix account. Then, create a new app based on
the Node-RED starter project.


When you create the instance, you’ll need to give it a unique name by changing
the App name field. Aside from changing the name, the rest of the default settings should
work fine. Once you’ve chosen a unique name, hit the Create button and a new Node-RED server will be provisioned.


You’ll need to do a small amount of setup with your Node-RED instance before you
can use it. The most important thing in this step is to set a password so your
Node-RED instance is fully secured.


Once your setup is complete, log into the flow editor by clicking on Go to your Node-RED flow editor .


Finally, let’s set up the URL that Slack will use to post messages to our
chatbot. Find the http node in the palette and drag it onto the canvas. Double-click it to configure,
setting the METHOD to POST and the URL to /slackbot .

Next, drag a http out node onto the cavnvas on the right side of the http node. For now, we won’t have our chatbot do much - just log messages out into
the Node-RED debug editor. Drag a debug node onto the canvas and drag a wire between the http node and the other two nodes.


Now that we have somewhere to send our Slackbot messages, let’s set up the Slack
bot application itself.

CREATING THE SLACK CHATBOT
To set up the Slack Chatbot, you’ll first need to have a slack account that you
can add apps to. You can do this by creating a slack team or logging into an existing team on the slack homepage .

Once you have a team created, you’ll need to log into the API Portal and create a new application. If you’re logged into a team account already, you
can click on the Start Building button or Your Apps tab in the menu.


You’ll be redirected to the apps site where you can create a new Slack App. Click on the Create New App button and give your app a name.


We’d like Slack to send a message to the web service we created in Node-RED so
that each time a message is sent to our app, the service receives a
notification. We’ll do that from the Event Subscriptions menu. Select that menu item, and click on the Enable Events toggle so that it switches to the on position. Then, enter the URL for your Node-RED endpoint into the Request URL section. You can find the base URL for this in your Bluemix apps (ex:
drwatson.bluemix.net) and the endpoint is what we entered into the Node-RED http node ( /slackbot ). The final result is something like the following:

https://<bluemix_application_id>.mybluemix.net/slackbot


Scroll down to the Subscribe To Bot Events section and, in the text input, type messages .


Select the message.channels and message.im options, which will tell Slack to send a request to our application when it
receives a direct message to a bot user OR if a message is posted in a channel
that our bot user is listening in.

Now we’re ready to create our bot user. Bot users show up in Slack like any
other user, but when you send messages to or interact with the bot user, Slack
will send those messages to an HTTP url that we’re going to specify. Select your
newly-created application from the Your Apps page, click on the Bot Users menu item, and click the Add a Bot User button.


You’ll then be asked to choose a name for the bot and given a few options. You
can choose any name you like, and turn on the toggle that says Always Show My Bot as Online . Click the Add Bot User button to finish.


We now have a slack bot user that’s ready to be added to our Slack team. Head
back to the Basic Information tab in the settings page for your app and click on the accordion menu that says Install your app to your team . Then, click the Install App to Team button.


When the slack app prompts you to add the user, click Authorize and select the team in which you’d like to install the app. Once the app is
installed, you will now see a new user in your application under the apps section with the name of your bot. Double-clicking the bot user will open a
direct message dialog and we can test our integration by typing a message to the
bot. We should see the message being received in the debug panel of our Node-RED flow editor.


SAVING THE CONVERSATION
At this point, the Slackbot isn’t really talking back to our users. We’ll tackle
that in Part 2 using the Watson Conversation API. For now, let’s just make sure
that what our users’ are typing to us is saved in a database.

We can use Compose MongoDB to easily store the messages directly as they come in
from Slack. We'll do this by heading over to Compose and creating a new MongoDB deployment .

Once you've created the deployment, click on the Add Database button and create a new database called chatbot .


Once the database is created, you'll be dropped into the create collections interface. Let's create a collection to store our users' conversations.


Next, we'll create a database user so we can connect to our database from
Node-RED. Click on the Users tab and click the Add User button. Fill in the username and password field and click on the run button to execute the query.


We now have enough information to connect to our new database. Click on the Admin tab to show your connection string information and we'll drop it into Node-RED.


Now let's head back over to our Node-RED flow editor and wire in the MongoDB
node. In the palette, type mongo in the search bar. Then, grab the mongodb out node (which has only an input port) and drag it to the right side of the http in node.


Double-click the node to configure it and click on the pencil icon next to the server field. Use the credentials you got from the admin section of your Compose MongoDB deployment and put those in the server
configuration fields.


Fill in your credentials and put conversations in your collection field (if you named this something else in the previous
step, make sure the name matches your collection). Choose the ""insert"" option
and make sure the only store msg.payload object? option is checked and click done . Then click Deploy to deploy your latest changes.


That's it! You're now saving conversations into MongoDB.

You can read about connecting Node-RED to a Compose MongoDB database in our
previous article on Power Prototyping with MongoDB and Node-RED .

SUMMING IT UP
In this article, we tackled the first part of our smart business chatbot:
creating a working Slack Chatbot on Bluemix. In the next part in this series,
we’ll make our chatbot interactive by connecting it to the Watson Conversation
API and gain insights into our users by passing our conversation data into
Watson Personality Insights.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Pixabay user CyberRabbit

John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
May 9, 2017LAUNCHING RESTHEART INTO PRODUCTION
Now that we've shown you how to build instant RESTFul API's with RESTHeart and
secure your RESTHeart installation, there's ju…

John O'Connor May 3, 2017CAMPUS DISCOUNTS - MAKING THE MOST OF COMPOSE
Campus Discounts uses several Compose-hosted databases including MySQL, MongoDB,
Redis, Elasticsearch and RabbitMQ to power t…

Arick Disilva Apr 6, 20175-MINUTE SIGNUP FORMS WITH NODE-RED AND COMPOSE
Testing interest in a new startup idea? Looking for a quick and easy way to do
RSVPs? Need to capture email addresses for a…

John O'Connor Products Databases Pricing Add-Ons Datacenters Enterprise Learn Why Compose Articles Write Stuff Customer Stories Webinars Company About Privacy Policy Terms of Service Support Support Contact Us Documentation System Status Security © 2017 Compose, an IBM Company","In this first in a series of futuristic interaction articles, we’ll use Bluemix+Compose to build a chatbot that can reliably converse with our users while collecting data and gaining insights on their personality and mood.",Making of a Smart Business Chatbot: Part 1,Live,908
2803,"Sometimes you just need to query your data using a                         search engine. Cloudant Search is built upon                         Lucene and                         allows you to build custom search indexes of your                         documents.Cloudant search lets you run dynamic queries against your views and data. For this example we've hooked                         search up to the SimpleGeo ""Places of Interest"" dataset. While Cloudant search isn't suitable for high                         precision geo searches (that's being improved on in search v2.0) it's good enough for some basic uses, such                         as this example.The table below is populated with 10 places within a 5 mile bounding box of your location (as determined                         by MaxMind). This uses a range query for the search:lat:[LAT1 TO LAT2] AND lon:[LONG1 TO LONG2]This query is run against an index generated with                         the following index function:function(doc){if (doc.geometry){index('lat', doc.geometry.coordinates[1], {""store"":""yes""});index('lon', doc.geometry.coordinates[0], {""store"":""yes""});Finding places near you is fairly useful, but really you need to                         refine that search; it's no good finding a load of mechanics when you                         need a doctor. If your result set is small it's easy to do that client                         side, but if it gets sizable (for instance if you're in a densely                         populated city) a simple geo index won't cut it as you really want to                         include additional search requirements with your location data.Cloudant Search uses a generic indexer, it can do geo queries if                         you define an appropriate index function (as above) but you can also                         include other variables in your index. This means you can write queries                         to satisfy ""show me restaurants within 5 miles of where I am"" by                         extending the fields your index function indexes. The final index                         function used by this example is:function(doc){   if (doc.geometry){     index('lat', doc.geometry.coordinates[1], {""store"":""yes""});     index('lon', doc.geometry.coordinates[0], {""store"":""yes""});   }   if (doc.properties.classifiers){     for (c in doc.properties.classifiers) {       var classifier = doc.properties.classifiers[c];       if (classifier.category){         index('category', classifier.category, {""store"":""yes""});       }       if (classifier.subcategory && classifier.subcategory != """"){         index('category', classifier.subcategory, {""store"":""yes""});       }     }   } }The table below shows restaurants near the Cloudant HQ. We've not                         been to them all, but we're working on it.About the exampleThis example uses the Creative Commons licenced SimpleGeo ""places of interest"" dataset.                         This is a directory of business listings and contains NN locations                         from MM countries. Your definition of ""interest"" may vary...The example uses the MaxMind geoip APIand code from movable-type.co.ukto determine distances between locations.",Cloudant Search's Lucene index allows simple geographic searches to be performed by querying within latitude/longitude ranges. This post shows you how.,Cloudant Search Geo Example,Live,909
2807,"Homepage Follow Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Snehal Gawas Blocked Unblock Follow Following Jan 31
--------------------------------------------------------------------------------

WORKING WITH IBM CLOUD OBJECT STORAGE IN PYTHON
IBM Cloud Object Storage

Working with Data Science Experience comes with a flexible storage option of IBM Cloud Object Storage . When you create project in DSX you get two storage options.

 1. IBM Cloud Object Storage : Data stored using COS option is encrypted and dispersed across multiple
    geographic locations. This data can be accessed over HTTP using a REST API.
 2. Object Storage (Swift API ): You have to use Swift API to interact with these storage accounts. You can
    learn more about how to interact with this storage option here . If you are still using Object Storage (Swift API ), we encourage you to
    start working with IBM Cloud Object Storage.

In this blog, we will learn how to access IBM Cloud Object Storage in python.

Import Credentials

To access IBM Cloud Object Storage you need credentials. You can get these credentials using insert credentials option in the DSX notebook. To insert credentials you need to first upload some
data to DSX using browse functionality.

Insert credentials in the notebook# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials = {
    'IBM_API_KEY_ID': '*******************************',
    'IAM_SERVICE_ID': '*******************************',
    'ENDPOINT': '*******************************',
    'IBM_AUTH_ENDPOINT': '*******************************',
    'BUCKET': '*******************************',
    'FILE': '*******************************'
}

ibm_boto3 library provides complete access to the IBM® Cloud Object Storage API. We need
to create a low-level client using above credentials.

from botocore.client import Config
import ibm_boto3

cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])

Upload Files

To upload file in COS, we will be using upload_file function. It takes three parameters — local file name(along with path), bucket
name and key. Key can be different from local file name. Your file will be
identified by this name within the bucket. When you create a project with IBM Cloud Object Storage option, DSX creates bucket for your project . You can find bucket corresponding
to your project in credentials.

cos.upload_file(Filename='wine/wine.csv',Bucket=credentials['BUCKET'],Key='wine_data.csv')

We can use this function for uploading zip-files or pickle objects. Here I have
Gradient Boosting Classifier as pickle object.

#Upload zip file
cos.upload_file('wine.gz', credentials['BUCKET'],'wine.gz')

#upload pickle object
cos.upload_file('GB_Classification_model.pkl', credentials['BUCKET'],'GB_Classification_model.pkl')

To help you with quickly uploading files here is the upload_file_cos function. You need to pass your credentials, local file name and key as
parameters to this function.

from botocore.client import Config
import ibm_boto3

def upload_file_cos(credentials,local_file_name,key):  
    cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])
    try:
        res=cos.upload_file(Filename=local_file_name, Bucket=credentials['BUCKET'],Key=key)
    except Exception, e:
        print(Exception, e)
    else:
        print('File Uploaded')

Download Files To Local Machine

Once you have your file in IBM Cloud Object Storage , now you can download it on your local machine.

 1. Click on your project
 2. Click find and add data icon from upper right hand side panel.
 3. Select the file and click download.

Get Data From COS Into Notebook

To download file from COS into notebook, we will be using download_file function. It takes same parameters as above. Here I am downloading file
wine.csv to folder data and saving it with name wine1.csv

cos.download_file(Bucket=credentials['BUCKET'],Key='wine.csv',Filename='data/wine1.csv')

Here is download_file_cos function for quick use.

from botocore.client import Config
import ibm_boto3

def download_file_cos(credentials,local_file_name,key):  
    cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])
    try:
        res=cos.download_file(Bucket=credentials['BUCKET'],Key=key,Filename=local_file_name)
    except Exception, e:
        print(Exception, e)
    else:
        print('File Downloaded')


--------------------------------------------------------------------------------

Instead of file if you want to upload/download file-like object you can use upload_fileobj and download_fileobj . Object must implement the read method, and return/accept bytes respectively.

with open('wine.csv', 'rb') as data:
    cos.upload_fileobj(data,  credentials['BUCKET'], 'wine_bytes')

with open('wine_copy.csv', 'wb') as data:
    cos.download_fileobj(credentials['BUCKET'], 'wine_bytes', data)


--------------------------------------------------------------------------------

Credentials you get in DSX using insert credentials options are scoped to one bucket access permission i.e. they allow you to only
interact with your project’s bucket. If you want to interact with other buckets
then you will have to create new credentials with appropriate access
permissions.

Create New Service Credentials

To create new credentials, you need to follow these steps.

 1. Go to IBM Cloud and click on IBM Cloud Object Storage service under services section.
 2. Select Service credentials from left hand panel and click on New credential button.
 3. Give name and choose appropriate role based on your requirements and hit add
    button.

You should be able to see this credential under service credentials.

Just copy these credentials from view credentials option and create cos object.

cos_credentials={
  ""apikey"": ""***********************"",
  ""endpoints"": ""***********************"",
  ""iam_apikey_description"": ""***********************"",
  ""iam_apikey_name"": ""***********************"",
  ""iam_role_crn"": ""***********************"",
  ""iam_serviceid_crn"": ""***********************"",
  ""resource_instance_id"": ""***********************""
}

auth_endpoint = 'https://iam.bluemix.net/oidc/token'
service_endpoint = 'https://s3-api.us-geo.objectstorage.softlayer.net'

cos = ibm_boto3.client('s3',
                         ibm_api_key_id=cos_credentials['apikey'],
                        ibm_service_instance_id=cos_credentials['resource_instance_id'],
                         ibm_auth_endpoint=auth_endpoint,
                         config=Config(signature_version='oauth'),
                         endpoint_url=service_endpoint)

List Buckets

Using list_buckets function we can list all buckets.

for bucket in cos.list_buckets()['Buckets']:
    print(bucket['Name'])

Create & Delete Buckets

create_bucket and delete_bucket function will help you in creating and deleting buckets.

cos.create_bucket(Bucket='bucket1-test')
cos.delete_bucket(Bucket='bucket1-test')

There are many functions you can use to manage your IBM Cloud Object Storage . In this blog,we have covered basic functions to make your job easy while
working with IBM Cloud Object Storage in Data Science Experience using Python.

For more information you can refer this documentation . For python code refer this notebook .

 * Cloud Computing
 * Data Science
 * Python

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

10 Blocked Unblock Follow FollowingSNEHAL GAWAS
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 10
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates",Working with Data Science Experience comes with a flexible storage option of IBM Cloud Object Storage. ,Working With IBM Cloud Object Storage In Python,Live,910
2811,"Follow Sign in / Sign up * Home
 * About Insight
 * Data Science
 * Data Engineering
 * Health Data
 * AI
 * 

Carl Wivagg Blocked Unblock Follow Following Jun 23
--------------------------------------------------------------------------------

PITCHER PROGNOSIS: USING MACHINE LEARNING TO PREDICT BASEBALL INJURIES
Carl Wivagg, an Insight Health Data Fellow (Winter 2017), obtained his Ph.D. in experimental pathology from Harvard
University. He studied antibiotic resistance in bacteria using genomics and
machine learning tools. He is now a data scientist at Amazon Alexa.


--------------------------------------------------------------------------------

In the multibillion dollar world of sports entertainment, we often think of
injuries as being chance events. I set out to see whether the statistical
richness of baseball could be mined to identify players at risk of injury. Some
baseball pitchers are paid on the order of a million dollars per game, so the
consequences of an injury and a subsequent trip to the Disabled List are
immense. Although professional players are placed under a high level of medical
scrutiny, I reasoned that the information encoded in performance statistics
might add a useful leading indicator of injury risk to the medical toolbox.

Baseball is the data scientist’s dream sport, because nearly every aspect of the
game is discrete and quantifiable. Even time, which in other sports goes
according to the clock, in baseball is defined by innings, outs, and pitches.
Even with all this quantification, it was first necessary for me to properly
formulate the question. I chose a classical binary classification format: for
each player in each game, I would label that game according to whether it
preceded an injury for that player (1) or not (0). Then, I would aggregate the
player’s statistics from preceding games and use those as features.The idea is
thus that a coach, medical support staff member, or even a player him- or
herself, could then enter their accumulated statistics on a given day (the
“intervention point”) into my model and see what the likelihood would be that
playing on that day could precede an injury.

For a given game, the aggregate statistics from games preceding it are its
feature values, and whether the player is injured immediately afterwards is its
label.Baseball fans will note from the statistics I have chosen in the example that I
am focusing on starting pitchers. Other players have different statistics and
would constitute an entirely separate machine learning problem. Pitchers are the
most impactful choice for a first analysis anyway, both because they are often
the most valuable players on a team and because the demanding nature of their
task makes them highly susceptible to injury.

Having formulated a suitable question, the next step is data. Major League
Baseball statistics are readily available, but records of injury events are
harder to come by. Ultimately, I chose a list containing several thousand injury
events from mlb.com’s transaction history. Each disabling injury results in a
player being moved to the Disabled List, which is a transaction. Unfortunately,
players being traded or moving up from the minor leagues are also transactions,
so I used regex processing to generate a mostly clean list of about a thousand
pitcher movements to the Disabled List. Spot checking revealed no
irregularities; every event that passed through the regex filters was indeed
injury-related.

Exploratory Data Analysis

It is usually a good idea to explore the data a bit. In my case, the
well-structured nature of baseball and prior familiarity with the dataset had
assured me that my data were relatively clean, so the most urgent question
confronting me was whether game statistics in fact contained any predictive
information at all in relation to injuries. I started with one of the simplest
statistics of all: a player’s age at the time of the game preceding his injury
(or non-injury). It seemed intuitive to me that older players would be more
susceptible to injury; although in many careers, the early forties are a highly
productive time, the extreme physical demands of baseball mean that few players
can continue to perform at the professional level that long. Since injury is a
failure mode associated with physical stresses, older players should have more
injuries. Indeed, that is exactly what I saw (as figure below): relative to the
“not injured” events, “injured” events are right-skewed. The effect size is
modest, but because of the large number of events, statistically significant at
p < 0.0001.

The light blue bars are the distribution of ages in games that did not precede
an injury event, the red bars did precede an injury event, and the dark blue
fractional bars are the overlap of the light blue and the red.Many other statistics have similar correlations with injury. One of the most
predictive and interesting is “Innings Pitched”, which is related to how long a
pitcher spends in a game and how many pitches they throw.

The light blue bars are the distribution of innings pitched in games that did
not precede an injury event; the dark blue fractional bars are the overlap of
the light blue and the red. Note that the bins are not integer values of
innings: since innings pitched is counted by the number of outs recorded when a
pitcher leaves the game, there are twenty-eight possible values for innings
pitched in a standard game. Plus, in five out of the 27,000 games examined, a
pitcher pitched his entire nine innings and then continued pitching in overtime!Surprisingly, a high number of innings pitched is a predictor that a player is
relatively safe from injury; I had expected that, to the contrary, a high number
of innings pitched would constitute overwork and lead to eventual breakdown and
injury. One possible explanation for this counterintuitive finding is that
nagging undetected conditions that might eventually result in injuries impede
performance in the games preceding an injury; poor performance in turn leads to
the coach benching the player. It is not the case that injury causes a lower
number of innings pitched in that game, because the aggregation window for the
features is separated by a full played game from transfer to the Disabled List
(see initial image).

Feature Engineering

To hone the predictive power of my features, first I generated new features by
applying different aggregation windows: for each player, I created separate
features for each performance metric for one game preceding the intervention
point, for the average of seven games preceding the intervention point, and for
the player’s entire career. I also created separate features for the percent
deviation of each single game value and seven-game average value from the career
total.

Second, it was necessary to decorrelate the features. It should come as no
surprise that many of the statistics a player generates during a game are highly
mutually correlated. For instance, a player with a high number of innings
pitched will also have a high number of pitches, a high number of outs, and
higher numbers of the various particular types of outs. To avoid the cardinal
machine learning sin of fitting a multicollinear set of features, I normalized
each feature to an appropriate reference feature. For instance, I divided the
number of groundball outs by the number of outs, and the number of hits by the
number of batters faced. This method not only reduced multicollinearity; it also
gave me more meaningful features. “Fraction of groundball outs” contains more
information about a pitcher’s style than the total number of groundball outs,
which could be high either because the pitcher was in the game for a long time
or because they frequently throw groundball outs.

Additionally, I had one more aspect of pitchers’ performance that I wanted to
account for: pitching style. The pages of baseball literature and commentary are
filled with accounts of power pitchers, knuckleballers, sinkerballers, and more.
For a relatively casual baseball fan like myself, it is difficult to draw
consistent, distinct categories of pitching style from expert commentary or from
the statistical data that I had already collected. And so I turned to natural
language processing.

The player cards at BrooksBaseball.net contain descriptions of the types of
pitches that a pitcher throws and how batters react to them. The stereotyped
language is ideal for a first foray into natural language processing.I located a reasonably complete database of the pitching styles of current
pitchers and used standardized techniques to treat the descriptions as bags of
words, lemmatize, and vectorize them. Having no strong preconceptions about how
many pitching styles there might be, and given the limited time available, I
turned to an extremely simple technique: K-means analysis.

Initial natural language processing (NLP) of text descriptions of pitching
styles yields two broad categories separated based on the frequency of their sue
of the indicated terms.I projected the term frequency vectors I had created, which had a dimensionality
on the order of the total number of terms present, onto a two-dimensional space
using multidimensional scaling, which is meant to preserve the approximate
relation of each of the pitcher descriptions to all of the others. Initially, I
separated the descriptions into two means to see if there was any obvious
topical difference between the terms associated with one of the means compared
to the other. I did not see any, so I added in a third mean.

Adding a third mean does not change the relationship between the text feature
vectors; it merely changes the classification of each vector.Now, the means started to make sense. First, the observant reader will note that
the third mean, colored purple, drew descriptions almost entirely from the
population of descriptions that had previously been in the second mean. This
indicates that the first two means were well separated in multidimensional
space: if they had been close together or overlapping, we might have expected a
third mean to draw points more equally from both initial means. Thus, we can be
confident that we are separating the populations in a meaningful way.

Second, there is now a somewhat intuitive meaning to the terms associated with
the three populations. The first has “flyballs”, the second has “groundballs”,
and the third has “whiffs/swing”, which is a baseball term for a
swing-and-a-miss, which usually leads to a strikeout. Thus, we have means
associated with the three different types of out in baseball. These term
associations were robust through several of the top terms associated with each
mean: particularly for the groundball mean, the top three terms all contained
the word “groundballs”. In the way that I set up the term frequency vectors, a
single word can occur more than once because I accounted for the frequency of
bigrams, or pairs of words occurring together, and trigrams as well as single
words.

Modeling

Given the tight timeline, I chose to use random forest for a quick preliminary
method to build up a base model. It offers several advantages for the problem I
try to solve: 1) it doesn’t require labor-intensive feature scalings; 2) it is
robust to find outliers; 3) It is sensitive to interactions between variables.

Model Optimization

I optimized the random forest hyperparameters to maximize the area under an ROC
curve, which has two characteristics that make it better than accuracy score for
this sort of situation: 1) the value of this metric is still meaningful with
greatly imbalanced datasets － and there are many more games preceding
noninjuries in baseball than games preceding injuries － and 2) how a
risk-predicting application may be used is not necessarily known before
deployment: avoiding false positives may matter more than avoiding false
negatives, or vice versa. The area under an ROC curve metric does not require me
to know in advance where I will set the threshold for identifying players at
risk of injury.

I began by optimizing the random forest’s hyperparameters. The hyperparameters I
focused on were the number of features each decision tree could choose from at
each step in its creation and the maximum depth of those trees, or the total
number of features that could be used in the classification of a single point. I
used a grid search to explore all possible combinations of low integer values
for these two hyperparameters, settling on an optimum value of three to four
features for each. I also optimized the number of decision trees in my random
forest; although I saw little increase in performance beyond 300 trees, I
settled on 1,000 because compute time was not limiting and having redundancy
within the forest would not be expected to harm model performance. This range
agreed with various sources of expert advice.

The performance metric I chose to maximize with my grid search was area under
the ROC curve, which has two characteristics that make it better than the
standard accuracy score for this sort of situation: 1) the value of this metric
is still meaningful with greatly imbalanced datasets － and there are many more
games preceding noninjuries in baseball than games preceding injuries － and 2)
how a risk-predicting application may be used is not necessarily known before
deployment: avoiding false positives may matter more than avoiding false
negatives, or vice versa. The area under an ROC curve metric does not require me
to know in advance where I will set the threshold for identifying players at
risk of injury. My ultimate area under the ROC curve for a withheld test set was
0.72, substantially below a perfect 1.00, but also substantially above the 0.50
that would be expected from random guessing. Considering that I did not know at
the outset whether athletes’ statistical performance metrics would contain any
at all information predicting injury, I was very happy with this result!

The ROC curve compares the performance of the random forest model to random
guessing.Model Validation

A final, and critical, step in any machine learning project is to prepare the
model and findings for presentation or deployment in a way that is useful and
meaningful to the intended audience. In this case, I wished the project to be
persuasive and usable by all athletes, both those with PhDs in mathematics and
those struggling to complete high school. Having provided a nice statistical
metric conveying my model’s performance, I thought it would be useful to
audiences of all backgrounds to have a graphical representation of the model’s
performance. I chose to investigate whether the model generated an uptick in
injury risk scores for individual players in the games leading up to a stint on
the Disabled List. I picked four random players and calculated their injury
scores for each game in the season they got injured.

José Contreras’s injury scores for his 2008 season. The precipitous decline
early in the season is an artifact arising from the model’s use of aggregate
statistics from prior games, which early games in the season do not have.All four indeed displayed high injury scores leading up to the injury; above is
presented the one with the sharpest uptick. Two other players had similar but
mildly noisier trends, while the fourth player had a consistently high injury
score.

A Web Application for Public Interaction with the Model

More than arguing for the model’s validity, I wished to offer baseball
enthusiasts and professionals of all stripes an easy way to use and understand
the model. I deployed my model on an AWS instance using Flask and Green Unicorn.
I thus deployed my model on an AWS instance using Flask and Green Unicorn,
making it available to the public at http://www.baseballinjurypredict.tech/ .

In this screen capture, after a player or coach has entered their values for
each of the feature that the model uses, the model outputs an assessment of the
player’s risk of injury. The “injury score” output by the random forest model is
notionally a probability of a particular set of feature values of indicating
that an injury will occur, or more precisely the average of this probability
across all of the decision trees in the forest, although depending on how one
deals with the class imbalance in injury prediction problem, this interpretation
is not necessarily correct.

Extract Insightful Information from the Model

The final stage of a machine learning problem is to produce a clear, useful, and
interpretable result. To avoid forcing baseball players and coaches to deal with
the intricacies of random forest output, the web application I designed compares
the injury score for a given player’s input to all of the scores in the database
used for the modeling and outputs the player’s injury score percentile, which
should be readily understandable to many people.

Some users may distrust what seems like a data science black box, and to provide
more persuasive analysis or explanation, I also use nearest neighbors analysis
to identify games similar to the user’s entered values. The application presents
an equal number of games that resulted in injuries and did not, with the idea
that the user can evaluate by eye how similar his own feature values are to
those in each class; moreover, the nearest neighbors analysis offers some
insight into which of the features may be driving the random forest’s output in
this particular case.

My modeling of pitcher injury risk raises several interesting questions. In
particular, the anticorrelation between innings pitched and injury risk, which
persists over all aggregation windows, is an intriguing finding that the
expertise of professional athletes or coaches might be able to clarify. More
importantly, it offers a proof-of-concept for the possibility of rationally
weighing a player’s age and his recent and long-term performance characteristics
to assess injury risk. Such modeling can help not just professional players, but
also youth and amateur athletes across the globe without access to the same
level of medical scrutiny as the professionals.


--------------------------------------------------------------------------------

Interested in transitioning to a career in data science? Find out more about the Insight Fellows Programs in New York, Silicon Valley, Boston, and Seattle apply today, or sign up for program updates.

 * Machine Learning
 * Data Science
 * Sports
 * Insight Health Data
 * Artificial Intelligence

Blocked Unblock Follow FollowingCARL WIVAGG
FollowINSIGHT DATA
Insight Fellows Program —Your bridge to careers in Data Science and Data
Engineering.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates","In the multibillion dollar world of sports entertainment, we often think of injuries as being chance events. I set out to see whether the statistical richness of baseball could be mined to identify players at risk of injury. ",Using Machine Learning to Predict Baseball Injuries,Live,911
2816,"Cloudant Search 2.0 is a full text indexing (FTI) and query system based on embedded Lucene libraries. Search is ideal for arbitrary queries on high-dimensional data. Cloudant uses a system very similar to the incremental MapReduce engine to index your data in real time and provide a simple, scalable, and fast search engine that can process arbitrarily large volumes of data or concurrent queries without requiring the user to worry about scaling concerns. Cloudant Search 2.0 is an ideal replacement for the limited in-database search capabilities of MySQL or SQL server, ""bolt-on"" external search systems like Sphinx and SOLR, or the multi-dimensional query language of MongoDB.See the exampleThis example uses the public lobbyist         disclosure dataset from that can be downloaded from the US Senate.         We have extracted the information from the individual XML         documents and uploaded them to a public database hosted on         Cloudant.com and accessible at http://examples.cloudant.com/lobby-search/.         The dataset consists of 757,123 individual         documents.  The uncompressed XML documents are 2.5 GB on disk,         and the corresponding Cloudant database is only 1.3 GB.  You         can view any of the documents in your browser, e.g. http://examples.cloudant.com/lobby-search/019b716168d45be2c2bd8371d400272a.In Cloudant Search 2.0, we provide a simple         interface to allow the user to choose what data should be         indexed, as well as ""power-user"" features like language choice,         specific Lucene Analyzers, etc.  The interface is nearly         identical to how one writes map and reduce functions, except         you substitute index() in place of the regular         emit() function for each property of a document         that you want to be indexed.  For example, if you want to store         doc.user_name and search it with a query like ?q=name:bob you simply call:To index the full lobbyist dataset we use this         javascriptfunction, which is automatically bundled into thedesign document using the couchapp python tool.   In         fact, this entire application is contained in a single design         document!Using the above design document, we have                     prepared an index for all of the fields in the                     document corpus.  The full list of fields is         javascriptavailable here.  Suppose we want to find all health related                       filings, from the year 2009, in the state of                       California, with a a filing amount of over 50000.  We simply write:This example has shown you the simplicity,             power, and speed of Cloudant Search 2.0 for             high-dimensional selections that don't require the complex             computations and aggregations of incremental MapReduce. This HTML5             application is served directly from the Cloudant Data Layer using             no middle tier. You can clone this application by simply             performing:","Cloudant Search, based on Apache Lucene, allows full-text and field search across your JSON document store. This post demonstrates the feature set using data from lobbyist disclosures from the US Senate.",Search Lobbyist Disclosures,Live,912
2820,"BUILDING A DYNAMIC CONFIGURATION SERVICE WITH ETCD AND PYTHONShare on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Apr 14, 2016A dive into etcd and the creation of a Python library to manage dynamicconfiguration are the subject of Gigi Sayfan's latest Write Stuff article.Building robust and performant distributed systems is hard. The main reason isthat typically everything is in flux. There is no well defined spec forhardware, software and usage. Everything evolves at the same time. Your systemcapabilities evolve, if you run in the cloud your cloud provider may change thehardware under you, changes in data volume and user access patterns will pushyou to new architectural choices, at scale cost of hardware, storage andbandwidth will become the dominant factor and not the people managing yourinfrastructure. Add to that the dynamic nature of networked systems wherecomponent failure is a day to day reality and not a theoretical possibility.Managing that change and turmoil requires some deliberate design decisions. Youcan't hard-code all this flexibility into your software, which means you'll haveto be able to configure it without making code changes. There are severalwell-known mechanisms to do that such as command-line arguments, environmentvariables and configuration files. All have their own time and place.But, in a large-scale distributed systems context (think hundreds or thousandsof servers) they all suffer from a significant downside. They all require adeployment (if pushing a new configuration file) and/or restarting a process inorder to provide new command-line arguments or environment variables. Thispush-based approach is problematic because it's fragile. Some servers will bedown and you'll have to keep track of which servers got the new configurationand which ones didn't. The ones that didn't can create all kinds of issues whenthey eventually get online and keep running with the old configuration.An alternative approach is a pull-based approach where processes periodicallypull from a central configuration repository the most up to date configuration.The ultimate solution is if a process is notified in real-time whenever itsconfiguration changes and can reconfigure itself almost immediately. In thisarticle I'll show you how to do just that and build a dynamic configurationservice based on etcd using a Python library called conman .A QUICK INTRODUCTION TO ETCDEtcd is a distributed key value store that provides a reliable way to store dataacross a cluster of machines. It uses the RAFT algorithm and is great at keepingcritical data correct and available. You can organize your keys in directoriesto get hierarchies and get notified when keys are added/removed/changed todirectories you watch.Let's play around with etcd to get a sense of how the different concepts worktogether. You can install a local etcd cluster by following the instructions here for creating a local etcd cluster.But, it may be easier to play with an existing cluster. Compose.io offers a30-day trial of a full fledged 3-node etcd cluster running on AWS (or Softlayer- editor). You can sign up here . You'll get access to a nice dashboard that looks like this:etcdctl is your command-line client. You can use it interactively or in scripts,although for serious programming I recommend using a proper etcd client libraryin your favorite language.I created this alias to quickly connect to my Compose.io etcd instance usingetcdctl.alias e='etcdctl --ca-file ~/compose_etcd.pk --no-sync --peershttps://aws-us-east-1-portal10.dblayer.com:10835,https://aws-us-east-1-portal11.dblayer.com:27265-u root:*********'Note the --ca-file ~/compose_etcd.pk option. You need to create a file that contains the public key provided to youin the compose.io etcd dashboard.Here are the commands supported by etcdctl:   backup          backup an etcd directory   cluster-health  check the health of the etcd cluster   mk              make a new key with a given value   mkdir           make a new directory   rm              remove a key or a directory   rmdir           removes the key if it is an empty directory or a key-value pair   get             retrieve the value of a key   ls              retrieve a directory   set             set the value of a key   setdir          create a new or existing directory   update          update an existing key with a given value   updatedir       update an existing directory   watch           watch a key for changes   exec-watch      watch a key for changes and exec an executable   member          member add, remove and list subcommands   import          import a snapshot to a cluster   user               user add, grant and revoke subcommands   role            role add, grant and revoke subcommands   auth            overall auth controls   help, h         Shows a list of commands or help for one commandRunning etcdctl --help will list off these and the global options. Here are the ones to pay specialattention to * --debug - output cURL commands which can be used to reproduce the request * --no-sync - don't synchronize cluster information before sending request * --output, -o ""simple"" - output response in the given format ( simple , extended or json ) * --peers, -C - a comma-delimited list of machine addresses in the cluster   (default: "" http://127.0.0.1:4001,http://127.0.0.1:2379 "") * --endpoint - a comma-delimited list of machine addresses in the cluster * --ca-file - verify certificates of HTTPS-enabled servers using this CA bundle * --username, -u - provide username[:password] and prompt if password is not   supplied.PLAYING WITH ETCDLet's play a little with etcd and on the way I'll introduce some key concepts.Etcd is a key-value store. Let's create some keys then using the ""mk"" command:~ > e mk x 33  ~ > e mk y 123123  Now we can observe the keys using the ""ls"" command:~ > e ls/x/yTo get the value of the key you use... wait for it... the ""get"" command:~ > e get /y123  That's pretty straightforward you can also create (like ""mk"") or update existingkeys with the ""set"" command.~ > e set new 66  ~ > e ls/x/y/new~ > e set x 777777  ~ > e get new6  ~ > e get x777  You may have noticed that etcd adds a forward slash / before keys. That'sbecause keys are organized in a directory structure. If you just create keyslike I did they get put in the root directory under /. But you can also createdirectories using the mkdir or implicitly when setting keys that include paths.For example, to create a directory ""d"" that contains two keys ""a"" with the value4 and ""b"" with the value 5:~ > e set d/a 44~ > e set d/b 55~ > e ls d/d/a/d/b~ > e get d/a 44~ > e get d/b 55  You can also remove keys or directories using the ""rm"" command, which has the--recursive option. Let's get rid of the entire ""d"" directory:~ > e rm --recursive d~ > e ls/x/y/newSo, etcd looks pretty similar to a distributed file system from anadministrative point of view. That's great because you should have a mentalmodel for how to organize information and how to manage it in etcd.Etcd has another cool thing, which is TTLs (time to live). When you create a keyyou can decide how long it hangs around. There are many situations where you maywant keys to expire after a while and with tcd you don't have to implement thislogic yourself and remember to delete keys after they expire.Here I set a key with a TTL of 5 seconds and try to get it several times. Seewhat happens after 5 seconds:~ > e mk e 4 --ttl ""5""4  ~ > e get e4  ~ > e get e4  ~ > e get eError:  100: Key not found (/e) [45]  So, TTLs are cool. What is even cooler is the ""watch"" command. With the ""watch""you can get notified when watched keys or directories are modified. Let's see itin action. Since ""watch"" is blocking (waiting for something to happen) I'll makemay changes in a separate terminal window. Here is the watch terminaloriginally, watching the root recursively:~ > e watch / --recursiveThis command will block until some change occurs in the watched key and all itsdescendents since I specified the ""--recursive"" option.Now, in a separate terminal window I set a new key:~ > e set a ""what this.""what this.  And immediately, in the original window the ""watch"" command returned with theinformation:~ > e watch / --recursive[set] /awhat this.  Normally, when you watch a key, you would want to do something when it ismodified. The ""exec-watch"" command lets you execute arbitrary commands when thewatched key is modified.OTHER THINGS YOU CAN DO WITH ETCDI will just mention a few other things etcd provides. You can manageauthentication, users and roles, grant and revoke read/write privileges on apath basis. You can backup and import snapshots and you can manage the membersof the cluster.THE CONMAN LIBRARYThe ConMan library is a Python library for program configuration. It supportsconfiguration files in various formats but most importantly etcd. It is built ontop of python-etcd .The ConManEtcd class allows you to connect to an etcd instance and then add keysyou're interested in. Then it exposes the key and all its subdirectories as anested Python dictionary. You can refresh at any point or even watch the key andget notified when something is changed.Good examples can be found in conman_etcd_test.py .Let's look at one of the test methods to get an idea how to use ConmanEtcd andbreak it down. Here is the entire test_refresh() function:def test_refresh(self):      self.assertFalse('refresh_test' in self.conman)    # Insert a new key to etcd    set_key(self.conman.client, 'refresh_test', dict(a='1'))    # The new key should still not be visible by conman    self.assertFalse('refresh_test' in self.conman)    # Refresh to get the new key    self.conman.refresh('refresh_test')    # The new key should now be visible by conman    self.assertEqual(dict(a='1'), self.conman['refresh_test'])    # Change the key    set_key(self.conman.client, 'refresh_test', dict(b='3'))    # The previous value should still be visible by conman    self.assertEqual(dict(a='1'), self.conman['refresh_test'])    # Refresh again    self.conman.refresh('refresh_test')    # The new value should now be visible by conman    self.assertEqual(dict(b='3'), self.conman['refresh_test'])The test_refresh() function makes sure that when the data for a particular keyis modified on etcd and conman's refresh() method is called the updated datashows up. ""self.conman"" is a ConmanEtcd object that was created in the setupphase. The ""set_key()"" method is a utility function that sets keys on the etcdinstance.First, let's verify that the key ""refresh_test"" is not present in self.conman.ConmanEtcd exposes a dict interface, so you can use the ""in"" operator to checkif keys exist or not:self.assertFalse('refresh_test' in self.conman)  Then, let's add the key to the etcd instance using the set key() helper. The value is a dictionary ""dict(a=1)"", so it will actually createon etcd a directory with a sub-key ""/refresh test/a 1"".# Insert a new key to etcdset_key(self.conman.client, 'refresh_test', dict(a='1'))  So, the state has changed on etcd, but our self.conman wasn't refreshed sincethe change, so it should still be unaware of the new state under the""refresh_test"" key.# The new key should still not be visible by conmanself.assertFalse('refresh_test' in self.conman)  OK. Let's refresh and verify self.conman has the new state.# Refresh to get the new keyself.conman.refresh('refresh_test')# The new key should now be visible by conmanself.assertEqual(dict(a='1'), self.conman['refresh_test'])  Yeah, it worked. But, what if we change an existing key?# Change the keyset_key(self.conman.client, 'refresh_test', dict(b='3'))# The previous value should still be visible by conmanself.assertEqual(dict(a='1'), self.conman['refresh_test'])# Refresh againself.conman.refresh('refresh_test')# The new value should now be visible by conmanself.assertEqual(dict(b='3'), self.conman['refresh_test'])  Yep. The refresh() method of ConmanEtcd works as expected and can sync with theetcd instance.DYNAMIC CONFIGURATION WITH CONMANRefreshing state is fine, but very often you want to always work with the mostup to date state. You could repeatedly call refresh(), but a much better way isto watch for changes. ConmanEtcd supports the watch functionality with a nicecallback interface. dyn_conf_program.py is a complete sample program that is configured dynamically. Whenever theconfiguration on etcd is changed it gets notified via a callback and writes thechange to a file. Finally when a special key called ""/dyn_conf/stop"" is set to 1it exits.Let's go over the different aspects of the dyn_conf_program.py program. Thewhole program is contained in a single class called Program that is instantiatedwhen the script is executed. You need to pass a key to watch and a filename aswell as various connectivity arguments, but it connects by default to a localetcd instance.if __name__ == '__main__':      Program(*sys.argv[1:])The __init__() method creates a ConmanEtcd object passing it own on_configuration_change() as the on_watch callback argument. It also opens and truncates the fileindicated by filename (for testing purposes). It then stores the filename andthe key, initializes a variable called ""last_change"" to None and calls its ownrun() method.class Program(object):  def __init__(self,               key,             filename,             protocol='http',             host='127.0.0.1',             port=4001,             username=None,             password=None):    self.conman = ConManEtcd(protocol=protocol,                             host=host,                             port=int(port),                             username=username,                             password=password,                             on_change=self.on_configuration_change,                             watch_timeout=5)    self.filename = filename    open(self.filename, 'w+')    self.key = key    self.last_change = None    self.run()The run() method starts by refreshing the key so it is up to date with thecurrent state then it watches the key by calling self.conman.watch(). Note thatthis is NOT a blocking call because conman runs each watch in a separate thread.Then the run() method gets into a loop where it is constantly checking if thereis a sub-key called ""stop with a value of ""1"" and in this case it returnseffectively ending the program. Otherwise, it sleeps for one second and checksconman again. This is a classic example of dynamic configuration where you canterminate the program remotely by setting the ""stop"" key in etcd to ""1"". Theprogram is very efficient because it doesn't refresh the entire state from etcdevery second. It just checks the state of the local conman.def run(self):      self.conman.refresh(self.key)    self.conman.watch(self.key)    while True:        if self.conman[self.key].get('stop') == '1':            open(self.filename, 'a').write('Stopping...\n')            self.conman.stop_watchers()            return        time.sleep(1)OK. So, if run() checks only the local conman's state and never refreshes howdoes it know when the ""stop"" key becomes ""1""? That's the beauty of the watchfunctionality. Whenever, someone modifies the watched key the callback method""on_configuration_change()"" will be called. This method is smart enough toignore redundant repeats of the same change, which can happen in a distributeddatabase like etcd and is hence idempotent. When a new change arrives it writesit to the output file and also does a refresh, so the local conman is up to datewith the state on etcd.def on_configuration_change(self, key, action, value):      # Sometimes the same change is reported multiple times. Ignore repeats    if self.last_change == (key, action, value):        return    self.last_change = (key, action, value)    line = 'key: {}, action: {}, value: {}\n'.format(key,                                                    action,                                                    value)    open(self.filename, 'a').write(line)    self.conman.refresh(self.key)There are also full-fledged integration tests which run 3 separate dyn conf program.py programs, settings keys on etcd and eventually setting the ""stop""sub-key to ""1"" in order to terminate all 3 programs. Check it out hereCONCLUSIONEtcd is a powerful and very reliable distributed database designed for specialuse cases. One of these use cases is program configuration. This articleexplained the need and rationale for reliable dynamic configuration of a largenumber of programs across a distributed system and introduced a Python librarycalled conman that can be used to dynamically configure your programs remotely.Give etcd a try. It's fun and useful.Gigi Sayfan is the director of software infrastructure at Aclima ( http://aclima.io ), a start-up company that designs and deploys distributed sensor networks thatenable a higher level of environmental awareness. Gigi has been developingsoftware professionally for 20 years in domains as diverse as instant messaging,morphing, chip fabrication process control, embedded multi-media application forgame consoles, brain-inspired machine learning, custom browser development, webservices for 3D distributed game platform and most recently IoT/sensors.This article is licensed with CC-BY-NC-SA 4.0 by Compose.Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose",A dive into etcd and the creation of a Python library to manage dynamic configuration.,Building a dynamic configuration service with etcd and Python,Live,913
2824,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. May 10
--------------------------------------------------------------------------------

DIFF YOUR DATABASES WITH COUCHDIFF
COMMAND-LINE TOOL FOR COMPARING TWO APACHE COUCHDB OR CLOUDANT DATABASES
The Unix diff command-line utility has been around since the 1970s . It compares two text files line-by-line and tells you the differences between
them.

HELLO, DIFF !
If I diff two files a.txt and b.txt containing versions of the same poem, I can find the differences between them
with the command:

$ diff a.txt b.txt
2c2
< T.S. Eliot 1888-1965
---
> T.S. Eliot
6a7
> Sprouting despondently at area gates.	
13,14d13
< 
< From Prufrock, and other observations (The Egoist, Ltd, 1917)

Lines starting with < mean “this line needs removing,” and lines starting with > mean “this line needs adding.” The other lines specify where in the file the
changes would need to be made. This machine-readable data allows one file to be
“patched” to match another.

We can also output the same data in a so-called “unified” format by passing a -u parameter:

$ diff -u a.txt b.txt
--- a.txt	2017-04-25 14:03:12.000000000 +0100
+++ b.txt	2017-04-25 13:56:20.000000000 +0100
@@ -1,14 +1,13 @@
 Morning at the Window
-T.S. Eliot 1888-1965
+T.S. Eliot

 They are rattling breakfast plates in basement kitchens,
 And along the trampled edges of the street
 I am aware of the damp souls of housemaids
+Sprouting despondently at area gates.	

 The brown waves of fog toss up to me	        
 Twisted faces from the bottom of the street,
 And tear from a passer-by with muddy skirts
 An aimless smile that hovers in the air
 And vanishes along the level of the roofs.
-
-From Prufrock, and other observations (The Egoist, Ltd, 1917)

This format uses - and + in place of < and > and provides a few lines of context around each change, making it a little
easier to digest.

DIFFING A DATABASE
Let’s say we have two databases instead of two text files — in this case two Apache CouchDB™ or Cloudant databases. How can we tell if documents in each database are identical, and if
they’re not, which ones differ?

I’ve written a command-line tool to do just that: couchdiff . It is installed using the npm command:

npm install -g couchdiff

You can then use couchdiff like diff , except that it expects two URLs instead of two file paths . For example:

$ couchdiff http://localhost:5984/mydb1 http://localhost:5984/mydb2
spooling changes...
sorting...
calculating difference...
2c2
< 1000543/1-3256046064953e2f0fdb376211fe78ab
---
> 1000543/2-7d93e4800a6479d8045d192577cff4f7

In this case, the two databases are identical except for document id 1000543 , which is at a later revision in the second database.

The URLs can point to local CouchDB databases or to remote Cloudant databases,
or both:

Like diff , couchdiff also accepts a -u parameter to output the data in unified format.

HOW DOES COUCHDIFF WORK?
Here is the basic order of operations of the couchdiff utility:

 1. It gets the changes feed for each of the databases and writes the document id and revision token to
    a temporary file — one file for each database.
 2. The temporary files are sorted using the sort command-line tool. This ensures that both files are in “id order.”
 3. The two files are diffed using the diff utility, which is the output you see. If the databases are identical, there
    will be no output.

You need both sort and diff to be installed on your machine for this to work. A Mac and most Linux
distributions would have them pre-installed.

WHAT ABOUT CONFLICTS?
Conflicted documents are ignored by couchdiff by default, but adding the --conflicts parameter brings them into play. Here’s how that looks:

Both databases will be compared, including any variance in conflicted revisions.

WHAT ABOUT ATTACHMENTS?
The couchdiff tool doesn't examine the bodies of binary attachments explicitly, but since the
document bodies contain a digest of each attachment, it will be able to detect
differences in attachments.

OTHER COMMAND-LINE TOOLS
If you need to access your CouchDB or Cloudant database from the command-line,
then there are other tools you can use:

 * couchimport — import data to your JSON document store from CSV/TSV files and vice versa
 * couchshell — interact with your databases as if they were a file system
 * couchbackup — backup your database to a text file and restore just as easily

I hope these tools are useful. If you have any feedback — especially on couchdiff — please let me know in the comments below, or create a GitHub issue . Pull requests, as always, are quite welcome.

As always, if you enjoyed this article, please ♡ it to recommend it to other
Medium readers. Thanks for reading!

 * JavaScript
 * Couchdb
 * Cloudant
 * Nodejs
 * Web Development

Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",The Unix diff command-line utility has been around since the 1970s. It compares two text files line-by-line and tells you the differences between them. If I diff two files a.txt and b.txt containing…,Diff your databases with couchdiff – IBM Watson Data Lab – Medium,Live,914
2828,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONFIGURE SIMPLE DATA PIPE FOR STRIPE.COMWAREHOUSE YOUR PAYMENTS DATAptitzler / September 24, 2015EXTRACT & LOAD STRIPE.COM DATAIf your app uses Stripe to handle credit card payments, you likely want toanalyze that data. You probably also want to combine this payment data withother operational data (CRM systems, Google Analytics, etc.) to further analyzeyour business. Whether it’s building predictive models or visualizing this datain a dashboard, somehow, you need to land your Stripe data in a data warehouse.Our Simple Data Pipe app makes it easy for you to extract data from the source you choose and loadit into IBM’s dashDB cloud data warehouse service, where you can combine it with other data sets andperform analysis. This article explains how to set up Simple Data Pipe toextract your Stripe data and load it into dashDB.DEPLOY SIMPLE DATA PIPEThe Simple Data Pipe app runs on Bluemix, IBM’s cloud application platformservice. The fastest way to deploy this app to Bluemix is to click this Deploy to Bluemix button—it automatically creates a web-based version of the app with all theservices you need: Node.js, Cloudant, dashDB, DataWorks, and SSO.If you prefer instead to deploy manually, please refer to the readme .On the Deploy to Bluemix screen, give the app a name that’s meaningful to you,like Simple Data Pipe , and click Save . (If you have any trouble deploying, make sure you haven’t already reachedyour Bluemix memory and services quotas. You get 10 with your account’s freetrial period.)To launch the app, click the View Your App button. Your Simple Data Pipe app opens, but you’re not ready to use it yet.Leave this browser window open. You’ll return to it in a few minutes.GRANT SIMPLE DATA PIPE ACCESS TO YOUR STRIPE ACCOUNTIn Stripe, you must grant Simple Data Pipe read-only access to your data. To doso: 1. Open the stripe.com dashboard . 2. Log in using your stripe.com credentials.If you don’t yet have an account, sign up. You can try these steps with test    data. You can create test data through the UI for the Stripe test    environment. See their docs for test card numbers and other simulated parameters.         3. On the upper right of the screen, click Your account > Account Settings . 4.  5. Click the API Keys tab. 6.  7. Note the key value for the data you want to move. You’ll need this string in    the next section. * To pipe only test data, copy the value in Test Secret Key .     * To pipe live production data, copy the value in Live Secret Key .                Click the Connect tab. Register your platform. * If you never connected to an app before, go to the lower left of the       screen and click Register your platform .            * If you have previously connected to an app, click the Platform Settings button, then Register your platform        In the standalone accounts section enter the following information: * Name of the app. Enter any unique identifier, like My Simple Data Pipe Demo App .     * Website URL (used for reference only) of your Simple Data Pipe application. You can       get this URL from the Simple Data Pipe browser window you left open. Just       copy and paste the URL from the address bar. It is: https:// <your_data_pipes_host_name> .mybluemix.net     * Redirect URIs . Enter a value for the Stripe environment in which you’re working:       development or production. Your entry will be the same for either one: https:// <your_data_pipes_host_name> .mybluemix.net/authCallback                Copy the client_id value the environment you’re working in: either production or development . You’ll need this string in minute to set up your Data Pipe run. Click Done .RUN A DATA PIPE JOB TO LOAD DATA FROM STRIPE 1. Return to the Simple Data Pipe app.If you can’t find the Simple Data Pipes app browser window you left open,    find the app’s route through your Bluemix dashboard.         2. Choose Create A New Pipe . 3. Under Type choose Stripe . Enter a unique Name and Description .    Click Proceed to create the new pipe. In the Connect tab enter the following values: * In Consumer key , enter the client_id value you copied from Stripe.     * In Consumer secret , enter the corresponding Test or Live Secret Key you noted earlier in your Stripe account settings.                Click Connect to Stripe to start the authorization process. If prompted, enter your stripe user credentials. If you have not activated your account yet, the following form is displayed    prompting you to accept payments with Stripe. If you are planning to pull    only data from your test environment, do not fill it out . Instead, go the upper right and click Skip this account form .    You will be redirected to Simple Data Pipe. Click Save and continue .Note: After you complete this step, your Simple Data Pipe has read-only access to    all data in the Stripe environment you selected (test or production). This    is reflected your Stripe account settings.                On the Pick tables tab, choose the table(s) you want to extract from Stripe and click save and continue .    If you want to run the pipe immediately, on the Schedule tab, click the Skip button.If you prefer to schedule a run, enter a time to run the pipe and click Save and continue .        Note: You can run only one pipe job at a time.                If you skipped scheduling, on the Activity tab, click Run Now . Wait for the data transfer to complete. This will take several minutes.    Tip: If you want to peek under the hood you can read how to review the DataWorks activity processing status .After data copy completes, click View details to open the pipe run summary and Details next to the pipe run for more information.DELETE A DATA PIPEYou can delete a pipe run by clicking on X next to the pipe name.REVOKE SIMPLE DATA PIPE ACCESS TO YOUR STRIPE DATAYou can revoke Simple Data Pipe’s read-only access to your Stripe data at anytime by taking the following steps. 1. Open the stripe.com dashboard . 2. Log in using your stripe.com credentials. 3. Open your account settings ( Your account > Account Settings ). 4. Select the Connect tab and Connected Applications . 5. Click Revoke access .    That’s it! Stay tuned for our next article, where we’ll connect a dashboard tovisualize our Stripe data.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","The Simple Data Pipe is an example app for moving data into dashDB, IBM’s cloud data warehouse. Here, we land Stripe.com JSON in dashDB.",Land Stripe data in IBM dashDB cloud data warehouse,Live,915
2833,Cloudant Query is a powerful declarative JSON query syntax for querying Cloudant databases. This video shows you how to build and query data using Cloudant Query. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center,,Use the new Cloudant query,Live,916
2837,"IBM Cloudant is a NoSQL JSON document store that’s optimized for handling heavy workloads of concurrent reads and writes in the cloud; a workload that is typical of large, fast-growing web and mobile apps. You can use Cloudant as a fully-managed DBaaS running on public cloud platforms like IBM SoftLayer or via an on-premise version called Cloudant Local, that you can run yourself on any private, public, or hybrid cloud platform you choose.

Find more video and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center","IBM Cloudant is a NoSQL JSON document store that’s optimized for handling heavy workloads of concurrent reads and writes in the cloud; a workload that is typical of large, fast-growing web and mobile apps. You can use Cloudant as a fully-managed DBaaS running on public cloud platforms like IBM SoftLayer or via an on-premise version called Cloudant Local, that you can run yourself on any private, public, or hybrid cloud platform you choose.","IBM Cloudant overview, NoSQL database-as-a-service",Live,917
2841,"Patrick Titzler Blocked Unblock Follow Following Developer Advocate at IBM Watson Data Platform Sep 5
--------------------------------------------------------------------------------

TELL ME SOMETHING I DON’T KNOW
ENRICHING AND ANALYZING PODCAST METADATA USING JUPYTER NOTEBOOKS
I enjoy listening to podcasts, but I can never keep up with all the episodes
released on a daily or weekly basis. So my backlog is growing. A lot.

My current backlog in Podcast Addict : Where do I start???With about 800 episodes waiting to be heard, it’d take many binge listening
sessions to catch up. And therein lies my dilemma: too many choices .

How do I attack the backlog? Listen to the episodes in chronological order? In
reverse chronological? I settled on neither approach and instead opted to go
with “whatever suits my interest/mood of the day”. Which created another
problem: what are all these episodes about?

A PODCAST FOR EVERY MOOD
Here’s how I went about tackling this nice little data engineering challenge
using a simple Python notebook.

Enriching a web syndication feed using Watson Natural Understanding and Wikipedia . Why did I use a notebook? I’m not a data scientist and don’t aspire to be one.
However, I do like to understand the tools they use and approaches they take.
The best way to do that is to walk in their shoes.TUNING IN
Parser libraries such as Python’s feedparser make it easy to ingest web syndication feeds . Providing a feed URL as input, such as WNYC’s Radiolab ( http://feeds.wnyc.org/radiolab ), these libraries retrieve and parse the feed, providing the raw metadata I
needed to catalog each feed item. In this project, each web feed item describes
a podcast episode; however, other feeds might describe other media types, such
as a blog post or a video log.

For each feed item I’ve extracted four pieces of metadata: episode title,
publication date, episode summary, tags, and URL.

Collecting basic syndication feed metadata in a Pandas DataFrame.Analyzing various feeds, I quickly realized that some do not include
episode-specific tags, making it hard to implement faceted search or to automatically categorize episodes.

Sample podcast containing no episode-specific tags. (X-axis: number of episodes,
color-coded by year.)I’ve also noticed that some episode summaries mentioned people, companies, and
other entities of potential interest. Thankfully, IBM Watson provides a service
(with a free tier) that I could use to to derive the desired information from
the episode summaries.

A WAY WITH WORDS
Watson’s Natural Language Understanding service provides a simple API that you can use to extract concepts, keywords, categories and much more from text snippets. Having settled on using Python notebooks, I installed the watson-developer-cloud Python SDK.

A REST API and other SDKs are available for various other languages, such as Node.js, Swift, and Java.To extract the desired information from the episode summary (or the title if no
summary was present), I passed the following payload to the /analyze API endpoint:

 * Text snippet to be analyzed. The snippet can include HTML, which is convenient because some feeds do not
   provide plain text summaries.
 * Names of the text analysis features to be applied. I selected Categories , Entities, and Keywords , but more are available. (Refer to the pricing plan for some important information and supported features by language .)
 * Language of the text, encoded in ISO 639–1 . If no language is specified, the service attempts to identify it by analyzing the text . Since I’ve had some mixed detection results for very short text snippets
   (four words or less), I’ve explicitly passed in the feed’s language.

Calling the Watson Natural Language Understanding APIThe response includes the requested information, along with a relevance rating.

Response for “Neil Degrasse Tyson and some new microbiome science help answer
the question — when we touch greatness how much of it stays with us?”Applying some basic parsing to the response, I’ve appended the new metadata to
each episode item.

Each enriched feed item now includes categories, entities and keywordsPAINTING A PICTURE WITH WORDS (AND BAR CHARTS)
With this information in place, it’s now easy to visually explore the podcast.
For illustrative purposes, the notebook plots separate charts for tags,
categories, and keywords:

Most frequently used tags. Generic tags provide little selectivity, rendering
them unsuitable for faceted search in episodes from the same podcast. Podcast classification results. I would not have guessed that law is such a
prevalent topic. More research is needed to confirm how accurate the
classifications are, given that they are based on short episode summaries. Most frequently used keywords, based on Watson’s episode summary text analysis.Since each tag, category, and keyword is associated with one or more episodes
(and its URL) it’s now much easier to find something entertaining, thoughtful,
or whatever the mind craves at any given time.

So, this leaves one last question…

WHO IS NEIL DEGRASSE TYSON?
Watson’s entity analysis feature identified for each episode entities that were
mentioned in the summary. For each returned entity, its detected type is
specified, such as Person , Company , Organization , and Location .

Taking advantage of Wikipedia’s search capabilities, I’ve added some code to the
notebook that sends a search request for each entity with a type of interest
(e.g. Person or Organization ) and evaluates the returned HTTP status code .

If Wikipedia responds with HTTP code 200, the exact search term was not found.

A Wikipedia search result for “Watson Data Lab” , which is not found.If wikipedia responds with HTTP code 302 , the exact search term was found. However, the redirect URL may or may not
contain the expected result, as illustrated in the following two examples:

Wikipedia search result for a unique wikipedia entry “Neil deGrasse Tyson” . Wikipedia search result for an ambiguous wikipedia entry ""Jad"" .While this simple approach is not very reliable, it typically provides a
meaningful starting point for subsequent searches. The response is recorded, and
information about the entity is therefore only “a click away”.

Entities along with their associations.PARTING WORDS
The data engineering approach I’ve outlined provides input that could be useful
in other scenarios. For example, one can create podcast profiles and build a
recommendation engine for other episodes or other podcasts.

I invite you to explore this notebook using your favorite podcast web
syndication feed. Maybe you’ll learn something you didn’t know. Follow the instructions in this Github repository to get started .

 * Jupyter
 * Data Science
 * Ipython
 * Pixiedust
 * Python

Blocked Unblock Follow FollowingPATRICK TITZLER
Developer Advocate at IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",Enriching and analyzing podcast metadata using Jupyter Notebooks.,Tell Me Something I Don’t Know – IBM Watson Data Lab – Medium,Live,918
2842,"Homepage IBM Watson Follow Sign in Get started * Home
 * Announcements
 * Editorials
 * Tutorials
 * Code Spotlight
 * 
 * Build with Watson
 * 

Kati Venturato Blocked Unblock Follow Following Product Manager @IBMWatson Nov 9, 2017
--------------------------------------------------------------------------------

WATSON SPEECH-TO-TEXT SERVICES — TL;DR NEED NOT APPLY
Photo by Ben White on UnsplashIn consideration of today’s too long; didn’t read (tl;dr) mentality, my desire to tell you EVERYTHING about Watson’s
Speech-to-Text Service is directly confronted.

Being a writer, I personally enjoy writing in full description and meaningful
explanation. Long prose can be satisfying in a way — a somewhat cathartic baring
of the subject at hand, exposed for all to read and know.

Oh, the stories I could tell about the various uses of our services, like:

 * A call center transcribing audio conversations between customers and agents
   to analyze common call patterns and issues.
 * A medical service provider creating an application for doctors to dictate
   patient diagnoses and treatments directly into their files.
 * A retailer servicing its customers through an online conversational
   application with transcription for real-time logging.

But today, first thing’s first.

Today, I simply want to remind you of what our speech-to-text services can do, in a way that is exactly the
opposite of tl;dr.

How? With a CHEAT SHEET!

Dive into our documentation and you will find so many golden nuggets of detail
in there, conveying the many features of our speech-to-text service and the
flexibility at which you can work with it.

But tl;dr, right?

Hence the cheat sheet below.

This cheat sheet is meant to be a quick reference to our feature set, so you can
know in just a matter of seconds what our services can do.

And, for the readers among you — I conveniently hyperlinked my cheat sheet to
our correlating technical documentation in case you want to dive in for more
details. I mean, being the writer that I am, I cannot not* refer to our thorough
worth-reading-all-the-way-through documentation. It’s just that good!

(*double-negative noted, for emphasis and creative flair!)


--------------------------------------------------------------------------------

WATSON STT: API CHEAT SHEET
We have a number of APIs that you can use to suit our STT service to your
transcription needs. Lots of flexibility here!

 * Basic Speech Transcription: OK, so this is a given. Click on the image description and scroll down to
   this section. Each line item is clickable with more details underneath.

Watson STT Services: Basic Recognize calls
--------------------------------------------------------------------------------

 * Language Customization Modeling: Creating a language model specific to your business domain will serve to
   expand the vocabulary of a base model and improve your audio transcription
   accuracy overall. Click on the image description and scroll down to this
   section for: 1) Creating the Model, 2) Adding Data (corpora and/or words),
   and 3) Training Watson. Each line item is clickable with more details
   underneath.

Watson STT Services: Language Customization Model calls
--------------------------------------------------------------------------------

 * Acoustic Customization Modeling: Audio acoustics specific to your audio environment (like a call center) or
   in voice (like accents) can impact transcription accuracy rates unless you
   create a model to account for such surround sound. Click on the image
   description and scroll down to this section for: 1) Creating the Model, 2)
   Adding Audio Data, and 3) Training Watson. Each line item is clickable with
   more details underneath.

Watson STT Services: Acoustic Customization Models callsFor Curl, Node and Java code details on all of the above, you can also go to our STT API Reference Guide .


--------------------------------------------------------------------------------

WATSON STT: PARAMETER CHEAT SHEET
As far as our API Parameters are concerned, here’s a simple line up of them all
with code examples. Each item below is hyperlinked so you can arrive at the
exact location inside our documentation to learn more about the code details.

 * Speaker Labels: Identify and track multiple speakers in your audio

curl -X POST -u {username}:{password}
--header ""Content-Type: audio/flac""
--data-binary @{path}audio-multi.flac
""https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?model=en-US_NarrowbandModel&speaker_labels=true""

{
  ""results"": [
    {
      ""alternatives"": [
        {
          ""timestamps"": [
            [
              ""hello"",
              0.68,
              1.19
            ],
            [
              ""yeah"",
              1.47,
              1.93
            ],
            [
              ""yeah"",
              1.96,
              2.12
            ],
            [
              ""how's"",
              2.12,
              2.59
            ],
            [
              ""Billy"",
              2.59,
              3.17
            ],
            . . .
          ]
          ""confidence"": 0.821,
          ""transcript"": ""hello yeah yeah how's Billy ""
        }
      ],
      ""final"": true
    }
  ],
  ""result_index"": 0,
  ""speaker_labels"": [
    {
      ""from"": 0.68,
      ""to"": 1.19,
      ""speaker"": 2,
      ""confidence"": 0.418,
      ""final"": false
    },
    {
      ""from"": 1.47,
      ""to"": 1.93,
      ""speaker"": 1,
      ""confidence"": 0.521,
      ""final"": false
    },
    {
      ""from"": 1.96,
      ""to"": 2.12,
      ""speaker"": 2,
      ""confidence"": 0.407,
      ""final"": false
    },
    {
      ""from"": 2.12,
      ""to"": 2.59,
      ""speaker"": 2,
      ""confidence"": 0.407,
      ""final"": false
    },
    {
      ""from"": 2.59,
      ""to"": 3.17,
      ""speaker"": 2,
      ""confidence"": 0.407,
      ""final"": false
    },
    . . .
  ]
}

 * Keyword Spotting: Locate specific words or sentences that you may need to find for extraction,
   analysis and review

curl -X POST -u {username}:{password}
--header ""Content-Type: audio/flac""
--data-binary @{path}audio-file.flac
""https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?keywords=%22colorado%22%2C%22tornado%22%2C%22tornadoes%22&keywords_threshold=0.5""

{
  ""results"": [
    {
      ""keywords_result"": {
        ""colorado"": [
          {
            ""normalized_text"": ""Colorado"",
            ""start_time"": 4.94,
            ""confidence"": 0.913,
            ""end_time"": 5.62
          }
        ],
        ""tornadoes"": [
          {
            ""normalized_text"": ""tornadoes"",
            ""start_time"": 1.52,
            ""confidence"": 1.0,
            ""end_time"": 2.15
          }
        ]
      },
      ""alternatives"": [
        {
          ""confidence"": 0.891,
          ""transcript"": ""several tornadoes touch down as a line of severe thunderstorms swept through Colorado on Sunday ""
        }
      ],
      ""final"": true
    }
  ],
  ""result_index"": 0
}

 * Maximum Alternatives and Interim Results: Return alternative and interim transcription results. The former provide
   different possible hypotheses; the latter represent interim hypotheses as the
   transcription progresses. In both cases, the service indicates final results
   in which it has the greatest confidence.

curl -X POST -u {username}:{password}
--header ""Content-Type: audio/flac""
--data-binary @{path}audio-file.flac
""https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?max_alternatives=3""

{
  ""results"": [
    {
      ""alternatives"": [
        {
          ""confidence"": 0.891,
          ""transcript"": ""several tornadoes touch down as a line of severe thunderstorms swept through Colorado on Sunday ""
        },
        {
          ""transcript"": ""several tornadoes touched down as a line of severe thunderstorms swept through Colorado on Sunday ""
        },
        {
          ""transcript"": ""several tornadoes touch down is a line of severe thunderstorms swept through Colorado on Sunday ""
        }
      ],
      ""final"": true
    }
  ],
  ""result_index"": 0
}

 * Word Alternatives: Leverage Watson’s ability to provide word alternatives that are acoustically
   similar to the original recorded word.

curl -X POST -u {username}:{password}
--header ""Content-Type: audio/flac""
--data-binary @{path}audio-file.flac
""https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?word_alternatives_threshold=0.9""

{
  ""results"": [
    {
      ""word_alternatives"": [
        {
          ""start_time"": 1.01,
          ""alternatives"": [
            {
              ""confidence"": 1.0,
              ""word"": ""several""
            }
          ],
          ""end_time"": 1.52
        },
        {
          ""start_time"": 1.52,
          ""alternatives"": [
            {
              ""confidence"": 1.0,
              ""word"": ""tornadoes""
            }
          ],
          ""end_time"": 2.15
        },
        {
          ""start_time"": 3.39,
          ""alternatives"": [
            {
              ""confidence"": 0.9634,
              ""word"": ""severe""
            }
          ],
          ""end_time"": 3.77
        },
        {
          ""start_time"": 3.77,
          ""alternatives"": [
            {
              ""confidence"": 0.991,
              ""word"": ""thunderstorms""
            }
          ],
          ""end_time"": 4.51
        },
        {
          ""start_time"": 4.51,
          ""alternatives"": [
            {
              ""confidence"": 0.9729,
              ""word"": ""swept""
            }
          ],
          ""end_time"": 4.81
        }
      ],
      ""alternatives"": [
        {
          ""confidence"": 0.891,
          ""transcript"": ""several tornadoes touch down as a line of severe thunderstorms swept through Colorado on Sunday ""
        }
      ],
      ""final"": true
    }
  ],
  ""result_index"": 0
}

 * Word Confidence: Know how confident Watson was with each word as transcribed. Also, use it to
   determine if you want Watson to provide word alternatives, or if you want to
   apply a custom language model to further train Watson on your specific
   business language.

curl -X POST -u {username}:{password}
--header ""Content-Type: audio/flac""
--data-binary @{path}audio-file.flac
""https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?word_confidence=true""

{
  ""results"": [
    {
      ""alternatives"": [
        {
          ""transcript"": ""several tornadoes touch down as a line of severe thunderstorms swept through Colorado on Sunday "",
          ""confidence"": 0.891,
          ""word_confidence"": [
            [
              ""several"",
              1.0
            ],
            [
              ""tornadoes"",
              1.0
            ],
            [
              ""touch"",
              0.52
            ],
            [
              ""down"",
              0.904
            ],
            . . .
            [
              ""on"",
              0.311
            ],
            [
              ""Sunday"",
              0.986
            ]
          ]
        }
      ],
      ""final"": true
    }
  ],
  ""result_index"": 0
}

 * Word Timestamps: Locate specific words in or sections of your transcript as compared to the
   timeline of your audio file for extraction, analysis and review.

curl -X POST -u {username}:{password}
--header ""Content-Type: audio/flac""
--data-binary @{path}audio-file.flac
""https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?timestamps=true""

{
  ""results"": [
    {
      ""alternatives"": [
        {
          ""timestamps"": [
            [
              ""several"",
              1.01,
              1.52
            ],
            [
              ""tornadoes"",
              1.52,
              2.15
            ],
            [
              ""touch"",
              2.15,
              2.5
            ],
            [
              ""down"",
              2.5,
              2.81
            ],
            . . .
            [
              ""on"",
              5.62,
              5.74
            ],
            [
              ""Sunday"",
              5.74,
              6.34
            ]
          ],
          ""confidence"": 0.891,
          ""transcript"": ""several tornadoes touch down as a line of severe thunderstorms swept through Colorado on Sunday ""
        }
      ],
      ""final"": true
    }
  ],
  ""result_index"": 0
}

 * Profanity Filtering: Censor profanity from U.S. English transcriptions by default. You can use
   the filtering to sanitize the service’s output.

curl -X POST -u {username}:{password}
--header ""Content-Type: audio/flac""
--data-binary @{path}audio-file.flac
""https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?word_alternatives_threshold=0.99&word_confidence=true&timestamps=true""

{
  ""results"": [
    {
      ""word_alternatives"": [
        {
          ""start_time"": 0.03,
          ""alternatives"": [
            {
              ""confidence"": 1.0,
              ""word"": ""****""
            }
          ],
          ""end_time"": 0.25
        },
        {
          ""start_time"": 0.25,
          ""alternatives"": [
            {
              ""confidence"": 0.9976,
              ""word"": ""you""
            }
          ],
          ""end_time"": 0.56
        }
      ],
      ""alternatives"": [
        {
          ""transcript"": ""**** you"",
          ""confidence"": 0.992,
          ""word_confidence"": [
            [""****"", 0.9999999999999918],
            [""you"", 0.986436366840706]
          ],
          ""timestamps"": [
            [""****"", 0.03, 0.25],
            [""you"", 0.25, 0.56]
          ]
        }
      ],
      ""final"": true
    }
  ],
  ""result_index"": 0
}

 * Smart Formatting: Convert dates, time, numbers, phone numbers, currency values, internet
   addresses into more readable, conventional formats.

curl -X POST -u {username}:{password}
--header ""Content-Type: audio/flac""
--data-binary @{path}audio-file.flac
""https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?smart_formatting=true""

 * Ibm Watson
 * Speech Recognition
 * Artificial Intelligence
 * API
 * Tutorial

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

175 Blocked Unblock Follow FollowingKATI VENTURATO
Product Manager @IBMWatson

FollowIBM WATSON
AI Platform for the Enterprise

 * 175
 * 
 * 
 * 

Never miss a story from IBM Watson , when you sign up for Medium. Learn more Never miss a story from IBM Watson Get updates Get updates","This cheat sheet is meant to be a quick reference to our feature set, so you can know in just a matter of seconds what our services can do.",Watson Speech-to-Text Services — tl;dr need not apply,Live,919
2843,"GEOFILE: EVERYTHING IN THE RADIUS WITH POSTGIS
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 17, 2016GeoFile is a series dedicated to looking at geographical data, its features and
uses. In this article, we build upon our last discussion of getting the distance
between two points, and now move to getting all data points within a specified
radius. On this journey, we take you from indexing your PostgreSQL database to
setting up queries with ST_DWithin .

In the last GeoFile article , we looked at the differences between GEODIST and ST_Distance in Redis and PostGIS and why you might use Redis over PostGIS depending on the
type of data you’re using and the distance you’re covering between points.

However, if you’re not interested in finding the distance between two points,
but want to find all the locations surrounding your point of origin, then there
are various functions and commands in Redis and PostGIS that will help you.
We’ve covered Redis’s GEORADIUS here , so in this installment of GeoFile we’ll focus on PostGIS.

ST_DWITHIN IN POSTGIS
PostGIS allows you to search for locations within specified distances using the ST_DWithin function. This function allows you to find locations near each other or to find
out if a location is within a certain distance of others.

Using a Starbucks dataset , all of the Starbucks locations in the United States were inserted into a
Compose PostgreSQL database called Starbucks and a table called stores was created. Since the data only has longitude and latitude coordinates, we’ll
have to create a geometry column labelled geom . To do this, we can use the PostGIS function AddGeometryColumn like this:

SELECT AddGeometryColumn('stores', 'geom', 4326, 'POINT', 2);  


All the function is doing is adding a geometry column to the stores table and calling it geom . We set the column SRID to 4326 and indicate that we are setting coordinates
to ""POINTS"", not other geometrical shapes, and those points are set to ""2""
meaning they are located in a 2D space. Now, we’ll add the geometry data to the
column with the following SQL query that will create geometry data from the
longitude and latitude points from our table and set them to SRID to 4326:

UPDATE stores SET geom = ST_SetSRID(ST_MakePoint(longitude, latitude), 4326);  


We’ll use a geometry over a geography data type because we will confine our
search to Seattle. If we were to get all the Starbucks stores from a larger
space, such as a couple states, or over an entire continent, using a geography
data point would be better. Here is a good discussion of the pros and cons of using geometry and geography data
types.

Remember to also create an index on the geom column for faster lookups like:

CREATE INDEX idx_starbucks_stores ON stores USING gist(geom);  


After we’ve stored our data with the geometry data, we can created the VIEW
below that narrowed our query to select only the Starbucks located in Seattle so
we don’t have to continuously look up Seattle stores.

CREATE VIEW Seattle_Starbucks AS  
    SELECT name, address, city, state, zip, lat, lon, id, geom
    FROM Stores
    WHERE city = ‘Seattle’;


Looking on a map using OpenJUMP or QGIS, we'll get the following points. For the
map, I used the OpenStreetMaps Seattle shapefile from Mapzen .


That gave us 141 Starbucks locations throughout Seattle. Now, we can find the
closest Starbucks to any location we want. I chose a random point (-122.325959
longitude, 47.625138 latitude) in a residential area. Now, to get the Starbucks
locations surrounding that point, we’d use ST_DWithin like:

SELECT id, name, address, geom  
FROM Seattle_Starbucks  
WHERE ST_DWithin(geom, ST_MakePoint(-122.325959,47.625138)::geography, 1000);  


This query asks for the locations of Starbucks within 1 kilometer of our point.
We select the geom column from our Seattle_Starbucks view to find stores from our point of origin. We use the ST_MakePoint function to define the point of origin, and then pass in the number of meters
we want our search to expand to. You'll notice that I've cast the point of
origin to a geography type. That's because of how ST_DWithin works with geometry types. If you use the geometry type, the function thinks
that you want all of the locations within 1000 degrees of your origin. This
makes no sense and you'll get all of your locations back. However, casting to a
geography allows you to set the distance to 1000 meters, or one kilometer. The
result of this query gives us the location but not in any particular order:

 id   |             name              |           address            
-------+-------------------------------+------------------------------
 10481 | Metropolitan Park             | 1730 Minor Avenue  Suite 102
 10493 | E. Olive Way                  | 1600 E Olive Wy
 10497 | QFC-Seattle/Broadway Mkt #887 | 417 Broadway E
 10499 | Capitol Hill                  | 434 Broadway Ave E
 10506 | Roy St Coffee & Tea           | 700 Broadway E


On our map below, we can see our point of origin as a red point and our results
as green points.


To really visualize what a kilometer looks like, you can use the ST_Buffer function that will surround a point or any geometry with a user defined buffer
zone. Below, we are creating the buffer around our point of origin that’s one
kilometer in diameter like:

SELECT ST_Buffer(  
    ST_MakePoint(-122.325959,47.625138)::geography, 
    1000)::geometry;


The result of the query confirms to us what the previous query selected, which
is visible below on the map.


To order these results by distance from the point of origin, we’d set up the
query using ST_DWithin and order the results with ST_Distance , which we’ve covered in the previous GeoFile article on distances .

SELECT id, name, address, ST_Distance(geom, ref_geom) AS distance  
FROM Seattle_Starbucks  
CROSS JOIN (SELECT ST_MakePoint(-122.325959, 47.625138)::geography AS ref_geom) AS r  
WHERE ST_DWithin(geom, ref_geom, 1000)  
ORDER BY ST_Distance(geom, ref_geom);  


This will give you the following Starbucks’s in order of nearest to farthest
from your point of origin in meters.

id   |             name                |           address            |   distance  
-------+-------------------------------+------------------------------+---------------
 10503 | Roy St Coffee & Tea           | 700 Broadway E               | 368.879406565
 10497 | Capitol Hill                  | 434 Broadway Ave E           | 457.371780075
 10495 | QFC-Seattle Broadway Mkt #887 | 417 Broadway E               | 483.775129907
 10491 | E. Olive Way                  | 1600 E Olive Wy              | 644.775281102
 10482 | Metropolitan Park             | 1730 Minor Avenue Suite 102  | 965.297144038


SUMMING IT UP
We've looked at the power of the ST_DWithin function as well as a couple others that will allow you to find locations
within a set diameter. You probably use a function like this every day in your
GIS application when searching for restaurants or looking for entertainment near
your location. There's a lot more you can do with PostGIS, and we'll expose you
to much more in our GeoFile series.

Next up, we’ll turn to GeoJSON and look at some geospatial queries in
Elasticsearch and MongoDB.

Image by Stephen Monroe Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","In this article, we build upon our last discussion of getting the distance between two points, and now move to getting all data points within a specified radius.",Everything in the Radius with PostGIS,Live,920
2850,"VISUALIZING COMPOSE POSTGRESQL DATA WITH LEFTRONIC
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 21, 2016At Compose, we use Leftronic for metrics at-a-glance. Leftronic is a real-time, browser-based data
visualization service that makes it easy to create smart looking dashboards. We
like it so much that we wanted to share how we're using it because we think you
might like it, too.

Leftronic has many different integrations to services like Facebook, Google
Analytics and Mail Chimp, to name a few, and they are continually adding new
features. Features include spreadsheet upload, dashboard styling, dashboard
collections, alerts per metric, and the ability to produce manual and scheduled
reports. To make sure only we get to see our metrics, Leftronic also sports user
management and IP whitelisting.

We started using Leftronic a couple years ago and have steadily escalated our
usage to reach the top tier of their billing plan. We now have dozens of
dashboards that display how different parts of our business are doing. In this
article, we're going to walk through how we chart some of our PostgreSQL metrics
in Leftronic.

GETTING READY
If you don't have a Compose PostgreSQL deployment, you can either sign up to try it free for 30 days or, if you already have an account with us, click ""Create Deployment"" from
within the Compose management console and select PostgreSQL to spin up an
instance.

To follow along here, you'll also need to create a Leftronic account . They offer several different paid plans, but you can try out the service for
14 days free. Note that some of the features mentioned above come by default and
some are plan-dependent so you'll want to check the different plans available to
see which ones will meet your requirements. If you already have a Leftronic
account, then let's dive right in for how you can use it with Compose
PostgreSQL.

CREATING A DASHBOARD
Once you are logged into the Leftronic interface, the first thing you'll need to
do is to create a dashboard. Just click on the little plus sign in the top right
corner.


Then you'll need to select the kind of dashboard you want to create. We're going
to generate a blank dashboard, but if you wanted a dashboard of all your
Facebook stats or Google Analytics metrics, for example, you could take a
shortcut to widget creation (which we'll review below), by selecting an
integration-specific dashboard:


You'll be placed directly onto your new blank dashboard. In the far top right
corner of the Lefronic interface, near where we originally clicked to create our
dashboard, there is a gear icon for the settings related to all your dashboards.
There are options for styling your dashboards using some presets or by applying
your own custom CSS, for uploading data from a spreadsheet, for user and
security management, and more. You'll want to read up on these on your own, but
we'll cover a couple of these in this article:


For right now, we'll select ""Dashboard Settings"" to configure our new dashboard:


In our case, we are creating a dashboard for metrics having to do with usage in
our app from MongoDB customers. You'll want to name yours according to the
metrics you plan to display. By checking the box that says ""Share link for
current dashboard"", you can generate a URL to share with viewers of the
dashboard, which you can password protect as well. Note that the link shown here
is for example purposes for the article - it isn't a link that you can follow.

Now that we have our dashboard set up, we need to populate it. We do that by
creating a series of ""widgets"" - one per metric.

CREATING WIDGETS
Different integrations have different pre-set widgets (for example, Facebook has
feed details and likes and Google Analytics has visits and ecommerce, among
others) or you can create your own via API. There are also content widgets that
support labels, images, and HTML as well as a calculator widget for math
operations. Here we're going to choose the ""Databases"" service:


You'll notice the ""Databases"" icon is now displayed in the left nav to make it
easy for us to select it again for each widget we want to generate. There are
several different types of widgets available for databases:


You'll want to check the official documentation and do some exploration for how each type displays, but we'll run through a
couple examples below.

NUMBER WIDGET
Let's have a look at the Number widgets available. To do that we'll go ahead and
select ""Number Query"". You'll see we have even more options now:


Each of the graph options shown are for cases where the result of your database
query will be a number. The simplest one is the single ""Number"" widget so let's
start with that.

Once we click the widget type, an empty widget shows up on our dashboard. By
clicking the gear icon in the top right corner of the widget, we can see the
widget options available:


We're going to start by choosing to ""Edit the widget settings"". Once we click
that option, a two-tabbed configuration box will appear. In the ""Data Settings""
tab is where we'll configure our Compose PostgreSQL connection and add our
query.

Connecting to and querying Compose PostgreSQLTo configure the Leftronic connection to Compose PostgreSQL, you'll need your
deployment connection string and credentials. You can get these from the
deployment ""Overview"" page in the Compose management console. Copy each part of
the connection string into the appropriate box (note that port is a separate box
down below what's shown in this screen shot):


In this example, we used the credentials for the admin user. If you have other
roles configured for your PostgreSQL database you could use one of those
instead. The credentials you use just need to have the appropriate permissions
to run the query you want to use.

If you want to use SSL, secure connections to Compose PostgreSQL are taken care
of automatically by Leftronic.

Once your connection is configured, just copy your query into the query box:


Because this is a single number widget, our query returns one single number -
the count of distinct account IDs that are running a MongoDB deployment.

Note that you'll want to have tested the query before adding it here to make
sure it returns the results you expect. If, for some reason, there is a problem
with the query, you'll see that the widget will display as ""Error"" once you save
the widget settings. By clicking into the widget settings again and looking at
the ""Data Settings"" tab, you'll be able to get the details of the error.

Before you leave the ""Data Settings"" tab, choose how frequently Leftronic should
run the query to update the metric in the widget. You can select between 10
minutes and 24 hours. If the data in your database will not be changing
frequently, then it is recommended that you choose one of the longer intervals.
Each widget will create a new connection to your PostgreSQL database. Depending
on how many apps are connecting to your PostgreSQL deployment and how
frequently, you may need to increase the number of connections you allow.

When you're finished configuring the data settings, click to the ""Widget
Settings"" tab.

Configuring the widget displayIn the ""Widget Settings"" tab is where you'll set the display options for the
widget:


Different kinds of widgets have different options. In this case, we've given our
widget the title ""Accts Paying for MongoDB"" and we've set the widget to display
the full number. You'll want to experiment with which display settings make the
most sense for your data in each of the widget types you use.

Finally, we'll hit ""Save"" and we can now see our result in the widget on the
dashboard:


We can move the widget around anywhere on the dashboard via drag-and-drop and we
can also expand or minimize it by dragging the lower right corner out or down.
Once you have a few widgets configured you'll use these features to configure
the layout of your dashboard.

Now that you have a handle on the basics, let's do one more example - a
multi-bar widget.

MULTI-BAR WIDGET
In the left navigation, click on the ""databases"" icon that was placed there
previously. This time, however, click on ""Multi-bar Query"". You'll see one
widget type available, the ""Multi-Bar Graph"":


Once the widget is placed on the dashboard, click on the gear icon in the top
right corner of the widget to display the options and select ""Edit widget
settings"" again. This will place you on the ""Data Settings"" tab to configure the
widget.

Widget settingsIf you already have a connection set up, like we do, then that will appear as
the default. If you have more than one connection set up already, you can click
on the arrow to choose the connection you want to use for this widget or you can
select ""Use another account"" to set up a new connection. For our example here,
we're going to use the same connection.

We'll copy in our query and select the frequency the query will be run. In this
case, because this is a multi-bar widget, we're going to have multiple rows
returned with a maximum of 3 columns that are ordered by what we want to display
on the X axis, the pivot values, and then the Y axis. Our columns are Month,
Event, and the count of unique events relating to usage of the MongoDB backups
feature provided by Compose.

We'll then click over to the ""Widget Settings"" tab to set the display options.
As you can see, because this is a different widget type, there are different
options available. Again, you'll want to experiment with which settings work
best for your data. We're going to entitle our widget ""MongoDB Backups"", have
the widget display as a stacked bar chart, enable the display of X and Y axis
labels, and also we're going to limit the number of bars to just 6 (an option
lower down than what this screenshot shows) because our query returns only the
previous 6 months of results:


Once we ""Save"", we'll see our results in the widget on the dashboard:


And that's how we build our dashboard - one widget at a time for each of the
metrics we want to display.

When you're at the point where you have multiple dashboards, you may want to
keep them organized by adding them to a collection and potentially allowing them
to be set on a rotating display cycle for viewers.

DASHBOARD COLLECTIONS
Selecting the ""Settings"" gear icon in the top right again and clicking on
""Dashboard Collections"" will open a configuration box for creating and managing
dashboard collections:


As you can see from ours, we have a handful of collections already configured:
Key Metrics, Content, and App Usage. You can create a collection, by clicking on
the large blue ""Create New"" button down in the far right corner. Since we
already have some collections, we've selected the App Usage one to add our new
""App Usage - MongoDB"" dashboard.

Here is where you name the collection, choose which dashboards you want to add
(and in which order you want them to display), and configure other options such
as a shareable link (may also be password protected) and the cycling speed for
display. Clicking on the ""Add Dashboard"" button let's you select from a dropdown
of all the dashboards you have created. Once you've completed the configurations
for the collection, scroll down and hit ""Save"", then click ""Close"" in the far
lower left corner of the box.

Before we wrap up, we wanted to review one last key feature related to
connecting to Compose PostgreSQL... that's the ""Authentications"" option in the
""Settings"" menu.

MANAGING AUTHENTICATIONS
Now that we have a connection set up to Compose PostgreSQL, we can manage it by
using the ""Authentications"" settings option. To do this, select the gear icon in
the far top right of the Leftronic interface again. Select ""Authentications"".
Here you'll see your Compose PostgreSQL connection. If you have created widgets
using other integrations, like Facebook or Google Analytics, you'll see those
here, too:


By clicking on the Compose PostgreSQL connection, you can select ""Edit
Authentication"" to alter the connection configuration. This can be useful when
you want to change the database used or the credentials. You can also delete the
connection entirely by clicking on the ""Delete Authentication"" option. Note,
however, that new connections can only be created via the widget settings, which
we described above.

WRAPPING UP
Now you know how to pair Leftronic with Compose PostgreSQL to visualize your
data. Follow our lead. Go ahead and try out all the different types of widgets
and displays in Leftronic to create a rich dashboard that will bring your
metrics to life.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","At Compose, we use Leftronic for metrics at-a-glance. Leftronic is a real-time, browser-based data visualization service. We like it so much that we wanted to share how we're using it because we think you might like it, too.",Visualizing Compose PostgreSQL Data with Leftronic,Live,921
2854,"Compose The Compose logo Articles Sign in Free 30-day trialINTERLOOP - MAKING THE MOST OF COMPOSE
Published Feb 28, 2017 case study mongodb watson Interloop - Making the Most of ComposeCompose customer, Interloop , has launched a ""sales execution and forecasting platform"" — powered by IBM
Watson® and Compose-hosted MongoDB — to help businesses understand where to
focus to close more business predictably.

Interloop was one of a handful of companies selected by IBM to participate in an
early-access program for IBM Watson. Along with other startups, IBM took
Interloop through a 45-day onboarding for both Watson and Bluemix, IBM's cloud
platform-as-a-service. And today they are releasing version 2.0 of their sales
execution platform designed to assist much of the manual work that sales teams
do today.

Founded as a sales consultancy by Tony and Jordan Berry in October 2015,
Interloop set out to provide sales teams with the kind of training that had
traditionally been done through mentoring, classes, and hard-won experience.

As Tony explained, ""Reps are spending a lot of time trying to research clients
mainly because there are so many people involved in a sales decision now. In
fact, on average five to six people are involved in a B2B sale. They're
typically cross-functional. They typically have conflicting interests and may
have different buying personalities along with everything else. My background
was you always address that by basically teaching reps how to read a room, how
to read people, how to navigate to build consensus with the buying group.""

A typical B2B sales process begins with a research phase to gather all the data
about a business and team. Often, sales teams employ interns or other
operational team members to build databases that map their buyers'
infrastructure, personality types, and needs. It's typically a slow and
expensive process. Traditional CRM systems have struggled to build the kinds of
sophisticated sales models that Interloop sought to make.

Jordan suggested that there was a better way to provide this kind of sales
consulting at scale by productizing their efforts with artificial intelligence,
and specifically the Watson API.

""We wanted to provide all of that research and information on your buyers to the
salesperson so that when they walked into a meeting, they were very confident
who they were meeting with, what they're doing to advance the deal,"" said
Jordan.


Even before Watson was trained for Interloop's application, the Interloop team
had selected Compose-hosted MongoDB to build their product. MongoDB was ideal
because it allowed them to build complex models that analyze the web of
connections and influence related to any particular deal. Watson now has more
than 7 million rows of data in MongoDB; data that can be used to build
predictive models for sales opportunities and analyze the personalities of the
buyers, and then make suggestions to sales reps to give them an extra edge.

Watson analyzes data gathered from a number of different sources: Interloop's
customers' CRMs, email systems, and proprietary data sets, as well as public
information such as their customer's social media feeds, web domains, and data
gathered from the FullContact API. Interloop leverages out-of-the-box
capabilities of Watson to analyze the data, but because each customer has a
different audience and different kinds of data to work with, they provide custom
Watson training for their clients.

""We think the real stickiness is when you train your model for a client and then
the model starts to understand your business, the way you sell and the people
you interact with,"" said Jordan.

While Watson alone has powerful learning features, the cyclical nature of using
MongoDB and Watson together has an exponential benefit: the more data that's
added to MongoDB, the more Watson learns, and the more it learns the more data
it feeds back into the database.

""One of our customers is in the ad space, and for them we're teaching Watson how
to read unstructured public data such as expenses and categorize it all
correctly,"" added Tony. ""This way, the customer can understand where the spend
was, maybe some of their missed opportunities or places they need to double down
in terms of some of their product offerings.""

The team had been using MongoDB directly from Compose and running Node.js on
Bluemix. When they added Watson to the mix via Bluemix, it made sense to move
everything over to a single platform. Today, their MEAN stack and Watson are all
on Bluemix. They also use LoopBack (by StrongLoop, an IBM company) for data
modeling. While MongoDB had been available only through Compose.com, last
October it, along with our other databases, was made available ""natively"" on
Bluemix.

Having everything running from a single platform has other benefits, including
seamless continuous deployments. ""As we slip in a feature set, it's five minutes
at the most now to push all the changes to everything. And we're quite excited
about the newly released Compose API . We think we might even be able to do some configuration of the database
itself through the API, which would be great,"" said Jordan.

As for why they selected Compose, Tony explained, ""One of the things we love
about Compose is that performance-wise, it has been great. The NoSQL
capabilities of MongoDB has let us really rev on our structure underneath. We've
been able to really both expand, test, change, mold, and modify very, very
quickly. And we haven't had a single outage with Compose; it's been rock solid.""

Learn more about Interloop and check out the latest version of their platform at www.interloop.ai .


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image © Miflippo | Dreamstime.com Jon Silvers works in marketing at Compose. Love this article? Head over to Jon Silvers ’s author page and keep reading.RELATED ARTICLES
Dec 8, 2016OMNI LABS – MAKING THE MOST OF COMPOSE
Learn how startup Omni Labs uses Compose-hosted MongoDB and a combination of
Node.js, React, and Spark Python to help bootst…

Jon Silvers Nov 3, 2016FLYING DONUT – MAKING THE MOST OF COMPOSE
Flying Donut has created a simple and intuitive agile collaboration and project
tracking tool running on Compose MongoDB and…

Jon Silvers Sep 8, 2016MAKING THE MOST OF COMPOSE ENTERPRISE – README.IO
ReadMe.io has created a simple, gorgeous platform for publishing technical
documentation running on Node.js and Compose Enter…

Jon Silvers Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Compose customer, Interloop, has launched a ""sales execution and forecasting platform"" — powered by IBM Watson® and Compose-hosted MongoDB — to help businesses understand where to focus to close more business predictably.",Customer: Interloop - Making the Most of Compose,Live,922
2857,"COMPOSE FOR MYSQL AND COMPOSE FOR SCYLLADB: THE NEW COMPOSE DATABASES ON BLUEMIX
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 30, 2016Since we made Compose-hosted databases for Bluemix available, we've seen Bluemix
users opening up the catalog to the benefits of hosted and managed Compose for
MongoDB, Redis, Elasticsearch, PostgreSQL, RethinkDB, etcd and RabbitMQ. There's
now two more databases to add to that list: Compose for MySQL, probably the most
popular SQL database in the world, and ScyllaDB, a high-performance
Cassandra-compatible database. Each of these new options brings new capabilities
to the application stacks that developers can deploy:

Compose for MySQL (beta) : MySQL is probably the most popular open source relational database in the
world; it debuted in 1995 and rapidly became an essential part of the internet's
infrastructure as a component of the LAMP stack. Since then it has been
constantly evolving under different owners. With a broad subset of ANSI SQL 99
and a wide set of its own extensions, including JSON document, full-text search,
and updatable views, MySQL offers a rich palette for developers to draw on in
their applications. Administrators will also find a wide selection of database
management tools that can work with MySQL.

Compose for ScyllaDB (beta) : ScyllaDB is a highly performant, in-place replacement for the Cassandra
wide-column distributed database. ScyllaDB is written in C++, rather than
Cassandra's Java, for better resource usage that can result in ten times better
performance in benchmarks. Whilst retaining compatibility with Cassandra tool
and data files, ScyllaDB adds self-tuning capabilities.

As with all Compose databases, IBM Compose makes them even better by taking care
of the management for you. This includes offering an easy, auto-scaling
deployment system which delivers high availability and redundancy, automated
no-stop backups and much more.

Also, we'd like to announce the promotion of RabbitMQ from beta to top table
status, ready and complete for all production demands. The messaging platform is
an essential complement to decoupled applications that distribute workloads.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Carl Cerstrand Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",Each of these new options brings new capabilities to the application stacks that developers can deploy.,Compose for MySQL and Compose for ScyllaDB: the new Compose databases on Bluemix,Live,923
2858,"Compose The Compose logo Articles Sign in Free 30-day trialDATALAYER EXPOSED: ANTONIO CHAVEZ & WHY WE LEFT MONGODB
Published Jun 19, 2017 datalayer mongodb DataLayer Exposed: Antonio Chavez & Why We Left MongoDBIt's Monday, so that means time for a new video from this year's DataLayer
conference. Up this week is Antonio Chavez who talks about why Queue
Technologies moved away from MongoDB.

Lead engineer of co-founder of Queue Technologies, Antonio Chavez took the stage
next to talk about how database decisions can affect your stack and the
importance of choosing the right database for your needs.

A few years back, Queue Technologies adopted MongoDB becuase it was the ""cool""
option. They hit some challenges with MongoDB and, after wrestling with it,
decided to trade MongoDB in for a traditional relational database. Here's why.

Previous DataLayer 2017 talks:

 * Charity Majors' presentation on observability
 * Ross Kukulinski's presentation on the state of containers

Be sure to tell us what you think using hashtag #DataLayerConf and check back
next Monday for the next talk at DataLayerConf.


--------------------------------------------------------------------------------

We're in the planning stages for DataLayer 2018 right now so, if you have an
idea for a talk, start flushing that out. We'll have a CFP, followed by a blind
submission review, and then select our speakers, who we'll fly to DataLayer to
present. Sounds fun, right?

Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe ’s author page and keep reading.RELATED ARTICLES
Jun 16, 2017NEWSBITS: ETCD GETS A SCALING UPDATE
These are the Newsbits for the week ending June 16th: etcd gets a major update
for scale MongoDB gets a minor fix up A look a…

Dj Walker-Morgan Jun 12, 2017DATALAYER EXPOSED: ROSS KUKULINSKI & THE STATE OF STATE IN CONTAINERS
We're continuing to bring you video of all the sessions from this year's
DataLayer conference, and next up is Ross Kukulinski…

Thom Crowe Jun 5, 2017DATALAYER EXPOSED: CHARITY MAJORS ON OBSERVABILITY & THE GLORIOUS FUTURE
We're bringing you video of all the sessions from this year's DataLayer
conference, starting with the opening keynote from Ch…

Thom Crowe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL JanusGraph Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Antonio Chavez's talk from the 2017 DataLayer conference, discussing why Queue Technologies moved away from MongoDB.",DataLayer Exposed: Antonio Chavez & Why We Left MongoDB,Live,924
2861,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseDATA SCIENCE EXPERIENCE: BUILD SQL QUERIES WITH APACHE SPARK
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

11 views 1LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 2 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Science Experience: Use Spark R to load data and run SQL queries -
   Duration: 3:18. developerWorks TV No views * New 3:18


--------------------------------------------------------------------------------

 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21.
   IBM Analytics 8,386 views 8:21
 * The Data Science Experience - Duration: 42:45. Evolving Education with
   Cognitive & Data Sciences 1,106 views 42:45
 * Tanmay Bakshi on building AskTanmay - Duration: 22:59. developerWorks TV
   193,680 views 22:59
 * Data Science Experience: Analyze Db2 Warehouse on Cloud data in RStudio -
   Duration: 5:30. developerWorks TV 3 views * New 5:30
 * Introduction to Spark and Data Science Experience - Duration: 49:24. Data
   Gurus 419 views 49:24
 * Data Science Experience: Analyze precipitation data using a community
   notebook - Duration: 5:15. developerWorks TV No views * New 5:15
 * A data scientist experiments with Jupyter notebooks and Apache Spark: Part 1
   - Duration: 13:30. IBM Analytics 4,093 views 13:30
 * Scaling Genetic Data Analysis with Apache Spark - Jonathan Bloom and Timothy
   Poterba - Duration: 29:29. Databricks 2,097 views 29:29
 * Spark & Kafka: Why they go together - Adam Kocoloski & Skyla Loomis at Open
   Tech Summit 2016 - Duration: 10:10. developerWorks TV 424 views 10:10
 * IBM Big SQL: Analyze HDFS data with IBM Cognos Analytics - Duration: 6:54.
   developerWorks TV No views * New 6:54
 * Introducing the Data Science Experience - Duration: 2:31. IBM Analytics
   14,454 views 2:31
 * Predicting Stock Prices - Learn Python for Data Science #4 - Duration: 7:39.
   Siraj Raval 233,577 views 7:39
 * Analyzing Log Data With Apache Spark - Duration: 29:20. Spark Summit 6,181
   views 29:20
 * Best Practices for Using Alluxio with Apache Spark - Cheng Chang & Haoyuan Li
   - Duration: 30:16. Databricks 375 views 30:16
 * Incremental Data Processing with Apache Spark on Azure HDInsight - PyDataSG -
   Duration: 25:51. Engineers.SG 308 views 25:51
 * H2O With IBM's Data Science Experience (DSX) - Duration: 4:43. Matt McInnis
   303 views 4:43
 * Datascience made simple with IBM DSX | HackerEarth Webinar - Duration:
   1:06:11. HackerEarth 260 views 1:06:11
 * Apache Spark MLlib 2 0 Preview: Data Science and Production - Duration:
   31:23. Spark Summit 4,800 views 31:23
 * Deep Learning with Apache Spark and GPUs - Tim Gasper & Pierce Spitler -
   Duration: 20:13. Databricks 414 views 20:13
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to use the Spark SQL programming model within a notebook in IBM Data Science Experience (DSX).,Build SQL queries with Apache Spark in DSX,Live,925
2864,"ACCESS DENIED
Sadly, your client does not supply a proper User-Agent, and is consequently
excluded.

We have an inordinate number of problems with automated scripts which do not
supply a User-Agent, and violate the automated access guidelines posted at
arxiv.org -- hence we now exclude them all.

(In rare cases, we have found that accesses through proxy servers strip the
User-Agent information. If this is the case, you need to contact the
administrator of your proxy server to get it fixed.)

If you believe this determination to be in error, see https://arxiv.org/denied.html for additional information.","The largely dominant meritocratic paradigm of highly competitive Western cultures is rooted on the belief that success is due mainly, if not exclusively, to personal qualities such as talent, intelligence, skills, efforts or risk taking. Sometimes, we are willing to admit that a certain degree of luck could also play a role in achieving significant material success. But, as a matter of fact, it is rather common to underestimate the importance of external forces in individual successful stories. ",Talent vs Luck: the role of randomness in success and failure,Live,926
2865,"Use our RESTful API to create, read, update and delete documents inside your hosted Cloudant database.

Enhance this tutorial with live data from a sample database inside your Cloudant account.

For security purposes, please sign in or sign up to demo the API.

To demo the Cloudant API, you'll first need to replicate a small sample database into your account.

Cloudant's RESTful API makes every document in your Cloudant database accessible as JSON via a URL; this is one of the features that makes Cloudant so powerful for web applications.

The sample database behind this tutorial (called crud, short for create,

read, update and delete) is located at https://[username].cloudant.com/crud/ and holds a single document with the _id of welcome.

To access it via the API, simply append the document's id to the URL of the database. Give it a try:

View the document right here. The server response (JSON) will appear directly below.

Sign in or create a free account to demo the Cloudant API.

To view the contents of the document, replicate the sample database first.

Documents can be inserted into Cloudant individually or in bulk. We'll look at bulk inserts a little later, for now let's just insert a single document.

This JSON document is ready to inserted into your sample database. It's editable, too. Just make sure you don't invalidate the JSON.

The response from the server will appear directly below.

Sign in or create a free account to insert this document via the Cloudant API.

To demo the API here, replicate the sample database first.

Every document in Cloudant is accessible as JSON via a URL; this is one of the features that makes it so powerful for web applications.

To retrieve documents you need to know their _id. The document inserted above doesn't define an _id value, so it got assigned one on insert.

You can always define an _id for your document up front. You need to make sure that the _id isn't already in use. If it is, the insert will fail. Try changing the JSON above and adding in an _id. (For example, change season to

_id).

Now try inserting the document again with the same _id. You should see a conflict error in the response.

{""error"":""conflict"",""reason"":""Document update conflict.""}

When a problem like this arises, the server will set an appropriate HTTP error code in the response header and return a JSON document describing the problem.

In the widget above, we sent the document from the browser using       jquery and the jquery.couch.js library.       The relevant part of the browser code is below.

// Parse the json from the text area

var doc = $.parseJSON($('#newdoc').val());

// ""connect"" to the database

var db = $.couch.db(user_db);

// insert the doc into the db

db.saveDoc(doc, {

success: function(response, textStatus, jqXHR){

// do something if the save works

error: function(jqXHR, textStatus, errorThrown){

//do something else if it goes wrong

As we saw in the insert above, on success you'll get a JSON reponse containing the       id, rev and ok:true. Note       that the underscores are removed from _id and       _rev in the response.

If an error is raised the jqXHR will contain the error       code the server responds with, and the response body is parsed so that       the textStatus will have the value of error       and errorThrown will be the contents of       reason.

For more complicated applications you might want to hook up       Backbone, AngularJS or some other favourite application framework.       Most already have a CouchDB connector, and for those that don't       they are usually simple to build.

The following curl calls should do the same thing as the       browser example and hopefully show whats going on ""under the hood""       in more detail.

If you want to set the _id when you insert, you can do it in two       ways: POST and PUT.

POST the document with the _id in the document body:

Or PUT the document, specifying the       _id in the URL:

We've mentioned the _rev field a few times above. This gets added to your documents by the server when you on insert or modify them, and is included in the server response when you make changes or read a document. The _rev is built from a crude counter and a hash of the document and is used to determine what needs to be replicated between servers, and if a client is trying to modify the latest version of a document. For this reason updates need to send the _rev token to be able to modify a document.

It is important to note that _rev should not be used to build a version control system, its an internal value used by the server and older revisions are transient, and removed regularly.

The code or command line to update a document is the same as to insert, just

be sure to include the _rev in the document body.

As you might expect deletions are done by using the DELETE HTTP method. There are some cases where firing a DELETE might not be possible so you can also delete a document by adding _deleted to the document and update it. This is especially useful for bulk operations, where many documents may be created, updated or deleted in a single HTTP operation. As you'll be removing the document you can delete the rest of its contents, apart from the _id, _rev and _deleted fields. If you leave other fields they will be in the documents ""tombstone"", this can be useful when replicating or validating document edits.

All CRUD operations can be done in bulk, and for performance reasons you'll want to make the most of the bulk interfaces. Bulk read operations are done via views, which we'll cover in another, dedicated cookbook. Creations, updates and deletes are all done via POST's to the databases _bulk_docs end point.

Inserting, modifying or deleting multiple documents at once is very simple; just send a list of documents in your JSON body. Click the button below to insert the following set of docs:

This JSON document is ready to inserted into your sample database. It's editable, too. Just make sure you don't invalidate the JSON.

{""docs"":[

{""season"": ""summer"", ""weather"": ""usually warm and sunny""},

{""season"": ""winter"", ""weather"": ""usually cold and snowy""},

{""season"": ""spring"", ""weather"": ""cool with rain and sun""},

{""season"": ""autumn"", ""weather"": ""breezes""}

The response from the server will appear directly below.

Sign in or create a free account to insert this document via the Cloudant API.

To demo the API here, replicate the sample database first.

Now try changing season to _id in the above text area, and insert. If you made a document with

_id:summer earlier, you should see an alert with 3 good documents (winter, spring, autumn) and one failure; the conflicted summer document. If you click the button again you should see four conflicts, one for each season document If you want to update/delete documents in bulk you need to include the _rev as you would for a single document operation. If you update the JSON above with the _rev fields of the various season documents you'll be able to update or delete the documents (use _deleted:true to remove one).

The jquery.couchdb.js library has a function for       bulk operations. It takes a JSON structure as in the example       above and success/error callbacks

// ""connect"" to the database

var db = $.couch.db(user_db);

// make our bulk insert object

var docs = {""docs"":[

{""name"":""document-1""},

{""name"":""document-2""},

{""name"":""document-3""}

// make the bulk commit

db.bulkSave(docs, {

success: function(resp){

// success callback

error: function(resp, textStatus, errorThrown){

// error callback

curl can also be used to upload a set of       documents in bulk. While you can pass the full set in with the       -d option that we used above it's probably more       useful to upload the documents from a file, especially if you're       uploading more than a handful of documents.

Documents can have files attached to them. Attachments can be of any file type. This can be useful for a number of use cases; storing media with associated metadata in a document, storing result data with a description of the task that generated it, even storing an HTML5 application that renders the data.

Attachments are accessed through the same REST interface as documents. An attachment (myphoto.png) attached to a document (myprofile) is accessible at

/crud/myprofile/myphoto.png. We'll cover pushing attachments into the database below.

Documents can store attachments in two ways; inline or standalone. Inline attachments are good for smaller data, and let you push the document and it's attachments in a single HTTP call. Standalone attachments are good for larger files and for when you want to add attachments over time.

When attaching your file you need to know it's mime type. The database returns the attachment with the type you set (via the Content-Type header), so if this is not set correctly you may not be able to serve the file correctly.

Pushing an inline attachment is the same as pushing up a document, however your attachment data needs to be laid out correctly. All attachments are held inside the _attachments field. This is an object of objects, each containing the base64 encoded attachment data and the content type of the attachment. Note that the content type image/png has been URL encoded to image\/png.

""_id"":""myprofile"",

""_attachments"":

""myphoto.png"":

""content_type"":""image\/png"",

""data"": ""iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAABiElEQVQYlQXBzUvbYAAH4N/7sYQkbRpTjGuJzDK/0INYd9hB8Dhhh8EY7DDYceexm7D9CWMH6al3T571IMomePDshlWY7TZs2mLatGls35rk3fOQyNn68mgwKX79sD3a0TX6bbGAo/MzHGSe4I3npdW906yVmTrkakd8HL3btHcK03idtbBWKmLW3sZqL8LYUGBhChAT0NDV/aprAlQRbxdmY0PlsZ0x4lfzbmzOuaI1ZwIPg5BWXqzTT4TgpeWwnELZVe2S3TT+sLDfZ0EsWVRyAHQo/ew4BIaNZ7IH0W2i43fBOUPt1wXuhCBJ0QbQBi+PfPi6ju/9IfzCPDZXTNgsgaYqcINUyl4EAOB3po6kdYMfMwtkVTKiC4HBMJBM19AgkuxuLOF57j04OE3SeAAwK5mEkaxPRkjugX/tFo77URLkDVZZL6cchGa9pyU4rWtlX2M4N1SUrQx+Dw0E1ye8lF8Gv480TsbjGoiKiHVDK/DYzwdHXvz1MEMS5B4X0nqvk0dw2/wPAziepZ5qttMAAAAASUVORK5CYII=""

Once the document has been written to disk the content of the _attachments object changes to:

""_id"": ""myprofile"",

""_rev"": ""1-2b0a2ad6d1ba8a4ec811aa67ccb30db1"",

""_attachments"": {

""myphoto.png"": {

""content_type"": ""image/png"",

""revpos"": 1,

""digest"": ""md5-VY/bvWh3RCG0BNObBVlMGw=="",

""length"": 449,

""stub"": true

Standalone attachments can be uploaded to a document without modifying the document itself. You need to know the document _id and _rev, but the document itself doesn't need to be modified (the _rev will increment when a file is attached). If the document you want to attach a file to doesn't exist it will be created for you.

The file will be attached to a document called attachment_doc.

We can create the document and attach the file at the same time in a single request:

If you want to add another attachment to the document you can do that in a similar manner, you just need to include the document _rev in the URL (for a PUT or in the form data for a POST:

Browse the API Reference

Get the API Reference (PDF)","A simple guide to creating, reading, updating and deleting documents using Cloudant's simple HTTP API.",Reading & Writing,Live,927
2868,"Stats and Bots Follow Sign in / Sign up * Home
 * Analytics
 * Bots
 * Updates
 * Design
 * Subscribe
 * 
 * 🤖 STATSBOT
 * 

Pavel Tiunov Blocked Unblock Follow Following CTO @ Statsbot Jun 8
--------------------------------------------------------------------------------

TIME SERIES ANOMALY DETECTION ALGORITHMS
THE CURRENT STATE OF ANOMALY DETECTION TECHNIQUES IN PLAIN LANGUAGE
At Statsbot , we’re constantly reviewing the landscape of anomaly detection approaches and
refinishing our models based on this research.

This article is an overview of the most popular anomaly detection algorithms for
time series and their pros and cons.

This post is dedicated to non-experienced readers who just want to get a sense
of the current state of anomaly detection techniques. Not wanting to scare you
with mathematical models, we hid all the math under referral links.

IMPORTANT TYPES OF ANOMALIES
Anomaly detection problem for time series is usually formulated as finding outlier data points relative to some standard or usual signal . While there are plenty of anomaly types, we’ll focus only on the most
important ones from a business perspective, such as unexpected spikes, drops,
trend changes and level shifts.

Imagine you track users at your website and see an unexpected growth of users in
a short period of time that looks like a spike. These types of anomalies are
usually called additive outliers .

Another example with the website is when your server goes down and you see zero
or a really low number of users for some short period of time. These types of
anomalies are usually classified as temporal changes .

In the case that you deal with some conversion funnel, there could be a drop in
a conversion rate. If this happens, the target metric usually doesn’t change the
shape of a signal, but rather its total value for a period. These types of
changes are usually called level shifts or seasonal level shifts depending on the character of the change.

Basically, an anomaly detection algorithm should either label each time point
with anomaly/not anomaly , or forecast a signal for some point and test if this point value varies from
the forecasted enough to deem it as an anomaly.

Using the second approach, you would be able to visualize a confidence interval,
which will help a lot in understanding why an anomaly occurs and validate it.

Statsbot’s anomaly report. Actual time series, predicted time series and
confidence interval help understand why anomaly occurs.Let’s review both algorithm types from the perspective of appliance to finding
various types of outliers.

STL DECOMPOSITION
STL stands for seasonal trend loess decomposition. This technique gives you an
ability to split your time series signal into three parts: seasonal, trend and residue .

From top to bottom: original time series, seasonal, trend and residue parts
retrieved using STL decomposition.As the name states, it is suitable for seasonal time series, which is the most
popular case.

If you analyze deviation of residue and introduce some threshold for it, you’ll
get an anomaly detection algorithm.The not obvious part here is that you should use median absolute deviation to get a more robust detection of anomalies. The leading implementation of this
approach is Twitter’s Anomaly Detection library . It uses Generalized Extreme Student Deviation test to check if a residual point is an outlier.

PROS
Pros of this approach are in its simplicity and how robust it is. It can handle
a lot of different situations and all anomalies can still be intuitively
interpreted.

It’s good mostly for detecting additive outliers. To detect level changes you
can analyze some rolling average signal instead of the original one.

CONS
The cons of this approach are in its rigidity regarding tweaking options. All
you can tweak is your confidence interval using the significance level.

The typical scenario in which is doesn’t work well is when characteristics of
your signal have changed dramatically. For example, you’re tracking users on
your website that was closed to the public, and then was suddenly opened. In
this case, you should track anomalies that occur before and after launch periods
separately.

CLASSIFICATION AND REGRESSION TREES
Classification and regression trees is one of the most robust and most effective
machine learning techniques. It may also be applied to anomaly detection
problems in several ways.

 * First, you can use supervised learning to teach trees to classify anomaly and
   non-anomaly data points. In order to do that you’d need to have labeled
   anomaly data points.
 * The second approach is to use unsupervised learning to teach CART to predict
   the next data point in your series and have some confidence interval or
   prediction error as in the case of the STL decomposition approach. You can
   check if your data point lies inside or outside the confidence interval using
   Generalized ESD test, or Grubbs’ test .

Actual time series (Green), predicted time series made using CART model (Blue),
and anomalies detected as deviation from forecasted time series.The most popular implementation to perform learning for trees is the xgboost library .

PROS
The strength of this approach is that it’s not bound in any sense to the
structure of your signal, and you can introduce many feature parameters to
perform the learning and get sophisticated models.

CONS
The weakness is a growing number of features can start to impact your
computational performance fairly quickly. In this case, you should select
features consciously.

ARIMA
ARIMA is a very simple method by design, but still powerful enough to forecast signals
and to find anomalies in it.

It’s based on an approach that several points from the past generate a forecast
of the next point with the addition of some random variable, which is usually
white noise. As you can imagine, forecasted points in the future will generate
new points and so on. Its obvious effect on the forecast horizon: the signal
gets smoother.

The difficult part in appliance of this method is that you should select the number of differences, number of autoregressions, and forecast error
coefficients.

Each time you work with a new signal you should build a new ARIMA model.Another obstacle is that your signal should be stationary after differencing. In
simple words, it means your signal shouldn’t be dependent on time, which is a
significant constraint.

Anomaly detection is done by building an adjusted model of a signal by using
outlier points and checking if it’s a better fit than the original model by
utilizing t-statistics .

Two time series built using original ARIMA model and adjusted for outliers ARIMA
model.The favored implementation of this approach is tsoutliers R package. It’s suitable to detect all types of anomalies in the case that you
can find a suitable ARIMA model for your signal.

EXPONENTIAL SMOOTHING
Exponential smoothing techniques are very similar to the ARIMA approach. The
basic exponential model is equivalent to the ARIMA (0, 1, 1) model.

The most interesting method from the anomaly detection perspective is Holt-Winters seasonal method . You should define your seasonal period which can equal to a week, month,
year, etc.

In the case you need to track several seasonal periods, such as having both week
and year dependencies, you should select only one. Usually, it’ll be the
shortest one: a week in this example.

This is clearly a drawback of this approach, which affects the forecasting
horizon a lot.

Anomaly detection can be done using the same statistical tests for an outlier,
as in the case of STL or CARTs.

NEURAL NETWORKS
As in the case of CART, you have two ways to apply neural networks: supervised and unsupervised learning .

As we’re working with time series, the most suitable type of neural network is LSTM . This type of Recurrent Neural Network, if properly built, will allow you to
model the most sophisticated dependencies in your time series as well as
advanced seasonality dependencies.

This approach can also be very helpful if you have multiple time series coupled with each other.

This area is still on-going research , and it requires a lot of work to build the model for your time series. Should
you succeed, you may achieve outstanding performance results in terms of
accuracy.

💡TO KEEP IN MIND 💡
 1. Try the simplest model and algorithm that fit your problem the best.
 2. Switch to more advanced techniques if it doesn’t work out.
 3. Starting with more general solutions that cover all the cases is a tempting
    option, but it’s not always the best.

At Statsbot, to detect anomalies at scale we use different combinations of
techniques starting with STL and ending with CART and LSTM models.

Was it helpful? Please recommend and share this article to help other people
find it.

KEY TAKEAWAYS:
A Big Data Cheat Sheet: From Narrow AI to General AI Artificial Intelligence in
2017 blog.statsbot.co Which Metrics Matter The Most To You? “Without a goal, analytics is aimless and
worthless. A target should go with every goal. If a strategy meets a goal: It…
blog.statsbot.co Solve These 3 Common Problems to Save Your Analytics It sometimes feels
impossible to make sense or use of the vast amount of spreadsheets, reports, and
other embodiments… blog.statsbot.co * Machine Learning
 * Anomaly Detection
 * Timeseries
 * Machine Intelligence
 * Algorithms

10 1 Blocked Unblock Follow FollowingPAVEL TIUNOV
CTO @ Statsbot

FollowSTATS AND BOTS
Stories from the makers of Statsbot

 * Share
 * 10
 * 
 * 
 * 

Never miss a story from Stats and Bots , when you sign up for Medium. Learn more Never miss a story from Stats and Bots Get updates Get updates",This article is an overview of the most popular anomaly detection algorithms for time series and their pros and cons.,Time Series Anomaly Detection Algorithms – Stats and Bots,Live,928
2869,"RStudio Blog * Home

 * Subscribe to feed

TIDYR 0.5.0
June 13, 2016 in Packages

I’m pleased to announce tidyr 0.5.0. tidyr makes it easy to “tidy” your data,
storing it in a consistent form so that it’s easy to manipulate, visualise and
model. Tidy data has a simple convention: put variables in the columns and
observations in the rows. You can learn more about it in the tidy data vignette. Install it with:

install.packages(""tidyr"")

This release has three useful new features:

 1. separate_rows() separates values that contain multiple values separated by a delimited into
    multiple rows. Thanks to Aaron Wolen for the contribution!df <- data_frame(x = 1:2, y = c(""a,b"", ""d,e,f""))
    df %>% 
      separate_rows(y, sep = "","")
    #> Source: local data frame [5 x 2]
    #> 
    #>       x     y
    #>   <int> <chr>
    #> 1     1     a
    #> 2     1     b
    #> 3     2     d
    #> 4     2     e
    #> 5     2     f
    
    Compare with separate() which separates into (named) columns:
    
    df %>% 
      separate(y, c(""y1"", ""y2"", ""y3""), sep = "","", fill = ""right"")
    #> Source: local data frame [2 x 4]
    #> 
    #>       x    y1    y2    y3
    #> * <int> <chr> <chr> <chr>
    #> 1     1     a     b  <NA>
    #> 2     2     d     e     f
    
    
 2. spread() gains a sep argument. Setting this will name columns as “key|sep|value”. This is useful
    when you’re spreading based on a numeric column:df <- data_frame(
      x = c(1, 2, 1), 
      key = c(1, 1, 2), 
      val = c(""a"", ""b"", ""c"")
    )
    df %>% spread(key, val)
    #> Source: local data frame [2 x 3]
    #> 
    #>       x     1     2
    #> * <dbl> <chr> <chr>
    #> 1     1     a     c
    #> 2     2     b  <NA>
    df %>% spread(key, val, sep = ""_"")
    #> Source: local data frame [2 x 3]
    #> 
    #>       x key_1 key_2
    #> * <dbl> <chr> <chr>
    #> 1     1     a     c
    #> 2     2     b  <NA>
    
    
 3. unnest() gains a .sep argument. This is useful if you have multiple columns of data frames that
    have the same variable names:df <- data_frame(
      x = 1:2,
      y1 = list(
        data_frame(y = 1),
        data_frame(y = 2)
      ),
      y2 = list(
        data_frame(y = ""a""),
        data_frame(y = ""b"")
      )
    )
    df %>% unnest()
    #> Source: local data frame [2 x 3]
    #> 
    #>       x     y     y
    #>   <int> <dbl> <chr>
    #> 1     1     1     a
    #> 2     2     2     b
    df %>% unnest(.sep = ""_"")
    #> Source: local data frame [2 x 3]
    #> 
    #>       x  y1_y  y2_y
    #>   <int> <dbl> <chr>
    #> 1     1     1     a
    #> 2     2     2     b
    
    It also gains a .id column that makes the names of the list explicit:
    
    df <- data_frame(
      x = 1:2,
      y = list(
        a = 1:3,
        b = 3:1
      )
    )
    df %>% unnest()
    #> Source: local data frame [6 x 2]
    #> 
    #>       x     y
    #>   <int> <int>
    #> 1     1     1
    #> 2     1     2
    #> 3     1     3
    #> 4     2     3
    #> 5     2     2
    #> 6     2     1
    df %>% unnest(.id = ""id"")
    #> Source: local data frame [6 x 3]
    #> 
    #>       x     y    id
    #>   <int> <int> <chr>
    #> 1     1     1     a
    #> 2     1     2     a
    #> 3     1     3     a
    #> 4     2     3     b
    #> 5     2     2     b
    #> 6     2     1     b
    
    
tidyr 0.5.0 also includes a bumper crop of bug fixes, including fixes for spread() and gather() in the presence of list-columns. Please see the release notes for a complete list of changes.

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,744 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

1 COMMENT
June 14, 2016 at 2:46 am

Mmmm

So `separate` will be rename to `separate_cols` in future releases?😉


« Profiling with RStudio and profvis dplyr 0.5.0 »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,744 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","Announcing tidyr 0.5.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention…",tidyr 0.5.0,Live,929
2870,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * Data Catalog
 * 
 * Watson Data Platform
 * 

Charles gomes Blocked Unblock Follow Following Nov 1
--------------------------------------------------------------------------------

HOW TO USE VERSION CONTROL (GITHUB) IN RSTUDIO WITHIN DSX?
Problem: You want to collaborate with your colleagues or other contributors on a project
and want to keep versions of your code you develop in RStudio for backup.

With RStudio in DSX, git comes pre-installed in your container and RStudio by
default have git and subversion integration available.

Now let’s say you already have project and you want to clone or check out that
project, start by creating a new project.

For example:

 1. I forked one of the sample RStudio projects in my GitHub account (but you
    can choose your own GitHub repository) — https://github.com/rstudio/bookdown-demo to https://github.com/charles2588/bookdown-demo
 2. Then clone this forked project or the master project into DSX’s instance of
    RStudio so that you can start collaborating with your teammates or other
    contributors.

Select Version Control .

Select Git since we are working with GitHub.

Go to GitHub and copy the clone repository URL from GitHub repository and enter
it in the Clone Git Repository window. Select Create Project .

Enter GitHub username (email) and password:

This will clone the project into your RStudio (creating a local directory and
local branch.)

For your private GitHub Repository or Enterprise GitHub it might ask for other
authentications like Personal Token.

https://github.ibm.com/charlesgomes/DSXRstudioProjectTest.git

In such cases, you might need to specify personal access token as password.

Now let’s say you create an R script as I did below:

and you want to commit this file.

Before you commit the above created file in branch of the local repository and
then push it to branch in remote connected repository, do the following steps:

Go to RStudio Console.

Run this in the RStudio Console, replacing my email with your GitHub email and
my full name with your name.

system(""git config --global user.email 'charles2588@gmail.com'"")

system(""git config --global user.name 'Charles Gomes'"")

For Enterprise Git:

system(""git config --global user.email 'ccgomes@us.ibm.com'"")

system(""git config --global user.name 'Charles Gomes'"")

Once the above global options are set, you are ready to proceed.

Then go to Tools -> Version Control and select Commit.

Given that the below files are new files, it will ask to add them and will let
you commit those files. Check the files.

Enter your commit message and then click Commit .

The above will commit to your RStudio local branch.

Now we are ready to push the branch out to GitHub or Git Repository.

Congratulations you have setup your project with Git version control!

TRY GIT VERSION CONTROL WITH DSX RSTUDIO AND BEGIN COLLABORATING WITH YOUR TEAM ON DIFFERENT VERSIONS TODAY!
 * Git
 * Data Science
 * Dsx

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

2 Blocked Unblock Follow FollowingCHARLES GOMES
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 2
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","With RStudio in DSX, git comes pre-installed in your container and RStudio by default have git and subversion integration available.",How to use version control (GitHub) in RStudio within DSX?,Live,930
2873,"Lorna Mitchell Blocked Unblock Follow Following Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net) Aug 29
--------------------------------------------------------------------------------

JEFFCONF LONDON: SHORT-NOTICE SERVERLESS CONFERENCE
TAKING MY ALEXA DEMO TO A JEFF CONFERENCE
Recently, I had the pleasure of attending and speaking at JeffConf in London. It was the first edition of a Serverless Conference, brought
together by enthusiasts and generous sponsors at relatively short notice.

Serverless is a hot topic, and the conference filled a needed space in the
technology event landscape. The organisers had also gone out of their way to
reach developers from all sorts of backgrounds, and provided a variety of
content including lightning talks and panel discussions.

There was definitely something for everyone in this single-track event. Here are
the talks that stood out to me:

 * Architectural advice and proclamations of doom from the one and only Simon Wardley
 * A great presentation from Ben Fletcher of the Financial Times about their implementation of serverless technology
 * A technical session from Guy Podjarny on security considerations of code that “costs nothing”, that we’re likely
   to deploy and forget

The hallway tracks, catering and after-party were great, and I had many
interesting and (mostly!) technical conversations with a wide range of people.
This included meeting some of my colleagues (IBM is a big company) that I
haven’t actually met in person yet, which is always excellent.

Image credit: Rob Allen .I can also bring you video of my own talk, where I attempt to co-present with an
Amazon Alexa. She’s wired into the sound system, and I’ve got the mic so
everyone can hear us. The only problem is … she can’t hear me well with the
reverberations in the presentation hall! Never work with children, animals, or
partially-sentient gadgets, that’s my advice.

For more on my alexa-project-codename code , see my recent article, Add Redis to Your Serverless Application . And here are the slides from my JeffConf talk . Let me know in the comments if you have any questions, and thanks for
reading!

 * Serverless
 * Openwhisk
 * Alexa Skills

Blocked Unblock Follow FollowingLORNA MITCHELL
Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net )

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",Taking my Alexa demo to a Jeff conference,Short-Notice Serverless Conference,Live,931
2874,"Skip to contentData, what now?

Making sense of all that mess.

MAIN NAVIGATION
Menu * Blog
 * About Me

PSEUDO-LABELING A SIMPLE SEMI-SUPERVISED LEARNING METHOD
Vinko Kodžoman September 6, 2017 September 6, 2017The foundation of every machine learning project is data – the one thing you
cannot do without. In this post, I will show how a simple semi-supervised
learning method called pseudo-labeling that can increase the performance of your
favorite machine learning models by utilizing unlabeled data.


PSEUDO-LABELING
To train a machine learning model with supervised learning, the data has to be
labeled. Does that mean that unlabeled data is useless for supervised tasks like
classification and regression? Certainly not! Aside from using the extra data
for analytic purposes, we can even use it to help train our model with
semi-supervised learning – combining both unlabeled and labeled data for model
training.

The main idea is simple. First, train the model on labeled data, then use the
trained model to predict labels on the unlabeled data, thus creating
pseudo-labels. Further, combine the labeled data and the newly pseudo-labeled
data in a new dataset that is used to train the data.


I was inspired to try this method when it was mentioned in fast.ai MOOC ( original paper ). Although this method was mentioned in the context of deep learning (online
algorithms), I tried it out on traditional machine learning models and got
slight improvements.


DATA PREPROCESSING AND EXPLORATION
In competitions, such as ones found on Kaggle , the competitor receives the training set (labeled data) and test set
(unlabeled data). This can be a good place to test pseudo-labeling. The dataset we will use is from the Mercedes-Benz Greener Manufacturing competition – the goal is the predict the duration of testing a car based on its features
(regression). As always, all the code with additional descriptions can be found
in this notebook .

import pandas as pd

# Load the data
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')

print(train.shape, test.shape)
# (4209, 378) (4209, 377)

We can see that the training dataset is not ideal, it has a low number of data
points (4209) and many features (376). To improve the dataset we should reduce
the number of features and try to increase the number of data points if
possible. I covered feature importance (feature reduction) in a previous blog post , this topic will be skipped as the main focus of this blog post will be on
increasing the number of data points with pseudo-labeling. This dataset is good
for pseudo-labeling because of the small dataset and a decent ratio of labeled
to unlabeled data – 1:1.

The table below shows a subset of the whole training dataset. Features X0-X8 are
categorical variables and we have to transform them into in a form that is
useable by our model – numerical values.

ID y X0 X1 X2 X3 X4 X5 X6 X8 X10 … 0 0 130.81 k v at a d u j o 0 … 1 6 88.53 k t av e d y l o 0 … 2 7 76.26 az w n c d x j x 0 …This was done using scikit-learn’s LabelEncoder class.

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

features = train.columns[2:]

for column_name in features:
    label_encoder = LabelEncoder() 
    
    # Get the column values
    train_column_values = list(train[column_name].values)
    test_column_values = list(test[column_name].values)
    
    # Fit the label encoder
    label_encoder.fit(train_column_values + test_column_values)
    
    # Transform the feature
    train[column_name] = label_encoder.transform(train_column_values)
    test[column_name] = label_encoder.transform(test_column_values)

The result:

ID y X0 X1 X2 X3 X4 X5 X6 X8 X10 … 0 0 130.81 37 23 20 0 3 27 9 14 0 … 1 6 88.53 37 21 22 4 3 31 11 14 0 … 2 7 76.26 24 24 38 2 3 30 9 23 0 …Now, the data is ready for our machine learning model.


Implementing pseudo-labeling with Python and scikit-learn

Let us create a function that creates the “augmented training set” that consists
of pseudo-labeled and labeled data. The arguments of the function are the model,
training and test set information (data and features), and the parameter
sample_rate. Sample_rate allows us to control the percent of pseudo-labeled data
that we will mix with true labeled data. Setting sample_rate to 0.0 means that
the model will use only true labeled data, while sample_rate 0.5 means that the
model will use all the true labeled data and half of the pseudo-labeled data. In
whichever case, the model will use all the true labeled data.

def create_augmented_train(X, y, model, test, features, target, sample_rate):
    '''
    Create and return the augmented_train set that consists
    of pseudo-labeled and labeled data.
    '''
    num_of_samples = int(len(test) * sample_rate)

    # Train the model and creat the pseudo-labeles
    model.fit(X, y)
    pseudo_labeles = model.predict(test[features])

    # Add the pseudo-labeles to the test set
    augmented_test = test.copy(deep=True)
    augmented_test[target] = pseudo_labeles

    # Take a subset of the test set with pseudo-labeles and append in onto
    # the training set
    sampled_test = augmented_test.sample(n=num_of_samples)
    temp_train = pd.concat([X, y], axis=1)
    augemented_train = pd.concat([sampled_test, temp_train])
    
    # Shuffle the augmented dataset and return it
    return shuffle(augemented_train)

Also, we will need a fit method – a method that trains the model – which will
take the augmented training set and train the model with it. That is another
function, and the one we wrote before already takes a lot of arguments. This is
a good opportunity to create a class to increase cohesion and make the code
cleaner, and put the methods into that class. The class we will create will be
called PseudoLabeler. It will take a scikit-learn model and train it with the augmented training set. Scikit-learn allows us to
create our own regressors, but we have to follow their library standard.

from sklearn.utils import shuffle
from sklearn.base import BaseEstimator, RegressorMixin

class PseudoLabeler(BaseEstimator, RegressorMixin):
    
    def __init__(self, model, test, features, target, sample_rate=0.2, seed=42):
        self.sample_rate = sample_rate
        self.seed = seed
        self.model = model
        self.model.seed = seed
        
        self.test = test
        self.features = features
        self.target = target
        
    def get_params(self, deep=True):
        return {
            ""sample_rate"": self.sample_rate,
            ""seed"": self.seed,
            ""model"": self.model,
            ""test"": self.test,
            ""features"": self.features,
            ""target"": self.target
        }

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

        
    def fit(self, X, y):
        if self.sample_rate > 0.0:
            augemented_train = self.__create_augmented_train(X, y)
            self.model.fit(
                augemented_train[self.features],
                augemented_train[self.target]
            )
        else:
            self.model.fit(X, y)
        
        return self


    def __create_augmented_train(self, X, y):
        num_of_samples = int(len(test) * self.sample_rate)
        
        # Train the model and creat the pseudo-labels
        self.model.fit(X, y)
        pseudo_labels = self.model.predict(self.test[self.features])
        
        # Add the pseudo-labels to the test set
        augmented_test = test.copy(deep=True)
        augmented_test[self.target] = pseudo_labels
        
        # Take a subset of the test set with pseudo-labels and append in onto
        # the training set
        sampled_test = augmented_test.sample(n=num_of_samples)
        temp_train = pd.concat([X, y], axis=1)
        augemented_train = pd.concat([sampled_test, temp_train])

        return shuffle(augemented_train)
        
    def predict(self, X):
        return self.model.predict(X)
    
    def get_model_name(self):
        return self.model.__class__.__name__

Besides the “fit” and “__create_augmented_train” methods, there are several
smaller methods that are required by scikit-learn in order to use this class as
a regressor (you can read more about this topic in the official documentation ). Now that we have created our scikit-learn class for pseudo-labeling, let us
show an example.

target = 'y'

# Preprocess the data
X_train, X_test = train[features], test[features]
y_train = train[target]

# Create the PseudoLabeler with XGBRegressor as the base regressor
model = PseudoLabeler(
    XGBRegressor(nthread=1),
    test,
    features,
    target
)

# Train the model and use it to predict
model.fit(X_train, y_train)
model.predict(X_train)

In the example, the PseudoLabeler class uses XGBRegressor to do regression with
pseudo-labeling. The default parameter for “sample_rate” is 0.2, meaning that
the PseudoLabeler will use 20% of the unlabeled dataset.

RESULTS
To test out the PseudoLabeler, I used XGBoost (when the competition was live I
was getting the best results with XGBoost). To evaluate the model, we compare
the raw XGBoost against the pseudo-labeled XGBoost. Using eight-fold
cross-validation (on 4k data points, each fold got a small dataset – around 500
data points). The evaluation metric is R2-score , the official metric of the competition.

XGBRegressor              CV-8 R2: 0.5671 (+/- 0.1596)
PseudoLabeler             CV-8 R2: 0.5680 (+/- 0.1568)

The PseudoLabeler has a slightly higher mean-score and lower deviation, which
makes is (slightly) superior to the raw model. I made a more detailed analysis
in the notebook, you can see it here. The performance gain might seem very low,
but keep in mind this is a Kaggle competition where every increase in score
might bring you higher on the leaderboard. The complexity introduced here is not
too big (~70 LOC) but the problem and the model are very simple in this example,
keep this in mind when trying to use this on a more complex problem or domain.

CONCLUSION
Pseudo-labeling allows us to utilize unlabeled data while training machine
learning models. This sounds like a powerful technique, and yes, it more often
than not increases the performance of our models. However, it can be difficult
to tune and to make it work properly, and even when it works, it gives only a
slight performance boost. In competitions such as Kaggle, I believe that this
technique can be useful, because, usually, even a slight increase in score can
give you a boost on the leaderboard. Still, I would think twice before using
this in a production environment as it seems to introduce additional complexity
without a big increase in performance, and that might not necessarily be what
you want.

Categories Data Science , Deep Learning , Machine Learning Tags Kaggle , PythonPOST NAVIGATION
Previous Previous post: Introduction to web scraping with PythonLEAVE A REPLY CANCEL REPLY
Your email address will not be published.

Comment

Name

Email

Website


Notify me of follow-up comments by email.

Notify me of new posts by email.

PRIMARY SIDEBAR
Toggle SidebarGITHUB
Weenkus (Vinko Kodžoman) Vinko Kodžoman Weenkus Zagreb, Croatia https://datawhatnow.com Joined on Oct 07, 2014 14 Followers 21 Following 27 Public Repositories ansiweather book_problems cards cats_vs_dogs_redux_kaggle Competition DataWhatNow-Codes Deep-Learning-University-of-Zagreb digit_recognizer_kaggle FM-index GTEngine hello_app identicon_generator InverseMatrixCaching LabDump leaf_classification_kaggle LearnOpenGL_tutorial Machine-Learning-University-of-Washington Machine-Learning-University-of-Zagreb My-personal-webpage One-Hump-Iterator-Visualization on_power_efficient_virtual_network_function_placement_algorithm Quora-Questions-Pairs-Kaggle Reference-Genome-Index Rentals Search-Engine Sexual-Predator-Classification-Using-Ensemble-Classifiers toy_app 0 Public GistsNEWSLETTER
RECENT POSTS
 * Pseudo-labeling a simple semi-supervised learning method
 * Introduction to web scraping with Python
 * SimHash for question deduplication
 * Feature importance and why it’s important

ARCHIVES
 * September 2017
 * May 2017
 * April 2017

Search for:Data, what now? © 2017 . All Rights Reserved",Using pseudo-labeling a simple semi-supervised learning method to train machine learning models with sci-kit learn and Python (examples with code).,Pseudo-labeling a simple semi-supervised learning method,Live,932
2876,"* 
 * 
 * 
 * 
 * 

HYNDSIGHT
A BLOG BY ROB J HYNDMAN
 * Home
 * Forecasting
 * R
 * LaTeX
 * Help
 * About
 * Main site


Search for:

Rob J Hyndman is Professor of Statistics at Monash University , Australia, and Editor-​​in-​​Chief of the Inter­na­tional Jour­nal of Fore­cast­ing .

Twitter: @robjhyndman
Google+: +RobJHyndman
Email: Rob.Hyndman@monash.edu

RSS feed


TAGS
beamer computing conferences consulting data science demography econometrics energy forecasting fpp graphics hts humour IJF ISF2017 jobs journals kaggle LaTeX mathematics maxima Monash University obituary organization otexts phd productivity progress R refereeing references reproducible research research team seasonality seminars StackExchange statistics supervision teaching technology time series video welfare writing ysc2013 Search for:POPULAR POSTS
 * Why every statistician should know about cross-validation
 * A LaTeX template for a CV
 * Controlling figure and table placement in LaTeX
 * R graph with two y-axes
 * Making a poster in beamer
 * The difference between prediction intervals and confidence intervals
 * How to fail a PhD
 * I'm switching to TeXstudio

RECENT COMMENTS
18 September 2009 31 August 2010 byWORKFLOW IN R
 * productivity , R
 * 7 Comments

This came up recently on StackOverflow . One of the answers was particularly helpful and I thought it might be worth
mentioning here. The idea presented there is to break the code into four files,
all stored in your project directory. These four files are to be processed in
the following order.

load.R This file includes all code associated with loading the data. Usually, it will
be a short file reading in data from files. clean.R This is where you do all the pre-processing of data, such as taking care of
missing values, merging data frames, handling outliers. By the end of this file,
the data should be in a clean state, ready to use. It is much better to do this
here rather than clean the data on the original file as this enables you to have
a complete record of everything done to the data. functions.R All of the functions needed to perform the actual analysis are stored here.
This file should do nothing other than define the functions you need for
analysis. (If you require your own functions for loading or cleaning the data,
include them at the top of either load.R or clean.R.) In particular, functions.R
should not do anything to the data. This means that you can modify this file and
reload it without having to go back and repeat steps 1 & 2 which can take a long
time to run for large data sets. do.R Here is the code to actually do the analysis. This file will use the functions
defined in functions.R to do the calculations, produce figures and tables, etc.
All figures and tables that end up in your report, paper or thesis should be
coded here. Never create figures and tables manually (i.e., with the mouse and
menus) as then you can’t easily reproduce.It is a good idea to save your workspace after each file is run.

There are many advantages to this set up. First, you don’t have to reload the
data each time you make a change in a subsequent step. Second, if you come back
to an old project, you will be able to work out what was done relatively
quickly. It also forces a certain amount of structured thinking in what you are
doing, which is helpful.

Often there will be bits and pieces of code that you write, but don’t end up
using, yet don’t want to delete. These should either be commented out or saved
in files with other names. All analysis from reading data to producing the final
results should be reproducible by simply source() ing these four files in order with no further user intervention.

I’ve tried this process on a few projects and found it rather too restrictive.
In particular, my do.R file often becomes large and unwieldy. Instead, I am now using the following
process.

main.R This file simply contains a list of source statements to run each of the other
R files in order. functions.R As above, all of the functions needed to perform the actual analysis are stored
here. This file should do nothing other than define the functions you need for
analysis. xxx.R All other code is contained in files of the form xxx.R which are called in an appropriate order by main.R . The number and content of these files will depend on the project. Often it
will include a load.R file and clean.R file as above. However, I usually have more than one file containing the actual
analysis (instead of the do.R file).The important part of this is that running main.R will run the entire project from scratch. So if the data are updated, or the
functions are changed, it is easy to repeat the entire analysis in one step —
just run source(""main.R"") .

It is important to be disciplined about keeping the R files neat and documented.
You want to be able to figure out what each part of the code does when you look
at it a year after writing it. That means inserting comments and removing
anything that is not actually used.

--------------------------------------------------------------------------------

RELATED POSTS:
 * Makefiles for R/LaTeX projects
 * RStudio just keeps getting better
 * In praise of Dropbox
 * Organization and R
 * Backing up


--------------------------------------------------------------------------------


SHARE THIS:
 * Click to share on Twitter (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * Click to email this to a friend (Opens in new window)
 * 

POST NAVIGATION
← Previous Post Writing mathematics Next Post → Writing an abstract * patricio fuenmayorestoy usando R para proyecciones, y sería interesante tener lo que ya esta
   desarrollado.
   
   
 * 
 * Pingback: links for 2009-10-29 « Amy G. Dala ()
   
   
 * 
 * giobyand a good makefile to put it all together.
   
   
 * 
 * Brandon BertelsenThanks for the outline here. Always nice to see someone else’s process in
   order to improve your own.
   
   
 * 
 * François-Philippe DubéThank you for this tip. You might be interested in looking at the
   ProjectTemplate (www.projecttemplate.net), available on CRAN. I personally
   find that the full project structure proposed is too much for my needs, hence
   I create a minimalist structure using one of the options in the project
   creation function.
   
   
 * 
 * Phillip BurgerHaving good workflow is a key to increasing productivity. I’m going to adjust
   how I organize my projects after reading your post.
   
   The other readers’ comment concerning http://www.projecttemplate.net is also helpful.
   
   Have you evolved how you organize your projects since your original post?
   
   
 * 
 * jacob lindbergI do not get the difference between Main.R and do.R as well as why to split
   them up.
   
   
 * 

Proudly powered by WordPress | Theme: editor by Array Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","The idea presented there is  to break the code into four files, all stored in your project directory.",Workflow in R,Live,933
2878,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix                * Tutorials * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags       * Use Spark Streaming       * Launch a Spark job using spark-submit                * Sample Notebooks * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis                   * BigInsights * Get Started * BigInsights on Cloud for Analysts       * BigInsights on Cloud for Data Scientists       * Perform Text Analytics on Financial Data       * Perform Sentiment Analysis       * Sample Scripts                   * Compose * Get Started * Create a Deployment       * Add a Database and Documents       * Back Up and Restore a Deployment       * Enable Two-Factor Authentication       * Add Users       * Enable Add-Ons for Your Deployment                * Compose Enterprise * Get Started                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata System for Analytics to dashDB       * From Netezza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Perform Predictive Analytics and SQL Pushdown       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                * REST API * Load delimited data using the REST API and cURL                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  LOAD CLOUDANT DATA IN APACHE SPARK USING A PYTHON NOTEBOOKsharynr / December 8, 2015Learn how to use a Python notebook for easy access to filter and refine Cloudantdata in IBM Analytics for Apache Spark using the Cloudant-spark beta connector. This video shows an alternative to the Scala notebook shown in the Load Cloudant Data in Apache Spark using a Scala notebook video.You can download the Python notebook shown in the video and referenced in this tutorial, or create your own notebookby cutting/pasting the code found in the tutorial below into a new notebook.You can also read a transcript of this videoRELATED LINKS * Build SQL Queries * Use the Machine Learning Library * Load and Analyze dashDB Data with Apache Spark * Load Cloudant Data in Apache Spark using a Scala NotebookTRY THE TUTORIALLearn how use Spark SQL to load, filter, and visualize Cloudant data in a Pythonnotebook in IBM Analytics for Apache Spark.WHAT YOU’LL LEARNAt the end of this tutorial, you should be able to: * replicate a Cloundant database into your Cloudant account * create a Python notebook in IBM Analytics for Apache Spark. * use Spark SQL to load and filter the Cloudant data. * write data back to a Cloudant database. * use PySpark, Pandas, and matplotlib to visualize the data.BEFORE YOU BEGINWatch the Getting Started on Bluemix video to add the IBM Analytics for Apache Spark service to your Bluemix account. Youmay also want to watch the Load and Filter Cloudant Data in Apache Spark video to see how to accomplish these same tasks using a Scala notebook.You can download the Python notebook shown in the video and referenced in this tutorial, or create your own notebookby cutting/pasting the code into a new notebook.PROCEDURE 1: REPLICATE THE CRIMES DATABASE INTO YOUR CLOUDANT ACCOUNT 1. Sign in to your Cloudant account or sign in to Bluemix , and access the Cloudant Dashboard . 2. Click the Replication tab. 3. Complete the form on the right side of the screen to create a new    replication job with the following specifications. 1. For the _id, type crimes_replication .     2. In this tutorial, you want to replicate a database from the Education        account to your own personal account, so indicate that the source        database is a Remote Database and type the URL to the database as https://education.cloudant.com/crimes .        In this case, you don’t need to set any special permissions because this        database is already set to allow anyone to replicate it locally.     3. For the target database, click New Database , select Create a new database locally , and then specify the database name as crimes .     4. Leave Make this replication continuous unchecked so this will be a singular replication event.        Click Replicate . Next, type your password, and click Continue .Under the covers, the process base64 encodes your credentials and includes    that authentication information in the replication document.        You get the success message: This replication has been posted to the _replicator database but hasn’t    been fired yet. Check the _replicator DB to see its state.PROCEDURE 2: CREATE A PYTHON NOTEBOOK TO ANALYZE THE CLOUDANT DATA 1.  From your Bluemix Dashboard, open the Apache Spark instance, and launch IBM     Analytics for Apache Spark. 2.  Open the existing Spark instance. 3.  On the Analytics tab, click New Notebook , select Python , type a name and description for the notebook, and click Create Notebook . 4.  Paste the following statements into the cells of the notebook, and then     click Run . For an explanation of what these statements do, refer to the Load and Filter Cloudant Data in Apache Spark video and tutorial. Replace hostname , username , and password with the hostname, username, and password for your Cloudant account.Command:     sqlContext = SQLContext(sc)     cloudantdata = sqlContext.read.format(""com.cloudant.spark"").\     option(""cloudant.host"",""hostname"").\     option(""cloudant.username"", ""username"").\     option(""cloudant.password"",""password"").\     load(""crimes"")     cloudantdata.printSchema()     cloudantdata.count()     cloudantdata.select(""properties.naturecode"").show()     disturbDF = cloudantdata.filter(""properties.naturecode = 'DISTRB'"")     disturbDF.show()     disturbDF.select(""properties"").write.format(""com.cloudant.spark"").\     option(""cloudant.host"",""hostname "").\     option(""cloudant.username"", ""username"").\     option(""cloudant.password"",""password"").\     save(""crimes_filtered"")                5.  Next, you’ll see how to create a visualization of the crimes data. Paste     the following statement into the next cell, and then click Run . This line creates a DataFrame containing all of the naturecodes and a     count of the crime incidents for each code.     Command: reducedValue = cloudantdata.groupBy(""properties.naturecode"").count()     reducedValue.printSchema() 6.  Paste the following statement into the next cell, and then click Run . This line imports two Python modules. The pprint module helps to produce     pretty representations of data structures, and the counter subclass from     the collections module helps to count hashable objects.     Command: import pprint     from collections import Counter 7.  Paste the following statement into the next cell, and then click Run . This line imports PySpark classes for Spark SQL and DataFrames.     Command: from pyspark.sql import *     from pyspark.sql.functions import udf, asc, desc     from pyspark import SparkContext, SparkConf     from pyspark.sql.types import IntegerType 8.  Paste the following statement into the next cell, and then click Run .     Command: import pandas as pd     pandaDF = reducedValue.orderBy(desc(""count""), asc(""naturecode"")).toPandas()     print(pandaDF) 9.  Paste the following statement into the next cell, and then click Run . This line is required to actually see the plots.     Command: %matplotlib inline 10. Paste the following statement into the next cell, and then click Run . This line imports matplotlib.pyplot which is a collection of command     style functions that make matplotlib work like MATLAB.     Command: import matplotlib.pyplot as plt 11. Paste the following statement into the next cell, and then click Run . This line assigns the count and naturecode data to the values and labels     objects.     Command: values = pandaDF['count']     labels = pandaDF['naturecode'] 12. Paste the following statement into the next cell, and then click Run . The first two statements provide the format for the plot. The next two     statements specify that the plot should display as a horizontal bar chart     with values for the x-axis and labels for the y-axis. The last statement     displays the plot.     Command: plt.gcf().set_size_inches(16, 12, forward=True)     plt.title('Number of crimes by type')     plt.barh(range(len(values)), values)     plt.yticks(range(len(values)), labels)     plt.show()     A bar chart displays:     Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",Learn how to use a Python notebook for easy access to filter and refine Cloudant data in IBM Analytics for Apache Spark using the Cloudant-spark connector.,Load Cloudant Data in Apache Spark Using a Python Notebook,Live,934
2882,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Sonali Surange Blocked Unblock Follow Following Sep 15, 2016
--------------------------------------------------------------------------------

MAKING DATA CLEANING SIMPLE WITH THE SPARKLING.DATA LIBRARY
The Sparkling.data library is a tool to simplify and enable quick data
preparation prior to any analysis step in Spark. The library provides tools to
map, visualize, and transform data for iterative analysis. Quality metrics,
metadata, and summary statistics enable a data scientist to get a visceral sense
of the data preparation progress. The library is available to use from Python
and Scala notebooks in IBM Data Science Experience.

WHY WE BUILT SPARKLING.DATA
Before a data scientist can derive value from data, the data must be cleaned up
and converted into a unified form for use by an algorithm. Hence, data
scientists spend 80% of their time doing data preparation.

Too much time is spent handcrafting and tweaking data — often referred to as data wrangling , data munging , data janitor work , and so on — a vital but necessary evil required to make sense of any data.As platform and tool providers, we often hear this sentiment from our customers,
and if you’re a data scientist, you probably experience the pain of data
preparation regularly.

We built Sparkling.data to ease the pain of data cleansing. For data scientists
who want to be productive in Python or Scala, Sparkling.data does the heavy
lifting, enabling better data pipelines and helping scale on very large data
sets by leveraging the power of Spark.

THE SPARKLING FEATURES
As we work with customers, we often find that their data is scattered in a
variety of containers and subcontainers containing heterogeneous data. To
execute an analysis, customers were needing to first categorize the data in
folders containing similar types of data. Such categorization is a tedious and
daunting task, especially for voluminous data, where the bulk of business value
lies.

This challenge led us to our first sparkling feature–Automated discovery of file
types.

AUTOMATED DISCOVERY OF FILE TYPES
Whether you are working with data in Swift,S3, GPFS, or HDFS, Sparkling.data
discovers file types and returns a Spark data frame that represents the
frequently occurring data types. The task of mapping the right files to the
right readers or categorizing data into folders or containers of the same type
is done for you. This simplifies the programming task significantly.

Often with Spark, you have to figure out the right package name to use with the
right data.

df = sqlContext.read.format(“com.databricks.spark.csv"").load(""dirContainingCsv"")

Sparkling.data’s discovery package figures out the most commonly occurring file
types and returns a data frame.

df =
sqlContext.read.format(“com.ibm.spark.discover"").load(""sparklingdata/data/AllDataDir"")

In addition, you can list all the file types discovered by occurrence and choose
to read a specific type selectively. In cases when file types do not exist, you
can override the discovery and explicitly provide types to use for reading.

This feature is available for major text-based file formats: CSV, TSV, Parquet,
Avro, and JSON. In addition, PDF and document file formats are also supported.

AUTOMATED DISCOVERY OF DATA TYPES
Our customers also need visibility into the data, not just a view into the
sample of the raw data, but visibility into the more interesting aspects, such
as:

 * What are the data column types?
 * How many fields match the data type?
 * How many fields are mismatches?
 * Which fields are matches?
 * Which fields are mismatches?

Sparkling Data, offers several features that automatically detect types by
scanning a sample and detecting the types, or by looking at the entire data set.
Most libraries handle the cases where all fields contain the same type well. In
such cases, the column type can be converted to the detected type without any
“data loss.”

In the big data world though, this is often not the case. Take an illustrative
example of a column containing data representing age, with values such as 22,
14, 80, age-41, 75, age-47, 40.

Most libraries would recommend converting this column to a string to include all
values. However, a string column for age data would not help the downstream
analysis of the data.

Sparkling.data’s features to handle bad data use the principle of data cleaning
with data inclusion.

Sparkling.data’s type detection recommends the most occurring type seen in the
data. In the above case, the recommendation for the age column would be an
integer.

VISIBILITY INTO MISMATCHED DATA
Most libraries provide the option to manually convert the type of the age column
above to Integer, which switches the strings to nulls, leading to data loss.
Sparkling.data gives visibility into the mismatched data, leading to an
opportunity to fix the data. In the above case, the mismatched data shows all
the entries starting with “age-“.

In Figure 1, see values marked with >< to indicate data mismatches, for the Age
column in the lower half of the figure.

Figure 1: View recommended types and mismatched values

Once the data scientist has visibility into the mismatched data, he or she can
now transform the data to fix it. In the above case, calling a substitute function on the column to replace ""age-"" with an empty string cleans the data
resulting in all integer values. The data scientist can now act on
Sparkling.data's recommendation to convert the column to integer, without losing
any data. The process is iterative, transparent, and quick.

Sparkling.data provides several features that allow the data scientist to
provide data set or column specific information to improve quality. Some
examples include providing locale information to ensure accurate type conversion
of numeric and date values, or providing custom formats to include in the
automatic date and timestamp detection and conversion.

AUTOMATIC INTERPRETATION OF BUSINESS TYPES
When it comes to detecting types, Sparkling.data goes one step further by
interpreting business types such as person names, addresses, phone numbers, and
organization names. Sparkling.data uses IBM’s technologies such as text
analytics to detect these types.

In Figure 2 below, see automatically detected business types such a person
names, addresses, and organizations

Figure 2: Business type detection

We often see cases where customers require that the quality of the data be tuned
to acceptable levels based on specific use cases. A data scientist may be
working on a customer scenario where he or she needs at least 90% of the data in
a column to be person names in order for the column to be treated as type
Person.

Alternately, a column detected as Organization might need to have only 80% of
the data in the column to be organizations. These quality thresholds can be
provided on a per-column basis or on the whole data set to control the type
detection. For example:

 * A column C1 with 95% person names is detected as type Person, when the user
   provides a threshold of 90%.
 * A column C2 with 75% organization names can be treated as a String, when the
   user provides a threshold of 80%.
 * A column C3 containing 100% addresses is detected as Address type.

EXTENDED TYPE SYSTEM
Sparkling.data’s mission is to make data ready for analysis. This analysis may
require using SQL, performing prescriptive analysis, or building predictive
models. In such cases, detecting the type of an unstructured text column and
allocating it to a Person or Address field is insufficient. These fields need
defined structures for further analysis.

Sparkling.data’s extended type system provides support for such scenarios. Each
semantic type is represented as a Spark struct type. In the above scenario, PersonType contains firstName, middleName and lastName, and the addressType contains street name, city, state, zip.

Figure 3 below shows the types the columns are converted to after applying the
threshold.

Figure 3: Automatic conversion to extended business types

As a result, Sparkling.data’s extended type system allows downstream analysis
over unstructured text columns.

Figure 4 below shows a sample analysis using a Spark SQL command to find all
first names from state California, spelled in various ways.

Figure 4: SQL analysis over unstructured text

COLUMN DISTRIBUTIONS AND KEY STATISTICS
Circling back to the key Sparkling.data principle of discovery and visibility
into the data, a data descriptive feature provides column distributions and
statistics on the entire data set. Data scientists in the process of building
predictive models find the column distributions and key statistics such as
standard deviations, min, max, and top n values especially useful. Column
distributions not only serve the purpose of giving an overview of the data, but
they also provide insights into the quality of data.

Data scientists get insight into aspects of each column, such as:

 * Frequency and occurrence of nulls in the fields
 * Frequency and occurrence of unique values for categorical fields
 * Bins representing numerical ranges and distributions per range for numerical
   values
 * Bins representing date, timestamp ranges, and distributions per range for
   date and timestamp values
 * Statistics on numeric columns such as max, min, range, stdDev, kurtosis,
   skewness and others

As revealed earlier, the Age column contains some mismatched values. Figure 5
below shows the mismatches values, converted to null for the integer column,
indicating potentially poor quality.

Figure 5: Visualize column statistic for quality assessment

As shown in Figure 6, after fixing the data, visualizations are a confirmation
of improved quality (nulls replaced by expected values of the right type, in
this case integer).

Figure 6: Confirm improved quality using profiling

Sparkling.data stores all the discovered metadata from type inference and column
distributions in the Spark data frame’s metadata. Sparkling.data’s render
functions help visualize this metadata into charts and graphs.

In addition to the visualizations seen so far on numeric values, Figure 7 below
shows visualizations of categorical content.

Figure 7: View blood pressure and drug values distribution

Sparkling.data libraries are available for Python and Scala notebooks. Data
scientists can use discovery and insights to iterate over the data preparation
process. Throughout the process, they can use matplotlib or Brunel libraries, to
visualize their data and use metadata to aid with the iterative data preparation
process.

DISCOVER THE POWER OF SPARKLING.DATA!
The Sparkling.data library gives data scientists an invaluable tool to jump
start the data preparation process along with statistical and qualitative
measures of progress with the data set. Using this library simplifies the
unavoidable and tedious step of working with structured or unstructured data
prior to any Spark analysis. The library along with the metrics provides an
intuitive way to leverage the power of data for Spark.

Take a tour with a sample notebook

Take a tour of cleaning structured, semi-structured, unstructured data sets
using Sparkling.data by using the sample notebook with these steps:

 1. Open the Analyzing data by using the Sparkling.data library features sample notebook but do not run it yet . 
    https://apsportal.ibm.com/exchange/public/entry/view/a0a43129f5965cb5b14949d382ec7ece
 2. Add this code as the first cell in the notebook:

from pyspark import SparkContext, SparkConf sc.stop()

conf = (SparkConf().set(""com.ibm.analytics.metadata.enabled"", ""false""))

sc = SparkContext(conf = conf)

3. Run the notebook.

4. When you run the notebook multiple times, saving the data frame will give an
error due to an existing file at the destination location. In the Save the DataFrame section, comment out this line or change the destination location:

dfFixed.write.format(""com.ibm.spark.discover"").save(""sparklingdata/data/drugdfFixed.json"")


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on September 15, 2016.

 * Data Science
 * Python
 * Scala
 * Data
 * Libraries

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingSONALI SURANGE
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","The Sparkling.data library is a tool to simplify and enable quick data preparation prior to any analysis step in Spark. The library provides tools to map, visualize, and transform data for iterative…",Making data cleaning simple with the Sparkling.data library,Live,935
2888,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
UNSTRUCTURED AND STRUCTURED DATA VERSUS REPETITIVE AND NON-REPETITIVE DATA
Post Comment July 25, 2016 by W.H. Inmon Owner, Forest Rim Technology Follow me on LinkedInMany people classify data as structured or unstructured. But classifying data in
this manner presents a problem because the meaning of unstructured is very unclear. What may be unstructured to one audience can be quite
structured to another audience.

Take text, for example. Ask any technician if text is unstructured, and the
technician will likely assure you that text is definitely unstructured. But ask
an English teacher if text is unstructured, and the English teacher is quite
likely to tell you that text is about as structured as it gets.

Who is right, the technician or the English teacher?

CLASSIFYING RECORD STRUCTURE
And lots of other ambiguities exist for the term unstructured. And because of
those ambiguities, a better way to classify data is as repetitive and
non-repetitive data. What is repetitive data? It is data with a structure that
repeats, usually in large numbers:


Many examples of repetitive data abound. Some examples include details of
telephone call records, clickstream records, log tape data, metering data,
banking ATM data and so forth. Consider a record of telephone call details. Each
record has the date and time of the telephone call, the originator of the call,
the person to whom the call was made and the length of the call. Structurally,
one record looks like every other record, and there are a lot of records. ATM
records are very similar. The only difference from one detailed record to the
next is the content of the record. The structure of each ATM record is identical
to every other record.

Non-repetitive data, on the other hand, is data in which little or no repetition
of the record structure is evident from one record to the next. Any two records
that happen to have an identical structure is accidental:


Some common examples of non-repetitive data are email, call center data,
restaurant and hotel feedback data, help desk conversations, warranty claim
data, insurance claim data and so on. Consider email. When someone writes an
email, no one is telling the author what to write or how to write it. The
email’s author is free to write however much or however little the writer
desires. That person can also write the email in any language, use slang and
write in complete or incomplete sentences. If any two emails happen to have the
same exact content, it is purely a coincidence.

The same considerations apply to conversations. Conversations can be short or
long. They can be gentle and polite or mean spirited. And they can be in
Spanish, English or any other language.

COMPARING BUSINESS VALUE
Several very interesting aspects emerge when classifying data this way. One
aspect is that there is no confusion. Data is either repetitive or it is not
repetitive. No middle ground exists between the two. For this reason alone,
classifying data as repetitive or non-repetitive is worthwhile.

But another very important, very nonobvious difference occurs between the two
classifications of data. A very high business value is associated with
non-repetitive data, while a very low amount of business value is associated
with repetitive data. The implications of this observation are quite far
reaching and deserve explanation.

Consider how many records have repetitive data. The number depends on the
environment, but in many environments there are many, many records. Now consider
how many of those records have real value. Take a look at telephone call record
details. In a day’s time hundreds of millions of records can be created. Every
time someone picks up a telephone and gets a dial tone, a call record is
created. Now, how many of those phone calls have value? Suppose that in a day’s
time three phone calls are related to terrorism. Within that time frame then,
three of perhaps 100 million records are of interest. Certainly the three
records that are of interest are of very real, very important interest, but the
percentage of records that have that value is very low.

Usually, not as many non-repetitive records as repetitive records are available.
But every non-repetitive record has business value. Some records have high business
value, and some have low business value. But typically, every non-repetitive record has business value.

Take call center records as an example. A call center record is a transcript of
the conversation between an 800 number operator and a customer or a prospective
customer. Every conversation between the operator and the customer or prospect
has business value. If 5,000 calls are made in a day, then those 5,000 calls
have business value. The business value ratio of the call center data therefore
is 5,000:5,000, or one to one, which is substantially different than the 3:100
million ratio from the previous telephone call record detail example.

Repetitive data having a low ratio of business value versus non-repetitive data
having a very high ratio of business value is a phenomenon that is repeated
again and again. As a result, the significance of repetitive data’s business
value is very different than the significance of non-repetitive data’s business
value.

Catch W.H. Inmon speaking on this topic at the Big Data Seminar 2016 , 15–16 September 2016 at the Hotel Pennsylvania in New York, New York. The
event is sponsored by Data Management Forum . And explore the power of enterprise content management .


Follow @IBMBigData

Topics: Analytics , Big Data Education , Big Data Research , Big Data Technology , Big Data Use Cases , Data Scientists Tags: big data , big data analytics , data warehouse , unstructured data , structured data , repetitive dataRELATED CONTENT
BLOG
A SNEAK PEEK AT THE FUTURE OF DATA MIGRATION: THE DNA OF BLUEMIX LIFT
What does dedication to the client experience look like? To create IBM Bluemix
Lift, it meant driving from concept to app in 90 days flat. Read Blog Video Data science expert interview: Jennifer Shin Blog Introducing BigInsights for Apache Hadoop Basic Plan on Bluemix Blog Next-generation data scientist: Harnessing an integrated development
environment Blog The 7 drivers of effective decision optimization Blog Insight Ops: The road to a collaborative self-service model Video Data science expert interview: Holden Karau Presentation 10 benefits to thinking inside Box Video Data science expert interview: Imran Younus Blog Best of all worlds: A hybrid data warehouse approach that delivers
cloud-like flexibility Blog CIO Insights: The culture of data—Putting data to work for the business Podcast CIO Insights: The culture of data—Putting data to work for the business Blog How technology advancements contribute to the democratization of data
View the discussion thread.

IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Insight Services Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Infographic Analytics for banking: ""Show me you know me"" Blog A sneak peek at the future of data migration: The DNA of Bluemix Lift Podcast Finance in Focus: Insights from Velan Inc.'s finance transformation Blog IBM positioned as a Leader in Gartner Magic Quadrant for Strategic CPMMORE
Infographic Analytics for banking: ""Show me you know me"" Blog A sneak peek at the future of data migration: The DNA of Bluemix Lift Podcast Finance in Focus: Insights from Velan Inc.'s finance transformation Blog IBM positioned as a Leader in Gartner Magic Quadrant for Strategic CPM Blog Apply design thinking to enhance user experience for incentive
compensation management Gallery Can you identify the threat? Blog Analytics for the office of finance: How smart CFOs stay ahead Infographic Analytics for banking: ""Show me you know me"" Blog A sneak peek at the future of data migration: The DNA of Bluemix Lift Podcast Finance in Focus: Insights from Velan Inc.'s finance transformation Blog IBM positioned as a Leader in Gartner Magic Quadrant for Strategic CPMMORE
Infographic Analytics for banking: ""Show me you know me"" Blog A sneak peek at the future of data migration: The DNA of Bluemix Lift Podcast Finance in Focus: Insights from Velan Inc.'s finance transformation Blog IBM positioned as a Leader in Gartner Magic Quadrant for Strategic CPM Blog Apply design thinking to enhance user experience for incentive
compensation management Blog Introducing BigInsights for Apache Hadoop Basic Plan on Bluemix Blog Enhancing risk data management with utility and refinery models Blog Is the brain the answer to superior telecom customer care? Infographic Analytics for banking: ""Show me you know me"" Podcast How cybersecurity and fraud intersect Infographic A clear forecast for insurers: Using weather data to build
policyholder loyalty and reduce claimsMORE
Blog Is the brain the answer to superior telecom customer care? Infographic Analytics for banking: ""Show me you know me"" Podcast How cybersecurity and fraud intersect Infographic A clear forecast for insurers: Using weather data to build
policyholder loyalty and reduce claims Presentation How secure is your enterprise from threats? Gallery Can you identify the threat? Blog Enhancing risk data management with utility and refinery models Blog A sneak peek at the future of data migration: The DNA of Bluemix Lift Video Data science expert interview: Jennifer Shin Blog Introducing BigInsights for Apache Hadoop Basic Plan on Bluemix Blog Next-generation data scientist: Harnessing an integrated development
environmentMORE
Blog A sneak peek at the future of data migration: The DNA of Bluemix Lift Video Data science expert interview: Jennifer Shin Blog Introducing BigInsights for Apache Hadoop Basic Plan on Bluemix Blog Next-generation data scientist: Harnessing an integrated development
environment Blog The 7 drivers of effective decision optimization Blog Insight Ops: The road to a collaborative self-service model Video Data science expert interview: Holden Karau * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site","When it comes to the business value of data, consider another way to look at data—whether it is repetitive data or non-repetitive data.",Unstructured and structured data versus repetitive and non-repetitive,Live,936
2892,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Jun 19
--------------------------------------------------------------------------------

WHO LIMITS THE RATE-LIMITER?
QUEUEING API REQUESTS TO USE CLOUDANT MORE EFFICIENTLY
APIs make our lives easier. As developers, we are all consumers of APIs built
and maintained by service providers. It’s important to remember that this
relationship is a two-way street.

To ensure a good experience for all applications hitting their APIs, providers
need to limit the rates at which they are consumed. Whether your app experiences
a sudden surge in popularity, or buggy code is unintentionally flooding a
service with requests, you can make API providers’ lives easier by accounting
for usage spikes in your design.

In this article, I’ll explore a simple technique for queueing API requests,
using the Cloudant database service as an example.

“Who watches the Watchmen?” Image scan Corey Pung . Watchmen property of DC Comics .GET IN LINE TO MISS THE 429
With any API, if you exceed its rate limits, your request will get a “HTTP 429
Too many requests” response. The Cloudant Node.js has a ‘retry’ plugin that will resend such requests. This approach is handy when you only
occasionally are hitting a limit, perhaps at times when your site or app is
unusually busy.

If you are routinely exceeding the quota, then no amount of retrying will help
because your app will be systematically retrying a swathe of requests. In that
case you need to look at upgrading your API access (if possible) or adding a
layer of abstraction to handle your request rate.

I’ll use Cloudant’s most basic Lite plan as an example. Here are the data and
API rate limits (at time of writing):

 * < 1GB data size
 * < 1MB request size
 * < 20 lookups (hits on the primary index) per second
 * < 10 writes per second
 * < 5 queries per second

To use this plan as efficiently as possible and keep your write requests to 10
per second, you have two options:

 1. make bulk requests — instead of writing 50 documents individiually, write all fifty in one
    call to POST /db/_bulk_docs
 2. queue your requests and only allow the queue to be consumed at rate that is
    less than the permitted level

IMPLEMENT A RATE-LIMITED QUEUE WITH QRATE
I wrote a Node.js module to help with the latter option. The qrate library lets you create queues and specify:

 * concurrency — the number of jobs that will be worked on in parallel
 * rateLimit — the number of jobs per second that are allowed to be consumed
   from the queue

Here’s how it works. First, bring in the silverlining library to access a Cloudant database and the qrate package too:

const silverlining = require('silverlining'
const qrate = require('qrate'
const db = silverlining('https://reader:password@reader.cloudant.com/queue'

Then, define a “worker” function that deals with a single queue item. In this
case, you want to write a document to Cloudant. The worker function receives the
‘document’ and calls ‘done’ when it’s finished:

// the worker function:
// writes the document to Cloudant
const worker = function(document, done) {
  console.log('worker'
  db.insert(document).then(done);
};

You can then create a rate-limited queue using the qrate module:

var concurrency = 3; // three workers at a time
var rateLimit = 10; // maximum 10 items per second
var q = qrate(worker, concurrency, rateLimit);

Then, feed the documents you want to add to the database to the queue (q) with q.push :

for(var i = 0; i  i++) {
  q.push( { i: i, name: 'hello world'
}

Even though there are 100 items in the queue and up to three workers running at
once, the queue rate never exceeds 10 per second. So no 429 responses are
received from Cloudant and no retry logic is required.

In a real application, you would add documents you want to save to the queue
instead of writing them directly to Cloudant. The queue would ensure that the
writes happen slower than the prescribed rate, with the excess building up in
memory.

This approach is useful for processing calls to an API service that has a rate
limit!

If the qrate API looks familiar to you, then that is because it is based on the excellent async library, which provides a range of tools for writing asynchronous code for
JavaScript. The qrate library is essentially the same as async.queue with an extra, optional rateLimit parameter. Without it, it behaves as a normal async.queue .

Happy API queueing!

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * JavaScript
 * Cloudant
 * Couchdb
 * Web Development
 * Database

1 Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","APIs make our lives easier. As developers, we are all consumers of APIs built and maintained by service providers. It’s important to remember that this relationship is a two-way street. To ensure a…",Who limits the rate-limiter? – IBM Watson Data Lab – Medium,Live,937
2896,"FORMATTED SQL IN PYTHON WITH PSYCOPG’S MOGRIFY
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Oct 10, 2016In this Compose Write Stuff Addon, Lucero Del Alba takes a look at the problem
of viewing queries sent to a server, and how to solve this problem by using
Psycopg's mogrify. Do you want to shed light on a favorite feature in your
preferred database? Why not write a short ""Addon"" for Write Stuff ?

When using templates and variables within your script, it's hard to tell what
the final query that was sent to the server was, which makes debugging tricky.
This is where mogrify can help you.

THE PROBLEM
If you have a buggy query, the Python interpreter is going to tell you where the
syntax error is, and it will do so by passing you back the DB response.
Sometimes, however, it’s not quite a syntax error that's bugging you, but just
an empty result set that makes you wonder what went wrong, like:

 * did you query the right tables?
 * was there some typo on the conditional clauses?
 * did you use a lot of substitutions or build the query in such a way that you
   lost track of it?

Let’s take a step backwards for a moment.

If you use the Python DB API v2 properly you don’t need to worry about input sanitation because the driver does
it transparently for you. For example, consider the following query:

SELECT * FROM brokers WHERE broker_name LIKE 'A%' AND broker_id > 10  


If A% and 10 need to be variables, you can do the following:

SQL = 'SELECT * FROM brokers WHERE broker_name LIKE %s AND broker_id > %s'

broker_name = 'A%'  
broker_id = 10

cursor.execute(SQL, (broker_name, broker_id))  


In that snippet, execute takes a string with the query as a mandatory argument and a tuple as an
optional one. The driver will sanitize the tuple’s content for you — in this
case broker_name and broker_id — to prevent some SQL injection tricks, and it will treat them as string and integer, accordingly.

The problem is that you just can’t print out cursor.execute() to see what the query was , and even if you turn that line into a print() statement, it’ll most likely not return the actual query unless you’re only
dealing with integers:

print(SQL % (broker_name, broker_id))  


SELECT * FROM brokers WHERE broker_name LIKE A% AND broker_id > 10  


That is not the query that’s sent to the server — there are no string escaping
characters (no apostrophes around A% ), let alone any sanitation. If you try to run that query on the DB, you’ll
receive an error.

MOGRIFY IT
The PostgreSQL driver for Python, Psycopg, comes with a very handy method for
dealing with these situations: mogrify . From the Psycopg documentation, mogrify :

Return[s] a query string after arguments binding. The string returned is exactly
the one that would be sent to the database running the execute() method or
similar

You can run mogrify() with the same arguments you would use for execute() , and the result will be as expected:

print(cursor.mogrify(SQL, (broker_name, broker_id)))  


b""SELECT * FROM brokers WHERE broker_name LIKE 'A%' AND broker_id > 10""  


Now this was a simple enough query, but that complexity may escalate very
quickly as you start joining tables using Python’s templates, such as {table_name} , to format later with the format() method when building your SQL code, and adding lots of variable substitutions …
yes, things will get messy.

TAKE IT FURTHER
In fact, you can implement mogrify() in your workflow and use it as an intermediate step when querying the database,
so that you can benefit from easier debugging later:

SQL = ""SELECT broker_id, broker_name FROM brokers WHERE broker_name LIKE %s""

query = cursor.mogrify(SQL, ('A%', ))  
cursor.execute(query)

fetchall_brokers = cursor.fetchall()  


Notice that instead of doing the substitutions on execute() , you first mogrify the query and use that output for execution. Now you have
the SQL template on a SQL constant, and the actual query in a query variable. Should you have a problem later, you just check that variable.

NOT A PYTHON DB API FEATURE … BUT THERE’S A WORKAROUND
Unfortunately, mogrify is not a method defined by the Python DB API, but instead an add-on of the
Psycopg driver.

If you're using MySQL, you have a workaround to this problem so that you can see the actual query:

import MySQLdb

conn = MySQLdb.connect()  
cursor = conn.cursor()

cursor.execute('SELECT %s, %s', ('bar', 1))  
cursor._executed  


b""SELECT 'bar', 1""  


However, this is a post hoc analysis (after the fact) will not allow you to see the full query if it
provoked an exception in the program (e.g. a syntax error).

In PostgreSQL, you can also do this with the Psycopg driver by replacing _executed with query in the last line, like this:

import psycopg2

conn = psycopg2.connect()  
cursor = conn.cursor()

cursor.execute('SELECT %s, %s', ('bar', 1))  
cursor.query  


b""SELECT 'bar', 1""  


CONCLUSIONS
We've shown ways in Python to see what’s the actual query that’s been sent to
the server, and this will save you quality time when debugging. If you want to
go further, consider that Psycopg's mogrify for PostgreSQL allows you to cache the actual executed statement so you can
reuse it whenever you need it.

Lucero dances, plays music , writes about random topics, leads projects to varying and doubtful degrees of
success, and keeps trying to be real and he keeps failing.

This article is licensed with CC-BY-NC-SA 4.0 by Compose.

Image by Peter Y. Chuang Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB Enterprise Add-ons * Deployments AWS DigitalOcean SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Let's look at the problem of viewing queries sent to a server, and how to solve it by using Psycopg's mogrify.",Formatted SQL in Python with Psycopg’s Mogrify,Live,938
2897,"Facebook Twitter Email Rss Linkedin TumblrA BLOG ON ARTIFICIAL INTELLIGENCE, DEEP LEARNING AND COGNITIVE SCIENCE
 * Home
 * List of Articles
 * About Me
 * 

Previous Next * View Larger Image

DEEP LEARNING FROM SCRATCH I: COMPUTATIONAL GRAPHS
This is part 1 of a series of tutorials, in which we develop the mathematical
and algorithmic underpinnings of deep neural networks from scratch and implement
our own neural network library in Python, mimicing the TensorFlow API.

 * Part I: Computational Graphs
 * Part II: Perceptrons
 * Part III: Training criterion
 * Part IV: Gradient Descent and Backpropagation
 * Part V: Multi-Layer Perceptrons

I do not assume that you have any preknowledge about machine learning or neural
networks. However, you should have some preknowledge of calculus, linear
algebra, fundamental algorithms and probability theory on an undergraduate
level. If you get stuck at some point, please leave a comment.

By the end of this text, you will have a deep understanding of the math behind
neural networks and how deep learning libraries work under the hood.

I have tried to keep the code as simple and concise as possible, favoring
conceptual clarity over efficiency. Since our API mimicks the TensorFlow API,
you will know how to use TensorFlow once you have finished this text, and you
will know how TensorFlow works under the hood conceptually (without all the
overhead that comes with an omnipotent, maximally efficient machine learning
API).

The full source code of the API can be found at https://github.com/danielsabinasz/TensorSlow . You also find a Jupyter Notebook there, which is equivalent to this blog post
but allows you to fiddle with the code.

COMPUTATIONAL GRAPHS
We shall start by defining the concept of a computational graph, since neural
networks are a special form thereof. A computational graph is a directed graph
where the nodes correspond to operations or variables . Variables can feed their value into operations, and operations can feed their
output into other operations. This way, every node in the graph defines a
function of the variables.

The values that are fed into the nodes and come out of the nodes are called tensors , which is just a fancy word for a multi-dimensional array. Hence, it subsumes
scalars, vectors and matrices as well as tensors of a higher rank.

Let’s look at an example. The following computational graph computes the sum $z$
of two inputs $x$ and $y$.
Here, $x$ and $y$ are input nodes to $z$ and $z$ is a consumer of $x$ and $y$.
$z$ therefore defines a function $z : \mathbb{R^2} \rightarrow \mathbb{R}$ where
$z(x, y) = x + y$.


The concept of a computational graph becomes more useful once the computations
become more complex. For example, the following computational graph defines an
affine transformation $z(A, x, b) = Ax + b$.


OPERATIONS
Every operation is characterized by three things:

 * A compute function that computes the operation’s output given values for the
   operation’s inputs
 * A list of input_nodes which can be variables or other operations
 * A list of consumers that use the operation’s output as their input

Let’s put this into code:

In [1]:classOperation:""""""Represents a graph node that performs a computation.    An `Operation` is a node in a `Graph` that takes zero or    more objects as input, and produces zero or more objects    as output.    """"""def__init__(self,input_nodes=[]):""""""Construct Operation        """"""self.input_nodes=input_nodes# Initialize list of consumers (i.e. nodes that receive this operation's output as input)self.consumers=[]# Append this operation to the list of consumers of all input nodesforinput_nodeininput_nodes:input_node.consumers.append(self)# Append this operation to the list of operations in the currently active default graph_default_graph.operations.append(self)defcompute(self):""""""Computes the output of this operation.        """" Must be implemented by the particular operation.        """"""pass

SOME ELEMENTARY OPERATIONS
Let’s implement some elementary operations in order to become familiar with the Operation class (and because we will need them later).

ADDITION
In [2]:classadd(Operation):""""""Returns x + y element-wise.    """"""def__init__(self,x,y):""""""Construct add        Args:          x: First summand node          y: Second summand node        """"""super().__init__([x,y])defcompute(self,x_value,y_value):""""""Compute the output of the add operation        Args:          x_value: First summand value          y_value: Second summand value        """"""self.inputs=[x_value,y_value]returnx_value+y_value

MATRIX MULTIPLICATION
In [3]:classmatmul(Operation):""""""Multiplies matrix a by matrix b, producing a * b.    """"""def__init__(self,a,b):""""""Construct matmul        Args:          a: First matrix          b: Second matrix        """"""super().__init__([a,b])defcompute(self,a_value,b_value):""""""Compute the output of the matmul operation        Args:          a_value: First matrix value          b_value: Second matrix value        """"""self.inputs=[a_value,b_value]returna_value.dot(b_value) # We assume that the matrices are encoded as numpy arrays, which provides a dot function for us

PLACEHOLDERS
Not all the nodes in a computational graph are operations. For example, in the
affine transformation graph, $A$, $x$ and $b$ are not operations. Rather, they
are inputs to the graph that have to be supplied with a value once we want to
compute the output of the graph. To provide such values, we introduce placeholders .

In [4]:classplaceholder:""""""Represents a placeholder node that has to be provided with a value       when computing the output of a computational graph    """"""def__init__(self):""""""Construct placeholder        """"""self.consumers=[]# Append this placeholder to the list of placeholders in the currently active default graph_default_graph.placeholders.append(self)

VARIABLES
In the affine transformation graph, there is a qualitative difference between
$x$ on the one hand and $A$ and $b$ on the other hand. While $x$ is an input to
the operation, $A$ and $b$ are parameters of the operation, i.e. they are intrinsic to the graph. We will refer to such
parameters as Variables .

In [5]:classVariable:""""""Represents a variable (i.e. an intrinsic, changeable parameter of a computational graph).    """"""def__init__(self,initial_value=None):""""""Construct Variable        Args:          initial_value: The initial value of this variable        """"""self.value=initial_valueself.consumers=[]# Append this variable to the list of variables in the currently active default graph_default_graph.variables.append(self)

THE GRAPH CLASS
Finally, we’ll need a class that bundles all the operations, placeholders and
variables together. When creating a new graph, we can call its as_default method to set the _default_graph to this graph. This way, we can create operations, placeholders and variables
without having to pass in a reference to the graph everytime.

In [6]:classGraph:""""""Represents a computational graph    """"""def__init__(self):""""""Construct Graph""""""self.operations=[]self.placeholders=[]self.variables=[]defas_default(self):global_default_graph_default_graph=self

EXAMPLE
Let’s now use the classes we have built to create a computational graph for the
following affine transformation:

$$
z = \begin{pmatrix}
1 & 0 \\
0 & -1
\end{pmatrix}
\cdot
x
+
\begin{pmatrix}
1 \\
1
\end{pmatrix}
$$

In [7]:# Create a new graphGraph().as_default()# Create variablesA=Variable([[1,0],[0,-1]])b=Variable([1,1])# Create placeholderx=placeholder()# Create hidden node yy=matmul(A,x)# Create output node zz=add(y,b)

COMPUTING THE OUTPUT OF AN OPERATION
Now that we are confident creating computational graphs, we can start to think
about how to compute the output of an operation.

Let’s create a Session class that encapsulates an execution of an operation. We
would like to be able to create a session instance and call a run method on this instance, passing the operation that we want to compute and a
dictionary containing values for the placeholders:

session = Session()
output = session.run(z, {
    x: [1, 2]
})


This should compute the following value:

$$
z = \begin{pmatrix}
1 & 0 \\
0 & -1
\end{pmatrix}
\cdot
\begin{pmatrix}
1 \\
2
\end{pmatrix}
+
\begin{pmatrix}
1 \\
1
\end{pmatrix}
=
\begin{pmatrix}
2 \\
-1
\end{pmatrix}
$$

In order to compute the function represented by an operation, we need to apply
the computations in the right order. For example, we cannot compute $z$ before
we have computed $y$ as an intermediate result. Therefore, we have to make sure
that the operations are carried out in the right order, such that the values of
every node that is an input to an operation $o$ has been computed before $o$ is
computed. This can be achieved via post-order traversal .

In [10]:importnumpyasnpclassSession:""""""Represents a particular execution of a computational graph.    """"""defrun(self,operation,feed_dict={}):""""""Computes the output of an operation        Args:          operation: The operation whose output we'd like to compute.          feed_dict: A dictionary that maps placeholders to values for this session        """"""# Perform a post-order traversal of the graph to bring the nodes into the right ordernodes_postorder=traverse_postorder(operation)# Iterate all nodes to determine their valuefornodeinnodes_postorder:iftype(node)==placeholder:# Set the node value to the placeholder value from feed_dictnode.output=feed_dict[node]eliftype(node)==Variable:# Set the node value to the variable's value attributenode.output=node.valueelse:# Operation# Get the input values for this operation from node_valuesnode.inputs=[input_node.outputforinput_nodeinnode.input_nodes]# Compute the output of this operationnode.output=node.compute(*node.inputs)# Convert lists to numpy arraysiftype(node.output)==list:node.output=np.array(node.output)# Return the requested node valuereturnoperation.outputdeftraverse_postorder(operation):""""""Performs a post-order traversal, returning a list of nodes    in the order in which they have to be computed    Args:       operation: The operation to start traversal at    """"""nodes_postorder=[]defrecurse(node):ifisinstance(node,Operation):forinput_nodeinnode.input_nodes:recurse(input_node)nodes_postorder.append(node)recurse(operation)returnnodes_postorder

Let’s test our class on the example from above:

In [11]:session=Session()output=session.run(z,{x:[1,2]})print(output)

[ 2 -1]


Looks good.

If you have any questions, feel free to leave a comment. Otherwise, c ontinue with the next part: II: Perceptrons .

SHARE THIS:
 * Click to share on Twitter (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * Share on Skype (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to print (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * Click to share on Tumblr (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Telegram (Opens in new window)
 * Click to share on Pinterest (Opens in new window)
 * Click to share on WhatsApp (Opens in new window)
 * Click to email this to a friend (Opens in new window)
 * 

RELATED
By Daniel | 2017-09-24T15:15:50+00:00 August 26th, 2017 | Artificial Intelligence , Deep Learning , Machine Learning , Python , TensorFlow | 7 Comments * Valentine Michael SmithThis is the website I’ve been looking for for months. Really, really nicely
   done. Thanks!
   
   
 * 
 * thecity2Where is `dot()` coming from?
   
    * Daniel SabinaszWe assume that our multi-dimensional data are numpy arrays, which have
      such a dot function. I’ll try to incorporate that into the text.
      
      
    * 
   
   
 * 
 * Emile PetroneWhere is operation.output defined?
   
    * Daniel SabinaszIt’s set in the Session.run method.
      
      
    * 
   
   
 * 
 * Huang PuExcellent blogs! Can I translate your blog into Chinese? Of course I will
   keep your original blog URL. 🙂
   
    * Daniel SabinaszI’d very much welcome a translation. However, please do not create a
      Chinese copy of the website under the same name (like deepideas.cn). If
      you send me your translations, I can post them as translations to
      deepideas.net in your name. Or, if you prefer, you could post them on your
      own personal blog and link back to the original version.
      
      
    * 
   
   
 * 

STAY UPDATED
Subscribe to the mailing list and get updated about new blog posts by email.

Thank you for subscribing.

Something went wrong.

I respect your privacy and take protecting it seriously

FOLLOW ME ON TWITTER
My Tweets Copyright 2017 Daniel Sabinasz Facebook Twitter Email Rss Linkedin Tumblr Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","This is part 1 of a series of tutorials, in which we develop the mathematical and algorithmic underpinnings of deep neural networks from scratch and implement our own neural network library in Python, mimicing the TensorFlow API.",Deep Learning From Scratch I: Computational Graphs,Live,939
2900,"RStudio Blog * Home

 * Subscribe to feed

A NEW VERSION OF DT (0.2) ON CRAN
August 9, 2016 in Packages

The R package DT v0.2 is on CRAN now. You may install it from CRAN via install.packages('DT') or update your R packages if you have already installed it before. It has been
over a year since the last CRAN release of DT , and there have been a lot of changes in both DT and the upstream DataTables library. You may read the release notes to know all changes, and we want to highlight two major changes here:

 * Two extensions “TableTools” and “ColVis” have been removed from DataTables,
   and a new extension named “Buttons” was added. See this page for examples.
 * For tables in the server-side processing mode (the default mode for tables in
   Shiny), the selected row indices are integers instead of characters (row
   names) now. This is for consistency with the client-side mode (which returns
   integer indices). In many cases, it does not make much difference if you
   index an R object with integers or names, and we hope this will not be a
   breaking change to your Shiny apps.

In terms of new features added in the new version of DT , the most notable ones are:

 * Besides row selections, you can also select columns or cells. Please note the
   implementation is not based on the “ Select ” extension of DataTables, so not all features of “Select” are available in DT . You can find examples of row/column/cell selections on this page .
 * There are a number of new functions to modify an existing table instance in a
   Shiny app without rebuilding the full table widget. One significant advantage
   of this feature is it will be much faster and more efficient to update
   certain aspects of a table, e.g., you can change the table caption, or set
   the global search keyword of a table without making DT to create the whole table from scratch. You can even replace the data object
   behind the table on the fly (using DT::replaceData() ), and after the data is updated, the table state can be preserved (e.g.,
   sorting and filtering can remain the same).
 * A few formatting functions such as formatSignif() and formatString() were also added to the package.

As always, you are welcome to test the new release and we will appreciate your
feedback. Please file bug reports to Github , and you may ask questions on StackOverflow using the DT tag.

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,795 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

LEAVE A COMMENT
Comments feed for this article

LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.

Notify me of new posts via email.


« readr 1.0.0Blog at WordPress.com.

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,795 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:",The R package DT v0.2 is on CRAN now. You may install it from CRAN via install.packages(‘DT’) or update your R packages if you have already installed it before. It has been over a year …,A New Version of DT (0.2) on CRAN,Live,940
2902,"RStudio Blog * Home

 * Subscribe to feed

FEATHER: A FAST ON-DISK FORMAT FOR DATA FRAMES FOR R AND PYTHON, POWERED BY
APACHE ARROW
March 29, 2016 in Packages

Wes McKinney, Software Engineer, Cloudera
Hadley Wickham, Chief Scientist, RStudio

This past January, we (Hadley and Wes) met and discussed some of the systems
challenges facing the Python and R open source communities. In particular, we
wanted to see if there were some opportunities to collaborate on tools for
improving interoperability between Python, R, and external compute and storage
systems.

One thing that struck us was that while R’s data frames and Python’s pandas data
frames utilize very different internal memory representations, they share a very
similar semantic model. In both R and Panda’s, data frames are lists of named,
equal-length columns, which can be numeric, boolean, and date-and-time,
categorical ( factors), or string. Every column can have missing values.

Around this time, the open source community had just started the new Apache Arrow project, designed to improve data interoperability for systems dealing with
columnar tabular data.

In discussing Apache Arrow in the context of Python and R, we wanted to see if
we could use the insights from feather to design a very fast file format for
storing data frames that could be used by both languages. Thus, the Feather
format was born.

What is Feather?

Feather is a fast, lightweight, and easy-to-use binary file format for storing
data frames. It has a few specific design goals:

 * Lightweight, minimal API: make pushing data frames in and out of memory as
   simple as possible
 * Language agnostic: Feather files are the same whether written by Python or R
   code. Other languages can read and write Feather files, too.
 * High read and write performance. When possible, Feather operations should be
   bound by local disk performance.

Code examples

The Feather API is designed to make reading and writing data frames as easy as
possible. In R, the code might look like:

library(feather)
path <- ""my_data.feather""
write_feather(df, path)
df <- read_feather(path)

Analogously, in Python, we have:

import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)

How fast is Feather?

Feather is extremely fast. Since Feather does not currently use any compression
internally, it works best when used with solid-state drives as come with most of
today’s laptop computers. For this first release, we prioritized a simple
implementation and are thus writing unmodified Arrow memory to disk.

To give you an idea, here is a Python benchmark writing an approximately 800MB
pandas DataFrame to disk:

import feather
import pandas as pd
import numpy as np
arr = np.random.randn(10000000) # 10% nulls
arr[::10] = np.nan
df = pd.DataFrame({'column_{0}'.format(i): arr for i in range(10)})
feather.write_dataframe(df, 'test.feather')

On Wes’s laptop (latest-gen Intel processor with SSD), this takes:

In [9]: %time df = feather.read_dataframe('test.feather')
CPU times: user 316 ms, sys: 944 ms, total: 1.26 s
Wall time: 1.26 s

In [11]: 800 / 1.26
Out[11]: 634.9206349206349

This is effective performance of over 600 MB/s. Of course, the performance you
see will depend on your hardware configuration.

And in R (on Hadley’s laptop, which is very similar):

library(feather)

x <- runif(1e7)
x[sample(1e7, 1e6)] <- NA # 10% NAs
df <- as.data.frame(replicate(10, x))
write_feather(df, 'test.feather')

system.time(read_feather('test.feather'))
#>   user  system elapsed 
#>  0.731   0.287   1.020 

How can I get Feather?

The Feather source code is hosted at http://github.com/wesm/feather .

Installing Feather for R

Feather is currently available from github, and you can install with:

devtools::install_github(""wesm/feather/R"")

Feather uses C++11, so if you’re on windows, you’ll need the new gcc 4.93 toolchain . (All going well this will be included in R 3.3.0, which is scheduled for
release on April 14. We’ll aim for a CRAN release soon after that).

Installing Feather for Python

For Python, you can install Feather from PyPI like so:

$ pip install feather-format

We will look into providing more installation options, such as conda builds, in
the future.

What should you not use Feather for?

Feather is not designed for long-term data storage. At this time, we do not
guarantee that the file format will be stable between versions. Instead, use
Feather for quickly exchanging data between Python and R code, or for short-term
storage of data frames as part of some analysis.

Feather, Apache Arrow, and the community

One of the great parts of Feather is that the file format is language agnostic.
Other languages, such as Julia or Scala (for Spark users), can read and write
the format without knowledge of details of Python or R.

Feather is one of the first projects to bring the tangible benefits of the Arrow
spec to users in the form of an efficient, language-agnostic representation of
tabular data on disk. Since Arrow does not provide for a file format, we are
using Google’s Flatbuffers library (github.com/google/flatbuffers) to serialize
column types and related metadata in a language-independent way in the file.

The Python interface uses Cython to expose Feather’s C++11 core to users, while
the R interface uses Rcpp for the same task.

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

43 COMMENTS
March 29, 2016 at 4:04 pm

dselivanov

Wow, what a surprise! I expect something like this after tweet about Wes
McKinney & Hadley Wickham meeting Thank you, Apache Arrow is great!

March 29, 2016 at 4:31 pm

Roger Filmyer

Finally, a peace treaty in the great Python-R data science cultural war of 2016!

March 29, 2016 at 4:33 pm

TM

Sounds great. Looking for an on-disk data frame format. Assuming this could be
used in place of in-memory data frames, compatible with any package.

March 29, 2016 at 4:49 pm

Avraham Adler

As a data frame, I presume it will be 100% compatible with dplyr and the rest of
the Hadleyverse. Will it work with data.table as well?

 * March 29, 2016 at 4:55 pm
   
   hadleywickham
   
   It’s a file format designed to work with data frames. I think it would need
   some extra work to work natively with data.tables (i.e. feather data.table
   instead of feather data frame data.table)
   
   
 * March 29, 2016 at 7:30 pm
   
   Mark Danese
   
   just setDT(df) afterward and I see no problem with data.table. I just did it.
   It would be nice if feather respected the classes at the time of saving
   instead of converting them to tibble dataframes. But it is a trivial thing to
   fix. setDT(df) take no time. Am I missing something?
   
   
 * 
 * March 29, 2016 at 8:01 pm
   
   Guillermo Ponce
   
   This works fine…
   
   dt<-as.data.table(df)
   write_feather(dt, 'test.feather')
   dt<-as.data.table(read_feather('test.feather'))
   
   
 * March 29, 2016 at 8:04 pm
   
   bitremoto
   
   This works fine:
   
   library(data.table)
   library(feather)
   x <- runif(1e7)
   x[sample(1e7, 1e6)] <- NA # 10% NAs
   df <- as.data.frame(replicate(10, x))
   
   dt <- as.data.table(df)
   write_feather(dt, 'test.feather')
   dt<-as.data.table(read_feather('test.feather'))
   
   
 * March 29, 2016 at 9:01 pm
   
   jangorecki
   
   not sure why it wouldn’t work, if feather returns data.frame then all you
   need to do is `setDT(read_feather(“file”))`
   
   
March 29, 2016 at 5:51 pm

Szilard

Great work Hadley and Wes!

The timings reflect only time to read/write to filesystem cache, not to disk
though. Disk I/O will add a bit more depending on SSD/HD etc. (for write you can
measure a `sync` thereafter, for read you can clean the fs cache before reading 
https://unix.stackexchange.com/questions/87908/how-do-you-empty-the-buffers-and-cache-on-a-linux-system ).

Szilard

March 29, 2016 at 7:04 pm

Kyle Doherty

I am having trouble deploying this package within shinyapps.io. Anyone had luck?

 * March 30, 2016 at 12:23 pm
   
   Kyle Doherty
   
   ERROR: compilation failed for package ‘feather’
   * removing ‘/usr/local/lib/R/site-library/feather’
   ################################# End Task Log
   #################################
   Error: Unhandled Exception: Child Task 176565252 failed: Error building
   image: Error building feather (0.0.0.9000). Build exited with non-zero
   status: 1
   Execution halted
   
   
March 29, 2016 at 7:34 pm

Mark Danese

tested on a 1.8 Gb Medicare file (a data.table):
saveRDS = 91 seconds
write_feather = 8 seconds
readRDS = 45 seconds
read_feather = 8 seconds

March 29, 2016 at 10:51 pm

Rodrigo Hernández Mota

Does somebody knows if it’s compatible with the objects generated by the new
tibble package?

 * March 30, 2016 at 9:58 am
   
   hadleywickham
   
   Yes
   
   
March 30, 2016 at 1:05 am

Norbert

Interesting. Is there any documentation of the rationale behind the format
itself, ie. how data is written to disk and why in that specific order?

 * March 30, 2016 at 9:59 am
   
   hadleywickham
   
   There’s not much in public – the main justification is the diagram on https://arrow.apache.org
   
   
 * March 30, 2016 at 12:37 pm
   
   Norbert
   
   Thanks for the link.
   
   
 * 

March 30, 2016 at 2:21 am

Sam

I am not table to install the package

**********************************************
WARNING: this package has a configure script
It probably needs manual configuration
**********************************************

** libs

*** arch – i386
g++ -m32 -std=c++0x -I”C:/PROGRA~1/R/R-32~1.2/include” -DNDEBUG -I.
-I”C:/Users/Sam.man/R/win-library/3.2/Rcpp/include”
-I”d:/RCompile/r-compiling/local/local320/include” -O2 -Wall -mtune=core2 -c
RcppExports.cpp -o RcppExports.o
g++ -m32 -std=c++0x -I”C:/PROGRA~1/R/R-32~1.2/include” -DNDEBUG -I.
-I”C:/Users/Sam.man/R/win-library/3.2/Rcpp/include”
-I”d:/RCompile/r-compiling/local/local320/include” -O2 -Wall -mtune=core2 -c
feather-read.cpp -o feather-read.o
feather-read.cpp:4:25: fatal error: feather/api.h: No such file or directory
compilation terminated.
make: *** [feather-read.o] Error 1
Warning: running command ‘make -f “Makevars” -f
“C:/PROGRA~1/R/R-32~1.2/etc/i386/Makeconf” -f
“C:/PROGRA~1/R/R-32~1.2/share/make/winshlib.mk” CXX=’$(CXX1X) $(CXX1XSTD)’
CXXFLAGS=’$(CXX1XFLAGS)’ CXXPICFLAGS=’$(CXX1XPICFLAGS)’
SHLIB_LDFLAGS=’$(SHLIB_CXX1XLDFLAGS)’ SHLIB_LD=’$(SHLIB_CXX1XLD)’
SHLIB=”feather.dll” ‘ had status 2
ERROR: compilation failed for package ‘feather’
* removing ‘C:/Users/Sam.man/R/win-library/3.2/feather’
Error: Command failed (1)

 * March 30, 2016 at 9:57 am
   
   hadleywickham
   
   If you’re on windows, you’re probably best off just waiting until R 3.3.0
   comes out and feather goes on CRAN.
   
   
 * March 31, 2016 at 1:00 am
   
   SEUNGCHEOL HAN
   
   Hi Hadley,
   
   When I tried to install it in Linux box, I got the same error. Please see
   below:
   
   ** libs
   g++ -m64 -std=c++0x -I/usr/include/R -DNDEBUG -I. -I/usr/local/include
   -I”/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/Rcpp/include” -fpic -g
   -O2 -c RcppExports.cpp -o RcppExports.o
   g++ -m64 -std=c++0x -I/usr/include/R -DNDEBUG -I. -I/usr/local/include
   -I”/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/Rcpp/include” -fpic -g
   -O2 -c feather-read.cpp -o feather-read.o
   In file included from feather/api.h:18,
   from feather-read.cpp:4:
   ./feather/buffer.h: In constructor
   ?feather::MutableBuffer::MutableBuffer()?:
   ./feather/buffer.h:93: error: ?nullptr? was not declared in this scope
   ./feather/buffer.h: At global scope:
   ./feather/buffer.h:111: error: ?constexpr? does not name a type
   ./feather/buffer.h: In constructor
   ?feather::BufferBuilder::BufferBuilder()?:
   ./feather/buffer.h:116: error: ?nullptr? was not declared in this scope
   ./feather/buffer.h: In member function ?feather::Status
   feather::BufferBuilder::Append(const uint8_t*, int)?:
   ./feather/buffer.h:125: error: ?MIN_BUFFER_CAPACITY? was not declared in
   this scope
   ./feather/buffer.h: In member function ?std::shared_ptr
   feather::BufferBuilder::Finish()?:
   ./feather/buffer.h:139: error: ?nullptr? was not declared in this scope
   In file included from feather/api.h:19,
   from feather-read.cpp:4:
   ./feather/common.h: At global scope:
   ./feather/common.h:20: error: expected initializer before ?const?
   ./feather/common.h:32: error: ?constexpr? does not name a type
   ./feather/common.h: In function ?bool feather::util::get_bit(const uint8_t*,
   int)?:
   ./feather/common.h:35: error: ?BITMASK? was not declared in this scope
   ./feather/common.h: In function ?bool feather::util::bit_not_set(const
   uint8_t*, int)?:
   ./feather/common.h:39: error: ?BITMASK? was not declared in this scope
   ./feather/common.h: In function ?void feather::util::clear_bit(uint8_t*,
   int)?:
   ./feather/common.h:43: error: ?BITMASK? was not declared in this scope
   ./feather/common.h: In function ?void feather::util::set_bit(uint8_t*,
   int)?:
   ./feather/common.h:47: error: ?BITMASK? was not declared in this scope
   In file included from feather/api.h:20,
   from feather-read.cpp:4:
   ./feather/io.h: In constructor
   ?feather::LocalFileReader::LocalFileReader()?:
   ./feather/io.h:65: error: ?nullptr? was not declared in this scope
   ./feather/io.h: In constructor
   ?feather::MemoryMapReader::MemoryMapReader()?:
   ./feather/io.h:91: error: ?nullptr? was not declared in this scope
   ./feather/io.h: In constructor
   ?feather::FileOutputStream::FileOutputStream()?:
   ./feather/io.h:178: error: ?nullptr? was not declared in this scope
   In file included from feather/api.h:21,
   from feather-read.cpp:4:
   ./feather/metadata.h: In constructor ?feather::metadata::Table::Table()?:
   ./feather/metadata.h:150: error: ?nullptr? was not declared in this scope
   make: *** [feather-read.o] Error 1
   ERROR: compilation failed for package ?feather?
   * removing ?/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/feather?
   Error: Command failed (1)
   ====================================
   
   Do you have any suggestion for me?
   
   
 * 

March 30, 2016 at 3:35 am

jwijffels

Very interesting. Are there also plans to add a RAM-friendly layer on top of
this for indexing, grouping by, reshaping?

March 30, 2016 at 9:42 am

Robert Young

wow! just add a bit of data integrity and you’ve got a RDBMS.

March 30, 2016 at 11:44 am

Nicholas R Jhirad

OS X, El Capitan: I managed to get a segfault with a .7 Gig data.table (tried
converting it to a regular data.frame, but got the same,

> tables()
NAME NROW NCOL MB
[1,] z 213,246 64 719
COLS
…
KEY
[1,]
Total: 719MB

> library(feather)
> write_feather(df,’feathertest’)

*** caught segfault ***
address 0x10, cause ‘memory not mapped’

Traceback:
1: .Call(“feather_writeFeather”, PACKAGE = “feather”, df, path)
2: writeFeather(x, path)
3: write_feather(df, “feathertest”)

 * March 30, 2016 at 12:22 pm
   
   Nicholas R Jhirad
   
   (missing the intervening step, where I set df <- as.data.frame(z))
   
   
March 30, 2016 at 1:02 pm

Blaine Mooers

The python example is not working with macports python2.7

import feather
import pandas as pd
import numpy as np
arr = np.random.randn(10000000) # 10% nulls
arr[::10] = np.nan
df = pd.DataFrame({‘column_{0}’.format(i): arr for i in range(10)})
feather.write_dataframe(df, ‘test.feather’)

File “featherTest.py”, line 7, in
feather.write_dataframe(df, ‘test.feather’)
AttributeError: ‘module’ object has no attribute ‘write_dataframe’

 * April 6, 2016 at 12:33 pm
   
   Blaine Mooers
   
   I got feather to compile on the Mac using the source from github and macports
   python 2.7.11.
   
   I had to install py27-pbr first by using macports
   
   sudo port install py27-pbr
   
   git clone ….
   
   cd to the feather/python directory and
   link the static libraries
   
   # Symlink the C++ library for the static build
   ln -s ../cpp/src src
   python setup.py build
   
   # To install it locally
   sudo -H python setup.py install
   
   
March 30, 2016 at 7:23 pm

statquant

guys I see a lot of post asking for data.table… data.table is a daughter class
of data.frame, it IS a data.frame. So everything HAS to work for data.table. Am
I missing something here ?

March 30, 2016 at 8:30 pm

Femi Anthony (@DataPhanatik)

I just tried the Python code as above and obtained the following error:

n [9]: feather.write_dataframe(df,’test.feather’)
—————————————————————————
AttributeError Traceback (most recent call last)
in ()
—-> 1 feather.write_dataframe(df,’test.feather’)

AttributeError: ‘module’ object has no attribute ‘write_dataframe’

The version of feather I’m using is as follows:

anaconda show travis/feather
Using binstar api site https://api.anaconda.org
Name: feather
Summary: https://github.com/jdodds/feather
Access: public
Package Types: conda
Versions:
+ 0.9.1dev

 * April 6, 2016 at 12:35 pm
   
   Blaine Mooers
   
   Looks like you have to set an environment variable before the Anacaoda
   install.
   
   https://github.com/wesm/feather/tree/master/python
   
   
March 31, 2016 at 3:11 am

OrenT

Is the format compatible with mmap? Can you just map the whole file in, parse
some metadata, set up some pointers but keep the bulk of the data to be read on
demand?

 * April 1, 2016 at 1:18 pm
   
   hadleywickham
   
   Yes.
   
   
April 1, 2016 at 2:02 pm

bitremoto

saveRDS with compress=FALSE option writes a little bit faster than
“write_feather” and reading is almost the same?

> x x[sample(1e7, 1e6)] df ptm saveRDS(dt,’TestRDS.rds’, compress=FALSE)
> proc.time() – ptm
user system elapsed
0.823 0.773 1.709

> # Saving with “write_feather”
> ptm write_feather(dt, ‘test.feather’)
> proc.time() – ptm
user system elapsed
0.577 0.789 2.057
>
> # Reading with “read_feather”
> ptm dt proc.time() – ptm
user system elapsed
0.604 0.424 1.027

April 1, 2016 at 2:07 pm

Guillermo

saveRDS with compress=FALSE option writes a little bit faster than
“write_feather” and reading is almost the same?

> library(data.table)

> library(feather)

> x x[sample(1e7, 1e6)] df dt # Saving with saveRDS
> ptm saveRDS(dt,’TestRDS.rds’, compress=FALSE)
> proc.time() – ptm
user system elapsed
0.835 0.670 1.622

> # Reading with readRDS
> ptm dt proc.time() – ptm
user system elapsed
0.785 0.394 1.178

> # Saving with “write_feather”
> ptm write_feather(dt, ‘test.feather’)
> proc.time() – ptm
user system elapsed
0.577 0.789 2.057
>
> # Reading with “read_feather”
> ptm dt proc.time() – ptm
user system elapsed
0.604 0.424 1.027

 * April 1, 2016 at 2:19 pm
   
   hadleywickham
   
   Yes, for that specific example.
   
   
April 5, 2016 at 6:26 pm

Jonathan Feldschuh

This looks awesome – I am looking forward to seeing how it develops! I am
currently using csv files for this purpose, and it is really slow – and
unreliable – some dataframes cause errors where R’s read.csv() fails on a file
created by R’s write.csv(). Looking forward to a stable CRAN package.

April 6, 2016 at 5:15 am

rPython + feather – datanalytics

[…] que a estas alturas todos conoceréis feather y rPython. Hoy los vais a ver
trabajar […]

April 7, 2016 at 6:19 am

Using Data Frames in Feather format (Apache Arrow) | AnalytX Info

[…] by the RStudio blog article about feather I did the one line install and
compared the results on a data frame of 19 million rows. First […]

April 7, 2016 at 7:02 am

LG

Unable to install on Windows 7.
error: Microsoft Visual C++ 10.0 is required (Unable to find vcvarsall.bat).

—————————————-
Failed building wheel for feather-format

Yet when I attempt to install MSFT Visual C++ 10.0, it says I have a newer
version. Do I need to downgrade MSFT Visual 10++?

 * April 7, 2016 at 8:38 am
   
   hadleywickham
   
   I think you’d be better off asking for help somewhere there are more python
   readers.
   
   
 * April 9, 2016 at 11:38 am
   
   LG
   
   OK. Posted on SE…
   
   http://stackoverflow.com/questions/36518968/python-feather-module-install-problems
   
   
April 7, 2016 at 2:28 pm

John Liberty

Do you envision support for other languages/environments such that data can be
exported into feather format? — e.g. enable a REST server which can be called to
get CSV, to get feather instead.

 * April 7, 2016 at 2:31 pm
   
   hadleywickham
   
   Yes
   
   
« RStudio at the Open Data Science Conference testthat 1.0.0 »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","Wes McKinney, Software Engineer, Cloudera Hadley Wickham, Chief Scientist, RStudio This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R op…","A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow",Live,941
2904,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Advisory Council
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Advisory Council
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
OPEN SOURCE
INTERVIEW WITH SEAN LI, NEW APACHE SPARK™ COMMITTER
Sean Li of the Spark Technology Center was recently honored to be designated as
an official committer for the Apache Spark™ project — one of just 47 committers worldwide. The honor
reflects Sean's sustained and outstanding contributions to Spark, and his
commitment to the community at large.

STC: In the past year, you've been an active contributor to Spark SQL. What
motivates you to contribute?

Sean: I'll answer by giving a bit of my background. When I was a child, my parents
founded a company called Richpeace Group . Over the last 24 years, we all experienced a lot of adversity — but through
it all, one thing never changed: maintaining a startup spirit in the business.
That might be why I enjoy working on Apache Spark, which of course is the most
active big data open-source project. At all hours of the day and night, the
passionate community is adding new contributions, and I love being actively
involved in the discussions and contributing code to help improve the project at
large.

STC: You've worked for IBM since 2010. What did you do before joining the Spark
Technology Center — and how has that previous work experience guided you?

Sean: Six years ago, I started at IBM as a COOP with a focus on database replication.
The replication team was small but highly efficient. In fact, it felt like a
startup because the developers needed to take on many roles: product design,
software implementation and maintenance, quality assurance, technical sales,
customer service, and cross-organization collaboration.

Many Fortune 100 customers were using our product in mission-critical systems,
so as a developer, the pressure was high. A major bug could provoke a lawsuit
from a client, but even small bugs could seriously impact a customer's business.
I remember a data loss bug impacting the production system of a Japanese bank.
It was a particularly difficult bug because it was triggered only once per week
— and we were only able to reproduce it by running four systems in parallel.
After a whole month of effort — and many late nights — we finally found the root
cause just before the contract was up for renewal.

Those experiences with traditional infrastructure products help a lot when I'm
contributing to Apache Spark. Spark's quality control and test coverage are
actually weaker than what we faced on that previous team. But as a committer, I
know contributors understand the significance of test cases. They know that
product quality really depends on the test case coverage. They also know that to
make Apache Spark enterprise-ready, there's a lot still to do. I'm hoping my
ongoing contributions can help Apache Spark become more and more
enterprise-ready over the course of the next few years.

STC: Do you remember the first time you used Spark?

Sean: Yeah! I finished my first Spark application in an internal Spark hackathon.
When the votes were tallied we'd placed third — and I was hooked. The
application we created replicated data from a database on mainframe to Spark
almost in real time. I was surprised how easy it was to learn Spark from
scratch. (We even got some attention from the global sales team who wanted to
know when the application would be ready to sell to customers.)

STC: Is that when you decided to join the Spark Technology Center team?

Sean: Many things triggered the decision. In 2015, our executive Rob Thomas regularly
shared his thoughts about big data and open source through his personal blog.
After attending both a Hadoop summit and a Spark summit, I could sense the
changes happening across the data management community. I could see that Spark
was likely to shake up data analytics, just as Linux had shaken up operating
systems. I knew that the Spark Technology Center was being created to fulfill
that vision — and I knew I wanted to be a part of the change.

STC: Your focus is Spark SQL. Tell us more about that.

Sean: First, I think the name Spark SQL is confusing. Many people incorrectly list
Spark SQL and DataFrame/Dataset as separate modules, but in fact the
DataFrame/Dataset APIs and the SQL interface are the interfaces of Spark SQL.
They are user-friendly, high-level Spark system APIs. Internally, Spark SQL
contains the Catalyst optimizer, the Tungsten execution engine, and data source
integration support. It now powers all user-facing components, including the
machine learning library, streaming support, and graph-parallel computation. In
other words, our improvements to Spark SQL benefit all other components. That's
why Spark SQL is the most active component in Spark. Maybe we should rename it
to Spark Core. [SMILE]

STC: We've seen a lot of comparison between Spark SQL and traditional databases like
Oracle and DB2. Can Spark SQL replace them?

Sean: Yes and no. It's worth noticing that Spark SQL doesn't target OLTP scenarios.
Generally, it's impractical to use Spark as an OLTP engine. In the industry,
more and more users are employing Spark to handle OLAP workloads, since the
traditional database vendors are still expensive. With the Spark 2.0 release,
Spark is becoming mature and stable. I expect more and more Hive users will
switch to Spark SQL in the next few years.

But we shouldn't view Spark SQL as just a database. It's a new core of Spark,
and Spark is a general-purpose data processing/analytics engine. We can think
about Spark the way we think about smartphones: Smartphones don't have the best
cameras, and they're not the best game consoles, eBook readers, or music/video
players. But how many people are still buying cameras, game consoles, or MP3 or
MP4 players?

STC: That's an interesting point. We've seen almost all the major IT players
integrating Spark into their offerings or providing Spark as a service. Are you
saying the Spark ecosystem resembles the Android ecosystem? If so, is there an
Apple?

Sean: I don't think there's a Apple-like company in Big Data world. An Apple-like
company would have a very hard time, I think, because the future belongs to open
source. Eventually, Spark can be like Linux. That's my personal vision. I have a
lot of confidence about the future of Spark.

SHARE ON
 * 
 * Share

XIAO (SEAN) LI
DATE
21 October 2016TAGS
open source, data pros, apache sparkNEWSLETTER
Subscribe to the Spark Technology Center newsletter for the latest thought
leadership in Apache Spark™, machine learning and open source.

SubscribeNEWSLETTER

YOU MIGHT ALSO ENJOY
OPEN SOURCE IMPROVEMENTS TO THE SIZEESTIMATOR CLASS IN APACHE™ SPARK by Vijay
Sundaresan OPEN SOURCE EXPLORING THE APACHE SPARK™ DATASOURCE API by Sunitha Kambhampati DATA SCIENCE DATA SCIENCE HUB & THE DATA SCIENCE COMMUNITY: PHILIPPE VAN IMPE
by Steve Moore MEETUP APACHE SPARK™ APPLICATIONS THE EASY WAY: PIERRE BORCKMANS by Steve MooreSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About
 * Advisory Council

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.

 * 
 * 
 * 
 *","Sean looks back on his first encounter with Spark, talks about building community and staying motivated, and dives deep on Spark SQL.","Interview with Sean Li, New Apache Spark™ Committer",Live,942
2906,"Skip to content * United States

IBM® developerWorks Developer Centers * Site map

Search Search IBM Code

Search

IBM Code * Journeys
 * Technologies
 * Open Source
 * Advocates
 * Events
 * Blogs
 * Community

DISCOVER HIDDEN FACEBOOK USAGE INSIGHTS
HARNESS THE POWER OF COGNITIVE DATA ANALYSIS IN A JUPYTER NOTEBOOK WITH
PIXIEDUST
– – – Last updated –COMBINE THE POWER OF A JUPYTER NOTEBOOK, PIXIEDUST, AND IBM WATSON® COGNITIVE
SERVICES TO GLEAN USEFUL MARKETING INSIGHT FROM A VAST BODY OF UNSTRUCTURED
FACEBOOK DATA. TO HELP IMPROVE BRAND PERCEPTION, PRODUCT PERFORMANCE, CUSTOMER
SATISFACTION, AND AUDIENCE ENGAGEMENT, TAKE DATA FROM A FACEBOOK ANALYTICS
EXPORT, ENRICH IT WITH WATSON VISUAL RECOGNITION, NATURAL LANGUAGE
UNDERSTANDING, AND TONE ANALYZER, AND CREATE INTERACTIVE CHARTS TO OUTLINE YOUR
FINDINGS. CREDIT GOES TO ANNA QUINCY AND TYLER ANDERSEN FOR PROVIDING THE
INITIAL NOTEBOOK DESIGN.


BY MARK STURDEVANT , ANNA QUINCY , TYLER ANDERSEN
Get the code View the demo

OVERVIEW
We start with data exported from Facebook Analytics and enrich that data with
Watson APIs. We will use the enriched data to answer questions like: * What sentiment is most prevalent in the posts with the highest engagement
   performance?
 * What are the relationships between social tone of article text, the main
   article entity, and engagement performance?

These types of insights are beneficial for marketing analysts interested in
understanding and improving brand perception, product performance, customer
satisfaction, and audience engagement. It is important to note that this journey
is meant to be used as a guided experiment, rather than an application with one
set output. The standard Facebook Analytics export features text from posts,
articles, and thumbnails, along with standard Facebook performance metrics, such
as likes, shares, and impressions. This unstructured content is then enriched
with Watson APIs to extract keywords, entities, sentiment, and tone. After data
is enriched with Watson APIs, there are several ways to analyze it. The Data
Science Experience provides a robust yet flexible method of exploring the
Facebook content. This journey provides mock Facebook data, a notebook, and
comes with several pre-built visualizations to get you started with uncovering
hidden insights. When you complete this journey, you will understand how to: * Read external data into a Jupyter Notebook via DSX Object Storage and pandas
   DataFrame.
 * Enrich unstructured data using a Jupyter Notebook and Watson Visual
   Recognition, Natural Language Understanding, and Tone Analyzer.
 * Use PixieDust to explore data and visualize insights.

FLOW
 1. A CSV file exported from Facebook Analytics is added to DSX Object Storage. Generated code makes the file accessible as a pandas DataFrame. The data is enriched with Watson Natural Language Understanding.
 2. The data is enriched with Watson Tone Analyzer. The data is enriched with Watson Visual Recognition. The enriched data can be explored with PixieDust to uncover hidden insights
    and create graphics to highlight them.

COMPONENTS
JUPYTER NOTEBOOKAn open source web application that allows you to create and share documents
that contain live code, equations, visualizations, and explanatory text.

Read more

IBM DATA SCIENCE EXPERIENCEAnalyze data in a configured and collaborative environment.

Read more

WATSON TONE ANALYZERUses linguistic analysis to detect communication tones in written text.

Read more

WATSON NATURAL LANGUAGE UNDERSTANDINGA Bluemix service that can analyze text to extract metadata from content such as
concepts, entities, keywords, categories, sentiment, emotion, relations, and
semantic roles using natural language understanding.

Read more

WATSON VISUAL RECOGNITIONVisual Recognition understands the contents of images - visual concepts tag the
image, find human faces, approximate age and gender, and find similar images in
a collection.

Read more

PIXIEDUSTProvides a Python helper library for IPython Notebook.

Read more

FEATURED TECHNOLOGIES
ANALYTICSFinding patterns in data to derive information.

Read more

DATA SCIENCESystems and scientific methods to analyze structured and unstructured data in
order to extract knowledge and insights.

Read more

RELATED BLOGS
USING WATSON DISCOVERY SERVICE TO FIND TRENDING TOPICS IN YOUR NEWS FEED
by Ankur Patel on Aug 04, 2017 in Cognitive , create-a-cognitive-news-search-app , data-analyticsLearn how to use the Watson Discovery News and Watson Discovery Service to
extract, aggregate, and discover trending news topics.

Continue reading Using Watson Discovery Service to find trending topics in your
news feed

COGNITIVE FACEBOOK DATA ANALYSIS USING A JUPYTER NOTEBOOK WITH PIXIEDUST
by markstur on Jul 26, 2017 in Cognitive , data science , data-analytics , Watson , Watson Developer Cloud: Python SDKHarness the power and find hidden insights in your Facebook data with Watson and
Data Science Experience.

Continue reading Cognitive Facebook data analysis using a Jupyter Notebook with
PixieDust

THE ELASTIC OPEN SOURCE COMMUNITY IS ALWAYS STRETCHING
by Martin Hickey on Jun 30, 2017 in data-analytics , Open SourceWe are halfway through 2017, and maybe it is a good time to draw a breath and
take stock of the Elastic open source community. Let’s dive in and see what has
happened! Elastic{ON}17 Elastic{ON}17 was held in San Francisco in early March.
It was a vibrant conference with about 2,500 attendees, lots of discussion...

Continue reading The Elastic open source community is always stretching

RELATED LINKS
FACEBOOK ANALYTICS DEVELOPER DOCS
A ROBOT BEFRIENDS CLASSIC MONSTERS USING WATSON APIS
Back to top

 * Contact
 * Privacy
 * Terms of use
 * Accessibility
 * Feedback
 * Report Abuse
 * Cookie Preferences","Enrich unstructured data from Facebook using a Jupyter Notebook with Watson Visual Recognition, Natural Language Understanding, and Tone Analyzer, then use PixieDust to explore the results and uncover hidden insights.",Discover hidden Facebook usage insights,Live,943
2912,,Watch how to extract and export dashDB data to a CSV file for import into a spreadsheet application such as Microsoft Excel or Open Office. ,Extract and export dashDB data to a CSV file,Live,944
2918,"Enterprise Pricing Articles Sign in Free 30-Day TrialREADME CONQUERING THE DATA LAYER
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 17, 2017Compose has helped many businesses by taking care of their data layer so that
they can focus on building great products. ReadMe founder Gregory Koberger sat down with us to talk about their quest to make developer documentation
easy, and how they chose Compose for their data solution to focus on what truly
makes ReadMe unique.

Watch how they've conquered their data layer.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","ReadMe founder Gregory Koberger sat down with us to talk about their quest to make developer documentation easy, and how they chose Compose for their data solution to focus on what truly makes ReadMe unique.",Customer: ReadMe Conquering the Data Layer,Live,945
2922,"Enterprise Pricing Articles Sign in Free 30-Day TrialNEW AT COMPOSE: HORIZONTAL SCALING FOR REDIS AND MORE SCALING CONTROL
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 31, 2017If you are a Compose user, then from today, you will find a new Resources
console ready for you to take more control of your database scaling. And if you
use Redis on Compose you'll find that new console has enabled support for
Read-only slaves. It's all about scaling today at Compose - just like every day.

People love how Compose looks after their scaling needs. There's no running out
of capacity with Compose databases; when you need more space or RAM, autoscaling
kicks in and scales your database up without interruption. It's simple and
effective and it makes running your databases at Compose a breeze. If you wanted
to you could reach for the Scaling slider and bump up your resources before you
needed them too, preempting a large import or traffic from a launch. Even
simpler.

We listen too. We heard you wanted more control of your deployments scaling.
That's why we've just launched a new element to the Compose console, Resources . Each database deployment now has a Resources sidebar menu which gives you a detailed view of how your database is using
resources and more controls to let you scale up those resources. Here's what it
looks like:


Now you can see memory and storage allocated to database nodes and you can scale
up your TCP, SSH, and other portals; you've been able to horizontally scale them
in the Security view for some time. Each Resources view is customized to the database type underneath and that now enables us to
offer new scaling options. First to benefit from that is Redis, as you may be
able to tell from the screenshot above.

REDIS READ-ONLY SLAVES
Redis Read-only Slaves can be added to your Redis deployments now for those
situations where you want to give large numbers of clients the ability to view
your fast moving Redis data. Read-only slaves can also provide a great place to
execute long-running queries which could block your Redis master for a performance-killing amount of time .

Each read-only slave is synchronized to the master Redis server and has its own
TCP portal, so incoming request traffic doesn't impact on the rest of the Redis
deployment. As you add slaves to your Redis deployment, the overview will expand
to list new connection strings and command lines to connect to each of them.

You can have up to ten read-only slaves, each one provisioned with the same
amount of memory as the master and secondary Redis nodes and priced accordingly.
The best part is it is, as always with Compose, simple:


Drag the slider, click Confirm and you have read-only slaves on your Redis.

VISIBLE PRICING
Talking about prices, the new Resources views will always give you the cost of any scaling you make and can show you
the price of the changes you are making and how it will affect the total cost of
your deployment. You can see that above in the read-only slave slider
screenshot. You can see how much you are paying currently for the feature, what
the difference in pricing will be, the new total for the feature and the total
cost for the deployment. Please, remember these are the estimated monthly costs;
billing is calculated on a pro rata hourly basis, so you can scale up and down
when it suits you.

To learn more about resources and scaling, check out the Resources and Scaling documentation online which includes links to pages about each databases'
scaling capabilities.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Evan Kirby Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","If you are a Compose user, then from today, you will find a new Resources console ready for you to take more control of your database scaling. And if you use Redis on Compose you'll find that new console has enabled support for Read-only slaves.",New at Compose: Horizontal Scaling for Redis and more scaling control,Live,946
2925,"You have JavaScript disabled For the best experience, please turn JavaScript on. Here's how Navigation TED * 1

Log in Sign up Skip to content… * Watch * TED Talks Browse the library of TED talks and speakers
    * Playlists 100+ collections of TED Talks, for curious minds
    * TED-Ed videos Watch, share and create lessons with TED-Ed
    * TEDx Talks Talks from independently organized local events
    * Surprise Me The easy option. Let us choose for you!
   
   
 * Discover * Topics Explore TED offerings by topic
    * TED Books Short books to feed your craving for ideas
    * TED Studies Curated course material for educators and life long learners
    * Ideas Blog Outbound Our daily coverage of the world of ideas
    * Newsletter Inspiration delivered straight to your inbox
   
   
 * Attend * Conferences Take part in our events: TED, TEDGlobal, TEDActive and more
    * TEDx events Find and attend local, independently organized events
    * TED Live Experience the conferences from home
   
   
 * Participate * Nominate Recommend speakers, TED Prize recipients, Fellows and more
    * Organize a local TEDx event Rules and resources to help you plan a local
      TEDx event
    * Translate Bring TED to the non-English speaking world
    * TED Prize Get involved in the yearly prize for world-changing ideas
    * TED Fellows Join or support innovators from around the globe
   
   
 * About * Our organization Our mission, history, team, and more
    * Conferences TED Conferences, past, present, and future
    * Programs & Initiatives Details about TED's world-changing initiatives
    * Partner with TED Learn how you can partner with us
    * TED Blog Outbound Updates from TED and highlights from our global
      community
   
   
Watch next Tim Berners-Lee: The next web arrow Loading… Watch later Favorite Download RateKENNETH CUKIER: BIG DATA IS BETTER DATA
TEDSalon Berlin 2014 · 15:51 · Filmed Jun 2014 22 subtitle languages Help with subtitles View interactive transcript Watch next... Tim Berners-Lee: The next web arrow Share this idea Watch Later Later Download Download Rate Rate Share via Facebook Facebook Share on Twitter Twitter Share by Email Email Embed this talk Embed Other sharing services Share Other sharing services More 1,093,840 Total viewsSelf-driving cars were just the start. What's the future of big data-driven
technology and design? In a thrilling science talk, Kenneth Cukier looks at
what's next for machine learning — and human knowledge.

 * Interactive transcript Interactive transcript

Kenneth Cukier Data Editor of The Economist Kenneth Cukier is the Data Editor of The Economist. From 2007 to 2012 he was
the Tokyo correspondent, and before that, the paper’s technology correspondent
in London, where his work focused on innovation, intellectual property and
Internet governance. Kenneth is also the co-author of Big Data: A Revolution
That Will Transform How We Live, Work, and Think with Viktor Mayer-Schönberger
in 2013, which was a New York Times Bestseller and translated into 16 languages. Full bio * Similar topics
 * Data
 * Future
 * Technology

This talk was presented at an official TED conference, and was featured by our
editors on the home page. Playlists to watch Browse all * TALKS TO HELP YOU UNDERSTAND WHAT'S REALLY GOING ON IN THE WORLD PLAYLIST (13 TALKS)
   
 * WILL DRONES SAVE US OR DESTROY US? PLAYLIST (15 TALKS)
   
 * ART MADE OF DATA PLAYLIST (5 TALKS)
   

RELATED TALKS
16:23TIM BERNERS-LEE THE NEXT WEB
17:56DAVID MCCANDLESS THE BEAUTY OF DATA VISUALIZATION
17:03TALITHIA WILLIAMS OWN YOUR BODY'S DATA
12:14MARIANO SIGMAN YOUR WORDS MAY PREDICT YOUR FUTURE MENTAL HEALTH
6:03LAURA INDOLFI GOOD NEWS IN THE FIGHT AGAINST PANCREATIC CANCER
18:00MORAN CERF THIS SCIENTIST CAN HACK YOUR DREAMS
DISCUSS
63 commentsEnthusiastically agree? Respectfully beg to differ? Have your say here.

2000 characters remaining Submit Log in to comment Don't have an account? Sign up now — it's fast and free. Sort comments by Newest Upvotes There are currently no comments for this talk. Show more comments TEDPROGRAMS & INITIATIVES
 * TEDx
 * TED Prize
 * TED Fellows
 * TED Ed
 * Open Translation Project
 * TED Books
 * TED Institute

WAYS TO GET TED
 * TED Radio Hour on NPR
 * More ways to get TED

FOLLOW TED
 * Facebook
 * Twitter
 * Google+
 * Pinterest
 * Instagram
 * YouTube
 * TED Blog

OUR COMMUNITY
 * TED Speakers
 * TED Fellows
 * TED Translators
 * TEDx Organizers
 * TED Community

GET TED EMAIL UPDATES
Subscribe to receive email notifications whenever new talks are published.

 * Daily
 * Weekly

Your email addressPlease enter an email address.

Please enter a valid email address.

Did you mean ?

Please check Daily or Weekly and try again.

Please check your details and try again.

Please check your details and try again.

Sorry, we're currently having trouble processing new newsletter signups. Please
try again later.

Thanks! Please check your inbox for a confirmation email.

If you want to get even more from TED, like the ability to save talks to watch
later, sign up for a TED account now .

LANGUAGE SELECTOR
TED.com translations are made possible by volunteer translators. Learn more
about the Open Translation Project .

English * TED Talks Usage Policy
 * Privacy Policy
 * Advertising / Partnership
 * TED.com Terms of Use
 * Contact
 * Jobs
 * Staff
 * Press
 * Help

© TED Conferences, LLC","Self-driving cars were just the start. What's the future of big data-driven technology and design? In a thrilling science talk, Kenneth Cukier looks at what's next for machine learning -- and human knowledge.",Big data is better data,Live,947
2926,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SYSTEMML
WHY YOU SHOULD BE USING APACHE SYSTEMML
WHAT IS SYSTEMML? WHY IS IT RELEVANT TO YOU?
Here's the deal, you've probably never heard of SystemML , but you definitely need to know what it is. Why? Not only will SystemML make
you look awesome because machine learning the hot topic right now, but it will also save you a lot of time and trouble. As a
new data scientist I am constantly having to spend my time learning new
technologies--most of which don't work very well. Here's the thing: SystemML
actually does work very well. Because it's so new, it's difficult to find
material on how to get started, but that's quickly changing. For now, let me
break down what SystemML is, why you want to use it, and how it will get even
easier.

SYSTEMML'S OFFICIAL DEFINITION IS THIS:
""SystemML provides declarative large-scale machine learning (ML) that aims at
flexible specification of ML algorithms and automatic generation of hybrid
runtime plans ranging from single-node, in-memory computations, to distributed
computations on Apache Hadoop and Apache Spark.""

WHAT DOES THAT MEAN IN NORMAL PEOPLE LANGUAGE? LET'S BREAK IT DOWN.
SystemML is an Apache incubator project that is being focused on by IBM's Spark Technology Center . At a very high level, it is a language, and hopeful platform, that allows you
to basically log into the Spark shell or a notebook like Jupyter or Zeppelin,
and use python or R to do all of your awesome machine learning stuff.
Specifically, this platform is catered to linear algebra, matrices, statistical
functions etc. that help you with your machine learning project. Previously,
data scientists would have to hand off their project for someone else to do it,
due to the complicated code needed. That issue is no longer relevant. Now you
just load SystemML, load your data and do all of your computations, just like
that. Very clear tables and statistics come out of it, and I just wish I knew
about it in my previous classes at UC Berkeley!

But guess what, SystemML is even more awesome than just that. The other
compelling aspect of SystemML is how it functions. SystemML decides, line by
line, whether to run on the driver or the Spark cluster, meaning that it scales
automatically. This saves valuable time, and does it automatically so you don't
have to. Because it decides what is optimal for any given line of code, it saves
you a great deal of hassle, where you can just focus on what you're trying to do
and your very own algorithms.

And speaking of code, SystemML is also incredibly concise. One of my mentors,
Deron Eriksson, mentioned to me that it was probably 10x to 100x less code than
Scala, which is absolutely incredible. After trying it out for myself, I found
this to be surprisingly true.

SO HOW CAN YOU USE SYSTEMML NOW?
You can run SystemML locally on your computer. I personally find it easiest to
do this on the Jupyter Notebook or the Spark Shell with the new API. (This is
being made public next week! I will write a tutorial, I promise!)

WHY DON'T YOU KNOW ABOUT THIS ALREADY AND HOW IS THAT CHANGING?
SystemML is less than a year old. A lot of work has gone into making it awesome
and not a lot of work has gone into advertising. That's why I'm here. The
website is a bit tricky at the moment, but will be changing. The best way to get
started on it this very second is to follow this tutorial for Jupyter notebook. Next week, I'll show you how to run it on the Spark shell
which is my favorite.

Now that you've heard about how awesome SystemML is, I encourage you to go
experiment with it! We want you to enjoy it as much as we do! Make sure to give
us your feedback and let us know what projects your working on with SystemML! If
it's awesome, I'll blog about you!

Stay tuned for tutorials!

By Madison J. Myers

SHARE ON
 * 
 * Share

MADISON J MYERS
DATE
29 July 2016TAGS
systemml, apache sparkSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.","Here's the deal, you've probably never heard of SystemML, but you definitely need to know what it is. Why? Not only will SystemML make you look awesome because machine learning is the hot topic right now, but it will also save you a lot of time and trouble.",What is SystemML? Why is it relevant to you?,Live,948
2930,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK SQL
APACHE SPARK SQL ANALYZER RESOLVES ORDER-BY COLUMN
The Apache Spark SQL component has several sub-components including Analyzer,
which plays an important role in making sure that the logical plan is fully
resolved at the end of an analysis phase. Analyzer takes a parsed logical plan
as input and makes sure all the table references, attributes/column references,
and function references are resolved by looking up the metadata from catalogs.
It works by applying a set of rules on the logical plan — and transforming it on
each stage in order to resolve specific portions of the plan.

We’ll examine the workings of Analyzer by taking an example defect and
describing how we addressed the problem.

EXAMPLE QUERY:
select a as a1 , c as a2, count(c) as a3 from tab group by a, b order by a1, c

PROBLEM DESCRIPTION:
In this case, Analyzer was unable to resolve the attributes referenced in the
order by clause. To see why, let’s look at the underlying parsed logical plan.

PARSED LOGICAL PLAN:
'Sort ['a1 ASC,'c ASC], true +- 'Aggregate ['a,'c], ['a AS a1#17,'c AS a2#18, (count('a),mode=Complete,isDistinct=false) AS a3#19] +- LocalRelation [a#1,b#2,c#3,d#4,e#5]

In this case, only the LocalRelation is resolved. None of the other plan
operators are resolved since the underlying attributes they refer to are not
resolved. However, we can see that the Sort operator is above the Aggregate
Operator and the attributes referenced by the Sort operator was being resolved
from the outputs of its child (the Aggregate operator). The output of the
Aggregate operator is ‘a1#17’, ‘a2#18’ and ‘a3#19’ in the above plan which is
missing the attribute ‘C#3’, which is referenced by the Sort operator. That
causes a failure in the analysis process which in turn results in query failure.

In order to properly resolve the Sort operator, we need to make sure that…

 * `a1 in Sort is resolved from its immediate child (Aggregate)
 * `c in Sort is resolved from its grandchild (Local Relation)

In Spark Analyzer, the ResolveAggregateFunctions rule was modified in order to
properly resolve the Sort operator and the query results in following analyzed
logical plan after the fix.

Project [a1#14,a2#15,a3#16L] +- Sort [a1#14 ASC,a2#15 ASC], true +- Aggregate [a#1,c#3], [a#1 AS a1#14,c#3 AS a2#15, (count(a#1),mode=Complete,isDistinct=false) AS a3#16L] +- LocalRelation [a#1,b#2,c#3,d#4,e#5]

CONCLUSION:
Hopefully this blog gives a brief insight into the workings of Analyzer. We’ll
post a more extended description of Analyzer in the future. In general, handling
Analyzer issues requires a deep understanding of Spark logical plans.

About the Author:

Dilip Biswal is a Senior software engineer at the Spark Technology Center at
IBM. He is an active Apache Spark contributor and works in the open source
community. He is experienced in Relational Databases, Distributed Computing and
Big Data Analytics. He has extensively worked on SQL engines like Informix,
Derby, and Big SQL. *
*

SHARE ON
 * 
 * Share

DILIP BISWAL
DATE
05 June 2016TAGS
Spark SQL, apache sparkSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.","The Apache Spark SQL component has several sub-components including Analyzer, which plays an important role in making sure that the logical plan is fully resolved at the end of an analysis phase. Analyzer takes a parsed logical plan as input and makes sure all the table references, attributes/column references, and function references are resolved by looking up the metadata from catalogs. It works by applying a set of rules on the logical plan — and transforming it on each stage in order to resolve specific portions of the plan.",Apache Spark SQL Analyzer Resolves Order-by Column,Live,949
2931,"This talk will explore why developer experience matters, what makes for a great developer experience and the relationship between developer experience and the broader field of user experience. Software developers are gaining more influence over the purchase decisions of technologies with which they must build on and with which they must integrate. For example, the success of Amazon Web Services, Heroku and MongoDB has been driven primarily by individual software developers choosing to use these tools, rather than the by the decisions of managers or business executives.

Many of today’s products and services include APIs and other points of extension and integration intended to be used by software developers, even if the primary market segment for these products and services is not software developers. Flickr co-founder Caterina Fake once referred to APIs as “Business Development 2.0” and others have more recently begun to use the term “API economy” to refer to the emergence of APIs as a new model of engagement with customers, business partners and others.

This talk is intended to foster a dialogue around what makes for a great developer experience. The speaker will outline his views on the three key components of a great developer experience: self service, composability and hackability. The presentation will include an example of how IBM Cloudant identified an area in need of improved developer experience, developed a new feature to improve the developer experience and then contributed this new feature to the open source Apache CouchDB project.","This talk will explore why developer experience matters, what makes for a great developer experience and the relationship between developer experience and the broader field of user experience. Software developers are gaining more influence over the purchase decisions of technologies with which they must build on and with which they must integrate. For example, the success of Amazon Web Services, Heroku and MongoDB has been driven primarily by individual software developers choosing to use these tools, rather than the by the decisions of managers or business executives.",Why Developer Experience Matters at Cloud Expo 2015,Live,950
2932,"RStudio Blog * Home

 * Subscribe to feed

SHINY 0.12: INTERACTIVE PLOTS WITH GGPLOT2
June 16, 2015 in Packages , Shiny | Tags: ggplot2

Shiny 0.12 has been released to CRAN!

Compared to version 0.11.1, the major changes are:

 * Interactive plots with base graphics and ggplot2
 * Switch from RJSONIO to jsonlite

For a full list of changes and bugfixes in this version, see the NEWS file.

To install the new version of Shiny, run:

install.packages(c(""shiny"", ""htmlwidgets""))

htmlwidgets is not required, but shiny 0.12 will not work with older versions of
htmlwidgets, so it’s a good idea to install a fresh copy along with Shiny.

INTERACTIVE PLOTS WITH BASE GRAPHICS AND GGPLOT2


The major new feature in this version of Shiny is the ability to create
interactive plots using R’s base graphics or ggplot2. Adding interactivity is
easy: it just requires using one option in plotOutput() , and then the information about mouse events will be available via the input object.

You can use mouse events to read mouse coordinates, select or deselect points,
and implement zooming. Here are some example applications:

 * Basic interactions
 * Zooming
 * Advanced interactions : This demonstrates many advanced features of interactive plots.
 * Excluding points (as depicted in the screen capture above)

For more information, see the Interactive Plots articles in the Shiny Dev Center, and the demo apps in the gallery .

SWITCH FROM RJSONIO TO JSONLITE
Shiny uses the JSON format to send data between the server (running R) and the
client web browser (running JavaScript).

In previous versions of Shiny, the data was serialized to/from JSON using the RJSONIO package. However, as of 0.12.0, Shiny switched from RJSONIO to jsonlite . The reasons for this are that jsonlite has better-defined conversion
behavior, and it has better performance because much of it is now implemented in
C.

For the vast majority of users, this will have no impact on existing Shiny apps.

The htmlwidgets package has also switched to jsonlite, and any Shiny apps that use htmlwidgets
also require an upgrade to that package.

A NOTE ABOUT DATA TABLES
The version we just released to CRAN is actually 0.12.1; the previous version,
0.12.0, was released three weeks ago and deprecated Shiny’s dataTableOutput and renderDataTable functions and instructed you to migrate to the nascent DT package instead. (We’ll talk more about DT in a future blog post.)

User feedback has indicated this transition was too sudden and abrupt, so we’ve
undeprecated these functions in 0.12.1. We’ll continue to support these
functions until DT has had more time to mature.

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

8 COMMENTS
June 16, 2015 at 3:38 pm

LJ

wooooooahhh

June 16, 2015 at 5:11 pm

Wilson Gomez

Hello, great news… I have one question, it is possible make tooltips with
ggplot2.

Thanks

 * June 17, 2015 at 1:22 am
   
   Mario
   
   This is what I am also interested in.
   
   
June 16, 2015 at 8:52 pm

Jianghao Wang

great news

June 16, 2015 at 9:21 pm

myschizobuddy

now that you can create interactive charts using ggplot2. what will happen to
ggvis?

 * June 17, 2015 at 2:46 am
   
   Joe Cheng
   
   ggvis is a better option when it can be used; because it’s built on
   web-native technologies, we can achieve more sophisticated kinds of
   interactions and potentially with much less latency (though we haven’t done
   much optimization work to date).
   
   These new Shiny features are targeted at users who already have working code
   in ggplot2, or need the features and maturity of ggplot2 that ggvis currently
   lacks, or just haven’t spent the time to learn ggvis yet.
   
   
June 17, 2015 at 12:08 pm

Distilled News | Data Analytics & R

[…] Shiny 0.12: Interactive Plots with ggplot2 Shiny 0.12 has been released to
CRAN. Compared to version 0.11.1, the major changes are: • Interactive plots
with base graphics and ggplot2 • Switch from RJSONIO to jsonlite […]

June 18, 2015 at 1:31 am

Shiny, sondaże przedwyborcze i interaktywny ggplot2 | SmarterPoland.pl

[…] sondaży przedwyborczych. Tak się, złożyło, że wczoraj pojawiła się nowa
wersja shiny, umożliwiająca tworzenie interaktywnych wykresów ggplot2.
Wypróbujemy tą możliwość, pokazując na wykresie oszacowanie dla określonego dnia
poparcia […]


« Hadley Wickham’s Master R Developer Workshop – Washington DC registration is
open New Shiny cheat sheet and video tutorial »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:","Shiny 0.12 has been released to CRAN! Compared to version 0.11.1, the major changes are: Interactive plots with base graphics and ggplot2 Switch from RJSONIO to jsonlite For a full list of changes …",Shiny 0.12: Interactive Plots with ggplot2,Live,951
2939,"Hello,

We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in .

Not an Insider? Subscribe now for unlimited access to online articles.

Subscribe todayWHY WE MADE THIS CHANGE
Visitors are allowed 3 free articles per month (without a subscription), and
private browsing prevents us from counting how many stories you've read. We hope
you understand, and consider subscribing for unlimited online access.

Back to MIT Technology Review home Contact customer service if you are seeing this message in error. Menu * Topics *  * Business Impact
       * Connectivity
       * Intelligent Machines
       * Rewriting Life
       * Sustainable Energy
      
      
    *  * 10 Breakthrough Technologies
       * 35 Innovators Under 35
       * 50 Smartest Companies
      
      
    * Views
    * Views from the Marketplace
   
   
 * The Download
 * Magazine
 * Events
 * More * Video
    * Special Publications
    * MIT News Magazine
    * Newsletters
    * Help/Support
    * Advertise with Us
   
   
 * 
 * Log in / Create and account

 * Subscribe

 * Log in / Create an account
 * Search

 * Click search or press enter
   
   
 * 
 * 
 * 
 * 
 * 
 * 

REWRITING LIFE
WHY EVEN A MOTH’S BRAIN IS SMARTER THAN AN AI
A NEURAL NETWORK THAT SIMULATES THE WAY MOTHS RECOGNIZE ODORS ALSO SHOWS HOW
THEY LEARN SO MUCH FASTER THAN MACHINES.
 * by Emerging Technology from the arXiv
 * February 19, 2018

 * 
 * 
 * 
 * 
 * 
 * 

One of the curious features of the deep neural networks behind machine learning
is that they are surprisingly different from the neural networks in biological
systems. While there are similarities, some critical machine-learning mechanisms
have no analogue in the natural world, where learning seems to occur in a
different way.

Recommended for You 1. Tech companies should stop pretending AI won’t destroy jobs
 2. Hello, quantum world
 3. How UPS delivers faster using $8 headphones and code that decides when dirty
    trucks get cleaned
 4. Russians accused of information warfare used tech to whip up controversy and
    cover their tracks
 5. The “Black Mirror” scenarios that are leading some experts to call for more
    secrecy on AI

These differences probably account for why machine-learning systems lag so far
behind natural ones in some aspects of performance. Insects, for example, can
recognize odors after just a handful of exposures. Machines, on the other hand,
need huge training data sets to learn. Computer scientists hope that
understanding more about natural forms of learning will help them close the gap.

Enter Charles Delahunt and colleagues at the University of Washington in
Seattle, who have created an artificial neural network that mimics the structure
and behavior of the olfactory learning system in Manduca sexta moths. They say their system provides some important insights into the way
natural networks learn, with potential implications for machines.

First some background. The olfactory learning system in moths is relatively
simple and well mapped by neuroscientists. It consists of five distinct networks
that feed information forward from one to the next.

The first is a system of around 30,000 chemical receptors that detect odors and
send a rather noisy set of signals to the next level, known as the antenna lobe.
This contains about 60 units, known as glomeruli, that each focus on specific
odors.

The antenna lobe then sends neural odor codes to the mushroom body, which
contains some 4,000 kenyon cells and is thought to encode odors as memories.

Finally, the result is read out by a layer of extrinsic neurons, which number in
the 10s. These interpret the signals from the mushroom body as actions, such as
“fly upwind.”

Several aspects of this system are entirely different from what’s found in
machine-learning networks. For example, the antenna lobe encodes information in
a low-dimensional parameter space but sends it to the mushroom body, which
encodes it in a high-dimensional parameter space. By contrast, the layers in
artificial neural networks tend to have similar dimensions.

And in moths, the successful recognition of an odor triggers a reward mechanism
in which neurons spray a chemical neurotransmitter called octopamine into the
antenna lobe and mushroom body.

This is a crucial part of the learning process. Octopamine seems to help
reinforce the neural wiring that leads to success. It is a key part of Hebbian
learning, in which “cells that fire together wire together.” Indeed,
neuroscientists have long known that moths do not learn without octopamine. But
the role it plays isn’t well understood.

Learning in machines is very different. It relies on a process called
backpropagation, which tweaks the neural connections in a way that improves
outcomes. But information essentially travels backward through the network in
this process, and there is no known analogue of it in nature.

To better understand the way moths learn, Delahunt and co created an artificial
neural network that mimics the behavior of the natural one. “We constructed an
end-to-end computational model of the Manduca sexta moth olfactory system which includes the interaction of the Antenna Lobe and
Mushroom Body under octopamine stimulation,” they say.

The model is specifically designed to reproduce the behavior of the natural
system at every level. In particular, the model simulates the noisy signals
generated by the odor receptors and the change in dimension as information flows
from the antenna lobe to the mushroom body, and it includes an analogue of the
role played by octopamine.

And the results make for interesting reading. The model shows how the odor
receptors produce a noisy signal that is pre-amplified by the antenna lobe.
However, the change in dimension as the signal travels to the mushroom body has
the effect of removing noise, and this allows the system to generate specific,
unambiguous action signals like “fly upwind.”

The role of octopamine looks clearer, too. The simulations show that learning
can occur without octopamine, but it is so slow as to be effectively useless.
This implies that octopamine acts as a powerful accelerant to learning.

But just how it does this is still up for discussion. Delahunt and co have their
own ideas. “Perhaps it is a mechanism that allows the moth to work around
intrinsic organic constraints on Hebbian growth of new synapses, constraints
which would otherwise restrict the moth to an unacceptably slow learning rate,”
they suggest.

Octopamine also has another role. Hebbian learning only reinforces connections
that already exist, and that raises the question of how new wiring occurs.
Delahunt and co say that octopamine opens new transmitting channels for wiring.
“This expands the solution space the system can explore during learning,” they
say.

And most impressive is that the simulated network learns in a similar way to the
natural network. “Our model is able to robustly learn new odours, and our
simulations of integrate-and-fire neurons match the statistical features of in
vivo firing rate data,” say Delahunt and co.

This work that could have significant implications for the design of synthetic
neural networks that need to learn quickly. “From a machine learning
perspective, the model yields bioinspired mechanisms that are potentially useful
in constructing neural nets for rapid learning from very few samples,” say the
team.

So the machine learning networks of the future may soon contain simulated
versions of octopamine and other neurotransmitters.

Of course, it is not just in learning that neurotransmitters are important.
Neuroscientists are well aware of the role they play in emotions, mood
regulation, and so on. Therein lies another avenue of research that
machine-learning teams will be interested to explore.

Ref: arxiv.org/abs/1802.02678 : Biological Mechanisms for Learning: A Computational Model of Olfactory
Learning in the Manduca sexta Moth, with Applications to Neural Nets

Become an MIT Technology Review Insider for in-depth analysis and unparalleled
perspective.

Subscribe today
SHARE
 * 
 * 
 * 
 * 
 * 
 * 

Emerging Technology from the arXiv

Emerging Technology from the arXiv covers the latest ideas and technologies that
appear on the Physics arXiv preprint server. It is part of the Physics arXiv
Blog. Email: … More KentuckyFC@arxivblog.com

Subscribe to the Physics arXiv Blog RSS Feed .

RELATED VIDEO
More videos

Rewriting Life

Capturing Our Imagination: The Evolution of Brain-Machine Interfaces 28:35Rewriting Life

Next-generation Brain Interfaces 29:32Rewriting Life

Understanding Intelligence 23:23Rewriting Life

We Tried 23andMe's Pain Tolerance Test 04:13 Recommended for You 1. Tech companies should stop pretending AI won’t destroy jobs
 2. Hello, quantum world
 3. How UPS delivers faster using $8 headphones and code that decides when dirty
    trucks get cleaned
 4. Russians accused of information warfare used tech to whip up controversy and
    cover their tracks
 5. The “Black Mirror” scenarios that are leading some experts to call for more
    secrecy on AI

More from Rewriting LifeReprogramming our bodies to make us healthier.


 * Forecasts of genetic fate just got a lot more accurateDNA-based scores are getting better at predicting intelligence, risks for
   common diseases, and more.
   
   by Antonio Regalado
   
   
 * The race to invent the artificial leafIn this excerpt from his new book Taming the Sun , Varun Sivaram follows the research paths of two rival scientists
   determined to find a way to wring fuel out of thin air.
   
   by Varun Sivaram
   
   
 * Can’t get new lungs? Try refurbished ones instead.Spruced-up human and animal organs could someday be the solution for people
   needing transplants.
   
   by Erin Winick
   
   
More from Rewriting Life

From Our Advertisers * In partnership with Couchbase The Customer Engagement Revolution
 * In partnership with BMF A New Dimension: How a Startup Company Reshaped
   Precision Manufacturing
 * Sponsored by VMware Network Virtualization: The Bridge to Digital
   Transformation
 * Presented in partnership with VMware The Bridge to Digital Transformation:
   The Move to a Software-Based Network Strategy

Want more award-winning journalism? Subscribe to Insider Online Only. * Insider Online Only {! insider.prices.online !} *{! insider.display.menuOptionsLabel !}
   
   Unlimited online access including articles and video, plus The Download with
   the top tech stories delivered daily to your inbox.
   
   {! insider.buttons.online.buttonText !} See details+What's Included
   
   Unlimited 24/7 access to MIT Technology Review ’s website
   
   The Download: our daily newsletter of what's important in technology and innovation
   
   
* {! insider.display.footerLabel !}

See international prices

See U.S. prices

Revert to MIT Enterprise Forum pricing

Revert to standard pricing

The Download What's important in technology and innovation, delivered to you every dayFollow us

Twitter Facebook RSS

The mission of MIT Technology Review is to equip its audiences with the
intelligence to understand a world shaped by technology.

Browse International Editions * Company * About Us
    * Careers
    * Advertise with Us
    * Insights
    * Reuse, Permissions, and Licensing
    * Press Room
   
   
 * Your Account * Log In / Create Account
    * Activate Account
    * Newsletters
    * Manage Account
    * Manage Subscription
   
   
 * Customer Support * Help/FAQs
    * Contact Us
    * Feedback
    * Sitemap
   
   
 * More * Events
    * MIT Enterprise Forum
    * MIT News
   
   
 * Policies * Ethics Statement
    * Terms of Service
    * Privacy
    * Commenting Guidelines
   
   
MIT Technology Review © 2018 v.|e iπ |

/3 You've read of three free articles this month. Subscribe now for unlimited online access. You've read of three free articles this month. Subscribe now for unlimited online access. This is your last free article this month. Subscribe now for unlimited online access. You've read all your free articles this month. Subscribe now for unlimited online access. You've read of three free articles this month. Log in for more, or subscribe now for unlimited online access. Log in for two more free articles, or subscribe now for unlimited online access.",A neural network that simulates the way moths recognize odors also shows how they learn so much faster than machines.,Why even a moth’s brain is smarter than an AI,Live,952
2945,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectSIMPLE DATA PIPE CONNECTORSUsing and creating custom connectorsThe Simple Data Pipe uses source-specific connectors to load data and store it as JSON in the Cloudant staging database.A connector accesses data through the source’s public APIs, authenticating withuser-provided OAuth credentials. Connectors can optionally enrich the retrieveddata, for example by adding sentiment analysis results or weather data.CONNECTOR REPOSITORIES ON GITHUBThe following connectors are ready to use:Data Source Requires deployment? Data Enrichment Services Flightstats BETA Yes Insights for Weather Reddit BETA Yes Watson Tone Analyzer Runkeeper BETA Yes No Salesforce No. Built-in No Slack BETA Yes No Stripe No. Built-in No Sugar CRM Yes No Trello BETA Yes NoDEPLOY A CONNECTORTo add a custom connector in Simple Data Pipe 1. Deploy the Simple Data Pipe application to Bluemix. 2. If needed, create any custom service instances that are required by the    connector. 3. Add the desired connector repository as a Simple Data Pipe dependency. 4. Restart the Simple Data Pipe application.To learn more about the deployment process, refer to the detailed instructions in the connector’s README file (click a datasource name, to open the connector’s readme).BUILD A CUSTOM CONNECTORIf you want to load data from a cloud data source that doesn’t yet have aconnector, you can build your own. We created template repositories on GitHub tohelp you hit the ground running.Choose from the following boilerplates:Data Source Use Case OAuth support Generic Be creative! Yes Reddit BETA Analyze Ask Me Anything comments for a specific topic Yes Stripe BETA Analyze your billing information Yes Yahoo In plan Fantasy sports analysis, anyone? YesBuild a custom connectorUSE CASES * Billing (Stripe.com) * CRM (Salesforce.com) * Sentiment Analysis (reddit data with Apache Spark and Watson) * Flight stats analysisSHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","How connectors let you load data from a variety of cloud-based sources, through the Simple Data Pipe, and into Cloudant. Use and create your own connectors.",Simple Data Pipe connectors,Live,953
2948,"Search * Sign Up
 * Sign In


 * Analytics
 * Big Data
 * Hadoop
 * Data Plumbing
 * DataViz
 * Jobs
 * Webinars
 * Digest * Previous Digests
   
   
 * Search
 * Contact


Subscribe to DSC Newsletter * All Blog Posts
 * My Blog
 * Add

A PLETHORA OF OPEN DATA REPOSITORIES (I.E., THOUSANDS!)
 * Posted by Kirk Borne on August 30, 2015 at 2:09pm
 * View Blog

Open data repositories are valuable for many reasons, including:

(1) they provide a source of insight and transparency into the domains and
organizations that are represented by the data sets;

(2) they enable value creation across a variety of domains, using the data as
the “fuel” for innovation, government transformation, new ideas, and new
businesses;

(3) they offer a rich variety of data sets for data scientists to sharpen their
data mining, knowledge discovery, and machine learning modeling skills; and

(4) they allow many more eyes to look at the data and thereby to see things that
might have been missed by the creators and original users of the data.

Here are some sources and meta-sources of open data:

 * http://data.gov
 * http://www.census.gov/data.html
 * http://www.healthdata.gov/
 * http://www.socrata.com/resources/
 * https://www.quandl.com/
 * http://data.gov.uk/
 * http://open-data.europa.eu/en/data/
 * http://index.okfn.org/dataset/
 * http://ohiotreasurer.gov/transparency/Ohios-Online-Checkbook
 * http://www.gapminder.org/data/
 * http://aws.amazon.com/datasets
 * http://www.google.com/publicdata/directory
 * http://datacatalog.worldbank.org/
 * https://www.kaggle.com/competitions
 * http://www.kdnuggets.com/datasets/index.html
 * http://www.crowdflower.com/data-for-everyone
 * https://datafloq.com/public-data/
 * http://www.data-mania.com/blog/19-excellent-free-open-data-sources-...
 * http://www.datasciencecentral.com/group/data-science-apprenticeship...
 * https://mran.revolutionanalytics.com/documents/data/
 * http://archive.ics.uci.edu/ml/
 * https://kdd.ics.uci.edu/
 * http://wiki.dbpedia.org/
 * http://blog.visual.ly/data-sources/
 * https://www.crowdanalytix.com/dataX
 * http://blog.bigml.com/2013/02/28/data-data-data-thousands-of-public...
 * http://people.stern.nyu.edu/adamodar/New_Home_Page/data.html
 * http://www.smartdatacollective.com/bernardmarr/235366/big-data-20-f...
 * http://www.sisense.com/blog/free-data-sources-upgrade-business-deci...
 * https://sites.google.com/site/braumoellerosu/ug-stats-resource-page
 * http://readwrite.com/2008/04/09/where_to_find_open_data_on_the


Follow Kirk Borne on Twitter @KirkDBorne


Views: 4701

Tags:


Like 6 members like thisShare Tweet

 * < Previous Post

Comment

YOU NEED TO BE A MEMBER OF DATA SCIENCE CENTRAL TO ADD COMMENTS!
Join Data Science Central

RSS

Welcome to
Data Science Central

Sign Up
or Sign In

Or sign in with:

 * 
 * 
 * 

FOLLOW US
@DataScienceCtrl | RSS Feeds

TOP CONTENT
Edit

1THE NEW RULES FOR BECOMING A DATA SCIENTIST
2WHAT ARE THE DIFFERENCES BETWEEN PREDICTION, EXTRAPOLATION, AND INTERPOLATION?
3LEARNING R IN SEVEN SIMPLE STEPS
4TOP 10 HOT DATA SCIENCE TECHNOLOGIES
5THE GUIDE TO LEARNING PYTHON FOR DATA SCIENCE
6INDUSTRIALISING DATA SCIENCE
7HOW HEALTHCARE INDUSTRY WILL BENEFIT BY EMBRACING DATA SCIENCES
 * RSS
 * View All

ANNOUNCEMENTS
Learn design and application of business analytic systems

The talent shortage in data science is your opportunity

Toolkit for processing language, documents, and names now available

Break Down the Barriers to Better Analytics

High Performance Python for Open Data Science

Machine Learning is a Game Changer

Big Data Fueling New Business Opportunities

Designing Great Visualizations

High Performance Computing in the Open Data Science Era

Data Science Summit July 12-13 in San Francisco


VIDEOS
 * SHAPING DATA STORIES WITH NEUROSCIENCE
   Added by Tim Matteson 0 Comments 0 Likes
   
   
 * 5 KEY SELF-SERVICE ANALYTICS AND HOW TO GET THE DATA
   Added by Tim Matteson 0 Comments 1 Like
   
   
 * Add Videos
 * View All
 * 
 * xg.addOnRequire(function () { x$('.module_video').mouseover(function () {
   x$(this).find('.video-facebook-share').show(); }) .mouseout(function () {
   x$(this).find('.video-facebook-share').hide(); }); });

RESOURCES
TOP CATEGORIES
Machine Learning

R Programming

Python for Data Science

Visualization, Dashboards

NoSQL and NewSQL

Big Data

Cheat Sheets

Internet of Things

Excel

© 2016 Data Science Central Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service


HELLO, YOU NEED TO ENABLE JAVASCRIPT TO USE DATA SCIENCE CENTRAL.
Please check your browser settings or contact your system administrator.","Open data repositories are valuable for many reasons, including:(1) they provide a source of insight and transparency into the domains and organizations that…","A Plethora of Open Data Repositories (i.e., thousands!)",Live,954
2954,"RStudio Blog * Home

 * Subscribe to feed

LEAFLET: INTERACTIVE WEB MAPS WITH R
June 24, 2015 in Packages | Tags: htmlwidgets , spatial

We are excited to announce that a new package leaflet has been released on CRAN. The R package leaflet is an interface to the JavaScript library Leaflet to create interactive web maps. It was developed on top of the htmlwidgets framework, which means the maps can be rendered in R Markdown (v2) documents,
Shiny apps, and RStudio IDE / the R console. Please see http://rstudio.github.io/leaflet for the full documentation. To install the package, run

install.packages('leaflet')

We quietly introduced this package in December when we announced htmlwidgets , but in the months since then we’ve added a lot of new features and launched a
new set of documentation . If you haven’t looked at leaflet lately, now is a great time to get
reacquainted!

THE MAP WIDGET
The basic usage of this package is that you create a map widget using the leaflet() function, and add layers to the map using the layer functions such as addTiles() , addMarkers() , and so on. Adding layers can be done through the pipe operator %>% from magrittr (you are not required to use %>% , though):

library(leaflet)

m <- leaflet() %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng=174.768, lat=-36.852,
    popup=""The birthplace of R"")
m  # Print the map


There are a variety of layers that you can add to a map widget, including:

 * Map tiles
 * Markers / Circle Markers
 * Polygons / Rectangles
 * Lines
 * Popups
 * GeoJSON / TopoJSON
 * Raster Images
 * Color Legends
 * Layer Groups and Layer Control

There are a sets of methods to manipulate the attributes of a map, such as setView() and fitBounds() , etc. You can find the details from the help page ?setView .


The leaflet() function and all layer functions have a data argument that can take several types of spatial data objects, including
matrices and data frames with latitude and longitude columns, spatial objects
from the sp package (e.g. SpatialPoints and SpatialPointsDataFrame , etc), and the data frame returned from maps::map() . When you have got a data object in leaflet() or layer functions, you may use the formula interface to pass values of
variables to function arguments.

m <- leaflet(df) %>% addTiles()
m %>% addCircleMarkers(radius = ~size, color = ~color)
# this is more compact than radius = df$size, color = df$color

BASEMAPS
You can add basemaps to a map widget using map tiles. The default tiles provided
by addTiles() are OpenStreetMap tiles, and you can easily add third-party tiles via addProviderTiles() . WMS (Web Map Service) tiles can be added via addWMSTiles() . You may use more than one tile layer on a map, too.

MARKERS AND POPUPS
Icon markers and circle markers can be placed at the locations specified by
latitudes/longitudes on a map via addMarkers() and addCircleMarkers() , respectively. You can change the default appearance of icon markers (dropped
pins) and use custom icon images, and you can also customize the appearance of
circle markers (radius, color, and so on). When there are a large number of
markers on a map, you may cluster them into groups (each group containing
multiple markers close to each other), and see individual markers as you zoom
into the map. When you add a marker to a map, you can also bind a popup to it
through the popup argument, so users can see more information after they click on the marker. It
is possible to add popups separately without markers as well via addPopups() .

LINES AND SHAPES
Polygons, polylines, circles, and rectangles can be added to the map through addPolygons() , addPolylines() , addCircles() , and addRectangles() . For example, you can create a choropleth map by adding polygons with
different colors.

GEOJSON / TOPOJSON
When your data is in the GeoJSON or TopoJSON format, you can add it to the map
using addGeoJSON() and addTopoJSON() , respectively. The features in the JSON data can be styled via either the
styles specified inside the data, or the arguments of the functions addGeoJSON() / addTopoJSON() .

RASTER IMAGES
Two-dimensional RasterLayer objects (from the raster package ) can be turned into images and added to Leaflet maps using the addRasterImage() function. You can color the image through the colors argument that accepts a
variety of color specifications.

SHINY INTEGRATION
Like most Shiny output widgets, you just create a leaflet output element in the
UI using leafletOutput() , and render the widget on the server side using renderLeaflet() , with a leaflet widget object passed to renderLeaflet() .

After a widget is rendered in Shiny, you may modify it using the leafletProxy() object. All you need to do is replace leaflet() with leafletProxy() . For example, suppose the output ID of the map is mymap , and you have two numeric inputs x and y (specifying lng and lat) in the app, then you can add new markers to the map
via:

observe({
  leafletProxy(""mymap"") %>% addMarkers(input$x, input$y)
})

If you added layers with layer ID’s to a map, you will be able to remove certain
layers according to the layer ID’s (e.g. removeMarker() ). You can also clear entire layers (e.g. clearMarkers() ).

When interacting with a map or its layers, you can obtain some information about
the interaction from the Shiny input object. For example, input$mymap_shape_click will be a list of the form list(lat = lat, lng = lng, id = layerId) after you click on a shape object (e.g. a marker or a circle) on the map.

COLOR PALETTES AND LEGENDS
We have provided four types of color palettes in this package: colorNumeric() , colorBin() , colorQuantile() , and colorFactor() . These palette functions return functions that can be applied to numeric or
factor values to generate colors. If you have used one of these color palettes,
you may also use addLegend() to add a color legend to the map.

LAYER GROUPS AND LAYER CONTROL
Normally a layer function has an argument called group, which can be used to
assign multiple layer elements into a group. Later you may use addLayersControl() to add a layer control to the map to show/hide groups.

We hope you will enjoy using this package. Please let us know if you have any
comments or questions when the R package documentation or the website http://rstudio.github.io/leaflet is not clear enough. You are welcome to file bug reports / feature requests to the Github repository or ask questions in the shiny-discuss mailing list.

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

1 COMMENT
June 28, 2015 at 9:51 am

ch3ckmat3

Reblogged this on screenshot and commented:
Leaflet R Package offers cool features to visualize and analyze data with
geo-location information.


« DT: An R interface to the DataTables library Accelerating R: RStudio and the new R Consortium »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:",We are excited to announce that a new package leaflet has been released on CRAN. The R package leaflet is an interface to the JavaScript library Leaflet to create interactive web maps. It was devel…,Leaflet: Interactive web maps with R,Live,955
2957,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Apache Spark * Get Started * Get Started in Bluemix       * Load dashDB Data with Apache Spark       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook       * Build SQL Queries       * Use the Machine Learning Library       * Use Spark Streaming                * Tutorials and samples * Sample Notebooks       * Sample Python Notebook: Precipitation Analysis       * Sample Python Notebook: NY Motor Vehicle Accidents Analysis       * Build a Custom Library for Apache Spark       * Sentiment Analysis of Twitter Hashtags                   * BigInsights * BigInsights on Cloud for Analysts    * BigInsights on Cloud for Data Scientists       * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                * Integrate * Create a Data Warehouse from Cloudant Data       * Store Tweets Using Cloudant, dashDB, and Node-RED       * Load Cloudant Data in Apache Spark Using a Scala Notebook       * Load Cloudant Data in Apache Spark Using a Python Notebook                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Perform Predictive Analytics and SQL Pushdown       * Load data from the desktop into dashDB       * Load from Desktop Supercharged with IBM Aspera       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON Data from Cloudant into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata for Analytics to dashDB       * From Neteeza to dashDB: It’s That Easy!       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                   * DataWorks * Get Started * Connect to Data in IBM DataWorks       * Load Data for Analytics in IBM DataWorks       * Blend Data from Multiple Sources in IBM DataWorks       * Shape Raw Data in IBM DataWorks       * DataWorks API                  BIGINSIGHTS ON CLOUD FOR DATA SCIENTISTSsharynr / February 9, 2016Learn you how to analyze data in IBM BigInsights on Cloud using Text Analytics.You can also read a transcript of this videoRELATED LINKS * BigInsights on Cloud for AnalystsPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM",How to analyze data in IBM BigInsights on Cloud using Text Analytics,BigInsights on Cloud for Data Scientists,Live,956
2960,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Armand Ruiz Blocked Unblock Follow Following Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own Jul 1, 2016
--------------------------------------------------------------------------------

BRUNEL INTERACTIVE VISUALIZATIONS IN JUPYTER NOTEBOOKS
Data scientists spend a lot of time and effort manipulating data to conform to
chart tool requirements, especially when they want to change visualizations.

Brunel Visualization Language is a high-level language developed by IBM and open-sourced in 2015. Brunel
describes visualizations in terms of composable actions, and drives a
visualization engine ( D3 ) that performs the actual rendering and interactivity. Brunel makes it easy to
build fun and inventive visualizations that can be rapidly deployed on the web.
Today we are announcing the availability of Brunel in our Jupyter notebooks for
Scala and Python, and very soon for R, too.

HOW DOES BRUNEL WORK?
Brunel provides a novel language that produces interactive data visualizations
using **pandas.DataFrame objects**. The language is well suited for both data
scientists and more aggressive business users. The system interprets the
language syntax and produces live visualizations directly within Jupyter
notebooks.

LOAD LIBRARIES AND DATA
import pandas as pd
import brunel

cars = pd.read_csv(“https://github.com/Brunel-Visualization/Brunel/raw/master/python/examples/data/cars.csv"")

cars.head(6)

CREATE THE VISUALIZATION
Data defines the pandas.DataFrame object to use. If not specified, the
pandas.DataFrame object that best fits the action command will be used. You can
specify Width and height to set the resulting size.

%brunel data(‘cars’) x(mpg) y(horsepower) color(origin) filter(horsepower) :: width=800, height=300

SAMPLE NOTEBOOKS TO GET STARTED
- Network graphs
- Cars data visualizations
- Whiskey data

CORE FEATURES OF BRUNEL
- Automatically chooses appropriate transforms, mappings, and formatting for
your data.
- Allows multiple combinations of visualization “elements” — overlay points,
bars, lines, paths, areas and text — freely and in a coordinated space.
- Handles building structures for D3 diagram-like hierarchies, treemaps, and
chords.
- Handles data ranges, binning, and stacking automatically.
- Automatically wraps and fits text, even when animating.
- Intelligently works out a good layout for the chart aspects, taking the data
into account so that you don’t have to guess axis sizes, for example.
- Provides flexible interactivity including tooltips, pan & zoom, and
interactive brushing.
- Coordinates multiple visualizations in the same space, including interactive
brushing.
- Adds features such as word clouds and paths with smoothly varying size.
- The data engine is in the Javascript code, so high-speed interactivity works
with binning, aggregation, and filtering.

If you want to learn more about Brunel:

- Github Open Source Repository
- Brunel Visualization Cookbook
- Check the developerWorks Open webinar about Brunel, where we showcase IBM’s
up-and-coming open source projects.

I would like to thank Dan Rope and Graham Wills for their great contribution to
the Open Source community with Brunel!

 * Data Science
 * Data Visualization
 * IBM

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

9 Blocked Unblock Follow FollowingARMAND RUIZ
Lead Product Manager Data Science Experience #IBM #BigData #Analytics #RStats
#Cloud - Born in Barcelona Living in Chicago - All tweets and opinions are my
own

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 9
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Data scientists spend a lot of time and effort manipulating data to conform to chart tool requirements, especially when they want to change visualizations. Brunel Visualization Language is a high…",Brunel interactive visualizations in Jupyter notebooks,Live,957
2975,"Homepage Follow Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

IBM Data Science Experience Blocked Unblock Follow Following Dec 15
--------------------------------------------------------------------------------

USING DSX JUPYTER NOTEBOOKS TO ANALYZE GITHUB DATA
IBM Data Science Experience (DSX) offers a wealth of functionality to any
software developer, especially those interested in data science. An important
part of that functionality is the ability to use Notebooks, which are a
convenient and intuitive way to compartmentalize different segments of a code
base.

USE CASE
The IBM Watson Data Platform (WDP) Integration team manages system verification
defects for various services and utilizes GitHub’s “issues” feature to keep
track of each defect’s status, details, and assignments. Currently, there is no
way for us to quantify the team’s activity each week. How many defects are being
opened, closed, and worked on each week for each service? How severe are those
defects?

In this article, we’ll explore how to track and present a high-level overview of
a specific development team’s weekly activity. The goal is to collect raw data
from GitHub, organize and sort it, then visualize it in a meaningful way. After
analyzing organized data about a service’s defects, an observer can make
educated inferences about the team and code base. This program was developed
during the active development of Watson Machine Learning (WML), which will be
reflected in the example charts shown below.

GATHERING THE INFORMATION
The program begins in a Python notebook that retrieves defect information from
GitHub, parses the data, and stores it in three separate Db2 Warehouse on Cloud
tables (one for each service the integration team is assigned to).

The below code shows pushing the data from tuples to data frames, and an example
of one Db2 Warehouse on Cloud table update.

# Put those tuples into dataframes that will be pushed to the dashDB tables (note the WML service is the only one that has pipeline info recorded) 
# Pipeline info is updated through our JavaScript program on the Fyre server, since we cannot access ZenHub from the DSX due to firewall restrictions

sparkSession = SparkSession.builder.getOrCreate() 

df_catalog_data = sparkSession.createDataFrame(catalogdatatuples).toDF(""NUMBER"", ""TITLE"", ""LINK"", ""ASSIGNEES"", ""COMPONENT LABELS"", ""SEVERITY LABELS"", ""CREATEDAT"", ""CLOSEDAT"", ""TIMESTAMP"")

df_DSX_data = sparkSession.createDataFrame(DSXdatatuples).toDF(""NUMBER"", ""TITLE"", ""LINK"", ""ASSIGNEES"", ""COMPONENT LABELS"", ""SEVERITY LABELS"", ""CREATEDAT"", ""CLOSEDAT"", ""TIMESTAMP"")

df_WML_data = sparkSession.createDataFrame(WMLdatatuples).toDF(""NUMBER"", ""TITLE"", ""LINK"", ""ASSIGNEES"", ""COMPONENT LABELS"", ""SEVERITY LABELS"", ""CREATEDAT"", ""CLOSEDAT"", ""TIMESTAMP"", ""PIPELINE"")

from ingest.Connectors import Connectors

# Writing to catalog table in dashDB 
targetOptionscatalog = { 
  Connectors.DASHDB.HOST : credentials[""host""],
  Connectors.DASHDB.DATABASE : credentials[""db""],
  Connectors.DASHDB.USERNAME : credentials[""username""],
  Connectors.DASHDB.PASSWORD : credentials[""password""],
  Connectors.DASHDB.TARGETTABLENAME : 'DASH107803.CATALOGISSUES',
  Connectors.DASHDB.TARGETTABLEACTION : 'replace'
}

df_catalog_data.write.format('com.ibm.spark.discover').options(**targetOptions_catalog).save()

Thanks to DSX, we’re able to schedule this to run and update every hour. The
table will be kept up to date with minimal maintenance required. Click the clock
icon inside a DSX notebook to set up a scheduled job:

Once you schedule a notebook, you’ll see it in your scheduled jobs. These jobs
are displayed in the project’s ”Overview” section:

CLEANING AND VISUALIZING THE INFORMATION
Once all defect data has been aggregated, a separate notebook pulls this
information out, sorts it by severity, and charts the information we want to
see.

# Pull info from dashDB table and load it into dataframe df_WML

targetOptionsWML = {
  Connectors.DASHDB.HOST : credentials[""host""],
  Connectors.DASHDB.DATABASE : credentials[""db""],
  Connectors.DASHDB.USERNAME : credentials[""username""],
  Connectors.DASHDB.PASSWORD : credentials[""password""],
  Connectors.DASHDB.SOURCETABLENAME : 'DASH107803.WMLISSUES_DATA'
}

dfWML = spark.read.format(""com.ibm.spark.discover"").options(**targetOptionsWML).load()
dfWML = dfWML.replace([""severity-high"", ""severity-medium"", ""severity-low""], [""3"", ""2"", ""1""], ""SEVERITY LABELS"")
dfWML = dfWML.orderBy([""CLOSEDAT"", ""SEVERITY LABELS"", ""NUMBER""], ascending=[True, False, False])
dfWML = df_WML.replace([""3"", ""2"", ""1""], [""severity-high"", ""severity-medium"", ""severity-low""], ""SEVERITY LABELS"")

Thanks to the GitHub REST API, the creation and closure dates for each defect is
available to us. With this, we are able to create burn down charts. Below,
you’ll see an example of the charts for one of the three services we are
analyzing. The x-axis contains one tick for each workweek, and we plot three
data points per tick: number of defects opened, number of defects closed, and
number of defects that remain open at the conclusion of the week (the backlog).
Alongside the burn down chart, the program generates a graph to display the
count of defects in the backlog per severity level.

The entire notebook is pushed to GitHub (another handy feature the DSX
provides). Our web server then pulls the notebook directly from our GitHub
repository, and neatly displays it in a web page.

After this is all completed, the sorted data is sent (via Db2 Warehouse on Cloud
tables) to our external web page, and we’re able to use it to do other useful
activities, like generate graphs and tables outlining specific details regarding
each defect.

UNDER THE HOOD
There’s a lot of cross-platform communication going on behind the scenes.
Information needs to be extracted, parsed, externally stored, transferred, and
re-parsed. Here’s a graphic to show the flow of these two programs:

 1. Make a request from the Node.js server to get the pipeline information of
    all the issues on the ZenHub boards.*
 2. Sort the pipeline data and put the information in a Db2 Warehouse on Cloud
    table.*
 3. Get a list of issues from GitHub and parse the relevant information (issue
    created date, label, severity, and more) into tables.
 4. Push the unsorted GitHub data to a Db2 Warehouse on Cloud table.
 5. Get the defect data from the Db2 Warehouse on Cloud table, sort it, and
    generate the burn down table. Plot the burn down data on a graph using
    matplotlib.
 6. Save the burn down data points to a Db2 Warehouse on Cloud table.
 7. On the web page, retrieve the burn down data and defect tables from Db2
    Warehouse on Cloud. When a user makes a request on the front end, deliver
    this information.
 8. Push the Notebook to GitHub as a Gist. This Gist link is embedded into the
    web page. When the notebook is updated on GitHub, the web page automatically
    displays that new notebook.
 9. Additional visualizations are made on the web page once the data has been
    retrieved.

* Steps one and two would be merged with step three if not for some issues we
ran into regarding interactions between IBM’s firewall and the different
services we were making API requests from.

CONCLUSION
The final product includes two Python notebooks, four Db2 Warehouse on Cloud
tables, and some JavaScript code hosted on an IBM-provisioned Fyre server.

IBM Data Science Experience has proven to be an incredibly useful resource for
this project’s development, and it will continue to provide user-friendly,
partially automated maintenance.


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com by Marc Suesser and Surya Penumatcha .

 * Data Science
 * Github
 * Jupyter Notebook
 * Project Management
 * Python

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

Blocked Unblock Follow FollowingIBM DATA SCIENCE EXPERIENCE
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","DSX offers a wealth of functionality to any software developer, especially those interested in data science. An important part of that functionality is the ability to use Notebooks, which are a convenient and intuitive way to compartmentalize different segments of a code base.",Using DSX notebooks to analyze GitHub data,Live,958
2987,"Variance Explained * About Me
 * Posts
 * R Course
 * Introduction to Empirical Bayes

DAVID ROBINSON
Data Scientist at Stack Overflow, works in R and Python.

Email Twitter Github Stack OverflowSUBSCRIBE

RECOMMENDED BLOGS
 * DataCamp
 * R Bloggers
 * RStudio Blog
 * R4Stats
 * Simply Statistics

UNDERSTANDING EMPIRICAL BAYES ESTIMATION (USING BASEBALL STATISTICS)
Which of these two proportions is higher: 4 out of 10 , or 300 out of 1000 ? This sounds like a silly question. Obviously , which is greater than .

But suppose you were a baseball recruiter, trying to decide which of two
potential players is a better batter based on how many hits they get. One has
achieved 4 hits in 10 chances, the other 300 hits in 1000 chances. While the
first player has a higher proportion of hits, it’s not a lot of evidence: a
typical player tends to achieve a hit around 27% of the time, and this player’s could be due to luck. The second player, on the other hand, has a lot of
evidence that he’s an above-average batter.

This post isn’t really about baseball, I’m just using it as an illustrative
example. (I actually know very little about sabermetrics . If you want a more technical version of this post, check out this great paper ). This post is, rather, about a very useful statistical method for estimating
a large number of proportions, called empirical Bayes estimation . It’s to help you with data that looks like this:

Success Total 11 104 82 1351 2 26 0 40 1203 7592 5 166A lot of data takes the form of these success/total counts, where you want to
estimate a “proportion of success” for each instance. Each row might represent:

 * An ad you’re running : Which of your ads have higher clickthrough rates, and which have lower?
   (Note that I’m not talking about running an A/B test comparing two options,
   but rather about ranking and analyzing a large list of choices.)
 * A user on your site : In my work at Stack Overflow, I might look at what fraction of a user’s
   visits are to Javascript questions, to guess whether they are a web developer . In another application, you might consider how often a user decides to
   read an article they browse over, or to purchase a product they’ve clicked
   on.

When you work with pairs of successes/totals like this, you tend to get tripped
up by the uncertainty in low counts. does not mean the same thing as ; nor does mean the same thing as . One approach is to filter out all cases that don’t meet some minimum, but
this isn’t always an option: you’re throwing away useful information.

I previously wrote a post about one approach to this problem, using the same
analogy: Understanding the beta distribution (using baseball statistics) . Using the beta distribution to represent your prior expectations , and updating based on the new evidence, can help make your estimate more accurate and
practical. Now I’ll demonstrate the related method of empirical Bayes
estimation, where the beta distribution is used to improve a large set of
estimates. What’s great about this method is that as long as you have a lot of
examples, you don’t need to bring in prior expectations .

Here I’ll apply empirical Bayes estimation to a baseball dataset, with the goal
of improving our estimate of each player’s batting average. I’ll focus on the
intuition of this approach, but will also show the R code for running this
analysis yourself. (So that the post doesn’t get cluttered, I don’t show the
code for the graphs and tables, only the estimation itself. But you can find all this post’s code here ).

WORKING WITH BATTING AVERAGES
In my original post about the beta distribution , I made some vague guesses about the distribution of batting averages across
history, but here we’ll work with real data. We’ll use the Batting dataset from the excellent Lahman package . We’ll prepare and clean the data a little first, using dplyr and tidyr:

library(dplyr)library(tidyr)library(Lahman)career<-Batting%>%filter(AB>0)%>%anti_join(Pitching,by=""playerID"")%>%group_by(playerID)%>%summarize(H=sum(H),AB=sum(AB))%>%mutate(average=H/AB)# use names along with the player IDs
career<-Master%>%tbl_df()%>%select(playerID,nameFirst,nameLast)%>%unite(name,nameFirst,nameLast,sep="" "")%>%inner_join(career,by=""playerID"")%>%select(-playerID)

Above, we filtered out pitchers (generally the weakest batters, who should be
analyzed separately). We summarized each player across multiple years to get
their career Hits (H) and At Bats (AB), and batting average. Finally, we added first and
last names to the dataset, so we could work with them rather than an identifier:

career

## Source: local data frame [9,256 x 4]
## 
##                 name     H    AB average
##                (chr) (int) (int)   (dbl)
## 1         Hank Aaron  3771 12364  0.3050
## 2       Tommie Aaron   216   944  0.2288
## 3          Andy Abad     2    21  0.0952
## 4        John Abadie    11    49  0.2245
## 5     Ed Abbaticchio   772  3044  0.2536
## 6        Fred Abbott   107   513  0.2086
## 7        Jeff Abbott   157   596  0.2634
## 8        Kurt Abbott   523  2044  0.2559
## 9         Ody Abbott    13    70  0.1857
## 10 Frank Abercrombie     0     4  0.0000
## ..               ...   ...   ...     ...

I wonder who the best batters in history were. Well, here are the ones with the
highest batting average:

name H AB average Jeff Banister 1 1 1 Doc Bass 1 1 1 Steve Biras 2 2 1 C. B. Burns 1 1 1 Jackie Gallagher 1 1 1Err, that’s not really what I was looking for. These aren’t the best batters,
they’re just the batters who went up once or twice and got lucky. How about the
worst batters?

name H AB average Frank Abercrombie 0 4 0 Horace Allen 0 7 0 Pete Allen 0 4 0 Walter Alston 0 1 0 Bill Andrus 0 9 0Also not what I was looking for. That “average” is a really crummy estimate. Let’s make a better one.

STEP 1: ESTIMATE A PRIOR FROM ALL YOUR DATA
Let’s look at the distribution of batting averages across players.


(For the sake of estimating the prior distribution, I’ve filtered out all
players that have fewer than 500 at-bats, since we’ll get a better estimate from
the less noisy cases. I show a more principled approach in the Appendix).

The first step of empirical Bayes estimation is to estimate a beta prior using
this data. Estimating priors from the data you’re currently analyzing is not the
typical Bayesian approach- usually you decide on your priors ahead of time.
There’s a lot of debate and discussion about when and where it’s appropriate to
use empirical Bayesian methods, but it basically comes down to how many
observations we have: if we have a lot, we can get a good estimate that doesn’t
depend much on any one individual. Empirical Bayes is an approximation to more exact Bayesian methods- and with the amount of data we have, it’s a
very good approximation.

So far, a beta distribution looks like a pretty appropriate choice based on the
above histogram. (What would make it a bad choice? Well, suppose the histogram
had two peaks, or three, instead of one. Then we might need a mixture of Betas , or an even more complicated model). So we know we want to fit the following
model:

We just need to pick and , which we call “hyper-parameters” of our model. There are many methods in R
for fitting a probability distribution to data ( optim , mle , bbmle , etc). You don’t even have to use maximum likelihood: you could use the mean and variance , called the “method of moments”. But we’ll use the fitdistr function from MASS.

# just like the graph, we have to filter for the players we actually
# have a decent estimate of
career_filtered<-career%>%filter(AB>=500)m<-MASS::fitdistr(career_filtered$average,dbeta,start=list(shape1=1,shape2=10))alpha0<-m$estimate[1]beta0<-m$estimate[2]

This comes up with and . How well does this fit the data?


Not bad! Not perfect, but something we can work with.

STEP 2: USE THAT DISTRIBUTION AS A PRIOR FOR EACH INDIVIDUAL ESTIMATE
Now when we look at any individual to estimate their batting average, we’ll
start with our overall prior, and update based on the individual evidence. I went over this process in detail in the original Beta distribution post : it’s as simple as adding to the number of hits, and to the total number of at-bats.

For example, consider our hypothetical batter from the introduction that went up
1000 times, and got 300 hits. We would estimate his batting average as:

How about the batter who went up only 10 times, and got 4 hits. We would
estimate his batting average as:

Thus, even though , we would guess that the batter is better than the batter!

Performing this calculation for all the batters is simple enough:

career_eb<-career%>%mutate(eb_estimate=(H+alpha0)/(AB+alpha0+beta0))

RESULTS
Now we can ask: who are the best batters by this improved estimate?

name H AB average eb_estimate Rogers Hornsby 2930 8173 0.358 0.355 Shoeless Joe Jackson 1772 4981 0.356 0.350 Ed Delahanty 2596 7505 0.346 0.343 Billy Hamilton 2158 6268 0.344 0.340 Harry Heilmann 2660 7787 0.342 0.339Who are the worst batters?

name H AB average eb_estimate Bill Bergen 516 3028 0.170 0.178 Ray Oyler 221 1265 0.175 0.191 John Vukovich 90 559 0.161 0.196 John Humphries 52 364 0.143 0.196 George Baker 74 474 0.156 0.196Notice that in each of these cases, empirical Bayes didn’t simply pick the
players who had 1 or 2 at-bats. It found players who batted well, or poorly,
across a long career. What a load off our minds: we can start using these
empirical Bayes estimates in downstream analyses and algorithms, and not worry
that we’re accidentally letting or cases ruin everything.

Overall, let’s see how empirical Bayes changed all of the batting average
estimates:


The horizontal dashed red line marks - that’s what we would guess someone’s batting average was if we had no evidence at all. Notice that points above that line tend to move down towards
it, while points below it move up.

The diagonal red line marks . Points that lie close to it are the ones that didn’t get shrunk at all by
empirical Bayes. Notice that they’re the ones with the highest number of at-bats
(the brightest blue): they have enough evidence that we’re willing to believe
the naive batting average estimate.

This is why this process is sometimes called shrinkage : we’ve moved all our estimates towards the average. How much it moves these
estimates depends on how much evidence we have: if we have very little evidence
(4 hits out of 10) we move it a lot, if we have a lot of evidence (300 hits out
of 1000) we move it only a little. That’s shrinkage in a nutshell: Extraordinary outliers require extraordinary evidence .

CONCLUSION: SO EASY IT FEELS LIKE CHEATING
Recall that there were two steps in empirical Bayes estimation:

 1. Estimate the overall distribution of your data.
 2. Use that distribution as your prior for estimating each average.

Step 1 can be done once, “offline”- analyze all your data and come up with some
estimates of your overall distribution. Step 2 is done for each new observation
you’re considering. You might be estimating the success of a post or an ad, or
classifying the behavior of a user in terms of how often they make a particular
choice.

And because we’re using the beta and the binomial, consider how easy that second step is. All we did was add one number to the successes, and add
another number to the total. You can build that into your production system with
a single line of code that takes nanoseconds to run.

// We hired a Data Scientist to analyze our Big Data
// and all we got was this lousy line of code.
float estimate = (successes + 78.7) / (total + 303.5);


That really is so simple that it feels like cheating- like the kind of “fudge
factor” you might throw into your code, with the intention of coming back to it
later to do some real Machine Learning.

I bring this up to disprove the notion that statistical sophistication
necessarily means dealing with complicated, burdensome algorithms. This Bayesian
approach is based on sound principles, but it’s still easy to implement.
Conversely, next time you think “I only have time to implement a dumb hack,”
remember that you can use methods like these: it’s a way to choose your fudge
factor. Some dumb hacks are better than others!

But when anyone asks what you did, remember to call it “empirical Bayesian
shrinkage towards a Beta prior.” We statisticians have to keep up appearances.

APPENDIX: HOW COULD WE MAKE THIS MORE COMPLICATED?
We’ve made some enormous simplifications in this post. For one thing, we assumed
all batting averages are drawn from a single distribution. In reality, we’d
expect that it depends on some known factors. For instance, the distribution of
batting averages has changed over time:


Ideally, we’d want to estimate a different Beta prior for each decade.
Similarly, we could estimate separate priors for each team, a separate prior for
pitchers, and so on. One useful approach to this is Bayesian hierarchical modeling (as used in, for example, this study of SAT scores across different schools ).

Also, as alluded to above, we shouldn’t be estimating the distribution of
batting averages using only the ones with more than 500 at-bats. Really, we
should use all of our data to estimate the distribution, but give more consideration to the players with a higher number of at-bats. This can be done by fitting a
beta-binomial distribution. For instance, we can use the dbetabinom.ab function from VGAM, and the mle function:

library(VGAM)# negative log likelihood of data given alpha; beta
ll<-function(alpha,beta){-sum(dbetabinom.ab(career$H,career$AB,alpha,beta,log=TRUE))}m<-mle(ll,start=list(alpha=1,beta=10),method=""L-BFGS-B"")coef(m)

## alpha  beta 
##    75   222

We end up getting almost the same prior, which is reassuring!


--------------------------------------------------------------------------------

DAVID ROBINSON
Data Scientist at Stack Overflow, works in R and Python.

Email Twitter Github Stack OverflowSUBSCRIBE

RECOMMENDED BLOGS
 * DataCamp
 * R Bloggers
 * RStudio Blog
 * R4Stats
 * Simply Statistics

Understanding empirical Bayes estimation (using baseball statistics) was published on October 01, 2015 .

YOU MIGHT ALSO ENJOY ( VIEW ALL POSTS )


--------------------------------------------------------------------------------

© 2017 David Robinson. Powered by Jekyll using the Minimal Mistakes theme. Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus",This post discusses a very useful statistical method for estimating a large number of proportions - empirical Bayes estimation.,Understanding empirical Bayes estimation (using baseball statistics),Live,959
2989,"INTERVIEW: LEVERAGING TECHNOLOGY, AND D3.JS Mike Broberg / April 12, 2016JAVASCRIPT IN THE ENTERPRISEWeb startups have popularized JavaScript and libraries like D3.js , and these technologies are now finding their place in the world of enterprisesoftware. It’s something I’ve witnessed firsthand, beginning my career atCloudant and going through our acquisition by IBM.Leveraging Technology is a traditional IBM integrator, branching out intoJavaScript, JSON and D3.js, using IBM Bluemix and Cloudant.With that perspective in mind, I recently spoke with Jim Cantin and Ron Jamersonof Leveraging Technology in Rochester, N.Y. As an experienced IBM consultancy & tech services firm,Leveraging Technology is helping their clients use JavaScript to extend thecapabilities of their traditional IBM systems.JavaScript has a lot of energy around it right now. It has evolved to the pointwhere you’re now seeing JavaScript in the enterprise. Node.js was what we firststarted playing with, because it was one of the runtimes availableout-of-the-box on Bluemix and paired well with our focus on back-end software.—Jim Cantin, President, Leveraging TechnologyRead on!--------------------------------------------------------------------------------Mike Broberg, Editor, IBM CDS Dev Advocacy:Tell us about your company, Leveraging Technology.Jim Cantin, President, Leveraging Technology:We’ve been around for almost 18 years. Our traditional experience is integrationand architecture, and we’re trying to capitalize on that knowledge and skill setto create new capabilities — something that we can put in the cloud and has somelegs to it. That’s the future for our company.Recently, we’ve been using D3.js to build Web apps for clients. What you’reseeing in the initial data visualization UI is the very beginning for us.There’s a whole road map of capabilities we see developing over time.Ron Jamerson, Enterprise Solutions Architect, Leveraging Technology:I’ve spent a lot of time with clients developing ESB-type solutions [enterpriseservice bus]. When we go into client sites to design an ESB — primarily with IIB[IBM Integration Bus] — once we went live with that solution, it was hard to seeinto the system and analyze its performance. In order to gain visibility intoit, the typical approach is to introduce a piece of middleware. But usually whatends up happening is you have people on either side saying that you’re slowingdown their system.What we needed was some kind of visibility to prove that ESB solutions actuallyaided — and not impeded — API response times between services. That’s why weended up with D3.js. It comes from our experience with knowing how to collectevents emitted from the ESB, how to persist them, view them, and then analyzethem.Mike: Have you been using any of IBM’s new cloud services in your work as anintegrator?Jim: Visualizing data on API interactions within the enterprise — that’s what werecently presented about at the InterConnect conference. In addition to D3.js,the introduction of IBM Bluemix and IBM Cloudant showed us that this work couldbe a potential new business for us.It’s basically IoT for the enterprise. To build on Ron’s comments, we realizedthere was great value in instrumenting applications. We have certain basemetadata that we’re emitting from the ESB, and as we break down existingbusiness transactions into their lower-level pieces, we find that there’s othernew value we can drive from those existing transactions. We’re basically storingjust enough meta information as JSON in Cloudant that lets us deep-link backinto the appropriate enterprise back-end systems.The current version of the product that we’re piloting, we call it “ Interaction Aware™ .” It’s this whole idea of bringing awareness to things people can’t see today,or aren’t aware of, around enterprise interactions. Interactions can be systems,people, or trading partners. The interfaces between them have a lot ofinformation that can be re-purposed. That’s where we’re headed: not applicationperformance, but showing business value hidden in these different types ofinteractions.It’s basically using enterprise data that already exists, and turning itsmetadata into a big data opportunity. Interaction Aware takes that data andre-associates it differently than it was initially constituted.Interaction Aware™ from Leveraging Technology visualizes API interactions inD3.js, directly from Cloudant JSON.Mike: How did you end up deciding to use D3.js?Jim: JavaScript has a lot of energy around it right now. It has evolved to the pointwhere you’re now seeing JavaScript in the enterprise. Node.js was what we firststarted playing with, because it was one of the runtimes availableout-of-the-box on Bluemix and paired well with our focus on back-end software.We’re not a UI shop, but we see value in following the standards. We took theadvice of our IBM sales engineer and designers, and landed on D3.js.From a product standpoint, we want to provide some unique capabilitiesout-of-the-box. For example, dynamic visualizations that react to data the waywe’ve staged it for clients. The D3.js stuff works well here. Where we want togo is to exploit some of the network visibility features that are within D3.jsand build upon those. Some of the Force Layout examples that implement force-directed graph visualizations work well with thedata we’re already collecting, and that starts to bridge us over to the businessdomain nicely.Our intent is to show the volume of interactions between systems, people, andtrading partners. We want to have everybody as a peer to each other, within onediagram. I think we see how that’s going to play out as a graph visualization.Our JavaScript UI is only part of the story. Within Cloudant, we have theability to carry the actual transaction, as well as the metadata. We’re going toenable some visualizations there with D3.js, but we’re also going to allowcustomers to land data in a format that makes sense with their existing BItools. For Interaction Aware’s architecture, that probably means using IBM’s dashDB cloud data warehouse . That will allow people to bring along whatever BI tool they might already beusing to visualize even more of their data.Mike: How is JavaScript adoption progressing in the enterprise?Jim: From an integration standpoint, for cloud topologies, JavaScript is the way theindustry is going. The traditional products — the message brokers, the ESBs —are going to morph into something different. The capabilities will still be outthere, but they’ll be implemented using new standards. With JavaScript — justits scalability alone — it makes all the sense in the world.Mike: Have you been using the D3.js components of Interaction Aware in clientengagements?Jim: We have a pilot customer: CoreLink Administrative Solutions . CoreLink builds claims and enrollment processing software for Blue Cross andBlue Shield organizations in North Dakota, Wyoming and Nebraska. We recently presented with them at IBM InterConnect . The solution we built for them takes metadata on pharmacy claims as they areexchanged with their outside pharmacy benefits management company [PBM] andaggregates counts of transactions and their dollar amounts. Before, they didn’teven have visibility into these transactions.We definitely see more IBM API management capabilities critical to the businesswe’re building at Leveraging Technology. Our services experience in ourtraditional business as an IBM integrator has led us right to that.Mike: Aside from basic bar charts and some of the more advanced Force Layout stuff,what other D3.js visualizations are you exploring?Jim: We have circle charts where we’re showing a three-level hierarchy on one page.All the elements are aware of one another and respond to the one that’scurrently selected. It’s a nice, dynamic way to tie together the visualization.We also plan to restructure the code and use AngularJS as the underlying framework.[Native JSON in Cloudant] has been great for prototyping. I come from therelational database world. It took me a little while before I realized howextensible JSON storage is, combined with Cloudant & CouchDB’s incrementalMapReduce capability.—Ron Jamerson, Enterprise Solutions Architect, Leveraging TechnologyMike: Has the ability to persist JSON natively helped your JavaScript developmentefforts?Ron: It’s been great for prototyping. I come from the relational database world. Ittook me a little while before I realized how extensible JSON storage is,combined with Cloudant & CouchDB’s incremental MapReduce capability.Because of JSON’s flexible schema, we’ve been able to re-use a lot of ourexisting code, with just slight alterations here and there. There are two appsI’ve recently mocked up that are interesting: 1. a CRM app that builds a 360-degree view of customers 2. a fraud-detection app for a chain of auto parts storesWith the base-level capabilities we expose in Interaction Aware via D3.js —that’s the ability to count things, and sum and average them over discrete unitsof time — we’ve run into customers with very similar needs. We can visualizethis data similar to how we were capturing those API interactions hitting theESB I mentioned earlier.For the CRM use-case, this enterprise client of ours had many differentcustomer-facing applications. Whether customers were purchasing things, callingthe help desk, returning materials, waiting for deliveries — the company didn’thave a good view into all these interactions. In some cases, they didn’t knowthey were even happening.When you think about a customer interaction — where, in this case, a customerinteracted with the sales system, then the call center, then a return-materialsauthorization form — we use the exact same JSON format to start capturing thatinformation so the UI can start presenting this story of customers interactingwith these different applications. Using a simple REST request, we’re then ableto return all interactions a particular customer has had over a given period oftime. Now, our client can do whatever they want with that information: pop it ona call center screen, send it to the mobile device of an on-site salesperson,and so forth.We have another client who is an auto parts retailer with lots of stores in theEastern U.S. For them, we applied Cloudant and D3.js to give the client theability to count sales transactions across product categories, per retail store,and aggregate that information up into sales regions. So managers can get areal-time view into performance over any time period they need.For the auto parts retailer it’s the same basic D3.js capabilities we have inInteraction Aware, just applied to a different setting. We have a few keyattributes that every JSON document must have, but after that we can storewhatever fields or structures we need right alongside those attributes. Afteryou have that JSON, it’s easy to send it wherever else it needs to go — whetherthat’s right to dashDB or to some kind of pub/sub system.Mike: You’ve used Bluemix and Cloudant so far, but have IBM’s recent acquisitions ofStrongLoop and Compose.io sparked any new product ideas for your company?Jim: Yes. Some of the experiences we’ve had over the last year have brought us intothe API management world. We understand IBM’s traditional API Managementsolution quite well, which rides on top of IBM WebSphere DataPower and so forth.With the acquisition of StrongLoop, IBM is going beyond middleware and gettinginto the development side of APIs. That’s very interesting to us. As anintegrator, we know how important it is to have visibility into how your APIsare being used — particularly as charge-back mechanisms and everything else.Mike: Do you see your integration work and applications like Interaction Awareimproving access to more than metadata on API interactions? For instance,securely accessing and distributing data that would normally be trapped withinlegacy systems?Jim: I think our story will evolve into more of a “data warehouse alternative”because we’re focused on real-time data. When we build integration solutions, wedesign them to touch data that’s on the move. While we’re at it, we might aswell store it in a repository that can handle real-time, and display it inreal-time. There’s less maintenance if you don’t store it, then extract it,transform it, and load it (ETL it) into a separate repository for analytics.Capture data in its native structure and visualize it right away — there’s awhole aspect of the data warehouse that I think can be replaced with thisreal-time approach. The data in these cases is likely to be more temporary andthrowaway, but for experimental purposes where you’re quickly iterating on aproblem, it’s a great approach to gaining insight.That gets us back to the wonderful interplay between Cloudant and dashDB. Forsome of this work, the ability to publish events off to Cloudant removes some ofthe extra steps that people need to take to bring visibility to data.Mike: Thanks a lot, Ron and Jim!--------------------------------------------------------------------------------WRAPIt was great to hear how JavaScript has helped Leveraging Technology, and howJim and Ron are combining the language with other technologies to help thecompany build new products. Leveraging Technology’s new Interaction Awareservice is a good example of JavaScript’s maturity (with libraries like D3.js)and reach (as JavaScript breathes new life into enterprise data). With new clouddata services from IBM, Leveraging Technology can now offer clients capabilitiesbeyond IBM’s traditional on-premises software — shaping the future ofintegration solutions for channel partners and customers alike.For technical examples of using D3.js and IBM cloud data services together, seeour Simple Metrics Collector tutorial [ part 1 , part 2 ] and more resources on capturing metrics .SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Geospatial    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Leveraging Technology, an enterprise integrator, talks how JavaScript tools like D3.js and Node.js are helping the company build new products on IBM Bluemix","INTERVIEW: LEVERAGING TECHNOLOGY, AND D3.JS",Live,960
2991,"{ spark .tc } * Community
 * Projects
 * Blog
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark
   
   
BEYOND PARALLELIZE AND COLLECT
Holden Karau presented this important work at Spark Summit East in NYC in
February 2016. See the slides on SlideShare


Beyond parallelize and collect – Spark Summit East 2016 from Holden KarauEffectively testing Spark programs is a notoriously complex challenge — and it’s
a challenge some of us would rather avoid completely. In this presentation,
author and researcher Holden Karau makes the case for rigorous testing,
especially at full scale with workloads that are too large for a single machine.


You

Notify me of follow-up comments by email.

Notify me of new posts by email.


{ AUTHOR }
HOLDEN KARAU Holden Karau is the author of High Performance Spark, an STC
engineer, and a contributor to Apache Spark. Follow Holden on Twitter
@holdenkarau * 
 * 
 * 

Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software
Foundation in the United States and/or other countries.",Holden Karau presented this important work at Spark Summit East in NYC in February 2016. See the slides on SlideShare Beyond parallelize and collect – Spark Summit East 2016 from Holden Karau…,Beyond Parallelize and Collect,Live,961
2996,"developerWorks Premium An all-access pass to building your next great app!Sign up

 * 
 * Sign in | Register * › My developerWorks
    * 
      --------------------------------------------------------------------------
      
      
    * developerWorks Community
    * › My profile
    * › My communities
    * › Settings
    * 
      --------------------------------------------------------------------------
      
      
    * › Sign out
   
   
 * 
 * IBM

 * 

 * Technical topics
 * Evaluation software
 * Community
 * Events

Search developerWorks

 * developerWorks
 * Technical topics
 * Big data and analytics
 * Technical library

DATA VISUALIZATION WITH R: HOW TO GET AND SHOW MEANINGFUL METRICS FOR A SCRUM
TEAM
Data visualizations are universally understood and are an ideal way to
communicate operational metrics for an agile team. Chief among those metrics are
performance indicators of quality such as total defects. It's critical for team
members and stakeholders to understand the nuances and context of these metrics.
A popular way to both analyze and visualize nuances in data is to use the R
language. This article demonstrates how to use R to ingest and parse quality
metrics. Learn to craft meaningful depictions of how metrics fluctuate over time
and how they can reveal issues that affect the customer experience. An example
shows how R can help you create data visualizations that share insights about
your metrics.

PDF (457 KB) |

Share:Tom Barker ( tomjbarker@gmail.com ), Director of Software Development and Engineering , Comcast

Close [x]

Tom Barker is a software engineer, engineering manager, professor and author.
Currently, he is director of Software Development and Engineering at Comcast and
an adjunct professor at Philadelphia University. He has authored Pro JavaScript Performance: Monitoring and Visualization, Pro Data Visualization with R and JavaScript, and Technical Management: A Primer.


18 March 2014

Also available in Chinese Russian

 * Table of contents * Representing data
    * Taking advantage of data visualization
    * The time series chart
    * Gathering bug data with R
    * Creating the time series chart
    * Conclusion
    * Resources
    * Comments
   
   
REPRESENTING DATA
Using data visualizations as a communication tool is already well established in
some fields and departments, but it's a relatively recent development in the
world of software engineering. At your company, the finance department likely
uses data visualizations to represent fiscal information. Take a look at the
quarterly earnings report for almost any publicly traded company: it's full of
charts to show revenue by quarter, year-over-year earnings, or a plethora of
other historic financial data. All are designed to show lots of data
points—potentially pages and pages of data points—in a single and easily
digestible graphic.

Figure 1 shows a bar chart in Google's quarterly earnings report from 2007.

Figure 1. Google's quarterly earning report, 2007Compare the bar chart in Figure 1 to a subset of the data it is based upon, in a
tabular format, in Figure 2.

Figure 2. Google's quarterly earnings in tabular formThe bar chart is immensely more readable. You can clearly see by the shape that
earnings are up and have been steadily going up each quarter. With the color
coding, you can see the sources of the earnings. The annotations show both the
precise numbers that the color coding represents and the year-to-year
percentages.

With the tabular data, you have to read labels on the left, line up the data on
the right with those labels, do your own aggregation and comparison, and draw
your own conclusions. You have to work to take in the tabular data. In addition,
there's a very real possibility that you either don't understand the data and
create your own incorrect conclusions, or you tune out completely because it's
too much effort to take in the information.

Disciplines other than finance use visualizations to communicate dense amounts
of data. Your operations department might use charts to communicate server
uptime. The customer support department uses graphs to show call volume. It's
time for engineering and web development to get on board with data
visualization. We have a huge amount of relevant data, and it's important for us
to:

 * Be aware of and use our own data to refine and improve what we do.
 * Communicate to our stakeholders to demonstrate successes, validate resource
   needs, or plan tactical roadmaps for the coming year.


--------------------------------------------------------------------------------

Back to top

TAKING ADVANTAGE OF DATA VISUALIZATION
I came upon the data visualization revelation several years ago after a good bit
of introspection and analysis. Most of the issues I dealt with daily could be
traced back to a single root cause: communication. Bugs were a result of the
lack of communication about requirements. The same goes for contention over
velocity and even what features were completed in a given sprint.

By using data visualizations, I could make my working life considerably less
stressful by communicating the salient data that would give the necessary
context for a given situation.

I jumped right in and began creating team health reports for all of my
stakeholders. The reports covered a range of topics, such as the number of bugs
in our backlog over time, number of bugs per product, production incidents by
product, even things like what day and time the most code check-ins occurred.
(The original intent was to make sure no meetings were scheduled during that
time, but the teams were most active checking in because that time was already
meeting-free.)

A big part of my team health reports involved a deep dive into the defects the
team created because that was one of the main performance indicators of quality
of the products we built.

Data visualization is the art and practice of gathering, analyzing, and
graphically representing empirical information. Data visualizations are
sometimes called information graphics or just charts and graphs. Whatever you
call it, the goal of visualizing data is to tell the story in the data. Telling
the story is predicated on understanding the data at a deep level and gathering
insight from comparisons of data points in the numbers.

I use R as both an analytics tool and as the medium to generate my data
visualizations. R is an open source environment and language designed for
statistical computing. You can get R at The R Project for Statistical Computing .


--------------------------------------------------------------------------------

Back to top

THE TIME SERIES CHART
The most useful way to visualize bugs is with a time series chart. A time series
is a graph that compares changes in values over time. They are generally read
left to right with the x-axis representing a measure of time and the y-axis
representing the range of values. When constructing my team health report, I
used time series charts to represent the state of our defect backlog over time.

Tracking defects over time lets you identify spikes in issues. You can also see
larger patterns in workflows, especially when you include more granular details
like bug criticality and cross-referencing data such as dates for events (for
example, the start and end of iteration). You begin to expose trends, such as:

 * When during an iteration do bugs get opened?
 * When are most of your blocker bugs opened?
 * What iterations produced the highest number of bugs?

This kind of self-evaluation and reflection allows you to identify and focus on
your blind spots or areas for improvement. It also allows you to recognize
victories in a larger scope that you might miss when viewing the daily numbers
without context.

For instance, my organization set a group goal of achieving a certain bug number
by the end of the year. The target number was a percent of the total bugs open
at the beginning of the year. My peers, the management staff, and I coached the
developers on the goal. We created process improvements and won hearts and minds
in this goal. At the end of the year, the number of bugs that remained open was
about the same as when we had started the year. We were confused and concerned.
But, when I summed the daily numbers, it was clear that we had achieved
something larger than we had anticipated. We actually opened 33% fewer bugs
overall compared to the previous year. This was huge, and we would have easily
missed it if we weren't looking at the data with a critical eye to the bigger
picture.


--------------------------------------------------------------------------------

Back to top

GATHERING BUG DATA WITH R
Let's walk through an example to show how you could start to visualize defects
over time. Assuming that our defect data is exported to a flat file named
allbugs.csv, you can read in the data using the read.table function to import
the defect data into a data frame named bugs . First, order the example bug data by date, as in Listing 1.

Listing 1. Ordering bug data by datebugExport <- ""/Applications/MAMP/htdocs/allbugs.csv""
bugs <- read.table(bugExport, header=TRUE, sep="","")
bugs <- bugs[order(as.Date(bugs$Date,""%m-%d-%Y"")),]

The next step is to calculate the total bug count by date, which shows how many
new bugs are opened by day. Pass bugs$Date into the table() function. The table() function builds a data structure of counts of each date in the bugs data frame: totalbugsByDate <- table(bugs$Date) .

The structure of totalbugsByDate looks like Listing 2.

Listing 2. Structure of totalbugsByDate> totalbugsByDate

2014-01-04 2014-01-08 2014-01-09 2014-01-10 2014-01-14
      2         4           5         3           1

You can plot this data to get an idea of how many bugs are opened each day. You
can pass the totalbugsByDate variable into R's plot function, as follows:

plot(totalbugsByDate, type=""l"", main=""New Bugs by Date"", col=""red"",
        ylab=""Bugs"")

This creates the chart in Figure 3, which shows how many new bugs are opened
each day.

Figure 3. New bugs by dateNow that you have a count of how many bugs are generated each day, you can get a
cumulative sum using the cumsum() function. cumsum() takes the new bugs opened each day and creates a running sum, updating the
total each day. This lets you generate a trend line for the cumulative count of
bugs over time, as in Listing 3.

Listing 3. Trend line for count of bugs over time> runningTotalBugs <- cumsum(totalbugsByDate)
> 
> runningTotalBugs
01-04-2014 01-08-2014 01-09-2014 01-10-2014 01-14-2014 01-16-2014
    2          6         11         14         15         17


--------------------------------------------------------------------------------

Back to top

CREATING THE TIME SERIES CHART
Listing 3 provides exactly what's needed to now plot how your bug backlog grows
or shrinks each day. Pass runningTotalBugs to the plot() function. Set the type to l to signify that you're creating a line chart and name the chart Cumulative Defects Over Time . In the plot function, also turn the axes off so you can draw custom axes for
this chart. You want to do this so you can specify the dates as the x-axis
labels.

To draw custom axes, use the axis() function. The first parameter in the axis function is a number that tells R where to draw the axis: 1 corresponds to the
x-axis at the bottom of the chart, 2 corresponds to the left, 3 corresponds to
the top, and 4 corresponds to the right of the chart. Listing 4 shows an
example.

Listing 4. The axis() functionplot(runningTotalBugs, type=""l"", xlab="""", ylab="""", pch=15, lty=1, 
col=""red"", main=""Cumulative Defects Over Time"", axes=FALSE)
axis(1, at=1: length(runningTotalBugs), lab= row.names
    (totalbugsByDate))
axis(2, las=1, at=10*0:max(runningTotalBugs))

The code in Listing 4 creates the time series chart shown in Figure 4.

Figure 4. Cumulative defects over timeThe time series chart shows the progressively increasing bug backlog by date.

Let's look at the criticality of the bugs, which shows not just when the bugs
are opened, but when the most severe bugs are being opened. When you exported
the bug data, you included the Severity field. Severity indicates the level of
criticality of each bug. Organizations might have their own classification of
severity, but typically there are at least the following categories:

Blockers Severe bugs that prevent the launch of a body of work. These are generally
broken functions or missing sections of a widely used feature. They could also
be discrepancies with contractually or legally binding features such as closed
captioning or digital right protection. Criticals Bugs that are severe but not so damaging that they gate a release. These could
be broken functionality of less used features. The scope of accessibility, or
how widely used a feature is, is generally a determining factor when classifying
a bug as a blocker or critical. Minors Bugs with minimal impact that might not even be noticeable to a user.To break out the bugs by severity, simply call the table function, just as you did to break out bugs by date, but this time add the Severity column, as in Listing 5.

Listing 5. Calling the table functionbugsbySeverity <- table(factor(bugs$Date),bugs$Severity)

This creates a data structure that looks like Listing 6.

Listing 6. Data structure                Blocker      Critical    Minor
  01-04-2014        0           0         2
  01-08-2014        1           0         3
  01-09-2014        3           0         2
  01-10-2014        1           0         2
  01-14-2014        0           0         1
  01-16-2014        1           0         1
  01-22-2014        1           0         1

You can then plot this data object. Use the plot function to create a chart for one of the columns, then use the lines() function to draw lines on the chart for the remaining columns, as in Listing 7.

Listing 7. lines() functionbugsbySeverity <- table(factor(bugs$Date),bugs$Severity)
plot(bugsbySeverity[,3], type=""l"", xlab="""", ylab="""", 
    pch=15, lty=1, 
col=""orange"", main=""New Bugs by Severity and Date"", 
    axes=FALSE)
lines(bugsbySeverity[,1], type=""l"", col=""red"", lty=1)
lines(bugsbySeverity[,2], type=""l"", col=""yellow"", lty=1)
axis(1, at=1: length(runningTotalBugs), lab= row.
    names(totalbugsByDate))
axis(2, las=1, at=0:max(bugsbySeverity[,3]))
legend(""topleft"", inset=.01, title=""Legend"", 
    colnames(bugsbySeverity), 
lty=c(1,1,1), col= c(""red"", ""yellow"", ""orange""))

This code produces the chart in Figure 5.

Figure 5. New bugs by severity and dateThe chart in Figure 5 is great, but what if you want to see the cumulative bugs
by severity? Using the R code in Listing 7 , instead of plotting out the columns, you can simply plot the cumulative sum
of each column, as in Listing 8.

Listing 8. Plotting the cumulative sum of each columnplot(cumsum(bugsbySeverity[,3]), type=""l"", xlab="""", ylab="""", 
   pch=15, lty=1, col=""orange"", main=""Running Total of Bugs
   by Severity"", axes=FALSE)
lines(cumsum(bugsbySeverity[,1]), type=""l"", col=""red"", lty=1)
lines(cumsum(bugsbySeverity[,2]), type=""l"", col=""yellow"", lty=1)
axis(1, at=1: length(runningTotalBugs), lab= row
    .names(totalbugsByDate))
axis(2, las=1, at=0:max(cumsum(bugsbySeverity[,3])))
legend(""topleft"", inset=.01, title=""Legend"", 
    colnames(bugsbySeverity), 
   lty=c(1,1,1), col= c(""red"", ""yellow"", ""orange""))

The code in Listing 8 produces the chart in Figure 6.

Figure 6. Running total of bugs by severity
--------------------------------------------------------------------------------

Back to top

CONCLUSION
This article outlined how to gain insight from studying your data and how to
represent that data visually in interesting ways, by using R. Imagine the
possibilities when looking at other organizational data such as production
incidents or performance data. Just remember that the point is not to visualize
for the sake of visualizing, but to use the charts:

 * As a communication tool with your team and your stake holders.
 * To track progress.
 * To highlight accomplishments and areas that need focus.

RESOURCES
LEARN
 * Learn about the IBM Watson research project.
 * Check out Big Data University for free courses on Hadoop and big data.
 * Visit the Apache Hadoop project web site .
 * Refer to the Big Data Glossary By Pete Warden, O'Reilly Media, ISBN: 1449314597, 2011.
 * Read MapReduce: Simplified Data Processing on Large Clusters , Jeffrey Dean and Sanjay Ghemawat, OSDI, 2004.
 * SQL/MapReduce: A practical approach to self-describing, polymorphic, and
   parallelizable user-defined functions , Eric Friedman et al., Proceedings of the VLDB Endowment, 2(2), 2009. This
   paper describes the motivation for this new approach to UDFs as well as the
   implementation within AsterData Systems' nCluster database.
 * Refer to the IBM InfoSphere BigInsights Information Center for product documentation.
 * Learn more about big data in the developerWorks big data content area . Find technical documentation, how-to articles, education, downloads,
   product information, and more.
 * developerWorks technical events and webcasts : Stay current with developerWorks technical events and webcasts.

GET PRODUCTS AND TECHNOLOGIES
 * Get R: The Project for Statistical Computing .
 * Get Hadoop 0.20.1 from Apache.org.
 * Build your next development project with IBM trial software , available for download or on DVD.

DISCUSS
 * Participate in developerWorks blogs and get involved in the developerWorks community.

COMMENTS
Close [x]

DEVELOPERWORKS: SIGN IN
Required fields are indicated with an asterisk ( * ).

IBM ID: *
Need an IBM ID?
Forgot your IBM ID?

Password: *
Forgot your password?
Change your password

Keep me signed in.

By clicking Submit , you agree to the developerWorks terms of use .


--------------------------------------------------------------------------------

The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is
displayed to the public and will accompany any content you post, unless you opt
to hide your company name . You may update your IBM account at any time.

All information submitted is secure.

Close [x]

CHOOSE YOUR DISPLAY NAME


The first time you sign in to developerWorks, a profile is created for you, so
you need to choose a display name. Your display name accompanies the content you
post on developerWorks.

Please choose a display name between 3-31 characters . Your display name must be unique in the developerWorks community and should
not be your email address for privacy reasons.

Required fields are indicated with an asterisk ( * ).

Display name: * (Must be between 3 – 31 characters.)

By clicking Submit , you agree to the developerWorks terms of use .


--------------------------------------------------------------------------------

All information submitted is secure.

DIG DEEPER INTO BIG DATA AND ANALYTICS ON DEVELOPERWORKS
 * Overview
 * Proven practices
 * Products
 * Technical library (tutorials and more)


--------------------------------------------------------------------------------

 * DEVELOPERWORKS PREMIUM
   Exclusive tools to build your next great app. Learn more.
   
   
 * 
 * 
 * 
 * DEVELOPERWORKS LABS
   Technical resources for innovators and early adopters to experiment with.
   
   
 * 
 * 
 * IBM EVALUATION SOFTWARE
   Evaluate IBM software and solutions, and transform challenges into
   opportunities.
   
   
--------------------------------------------------------------------------------

Back to top

static.content.url=http://www.ibm.com/developerworks/js/artrating/ SITE_ID=1 Zone=Big data and analytics ArticleID=965540 ArticleTitle=Data visualization with R: How to get and show meaningful metrics
for a scrum team publish-date=03182014 * About
 * Help
 * Contact us
 * Submit content

 * Feeds
 * Newsletters
 * Follow
 * Like

 * Report abuse
 * Terms of use
 * Third party notice
 * IBM privacy
 * IBM accessibility

 * Faculty
 * Students
 * Business Partners

 * Select a language:
 * English
 * 中文
 * 日本語
 * Русский
 * Português (Brasil)
 * Español
 * Việt",Data visualization with R: How to get and show meaningful metrics for a scrum team,Data visualization with R: Scrum metrics,Live,962
3003,"Compose The Compose logo Articles Sign in Free 30-day trialHOW TO GET BACKUPS WITH THE COMPOSE API AND NODE.JS
Published Feb 9, 2017 Compose API nodejs backups How to get backups with the Compose API and Node.jsAutomatic backup retrieval is now possible with Compose's API and here we'll
show you how to use it from Node.js. If you want to archive your own backups,
start here.

The latest addition to the Compose API is new endpoints to list backups for deployments, view details about backups,
start on-demand backups and get the information you need to retrieve them. To
understand the endpoints, it's probably easiest to show them in use, so we're
going to create a command-line Node.js application, backupomat . To make it an easier read, we'll also be doing it in ES6 JavaScript, with
promises and some handy modules like node-fetch and yargs.

JAVASCRIPT PREAMBLES
Let's kick off with the inevitable preamble. Our app, backupomat.js starts with this:

#!/usr/bin/env node
'use strict';  
const yargs = require('yargs');  
const fetch = require('node-fetch');  
const fs = require('fs');  


That will bring in the yargs command-line parsing package and the node-fetch package. You'll want to npm install both of them too.

API ESSENTIALS
To make use the Compose API, you'll need two things: the base URL for all the
endpoints and an API token. The first one is easy, it's https://api.compose.io/2016-07 . The 2016-07 is the version n The second one, you get by going to your Compose console,
selecting Accounts, then API Tokens and generating a token there. Save it
somewhere safe as that token grants the holder the same privileges as your
Compose account has. We recommend you at least keep it out of your code in, say
for example, an environment variable called COMPOSEAPITOKEN . Let's add a reference to that into our code:

let apibase = ""https://api.compose.io/2016-07"";  
let apitoken = process.env.COMPOSEAPITOKEN;  
let apiheaders = {  
                ""Authorization"": ""Bearer "" + apitoken,
                ""Content-Type"": ""application/json""
            };


The last step here is preparing some headers for the HTTP requests which pass
the apitoken to the system and tell the API to deal in JSON results. Now we're
ready to start accessing the API.

IT STARTS WITH A DEPLOYMENT ID
The key to accessing Compose deployments through the API is knowing the
deployment id of the database you want to access. Now, you can find that in a
deployment's Settings view but, we're at the API and there is an endpoint just for that: GET 2016-07/deployments . This returns information about all your deployments. Follow the link to read
the API documentation. We'll start building a function to retrieve that here:

let showDeployments = () => {  
    fetch(`${apibase}/deployments/`, { headers: apiheaders })


There's a lot packed into two lines here. The let is creating a function with a variable name. The function doesn't take any
arguments in this case. The actual function starts by invoking fetch from node-fetch . This package uses promises and takes care of a lot of the fiddly bits around
retrieving and posting to APIs and websites. Here, the fetch call is being given
thethe ""deployments"" endpoint to GET in the first argument. The back quoted
string enables embedded variables in strings, so the apibase variable will be expanded into that URL. Then the second argument passes the
headers we defined earlier as one of the options for the fetch call. The fetch
call returns a promise so we add our next step, once we have results, as a .then() function:

        .then(function (res) {
            return res.json();
        })


This bit of code waits for the result to come back and then calls another handy
fetch feature that decodes the result into JSON. We can then pick up that result
in another .then() wrapped function:

         .then(function (json) {
            let deployments = json[""_embedded""].deployments;
            for (let deployment of deployments) {
                console.log(`${deployment.id} ${deployment.type} ${deployment.name}`);
            }
        })


Our function here is being passed the results of the deployments API call. If
you look at the documentation you'll see the results are wrapped in an
""_embedded"" element as part of complying with JSON+HAL specifications. The first thing we do, therefore, is skip right through that
and down to the array with the key ""deployments"".

When we have that, we then iterate through the deployments and print out the id,
database type and name for each. And we're done. Well, almost done...

        .catch(function (err) {
            console.log(err);
        });


Adding this to the end will catch errors and log them. Now we're done. If we ran
this function we would get output something like this:

55f694344d847d005d000009 rethink Rethinkery  
56aa59a641380c0010000000 mongodb WiredForTigers  
56b9c68525c90d001e000000 mongodb Exemplum  
56fbb5d69a6ac0001e000004 elastic_search latest-elasticsearch  
5702ad373102eb0017000000 postgresql Sweet95  
569fa2e8f3e11a2b4200000f redis raymond-redisston  


Now we can find deployment ids, and we've pretty much covered the archetype for
making an API request, so we can move on to looking for those deployment's
backups.

THERE'S A LOT OF BACKUPS
You'll find there are quite a few backups associated with any Compose
deployment; 7 days of daily backups, 4 weeks of weekly backups and three months
of monthly backups. To find out what is available, we need to use the GET 2016-07/get-deployment-backups endpoint. This takes a deployment id as a parameter in the path. Here's the
start of our function:

let listBackups = (deploymentid, options) =
        })


This is similar to the previous fetch but notice we are now embedding a deployment id variable into the URL; thats
being passed in to the function. This will return us the JSON document of
backups for that deployment which, if you consult the documentation, looks like
this:

{
  ""_embedded"": {
    ""backups"": [{
      ""id"":""5854018289d50f424e00030b"",
      ""deployment_id"":""5854017d89d50f424e00002c"",
      ""name"":""test-deployment-2_2017-01-10_19-41-12_utc_daily"",
      ""type"":""daily"",
      ""status"":""complete""
    },
    ...


Here we can see it's another array, with the key backups wrapped in a JSON+HAL _embedded . The fields include the backup id and deployment_id, root name of the backup,
type (daily, weekly, monthly or ondemand) and status (either running or
complete). We just need to print some of these out so let's do that:

        .then(function (json) {
            let backups = json[""_embedded""].backups;
            for (let backup of backups) {
                console.log(`
Backup ID: ${backup.id}  
Type:      ${backup.type}  
Status:    ${backup.status}  
Base Name: ${backup.name}`);  
            }
        })
        .catch(function (err) {
            console.log(err);
        });
}


The back-quoted string in the console.log call is making use of another
attribute of ES6 string templates: the ability to have multi-line templates.
This code simply loops through so we get a list that looks something like this:

Backup ID: 5892bcd0af0fbe000a000365  
Type:      daily  
Status:    complete  
Base Name: sweet95_2017-02-02_05-00-00_utc_daily

Backup ID: 58941c706d558b000800010f  
Type:      daily  
Status:    complete  
Base Name: sweet95_2017-02-03_06-00-16_utc_daily

Backup ID: 58956e2cebe448000a0003cf  
Type:      daily  
Status:    complete  
Base Name: sweet95_2017-02-04_06-01-16_utc_daily  
...


MORE INFORMATION
There's one thing you can't see in the list of backups and that's the download
link for each one. We make sure that you query a single backup for that
information. This is done with the 2016-07/get-deployment-backups-id endpoint which adds the backup id to the path parameter to get the singular
backup data. Let's make an about function for backups that uses this:

let aboutBackup = (deploymentid, backupid, options) =
}


If we run this we could get something like:

Backup ID: 587af3e260bc1a0008000c42  
Type:      weekly  
Status:    complete  
Base Name: sweet95_2017-01-15_04-00-34_utc_daily  
Download:  https://dblayer-backups-postgresql.s3.amazonaws.com/compose-3/5702ad373102eb0017000000/sweet95_2017-01-15_04-00-34_utc_daily.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256...  


The backup download link is much longer containing credentials and other
information to download within a particular time window. We've just shortened it
here as it isn't useful or pretty. This command is purely for information. Let's
put the endpoint to use in a download command.

GET BACK UP
The first part of getting the backup is to get that download link, so the code
looks very similar to what we just did:

let getBackup = (deploymentid, backupid, options) =
        })


What happens next is important though. First up, is checking we did get a
download_link.

        .then(function (backup) {
            if (backup.download_link == null) {
                console.log(`API returned no download link for deployment ${backup.download_link}`);
                process.exit(1);
            }


A missing download link is worth checking for. It doesn't mean something is
wrong. For example, you may be trying to retrieve an Elasticsearch backup and
currently we don't support downloading them. If we do have a download_link
though, it's time to retrieve it.

            console.log(`Going to download ${backup.name}`);
            fetch(backup.download_link)
                .then((res) =
                    dest.on('finish', () =


There's plenty going on here. After printing out a message that we're going to
download, we use fetch to open a connection and start that download. We get the result in the then and at that point create a file to save this new stream of data into. Now we
could extract the full name from the download_link URL, but we'll take a short
cut. We append .tar.gz to the end of the backup's base name. We also bump the highWaterMark option up so that things stream smoothly. All that's left is to take the
result's body and pipe it to the destination file... res.body.pipe(dest) .

We could be done at that point, but the program wouldn't exit. So we set up an
event, waiting for the destination file to trigger the ""finish"" and when that
happens, we print, we are done, and we exit.

TAKING COMMAND
Well, we now have some functions and we really want to make them into a useful
set of commands. No problem as we already included the yargs package . Yargs is a super useful command line parser. All we need to do is this:

yargs.version(""0.0.1"")  
    .command(""deployments"", ""List deployments"", {}, (argv) => showDeployments())
    .command(""list <deploymentid>"", ""List deployment backups"", {}, (argv) => listBackups(argv.deploymentid))
    .command(""get <deploymentid> <backupid>"", ""Get specific backup"", {}, (argv) => getBackup(argv.deploymentid, argv.backupid))
    .command(""about <deploymentid> <backupid>"", ""Get specific backup information"", {}, (argv) =


A quick explanation. Each .command() function defines a command which can have required arguments. Yargs parses the
command line looking for the match and then hashes up the command line giving
the positional arguments their name in the argument list. All thats left to do
is call a function which does the command and pass it the arguments you want.
There's descriptions for commands too which lets the .help() function build a help page:

$  ./backupomat.js start --help                         
Commands:  
  deployments                      List deployments
  list <deploymentid>              List deployment backups
  get <deploymentid> <backupid>    Get specific backup
  about <deploymentid> <backupid>  Get specific backup information

Options:  
  --version  Show version number                                       [boolean]
  --help     Show help                                                 [boolean]


And yes we can run this as a script because we put a #! at the start to run it
with the local Node.js runtime. All we had to do was chmod u+x it to make it executable.

So, there we have it, for now. A working command for finding and retrieving
backups. In the next part, we'll look at creating on-demand backups and
downloading them.

You'll find a repository for this code in compose-ex/backupomat on GitHub.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Pixabay

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Feb 3, 2017NEWSBITS: SCYLLADB 1.6, GITLAB DB TROUBLES, ELASTICSEARCH 5.2, NODE 7.5.0, AND
MORE
NewsBits for week ending February 3rd: The release of ScyllaDB 1.6 RC1, Gitlab
shuts down temporarily due to data troubles, R…

John O'Connor Feb 1, 2017BUILDING SECURE DISTRIBUTED JAVASCRIPT MICROSERVICES WITH RABBITMQ AND SENECAJS
To take Microservices into production, you need to make sure they are
communicating securely and reliably. We explore using R…

John O'Connor Jan 12, 2017BUILDING JAVASCRIPT MICROSERVICES WITH SENECAJS AND COMPOSE
Database-backed microservices are powerful and in this article we show how to
use SenecaJS, NodeJS and Compose databases to c…

John O'Connor Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","Automatic backup retrieval is now possible with Compose's API and here we'll show you how to use it from Node.js. If you want to archive your own backups, start here.",How to get backups with the Compose API and Node.js,Live,963
3004,"IBM WENT CAMPING WITH #OFFLINEFIRST
Maureen McElaney / July 8, 2016

THE CAMPSITE
From June 24–27, 2016, members of the Offline First community gathered for a
3-day retreat at a house nestled in the Catskill Mountains. Their goal was to
discuss the issues around building Offline First applications. Offline Campers were made up of developers interested in making
their apps work without constant internet connection, designers exploring UX
patterns that maintain user understanding of system state, and community
builders interested in sparking discussion around how to live in a disconnected
and battery-powered world.

We called it Offline Camp , but arguably we glamped — in style — with pretty solid internet access. But our goal wasn’t entirely to
unplug, even though nary a laptop was in sight all weekend (though @OfflineCamp tweets abounded). Our goal was to put our heads together and tackle big issues facing
Offline First and to make friends IRL with Offline First lovers from around the
world.

Offline Camp Session in Progress

CAMPER ACTIVITIES
We took an “unconference” approach to content coordination for Offline Camp, so
the campers in attendance shaped the direction of what topics were discussed.
The sessions ranged from technical to non-technical, some groups brainstormed
around how to build community with events while others tackled issues like notifications and alerts , tooling that helps with building offline apps , and much much more. Campers had the opportunity to give passion talks, which
ranged in topic from Cloudant Envoy for Offline First given by @BradleyHolt to an app that constructs a poem through text messages built by @janecakemaster . You can learn more from camp on the Offline Camp medium account: https://medium.com/offline-camp .

Campfire at Offline Camp

Co-organizing Offline Camp was an incredible experience for me as a newly minted
Developer Advocate within IBM Cloud Data Services. Before joining IBM, I was a
QA Engineer during the day doing double duty as a community organizer for my
local tech community at night. Having the opportunity to represent my company
while helping to foster the community behind Offline First was rewarding beyond
measure. I was able to spread the love about what IBM has to offer to the
Offline First community in a genuine way, organically through conversation and
shared experience. While it’s hard to place a dollar amount on the value gained
from our participation in such an event, our contribution to the health of the
Offline First community was invaluable. I look forward to keeping in touch with
the amazing folks I met at Offline Camp, and you can too! Join the conversation
about Offline First by joining the community on Slack .

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Tagged: Cloudant Envoy / Community / Offline Camp / Offline First Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","In June 2016, members of the Offline First community gathered for a retreat at a house nestled in the Catskill Mountains. Here's a wrap-up of our progress.",Our first Offline First camp <3,Live,964
3006,"Jump to navigation

 * Twitter
 * LinkedIn
 * Facebook

 * About
 * Contact
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chats * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Subscribe

×BLOGS
DATA VISUALIZATION PLAYBOOK: REVISITING THE BASICS
SURVEY VISUALIZATION TECHNIQUES THAT DATA SCIENTISTS IN ALL INDUSTRIES NEED TO
KNOW
March 13, 2015 by Jennifer Shin Topics: Analytics , IBM Watson Foundations , Big Data Technology , Hadoop Tags: analysis , analytical thinking , analytics , application , bar chart , bubble chart , chart , data scientist , donut chart , graph , heat map , spreadsheet , tree map , visualization , visualization toolThe recent surge of advanced visualization tools, applications, and websites can
give analysts, data scientists, and other professionals the impression that data
visualization is a hot, burgeoning field that requires learning a whole new set
of skills. However, scientists have been creating visualizations for centuries.
Making bar graphs in math class or plotting x and y data points on graph paper
for homework assignments was a common practice for years. As a result, many
people may have already created their first data visualization without even
realizing it. What’s the difference between the charts we grew up with and data
visualizations? The simple answer is technology. The advent and development of
fast computers, sophisticated algorithms, and data storage systems with
ever-growing capacities enable the creation of today’s complex visualizations,
and they can be made even more rapidly and accurately than ever before.

APPLYING VISUALIZATION TO SPECIFIC USE CASES
Creating data visualizations often requires technical skills, but data
scientists need to be careful not to overestimate the role of advanced
technological tools. While being well versed in writing code and familiar with
new programs is beneficial, creating a good-quality data visualization still
relies on analytical thinking, depends on effective communication, and requires
applying the same principles we learned in school. High-quality data
visualizations also include accurate information and clearly rendered labeling.
Several classic data visualizations—and some new ones—detailed here are
necessary types that should be part of every data scientist’s toolbox.

Bar chartsBar charts typically use rectangular, side-by-side bars to represent values. Bar
charts offer a quick-scan comparison of values and a simple, direct, and easy
way to understand data relationships. When working with a bar chart, consider
updating the chart by trying different styles instead of using traditional
rectangular bars. For example, the bars can be represented in a circular or
semicircular fashion (see Figure 1) or in semicircle and traditional bar
combinations. Figure 1. A bar chart variation comparing values using curved flourishes Options available for variations of bar charts depend on the application chosen
to create it. If a data scientist or analyst is interested in producing
visualizations quickly with minimal customization, the Picktochart website 1 offers an online data visualization platform that provides the capability to
create infographics, presentations, reports, and more.

Donut chartsA donut chart is a variation of the pie chart and uses a circular design that is
divided into sections or slices to show proportional value comparisons (see
Figure 2). A donut chart is essentially a pie chart with the center removed.
Donut charts can make comparing values easier than other chart types and show
sizes relative to the entire set. Unlike bar charts, donut charts generally show
relative values, such as percentages, instead of absolute values, such as
amounts. Figure 2. A donut chart for comparing proportional values in a variation on the pie chart Multiple donut charts can represent comparisons of different data sets or bases
of values, and by placing a different label, icon, or picture in the center of
the donut shapes, each donut chart can be visually differentiated. But keep in
mind that if multiple donut charts are used in this manner, readers’ perception
may be that the total amount for each chart is equivalent. For example, upon an
initial scan, a donut chart representing a total of USD400 appears to be the
same as a donut chart illustrating USD4,000 (see Figure 3). Figure 3. Multiple donut charts comparing contrasting proportions of different data sets Although both donut charts have equivalent relative amounts or proportions, the
absolute amounts may not be equal. For example, the dark blue section on both
donut charts shown in Figure 3 looks equivalent. However, the US dollar amount
for this section in the chart on the right is 10 times greater than the amount
represented in the dark blue section of the chart on the left. Donut charts are
widely used and understood, and they can be made using nearly any data
visualization application, even in widely used spreadsheet applications.

Heat mapsA heat map offers a two-dimensional table or matrix that associates colors to a
numerical value or range of values (see Figure 4). It often associates different
intensities or shades of a single color by assigning the lightest shade to the
lowest value and the darkest shade to the highest value. Heat maps provide a
simple, yet powerful tool for analyzing numerical data sets as well as for
highlighting extreme values or outliers. Figure 4. A heat map representing values with corresponding light-to-dark shading Heat maps can be a well-suited option for showing variability across multiple
values in a data set, but they may add little value to a visualization if the
values do not fall in a range. A data set can produce an ineffective heat map
that has no variability in the colors if all the data points are close to either
the smallest or largest amounts, and none fall in between these two extremes
(see Figure 5). Figure 5. A heat map showing little variability within groups, but high variability
across groups Heat maps can be made with nearly any application that allows end users to
automatically change the color of a cell based on the value, such as widely used
spreadsheet applications. However, heat maps are much easier to create when
using applications that have an option already built in—such as those offered by
Tableau Software 2 —rather than using one based on spreadsheets.

Tree mapsA tree map is a visual representation of a data tree that is displayed as a set
of nested rectangles in which each branch or node of the tree is a rectangle.
The size of each rectangle is proportional to the size of the data and relative
to the other nodes of the data tree. Similar to donut charts, tree maps can be
used to show size relative to the entire data set, but they also offer the
option to organize data in groups, such as A, B, C, and D, and subgroups, such
as A1, A2, B1, and so on (see Figure 6). Like heat maps, tree maps also offer
the capability to apply colors for differentiating small and large values or
amounts. Figure 6. A tree map for representing proportional value groups and subgroups using
relative size and shading Tree maps can present any hierarchical or node-dependent data set in neatly
organized nested rectangles, but without careful segmentation a visualization
can also appear messy and unclear (see Figure 7). Selecting the appropriate
level of detail, determining the size of groups and subgroups, and organizing
the data are essential for creating effective tree map visualizations. Figure 7. An unorganized and over-partitioned tree map Several data visualization platforms can be used to generate tree maps, but the
ease of grouping the nodes can vary. The IBM® Many Eyes® platform 3 provides an easy-to-use, web-based tool for creating tree maps that doesn’t
require any coding experience.

Bubble chartsA bubble chart represents values as circles, or bubbles, that are scaled
relative to each other (see Figure 8), and are depicted in either free form or
against x- and y-axes. Bubble charts are a great choice for showing the relative
sizes of groups or categories as long as all the sizes represent positive
values. Figure 8. A bubble chart for comparing values in a scaled representation Bubble charts are popular and can be made using many visualization tools. Some
platforms, such as Google Charts, 4 can plot bubble charts against a graph and use colors to group the bubbles into
categories (see Figure 9). Figure 9. A bubble chart variation overlaying a graph with x- and y-axes

MIXING AND MATCHING VISUALIZATION TYPES
Charts and graphs for modern data visualization are just the tip of the iceberg.
In fact, there are no rules about what a visualization should look like as long
as it communicates a clear message and represents data accurately. Mixing and
matching different effects such as color, shape, and size can enhance
visualizations without cluttering the display, and incorporating other graphical
elements such as maps can transform a simple chart into an engaging and
informative representation of data. In the meantime, please share any thoughts
or questions in the comments. 1 Picktochart , a registration-based website for creating infographics, presentations,
posters, and reports using icons, graphics, and templates that can be
customized. 2 Tableau Software , a vendor of business intelligence tools and software to help visualize and
understand data. 3 IBM Many Eyes , a web-based tool for creating visualizations to represent specific data sets. 4 Google Charts , a web-based portal for creating bubble charts to visualize data in two to
four dimensions.

RELATED CONTENT
BLOG
WHAT IS MACHINE LEARNING?
Businesses can benefit enormously from analysis-derived rules that enable
understanding why certain events occur and the corresponding actions to take.
Learn more about a widely used six-phase methodology for building predictive
analytics models that can reveal hidden rules for meaningful business... Read Blog Blog IBM is a leader in the Forrester Wave™: Big Data Hadoop Cloud Solutions,
Q2 2016 Blog Simple polyglot persistence in the cloud Video IBM Analytics is Open for Data Podcast How is open source transforming machine learning? Blog Bridging NoSQL databases into open data science initiatives Blog Spark and R: The deepening open analytics stack Blog Extending trust and confidence in the cloud Infographic Do you stand apart from the others? Blog The data roadmap for IBM’s first CDO Blog The blurring lines between developer and data scientist Blog Delivering superior customer interactions: Banking edition Blog Next-generation DB2 release highlights BLU Acceleration IBM * Site Map
 * Privacy
 * Terms of Use
 * 2014 IBM

FOLLOW IBM BIG DATA & ANALYTICS
 * Facebook
 * YouTube
 * Twitter * @IBMbigdata
   
   
 * LinkedIn
 * Google+
 * SlideShare
 * Twitter * @IBManalytics
   
   
 * Explore By Topic * Use Cases
    * Industries
    * Analytics
    * Technology
    * For Developers
    * Big Data & Analytics Heroes
   
   
 * Explore By Content Type * Blogs
    * Videos
    * Analytics Video Chats
    * Big Data Bytes
    * Big Data Developers Streaming Meetups
    * Cyber Beat Live
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
    * Events
    * Around the Web
    * About The Big Data & Analytics Hub
    * Contact Us
    * RSS Feeds
   
   
 * Additional Big Data Resources * AnalyticsZone
    * Big Data University
    * Channel Big Data
    * developerWorks Big Data Community
    * IBM big data for the enterprise
    * IBM Data Magazine
    * Smarter Questions Blog
   
   
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

More * Events * Upcoming Events
    * Webcasts
    * Twitter Chats
    * Meetups
   
   
 * Around the Web
 * For Developers
 * Big Data & Analytics Heroes

SearchEXPLORE BY TOPIC:
Use Cases All Acquire Grow & Retain Customers Create New Business Models Improve IT Economics Manage Risk Optimize Operations & Reduce Fraud Transform Financial Processes Industries All Automotive Banking Consumer Products Education Electronics Energy & Utilities Government Healthcare & Life Sciences Industrial Insurance Media & Entertainment Retail Telecommunications Travel & Transportation Wealth Management Analytics All Content Analytics Customer Analytics Entity Analytics Financial Performance Management Social Media Analytics Technology All Business Intelligence Cloud Database Data Warehouse Database Management Systems Data Governance Data Science Hadoop & Spark Internet of Things Predictive Analytics Streaming Analytics Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analyticsMORE
Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insight Blog What is machine learning? Blog Simple polyglot persistence in the cloud Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Blog What is machine learning? Blog Simple polyglot persistence in the cloudMORE
Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Blog What is machine learning? Blog Simple polyglot persistence in the cloud Podcast How is open source transforming machine learning? Blog The future of cognitive business: Try the self-service technical preview Blog Bridging NoSQL databases into open data science initiatives Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freightMORE
Blog 3 powerful ways a data model can benefit energy and utilities Podcast Finance in Focus: How investors can build new business with social
media Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic How financial advisors can connect with investors Podcast Cyber Beat Live: I'm In! When insiders threaten our security Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insightMORE
Blog Is the Internet of Things a scary proposition for organizations? Blog Optimizing asset utilization: The data-driven future of rail and freight Blog Improving patient care quality: 3 ways health organizations use predictive
analytics Infographic Win the race to insight Infographic How financial advisors can connect with investors Blog What is machine learning? Blog IBM is a leader in the Forrester Wave™: Big Data Hadoop Cloud Solutions,
Q2 2016 * Home
 * Explore By Topic * Use Cases * All
       * Acquire, Grow & Retain Customers
       * Create New Business Models
       * Improve IT Economics
       * Manage Risk
       * Optimize Operations & Reduce Fraud
       * Transform Financial Processes
      
      
    * Industries * All
       * Banking
       * Consumer Products
       * Education
       * Energy & Utilities
       * Government
       * Healthcare & Life Sciences
       * Industrial
       * Insurance
       * Media & Entertainment
       * Retail
       * Telecommunications
      
      
    * Analytics * All
       * Content Analytics
       * Customer Analytics
       * Entity Analytics
       * Social Media Analytics
      
      
    * Technology * All
       * Business Intelligence
       * Cloud Database
       * Data Governance
       * Data Warehouse
       * Database Management Systems
       * Data Science
       * Hadoop & Spark
       * Internet of Things
       * Predictive Analytics
       * Streaming Analytics
      
      
 * Content By Type * Blogs
    * Videos * All Videos
       * IBM Big Data In A Minute
      
      
    * Video Chat * Analytics Video Chats
       * Big Data Bytes
       * Big Data Developers Streaming Meetups
       * Cyber Beat Live
      
      
    * Podcasts
    * White Papers & Reports
    * Infographics & Animations
    * Presentations
    * Galleries
   
   
 * Big Data & Analytics Heroes
 * For Developers
 * Events * Upcoming Events
    * Webcasts
    * Twitter Chat
    * Meetups
   
   
 * Around The Web
 * About Us
 * Contact Us
 * Search Site",Survey visualization techniques that data scientists in all industries need to know,Data Visualization Playbook: Revisiting the Basics,Live,965
3010,"Compose The Compose logo Articles Sign in Free 30-day trialFASTER OPERATIONS WITH THE JSONB DATA TYPE IN POSTGRESQL
Published Mar 20, 2017 postgresql json jsonb Faster Operations with the jsonb Data Type in PostgreSQLLucero Del Alba takes a look at how to get better performance out of jsonb data
types in PostgreSQL in this Compose's Write Stuff article.

Since version 9.4, PostgreSQL offers a significant speedup when using the binary
representation of JSON data, jsonb , which can give you that extra edge you need to increase your performance.

WHAT'S JSONB
The data types json and jsonb , as defined by the PostgreSQL documentation ,are almost identical ; the key difference is that json data is stored as an exact copy of the JSON input text, whereas jsonb stores data in a decomposed binary form; that is, not as an ASCII/UTF-8 string,
but as binary code.

And this has some immediate benefits :

 * more efficiency,
 * significantly faster to process,
 * supports indexing (which can be a significant advantage, as we'll see later),
 * simpler schema designs (replacing entity-attribute-value (EAV) tables with jsonb columns, which can be queried, indexed and joined, allowing for performance
   improvements up until 1000X !)

And some drawbacks :

 * slightly slower input (due to added conversion overhead),
 * it may take more disk space than plain json due to a larger table footprint, though not always,
 * certain queries (especially aggregate ones) may be slower due to the lack of
   statistics.

The reason behind this last issue is that, for any given column, PostgreSQL
saves descriptive statistics such as the number of distinct and most common
values, the fraction of NULL entries, and --for ordered types-- a histogram of the data distribution. All of
this will be unavailable when the info is entered as JSON fields, and you will
suffer a heavy performance penalty especially when aggregating data ( COUNT , AVG , SUM , etc) among tons of JSON fields.

To avoid this, you may consider storing data that you may aggregate later on
regular fields.

For further commentary about this issue, you can read Heap's blog post When To Avoid JSONB In A PostgreSQL Schema .

USE CASE: BOOK ENTRIES
Let's use a toy model with book entries to illustrate some basic operations when
working with JSON data in PostgreSQL.

The operations in this section will be essentially the same either if you use json or jsonb , but let's review them to refresh what we can do with JSON and to set our use
case before we see the jsonb goodies right after.

DEFINE A COLUMN IN A TABLE
Simply enough, we specify the data column with the jsonb data type:

CREATE TABLE books (  
  book_id serial NOT NULL,
  data jsonb
);


INSERT JSON DATA
To enter data to the books table we just pass the whole JSON string as a field value:

INSERT INTO books VALUES (1, '{""title"": ""Sleeping Beauties"", ""genres"": [""Fiction"", ""Thriller"", ""Horror""], ""published"": false}');  
INSERT INTO books VALUES (2, '{""title"": ""Influence"", ""genres"": [""Marketing   


QUERY DATA
We can now query specific keys within the JSON data:

SELECT data-  


This returns the title, extracted from the JSONB data, as a column:

           title
---------------------------
 ""Sleeping Beauties""
 ""Influence""
 ""The Dictator's Handbook""
 ""Deep Work""
 ""Siddhartha""
(5 rows)


FILTER RESULTS
You may also filter a result set no differently as you normally would, using the WHERE clause but through JSON keys:

SELECT * FROM books WHERE data-  


Which in this case returns the raw JSON data:

 book_id |                                              data
---------+-------------------------------------------------------------------------------------------------
       1 | {""title"": ""Sleeping Beauties"", ""genres"": [""Fiction"", ""Thriller"", ""Horror""], ""published"": false}
(1 row)


EXPAND DATA
This is an important one, as it will enable us to use the aggregate functions
that we are familiar when dealing with relational databases, but in the
otherwise counter-intuitive environment of JSON data.

SELECT jsonb_array_elements_text(data-  


That will expand the JSON array into a column:

  genre
----------
 Fiction
 Thriller
 Horror
(3 rows)


SPECIAL JSONB FEATURES
Besides efficiency, there are extra ways in which you can benefit from storing
JSON in binary form.

One such enhancement is the GIN (Generalized Inverted Index) indexes and a new
brand of operators that come with them.

CHECKING CONTAINMENT
Containment tests whether one document (a set or an array) is contained within
another. This can be done in jsonb data using the @> operator.

For example, the array [""Fiction"", ""Horror""] is contained in the array [""Fiction"", ""Thriller"", ""Horror""] (where t stands for true ):

SELECT '[""Fiction"", ""Thriller"", ""Horror""]'::jsonb @  


t


The opposite however, [""Fiction"", ""Thriller"", ""Horror""] being contained in [""Fiction"", ""Horror""] , is false :

SELECT '[""Fiction"", ""Horror""]'::jsonb @  


f


Using this principle we can easily check for a single book genre:

SELECT data->'title' FROM books WHERE data->'genres' @  


""Sleeping Beauties""
""Siddhartha""


Or multiple genres at once, by passing an array (notice that they key order
won't matter at all):

SELECT data->'title' FROM books WHERE data->'genres' @  


""Sleeping Beauties""


Also, since version 9.5, PostgreSQL introduces the ability to check for
top-level keys and containment of empty objects:

SELECT '{""book"": {""title"": ""War and Peace""}}'::jsonb @  


t


CHECKING EXISTENCE
As a variation on containment, jsonb also has an existence operator ( ? ) which can be used to find whether an object key or array element is present.

Here, let's count the books with the authors field entered:

SELECT COUNT(*) FROM books WHERE data ? 'authors';  


Only one in this case (""The Dictator's Handbook""):

count
-------
    1
(1 row)


CREATING INDICES
Let’s take a moment to remind ourselves that indexes are a key component of
relational databases. Without them, whenever we need to retrieve a piece of
information, the database would do a scan the entire table which is, of course,
very inefficient.

A dramatic improvement of jsonb over the json data type, is the ability to index JSON data.

Our toy example only has 5 entries, but if they were thousands --or millions--
of entries, we could cut seek times in more than half by building indices.

We could, for example, index published books:

CREATE INDEX idx_published ON books (data-  


This simple index will automatically speed up all the aggregate functions that
we run on published books ( WHERE data->'published' = 'true' ) because of the idx_published index.

And in fact, we could --and probably should, as the DB size increases-- index
anything that's subject to be used on a WHERE clause when filtering results.

CAVEATS
There are just a few technicalities you need to consider when switching over to jsonb data type.

jsonb is stricter , and as such, it disallows Unicode escapes for non-ASCII characters (those
above U+007F) unless the database encoding is UTF8. It also rejects the NULL character ( \u0000 ), which cannot be represented in PostgreSQL's text type.

It does not preserve white space , and it will strip your JSON strings of leading/lagging white space as well as
white space within the JSON string, all of which will just untidy your code
(which might not be a bad thing for you after all.)

It does not preserve the order of object keys , treating keys in pretty much the same way as they are treated in Python
dictionaries -- unsorted. You'll need to find a way around this if you rely on
the order of your JSON keys.

Finally, jsonb does not keep duplicate object keys (which, again, might not be a bad thing , especially if you want to avoid ambiguity in your data), storing only last
entry.

CONCLUSIONS
The PostgreSQL documentation recommends that most applications should prefer to
store JSON data as jsonb , since as we've seen there are significant performance enhancements and only
minor caveats.

The features jsonb brings are so powerful that you may very well handle relation data in pretty
much the same manner as you would do in regular RDBMS , but all in JSON and which a very significant gain in performance , combining the practicality of a NoSQL solution with the power features of a RDBMS.

The main drawback when switching to jsonb is legacy code that may, for example, be relying on the ordering of object
keys; which is code that will need to be updated to work as expected. And to
state the obvious, as a feature introduced in version 9.4, jsonb isn't backward-compatible and the very jsonb keyword you need to use set the JSON tables will break your SQL code on legacy
platforms.

Finally, notice that I covered some typical uses of the indices and their
operators; for more details and examples have a look at the jsonb indexing and the JSON functions and operators in the official PostgreSQL documentation.


--------------------------------------------------------------------------------

Do you want to shed light on a favorite feature in your preferred database? Why
not write about it for Write Stuff ?


attribution Lionello DelPiccolo

Default avatar The default author avatar The Compose Team The fine team of people at Compose have brought this article to you through
teamwork. Remember, teamwork makes the dreamwork. Love this article? Head over
to The Compose Team ’s author page and keep reading.RELATED ARTICLES
Mar 3, 2015IS POSTGRESQL YOUR NEXT JSON DATABASE?
TL;DR: Betteridge's law applies unless your JSON is fairly unchanging and needs
to be queried a lot. With the most recent ver…

Dj Walker-Morgan Apr 13, 2016COULD POSTGRESQL 9.5 BE YOUR NEXT JSON DATABASE?
TL;DR: No, but that's not the right question. Just over a year ago we asked Is
PostgreSQL Your Next JSON Database... Now, wit…

Dj Walker-Morgan Mar 17, 2017NEWSBITS: RABBITMQ, TIMESCALEDB AND POSTGRESQL, 12TB HARD DRIVES, REAL LIFE
TALES OF SCALE AND MORE
NewBits for the week ending 17 March: An update to RabbitMQ, TimescaleDB's
PostgreSQL based time-series database, Seagate's h…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",Lucero Del Alba takes a look at how to get better performance out of jsonb data types in PostgreSQL in this Compose's Write Stuff article.,Faster Operations with the jsonb Data Type in PostgreSQL,Live,966
3016,This video shows you how to construct queries to access the primary index through the API.Visit http://www.cloudant.com/sign-up to sign up for a free Cloudant account.,This video shows you how to construct queries to access the primary index through the API,Use the Primary Index,Live,577
3017,"Homepage Follow Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Carmen Ruppach Blocked Unblock Follow Following Offering Manager for Data Refinery on Watson Data Platform at IBM Nov 14, 2017
--------------------------------------------------------------------------------

SELF-SERVICE DATA PREPARATION WITH IBM DATA REFINERY
If you are like most data scientists, you are probably spending a lot of time to
cleanse, shape and prepare your data before you can actually start with the more
enjoyable part of building and training machine learning models. As a data
analyst, you might face similar struggles to obtain data in a format you need to
build your reports. In many companies data scientists and analysts need to wait
for their IT teams to get access to cleaned data in a consumable format.

IBM Data Refinery addresses this issue. It provides an intuitive self-service
data preparation environment where you can quickly analyze, cleanse and prepare
data sets. It is a fully managed cloud service, available in open beta now.

Analyze and prepare your data

With IBM Data Refinery, you can interactively explore your data and use a wide
range of transformations to cleanse and transform data into the format you need
for analysis.

You can use a simple point-and-click interface for selecting and combining a
wide range of built-in operations, such as filtering, replacing, and deriving
values. It is also possible to quickly remove duplicates, split and concatenate
values, and choose from a comprehensive list of text and math operations.

Interactive data exploration and preparationIf you prefer to code, in IBM Data Refinery you can directly enter R commands
via R libraries such as dplyr. We provide code templates and in-context
documentation to help you become productive with the R syntax more quickly.

Code templates to help users with R syntaxIf you’re not satisfied with the shaping results, you can easily undo and change
operations in the Steps side bar.

The interactive user interface works on a subset of the data to give you a
faster preview of the operations and results. Once you’re happy with the sample
output, you can apply the transformations on the entire data set and save all
transformation steps in a data flow. You can repeat the data flow later and
track changes that were applied to your data. To accelerate the job execution,
Apache Spark is used as the execution engine.

Profile and visualize data

Data shaping is an iterative and time-consuming process. In a traditional data
science workflow, you might use one tool to apply various transformations to
your data set, and then load the data into another tool to visualize and
evaluate the results. Over many cycles, this continual tool hopping can become
frustrating.

IBM Data Refinery soothes the pain by integrating both data transformations and
visualizations in a single interface, so you can move between views with a
simple click. You can use the Profile tab to view descriptive statistics of your
data columns in order to better understand the distribution of values. You can
continue to apply transformations and the corresponding profile information
adjusts automatically.

On the Visualization tab you can select a combination of columns to build charts
using Brunel (open source visualization library). IBM Data Refinery
automatically suggests appropriate plots and you can choose between 12
pre-defined chart types. You can adjust the appearance of the charts using
Brunel syntax.

Connect to your data wherever it resides

IBM Data Refinery comes with a comprehensive set of 30 prebuilt data connectors
so that you can set up connections to a wide range of commonly used on-premises
and cloud data stores. You can connect to IBM as well as non-IBM services. If
your data service is hosted on IBM Cloud (formerly IBM Bluemix), you can
directly access the data service instance from IBM Data Refinery.

Once you specify a connection and connect the data object to your data, you can
start to analyze and refine your data wherever it resides.

Try out IBM Data Refinery! Sign up for free at: https://www.ibm.com/cloud/data-refinery

 * Data Science
 * Data Visualization
 * Data Analysis
 * Data Refinery

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

27 Blocked Unblock Follow FollowingCARMEN RUPPACH
Offering Manager for Data Refinery on Watson Data Platform at IBM

FollowIBM WATSON DATA
Build smarter applications and quickly visualize, share, and gain insights

 * 27
 * 
 * 
 * 

Never miss a story from IBM Watson Data , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Get updates Get updates","If you are like most data scientists, you are probably spending a lot of time to cleanse, shape and prepare your data before you can actually start with the more enjoyable part of building and…",Self-service data preparation with IBM Data Refinery,Live,232
3021,"2.5 * Share
 * 
 * ?

 * Profiles ▼
 * Communities ▼
 * Apps ▼
 * 
 * 

BLOGS
 * My Blogs
 * Public Blogs
 * My Updates
 * if(currentLogin.globalAdmin) { document.write('<li role=""presentation""><div
   class=""lotusTabWrapper""><a class=""lotusTab"" role=""button""
   aria-pressed=""false""
   href=""/developerworks/community/blogs/roller-ui/admin/rollerConfig.do?method=edit�
   }


IT BEST KEPT SECRET IS OPTIMIZATION
 * Log in to participate

 * if(currentLogin.auth) document.write('<li class=""lotusFirst""
   style=""border-left:none;""><span class=""lotusBtn""><a role=""button""
   aria-label=""Follow this Blog"" class=""blogsFollowBtn"" href=""javascript:;""
   onclick=""followAction(\'8efbd6b5-b6cf-4f3f-a40d-2ef96ac2df9d\', \'follow\');
   "">Follow this Blog</a></span></li>');
 * if(!currentLogin.auth)
   dojo.byId(""login_participate_link"").style.display=""block""; function
   followAction(id, action) { dojo.xhrPost({ url:
   '/developerworks/community/blogs/roller-services/json/following', handleAs:
   'json', postData: 'version=300&uuid='+id+'&action='+action+'� }

ABOUT THIS BLOG
Musing about Analytics, Optimization, Data Science, and Machine Learning
Leverages Python and Mathematical Optimization. I am now publishing my code (esp
notebooks) on git hub at: https://github.com/jfpuget/ My Views are my own. * Facebook
 * Twitter
 * Google
 * LinkedIn
 * RSS

RELATED POSTS
STACKERS SUMMIT IN A...
Updated Likes 2 Comments 0DATA SCIENCE AUTOMAT...
Updated Likes 1 Comments 2CONGRATULATIONS TO O...
Updated Likes 0 Comments 0MONTREAL CLOUD COMPU...
Updated Likes 0 Comments 0BILOG: MAXIMO. WATS...
Updated Likes 3 Comments 0SIMILAR IDEAS
RÉSOLUTION DES CHALL...
Ideation Blog: IBM Tunisia H...
marouene.boubakri 31000009XK
Updated
5 Comments 0RE: 2014 2ND EDITION...
Ideation Blog: IBM PureData-...
shubho 270001FMSR
Updated
0 Comments 0CREATING USER DEFINE...
Ideation Blog: IBM PureData-...
NeerajGaurav 060000R793
Updated
3 Comments 0CLIENT CONNECTIVITY ...
Ideation Blog: IBM PureData-...
Apoorv Kapse 270003FSX5
Updated
0 Comments 0LINKS
 * My github repository
 * @JFPuget on twitter
 * Optimization Community on Deve...
 * Free CPLEX Trials
 * Free Cloud trial
 * Free Software For Academics
 * Support Forum on developerWork...
 * CPLEX and OPL products
 * IBM Decision Optimization Cent...
 * LinkedIn profile for jfpuget
 * Michael's Trick Operations Res...
 * OR in an OB World
 * Open Courses on Operations Res...
 * Punk Rock OR

TAGS
MACHINE LEARNING ALGORITHM != LEARNING MACHINE
JeanFrancoisPuget 2700028FGP | | Comments (9) | Visits (9812)
TweetHow easy it is to build a learning machine? Shouldn't one just hire some Machine
Learning PhDs and have them run their algorithms? Well, this is most probably a
good idea, but it won't be enough. I'll try to explain why in this blog entry.

Before answering our questions, let's define what we are dealing with.

A Learning Machine is a machine (a software, a web site, a mobile app, a robot, pick your
favorite) that performs a task, and that gets better and better as it performs
it. In recent years, some learning machines made headlines. For instance, IBM Watson defeated best humans at Jeopardy few years ago. IBM Watson is a learning
machine (we like to say a cognitive machine at IBM). It was given Wikipedia to digest, then it was trained with
pairs of question answers from Jeopardy. As it trained, its performance at
answering Jeopardy questions improved until it was able to best top human
players. More recently, Google AlphaGo won a match against one of the top Go players. AlphaGo is also a learning
machine. It was first trained on a large set of recorded Go games between top
players. Then it trained against itself. As it trained, its performance at Go
increased, until it became better than a top Human player.

These are two well-known examples of learning machines, but there are many more
learning machines around us.

Machine Learning is harder to define. One would think that it is the study of what it takes to
build learning machines. It started that way, but the machine learning community
is mostly focusing on the development and on the application of machine learning algorithms . A machine learning algorithm takes some data as input, and it produces a
model of that data as output. That model can then be used to make predictions
out of unforeseen data.

AN EXAMPLE
Let's give a simple example for the sake of clarity. Assume you want to build a
learning machine that learns how to price houses for sale. You train it with
data about past sales. Let's say we get for each sale the price, and various
information about the house: its location, its surface, its number of rooms, the
presence of a pool, the number of floors, etc. In that case we would use a supervised learning algorithm. We say supervised because we know in advance the task that needs to
be learned: predict house price.

Our supervised learning algorithm will compute a model that relates the price of
houses to the rest of the data we have for the house. A simple model could be:

price = 100 * surface + 20,000 * pool + 15,000 * num_room

Models like this, where the target variable (here the price) is a linear
combination of other variables, are produced by regression algorithms. Other machine learning algorithms such as decision trees, or deep
learning, would produce more complex models. Whatever the model complexity, the
purpose is the same: make predictions on new data. For instance, when we get a
new house, say with 2,000 sq. feet, 3 rooms, and no pool, we can compute a
price:

price = 100 * 2,000 + 20,000 * 0 + 15,000 * 3 = 245,000

We just made a prediction . It is a prediction because it is not the price at which the house will be
sold. It is our best guess given the model we built out of the history data we
have access to. Another model might produce a different prediction.

MACHINE LEARNING WORKFLOW
Let me recap: we ingested data (past sales information), we built a model of
that data using a machine learning algorithm, then we used that model to make a
prediction using new, unforeseen, data. The machine learning workflow looks like
this:


How should we chose our model? The answer is simple: the quality of a model
boils down to the quality of the predictions we can make using it. Issue is that
we don't know the prediction quality when we build the model. Indeed, we don't
know in advance what are the future house for sales, hence we cannot make
predictions when all we have is history data. Understanding what makes a good
model before you start making predictions is a difficult task. There are
techniques for it (e.g. cross validation), but this is beyond the scope of this
post.

If we used a good algorithm and selected the right model, then we should be able
to get good predictions. Does it mean we have a machine that learns how to
predict house prices? No, by far.

Sure, we have a machine learning algorithm that creates a model from data. But
where does this data come from? We need to get it from somewhere. And whatever
the data source, data is probably not in the right form. And it may have quality
issues, e.g. missing or incorrect values (typos). For these reasons, and other
reasons, we need to include a data preparation step before we train a model.
Similarly, it is not because we have a model that we can predict: we need data
to make the prediction! We therefore need to deploy the model in a production environment where new data are ingested. The workflow
becomes a bit more complex:


Does this get us a learning machine? Not yet.

One issue is that all the steps above need to be fully automated before we can
even think of a learning machine. Automating these steps isn't easy in general.
One of the issue is to be able to evaluate the future performance of a model.
Indeed, it is not enough to produce a model, we want to produce a model that can
produce good predictions. Another issue is that of data preparation. It is hard
to define in advance all the weirdness that can happen in data. I could go on
forever, but let's assume for the sake of discussion that we can automate all of
this.

Do we get a learning machine? Not yet, but we're getting closer.

LEARNING FROM FEEDBACK
What we have is a machine that learns once. Indeed, it learns from the input
data, but it stops learning after. We would like to be able to have it learn as
the housing market evolves. The cure for that is to repeat the training on a
regular basis, for instance every month. We would train a new model using
additional data coming from last month sales.

Note that training can be time consuming, and it seems silly to start from
scratch each time. Fortunately, there are machine algorithms that are
incremental: they can start from an existing model when they are trained with
additional data, cutting training time significantly.

In a similar vein, our machine does not learn from its mistakes. Let's say it
constantly predicts a price higher than the actual sales price. Shouldn't it
learn to revise the way it makes predictions? We can cure this issue and the
previous issue at once by adding a feedback ingestion loop to our machine. We
need monitor what happens out of our predictions, so that we can compare the
actuals with the prediction, and use it as further data for a new training of
the machine. For instance, we can collect the actual price of a sale together
with the predicted price, and feed it back to our machine. This would produce a
new model that can be deployed to make new predictions. The workflow becomes:


The loop can then be repeated as we go.

Implementing the above requires the automation of the monitoring step and the
feedback ingestion step. Assuming this is done, do we have a learning machine?
I'd say we're pretty close to it.

What remains to be automated is the acting part: what happens out of
predictions? Ideally, predictions are used in a system that can act, for
instance play Jeopardy or play Go. When acting is automated, then the above loop
can be fully automated, and it can be executed continuously. In that case we do
have a learning machine.

Note that the loop can be executed as often as possible. We can even repeat it
after each prediction if the machine executes fast enough. In that case, effect
of each prediction is monitored and the feedback is ingested at once, leading to
a new training, before the next prediction occurs. Machines performing this kind
of learning are called online learning machines.

TAKEAWAY
We can now answer our questions. Yes, machine learning algorithms are required
to build a learning machine. They are at the heart of it. But they aren't
enough. We need to automate a complete flow that goes from data to predictions.
We also need to automate a complete flow that works the reverse way, collecting
feedback from the effect of predictions in order to further train the machine.
In a nutshell, machine learning algorithms are necessary for learning machines,
but they are not sufficient. This is why a machine learning algorithm is not the
same as a learning machine.


Tags:&nbsp bigdata machine_learning datascience analytics Login to access this feature * Add a Comment
 * More Actions v

Notify Other People Notify Other People

Comments (9) * Add a Comment
 * if(!currentLogin.edit) document.write('<li style=""display:none"">'); else
   document.write('<li>');
 * Edit
 * More Actions v
 * if(!currentLogin.postModerationReviewer) document.write('<li
   style=""display:none"">'); else document.write('<li>');
 * Quarantine this Entry

notificationSEND EMAIL NOTIFICATION
Type in a Name: + Notify: Message:QUARANTINE THIS ENTRY
Provide a reason for quarantining this blog entry (optional): deleteEntry duplicateEntryMARK AS DUPLICATE
Find the duplicate idea: * Previous Entry
 * Main
 * Next Entry","How easy it is to build a learning machine? Shouldn't one just hire some Machine Learning PhDs and have them run their algorithms?  Well, this is most probably a good idea, but it won't be enough.  I'll try to explain why in this blog entry. Before answering our questions, let's define what we are dealing with.  A Learning Machine is a machine (a software, a web site, a mobile app, a robot, pick your favorite) that performs a task, and that gets better and better as it performs it.  In recent years, some learning machines made headlines.  For instance, IBM Watson defeated best humans at...",ML Algorithm != Learning Machine,Live,967
3025,"RStudio Blog * Home

 * Subscribe to feed

SHINY 0.13.0
January 20, 2016 in News , Packages , Shiny

Shiny 0.13.0 is now available on CRAN! This release has some of the most
exciting features we’ve shipped since the first version of Shiny. Highlights
include:

 * Shiny Gadgets
 * HTML templates
 * Shiny modules
 * Error stack traces
 * Checking for missing inputs
 * New JavaScript events

For a comprehensive list of changes, see the NEWS file .

To install the new version from CRAN, run:

install.packages(""shiny"")

Read on for details about these new features!


SHINY GADGETS
With Shiny Gadgets, you can use Shiny to create interactive graphical tools that
run locally, taking your data as input and returning a result. This means that
Shiny isn’t just for creating applications to be delivered over the web – it can
also be part of your interactive data analysis toolkit!

Your workflow could, for example, look something like this:

 1. At the R console, read in and massage your data.
 2. Use a Shiny Gadget’s graphical interface to build a model and tweak model
    parameters. When finished, the Gadget returns the model object.
 3. At the R console, use the model to make predictions.

Here’s a Shiny Gadget in action ( code here ). This Gadget fits an lm model to a data set, and lets the user interactively exclude data points used
to build the model; when finished, it returns the data with points excluded, and
the model object:


When used in RStudio, Shiny Gadgets integrate seamlessly, appearing in the
Viewer panel, or in a pop-up dialog window. You can even declare your Shiny
Gadgets to be RStudio Add-ins , so they can be launched from the RStudio Add-ins menu or a customizable
keyboard shortcut.

When used outside of RStudio, Shiny Gadgets have the same functionality – the
only differences are that you invoke them by executing their R function, and
that they open in a separate browser window.

Best of all, if you know how to write Shiny apps, you’re 90% of the way to
writing Gadgets! For the other 10%, see the article in the Shiny Dev Center.

HTML TEMPLATES
In previous versions of Shiny, you could choose between writing your UI using
either ui.R (R function calls like fluidPage , plotOutput , and div ), or index.html (plain old HTML markup).

With Shiny 0.13.0, you can have the best of both worlds in a single app,
courtesy of the new HTML templating system (from the htmltools package). You can author the structure and style of your page in HTML, but
still conveniently insert input and output widgets using R functions.

<!DOCTYPE html>
<html>
  <head>
    <link href=""custom.css"" rel=""stylesheet"" />
    {{ headContent() }}
  </head>
  <body>
  {{ sliderInput(""x"", ""X"", 1, 100, sliderValue) }}
  {{ button }}
  </body>
</html>


To use the template for your UI, you process it with htmlTemplate() . The text within the {{ ... }} is evaluated as R code, and is replaced with the return value.

htmlTemplate(""template.html"",
  button = actionButton(""go"", ""Go"")
)

In the example above, the template is used to generate an entire web page.
Templates can also be used for pieces of HTML that are inserted into a web page.
You could, for example, create a reusable UI component which uses an HTML
template.

If you want to learn more, see the HTML templates article.

SHINY MODULES
We’ve been surprised at the number of users making large, complex Shiny apps –
to the point that abstractions for managing Shiny code complexity has become a
frequent request.

After much discussion and iteration, we’ve come up with a modules feature that should be a huge help for these apps. A Shiny module is like a fragment of
UI and server logic that can be embedded in either a Shiny app, or in another
Shiny module. Shiny modules use namespaces, so you can create and interact with
UI elements without worrying about their input and output IDs conflicting with
anyone else’s. You can even embed a Shiny module in a single app multiple times,
and each instance of the module will be independent of the others.

To get started, check out the Shiny modules article.

(Special thanks to Ian Lyttle , whose earlier work with shinychord provided inspiration for modules.)

BETTER DEBUGGING WITH STACK TRACES
In previous versions of Shiny, if your code threw an error, it would tell you
that an error occurred (the app would keep running), but wouldn’t tell you where
it’s from:

Listening on http://127.0.0.1:6212
Error in : length(n) == 1L is not TRUE

As of 0.13.0, Shiny gives a stack trace so you can easily find where the problem
occurred:

Listening on http://127.0.0.1:6212
Warning: Error in : length(n) == 1L is not TRUE
Stack trace (innermost first):
    96: stopifnot
    95: head.default
    94: head
    93: reactive mydata [~/app.R#10]
    82: mydata
    81: ggplot
    80: renderPlot [~/app.R#14]
    72: output$plot
     5: <Anonymous>
     4: do.call
     3: print.shiny.appobj
     2: print
     1: source

In this case, the error was in a reactive named mydata in app.R , line 10, when it called the head() function. Notice that the stack trace only shows stack frames that are relevant
to the app – there are many frames that are internal Shiny code, and they are
hidden from view by default.

For more information, see the debugging article.

CHECKING INPUTS WITH REQ()
In Shiny apps, it’s common to have a reactive expression or an output that can
only proceed if certain conditions are met. For example, an input might need to
have a selected value, or an actionButton might need to be clicked before an output should be shown.

Previously, you would need to use a check like if (is.null(input$x)) return() , or validate(need(input$x)) , and a similar check would be needed in all downstream reactives/observers
that rely on that reactive expression.

Shiny 0.13.0 provides new a function, req() , which simplifies this process. It can be used req(input$x) . Reactives and observers which are downstream will not need a separate check
because a req() upstream will cause them to stop.

You can call req() with multiple arguments to check multiple inputs. And you can also check for
specific conditions besides the presence or absence of an input by passing a
logical value, e.g. req(Sys.time() <= endTime) will stop if the current time is later than endTime .

For more details, see the article in the Shiny Dev Center.

JAVASCRIPT EVENTS
For developers who want to write JavaScript code to interact with Shiny in the
client’s browser, Shiny now has a set of JavaScript events to which event
handler functions can be attached. For example, the shiny:inputchanged event is triggered when an input changes, and the shiny:disconnected event is triggered when the connection to the server ends.

See the article for more.

SHARE THIS:
 * Reddit
 * More
 * 

 * Email
 * Facebook
 * 
 * Print
 * Twitter
 * 
 * 

LIKE THIS:
Like Loading...RELATED
SEARCH
LINKS
 * Contact Us
 * Development @ Github
 * RStudio Support
 * RStudio Website
 * R-bloggers

CATEGORIES
 * Featured
 * News
 * Packages
 * R Markdown
 * RStudio IDE
 * Shiny
 * shinyapps.io
 * Training
 * Uncategorized

ARCHIVES
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015
 * April 2015
 * March 2015
 * February 2015
 * January 2015
 * December 2014
 * November 2014
 * October 2014
 * September 2014
 * August 2014
 * July 2014
 * June 2014
 * May 2014
 * April 2014
 * March 2014
 * February 2014
 * January 2014
 * December 2013
 * November 2013
 * October 2013
 * September 2013
 * June 2013
 * April 2013
 * February 2013
 * January 2013
 * December 2012
 * November 2012
 * October 2012
 * September 2012
 * August 2012
 * June 2012
 * May 2012
 * January 2012
 * October 2011
 * June 2011
 * April 2011
 * February 2011

EMAIL SUBSCRIPTION
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Join 19,578 other followers


RStudio is an affiliated project of the Foundation for Open Access Statistics

6 COMMENTS
January 21, 2016 at 4:13 am

Dean Attali (@daattali)

Great collection of new and useful features, so many new things to play with!
Really good job RStudio

January 21, 2016 at 5:17 am

JBW

Looks great, can’t wait to use … will these features work on shinyapps.io ?

 * January 21, 2016 at 9:21 am
   
   winstonchang
   
   Yup, all of these features should work with shinyapps.io right now.
   
   
January 21, 2016 at 12:24 pm

Distilled News | Data Analytics & R

[…] Shiny 0.13.0 Shiny 0.13.0 is now available on CRAN! This release has some of
the most exciting features we’ve shipped since the first version of Shiny. […]

January 22, 2016 at 5:28 am

Carlos Cinelli

This is great work!

Regarding reproducibility, I liked a lot the example of the subset RStudio
addin, using RStudio’s api to insert the text in the users source code.

Do you have other suggestions to make our gadgets both easy to use (point and
click) and reproducible?

January 29, 2016 at 1:48 pm

Shinha

Hello,

I’m trying to count the number of national laws that exist by country, by type
and by year.
In this web site:


http://www.ecolex.org/ecolex/ledge/view/Common;DIDPFDSIjsessionid=5286BA4D91AA4333B5B2A9B38D7A3B77

for 33 countries 25 years and for 16 categories in national laws.

This the code I better got, but the information that it’s been uploaded (when it
does not block) does not correspond to the actual number of laws (legislation)
that is the web site.

I hope you can help me, thanks in advance.

the code:

install.packages(“XML”)
install.packages(“RCurl”)
install.packages(“plyr”)
install.packages(“data.table”)

library(XML)
library(RCurl)
library(plyr)
library(data.table)

countryl = c(“Antigua and
Barbuda”,”Argentina”,”Bahamas”,”Barbados”,”Belize”,”Boliv. Rep. of”,
“Brazil”,”Chile”,”Colombia”,”Costa Rica”,”Dominica”,”Dominican
Republic”,”Ecuador”,”El Salvador”,

“Grenada”,”Guatemala”,”Guyana”,”Haiti”,”Honduras”,”Jamaica”,”Mexico”,”Nicaragua”,”Panama”,”Paraguay”,”Peru”,
“Saint Kitts and Nevis”,”Saint Lucia”,”Saint Vincent and the
Grenadines”,”Suriname”,”Trinidad and Tobago”,
“Uruguay”,”Venezuela Boliv. Rep. of”)
countryl = gsub(” “,”+”,countryl)
yearl = 1960:2015
topicl =
c(“Air+atmosphere”,”Cultivated+plants”,”Environment+gen.”,”Land+soil”,”Legal+questions”,”Mineral+resources”,”Waste+hazardous+substances”,”Wild+species+ecosystems”)
docs = c(“documents”,”treaties”)

res = list()
ind = 0
year = “2015”
docs = “documents”
topic = “Legal+questions”
country = “Boliv.+Rep.+of”

for(year in yearl){
for(topic in topicl){
for(kind in docs){
for(country in countryl){
ur =
paste0(“DIDPFDSIjsessionid=?exportType=xml&selected=TRE-149565.&allFields=&allFields_allWords=allWords&basin=&basin_allWords=allWords&country=”,country,”&country_allWords=allWords&index=”,kind,”&keyword=&keyword_allWords=allWords&languageOfDocument=&languageOfDocument_allWords=allWords&listingField=&region=&region_allWords=allWords&screen=Common&searchDate_end=”,year,”&searchDate_start=”,year,”&sortField=searchDate&subject=”,topic,”&subject_allWords=allWords&titleOfText=&titleOfText_allWords=allWords”)
xi = as.numeric(xpathSApply(xmlParse(ur),”//result/@numberResultsFound”))
nu = cbind(year,kind,topic,country,xi)
ind = ind + 1
res[[ind]] = nu
print(nu)
}
}
}
}
trim = function(x){gsub(“\\+”,”
“,gsub(“(?<=[\\s])\\s*|^\\s+|\\s+$"","""",gsub(""\n"","""",x),perl=TRUE))}
res = trim(do.call(rbind,res))

write.csv(res,""my directory/nameofthefile"")
#thais scraper version 1.0


« RcppParallel: Getting R and C++ to work (some more) in parallel Devtools 1.10.0 »Blog at WordPress.com. The Tarski Theme .

Subscribe to feed.

FollowFOLLOW “RSTUDIO BLOG”
Get every new post delivered to your Inbox.


Join 19,578 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this:",Shiny 0.13.0 is now available on CRAN! This release has some of the most exciting features we’ve shipped since the first version of Shiny. Highlights include: Shiny Gadgets HTML templates Shiny mod…,Shiny 0.13.0,Live,968
3027,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates va barbosa Blocked Unblock Follow Following code rules everything around me 13 mins ago
--------------------------------------------------------------------------------

FLIGHTPREDICT II: THE SEQUEL
PREDICT FLIGHT DELAYS (NOW WITH PIXIEDUST)
A couple months ago, David Taieb put together a tutorial on how to Predict Flight Delays with Apache Spark MLLib, FlightStats, and Weather Data . For the sequel, we sprinkle some PixieDust onto his original solution and the result is pure magic .

PixieDust is an open source Python helper library that extends the usability of
notebooks. Using PixieDust’s visualization and apps features, we provide a
customized, interactive, and more pleasing experience than you’ll find in a
regular notebook.

PRE-FLIGHT CHECKLIST
Before you follow the steps in this post, run through the Predict Flight Delays with Apache Spark MLLib, FlightStats, and Weather Data tutorial. At a minimum, you must complete the following steps from that
tutorial:

✓ Set up a FlightStats account (REQUIRED! In the first tutorial, you could skip this step, but you
need these credentials to run this notebook.)

✓ Provision the Weather Company Data service

✓ Obtain or build the training and test data sets

Once you’ve done that, you can tackle this tutorial, which is a run-through of
my Flight Predict with PixieDust notebook, which you can run from the IBM Data Science Experience (DSX) or from a local Jupyter Notebook environment (with Spark 1.6.x and Python
2.x).

CLEARED FOR TAKE-OFF
While you can run the application from any Jupyter Notebook environment, I used
IBM’s Data Science Experience. The first step is to get the Flight Predict with PixieDust notebook into DSX:

 1. Sign into DSX .
 2. Create a new project (or select an existing project).
 3. On the upper right of the screen, click the + plus sign and choose Create project .

Add a new notebook (From URL) within the project

 1. Click add notebooks
 2. Click From URL
 3. Enter notebook name
 4. Enter the notebook URL: 
    https://raw.githubusercontent.com/ibm-cds-labs/simple-data-pipe-connector-flightstats/master/notebook/Flight%20Predict%20with%20Pixiedust.ipynb
 5. Select the Spark Service
 6. Click Create Notebook

If prompted, select a kernel for the notebook. The notebook should successfully
import.

FLY THROUGH THE NOTEBOOK
Run through each cell of the notebook in order.

When you use a notebook in DSX, you can run a cell only by selecting it, then
going to the toolbar and clicking on the Run Cell (▸) button. If you don’t see the Jupyter toolbar showing that run button and
other notebook controls, you’re not in edit mode. Go to the dark blue toolbar
above the notebook and click the edit (pencil) icon.Go through the notebook, running each code cell.

 1. Install PixieDust and its flightpredict plugin.
    Run the first 2 cells, which install and update pixiedust and the pixiedust-flightpredict plugin.
 2. Restart the kernel.
    From the menu, choose Kernel > Restart .
 3. Run the following cell to import the python package and launch the
    configuration dashboard:
    import pixiedust_flightpredict pixiedust_flightpredict.configure()
    The dashboard checks the current status of the app and guides you through
    setup.
 4. Add credentials and update incorrect or missing info ( x icon) entries.
    On the top right of the dashboard list, Click the Edit Configuration button. Enter the credentials you got completing the first tutorial.
    To save, click the Save Configuration button. The dashboard updates to show completed data.
 5. To create a cell with code to load the training data, click on Generate Cell code to load trainingData .
    The new cell appears under the dashboard.
 6. Go to the newly created cell and run the cell. The cell output is a
    PixieDust visualization of training data which you can view in various
    formats and also download or save into Cloudant or Object Store.
 7. Re-run the Configuration Dashboard cell you ran in Step 3 and it updates
    show you’ve loaded training data.
 8. Complete configuration.
    Continue through the dashboard, clicking each Generate Cell code to load button then running the new cell that appears below the dashboard. Repeat
    for each remaining incomplete task, except for custom handler , which is optional. (You can use the custom handler cell to provide new
    classification and features. For example, you may want to include a day of departure feature.)
 9. To confirm that you completed all steps, you can run the dashboard cell
    again. All entries should show None under Action required (except the custom handler, which is optional)

TRAIN AND EVALUATE THE MODELS
Like the first flight tracker tutorial that you ran through, this notebook
creates and runs four models (Logistic Regression, Naive Bayes, Decision Tree,
and Random Forest) — this time using PixieDust to display data and the model
evaluations.

 1. Now that your data’s loaded, go to the Train multiple classification models section and run each of the four code cells.
 2. Run the display(testData) cell to evaluate the models.

The pixiedust-flightpredict plugin generates a custom airplane dropdown menu that lets you:

 * Measure accuracy via an accuracy table and confusion matrixes , which you read about in the first tutorial. Again, you can use this tool
   to judge performance and decide if more training data is needed or if the
   classes need to be changed.
 * See a histogram showing the probability distribution.
 * Visualize Features (results) in a scatterplot.

The airplane menu is a custom PixieDust plugin created for this notebook.PixieDust provides an API that makes it easy for anyone to contribute a new
visualization plugin, like that nifty plane menu. You too can extend PixieDust
with custom features that serve your needs. Stay tuned for tutorials and docs
explaining how to code your own plugin.

RUN THE MODELS
The predictive models are now in place, and it’s time to launch the flight delay
prediction application. In the Run the predictive model application section, run the cell. (You can change the initial airport code, LAS, to
another city, if you want. You’ll also be able to do so in the app that
launches.)

import pixiedust_flightpredict
from pixiedust_flightpredict import *
pixiedust_flightpredict.flightPredict(""LAS"")

Enter a flight information and click Continue .

You’ll see delay predictions from the models, the weather forecast for each
airport, and a flight path map:

From here, you may Start Over to enter a new flight information or Go to Notebook to return to the notebook.

WHAT YOU CAN MAKE OUT OF IT
Run the last code cell in the notebook, which displays a map with an aggregated
view of all the flights that the app has searched:

 * Click on an airport to see all outgoing flights
 * Click on a flight path to get a listing of the flights and number of
   passengers who searched the specific flight

You can return to the notebook and continue to play with the data. See what you
can uncover or improve upon within the flight delay predictions.

YOU ARE NOW FREE TO MOVE ABOUT THE CABIN
Predicting flight delays based on weather using machine learning started out as
a way of showcasing the flexibility of a notebook. However, with the inclusion
of PixieDust, visualizing the data is now even easier. To take it all the way,
you could build a user interface and make this a full-fledged application. You
can load, manipulate, and present the data all within the notebook.

PixieDust is an open source project looking to improve the notebook experience.
You’ll find lots of guidance in its GitHub repo wiki . All are invited to contribute and pull requests welcome! We can have a parade and serve hot hors d’oeuvres…

Data Science Machine Learning Tutorial Python Blocked Unblock Follow FollowingVA BARBOSA
code rules everything around me

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.","Enhancing our Flight Predict notebook with an interactive app and visualizations built using PixieDust, the open source Python helper library.",FlightPredict II: The Sequel  – IBM Watson Data Lab,Live,969
3028,"Compose The Compose logo Articles Sign in Free 30-day trialLAUNCHING RESTHEART INTO PRODUCTION
Published May 9, 2017 restheart mongodb bluemix Launching RESTHeart into ProductionNow that we've shown you how to build instant RESTFul API's with RESTHeart and secure your RESTHeart installation , there's just one more step to building instant, secure API's from Compose
MongoDB: taking RESTHeart into production.

Over the last few weeks, we've looked at how RESTHeart is a great way to build
out a REST API from your MongoDB database. We've gone from a database schema to
an ""instant"" API and then looked at how to make that API more secure with SSL
and authentication. Now, we've reached the point where we want to put this into
production. We want to have this all running up in the cloud so we can have the
seamless API we've been dreaming of. In this third article in our the series,
we'll take a small back step and use the Database Identity Manager to secure our credentials. Then, we'll use the Bluemix and CloudFoundry CLIs to
deploy RESTHeart to the cloud using the IBM Bluemix Container Service.

IDENTITY: WHAT'S IN A USERNAME?
In our previous article, we used RESTHeart's SimpleFileIdentityManager to store our users' credentials in a flat file. While that might be ok for some
simple use cases, this method stores our users' passwords in the open and does
not scale nicely. Fortunately, RESTHeart also ships with a DBIdentityManager and, since we're already connecting to MongoDB with RESTHeart anyway, it's a
breeze to implement.

To use the Database Identity Manager, first pop open your restheart.yml file, navigate to the Security section, and add the following:

idm:  
    implementation-class: org.restheart.security.impl.DbIdentityManager
    conf-file: ./etc/security.yml
access-manager:  
    implementation-class: org.restheart.security.impl.SimpleAccessManager
    conf-file: ./etc/security.yml


Notice that we're also specifying the settings for the access-manager as well. This will give us a foundation for role-based security, and since this
is something we'll likely want to change only during development, it makes sense
to use the SimpleAccessManager for now.

Next, open your ./etc/security.yaml file and add the following:

dbim:  
    - db: restheart
      coll: _users
      cache-enabled: false
      bcrypt-hashed-password: true      
      cache-size: 1000
      cache-ttl: 60000
      cache-expire-policy: AFTER_WRITE


This configuration tells RESTHeart to use the Database Identity Manager, which
is bundled with your installation automatically. The db field refers to the database that contains your users, and the coll field is the collection that contains your user accounts. The underscore ( _ )
at the beginning of the collection name means that RESTHeart will treat the
accounts collection as a special, reserved collection. Reserved collections are
not exposed via the API, and that makes sense; you don't want your usernames and
passwords exposed to the world. Finally, make sure you set the bcrypt-hashed-password option to true or your password will be stored in plain text.

Next, beneath the previous dbim section, let's specify the access control settings:

# Users with role 'ADMIN' can do anything
permissions:  
    - role: ADMIN
      predicate: path-prefix[path=""/""]
    - role: OPERATOR
      predicate: path-prefix[path=""/myapp""]
    - role: USER
      predicate: path-prefix[path=""/myapp""]

# Not authenticated user can only GET any resource under the /public URI
    - role: $unauthenticated
      predicate: path-prefix[path=""/public""] and method[value=""GET""]


These settings create 3 roles: ADMIN , OPERATOR , and USER and provide various levels of permission to those users. The reserved $unauthenticated role specifies what public users are able to access within the API. If you're
going to have public access to your RESTFul API, it's best to separate the
public section of your application from the rest of your data.

Now, let's try running RESTHeart with our new configuration:

docker run -d -p 80:8080 --name restheart -v $PWD/etc:/opt/restheart/etc:ro softinstigate/restheart:3.0.0  


And let's check to logs to make sure it worked:

docker logs restheart  
07:05:36.185 [main] ERROR org.restheart.Bootstrapper - Error configuring Identity Manager implementation org.restheart.security.impl.DbIdentityManager  
java.lang.reflect.InvocationTargetException: null  
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)  
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)  


Uh oh - we got an exception.

Digging into the docs, we'll see that we need to create an initial user in the
database before we can start up RESTHeart. But before we do create our user,
let's take a quick detour to talk about generating encrypted passwords so our
first user will be a secure one.

ENCRYPTING YOUR PASSWORD
When we created our security.yml file and set the bcrypt-hashed-password value to true , what we told RESTHeart was that our password would be supplied and stored in
an encrypted form. A best practice for security, especially in production, is to
NEVER store your users' passwords in plain text.

However, MongoDB doesn't have a BCrypt function built into it. So how can we
encrypt our passwords? Fear not!

There are many different utilities available to use BCrypt from the command
line. For now, we'll choose the Node.JS version and install it using npm , but in practice any trusted implementation should produce the same results.

Start by installing the command-line wrapper for the bcrypt library:

$ npm install -g bcrypt-cli


Next, let's generate a salted, hashed version of the password for our new user.

$ bcrypt-cli supersecret
$2a$10$5nCQsTp5VuwMzRF.1aZTyewYyBlT3ird3r3M6Rbc3sEsd9ME/itDu


This outputs the salted hash string for our new password. We can now safely
store this password in our database, and RESTHeart will know how to authenticate
against it.

LET'S CREATE A USER
Now that we have a good, strongly encrypted password, we can create our user. We
can do this either through the command line interface, or through a script on
the server. Since the method we choose doesn't really matter, let's make it
simple and use the mongo command line interface:

mongo --ssl --sslAllowInvalidCertificates aws-us-east-1-portal.23.dblayer.com:17249/restheart -u dbuser -p secret  


Next, let's insert a new user with the ADMIN role we created in our access manager.

mongos  
WriteResult({ ""nInserted"" : 1 })  


Notice two things here: first, that we used our encrypted hashed password that
we created in the previous step, and second that our ""role"" looks exactly how we defined it in the security.yml file. If all was successful, we should see the user and password when we query
the database:

mongos  
{ ""_id"" : ""admin"", ""password"" : ""$2a$10$5nCQsTp5VuwMzRF.1aZTyewYyBlT3ird3r3M6Rbc3sEsd9ME/itDu"", ""roles"" : [ ""ADMIN"" ] }


Finally, let's exit the mongo shell and try running RESTHeart one more time:

docker run -d -p 80:8080 --name restheart -v $PWD/etc:/opt/restheart/etc:ro softinstigate/restheart:3.0.0  
19:26:25.429 [main] INFO  org.restheart.Bootstrapper - Starting RESTHeart instance default  
19:26:25.439 [main] INFO  org.restheart.Bootstrapper - version 3.0.0  
19:26:25.452 [main] INFO  org.restheart.Bootstrapper - Logging to console with level INFO  


Alright, that's better. Now that we are using our Database Identity Manager, we
can authenticate calls against the database the same as we did before:

curl -i --user admin:supersecret http://localhost/restheart  


Notice that we supply our plaintext password here, rather than the encrypted one. The Database Identity Manager
will correctly compare our plaintext password against the encrypted one we
stored in the database.

DEPLOYING TO PRODUCTION WITH IBM BLUEMIX CONTAINER SERVICE
Now that you have your RESTHeart system secured, it's time to deploy them to the
open web. Using a service such as the Bluemix Container Service means that we
can deploy RESTHeart quickly without having to manage servers or Docker
infrastructure ourselves.

Navigate to console.ng.bluemix.net and create an account (or log into your existing account). There are two ways
you can deploy your application: in a clustered environment using Kubernetes, or
as Docker single and scalable containers. Since our application only has one
container, we'll choose the single and scalable containers option.

Follow the instructions on the IBM Bluemix site for creating single and scalable containers . Once you've created the container, download the CloudFoundry CLI in the format appropriate for your platform. We can verify the installation was
completed successfully by running the following command:

cf -v  


Next, install the Bluemix CLI for your platform and install the IBM Container Service plugin:

bx plugin install IBM-Containers -r Bluemix  


You can verify the installation by checking out the list of installed plugins:

bx plugin list  


Once you've installed the Bluemix CLI, you'll need to set up your API within the
correct region. You'll only need to do this the first time you run the Bluemix
CLI:

bx api https://api.ng.bluemix.net  


Now, it's time to log into Bluemix so we can push our RESTHeart image up to it:

bx login  


PUSHING OUR LOCAL RESTHEART IMAGE TO BLUEMIX CONTAINER SERVICE
Next, we'll want to create a private image registry in the Bluemix Container
Service to upload our RESTHeart container image bundled with our custom
configuration changes. This needs to be private, rather than on a public
registry such as DockerHub, because this image will have full access to our
database.

The first step is to create a namespace in the bluemix container registry. You can think of a namespace as the base URL of your registry; you'll likely have a different namespace for
every organization or project you wish to deploy.

bx ic namespace-set <your namespace here>  


If we wanted a namespace for this project of compose_restheart , we would have the following:

bx ic namespace-set compose_restheart  


And now, we'll initialize the Bluemix Container Service for our namespace and
account:

bx ic init  


Once we've initialized the Container Service, we can start to build our local
image. First, we'll put together a simple Dockerfile to tell docker how to build
our custom RESTHeart image:

FROM softinstigate/restheart

COPY ./etc /opt/restheart/etc  


Finally, let's build our container using the Bluemix CLI, which will
automatically upload the build image to our private image repository. The -t flag gives us a tag name that we can use to refer to our image later:

$ bx ic build -t registry.ng.bluemix.net/<your namespace>/restheart .


Remember to replace <your namespace> with your actual namespace. If you used our namespace above, your command would
look like the following:

$ bx ic build -t registry.ng.bluemix.net/compose_restheart/restheart .


We can double-check our image actually made it all the way to our images
repository typing the following into the terminal:

$ bx ic images
REPOSITORY                                                  TAG                 IMAGE ID            CREATED             SIZE  
registry.ng.bluemix.net/compose_restheart/restheart                          latest              8de059ee71fc        2 minutes ago         317.4 MB  
registry.ng.bluemix.net/ibm-node-strong-pm                  latest              322b9ca7b2dc        2 weeks ago         616.4 MB  
registry.ng.bluemix.net/ibmliberty                          latest              6595ea483bf5        2 weeks ago         552.8 MB  
registry.ng.bluemix.net/ibmnode                             latest              b2c351248227        2 weeks ago         472.4 MB  
registry.ng.bluemix.net/ibmnode                             v4                  b2c351248227        2 weeks ago         472.4 MB  
registry.ng.bluemix.net/ibmnode                             v1.1                7d11220193d6        2 weeks ago         449.2 MB  
registry.ng.bluemix.net/ibmnode                             v1.2                84efce0c747b        2 weeks ago         465.2 MB  


You should see your new image listed there.

BUILDING AND DEPLOYING OUR CONTAINER
Now that we've built our RESTHeart container and made it available to the
Bluemix Container Service in our private registry, there's just one step left:
it's time to run the container!

$ bx ic run --name restheart registry.ng.bluemix.net/compose_restheart/restheart


If all goes well, you should now have a container running in the IBM Bluemix
Container Service. You can double-check that your container is running by
running the following command:

$ bx ic ps
CONTAINER ID        IMAGE                                           COMMAND             CREATED             STATUS                  PORTS                         NAMES  
68f6536a-82f        registry.ng.bluemix.net/compose_restheart/restheart:latest   """"                  3 minutes ago          Running 6 seconds ago   8080/tcp                      restheart  


Your service is now running in the cloud, but we still need to attach it to a
public IP address. To find out the IP addresses available to you, run the
following command:

$ bx ic ips
43.222.342.122  


NOTE: If you don't see any IP addreses in this list, run the following command
to request one.

$ bx ic ip-request


Then, run the previous ips command to see the IP addresses leased to you.

Finally, let's bind one of those ip addresses to our running container:

bx ic ip-bind <your ip address> restheart  


Now, just load up a web browser and navigate to your IP address and port. You
can test this now by acessing the built-in data browser by navigating to http://<your ip address>:8080/browser .

For more information on using the Bluemix Container Service, check out the running single containers tutorial on the Bluemix site.

SUMMARY
In this article, we put together the final pieces for a production RESTHeart
deployment, allowing you to safely and securely expose MongoDB on the public
Internet. You can now create dynamic API's that are automatically generated from
your MongoDB data and that can be accessed only by those with the correct roles
and permissions. We'll continue to explore the IBM Bluemix Container Service and
ways to integrate other Bluemix services in future articles.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Bayu Rivaldy

John O'Connor is a code junky, educator, and amateur dad that loves letting the smoke out of
gadgets, turning caffeine into code, and writing about it all. Love this
article? Head over to John O'Connor ’s author page and keep reading.RELATED ARTICLES
Apr 20, 2017BUILDING SECURE INSTANT API'S WITH RESTHEART AND COMPOSE
When you need to turn your Mongo database into a RESTFul API, RESTHeart can get
you up-and-running quickly and securely. Fol…

John O'Connor May 3, 2017CAMPUS DISCOUNTS - MAKING THE MOST OF COMPOSE
Campus Discounts uses several Compose-hosted databases including MySQL, MongoDB,
Redis, Elasticsearch and RabbitMQ to power t…

Arick Disilva Jan 24, 2017BUILDING INSTANT RESTFUL API'S WITH MONGODB AND RESTHEART
When you need to turn your Mongo database into a RESTFul API, RESTHeart can get
you up-and-running quickly. In this article,…

John O'Connor Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","In this third article in our the series, we'll take a small back step and use the Database Identity Manager to secure our credentials. Then, we'll use the Bluemix and CloudFoundry CLIs to deploy RESTHeart to the cloud using the IBM Bluemix Container Service.",Launching RESTHeart into Production,Live,970
3029,"Cloudant allows custom Javascript to be run server-side to generate indexes for MapReduce and Lucene search indexes. This article outlines common pitfalls, their solutions and advocates for automated testing of such code.Writing data into Cloudant is easy. Simply post data to your database using our RESTful API. When it comes to querying your data, most folks choose Cloudant Query as it is easy to setup and has a simple declarative language that allows queries to be expressed as JSON.Many customers, however, need to use the raw power of Cloudant MapReduce or Cloudant Search, both of which require JavaScript functions to be uploaded in design documents. Because indexes are built asynchronously in Cloudant, it may be some time before a mistake in a JavaScript function is spotted, especially with large databases.This article describes the most common pitfalls, how to avoid them, and how to code defensively in JavaScript functions.Cloudant MapReduce functions issue “emit” function calls for each item to be indexed. Cloudant Search functions issue “index” function calls for each item to be indexed. The code below refers mainly to MapReduce functions, but the same guidance can be followed for Search functions too.It is important to watch for, and avoid, syntax errors in your JavaScript code.The Cloudant Dashboard helps you check the syntax of any JavaScript code you enter:This code has an extra 'open' bracket in the call to ‘emit’. A red cross on the offending line shows where the error is. This makes it impossible to submit code that is syntactically incorrect.By contrast, the Cloudant API cannot perform these kinds of syntax checks in advance. If we submit the same code through the API, it is accepted, because the API is only responsible for checking that the document is valid JSON:curl -X POST ""https://myaccount.cloudant.com/js"" \-H 'Content-type: application/json' \-d'{""id"":""_design/fetch"",""index"":{ ""bya"":{ ""map"": ""function(doc){ emit((doc.a,null);}"" }}}'A common error is to miss the “doc” part, when accessing part of the document. For example:Cloudant is a schema-less database. Each document in a database can have a different schema to the next one. Unless a map function behaves defensively, it may attempt to access a non-existent property of the doc object, resulting in an exception. A thrown exception results in no indexing activity for the document being processed.Let’s say our documents have this form, with an additional property “c” being optional:""_id"": ""1"",""a"": 1,""b"": 2And our map function looks like this:function(doc) {emit(doc.a, null);emit(x, null);As the attempt to emit “x” throws an exception, the perfectly valid call to emit “doc.a” will not reach the index. The above example is poor code, because ‘x’ can never exist, but if we have some documents with a ‘c’ property and some without, what happens if we try and index a non-existent property of doc?function(doc) {emit(doc.a, null);emit(doc.c, null);The answer is that the second emit causes a key of “null” to be stored in the index. This can be prevented with some defensive coding:Take what seems to be a reasonable map function:function(doc) {if (doc.a) {emit(doc.a, null);The problem with this function arises from JavaScript’s definition of truth. We can see what this means by thinking about the different values of ‘a’ that might be seen, and the effect they would have in the 'if' evaluation.So our map function has to be careful to:* Avoid indexing null keys, by accessing doc.a (if doc.a was null or undefined)* Ensure that the only the correct data types are emitted into an indexOur original map function does not have the first problem, but unfortunately it fails the second requirement. This shortcoming might cause our code to discard documents that should be indexed (documents where ‘a’ is zero, for instance).The best practice here is to check the typeof an object’s property before emitting it:function(doc) {if (typeof(doc.a) === ""string"" && typeof(doc.b) === ""number"") {emit(doc.a, doc.b);The typeof operator does not throw an exception, even when checking for properties that don't exist. 'typeof' lets you check the type of each variable before anything is inserted into the index. One thing to avoid is the null property:console.log( typeof null)// object !!See http://james.padolsey.com/javascript/truthy-falsey/ for more JavaScript logical anomalies.Let’s say we have documents like this:""_id"": ""1"",""date"": ""2014-08-15"",""temperature"": 17.8Because JavaScript has no built-in Date type, strings are often used to represent the date in YYYY-MM-DD format. One technique for indexing this data is to break the date into its constituent parts, and index them in a compound key constructed out of Numbers:function(doc) {if (typeof(doc.date) === ""string"" && typeof(doc.temperature) === ""number"") {var bits = doc.date.split(""-"");var year = parseInt(bits[0]);var month = parseInt(bits[1]);var day = parseInt(bits[2]);emit( [year, month, day], doc.temperature);Given our document, we would expect keys to be emitted of the following form:[ 2014, 8, 15] → 17.8but we don’t. Instead we get:[ 2014, 0, 15] → 17.8What’s going on? The parseInt function assumes that a string with a leading ‘0’ represents an Octal number. “08” is nonsensical in Octal, so zero is returned. Fortunately, parseInt allows us to force the base using the second parameter:function(doc) {if (typeof(doc.date) === ""string"" && typeof(doc.temperature) === ""number"") {var bits = doc.date.split(""-"");var year = parseInt(bits[0], 10);var month = parseInt(bits[1], 10);var day = parseInt(bits[2], 10);emit( [year, month, day], doc.temperature);A better solution is to use JavaScript’s built-in Date objectfunction(doc) {if (typeof(doc.date) === ""string"" && typeof(doc.temperature) === ""number"") {var d = new Date(doc.date);emit([ d.getFullYear(), d.getMonth()+1, d.getDay()], doc.temperature);However, if the string cannot be converted to a Date object, calls to functions such as “getFullYear” will return “NaN”, So a an even better solution is to check the Date object before using it:function(doc) {if (typeof(doc.date) === ""string"" && typeof(doc.temperature) === ""number"") {var d = new Date(doc.date);if (!isNaN(d.getTime()) {emit([ d.getFullYear(), d.getMonth()+1, d.getDay()], doc.temperature);Code that is uploaded to Cloudant to run over each document in a database is just code. So it can be put through the rigours of testing and review that all production code should go through. A simple way to test a map function is to create a test harness, call it with some data and see what values it emits:var emit = function(key, value) {console.log(key,""--->"", value);var doc = { a: 1, b:2};var map = function(doc) { /* your code goes here */ };map(doc);The automated test framework mocha paired with the assertion tool should allow automated tests to be written in JavaScript. As map functions are pure JavaScript, they too can be automatically tested.var should = require('should');var validdoc1 = { _id:""1"", year:2014, month:8, temperature: 12};var validdoc2 =  { _id:""6"", year:2014, month:8, temperature: 13.2};var docs = [validdoc1,{ _id:""2"", month:8, temperature: 12},{ _id:""3"", year:2014, temperature: 12},{ _id:""4"", year:2014, month:8},{ _id:""5"", year:null, month:null, temperature: null},validdoc2var theIndex = []describe('Array', function(){before(function() {var emit = function(key, value) {theIndex.push( {key: key, value: value});var map = function(doc) {if (typeof doc.year == 'number' && typeof doc.month == 'number'&& typeof doc.temperature == 'number') {emit([doc.year,doc.month], doc.temperature);for(var i in docs) {map(docs[i]);describe('#map functoin', function(){it('should reject invalid documents', function() {theIndex.should.be.an.Array;theIndex.length.should.be.a.Number;theIndex.length.should.be.equal(2);it('should index year, month and temperature correctly', function() {var first = theIndex[0];var second = theIndex[1];first.should.have.property('key');first.should.have.property('value');first['key'].should.be.an.Array;first['key'].length.should.be.equal(2);first['key'][0].should.be.a.Number;first['key'][0].should.be.equal(validdoc1.year);first['key'][1].should.be.a.Number;first['key'][1].should.be.equal(validdoc1.month);first['value'].should.be.a.Number;first['value'].should.be.equal(validdoc1.temperature);second.should.have.property('key');second.should.have.property('value');second['key'].should.be.an.Array;second['key'].length.should.be.equal(2);second['key'][0].should.be.a.Number;second['key'][0].should.be.equal(validdoc2.year);second['key'][1].should.be.a.Number;second['key'][1].should.be.equal(validdoc2.month);second['value'].should.be.a.Number;second['value'].should.be.equal(validdoc2.temperature);Trying a map function in a test harness before running it over millions of documents can save many development hours!Another approach would to use PouchDB Server as a test harness for MapReduce operations. As PouchDB is CouchDB compatible, it can be used to automatically test map functions prior to deployment in production.","Cloudant allows custom Javascript to be run server-side to generate indexes for MapReduce and Lucene search indexes. This article outlines common pitfalls, their solutions and advocates for automated testing of such code.",Defensive coding in Map/Index functions,Live,971
3037,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__TRY AD-FREE FOR 3 MONTHS
Loading...

Sign up by October 31st for an extended 3-month trial of YouTube Red.Working...

No thanks Try it free Find out why CloseDATA SCIENCE EXPERIENCE: CREATE A PROJECT AND NOTEBOOK
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 17KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Statistics
 * Add translations

35 views 0LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 3, 2017Find more videos in the Data Science Experience Learning Center at http://ibm.biz/dsx-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * Data Science Experience: Build SQL queries with Apache Spark - Duration:
   3:29. developerWorks TV 2 views * New 3:29


--------------------------------------------------------------------------------

 * Demo: Intro to Bluemix and Internet of Things Foundation - Duration: 8:08.
   developerWorks TV 14,673 views 8:08
 * IBM Watson IoT Platform Demo - Duration: 12:16. IBM Watson Internet of Things
   41,718 views 12:16
 * Creating the Data Science Experience - Duration: 3:55. IBM Analytics 3,197
   views 3:55
 * The Data Science Experience - Duration: 42:45. Evolving Education with
   Cognitive & Data Sciences 1,170 views 42:45
 * Data Science Experience demo: Modeling energy usage in NYC - Duration: 8:21.
   IBM Analytics 8,386 views 8:21
 * How to Become a Data Scientist in 2017? | Data Scientist Career | Data
   Science Future - Duration: 1:17:14. HackerEarth 134,256 views 1:17:14
 * Tanmay Bakshi on building AskTanmay - Duration: 22:59. developerWorks TV
   195,627 views 22:59
 * Data science expert interview: Jennifer Shin - Duration: 7:29. IBM Analytics
   17,327 views 7:29
 * Gigs: A day in the life of a data scientist - Duration: 4:01. RCR Wireless
   News 204,949 views 4:01
 * How To Become A Data Scientist In 6 Months - Duration: 45:37. Coding Tech
   43,731 views 45:37
 * Data Science Experience: Collaborate on projects - Duration: 1:38.
   developerWorks TV No views * New 1:38
 * What Are the Top Skills Needed to Be a Data Scientist? - Duration: 5:12. SAS
   Software 81,087 views 5:12
 * Introducing the Data Science Experience - Duration: 2:31. IBM Analytics
   14,839 views 2:31
 * IBM Watson: How it Works - Duration: 7:54. IBM Watson 1,243,461 views 7:54
 * Building a blockchain for business with the Hyperledger Project - Duration:
   2:52. developerWorks TV 41,913 views 2:52
 * Tetiana Ivanova - How to become a Data Scientist in 6 months a hacker’s
   approach to career planning - Duration: 56:26. PyData 132,267 views 56:26
 * Introduction - Learn Python for Data Science #1 - Duration: 6:55. Siraj Raval
   175,120 views 6:55
 * Getting Started for Gateways in IBM Watson IoT Platform - Duration: 7:08. IBM
   IoT Support 1,500 views 7:08
 * Data Science Experience: Load and analyze public data sets - Duration: 2:46.
   developerWorks TV No views * New 2:46
 * Loading more suggestions...
 * Show more

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to create a project in IBM Data Science Experience (DSX).,Create a project in DSX,Live,972
3041,"THE GRADIENT FLOW
DATA / TECHNOLOGY / CULTURE
Menu Search Skip to content * Home
 * About
 * Calendar
 * Contact
 * Hardcore Data Science and Data Engineering
 * The Data Show
 * Webcasts

Search for:RECENT TRENDS IN RECOMMENDER SYSTEMS
[A version of this post appears on the O’Reilly Radar .]

THE O’REILLY DATA SHOW PODCAST: DANNY BICKSON ON RECOMMENDERS, DATA SCIENCE, AND
APPLICATIONS OF MACHINE LEARNING.
Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.
Find us on Stitcher , TuneIn , iTunes , SoundCloud , RSS .

In this episode of the O’Reilly Data Show, I spoke with Danny Bickson , co-founder and VP at Dato , and the principal organizer of the Data Science Summit (full disclosure: I’m a member of the conference organizing committee). Among
machine learning students and practitioners, recommender systems have become
somewhat of a canonical use case and application. One of the early and popular
building blocks was GraphLab’s collaborative filtering toolkit, a library originally written and maintained by Bickson . He has continued to keep tabs on the latest developments in recommenders and
continues to help organize workshops on related topics throughout the world.

In recent years Bickson has turned his focus toward helping companies deploy
machine learning systems in production in a wide range of real-world settings.
Here are some highlights from our conversation:

BUILDING A TOOLKIT FOR COLLABORATIVE FILTERING
It was kind of accidental. I was working on my Ph.D.—a lot of linear models,
like linear systems of equations and interactive solvers. Matrix factorization,
which is the base algorithm behind collaborative filtering, is very related to
linear systems. It can be thought of as some kind of extension, and it’s more
powerful. … When I was at CMU , I heard a lecture by a guy who’s now a researcher at Facebook, who actually
worked on what they call Bayesian Tensor Factorization. This work drew me toward
the domain, and I started to look into it. His code was in Matlab, so I tried to
re-implement it on our system, GraphLab.

Initially, when we started the project, we had what we call a framework, which
is like an API for graph analytics. But we found out that not many people are
interested in just writing code for a framework because it’s a very low level
and it’s not that intuitive. … Once we started to package algorithms on top of
the framework, then we became way more popular because people wanted to use
pre-made building blocks.

One of the reasons behind the success of this toolkit was that we started to
compete in what was a relatively known competition called ACM KDD Cup . It was back in 2011. … When we started to compete using our code, we actually
did something that was counter-intuitive: we shared our code during the
competition, and then people, if they downloaded it, could improve their own
results. That got us very quickly to hundreds of downloads, and a lot of
companies were involved in this competition, so that opened a lot of doors for
us in industry.

RECOMMENDER SYSTEMS
The pillar stone of recommender systems research started with the Netflix competition , which, I guess, most of us know. … That was for movie recommenders. Their
main assumptions were that you echoed information about user-to-movie
interaction and their scores. That’s a kind of program that we are all very
familiar with. There are hundreds of research papers. It is an explored domain
where we are very good.

The areas that need a bit more attention are those where you have additional
data. … [where] you also know the day of the week, and the time, and which type
of iPhone the user had, and what the user’s age and zip code are, what the item
color and price is, and so on. Once you throw in more information, of course you
can build richer models, but then the complexity goes up.

… You can have models that rely on user behavior. You can have separate models
that rely on activity data, like finding similar cars to the ones previously
sold, and so on. There are models based on text description of products. We have
models based on user reviews of products, text reviews, and sentiments. There
are models that even take into account images of products. But the most
interesting models are hybrid models that combine a lot of types of inputs
because companies have very rich information. Currently, they’re using just a
small fraction of that information to make the predictions, but once they gather
more information they can have better models and more accurate models. That is
what’s most interesting to me personally.

DEEP LEARNING USING DATO
As you know, deep learning is one of the hottest techniques in machine learning,
so we did want to have a foot in this domain . So far, we have an initial version, which supports convolutional neural
networks. But the good news is that we hired some of the people behind MXNet , which is one of the emerging deep learning platforms, and there you have a
lot of other algorithms, including RNN, and you also have features like support
of multiple GPUs.

Editor’s note : Danny Bickson will present a talk entitled New trends in recommender systems at Strata + Hadoop World London 2016.

Related resources:

 * The evolution of GraphLab (a previous episode of the Data Show featuring Dato co-founder/CEO Carlos Guestrin)
 * Practical machine learning techniques for building intelligent applications (a previous episode of the Data Show featuring Mikio Braun , data scientist at Zalando )
 * Stream processing and messaging systems for the IoT age (a previous episode of the Data Show featuring MC Srivas , co-founder of MapR )
 * Machine Learning – an O’Reilly Learning Path

SHARE THIS:
 * Click to share on Twitter (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * Click to email (Opens in new window)
 * 

05/05/2016 Ben Lorica data show , podcastPOST NAVIGATION
← →LEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.

Notify me of new posts via email.


SEARCH
Search for:RSS FEED
 * RSS - Posts

SITE MAP
 * About
 * Calendar
 * Contact
 * Hardcore Data Science and Data Engineering
 * The Data Show
 * Webcasts

RECENT POSTS
 * Structured streaming comes to Apache Spark 2.0
 * Don’t overlook simpler techniques and algorithms
 * Recent trends in recommender systems
 * Semi-supervised, unsupervised, and adaptive algorithms for large-scale time
   series
 * Practical machine learning techniques for building intelligent applications

CATEGORIES
 * Data Engineer
 * Data Science
 * Finance
 * Marketing
 * Science
 * Uncategorized

ARCHIVES
 * May 2016 (3)
 * April 2016 (2)
 * March 2016 (5)
 * February 2016 (3)
 * January 2016 (3)
 * December 2015 (3)
 * November 2015 (4)
 * October 2015 (5)
 * September 2015 (5)
 * August 2015 (3)
 * July 2015 (4)
 * June 2015 (4)
 * May 2015 (3)
 * April 2015 (6)
 * March 2015 (5)
 * February 2015 (7)
 * January 2015 (6)
 * December 2014 (7)
 * November 2014 (3)
 * October 2014 (3)
 * September 2014 (4)
 * August 2014 (5)
 * July 2014 (7)
 * June 2014 (6)
 * May 2014 (1)
 * April 2014 (4)
 * March 2014 (4)
 * February 2014 (7)
 * January 2014 (4)
 * December 2013 (5)
 * November 2013 (3)
 * October 2013 (3)
 * September 2013 (5)
 * August 2013 (4)
 * July 2013 (4)
 * June 2013 (5)
 * May 2013 (4)
 * April 2013 (4)
 * March 2013 (4)
 * February 2013 (2)
 * October 2012 (1)
 * August 2012 (1)

My Tweets Blog at WordPress.com. | The Sorbet Theme . FollowFOLLOW “THE GRADIENT FLOW”
Get every new post delivered to your Inbox.


Join 35 other followers


Build a website with WordPress.com Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","[A version of this post appears on the O’Reilly Radar.] The O’Reilly Data Show Podcast: Danny Bickson on recommenders, data science, and applications of machine learning. Subscribe to the O&#…",Recent trends in recommender systems,Live,973
3043,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (November 22, 2016)
 * This Week in Data Science (November 15, 2016)
 * This Week in Data Science (November 08, 2016)
 * Partnering with Big Data University – UMUC Case Study
 * This Week in Data Science (November 01, 2016)

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (NOVEMBER 22, 2016)
Posted on November 22, 2016 by cora

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Maths zeroes in on perfect cup of coffee – Mathematicians are a step closer to understanding what makes a perfect cup
   of coffee.
 * Black Friday 2016: Mobile vs Desktop User Behaviour – Black Friday statistics are analyzed to understand user purchasing
   behavior.
 * How data science and rocket science will get humans to Mars – In a recent op-ed to CNN, President Obama re-affirmed America’s commitment
   to sending a manned mission to Mars.
 * Cooking with Chef Watson, I.B.M.’S Artificial-Intelligence App – Watson makes suggestions that no human would ever make, like adding milk
   chocolate to a clam linguine or mayonnaise to a Bloody Mary.
 * This temporary tattoo can listen to your heart – This new wearable is a stick-on stethoscope that’s smaller than a penny.
 * How artificial intelligence can help journalists tell the truth – Facebook and Google are now the big beasts of the internet when it comes
   to distributing news – and as they have sought to secure advertising revenue,
   what has slowly but surely emerged is a kind of “click-mania”.
 * Media in the Age of Algorithms – Since Tuesday’s election, there’s been a lot of finger pointing, and many
   of those fingers are pointing at Facebook, arguing that their newsfeed
   algorithms played a major role in spreading misinformation and magnifying
   polarization.
 * IBM’s Computers Can Diagnose Skin Cancer – The smartphone is on a collision course with your local dermatologist.
 * Big debate about Shakespeare finally settled by big data: Marlowe gets his
   due – Now, for the first time and with a bit of help from computers and big
   data, the Oxford University Press will add Christopher Marlowe as a co-author
   in all three Shakespeare “Henry VI” plays (Parts 1, 2 and 3).
 * Beer Meets Big Data with Glassify – What if your beer glass could be more than the vessel that delivers
   precious nectar to your lips? What if it really knew you—and could hook you
   up with deals?
 * The Non-Technical Guide to Machine Learning & Artificial Intelligence – There is already a ton of technical content being produced about
   artificial intelligence and machine learning. This list is a primer for
   non-technical people who want to understand what machine learning makes
   possible.
 * Ethical Implications Of Industrialized Analytics – As analytics are embedded more and more deeply into processes and systems
   that we interact with, they now directly impact us far more than in the past.
 * Text Analytics and Machine Learning: A Virtuous Combination – One particularly productive combination that should not be overlooked
   involves the use of text analytics and machine learning.
 * Building a Data-Driven Education System in the United States – Schools today are not very different from 50 years ago. Instructors still
   teach to the average, rather than provide students personalized instruction,
   because it is expedient, not because it is effective.
 * Digital transformation: Why the consumer is the channel – If you go back 10 years and tell a room full of marketers they need to
   focus their efforts on speaking to a target audience as individuals, they’ll
   think either you’re trying to cut their budgets or you’re crazy. Today,
   however, that approach is not crazy at all.

UPCOMING DATA SCIENCE EVENTS
 * Apache Spark – Hands-on Session – Come join speakers Matt McInnis and Sepi Seifzadeh, Data Scientists from
   IBM Canada as they guide the group through three hands-on exercises using
   IBM’s new Data Science Experience to leverage Apache Spark!

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , big da , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Our forty first release of a weekly round up of interesting Data Science and Big Data news, links, and upcoming events.","This Week in Data Science (November 22, 2016)",Live,974
3045,"×SIGNUP LOGIN
Sign up with Login with Facebook Google+ LinkedIn Or sign up with your email Or log in with your email Create Account Remember Me Login Forgot your Password? Already have an account? Sign in Need an account? Sign up Toggle navigation × * New! Workshops * Data Analytics for Business
    * Foundations of Data Science
    * UX Design School
    * Data Science Intensive
   
   
 * Alumni
 * Resources * Learning Paths
    *  * Data Analysis
       * UX Design
       * Android App Development
       * Web Development with Python
       * Browse All
      
      
    * Guides & email resources
    *  * Guide to UX Design Careers
       * Data Podcasts & newsletters
       * Data Science Process Email Course
       * Getting your First Job in Data Science (A Guide)
      
      
    * Course Finder
   
   
 * Blog
 * Login

Feb 8, 2017THE DATA SCIENCE PROCESS
Raj Bandyopadhyay 0 Uncategorized Share:
As a data scientist, I often get the question,“What do you actually do?”

Data scientists can appear to be wizards who pull out their crystal balls
(MacBook Pros), chant a bunch of mumbo-jumbo (machine learning, random forests,
deep networks, Bayesian posteriors) and produce amazingly detailed predictions
of what the future will hold. However, as much as we’d like to believe it was,
data science is not magic. The power of data science comes from a deep
understanding of statistics and algorithms, programming and hacking, and
communication skills. More importantly, data science is about applying these
three skill sets in a disciplined and systematic manner.

Over the last few years, I’ve not only worked as an individual data scientist in
several companies, but also led a team of data scientists as chief data
scientist at Pindrop Security, a hot Andreessen-Horowitz funded cybersecurity
startup. My team worked on several cutting-edge projects using a wide variety of
tools and techniques. Over time, I realized that despite the variation in the
details of different projects, the steps that data scientists use to work
through a complex business problem remain more or less the same.

After Pindrop, I joined Springboard as the director of data science education. In this capacity, my role is to
design and maintain our data science courses for students, such as our data science career track bootcamp . Designing these courses compelled me to reflect on the systematic process
that data scientists use at work, and to make sure that I incorporated those
steps in each of our data science courses. In this article, I explain this data
science process through an example case study. By the end of the article, I hope
that you will have a high-level understanding of the day-to-day job of a data
scientist, and see why this role is in such high demand.

THE DATA SCIENCE PROCESS
Congratulations! You’ve just been hired for your first job as a data scientist
at Hotshot, a startup in San Francisco that is the toast of Silicon Valley. It’s
your first day at work. You’re excited to go and crunch some data and wow
everyone around you with the insights you discover. But where do you start?

Over the (deliciously catered) lunch, you run into the VP of Sales, introduce
yourself and ask her, “What kinds of data challenges do you think I should be working on?”

The VP of Sales thinks carefully. You’re on the edge of your seat, waiting for
her answer, the answer that will tell you exactly how you’re going to have this
massive impact on the company of your dreams.

And she says, “Can you help us optimize our sales funnel and improve our conversion rates?”

The first thought that comes to your mind is: What? Is that a data science problem? You didn’t even mention the word ‘data’.
What do I need to analyze? What does this mean?

Fortunately, your data scientist mentors have warned you already: this initial
ambiguity is a regular situation that data scientists encounter frequently. All
you have to do is systematically apply the data science process to figure out
exactly what you need to do.

THE DATA SCIENCE PROCESS: A QUICK OUTLINE
When a non-technical supervisor asks you to solve a data problem, the
description of your task can be quite ambiguous at first. It is up to you, as
the data scientist, to translate the task into a concrete problem, figure out
how to solve it and present the solution back to all of your stakeholders. We
call the steps involved in this workflow the “Data Science Process.” This
process involves several important steps:

 * Frame the problem : Who is your client? What exactly is the client asking you to solve? How can
   you translate their ambiguous request into a concrete, well-defined problem?

 * Collect the raw data needed to solve the problem : Is this data already available? If so, what parts of the data are useful? If
   not, what more data do you need? What kind of resources (time, money,
   infrastructure) would it take to collect this data in a usable form?

 * Process the data (data wrangling) : Real, raw data is rarely usable out of the box. There are errors in data
   collection, corrupt records, missing values and many other challenges you
   will have to manage. You will first need to clean the data to convert it to a
   form that you can further analyze.

 * Explore the data : Once you have cleaned the data, you have to understand the information
   contained within at a high level. What kinds of obvious trends or
   correlations do you see in the data? What are the high-level characteristics
   and are any of them more significant than others?

 * Perform in-depth analysis (machine learning, statistical models, algorithms) : This step is usually the meat of your project,where you apply all the
   cutting-edge machinery of data analysis to unearth high-value insights and
   predictions.

 * Communicate results of the analysis : All the analysis and technical results that you come up with are of little
   value unless you can explain to your stakeholders what they mean, in a way
   that’s comprehensible and compelling. Data storytelling is a critical and
   underrated skill that you will build and use here.

So how can you help the VP of Sales at hotshot.io? In the next few sections, we
will walk you through each step in the data science process, showing you how it
plays out in practice. Stay tuned!


--------------------------------------------------------------------------------

STEP 1 OF 6: FRAME THE PROBLEM (A.K.A. “ASK THE RIGHT QUESTIONS”)
The VP of Sales at hotshot.io, where you just started as a data scientist, has
asked you to help optimize the sales funnel and improve conversion rates. Where
do you start?

You start by asking a lot of questions.

 * Who are the customers, and how do you identify them?
 * What does the sales process look like right now?
 * What kind of information do you collect about potential customers?
 * What are the different tiers of service right now?

Your goal is to get into your client’s (the VP in this case) head and understand
their view of the problem as well as you can. This knowledge will be invaluable
later when you analyze your data and present the insights you find within.

Once you have a reasonable grasp of the domain, you should ask more pointed
questions to understand exactly what your client wants you to solve. For
example, you ask the VP of Sales, “What does optimizing the funnel look like for you? What part of the funnel is
not optimized right now?”

She responds, “I feel like my sales team is spending a lot of time chasing down customers who
won’t buy the product. I’d rather they spent their time with customers who are
likely to convert. I also want to figure out if there are customer segments who
are not converting well and figure out why that is.”

Bingo! You can now see the data science in the problem. Here are some ways you
can frame the VP’s request into data science questions:

 1. What are some important customer segments?
 2. How do conversion rates differ across these segments? Do some segments
    perform significantly better or worse than others?
 3. How can we predict if a prospective customer is going to buy the product?
 4. Can we identify customers who might be on the fence?
 5. What is the return on investment (ROI) for different kinds of customers?

Spend a few minutes and think about any other questions you’d ask.

Now that you have a few concrete questions, you go back to the VP Sales and show
her your questions. She agrees that these are all important questions, but adds: “I’m particularly interested in having a sense of how likely a customer is to
convert. The other questions are pretty interesting too!” You make a mental note to prioritize questions 3 and 4 in your story.

The next step for you is to figure out what data you have available to answer
these questions. Stay tuned, we’ll talk about that next time!


--------------------------------------------------------------------------------

STEP 2 OF 6: COLLECT THE RIGHT DATA
You’ve decided on your very first data science project for hotshot.io:
predicting the likelihood that a prospective customer will buy the product.

Now’s the time to start thinking about data. What data do you have available to
you?

You find out that most of the customer data generated by the sales department is
stored in the company’s CRM software, and managed by the Sales Operations team.
The backend for the CRM tool is a SQL database with several tables. However, the tool also provides a very convenient
web-based API that returns data in the popular JSON format.

What data from the CRM database do you need? How should you extract it? What
format should you store the data in to perform your analysis?

You decide to roll up your sleeves and dive into the SQL database. You find that
the system stores detailed identity, contact and demographic information about
customers, in addition to details of the sales process for each of them. You
decide that since the dataset is not too large, you’ll extract it to CSV files
for further analysis.

As an ethical data scientist concerned with both security and privacy, you are
careful not to extract any personally identifiable information from the
database. All the information in the CSV file is anonymized, and cannot be
traced back to any specific customer.

In most data science industry projects, you will be using data that already
exists and is being collected. Occasionally, you’ll be leading efforts to
collect new data, but that can be a lot of engineering work and it can take a
while to bear fruit.

Well, now you have your data. Are you ready to start diving into it and cranking
out insights? Not yet. The data you have collected is still ‘raw data’ — which
is very likely to contain mistakes, missing and corrupt values. Before you draw
any conclusions from the data, you need to subject it to some data wrangling,
which is the subject of our next section.


--------------------------------------------------------------------------------

STEP 3 OF 6: HOW TO PROCESS (OR “WRANGLE”) YOUR DATA
As a brand-new data scientist at hotshot.io, you’re helping the VP of Sales by
predicting which prospective customers are likely to buy the product. To do so,
you’ve extracted data from the company’s CRM into CSV files.

But, despite all your work, you’re not ready to use the data yet. First, you
need to make sure the data is clean! Data cleaning and wrangling often takes up
the bulk of time in a data scientist’s day-to-day work, and it’s a step that
requires patience and focus.

First, you need to look through the data that you’ve extracted, and make sure
you understand what every column means. One of the columns is called
‘FIRST_CONTACT_TS’, representing the date and time the customer was first
contacted by hotshot.io. You automatically ask the following questions:

 * Are there missing values i.e. are there customers without a first contact
   date? If not, why not? Is that a good or a bad thing?
 * What’s the time zone represented by these values? Do all the entries
   represent the same time zone?
 * What is the date range? Is the date range valid? For example, if hotshot.io
   has been around since 2011, are there dates before 2011? Do they mean
   anything special or are they mistakes? It might be worth verifying the answer
   with a member of the sales team.

Once you have uncovered missing or corrupt values in your data, what do you do
with them? You may throw away those records completely, or you may decide to use
reasonable default values (based on feedback from your client). There are many
options available here, and as a data scientist, your job is to decide which of
them makes sense for your specific problem.

You’ll have to repeat these steps for every field in your CSV file: you can
begin to see why data cleaning is time-consuming. Still, this is a worthy
investment of your time, and you patiently ensure that you get the data as clean
as possible.

This is also a time when you make sure that you have all of the critical pieces
of data you need. In order to predict which future customers will convert, you
need to know which customers have converted in the past. Conveniently enough,
you find a column called ‘CONVERTED’ in your data, with a simple ‘Yes/No’ value.

Finally, after a lot of data wrangling, you’re done cleaning your dataset, and
you’re ready to start drawing some insights from the data. Time for some
exploratory data analysis!


--------------------------------------------------------------------------------

STEP 4 OF 6: EXPLORE YOUR DATA
You’ve extracted data and spent a lot of time cleaning it up.

And now, you’re finally ready to dive into the data! You’re eager to find out
what information the data contains, and which parts of the data are significant
in answering your questions. This step is called exploratory data analysis.

What are some things you’d like to explore? You could spend days and weeks of
your time aimlessly plotting away. But you don’t have that much time. Your
client, the VP of Sales, would love to present some of your results at the board
meeting next week. The pressure is on!

You look at the original question: predict which future prospects are likely to
convert. What if you split the data into two segments based on whether the
customer converted or not and examine differences between the two groups? Of
course!

Right away, you start noticing some interesting patterns. When you plot the age
distributions of customers on a histogram for the two categories, you notice
that there are a large number of customers in their early 30s who seem to buy
the product and far fewer customers in their 20s. This is surprising, since the
product targets people in their 20s. Hmm, interesting …

Furthermore, many of the customers who convert were targeted via email marketing
campaigns as opposed to social media. The social media campaigns make little
difference. It’s also clear that customers in their 20s are being targeted
mostly via social media. You verify these assertions visually through plots, as
well as by using some statistical tests from your knowledge of inferential
statistics.

The next day, you walk up to the VP of Sales at her desk and show her your
preliminary findings. She’s intrigued and can’t wait to see more! We’ll show you
how to present your results to her in our next section.


--------------------------------------------------------------------------------

STEP 5 OF 6: ANALYZE YOUR DATA IN DEPTH
In the previous section, we explored a dataset to find a set of factors that
could solve your original problem: predicting which customers at hotshot.io will
buy the product. Now you have enough information to create a model to answer
that question.

In order to create a predictive model, you must use techniques from machine
learning. A machine learning model takes a set of data points, where each data
point is expressed as a feature vector.

How do you come up with these feature vectors? In our EDA phase, we identified
several factors that could be significant in predicting customer conversion, in
particular, age and marketing method (email vs. social media). Notice an
important difference between the two factors we’ve talked about: age is a
numeric value whereas marketing method is a categorical value. As a data
scientist, you know how to treat these values differently and how to correctly
convert them to features.

Besides features, you also need labels. Labels tell the model which data points
correspond to each category you want to predict. For this, you simply use the
CONVERTED field in your data as a boolean label (converted or not converted). 1
indicates that the customer converted, and 0 indicates that they did not.

Now that you have features and labels, you decide to use a simple machine
learning classifier algorithm called logistic regression. A classifier is an
instance of a broad category of machine learning techniques called ‘ supervised learning, ’where the algorithm learns a model from labeled examples. Contrary to
supervised learning, unsupervised learning techniques extract information from data without any labels supplied.

You choose logistic regression because it’s a technique that’s simple, fast and
it gives you not only a binary prediction about whether a customer will convert
or not, but also a probability of conversion. You apply the method to your data,
tune the parameters, and soon, you’re jumping up and down at your computer.

The VP of Sales is passing by, notices your excitement and asks, “So, do you have something for me?” And you burst out, “Yes, the predictive model I created with logistic regression has a TPR of 95%
and an FPR of 0.5%!”

She looks at you as if you’ve sprouted a couple of extra heads and are talking
to her in Martian.

You realize you haven’t finished the job. You need to do the last critical step,
which is making sure that you communicate your results to your client in a way
that is compelling and comprehensible for them.


--------------------------------------------------------------------------------

STEP 6 OF 6: VISUALIZE AND COMMUNICATE YOUR FINDINGS
You now have an amazing machine learning model that can predict, with high
accuracy, how likely a prospective customer is to buy Hotshot’s product. But how
do you convey its awesomeness to your client, the VP of Sales? How do you
present your results to her in a form that she can use?

Communication is one of the most underrated skills a data scientist can have.
While some of your colleagues (engineers, for example) can get away with being
siloed in their technical bubbles, data scientists must be able to communicate
with other teams and effectively translate their work for maximum impact. This
set of skills is often called ‘data storytelling.’

So what kind of story can you tell based on the work you’ve done so far? Your
story will include important conclusions that you can draw based on your
exploratory analysis phase and the predictive model you’ve built. Crucially, you
want the story to answer the questions that are most important to your client!

First and foremost, you take the data on the current prospects that the sales
team is pursuing, run it through your model, and rank them in a spreadsheet in
the order of most to least likely to convert. You provide the spreadsheet to
your VP of Sales.

Next, you decide to highlight a couple of your most relevant results:

 * Age: We’re selling a lot more to prospects in their early 30s, rather than those
   in their mid-20s. This is unexpected since our product is targets people in
   their mid-20s!
 * Marketing methods: We use social media marketing to target people in their 20s, but email
   campaigns to people in their 30s. This appears to be a significant factor
   behind the difference in conversion rates.

The following week, you meet with her and walk her through your conclusions.
She’s ecstatic about the results you’ve given her! But then she asks you, “How can we best use these findings?”

Technically, your job as a data scientist is about analyzing the data and
showing what’s happening. But as part of your role as the interpreter of data,
you’ll be often called upon to make recommendations about how others should use
your results.

In response to the VP’s question, you think for a moment and say, “Well, first, I’d recommend using the spreadsheet with prospect predictions for
the next week or two to focus on the most likely targets and see how well that
performs. That’ll make your sales team more productive right away, and tell me
if the predictive model needs more fine-tuning.

Second, we should also look into what’s happening with our marketing and figure
out whether we should be targeting the mid-20s crowd with email campaigns, or
making our social media campaigns more effective.”

The VP of Sales nods enthusiastically in agreement and immediately sets you up
in a meeting with the VP of Marketing so you can demonstrate your results to
him. Moreover, she asks you to send a couple of slides summarizing your results
and recommendations so she can present them at the board meeting.

Boom! You’ve had an amazing impact on your first project!

You’ve successfully finished your first data science project at work, and you
finally understand what your mentors have always said: data science is not just
about the techniques, the algorithms or the math. It’s not just about the
programming and implementation. It’s a true multi-disciplinary field, one that
requires the practitioner to translate between technology and business concerns.
This is what makes the career path of data science so challenging, and so
valuable.


--------------------------------------------------------------------------------

If you enjoyed reading this and are curious about a career in data science,
check out some of our awesome programs and resources:

 * Data science career track
 * Data science interview guide

Also, I’ve written answers to several related questions on Quora that might be helpful.

Data Science Career Track

AUTHOR
Raj Bandyopadhyay Director of Data Science Education , Springboard Tags data science processRELATED ARTICLES
Weekly MOOC Buffet 19 (11th – 17th Aug, 2014)MILLIONS OF STUDENTS ARE IMPROVING THEIR SKILLS BY USING OUR COURSES
LEARN MORESTAY IN THE KNOW!
I want to learn... * Data Science with R
 * UX Design
 * Data Science with Python

Get Updates * 
 * 

Workshops * Data Analytics for Business
 * Foundations of Data Science
 * UX Design School
 * Data Science Intensive
 * Data Science Career Track

Resources * Data Analysis
 * UX Design
 * Web Dev with Python
 * View all resources

Company * About
 * We're Hiring
 * Contact
 * Blog
 * Mentors
 * Become A Mentor
 * Corporate Training

© 2017 SlideRule Labs Inc. All rights reserved. Terms | Privacy * 
 *","As a data scientist, I often get the question,“What do you actually do?” Come read about the actual day-to-day of a data scientist.",The Data Science Process,Live,975
3046,"Couchimport is a command-line tool written in Node.js that allows CSV files to be imported into Cloudant/CouchDB, with each row in the source file becoming a document in NoSQL database. A recent addition to its feature set allows the same tool to be used to bulk import JSON data too.

Imagine we want to import crime data from here:

into CouchDB or Cloudant. The format of the data returned by that URL is

""category"" : ""BURGLARY"",

""pddistrict"" : ""RICHMOND"",

""descript"" : ""BURGLARY OF APARTMENT HOUSE, UNLAWFUL ENTRY"",

""dayofweek"" : ""Monday"",

""resolution"" : ""NONE"",

""date"" : ""2005-05-16T00:00:00"",

""incidntnum"" : ""050546220""

We can use couchimport for this, as it handles streams of JSON data easily:

# define where the data should go

export COUCH_URL=http://127.0.0.1:5984

# curl the data and pipe to couchimport, specifying

# --db - the name of the database

# --type - json data

# --jsonpath - which part of the json to import

curl 'https://data.sfgov.org/resource/tmnf-yvry.json?$select=incidntnum,category,descript,dayofweek,date,pddistrict,resolution&$limit=50000' | couchimport --db cr --type json --jsonpath “*""

This imports the data 500 records at a time using the CouchDB bulk API.",Couchimport is a command-line tool that allows CSV files to be imported in bulk into Cloudant. It can also be used to import streaming JSON sources too. This document tells you how.,Using couchimport to load JSON data into Cloudant,Live,976
3050,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
NEWS
APACHE SPARK™ AS THE NEW ENGINE OF GENOMICS
A handful of talks at the recent Spark Summit in San Francisco neatly underscore three major trends in genomics:

 1. Faster processing of raw genomic data.
 2. New schema and libraries for genomic analysis.
 3. Genomics as a guide to real-world treatment.

All three trends are moving the center of innovation away from the wetware
sequencing process into the realm of computation — where Apache Spark™ is
playing an increasingly key role.

The first trend involves reducing the time and expense required to boil raw
genomic data down to something researchers can use. Doing so means exposing how
the particular genome under investigation varies from the human reference genome , a process that researchers call this process variant calling . Variant calling turns out to be hugely intensive in terms of computation:
well over 40 hours for a single raw genome sequence even on robust systems.

In fact, as Zaid Al-Ars of Delft University of Technology pointed out in his Spark Summit talk , the cost of the computation for this process is now higher than the cost of
the wetware process for acquiring the raw genomic data in the first place. (The
cost of that wetware process was $100 million in 2001 and it’s been falling
faster than Moore’s law. Current estimates put it at $1,000 or less.) With
wetware costs so low and falling, this first new trend began to emerge: a shift
in focus from the wetware process to the post-wetware computation.

Al-Ars and his team at Delft took up the challenge of reducing those computation
costs by bringing Spark to bear on the traditional data process pipeline. In
fact, Spark is uniquely suited to the challenge: a single full-genome sequence
can run to hundreds of gigabytes of raw data — and the processing of that data
is parallelizable by chromosome (or even by sub-segments of chromosome).

But Al-Ars didn’t just take advantage of Spark’s ability to process lots of data
in parallel; he implemented dynamic load balancing as well. In the end, by
running 163 hardware threads in parallel on each of 20 nodes (for a total of
over 3200 parallel threads), he was able to get the compute time for a single
raw genome file down from over 40 hours to a single hour — a huge reduction in
time and cost. (Check out the abstract , slides , and video of his talk.)

A second trend in the field also takes on the challenge of reducing those
initial computation costs — but takes a different approach. Rather than making
the traditional pipeline more efficient, researchers like Frank Austin Nothaft at UC Berkeley’s AmpLab are focusing on the legacy BAM files that indicate how a particular genome
aligns with the human reference genome. It turns out that the flat format of BAM
files aren’t just computationally expensive — they also severely constrain how
the data can be optimized and analyzed.

Notthaft and others are tossing out the BAM files and starting from scratch. At
the center of their work is ADAM , a Spark-based, open-source library for doing genomic analysis. ADAM defines
explicit schema for individual datatypes and stores the data on disk using the Parquet format. It also lets researchers use common schemas and Spark-based primitives
that vastly improve performance, especially around joins.

Nothaft didn't stop there. He also helped develop an index RDD that extends a
point-optimized RDD to enable range lookups (which are especially relevant to
genomic data). He even integrated Toil, a pipeline manager for massive
workflows. In the end, he and his team have been able to achieve 30x - 50x
performance improvements at scale — while enabling a new diversity of queries.
(See the abstract , slides , and video of Nothaft's Spark Summit talk.)

As these researchers streamline the initial analytics process and reduce costs,
masses of genetic data are coming online — both from patients themselves and,
increasingly, from the biopsied tumors whose peculiar genetic signatures offer
vital clues to treatment. That brings us to the third trend: Putting genomics
data to work means fitting that data into the clinical process in a way that
doctors can understand, manage, and leverage.

Oncologists and other doctors simply aren’t trained in the complex work of
sifting through genomic analyses in order to guide interventions. Daniel Quest of the Mayo Clinic’s Center for Individualized Medicine predicts that in time we’ll have genomics data specialists who support doctors
much the way radiologists do: by interpreting the results of sophisticated
diagnostics to offer a set of predictions and suggestions for treatment.

But first the data has to be gathered and filtered. And as Quest points out in
his Spark Summit talk, doing that work means more than crunching numbers on a
single patient’s own raw genomic data. It means integrating genomic data from
across the population (and across demographics within the population) in order
to establish some version of ground truth about normal human variation. Doctors
need that ground truth to distinguish expected genomic variation from actual
pathology.

But even that isn't enough. Getting a truly complete picture means going a step
further by combining baseline genomic data with gene expression data from RNA
sequencing — and then layering in phenotype data from patient records: gender,
age, ailment, behavior, and more.

It’s a dizzying set of challenges that includes annotation, filtering, cohort
analysis, sparse data management, patient privacy, and ultimately the need to
unify on tools that enable knowledge to be shared between institutions. And for
each of those challenges, Quest sees a key role for Apache Spark. (Check out the abstract , slides , and video of his talk.)

We're seeing that advances in genomics — especially around computation and
analysis — are the gateway to a revolution in medicine. By all accounts, Apache
Spark is the new engine of that revolution.

SHARE ON
 * 
 * Share

STEVE MOORE
DATE
29 June 2016TAGS
newsSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.",A handful of talks at the recent Spark Summit in San Francisco neatly underscore three major trends in genomics:      Faster processing of raw genomic data.     New schema and libraries for genomic analysis.     Genomics as a guide to real-world treatment.  All three trends are moving the center of innovation away from the wetware sequencing process into the realm of computation — where Apache Spark™ is playing an increasingly key role.,Apache Spark as the New Engine of Genomics,Live,977
3051,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Maureen McElaney Blocked Unblock Follow Following dev advocate at @IBM Watson Data Platform. founder of @GDIBurlington.
co-organizer at @uxburlington. fellow at @BTVIgnite. content here is mine. 28 mins ago
--------------------------------------------------------------------------------

SESSION RECAP: SAVE THE WORLD WITH OFFLINE FIRST
FROM OFFLINE CAMP CALIFORNIA
At Offline Camp , I proposed a session on identifying how Offline First applications can
improve, and even save, people’s lives. We had a great discussion, which I’d
like to share.

To build an application that works everywhere, you must first focus on the
system’s most resource-constrained environment. In this case, people with little
to no network connectivity represent your most resource-constrained user base.
If your users have variable connectivity, or if they lose it completely, will
they still be able to get the base minimum of the information they need from
your application?

Head over to the Offline Camp Medium to learn more about our discussion on building Offline First applications that
help users with social media access, scientific research, data collection and
storage, disaster response and healthcare, and environmental activism. Maybe
we’ll even see you the next time the Offline First community gets together!

Head over to the Offline Camp Medium account to read a full recap of this session. Blocked Unblock Follow FollowingMAUREEN MCELANEY
dev advocate at @IBM Watson Data Platform. founder of @GDIBurlington . co-organizer at @uxburlington . fellow at @BTVIgnite . content here is mine.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Subscribe Subscribe","At Offline Camp, I proposed a session on identifying how Offline First applications can improve, and even save, people’s lives. We had a great discussion, which I’d like to share. To build an…",Session Recap: Save the World with Offline First – IBM Watson Data Lab,Live,978
3053,"Enterprise Pricing Articles Sign in Free 30-Day TrialMAKING THE MOST OF COMPOSE – C2G CONSULTING
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jun 8, 2016This is an article in our ongoing series on how Compose customers are building
apps with multiple persistent data stores.

C2G Consulting is a technology company headquartered on Victoria Island in Lagos, Nigeria.
Providing consulting and development services – ranging from ERP, HCM, CRM
analytics, disaster recovery and more – to both African and global businesses
since 2004, C2G recently expanded its offerings with a new multi-tenant bulk
ordering and retail execution platform that they've dubbed TradeDepot . TradeDepot is now in beta by some of their mid-to-large sized enterprise
customers.

Built on Meteor.js with a Compose MongoDB and RabbitMQ backend, TradeDepot
allows product manufacturers to receive orders from distributors and manage the
order all the way through to retail outlets. In that sense, it's much more than
a traditional eCommerce tool; it's a complete platform for these customers to
manage the supply chain, from manufacturing to retail. One of their first big
clients, for example, is one of the largest dairy companies in Africa who need
greater insights and control into their milk distribution.

One of the three co-founders of C2G Consulting, Onyekachi (Kachi) Izukanne, told
us they'd looked for a hosted, managed service and eventually came upon Compose.
With Compose, ""the management offering made life easy... so we went with that.""


TradeDepot mockup, image courtesy C2G Consulting

TradeDepot is actually several parts: the app built in Meteor and a Java
middleware layer that leverages Apache Camel and is hosted separately with a
MySQL database. There is also an events monitoring service called Log which is
responsible for tracking all the different components. ""Within this stack, we're
hosting MongoDB and RabbitMQ,"" adds Izukanne. ""It's a microservices architecture
with several components deployed as independent services."" Compose-hosted
RabbitMQ is the messaging interface between all these services. When there is a
message from the main to the back monitor, ""we publish an event queue on
RabbitMQ and then it's picked up by the Log service."" This allows them to run
multiple service without the worry of failure that can happen with monolithic
applications; if one service is down momentarily, it doesn't affect activity on
other parts of the platform.

C2G went in the microservice direction to get a high level of reliability and
consistency. To them, that meant selecting a provider that could plug into that
architecture with independent services that allow them to focus on building
their application without having to worry about their persistence layers.

While C2G's core business remains consulting for some of the largest
multi-national companies in the world, TradeDepot firmly pushes them in a new
direction of building commercial software to help consumer products companies
get more control over their supply chain. Izukanne added: ""For us, for any
back-end service we can procure from a managed service provider, that would be a
more reliable way to go. And while it's early days, so far we're happy with our
Compose investment.""

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Jon Silvers does marketing at Compose. He is also a father, husband, Californian, runner,
hiker, INTJ, guitar beginner, and comedy geek. Love this article? Head over to Jon Silvers’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Built on Meteor.js with a Compose MongoDB and RabbitMQ backend, TradeDepot allows product manufacturers to receive orders from distributors and manage the order all the way through to retail outlets.",Making the Most of Compose: Customer C2G Consulting,Live,979
3055,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectLOGSHAREGlynn Bird / March 14, 2016logshare is a command-line tool that lets you share real-time, streaming text data withyour colleagues.I work for IBM, but I don’t drive to an IBM office every day, I work from homeinstead. The rest of my team is 250 miles away in Bristol or thousands of milesaway in the USA and beyond. I work closely with other IBMers using Slack as avirtual office, but sometimes you need your colleagues to see what you’re seeingon your screen. Imagine I’m developing and running an app locally which causesit to generate logs to a terminal or a log file. I can watch the logs go bywith:    > tail -f thelogs.txtbut my remote colleagues can’t. I could share my screen with them, but that isbandwidth-heavy and doesn’t allow folks to scroll through the logs themselves,or cut-and-paste bits out. This frustration brought about the development of logshare , the simple log-sharing service.logshare is command-line tool that lets you share real-time, streaming text data withyour colleagues. The consumers of the data can see it either in their terminalor on a web page.INSTALL LOGSHARElogshare is a Node.js app and is published to the npm code repository, so installing it is as simple as:   > npm install -g logshareThis installs the logshare command-line utility and its dependencies globally on your machine. You canthen begin sharing streaming data immediately:   > tail -f /var/log/system.log | logshare    Share URL: https://logshare.mybluemix.net/share/kkdgapgdxlogshare outputs a URL which you can share with the folks who also need to view thelogs. If they open the URL in their web browser, they see a real-time stream oflogs:If your colleagues have also installed logshare and would prefer to see the shared data on their terminals, then they can type:    > logshare kkdgapgdxwhere kkdgapgdx is the unique token generated for each logshare session.HOW DOES IT WORK?logshare consists of two software projects: * logshare-server is the server-side code that hosts the website, publishes the sharing API,   and hosts the web-based sharing tool * logshare-client is the client-side code that either publishes or consumes data on the   command-lineThe logshare-server code is deployed to Bluemix, IBM’s platform-as-a-service where it connects to a Redis service to handle pubsub and meta-data storage, and to a Cloudant NoSQL servicewhich records the stats for each completed logshare session.REDISRedis is an in-memory database that lets you store and retrieve simple datastructures very quickly. It also has pubsub channels that broker the flow of data between the producer of the data (theoriginator of the logshare session) and zero or more consumers of it (other command-line or web-basedclients). Each logshare session results in the creation of a new PubSub channel to which incoming datais published. Every command-line and web client connects to the logshare server via a WebSockets connection. As new data arrives on the pubsub channel,it is dispatched to the appropriate WebSocket clients.The same effect could be acheived without Redis but if we want our app to scale across multiple logshare servers, then we need Redis to pass requests between the servers:Redis also stores the meta data about a log-sharing session, including * start date * end date * number of lines of data * number of bytes of datawhich is stored in a Redis hash . The arrival of every line of data results in the associated meta data recordbeing updated. Redis is a good fit for this task as its in-memory storageprovides low-latency commands that let values increment in the database:   HINCRBY logshare_kkdgapgdx_meta messages 1   HINCRBY logshare_kkdgapgdx_meta bytes 251Deploying and maintaining a multi-node Redis cluster is easy with Compose.io which provides dedicated Redis hosting in a choice of data centres with a30-day free trial.APIThe logshare server has an HTTP API that lets you start and stop logshare sessions, andpublish data. There is also a WebSockets API that lets you publish and subscribeto data. The logshare-client project uses a combination of the HTTP and WebSockets API to generate andconsume data.CLOUDANTCloudant is a scalable NoSQL database run as-a-service by IBM, with free,pay-as-you-go, and dedicated tiers.The Cloudant component of this project is optional: it is used to archive the logshare meta data once a sharing session has completed. The meta data is converted intoa JSON object and stored in a Cloudant database:{  ""_id"": ""f9f6122a76f64d8790b2351714f07622"",  ""_rev"": ""1-bc9567e660f69e6709461c89534c6c3d"",  ""start"": ""2016-02-23T09:17:33+00:00"",  ""messages"": 17,  ""bytes"": 1327,  ""end"": ""2016-02-23T09:18:34+00:00"",  ""duration"": 61}A MapReduce view calculates totals and averages across the entire meta datacollection:function (doc) { if (doc.duration) {     var bits = doc.start.split('Z');    var d = bits[0];    var datebits = d.split('-');    var year = parseInt(datebits[0], 10);    var month = parseInt(datebits[1], 10);    var day = parseInt(datebits[2], 10);    emit([year, month, day], [doc.messages, doc.bytes, doc.duration]); }}Making the key of the MapReduce index an array lets us group the data atquery-time by year, year & month, or year & month & day. The built-in “_stats”reducer allows the data to be summarised as * number of messages published * number of bytes published * average durationfor a given time period.PRIVACY AND DATA RETENTIONThis project makes no guarantees as to the privacy of the data that you streamto logshare. If you are using https://logshare.mybluemix.net then the data is encrypted between the producer and server, and between theserver and the consumers. There is no authentication mechanism to prevent anunknown third party observing your data stream, if they can guess the nine-digitsession token. So don’t consider it safe for confidential data. It is designedto relay streaming data across development teams temporarily, not for anythingyou wouldn’t want others to see.This project does not store your data at any time. Log data goes to a Redispubsub channel and then relays immediately to any connected clients who havesubscribed to that session. The data is then discarded, with only meta dataabout the session (the number of lines of data and the number of bytes of datareceived) being retained.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: cloudant / Node.js / pubsub / WebSockets Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","A command-line tool that lets you share real-time, streaming text data with your colleagues, who can view logs in their terminal or on a web page.",logshare,Live,980
3056,"van den Blog * About
 * Resume

SUPER FAST STRING MATCHING IN PYTHON
Oct 14, 2017

Traditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure are too slow for large datasets. Using TF-IDF with N-Grams as
terms to find similar strings transforms the problem into a matrix
multiplication problem, which is computationally much cheaper. Using this
approach made it possible to search for near duplicates in a set of 663,000
company names in 42 minutes using only a dual-core laptop.

NAME MATCHING
A problem that I have witnessed working with databases, and I think many other
people with me, is name matching. Databases often have multiple entries that
relate to the same entity, for example a person or company, where one entry has
a slightly different spelling then the other. This is a problem, and you want to
de-duplicate these. A similar problem occurs when you want to merge or join
databases using the names as identifier.

The following table gives an example:

Company Name Burger King Mc Donalds KFC Mac Donald’sFor the human reader it is obvious that both Mc Donalds and Mac Donald’s are the same company. However for a computer these are completely different
making spotting these nearly identical strings difficult.

One way to solve this would be using a string similarity measures like Jaro-Winkler or the Levenshtein distance measure. The obvious problem here is that the amount of calculations
necessary grow quadratic. Every entry has to be compared with every other entry
in the dataset, in our case this means calculating one of these measures
663.000^2 times. In this post I will explain how this can be done faster using
TF-IDF, N-Grams, and sparse matrix multiplication.

THE DATASET
I just grabbed a random dataset with lots of company names from Kaggle . It contains all company names in the SEC EDGAR database. I don’t know
anything about the data or the amount of duplicates in this dataset (it should
be 0), but most likely there will be some very similar names.

importpandasaspdpd.set_option('display.max_colwidth',-1)names=pd.read_csv('data/sec_edgar_company_info.csv')print('The shape: %d x %d'%names.shape)names.head()

The shape: 663000 x 3


Line Number Company Name Company CIK Key 0 1 !J INC 1438823 1 2 #1 A LIFESAFER HOLDINGS, INC. 1509607 2 3 #1 ARIZONA DISCOUNT PROPERTIES LLC 1457512 3 4 #1 PAINTBALL CORP 1433777 4 5 $ LLC 1427189TF-IDF
TF-IDF is a method to generate features from text by multiplying the frequency
of a term (usually a word) in a document (the Term Frequency , or TF ) by the importance (the Inverse Document Frequency or IDF ) of the same term in an entire corpus. This last term weights less important
words (e.g. the, it, and etc) down, and words that don’t occur frequently up.
IDF is calculated as:

IDF(t) = log_e(Total number of documents / Number of documents with term t in
it).

An example (from www.tfidf.com/ ):

Consider a document containing 100 words in which the word cat appears 3 times.
The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we
have 10 million documents and the word cat appears in one thousand of these.
Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000
/ 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 *
4 = 0.12.

TF-IDF is very useful in text classification and text clustering. It is used to
transform documents into numeric vectors, that can easily be compared.

N-GRAMS
While the terms in TF-IDF are usually words, this is not a necessity. In our
case using words as terms wouldn’t help us much, as most company names only
contain one or two words. This is why we will use n-grams : sequences of N contiguous items, in this case characters. The following function cleans a
string and generates all n-grams in this string:

importredefngrams(string,n=3):string=re.sub(r'[,|-|.|/|  BD]',r'',string)ngrams=zip(*[string[i:]foriinrange(n)])return[''.join(ngram)forngraminngrams]print('All 3-grams in ""McDonalds"":')ngrams('McDonalds')

All 3-grams in ""McDonalds"":

['Mco', 'con', 'ona', 'nal', 'ald', 'lds']


As you can see, the code above does some cleaning as well. Next to removing some
punctuation (dots, comma’s etc) it removes the string “ BD”. This is a nice
example of one of the pitfalls of this approach: some terms that appear very
infrequent will result in a high bias towards this term. In this case there
where some company names ending with “ BD” that where being identified as
similar, even though the rest of the string was not similar.

The code to generate the matrix of TF-IDF values for each is shown below.

fromsklearn.feature_extraction.textimportTfidfVectorizercompany_names=names['Company Name']vectorizer=TfidfVectorizer(min_df=1,analyzer=ngrams)tf_idf_matrix=vectorizer.fit_transform(company_names)

The resulting matrix is very sparse as most terms in the corpus will not appear
in most company names. Scikit-learn deals with this nicely by returning a sparse CSR matrix .

You can see the first row (“!J INC”) contains three terms for the columns 11,
16196, and 15541.

print(tf_idf_matrix[0])# Check if this makes sense:ngrams('!J INC')

  (0, 11)	0.844099068282
  (0, 16196)	0.51177784466
  (0, 15541)	0.159938115034

['!JI', 'JIN', 'INC']


The last term (‘INC’) has a relatively low value, which makes sense as this term
will appear often in the corpus, thus receiving a lower IDF weight.

COSINE SIMILARITY
To calculate the similarity between two vectors of TF-IDF values the Cosine Similarity is usually used. The cosine similarity can be seen as a normalized dot product.
For a good explanation see: this site . We can theoretically calculate the cosine similarity of all items in our
dataset with all other items in scikit-learn by using the cosine_similarity
function, however the Data Scientists at ING found out this has some disadvantages :

 * The sklearn version does a lot of type checking and error handling.
 * The sklearn version calculates and stores all similarities in one go, while
   we are only interested in the most similar ones. Therefore it uses a lot more
   memory than necessary.

To optimize for these disadvantages they created their own library which stores only the top N highest matches in each row, and only the
similarities above an (optional) threshold.

importnumpyasnpfromscipy.sparseimportcsr_matriximportsparse_dot_topn.sparse_dot_topnasctdefawesome_cossim_top(A,B,ntop,lower_bound=0):# force A and B as a CSR matrix.# If they have already been CSR, there is no overheadA=A.tocsr()B=B.tocsr()M,_=A.shape_,N=B.shapeidx_dtype=np.int32nnz_max=M*ntopindptr=np.zeros(M+1,dtype=idx_dtype)indices=np.zeros(nnz_max,dtype=idx_dtype)data=np.zeros(nnz_max,dtype=A.dtype)ct.sparse_dot_topn(M,N,np.asarray(A.indptr,dtype=idx_dtype),np.asarray(A.indices,dtype=idx_dtype),A.data,np.asarray(B.indptr,dtype=idx_dtype),np.asarray(B.indices,dtype=idx_dtype),B.data,ntop,lower_bound,indptr,indices,data)returncsr_matrix((data,indices,indptr),shape=(M,N))

The following code runs the optimized cosine similarity function. It only stores
the top 10 most similar items, and only items with a similarity above 0.8:

importtimet1=time.time()matches=awesome_cossim_top(tf_idf_matrix,tf_idf_matrix.transpose(),10,0.8)t=time.time()-t1print(""SELFTIMED:"",t)

SELFTIMED: 2718.7523670196533


The following code unpacks the resulting sparse matrix. As it is a bit slow, an
option to look at only the first n values is added.

defget_matches_df(sparse_matrix,name_vector,top=100):non_zeros=sparse_matrix.nonzero()sparserows=non_zeros[0]sparsecols=non_zeros[1]iftop:nr_matches=topelse:nr_matches=sparsecols.sizeleft_side=np.empty([nr_matches],dtype=object)right_side=np.empty([nr_matches],dtype=object)similairity=np.zeros(nr_matches)forindexinrange(0,nr_matches):left_side[index]=name_vector[sparserows[index]]right_side[index]=name_vector[sparsecols[index]]similairity[index]=sparse_matrix.data[index]returnpd.DataFrame({'left_side':left_side,'right_side':right_side,'similairity':similairity})

Lets look at our matches:

matches_df=get_matches_df(matches,company_names,top=100000)matches_df=matches_df[matches_df['similairity']<0.99999]# Remove all exact matchesmatches_df.sample(20)

left_side right_side similairity 41024 ADVISORY U S EQUITY MARKET NEUTRAL OVERSEAS FUND LTD ADVISORY US EQUITY MARKET NEUTRAL FUND LP 0.818439 48061 AIM VARIABLE INSURANCE FUNDS AIM VARIABLE INSURANCE FUNDS (INVESCO VARIABLE INSURANCE FUNDS) 0.856922 14978 ACP ACQUISITION CORP CP ACQUISITION CORP 0.913479 54837 ALLFIRST TRUST CO NA ALLFIRST TRUST CO NA /TA/ 0.938206 89788 ARMSTRONG MICHAEL L ARMSTRONG MICHAEL 0.981860 54124 ALLEN MICHAEL D ALLEN MICHAEL J 0.928606 66765 AMERICAN SCRAP PROCESSING INC SCRAP PROCESSING INC 0.858714 44886 AGL LIFE ASSURANCE CO SEPARATE ACCOUNT VA 27 AGL LIFE ASSURANCE CO SEPARATE ACCOUNT VA 24 0.880202 49119 AJW PARTNERS II LLC AJW PARTNERS LLC 0.876761 16712 ADAMS MICHAEL C. ADAMS MICHAEL A 0.891636 96207 ASTRONOVA, INC. PETRONOVA, INC. 0.841667 26079 ADVISORS DISCIPLINED TRUST 1329 ADVISORS DISCIPLINED TRUST 1327 0.862806 16200 ADAMANT TECHNOLOGIES NT TECHNOLOGIES, INC. 0.814618 77473 ANGELLIST-SORY-FUND, A SERIES OF ANGELLIST-SDA-FUNDS, LLC ANGELLIST-NABS-FUND, A SERIES OF ANGELLIST-SDA-FUNDS, LLC 0.828394 70624 AN STD ACQUISITION CORP OT ACQUISITION CORP 0.855598 16669 ADAMS MARK B ADAMS MARY C 0.812897 48371 AIR SEMICONDUCTOR INC LION SEMICONDUCTOR INC. 0.814091 53755 ALLEN DANIEL M. ALLEN DANIEL J 0.829631 16005 ADA EMERGING MARKETS FUND, LP ORANDA EMERGING MARKETS FUND LP 0.839016 97135 ATHENE ASSET MANAGEMENT LLC CRANE ASSET MANAGEMENT LLC 0.807580The matches look pretty similar! The cossine similarity gives a good indication
of the similarity between the two company names. ATHENE ASSET MANAGEMENT LLC and CRANE ASSET MANAGEMENT LLC are probably not the same company, and the similarity measure of 0.81 reflects
this. When we look at the company names with the highest similarity, we see that
these are pretty long strings that differ by only 1 character:

matches_df.sort_values(['similairity'],ascending=False).head(10)

left_side right_side similairity 77993 ANGLE LIGHT CAPITAL, LP - ANGLE LIGHT CAPITAL - QUASAR SERIES I ANGLE LIGHT CAPITAL, LP - ANGLE LIGHT CAPITAL - QUASAR SERIES II 0.994860 77996 ANGLE LIGHT CAPITAL, LP - ANGLE LIGHT CAPITAL - QUASAR SERIES II ANGLE LIGHT CAPITAL, LP - ANGLE LIGHT CAPITAL - QUASAR SERIES I 0.994860 81120 APOLLO OVERSEAS PARTNERS (DELAWARE 892) VIII, L.P. APOLLO OVERSEAS PARTNERS (DELAWARE 892) VII LP 0.993736 81116 APOLLO OVERSEAS PARTNERS (DELAWARE 892) VII LP APOLLO OVERSEAS PARTNERS (DELAWARE 892) VIII, L.P. 0.993736 66974 AMERICAN SKANDIA LIFE ASSURANCE CORP VARIABLE ACCOUNT E AMERICAN SKANDIA LIFE ASSURANCE CORP VARIABLE ACCOUNT B 0.993527 66968 AMERICAN SKANDIA LIFE ASSURANCE CORP VARIABLE ACCOUNT B AMERICAN SKANDIA LIFE ASSURANCE CORP VARIABLE ACCOUNT E 0.993527 80929 APOLLO EUROPEAN PRINCIPAL FINANCE FUND III (EURO B), L.P. APOLLO EUROPEAN PRINCIPAL FINANCE FUND II (EURO B), L.P. 0.993375 80918 APOLLO EUROPEAN PRINCIPAL FINANCE FUND II (EURO B), L.P. APOLLO EUROPEAN PRINCIPAL FINANCE FUND III (EURO B), L.P. 0.993375 80921 APOLLO EUROPEAN PRINCIPAL FINANCE FUND III (DOLLAR A), L.P. APOLLO EUROPEAN PRINCIPAL FINANCE FUND II (DOLLAR A), L.P. 0.993116 80907 APOLLO EUROPEAN PRINCIPAL FINANCE FUND II (DOLLAR A), L.P. APOLLO EUROPEAN PRINCIPAL FINANCE FUND III (DOLLAR A), L.P. 0.993116CONCLUSION
As we saw by visual inspection the matches created with this method are quite
good, as the strings are very similar. The biggest advantage however, is the
speed. The method described above can be scaled to much larger datasets by using
a distributed computing environment such as Apache Spark. This could be done by
broadcasting one of the TF-IDF matrices to all workers, and parallelizing the
second (in our case a copy of the TF-IDF matrix) into multiple sub-matrices.
Multiplication can then be done (using Numpy or the sparse_dot_topn library) by
each worker on part of the second matrix and the entire first matrix. An example
of this is described here .

Previous1 Day of Citi Bike availability After moving to New York from the
Netherlands I was relieved to find out that biking in Manhattan is actually...This project is maintained by bergvca

 * Tweet
 * Star",Traditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure are too slow for large datasets. Using TF-IDF with N-Grams...,Super Fast String Matching in Python,Live,981
3058,"This video shows you how to use cURL, jq, JSONView, POSTMan, and RESTCLient to create, read, update, and delete data in a Cloudant database. Sign up for a Cloudant account here: https://cloudant.com/sign-up/. Find more videos and tutorials in the Cloudant Learning Center: http://www.cloudant.com/learning-center","See how to use cURL, jq, JSONView, POSTMan, and RESTCLient to create, read, update, and delete data in a Cloudant database. ",Identify Useful HTTP API Tools,Live,982
3059,"In this article we'll show you an example of a chat application we put together which uses just Redis and Go. Redis is full of neat things that make writing an application like this so much easier.

We'll look at how things like a publish and subscribe system makes sending and getting messages simple and using expiring keys allows you to ensure that a user is keeping their session live. In combination with Go's goroutines and channels, we'll show you how to use those features and help you start exploring the power of Redis.

But we'll start our look at the code with the connection to Redis. We're going to use

Redigo, the popular Redis client library for Go. Now, Redigo is a fine library but it does make you work to log in if you are using a Redis URI. You have to parse the URL and present various parsed parts to the API to complete the connection – connecting to a local Redis is remarkably easy though. Anyway, someone else thought it was overly complex and did something about it. The

Redis-url package means we can connect with just:

conn, err := redisurl.Connect()

And the redisurl package will get the REDISURL environment variable and use that to connect. If you don't like using environment variables, there is an alternate version

ConnectToURL() which takes a string URL directly. Remember to set your

REDISURL environment variable, complete with password, by getting it from your Compose Redis dashboard.

Now, once we're connected, we want to ensure there's no user with our username already on line. What we want is to acquire a lock on the name in the server, so what we do is create a key/value pairing to represent that lock. We get the username as the first command line argument earlier on. Using that we'll make a key for our lock:

userkey := ""online."" + username

Now, if that key already exists, we want the client to exit because the user is, apparently already logged in. In Redis additional options to setting a key to ensure that the pre-condition ""the key for this user must not exist"" with the

NX option.

Now we do want the key to go away when the client exits. With Go we could clean up of the key when the user exits with the defer command. When the client exits normally, that could delete the user key. The only problem with using defer is that it isn't called when the program exits abnormally, through a crash or control-C, unless we do something else. So, we turn to Redis's ability to set expiry times on keys.

We can set an expiry time, the TTL (time to live), on the key so that unless the client comes back within that time and resets the key, it'll disappear on its own. To do this, we'll use the EX option to set the TTL on the key. We'll set this to 120 seconds. If we put that all together we get:

val, err := conn.Do(""SET"", userkey, username, ""NX"", ""EX"", ""120"")

if val == nil {

fmt.Println(""User already online"")

os.Exit(1)

If the key already exists, the command returns a nil value rather than ""ok"" so we can use that to determine if we are allowed on.

We're also going to add the user's name to a Redis set of users online like so...

val, err = conn.Do(""SADD"", ""users"", username)

You may be wondering when there's a key in the Redis database for each user online why we aren't using the Redis SCAN command to just select which users have an online key. The SCAN command with a MATCH, which is what we'd need to match ""online.username"" in the keys isn't a cheap operation so to save on that, we use the set.

Ah, you say, why not then just use a set and expire the members from that set. The reason is simple, you can only TTL expire keys and their values in Redis. There is another solution to the expiry issue, creating a Redis sorted set with timestamps as the score for sorting. That makes it easy to work out which members should be expired because the oldest members will be first in the set. The drawback is that it does require a process to check and expire the set members.

So, for our example, we're keeping it simple - one key/value for the locking and expiry and one set member for an easy way of seeing who's online. Oh, and before we move on, let's just set up a Go ticker to remind us when to update that key/value:

tickerChan := time.NewTicker(time.Second * 60).C

We are going to use Redis's publish and subscribe system to move messages around. This involves there being a channel – we'll call it messages – where new messages are written and to which every client is listening. We'll get to the writing messages in a bit, but first the listening. It's only sensible here to use Go's channels and goroutines to have something always listening for new messages. We're going to make a channel, called subChan, and then start up a new connection to Redis in a goroutine:

subChan := make(chan string)

go func() {

subconn, err := redisurl.Connect()

if err != nil {

fmt.Println(err)

os.Exit(1)

defer subconn.Close()

Why a new connection? Well, Redis subscribe traffic is easier to handle if it's not mixed in with other traffic, so giving it its own connection makes things a lot simpler for everyone. But what makes that connection a publish and subscribe connection? That happens when we do this:

psc := redis.PubSubConn{Conn: subconn}

That wraps the connection with the bits needed to make it easy to subscribe to a channel with:

psc.Subscribe(""messages"")

And at this point we can drop into a forever loop, where we can recieve Redis pubsub messages and use switch to act on the different types of message. That's one with data and one with subscription information. We're only interested in the first one, the one with data as that'll be a string message sent by another client. When we get that, we send that to our subChan channel.

for {

switch v := psc.Receive().(type) {

case redis.Message:

subChan

The message, by the way, also comes with the name of the channel it was sent to; you can be subscribed to multiple channels and wild-card selections of channels with Redis, so that's useful to know. In this example, we're just using one channel, but you could have multiple chat rooms on different channels.

Back to our switch. We throw away subscription information and exit on errors (letting the

defer earlier tidy up the connection).

case redis.Subscription:

break // We don't need to listen to subscription messages,

case error:

return

And that routine is set running... We'll pick up the messages in the channel later.

We have another channel and goroutine for reading input from the user. The channel

sayChan is populated by a goroutine that reads lines from Stdin. If there's an error, it'll put ""/exit"" into the channel which also happens to be the command a user would enter to exit.

sayChan := make(chan string)

go func() {

prompt := username + "">""

bio := bufio.NewReader(os.Stdin)

for {

fmt.Print(prompt)

line, _, err := bio.ReadLine()

if err != nil {

fmt.Println(err)

sayChan

We are now ready, almost, to enter into the main loop of chatting. Just some final things to do, like our first message publication:

conn.Do(""PUBLISH"", ""messages"", username+"" has joined"")

We send our publish commands from the connection we initially established. It's only subscribing which needs its own channel. This line just announced to everyone listening that the user has logged on. Now we can dive into the loop, which we'll only exit when a chatExit flag is set true:

for !chatExit {

select {

As we have three active channels, the subscribed channel subChan, the user entry channel sayChan and the ticker channel tickerChan, we can use the Go select command to listen on all of them. The simplest one to handle is the subscribed channel:

case msg :=

If there's a message to be read, read it and print it.

Remember that ticker we set up? When that goes off we want to set the user key again. This time we want to be sure the key already exists and fail if it doesn't exist. For that, we use the ""XX"" option; the key must exist otherwise we can assume we've been suspended and the key expiry has kicked in. If that's the case, we'll set the chatExit flag and leave:

case

That leaves the user's input to handle. We read the user's entry from the sayChan channel and if it's an exit command (""/exit""), we set the chatExit to true:

case line :=

But if it's a ""/who"" command, we retrieve that set we added to earlier and print that out. Note the use of the redis.Strings helper to easily coerce the results from the

SMEMBERS Redis command into an array of strings.

else if line == ""/who"" {

names, _ := redis.Strings(conn.Do(""SMEMBERS"", ""users""))

for _, name := range names {

fmt.Println(name)

If the entry was neither of those commands, we publish whatever was entered to the messages channel in the same way we did with the ""joined"" announcement:

else {

conn.Do(""PUBLISH"", ""messages"", username+"":""+line)

And thats it for the three channel handlers. If nothing has happened, we just sleep for 100 milliseconds rather then chewing up a CPU:

default:

time.Sleep(100 * time.Millisecond)

There's still some tidying up to be done when the client does exit cleanly. We need to delete that user key ourselves, remove our user's name from the set and, it's only polite, publish a message to everyone that we're leaving.

conn.Do(""DEL"", userkey)

conn.Do(""SREM"", ""users"", username)

conn.Do(""PUBLISH"", ""messages"", username+"" has left"")

We deferred closing the Redis connections so they should shut down as we leave, and that's our chat client.

There's plenty that could be enhanced in this chat client, but it is primarily an example of how you can use Redis, and Go, to produce reliable publish and subscribe based application. It's a powerful mechanism you can embed into your application to enable interapp communications. Combined with Redis' key/value storage capabilities, it helps show why Redis is considered by many to be a good, flexible yet solid cement for binding application components together.",See an example of a chat application we put together which uses just Redis and Go. Redis is full of neat things that make writing an application like this so much easier.,"Redis, Go, & How to Build a Chat Application",Live,983
3061,"Homepage Sign in / Sign up 12 adam kelleher Blocked Unblock Follow Following Physics PhD; Principal Data Scientist at BuzzFeed yesterday 7 min read
--------------------------------------------------------------------------------

THE DATA PROCESSING INEQUALITY
If you look at the wikipedia article for the data processing inequality , it’s really just a stub (as of the time this article was published). The
inequality is given, but there is little context. The data processing inequality
is fundamental to data science, machine learning, and social science.

Lately, I’ve been blogging almost exclusively about causality. I’m about to go
deep into causal inference, but needed to lay one more piece of groundwork
before I do. There is a deep problem with how we encode the “real world” into a
form that a computer can understand. The implications go far beyond limiting the
predictive power of a machine learning model. Our representations of data can
limit our ability to infer causal relationships. To understand this fully, you
need to understand the data processing inequality.

The context

Suppose you have some event, X, that takes place in the real world. Say, X = the
entire experience of you going to the store to buy groceries. It’s your state of
mind, your grocery list, and the full context of the world around you. X is an
event, but it’s not an event that has been encoded to do machine learning with.
We haven’t even necessarily recorded any data at all about X. There are many
things we could record — your heart rate; what you buy; the number of cars you
pass on the street; whether you say “hi” to your neighbor.

At this point, X isn’t encoded at all. We don’t even have data for X. It’s just
a thing that happened. We can try to represent it in words, but words fail to
represent the whole reality. Suppose now that we keep a diary, and maybe tend to
mention when we go out for groceries in our diary. We lose an immeasureable
amount of information about our day by summarizing it into a diary entry, Y, but
we can hope that it contains some information about going to the store. We now
have some data about X, but it isn’t encoded very well for machine learning.

Now, suppose we want to predict something about the world. For example, we may
want to predict our total spending on that day, Z=”the amount I spent today”. We
can only do this using the data that we’ve collected, Y. We can encode Y as a
binary variable, Y’ (“y-prime”), recording whether or not we mention going to
the store. We can look at it every day, and see how much it says about our
spending.

If we try to do this prediction problem, we’ll find that Y’ might do a decent
job at predicting Z. Maybe it explains 50% of my spending on any given day!
Still, it’s missing a lot of information that it could contain. What was on
sale? Had you gone shopping earlier that week? Were you hungry when you went to
the store? All of these can help explain how much I spent on that trip (not just
whether there was a trip), and even help explain how much I spent at other
stores on the same day.

The fact that we can enumerate these factors is promising. It says the more
effort we put into recording and encoding our data, the better we can do at
predicting Z. As we put more and more effort in, hopefully we can asymptote to
the correct value of Z.

The inequality

In this sequence, we have some event in the real world, X, which we encode by
summarizing and writing down, Y. We’ve lost a lot of information about X. Next,
we encode this summary for machine-learning purposes as Y’, again losing a lot
of information about Y (and therefore X). Finally, we hope that Y’ contains some
information for predicting Z. You can write the data processing part from X to
Y’ as a graphical model,

Data processingwhere Y contains all of the information about X relevant for determining Y’. Y’
can’t contain any more information about X than Y does. You can write this
nicely. If I(X,Y) is the information in common between X and Y, then you can
write the data processing inequality as (from Elements of Information Theory , a great book!)

Elements of Information Theory, 2nd Ed., p. 34What this is saying is that you can really only lose information by processing data. The information that Y’, our machine learning
encoding, contains about X is less than the information that our raw data, Y,
contains about X. You could extend this back and say that our raw data has also
lost information about X. There is much more that we could have written down.

In the best case, you keep the same amount of information (you have equality).
There’s no clever feature engineering you can do to get around this:This is true
for any data processing process that doesn’t involve incorporating external information
(so there aren’t other variables pointing into the Y or Z, and you indeed have a
Markov chain).

Implications for causal inference

I mentioned that this problem is a fundamental limitation for causal inference.
As we’ll learn deeply in the next few posts, causal inference relies on testing
independence relationships. In particular, it needs to test conditional independence relationships. That means that if you have a Markov chain like A →
B → C, you need to be able to condition on B and show that A and C become
independent. Intuitively, what this means is that you need to be able to show
that B contains all of the information about A that is relevant for determining
the value of C. This should ring some bells from the previous section.

What happens if, when encoding B to B’, you’ve lost some information? This will
almost surely happen in a typical machine learning problem. In that case, B’ no
longer “ blocks ” A and C, and you can no longer determine that they’re conditionally
independent of each other! If B’ contained all of the information about B
relevant for determining C, then you could still block B and C. This can happen, for example, if B’ is calculated from
some invertible mapping, B’ = f(B).

A causal chain, A->B->C, and a machine learning encoding B -> B’Is there a way we can make B’ contain as much information as possible about B?
Can we do this in a way that lets us do a good job of measuring these
probability distributions? (we’d really like B’ to be relatively
low-dimensional!)

If we can’t get B’ to contain all of the information about B relevant for
determining C, then we can’t show A and C are independent given B.
Unfortunately, inferring causal graphs depends on testing this kind of
independence relationships. Even more unfortunately, you can’t test independence
relationships without encoding your data!

Optimal representations?

It might sound like the situation is hopeless, but I’m not convinced that that’s
true. Consider when the reality is limited to a specific, very measurable
context. In video games, for example, the entire set of user actions can be used
to learn about the user. What information might it contain about their thought
processes? It could be enough information to predict their response to future
stimulus within the context of the game. The distance between the data and the
encoding is very small in this context.

There are still some real problems with measuring independence relationships
even in this context. Consider that there are a large number of actions you can
perform, and these define your “action history,” H. The problem amounts to
estimating the odds you’ll perform a new action, A, in a new context, C, given
your history, H, or P(A|C,H). The action could be binary, so coding that is
simple. The context could be binary as well, so that might also be simple. This
history, unfortunately, can be extremely high-dimsional. Even if it consists of
only N binary past actions, it can be as many as 2^N-dimensional. You can’t
estimate this density for any reasonably large history.

Fortunately, we have new technology that might handle this problem. Histories
might be “close” to each other, and so it might make sense to embed them in some
vector space. Is there a reasonable way to do this? Neural networks do this
exactly, by mapping extremely high-dimensional inputs into lower-dimensional
vector spaces. Tensorflow has a really nice introduction to vector space embeddings , focusing on embedding words. Personally, I’ve had some good success embedding
sentences using RNNs. Googling around, you can find some really nice resources
for more on embeddings.

I’m too much of an amateur to know if some work has been done to see if these
embeddings are optimal, but what you would hope is that the embedding of the
history contains all of the information about the history that’s relevant for
determining the future action.

Psychometrics

This also has some implications for the way we develop psychometrics. In
classical test theory, we’re satisfied if a psychometric test (e.g. an IQ test)
has low expected variation in different instances of a person taking the test,
and if the test score is a reasonable predictor of other variables (e.g. future
income).

How do we know that we’ve developed a test that captures as much relevant
information as possible? If we want to measure the effect of intelligence on
future earnings, it’s clear that the problem is something like “real
intelligence” → “IQ” → “earnings”. The data processing inequality applies.
Without optimizing the IQ measurement for the prediction problem, we’ve almost
certainly lost some information about intelligence that’s relevant for
determining future income.

Could there be a way to use “blocking” to develop psychometric instruments? I’m
not a psychologist — maybe there already is! If any of you know, I’d love to
hear about it in the comments!!

Machine Learning Artificial Intelligence Causality Data Science Social Science 12 Blocked Unblock Follow FollowingADAM KELLEHER
Physics PhD; Principal Data Scientist at BuzzFeed

× Don’t miss adam kelleher’s next story Blocked Unblock Follow Following adam kelleher","If you look at the wikipedia article for the data processing inequality, it’s really just a stub (as of the time this article was published…",The Data Processing Inequality,Live,984
3063,"Follow Sign in / Sign up Home About Insight Data Science Data Engineering Health Data AI 6 * Share
 * 6
 * 
 * 

Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates Ryan Irwin Blocked Unblock Follow Following Oct 27
--------------------------------------------------------------------------------

A DAY IN THE LIFE OF A DATA ENGINEER
RYAN IRWIN, YELP
Ryan Irwin is a Software Engineer (Data Infrastructure) at Yelp. Prior to
joining Yelp, he was a Data Engineering Fellow at Insight . This post is part of our Day in the Life of Data series , where our alumni discuss the daily challenges they work on at over 200 companies.

I’m hoping to share a glimpse of what it means to me to be a data engineer at
Yelp, with responsibilities over data streams covering vastly different
applications in a day-in-the-life blog post. Much of this post is from the
perspective of an engineer on a growing team, with emphasis on team because improving our team’s systems would not be possible without the team.
Specific technologies are only mentioned to provide additional context, and
specific applications and features are largely omitted because general themes
emerge.

TEAM RESPONSIBILITIES
First, responsibilities. My team is responsible for outputting a daily log of
valid traffic identifiers for other teams to consume in order to produce their
own metrics. This means we ingest several logs in a MapReduce job, and produce new logs to load into Redshift . We are responsible for feature engineering and data-mining of the data in the
logs, in addition to operational responsibilities to ensure that the job
finishes on time.

Also, our team is responsible for a couple of real-time applications and
services that perform Kafka tailing on a subset of logs we consume in MapReduce, and provide a subset of
filtering in real-time. Another requirement is for parts of the system to be
interoperable with “humans in the loop” who seek to apply manual filtering when
necessary. This sounds like a Lambda Architecture , but is only so in spirit from a high level.

OPERATING THE INFRASTRUCTURE
With the context of these team responsibilities, here are the day-to-day
concerns we worry about. Is the infrastructure running properly? Is it delivering the data we intended?
Is our infrastructure affecting others, with co-location of services or costing
too much?

To address these items we have alerts in place with different tools covering
different failure scenarios. As our team grew, we wrote down “debug and recover”
runbooks that emerged as patterns, and when possible we automated parts of the
recovery or better utilized our underlying robust service infra (shout out to PaaSTA ).

Inevitably, new problems will arise. We’ve found that defense against this can
be helped with thorough documentation of past failures and non-failure-related
changes (i.e., descriptive tickets, code reviews). Also, it is key to know your
organization’s tooling for seeking information. It’s been said by many folks at
Yelp that “ a senior engineer doesn’t know all the answers, but they’ll know where to start
looking” .

DATA-BASED ALERTING
The fun part of being on my team as a data engineer is that we get to do
challenging analytics. Now that the data has been refined by our MapReduce or
streaming systems, we have many alerts that could trigger if anomalies exist. In
our case, this could mean that a bug in our logging system is causing
mis-classifications, potential abuse on the platform, or a new traffic flow
related to new site features. Many of our alerts exist at the intersection of
our team and other teams, so clear cross-team communication is required.

Given an alert, we look at our pre-built Splunk or home-grown dashboards that contain items that we often look for. If we can’t
find the answer there, we’ll often look at our logs loaded into Redshift after
different stages of aggregation (event and session level). Sometimes we’ll
leverage iPython notebooks for visualization.

Once the root causes of the anomaly have been identified from our perspective,
we evaluate what actions need to be taken. Should we surface the issue to other teams for additional input? Should we make
a fast code change? If quick code changes are required, having clean coding and testing procedures
is key.

PLANNING FOR THE FUTURE
The day-to-day stuff involves some critical thinking, but it really sets up a
continuous discussion for planning for the future. And just like the day-to-day
infrastructure operations and data-based alerts aspect, planning involves both
of those areas.

We continually ask ourselves: Does our current infrastructure meet our needs in the future? Does it scale as
site visitors increase and new application features are released? Often our data needs will drive our infrastructure needs; for example, we might
discover the plausibility of a new signal by ingesting more sources of data.

Also, there is an endless stream of new external technologies and nomenclature
to keep pace with. We can pilot a new superior alternative, or try it out in the
next company hackathon. Within the organization there is always an evolution on
best practices, maybe in security or simplifications in the service
infrastructure. Lastly, our team must understand how the organization’s needs
for our data evolves. Currently, many things are transitioning from daily batch
to streaming.

Often our data needs will drive our infrastructure needs; for example, we might
discover the plausibility of a new signal by ingesting more sources of data. We
also might want to improve our label acquisition practices allowing us to more
easily tune our precision and recall for our filters. Improved labeling helps us
use more machine learned models as opposed to manually-tuned models.

To cap off this post, my favorite part of being a data engineer at Yelp is the
wide variety of interesting problems from low-level infrastructure to high-level
data-science, all while working with really enthusiastic and talented folks.


--------------------------------------------------------------------------------

Want to become a data engineer at top companies like Yelp? The Insight Data Engineering Fellows Program is free 7-week professional training where you can build cutting edge big data
platforms and transition to a career in data engineering at top teams like
Facebook, Uber, Slack and Squarespace.

Learn more about the program and apply today .

Thanks to Daniel Blazevski and David E Drummond . Big Data Data Science Yelp Insight Data Engineering Infrastructure 6 Blocked Unblock Follow FollowingRYAN IRWIN
FollowINSIGHT DATA
Insight Fellows Program —Your bridge to careers in Data Science and Data
Engineering.","Describes a day in the life of Ryan Irwin, a data engineer at Yelp.",A Day in the Life of a Data Engineer,Live,985
3068,"* R for Data Science
 * 
 * Welcome
 * 1 Introduction * 1.1 What you will learn
    * 1.2 The tidyverse
    * 1.3 How you will learn
    * 1.4 What you won’t learn * 1.4.1 Big data
       * 1.4.2 Python, Julia, and friends
       * 1.4.3 Non-rectangular data
       * 1.4.4 Hypothesis confirmation
      
      
    * 1.5 Prerequisites * 1.5.1 RStudio
       * 1.5.2 R packages
       * 1.5.3 Code conventions
      
      
    * 1.6 Getting help and learning more
    * 1.7 Acknowledgements
    * 1.8 Colophon
   
   
 * Explore
 * 2 Introduction
 * 3 Data visualisation * 3.1 Introduction * 3.1.1 Prerequisites
      
      
    * 3.2 A graphing template * 3.2.1 Exercises
      
      
    * 3.3 Aesthetic mappings * 3.3.1 Exercises
      
      
    * 3.4 Common problems
    * 3.5 Facets * 3.5.1 Exercises
      
      
    * 3.6 Geometric objects * 3.6.1 Exercises
      
      
    * 3.7 Statistical transformations * 3.7.1 Exercises
      
      
    * 3.8 Position adjustments * 3.8.1 Exercises
      
      
    * 3.9 Coordinate systems * 3.9.1 Exercises
      
      
    * 3.10 The layered grammar of graphics
   
   
 * 4 Workflow: basics * 4.1 Coding basics
    * 4.2 What’s in a name?
    * 4.3 Calling functions
    * 4.4 Practice
   
   
 * 5 Data transformation * 5.1 Introduction * 5.1.1 Prerequisites
       * 5.1.2 nycflights13
       * 5.1.3 Dplyr basics
      
      
    * 5.2 Filter rows with filter() * 5.2.1 Comparisons
       * 5.2.2 Logical operators
       * 5.2.3 Missing values
       * 5.2.4 Exercises
      
      
    * 5.3 Arrange rows with arrange() * 5.3.1 Exercises
      
      
    * 5.4 Select columns with select() * 5.4.1 Exercises
      
      
    * 5.5 Add new variables with mutate() * 5.5.1 Useful creation functions
       * 5.5.2 Exercises
      
      
    * 5.6 Grouped summaries with summarise() * 5.6.1 Combining multiple operations with the pipe
       * 5.6.2 Missing values
       * 5.6.3 Counts
       * 5.6.4 Useful summary functions
       * 5.6.5 Grouping by multiple variables
       * 5.6.6 Ungrouping
       * 5.6.7 Exercises
      
      
    * 5.7 Grouped mutates (and filters) * 5.7.1 Exercises
      
      
 * 6 Workflow: scripts * 6.1 Running code
    * 6.2 RStudio diagnostics
    * 6.3 Practice
   
   
 * 7 Exploratory Data Analysis * 7.1 Introduction * 7.1.1 Prerequisites
      
      
    * 7.2 Questions
    * 7.3 Variation * 7.3.1 Visualising distributions
       * 7.3.2 Typical values
       * 7.3.3 Unusual values
       * 7.3.4 Exercises
      
      
    * 7.4 Missing values * 7.4.1 Exercises
      
      
    * 7.5 Covariation * 7.5.1 A categorical and continuous variable
       * 7.5.2 Two categorical variables
       * 7.5.3 Two continuous variables
      
      
    * 7.6 Patterns and models
    * 7.7 ggplot2 calls
    * 7.8 Learning more
   
   
 * 8 Workflow: projects * 8.1 What is real?
    * 8.2 Where does your analysis live?
    * 8.3 RStudio projects
   
   
 * Wrangle
 * 9 Introduction
 * 10 Tibbles * 10.1 Introduction * 10.1.1 Prerequisites
      
      
    * 10.2 Creating tibbles
    * 10.3 Tibbles vs. data frames * 10.3.1 Printing
       * 10.3.2 Subsetting
      
      
    * 10.4 Interacting with older code
    * 10.5 Exercises
   
   
 * 11 Data import * 11.1 Introduction * 11.1.1 Prerequisites
      
      
    * 11.2 Getting started * 11.2.1 Compared to base R
       * 11.2.2 Exercises
      
      
    * 11.3 Parsing a vector * 11.3.1 Numbers
       * 11.3.2 Character
       * 11.3.3 Dates, date-times, and times
       * 11.3.4 Exercises
      
      
    * 11.4 Parsing a file * 11.4.1 Strategy
       * 11.4.2 Problems
       * 11.4.3 Other strategies
      
      
    * 11.5 Writing to a file
    * 11.6 Other types of data
   
   
 * 12 Tidy data * 12.1 Introduction * 12.1.1 Prerequisites
      
      
    * 12.2 Tidy data * 12.2.1 Exercises
      
      
    * 12.3 Spreading and gathering * 12.3.1 Gathering
       * 12.3.2 Spreading
       * 12.3.3 Exercises
      
      
    * 12.4 Separating and uniting * 12.4.1 Separate
       * 12.4.2 Unite
       * 12.4.3 Exercises
      
      
    * 12.5 Missing values * 12.5.1 Exercises
      
      
    * 12.6 Case Study * 12.6.1 Exercises
      
      
    * 12.7 Non-tidy data
   
   
 * 13 Relational data * 13.1 Introduction * 13.1.1 Prerequisites
      
      
    * 13.2 nycflights13 * 13.2.1 Exercises
      
      
    * 13.3 Keys * 13.3.1 Exercises
      
      
    * 13.4 Mutating joins * 13.4.1 Understanding joins
       * 13.4.2 Inner join
       * 13.4.3 Outer joins
       * 13.4.4 Duplicate keys
       * 13.4.5 Defining the key columns
       * 13.4.6 Exercises
       * 13.4.7 Other implementations
      
      
    * 13.5 Filtering joins * 13.5.1 Exercises
      
      
    * 13.6 Join problems
    * 13.7 Set operations
   
   
 * 14 Strings * 14.1 Introduction * 14.1.1 Prerequisites
      
      
    * 14.2 String basics * 14.2.1 String length
       * 14.2.2 Combining strings
       * 14.2.3 Subsetting strings
       * 14.2.4 Locales
       * 14.2.5 Exercises
      
      
    * 14.3 Matching patterns with regular expressions * 14.3.1 Basic matches
       * 14.3.2 Anchors
       * 14.3.3 Character classes and alternatives
       * 14.3.4 Repetition
       * 14.3.5 Grouping and backreferences
      
      
    * 14.4 Tools * 14.4.1 Detect matches
       * 14.4.2 Exercises
       * 14.4.3 Extract matches
       * 14.4.4 Grouped matches
       * 14.4.5 Replacing matches
       * 14.4.6 Splitting
       * 14.4.7 Find matches
      
      
    * 14.5 Other types of pattern * 14.5.1 Exercises
      
      
    * 14.6 Other uses of regular expressions
    * 14.7 stringi * 14.7.1 Exercises
      
      
 * 15 Factors * 15.1 Introduction * 15.1.1 Prerequisites
      
      
    * 15.2 Creating factors
    * 15.3 General Social Survey * 15.3.1 Exercise
      
      
    * 15.4 Modifying factor order * 15.4.1 Exercises
      
      
    * 15.5 Modifying factor levels * 15.5.1 Exercises
      
      
 * 16 Dates and times * 16.1 Introduction * 16.1.1 Prerequisites
      
      
    * 16.2 Creating date/times * 16.2.1 From strings
       * 16.2.2 From individual components
       * 16.2.3 From other types
       * 16.2.4 Exercises
      
      
    * 16.3 Date-time components * 16.3.1 Getting components
       * 16.3.2 Rounding
       * 16.3.3 Setting components
       * 16.3.4 Exercises
      
      
    * 16.4 Time spans * 16.4.1 Durations
       * 16.4.2 Periods
       * 16.4.3 Intervals
       * 16.4.4 Summary
       * 16.4.5 Exercises
      
      
    * 16.5 Time zones
   
   
 * Program
 * 17 Introduction * 17.1 Learning more
   
   
 * 18 Pipes * 18.1 Introduction * 18.1.1 Prerequisites
      
      
    * 18.2 Piping alternatives * 18.2.1 Intermediate steps
       * 18.2.2 Overwrite the original
       * 18.2.3 Function composition
       * 18.2.4 Use the pipe
      
      
    * 18.3 When not to use the pipe
    * 18.4 Other tools from magrittr
   
   
 * 19 Functions * 19.1 Introduction * 19.1.1 Prerequisites
      
      
    * 19.2 When should you write a function? * 19.2.1 Practice
      
      
    * 19.3 Functions are for humans and computers * 19.3.1 Exercises
      
      
    * 19.4 Conditional execution * 19.4.1 Conditions
       * 19.4.2 Multiple conditions
       * 19.4.3 Code style
       * 19.4.4 Exercises
      
      
    * 19.5 Function arguments * 19.5.1 Choosing names
       * 19.5.2 Checking values
       * 19.5.3 Dot-dot-dot (…)
       * 19.5.4 Lazy evaluation
       * 19.5.5 Exercises
      
      
    * 19.6 Return values * 19.6.1 Explicit return statements
       * 19.6.2 Writing pipeable functions
      
      
    * 19.7 Environment
   
   
 * 20 Vectors * 20.1 Introduction * 20.1.1 Prerequisites
      
      
    * 20.2 Vector basics
    * 20.3 Important types of atomic vector * 20.3.1 Logical
       * 20.3.2 Numeric
       * 20.3.3 Character
       * 20.3.4 Missing values
       * 20.3.5 Exercises
      
      
    * 20.4 Using atomic vectors * 20.4.1 Coercion
       * 20.4.2 Test functions
       * 20.4.3 Scalars and recycling rules
       * 20.4.4 Naming vectors
       * 20.4.5 Subsetting
       * 20.4.6 Exercises
      
      
    * 20.5 Recursive vectors (lists) * 20.5.1 Visualising lists
       * 20.5.2 Subsetting
       * 20.5.3 Lists of condiments
       * 20.5.4 Exercises
      
      
    * 20.6 Attributes
    * 20.7 Augmented vectors * 20.7.1 Factors
       * 20.7.2 Dates and date-times
       * 20.7.3 Tibbles
       * 20.7.4 Exercises
      
      
 * 21 Iteration * 21.1 Introduction * 21.1.1 Prerequisites
      
      
    * 21.2 For loops * 21.2.1 Exercises
      
      
    * 21.3 For loop variations * 21.3.1 Modifying an existing object
       * 21.3.2 Looping patterns
       * 21.3.3 Unknown output length
       * 21.3.4 Unknown sequence length
       * 21.3.5 Exercises
      
      
    * 21.4 For loops vs. functionals * 21.4.1 Exercises
      
      
    * 21.5 The map functions * 21.5.1 Shortcuts
       * 21.5.2 Base R
       * 21.5.3 Exercises
      
      
    * 21.6 Dealing with failure
    * 21.7 Mapping over multiple arguments * 21.7.1 Invoking different functions
      
      
    * 21.8 Walk
    * 21.9 Other patterns of for loops * 21.9.1 Predicate functions
       * 21.9.2 Reduce and accumulate
       * 21.9.3 Exercises
      
      
 * Model
 * 22 Introduction * 22.1 Hypothesis generation vs. hypothesis confirmation
   
   
 * 23 Model basics * 23.1 Introduction * 23.1.1 Prerequisites
      
      
    * 23.2 A simple model * 23.2.1 Exercises
      
      
    * 23.3 Visualising models * 23.3.1 Predictions
       * 23.3.2 Residuals
       * 23.3.3 Exercises
      
      
    * 23.4 Formulas and model families * 23.4.1 Categorical variables
       * 23.4.2 Interactions (continuous and categorical)
       * 23.4.3 Interactions (two continuous)
       * 23.4.4 Transformations
       * 23.4.5 Interpolation vs. extrapolation
       * 23.4.6 Exercises
      
      
    * 23.5 Missing values
    * 23.6 Other model families
   
   
 * 24 Model building * 24.1 Introduction * 24.1.1 Prerequisites
      
      
    * 24.2 Why are low quality diamonds more expensive? * 24.2.1 Price and carat
       * 24.2.2 A model complicated model
       * 24.2.3 Exercises
      
      
    * 24.3 What affects the number of daily flights? * 24.3.1 Day of week
       * 24.3.2 Seasonal Saturday effect
       * 24.3.3 Computed variables
       * 24.3.4 Time of year: an alternative approach
       * 24.3.5 Exercises
      
      
    * 24.4 Learning more about models
   
   
 * 25 Many models * 25.1 Introduction * 25.1.1 Prerequisites
      
      
    * 25.2 gapminder * 25.2.1 Nested data
       * 25.2.2 List-columns
       * 25.2.3 Unnesting
       * 25.2.4 Model quality
       * 25.2.5 Exercises
      
      
    * 25.3 List-columns
    * 25.4 Creating list-columns * 25.4.1 With nesting
       * 25.4.2 From vectorised functions
       * 25.4.3 From multivalued summaries
       * 25.4.4 From a named list
       * 25.4.5 Exercises
      
      
    * 25.5 Simplifying list-columns * 25.5.1 List to vector
       * 25.5.2 Unnesting
       * 25.5.3 Exercises
      
      
    * 25.6 Making tidy data with broom
   
   
 * Communicate
 * 26 Introduction
 * 27 R Markdown * 27.1 Introduction * 27.1.1 Prerequisites
      
      
    * 27.2 R Markdown basics * 27.2.1 Exercises
      
      
    * 27.3 Text formatting with Markdown * 27.3.1 Exercises
      
      
    * 27.4 Code chunks * 27.4.1 Chunk name
       * 27.4.2 Chunk options
       * 27.4.3 Table
       * 27.4.4 Caching
       * 27.4.5 Global options
       * 27.4.6 Inline code
       * 27.4.7 Exercises
      
      
    * 27.5 YAML header * 27.5.1 Parameters
       * 27.5.2 Bibliographies and Citations
      
      
    * 27.6 Learning more
   
   
 * 28 Graphics for communication * 28.1 Introduction * 28.1.1 Prerequisites
      
      
    * 28.2 Label * 28.2.1 Exercises
      
      
    * 28.3 Annotations * 28.3.1 Exercises
      
      
    * 28.4 Scales * 28.4.1 Axis ticks and legend keys
       * 28.4.2 Legend layout
       * 28.4.3 Replacing a scale
       * 28.4.4 Exercises
      
      
    * 28.5 Zooming
    * 28.6 Themes
    * 28.7 Saving your plots * 28.7.1 Figure sizing
       * 28.7.2 Other important options
      
      
    * 28.8 Learning more
   
   
 * 29 R Markdown formats * 29.1 Introduction
    * 29.2 Output options
    * 29.3 Documents
    * 29.4 Notebooks
    * 29.5 Presentations
    * 29.6 Dashboards
    * 29.7 Interactivity * 29.7.1 htmlwidgets
       * 29.7.2 Shiny
      
      
    * 29.8 Websites
    * 29.9 Other formats
    * 29.10 Learning more
   
   
 * 30 R Markdown workflow

R FOR DATA SCIENCE
R FOR DATA SCIENCE
GARRETT GROLEMUND
HADLEY WICKHAM
WELCOME
This is the website for “R for Data Science” . This book will teach you how to do data science with R: You’ll learn how to
get your data into R, get it into the most useful structure, transform it,
visualise it and model it. In this book, you will find a practicum of skills for
data science. Just as a chemist learns how to clean test tubes and stock a lab,
you’ll learn how to clean data and draw plots—and many other things besides.
These are the skills that allow data science to happen, and here you will find
the best practices for doing each of these things with R. You’ll learn how to
use the grammar of graphics, literate programming, and reproducible research to
save time. You’ll also learn how to manage cognitive resources to facilitate
discoveries when wrangling, visualising, and exploring data.

To be published by O’Reilly in late 2016. Pre-order from amazon .


(R for Data Science was formerly called Data Science with R in Hands-On
Programming with R)

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.","This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you’ll learn how to clean data and draw plots—and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You’ll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You’ll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualising, and exploring data.",R for Data Science,Live,986
3069,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Matt Collins Blocked Unblock Follow Following Developer Advocate with IBM Watson Data Platform 6 mins ago
--------------------------------------------------------------------------------

REALTIME INFRASTRUCTURE MADE SIMPLE
LIVE EVENT UPDATES FOR YOUR WEB APP WITH OUR SIMPLE NOTIFICATION SERVICE
Recently we introduced the Simple Notification Service , a quick and easy realtime platform that lets you focus on actually building
features rather than worrying about infrastructure.

Today we want to show you an example of how you might build an online,
multiplayer game using the Simple Notification Service.

The game we're going to create is a classic, Noughts and Crosses . You may know it as Tic-tac-toe.

You can see a working example of this game here .

INGREDIENTS
So what goes into building something like this? Well, as always there are a
number of ways to do this kind of thing. For this particular example, you need
the following:

 * An instance of the Simple Notification Service up and running. Deploy this app to Bluemix .
 * Node.JS - v4.4+, using Express
 * Twitter Bootstrap - HTML/CSS framework
 * VueJS - JavaScript framework

Bootstrap helps us quickly put together a front-end for our game without having
to write too much CSS. VueJS provides a nice gateway into the many JavaScript
frameworks and includes a number of nice features like data bindings,
components, and more.

If you are unfamiliar with any of these technologies, I recommend giving them a
quick look over before continuing. They all come with great documentation, and
you should be able to get started in no time!

All of the code covered in this article is available here .

GET STARTED
First, we build a simple Node.JS app, running Express, to help us serve some web
pages and host a couple of API calls. You can start a new project by creating a
new directory, and running npm init :

mkdir noughts-and-crosses
cd noughts-and-crosses
npm init

Answer the setup questions, which create your package.json file.

Next, in that same directory, you create a file called app.js , which is the Node.JS app that we are going to build. Let’s take a look at
some of the code we need here.

At the very top of app.js we define all of our modules and so forth. You can add these modules by running
the associated npm install command and they get installed and included in your package.json .

npm install -s express cfenv body-parser underscore

One thing to point out is that we are creating a variable to store our “games”
in, games . This is in lieu of using a database to store this kind of information, just
to simplify things!

At the very bottom of app.js , we then do:

Here we define a few different routes that serve out some HTML from the public directory. You need to create this directory along with a few other directories for all of the individual JavaScript, CSS, and image resources. You also need
to create two files: index.html and game.html in the public directory.

In app.js , between these two chunks of code, we’ll add a few API endpoints, which I’ll
cover in just a moment.

So, at this point we have a very simple Node.JS app that does almost nothing
except serve out some HTML on the / and /game/:id endpoints.

CREATING THE API
Before we build anything on the front-end, we should finish off our back-end and
create those API endpoints I mentioned.

Take the code covered in the next few sections and place it in app.js between the two pieces of code just discussed. As a model, you can refer to my app.js file on GitHub .

CREATING A NEW GAME
This endpoint creates a new game object within our existing games object, with its own ID. We use this object to record the players that are
currently connected to this particular game.

ADDING A PLAYER TO A GAME
This endpoint adds a new player ID to the game specified, and then returns the
current details of this particular game. This helps us do the initial
sychronisation between players at the start of a game.

REMOVING A PLAYER FROM A GAME
Similarly, this endpoint removes a player from a specified game and returns the
current details of the game. Again it helps synchronise player details on new
connections.

HOMEPAGE
The first thing a user sees when they arrive at our app is the homepage. From
here we want to be able to either create a new game, or join an existing game.

This file is /public/index.html , and you can see all of the code for this page here .

I would like to point out a few notable things, however.

We need to include a number of CSS and JavaScript libraries.

<link href=""/css/bootstrap.min.css"" rel=""stylesheet"">
<link href=""/css/custom.css"" rel=""stylesheet"">
<script src=""/js/jquery-2.2.4.min.js""></script>
<script src=""/js/vue.js""></script>
<script src=""/js/vue-resource.min.js""></script>
<script src=""/js/bootstrap.min.js""></script>

Here we have our Bootstrap CSS and JavaScript, and a custom.css file with some of our own styles in it. We also include jQuery , and VueJS , along with Vue Resource which is a plugin for Vue that lets us make HTTP requests.

In index.html , we also introduce the Vue framework itself.

In the app variable, we define a new Vue instance. Within this instance, we define the
HTML element that our Vue instance relates to: el: '#app' . We can also define a number of methods that we want to make available to our
Vue instance: in this case, startGame and joinGame .

We can now refer to these methods from within our HTML. Here’s an example
showing how you might invoke the startGame method:

GAME PAGE
One of our routes ( GET /game/:id ) is what we use to access individual games. This page renders the game
components and manages all of the game mechanics, including the communication
between players via the Simple Notification Service.

This file is /public/game.html , and you can see all of the code for this page here .

We include the same JavaScript libraries as on the homepage, but we also include
two more files at the bottom of the page.

GAME.JS
This file contains all of the Vue-related JavaScript, and defines all of the
game mechanics.

We introduce the new data object when defining our Vue app. Let’s take a look at the different data
points we have:

 * gameID - a reference to the ID of this game, derived from the URL
 * turn - whose turn it is ( o , or x )
 * status - the status of this game ( x win, o win, draw, or nothing)
 * players - an object that defines the SNS ID of each player
 * me - is this player o or x
 * check - reference to an audio clip, for sound effects
 * squares - an object with nine sub-objects, that show whether a square has a o or an x in it
 * logs - an array of log data
 * hideLogs - do we hide the logs on the front-end, or not?

We also define a number of methods that perform different actions within the
game.

 * takeTurn - called when a square is clicked, try to take a turn
 * isSquareOccupied - is this square currently occupied?
 * checkWinners - do we have any winning combinations?
 * checkDraw - are there no more turns to take?
 * getCombinations - helper function, gets all of the different combinations of played squares
 * reset - reset the game, ready to play again
 * cancel - go back to the homepage
 * getAvailablePlayer - talk to the API and attempt to assign this user as a player. Also returns
   current players.
 * whoAmI - determine which player this user is, if any
 * quit - call the API and remove this player from the game, also returns current
   players
 * toggleLogs - show/hide the logs on the front-end

All of these data points and methods can be accessed via the app variable.

SNS.JS
This file contains all of the specific Simple Notification Service-related
JavaScript and refers to the Vue app to perform game related actions.

We need to connect to the Simple Notification Service, handle player connections
and disconnections, and update all users when a player makes a move. In short,
any interactions between players are all funneled through the Simple
Notification Service, and through the sns.js file.

CONNECTING
To begin with, we simply need to connect to the Simple Notification Service. We
define our userData , passing through the gameID from our Vue app. The userData describes this particular user to the Simple Notification Service, and allows
other users to interact with this user.

Our userQuery is going to be the same. The userQuery describes other users that this user wants to receive connection and
disconnection notifications about. For our game, we want to hear about users
that are playing the same game.

The code to connect to the Simple Notification Service is shown below:

Once we are connected, we want to talk to the API to assign this user as a
player within the game. This is done by using the connected event and the getAvailablePlayer method.

Here, we are saying to the Noughts & Crosses API ( POST /game/:id/register/:playerid ) that we want to register a particular player (identified by the sns.id ) as a player in this game. This sns.id is assigned to any available playing space (either as o or x ), and the current players are returned.

The response from this API call will be something like:

The Simple Notification Service then sends this data out to all users associated
with this game, by using the sns.send() method. This method lets a user send a notification to any other Simple
Notification Service user, along with some data.

When players receive this event, they use the data to update their local Vue
instance with the correct player details.

In this instance, we are wanting to update any user who is currently viewing our
game. So we define our userQuery as {gameID: app.gameID} . In our data payload, we define an action ( playersync ), and some data (which is our response from the API above, detailing which
players are assigned to which team). This notification is received by all users
of the Simple Notification Service who match this userQuery , including the originating user.

RECEIVING NOTIFICATIONS
To react to receiving this notification, we must implement the notification event within Simple Notification Service.

The first thing we do here is put this data into the app.logs array to render on the front-end. This is just something that was added to help
visualise the data flow within the application.

Next, we use a switch statement on the n.action property, and define what we want to happen for the playersync action.

You can see here that first we update the app.players data point with the response we received from the POST /game/:id/register/:playerid API call we made on first connection. Then we call app.whoAmI , passing in our local sns.id . This lets us update the app.me data point.

This means that, at this point:

 * The Noughts & Crosses app has a record of our game, and the user assigned to
   each player
 * All connected users now have an up-to-date record of which user is assigned
   to which player
 * Each user also has a local record of which player they are (if applicable)

PLAYING THE GAME
So we have our players — now we want to play the game!

Each of our game squares makes a reference to the app.takeTurn() method defined on our Vue instance.

When a square is clicked, we call the app.takeTurn() method, passing in the square ID (3, in the example above). This method makes
sure that:

 * we have 2 players
 * it is the correct player’s turn
 * this square is not occupied

If any of these conditions are not met, we simply return and do nothing. Otherwise, we mark this square as occupied by whichever player
has clicked, and send out a notification to all users, with an action value of turn . We also send the square ID and player details.

We can add a new “case” to our switch statement in sns.js to handle this action:

Each player will receive the turn data, and then we can synchronise all of our
players.

First up we update the Vue instance to mark this square as occupied by the
player that made the move. We also play a sound effect — just for fun!

Next, we need to check to see if we have a winner, using app.checkWinners() , passing in the player whos turn it is ( app.turn , either o or x ). If we have a winner, we update the app.status , alternate app.turn , and display the modal that lets us reset the game.

If there’s no winner, we then check for a draw — which is essentially checking
to see if all of the squares have been played. The result is very similar —
update the app.status , alternate app.turn , show the modal.

If there’s no need to end the game, we simply alternate app.turn and we're ready to play the next move!

ENDING THE GAME
There are three different ways to end the game, either a win for one player, all
of the spaces being played and therefore a draw, or a player leaves — ending the
game early.

In each instance, we display a modal — one of the features of Bootstrap . From here, we can either quit the game or reset, ready to play again.

If we hit the reset button, we call the app.reset() method, which sends a notification to all players:

Again, we handle this in our switch statement, synchronising all players:

We loop through all of our app.squares , setting them to be unoccupied and we update the app.status back to empty. This means we are ready to start again.

If we were to hit the cancel button, we send a slightly different notification
to our players via the app.cancel() method:

Handling this in a similar way, simply redirecting all the players to the
homepage.

IN CONCLUSION
The Simple Notification Service allows for the transmission of data between
users (or, between browsers) and is ideally suited for something like Noughts &
Crosses.

The main purpose of the Simple Notification Service in this application is to
synchronise the game data between users by sending notifications whenever a user
performs an action, but this approach can be applied to almost any realtime
application.

The key is to only apply local data changes when they are received via the
Simple Notification Service. This means that changes are only applied when all
users receive them — and there are no assumptions being made on the part of any
user. It is based on the actual data.

You can use the Simple Notification Service to power almost any realtime feature
or application using this approach.

JavaScript Nodejs Push Notification Tutorial Web Development Blocked Unblock Follow FollowingMATT COLLINS
Developer Advocate with IBM Watson Data Platform

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","You can use our Simple Notification Service to power almost any realtime feature. Here, we show how to build an online multiplayer game that updates for realtime play.",Realtime Infrastucture Made Simple – IBM Watson Data Lab,Live,987
3073,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectACCESSING MAINFRAME DATA SOURCES FROM YOUR BLUEMIX APPLICATIONSptitzler / January 25, 2016In my last post I’ve outlined how your Bluemix applications can access existing relationalon-premises databases using the Secure Gateway service. The accompanying tutorial described the configuration steps for a DB2 for Linux, Unix and Windowsdatabase, but the same setup works for mainframe data sources, such as DB2 forz/OS.Considering that the Secure Gateway software does not natively run on z/OS inyour data center, you need to install the Secure Gateway client software on a(virtual or physical) Linux or Windows system or use a DataPower Gateway toestablish secure connectivity between Bluemix and your environment. You canlearn more about deployment options and how you can secure cloud-to-mainframeconnectivity with Bluemix in the following Redpaper .An alternative approach to accessing your mainframe data is provided by the Rocket Mainframe Data service , which is currently in Beta . This unique service provides you with SQL and MongoDB-style access, which youcan test-drivehere and here .SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: Bluemix / Rocket Mainframe Data service / secure gateway Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Learn how Bluemix apps can access existing relational DB2 on-premises databases using the Secure Gateway service.,Accessing DB2 on-prem databases from IBM Bluemix cloud,Live,988
3074,"Enterprise Pricing Articles Sign in Free 30-Day TrialMASTERING REDIS HIGH-AVAILABILITY AND BLOCKING CONNECTIONS
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 30, 2017If you've developed your Redis application on a local single Redis instance, the
move to Compose Redis may expose some assumptions about Redis behavior that
you've inadvertently baked into your code. These assumptions can manifest
themselves in the form of poor performance.

The biggest assumption is that Redis is either running or not running, which is
a reasonable assumption when you have a single instance of Redis. When you have
a high availability configuration for Redis, you need to be aware that the Redis
you are talking to may, or may not be, the master and if it isn't you can't
write to it, but you'll only know that when you write to it. If that sounds a
little ""Schroedinger's Cat Picture Database"", let us explain further.

HIGH AVAILABILITY
With a high availability configuration of Redis, there are two or more, Redis
servers. One is the master, the others are replicas of the master. There's also
sentinels, processes which specifically check on the availability of the
servers. When the sentinels agree that the master is unavailable, the sentinels
step in and make a replica into a master so things can continue running. To
connect to the master server, at Compose we use a HAproxy process. This HAproxy
tests all the nodes to see who is master and directs new traffic there.

Let's look at that in a practical way. The master server, for whatever reason,
is shut down, dropping all the incoming connections to the server. The sentinels
note that their heartbeat checks on the master are failing. The sentinels wake
up a replica and promote it to master. The Haproxy sees the change on it's next
health check pass and starts directing connections to the new master. When the
previous master server returns to availability, the Compose platform ensures it
returns as a read-only replica. Simple and effective....

OFFLINE BUT NOT OFFLINE
Apart from one thing. You can make any Redis server look like its offline by
just forgetting that the server is single threaded. Input and output to the
server are handled away from it, but at Redis's core is a single thread which
does all the database work. This is one reason why native Redis commands can
usually be treated as atomic operations - the database basically blocks,
processes the command, pushes the results to I/O then releases and moves on to
the next operation. Redis is an in-memory database so processing the operation
should be quick for most commands.

But there's a class of commands on Redis that can potentially take a long time
to complete. Top of the list is the KEYS command. ""Returns all keys matching a pattern"" says the documentation . Sounds very useful, but if you read on you'll see it comes with a warning;
""should only be used in production environments with extreme care"" and ""may ruin
performance when it is executed against large databases"". The chances are that
if you did use KEYS in your code, you may not notice the problem in testing – unless you test
against a copy of a production database - something Compose can make easy for
you to do. In production, as your data grows, so the calls on KEYS will become more and more punishing to performance.

So punishing that you'll hit the point where the sentinel can't get a response
to its heartbeat check. In a single Redis instance, this point would just be ""Ah
Redis is running super slow and isn't taking connections as quickly as it
should"". For the sentinels though, it's an alarm. The master database is down
and it's time to switch. A replica becomes a master while the previous master
becomes a replica when it, eventually, responds again. The HAproxy, meanwhile
starts sending new requests to the new master. And all is as well as can be
expected except... if you have pubsub commands or blocking requests outstanding.

THE BLOCKING PROBLEM
The BLPOP , BRPOP , BRPOPLPUSH commands are all super-useful commands for handling queues which can block. The
pubsub commands are useful for message passing. The thing about these commands
though is that they hold a connection open to the server.

For example, BLPOP can sit on an empty list and wait for something to be added to the list, at
which point it'll pop it off and return that - if there's something already in
the list, that's popped off and returned immediately. Where there is only single
Redis server, if the server goes down, the connections are severed and errors
are thrown in the client. Where there's a high availability configuration, these
connections can stay open though.

Most connections to the unresponsive master should be killed, but long running
connections like this are hard to kill being being effectively running commands.
Eventually, something will update a list on the new master, it'd be replicated
to the now-replica server and the BLPOP would attempt to remove the first item from the list... and fail because the
replica is read-only. Even if the connection were killed by the server, there
would still be an error thrown by the disconnection.

If your application isn't catching errors that can occur when using these
blocking or pubsub commands, then you will fall foul of failover. These commands
should, at a minimum, be wrapped up to catch errors and look to reconnect. The
act of reconnection will send the blocking command over to the new master. As
for the connections staying open? Well, this is why there's a timeout option on
the blocking commands. By setting the timeout, you can ensure that the blocking
command will regularly complete and then you can reissue the command if no
result is forthcoming. The reissue of the command should be enough, in
conjunction with the previous error checking, to ensure a reliable experience.
Just because the server configuration is highly available doesn't mean that
errors will never occur, especially when failover happens with open connections
to the failing master.

FIXING THE SOURCE OF THE PROBLEM
We started this with the idea of a server that just appears offline because it
is being given lots of long running KEYS commands. The way you find out wether this is what is happening is the check
your Redis slowlog. You can use the SLOWLOG command or, on Compose, use the Slow Log viewer in the web console. You can
read about that in a previous article in Compose Articles. The SCAN command can do pattern matching in a way that doesn't block the server and
returns data in more manageable chunks. And it's worth considering whether the
problem is amenable to using Redis sets, which you can also scan using SSCAN . Redis is, as described on its home page, an in-memory data structure store
and it's always worth seeing if you can make better use of the available structures .


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Paul Smith Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","If you've developed your Redis application on a local single Redis instance, the move to Compose Redis may expose some assumptions about Redis behavior that you've inadvertently baked into your code.",Mastering Redis high-availability and blocking connections,Live,989
3079,"☰ * Login
 * Sign Up

 * Learning Paths
 * Courses * Our Courses
    * Partner Courses
   
   
 * Badges * Our Badges
    * BDU Badge Program
   
   
 * Events
 * Blog
 * Resources * Resources List
    * Downloads
   
   
 * 

BLOG
Welcome to the Big Data University Blog .SUBCRIBE VIA FEED
RSS - Posts

RSS - Comments

SUBSCRIBE VIA EMAIL
Enter your email address to subscribe to this blog and receive notifications of
new posts by email.

Email Address


RECENT POSTS
 * This Week in Data Science (January 10, 2017)
 * This Week in Data Science (December 27, 2016)
 * This Week in Data Science (December 20, 2016)
 * This Week in Data Science (December 13, 2016)
 * New York Data Science Bootcamp And Validated Badges

CONNECT ON FACEBOOK
Connect on FacebookFOLLOW US ON TWITTER
My TweetsTHIS WEEK IN DATA SCIENCE (JANUARY 10, 2017)
Posted on January 10, 2017 by Janice Darling

Hello all! My name is Janice Darling and I will be taking over this column from
Cora.
Here is a roundup of the news this week in Data Science and Big Data.

Don’t forget to subscribe to keep up-to-date with developments in Big Data and
Data Science!

INTERESTING DATA SCIENCE ARTICLES AND NEWS
 * Big Data: Main Developments in 2016 and Key Trends in 2017 – A brief review of the year in Big Data and Data Science for 2016 and their
   contributions to industry trends in 2017 as predicted by eight industry
   experts.
 * IBM’s ‘5 in 5’ predicts what crazy scientific inventions may emerge in the
   next few years – IBM’s annual ‘5 in 5’ predictions are here to predict innovation based on
   market and societal trends that will revolutionize our lives over the next
   five years.
 * 6 Tech Skills That Will Enhance Your Salary In 2017 – The six skills which might be worth learning in the new year in order to
   stand out and gain a higher salary in the ever-growing and competitive field
   of technology.
 * All Eyes On AI in 2017 – A description of the different AI technologies that might emerge over the
   course of 2017.
 * AI and the future of design: What will the designer of 2025 look like? – Delve into how artificial intelligence will impact design.
 * 10 Significant Visualization Developments: July to December 2016 – A six month review of some of the most popular data visualization
   projects.
 * 2016 in charts – A key look back at some of the key new indicators of 2016.
 * Neural Network Learns to Identify Criminals by Their Faces – How individuals from Shanghai Jiao Tong University in China used different
   machine-vision algorithms to study the faces of criminals and identify
   differences.
 * Top December Stories: 50+ Data Science, Machine Learning Cheat Sheets;
   Machine Learning/AI: Main 2016 Developments, Key 2017 Trends – An overview of the top stories for December 2016 in the fields of Data
   Science, Machine Learning, AI and more.
 * 5 trends in defensive security for 2017 – Key trends to watch for in the field of security in the coming months.
 * What counts as success for “Super Mario Run – Storytelling of the success of Super Mario Run through the use of Data
   visualizations.
 * Machine Learning Meets Humans – Insights from HUML 2016 – A report from an important IEEE workshop on human interaction with machine
   learning, safety, decision making, etc.
 * Python 3.6.0 is now available! – The newest release of Python became available on December 23 and contains
   many new features and optimizations.
 * R Packages worth a look – A look at some R packages dealing with the Naive Bayes algorithm and
   simulations among other things.
 * The biggest R stories from 2016 – Highlights from the R community during the year 2016

COOL DATA SCIENCE VIDEOS
 * Deep Learning with Tensorflow – Introduction to Convolutional Networks – Here is an overview of the convolutional network model from our free Deep
   Learning with TensorFlow course
 * Deep Learning with Tensorflow – Activation Functions – This video provides an overview of the most commonly used Activation
   Functions in TensorFlow
 * Deep Learning with Tensorflow – Logistic Regression – An overview and basic implementation of logistic regression which is an
   important tool for machine learning in TensorFlow
 * Deep Learning with Tensorflow – Linear Regression with TensorFlow – This video demonstrates how to perform linear regression using the
   TensorFlow Library

SHARE THIS:
 * Facebook
 * Twitter
 * LinkedIn
 * Google
 * Pocket
 * Reddit
 * Email
 * Print
 * 

RELATED
Tags: analytics , Big Data , data science , events , weekly roundup


--------------------------------------------------------------------------------

COMMENTS
LEAVE A REPLY CANCEL REPLY
 * About
 * Contact
 * Blog
 * Community
 * FAQ
 * Ambassador Program
 * Legal

Follow us * 
 * 
 * 
 * 
 * 

 * 
 * 

Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Here is a roundup of the news this week in Data Science and Big Data. ,"This Week in Data Science (January 10, 2017)",Live,990
3083,"Compose The Compose logo Articles Sign in Free 30-day trialCOMPOSE NOTES: BACH UPDATED AND SUPPORT STREAMLINED
Published Jun 29, 2017 compose support bach Compose Notes: Bach Updated and Support StreamlinedThere's new Compose API calls and a new version of Bach, the Compose command
line. We've also been busy streamlining the process of raising a support ticket
so there's less stress for you.

BACH UPDATED
There's new commands in Bach, the Compose API CLI, and endpoints in the API to
fill out the new notes and billing code functionality.

Compose watchers will know that we added notes and billing codes to the information we hold with deployments. We recently also added support in
the Deployments view to sort your databases by those billing codes , along with some other neat views of your data layer.

But we hadn't quite finished the job on that. That happens when you can
programmatically change those notes and codes, and now you can. Compose API
users will find patch-2016-07-deployments-id allows them to do just that. For Bach users, there's a new edition of command line utility for the Compose API. Bach
0.1.0 reorganizes the set command to allow you to set notes and set code in addition to set scale which was previously available.

We also took the opportunity to clean up some of the output. With Bach 0.1.0,
links to other endpoints are automatically hidden - add the --links flag to get them displayed. Empty fields in the connections are also completely
skipped and a recently added Misc connection string is rendered as JSON content
for easier reading. You can find Bach on the's Bach GitHub releases page where it's available for Windows, Linux and macOS.

SUPPORT STREAMLINED
There's one less option for most users in our revised support form and it may
surprise some of you without some context. We strive to get to and respond to
every support ticket as soon as possible and manage them so everyone gets the
best response and the most effective solution. We looked into how people were
using our support and found that one field was being set by many users which ran
mis-set their expectations.

That field was the old ""Priority"" field, or as it came to be known, the ""Help
Emergency"" button as so many users set it to the highest setting, even for
non-urgent general questions. It turns out this is apparently quite a common
pattern of behavior and also one that leads to stress for users filling in the
ticket as they have to evaluate the urgency of their request. So we've made it
simpler for all by eliminating the proirity button and letting our internal
triage processes work based on the actual content of your support requests.

One group of Compose users will, though, have a new option. Enhanced support customers will get an option to select between non-emergency and emergency response. This
is to enable them to ensure that their requests do trigger the automatic
prioritisation and rapid triage demanded by enhanced support users.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
May 30, 2017HOW TO WATCH AND WAIT WITH COMPOSE'S BACH AND API
We've just added a new feature to the Bach command line for the Compose API to
make things easier to track to completion. Let…

Dj Walker-Morgan May 18, 2017BACH - THE COMPOSE API AT YOUR COMMAND(LINE)
Find out how to use the power of the command line to control your Compose
database deployments. The Bach tool is an easy way…

Dj Walker-Morgan Dec 13, 2016DATABASE UPDATES AND THE NEW COMPOSE API
There's database updates for Redis and the early availability of the new Compose
API for developers wanting to automate their…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL JanusGraph Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","There's new Compose API calls and a new version of Bach, the Compose command line. We've also been busy streamlining the process of raising a support ticket so there's less stress for you.",Bach Updated and Support Streamlined,Live,991
3087,"Need to report the video?

Sign in to report inappropriate content.

Sign in

Want to watch this again later?

Sign in to add this video to a playlist.

Sign in

Sign in to make your opinion count.

Sign in to make your opinion count.",Using Cloudant and MobileFirst services on Bluemix we show how to build a Cloudant-backed iOS app which uses Google+ for authentication.,Building a MobileFirst App on IBM Bluemix,Live,992
3089,"* Home
 * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK.TC ☰ * Community
 * Projects
 * Blog
 * About
 * Resources * Code
    * Contributions
    * University
    * IBM Design
    * Apache SystemML
    * Apache Spark™
   
   
SPARK SQL
CONFIGURING THE APACHE SPARK™ SQL CONTEXT
The Apache Spark website documents the properties you can configure, including
settings that control the Spark application and the Spark SQL Context. Let’s
look at some of the Spark SQL Context parameters, and how to enable them with a
nice feature in Spark SQL 1.6.

Spark SQL Context (SQLContext) serves as the entry point for users to create
dataframes, datasets, and to run SQL-related queries. SQLContext can take Spark
SQL configuration properties through setConf method or you can specify the
SQLContext properties in the /conf directory within the spark-defaults.conf file. For example, you can add:

spark.sql.PARQUET_SCHEMA_MERGING_ENABLED=true

Or you can set the parameter using the SET key=value command in SQL.

The Spark SQL, DataFrames and Datasets Guide on the Apache Spark website documents most Spark SQL Context properties, which
all start with the naming convention “spark.sql.”. Look for those properties
within these sections of the guide:

 * Configuration
 * Caching Data In Memory
 * Other Configuration Options
 * Migration Guide

After you set the SQLContext configuration property value, the property name and
the key value are stored as a string pair in the map structure, and the property
key value can then be used during the SQL execution time. The value of the key
can be boolean, integer, or long.

Internally, Spark converts the string value to the corresponding boolean,
integer, or long when that property’s value is being used. For some of the
properties, the values are memory bytes, for example,
SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE, which is the targeted size of a
post-shuffle partition’s input data size.

In previous Spark releases, if the user wanted to specify 1m byte size, he or
she had to do the calculation (1024×1024 = 1048576), because the string ‘1m’
couldn’t be converted to long using the string’s toLong method. It would throw a
NumberFormatException. However, with Spark 1.6, the user can specify string ‘1m’
, ‘1g’ etc as unit (m,k,g), and Spark will use an existing utility to try to
interpret the string and convert it to corresponding byte size. That should save
Spark users a lot of time.

That’s just one example of how Spark is becoming easier to use and easier to
configure with each new release.

SHARE ON
 * 
 * Share

KEVIN YU
DATE
14 June 2016TAGS
Spark SQL, researchSPARK TECHNOLOGY CENTER
 * Community
 * Projects
 * Blog
 * About

The Apache Software Foundation has no affiliation with and does not endorse or
review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.","The Apache Spark website documents the properties you can configure, including settings that control the Spark application and the Spark SQL Context. Let’s look at some of the Spark SQL Context parameters, and how to enable them with a nice feature in Spark SQL 1.6.",Configuring the Apache Spark SQL Context,Live,993
3090,"USING QUERY STRING QUERIES IN ELASTICSEARCH
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published May 24, 2016In Elasticsearch, query string queries are their own breed of query - loads of
functionality for full text search rolled into one sweet little package. In this
article, we'll take a closer look at why query string queries are special and
how you can make use of them.

SEARCH LITE
Elasticsearch: The Definitive Guide explains that the query string query type uses what they call ""Search Lite"" , where all the query parameters are passed in the query string. Because of
this, query string queries use a different syntax than the standard request body
we've covered in previous articles, such as Elasticsearch Query-Time Strategies and Techniques for Relevance: Part I and Part II .

Note that the request body format for querying is the recommended approach from
Elasticsearch since it is considered robust and provides extensive
functionality. However, if you just need to do a quick-and-dirty full text
search that has some power behind it, then using the ""q"" parameter in search
(the query string query shortcut) is the way to go. Generally, query string
queries, and their cousins ( simple query string queries), will be most effective when used in development or QA
testing, or when made available to power users who know the syntax like the back
of their hand.

Let's take a closer look at this query type to understand what it can do for us
by searching against the IMDB Top 250 Films , which we have loaded into our Elasticsearch instance an index called
""top_films"" with a document type named ""film"".

THE QUERY STRING ""QUERY""
If you look at the Elasticsearch documentation for the Search APIs ""Search"" page , you'll notice all the examples there use the ""q"" parameter for search. This
is a shortcut way of accessing query string queries. Using the ""q"" parameter for
search is equivalent to the ""query"" option in JSON-formatted query string
queries (which we'll get into more details on later in the article when we look
at the setting options).

We're going to start by exploring just the query and its syntax since that's the
bare bones needed for query string queries, though you'll see that it's chock
full of features just on its own.

Let's run through some basic default settings so we know where we stand when we
construct a query using the query string query type:

 * If no field is specified in the query, then the _all field is searched
   automatically. The _all field is a special field that is constructed by
   concatenating the values of all the other fields in your document so it's got
   all the terms found elsewhere, making it ideal for full text searching. The
   _all field is generated in the background when your document is indexed
   unless you've explicitly disabled it via the index metadata.
 * If multiple fields are specified for search, then bool is automatically applied.
 * Multiple query terms will be ""OR""d together by default, unless you indicate
   that they make up a phrase by encapsulating them in double-quotes.
 * All expanded terms are lowercased. Expanded terms will include those from
   stemming and fuzzy matches, for example.

There are some other more advanced usage defaults, but now that we know the
basic ones, we'll use the ""q"" parameter -- the query string queries shortcut --
for some search examples. To follow along, just copy the HTTP connection string
for your Elasticsearch deployment from the ""Overview"" page in the Compose
administrative web console, supplying the username and password appropriate for
your instance. In this first example, we'll show the full URL path so you can
see how it's formed, but for the rest of the examples, we won't show the whole
connection string, just the part starting with our ""top_films"" index through the
""q"" parameter.

Getting started, then... the simplest query we can construct is a single term
query without any additional specification, for example:

https://admin:[password]@aws-us-east-1-portal10.dblayer.com:10019/top_films/film/_search?q=godfather  


Here, we're just searching for ""godfather"". Since we didn't specify a particular
field, the _all field will be used. As our result, we get 2 hits: The Godfather and the The Godfather: Part II , since those are the only ones from that series which have made it into the
top 250 films:

 {
    ""hits"" : {
       ""total"" : 2,
       ""max_score"" : 1.1862482,
       ""hits"" : [
          {
             ""_index"" : ""top_films"",
             ""_score"" : 1.1862482,
             ""_source"" : {
                ""title"" : ""The Godfather"",
                ""year"" : ""1972""
             },
             ""_type"" : ""film"",
             ""_id"" : ""2""
          },
          {
             ""_source"" : {
                ""title"" : ""The Godfather: Part II"",
                ""year"" : ""1974""
             },
             ""_score"" : 1.1862482,
             ""_type"" : ""film"",
             ""_id"" : ""3"",
             ""_index"" : ""top_films""
          }
       ]
    }
 }


If you'd like to know more about how the scores are arrived at for each of the
hits, then check out our article How Scoring Works in Elasticsearch .

MULTIPLE TERMS
Now that we have our baseline for a ""godfather"" query, let's add to our original
query a bit to create a multi-term query using the terms ""godfather"" and ""part"".
We'll join the two terms with the + sign for the proper URL encoding of the query:

/top_films/film/_search?q=godfather+part


In this case, because multiple terms are ""OR""d together by default, we get 3
hits - the 3 films that have either the term ""godfather"" or the term ""part"" in
them: The Godfather and The Godfather: Part II which we saw before, plus Harry Potter and the Deathly Hallows: Part 2 :

 {
    ""hits"" : {
       ""total"" : 3,
       ""hits"" : [
          {
             ""_index"" : ""top_films"",
             ""_type"" : ""film"",
             ""_id"" : ""3"",
             ""_score"" : 1.6776081,
             ""_source"" : {
                ""title"" : ""The Godfather: Part II"",
                ""year"" : ""1974""
             }
          },
          {
             ""_index"" : ""top_films"",
             ""_source"" : {
                ""year"" : ""1972"",
                ""title"" : ""The Godfather""
             },
             ""_score"" : 0.41940203,
             ""_type"" : ""film"",
             ""_id"" : ""2""
          },
          {
             ""_index"" : ""top_films"",
             ""_source"" : {
                ""title"" : ""Harry Potter and the Deathly Hallows: Part 2"",
                ""year"" : ""2011""
             },
             ""_score"" : 0.35948747,
             ""_id"" : ""216"",
             ""_type"" : ""film""
          }
       ],
       ""max_score"" : 1.6776081
    }
 }


PHRASES
Let's try that same search again, but this time we'll use double quotes to
indicate that the two words form a phrase (we're using URL encoding ""%22"" for
double quotes and ""%20"" for a whitespace):

/top_films/film/_search?q=%22godfather%20part%22


Now that we've indicated to treat the two words as a phrase, we only get 1 hit
back: The Godfather: Part II :

{
    ""hits"" : {
       ""total"" : 1,
       ""max_score"" : 2.3724964,
       ""hits"" : [
          {
             ""_score"" : 2.3724964,
             ""_source"" : {
                ""title"" : ""The Godfather: Part II"",
                ""year"" : ""1974""
             },
             ""_id"" : ""3"",
             ""_index"" : ""top_films"",
             ""_type"" : ""film""
          }
       ]
    }
 }


FIELDS
If we want to search in a specified field (or fields) for terms, we can indicate
that with our query syntax. For example, we can search the title field for
""godfather"" and the year field for 1974 (the year that The Godfather: Part II was released). Notice the ""%3A"" URL encoding for a colon between the field name
and the term:

/top_films/film/_search?q=title%3Agodfather+year%3A1974


In this case, we'll get 3 results as well since each part of the query is ""OR""d
together: our two Godfather films because they match the term ""godfather"" in the
title and Part II also matches the term 1974 for the year, and the film Chinatown because it is the only other film in the top 250 films that was released in
1974:

 {
    ""hits"" : {
       ""total"" : 3,
       ""hits"" : [
          {
             ""_id"" : ""3"",
             ""_score"" : 2.8478138,
             ""_index"" : ""top_films"",
             ""_type"" : ""film"",
             ""_source"" : {
                ""year"" : ""1974"",
                ""title"" : ""The Godfather: Part II""
             }
          },
          {
             ""_source"" : {
                ""year"" : ""1972"",
                ""title"" : ""The Godfather""
             },
             ""_type"" : ""film"",
             ""_id"" : ""2"",
             ""_score"" : 1.6665416,
             ""_index"" : ""top_films""
          },
          {
             ""_source"" : {
                ""year"" : ""1974"",
                ""title"" : ""Chinatown""
             },
             ""_type"" : ""film"",
             ""_score"" : 0.0906736600000001,
             ""_id"" : ""122"",
             ""_index"" : ""top_films""
          }
       ],
       ""max_score"" : 2.8478138
    }
 }


WILDCARDS
Let's turn pat of our query into a wildcard search. In this case we'll prefix
""father"" for any matching characters using the * special character:

/top_films/film/_search?q=title%3A*father+year%3A1974


Now we get back 4 hits: the 3 we saw above, plus In the Name of the Father because it matched to our wildcard query for ""*father"" in the title:

 {
    ""hits"" : {
       ""max_score"" : 1.4142135,
       ""total"" : 4,
       ""hits"" : [
          {
             ""_index"" : ""top_films"",
             ""_id"" : ""3"",
             ""_score"" : 1.4142135,
             ""_type"" : ""film"",
             ""_source"" : {
                ""title"" : ""The Godfather: Part II"",
                ""year"" : ""1974""
             }
          },
          {
             ""_id"" : ""2"",
             ""_index"" : ""top_films"",
             ""_score"" : 0.35355338,
             ""_type"" : ""film"",
             ""_source"" : {
                ""year"" : ""1972"",
                ""title"" : ""The Godfather""
             }
          },
          {
             ""_score"" : 0.35355338,
             ""_index"" : ""top_films"",
             ""_id"" : ""122"",
             ""_source"" : {
                ""title"" : ""Chinatown"",
                ""year"" : ""1974""
             },
             ""_type"" : ""film""
          },
          {
             ""_index"" : ""top_films"",
             ""_id"" : ""185"",
             ""_score"" : 0.35355338,
             ""_type"" : ""film"",
             ""_source"" : {
                ""title"" : ""In the Name of the Father"",
                ""year"" : ""1993""
             }
          }
       ]
    }
 }


We can just keep building more and more complex queries in this way. Some
options will increase the number of hits and some will decrease them. We won't
go through examples of all the options available in the query syntax here, but
we'll do a quick review of other special characters and syntax for more advanced
functionality.

OTHER QUERY OPTIONS
 * Boosting can be added to any term or field specified in the query when there is more
   than one. For example, in our query above where we searched for ""godfather""
   in the title field and 1974 in the year field, we could add boosting to the
   title field, indicated by the carat ^ special character, to increase the scores of documents found with that match
   so they would be ranked higher than matches with the year field. Here's how
   that example could look, where we're doubling the significance of matches to
   ""godfather"" in the title field by boosting by 2: q=title%3Agodfather^2+year%3A1974 .
   
   
 * Fuzziness is set to ""AUTO"" by default, which means that up to a maximum of 2
   characters in a term may be replaced, removed or added, but the behavior is
   based on the length of the term specified in the query. For example, for a
   term with 0-2 characters, an exact match is required. For a term with 3-5
   characters, then only 1 character may be be replaced, removed or added. You
   can indicate fuzziness for a term by using the ~ special character next to it and optionally setting the number of characters
   you'd like to try changing, which will override the default setting. An
   example would look like this: q=man~1 . Here we're searching for the term ""man"", but we're allowing for fuzziness
   for up to a 1 character change, so matches to terms like the following are
   possible: man, max, min, can, men, mad. By default up to 50 alternates will
   be generated for fuzzy matching.
   
   
 * Proximity for a phrase can be indicated by using the ~ special character with a number indicating the allowable slop. For example,
   we used the phrase ""godfather part"" in one of our examples above to search
   for an exact match on the phrase. We can use slop to set the number of words
   that are allowed between the terms in our phrase. By doing so, the proximity
   of the terms becomes becomes a mechanism for matching and scoring. We can set
   the slop like this: q=%22godfather%20part%22~5 . Notice the ~5 at the end of the phrase query. This will allow us to find any documents
   where the terms ""godfather"" and ""part"" are within 5 words of each other.
   Also, because we haven't specified a field here for our query, the _all field
   will be used. Due to the concatenation of field values in the _all field,
   which we mentioned above, we could potentially find a match to ""godfather"" in
   one field and ""part"" in another field, which just happen to be concatenated
   within 5 words of each other in the _all field. Just something to take into
   consideration when deciding whether to specify fields in your query or not.
   
   
 * Ranges can be indicated by using square brackets [ and ] for inclusive searches and curly braces { and } for exclusive searches as ""Minimum TO Maximum"". As an example, we could do a
   range search in our year field. That query could look like this for an
   inclusive search for films released in the years 1970-1975: q=year%3A[1970%20TO%201975] . Again, the URL encoding for the colon between our field and value is used
   ""%3A"" and also the whitespace surrounding the ""TO"" part of our range as
   ""%20"". This query would return all films released in 1970, 1971, 1972, 1973,
   1974, or 1975. By exchanging the square brackets for curly braces, we'll make
   the query exclusive: q=year%3A{1970%20TO%201975} . Now only films released in 1971, 1972, 1973, or 1974 will be returned.
   Ranges can be used for integers, dates, and strings, and can also make use of
   greater than > , less than < , greater than or equal to >= and less than or equal to <= operators.
   
   
 * Negation of a term can also be used in a query. Using the example we just did above
   for ranges, let's say we wanted all the films released between 1970 and 1975,
   inclusive, except for the year 1974. We can negate that year by prefacing it
   as a term with the - special character like this: q=year%3A[1970%20TO%201975]-1974 . This query will yield all the films released in the years 1970, 1971,
   1972, 1973, and 1975. 1974 has been negated.
   
   
 * Order of operations , just like in mathematical formulas, can be indicated by using nested
   parentheses to group parts of a query. For example, we know there was a movie
   about old men or an old man, but we can't remember the title. We could search
   for ""man"" or ""men"" and ""old"" using parentheses to group each part of the
   query like this: q=(man+men)AND(old) . Our result from the top 250 films will be No Country for Old Men .
   
   
 * Regular expressions can also be used in these queries by encapsulating the pattern inside
   forward slashes / . Here's an example where we know the first word begins with an ""m"" and is
   followed by the word ""max"": q=\/m.*%20max\/ . Notice how we've escaped the forward slashes for the regular expression
   with backslashes and also the whitespace is URL encoded as %20. In the top
   250 films, this query will yield us Mad Max: Fury Road as well as a film entitled Mary and Max . Be aware that the .* wildcard is greedy so it will include all characters up until a match with ""
   max"" is found, which is why the "" and"" was matched in Mary and Max . We are showing this example because it can be a quite intensive query if
   there are large value so use wildcards in regular expressions with caution.
   
   
As you can see, there's a lot of functionality available to you in just
formulating query string queries and using the ""q"" parameter in search makes it
pretty simple to do.

But what if you want to change some of those defaults we discussed? If, instead
of using the ""q"" parameter in search as we've shown in all the examples above,
you choose to construct a full-blown query string query, then you have many more
options available to you. We'll look at these next.

QUERY STRING QUERY SETTINGS
There are many different settings available in query string queries and you can
set them as options during query construction. Let's have a closer look at some
of them.

First, as we mentioned above, the default operator is ""OR"". We showed an example
above where we specifically used ""AND"", but you can change the default operator
to ""AND"". This is one of the settings that can also be changed for the ""q""
parameter so we'll show both methods.

First, let's take the query we did above where we searched for the individual
terms ""godfather"" and ""part"". If you remember, we got 3 hits: the 2 Godfather
films and then also Harry Potter and the Deathly Hallows: Part 2 because it contains the term ""part"" and our query ""OR""d the two terms. We can
change that behavior by setting the default_operator setting as an additional parameter in the URL alongside our ""q"" parameter. It'd
look like this:

/top_films/film/_search?default_operator=AND&q=godfather+part


Now, because we're ""AND""ing together our two search terms by default, we'll only
get The Godfather: Part II back since it's the only film that contains both terms.

The default_operator setting is one of the only ones from the query string query settings that can
be set as an additional parameter in the URL with the ""q"" parameter. All the
other settings we'll cover below can only be used within the query string query
construction, which is formatted as JSON. So, here's what our query would look
like using that format:

{
  ""query_string"" : {
    ""query"" : ""godfather part"",
    ""default_operator"" : ""AND""
  }
}


Now that you can see how the query string query construction is formatted, we
won't go through all the settings here, but we'll look at a couple more of them
below so you can get an idea of some different uses.

First, as we mentioned above, when multiple fields are specified in the query,
then bool is used by default. We can instead set multiple field queries to use the
disjunction maximum function. To do this, set use_dis_max to ""true"" as follows:

{
  ""query_string"" : {
    ""query"" : ""godfather mafia"",
    ""fields"" : [""title"", ""description""],
    ""use_dis_max"" : ""true""
  }
}


In this case, we're looking for ""godfather"" or ""mafia"" in either the title or
the description field and we've indicated that we want to use disjunction
maximum. For a discussion on using boolean versus dismax, have a look at our
article on querytime strategies and techniques

Another one we'll have a quick look at here is fuzziness. Fuzziness has a few
different settings that you can alter. These include the fuzziness setting itself where you can change the default from ""AUTO"" to a specified
character count you'd prefer to use, the fuzzy_max_expansions setting which you can alter from the default of 50 expansions to another number
that better suits your queries and document set, and the fuzzy_prefix_length where you can set the number of characters at the beginning of terms which
should not be changed for fuzzy matches. On that one, for example, you may want
to set the prefix length as 1 character so that the first character of a term
will not be changed for fuzzy matching. The default for prefix length is 0 so
all characters in a term are candidates for changing unless you set this option.
Here's an example containing these settings:

{
  ""query_string"" : {
    ""query"" : ""man~"",
    ""fuzziness"" : 2,
    ""fuzzy_max_expansions"" : 10,
    ""fuzzy_prefix_length"" : 1
  }
}


In the above example, we're doing a fuzzy match on the term ""man"", indicated by
the ~ special character at the end of the term. We've changed the default fuzziness
setting from ""AUTO"" to 2 specifying that we want to allow up to 2 characters to
change. We've also lowered the default expansions from 50 to 10 and we've
specified that the first character of the term cannot change.

There are many other settings that are available, which you can read about in
the official Elasticsearch documentation for query string queries . It's probably pretty clear now how powerful this query type is.

The primary drawback of query string queries, and why they're recommended only
for development, QA testing, and knowledgeable power users, is that they can
break easily with a simple typo... One slip of the syntax can yield zero results
or send your Elasticsearch instance crunching on some heavy query that consumes
all the memory. Because mistakes are easily made, Lucene (which runs under the
hood of Elasticsearch) developed the SimpleQueryParser, whose purpose is to
parse a string of human-readable text, no matter how poorly formatted, to
produce a result. The simple query string query uses this special parser so that it ignores parts of the query that aren't
formatted correctly. It's got much of the same functionality of the query string
query type, but it's more like a laid-back cousin.

We've already covered a lot in this article so we'll have to save simple query
string queries for another day.

WRAPPING UP
In this article we got deep into the syntax for using the ""q"" parameter in
search, which is a shortcut for performing query string queries in
Elasticsearch. We also looked at how to construct full-blown query string
queries in JSON format and why and how you might want to change some of the
default settings or use other setting options available to you. Query string
queries can help you quickly test your Elasticsearch index and can be a boon for
power users who want to have maximum functionality directly in the query syntax.
Query well.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Lisa Smith - keepin' it simple. Love this article? Head over to Lisa Smith’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Deployments AWS DigitalOcean SoftLayer© 2016 Compose","In Elasticsearch, query string queries are their own breed of query - loads of functionality for full text search rolled into one sweet little package. In this article, we'll take a closer look at why query string queries are special and how you can make use of them.",Using Query String Queries in Elasticsearch,Live,994
3091,"Jordan George Blocked Unblock Follow Following Aug 14
--------------------------------------------------------------------------------

VISUALIZING WEATHER DATA AS A PIXIEAPP
Editor’s note: This article is part of an occasional series by the 2017 summer interns on the
Watson Data Platform developer advocacy team, depicting projects they developed
using Bluemix data services, Watson APIs, the IBM Data Science Experience, and
more.With FOSS4G (The International Conference for Free and Open Source Software for Geospatial)
taking place in Boston this week, my final project as an intern on the IBM
Watson Data Platform Developer Advocacy team focused on accessing and exposing
geospatial and weather data in new ways.

Using a new open source library called PixieDust , I built a notebook which uses the Weather Company Data API to get weather
forecast JSON data based on latitude and longitude, convert this JSON data into
a pandas DataFrame, and create an interactive weather dashboard.

WHY NOTEBOOKS?
Throughout the summer I’ve been using Jupyter notebooks, which I didn’t know
about before starting the internship. Notebooks act as a way to run live code
and include explanations in markup so it can later be shared or understood. They
are especially popular among data scientists because it supports the languages
Scala, R, and Python, which data scientists use more than others.

PixieDust is an open source visualization library for developers of any data
science experience. With just a few lines of code, PixieDust can enhance a
notebook by creating charts that provide valuable data insights for business
users. PixieDust can do a lot of things, but I’m going to focus here on
PixieApps.

PixieApps are embedded apps which let non-programmers actively use notebooks, viewing
their data in an almost website-like fashion. Developers can create these apps
by using a combination of HTML, CSS, and JavaScript. Frameworks like Bootstrap
and jQuery are already preloaded. Once built, a PixieApp transforms a static
notebook into a polished interactive graphic app for business users.

In the case at hand, my PixieApp transforms data from the Weather Company into
an interactive dashboard that’s accessible to everybody, even users without any
programming experience.

HOW DOES IT WORK?
For this PixieApp I used the Data Science Experience to create the notebook itself, but it also could have been created locally. I
started by creating pandas DataFrames (a two-dimensional data structure), which were very important for the
functionality of the app, using the Weather Company Data API .

Calling the Weather Company API to load the 10-day forecastThe PixieApp includes DataFrames for each of the following situations:

 * 10-day forecast for Boston (where I’m based)
 * 10-day forecast for each of the ten locations I selected in New England
 * Most recent forecast for each location
 * Most recent weather forecast for Boston

The 10-day forecast data frame for Boston was used to create the default chart
that appears on the right when a user selects the weather variable of their
choice, such as “Temperature.”

I did the same for the other nine locations in addition to Boston so that a user
can select a location on the map (which is rendered using MapBox ) and get the appropriate data back depending on the location they choose.

The most recent weather forecast for each location data frame was created to be
able to get unsummarized data. If I had used the aforementioned DataFrame
instead, the data displayed on the map when hovering over a location would’ve
been added for each day, which is not what we wanted.

Recent forecast data frameI used the most recent weather forecast for Boston in order to assign an icon
representing the current weather. In the example below, you can see the sunshine
icon to the left of the app description.

After creating the DataFrames I was able to use HTML and CSS to create a nice
interactive interface (a PixieApp!) as shown below.

PixieApp showcaseCHALLENGES
One obstacle I had was the lack of knowledge I had about creating the pandas
DataFrames. I needed the DataFrames in order to create the charts and maps using
PixieDust. If I had known pandas before starting this project, I probably
could’ve made the data frames more efficiently.

Another challenge I had was that the PixieDust documentation wasn’t very
complete due to the newness of the library. I found some documentation and
examples that were helpful, but there were some tips and and workarounds not yet
documented. However, I got help from David Taieb , the creator of PixieDust, to get things working.

These are all challenges I knew I would have going into the project and that is
partly why I chose it — to be exposed to new things.

WHAT I’VE LEARNED
Throughout the summer I’ve been using Jupyter notebooks, which were new to me. I
can definitely see myself using them throughout my developer career if I were to
delve into the data science realm of computer science. Notebooks are incredibly
helpful when you want to print basic things out like you would in a command line
interface. They can also be checkpointed as version control and shared to be run
and viewed on any browser.

If you’d like to check out my notebook for this project and interact with the
PixieApp yourself, you can find it here . My colleagues will also be sharing an updated version of this demo (and
others) at the FOSS4G conference. Stop by and say hello!

The final project after some additional styling work by my team. Stop by FOSS4G
for a live demo!If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

Thanks to Teri Chadbourne, CMP . * Data Science
 * Weather Apps
 * Geospatial
 * Data Visualization
 * Maps

Blocked Unblock Follow FollowingJORDAN GEORGE
FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",How I used MapBox and The Weather Company data to create a regional weather forecast,Visualizing weather data as a PixieApp – IBM Watson Data Lab – Medium,Live,995
3096,"* Home
 * Research
 * Partnerships and Chairs
 * Staff
 * Books
 * Articles
 * Videos
 * Presentations
 * Contact Information
 * Subscribe to our Newsletter
 * 中文

 * Marketing Analytics
 * Credit Risk Analytics
 * Fraud Analytics
 * Process Analytics
 * Human Resource Analytics

 * Prof. dr. Bart Baesens
 * Prof. dr. Seppe vanden Broucke
 * Aimée Backiel
 * Libo Li
 * Sandra Mitrović
 * Klaas Nelissen
 * María Óskarsdóttir
 * Michael Reusens
 * Eugen Stripling
 * Tine Van Calster

 * Basic Java Programming
 * Principles of Database Management
 * Business Information Systems
 * Mini Lecture Series
 * Other Videos

5 PRACTICAL USE CASES OF SOCIAL NETWORK ANALYTICS: GOING BEYOND FACEBOOK AND
TWITTER
Posted on January 16, 2017Contributed by: Jasmien Lismont , Bart Baesens

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow
us @DataMiningApps . Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail
over at briefings@dataminingapps.com and let’s get in touch!


--------------------------------------------------------------------------------

Along with big data, social networks are a hot topic nowadays in data science
communities. Social networks link items or persons, also called nodes, in a
network by means of links, also called edges. For example, person A can be
linked to person B if they are friends on Facebook. Now we can derive
information about these persons by including those links in our analysis. As
such, if person A is interested in data science, person B might also be
interested in data science, because A and B communicate frequently. This
information might be hard to gather by just looking at person B as a single
individual instead of looking at person B as part of a community.

But next to the obvious usefulness of social network analytics (SNA) to
companies such as Facebook, Twitter and LinkedIn, these techniques can also be
applied in less conventional domains. These domains don’t necessarily contain
the social aspect of SNA, with networks where people don’t know each other. They
don’t even necessarily have to contain people. This article gives you five
examples which hopefully show you that SNA can be applied in different and
innovative ways not linked to who likes who and who follows who. Hopefully, it
will inspire you to new and creative applications of network analytics.

1. PSEUDO-SOCIAL NETWORKS FOR DIRECT MARKETING
In [1], Martens et al. propose a pseudo-social network for the purpose of
targeting customers. Here, they link customers of a bank if they transferred
money for similar reasons, e.g. a bookstore, electricity company, etc. The
chance that these customers know each other is small. As such, this network
doesn’t represent a real social network. However, the authors prove that SNA can
still be useful for targeting purposes as customers who are more closely linked
to existing customers are more likely to act on similar offers.

2. LINKING HEALTHCARE SERVICE PROVIDERS TO UNCOVER HIDDEN COSTS
Healthcare is an important topic of our current society with an aging population
and thus a rise in costs. One way to suppress costs is to take a closer look at
health insurance claims. For example, Srinivasan and Arunasalam [2] present a
network of physiotherapy providers who share patients. Again, these service
providers do not necessarily know each other, but they are linked by means of
the patients they treat. By studying this network, one can discover unusual
behavior by physiotherapists who regularly share large groups of patients.


Figure 1: Network of physiotherapists sharing patients. The colors represent
communities of physiotherapist who have a lot of patients in common [2].

3. CREATING A CITATION NETWORK TO SUMMARIZE RESEARCH TOPICS
Qazvinian and Radev [3] have built a citation network where papers are connected
if they refer to each other. This would allow them, by means of network analysis
techniques, to quickly summarize a certain research topic or domain. This
application is not only useful for researchers but for everyone in need of
efficient, automatic document summarization where documents can be linked
somehow.

4. A TRANSPORTATION NETWORK FOR URBAN PLANNING
Soh et al. [4] represent travel routes of trains and buses by means of a network
for the purpose of urban planning. Hereby, they study the network from a
topological perspective. This means that they study the shape, connections,
paths, and other topological measures of the graph in order to improve their
knowledge of the transportation network. As such, network analytics helps to
structure public transport data and solve vital questions for local governments.

5. EMPLOYEE NETWORKS FOR GUIDING HR ACTIVITIES
In a company, you can connect your employees with each other by means of
communication flows, e.g. e-mails and chats, shared projects and/or shared
offices. As such, Baesens et al. [5] explain how we can create a network of
employees which we can analyze to discover who your most pivotal or connected
employees are. This might guide HR in their firing policies, change management
and project management.

The previous 5 examples show you that social networks can be used in different
contexts and for different problems. The only requirement is that you’re able to
connect the instances you’re interested in by means of social ties or other
types of connections, e.g. similar behavior, shared locations or projects,
routing, etc. Social network analytics can really make a differences in those
cases where we can actually gather more information by linking instances than by
treating them independently. This means that in those cases the fact that two
instances are closely linked, means that they share some useful features or
allow us to deduce useful insights.

REFERENCES:
 * [1] Martens, D., Provost, F., Clark, J., and Junqué de Fortuny, E. (2016).
   Mining Massive Fine-Grained Behavior to Improve Predictive Analytics. MIS Quarterly 40(4) : 869-888.
 * [2] Srinivasan, U., and Arunasalam, B. (2013). Leveraging Big Data Analytics
   to Reduce Healthcare Costs. IT professional: 15 (6), p. 21-28. DOI = 10.1109/MITP.2013.55.
 * [3] Qazvinian, V., and Radev, D. R. (2008). Scientific Paper Summaziation
   Using Citation Summary Networks. In COLING ’08 Proceedings of the 22nd International Conference on Computational
   Linguistics – Volume 1 , P. 689-696.
 * [4] Soh, H., Lim, S., Zhang, T., Fu, X., Lee, G.K.K., Hung, T.G.G., Di, P.,
   Prakasam, S., and Wong, L. (2010). Weighted Complex Network Analysis of
   Travel Routes on the Singapore Public Transportation System. Physica A: Statistical Mechanics and its Applications, 389 (24): p. 5852-5863. DOI = 10.1016/j.physa.2010.08.015.
 * [5] Baesens, B., De Winne, S., and Sels, L. (2016). What to Do Before You
   Fire a Pivotal Employee. Harvard Business Review . https://hbr.org/2016/01/what-to-do-before-you-fire-a-pivotal-employee.

‹ How is location-data used in analytics today? Are there state-of-the-art
algorithms or industry standards to predict how crowds behave and where/when
they will gather? Web Picks (week of 9 January 2017) › —Ad—We display ads on this section of the site.
--------------------------------------------------------------------------------

Recent Posts * Web Picks (week of 9 January 2017)
 * 5 Practical Use Cases of Social Network Analytics: Going Beyond Facebook and
   Twitter
 * How is location-data used in analytics today? Are there state-of-the-art
   algorithms or industry standards to predict how crowds behave and where/when
   they will gather?
 * Web Picks (week of 28 December 2016)
 * The Analytics Year in Review and Looking Forward to 2017

Archives * January 2017
 * December 2016
 * November 2016
 * October 2016
 * September 2016
 * August 2016
 * July 2016
 * June 2016
 * May 2016
 * April 2016
 * March 2016
 * February 2016
 * January 2016
 * December 2015
 * November 2015
 * October 2015
 * September 2015
 * August 2015
 * July 2015
 * June 2015
 * May 2015

 * 
 * 
 * 

© DataMiningApps - Data Mining, Data Science and Analytics Research @ LIRIS, KU
Leuven
KU Leuven, Department of Decision Sciences and Information Management
Naamsestraat 69, 3000 Leuven, Belgium
DataMiningApps on Twitter , Facebook , YouTube
info@dataminingapps.com",This article gives you five examples which show you that social network analytics (SNA) can be applied in different and innovative ways not linked to who likes who and who follows who.,5 Practical Use Cases of Social Network Analytics: Going Beyond Facebook and Twitter,Live,996
3098,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register * DASHBOARD * HOW-TO * BLOG * EVENTS * DOCS * SUPPORTSearch General < Previous / Next >APACHE SPARK: UPGRADE AND SPEED-UP YOUR ANALYTICSanalytics Spark AndreaBraida / March 7, 2016 / 0 commentsOne of the best things about Apache Spark is that it makes real-time analyticsof vast unstructured datasets – like social media sites – feasible andaffordable for companies of all sizes. But what are the practicalities ofperforming this kind of analysis? And how would you get started? Chetna Warade,Developer Advocate at IBM, is a software engineer who works in research andproduct development. She has been issued two patents by the USPTO, has threemore pending, and earned the IBM First Invention Plateau Award in 2010. We spoketo Chetna about a recent project to demonstrate the potential of Spark forsocial media analytics, which focused on the popular “Ask Me Anything” (AMA)section of the social news and entertainment site, Reddit.--------------------------------------------------------------------------------Chetna, thank you for joining us. Could you give us a bit of background aboutthe project? What was IBM’s motivation in demonstrating how to analyze RedditAMAs?Reddit is an example of a social media site where users’ conversations aremoderated by the users themselves. It is often used as a forum for discussingproducts and services – so many businesses, particularly in the US, areinterested in using Reddit as a source of unbiased, unsolicited customerfeedback for brand marketing, market research and quality management purposes.The AMA subreddit is a space where companies can establish a presence and have aspokesperson interact directly with their end customers – but this can be ahigh-risk strategy. Many corporate AMAs are very well received, but the Redditcommunity can also react negatively if spokespeople don’t engage with them in anopen and approachable way. That’s why social sentiment analysis of the AMAsubreddit seemed like a great choice for a demo: it would not only showcase thetechnical aspects of social media analytics – it would also help us look at howpeople react to different types of AMA and what kind of behavior gets a positiveresponse. We hoped to gain a better understanding of how companies can useReddit AMAs to engage with their customers and spread their messageseffectively.How did you put together the demo? What technologies were involved?To make the demo possible, we needed to bring together a number of data servicesand analytics tools. IBM Cloud Data Services offers a rich ecosystem of IBM andopen source technologies that we can easily integrate together to createinnovative cloud-based apps and services, coordinated through the IBM Bluemixcloud application development platform.To get the data, we wrote a simple data pipe connector in Node.js that allowedus to connect to Reddit and fetch a specified number of comments from an AMApost. We then passed the text to IBM Watson Tone Analyzer – an app on Bluemixthat can be easily integrated with Apache Spark using a simple API – and storedthe results in a Cloudant database. You can find the connector on Github – it isstill a work-in-progress, but it was sufficient for the purposes of our demo. Wethen used the Cloudant Spark Connector – which is also open source and availableon Github – to load the data into Spark. There is a great video tutorial aboutthe Cloudant Spark Connector in our Apache Spark Learning Center, if anyonewants to learn more.As a front-end, we used a Jupyter Notebook, which allowed us to analyze theresults of the whole dataset interactively using Python code. The notebook madeit easy to generate statistics and visualizations to gain insight into the toneof the discussion around each AMA post, and assess the overall success of theAMA. The engine for the whole solution is our Spark-as-a-Service offering, IBMAnalytics for Apache Spark, which gives us the power to perform rapid, real-timeanalytics even on the massive datasets that are typical in social mediaanalytics use-cases. The managed Spark service makes it really easy to jumpstraight in – you can just start up a notebook and begin performing iterativeanalysis without any complex setup or up-front costs.What did the demo prove, from a technical perspective and a businessperspective?First, we demonstrated that the IBM Cloud Data Services portfolio now bringstogether all the services you need to capture and analyze data from Reddit inreal time. With our Spark service, the results are available almost instantly,so you could even potentially analyze an AMA session while it was in progress!This could make it a very valuable tool for businesses who want to use AMAs as away of interacting directly with their customers.Second, we showed how the combination of tools such as notebooks and Watson ToneAnalyzer with Spark provides a relatively simple way for data scientists toanalyze and report on social media data, sidestepping all of the major obstaclesof trying to do this kind of analysis with a traditional programming stack.Instead of having to perform a complex data modeling process up-front, you canexperiment and iterate your analysis dynamically in the notebook interface, andstill get quick results because of the power of Spark in dealing withunstructured data.Our Apache Spark service is like a Porsche. It’s not just that the Spark engineis powerful – it’s that the whole experience is designed to make it easy forusers go as fast as they need to. Chetna Warade, IBM Developer AdvocateThere’s an analogy I sometimes like to use: the traditional tools thatdevelopers use – integrated development environments like Eclipse and VisualStudio – are like a Ford truck. They’re great for heavy-duty jobs and day-to-dayactivities. But they aren’t built for speed, and that’s what data scientistsneed when they’re dealing with complex data and need to deliver results to thebusiness quickly. By contrast, our Apache Spark service is like a Porsche. It’snot just that the Spark engine is powerful – it’s that the whole experience isdesigned to make it easy for users go as fast as they need to.The ability to offer a user experience that is designed for data scientists, notpurely for traditional Java stack developers, is a real step forward in openingup big data analytics to a much wider audience. In my own development and datascience work, I really feel like this is a huge upgrade – it is so much easierto code and work fast because our Spark service provides a user experience thatenables me to get the job done quickly.From a business perspective, I think the main lesson is that we’re approaching astage where business users will be able to do this kind of analysis on theirown, without support from IT. Of course, it’s a journey: notebooks are still thedomain of the data scientist at present, not the average business user. But thekey message is that social sentiment analysis need not require a major ITimplementation any more – it’s something that your data scientists can startworking on without needing a big investment in software or infrastructure.And finally, where can I learn more about how to leverage Spark and othertechnologies for social media analytics?You can read our guide to Getting started with Analytics for Apache Spark and sign up for IBM Bluemix (it’s free!). Once you see the potential, we can scale up our Spark service towhatever size your business needs.SHARE THIS: * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to email this to a friend (Opens in new window) * LEAVE A COMMENTClick here to cancel reply. Tell us who you are Name (required) Email (required) Comment textNotify me of follow-up comments by email.Notify me of new posts by email. * START BUILDING WITH BLUEMIX!                * @IBMBLUEMIX   RT @RaviSBusi In the #IBMdojo , @angelluisdiaz & @D_Lindquist dive into #DevOps & the @IBMBluemix Garage Method: bit.ly/21CClnY      A little Thursday afternoon #inspiration from the #Bluemix team. pic.twitter.com/uC0x…      RT @WIRED IBM is now letting anyone play with its quantum computer wrd.cm/1SM9VFL      #IBM XPages Runtime is now generally available on #Bluemix ! See what you can do w/ XPages: bit.ly/24zULHI pic.twitter.com/9gf5…      The advantages of the #WebSphere Liberty architecture and what new uses Liberty makes possible: bit.ly/1SPhPhF       * @IBMCLOUDSUPPORT   #Bluemix DataWorks maintenance 5/6 21:00 UTC EU-GB; 5/7 02:00 UTC US-South. Details   at bit.ly/1J6UPrT pic.twitter.com/udFw…      #Bluemix platform maintenance May 6, 7:00 EDT US-South. See details at bit.ly/1J6UPrT pic.twitter.com/uGjW…      #Bluemix SQL database maintenance May 5, 21:00 UTC - EU-GB region. See details at bit.ly/1J6UPrT pic.twitter.com/361n…      #Bluemix SQL database maintenance May 5, 10 PM US-South region. See details at bit.ly/1J6UPrT pic.twitter.com/a8kv…      How do you deploy a Python web app to #Bluemix that connects to a database service? ibm.co/1nOcWc0 pic.twitter.com/In1S…       * CATEGORIES    * General    * Events    * Updates    * How-to       * SOLUTIONS    * Analytics    * App Services    * Big Data    * Bluemix Dedicated    * Bluemix Local    * Catalog    * CF Applications    * Containers    * DevOps    * Eclipse    * Hybrid    * Integration    * Internet of Things    * Mobile    * Network    * OpenWhisk    * Security    * Storage    * Virtual Servers    * Watson    * Web Apps      Follow us on Twitter RSS Feed * Contact us * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","One of the best things about Apache Spark is that it makes real-time analytics of vast unstructured datasets – like social media sites – feasible and affordable for companies of all sizes. But what are the practicalities of performing this kind of analysis? And how would you get started? Chetna Warade, Developer Advocate at IBM, is a software engineer who works in research and product development. We spoke to Chetna about a recent project to demonstrate the potential of Spark for social media analytics, which focused on the popular “Ask Me Anything” (AMA) section of the social news and entertainment site, Reddit. More...",Apache Spark: Upgrade and speed-up your analytics,Live,997
3116,"GEOFILE: POSTGIS AND RASTER DATA
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 15, 2016GeoFile is a series dedicated to looking at geographical data, its features, and
uses. In this article, we'll look at raster data and how you can use shapefiles
and rasters to give you other perspectives on your maps by superimposing
georeferenced images onto your maps.

If you're familiar with satellite images, thermal imagery, or even perhaps
elevation features on a map, you most likely have been exposed to raster data.
Raster data can help identify changes on maps, show more detail than what
shapefiles can provide, show environmental changes over time, and a whole lot
more.

In this article, we'll just cover the surface of what raster data can provide by
looking at what raster data is, how to gather information about your raster data
with the GDAL utility, how to import and view raster data into a PostgreSQL
database using raster2pgsql , and how to superimpose it onto a shapefile with QGIS.

WHAT IS RASTER DATA?
Raster data is basically a collection of pixels, or cells, organized into rows
and columns with geospatial information that forms tiles. These pixels are
sometimes given attribute values, which define geographic features, have
assigned SRIDs, and have references to location coordinates, referred to as georeferenced rasters .

At the same time, attribute values can be added, which are referred to as bands that hold the actual raster data. For example, if we take a map of a town and
break that up into smaller sections, or tiles, we could record some information
about each section, or cell, within its own band.

We see raster data every time we look at satellite imagery on an online map, or
when looking at real-life representations of temperature, elevation, or
topographical features on a map. Raster data can also be in the form of scanned
maps or images that lie over another map to show historical changes, for
instance.

POSTGIS AND GDAL
One of the essential tools for using raster data with PostGIS is GDAL (Geospatial Data Abstraction Library) that comprises a suite of utilities for raster and vector geospatial data.

While you probably will not touch all of the utilities that GDAL comes with, the
most popular ones are gdalinfo , gdal_translate , and gdalwarp . gdalinfo allows you retrieve information about the raster, which is usually in the form
of a .tif image file. It's particularly useful when trying to figure out what the SRID of
the raster is before importing it into PostgreSQL. gdal_translate allows you to export your raster data out of PostGIS and into other formats,
and it allows you to set the size of your raster to your requirements. gdalwarp allows you to change the SRID of your raster data, which gdal_translate doesn't do.

SUPERIMPOSING A RASTER ONTO A SHAPEFILE
Raster data is really interesting when we start combining the images with
shapefiles. If you're not familiar with shapefiles, see this article which discusses how to import them to your Compose PostgreSQL database and use
them.

To give you an example, we'll take a map of Seattle and then change the SRID to
4326 using gdalwarp . We'll then use raster2pgsql which comes with PostGIS to import the raster data and use shp2pgsql to import our shape file into our Compose PostgreSQL database. Then we'll use
QGIS to connect to Compose PostgreSQL and view our shapefile with the
superimposed Seattle map.

DATABASE SET UP
We are assuming that you have a Compose PostgreSQL deployment set up and that
you have installed the PostGIS extension. If you haven't got a deployment setup,
follow the steps in the help documentation .

After you're in your deployment's console, select Browser from the side menu to create a new database. Once you've done that, click on
your database and select Extensions from the side menu. Scroll down to PostGIS and click the install button on the left.

IMPORTING THE SHAPEFILE
After you've created a database, you'll want to download the shapefile that
contains the map that we want to use. In this case, we'll use the US Census map
for the United States which can be downloaded here . We'll then import that into our PostgreSQL database using the shp2pgsql command that comes with PostGIS. A detailed tutorial on how to import a
shapefile into PostgreSQL can be found here .

WARPING THE RASTER
Now that the shapefile has been imported, we want to download the raster data
for Seattle. The file that you should download is ""f47122e1"" located here .

Once you've downloaded the file, you'll need to inspect it so that you know what
SRID it uses. It is important to have the same projections for both the
shapefile and your raster data. If you don't then you run the risk of skewing
the raster data when superimposing it onto the shapefile map.

To inspect the SRID of the raster data, use gdalinfo followed by the path of your .tif file like:

gdalinfo /Users/alger/path/seattle.tif  


This will give you some information like the following:

Driver: GTiff/GeoTIFF  
Files: /Users/alger/path/seattle/seattle.tif  
       /Users/alger/path/seattle/seattle.tfw
Size is 7413, 5517  
Coordinate System is:  
PROJCS[""NAD27 / UTM zone 10N"",  
    GEOGCS[""NAD27"",
        DATUM[""North_American_Datum_1927"",
            SPHEROID[""Clarke 1866"",6378206.4,294.9786982139006,
                AUTHORITY[""EPSG"",""7008""]],
            AUTHORITY[""EPSG"",""6267""]],
        PRIMEM[""Greenwich"",0],
        UNIT[""degree"",0.0174532925199433],
        AUTHORITY[""EPSG"",""4267""]],
    PROJECTION[""Transverse_Mercator""],
    PARAMETER[""latitude_of_origin"",0],
    PARAMETER[""central_meridian"",-123],
    PARAMETER[""scale_factor"",0.9996],
    PARAMETER[""false_easting"",500000],
    PARAMETER[""false_northing"",0],
    UNIT[""metre"",1,
        AUTHORITY[""EPSG"",""9001""]],
    AUTHORITY[""EPSG"",""26710""]]
Origin = (500002.512999999977183,5316564.759999999776483)  
Pixel Size = (10.160000000000000,-10.160000000000000)  
Metadata:  
  AREA_OR_POINT=Area
  TIFFTAG_RESOLUTIONUNIT=1 (unitless)
  TIFFTAG_SOFTWARE=Arc/Info
  TIFFTAG_XRESOLUTION=0.097999997
  TIFFTAG_YRESOLUTION=0.097999997
Image Structure Metadata:  
  COMPRESSION=PACKBITS
  INTERLEAVE=BAND


What we're really interested in here is this section that tells us that the
projection is EPSG 4267:

DATUM[""North_American_Datum_1927"",  
...
AUTHORITY[""EPSG"",""4267""]],  


Since the shapefile uses EPSG 4326 and our raster data uses EPSG 4267, we can
change the raster SRID to use EPSG 4326. GDAL provides us with the gdalwarp utility that will change the SRID of the raster like:

gdalwarp -t_srs 'EPSG:4326' /Users/alger/path/seattle/seattle.tif  /Users/alger/path/seattle/seattle2.tif  


This will create another file called seattle2.tif within your folder.

Now, we'll want to import the file into our Compose PostgreSQL database. This is
done with PostGIS's raster2pgsql command, which is an import/export tool for raster data.

To use the command, write the following in your terminal:

raster2pgsql -s 4326 -F /Users/alger/path/seattle/seattle2.tif -n ColumnName -I -C | psql ""sslmode=require host=aws-us-east-2-portal.1.dblayer.com port=98989 dbname=compose user=admin""  


When importing a raster, make sure that you set the SRID number to whatever it's
encoded in otherwise it defaults to 0 if it doesn't know the SRID. Since we used gdalinfo and changed the SRID with gdalwarp to EPSG 4326, we simply use 4326 in the command. The switch -F is the path of your .tif image. -n allows you to set the column name of the raster data, and -I creates a gist index on the raster column. -C sets constraints on the raster column. After the configuration has been set up,
we use the Compose PostgreSQL connection string to make a connection to our
database and enter the deployment's password when prompted.

LET'S VIEW WITH QGIS
If you haven't got a copy of QGIS to view your GIS data, you can download it
free here for any platform you're using.

Since your shapefile and raster data has been imported into your Compose
PostgreSQL database, we can use QGIS to connect to our database and start to
superimpose our raster data onto our map.

You will first want to set up a database connection using Layer Add Layer Add PostGIS Layers. Here, you'll set up a new connection by entering your
Compose PostgreSQL credentials which are located on your deployment's console.


After you've set up a connection to the databases that contain your raster data
and your shapefile, you add the shapefile layer first, giving you:


then you add the raster data which will line up perfectly with the map like:


If we have a closer look, you'll see that the raster data lines up pretty well
with the map.


SUMMING IT UP
We looked at what raster data is, how to change the SRID so that it aligns with
our shapefile, and looked at how the shapefile and the raster data complement
each other. While you might not use raster data for every GIS application, they
are useful when you want to superimpose data on top of a map; for example, to
show differences between two maps, or to add more details to a map.
Nevertheless, it's always fun to see the different ways that maps can help not
just to get from point A to point B, but they can also tell a story. Try out
using raster data in your next application with Compose PostgreSQL.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Image by Stephen Monroe Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose",We'll look at raster data and how you can use shapefiles and rasters to give you other perspectives on your maps by superimposing georeferenced images onto your maps.,GeoFile: PostGIS and Raster Data,Live,998
3117,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectFROM PUREDATA SYSTEM FOR ANALYTICS TO A DASHDB CLOUD DATA WAREHOUSEptitzler / January 4, 2016You've been wanting to see what IBM's cloud data warehouse dashDB is all about, but how can you load in your own data to explore? If you arecurrently using IBM PureData System for Analytics (formerly Netezza) this blogpost is for you. You can make some of your data available in dashDB in only fewbasic steps: * Evaluate compatibility of your source database definitions and scripts with   dashDB and run the generated DDL to create the required database artifacts in   dashDB. * Export the source data to csv files. * Load the data files into dashDB.Take a minute (actually 2 and 22 secconds, but who's counting?) to watch thisvideo that outlines the process:That looks easy, doesn't it? Let's give it a try.READYBefore you begin, complete the following steps to prepare your environment andset up the data warehouse on Bluemix: 1. Login to Bluemix (or sign up for a free trial). 2. Provision a dashDB service instance.        In the dashDB service page, keep the default settings; they should work just    fine. If you already have an instance of dashDB that you'd like to use, go    right ahead.         3. Launch dashDB.        The dashDB web console opens.         4. Download and install the free Data Studio and the Database Conversion Workbench plugin on a local machine in your environment.        In the web console, select Connect > Download Tools > Database conversion software . Watch how .         5. Collect the dashDB connection information. (You'll need it to connect the    Database Conversion Workbench and your applications to dashDB.)        In the dashDB web console, select Connect > Connection Information and note the connection settings for access with and without SSL.         6. If you are planning to transfer large amounts of data, download the moveToCloud script .        In the web console, select Load > Load from Cloud > Move to cloud to access the script and supporting documentation.        SETWith your prerequsite tasks complete, you can prepare the data warehouse andextract the data from PureData for Analytics. 1. Evaluate compatibility of the source DDL and scripts with dashDB and run the    generated DDL to create the required database artifacts in dashDB.         2. Unload the source data to csv files .        For large data files, we suggest loading them into dashDB from cloud    storage, like Softlayer's Swift Object Storage or Amazon S3. You can use the    previously mentioned movetoCloud script to upload the exported data to cloud storage, as shown in this video .        GO!Load the data, in one of 2 ways, depending on your data size and location: * small data files (stored on local disk) * large data file (stored on cloud storage)That's it! You've moved data from PureData for Analytics into dashDB and can nowrun the desired analytics on your cloud data warehouse.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Tagged: dashdb / data warehouse / IBM PureData for Analytics Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Try IBM dashDB, the cloud-based data warehouse, by loading some of your data from IBM PureData System for Analytics (formerly Netezza).",From PureData System for Analytics to a dashDB Cloud Data Warehouse,Live,999
3124,"developerWorks Premium An all-access pass to building your next great app!Sign up

 * 
 * Sign in | Register * › My developerWorks
    * 
      --------------------------------------------------------------------------
      
      
    * developerWorks Community
    * › My profile
    * › My communities
    * › Settings
    * 
      --------------------------------------------------------------------------
      
      
    * › Sign out
   
   
 * 
 * IBM

 * 

 * Technical topics
 * Evaluation software
 * Community
 * Events

Search developerWorks

 * developerWorks
 * Technical topics
 * Open source
 * Technical library

RECOMMENDER SYSTEMS, PART 1: INTRODUCTION TO APPROACHES AND ALGORITHMS
Learn about the concepts that underlie web recommendation engines

Most large-scale commercial and social websites recommend options, such as
products or people to connect with, to users. Recommendation engines sort
through massive amounts of data to identify potential user preferences. This
article, the first in a two-part series, explains the ideas behind
recommendation systems and introduces you to the algorithms that power them. In Part 2 , learn about some open source recommendation engines you can put to work.

View more content in this series | PDF (194 KB) |

Share:M. Tim Jones , Independent author

Close [x]

M. Tim Jones is an embedded-firmware architect and the author of Artificial Intelligence: A Systems Approach , GNU/Linux Application Programming , AI Application Programming , and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for
geosynchronous spacecraft to embedded systems architecture and networking
protocols development. Tim is a platform architect with Intel in Longmont, Colo.


12 December 2013

Also available in Chinese Russian Japanese

 * Table of contents * Basic approaches
    * Algorithms that recommender systems use
    * Challenges with recommender systems
    * Going further
    * Resources
    * Comments
   
   
FINDING RECOMMENDER SYSTEMS
You can find recommender systems in many of the websites you use every day.
including these well-known examples:

LinkedIn , the business-oriented social networking site, forms recommendations for
people you might know, jobs you might like, groups you might want to follow, or
companies you might be interested in. LinkedIn uses Apache Hadoop to build its
specialized collaborative-filtering capabilities.

Amazon , the popular e-commerce site, uses content-based recommendation. When you
select an item to purchase, Amazon recommends other items other users purchased
based on that original item (as a matrix of item-to-likelihood-of-next-item
purchase). Amazon patented this behavior, called item-to-item collaborative filtering .

Hulu , a streaming-video website, uses a recommendation engine to identify content
that might be of interest to users. It also uses (offline) item-based
collaborative filtering with Hadoop to scale the processing of massive amounts
of data. Details of Hulu's online and offline ItemCF architecture are publicly available.

Netflix , the video rental and streaming service, is a famous example. In 2006, Netflix
held a competition to improve its recommendation system, Cinematch. In 2009,
three teams combined to build an ensemble of 107 recommendation algorithms that
resulted in a single prediction. This ensemble proved to be the key to improving
predictive accuracy, and the combined team won the prize.

Other sites that incorporate recommendation engines include Facebook, Twitter,
Google, MySpace, Last.fm, Del.icio.us, Pandora, Goodreads, and your favorite
online news site. Use of a recommendation engine is becoming a standard element
of a modern web presence.

Recommendation systems changed the way inanimate websites communicate with their
users. Rather than providing a static experience in which users search for and
potentially buy products, recommender systems increase interaction to provide a
richer experience. Recommender systems identify recommendations autonomously for
individual users based on past purchases and searches, and on other users'
behavior. This article introduces you to recommender systems and the algorithms
that they implement. In Part 2 , learn about open source options for building a recommendation capability.

BASIC APPROACHES
Most recommender systems take either of two basic approaches: collaborative
filtering or content-based filtering. Other approaches (such as hybrid
approaches) also exist.

COLLABORATIVE FILTERING
LEARN MORE. DEVELOP MORE. CONNECT MORE.
The new developerWorks Premium membership program provides an all-access pass to powerful development tools
and resources, including 500 top technical titles (dozens specifically for open
source developers) through Safari Books Online, deep discounts on premier
developer events, video replays of recent O'Reilly conferences, and more. Sign up today .

Collaborative filtering arrives at a recommendation that's based on a model of prior user behavior. The
model can be constructed solely from a single user's behavior or — more
effectively — also from the behavior of other users who have similar traits.
When it takes other users' behavior into account, collaborative filtering uses
group knowledge to form a recommendation based on like users. In essence,
recommendations are based on an automatic collaboration of multiple users and
filtered on those who exhibit similar preferences or behaviors.

For example, suppose you're building a website to recommend blogs. By using the
information from many users who subscribe to and read blogs, you can group those
users based on their preferences. For example, you can group together users who
read several of the same blogs. From this information, you identify the most
popular blogs that are read by that group. Then — for a particular user in the
group — you recommend the most popular blog that he or she neither reads nor
subscribes to.

In the table in Figure 1, a set of blogs forms the rows, and the columns define
the users. The intersection of blog and user contains the number of articles
read by that user of that blog. By clustering the users based on their reading
habits (for example, by using a nearest-neighbor algorithm), you can see two clusters of two users each. Note the similarities
in the reading habits of the members of each cluster: Marc and Elise, who both
read several articles about Linux® and cloud computing, form Cluster 1. In
Cluster 2 reside Megan and Jill, who both read several articles about Java™ and
agile.

Figure 1. Simple example of collaborative filteringNow you can identify some differences within each cluster and make meaningful
recommendations. In Cluster 1, Marc read 10 open source blog articles, and Elise
read none; Elise read one agile blog, and Marc read none. In Figure 1 , then, one recommendation for Elise is the open source blog. No
recommendations can be made for Marc because the small difference between him
and Elise in agile blog reads would likely be filtered away. In Cluster 2, Jill
read three open source blogs, and Elise read none; Elise read 11 Linux blogs,
and Jill read none. Cluster 2, then, carries a pair of recommendations: the
Linux blog for Jill and the open source blog for Megan.

Another way to view these relationships is based on their similarities and
differences, as illustrated in the Venn diagram in Figure 2. The similarities
define (based on the particular algorithm used) how to group users who have
similar interests. The differences are opportunities that can be used for
recommendation — applied through a filter of popularity, for example.

Figure 2. Similarities and differences used in collaborative filteringAlthough Figure 2 is a simplification (and suffers from a sparseness of data by
using only two samples), it's a convenient representation.

CONTENT-BASED FILTERING
Content-based filtering constructs a recommendation on the basis of a user's behavior. For example,
this approach might use historical browsing information, such as which blogs the
user reads and the characteristics of those blogs. If a user commonly reads
articles about Linux or is likely to leave comments on blogs about software
engineering, content-based filtering can use this history to identify and
recommend similar content (articles on Linux or other blogs about software
engineering). This content can be manually defined or automatically extracted
based on other similarity methods.

Returning to Figure 1 , focus on the user Elise. If you use a blog ranking that specifies that users
who read about Linux might also enjoy reading about open source and cloud
computing, you can easily recommend — on the basis of her current reading habits
— that Elise read about open source. This approach, illustrated in Figure 3,
relies solely on content that a single user accesses, not on the behavior of
other users in the system.

Figure 3. Ranked differences used in content-based filteringThe Venn diagram in Figure 2 also applies here: If one side is the user Elise and the other a ranked set of
similar blogs, the similarities are ignored (because those blogs were already
read by Elise), and the ranked differences are the opportunities for
recommendation.

HYBRIDS
Hybrid approaches that combine collaborative and content-based filtering are also
increasing the efficiency (and complexity) of recommender systems. A simple
example of a hybrid system could use the approaches shown in Figure 1 and Figure 3 . Incorporating the results of collaborative and content-based filtering
creates the potential for a more accurate recommendation. The hybrid approach
could also be used to address collaborative filtering that starts with sparse
data — known as cold start — by enabling the results to be weighted initially toward content-based
filtering, then shifting the weight toward collaborative filtering as the
available user data set matures.


--------------------------------------------------------------------------------

Back to top

ALGORITHMS THAT RECOMMENDER SYSTEMS USE
As demonstrated by the winning approach for the Netflix prize , many algorithmic approaches are available for recommendation engines. Results
can differ based on the problem the algorithm is designed to solve or the
relationships that are present in the data. Many of the algorithms come from the
field of machine learning, a subfield of artificial intelligence that produces
algorithms for learning, prediction, and decision-making.

PEARSON CORRELATION
Similarity between two users (and their attributes, such as articles read from a
collection of blogs) can be accurately calculated with the Pearson correlation. This algorithm measures the linear dependence between two variables (or users)
as a function of their attributes. But it doesn't calculate this measure over
the entire population of users. Instead, the population must be filtered down to neighborhoods based on a higher-level similarity metric, such as reading similar blogs.

The Pearson correlation, which is widely used in research, is a popular
algorithm for collaborative filtering.

CLUSTERING ALGORITHMS
Clustering algorithms are a form of unsupervised learning that can find structure in a set of
seemingly random (or unlabeled) data. In general, they work by identifying
similarities among items, such as blog readers, by calculating their distance
from other items in a feature space. (Features in a feature space could represent the number of articles read in a
set of blogs.) The number of independent features defines the dimensionality of
the space. If items are ""close"" together, they can be joined in a cluster.

Many clustering algorithms exist. The simplest one is k -means, which partitions items into k clusters. Initially, the items are randomly placed into clusters. Then, a centroid (or center ) is calculated for each cluster as a function of its members. Each item's
distance from the centroids is then checked. If an item is found to be closer to
another cluster, it's moved to that cluster. Centroids are recalculated each
time all item distances are checked. When stability is reached (that is, when no
items move during an iteration), the set is properly clustered, and the
algorithm ends.

Calculating the distance between two objects can be difficult to visualize. One
common method is to treat each item as a multidimensional vector and calculate
the distance by using the Euclidean algorithm.

Other clustering variants include the Adaptive Resonance Theory (ART) family,
Fuzzy C-means, and Expectation-Maximization (probabilistic clustering), to name
a few.

OTHER ALGORITHMS
Many algorithms — and an even larger set of variations of those algorithms —
exist for recommendation engines. Some that have been used successfully include:

 * Bayesian Belief Nets , which can be visualized as a directed acyclic graph, with arcs
   representing the associated probabilities among the variables.
 * Markov chains , which take a similar approach to Bayesian Belief Nets but treat the
   recommendation problem as sequential optimization instead of simply
   prediction.
 * Rocchio classification (developed with the Vector Space Model), which exploits feedback of the item
   relevance to improve recommendation accuracy.


--------------------------------------------------------------------------------

Back to top

CHALLENGES WITH RECOMMENDER SYSTEMS
Taking advantage of the ""wisdom of crowds"" (with collaborative filtering) has
been made simpler with the data-collection opportunities the web affords. But
the massive amounts of available data also complicate this opportunity. For
example, although some users' behavior can be modeled, other users do not
exhibit typical behavior. These users can skew the results of a recommender
system and decrease its efficiency. Further, users can exploit a recommender
system to favor one product over another — based on positive feedback on a
product and negative feedback on competitive products, for example. A good
recommender system must manage these issues.

One problem that's endemic to large-scale recommendation systems is scalability.
Traditional algorithms work well with smaller amounts of data, but when the data
sets grow, the traditional algorithms can have difficulty keeping up. Although
this might not be a problem for offline processing, more-specialized approaches
are needed for real-time scenarios.

Finally, privacy-protection considerations are also a challenge. Recommender
algorithms can identify patterns individuals might not even know exist. A recent
example is the case of a large company that could calculate a
pregnancy-prediction score based on purchasing habits. Through the use of
targeted ads, a father was surprised to learn that his teenage daughter was
pregnant. The company's predictor was so accurate that it could predict a
prospective mother's due date based on products she purchased.


--------------------------------------------------------------------------------

Back to top

GOING FURTHER
Recommendation engines now power most of the popular social and commerce
websites. They provide tremendous value to the site's owners and to its users
but also have some downsides. This article explored some of the ideas behind
recommendation systems and the algorithms that power them. Part 2 of this series introduces you to some of the open source options available for
building a recommendation capability.

Sign up for developerWorks Premium

RESOURCES
LEARN
 * Recommender Systems Wiki : Visit this site to access information about recommender systems from a
   wide range of sources.
 * Recommendations @ LinkedIn : Learn about LinkedIn's recommendation engine from this presentation by
   Abihishek Gupta and Adil Aijaz.
 * Hulu's Recommendation System : Find out how Hulu's recommendation engine helps content owners promote
   their content on Hulu's framework.
 * Netflix Recommendations: Beyond the 5 stars : Learn about the Netflix recommendation engine, the Netflix prize, and the
   issues that arose in the competition for the prize.
 * Web-Scale User Modeling for Targeting : Read this paper from Yahoo! to learn about a user-modeling platform for
   optimizing ad selection for users on an Internet scale with Hadoop.
 * A Survey of Collaborative Filtering Techniques : Read an in-depth treatment of recommender systems and their challenges,
   and a useful survey of collaborative-filtering techniques.
 * AI Application Programming, 2nd edition (M. Tim Jones, Cengage Learning, 2005): Learn more about recommender systems
   and artificial intelligence in general from this book.
 * Patterns In and Across Aggregated Data — Is ""Anonymous"" Collaborative
   Filtering Really Safe?: Kent Anderson presents some of the dangers and privacy risks of
   collaborative filtering.
 * How Target Figured Out a Teen Girl Was Pregnant Before Her Father Did: Get a perspective on the unsettling nature of recommendation systems and the
   potential risks behind item-based collaborative filtering.
 * The developerWorks Open source technical topic: Find extensive how-to information, tools, and project updates to help you
   develop with open source technologies and use them with IBM products.

GET PRODUCTS AND TECHNOLOGIES
 * developerWorks Premium : Provides an all-access pass to powerful tools, curated technical library
   from Safari Books Online, conference discounts and proceedings, SoftLayer and
   Bluemix credits, and more.

DISCUSS
 * Get involved in the developerWorks community . Connect with other developerWorks users while exploring the
   developer-driven blogs, forums, groups, and wikis.

COMMENTS
Close [x]

DEVELOPERWORKS: SIGN IN
Required fields are indicated with an asterisk ( * ).

IBM ID: *
Need an IBM ID?
Forgot your IBM ID?

Password: *
Forgot your password?
Change your password

Keep me signed in.

By clicking Submit , you agree to the developerWorks terms of use .


--------------------------------------------------------------------------------

The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is
displayed to the public and will accompany any content you post, unless you opt
to hide your company name . You may update your IBM account at any time.

All information submitted is secure.

Close [x]

CHOOSE YOUR DISPLAY NAME


The first time you sign in to developerWorks, a profile is created for you, so
you need to choose a display name. Your display name accompanies the content you
post on developerWorks.

Please choose a display name between 3-31 characters . Your display name must be unique in the developerWorks community and should
not be your email address for privacy reasons.

Required fields are indicated with an asterisk ( * ).

Display name: * (Must be between 3 – 31 characters.)

By clicking Submit , you agree to the developerWorks terms of use .


--------------------------------------------------------------------------------

All information submitted is secure.

DIG DEEPER INTO OPEN SOURCE ON DEVELOPERWORKS
 * Overview
 * New to Open source
 * Projects
 * Technical library (tutorials and more)
 * Forums
 * Events


--------------------------------------------------------------------------------

 * DEVELOPERWORKS PREMIUM
   Exclusive tools to build your next great app. Learn more.
   
   
 * 
 * 
 * 
 * DEVELOPERWORKS LABS
   Technical resources for innovators and early adopters to experiment with.
   
   
 * 
 * 
 * IBM EVALUATION SOFTWARE
   Evaluate IBM software and solutions, and transform challenges into
   opportunities.
   
   
--------------------------------------------------------------------------------

Back to top

static.content.url=http://www.ibm.com/developerworks/js/artrating/ SITE_ID=1 Zone=Open source, Web development ArticleID=956454 ArticleTitle=Recommender systems, Part 1: Introduction to approaches and
algorithms publish-date=12122013 * About
 * Help
 * Contact us
 * Submit content

 * Feeds
 * Newsletters
 * Follow
 * Like

 * Report abuse
 * Terms of use
 * Third party notice
 * IBM privacy
 * IBM accessibility

 * Faculty
 * Students
 * Business Partners

 * Select a language:
 * English
 * 中文
 * 日本語
 * Русский
 * Português (Brasil)
 * Español
 * Việt","Most large-scale commercial and social websites recommend options, such as products or people to connect with, to users. Recommendation engines sort through massive amounts of data to identify potential user preferences. This article, the first in a two-part series, explains the ideas behind recommendation systems and introduces you to the algorithms that power them. In Part 2, learn about some open source recommendation engines you can put to work.",Recommender systems: Approaches & algorithms,Live,1000
3130,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectCONTENTS * Spark * Get Started * Get started in Bluemix       * Build SQL Queries       * Use the Machine Learning Library       * Load and Analyze dashDB data with Apache Spark                * Tutorials and samples * Start Developing with Notebooks       * Sentiment Analysis of Twitter Hashtags       * Sample Notebooks                   * Cloudant * Get started * Copy a sample database       * Create a database       * Change database permissions       * Connect to Bluemix       * Developing against Cloudant       * Integrate with IBM dashDB                * Intro to the HTTP API * Execute common API commands       * Set up pre-authenticated cURL                * Database Replication * Use cases for replication       * Create a replication job       * Check replication status       * Set up replication with cURL                * Indexes and Queries * Use the primary index       * MapReduce and the secondary index       * Build and query a search index       * Use Cloudant Query       * Cloudant Geospatial                   * dashDB * dashDB Quick Start    * Get * Get started with dashDB on Bluemix       * Load data from the desktop into dashDB       * Load data from the Cloud into dashDB       * Move data to the Cloud with dashDB’s MoveToCloud script       * Load Twitter data into dashDB       * Load XML data into dashDB       * Store Tweets Using Bluemix, Node-RED, Cloudant, and dashDB       * Load JSON from Cloudant database into dashDB       * Integrate dashDB and Informatica Cloud       * Load geospatial data into dashDB to analyze in Esri ArcGIS       * Bring Your Oracle and Netezza Apps to dashDB with Database Conversion         Workbench (DCW)       * Install IBM Database Conversion Workbench       * Convert data from Oracle to dashDB       * Convert IBM Puredata for Analytics to dashDB       * Use Aginity Workbench for IBM dashDB                * Build * Create Tables in dashDB       * Connect apps to dashDB                * Analyze * Use dashDB with Watson Analytics       * Use dashDB with Spark       * Use dashDB with Pyspark and Pandas       * Use dashDB with R       * Publish apps that use R analysis with Shiny and dashDB       * Perform market basket analysis using dashDB and R       * Connect R Commander and dashDB       * Use dashDB with IBM Embeddable Reporting Service       * Use dashDB with Tableau       * Leverage dashDB in Cognos Business Intelligence       * Integrate dashDB with Excel       * Extract and export dashDB data to a CSV file       * Analyze With SPSS Statistics and dashDB                   * DataWorks * DataWorks Forge    * DataWorks APIs      GET STARTED IN BLUEMIXJess Mantaro / October 21, 2015IBM Analytics for Spark is a cloud-based data service you can use on Bluemix, IBM’s open cloud platformfor building, running, and managing applications.You can also read a transcript of this videoLAUNCH THE SERVICETo get started, go to the Bluemix Catalog page , and scroll to the Data and Analytics section of the Bluemix™ catalog. Then select Apache Spark (or on your Bluemix dashboard, click WORK WITH DATA.)WORKING WITH ANALYTICS FOR APACHE SPARK IN BLUEMIXWhether you’re new to Apache® Spark™ or a veteran, the following docs in Bluemixcan guide you through every step:Read how to: * Get started on Bluemix * Load Data for Use * Use Custom LibrariesPlease enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM","Use IBM Analytics for Apache Spark in IBM Bluemix, the open cloud platform for building, running, and managing applications",Get started in Bluemix,Live,1001
3133,"Compose The Compose logo Articles Sign in Free 30-day trialNEW FEATURES FOR COMPOSE'S SCYLLADB
Published Feb 15, 2017 scylladb compose New features for Compose's ScyllaDBA data browser, nodetool /JMX access and a version update to 1.6 are the latest enhancements to ScyllaDB
at Compose.

ScyllaDB, the high-performance Cassandra-compatible database, is currently in
beta at Compose and that means we're constantly looking at ways to improve it.
With the latest round of enhancements, we're making ScyllaDB more accessible
from the web, enhancing its manageability and bringing it right up to date with
the latest release from Scylla Inc.

BROWSING SCYLLADB
Let's start with the data browser. This brings ScyllaDB in line with other
Compose databases. Clicking on the Browser in the sidebar will now let you view, create and delete keyspaces and query,
view and drop tables and their indexes in those keyspaces.

This isn't a full interface to the power of Scylla's CQL; it's a selected subset
which we've found most useful when looking after remote databases. Rather than
setting up cqlsh, now you can pop into your deployment over the web and check
out various keyspaces and tables, including the system keyspaces. You can read
more about the ScyllaDB data browser on its help page

JMX ENABLED SCYLLADB
The addition of nodetool /JMX access means developers and administrators can plug in the nodetool application to run some common tasks on the Scylla cluster. For obvious
reasons, the commands are filtered against a whitelist to prevent potentially
troublesome ones being called. You can find out more about the available
commands on the new ScyllaDB Tools help page .

SCYLLA 1.6
Finally, Scylla 1.6 is the latest version of ScyllaDB. Scylla 1.6 sees
auto-tuning in query fetch sizes and newly optimized range scans. ScyllaDB's Avi
Kivity talks about how that impacts on full table scans in the ScyllaDB blog . Scylla 1.5 included a number of useful bug fixes and tuning enhancements too.

All these features will be at your fingertips with new deployments of ScyllaDB
and existing users can upgrade from the Compose Console's Settings page - don't
worry, the web interface will remind you with a friendly nag, just log in and
see and while you are there, check out the new data browser. If you aren't
running ScyllaDB, sign up for Compose's 30 day free trial and check it out today.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Tobias Keller

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Dec 30, 2016COMPOSE FOR MYSQL AND COMPOSE FOR SCYLLADB: THE NEW COMPOSE DATABASES ON BLUEMIX
Since we made Compose-hosted databases for Bluemix available, we've seen Bluemix
users opening up the catalog to the benefits…

Dj Walker-Morgan Dec 2, 2016SCYLLA 1.4 ON COMPOSE - OPTIMIZING CONNECTIVITY
Compose is pleased to announce the availability of Scylla 1.4.2 on the Compose
platform. This is the first update of the data…

Dj Walker-Morgan Sep 20, 2016SCYLLADB'S COME TO COMPOSE
ScyllaDB has arrived on Compose and it already looks like it's set to be a
favorite with polyglot persistent developers. The…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","A data browser, nodetool/JMX access and a version update to 1.6 are the latest enhancements to ScyllaDB at Compose.",New features for Compose's ScyllaDB,Live,1002
3139,"Homepage IBM Watson Data Lab Follow Sign in / Sign up Home Cognitive Computing Data Science Web Dev Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Feb 13
--------------------------------------------------------------------------------

INSTALLING WEB APPS WITH ELECTRON
There used to be a clear distinction between “applications” and “websites.”

 * apps — installable binary packages targeted at specific computing platforms that
   stay on the device until removed
 * websites — served-out HTML/JavaScript/CSS pages rendered on a client-side web browser

Now we have “ progressive web apps ” (PWAs) that are much more than static web pages. They aim to be installable,
operated offline, and have app-like features (notifications, background tasks,
sync, etc.), while still being delivered through a web browser.

The PWA movement is making leaps and bounds, but if you are a web developer and
want to create native, installable applications for Mac, Windows, and Linux today, then the Electron framework may be for you. It is a framework used to build the Slack and
Microsoft Visual Studio Code apps.

Simply build an HTML/JavaScript/CSS web app, which as a bonus can include npm modules from the Node.js world, and Electron packages the whole thing up as a
native, installable application package. Apps get their own icon (merrily
bouncing away in a Mac’s dock) and can be automatically updated and have access
to native menus, file selectors, and other APIs.

MY FIRST ELECTRON APP
As my first Electron project, I thought I’d make a simple but useful utility out
of components that I already had. I used my couchimport library, which imports structured data into CouchDB/Cloudant, to create a
desktop app.

The user interface is as simple as can be: run the app, drag and drop a CSV
file, and watch as it is imported into a local CouchDB database or a remote
Cloudant cluster.

THE STRUCTURE OF AN ELECTRON APP
If you’ve built Node.js applications before, you know that your app’s meta data
and dependencies are stored in a package.json file. In an Electron app, there are two package.json files :

 * Development package.json - defines the dependencies required to build the Electron executable and
   installer
 * Application package.json - your app's dependencies

Typically, your source code directory would look like this…

|-README.md
|-package.json
|-app/
  |-package.json
  |-app.js

…with your web application residing in the app sub-directory. Although Electron's own Quick Start project isn't organised this way, it's a very neat way of separating out the
framework's dependencies from yours and is worth adopting from the beginning.

ELECTRON-IFYING A WEB APP
Typically, a simple Electron app has main.js which is the application's entry point. It opens a web page as its main window .

While Electron keeps out of your web app’s way for the most part, there are some
Electron-specific features that you need to consider for your app:

 * if you want your app to have native menus, then you need to define them in
   your application’s main.js
 * to activate native file selectors from your app, you must delve into the Electron APIs
 * if you want anchor tags to open in a browser window, use the Electron shell API
 * to store state (such as your application’s config) between runs, then you
   need something like electron-config to persist the data locally

Otherwise your static HTML/JavaScript/CSS application can turn into a native app
before your eyes.

RUNNING LOCALLY
Clone your project’s repository and run npm install to load the Electron code and your application's dependencies:

git clone https://github.com/ibm-cds-labs/couchdbimporter.git
cd couchdbimporter
npm install

Run your app with npm start

npm start

MAKING AN INSTALLER
If you want to distribute your application, then you need an ‘installer’ for
each operating system platform. I have my ‘package.json’ set up so that it
creates an Apple Mac installer with

npm run mac

and the resultant files are stored in the dist/mac directory.

The same process can be run for win and linux targets to produce installers for Windows and Linux, respectively.

It is also possible to script the build and deployment your app installers .

ALTERNATIVES TO ELECTRON
The Electron framework is a powerful way of making native apps for multiple
platforms with your web app skill set while avoiding cross-browser issues. But
it isn’t the only pony in the show:

 * Progressive Web Apps are web sites that can offer app-like features and are accessed by a URL
 * there are already tools that allow websites to behave like desktop applications (by encasing them in
   an Electron wrapper)
 * Cordova lets you build web applications as native mobile applications for iOS and
   Android
 * React Native is another way to build native mobile applications from web technology, as
   is NativeScript

TRY IT YOURSELF
Take a look at my CouchDB data importer app , and then try this with your own app. If you liked this article, express
yourself with a click on the ♡. Thanks.

JavaScript Nodejs Tutorial Progressive Web App Electron Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","How to use the Electron framework to convert your web app into a native, installed app.",Installing Web Apps with Electron – IBM Watson Data Lab,Live,1003
3140,"MENU
Close * Deepgram Blog
 * Deepgram Home

Subscribe MenuHOW TO GET A JOB IN DEEP LEARNING
22 September 2016If you’re a software engineer (or someone who’s learning the craft), chances are
that you’ve heard about deep learning (which we’ll sometimes abbreviate as
“DL”). It’s an interesting and rapidly developing field of research that’s now
being used in industry to address a wide range of problems, from image
classification and handwriting recognition, to machine translation and,
infamously, beating the world champion Go player in four games out of five.

A lot of people think you need a PhD or tons of experience to get a job in deep
learning, but if you're already a decent engineer, you can pick up the requisite
skills and techniques pretty quickly. At least, that's our philosophy. (So even
if you're a beginner with deep learning, you're welcome to apply for one of our open positions .)

Important point: You need to have motivation and be able to code and problem solve well. That's
about it.

Here at Deepgram we’re using deep learning to tackle the problem of speech search . Basically, we’re teaching machines to listen to and remember the contents of
recorded conversation, phone calls, online videos, podcasts, and anything else
that has audio of people talking. But listening is just half of it. We’re also
teaching machines to recall key words and phrases from these recordings in a
similar way to how our brains search for memories of conversation: by the sound
of those key words and phrases you type into the search bar. (In case you
haven’t already played around with Deepgram yet, we have a little demo to show some of its capabilities.)

Getting involved in deep learning may seem a bit daunting at first, but the good
news is that there are more resources out there now than ever before. (There’s
also a huge, pent up demand for engineers who know how to implement deep
learning in software.) So, if you want to get yourself a job in deep learning
but need to get yourself up to speed first, let this be your guide! (If you
already know a lot about deep learning and you’re just looking for information
about getting a job in the field, skip to the bottom.)

WHAT IS DEEP LEARNING?
In a nutshell, deep learning involves building and training a large artificial
neural network with many hidden layers between the input side of the network and
the output side. It's because of these many hidden layers that we call this kind
of neural network ""deep"". Deep neural networks have at least three hidden
layers, but some neural networks have hundreds.


Neural networks are complex statistical models that allow computers to create a
remarkably accurate abstract representation of information. What kind of
information, you ask? Like we mentioned, Deepgram's deep neural network is
specifically trained to ""understand"" and act upon spoken word data, but deep
neural networks have been used in plenty of other contexts, from detecting
cancers in medical scans to forecasting energy prices and modeling the weather.

There are a number of notable players in the deep learning space. On the
academic side, the Geoffrey Hinton's lab at University of Toronto , Yann LeCun's group at New York University and Stanford's AI lab are some of the major leaders in deep learning research. On the private side,
Google has led the way in applying deep learning to search and computer vision,
and Baidu's Chief Scientist, Andrew Ng, is a major contributor to the scientific
literature around deep learning on top of being the cofounder of Coursera.

Why is deep learning so accessible today, even for newcomers to the field? There
are two primary factors. First, computing hardware is now fast and cheap enough
to make deep learning accessible to just about anyone with a decent graphics
card in their PC. (In our own testing, we've found that one GPU server is about
as fast as 400 CPU cores for running the algorithms we're using.) Second, new
open source deep learning platforms like TensorFlow , Theano and Caffe make spinning up your own deep neural network fairly easy, especially when
compared to having to build one from scratch.

There's a lot more to deep learning, of course, but that's what this guide is
for!

WHAT YOU SHOULD ALREADY KNOW BEFORE DIVING INTO DEEP LEARNING
Speaking of math, you should have some familiarity with calculus, probability
and linear algebra. All will help you understand the theory and principles of
DL.


Obviously, there is also going to be some programming involved. As you can see from this list of deep learning libraries , most of the popular libraries are written in Python and R, so some knowledge
of Python or R would also be helpful.

If you need to bone up on your math or programming skills, there are plenty of
very high quality resources online to use.

Also, as we mentioned above, having a decent graphics card (or accessing a GPU
instance through a cloud computing platform like Amazon Web Services or one of
the other hosting providers listed here ).

WHERE TO LEARN ABOUT DEEP LEARNING
TALKS AND ARTICLES ABOUT DL
If you’re brand new to the field and you’re looking for some high-level
explanations of the concepts behind deep learning without getting lost in the
math and programming aspects , there are some really informative talks out there to familiarize yourself
with the concepts and terminology.

 * The University of Wisconsin has a nice, one-webpage overview of neural networks .
   
   
 * Brandon Rohrer , Microsoft’s principal data scientist, gave a talk that aims to explain and
   demystify deep learning without using fancy math or computer jargon at the
   Boston Open Data Science conference. He has the video and slides on this page .
   
   
 * Deep learning pioneer Geoffrey Hinton was the first to demonstrate the use of backpropogation algorithms for
   training deep neural networks. He now leads some of Google’s AI research
   efforts when he’s not attending to academic responsibilities at the
   University of Toronto. He gave a brief but illuminating talk on ""How Neural Networks Really Work"" that we really like. You can also find a list of his papers on DL “without
   much math” on his faculty page .
   
   
 * Steve Jurvetson , the founding partner of DFJ, a large Silicon Valley venture capital firm,
   led a panel discussion at the Stanford Graduate School of Business on the
   subject. If you’re interested in learning about deep learning from the
   perspective of some startup founders and engineers implementing DL in
   industry, check out the video .
   
   
If you just want to dive right in and are comfortable with some math, simple
code examples, and discussions of applying DL in practice check out Stanford grad Andrej Karpathy ’s blog post on "" The Unreasonable Effectiveness of Recurrent Neural Networks "".

ONLINE COURSES
If you’re the type of person who enjoys and gets a lot out of taking online
courses, you’re in luck. There are several good courses in deep learning
available online.

 * Andrew Ng’s Stanford course on machine learning is very popular and generally well-reviewed. It’s considered one of the best
   introductory courses in machine learning and will give you some rigorous
   preparation for delving into deep learning.
   
   
 * Udacity has a free, ten week introductory course in machine learning that focuses on both theory and real-world applications. Again, it’s a
   decent preparatory course for those interested in eventually pursuing deep
   learning.
   
   
 * Caltech’s Yaser S. Abu-Mostafa’s self-paced course, "" Learning From Data "" is less mathematically dense, but it’s still a very solid survey of
   machine learning theory and techniques.
   
   
 * Andrej Karpathy’s "" CS231n: Convolutional Neural Networks for Visual Recognition "" at Stanford is challenging but well-done course in deep neural networks,
   and the syllabus and detailed course notes are available online.
   
   
 * Geoffrey Hinton’s course on "" Neural Networks for Machine Learning "" is good, and it’s taught by one of the godfathers of the field.
   
   
BOOKS
Maybe online courses aren’t your thing, or maybe you just prefer reading to
watching lectures and reviewing slide decks. There are a few good books out
there that are worth checking out. We recommend:

 * Andrew Trask’s Grokking Deep Learning aims to give a really accessible, practical guide to deep learning
   techniques. If you know some Python and passed algebra in high school, you’re
   100% prepared for this book.
   
   
 * Ian Goodfellow, Yoshua Bengio and Aaron Courville’s book, Deep Learning , which will be published by MIT Press. For now, there is an early version of the book available for free online , plus lecture slides and exercises.
   
   
OTHER LEARNING RESOURCES & WEBSITES
 * Metacademy is a very cool site with a very, very solid overview of deep learning and tons of links to specific topics in the field.
   
   
 * Denny Britz of the Google Brain team has a pretty comprehensive glossary of deep learning terminology on his website, WildML. He also curates a weekly newsletter that contains links to both technical and non-technical articles about
   machine learning and deep learning.
   
   
WHERE TO PRACTICE DEEP LEARNING
Once you have some of the basics under your belt, you’ll be ready to sink your
teeth into some actual data and exercises. Here are a few websites where you can
find sample datasets and coding challenges:

 * Kaggle has a fairly large collection of datasets ranging from SF/Bay Area Pokemon Go spawn points to Y Combinator companies to the giant text corpus that is Hillary Clinton’s leaked emails .
   
   
 * UC Irvine also has a big collection of datasets to train deep neural networks on.
   
   
WHERE TO FIND PEOPLE INTERESTED IN DEEP LEARNING
Regardless of whether you’re a rank amateur or a PhD at the bleeding edge of
deep learning research, it’s always good to connect with the community. Here are
some places to meet other people interested in deep learning:

 * You should see if your city has a machine learning or deep learning group on
   a site like Meetup.com. Most major cities have something going on.
   
   
 * There are several online communities devoted to deep learning and deriving
   insights from data:
   
    * Deeplearning.net is one of the major online hubs for deep learning related information.
      Resources include: a comprehensive reading list , a list of deep learning research labs , and a collection of nifty demos so you can see DL in practice.
      
      
    * Datatau is kind of like Hacker News, but specifically focused on data and machine
      learning. The comments sections aren’t very active but there are new links
      posted regularly.
      
      
    * There is a machine learning subreddit that’s fairly active. (They also have a very helpful wiki with even more
      resources.) The deep learning subreddit is a little quieter.
      
      
    * There’s a surprisingly active Google Plus group devoted to deep learning with over 30,000 members. (Who knew people still
      used Google Plus?)
      
      
WHERE TO FIND A JOB IN DEEP LEARNING
The good news is that basically everyone is hiring people that understand deep
learning.

You probably know all the usual places to go looking: AngelList, the monthly
""Who’s Hiring"" thread on hacker news, the StackOverflow jobs board, and the
dozens of general-purpose job search sites.

One of the few jobs boards that specializes in DL positions is found at Deeplearning.net , and there is a more general machine learning jobs board on Kaggle .

These are definitely great points. Most companies looking for DL/ML talent
aren't interested in setting up HR hoops for the applicant to jump through.

WHAT TO DO WHEN APPLYING
Companies want to see if you did cool stuff before you applied for the job.

If you didn't then you won't get an interview, but if you did then you have a
chance no matter what your background is. Of course, the question of ""what is
cool stuff?"" comes up.

If your only experience is building small projects with only a little bit of
success, that probably won't do it (although it might work for larger companies, or companies that need light
machine learning performed). But if it is:

""I built a twitter analysis DNN from scratch using Theano and can predict the
number of retweets a tweet will get with alright accuracy:

 * here's the accuracy I achieved
 * here's a link to my write up
 * here's a link to github for the code""

That type of thing will get you in the door. Then you can work your magic with
coding chops and problem solving skills during the interview. :)

Deepgram is also hiring, so if you’re interested in solving hard problems and
building great tools, give us a holler !


--------------------------------------------------------------------------------

This article was written in collaboration with Jason D. Rowley .

Scott Stephenson's PictureSCOTT STEPHENSON
Dark matter physicist and Co-Founder of Deepgram — we are the world's most
accurate audio search.

San Francisco, CA http://www.deepgram.comSHARE THIS POST
Twitter Facebook Google+ CHEAP AUTOMATIC SPEECH TRANSCRIPTION Industry Price: $1.50/hr Deepgram Price:
$0.75/hr We'll match anybody's price. We don't care. We'll do… Deepgram Blog © 2016 Proudly published with Ghost","A lot of people think you need a PhD or tons of experience to get a job in deep learning, but if you're already a decent engineer, you can pick up the requisite skills and techniques pretty quickly. At least, that's our philosophy. (So even if you're a beginner with deep learning, you're welcome to apply for one of our open positions.)",How to Get a Job In Deep Learning,Live,1004
3145,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectREAL-TIME SENTIMENT ANALYSIS OF TWITTER HASHTAGS WITH SPARKDavid Taieb / January 15, 2016In my Sentiment Analysis of Twitter Hashtags tutorial , we explored how to build a Spark Streaming app that uses Watson Tone Analyzerto perform sentiment analysis on a set of Tweets.In that tutorial, Spark Streaming collects the Twitter data for a finite period.But it doesn’t run streaming analytics in real-time. It just accumulates thedata into a static Resilient Data Set (RDD) for analysis in an IPython Notebook.Here, I’ll show how to rebuild the analytics from that first tutorial forreal-time streaming analytics. We’ll also publish the output to a live dashboardweb app that displays and continuously updates graphic visualizations.Note: You don’t need to complete the prior tutorial in order to follow the steps in this tutorial.OVERALL ARCHITECTUREThe following diagram represents all the different components for thisapplication 1. Event Hub Service : Streaming service available on Bluemix that connects to Twitter Stream    and publishes the tweets to Kafka. 2. Message Hub Service : High throughput, message bus Service powered by Apache Kafka 3. Watson Tone Analyzer : provide sentiment analysis in the form of emotional, social, and writing    tone scores. 4. Spark Streaming application: Scala library consumes tweet events from Message Hub, enriches    the data with Watson sentiment scores, runs the streaming analytics, and    re-publishes the results to Message Hub as separate topics. 5. Node.js web app: Provides a real-time dashboard that consumes the output of the    streaming analytics from Message Hub and visualizes them as charts.This tutorial explains how to build the app and covers these 5 components indetail.INITIAL SETUPYou’ll want to have a few things in place before you dive in. First, initiateApache Spark and Watson Tone Analyzer services on Bluemix (IBM’s open cloudplatform for building, running, and managing applications). You’ll also create aScala notebook where we’ll store credentials and control streaming. Finally,you’ll want to get OAuth credentials from Twitter so you can access actualtweets.INITIATE IBM ANALYTICS FOR APACHE SPARK SERVICE AND ADD A NOTEBOOK 1. Login to Bluemix (or sign up for a free trial) . 2. On your Bluemix dashboard, click Work with Data . Click New Service . Find and click Apache Spark then click Choose Apache Spark Click Create . Click Open . Click the Object Storage tab.         3. Click the Add Object Storage button and click Create . 4. Click the My Notebooks tab. 5. Click the Create Notebook button. Click the From URL tab. Enter any name, and under Notebook URL enter     https://github.com/ibm-cds-labs/spark.samples/raw/master/streaming-twitter/notebook/Spark%20Streaming%20Twitter-Watson-MessageHub.ipynb Click Create Notebook Leave this notebook open. You’ll return here in a minute to enter    information.INITIATE WATSON TONE ANALYZER SERVICE 1. In a new browser tab or window, open Bluemix, go to the top menu, and click Catalog . 2. Scroll down to the bottom of the page and click the Bluemix Labs Catalog link. 3. Under Watson , click Tone Analyzer and click Create . 4. On left side of the screen, click Service Credentials     5. Copy and paste the username and password values into the Scala Notebook you    just created. In the //Watson Tone Analyzer service section, replace the XXXXs for each Watson credential. ""credentials"": {             ""url"":""XXXXX"",             ""username"":""XXXXX"",             ""password"":""XXXXX""        }        Leave this notebook open in a browser window. You’ll add more credentials in    a few minutes.        GENERATE OAUTH CREDENTIALS FOR TWITTERYou’ll need these OAuth credentials to create the Event Hub Twitter stream.Create a new app on your Twitter account and configure the OAuth credentials. 1. Go to https://apps.twitter.com/ . Sign in and click the Create New App button     2. Complete the required fields: * Name and Description can be anything you want.     * Website. It doesn’t matter what URL you enter here, as long as it’s valid. For       example, I used my Bluemix account URL: https://davidtaiebspark.mybluemix.net .         3. Below the developer agreement, turn on the Yes, I agree check box and click Create your Twitter application .     4. Click the Keys and Access Tokens tab. 5. Scroll to the bottom of the page and click the Create My Access Tokens button.     6. Copy the Consumer Key , Consumer Secret , Access Token , and Access Token Secret . You will need them in a few minutes.    MESSAGE HUB SERVICENow you’re ready to start laying down infrastucture for live-streaming. Start bycreating a Message Hub instance in Bluemix: 1. In Bluemix, on top menu, click Catalog . 2. At the top of the page, type Message Hub in the search box then click the Message Hub tile that appears. 3. On the right side of the screen, in the Service Name box type: messagehub-spark . 4. Accept the other default settings (Standard plan is currently the only    choice and leave the service unbound for now) and click Create .        The service launches.        The Scala notebook provides a convenient and consistent mechanism to passcredentials for the many services this app uses. To do so, it uses the followingsetConfig methods in a Scala notebook cell:   val demo = com.ibm.cds.spark.samples.MessageHubStreamingTwitter    val config = demo.getConfig()    ...    //bootstrap.servers contains the list of brokers referenced in the     //""kafka_brokers_sasl"" field of the credential json    config.setConfig(""bootstrap.servers"",""kafka01-prod01.messagehub.services.us-south.bluemix.net:9094,kafka02-prod01.messagehub.services.us-south.bluemix.net:9094,kafka03-prod01.messagehub.services.us-south.bluemix.net:9094,kafka04-prod01.messagehub.services.us-south.bluemix.net:9094,kafka05-prod01.messagehub.services.us-south.bluemix.net:9094"")        config.setConfig(""api_key"",""XXXX"")    config.setConfig(""kafka.user.name"",""XXXX"")    config.setConfig(""kafka.user.password"",""XXXX"")    config.setConfig(""kafka_rest_url"",""https://kafka-rest-prod01.messagehub.services.us-south.bluemix.net:443"")        //""ka'fka.topic.tweet"" contains the name of the topic used to publish the tweets.         //You'll configure this value in the Event Hub Service section e.g twitter-spark    config.setConfig(""kafka.topic.tweet"",""twitter-spark"")We’ll enter these credentials for Message Hub in a few minutes, after we deployour dashboard app.EVENT HUB SERVICENote: Event Hub requires an existing instance of Message Hub. So, you must completethe Message Hub section, before following these steps.To create an Event Hub instance: 1.  In Bluemix, go to the top menu and click Catalog . 2.  Scroll down to the bottom of the page and click the Bluemix Labs Catalog link. 3.  At the top of the page, type Event Hub in the search box. 4.  Click the Event Hub tile that appears. 5.  Accept the default settings (Experimental plan is currently the only choice     and leave the service unbound for now) and click Create .          The service launches.          IBM Event Hub for Bluemix is a streaming service that can connect to multiple streaming data     sources, generate events and publish them to Message Hub. Event Hub     supports the following connectors (with more to come in the future):                    – Twitter          – SalesForce          – MQ Light: Connect to an on-premise MQ Light instance          – IBM Cloudant           6.  Click the Create your first stream button.           7.  Enter a name for your stream like twitter-spark .           8.  Click the Twitter tile. A form opens at the bottom of the screen.           9.  Enter your Consumer Key , Consumer Secret , Access Token , and Access Token Secret (see this tutorial’s Initial Setup section to read how to get your Twitter OAuth credentials).           10. (optional) If you want to filter the tweets by keywords and users, enter     values in those fields.           11. Click Create Stream .          Initialization will take a minute, then you see your stream with status Running in the Event Hub dashboard.          The Event Hub automatically detected that a Message Hub instance was created inthe same space and started publishing tweets using a topic name based on thename given to the stream. In this case, the topic is twitter-spark (all lowercase, spaces replaced by a dash).Note: As you see at the end of the Message Hub section, the scala notebook specifiesthe topic name with: config.setConfig(""kafka.topic.tweet"",""twitter-spark"")To verify that all pieces are connected, go to your Message Hub dashboard andcheck that the twitter-spark topic has been created.SPARK STREAMING APPLICATIONTip: This section talks about code you’ll find in this GitHub directory .A WORD ABOUT SPARK STREAMINGIn our first tutorial on Sentiment Analysis with Spark , we showed how to build analytics on the Twitter + Watson data usingnotebooks. The data was statically stored in a Spark SQL Table or an RDD. Forthis app, we rebuild these analytics so you can run them continuously on thestreaming data received from Twitter. To achieve that, we’ll use Spark Streaming which is an extension to the core Spark API.Spark Streaming uses Discretized Streams (DStream) as opposed to RDDs for SparkCore. A DStream abstracts the streaming data into a continuous micro-batch ofRDDs as shown in this diagram:CREATING A STREAMING CONTEXTWhen building a Spark Streaming app, the first step is to create aStreamingContext from a SparkContext and specify the batch time interval (afterwhich a new RDD is generated by the DStream). Also, you must enablecheckpointing on the StreamingContext, which lets you persist RDD metadatainformation periodically. Checkpointing lets you: 1. Resume operations after a restart (normal restart or due to failure) 2. Enable stateful transformations between micro-batch by using the    updateStateByKey APIThe following code shows how to create a StreamingContext with a Batch TimeInterval of 5 seconds, then set the checkpoint directory:   ssc = new StreamingContext( sc, Seconds(5) )    ssc.checkpoint(kafkaProps.getConfig( MessageHubConfig.CHECKPOINT_DIR_KEY ));CONFIGURE SPARK STREAMING CHECKPOINTING TO USE SWIFT OBJECT STORAGECheckpoint directory is identified by a URI that must point to a hadoopcompatible filesystem, which means that the filesystem must provide animplementation of the org.apache.hadoop.fs.FileSystem class, like any of thefollowing: * file . local filesystem * HDFS . Hadoop File System * FTP . File Transfer Protocol * S3 . Amazon S3 * swift . OpenStack Object Store supported by Bluemix and Softlayer.When running this app on Bluemix, you can use the Object Storage containerassociated with your Spark instance as the checkpoint directory. Here’s how thatworks: 1. Url must have the following format:swift://notebook.<name>/<container>        where:         * <name> is an abritrary string, like spark , that you’ll use later in the hadoop configuration step     * <container> is the name of the container or folder where all the files will live,       like ssc .                 2. Hadoop configuration for swift: set the following key/values pair in the    hadoopConfiguration hashmap, as in this scala code:            val prefix = ""fs.swift.service.<name>""     val hconf = sc.hadoopConfiguration    hconf.set(prefix + "".auth.url"", ""<auth_url>/v2.0/tokens"")    hconf.set(prefix + "".auth.endpoint.prefix"", ""endpoints"")    hconf.set(prefix + "".tenant"", ""<project_id>"")    hconf.set(prefix + "".username"", ""<user_id>"")    hconf.set(prefix + "".password"", ""<password>"")    hconf.setInt(prefix + "".http.port"", 8080)    hconf.set(prefix + "".region"", ""<region>"")    hconf.setBoolean(prefix + "".public"", true)        As I mentioned, <name> must match the name specified in swift url. Other values, like auth_url , project_id , etc. come from your Object Storage service credentials in Bluemix.                For convenience and consistency, the application also supports passing the    variables directly via the setConfig method in the Scala Notebook:            val demo = com.ibm.cds.spark.samples.MessageHubStreamingTwitter    val config = demo.getConfig()    ...    config.setConfig(""name"",""spark�    config.setConfig(""auth_url"",""https://identity.open.softlayer.com�    config.setConfig(""project_id"",""XXXXXXXXXXXXXXXXXXXXX�    config.setConfig(""region"",""dallas�    config.setConfig(""user_id"",""XXXXXXXXXXXXXXXXXX�    config.setConfig(""password"",""XXXXXXXXXX�        For this tutorial, that’s the method we’ll use. So follow these steps, to enterthe proper credentials in your Scala Notebook: 1. In Bluemix, go to your dashboard. 2. Click your Apache Spark Object Storage service to open it. 3. Within the menu on the left, click Service Credentials . 4. Copy and paste the 3 object storage credentials ( Project_ID , userId , and password ), replacing the XXX’s in your Scala notebook’s //Spark Streaming checkpointing configuration section.         5. Again, leave this notebook open. You’ll return here to enter one last set of    credentials in a minute.CREATE A CUSTOM RECEIVER FOR MESSAGEHUB/KAFKAApache Spark already provides a Kafka connector for Spark Streaming based onKafka 0.8, but we can’t use it here because Message Hub requires Kafka 0.9. So,I built a custom Spark Streaming receiver for Message Hub using Kafka 0.9 (see build.sbt updates to point at Kafka 0.9 libraries from Maven repository)Tip: You can find the code in KafkaInputDStream.scala .To create a new receiver, you need to create a scala class that inherits from org.apache.spark.streaming.dstream.ReceiverInputDStream and override the getReceiver method which returns an instance of type org.apache.spark.streaming.receiver.ReceiverIn turn, the Receiver must implement the following lifecycle methods: * onStart: called when the receiver is started. Starts a new Thread that will   poll MessageHub for new messages and store them in Spark’s memory. * onStop: called when the receiver is stopped. Cleans up all resources and   stops the Thread.See custom receiver full documentation def onStart() {    ...            //Create a new kafka consumer and subscribe to the relevant topics    kafkaConsumer = new KafkaConsumer[K, V](kafkaParams)    kafkaConsumer.subscribe( topics )        new Thread( new Runnable {        def run(){            try{                while( kafkaConsumer != null ){                    var it:Iterator[ConsumerRecord[K, V]] = null;                    if ( kafkaConsumer != null ){                        kafkaConsumer.synchronized{                                 //Poll for new events                            it = kafkaConsumer.poll(1000L).iterator                                          while( it != null &�                            store( (record.key, record.value) )                            }                                        kafkaConsumer.commitSync                        }                    }                    Thread.sleep( 1000L )                }                  println(""Exiting Thread"")            }catch{                case e:Throwable => {                    reportError( ""Error in KafkaConsumer thread�                    e.printStackTrace()                }            }        }    }).start}For convenience, I also created an implicit method called createKafkaStream thatcan be called from a StreamingContext object.BUILD THE STREAMING ANALYTICSTip: To follow the discussion in this section, refer to the code implemented in therunAnalytics method in MessageHubStreamingTwitter.scala .We want to re-implement the last 2 analytics from our earlier static data version of this app , but this time, in Scala and using the Spark Streaming APIs (DStream).Since Watson Tone Analyzer understands only English, the first transformation isto filter the tweets, keeping only those in English: val tweets = stream.map( t => t._2)  .filter { status =>     Option(status.getUser).flatMap[String] {       u => Option(u.getLang)     }.getOrElse("""").startsWith(""en"") && CharMatcher.ASCII.matchesAllOf(status.getText) && ( keys.isEmpty || keys.exists{status.getText.contains(_)})  }The next transformation invokes Watson Tone Analyzer for each tweet and combinesthe sentiment scores with the tweet data:     val rowTweets = tweets.map(status=> {      lazy val client = PooledHttp1Client()      val sentiment = ToneAnalyzer.computeSentiment( client, status, broadcastVar )              var scoreMap : Map[String, Double] = Map()      if ( sentiment != null ){        for ( tone <- Option( sentiment.children ).getOrElse( Seq() ) ){          for ( result <- Option( tone.children ).getOrElse( Seq() ) ){            scoreMap.put( result.id, (BigDecimal(result.normalized_score).setScale(2, BigDecimal.RoundingMode.HALF_UP).toDouble) * 100.0 )          }        }      }            EnrichedTweet(           status.getUser.getName,           status.getCreatedAt.toString,           status.getUser.getLang,           status.getText,           Option(status.getGeoLocation).map{ _.getLatitude}.getOrElse(0.0),          Option(status.getGeoLocation).map{_.getLongitude}.getOrElse(0.0),          scoreMap      )    })EnrichedTweet is a helper case class that defines the combined data model, like Tweet + Sentiment scoresNow, extract the hashtags into a flat map of encoded values with the properencoding.    val metricsStream = rowTweets.flatMap { eTweet => {     val retList = ListBuffer[String]()     for ( tag <- eTweet.text.split(""\\s+"") ){       if ( tag.startsWith( ""#"") && tag.length > 1 ){           for ( tone <- Option( eTweet.sentimentScores.keys ).getOrElse( Seq() ) ){               retList += (tag + delimTagTone + tone + delimToneScore + eTweet.sentimentScores.getOrElse( tone, 0.0))           }       }     }     retList.toList   }}The app extracts each hastag from the tweets and encodes as tag-tone-sentimentvalue.The next transformations are focused on preparing the data for thevisualizations:     .map { fullTag =�       (split(0), split(1).toFloat)     }}This transformation uses the map function to transform the tag-tone-sentimentvalue into a (tag-tone,sentiment) pairNow we can compute the average sentiment score for each tag-tone pair. To dothat, we use the combineByKey method which lets us combine the elements usingcustom functions. In this case, we want to compute the total sum and count foreach tag-tone pair.combineByKey is a higher order function that takes 3 functions: 1. createCombiner creates an initial Value when a key is first encountered 2. mergeValue is invoked when a value has already been created for the key to process    (merge) the new value to the existing one. 3. mergeCombiner merges together values created on different partitions     .combineByKey(        (x:Float) => (x,1),  //CreateCombiner creates list of tuples (sum, count)       (x:(Float,Int), y:Float) => (x._1 + y, x._2+1), //mergeValue       (x:(Float,Int),y:(Float,Int)) => (x._1 + y._1, x._2 + y._2), //mergeCombiner       new HashPartitioner(sc.defaultParallelism)    )The next transformation maps the output of the previous transformation which is(tag-tone, (sum,count)) pair into (tag, List(sentimentLabel, average)). Noticethat we wrap the (sentimentLabel, average) tuple into a List. This prepares forthe next transformation which reduces all the keys into a (tag, List((sentimentLabel, average) )     .map[(String,(Long/*count*/, List[(String, Double)]))]{ t =�     val split = key.split(delimTagTone)     (split(0), (ab._2, List((split(1), BigDecimal(ab._1/ab._2).setScale(2, BigDecimal.RoundingMode.HALF_UP).toDouble ))))   }}The next transformation reduces the map by Key and aggregates all the associatedvalues (tone,average_score) into a list of tuples. Using the mapValuestransformation, we call the unzip function to separate the list of tones fromtheir respective scores (to make it easier to write code that will create thechart). We end up with a DStream of (Tag, (count, List of Tones, List of averagescores)) pairs.    .reduceByKey(         (t,u) => (t._1+u._1, (t._2 ::: u._2)            .sortWith( (l,r) => l._1.compareTo( r._1 ) < 0 ))    )    .mapValues( (item:(Long, List[(String,Double)])) => {     val unzip = item._2.unzip     (item._1/(item._2.size), unzip._1, unzip._2)   })For the last transformation, we need to maintain state within the DStream sothat the next micro-batch can include the metrics calculated before. For that,we call updateStateByKey which maintains arbitrary state data for each key ofthe DStream. This function applies only to (key,value) pair DStream, which iswhat we’ve got here.The following code calls the closure for each key, passing the previous value tobe merged.    .updateStateByKey( (a:scala.collection.Seq[(Long, List[String], List[Double])], b: Option[(Long, List[String], List[Double])]) => {        val safeB = b.getOrElse( (0L, List(), List() ) )        var listTones = safeB._2        var listScores = safeB._3        var count = safeB._1        for( item <- a ){        count += item._1        listScores = listScores.zipAll( item._3, 0.0, 0.0).map{ case(a,b)=>(a+b)/2 }.toList        listTones = item._2        }                 Some( (count, listTones, listScores) )   })PUBLISH THE RESULTS TO MESSAGEHUB/KAFKAWe successively apply (chain) the set of transformations described in theprevious section to produce the metricsStream DStream. Transformations are lazy,which means that they execute only when an action is called, like collectingoutput from the DStream.In the following code, we collect the first 5 records (knowing that they’realready sorted correctly) and publish them as JSON format to Message Hub:    metricsStream.foreachRDD( rdd =>{     val topHashTags = rdd.sortBy( f => f._2._1, false ).take(5)     if ( !topHashTags.isEmpty){       queue.synchronized{         queue += ((""topHashTags"", TweetsMetricJsonSerializer.serialize(topHashTags.map( f => (f._1, f._2._1 )))))         queue += ((""topHashTags.toneScores"", ToneScoreJsonSerializer.serialize(topHashTags)))           try{             queue.notify           }catch{             case e:Throwable=>logError(e.getMessage, e)           }       }     }   })  }Notice that we don’t directly call the kafka api to send the event. This isbecause the Spark framework requires the transformation closure (anonymousfunction passed as arguments to high-order functions), and unfortunately some ofthe needed kafka classes (like org.apache.kafka.clients.producer.ProducerRecord)are not serializable. To work around this issue, we post the JSON record in anasynchronous queue and have a separate Thread publish them to Message Hubasynchronously. The Thread is created only on the driver machine, but this is OKbecause the closure passed to the foreachRDD method executes only in the drivermachine.REAL-TIME DASHBOARD NODE.JS WEB APPLICATIONNow that we have completed the streaming analytics, the next step is to displaythe results in a dashboard that updates continuously.Tip: To follow this discussion, see the code in this Node.js app’s GitHub repository . Also be sure to deploy your own copy.DEPLOY THE APPThe fastest way to deploy this app to Bluemix, is to click the Deploy to Bluemix button.If you’d rather deploy manually, refer to the readme .ENTER MESSAGE HUB CREDENTIALS IN NOTEBOOK 1. In bluemix, go to the dashboard. 2. Click your new Twitter-Spark-Watson-Dashboard app to open it. 3. In the menu on the left, click Environment Variables .You see the following json.        {    ""credentials"": {    ""api_key"": ""XXXX"",    ""kafka_rest_url"": ""https://kafka-rest-prod01.messagehub.services.us-south.bluemix.net:443"",    ""kafka_brokers"": [      ""kafka01-prod01.messagehub.services.us-south.bluemix.net:9093"",      ""kafka02-prod01.messagehub.services.us-south.bluemix.net:9093"",      ""kafka03-prod01.messagehub.services.us-south.bluemix.net:9093"",      ""kafka04-prod01.messagehub.services.us-south.bluemix.net:9093"",      ""kafka05-prod01.messagehub.services.us-south.bluemix.net:9093""    ],    ""kafka_brokers_sasl"": [      ""kafka01-prod01.messagehub.services.us-south.bluemix.net:9094"",      ""kafka02-prod01.messagehub.services.us-south.bluemix.net:9094"",      ""kafka03-prod01.messagehub.services.us-south.bluemix.net:9094"",      ""kafka04-prod01.messagehub.services.us-south.bluemix.net:9094"",      ""kafka05-prod01.messagehub.services.us-south.bluemix.net:9094""    ],    ""user"": ""XXXX"",    ""password"": ""XXXX""    }    }         4. Copy and paste the 3 Message Hub credentials ( api_key , user , and password ), replacing the XXX’s in your Scala notebook’s //Message Hub/Kafka service section.         5. Again, leave this notebook open. You’ll return here to run some code in a    minute.MOZAIK DASHBOARD FRAMEWORKTo build this dashboard, we chose the Mozaik framework because it provides an easy-to-use widget framework based on ReactJS components. It also provides automated calling of api endpoints via WebSockets.Mozaik actually supports multiple dashboards. If more than one is defined, thenthe framework automatically rotates between them according to a customizablerotation duration. In this sample app, we define only one dashboard thatcontains 2 widgets: 1. Pie chart showing the top 5 hashtags by number of occurences. 2. Multi-series bar charts showing the average of all tone scores for tweets    containing the top 5 hashtagsDEFINING THE LAYOUT DECLARATIVELYWith Moziak, the layout of each dashboard is conveniently defined declarativelyin config.js using a json syntax:    ...    dashboards: [        {            columns: 5,             rows: 100,            widgets: [                {                    type: 'sparkTwitter.top_hash_tags',                    columns: 3, rows: 45,                    x: 1, y: 4                },                {                    type: 'sparkTwitter.tone_breakdown',                    columns: 3, rows: 45,                    x: 1, y: 51                }            ]        }    ]The dashboards field is an array of JSON objects that each define a dashboard layout. The columns and rows fields represent the number of columns and rows in the tabular layout. The widgets field contains an array of JSON objects that describe a rectangular area. Eachof these objects contains the following information: * columns number of columns taken by the area * rows number of rows taken by the area * x 1-based index indicating the start x-position of the area * y 1-based index indicating the start y-position of the area * type an id that identifies the widget being rendered in the specified area.The type consists of 2 parts: the extensions id and the widget id.You’ll find the extension id definition in App.jsx :...var extensions = {    sparkTwitter: sparkTwitterComponents};...In turn, the widget type definition is in mozaik-ext-sparkTwitterComponents.js .module.exports = {    topHashTags: require('./sparkTwitterTopHashTags.jsx'),    toneBreakdown: require('./sparkTwitterToneBreakdown.jsx')};Note: We declare the id using camel case format in the Node.js module. But inconfig.js, we need to use snake case format ( topHashTags becomes top\_hash\_tags )As you see, the widget implementation is in the module referenced by the requirestatement. For example, the topHashTags widget is implemented in sparkTwitterTopHashTags.jsx . We’ll discuss the lifecycle of these widgets in a minute. For now, pleasetake some time to look over these different files and to consult thedocumentation for the different frameworks used, like ReactJS and C3/D3.ACCESS MESSAGE HUB EVENTS USING MESSAGE-HUB-REST NODE MODULEThe dashboard uses API endpoints to get regular data updates. In turn, these APIendpoints get updates by subscribing to Message Hub using the Kafka Rest Proxy apis . For convenience, this app uses the message-hub-rest module which provides high-level APIs to create a new consumer instance andpublish/subscribe events to/from this instance.The following code (available in messageHubBridge.js ) shows how to create a new instance by reading the credentials from theVCAP_SERVICES. We first grab the VCAP_SERVICES data that contains thecredentials for the Message Hub Service. We then use them to create a newMessageHub object, which provides the interface to communicate with Message Hub.In this sample app, we subscribe to 2 topics: topHashTags andtopHashTags.toneScores. Note that Message Hub doesn’t auto-create them, so toavoid an error, we call createTopicIfNecessary first, to make sure they exist.    var consumerInstanceName = ""spark_twitter_consumer_instance�    var topics = [""topHashTags"", ""topHashTags.toneScores�        ...        var services = process.env.VCAP_SERVICES || configManager.get(""DEV_VCAP_CONFIG�    instance.topics.get()        .then( function(response){            console.log(""List of topics: �                        //Change in MessageHub on 11/2/2015: topics are not autocreated anymore         var createTopicIfNecessary = function( topic ){                if ( !_.find( response, function( t ){ return t === topic || (t.hasOwnProperty(""name"") && t.name === topic) })){                    instance.topics.create(topic)                    .then(function(res){                        console.log(""Successfully created topic �                    })                    .fail( function(error){                        console.log(""Unable to create topic �            })            ...        })        .fail( function(error){            console.log(""Failed to get list of topics: �We then call the consumeTopic routine that periodically polls the Message Hubserver to get new data://Helper that consumer a topic from MessageHub    var consumeTopic = function( topic ){        console.log(""Create MessageHub consumer for topic: �                            if ( _.isArray(data) ){                                if ( data.length �                                    }catch(e){                                        console.log(""Unable to parse Message Hub data�                            console.log(""Unable to consume topic: �            })            .fail( function(error){                console.log(""Unable to get consumer instance for topic: �The data received from Kafka Topics is stored in the messagesByTopics variable to be read later by the API endpoint when the dashboard sends a newrequest over WebSocket channel. The APIs endpoint implementation is located in sparkTwitterApiClient.js . Each API is declared as a field to the main object and must return a Promiseobject that returns the selected topic data from the MessageHubBridge. The runInterval field indicates how often the data refreshes from the browser.    var client = function (mozaik) {        return {            runInterval: 2000,            getTopHashTags: function(params) {                return new Promise( function( resolve, reject){                        resolve();                    })                    .then(function (res) {                        console.log(""Calling Api with params: �                    })                    .then(function (res) {                        console.log(""Calling Api with params: �That’s it! The framework automatically sets the WebSocket connection andperiodically calls the api to get new data. Next, you see how widgets consumethe data to refresh the charts.VISUALIZE THE RESULTSAs mentioned, each widget is implemented as a ReactJS Component, which providesa set of lifecycle events. Some that we used in our app are: * getInitialState initializes the widget state before the component is mounted. * getApiRequest specifies the API id configured in the main App.jsx module. * onApiData runs when the Mozaik framework gets new data from the WebSocket channel. * componentDidMount is invoked immediately after the widget renders. This is where the widget   content is dynamically created. For example, the WinsOverTime widget creates   the c3 chart. * componentDidUpdate is invoked when the new data has been received so that the widget can be   refreshed accordingly. * componentWillUnmount is invoked when the component is about to be destroyed, so the widget can   clean up any associated resources. * render returns the html fragment that will contain the widget.You read how the data updates by consuming topics from Message Hub. Once newdata comes in, the Mozaik framework automatically calls onApiData. The widgetcomponent then updates the state with a call to the setState method which in turn, triggers a refresh of the charts. For example, in sparkTwitterToneBreakdown.jsx :    ...    onApiData(metrics) {        this.setState({""colData�    }        componentDidUpdate() {        if ( this.state.colData.length �        }    }    ...The C3 framework creates the chart object and provides a high-level programmingmodel that is much easier to use than manipulating d3 apis directly.RUNNING THE APPLICATION FROM A SCALA NOTEBOOKHere’s how to quickly run the app from the Scala Notebook, you copied in thefirst section: 1. Confirm you’ve filled in all credentials. (You shouldn’t see an more XXXX    entries.) 2. In the Scala notebook, run code cells 1-3, until you’ve kicked off Spark    Streaming.You’ll see results streaming into your notebook. Let it run.                 3. Launch the dashboard app.In a separate browser tab or window, to go your Bluemix dashboard and launch    your Twitter-Spark-Watson-Dashboard app by opening its URL.        You see the chart updating every few seconds:                CONCLUSIONNow you know how to use IBM Analytics for Apache Spark in combination withMessage Hub and Event Hub to deliver real-time streaming analytics. You’ve evengot a live dashboard web app featuring charts and graphics that update continously, soyour corner-office decision-makers can see trends as they happen.Feel free to fork and enhance this code . Some suggestions: * select hashtags you want to monitor * set alerts based on emotion thresholds * display select tweetsThe possibilities are endless. IBM Bluemix has an unparalleled selection ofservices (data, cognitive, mobile, etc.) that you can use to build the nextkiller feature. When you do, don’t forget to let us know about it!SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",How to implement real-time Spark streaming analytics and publish the output to a live dashboard web app featuring dynamic graphics that update continuously.,Real-time Sentiment Analysis of Twitter Hashtags with Spark,Live,1005
3146,"* Home
 * Blog * Business Analytics * SAS
       * R
       * Python
      
      
    * Business Intelligence * Qlikview
       * Web Analytics
      
      
    * Big data
    * Infographics
   
   
 * Jobs
 * Trainings
 * Discuss
 * Learning Paths * SAS Business Analyst
    * LeaRn Data Science on R
    * Data Science in Python
    * DATA SCIENCE IN WEKA
    * Data Visualization with Tableau
    * Data Visualization with QlikView
    * Interactive Data Stories with D3.js
   
   
 * DataHack * Hackathons
    * Events
   
   
 * Stories
 * Write For Us
 * Contact Us

 * 
 * 
 * 
 * 

 * Home
 * Blog
 * Jobs
 * Trainings
 * Learning Paths
 * Discuss
 * DataHack

 * 

LEARN EVERYTHING ABOUT ANALYTICS
 * Home
 * Blog * Business Analytics * SAS
       * R
       * Python
      
      
    * Business Intelligence * Qlikview
       * Web Analytics
      
      
    * Big data
    * Infographics
   
   
 * Jobs
 * Trainings
 * Discuss
 * Learning Paths * SAS Business Analyst
    * LeaRn Data Science on R
    * Data Science in Python
    * DATA SCIENCE IN WEKA
    * Data Visualization with Tableau
    * Data Visualization with QlikView
    * Interactive Data Stories with D3.js
   
   
 * DataHack * Hackathons
    * Events
   
   
 * Stories karthe , September 1, 2016 MyStory: I became a Data Scientist after working for 10 years in IT Industry BALAJI SR , September 1, 2016 MyStory: How I transitioned to Data Science after 6 years in Data
   warehousing?
 * Write For Us
 * Contact Us

Home Business Analytics Essentials of Machine Learning Algorithms (with Python and R Codes)ESSENTIALS OF MACHINE LEARNING ALGORITHMS (WITH PYTHON AND R CODES)
Essentials of Machine Learning Algorithms (with Python and R Codes) Business Analytics Python R SHARE Sunil Ray , August 10, 2015 / 49INTRODUCTION
Google’s self-driving cars and robots get a lot of press, but the company’s real
future is in machine learning, the technology that enables computers to get
smarter and more personal.

– Eric Schmidt (Google Chairman)

We are probably living in the most defining period of human history. The period
when computing moved from large mainframes to PCs to cloud. But what makes it
defining is not what has happened, but what is coming our way in years to come.

What makes this period exciting for some one like me is the democratization of
the tools and techniques, which followed the boost in computing. Today, as a
data scientist, I can build data crunching machines with complex algorithms for
a few dollors per hour. But, reaching here wasn’t easy! I had my dark days and
nights.


WHO CAN BENEFIT THE MOST FROM THIS GUIDE?
WHAT I AM GIVING OUT TODAY IS PROBABLY THE MOST VALUABLE GUIDE, I HAVE EVER
CREATED.
The idea behind creating this guide is to simplify the journey of aspiring data
scientists and machine learning enthusiasts across the world. Through this
guide, I will enable you to work on machine learning problems and gain from
experience. I am providing a high level understanding about various machine learning
algorithms along with R & Python codes to run them. These should be sufficient
to get your hands dirty.


I have deliberately skipped the statistics behind these techniques, as you don’t
need to understand them at the start. So, if you are looking for statistical
understanding of these algorithms, you should look elsewhere. But, if you are
looking to equip yourself to start building machine learning project, you are in
for a treat.


BROADLY, THERE ARE 3 TYPES OF MACHINE LEARNING ALGORITHMS..
1. SUPERVISED LEARNING
How it works: This algorithm consist of a target / outcome variable (or dependent variable)
which is to be predicted from a given set of predictors (independent variables).
Using these set of variables, we generate a function that map inputs to desired
outputs. The training process continues until the model achieves a desired level
of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree , Random Forest , KNN, Logistic Regression etc.


2. UNSUPERVISED LEARNING
How it works: In this algorithm, we do not have any target or outcome variable to predict /
estimate. It is used for clustering population in different groups, which is
widely used for segmenting customers in different groups for specific
intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.


3. REINFORCEMENT LEARNING:
How it works: Using this algorithm, the machine is trained to make specific decisions. It
works this way: the machine is exposed to an environment where it trains itself
continually using trial and error. This machine learns from past experience and
tries to capture the best possible knowledge to make accurate business
decisions. Example of Reinforcement Learning: Markov Decision Process


LIST OF COMMON MACHINE LEARNING ALGORITHMS
Here is the list of commonly used machine learning algorithms. These algorithms
can be applied to almost any data problem:

 1.  Linear Regression
 2.  Logistic Regression
 3.  Decision Tree
 4.  SVM
 5.  Naive Bayes
 6.  KNN
 7.  K-Means
 8.  Random Forest
 9.  Dimensionality Reduction Algorithms
 10. Gradient Boost & Adaboost


1. LINEAR REGRESSION
It is used to estimate real values (cost of houses, number of calls, total sales
etc.) based on continuous variable(s). Here, we establish relationship between
independent and dependent variables by fitting a best line. This best fit line
is known as regression line and represented by a linear equation Y= a *X + b.

The best way to understand linear regression is to relive this experience of
childhood. Let us say, you ask a child in fifth grade to arrange people in his
class by increasing order of weight, without asking them their weights! What do
you think the child will do? He / she would likely look (visually analyze) at
the height and build of people and arrange them using a combination of these
visible parameters. This is linear regression in real life! The child has
actually figured out that height and build would be correlated to the weight by
a relationship, which looks like the equation above.

In this equation:

 * Y – Dependent Variable
 * a – Slope
 * X – Independent variable
 * b – Intercept

These coefficients a and b are derived based on minimizing the sum of squared
difference of distance between data points and regression line.

Look at the below example. Here we have identified the best fit line having
linear equation y=0.2811x+13.9 . Now using this equation, we can find the weight, knowing the height of a
person.


Linear Regression is of mainly two types: Simple Linear Regression and Multiple
Linear Regression. Simple Linear Regression is characterized by one independent
variable. And, Multiple Linear Regression(as the name suggests) is characterized
by multiple (more than 1) independent variables. While finding best fit line,
you can fit a polynomial or curvilinear regression. And these are known as
polynomial or curvilinear regression.

Python Code

#Import Library
#Import other necessary libraries like pandas, numpy...
fromsklearnimportlinear_model
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets
# Create linear regression object
linear =linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(x_train,y_train)
linear.score(x_train,y_train)
#Equation coefficient and Intercept
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
#Predict Output
predicted= linear.predict(x_test)


R Code

#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train <- input_variables_values_training_datasets
y_train <- target_variables_values_training_datasets
x_test <- input_variables_values_test_datasets
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
linear<-lm(y_train~.,data= x)
summary(linear)
#Predict Output
predicted= predict(linear,x_test)


2. LOGISTIC REGRESSION
Don’t get confused by its name! It is a classification not a regression
algorithm. It is used to estimate discrete values ( Binary values like 0/1,
yes/no, true/false ) based on given set of independent variable(s). In simple
words, it predicts the probability of occurrence of an event by fitting data to
a logit function . Hence, it is also known as logit regression . Since, it predicts the probability, its output values lies between 0 and 1
(as expected).

Again, let us try and understand this through a simple example.

Let’s say your friend gives you a puzzle to solve. There are only 2 outcome
scenarios – either you solve it or you don’t. Now imagine, that you are being
given wide range of puzzles / quizzes in an attempt to understand which subjects
you are good at. The outcome to this study would be something like this – if you
are given a trignometry based tenth grade problem, you are 70% likely to solve
it. On the other hand, if it is grade fifth history question, the probability of
getting an answer is only 30%. This is what Logistic Regression provides you.

Coming to the math, the log odds of the outcome is modeled as a linear
combination of the predictor variables.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

Above, p is the probability of presence of the characteristic of interest. It
chooses parameters that maximize the likelihood of observing the sample values
rather than that minimize the sum of squared errors (like in ordinary
regression).

Now, you may ask, why take a log? For the sake of simplicity, let’s just say
that this is one of the best mathematical way to replicate a step function. I
can go in more details, but that will beat the purpose of this article.

Python Code

#Import Library
fromsklearn.linear_modelimportLogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create logistic regression object
model=LogisticRegression()
# Train the model using the training sets and check score
model.fit(X,y)
model.score(X,y)
#Equation coefficient and Intercept
print('Coefficient: \n', model.coef_)
print('Intercept: \n', model.intercept_)
#Predict Output
predicted= model.predict(x_test)


R Code

x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic<- glm(y_train~.,data= x,family='binomial')
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)


FURTHERMORE..
There are many different steps that could be tried in order to improve the
model:

 * including interaction terms
 * removing features
 * regularization techniques
 * using a non-linear model


3. DECISION TREE
This is one of my favorite algorithm and I use it quite frequently. It is a type
of supervised learning algorithm that is mostly used for classification
problems. Surprisingly, it works for both categorical and continuous dependent
variables. In this algorithm, we split the population into two or more
homogeneous sets. This is done based on most significant attributes/ independent
variables to make as distinct groups as possible. For more details, you can
read: Decision Tree Simplified .


source: statsexchange

In the image above, you can see that population is classified into four
different groups based on multiple attributes to identify ‘if they will play or
not’. To split the population into different heterogeneous groups, it uses
various techniques like Gini, Information Gain, Chi-square, entropy.

The best way to understand how decision tree works, is to play Jezzball – a
classic game from Microsoft (image below). Essentially, you have a room with
moving walls and you need to create walls such that maximum area gets cleared
off with out the balls.


So, every time you split the room with a wall, you are trying to create 2
different populations with in the same room. Decision trees work in very similar
fashion by dividing a population in as different groups as possible.

More : Simplified Version of Decision Tree Algorithms

PYTHON CODE
#Import Library
#Import other necessary libraries like pandas, numpy...
fromsklearnimporttree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create tree object 
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini  
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X,y)
model.score(X,y)
#Predict Output
predicted= model.predict(x_test)


R Code

library(rpart)
x <- cbind(x_train,y_train)
# grow tree 
fit <- rpart(y_train~.,data= x,method=""class"")
summary(fit)
#Predict Output 
predicted= predict(fit,x_test)


4. SVM (SUPPORT VECTOR MACHINE)
It is a classification method. In this algorithm, we plot each data item as a
point in n-dimensional space (where n is number of features you have) with the
value of each feature being the value of a particular coordinate.

For example, if we only had two features like Height and Hair length of an
individual, we’d first plot these two variables in two dimensional space where
each point has two co-ordinates (these co-ordinates are known as Support Vectors )


Now, we will find some line that splits the data between the two differently classified groups of data.
This will be the line such that the distances from the closest point in each of
the two groups will be farthest away.


In the example shown above, the line which splits the data into two differently
classified groups is the black line, since the two closest points are the farthest apart from the line. This
line is our classifier. Then, depending on where the testing data lands on
either side of the line, that’s what class we can classify the new data as.

More: Simplified Version of Support Vector Machine

Think of this algorithm as playing JezzBall in n-dimensional space. The tweaks
in the game are:

 * You can draw lines / planes at any angles (rather than just horizontal or
   vertical as in classic game)
 * The objective of the game is to segregate balls of different colors in
   different rooms.
 * And the balls are not moving.


PYTHON CODE
#Import Library
fromsklearnimport svm
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object 
model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.
# Train the model using the training sets and check score
model.fit(X,y)
model.score(X,y)
#Predict Output
predicted= model.predict(x_test)


R Code

library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-svm(y_train~., data = x)
summary(fit)
#Predict Output 
predicted= predict(fit,x_test)


5. NAIVE BAYES
It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive
Bayes classifier assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature. For example, a fruit may be
considered to be an apple if it is red, round, and about 3 inches in diameter.
Even if these features depend on each other or upon the existence of the other
features, a naive Bayes classifier would consider all of these properties to
independently contribute to the probability that this fruit is an apple.

Naive Bayesian model is easy to build and particularly useful for very large
data sets. Along with simplicity, Naive Bayes is known to outperform even highly
sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from
P(c), P(x) and P(x|c). Look at the equation below:


Here,

 * P ( c|x ) is the posterior probability of class ( target ) given predictor ( attribute ).
 * P ( c ) is the prior probability of class .
 * P ( x|c ) is the likelihood which is the probability of predictor given class .
 * P ( x ) is the prior probability of predictor .

Example: Let’s understand it using an example. Below I have a training data set of
weather and corresponding target variable ‘Play’. Now, we need to classify
whether players will play or not based on weather condition. Let’s follow the
below steps to perform it.

Step 1: Convert the data set to frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast
probability = 0.29 and probability of playing is 0.64.


Step 3: Now, use Naive Bayesian equation to calculate the posterior probability
for each class. The class with the highest posterior probability is the outcome
of prediction.

Problem: Players will pay if weather is sunny, is this statement is correct?

We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny |
Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14
= 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class
based on various attributes. This algorithm is mostly used in text
classification and with problems having multiple classes.

PYTHON CODE
#Import Library
fromsklearn.naive_bayesimportGaussianNB
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object model=GaussianNB()# there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link# Train the model using the training sets and check score
model.fit(X,y)
#Predict Output
predicted= model.predict(x_test)


R Code

library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-naiveBayes(y_train~., data = x)
summary(fit)
#Predict Output 
predicted= predict(fit,x_test)


6. KNN (K- NEAREST NEIGHBORS)
It can be used for both classification and regression problems. However, it is
more widely used in classification problems in the industry. K nearest neighbors
is a simple algorithm that stores all available cases and classifies new cases
by a majority vote of its k neighbors. The case being assigned to the class is
most common amongst its K nearest neighbors measured by a distance function.

These distance functions can be Euclidean, Manhattan, Minkowski and Hamming
distance. First three functions are used for continuous function and fourth one
(Hamming) for categorical variables. If K = 1, then the case is simply assigned
to the class of its nearest neighbor. At times, choosing K turns out to be a
challenge while performing KNN modeling.

More: Introduction to k-nearest neighbors : Simplified .


KNN can easily be mapped to our real lives. If you want to learn about a person,
of whom you have no information, you might like to find out about his close
friends and the circles he moves in and gain access to his/her information!

Things to consider before selecting KNN:

 * KNN is computationally expensive
 * Variables should be normalized else higher range variables can bias it
 * Works on pre-processing stage more before going for KNN like outlier, noise
   removal

PYTHON CODE
#Import Library
fromsklearn.neighborsimportKNeighborsClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object modelKNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5
# Train the model using the training sets and check score
model.fit(X,y)
#Predict Output
predicted= model.predict(x_test)


R Code

library(knn)
x <- cbind(x_train,y_train)
# Fitting model
fit <-knn(y_train~., data = x,k=5)
summary(fit)
#Predict Output 
predicted= predict(fit,x_test)


7. K-MEANS
It is a type of unsupervised algorithm which solves the clustering problem. Its
procedure follows a simple and easy way to classify a given data set through a
certain number of clusters (assume k clusters). Data points inside a cluster are
homogeneous and heterogeneous to peer groups.

Remember figuring out shapes from ink blots? k means is somewhat similar this
activity. You look at the shape and spread to decipher how many different
clusters / population are present!


How K-means forms cluster:

 1. K-means picks k number of points for each cluster known as centroids.
 2. Each data point forms a cluster with the closest centroids i.e. k clusters.
 3. Finds the centroid of each cluster based on existing cluster members. Here
    we have new centroids.
 4. As we have new centroids, repeat step 2 and 3. Find the closest distance for
    each data point from new centroids and get associated with new k-clusters.
    Repeat this process until convergence occurs i.e. centroids does not change.

How to determine value of K:

In K-means, we have clusters and each cluster has its own centroid. Sum of
square of difference between centroid and the data points within a cluster
constitutes within sum of square value for that cluster. Also, when the sum of
square values for all the clusters are added, it becomes total within sum of
square value for the cluster solution.

We know that as the number of cluster increases, this value keeps on decreasing
but if you plot the result you may see that the sum of squared distance
decreases sharply up to some value of k, and then much more slowly after that.
Here, we can find the optimum number of cluster.


PYTHON CODE
#Import Library
fromsklearn.clusterimport KMeans
#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset
# Create KNeighbors classifier object modelk_means=KMeans(n_clusters=3,random_state=0)
# Train the model using the training sets and check score
model.fit(X)
#Predict Output
predicted= model.predict(x_test)


R Code

library(cluster)
fit <- kmeans(X, 3) # 5 cluster solution


8. RANDOM FOREST
Random Forest is a trademark term for an ensemble of decision trees. In Random
Forest, we’ve collection of decision trees (so known as “Forest”). To classify a
new object based on attributes, each tree gives a classification and we say the
tree “votes” for that class. The forest chooses the classification having the
most votes (over all the trees in the forest).

Each tree is planted & grown as follows:

 1. If the number of cases in the training set is N, then sample of N cases is
    taken at random but with replacement . This sample will be the training set for growing the tree.
 2. If there are M input variables, a number m<<M is specified such that at each
    node, m variables are selected at random out of the M and the best split on
    these m is used to split the node. The value of m is held constant during
    the forest growing.
 3. Each tree is grown to the largest extent possible. There is no pruning.

For more details on this algorithm, comparing with decision tree and tuning
model parameters, I would suggest you to read these articles:

 1. Introduction to Random forest – Simplified
    
    
 2. Comparing a CART model to Random Forest (Part 1)
    
    
 3. Comparing a Random Forest to a CART model (Part 2)
    
    
 4. Tuning the parameters of your Random Forest model
    
    
Python

#Import Library
fromsklearn.ensembleimportRandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest objectmodel= RandomForestClassifier()
# Train the model using the training sets and check score
model.fit(X,y)
#Predict Output
predicted= model.predict(x_test)


R Code

library(randomForest)
x <- cbind(x_train,y_train)
# Fitting model
fit <- randomForest(Species ~ ., x,ntree=500)
summary(fit)
#Predict Output 
predicted= predict(fit,x_test)


9. DIMENSIONALITY REDUCTION ALGORITHMS
In the last 4-5 years, there has been an exponential increase in data capturing
at every possible stages. Corporates/ Government Agencies/ Research
organisations are not only coming with new sources but also they are capturing
data in great detail.

For example: E-commerce companies are capturing more details about customer like
their demographics, web crawling history, what they like or dislike, purchase
history, feedback and many others to give them personalized attention more than
your nearest grocery shopkeeper.

As a data scientist, the data we are offered also consist of many features, this
sounds good for building good robust model but there is a challenge. How’d you
identify highly significant variable(s) out 1000 or 2000? In such cases,
dimensionality reduction algorithm helps us along with various other algorithms
like Decision Tree, Random Forest, PCA, Factor Analysis, Identify based on
correlation matrix, missing value ratio and others.

To know more about this algorithms, you can read “ Beginners Guide To Learn Dimension Reduction Techniques “.

PYTHON CODE
#Import Library
from sklearn import decomposition
#Assumed you have training and test data set as train and test
# Create PCA obejectpca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)
# For Factor analysis
#fa= decomposition.FactorAnalysis()
# Reduced the dimension of training dataset using PCA
train_reduced = pca.fit_transform(train)
#Reduced the dimension of test dataset
test_reduced = pca.transform(test)
#For more detail on this, please refer  this link.

R CODE
library(stats)
pca <- princomp(train, cor = TRUE)
train_reduced  <- predict(pca,train)
test_reduced  <- predict(pca,test)


10. GRADIENT BOOSTING & ADABOOST
GBM & AdaBoost are boosting algorithms used when we deal with plenty of data to
make a prediction with high prediction power. Boosting is an ensemble learning
algorithm which combines the prediction of several base estimators in order to
improve robustness over a single estimator. It combines multiple weak or average
predictors to a build strong predictor. These boosting algorithms always work
well in data science competitions like Kaggle, AV Hackathon, CrowdAnalytix.

More: Know about Gradient and AdaBoost in detail

PYTHON CODE
#Import Library
fromsklearn.ensembleimportGradientBoostingClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Gradient Boosting Classifier objectmodel= GradientBoostingClassifier(n_estimators=100,learning_rate=1.0, max_depth=1,random_state=0)
# Train the model using the training sets and check score
model.fit(X,y)
#Predict Output
predicted= model.predict(x_test)


R CODE
library(caret)
x <- cbind(x_train,y_train)
# Fitting model
fitControl <- trainControl( method = ""repeatedcv"", number = 4, repeats = 4)
fit <- train(y ~ ., data = x, method = ""gbm"", trControl = fitControl,verbose = FALSE)
predicted= predict(fit,x_test,type= ""prob"")[,2] 

GradientBoostingClassifier and Random Forest are two different boosting tree
classifier and often people ask about the difference between these two algorithms .

END NOTES
By now, I am sure, you would have an idea of commonly used machine learning
algorithms. My sole intention behind writing this article and providing the
codes in R and Python is to get you started right away. If you are keen to
master machine learning, start right away. Take up problems, develop a physical
understanding of the process, apply these codes and see the fun!

Did you find this article useful ? Share your views and opinions in the comments
section below.

IF YOU LIKE WHAT YOU JUST READ & WANT TO CONTINUE YOUR ANALYTICS LEARNING, SUBSCRIBE TO OUR EMAILS , FOLLOW US ON TWITTER OR LIKE OUR FACEBOOK PAGE .
SHARE THIS:
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * 

RELATED
Tags: C4.5 , cart , decision tree , GBM , K-Means , KNN , Linear-regression , logistic regression , machine learning , Naive Bayes , Neural network , random forest , Reinforcement , Supervised Learning , Unsupervised Next Article Beginners Guide to learn about Content Based Recommender Engines Previous Article Marketing Analytics: Essentials of Cross-Selling and Upselling
(with a case study) Author Sunil Ray I am a Business Analytics and Intelligence professional with deep experience in
the Indian Insurance industry. I have worked for various multi-national
Insurance companies in last 7 years.
49 COMMENTS
 * Kuber says: August 10, 2015 at 11:59 pmAwesowe compilation!! Thank you.
   
   Reply
 * 
 * Karthikeyan says: August 11, 2015 at 3:13 amThank you very much, A Very useful and excellent compilation. I have already
   bookmarked this page.
   
   Reply
 * 
 * hemanth says: August 11, 2015 at 4:50 amStraight, Informative and effective!!
   Thank you
   
   Reply
 * 
 * venugopal says: August 11, 2015 at 6:05 amGood Summary airticle
   
   Reply
 * 
 * Dr Venugopala Rao says: August 11, 2015 at 6:27 amSuper Compilation…
   
   Reply
 * 
 * Kishor Basyal says: August 11, 2015 at 7:30 amWonderful! Really helpful
   
   Reply
 * 
 * Brian Thomas says: August 11, 2015 at 9:24 amVery nicely done! Thanks for this.
   
   Reply
 * 
 * Tesfaye says: August 11, 2015 at 10:30 amThank you! Well presented article.
   
   Reply
 * 
 * Tesfaye says: August 11, 2015 at 10:31 amThank you! Well presented.
   
   Reply
 * 
 * Huzefa says: August 11, 2015 at 3:53 pmHello,
   
   Superb information in just one blog. Can anyone help me to run the codes in R
   what should be replaced with “~” symbol in codes? Help is appreciated
   
   Reply
 * 
 * Huzefa says: August 11, 2015 at 3:54 pmHello,
   
   Superb information in just one blog. Can anyone help me to run the codes in R
   what should be replaced with “~” symbol in codes? Help is appreciated .
   
   Reply
 * 
 * Sudipta Basak says: August 12, 2015 at 3:35 amEnjoyed the simplicity. Thanks for the effort.
   
   Reply
 * 
 * Sunil Ray says: August 14, 2015 at 7:36 amHi All,
   
   Thanks for the comment …
   
   Reply
 * 
 * Dalila says: August 14, 2015 at 1:35 pmVery good summary.
   
   Thank!
   One simple point. The reason for taking the log(p/(1-p)) in Logistic
   Regression is to make the equation linear, I.e., easy to solve.
   
   Reply * Sunil Ray says: August 21, 2015 at 5:21 amThanks Dalila… 🙂
      
      Reply
    * 
    * Borun Chowdhury says: April 21, 2016 at 8:48 amThat’s not the reason for taking the log. The underlying assumption in
      logistic regression is that the probability is governed by a step function
      whose argument is linear in the attributes. First of all the assumption of
      linearity or otherwise introduces bias. However, logistic regression being
      a parametric model some bias is inevitable. The reason to choose a linear
      relationship is not because its easy to solve but because a higher order
      polynomial introduces higher bias and one would not like to do so without
      good reason.
      
      Now coming to the choice of log, it is just a convention. Basically, once
      we have decided to go with a linear model, in the case of one attribute we
      model the probability by
      
      p(x) = f( ax+b)
      
      such that p(-infinity)=0 and p(infinity)=0. It so happens that this is
      satisfied by
      
      p(x) = exp(ax+b)/ (1 + exp(ax+b))
      
      which can be re-written as
      
      log(p(x)/(1-p(x)) = a x+ b
      
      While I am at it, it may be useful to talk about another point. One should
      ask is why we don’t use least square method. The reason is that a yes/no
      choice is a Bernoulli random variable and thus we estimate the probability
      according to maximum likelihood wrt Bernoulli process. For linear
      regression the assumption is that the residuals around the ‘true’ function
      are distributed according to a normal distribution and the maximum
      likelihood estimate for a normal distribution amounts to the least square
      method. So deep down linear regression and logistic regression both use
      maximum likelihood estimates. Its just that they are max likelihoods
      according to different distributions.
      
      Reply
    * 
   
   
 * 
 * Statis says: August 19, 2015 at 12:14 amNice summary!
   @Huzefa: you shouldn’t replace the “~” in the R code, it basically means “as
   a function of”. You can also keep the “.” right after, it stands for “all
   other variables in the dataset provided”. If you want to be explicit, you can
   write y ~ x1 + x2 + … where x1, x2 .. are the names of the columns of your
   data.frame or data.table.
   Further note on formula specification: by default R adds an intercept, so
   that y ~ x is equivalent to y ~ 1 + x, you can remove it via y ~ 0 + x.
   Interactions are specified with either * (which also adds the two variables)
   or : (which only adds the interaction term). y ~ x1 * x2 is equivalent to y ~
   x1 + x2 + x1 : x2.
   Hope this helps!
   
   Reply
 * 
 * Chris says: August 26, 2015 at 1:01 amYou did a Wonderful job! This is really helpful. Thanks!
   
   Reply
 * 
 * Glenn Nelson says: September 10, 2015 at 7:48 pmI took the Stanford-Coursera ML class, but have not used it, and I found this
   to be an incredibly useful summary. I appreciate the real-world analogues,
   such as your mention of Jezzball. And showing the brief code snips is
   terrific.
   
   Reply
 * 
 * Shankar Pandala says: September 15, 2015 at 12:09 pmThis is very easy and helpful than any other courses I have completed.
   simple. clear. To the point.
   
   Reply
 * 
 * markpratley says: September 26, 2015 at 9:29 amYou Sir are a gentleman and a scholar!
   
   Reply
 * 
 * whystatistics says: September 29, 2015 at 10:25 amHi Sunil,
   
   This is really superb tutorial along with good examples and codes which is
   surely much helpful. Just, can you add Neural Network here in simple terms
   with example and code.
   
   Reply
 * 
 * Sayan Putatunda says: November 1, 2015 at 7:00 amErrata:- fit <- kmeans(X, 3) # 5 cluster solution
   It`s a 3 cluster solution.
   
   Reply
 * 
 * Baha says: November 27, 2015 at 1:13 pmWell done, Thank you!
   
   Reply
 * 
 * Benjamin says: December 5, 2015 at 7:00 pmThis is a great resource overall and surely the product of a lot of work.
   
   Just a note as I go through this, your comment on Logistic Regression not
   actually being regression is in fact wrong. It maps outputs to a continuous
   variable bound between 0 and 1 that we regard as probability. it makes
   classification easy but that is still an extra step that requires the choice
   of a threshold which is not the main aim of Logistic Regression. As a matter
   of fact it falls under the umbrella of Generalized Libear Models as the glm R
   package hints it in your code example.
   
   I thought this was interesting to note so as not to forget that logistic
   regression output is richer than 0 or 1.
   Thanks for the great article overall.
   
   Reply
 * 
 * Bansari Shah says: January 14, 2016 at 6:27 amThank you.. reallu helpful article
   
   Reply
 * 
 * ayushgg92 says: January 22, 2016 at 9:54 amI wanted to know if I can use rattle instead of writing the R code explicitly
   
   Reply
 * 
 * Debasis says: January 22, 2016 at 10:35 amThank you. Very nice and useful article..
   
   Reply
 * 
 * Debashis says: February 16, 2016 at 8:19 amThis is such a wonderful article.
   
   Reply
 * 
 * Anthony says: February 16, 2016 at 8:39 amInformative and easy to follow. I’ve recently started following several pages
   like this one and this is the best material ive seen yet.
   
   Reply
 * 
 * Akhil says: February 17, 2016 at 3:55 amOne of the best content ever read regarding algorithms.
   
   Reply
 * 
 * Swathi says: February 17, 2016 at 12:02 pmThank you so much for this article
   
   Reply
 * 
 * Nícolas Robles says: February 18, 2016 at 5:51 amCool stuff!
   I just can’t get the necessary libraries…
   
   Reply
 * 
 * wizzerd says: February 26, 2016 at 12:09 pmlooks sgood article. Do I need any data to do the examples?
   
   Reply
 * 
 * Col. Dan Sulzinger says: March 1, 2016 at 1:21 amGood Article.
   
   Reply
 * 
 * Pansy says: March 8, 2016 at 3:04 pmI have to thank you for this informative summary. Really useful!
   
   Reply
 * 
 * J says: March 10, 2016 at 8:54 pmSomewhat irresponsible article since it does not mention any measure of
   performance and only gives cooking recipes without understanding what
   algorithm does what and the stats behind it. Cooking recipes like these are
   the ones that place people in Drew Conway’s danger zone ( 
   https://www.quora.com/In-the-data-science-venn-diagram-why-is-the-common-region-of-Hacking-Skills-and-Substantive-Expertise-considered-as-danger-zone ), thus making programmers the worst data analysts (let alone scientists,
   that requires another mindset completely). I highly recommend anyone wishing
   to enter into this brave new world not to jump into statistical learning
   without proper statistical background. Otherwise you could end up like
   Google, Target, Telefonica, or Google (again) and become a poster boy for
   “The Big Flops of Big Data”.
   
   Reply * George says: June 17, 2016 at 1:34 pmDo you have a better article?Please share…
      
      Reply
    * 
   
   
 * 
 * Robin White says: March 15, 2016 at 11:38 pmGreat article. It really summarize some of the most important topics on
   machine learning.
   But as asked above I would like to present thedevmasters.com as a company
   with a really good course to learn more depth about machine learning with
   great professors and a sense of community that is always helping itself to
   continue learning even after the course ends.
   
   Reply
 * 
 * salman ahmed says: March 19, 2016 at 8:29 amawesome , recommended this article to all my friends
   
   Reply
 * 
 * Borun Chowdhury says: April 21, 2016 at 8:13 amVery succinct description of some important algorithms. Thanks. I’d like to
   point out a mistake in the SVM section. You say “where each point has two
   co-ordinates (these co-ordinates are known as Support Vectors)”. This is not
   correct, the coordinates are just features. Its the points lying on the
   margin that are called the ‘support vectors’. These are the points that
   ‘support’ the margin i.e. define it (as opposed to a weighted average of all
   points for instance.)
   
   Reply
 * 
 * Isaac says: May 24, 2016 at 2:29 amThank you for this wonderful article…it’s proven helpful.
   
   Reply
 * 
 * Payal gour says: June 11, 2016 at 5:52 amThank you very much, A Very useful and excellent compilation.
   
   Reply
 * 
 * nd says: June 17, 2016 at 7:39 amVery good information interms of initial knowledge
   Note one warning, many methods can be fitted into a particular problem, but
   result might not be what you wish.
   Hence you must always compare models, understand residuals profile and how
   prediction really predicts.
   In that sense, analysis of data is never ending.
   In R, use summary, plot and check for assumptions validity .
   
   Reply
 * 
 * Dung Dinh says: June 17, 2016 at 10:24 amThe amazing article. I’m new in data analysis. It’s very useful and easy to
   understand.
   
   Thanks,
   
   Reply
 * 
 * sabarikannan says: June 30, 2016 at 5:35 amThis is really good article, also if you would have explain about Anomaly
   dection algorithm that will really helpful for everyone to know , what and
   where to apply in machine learning….
   
   Reply
 * 
 * Baseer says: August 18, 2016 at 5:11 amVery precise quick tutorial for those who want to gain insight of machine
   learning
   
   Reply
 * 
 * JS says: September 3, 2016 at 7:42 pmSuperb!
   
   Reply
 * 
 * Denis says: September 4, 2016 at 10:53 amgreat summary,
   Thank you
   
   Reply
 * 

LEAVE A REPLY CANCEL REPLY
Connect with:Your email address will not be published.


TOP AV USERS
Rank Name Points 1 SRK 5378 2 Aayushmnit 4828 3 Nalin Pasricha 4407 4 vopani 4353 5 binga 3371 More RankingsPOPULAR POSTS
 * A Complete Tutorial to Learn Data Science with Python from Scratch
 * Essentials of Machine Learning Algorithms (with Python and R Codes)
 * A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
 * 7 Types of Regression Techniques you should know!
 * A Complete Tutorial on Time Series Modeling in R
 * Beginner’s guide to Web Scraping in Python (using BeautifulSoup)
 * Complete guide to create a Time Series Forecast (with Codes in Python)
 * SAS vs. R (vs. Python) – which tool should I learn?

FEATURED VIDEO
RECENT POSTS
18 DATA SCIENCE & IOT STARTUPS FROM Y COMBINATOR SCHOOL – SUMMER 2016
Kunal Jain , September 6, 2016OUR NEW SECTION – STORIES AND WHY I AM SUPER EXCITED ABOUT THEM?
Kunal Jain , September 1, 2016MYSTORY: I BECAME A DATA SCIENTIST AFTER WORKING FOR 10 YEARS IN IT INDUSTRY
karthe , September 1, 2016MYSTORY: HOW I TRANSITIONED TO DATA SCIENCE AFTER 6 YEARS IN DATA WAREHOUSING?
BALAJI SR , September 1, 2016GET CONNECTED
6,188 Followers 19,272 Followers 1,288 Followers Email SubscribeABOUT US
For those of you, who are wondering what is “Analytics Vidhya”, “Analytics” can
be defined as the science of extracting insights from raw data. The spectrum of
analytics starts from capturing data and evolves into using insights / trends
from this data to make informed decisions. Read MoreSTAY CONNECTED
6,188 Followers 19,272 Followers 1,288 Followers Email SubscribeLATEST POSTS
18 DATA SCIENCE & IOT STARTUPS FROM Y COMBINATOR SCHOOL – SUMMER 2016
Kunal Jain , September 6, 2016OUR NEW SECTION – STORIES AND WHY I AM SUPER EXCITED ABOUT THEM?
Kunal Jain , September 1, 2016MYSTORY: I BECAME A DATA SCIENTIST AFTER WORKING FOR 10 YEARS IN IT INDUSTRY
karthe , September 1, 2016MYSTORY: HOW I TRANSITIONED TO DATA SCIENCE AFTER 6 YEARS IN DATA WAREHOUSING?
BALAJI SR , September 1, 2016QUICK LINKS
 * Home
 * About Us
 * Our team
 * Privacy Policy
 * Refund Policy
 * Terms of Use

TOP REVIEWS
© Copyright 2016 Analytics Vidhya","this article displays the list of machine learning algorithms such as linear, logistic regression, kmeans, decision trees along with Python R code",Essentials of Machine Learning Algorithms (with Python and R Codes),Live,1006
3148,"bluemix-helper-sso is a Node.js module library that makes it easy to add authentication to your IBM Bluemix application using Bluemix's Single Sign-On service. You can find more information on the SSO Service here.  Note: Before using this library, please make sure that your application is written in Node.js and is using the Express.js and Passport.js frameworks.",Bluemix Helper for adding support for Single Sign On service to your application,ibm-cds-labs/bluemix-helper-sso,Live,1007
3150,"Toggle navigationPROOFFREADER PLUS
too nerdy or not nerdy enough for prooffreader.com, not whimsical enough for
prooffreaderswhimsy.

 * prooffreader.com
 * prooffreaderplus
 * prooffreaderswhimsy

TUESDAY, OCTOBER 11, 2016
By David Taylor 10:22 AM deep.learning , learning , machine.learning , neural.networks , pdfs , tutorial , videos 2 comments

HUGO LAROCHELLE'S NEURAL NETWORK & DEEP LEARNING TUTORIAL VIDEOS, SUBTITLED &
SCREENGRABBED
Share This: Facebook Twitter Google+ Stumble Digg Like a lot of data scientists (I'm not there yet, but I aspire to become one),
I try my best to keep up with the latest discoveries in a very fast-changing
field; and probably nothing has been as game-changing as the advent of deep
learning.

Deep Learning, explained to a five-year-old (okay, maybe fifteen-year-old): Data
science been really good for a while now at data that can be explained in Excel
spreadsheets, i.e. columns and rows: one row per observation, one column per
variable. This is called structured data. Deep Learning allows us to create rows
of column variables that describe a representation of unstructured data, like
images or text. It's as if you had an automatic algorithm that could look
through all your images, and create one column based on the likelihood the image
contains a cat, another the likelihood it contains a shovel -- without having to tell the algorithm what a cat or shovel is, or what they look
like, or determine that there are cats and shovels at all before running the
algorithm. Deep Learning is rather math-intensive, and involves neural networks, a family
of algorithms that's been around for a long time but has now come into its own.
Unlike some skills, you can't learn it as a black box and then slowly come to
understand it as you use it. There are foundations you need to acquire;
tutorials you need to absorb.

I live in Montreal, which recently hosted its annual Deep Learning Summer School ; I couldn`t attend, but I heard great things about the lecture by Université de Sherbrooke's Hugo Larochelle .

There's just one thing; I hate listening to videos. It's why I don't take
Coursera classes now that I only have a short commute to work every day. I need
to learn at my own pace. And I prefer to read.

So when I realized Larochelle's lecture was based on a series of 92 videos on
his YouTube channel, I wrote a Python script to add a black bar beneath them,
burn subtitles into it, and take screenshots of every subtitle slide and make a
pdf out of it so I can read them. I like to read.

Here's an example screenshot:


I<m sharing the fruits of my labor with you here: Videos with subtitles, pdfs of
subtitled screenshots, and Python code I used to make them.


--------------------------------------------------------------------------------


HUGO LAROCHELLE NEURAL NETWORK LECTURE VIDEOS
& PDFS WITH SUBTITLES

These are zip files of subtitled videos and pdfs of screenshots made from Hugo Larochelle's (University of Sherbrooke) YouTube playlist of 92 videos in 10 parts on neural networks.


--------------------------------------------------------------------------------

VIDEOS WITH SUBTITLES:

Download individual chapters below , or a zip file of all 10 chapters listed below [zip, 189.3 MB] 1.  Subtitled MP4s for Part 01, Feedforward neural networks [zip, 108.5 MB]
 2.  Subtitled MP4s for Part 02, Training neural networks [zip, 238.6 MB]
 3.  Subtitled MP4s for Part 03, Conditional random fields [zip, 250.2 MB]
 4.  Subtitled MP4s for Part 04, Training CRFs [zip, 106.6 MB]
 5.  Subtitled MP4s for Part 05, Restricted Boltzmann Machine [zip, 169.1 MB]
 6.  Subtitled MP4s for Part 06, Autoencoder [zip, 136.8 MB]
 7.  Subtitled MP4s for Part 07, Deep Learning [zip, 226.8 MB]
 8.  Subtitled MP4s for Part 08, Sparse coding [zip, 152.8 MB]
 9.  Subtitled MP4s for Part 09, Computer vision [zip, 191.4 MB]
 10. Subtitled MP4s for Part 10, Natural Language Processing [zip, 289.5 MB]


--------------------------------------------------------------------------------

PDFS OF SCREENSHOTS:
Download individual chapters below , or a zip file of all 10 chapters listed below [zip, 225.6 MB] 1.  PDFs for Part 01, Feedforward neural networks [zip, 120.9 MB]
 2.  PDFs for Part 02, Training neural networks [zip, 287.0 MB]
 3.  PDFs for Part 03, Conditional random fields [zip, 284.2 MB]
 4.  PDFs for Part 04, Training CRFs [zip, 132.7 MB]
 5.  PDFs for Part 05, Restricted Boltzmann Machine [zip, 189.8 MB]
 6.  PDFs for Part 06, Autoencoder [zip, 161.1 MB]
 7.  PDFs for Part 07, Deep Learning [zip, 307.9 MB]
 8.  PDFs for Part 08, Sparse coding [zip, 200.9 MB]
 9.  PDFs for Part 09, Computer vision [zip, 239.8 MB]
 10. PDFs for Part 10, Natural Language Processing [zip, 371.6 MB]


--------------------------------------------------------------------------------


Methodology: In a Python script (which you can see here: Part 1 , Part 2 ), I: * used requests and BeautifulSoup to parse the YouTube playlist;
 * used youtube-dl to download the videos and WEBVTT subtitles;
 * used pycaption to convert subtitles to SRT format;
 * used ffmpeg (from a subprocess call) to add a black letterbox below each video, burn the subtitles into
   that box and then save png screenshots wherever there was a new subtitle
   line;
 * used imagemagick to bundle pngs into pdfs;
 * used zipfile to zip similar files together and deleted the originals.


--------------------------------------------------------------------------------


I'm David Taylor, aka prooffreader. About me


Email This BlogThis! Share to Twitter Share to Facebook • • • Older Post → Home2 COMMENTS:
 1. yina October 11, 2016 at 1:46 PMWow thank you this is fantastic
    
    Reply Delete
 2. yina October 11, 2016 at 1:46 PMWow thank you this is fantastic
    
    Reply Delete

Add comment Load more...


POPULAR POSTS
 * Top 10 Python idioms I wish I'd learned earlier I've been programming all my life, but never been a programmer. Most of my
   work was done in Visual Basic because it's what ...
 * How to quickly turn an IPython notebook into a blog post IPython notebooks are great for many things, but they're a little awkward to
   embed in blog post platforms like Blogger, Wordpress, etc. ...
 * One of my favourite memories: The good, the bad and the ugly Last year, a new postdoc (that' she's a woman, but in science we call women
   fellows) showed me somet...
 * Additional info for Comparison of letter position in words for eight
   languages Link to this week's blog, Comparison of letter position in words for eight
   languages Link to May 27 blog, Graphing the distribution o...
 * Most popular songs containing most decade-specific words in Billboard's
   popular music charts This post is an adjunct to my dataviz on prooffreader.com, "" Most
   decade-specific words in Billboard popular songs titles, 1890-2014 &q...
 * Dataset: Single word frequencies per decade from Google Books I have crunched a public English language dataset in order to remove
   information that is least likely to be of interest to users, and I offe...
 * A simple progress 'bar' for IPython Notebooks Doing data science, I often start loop functions without a clear idea of how
   long they'll take. When working with exceptionally huge dat...
 * GIF version of Animated maps of earthquakes near China and Japan. 1970-2013 For those who don't cotton to HTML5 video: You can see the HTML5 video
   (which you can pause and use a slider to select individual fr...
 * Python scripts to shorten column names, or to fetch Google Ngrams data I've made a couple new GitHub repos: google_ngram_py , which allows you to
   look up one- to five-word phrases in Google Ngrams Viewer ...
 * A Python script to make choropleth grid maps In May 2015, there was a sudden fad in the Dataviz community (on Twitter,
   anyway) for hexagonal grid-type choropleth maps. A choropleth ( no...

BLOG ARCHIVE
 * ▼ 2016 (2) * ▼ October (1) * Hugo Larochelle's neural network & deep learning t...
      
      
    * ► March (1)
   
   
 * ► 2015 (8) * ► October (1)
   
    * ► September (1)
   
    * ► August (2)
   
    * ► May (3)
   
    * ► April (1)
   
   
 * ► 2014 (18) * ► December (1)
   
    * ► November (3)
   
    * ► October (1)
   
    * ► September (1)
   
    * ► July (2)
   
    * ► June (2)
   
    * ► May (3)
   
    * ► April (1)
   
    * ► March (1)
   
    * ► February (2)
   
    * ► January (1)
   
   
ABOUT ME
David Taylor
A Renaissance man (or at least late medieval) with a working knowledge of
classical music, linguistics, classics, history, chemistry, biochemistry,
literature, design, computer science and cognition. Both geek and nerd, yet
never dork.
View my complete profile Search for:Copyright © prooffreader plus | Powered by Blogger
Design by WpMultiverse | Blogger Theme by Lasantha - PremiumBloggerTemplates.com | BTheme.net | Distributed By Gooyaabi Templates

Facebook Twitter GooglePlus Instagram Youtube Pinterest Linkedin Dribbble Tumblr Feed","Like a lot of data scientists (I'm not there yet, but I aspire to become one), I try my best to keep up with the latest discoveries in a ver...","Hugo Larochelle's neural network & deep learning tutorial videos, subtitled & screengrabbed",Live,1008
3155,Need to report the video?Sign in to report inappropriate content.Sign inWant to watch this again later?Sign in to add this video to a playlist.Sign inSign in to make your opinion count.Sign in to make your opinion count.,This video shows how to interact with Cloudant's HTTP API.,Review the HTTP API,Live,1009
3157,"Need to report the video?

Sign in to report inappropriate content.

Sign in

Want to watch this again later?

Sign in to add this video to a playlist.

Sign in

Sign in to make your opinion count.

Sign in to make your opinion count.",Cloudant co-founder Mike Miller talks about Cloudant's origins as database used build to deal with huge datasets produced by the Large Hadron Collider project at CERN.,Why Cloudant?,Live,1010
3160,"Compose The Compose logo Articles Sign in Free 30-day trialCOMPOSE ELASTICSEARCH'S NEW KIBANA ADDON
Published Jun 20, 2017 elasticsearch kibana add-on Compose Elasticsearch's new Kibana addonTL;DR: Compose users can now add-on a dedicated Kibana capsule to their
Elasticsearch deployments.

The Kibana visualization tool for Elasticsearch is a powerful way of turning
your data into graphical expressions of trends and patterns. It offers a web
interface which makes it easy to segment your data and then aggregate that data
into meaningful information. It's always been possible to run Kibana from your desktop with Compose; just get the matching version of Kibana for your Elasticsearch,
edit a config file and run the server then connect to it with your web browser.

We heard our customers ask for something simpler and as easy to deploy as
Compose databases, so we've gone and created a Kibana add-on for Compose
Elasticsearch. Installed within your Compose cluster's private network and
fronted by the same HAProxy access control that protects your Elasticsearch
database, the new Kibana capsule moves the processing work of Kibana virtually
closer to the servers which have the data. It also means all your configured
users get access to Kibana without installing anything locally.

GETTING KIBANA IN YOUR ELASTICSEARCH
Adding Kibana to an existing Elasticsearch deployment on Compose couldn't be
easier. Just go to the Add-ons tab where you'll find this:


Click Add on the Kibana entry and get this confirmation display:


Note the cost of the add-on capsule. It is currently $6.50 per month for a 256MB
Kibana capsule and you can scale that memory up if needed through the enhanced
Compose resources view. Click Add Kibana to confirm you want to add this to your deployment and you'll move to the Jobs view where you can see the add-on being deployed. Once it has done its work, go
to the Overview tab and look up the connection information. It's a little bigger with Kibana
installed:


CONNECTING TO KIBANA
There's now links to the Kibana server, one for each TCP HAProxy portal
configured. Whichever you choose, you'll be prompted for your Elasticsearch
username and password before being able to access them.

Select either one of the Kibana links to connect and log in. If you haven't
worked with Kibana before, check out the official documentation for Kibana 4.6 , the last Elasticsearch 2.4 compatible version. There are tutorials there that
will acclimatize you to the web interface and get you ready to start working out
who talks most in Shakespeare in no time:


If you've used Kibana locally on a deployment and start using the add-on, your
migration will be simple as the add-on also uses the default Kibana index for
storing settings. Your locally created settings should appear on the add-on
automatically.

If you haven't used Kibana, you're missing out on a chance to quickly create
rich dashboards of visualizations of your data. If you’re a current Compose
customer, you can login to your account, deploy Elasticsearch (or open an
existing Elasticsearch deployment) and then visit the Add-ons tab to try Kibana.
If you’re new to Compose, signup to try Elasticsearch and Kibana free for 30
days.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Apr 25, 2017HORIZONTAL SCALING ARRIVES ON COMPOSE ENTERPRISE
Today, Compose is bringing horizontal scaling to more databases on our
Enterprise platform. MongoDB, Elasticsearch and Scylla…

Jason McCay Dec 19, 2016COMPOSE NOTES: ELASTICSEARCH 2.4.2 AND BACKUP UPDATES
In this edition of Compose Notes, we have a new version of Elasticsearch online
and a new backup mechanism for it. Updated El…

Dj Walker-Morgan Nov 9, 2016GETTING CONNECTED WITH RABBITMQ AND ELASTICSEARCH
The latest enhancement to Compose makes your life simpler and more secure with
the introduction of TLS/SSL certificates backe…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL JanusGraph Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",Compose users can now add-on a dedicated Kibana capsule to their Elasticsearch deployments.,Compose Elasticsearch's new Kibana addon,Live,1011
3162,"In this blog we are going to discuss why choosing the index type for querying your geospatial data is important and why Cloudant supports configurable secondary geospatial index types. That’s why Cloudant allows GIS developers to use more than one of these indexes in the same application. Don't let a database restrict your index choice!

In later blog postings we will discuss K-D Trees, Quadtrees and Octrees which complete the more common spatial index types.

An index over a data collection enables an optimized query of the data. Index types can vary

according to the type of the data and should be chosen carefully to optimize query speed.

Cloudant's MapReduce views use a B-Tree where the emitted key is stored as a [key, docid] key in the B-Tree. This is very effective for key look-ups, but it is not as effective for queries when using geospatial data such as features with geometries. For indexing geometries, an index based on the minimum-bounding rectangle of the geometry is more efficient with look up since it groups geometries together. This results in significantly faster queries.

Cloudant supports multiple index types per database, enabling you to have multiple B-Tree, geospatial and full text indexes over the same data. This was developed to support customer use cases who need to index their data efficiently in different ways. As with existing Cloudant index types, the geospatial indexes are distributed and built for scale, speed and flexibility.

It is possible to create a geospatial index using CouchDB and a B-Tree by using a GeoHash and storing the generated hash values as keys in the B-Tree. A GeoHash is useful, however it is typically only used to index points or centroids. A GeoHash is a textual representation of a Morton number. Since a GeoHash string represents a bounding box where a location lies it is possible to have two spatially close locations with different GeoHashes which result in the query being inaccurate and missing data.

The image above shows the Morton Numbers across the whole world which shows how a GeoHash breaks up data of similar localilty.

Cloudant can use a GeoHash (a consistent hash) to route particular data to particular nodes within its distributed database.  This enables users to peg data to specific region so that it remains closer to the end users accessing it. Alternatively, you can restrict data from a specific region for compliance reasons.

An R-Tree is a spatial index, it groups nearby features and represents by their minimum bounding rectangle in the tree. An R-Tree also pages the data, making it suitable for serialization.

In typical geospatial database applications, a quadtree or an R-Tree is used as the spatial index implementation. Quadtrees typically recursively divide a two dimensional space into quadrants, this can be effective but we need more flexibility.

For Cloudant the attraction of an R*-Tree was minimizing the overlap of the minimum bounding rectangle between tree nodes for faster reads and because Cloudant is a distributed database, we can distribute the load of slower writes across a cluster, minimizing the costs of writing data to the R*-Tree.

In the above image we can see considerable overlap between nodes using an R-Tree as compared to the R*-Tree below. Those overlaps mean that an R-Trees can accept writes faster, but are slow on reads because of the inefficient indexing.

Cloudant is a distributed database, each node within the cluster holds a portion of the data, not all of it. In addition, to satisfy our guarantees of partition tolerance and availability, we typically hold three copies of that data. A successful read or write operation requires two responses from the nodes in the cluster.

Since a write to Cloudant is guaranteed to be written to disk if successful, then it makes sense to use an R*-Tree as part of that spatial index to minimize overlap between nodes. Writes may be slightly slower as a trade-off but reads will be significantly quicker. Add to that the scale of data that Cloudant typically operates with and we want to minimize tree traversal by using an R*-Tree.

A temporal index is required when data needs to be queried in a predictive or historical manner. Currently Cloudant supports using a TPR-Tree which is a temporal predictive index.

The aim of a temporal index is reduce the size of the index on disk (or in memory) and to enable a quick traversal of the tree for query.

A TPR-Tree is a predictive trajectory index which currently based on a linear path through time (a.k.a., dead reckoning). In addition to movement, TPR-Tree supports the growing of features, (i.e., the dispersion of a feature) which is particularly useful when modeling how feature spread across time and space, like the dispersion of smoke from a wildfire or oil from an oil spill.

Web, mobile and GIS developers should be able to choose the right index for their queries, not try to solve all challenges with a single index type. With Cloudant, you can build an application that uses more than one of these indexes in the same application.

Cloudant builds on the excellent LibSpatialIndex library and we are keen contributors to this and its open source Erlang binding. Let us know what index types you would like added and we will get to work!",Cloudant's Geospatial features allow GIS applications to built for a fraction of a cost of traditional solutions. This blog post describes the choices you face when choosing a Geospatial index for your applicaiton.,Choosing the Right Geospatial Index Type,Live,1012
3166,"Lorna Mitchell Blocked Unblock Follow Following Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net) Aug 10
--------------------------------------------------------------------------------

ADD REDIS TO YOUR SERVERLESS APPLICATION
The serverless revolution is definitely upon us. Personally, I’m finding a lot to love about this
flexible, scalable tech. Working with it is as easy as working in established
development toolchains. In particular, it easily integrates with other
components, such as data storage, that I’d want to use in any other application.
Recently I added a Redis feature to one of my serverless projects running on OpenWhisk , so I thought I’d share some of that code so you can see how the two interact.

ALEXA, I NEED A NEW PROJECT NAME!
One of the first “toy” projects I used a serverless platform for was a “skill”
for my Amazon Echo Dot. The skill creates a new GitHub project name for me —
basically it spits out two random words. This project was brilliant fun. (It’s here on GitHub if you want to see the code.) I often found myself wanting to ask Alexa to
repeat the hilarious combination of words she just came out with!

“Functional Mist” for a project name, anyone?So Alexa needed a better memory. I amended my project to store the random words
to Redis. Redis is an excellent, open source, in-memory datastore, and since
it’s not exactly a crisis if I lose this data, it’s perfect for the Alexa-memory
use case.

GET REDIS-READY
I deploy my serverless OpenWhisk actions to the IBM Bluemix platform , so I’ll create a Redis instance there. To find it, go to the Catalog page for services , look for Redis and choose Compose for Redis, or use this direct link .

Shopping for a Redis instance in the Bluemix catalogOn the next screen, the Standard plan is perfect for my needs, so I’ll just click Create . Once created, the dashboard for this service will display a Connection String field. I’ll use this field in the next step, where I’ll add my connection
information as a parameter to OpenWhisk.

New to OpenWhisk? Head over to http://openwhisk.incubator.apache.org/ and click the Get Started button. This guide will walk you through everything you need to configure the
tools, including the wsk tool used here. You’ll also find resources for general serverless info, and
tutorials for various use-cases.The connection string I copied above needs to go into a file called params.json , along with any other parameters to pass to the action. Mine looks like this:

Then, I have a file called build.sh that deploys the action to OpenWhisk. It’s in a script because there are a few
things to remember:

 * Since I’m including some npm modules that aren’t available by default on
   OpenWhisk, I create the action using a ZIP file that includes both my own
   JavaScript code in index.js and the contents of node_modules .
 * The script builds this for me every time.
 * When updating the action, I also supply the params.json file.

Here’s the build.sh file:

This step isn’t strictly necessary, as you could run the commands yourself, but
I find it helpful to have it packaged so I do it the same way (correctly!) every
time.

USING REDIS AS ALEXA’S MEMORY
Working with serverless JS is slightly different to using NodeJS in a more
traditional runtime setting. As a result, I use the Bluebird library alongside Redis from npm to access Redis from OpenWhisk. There are some examples in the Redis module docs too, which I found helpful!

Here’s the code that helps Alexa remember what she just said:

Using the standard Redis SET command becomes client.setAsync() when working with Bluebird. Into this function, I pass the key codenames and the random value generated on line 7. Now that Alexa can remember the
value, I can ask her to repeat herself! Here’s the code that does that:

Again, the asynchronous version turns the Redis GET command into client.getAsync() . It lets me fetch the key I’m interested in and then perform the next step
once the data arrives.

ALEXA, I NEED A CONCLUSION!
Being able to easily integrate open source solutions into existing applications
makes serverless platforms like OpenWhisk very approachable and quick for making
genuinely useful tools. These qualities mean that, as developers, we know we’ll
be able to push the software to go far beyond the many “Hello World” examples
that exist.

I’ll leave the final words to Alexa, showing off her new memory skills:

Alexa, …If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * Serverless
 * Alexa
 * Amazon Echo

Blocked Unblock Follow FollowingLORNA MITCHELL
Developer Advocate at IBM. Technology addict, open source fanatic and incurable
blogger (see http://lornajane.net )

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",How I gave my Alexa Skill a “memory” with Redis and OpenWhisk,Add Redis To Your Serverless Application – IBM Watson Data Lab – Medium,Live,1013
3172,"Follow Sign in / Sign up * Home
 * Dev
 * Design
 * Data
 * 
 * Submit your story
 * 

David Venturi Blocked Unblock Follow Following I’m creating my own data science master’s program. Data Consultant for Class
Central. www.davidventuri.com Jan 25
--------------------------------------------------------------------------------

I RANKED EVERY INTRO TO DATA SCIENCE COURSE ON THE INTERNET, BASED ON THOUSANDS
OF DATA POINTS
Data visualization by Alanah RydingA year ago, I dropped out of one of the best computer science programs in
Canada. I started creating my own data science master’s program using online resources. I realized that I could learn everything I needed
through edX, Coursera, and Udacity instead. And I could learn it faster, more
efficiently, and for a fraction of the cost.

I’m almost finished now. I’ve taken many data science-related courses and
audited portions of many more. I know the options out there, and what skills are
needed for learners preparing for a data analyst or data scientist role. A few months ago, I started creating a review-driven guide that recommends the
best courses for each subject within data science.

For the first guide in the series, I recommended a few coding classes for the beginner data scientist. Then it was statistics and probability classes .

NOW ONTO INTRODUCTIONS TO DATA SCIENCE.
(Don’t worry if you’re unsure of what an intro to data science course entails.
I’ll explain shortly.)

For this guide, I spent 10+ hours trying to identify every online intro to data
science course offered as of January 2017, extracting key bits of information
from their syllabi and reviews, and compiling their ratings. For this task, I
turned to none other than the open source Class Central community and its
database of thousands of course ratings and reviews.

Class Central’s homepage .Since 2011, Class Central founder Dhawal Shah has kept a closer eye on online courses than arguably anyone else in the world.
Dhawal personally helped me assemble this list of resources.

HOW WE PICKED COURSES TO CONSIDER
Each course must fit three criteria:

 1. It must teach the data science process. More on that soon.
 2. It must be on-demand or offered every few months.
 3. It must be an interactive online course, so no books or read-only tutorials . Though these are viable ways to learn, this guide focuses on courses.

We believe we covered every notable course that fits the above criteria. Since
there are seemingly hundreds of courses on Udemy , we chose to consider the most-reviewed and highest-rated ones only. There’s
always a chance that we missed something, though. So please let us know in the
comments section if we left a good course out.

HOW WE EVALUATED COURSES
We compiled average rating and number of reviews from Class Central and other
review sites to calculate a weighted average rating for each course. We read
text reviews and used this feedback to supplement the numerical ratings.

We made subjective syllabus judgment calls based on two factors:

1. Coverage of the data science process. Does the course brush over or skip certain subjects? Does it cover certain
subjects in too much detail? See the next section for what this process entails.

2. Usage of common data science tools. Is the course taught using popular programming languages like Python and/or R?
These aren’t necessary, but helpful in most cases so slight preference is given
to these courses.

Python and R are the two most popular programming languages used in data
science.WHAT IS THE DATA SCIENCE PROCESS?
What is data science? What does a data scientist do? These are the types of fundamental questions that an intro to data science
course should answer. The following infographic from Harvard professors Joe
Blitzstein and Hanspeter Pfister outlines a typical data science process , which will help us answer these questions.

Visualization from Opera Solutions .Our goal with this introduction to data science course is to become familiar
with the data science process. We don’t want too in-depth coverage of specific
aspects of the process, hence the “intro to” portion of the title.

For each aspect, the ideal course explains key concepts within the framework of
the process, introduces common tools, and provides a few examples (preferably
hands-on).

We’re only looking for an introduction. This guide therefore won’t include full
specializations or programs like Johns Hopkins University’s Data Science Specialization on Coursera or Udacity’s Data Analyst Nanodegree . These compilations of courses elude the purpose of this series: to find the
best individual courses for each subject to comprise a data science education. The final three
guides in this series of articles will cover each aspect of the data science
process in detail.

BASIC CODING, STATS, AND PROBABILITY EXPERIENCE REQUIRED
Several courses listed below require basic programming, statistics, and
probability experience. This requirement is understandable given that the new
content is reasonably advanced, and that these subjects often have several
courses dedicated to them.

This experience can be acquired through our recommendations in the first two
articles ( programming , statistics ) in this Data Science Career Guide.

OUR PICK FOR THE BEST INTRO TO DATA SCIENCE COURSE IS…
 * Data Science A-Z™: Real-Life Data Science Exercises Included (Kirill Eremenko/Udemy)

Kirill Eremenko’s Data Science A-Z™ on Udemy is the clear winner in terms of breadth and depth of coverage of the
data science process of the 20+ courses that qualified. It has a 4.5-star
weighted average rating over 3,071 reviews, which places it among the highest
rated and most reviewed courses of the ones considered.

It outlines the full process and provides real-life examples. At 21 hours of
content, it is a good length. Reviewers love the instructor’s delivery and the
organization of the content. The price varies depending on Udemy discounts,
which are frequent, so you may be able to purchase access for as little as $10.

Though it doesn’t check our “usage of common data science tools” box , the non-Python/R tool choices (gretl, Tableau, Excel) are used effectively in
context. Eremenko mentions the following when explaining the gretl choice (gretl
is a statistical software package), though it applies to all of the tools he
uses (emphasis mine):

In gretl, we will be able to do the same modeling just like in R and Python but
we won’t have to code. That’s the big deal here. Some of you may already know R
very well, but some may not know it at all. My goal is to show you how to build
a robust model and give you a framework that you can apply in any tool you choose . gretl will help us avoid getting bogged down in our coding.One prominent reviewer noted the following:

Kirill is the best teacher I’ve found online. He uses real life examples and
explains common problems so that you get a deeper understanding of the
coursework. He also provides a lot of insight as to what it means to be a data
scientist from working with insufficient data all the way to presenting your
work to C-class management. I highly recommend this course for beginner students
to intermediate data analysts! The preview video for Data Science A-Z™ .A GREAT PYTHON-FOCUSED INTRODUCTION
 * Intro to Data Analysis (Udacity)

Udacity’s Intro to Data Analysis is a relatively new offering that is part of Udacity’s popular Data Analyst Nanodegree . It covers the data science process clearly and cohesively using Python,
though it lacks a bit in the modeling aspect. The estimated timeline is 36 hours
(six hours per week over six weeks), though it is shorter in my experience. It
has a 5-star rating over one review. It is free.

The videos are well-produced and the instructor (Caroline Buckey) is clear and
personable. Lots of programming quizzes enforce the concepts learned in the
videos. Students will leave the course confident in their new and/or improved
NumPy and Pandas skills (these are popular Python libraries). The final project
— which is graded and reviewed in the Nanodegree but not in the free individual
course — can be a nice add to a portfolio.

Udacity instructor Caroline Buckey outlining the data analysis process (also
known as the data science process).AN IMPRESSIVE OFFERING WITH NO REVIEW DATA
 * Data Science Fundamentals (Big Data University)

Data Science Fundamentals is a four-course series provided by IBM’s Big Data
University. It includes courses titled Data Science 101 , Data Science Methodology , Data Science Hands-on with Open Source Tools , and R 101 .

It covers the full data science process and introduces Python, R, and several
other open-source tools. The courses have tremendous production value. 13–18
hours of effort is estimated, depending on if you take the “R 101” course at the
end, which isn’t necessary for the purpose of this guide. Unfortunately, it has
no review data on the major review sites that we used for this analysis, so we
can’t recommend it over the above two options yet. It is free.

A video from the first module of the Big Data University’s Data Science 101 (which is the first course in the Data Science Fundamentals series).THE COMPETITION
Our #1 pick had a weighted average rating of 4.5 out of 5 stars over 3,068
reviews. Let’s look at the other alternatives, sorted by descending rating.
Below you’ll find several R-focused courses, if you are set on an introduction
in that language.

 * Python for Data Science and Machine Learning Bootcamp (Jose Portilla/Udemy): Full process coverage with a tool-heavy focus
   (Python). Less process-driven and more of a very detailed intro to Python.
   Amazing course, though not ideal for the scope of this guide. It, like Jose’s
   R course below, can double as both intros to Python/R and intros to data
   science. 21.5 hours of content. It has a 4.7 -star weighted average rating over 1,644 reviews. Cost varies depending on
   Udemy discounts, which are frequent.
 * Data Science and Machine Learning Bootcamp with R (Jose Portilla/Udemy): Full process coverage with a tool-heavy focus (R).
   Less process-driven and more of a very detailed intro to R. Amazing course,
   though not ideal for the scope of this guide. It, like Jose’s Python course
   above, can double as both intros to Python/R and intros to data science. 18
   hours of content. It has a 4.6 -star weighted average rating over 847 reviews. Cost varies depending on
   Udemy discounts, which are frequent.

Jose Portilla has two Data Science and Machine Learning Bootcamps on Udemy: one
for Python and one for R . * Data Science and Machine Learning with Python — Hands On! (Frank Kane/Udemy): Partial process coverage. Focuses on statistics and
   machine learning. Decent length (nine hours of content). Uses Python. It has
   a 4.5 -star weighted average rating over 3,104 reviews. Cost varies depending on
   Udemy discounts, which are frequent.
 * Introduction to Data Science (Data Hawk Tech/Udemy): Full process coverage, though limited depth of
   coverage. Quite short (three hours of content). Briefly covers both R and
   Python. It has a 4.4 -star weighted average rating over 62 reviews. Cost varies depending on
   Udemy discounts, which are frequent.
 * Applied Data Science: An Introduction (Syracuse University/Open Education by Blackboard): Full process coverage,
   though not evenly spread. Heavily focuses on basic statistics and R. Too
   applied and not enough process focus for the purpose of this guide. Online
   course experience feels disjointed. It has a 4.33 -star weighted average rating over 6 reviews. Free.
 * Introduction To Data Science (Nina Zumel & John Mount/Udemy): Partial process coverage only, though good
   depth in the data preparation and modeling aspects. Okay length (six hours of
   content). Uses R. It has a 4.3 -star weighted average rating over 101 reviews. Cost varies depending on
   Udemy discounts, which are frequent.
 * Applied Data Science with Python (V2 Maestros/Udemy): Full process coverage with good depth of coverage for
   each aspect of the process. Decent length (8.5 hours of content). Uses
   Python. It has a 4.3 -star weighted average rating over 92 reviews. Cost varies depending on
   Udemy discounts, which are frequent.

V2 Maestros has two versions of their “ Applied Data Science” course: one for Python and one for R . * Want to be a Data Scientist? (V2 Maestros/Udemy): Full process coverage, though limited depth of
   coverage. Quite short (3 hours of content). Limited tool coverage. It has a 4.3 -star weighted average rating over 790 reviews. Cost varies depending on
   Udemy discounts, which are frequent.
 * Data to Insight: an Introduction to Data Analysis (University of Auckland/FutureLearn): Breadth of coverage unclear. Claims to
   focus on data exploration, discovery, and visualization. Not offered on
   demand. 24 hours of content (three hours per week over eight weeks). It has a 4 -star weighted average rating over 2 reviews. Free with paid certificate
   available.
 * Data Science Orientation (Microsoft/edX): Partial process coverage (lacks modeling aspect). Uses
   Excel, which makes sense given it is a Microsoft-branded course. 12–24 hours
   of content (two-four hours per week over six weeks). It has a 3.95 -star weighted average rating over 40 reviews. Free with Verified
   Certificate available for $25.
 * Data Science Essentials (Microsoft/edX): Full process coverage with good depth of coverage for each
   aspect. Covers R, Python, and Azure ML (a Microsoft machine learning
   platform). Several 1-star reviews citing tool choice (Azure ML) and the
   instructor’s poor delivery. 18–24 hours of content (three-four hours per week
   over six weeks). It has a 3.81 -star weighted average rating over 67 reviews. Free with Verified
   Certificate available for $49.

The above two courses are from Microsoft’s Professional Program Certificate in Data Science on edX. * Applied Data Science with R (V2 Maestros/Udemy): The R companion to V2 Maestros’ Python course above.
   Full process coverage with good depth of coverage for each aspect of the
   process. Decent length (11 hours of content). Uses R. It has a 3.8 -star weighted average rating over 212 reviews. Cost varies depending on
   Udemy discounts, which are frequent.
 * Intro to Data Science (Udacity): Partial process coverage, though good depth for the topics
   covered. Lacks the exploration aspect, though Udacity has a great, full course on exploratory data analysis (EDA). Claims to be 48 hours in length (six
   hours per week over eight weeks), but is shorter in my experience. Some
   reviews think the set-up to the advanced content is lacking. Feels
   disorganized. Uses Python. It has a 3.61 -star weighted average rating over 18 reviews. Free.
 * Introduction to Data Science in Python (University of Michigan/Coursera): Partial process coverage. No modeling and
   vizualization, though courses #2 and #3 in the Applied Data Science with Python Specialization cover these aspects. Taking all three courses would be too in depth for the
   purpose of this guides. Uses Python. Four weeks in length. It has a 3.6 -star weighted average rating over 15 reviews. Free and paid options
   available.

The University of Michigan teaches the Applied Data Science with Python Specialization on Coursera. * Data-driven Decision Making (PwC/Coursera): Partial coverage (lacks modeling) with a business focus.
   Introduces many tools, including R, Python, Excel, SAS, and Tableau. Four
   weeks in length. It has a 3.5 -star weighted average rating over 2 reviews. Free and paid options
   available.
 * A Crash Course in Data Science (Johns Hopkins University/Coursera): An extremely brief overview of the full
   process. Too brief for the purpose of this series. Two hours in length. It
   has a 3.4 -star weighted average rating over 19 reviews. Free and paid options
   available.
 * The Data Scientist’s Toolbox (Johns Hopkins University/Coursera): An extremely brief overview of the full
   process. More of a set-up course for Johns Hopkins University’s Data Science Specialization . Claims to have 4–16 hours of content (one-four hours per week over four
   weeks), though one reviewer noted it could be completed in two hours. It has
   a 3.22 -star weighted average rating over 182 reviews. Free and paid options
   available.
 * Data Management and Visualization (Wesleyan University/Coursera): Partial process coverage (lacks modeling).
   Four weeks in length. Good production value. Uses Python and SAS. It has a 2.67 -star weighted average rating over 6 reviews. Free and paid options
   available.

The following courses had no reviews as of January 2017.

 * CS109 Data Science (Harvard University): Full process coverage in great depth (probably too in
   depth for the purpose of this series). A full 12-week undergraduate course.
   Course navigation is difficult since the course is not designed for online
   consumption. Actual Harvard lectures are filmed. The above data science
   process infographic originates from this course. Uses Python. No review data.
   Free.

The featured viz on Harvard CS109’s homepage . * Introduction to Data Analytics for Business (University of Colorado Boulder/Coursera): Partial process coverage (lacks
   modeling and visualization aspects) with a focus on business. The data
   science process is disguised as the “Information-Action Value chain” in their
   lectures. Four weeks in length. Describes several tools, though only covers
   SQL in any depth. No review data. Free and paid options available.
 * Introduction to Data Science (Lynda): Full process coverage, though limited depth of coverage. Quite
   short (three hours of content). Introduces both R and Python. No review data.
   Cost depends on Lynda subscription.

WRAPPING IT UP
This is the third of a six-piece series that covers the best online courses for
launching yourself into the data science field. We covered programming in the first article and statistics and probability in the second article . The remainder of the series will cover other data science core competencies:
data visualization and machine learning.

If you want to learn Data Science, start with one of these programming classes
medium.freecodecamp.com If you want to learn Data Science, take a few of these statistics classes
medium.freecodecamp.comThe final piece will be a summary of those courses, and the best online courses
for other key topics such as data wrangling, databases, and even software
engineering.

If you’re looking for a complete list of Data Science online courses, you can
find them on Class Central’s Data Science and Big Data subject page.

If you enjoyed reading this, check out some of Class Central ’s other pieces:

Here are 250 Ivy League courses you can take online right now for free 250 MOOCs
from Brown, Columbia, Cornell, Dartmouth, Harvard, Penn, Princeton, and Yale.
medium.freecodecamp.com The 50 best free online university courses according to data When I launched
Class Central back in November 2011, there were around 18 or so free online
courses, and almost all of… medium.freecodecamp.comIf you have suggestions for courses I missed, let me know in the responses!

If you found this helpful, click the 💚 so more people will see it here on
Medium.

This is a condensed version of my original article published on Class Central , where I’ve included further course descriptions, syllabi, and multiple
reviews.

Data Science Programming Big Data Tech Startup 1.5K 28 Blocked Unblock Follow FollowingDAVID VENTURI
I’m creating my own data science master’s program. Data Consultant for Class
Central. www.davidventuri.com

FollowFREECODECAMP
Our community publishes stories worth reading on development, design, and data
science.

 * Share
 * 1.5K
 * 
 * 

Never miss a story from freeCodeCamp , when you sign up for Medium. Learn more Never miss a story from freeCodeCamp Get updates Get updates","A year ago, I dropped out of one of the best computer science programs in Canada. I started creating my own data science master’s program using online resources. ","I ranked every Intro to Data Science course on the internet, based on thousands of data points",Live,1014
3179,"Working Vis * 
 * 

 * Home
 * About This Blog
 * Brunel",The two main new features of Brunel 0.8 are an enhanced UI for building (as described by Dan) and a through re-working of our code for mapping data to color. This post is going to talk about the latter — with a lot of examples!,Enhanced Color mapping,Live,1015
3180,"SHARP SIGHT LABS

 * HOME
 * MEMBER LOGIN
 * ABOUT

WHY YOU SHOULD MASTER R (EVEN IF IT MIGHT EVENTUALLY BECOME OBSOLETE)
December 27, 2016

In last week’s blog post I asked How much data science do you actually remember ?

It’s a critical question. If you study data science, but forget everything that
you learn, you’ll be in big trouble when you go in for an interview. Or, you’ll
be in big trouble if you actually get a data science job, but you’ve forgotten the essential skills.

Let me be very clear: you need to know your essential toolkit inside and out.
You need to remember your tools, and you need to be able to execute quickly and on command if you
want to be a top performer.

However, the hard truth that you actually need to “memorize some syntax” stirred
up several comments.

One comment stood out, because it raised a critical question: why memorize your
toolkit, if tools become obsolete.

EVERY TOOL HAS A SHELF LIFE, SO WHY MASTER YOUR TOOLS?
Here’s the comment in question:

Thanks for posting this article. I enjoyed reading it, and I found it
thought-provoking. I actually disagree in some ways. Yes, you need to be fluent
in data analysis to the point where you know what your strategy is going to be
and what tools you will need to execute on it. But I don’t think it’s a good use
of time to memorize, eg, the exact syntax of anything but your most
bread-and-butter tools.
As someone who slogged through a quantitative PhD after years in data
engineering, this might sound like blasphemy. But my experience has been that
every tool has a shelf life. Every single one. We all used Perl in the early
2000s. A couple years ago, dplyr was nowhere to be found. Even just a few months
ago, you used lapply where today you might use purrr.
Any package you are using, no matter how essential or basic it seems now, is
going to get replaced with something easier, more elegant, higher-level. Even
the way you think and talk about data analysis is going to evolve. What is not
going to change is the need for a cold, clear-eyed way of looking at a problem
and building a game plan on the fly.

I actually agree with this comment in a few ways (which I’ll address
momentarily) but let’s first understand the essence of his objection.

The heart of his critique is this: data science is changing very fast, and any
tool that you learn will eventually become obsolete.

This is absolutely true.

Every tool has a shelf life.

Every. single. one.

Moreover, it’s possible that tools are going to become obsolete more rapidly
than in the past, because the world has just entered a period of rapid technological change . We can’t be certain, but if we’re in a period of rapid technological change,
it seems plausible that toolset-changes will become more frequent.

So, I agree with David’s comment: tools become obsolete. Although R is very popular today (and increasing in popularity over the last few years) another language might become more popular for data science. We don’t know.

What I do know, is that if you want to learn data science today, you need to select a tool and master the basics .

With that in mind, I want to clarify a few points to make sure that you
understand exactly what you need to do as you get started learning (and
mastering) data science.

MASTERING THE FOUNDATIONS DOESN’T TAKE LONG
(IF YOU KNOW HOW TO PRACTICE)
As a beginner, you need to master the foundations.

Keep in mind that I’m not telling you to spend the next 3 years memorizing every
part of the R programming language.

But, you need to know the fundamentals backwards and forwards.

A little more specifically, you need to master the foundational skills of at
least two core skill areas: data visualization and data manipulation. You also
need to be able to use these tools together to analyze data.

As an R user, that means you should know the most common tools from base R, and
the most commonly used techniques from a few packages:

– ggplot2
– dplyr
– tidyr
– readr

Keep in mind that these are general recommendations that will apply to about 80%
of people. If you’re in a specialized industry, the advice may be slightly
different. For most people, however, these are the tools that you need to know.

It is absolutely possible to master these foundations within 2 or 3 months
(maybe faster if you put in more time. The secret is systematic practice . If you know how to practice, you can master R’s essential toolkit very, very
quickly.

If it only takes a couple of months (and has huge payoffs), why wouldn’t you master the essential toolkit?

TO CLARIFY: YOU NEED TO LEARN THE HIGH FREQUENCY TOOLS
Pay careful attention to exactly what I just wrote:

I said that you need to master the “essential toolkit.” Master the “most
commonly used” techniques from the packages I recommend.

Do you need to learn every single function from those packages?

No.

Do you need to learn all of the parameters of every function.

No.

Do you need to memorize every little detail?

No.

… but, you need to memorize the most commonly used tools.

Here’s a quick example:

To be productive as a data scientist, you need to know at least a handful of
essential data visualizations. You’ll use these over and over again in
reporting, analysis, visual communication, and exploratory data analysis (e.g., using EDA as a step in your machine learning workflow ).

You need to know:
– The histogram
– The scatterplot
– The bar chart
– The line chart

Learning these tools is like learning the basic vocabulary of a foreign
language. They are essential. They are your “essential data visualization
vocabulary.”

They’re essential because you’ll use them constantly. They are the “high
frequency” data visualizations that you’ll use vastly more often than other,
more obscure visualization techniques.

Because they are so common and so essential, you need to know them backwards and
forwards. If I were to tell you to “create a scatterplot of X vs Y using ggplot2 ” you should be able to do that on command, from memory.

Trying to analyze a dataset without fluency in these techniques is like trying
to speak French without knowing any French vocabulary. It won’t work.

I can’t emphasize this enough: there are foundational skills that you need to
know. They are the essentials.

YOU NEED THE FOUNDATIONS TO BE PRODUCTIVE AND GET HIRED
Now that I’ve clarified that you need to master the foundations (but not necessarily every syntactical detail) let’s talk more about the core
point of David’s comment: tools become obsolete.

This is true. In a few years, another data science language might overtake R as
the top-of-the-line data language. We just don’t know.

What I do know, is that if you haven’t mastered the foundations of a data science
language today, you won’t be productive today.

Quite frankly, even though it’s absolutely true that tools become obsolete, that
doesn’t change the fact that you need to be skilled in a toolset today in order to get hired and be productive today .

If you want to get a job as a data scientist, you need to have a high-level of
competence in data visualization and data manipulation (at minimum).

And getting hired isn’t the only reason to master them. You also need a high
level of competence in these skill areas to “get things done” once you get a
job. If you don’t know them, you will be unproductive.

The point is, even if R might become obsolete at some point in the future, you
still need to get hired and get things done today . There’s no way around it. You need to know your tools.

YOU CAN USE THE FOUNDATIONS AS A PLATFORM TO LEARN HIGHER-ORDER CONCEPTS
Even if R becomes obsolete in the long run, mastering the syntax of essential
data science tools in R has another advantage: it serves as a foundation for
learning higher-order, language-agnostic concepts and processes.

Ultimately, after you learn the syntax of the essential tools, your next step
should be to learn these higher-order concepts and processes .

Let me give you an example. I just mentioned that you need to know several
essential data visualizations: the scatter, the line, the bar, and the
histogram. You should know the syntax for these cold. But you also need to know
how to apply them.

I’ll re-emphasize that you can’t apply a tool that you don’t know. If you can’t
create a scatterplot, you can’t apply it as a tool to analyze data. So once
again, mastering the syntax of your foundational tools is necessary.

Having said that, once you really know the syntax for creating these basic
plots, you need to know how to use them as analytical tools. When do you use the
scatterplot? What is a bar chart good for? What are the limitations of the
histogram? What do you do if you encounter overplotting in a scatterplot? How
can you combine techniques to create effective multivariate visualizations?
These are some of the things you need to learn after learning basic syntax. You
need to understand concepts, processes, and application.

This knowledge about process and application is largely language-agnostic. Once
you learn how to use visualizations to find insights in R, you’ll be able to use those visualizations to find insights in another
programming language. The knowledge about how to use visualization to find
insights is a transferable skill.

There are other language agnostic skills that you must learn after learning
essential syntax. For example, you need to learn basic data visualization
workflow; that is, you need to know how to iteratively create a data visualization , starting with a simple version, and then adding details to create a more
“polished” chart.

I also recommend that you use your foundation in R essentials to learn
mathematical and statistical concepts. I’ve written about this elsewhere, but I
think that beginners place too much emphasis on abstract math in the beginning.
My common recommendation is to learn how to analyze data first (using ggplot2 + dplyr ). After that, you might even learn a little bit of practical machine learning (not theoretical ML). Then, once you’ve learned some syntax and you’ve learned
how to apply basic tools, you can “back into” higher-order mathematical
concepts.

Again, all of these higher-order skills cut across programming languages. How to
think about data visualization is not R-specific. How to think about data analysis is not R-specific. These are meta-skills that are apart from the programming
language itself.

What that means, is that when you begin learning “how to think about data
visualization” and “how to think about data analysis” in R, you’ll later be able
to take what you learn and apply it if you move on to another programming
language.

LEARNING A PROGRAMMING LANGUAGE CAN HELP YOU
“LEARN HOW TO LEARN”
And as David’s comment pointed out, it’s very likely that you will need to learn a new programming language within a few years.

The tech world is changing very fast . Any language that you learn today is likely to become obsolete. You’ll
eventually need to learn another language and toolkit.

What this means, is if you can rapidly master a new programming languages,
you’ll be able to stay at the cutting edge. This will make you very valuable. I
actually think that being skilled at learning itself will be one of the most valuable skills of the next couple of decades.

“The illiterate of the 21st century will not be those who cannot read and write,
but those who cannot learn, unlearn, and relearn. ”

– Alvin Toffler


I said it in my post last week , and I’ll say it again:

You need to learn how to learn.

If you want to be an elite performer in the next few decades, you need to learn
how to learn.

And to be clear: learning is a technical skill . You can get better at learning.

And actually, learning a new programming language is an excellent way to get
better at the skill of learning (if you’re highly systematic).

Do you want to be a highly valued “superlearner?”

Learn R, and be extremely systematic about how you do it.

To understand how learning a programming language can help you improve at the
metaskill of learning, let’s examine a few people who have mastered “how to
learn” spoken languages . (Learning a programming language is very similar to learning a spoken
language.)

“Polyglots” offer some clues on how to become more effective at learning.

WHAT YOU CAN LEARN FROM FOREIGN LANGUAGE “POLYGLOTS”
Many noteworthy polyglots are highly systematic in how they learn. They aren’t
geniuses by birth, they’ve just learned how to learn. They are highly skilled
learners.

Many of them have turned learning a new language into a process.

Here’s a look at several things they do:

YOU’LL LEARN TO FOCUS ON FOUNDATIONS
Several famous polyglots and superlearners insist that you need to focus on
foundations first. (Does that sound familiar? I’ve been hammering that point for
months.)

For example, Tim Ferriss attacks a new language by identifying the highest
frequency words in that language. This is an application of the “80/20” rule. He
finds word frequency lists, and identifies the most frequently used words that
yield the highest return on investment. He finds those words and memorizes them . He practices those high-frequency words over and over (commonly with flash
cards).


This is very similar to the system suggested by the polyglot Gabe Weiner in his terrific book Fluent Forever . His system is essentially as follows: identify the highest frequency words,
and practice those words until you know them cold. Essentially, he’s applying a
principle similar to Ferriss’ common 80/20 analysis. Find the most frequent
words and learn those first. (I’m talking about words like “cat”, “dog”, “car”,
“sit”, “walk”, “eat”.)

There’s a reason why this works. In most spoken languages, a vocabulary of the
most frequent 1000 words “covers” about 75% of the spoken language. That is, if
you know the most frequent 1000 words, you’ll be able to understand about 75% of
the language .

So, these superlearners understand that to rapidly learn a spoken language, they
need to identify the highest frequency words and master them.

We can use a similar principle when we learn R: find the most used, most
important techniques, and master them.

YOU’LL LEARN HOW TO PRACTICE
I’ve said it recently, and I’ll repeat it: if you want to rapidly master R, you
need to practice.

Polyglots know this too. To master a spoken language quickly, you need to
practice. Most polyglots start out by drilling words until they’ve learned basic
vocabulary. Then they move on to practicing longer phrases.

Later, they move on to application and working “in the real world” by having
conversations.

If you want to master R quickly, you need to practice. And you need to develop
systems for practicing effectively.

Again, these “systems” for practicing learning a programming language are
transferable. After you have a system for learning R, you can then apply a
similar system to learning a new language in the future.

YOU’LL GET BETTER AT LEARNING PROGRAMMING LANGUAGES
Ultimately, once they develop systems and “get better at learning” many
polyglots will tell you that learning new languages becomes easier over time.
Learning a second foreign language is easier than learning a first foreign
language. The one after that is easier still.

As a data science student who’s learning R, the takeaway here is that if you can
become systematic in how you learn R, you can become better at learning programing languages .

So if R does become obsolete, you’ll be prepared.

TL;DR: HERE’S MY RECOMMENDATION
Technology is changing fast .

Any programming language you learn does have a shelf life.

But don’t use that as a reason to not master the foundations of R.

Instead do the following:

 1. Master the foundations of R
 2. This means master the essential tools of data visualization, data
    manipulation, and data analysis in R . Drill the syntax of these foundations until you know them with your eyes
    closed.
 3. Use your language as a platform to learn principles
 4. Once you’ve nailed the syntax, use your language as a “platform” to learn
    principles. Begin to focus on “how to analyze data”, “how to think about
    data visualization”, “how to find insights.” Essentially, you want to begin
    learning concepts, workflow, and process. These skills are
    language-agnostic, so you can bring them with you if you move to another
    language.
 5. Master the art of learning
 6. Tools become obsolete. Over the course of your career, you’ll have to learn
    new things to stay competitive.
 7. This is a reason to master the art of learning itself. Become systematic about how you learn. When you learn, focus on
    foundations, work on small problems first, then increase complexity and
    apply your skills on increasingly hard problems.

Ultimately, I’m telling you not to despair that programming languages become
obsolete.

I want to to become so good at learning them that you just don’t care.

DISCOVER HOW TO MASTER R AND MASTER THE ART OF LEARNING
Do you want to master the foundations of R?

Do you want to master data science?

Do you want to master the art of learning programming languages?

Sign up for the Sharp Sight email list .

At Sharp Sight, we’ll not only show you how to master R, but also show you
strategies for learning fast.

TRACKBACKS
 1. Why you should master R (even if it might eventually become obsolete) -
    Use-R!Use-R! says: December 27, 2016 at 4:16 pm[…] post Why you should master R (even if it might eventually become
    obsolete) appeared first on SHARP SIGHT […]
    
    Reply
 2. 
 3. Why you should master R (even if it might eventually become obsolete) –
    Mubashir Qasim says: December 27, 2016 at 7:18 pm[…] post Why you should master R (even if it might eventually become
    obsolete) appeared first on SHARP SIGHT […]
    
    Reply
 4. 
 5. Why you should master R (even if it might eventually become obsolete) | A
    bunch of data says: December 27, 2016 at 7:52 pm[…] post Why you should master R (even if it might eventually become
    obsolete) appeared first on SHARP SIGHT […]
    
    Reply
 6. 
 7. Mastering Tools – Curated SQL says: December 28, 2016 at 1:15 pm[…] The folks at Sharp Sight Labs explain that future obsolescence of a tool
    does not mean you should no…: […]
    
    Reply
 8. 

LEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


GET THE FREE DATA SCIENCE CRASH COURSE
Sign up now and learn:
• a step-by-step data science learning plan
• the 1 programming language you need to
learn
• 3 essential data visualizations
• how to do data manipulation in R
• how to get started with machine learning
• the difference between machine learning
and statistics
• and more ...

Your first name Your best email address
By signing up for the newsletter you'll also get ...
✓ Free machine learning tutorials
✓ Free data visualization tutorials
✓ Learning strategies to skyrocket your
progress

... delivered to your inbox on a regular basis.

RECOMMENDED READING
FlowingData
R-Bloggers
R-users (jobs site)Subscribe to receive our free ""Getting Started with Analytics and Data Science""
pdf.

First Name E-Mail Address © 2016 · Powered by data","Let me be very clear: you need to know your essential toolkit inside and out. You need to remember your tools, and you need to be able to execute quickly and on command if you want to be a top performer.",Why you should master R (even if it might eventually become obsolete),Live,1016
3183,"We're hiring! | Blog Home * Blog Home
 * Learn more about Domino

PYTHON FOR SAS USERS: THE PANDAS DATA ANALYSIS LIBRARY
python SAS * 
 * BY RANDY BETANCOURT ON DECEMBER 19TH, 2016
   

This post is a chapter from Randy Betancourt’s Python for SAS Users quick start guide. Randy wrote this guide to familiarize SAS users with Python
and Python’s various scientific computing tools.

TOPICS COVERED IN THIS POST:
Importing Packages

Series

DataFrames

Read .csv files

Inspection

Handling Missing Data

Missing Data Detection

Missing Value Replacement

Resources


AN INTRODUCTION TO PANDAS
This chapter introduces the pandas library (or package). pandas provides Python developers with high-performance,
easy-to-use data structures and data analysis tools. The package is built on
NumPy (pronounced ‘numb pie’), a foundational scientific computing package that
offers the ndarray , a performant object for array arithmetic. We will illustrate a few useful
NumPy objects as a way of illustrating pandas.

For data analysis tasks we often need to group dissimilar data types together.
An example being grouping categorical data using strings with frequencies and
counts using ints and floats for continuous values. In addition, we would like
to be able to attach labels to columns, pivot data, and so on.

We begin by introducing the Series object as a component of the DataFrame object. A Series can be thought of as an indexed, one-dimensional array,
similar to a column of values. DataFrames can be thought of as a two-dimensional
array indexed by both rows and columns. A good analogy is an Excel cell
addressable by row and column location.

In other words, a DataFrame looks a great deal like a SAS data set (or
relational table). The table below compares pandas components to those found in
SAS.

Pandas SAS DataFrame SAS data set row observation column variable groupby BY-Group NaN . slice sub-set axis 0 observation axis 1 columnDataFrame and Series indexes are covered in detail in Chapter 6, Understanding Indexes .


IMPORTING PACKAGES
To begin utilizing pandas objects, or objects from any other Python package, we
begin by importing libraries by name into our namespace. To avoid having to
retype full package names repeatedly, use the standard aliases of np for NumPy and pd for pandas.


import numpy as np
import pandas as pd
from numpy.random import randn
from pandas import Series, DataFrame, Index
from IPython.display import Image


SERIES
A Series can be thought of as a one-dimensional array with labels. This structure
includes an index of labels used as keys to locate values. Data in a Series can be of any data type. pandas data types are covered in detail here . In the SAS examples, we use Data Step ARRAY s as an analog to the Series .

Start by creating a Series of random values:


s1 = Series(randn(10))
print(s1.head(5))


0   -0.467231
1   -0.504621
2   -0.122834
3   -0.418523
4   -0.262280
dtype: float64


Notice the index start position begins with 0. Most SAS automatic variables like _n_ use 1 as the index start position. Iteration of the SAS DO loop 0 to 9 in conjunction with an ARRAY produces an array subscript out of range error.

In the SAS example below the DO loop is used to iterate over the array elements locating the target elements.

Arrays in SAS are used primarily for iteratively processing like variables
together. SAS/IML is a closer analog to NumPy arrays. SAS/IML is outside the scope of these examples.

0.4322317772
0.5977982976
0.7785986473
0.1748250183
0.3941470125


A Series can have a list of index labels.


s2 = Series(randn(10), index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
print(s2.head(5))


a   -1.253542
b    1.093102
c   -1.248273
d   -0.549739
e    0.557109
dtype: float64


The Series is indexed by integer value with the start position at 0.


print(s2[0])


-1.25354189867


The SAS example uses a DO loop as the index subscript into the array.

0.4322317772


Return the first 3 elements in the Series .


print(s2[:3])


a   -1.253542
b    1.093102
c   -1.248273
dtype: float64


0.4322317772
0.5977982976
0.7785986473


The example has two operations. The s2.mean() method calculates mean followed by a boolean test less than this calculated
mean.


s2[s2 < s2.mean()]


a   -1.253542
c   -1.248273
d   -0.549739
h   -2.866764
i   -1.692353
dtype: float64


Series and other objects have attributes that use a dot (.) chaining-style syntax. .name is one of a number of attributes for the Series object.


s2.name='Arbitrary Name'
print(s2.head(5))


a   -1.253542
b    1.093102
c   -1.248273
d   -0.549739
e    0.557109
Name: Arbitrary Name, dtype: float64


DATAFRAMES
As stated earlier, DataFrames are relational-like structures with labels. Alternatively, a DataFrame with a single column is a Series .

Like SAS, DataFrames have different methods for creation. DataFrames can be created by loading
values from other Python objects. Data values can also be loaded from a range of
non-Python input sources, including .csv files, DBMS tables, Web API’s, and even
SAS data sets (.sas7bdat), etc. Details are discussed in Chapter 11 — pandas Readers .

Start by reading the UK_Accidents.csv file. It contains vehicular accident data in the U.K from January 1, 2015 to
December 31, 2015. The .csv file is located here .

There are multiple reports for each day of the year, with values being mostly
integers. Another .CSV file found here maps values to descriptive labels.


READ .CSV FILES
The default values are used in the example below. pandas provides a number of
readers with parameters for controlling missing values, date parsing, line
skipping, data type mapping, etc. These parameters are analogous to SAS’ INFILE/INPUT processing.

Notice the additional backslash \\ to normalize the Window’s path name.


file_loc2 = 'C:\Data\\uk_accidents.csv'
df = pd.read_csv(file_loc2, low_memory=False)


PROC IMPORT is used to read the same .csv file. This is one of several methods for SAS to
read a .csv file. Here we have taken the defaults.

NOTE: The file 'c:\data\uk_accidents.csv' is:
      File Name 'c:\data\uk_accidents.csv',
      Lrecl=32760, Recfm=V

NOTE: 266776 records were read from file 'c:\data\uk_accidents.csv'
      The minimum record length was 65
      The maximum record length was 77
NOTE: Data set ""WORK.uk_accidents"" has 266776 observation(s) and 27 variable(s)


Unlike SAS, the Python interpreter is mainly silent upon normal execution. When
debugging it is helpful to invoke methods and functions to return information
about these objects. This is somewhat analogous to use PUT statements in the SAS log to examine variable values.

The size , shape , and ndim attributes (respectively, number of cells, rows/columns, and number of
dimensions) are shown below.


print(df.size, df.shape, df.ndim)


7202952 (266776, 27) 2


READ VERIFICATION
After reading a file, you often want to understand its content and structure.
The DataFrame .info() method returns descriptions of the DataFrame’s attributes.


df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266776 entries, 0 to 266775
Data columns (total 27 columns):
Accident_Severity               266776 non-null int64
Number_of_Vehicles              266776 non-null int64
Number_of_Casualties            266776 non-null int64
Day_of_Week                     266776 non-null int64
Time                            266752 non-null object
Road_Type                       266776 non-null int64
Speed_limit                     266776 non-null int64
Junction_Detail                 266776 non-null int64
Light_Conditions                266776 non-null int64
Weather_Conditions              266776 non-null int64
Road_Surface_Conditions         266776 non-null int64
Urban_or_Rural_Area             266776 non-null int64
Vehicle_Reference               266776 non-null int64
Vehicle_Type                    266776 non-null int64
Skidding_and_Overturning        266776 non-null int64
Was_Vehicle_Left_Hand_Drive_    266776 non-null int64
Sex_of_Driver                   266776 non-null int64
Age_of_Driver                   266776 non-null int64
Engine_Capacity__CC_            266776 non-null int64
Propulsion_Code                 266776 non-null int64
Age_of_Vehicle                  266776 non-null int64
Casualty_Class                  266776 non-null int64
Sex_of_Casualty                 266776 non-null int64
Age_of_Casualty                 266776 non-null int64
Casualty_Severity               266776 non-null int64
Car_Passenger                   266776 non-null int64
Date                            266776 non-null object
dtypes: int64(25), object(2)
memory usage: 55.0+ MB


In SAS, this same information is generally found in the output from PROC CONTENTS .


INSPECTION
pandas has methods useful for inspecting data values. The DataFrame .head() method displays the first 5 rows by default. The .tail() method displays the last 5 rows by default. The row count value can be an
arbitrary integer value such as:


    # display the last 20 rows of the DataFrame
    df.tail(20)


SAS uses the FIRSTOBS and OBS options with procedures to determine input observations. The SAS code to print
the last 20 observations of the uk_accidents data set is:


df.head()


Accident_Severity Number_of_Vehicles Number_of_Casualties Day_of_Week Time Road_Type Speed_limit Junction_Detail Light_Conditions Weather_Conditions … Age_of_Driver Engine_Capacity__CC_ Propulsion_Code Age_of_Vehicle Casualty_Class Sex_of_Casualty Age_of_Casualty Casualty_Severity Car_Passenger Date 0 3 2 1 6 19:00 3 40 1 4 1 … 30 -1 -1 -1 1 1 54 3 0 1/9/2015 1 3 2 1 6 19:00 3 40 1 4 1 … 54 1499 2 1 1 1 54 3 0 1/9/2015 2 3 3 1 2 18:30 3 40 1 4 2 … 30 -1 -1 -1 1 2 20 3 0 2/23/2015 3 3 3 1 2 18:30 3 40 1 4 2 … 20 1199 1 13 1 2 20 3 0 2/23/2015 4 3 3 1 2 18:30 3 40 1 4 2 … 30 -1 -1 -1 1 2 20 3 0 2/23/20155 rows × 27 columns


OBS=n in SAS determines the number of observations used as input.

The output from PROC PRINT is not displayed here.


Scoping output by columns is shown in the cell below. The column list is
analogous to the VAR statement in PROC PRINT . Note the double set of square brackets for this syntax. This example
illustrates slicing by column label. Slicers work along rows as well. The square
braces [] are the slicing operator. The details are explained here


df[['Sex_of_Driver', 'Time']].head(10)


Sex_of_Driver Time 0 1 19:00 1 1 19:00 2 1 18:30 3 2 18:30 4 1 18:30 5 1 17:50 6 1 17:50 7 1 7:05 8 1 7:05 9 1 12:30Notice the DataFrame default index (incrementing from 0 to 9). This is analogous
to the SAS automatic variable n . Later, we illustrate using other columns in the DataFrame as the index.

Below is the SAS program to print the first 10 observations of a data set along
with the variables Sec_of_Driver and Time.

The output from PROC PRINT is not displayed here.


HANDLING MISSING DATA
Before analyzing data, a common task is dealing with missing data. pandas uses
two designations to indicate missing data, NaN (not a number) and the Python None object.

The cell below uses the Python None object to represent a missing value in the array. In turn, Python infers the
data type for the array to be an object. Unfortunately, the use of a Python None object with an aggregation function for arrays raises an error.


s1 = np.array([32, None, 17, 109, 201])
s1


array([32, None, 17, 109, 201], dtype=object)


s1.sum()


---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-16-b615dd188243> in <module>()
----> 1 s1.sum()


C:\Users\randy\Anaconda3\lib\site-packages\numpy\core_methods.py in _sum(a, axis, dtype, out, keepdims)
     30
     31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
---> 32     return umr_sum(a, axis, dtype, out, keepdims)
     33
     34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False):


TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'


To alleviate the error raised above, use the np.nan (missing value indicator) in the array example below. Also notice how Python
chose floating point (or up-casting) for the array compared to the same example
two cells above.


s1 = np.array([32, np.nan, 17, 109, 201])
print(s1)
s1.dtype


[  32.   nan   17.  109.  201.]

dtype('float64')


Not all arithmetic operations using NaN ‘s will result in a NaN .


s1.mean()


nan


Contrast the Python program in the cell above for calculating the mean of the
array elements with the SAS example below. SAS excludes the missing value and
utilizes the remaining array elements to calculate a mean.

89.75


MISSING VALUE IDENTIFICATION
Returning to our DataFrame, we need an analysis of missing values for all the
columns. Pandas provide four methods for the detection and replacement of
missing values. They are:

Method Action Taken isnull() generates a boolean mask to indicate missing values notnull() opposite of isnull() dropna() returns a filtered version of the data fillna() returns a copy of data with missing values filled or imputedWe will look at each of these in detail below.

A typical SAS-programming approach to address the missing data analysis is to
write a program to traverses all columns using counter variables with IF / THEN testing for missing values.

This can be along the lines of the example in the output cell below. df.columns
returns the sequence of column names in the DataFrame.


for col_name in df.columns:
    print (col_name, end=""---->"")
    print (sum(df[col_name].isnull()))


Accident_Severity---->0
Number_of_Vehicles---->0
Number_of_Casualties---->0
Day_of_Week---->0
Time---->24
Road_Type---->0
Speed_limit---->0
Junction_Detail---->0
Light_Conditions---->0
Weather_Conditions---->0
Road_Surface_Conditions---->0
Urban_or_Rural_Area---->0
Vehicle_Reference---->0
Vehicle_Type---->0
Skidding_and_Overturning---->0
Was_Vehicle_Left_Hand_Drive_---->0
Sex_of_Driver---->0
Age_of_Driver---->0
Engine_Capacity__CC_---->0
Propulsion_Code---->0
Age_of_Vehicle---->0
Casualty_Class---->0
Sex_of_Casualty---->0
Age_of_Casualty---->0
Casualty_Severity---->0
Car_Passenger---->0
Date---->0


While this give the desired results, there is a better approach.

As an aside, if you find yourself thinking of solving a pandas operation (or
Python for that matter) using iterative processing, stop and take a little time
to do research. Chances are, a method or function already exists!

Case-in-point is illustrated below. It chains the .sum() attribute to the .isnull() attribute to return a count of the missing values for the columns in the
DataFrame.

The .isnull() method returns True for missing values. By chaining the .sum() method to the .isnull() method it produces a count of the missing values for each columns.


df.isnull().sum()


Accident_Severity                0
Number_of_Vehicles               0
Number_of_Casualties             0
Day_of_Week                      0
Time                            24
Road_Type                        0
Speed_limit                      0
Junction_Detail                  0
Light_Conditions                 0
Weather_Conditions               0
Road_Surface_Conditions          0
Urban_or_Rural_Area              0
Vehicle_Reference                0
Vehicle_Type                     0
Skidding_and_Overturning         0
Was_Vehicle_Left_Hand_Drive_     0
Sex_of_Driver                    0
Age_of_Driver                    0
Engine_Capacity__CC_             0
Propulsion_Code                  0
Age_of_Vehicle                   0
Casualty_Class                   0
Sex_of_Casualty                  0
Age_of_Casualty                  0
Casualty_Severity                0
Car_Passenger                    0
Date                             0
dtype: int64


To identify missing values the SAS example below uses PROC Format to bin missing and non-missing values. Missing values are represented by
default as ( . ) for numeric and blank (‘ ‘) for character variables. Therefore, a
user-defined format is needed for both types.

PROC FREQ is used with the automatic variables _CHARACTER_ and _NUMERIC_ to produce a frequency listing for each variable type.

Only a portion of the SAS output is shown since seperate output is produced for
each variable. As with the Python for loop example above, the time variable is the only variable with missing values.


Another method for detecting missing values is to search column-wise by using
the axis=1 parameter to the chained attributes .isnull().any() . The operation is then performed along columns.


null_data = df[df.isnull().any(axis=1)]
null_data.head()


Accident_Severity Number_of_Vehicles Number_of_Casualties Day_of_Week Time Road_Type Speed_limit Junction_Detail Light_Conditions Weather_Conditions … Age_of_Driver Engine_Capacity__CC_ Propulsion_Code Age_of_Vehicle Casualty_Class Sex_of_Casualty Age_of_Casualty Casualty_Severity Car_Passenger Date 11669 3 2 1 2 NaN 1 30 2 1 1 … -1 2148 2 12 2 2 44 3 1 1/5/2015 12473 3 1 1 4 NaN 6 30 0 1 2 … 51 -1 -1 -1 1 1 18 3 0 7/8/2015 12671 3 2 1 5 NaN 6 30 0 1 1 … 43 -1 -1 -1 1 1 29 3 0 9/3/2015 55179 3 1 1 3 NaN 6 60 9 1 8 … 48 1560 2 1 3 2 70 2 0 1/20/2015 55187 3 2 1 7 NaN 1 30 1 4 1 … 54 749 1 17 1 2 21 3 0 1/24/20155 rows × 27 columns


MISSING VALUE REPLACEMENT
The code below is used to render multiple objects side-by-side. It is from Essential Tools for Working With Data , by Jake VanderPlas. It displays the ‘before’ and ‘after’ effects of changes
to objects.


class display(object):
    """"""Display HTML representation of multiple objects""""""
    template = """"""<div style=�"">
    <p style='font-family:""Courier New"", Courier, monospace'>{0}</p>{1}
    </div>""""""
    def __init__(self, *args):
        self.args = args

    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)

    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)


To illustrate the .fillna() method, consider the following to create a DataFrame.


df2 = pd.DataFrame([['cold','slow', np.nan, 2., 6., 3.],
                    ['warm', 'medium', 4, 5, 7, 9],
                    ['hot', 'fast', 9, 4, np.nan, 6],
                    ['cool', None, np.nan, np.nan, 17, 89],
                    ['cool', 'medium', 16, 44, 21, 13],
                    ['cold', 'slow', np.nan, 29, 33, 17]],
                    columns=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
                    index=(list('abcdef')))
display(""df2"")


df2

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 NaN 6.0 d cool None NaN NaN 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0
df_tf = df2.isnull()
display(""df2"", ""df_tf"")


df2

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 NaN 6.0 d cool None NaN NaN 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0df_tf

col1 col2 col3 col4 col5 col6 a False False True False False False b False False False False False False c False False False False True False d False True True True False False e False False False False False False f False False True False False FalseBy default the .dropna() method drops either the entire row or column in which any null value is found.


df3 = df2.dropna()
display(""df2"", ""df3"")


df2

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 NaN 6.0 d cool None NaN NaN 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0df3

col1 col2 col3 col4 col5 col6 b warm medium 4.0 5.0 7.0 9.0 e cool medium 16.0 44.0 21.0 13.0The .dropna() method also works along a column axis. axis = 1 or axis = 'columns' is equivalent.


df4 = df2.dropna(axis='columns')
display(""df2"", ""df4"")


df2

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 NaN 6.0 d cool None NaN NaN 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0df4

col1 col6 a cold 3.0 b warm 9.0 c hot 6.0 d cool 89.0 e cool 13.0 f cold 17.0Clearly this drops a fair amount of ‘good’ data. The thresh parameter allows you to specify a minimum of non-null values to be kept for the
row or column. In this case, row ‘d’ is dropped because it contains only 3
non-null values.


df5 = df2.dropna(thresh=5)
display(""df2"", ""df5"")


df2

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 NaN 6.0 d cool None NaN NaN 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0df5

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 NaN 6.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0Rather than dropping rows and columns, missing values can be imputed or
replaced. The .fillna() method returns either a Series or a DataFrame with null values replaced. The example below replaces all NaN ‘s with zero.


df6 = df2.fillna(0)
display(""df2"", ""df6"")


df2

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 NaN 6.0 d cool None NaN NaN 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0df6

col1 col2 col3 col4 col5 col6 a cold slow 0.0 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 0.0 6.0 d cool 0 0.0 0.0 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow 0.0 29.0 33.0 17.0As you can see from the example in the cell above, the .fillna() method is applied to all DataFrame cells. We may not wish to have missing
values in df['col2'] replaced with zeros since they are strings. The method is applied to a list of
target columns using the .loc method. The details for the .loc method are discussed in Chapter 05–Understanding Indexes .


df7 = df2[['col3', 'col4', 'col5', 'col6']].fillna(0)
display(""df2"", ""df7"")


df2

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 NaN 6.0 d cool None NaN NaN 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0df7

col3 col4 col5 col6 a 0.0 2.0 6.0 3.0 b 4.0 5.0 7.0 9.0 c 9.0 4.0 0.0 6.0 d 0.0 0.0 17.0 89.0 e 16.0 44.0 21.0 13.0 f 0.0 29.0 33.0 17.0An imputation method based on the mean value of df['col6'] is shown below. The .fillna() method finds and then replaces all occurrences of NaN with this calculated value.


df8 = df2[[""col3"", ""col4"", ""col5""]].fillna(df2.col6.mean())
display(""df2"", ""df8"")


df2

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 NaN 6.0 d cool None NaN NaN 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0df8

col3 col4 col5 a 22.833333 2.000000 6.000000 b 4.000000 5.000000 7.000000 c 9.000000 4.000000 22.833333 d 22.833333 22.833333 17.000000 e 16.000000 44.000000 21.000000 f 22.833333 29.000000 33.000000The corresponding SAS program is shown below. The PROC SQL SELECT INTO clause stores the calculated mean for the variable col6 into the macro variable &col6_mean . This is followed by a Data Step iterating the array x for col3 - col5 and replacing missing values with &col6_mean.

SAS/Stat has PROC MI for imputation of missing values with a range of methods described here. PROC MI is outside the scope of these examples.

The .fillna(method='ffill') is a ‘forward’ fill method. NaN ‘s are replaced by the adjacent cell above traversing ‘down’ the columns. The
cell below contrasts the DataFrame df2 , created above with the DataFrame df9 created with the ‘forward’ fill method.


df9 = df2.fillna(method='ffill')
display(""df2"", ""df9"")


df2

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 NaN 6.0 d cool None NaN NaN 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0df9

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 7.0 6.0 d cool fast 9.0 4.0 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow 16.0 29.0 33.0 17.0Similarly, the .fillna(bfill) is a ‘backwards’ fill method. NaN ‘s are replaced by the adjacent cell traversing ‘up’ the columns. The cell
below contrasts the DataFrame df2 , created above with the DataFrame df10 created with the ‘backward’ fill method.


df10 = df2.fillna(method='bfill')
display(""df2"", ""df10"")


df2

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 NaN 6.0 d cool None NaN NaN 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0df10

col1 col2 col3 col4 col5 col6 a cold slow 4.0 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 17.0 6.0 d cool medium 16.0 44.0 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0Below we contrast DataFrame df9 created above using the ‘forward’ fill method, with DataFrame df10 created with the ‘backward’ fill method.


display(""df9"", ""df10"")


df9

col1 col2 col3 col4 col5 col6 a cold slow NaN 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 7.0 6.0 d cool fast 9.0 4.0 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow 16.0 29.0 33.0 17.0df10

col1 col2 col3 col4 col5 col6 a cold slow 4.0 2.0 6.0 3.0 b warm medium 4.0 5.0 7.0 9.0 c hot fast 9.0 4.0 17.0 6.0 d cool medium 16.0 44.0 17.0 89.0 e cool medium 16.0 44.0 21.0 13.0 f cold slow NaN 29.0 33.0 17.0Before dropping the missing rows, calculate the portion of records lost in the
accidents DataFrame, df created above.


print(""{} records in the DataFrame will be dropped."".format(df.Time.isnull().sum()))
print('The portion of records dropped is {0:6.3%}'.format(df.Time.isnull().sum() / (len(df) - df.Time.isnull().sum())))


24 records in the DataFrame will be dropped.
The portion of records dropped is 0.009%


The .dropna() method is silent except in the case of errors. We can verify the DataFrame’s
shape after the method is applied.


print(df.shape)
df = df.dropna()
print(df.shape)


(266776, 27)
(266752, 27)


RESOURCES
10 Minutes to pandas from pandas.pydata.org.

Tutorials , and just below this link is the link for the pandas Cookbook, from the pandas
0.19.1 documentation at pandas.pydata.org.

pandas Home page for Python Data Analysis Library.

Python Data Science Handbook , Essential Tools for Working With Data, by Jake VanderPlas.

pandas: Data Handling and Analysis in Python from 2013 BYU MCL Bootcamp documentation.

Intro to pandas data structures by Greg Reda. This is a three-part series using the Movie Lens data set nicely
to illustrate pandas.

Cheat Sheet: The pandas DataFrame Object by Mark Graph and located at the University of Idaho’s web-site.

Working with missing data pandas 0.19.1 documentation.

READ THE BOOK
This post is an excerpt from Randy Betancourt Python for SAS Users quick start guide. View the full Chapter List .

About Randy
Randy Betancourt has spent his career in a number of customer and
executive-facing roles at SAS Institute, Inc. and the Institute of International
Analytics. Starting as a technical architect and more recently as a consultant,
he advises business leaders on how to nurture and cost-effectively manage their
analytical resources portfolio. Recently, these discussion and efforts have
centered on modernization strategies in light of the growing industry
innovations.

RELATED
SEARCH
DOING DATA SCIENCE?
Get our regular data science news, insights, tutorials, and more!

Get Data Science Updates

POPULAR STORIES
 * [Video] A Huge Debate: R vs. Python for Data Science posted on November 30, 2016
 * Introducing the Data Science Maturity Model posted on November 22, 2016
 * [Video] 23 Visualizations and When to Use Them posted on December 2, 2016
 * Python for SAS Users: The pandas Data Analysis Library posted on December 19, 2016

 * Data Science Platform
 * Company
 * Careers
 * Support
 * Data Pop-Up

 * 
 * 
 * 
 * 

Made in San Francisco, Domino Data Lab, Inc © 2016.",An introduction to Pandas. A quick start guide to familiarize SAS users with Python and Python's various scientific computing tools.,The pandas Data Analysis Library,Live,1017
3184,"Skip to contentParallelDots

Blog

HEADER MENU
Menu * ParallelDots
 * Karna
 * API
 * Schedule a Demo

MAIN NAVIGATION
Search for: Menu * ParallelDots
 * Karna
 * API
 * Schedule a Demo

7 TYPES OF JOB PROFILES THAT MAKE YOU A DATA SCIENTIST
Muktabh Mayank Apr 14, 2017 Feb 24, 2017 * 0
 * 
 * 0
 * 0

0 sharesSo yes, this post might somewhat look like a clickbait, but I promise you it is
not exactly that (Well somewhat).

I recently got in question on Quora asking something on lines of what exact
skills do companies look for when they are recruiting a Data Scientist? and is
there a definition of Data Scientist profile? As is pretty obvious, there is no
one profile, as every company is solving its own set of problems. But I tried to
make a few generic job profiles that can somewhat fit JDs of different
companies. I think there is way too more variety, but I had to narrow down on a
set of profiles, so here is the list:

 1. The R using number-cruncher. Can run quick Group By’s and Counts on Numbers in R/Python . This profile
    is the coding version of Data Analyst from earlier days. Automated report
    generation in a more analyst-y organization is the most common location one
    finds this profile in.
    Tools Used : R (dataframes), SQL
    
 2. The Modeller. Deeply Mathematical mind, who can apply Bayesian/Frequentist inferences or
    hierarchal models. Probably I am grouping too many people into a single
    group here, when people analyzing drug trials, scientists modelling complex
    phenomena and people running autoregressive models on stocks are grouped
    into one. The common theme here is Mathematics forms the base of the work
    Tools Used: R is very popular, Fortran, C++ and sometimes functional
    languages.
    
    
 3. The Data Engineer who is also a occassional Data Scientist. Take a library from here, take some code from there and make something good
    enough while you manage the data pipeline. Very common profile, Data Science
    tasks include writing programs to automate report generation in Pandas,
    trying out simple Machine Learning models and (now-a-days) running a
    pretrained Neural Network on the data
    Tools: Python toolchain, Pandas, nltk, Keras.
    
    
 4. The tabular ML’er (or the XGBoost specialist). Ardent Kaggler, can train multiple algorithms and stack models and optimize
    the heck out of them. These guys have deep expertise with running and
    optimizing standard algorithms like XGBoost, Ridge Regression and
    (now-a-days) Keras models.
    Tools: Python or R, uses XGB, Keras a lot.
    
    
 5. The old style ML’er . Close to 4, but not limited to categorical models only. Very good at
    feature engineering. This was the only Machine Learning expertise until the
    newer Deep Learning profile came up.
    Tools: C++ / Python with Scikit Learn.
    
    
 6. Deep Learning Guy. Needs a GPU system and a well tagged dataset and needs to try out
    architectures and do no feature engineering. Will spend lot of time in
    trying arcitectures and minimal in feature engineering, but the accuracy
    will be insane.
    Tools: Python, Theano, Tensorflow and high level libraries like Keras.
    
    
 7. The domain specialist. Knows a lot about domain, something about linear models. Codes the domain
    information and trains a linear algorithm on top. Includes mechanical
    engineers, analysts at different firms and scientists in pure/applied
    sciences.
    Tools: Different Specializations use different things. Matlab by Engineers,
    C++/Fortran and sometimes R/Python.
    
    
 8. The newbie . The intern. Will evolve into whichever of the 7 categories his/her mentor
    belongs.
    
    
At ParallelDots, we have people of type 2,3,4,5 and 6. (and 8 if you want to
join us fulltime).


--------------------------------------------------------------------------------

ParallelDots is an ArtificiaI Intelligence research and Deep Learning startup that provides
AI solutions to clients in multiple domains. You can check out some of our text analysis APIs and reach out to us by filling this form here. * 0
 * 
 * 0
 * 0

0 shares Categories Artificial intelligence , Data Scientist , Deep learning , Machine learning , NLP , TechnologyPOST NAVIGATION
Previous Previous post: List of Free Must-Read Machine Learning Books Next Next post: Some New Interesting Deep Learning Datasets for Data ScientistsLEAVE A REPLY CANCEL REPLY
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website


PRIMARY SIDEBAR
Toggle Sidebar
 * Artificial intelligence (3)
 * Data Scientist (3)
 * Deep learning (5)
 * Machine learning (3)
 * Media-Monitoring (2)
 * NLP (4)
 * Technology (12)
 * Text Analytics (3)

© Artzen Software Labs Pvt. Ltd. | All Rights Reserved.",What exact skills do companies look for when they are recruiting a Data Scientist.,7 types of job profiles that makes you a Data Scientist,Live,1018
3190,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services * How-Tos * Blog * Events * ConnectHOW TO MIGRATE FROM IRIS COUCH TO IBM CLOUDANTJason H. Smith / December 9, 2015Both IBM Cloudant and Iris Couch are compatible with the open source ApacheCouchDB™ database, and so they are compatible with each other. That’s greatnews! It means that you can easily migrate from Iris Couch to Cloudant with aminimum of troubles, using your web browser.This document will walk through the process of migrating your data from IrisCouch to Cloudant. Here is the migration plan: 1. Create a free IBM Cloudant account 2. Use replication to move your data from Iris Couch to Cloudant 3. Transfer optional hosting configurations: CORS and virtual hosts 4. Make some databases public 5. Update your applicationBoth Cloudant and Iris Couch are compatible (and based on) Apache CouchDB, andCouchDB excels at replication. Thus, Iris Couch is a convenient migrationplatform — both to it, and from it. Most people can completely migrate in anhour or so.Indeed, Iris Couch has a very helpful blog post, How-To: Bail out on Iris Couch . That walkthough is great for moving to Apache CouchDB; however, this documentwill focus specifically on Cloudant, a quicker and simpler task. So let’s begin!PREPARATIONIn this document, we will use the example, example.cloudant.com and example.iriscouch.com , but of course your account will be your own, with your own user name prefixedbefore the .cloudant.com and .iriscouch.com domain. In all cases, substitute your own account’s URL instead of example.cloudant.com and example.iriscouch.com .Before you begin this migration, you will need to know: 1. Your Iris Couch URL , for example, example.iriscouch.com 2. Your Iris Couch admin username and password . You can confirm these by following the procedure in Step 2 below. If you’ve forgotten it, you can reset your Iris Couch admin password .STEP 1: CREATE A FREE IBM CLOUDANT ACCOUNTLike Iris Couch, IBM Cloudant has a free tier! Sign up for a free Cloudant account .Unlike Iris Couch, Cloudant pricing breaks down differently, but it’s easy to stay within the free tier for smallerprojects. For larger projects, Cloudant sells dedicated instances and managesthem 24/7. I won’t go into details here, but know that Cloudant offers differentdeployment models for heavier usage. (See the product comparison for more.)Once you have signed up, sign in .STEP 2: SIGN IN TO YOUR IRIS COUCH ACCOUNTTo sign into Iris Couch as the admin: 1. Go to Futon: https://example.iriscouch.com:6984/_utils/ 2. Click the Login link in the lower-right:                 3. Input your admin username and password in the prompt:                 4. You can confirm that you logged in successfully by looking at the label in    the lower-right of Futon. It will say, “Welcome”                STEP 3: REPLICATE DATABASESNow we reach the heart of this whole exercise: transferring databases from IrisCouch to Cloudant.Of course, we will use replication to move the data; but what exactly is theprocedure? We have many decisions to make. Replication can be done synchronously(“replicate, and tell me when you are finished”) or asynchronously (“replicatequietly in the background” and, finally, either initiated from Iris Couch orfrom Cloudant. And what about automating the process?Since this is a one-time undertaking, let’s keep it simple: * Stick to clicking and typing in the web interface; let’s not worry about   automation * Work primarily in the Cloudant dashboard * Simple, one-off replication. Let’s not worry about continuous replication.   (Advanced users who need this can simply follow this procedure and check that   box on the form).A note on continuous replication in Cloudant:Note: Continuous replication jobs are helpful while initially migrating your data,but creating lots of ongoing, continuous replications can result in charges later on that are outside of Cloudant’s free usage tier. Besure to monitor your account usage in the Cloudant dashboard if you decide tomake heavy use of continuous data replication. The pricing page has more details on the free tier and metered billing rates.Begin by looking at Futon in Iris Couch. You should see a list of yourdatabases. Keep this list handy.For each database in Iris Couch , (in this example, a database named foo ) do the following in the Cloudant dashboard. 1. In the Cloudant dashboard, click the Replication tab, which should take you to https://example.cloudant.com/dashboard.html#/replication 2. Click New Replication and fill out the form: 3. For _id , enter iriscouch- followed by the database name. For example, enter iriscouch-foo . This will help you to see the replication status at a glance 4. For Source Database , click Remote Database and enter the URL of the format https:// iriscouch-admin : iriscouch-password @ example.iriscouch.com / foo . That is, input your iris couch admin username, password, URL, and this    database name, for example, foo ). You can copy this full URL, so that next time, you can paste it and    simply change the final database name. 5. For Target Database , click New Database , then click Local , and then input the database name, foo 6. Do not check Make this replication continuous 7. Click the Replicate button. The dashboard will prompt you for your Cloudant password; so input    it and click Continue ReplicationYou will see the button label change to Starting replication , and shortly you will bounce to the All Replications tab. Stop to admire your new replication underway! If you click the documentID, you can see internal details about the replication, such as how much it hascopied, and how many documents remain. After a while, the replication statuswill become Triggered , and finally Completed .Continue this step for every database listed in Futon on Iris Couch. Feel freeto run multiple replications at the same time. Each is an independent process.They will not interfere with each other.Once you have initiated replication for all databasess, wait for all of theirstatuses to be Completed . That’s it! The hard part is done!STEP 4: TRANSFER CORS AND VIRTUAL HOST CONFIGURATIONThe vast majority of Iris Couch users do not use virtual hosting (“vhosts” so ifyou are not familiar with virtual hosting or CORS on Iris Couch or Cloudant,then you can safely skip this step.On Iris Couch, you can see both vhosts and CORS settings in the Configuration tab, on the right-hand side of Futon.CORSLook at your CORS settings in Iris Couch.Look at the origins configuration. You need to transfer this to your Cloudant account.In the Cloudant dashboard:For each virtual host line in Iris Couch , do the following in the Cloudant dashboard. 1. Click the Account tab, which should take you to https://example.cloudant.com/dashboard.html#/account 2. Click CORS and fill out the form: 3. If your Iris Couch origins config is simply * (all domains), then select All Domains in Cloudant. 4. If your Iris Couch origins config is a comma-separated list of domains, then for each domain, paste it    in the dashboard prompt and click AddCORS on Cloudant is done!VIRTUAL HOSTINGLook at your virtual hosts in Iris Couch.The vhost config has two important columns: the domain to serve, and the path touse for serving that domain (nearly always a “rewrite” path). In the screenshot: * The hosted domain is example.com * Queries to example.com will be served from the path /example_com/ design/example_com/ rewriteFor each virtual host line in Iris Couch , do the following in the Cloudant dashboard. 1. In the Cloudant dashboard, click the Account tab, which should take you to https://example.cloudant.com/dashboard.html#/account 2. Click Virtual Hosts and fill out the form: 3. For Hostname , copy the hostname from Futon on Iris Couch (such as example.com ) and paste it here. 4. For Path (optional) , copy the path from Futon on Iris Couch (such as /example_com/ design/example_com/ rewrite ) and paste it here. But remember to remove the leading slash , because Cloudant will automatically prepend the slash character. 5. Click Save . After a moment, you will see your virtual host entry listed on the    dashboard page.Repeat this procedure for every virtual host you have configured for Iris Couch.That’s it! Since you’ve already replicated your data, the necessary database anddesign documents are already in place. Be sure to read the next section, sinceyou will very likely want to open the database to the public.Finally, with the vhosts in place, test them . This process is different for each application. But, for example, this is asimple way to test vhosts from the command prompt using the free cURL tool. Thecommand will query Cloudant at the virtual host domain, and it will display theresponse headers.$ curl -I example.cloudant.com -H 'Host: example.com'HTTP/1.1 200 OKX-Couch-Request-ID: b6b71c1d89Vary: AcceptServer: CouchDB/1.0.2 (Erlang OTP/17)Etag: ""3B1WJJIVF36CDRTRMYF0NC2DR""Date: Tue, 30 Nov 2015 14:48:50 GMTContent-Type: text/html; charset=utf-8Content-Length: 2563X-Content-Type-Options: nosniff;Your results will vary slightly, but the key thing to note is the Content-Type is text/html and the Content-Length is correct for an HTML web page. Of course, you can omit the -I option to view the actual HTML output.OPTIONALLY PUBLISH A DATABASEIris Couch security permissions are slightly different from Cloudant: Iris Couchdatabases are publicly readable by default, but Cloudant databases are private by default. If you have used avhost, or if you otherwise maintain publicly-accessible databases, you will needto enable that permission on Cloudant.To make a Cloudant database publicly visible: 1. In the Cloudant dashboard, click the Databases tab, 2. Click the name of the database 3. Click Permissions 4. If you want the raw database API accessible publicly, check the box intersecting the column Reader and the row Everybody Else . 5. If you are publishing an app using vhosts, you will see a Virtual Hosts section. Check the Reader box for the appropriate domain name.UPDATE YOUR APPLICATIONThe final step is to update your application to work from Cloudant instead ofIris Couch. Again, this is highly specific for each application; however, thebasic idea is to simply replace iriscouch with cloudant .Search through your application source code for iriscouch.com . Change those to cloudant.com .Finally, check your site domain settings (your DNS configuration). If you have CNAME records which reference iriscouch.com , change them to cloudant.com . This is most common with vhost users.CONGRATULATIONS!You have migrated!If you cannot completely migrate in an instant, changes might land on Iris Couchafter you have already replicated. In most cases, just re-run the replicationprocess, and you will re-sync your data. But you can also make the replication continuous , which will pull in all Iris Couch changes in real time.You’re done! Relax and enjoy.© ""Apache"", ""CouchDB"", ""Apache CouchDB"", and the CouchDB logo are trademarks orregistered trademarks of The Apache Software Foundation. All other brands andtrademarks are the property of their respective owners.SHARE THIS: * Click to email this to a friend (Opens in new window) * Click to share on Twitter (Opens in new window) * Click to share on LinkedIn (Opens in new window) * Share on Facebook (Opens in new window) * Click to share on Reddit (Opens in new window) * Click to share on Pocket (Opens in new window) * Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES   Enter your email address to subscribe to this blog and receive notifications   of new posts by email.      Email Address             * CATEGORIES    * Analytics    * Cloudant    * Community    * Compose    * CouchDB    * dashDB    * Data Warehousing    * DB2    * Elasticsearch    * Gaming    * Hybrid    * IoT    * Location    * Message Hub    * Migration    * Mobile    * MongoDB    * NoSQL    * Offline    * Open Data    * PostgreSQL    * Redis    * Spark    * SQL      RSS Feed * Report Abuse * Terms of Use * Third Party Notice * IBM PrivacyIBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Tutorial on migrating JSON data from the Iris Couch service to IBM Cloudant, all using Apache CouchDB™-based replication.",How to Migrate from Iris Couch to Cloudant,Live,1019
3192,"EMARSYS – MAKING THE MOST OF COMPOSE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Nov 29, 2016In this article, we take a look at long-time Compose customer, Emarsys , who runs Compose-hosted MongoDB, PostgreSQL and Redis for their
micro-services architected marketing automation platform.

Emarsys has been delivering email campaigns for customers since 2000, but it was
in 2010 that the company made a critical pivot from offering CRM and email
services to a full-serviced marketing automation solution. That included moving
to a microservices architecture.

The microservices architecture is key for Emarsys for delivering value to their
clients faster. They have accelerated development of entirely new modules by
supporting greater autonomy for their engineers. Developers are encouraged to
code in their favorite languages with their choice of tools in their preferred
cloud. The majority of development happens in their Budapest office. We spoke
with Andras Fincza, Head of Engineering and László Merklik, Senior Vice
President of Core Products, about their application and use of Compose-hosted
services.


Unlike marketing automation vendors who are focused on the B2B space, Emarsys
works with B2C customers. While B2B marketing automation solutions are tailored
for smaller data sets, Emarsys is built for scale. ""If you have 2 million or 60
million contacts, there's an explosion of data and simple metrics solutions
won't suffice,"" said Merklik.

Merklik added: ""So the problem for a marketer is that it became the age of point
solutions – you have a solution for personalization, you have a solution for
push, you have a solution for automation, and so forth. But in the end it's
impossible to get that information in one place, make sense of it, build your
marketing strategy, and execute it with all these point solutions. So what we're
trying to solve is this gap – this huge gap in the information and the execution
of the strategy. We want to let our customers define their strategy but not have
to worry about how to get and use data.""

The Emarsys application is split into two different infrastructures, a legacy
infrastructure (built in PHP) and a new cloud-based infrastructure on Compose,
Heroku and other as-a-Service platforms. Everything works together seamlessly
using REST APIs to communicate across the entire stack. The huge data sets they
work with are analyzed through an AI module for example. To some degree, Compose
epitomizes why they have moved to a microservices architecture. It's simply much
easier for developers to focus on writing great code instead of managing
infrastructure.

""Our developers need to write maintainable code and be on-call if something
breaks or it needs work,"" said Fincza, ""So it really motivates them to write
resilient code. On the other side, we want to provide them with state-of-the-art
solutions to manage the services they write, so this is why we use Compose, for
example, because it frees up our operations time and helps the team focus on the
code quality and the product itself.""

So how does their team of 70+ developers and data scientists work in unison to
build a microservices platform? As Fincza describes it, they have followed the
same type of system that Spotify's engineering team has championed in which
they've split their teams into ""clans"" to focus on different themes, such as
content or reporting. The company embraces OKRs for setting goals; they have a
12-18 month high-level product roadmap that is broken into 4 month release. Each
team derives their own objectives in support of the high-level roadmap.

Compose-hosted MongoDB, PostgreSQL and Redis are used by various engineering
clans for different parts of the application. In addition to Compose freeing up
time for their team to focus on app development, Fincza added, ""One of the best
things we have from Compose is the reliability. We've had no critical incidents
or anything that I can recall.""

Every day, and often several times per day, they release code into production
which has helped them capitalize on the great output from their team. The
thoughtful combination of top-down OKRs with small teams broken into themes
helps their teams be more agile and release code into production daily. All that
work has paid off: Emarsys has grown into a market leader in marketing
automation with 15 offices worldwide and more than 1,500 clients running
250,000+ personalized campaigns each month.

If you'd like to read more about Emarsys' development and technology, you can
find them blogging on Medium on the Emarsys Craftlab .


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Jon Silvers works in marketing at Compose. Love this article? Head over to Jon Silvers’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","we take a look at long-time Compose customer, Emarsys, who runs Compose-hosted MongoDB, PostgreSQL and Redis for their micro-services architected marketing automation platform.",Emarsys – Making the Most of Compose,Live,1020
3194,"Enterprise Pricing Articles Sign in Free 30-Day TrialREDIS AND MONGODB IN THE BIOMEDICAL DOMAIN
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jan 11, 2017Varun Singh, a software engineer at IBM's Watson Health, looks at how to use
MongoDB and Redis to manage biomedical data in this guest article.

For bioinformatics software engineers, unstructured data is a fact of everyday
life. From centralized repositories containing information about biomedical
entities such as genes and proteins to journal publishers providing their papers
in PDF formats, unstructured data is the most common form of data available.
Further, research in this domain is dominated by numerous academic institutions
spread all over the world. The product of the research is often deposited in
small databases hosted by these institutions. This leads to several problems the
2 most common ones being data model design and external dependencies.

DATA MODEL DESIGN
A logical place to start with when designing a data model is understanding the
real world entities. A biological entity such as gene EGFR (Epidermal Growth Factor Receptor) ends up in several databases around the
world because of its significance (a mutation in this gene is linked to lung
cancer). It exists in the HUGO Gene Nomenclature Committee database with identifier 3236 . It also exists in the protein database Uniprot with identifier P00533 because it encodes a protein with the same name - EGFR! Furthermore, this gene
is also referenced by several synonyms such as ERBB, HER1, ERBB1 and NISBD2 etc
in the literature.

When creating a data model, the primary key for a table GENE containing such genes could be a VARCHAR string or a numeric value such as 123. One would also need to create another
table, say SYNONYMS , to hold the values of synonyms of genes and yet another table, say EXTERNAL_IDENTIFIERS to hold the database identifiers from HGNC and Uniprot. But there is a catch
here - the identifier of a database such as HGNC is numeric while that of
Uniprot is a string. It's easy to see how modeling just one entity gene is so cumbersome.

Taking into account other entities such as disease , proteins , mutations and unstructured literature text (containing evidence linking these entities),
along with the need to maintain data consistency across tables one can see how
immensely complicated it gets. This limits the to adoptability of traditional
relational databases.

EXTERNAL DEPENDENCIES
This is another obvious big issue. Since data comes from several disparate data
sources, the process of refreshing it is a challenge in itself for several
reasons.

First of all, access to these remote databases is often limited using quotas.
These quotas can be based on IP addresses or on a per account basis to mitigate
load on the server caused by automated scripts. Further since a lot of these
in-house academic lab databases are not exactly production quality, one may
encounter unplanned outages which could lead to disruption of work. In the case
of unavailability, the only options are either to temporarily halt the update or
do a partial update which carries the risk of introducing inconsistencies in the
data.

A combination of a NoSQL database such as MongoDB and an in-memory data
structure server such as Redis can help mitigate these issues and speed up the
development time by reducing the complexity of the data models and caching data
whenever an external data source is unavailable.

WHY USE MONGODB?
MongoDB is a NoSQL database. While not the first, it is the most popular NoSQL
database in the world. Unlike SQL databases where similar data is stored as rows in tables , NoSQL databases treat such data as documents in collections . Documents are JSON-like data structures with no requirements placed on data
types. While desirable, two documents in a collection don't even need to have
similar structure. This flexible schema while being the antithesis of traditional SQL based design where a schema has
to be created first before inserting data, is incredibly helpful in domains such
as bioinformatic applications where data is often very unclean or unstructured . In the gene example above, it is very easy to add additional external
database identifiers even if they are of different data types. These identifiers
can be easily added to the EGFR document for example, as list elements even if they are of different data types. There
is no danger of violating a foreign key constraint since all the identifiers are
in the same document instead of in multiple tables joined by a foreign key as in a SQL database. Here is an example of creating a
simple Gene in a collection called GENE in the database BIOMED :

use BIOMED  
db.GENE.insert(  
    {
        SYMBOL: ""EGFR"",
        SYNONYMS: [""ERBB"", ""HER1"", ""ERBB1""],
        IDENTIFIERS: [
            {""DB"" : ""HGNC"", ""ID"" : 3236},
            {""DB"" : ""ENTREZ"", ""ID"" : 1956},
            {""DB"" : ""RANDOM DATABASE"", ""ID"" : ""ABCDEF""}  //ID is a char string
        ]
    }
)


While creating one big table (or collection in MongoDB world) does lead to data denormalization (redundant data), it leads
to easier data models and faster development time.

Another big advantage of using MongoDB is the availability of drivers and
clients in several different programming languages including Python, Java and
NodeJS. Biomedical engineers tend to use Python a lot since the language has
extensive support for scientific and domain-specific libraries. The MongoDB API
for Python can be used in a manner very similar to the native Mongo shell. This
enables developers to test queries in the native Mongo shell and then easily
migrate it over to Python. SQL databases, unfortunately, don't offer the same
convenience. To illustrate this point, the following piece of code creates a GENE collection and then inserts a new gene using a Python shell:

from pymongo import MongoClient

client = MongoClient(""localhost:27017"")  
db = client[""BIOMED""]  
db.GENE.insert(  
    {
        ""SYMBOL"" : ""EGFR"", 
        ""SYNONYMS"" : [""ERBB"", ""HER1"", ""ERBB1""],
        ""IDENTIFIERS"" : [
            {""DB"" : ""HGNC"", ""ID"" : 3236},
            {""DB"" : ""ENTREZ"", ""ID"" : 1956},
            {""DB"" : ""RANDOM DATABASE"", ""ID"" : ""ABCDEF""}
        ]
    }
)


As is evident from above, the syntax for the Mongo shell and Python shell are
very similar. There are slight differences between the two syntaxes and one
should read the Pymongo documentation thoroughly. For example, updating a
document in Mongo shell, the syntax is one of the following: db.GENE.update() , db.GENE.updateOne() or db.GENE.updateMany() . Using Python on the other hand, there is no update but instead a update_one and update_many .

Along with the above features, MongoDB also offers built in replication and
sharding for high availability without a very complex configuration. As such it
is easy to use it in small academic environments which may lack dedicated DBAs.

WHAT ABOUT REDIS?
Now let's take a look at the other frequently encountered issue of dealing with
access to databases where restrictions have been placed on the frequency of
access either by design or due to inherent limitations based on the quality of
the server. Pubmed is arguably the most important database for biomedical journal articles in the
world. It contains several million journal articles with hundreds added every
week. Pubmed is hosted by the National Center for Biomedical Informatics (NCBI),
a research center within the National Institutes of Health, a US Government
agency. NCBI facilitates research by providing databases like Pubmed along with
dozens of others containing information about things like genes, proteins,
mutations and so on. This data is used by scientists, researchers and software
engineers alike for purposes as varied as research or for the development of
biomedical web applications.

The official ""Data Access Policy"" for Pubmed states that automated scripts should only be run on weekends or
between 9pm and 5am Eastern Time. Furthermore, there should be no more than 3
requests every 1 second. While such policies are reasonable from NCBI
perspective since it allows more people to be able to access the resources, they
are bad from a software development perspective. Redis can be used to mitigate
the effects of such policies. As an example, consider a web application which
allows the user to do a very customized search of the available biomedical
literature. There are 3 potential ways in which this can be done:

 1. Develop an application with a backend database containing periodic snapshots
    of Pubmed. The database can be updated on a configurable schedule such as
    monthly. This has the advantage of complete independence from Pubmed but may
    not contain up to date information at a given time.
    
    
 2. Develop an application which merely provides a wrapper to Pubmed and any
    search that will be done using the application will be rerouted to Pubmed
    and the fetched results shown to users in a more customizable UI. This has
    the advantage of containing up to date information but suffers from the
    ""Data Access Policy"" restrictions as mentioned above
    
    
 3. Develop an application which provides a wrapper to Pubmed with an
    intermediate cache layer. This layer which can be easily implemented using
    Redis to cache queries sent to Pubmed along with the results containing
    document IDs. The actual unstructured documents can be stored in a NoSQL
    database such as Mongo. The advantage of this approach is that results from
    the queries are more up to date than the first approach but also don't
    suffer from the Data Access Policy restrictions from the second approach
    since the server can just show the cached results as needed so as to not
    violate the policies by overloading the Pubmed server with multiple requests
    containing the same query.
    
    
Now some may question the advantage of using Redis for caching as opposed to
other techniques such as caching implemented by ORM layers like Hibernate. While
other technologies may serve the purpose they have the disadvantage of being
tightly coupled with the application. Since caching is not usually a core
business logic, if possible it can be decoupled from the application.
Furthermore, Redis is not merely a simple key-value store as some other
technologies. It provides more complex data structures such as strings, hashes,
lists, sets, sorted sets etc for ease of use within applications. Redis also has built-in replication, Lua scripting, LRU eviction, transactions and different
levels of on-disk persistence, and provides high availability via Redis Sentinel
and automatic partitioning with Redis Cluster . It is also very easy to configure, and one can get started with Redis within
minutes unlike more complicated application layer configurations.

Let's see some sample code on how to utilize MongoDB and Redis to interact with
Pubmed in a simple Python application:

from pymongo import MongoClient  
from redis import ConnectionPool  
from redis import StrictRedis  
from requests import get  
import cherrypy

MAX_EXPIRE_DURATION = 24 * 3600  
PUBMED_EUTILS_URL = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=medline'

def configure(max_db_pool_size = 2, max_redis_connections = 2):  
    '''
    This method should be executed during web server startup
    '''
    if not globals()['client']:
        client = MongoClient(host = cherrypy.config[""server.mongo_host""], port = cherrypy.config[""server.mongo_port""], max_pool_size = max_db_pool_size)
        globals()['client'] = client

    if not globals()['cache_pool']:
        cache_pool = ConnectionPool(host = 'localhost', port = 6379, db = 0, max_connections = max_redis_connections)
        globals()['cache_pool'] = cache_pool

def get_db_connection():  
    '''
    This method should be called to get a connection to the MongoDB server
    '''
    return globals()['client']

def get_cache_connection():  
    '''
    This method should be called to get a connection to the Redis server  
    '''
    return StrictRedis(connection_pool = globals()['cache_pool'])

def search(query):  
    '''
    The method which performs the actual search. This method first checks the cache to see whether the key exists. If yes, then the documents corresponding to those Pubmed IDs are retrieved from Mongo, else an internet search is performed and the results are cached.
    '''
    cache = get_cache_connection()
    if cache.exists(query):
        publications = lookup_cache_key(query)
    else:
        pubmed_ids = pubmed_lookup(query)

def pubmed_lookup(query):  
    '''
    Implement this method to do an internet lookup on Pubmed and save the results to MongoDB
    '''
    pubmed_articles = get(PUBMED_EUTILS_URL + ""?query="" + query)
    save_to_db(pubmed_articles)
    pubmed_ids = [ article[""PMID""] for article in pubmed_articles ]
    save_query_to_cache(query, pubmed_ids)

def lookup_cache_key(cache_key):  
    '''
    Lookup a key in Redis cache
    '''
    cache = get_cache_connection()
    ids = cache.get(cache_key)
    publications = get_publications_from_pmids(ids)

    return publications

def get_publications_from_pmids(pubmed_ids):  
    '''
    Get the actual Pubmed articles from MongoDB
    '''
    db_client = get_db_connection()
    db = db_client[""BIOMED""]
    return db.PUBMED.find({""ID"" : {""$in"" : pubmed_ids}})

def save_query_to_cache(cache_key, pubmed_ids):  
    '''
    Save the Pubmed IDs of documents into Redis cache
    '''
    cache = get_cache_connection()
    if pubmed_ids and len(pubmed_ids) > 0:
        cache.rpush(cache_key, pubmed_ids)
        cache.expire(cache_key, MAX_EXPIRE_DURATION)


CONCLUSION
Bioinformatics is a new field of study combining traditional IT with disciplines
such as biomedical engineering, medicine, and genetics. With cheap computing
available, more and more research and experiments are being done in silico . This is producing an almost unending supply of data being deposited in the cloud as well as on traditional servers. Managing such data while maintaining
consistency and dealing with unexpected outages is a task so huge that it can
take an army of DBAs and developers. While a combination of MongoDB and Redis is
not going to solve every problem, it can definitely help by reducing the
complexity of data models, easily distributing data across multiple machines for
high availability and caching frequently used but mutable data to mitigate
network outages while also keeping it up to date.

Varun Singh is a software engineer in the Watson for Genomics team in IBM where
he is working on utilizing Machine Learning approaches to solving problems
related to unstructured data and Natural Language Processing. He is very
interested in issues related to representation, storage and processing of big
and small data particularly in the bioinformatics domain. Image via pixabay Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2017 Compose","Varun Singh, a software engineer at IBM's Watson Health, looks at how to use MongoDB and Redis to manage biomedical data in this guest article.",Redis and MongoDB in the biomedical domain,Live,1021
3195,"Skip navigation Sign in SearchLoading...

Close Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.
WATCH QUEUE
QUEUE
Watch Queue Queue * Remove all
 * Disconnect

The next video is starting stop 1. Loading...

Watch Queue Queue __count__/__total__ Find out why CloseIBM DATA CATALOG: CREATE AND ADMINISTER A DATA CATALOG
developerWorks TVLoading...

Unsubscribe from developerWorks TV? Cancel UnsubscribeWorking...

Subscribe Subscribed Unsubscribe 18KLoading...

Loading...

Working...

Add toWANT TO WATCH THIS AGAIN LATER?
Sign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?
   Sign in to report inappropriate content. Sign in
 * Transcript
 * Add translations

15 views 1LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 2 0DON'T LIKE THIS VIDEO?
Sign in to make your opinion count. Sign in 1Loading...

Loading...

TRANSCRIPT
The interactive transcript could not be loaded.Loading...

Loading...

Rating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Oct 31, 2017This video shows you how to create and administer a data catalog. Find more
videos in the IBM Data Catalog Learning Center at http://ibm.biz/data-catalog-learning

 * CATEGORY
    * Science & Technology
   
   
 * LICENSE
    * Standard YouTube License
   
   
Show more Show lessLoading...

Autoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT
 * IBM Data Refinery: Shape data - Duration: 5:46. developerWorks TV 10 views *
   New 5:46


--------------------------------------------------------------------------------

 * IBM Data Catalog: Governance overview - Duration: 4:11. developerWorks TV 5
   views * New 4:11
 * IBM Data Catalog: Add data assets to a catalog - Duration: 3:03.
   developerWorks TV 2 views * New 3:03
 * IBM Data Refinery: Create a project and add data - Duration: 1:56.
   developerWorks TV 7 views * New 1:56
 * IBM Data Catalog: Overview - Duration: 2:03. developerWorks TV 2 views * New 2:03
 * IBM Data Catalog: Use data assets in a project - Duration: 1:09.
   developerWorks TV 1 view * New 1:09
 * IBM InfoSphere Information Analyzer: Analyzing Data Quality and Risk with the
   Thin Client - Duration: 3:56. IBM Analytics 219 views 3:56
 * Welcome - Duration: 1:35. developerWorks TV No views * New 1:35
 * UrbanCode Deploy: Using composite blueprints - Duration: 9:13. developerWorks
   TV 6 views * New 9:13
 * Healthy Habits Pet Assembly, Part1 - Duration: 5:53. developerWorks TV 6
   views * New 5:53
 * Watson Data Platform: Provision IBM Data Catalog or IBM Data Refinery
   services - Duration: 1:05. developerWorks TV 22 views * New 1:05
 * Business Glossary Azure Data Catalog - Duration: 22:38. Microsoft Azure 1,586
   views 22:38
 * IBM MVS 3.8 Catalog Management Introduction - Duration: 35:06. moshix 125
   views 35:06
 * Quickly design, build & secure a mobile app with Bluemix - Duration: 12:09.
   developerWorks TV 45,734 views 12:09
 * IBM Data Refinery: Create a connection and add it to a project - Duration:
   1:54. developerWorks TV 3 views * New 1:54
 * IBM Information Server Data Lineage in Two Easy Steps - Duration: 2:50.
   BusinessDrivenInfo 4,047 views 2:50
 * What is a Data Catalog - Tech VLOG - Duration: 8:09. Garnie Bolling 1,110
   views 8:09
 * IBM - InfoSphere Information Governance Catalog Demo - Duration: 11:43. PR3
   Systems 3,833 views 11:43
 * IBM Academic Initiative z/OS Data Sets, System and User Catalogs, and zUnix
   File Systems- Unit 07 - Duration: 1:23:32. J. Packy Laverty 4,880 views 1:23:32
 * Healthy Habits Pet - Flash MicroPython - Duration: 3:30. developerWorks TV 6
   views * New 3:30

 * Language: English
 * Content location: United States
 * Restricted Mode: Off

History HelpLoading...

Loading...

Loading...

 * About
 * Press
 * Copyright
 * Creators
 * Advertise
 * Developers
 * +YouTube

 * Terms
 * Privacy
 * Policy & Safety
 * Send feedback
 * Test new features
 * 

Loading...

Working...

Sign in to add this to Watch LaterADD TO
Loading playlists...",This video shows you how to create and administer a data catalog in Watson Data Platform.,Create and administer a data catalog using IBM Data Catalog,Live,1022
3197,"Compose The Compose logo Articles Sign in Free 30-day trialHOW TO MOVE DATA WITH COMPOSE TRANSPORTER - FROM DATABASE TO DISK
Published Mar 9, 2017 transporter mongodb How to move data with Compose Transporter - From database to diskTransporter is a great way to move and manipulate data between databases. In
this new article, we look at how you can get on board the Transporter quickly.

With the latest 0.2.1 version of Transporter, we've been making our open sourced
tool for moving and manipulating data between databases even easier. In this
short series of articles, we are going to show you how to get your data moving
with Transporter.

GETTING TRANSPORTER
You'll want your own copy of Transporter to begin with. You can find binary and
source releases in github.com/compose/transporter/releases . Download the appropriate version for your system (the macOS version is
-darwin) or, if you prefer to build your own, you can clone the Transporter Github repository . You'll probably want to rename the downloaded file transporter and give it execute permission ( chmod u+x transporter ) so it can run.

TRANSPORTER QUICKLY
Transporter is based around the idea of building a pipeline. At one end is the
source. This brings in data from databases or files and converts it into
messages which the pipeline can process. Then the messages flow down the
pipeline passing through filters. In Transporter terms, these filters are called
transformers as they are more powerful than a simple filter and can modify the
messages. The messages keep flowing downstream until they eventually reach the
end of the pipeline and the sinks. Sinks take messages in and send them out to
other databases.

Let's start with extracting the contents of a database to a file. This is a
great way to get a handle on how Transporter and its' pipeline is configured.

The new version of Transporter has one addition which makes things much easier.
The transporter init command is now the quickest way to get started creating a pipeline between two
data sources. Give transporter init the names of two adaptors and it will create the configuration files needed to
have one as the source of the data and one as the destination, the sink for the
data. But where do you find those names?

Information about adaptors in actually built into Transporter. Run transporter about to list the available adaptors.

$ transporter about
file - an adaptor that reads / writes files  
mongodb - a mongodb adaptor that functions as both a source and a sink  
postgres - a postgres adaptor that functions as both a source and a sink  
rethinkdb - a rethinkdb adaptor that functions as both a source and a sink  
transformer - an adaptor that transforms documents using a javascript function  
elasticsearch - an elasticsearch sink adaptor  
$


To create our initial configuration, we can select one for a source adaptor and
one for a sink adaptor.

CREATING A CONFIGURATION
Say we wish to move data from MongoDB to a file. For this we can select mongodb as the source adaptor, and file as the sink adaptor. The init command always writes out new configuration files
and will overwrite existing files, so be sure to be in a clean or new directory
before running it.

$ mkdir transporter-example-1
$ cd transportert-example-1
$ transporter init mongodb file
Writing transporter.yaml...  
Writing pipeline.js...  
$


There are now two files in your current directory, transporter.yaml and pipeline.js . The first file, transporter.yaml defines the nodes - the source and sink - that the transporter's pipeline will
have available. Here it looks something like this:

nodes:  
  source:
    type: mongodb
    uri: ${MONGODB_URI} 
    # timeout: 30s
    # tail: false
    # ssl: false
    # cacerts: [""/path/to/cert.pem""]
    # wc: 1
    # fsync: false
    # bulk: false
  sink:
    type: file
    uri: stdout://


The nodes: label opens the list of nodes; the children of this will be the names of the nodes. There is no other signifigance to the names. Here there are two
children, source and sink . The source node has a ""type"" setting of mongodb and the init command has now laid out all the available options: uri , timeout , ssl , cacerts , wc , fsync and bulk . Notice that most are commented out with # . The reference page for the adaptor will go into more detail on these; we'll
just touch on the ones we need to change.

SETTING UP THE NODES
First up is the uri setting. The uri is the canonical way of describing a
connection to a database or similar. It can contain the protocol, host names,
ports and more all in one string. Of course this isn't something that people
want embedded in files. Thats why this example uses the ability of the transporter.yaml file to import environment variables.

In this case, the configuration is pulling in the MONGODB_URI environment
variable... so we'd better go set that. We have a current MongoDB on Compose set
up and if we ask for the connection string in the UI for the enron1 database we get this
""mongodb://user:password@host-portal.1.dblayer.com:10000,host-portal.10.dblayer.com:10001/enron1?ssl=true"",
so let's set that in the environment.

$ export MONGODB_URI=""mongodb://user:password@host-portal.1.dblayer.com:10000,host-portal.10.dblayer.com:10001/enron1?ssl=true""


That's actually the only settable value in the transporter.yaml file, so why don't we run transporter test at this point. transporter test will, given a JavaScript .js pipeline file, load up everything and test the
connections. We'll use the pipeline.js as generated for now:

$ transporter test pipeline.js
Invalid URI (mongodb://user:password@host-portal.1.dblayer.com:10000,host-portal.10.dblayer.com:10001/enron1?ssl=true), unsupported connection URL option: ssl=true  


This is a MongoDB specific error; the adaptor can take everything from the
connection string but the MongoDB options at the end. We have to remove that ?ssl=true from the environment variable.

$ export MONGODB_URI=""mongodb://user:password@host-portal.1.dblayer.com:10000,host-portal.10.dblayer.com:10001/enron1""


And then set an appropriate other option. That was the option that turns on SSL,
so we can enable it by editing the transporter.yaml file, uncommenting the ssl setting and setting it to true:

nodes:  
  source:
    type: mongodb
    uri: ${MONGODB_URI}
    ssl: true
  sink:
    type: file
    uri: stdout://


We've removed the commented out options for clarity. Now if we run the test:

transporter test pipeline.js  
TransporterApplication:  
 - Source:         source                                   mongodb         test./.*/                      mongodb://user:password@host-portal.1.dblayer.com:10000,host-portal.10.dblayer.com:10001/enron1
  - Sink:          sink                                     file            test./.*/                      stdout://


This tells us our nodes are connecting to the outside world.

PIPELINES AND NAMESPACES
We can now look at what we need to set in pipeline.js . It currently looks like this:

Source({name:""source"", namespace:""test./.*/""}).save({name:""sink"", namespace:""test./.*/""})  


This is JavaScript and the rules of JavaScript apply. Source() creates a source and takes a set of options. There are two essential settings.
The first is ""name"" and its value is the name of the node in the transporter.yaml file to use for configuring the source code. Here, it's source . The second essential setting is the namespace .

For the MongoDB adapter, the namespace is a combination of the database name and
a regular expression which should match all the collections we want to read
from. The regular expression /.*/ matches anything so all collections will be read. If you wanted to read from a
specific collection, say ""stuff"" in the test database, the namespace would be
""test.stuff"".

The .save() takes the message output of the preceding part of the chain and, given its
options, sets out to write it somewhere. That somewhere is defined by the ""name""
option which again points to the name of the node in the transporter.yaml file. Also, there's a ""namespace"" again. The database name on the sink is
ignored, but the regular expression has to match for a message to be eligible to
be written. It's a regular expression and it's set to match anything it sees.

We want to get everything from our enron1 database so all we need to change here is the first namespace, like so:

Source({name:""source"", namespace:""enron1./.*/""}).save({name:""sink"", namespace:""test./.*/""})  


Write that back to disk and we are now ready to run this pipeline.

$ transporter run pipeline.js
INFO[0001] adaptor Starting...                           path=source  
INFO[0001] boot map[source:mongodb sink:file]            ts=1488378538693138714  
INFO[0001] adaptor Listening...                          file=""stdout://""  
INFO[0001] starting Read func                            db=enron1  
INFO[0001] collection count                              db=enron1 num_collections=3  
INFO[0001] sending for iteration...                      collection=enron db=enron1  
INFO[0001] sending for iteration...                      collection=experimental db=enron1  
INFO[0001] iterating...                                  collection=enron  
INFO[0015] Establishing new connection to host-portal.1.dblayer.com:10764 (timeout=15s)...  
INFO[0015] Establishing new connection to host-portal.10.dblayer.com:10361 (timeout=15s)...  
INFO[0015] Connection to host-portal.1.dblayer.com:10000 established.  
INFO[0015] Connection to host-portal.10.dblayer.com:10001 established.  
INFO[0015] Ping for host-portal.1.dblayer.com:10000 is 115 ms  
INFO[0015] Ping for host-portal.10.dblayer.com:10001 is 118 ms  
INFO[0030] Ping for host-portal.1.dblayer.com:10000 is 132 ms  
...


After that point, you'll want to stop that pretty quickly as it will work
through your entire database echoing it all out to the console as JSON
documents.

RUNNING TRANSPORTER QUIETLY TO A FILE
What you do see in the snippet above is the tracing. The Transporter is
defaulting to being overly chatty because there's a lot of metrics and
information that can be useful when setting up a Transporter. When you do go to
production with Transporter, you can use the -log.level option to select which
messages you want to log.

The other issue here is that everything is going to stdout, which is the default
for a transporter init generated setup. We just need to change the sink entry like so:

nodes:  
  sink:
    type: file
    uri: file://dump.json


And now our database will be written to the dump.json file in the current directory. Let's run that now and we'll mute the
information messages too:

$ transporter run -log.level error pipeline.js
$ ls -lh
total 3005688  
-rw-r--r--  1 dj  staff   1.4G Mar  1 15:29 dump.json
-rw-r--r--  1 dj  staff    92B Feb 28 14:04 pipeline.js
-rw-r--r--  1 dj  staff   122B Mar  1 14:35 transporter.yaml


At this point, we have Transporter extracting data from a MongoDB database and
saving it as JSON data. Next time, we'll look at configuring the Transporter to
send data to another database.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan ’s author page and keep reading.RELATED ARTICLES
Aug 29, 2016MONGO TO MONGO DATA MOVES WITH NIFI
There are many reasons to move or synchronize a database such as MongoDB:
migrating providers, upgrading versions, duplicatin…

Hays Hutton Feb 3, 2016COMPOSE'S NEW MONGODB AND WHAT YOU NEED TO KNOW
The release of our new MongoDB deployments has brought a lot of new questions
about what you need to know and why and what ha…

Dj Walker-Morgan Nov 30, 2015TRANSPORTER MAPS: MONGODB TO ELASTICSEARCH
The open source Transporter from Compose is a powerful tool and there's a lot to
take in when you want to get started. With p…

Dj Walker-Morgan Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company","With the latest 0.2.1 version of Transporter, we've been making our open sourced tool for moving and manipulating data between databases even easier. In this short series of articles, we are going to show you how to get your data moving with Transporter.",How to move data with Compose Transporter - From database to disk,Live,1023
3201,"Follow Sign in / Sign up * Home
 * About Insight
 * Data Science
 * Data Engineering
 * Health Data
 * AI
 * 

Jeffrey Hetherly Blocked Unblock Follow Following Jun 23
--------------------------------------------------------------------------------

USING DEEP LEARNING TO RECONSTRUCT HIGH-RESOLUTION AUDIO
Audio super-resolution aims to reconstruct a high-resolution audio waveform
given a lower-resolution waveform as input. There are several potential
applications for this type of upsampling in such areas as streaming audio and
audio restoration. One traditional solution is to use a database of audio clips
to fill in the missing frequencies in the downsampled waveform using a
similarity metric (see this and this paper). Inspired by the successful applications of deep learning to image super-resolution , there is recent interest in using deep neural networks to accomplish this
upsampling on raw audio waveforms. After prototyping several methods, I focused
on implementing and customizing recently published research from the 2017 International Conference on Learning Representations (ICLR).

While there are a variety of domains where audio upsampling could be useful, I
focused on a potential voice-over-IP application. The dataset I chose for this
project is a collection of TED talks about 35 GB in size found here . Each talk is located in separate files with bit rates of 16 kilobits per
second (kbps) which is considered high quality for speech audio. This dataset
contains primarily well-articulated English speech in front an audience from a
variety of speakers. These qualities regarding the TED talks are an
approximation to what one may expect during a voice-over-IP conversation.

The preprocessing steps are outlined in the above figure. The first and last 30
seconds from each file are trimmed to remove the TED intro and closing. The
files are then split into 2 second clips and a separate, 4x downsampled set of
clips at 4 kbps are created along with a set at the original 16 kbps. 60% of the
dataset are used during training while 20% are reserved for validation and 20%
for testing.

The training workflow outlined in the above figure uses the downsampled clips of
the data preprocessing steps and batch-feeds them into the model (a deep neural
network) to update its weights. The model with the lowest validation score
(denoted “Best Model”) is saved for later use.

The process of using the “Best Model” to upsample an audio file is given in the
above figure. This workflow takes whole audio files, splices them into clips
similarly to the preprocessing steps, sequentially feeds them to trained model,
stitches the high-resolution clips back together, and saves the high-resolution
file to disk.

MODEL ARCHITECTURE
The model architecture I implemented was a U-Net that uses a one-dimensional analogue of subpixel convolutions instead of deconvolution layers. I used Tensorflow’s Python API to build and
train the model while the subpixel convolutional layers are implemented using Tensorflow’s C++ API. The model works as follows:

 * The downsampled waveform was sent through eight downsampling blocks that are
   each made of convolutional layers with a stride of two. At each layer the
   number of filter banks was doubled so that while the dimension along the
   waveform was reduced by half, the filter bank dimension was increased by two.
 * The bottleneck layer was constructed identically to a downsampling block
   which connects to eight upsampling blocks which have residual connections to
   the downsampling blocks. These residual connections allowed for the sharing
   of features learned from the low-resolution waveform.
 * The upsampling blocks used a subpixel convolution that reorders information
   along one dimension to expand the other dimensions.
 * A final convolutional layer with restacking and reordering operations was
   residually added to the original input to yield the upsampled waveform.
 * The loss function used was the mean-squared error between the output waveform
   and the original, high-resolution waveform.

PERFORMANCE
The above figure shows two quantitative measures of performance on a test sample
after 10 epochs of training. On the left column are spectrograms of frequency
versus time, and on the right are plots of the waveform amplitude versus time.

 * The first row contains the spectrogram and waveform plots for the original,
   high-resolution audio sample.
 * The middle row contains similar plots for the 4x downsampled version of the
   original audio sample. Notice that 3/4 of the highest frequencies are missing
   in the downsampled frequency plot.
 * The last row contains the spectrograms and waveform plots for the output of
   the trained model.

Inset are two quantitative measures of performance: the signal-to-noise ratio
(SNR) and the log-spectral distance (LSD). Higher SNR values represent
clearer-sounding audio while lower LSD values indicate matching frequency
content. The LSD value shows the neural network is attempting to restore the
higher frequencies wherever appropriate. However, the slightly lower SNR value
implies that the audio may not be as clear-sounding.

The paper that inspired this architecture claimed to train on 400 epochs of data
whereas I could train on only 10 epochs due to time constraints. A longer
training period would likely result in increased clarity in the reconstructed
waveform. Below you can listen to sample audio clips from the test set. The
first 5 sec clip is the original audio at 16 kbps, the second is the downsampled
audio at 4kbps, and the last is the reconstructed audio at 16kbps.

Random clip at 16 kbps from the test set. Downsampled version of the above clip. Notice that all high frequency content
is missing. Reconstructed clip. Much of the high frequency content has been restored at the
expense of clarity.OPEN-SOURCE CONTRIBUTIONS
The reconstruction of downsampled audio can have a variety of applications, and
what is even more exciting is the possibilities of applying these techniques to
other non-audio signals. I encourage you to adapt and modify the code available
in my github repo to experiment along these lines.

In addition to making available the code for these experiments, I had a desire
to contribute additional open source materials for the growing applied AI
community. Since the subpixel convolution layer is a general operation that
might be useful to deep learning researchers and engineers alike, I’ve been contributing back to TensorFlow and working closely with their team to integrate into their codebase.


--------------------------------------------------------------------------------

Jeffrey Hetherly, Physics PhD and Insight AI Fellow, implemented cutting-edge
research that was scheduled to be presented at ICLR 2017. This project, made
possible by Paperspace GPUs, also resulted in an active open source contribution
to TensorFlow.

Want to learn applied Artificial Intelligence from top professionals in Silicon
Valley or New York? Learn more about the Artificial Intelligence program.

Thanks to Jeremy Karnowski , Matt Smith , and Ross Fadely . * Machine Learning
 * Insight Ai
 * Deep Learning
 * Artificial Intelligence

Blocked Unblock Follow FollowingJEFFREY HETHERLY
FollowINSIGHT DATA
Insight Fellows Program —Your bridge to careers in Data Science and Data
Engineering.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates",Audio super-resolution aims to reconstruct a high-resolution audio waveform given a lower-resolution waveform as input. There are several potential applications for this type of upsampling in such…,Using Deep Learning to Reconstruct High-Resolution Audio,Live,1024
3208,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Louis V. Frolio Blocked Unblock Follow Following Mar 28
--------------------------------------------------------------------------------

DATA TIDYING IN DATA SCIENCE EXPERIENCE — R EDITION
28 March 2017


--------------------------------------------------------------------------------

On May 20, 2015, then U.S. Chief Data Scientist DJ Patil tweeted ,

“I’ve found in my experience that cleaning the data is 80% of the hard work.” (1)

Since then, this metric has been ubiquitously quoted because it’s both accurate
and data cleansing is crucial to the success of any analytics project.

Photo credit

This post focuses on data tidying , which is a key component of the data cleansing process. Data tidying is the
final stage of data cleansing which readies it for modeling. (2)

Data tidying includes, but is not limited to:

 * Meaningful column names
 * Atomic data elements
 * Consistent observational granularity
 * Relevant features and if required, a target variable

Why does data need to be tidy? Because most modeling algorithms work best with
well formed data that is in a table format. Exceptions to this include analytics
on any unstructured data source like text analytics.

OVERVIEW
This hands-on tutorial demonstrates the basics of data tidying in Data Science
Experience using an R notebook.

Specifically, you will be shown:

 1. How to access data that is local and external
 2. Approaches to identifying and fixing data issues using R
 3. Best practices on documentation, sharing, and reproducibility

Upon completion, you will have a working knowledge of how to tidy a data set.
Also, you will be provided with all the code presented here. All source data
used here is retrieved from an Amazon Web Services public repository . It is available in the public domain and is attributed to Lending Club “Lending Statistics.”

The complete tutorial, which includes the Jupyter R notebook and associated data
sets, is available for download on GitHub.

PREREQUISITES
 * Familiarity with the R programming language
 * A DSX account — sign up for your free account
 * Basic understanding of Jupyter notebooks
 * Optional: Working knowledge of GitHub

CREATE PROJECT
Begin by creating a new project within DSX making sure to select “Object
Storage” as the Storage Type.

The Target Container assumes the name of the project but with the spaces
removed, e.g. “TidyData.” Click “Create” to continue. You will be brought to the
project’s dashboard:

ASSOCIATE GITHUB
Although not required, I do recommend that GitHub be used for notebook versioning. It is free to join.

To use GitHub:

 * Create a repository (GitHub)
 * Generate an access token for the repository (GitHub)
 * Add the token to your profile (DSX -> profile -> settings -> integration)

 * Add GitHub repository to project

Adding the URL is accomplished by clicking the “Settings” tab in the dashboard,
scrolling to the bottom, and then pasting the URL in the “Connect to a GitHub
Repository” field:

Click “Add” to tie the repository to the project.

ADD LOCAL DATA FILE
There are several ways to get data into a DSX project. The simplest way is to
add it manually:

Browse local file system and select data file to load:

In the overview page, you will see the file has been uploaded. Check it off,
then click “Apply”:

The CSV is now part of the project and is accessible to any notebook within the
project. There is only one more thing needed before coding can begin; create an
R notebook.

CREATE R NOTEBOOK
From the “Overview” page of the project, click on “add notebooks” to create a
new notebook:

Give the notebook a meaningful name and description. Be sure to choose “R” as
the “Language and accept the remaining defaults:

Click “Create Notebook” to continue. You will be prompted with an empty
notebook:

Recall that the data set “sample.csv” is part of the project; it is stored in a
container within a Bluemix Object Storage container. Access to the container
requires credentials.

In the upper right of the notebook header there is an small icon made up of
“1’s” and “0’s.” Click on this icon and a panel will appear and you will see the
“sample.csv” file. Beneath “sample.csv” click on “Insert to code” then choose
“Insert Credentials.”

The result is that the first code cell “In[1]” has been populated with
credential information necessary to move data in and out of the object store:

The object “credentials_1” contains a lot of information, I am only showing a
bit of it here because it contains sensitive credential information that I don’t
want to expose. Keep this in mind if you choose to share a notebook.

If you chose to use GitHub, you can publish the notebook at any time. To
publish, click the publish icon on the header of the notebook, then choose
“Publish on GitHub”:

You will be prompted with a dialogue in which you add a commit message. You also
need to choose an option for “Cell content.” Choose “All content except hidden
code cells.”

By choosing this option, the notebook is saved without sensitive, or “hidden
cell” items as indicated by the token “# @hidden_cell”.

A quick check of GitHub shows that the notebook has been published:

From the point forward, you can check in changes to GitHub at your discretion.
The last step in the bootstrap process is to install and load R packages that
make moving data in and out of Bluemix Object Storage easier:

The three new codes cells install and activate two R packages “ devtools ” and “ objectStoreR .”
The first is part of the R eco-system. The second was developed by the IBM Data
Science Experience team and is open source under the Apache License 2.0.

LOAD AND EXPLORE DATA
EXPLORE “SAMPLE.CSV”
Everything up to this point has laid the plumbing enabling us work with R, in a
Jupyter notebook, accessing both internally and externally. Recall that we
uploaded the file “sample.csv.” Let’s see how easy it is to retrieve it from
object storage and use it in an R data frame:

We now know how to pull data from object storage but what if the data is
external and can only be accessed by URL? Below, I demonstrate how to pull data
back from an external data source, put it into a data frame, and how to
self-document:

ACCESS EXTERNAL DATA VIA HTTPS
Access external open source loan data using R:

The data set was imported as a CSV file “loanData.csv” but it had to be
converted to an R data frame “loanDataRaw” to make it usable in R.

LOAN DATA EXPLORATION
Datafile “loanDataRaw” has 2500 records and 14 columns.

Great, but what does the data look like?

This brief output demonstrates that the data, and its metadata, are not tidy. A
few of the many items that should be addressed include:

 * Column headings: Inconsistent case and use of periods
 * Interest.Rate: Numerical data stored as strings
 * Loan.Length: Can be simplified and represented as numerical data
 * Fico.Range: Single column range would be more useful as distinct columns

The complete raw data set can be viewed on GitHub.

With a better understanding of the data and its issues, we first need to copy
this data to storage before making the necessary changes.

The helper function “objectStore.put” makes quick work of copying the raw data
set to object storage. The output of the “status” function is “201” which
indicates success.

Let’s dive in and start tidying up this data.

TIDY DATA
Step 1: Fix the column headings, remove the periods, and change the names so
that they are consistent:

You can see several of the column names before and after the treatment. This
consistent naming will make things easier moving forward.

CONVERT % CHARACTER DATA TO NUMERIC
Step 2: Remove “%” sign from “InterestRate” and “DebtToIncomeRatio” then cast as
numeric:

UNPACK PACKED FIELDS
Step 3: Break up FICORange into three new columns: FICOLow, FICOHigh, FICOAvg:

DISCOVER AND REMOVE NULL DATA
Step 4: Find and remove rows that have null values:

CONVERT CHARACTER CATEGORIES TO NUMERIC
Step 5: Convert string values to a numerical representation for several
variables: LoanPurpose, EmploymentLength, HomeOwnership, and LoanLength.

I show just one transformation for brevity. The other three transformations are
detailed in the complete tutorial R Notebook that supports this tutorial.

A quick check of the data shows that the transformations have been applied:

 * Column headings: Consistent case
 * InterestRate, DebtToIncomeRatio: “%” removed, numerical represenation
 * LoanLength: Numerical representation
 * FicoRange: Single concatenated string transformed to three numerical columns
 * Plus several other transformations

The data is tidy and ready for analysis. Before I proceed, I want to write the
tidy data set to BlueMix Object Storage. This step is not always necessary
because we can always regenerate the final data set. However, I may want to
perform analysis in another notebook but avoid having to process the raw data
again.

As indicated earlier in the post, a “201” return message indicates it was
successfully saved to object storage.

PLOT DATA AND FIT LINEAR MODEL
BASIC PLOTTING
Using the tidy data I will now demonstrate basics graphing, analysis, and the
creation of a linear model.

Simple plot of Interest Rate vs. Avg FICO Score:

SIMPLE CORRELATION
Check for correlations between variables in the data. This is an example of the
need for numerical data. Recall the string to numerical representation step:

Checking for correlations is an important step that helps identify variables
that are tied to both independent and dependent variables; known as confounders.
This type of analysis is outside the scope of this blog, but I do want to point
out how this output might be used to indicate confounding data.

The highlighted output shows the relationships between InterestRate,
AmountFunded, LoanLength, and FICOavg. Nothing seems too suspicious here. It is
reasonable to expect that the interest rate charged for a loan is functionally
dependent on the variables marked in green. However, a good question to ask is
whether or not AmountFundedByInvestors and LoanLength are necessary for
analysis. They have similar correlations to InterestRate. Without further
analysis, this question can’t be answered, but it certainly warrants further
exploration.

LINEAR MODEL
In any data analysis or in machine learning, an output is a model that can be
used to make predictions on new data elements as they are generated. The R
programming language makes it easy to fit data to a linear model. Let’s look at
Interest Rate vs. FICO Avg Score:

PLOTTING WITH LINEAR MODEL OVERLAY
Plot the data points and overlay the linear model:

In this tutorial I demonstrated how Data Science Experience, along with an R
notebook is used to tidy data and ready it for analytics.

Specifically:

 * Project creation in Data Science Experience
 * GitHub association to a project
 * Bluemix object storage credentials: retrieval and use
 * Accessing data in Bluemix, and externally via HTTPS
 * Basic methods of data exploration using R
 * Data operations on R data frames
 * Basic introduction to data analysis using R

I strongly encourage you to pursue a deep dive on the topics presented in this
tutorial.

Helpful Links:

Data Citations:

Lending Club — Lending Club Statistics . Accessed via 
https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv?accessType=DOWNLOAD on 03/25/2017

Giving Credit:

[1] DJ Patil. (2015). Twitter. https://twitter.com/dj44/status/601119768955920384

[2] Hadley Wickham. (2014). Journal of Statistical Software: Tidy Data. https://www.jstatsoft.org/article/view/v059i10

 * Data Science
 * Github
 * Jupyter Notebook
 * Dsx
 * R

Show your supportClapping shows how much you appreciated Louis V. Frolio’s story.

1 Blocked Unblock Follow FollowingLOUIS V. FROLIO
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Since then, this metric has been ubiquitously quoted because it’s both accurate and data cleansing is crucial to the success of any analytics project. This post focuses on data tidying, which is a…",Data tidying in Data Science Experience,Live,1025
3211,"Develop in the cloud at the click of a button!IBM Bluemix™ is a new open platform for developing and deploying web and mobile                 applications. In this article, I'll walk you through the steps to create a                 simple GuesstheWord game application using Bluemix and its cloud-based                 development environment: DevOps Services. You will start from scratch and                 end up with a simple game you can play in a web browser with server code                 running in the cloud.“When you have completed the steps in this article, you'll                     be ready to develop your own Bluemix applications of any                     size.”Although this application is simple, I'll cover aspects that are important                 when developing larger applications, such as setting up a good development                 environment to enable local debugging. My goal is to walk you through the                 development of a small Bluemix application using an approach that is also                 applicable to development of large Bluemix applications. When you have                 completed the steps in this article, you'll be ready to develop your own                 Bluemix applications of any size.What you willneed for your applicationTo follow along, you'll need a Bluemix account and a DevOps Services account. So if you haven't done so already,                 create these accounts now. Also, if this is the first time you've heard                 about Bluemix, you may want to read this short introduction.When embarking on a development project, it is important to choose the                 right technologies. Bluemix supports a multitude of technologies for                 developing applications. In fact, this flexibility is one of the main                 strengths of the platform because it allows you to choose which                 technologies are best suited for the applications you want to develop. For                 our Guess-the-Word game, we'll use the following:Node.js— For the server-side code, we'll useNode.js. Web serversimplemented in Node.js start up quickly and are easy to develop anddebug locally. It can also be a benefit to use the same language, suchas JavaScript, for both server- and client-side code. We'll also usethe Express framework with Node.js because it provides usefulfunctionality when implementing a web server.Cloudant— For persisting server data (the topgame scores) we will use a Cloudant database. A NoSQL database like Cloudant is easy touse with JavaScript and JSON encoded data.HTML, CSS, and JavaScript— The UI of the gamewill be implemented using HTML, CSS, and JavaScript. With the supportfor HTML5 and CSS3 in modern web browsers, this is a natural choicebecause it will allow our game to be run on phones and tablets, aswell as computers. Bootstrap and JQuery— We will also use theBootstrap and JQuery JavaScript frameworks, which provide nicefunctionality for web UI development. Jade— To save us some typing, we'll use atemplate language for our HTML pages. There are several templatelanguages available for Node.js. I decided on Jade because it is often usedwith the Express framework.With these decisions made, we're ready to start coding.Let's start by implementing the core part of the Node.js server so wecan test it before implementing the UI. We could use Eclipse for thecoding, but at the time of this writing, support for Node.js inEclipse is still somewhat basic, especially when it comes todebugging. So we'll use a utility called node-inspector for debugging our Node.js server code. Feelfree to use your favorite text editor for the code below:* Server for the GuessTheWord appvar express = require('express');var app = express();var http = require('http');var host = ""localhost"";var port = 3030;// Set path to Jade template directoryapp.set('views', __dirname + '/views');// Set path to JavaScript filesapp.set('js', __dirname + '/js');// Set path to CSS filesapp.set('css', __dirname + '/css');// Set path to image filesapp.set('images', __dirname + '/images');// Set path to sound filesapp.set('sounds', __dirname + '/sounds');// Set path to static filesapp.use(express.static(__dirname + '/public'));// Bind the root '/' URL to the hiscore pageapp.get('/', function(req, res){res.render('hiscores.jade', {title: 'Hiscores'});// Bind the '/play' URL to the main game pageapp.get('/play', function(req, res){res.render('main.jade', {title: 'Guess the Word'});var server = app.listen(port, function() {console.log('Server running on port %d on host %s', server.address().port, host);process.on('exit', function() {console.log('Server is shutting down!');});Save this code in a file called server.js, then create afolder called views to contain the two Jade files(hiscores.jade and main.jade), referenced in the code above and shownbelow. These Jade files define the contents of the two web pages usedby the game.The hiscores.jade file shows the high scores. This is the initial pageof the game and is bound to the root URL '/'.The main.jade file is the page where the game is played. It is boundto the URL '/play'.doctype htmlhtml(lang=""en"")headtitle= titlelink(rel=""stylesheet"", href=""css/bootstrap.css"", type=""text/css"")link(rel=""stylesheet"", href=""css/main.css"", type=""text/css"")body(style=""background-image:url(/images/background.jpg)"")div(class=""container"")h1(class=""game-text"") Guess the secret word!div(class=""row"")div(class=""col-xs-8"")div(id=""word-description"")div(class=""col-xs-4"")div(id=""score"") Score:div(class=""row"")div(class=""col-xs-8"")div(class=""word-area"")table()tr()div(class=""col-xs-4"")div(id=""power"") Power:div(class=""row"")div(class=""col-xs-8"")div(id=""help-text"", class=""game-text"") Click on a ?-box above and type a letterdiv(class=""col-xs-4"")div(class=""row"")div(class=""col-xs-12"")button(id=""skip-button"" class=""btn btn-info btn-xs"") Skip this worddiv(class=""col-xs-12"")button(id=""help-letter-button"" class=""btn btn-info btn-xs"") Give me a letteraudio(id=""tick-sound"", src=""sounds/Tick.mp3"")audio(id=""skip-sound"", src=""sounds/Falcon.mp3"")audio(id=""applause-sound"", src=""sounds/Applause.mp3"")script(src=""js/jquery.js"", type=""text/javascript"")script(src=""js/bootstrap.js"", type=""text/javascript"")script(src=""js/main.js"", type=""text/javascript"")As you can see, these Jade files reference a lot of static files, such as                 CSS and JavaScript files. We'll add those later, but first, let's run a                 test of what we have so far.If you haven't already installed Node.js locally, now is the time.Download from Nodejs.org and followthe instructions. We install Node.js locally so we can run and debugthe server code locally before deploying it to Bluemix.You also need to install the Express and Jade modules for Node.js. Theeasiest way to do this is to first create a package.jsonfile in the root folder that specifies the dependencies to theselibraries.Now                         you can install Express and Jade by opening a command-line prompt                         in the root folder and entering the following                         command:By                         the time you read this, there may be versions of Express and Jade                         available that are newer than the versions shown above. You may of                         course use these versions instead. As long as all dependencies of                         the application are mentioned in package.json, the same versions                         of the libraries will be used when you develop locally and when                         you later deploy to Bluemix. This is good because you avoid                         surprises caused by different library versions being used in the                         different environments.Make a first run of your application with the following command:Open a web browser and go to http://localhost:3030/. Before we proceed with game development, let's make sure we can debugthe application. First, we install node-inspector. Open a new commandline and type the following command:Because we launched the application with the --debugflag, we can now start node-inspector like this:This                         opens a JavaScript debugger in your default web browser, which                         should be either Chrome or Opera. If you have another web browser                         as default, you can instead use the following                         command:Open                             http://127.0.0.1:8080/debug?port=5858 in Chrome or Opera                         to start debugging. If you have not used this debugger before, you                         may want to read this introduction.Now let's create the missing files referenced by the Jade files. Allbinary files (JPG and MP3) can be downloaded from the completed GuessTheWord DevOps Services project, which contains allfiles referenced in this article. The easiest way to download them isto go to the EDIT CODE tab, right-click on thecontaining folder and Export the contents as a zipfile for download. You can of course also choose to create your own images                         and sounds to give a more personal touch to the game.The following files constitute the JQuery and Bootstrap JavaScriptlibraries. They can also be downloaded from the GuessTheWord DevOpsServices project, or from the Internet. Download to folder public/js: jquery.js (the latest version of the JQueryJavaScript library)Download to folder public/css: bootstrap.css (the latest version of the CSSfile for Bootstrap)Now create the following two CSS files, which define the styles usedin the game: Run the application again locally to see these changes take effect. Goto the root folder of the application (making sure it's the folder thelocal Git repository is connected to), then run:The                         game should now look a bit nicer than before:Next, let's write the client-side JavaScript that implements the gamelogic. We'll wait a bit before implementing hiscores.jssince populating the high-score list requires a database, something wewill add later. Instead, create the filepublic/js/main.js with the contents found in this code snippet, which basically contains all the game logic. Ittakes care of getting a random English word with a description fromthe server, handling keyboard input by the player, counting score,managing power, etc. As you can see, the function                         getNewSecretWord() invokes a /randomwordservice on the web server in order to get the secret                     word.Let's implement this service by adding the following code to server.js(after the initial variable declarations).* Lookup the word in the wordnik online dictionary and return a description for it.* @param word {String} Word to lookup description for* @param cb_description {function} Callback with the description as argument.* If the word was not found in the dictionary the description is empty.function wordLookup(word, cb_description) {http.request(host: ""api.wordnik.com"",path: ""/v4/word.json/"" + word +""/definitions?limit=1&api_key=a2a73e7b926c924fad7001ca3111acd55af2ffabf50eb4ae5""},	function (res) {var str = '';res.on('data', function(d) {str += d;res.on('end', function() {var wordList = JSON.parse(str);cb_description(wordList.length > 0 ? wordList[0].text : """");}).end();app.get('/randomword', function(request, response) {http.request(host: ""api.wordnik.com"",path: ""/v4/words.json/randomWord?hasDictionaryDef=false&minCorpusCount=0&maxCorpusCount=-1&minDictionaryCount=1&maxDictionaryCount=-1&minLength=5&maxLength=-1&api_key=a2a73e7b926c924fad7001ca3111acd55af2ffabf50eb4ae5""}, function (res) {var str = '';res.on('data', function(d) {str += d;res.on('end', function() {var wordObj = JSON.parse(str);wordLookup(wordObj.word, function(descr) {var randomWordObj = { word : wordObj.word, description : descr };response.send(JSON.stringify(randomWordObj));}).end();});This                         code binds the /randomword URL to code that uses Wordnik.com, an online English                         dictionary, to get a random word. It then calls                         wordLookup(), which makes another call to the Wordnik                         API in order to get the description of that word. Finally, the                         secret word and its description are encoded as JSON and returned                         as the response of the HTTP request.Note:The API key included in the above code snippet is for a free                         account shared by many people, so it can only be used on a small                         scale. To avoid this restriction, you can register at Wordnikand get your own APIkey.At this point, you may want to run a few debug sessions locally tostep through the client- and server-side code to verify that the gameworks as expected. The client-side JavaScript is debugged using the embedded                         Chrome debugger. The server-side JavaScript is debugged using                         node-inspector as described earlier. Note that node-inspector uses                         the same Chrome debugger for server-side Node.js JavaScript                         debugging.                     Now that we have a working application we can run and debug locally, our                 next step is to store what we've written so far in a DevOps Services                 repository. This provides a number of benefits, including:The code is stored in the cloud, which is more secure than a localcomputer.Other people can contribute to the development of theapplication.We can enable automatic deployment to Bluemix.Sign in at DevOps Services andclick Start coding to create a project for ourapplication. The choice between Jazz SCM and Git SCM is arbitrary andwill typically depend on which SCM system you are most familiar with.We'll choose a Git repository for our project, and mark theDeploy to Bluemix checkbox since we'll bedeploying our app to Bluemix. Select a Bluemix organization and space,then click CREATE. Your new project will initially be empty, except for a couple defaultfiles. To add your code to the project, click EDITCODE, then drag/drop the files from your file system intothe file hierarchy of the project (the tree shown to the left). If                         you wrote the code using Eclipse, you can drag/drop the files                         directly from Eclipse Project Explorer. Note that you then should                         select the individual files and folders under the Eclipse project                         (not the project itself) and drag/drop them onto the top node                         (marked in red) in the project file hierarchy. Also note that                         Eclipse hides some files, for example the .project file, so you                         still may have to drag/drop some files from your file                         system.Your project file hierarchy should now show all the                         files of your application.Note that even though you just                         copied the files from your local computer into a DevOps Services                         project, the changes you made are still not committed and pushed                         to the Git repository. This means that no one else can see the                         files you added. Think of the file hierarchy shown under your                         DevOps Services project as your personal work area, much like your                         local file system except it's stored in the cloud.If you click on the repository icon, you'll see a number ofuncommitted changes, one for each file you added. The files are shownas ""Unstaged,"" which is Git terminology for a change that has not yetbeen committed to the repository. To commit the changes, perform the Stage the changecommand on each file: This moves the files from Unstaged to Staged. All changes                         shown under Staged will be committed as a single change set when                         you click COMMIT. In this case, we want to add                         all files in a single change set, but you may later want to commit                         some changes as one change set and other changes as another change                         set. Each change set should group changes that logically belong                         together.Your Commit message in the                             Commit dialog should describe the change set,                         and perhaps provide a reason for the change. When you click SUBMIT, the change set is committed toGit and appears as a single change set under the Commits section. The final step is to push this change set to the master branch soothers can fetch it from there. Click PUSH, thenclick OK to push your changes. At this point, your local copy of the code is redundant and you shouldconsider deleting it. If you continue to develop on your local copy,remember to upload all modified files to your DevOps Services projectand deliver the changes in the same way you just did. It is possibleto work in this way, but it may be confusing and a bit cumbersome.A more practical approach is to either abandon local                         development altogether and edit the code using the web editor in                         DevOps Services, or to set up a local Git repository connected to                         the DevOps Services Git repository. I chose the second alternative                         because we want to be able to develop and debug locally, something                         which is not yet possible using the DevOps Services web                         editor.There are a couple of ways to set up a local Git                         repository connected to the DevOps Services Git repository. If you                         use an IDE such as Eclipse or Visual Studio, you can use the Git                         plugin for these IDEs to import the DevOps Services Git repository                         into the IDE. If you develop in an environment where no Git plugin                         is available, you can use Git from the command line. In either                         case, you need the URL for the DevOps Services Git repository. You                         can get this URL from the overview page of your DevOps Services                         project.To get to the project overview page from the                             EDIT CODE tab, click the project link in the                         top-left corner.Then click on the Git URL link:Once you have cloned this URL into a local Git repository, you cancontinue to develop the application locally and commit changesdirectly to the DevOps Services Git repository without going throughthe DevOps Services web interface. If you use Eclipse and the EGitplugin, your project view will look something like the screenshotbelow. If your Eclipse doesn't already contain the EGit plugin, it'seasy to install. If you are new to EGit, consult the User Guide.At this point, it's a good idea to make a first deployment to Bluemix, just                 to make sure we have this working before we continue developing the                 game.Deployment is controlled by a file called manifest.yml, so we need tocreate this file in the root folder of our application.These                         settings tell Bluemix how to deploy our application. For example,                         we specify the runtime (Node.js), the amount of memory to reserve                         for our application, and the host name we wish to use in the URL                         of our deployed application. Note that the host name must be                         globally unique, so you'll need to pick a host name other than                         ""GuessTheWord.""Whether you create this file in the DevOps Services web editor or inyour local development environment, don't forget to push the new fileto the DevOps Services Git repository.To start a deployment, click the BUILD & DEPLOYbutton on your DevOps Services project. Then toggle the switch at the top of the page from                             OFF to SIMPLE to deploy to                         Bluemix.As soon as you do this, DevOps Services attempts to deploy yourapplication to Bluemix. Also, a new deployment will take place automatically                         whenever someone pushes changes to the DevOps Services Git                         repository. This is convenient because it allows us to always have                         access to the latest deployed version of the application to run                         tests, for example.After 30 seconds or so, you should see that the application deployedwithout errors. To access the deployed application in Bluemix, click the link abovethe list. When you attempt to access your deployed application you will probably see                 the following message:404 Not Found: Requested route ('guesstheword.mybluemix.net') does not exist.This doesn't look good. There is obviously a problem somewhere, and it                 seems related to the application running on Bluemix since it ran fine                 locally.This illustrates an important point: An application may be successfully                 deployed but still not be available when you try to access it. The green                 status indicator and an OK result in the RecentAuto-Deployments list just means that the application was                 successfully deployed and started. However, it may, for example, terminate                 shortly after starting and not be available when you try to access its                 Bluemix URL.To troubleshoot these kinds of problems, it would be nice to be able to                 debug your server code on Bluemix. This likely will be possible in the                 future, but Bluemix does not yet support this. What you can do,                 however:Look at log files within DevOps Services to find out what went wrong.Also, you can sometimes get additional information by using theBluemix cf CLI tool. For example:> cf app GuessTheWordShowing health and status for app GuessTheWord in org mattias.mohlin@se.ibm.com/ space dev as mattias.mohlin@se.ibm.com...OKrequested state: startedinstances: 0/1usage: 64M x 1 instancesurls: GuessTheWord.mybluemix.netstate      since                    cpu    memory   disk#0   crashing   2014-04-24 01:10:37 PM   0.0%   0 of 0   0 of 0Here                         we see that the state of the application is crashing,                         which means it somehow terminated. A web server should never                         terminate so we need to find out why ours terminates when running                         on Bluemix.In DevOps Services on the EDIT CODE tab, there's aDEPLOY button, but don't click it yet. This button is for manual deployment of the contents of                         your personal DevOps Services project area to Bluemix. This is                         useful when troubleshooting because it lets you add temporary                         logging or other experiments you don't want to commit to the                         master branch. Before we click DEPLOY, we need to                         fetch all incoming change sets to bring our DevOps Services                         project area up-to-date with the master branch.Go to the Git Status page and click the FETCH button.Incoming change sets are shown under the Commitssection. Click the MERGE button to merge theseincoming change sets into your DevOps Services project area.Now we can do a manual deployment by clicking theDEPLOY button. When the application has deployedto Bluemix, you will see this: Click the root folder page link in the above message to view manualdeployment information. We see a red indicator, confirming that the application is                         not running. Click the Logs link to view logs                         produced by the application before it terminated.Open the stdout.log file. From this log, you probably immediately understand what's                         wrong. Our web server tries to use a hard-coded host name                         (localhost) and port number (3030). This works fine when running                         the server locally, but not on Bluemix.  The solution is to add the following piece of code just after thedeclaration of the host and port variablesin the server.js file:if (process.env.hasOwnProperty(""VCAP_SERVICES"")) {// Running on Bluemix. Parse out the port and host that we've been assigned.var env = JSON.parse(process.env.VCAP_SERVICES);var host = process.env.VCAP_APP_HOST;var port = process.env.VCAP_APP_PORT;}Bluemix                         uses an environment variable called VCAP_SERVICES for                         passing information about the environment to the server. The                         host and port are examples of such                         information. Note that the above if statement will                         not be executed when we run the server locally because                         VCAP_SERVICES is not set then. Therefore, we can run                         and debug our application locally even after this change.Let's push this change to the Git repository and let auto-deploy doits job. We see the deployment was successful. If we now click the link to the deployed application, we have reasonfor a small celebration.  Voilà! Our game is running on Bluemix.Now take a well-deserved break and play the game. See how many wordsyou can guess before running out of power.Did you achieve a great score? Congratulations! Too bad it's not saved so                 that others can see how good you are at word guessing. Let's fix this by                 persisting the scores in a database. You may think a database is overkill                 for saving so little data. Couldn't we just keep the high-score list in a                 text file on the server? Yes, but it's not a good idea because a Bluemix                 server may not always run on the same physical machine (due to load                 balancing). Therefore, it is not recommended to persist data using files,                 but instead to use some kind of cloud service for persistence.There are a number of cloud storage solutions to choose from, but let's                 keep it simple. Let's add a Cloudant database service to our                 application:Go to the Bluemix web app by clicking the MANAGEbutton in the BUILD & DEPLOY tab. This takes us to the GuessTheWord app in Bluemix. We can see that ourapplication is running, but there are no services associated with it.Click Add a service to add a new service. In theData Management section of the Bluemix catalog,choose Cloudant NoSQL DB. Click on the service, then on Create in the dialog.The application will restart, which should only take a few seconds.Then click Show Credentials on the added service: A Cloudant account is automatically created with the generated username and                 password. Copy the value of the URL field since you will need it when                 writing the code for accessing the database below.Click on the service, then Launch. This takes you tothe Cloudant web UI where you can configure and administer thedatabase. Create a new database named guess_the_word_hiscores. Thenclick the button to create a new secondary index. Store it in adocument named top_scores and name the indextop_scores_index. The map function defines which objects in the database are categorizedby the index and what information we want to retrieve for thoseobjects. We use the score as the index key (the first argument toemit), then emit an object containing thescore, the name of the player, and the date the score was achieved.Following is the JavaScript implementation of the map function, whichyou can copy/paste.Let's write the code necessary for storing scores in the Cloudantdatabase. We'll use the DevOps Services web editor this time. You'llneed to decide which Node.js library to use for accessing Cloudant. Ichose the nano library.Click on the package.json file in the DevOps Services file                         hierarchy, then add the line ""nano"" : ""*"", as shown                         below. This will include the latest version of the nano library in                         your                         application.As                         with the host and port, Bluemix uses                         the VCAP_SERVICES environment variable to tell our                         application about the Cloudant database service (where it's                         running, which credentials to use for logging into it, etc.). Add                         the lines marked in italics below to the server.js file. Replace                         the URL string with the value you copied from the service                         credentials                         previously.var express = require('express');var app = express();var http = require('http');var host = ""localhost"";var port = 3030;var cloudant = {url : """" // TODO: Updateif (process.env.hasOwnProperty(""VCAP_SERVICES"")) {// Running on Bluemix. Parse out the port and host that we've been assigned.var env = JSON.parse(process.env.VCAP_SERVICES);var host = process.env.VCAP_APP_HOST;var port = process.env.VCAP_APP_PORT;// Also parse out Cloudant settings.cloudant = env['cloudantNoSQLDB'][0].credentials;var nano = require('nano')(cloudant.url);var db = nano.db.use('guess_the_word_hiscores');As                         you can see, the above code will also work when the application                         does not run on Bluemix. When running locally, the application can                         use the Cloudant database just as when running on                     Bluemix.Now let's handle two new URLs in our server: /hiscores (for getting the top high scores from thedatabase)/save_score (for saving a new high score to the database)The code is shown                         below:app.get('/hiscores', function(request, response) {db.view('top_scores', 'top_scores_index', function(err, body) {if (!err) {var scores = [];body.rows.forEach(function(doc) {scores.push(doc.value);response.send(JSON.stringify(scores));app.get('/save_score', function(request, response) {var name = request.query.name;var score = request.query.score;var scoreRecord = { 'name': name, 'score' : parseInt(score), 'date': new Date() };db.insert(scoreRecord, function(err, body, header) {if (!err) {response.send('Successfully added one score to the DB');});We don't have any calls to these new URLs yet, but we can still testthe database connection by invoking the above URLs directly from abrowser. Click on Git Status, push all changes to                         the DevOps Services Git repository, and wait for the application                         to be deployed to Bluemix. Then enter the following in a browser,                         replacing guesstheword in the URL with the host name                         you chose for your                         application.You                         should see a success message. Enter the following URL, again                         replacing guesstheword with your application host                         name.The                         entry you just added should appear encoded in                         JSON.This                         shows that our code for using the database is working                         correctly.Now let's create the missing JavaScript file for populating the scoretable on the main page of our game. In the web editor, right-click onthe folder public/js and select New > File. Name the file hiscores.js and give it the                         following                         contents./*** Hiscoresfunction createTableRow(name, score, date) {var dateObj = new Date(date);var formattedDate = dateObj.toLocaleDateString() + "" "" + dateObj.toLocaleTimeString();return '';* Populate the hiscore table by retrieving top 10 scores from the DB.* Called when the DOM is fully loaded.function populateTable() {var table = $(""#hiscore_table tr"");$.get(""/hiscores"", function (data) {var hiscores = JSON.parse(data);hiscores.forEach(function (hiscore) {var html = createTableRow(hiscore.name, hiscore.score, hiscore.date);table.last().after(html);$(populateTable);This                         code populates the table with the top scores retrieved from the                         database. Push the changes to the DevOps Services Git repository                         and wait until the application gets redeployed on Bluemix. Now                         when we access the application, we should see the fake score entry                         that we previously added to the database:We have just a few final things to fix so the score is saved when thegame is over. In the main.js file, locate the following line:// TODO: Save name and score in DBAnd                         replace it                         with:function saveScore(name, score) {$.ajax({url: ""/save_score?name="" + name + ""&score="" + score, cache : false}).done(function(data) {window.location.replace(""/""); // Go to hiscore page}Finally, create a Bootstrap ""Game Over"" dialog by adding the followingas the first div in the file main.jade:Push the changes, wait for auto-deployment, then play the game once toconfirm that your score is saved in the database and appears in thetable on the main page. In this article, you have built a GuesstheWord game using IBM DevOps                 Services and deployed the app on Bluemix. You have seen how the code can                 be developed in the web editor, or in Eclipse (or another IDE), then                 pushed to the IBM DevOps Services Git repository. You have also seen how                 to set up automatic deployment so the application is redeployed as soon as                 new changes are pushed. You have also learned how to debug your                 application locally, and how to troubleshoot problems that may occur when                 the application runs on Bluemix. You are now ready to develop your own                 Bluemix applications.BLUEMIX SERVICES USED IN THIS TUTORIAL:SDK for Node.js helps you develop, deploy, and scale server-side JavaScript apps with ease.Cloudant NoSQL DB provides access to a fully managed NoSQL JSON data layer that's always on.Required fields are indicated with an asterisk (*).By clicking Submit, you agree to the developerWorks terms of use.The first time you sign into developerWorks, a profile is created for you.  Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name.  You may update your IBM account at any time.The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name.  Your display name accompanies the content you post on developerWorks.Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.Required fields are indicated with an asterisk (*).By clicking Submit, you agree to the developerWorks terms of use.",Build a word game app and see how to manage and deploy on Bluemix. Explore Bluemix DevOps services and see how to store game data with Cloudant.,Build a simple word game app using Cloudant on Bluemix,Live,1026
3215,"BLAZINGLY FAST GEOSPATIAL QUERIES WITH REDIS

--------------------------------------------------------------------------------


--------------------------------------------------------------------------------

Glynn Bird 11/16/16Glynn Bird

Before joining IBM Cloud Data Services, Glynn served as the Head of IT and
Development for Central Index, creating a white-label frontend for a NoSQL
business directory (using PHP, Node.js, MySQL, Redis, Cloudant, and Redshift).
His experience includes writing CRM systems, ""find my nearest"" indexes,
e-commerce platforms, and a phone…

Learn More Recent Posts * Blazingly Fast Geospatial Queries with Redis Use Redis and and Python scripts to speed your geospatial queries.
 * Enhanced Cloudant Search with Watson Alchemy Create an app that takes an RSS news feed, passes it through the Alchemy
   Language…
 * Building Offline-First, Progressive Web Apps In this article, I aim to summarise Progressive Web Apps and provide
   recommendations from my…

Imagine you have a database of geo-located data: churches, houses, people,
Pokémon–whatever. You need to be able to quickly return the nearest items to a
given point (latitude and longitude). How do you do it? Let’s say we have some
data like this:

name latitude longitude NW89AY 51.5321531084 -0.17797924481 NW89AZ 51.5338736618 -0.17759333684 NW89BD 51.533115598 -0.17799848667where each of 1.7 million rows represents a UK postcode and the latitude and
longitude of its centre.


How would we find the nearest postcode to our current location? Here are a few
methods:

 1. We could just go through each of the 1.7 million records in turn calculating
    how far each one is to us and picking the smallest distance. This works for
    one-off queries but is computationally expensive and is not a suitable
    solution if we want great performance.
 2. We could put the data into a relational database and index the longitude
    column. This would allow us to select vertical slices of data to reduce the
    amount of calculations we would need to do. This helps a little but is not
    an optimal solution, as it only reduces the data size in one axis.
 3. The ideal solution is build a two-dimensional index using both the latitude
    and longitude. PostgreSQL has geospatial functions as has the IBM Cloudant database, but for this example we’re going to use Redis.

WHY REDIS?
Redis is an in-memory database making read and write operations blindingly quick. It
provides simple commands that you can use like building blocks to assemble your
own data structures, including some geospatial functions that use
two-dimensional indices built from supplied latitude and longitude data. Redis
converts the coordinates into geohashes which when sorted in an index, lets nearby objects be close together in
storage.

Data can be added into a Redis geospatial index using its command-line shell
(redis-cli) using the GEOADD command:

 GEOADD postcodes -0.17797924481 51.5321531084 NW89AY


This adds a postcode NW89AY at lat/long 51.5/-0.2 to an index called postcodes . We can then query the index to see which of the postcodes is nearest to our location:

 GEORADIUS postcodes -1.0 54.4 400 km
1) ""NW89AY""


This returns the nearest postcode. Although we have only added one postcode into
the index at this point!

FINDING THE SOURCE DATA
In the UK, the Ordnance Survey publishes tons of data including the Code Point Open data set which provides a list all of the postcodes in the UK, where they are
on the map, and which adminstrative regions they belong to. After downloading
the data, we find we have 121 separate csv files whose contents are of this
form:

""AB101AA"",10,394251,806376,""S92000003"","""",""S08000020"","""",""S12000033"",""S13002483""
""AB101AB"",10,394235,806529,""S92000003"","""",""S08000020"","""",""S12000033"",""S13002483""
""AB101AF"",10,394181,806429,""S92000003"","""",""S08000020"","""",""S12000033"",""S13002483""
""AB101AG"",10,394251,806376,""S92000003"","""",""S08000020"","""",""S12000033"",""S13002483""
""AB101AH"",10,394371,806359,""S92000003"","""",""S08000020"","""",""S12000033"",""S13002483""
""AB101AL"",10,394330,806529,""S92000003"","""",""S08000020"","""",""S12000033"",""S13002483""
""AB101AN"",10,394296,806581,""S92000003"","""",""S08000020"","""",""S12000033"",""S13002483""
""AB101AP"",10,394309,806459,""S92000003"","""",""S08000020"","""",""S12000033"",""S13002483""


We are interested in the 1st, 3rd, and 4th columns. The rest we can ignore.

IMPORTING DATA IN BULK
It turns out the Ordnance survey prefers to publish its coordinates as eastings and northings in a coordinate system which best fits the geography of the UK. Converting
these coordinates into WGS84 latitude and longitudes requires some pretty hairy maths , but luckily Hannah Fry’s blog post distils this down into some Python code we can repurpose. I did so and posted tolatlng.py for you to download into the folder where you saved the .csv files.

Modifying Hannah’s code slightly, we can pipe in all the CSV files in one go and
output a stream of Redis GEOADD commands to be saved to a text file, or even better, to be sent directly into
Redis:


cat *.csv | ./tolatlng.py | redis-cli


It took 12.5 minutes to import the 1.7 million postcodes to a local Redis
database on my Mac, most of this time being taken to convert the coordinates in
Python. We can make this even more efficient by converting the Redis commands
into the lower-level Redis protocol . Download toredis.py into the folder where you saved the data files. Then we can pipe this stream of
data into redis-cli --pipe instead. Using --pipe means that Redis expects Redis protocol input and only replies once at the end
of import, not once after each command. This is particularly important when
importing data to a remote Redis instance:


cat *.csv | ./tolatlng.py | ./toredis.py | redis-cli --pipe


QUERYING THE INDEX
Once imported into Redis, the index occupies just over 200Mb of RAM and is ready
to query with the GEORADIUS command:

 GEORADIUS postcodes -1.2 54.5 5 km WITHDIST COUNT 10
1) 1) ""TS80AN""
   2) ""0.2072
2) 1) ""TS80AJ""
   2) ""0.5106""
3) 1) ""TS89EB""
   2) ""0.7001""
4) 1) ""TS70NU""
   2) ""0.8965""
5) 1) ""TS80AH""
   2) ""0.9330""
6) 1) ""TS80AW""
   2) ""0.9340""
7) 1) ""TS80AL""
   2) ""1.0324""
8) 1) ""TS89RZ""
   2) ""1.1061""
9) 1) ""TS80AD""
   2) ""1.1364""
10) 1) ""TS80AQ""
   2) ""1.1449""


which translates to “find the nearest 10 postcodes to this lat/long (within a
5km radius), and also calculate the distance”. Redis performs this search in
less than 10 milliseconds.


Every Redis command comes with an indication of its complexity and the GEORADIUS command is listed as:

O(N+log(M)) where N is the number of elements inside the bounding box of the
circular area delimited by center and radius and M is the number of items inside
the index.

In other words, if you want great performance, keep the radius as small as
possible: the more points that the circle covers, the slower the query.

RUNNING ON REDIS BY COMPOSE
Compose offers fault-tolerant Redis clusters as-a-service in a choice of data centers. You can use your local redis-cli command-line tool to interact with your Compose Redis cluster by copying
hostname, port, and password credentials from your Compose dashboard:


cat *.csv | ./tolatlng.py | ./toredis.py | redis-cli -h sl-us-dal-9-portal.99.dblayer.com -p 10000 -a MYPASSWORD --pipe",Use Redis and and Python scripts to speed your geospatial queries.,Blazingly Fast Geospatial Queries with Redis,Live,1027
3218,"Blog Home Dataquest.io Learn Data Science in Your BrowserWORKING WITH SQLITE DATABASES USING PYTHON AND PANDAS
03 Oct 2016 in tutorials, python, sqlite, and sqlSQLite is a database engine that makes it simple to store and work with relational
data. Much like the csv format, SQLite stores data in a single file that can be easily shared with
others. Most programming languages and environments have good support for
working with SQLite databases. Python is no exception, and a library to access
SQLite databases, called sqlite3 , has been included with Python since version 2.5 . In this post, we’ll walk through how to use sqlite3 to create, query, and update databases. We’ll also cover how to simplify
working with SQLite databases using the pandas package. We’ll be using Python 3.5 , but this same approach should work with Python 2 .

Before we get started, let’s take a quick look at the data we’ll be working
with. We’ll be looking at airline flight data, which contains information on
airlines, airports, and routes between airports. Each route represents a
repeated flight that an airline flies between a source and a destination
airport.

All of the data is in a SQLite database called flights.db , which contains three tables – airports , airlines , and routes . You can download the data here .

Here are two rows from the airlines table:

id name alias iata icao callsign country active 10 11 4D Air \N NaN QRT QUARTET Thailand N 11 12 611897 Alberta Limited \N NaN THD DONUT Canada NAs you can see above, each row is a different airline, and each column is a
property of that airline, such as name , and country . Each airline also has a unique id , so we can easily look it up when we need to.

Here are two rows from the airports table:

id name city country code icao latitude longitude altitude offset dst timezone 0 1 Goroka Goroka Papua New Guinea GKA AYGA -6.081689 145.391881 5282 10 U Pacific/Port_Moresby 1 2 Madang Madang Papua New Guinea MAG AYMD -5.207083 145.7887 20 10 U Pacific/Port_MoresbyAs you can see, each row corresponds to an airport, and contains information on
the location of the airport. Each airport also has a unique id , so we can easily look it up.

Here are two rows from the routes table:

airline airline_id source source_id dest dest_id codeshare stops equipment 0 2B 410 AER 2965 KZN 2990 NaN 0 CR2 1 2B 410 ASF 2966 KZN 2990 NaN 0 CR2Each route contains an airline_id , which the id of the airline that flies the route, as well as source_id , which is the id of the airport that the route originates from, and dest_id , which is the id of the destination airport for the flight.

Now that we know what kind of data we’re working with, let’s start by connecting
to the database and running a query.

QUERYING DATABASE ROWS IN PYTHON
In order to work with a SQLite database from Python, we first have to connect to
it. We can do that using the connect function, which returns a Connection object:

importsqlite3conn=sqlite3.connect(""flights.db"")

Once we have a Connection object, we can then create a Cursor object. Cursors allow us to execute SQL queries against a database:

cur=conn.cursor()

Once we have a Cursor object, we can use it to execute a query against the
database with the aptly named execute method. The below code will fetch the first 5 rows from the airlines table:

cur.execute("")

You may have noticed that we didn’t assign the result of the above query to a
variable. This is because we need to run another command to actually fetch the
results. We can use the fetchall method to fetch all of the results of a query:

results=cur.fetchall()print(results)

[(0, '1', 'Private flight', '\\N', '-', None, None, None, 'Y'),
 (1, '2', '135 Airways', '\\N', None, 'GNL', 'GENERAL', 'United States', 'N'),
 (2, '3', '1Time Airline', '\\N', '1T', 'RNX', 'NEXTIME', 'South Africa', 'Y'),
 (3, '4', '2 Sqn No 1 Elementary Flying Training School', '\\N', None, 'WYT', None, 'United Kingdom', 'N'),
 (4, '5', '213 Flight Unit', '\\N', None, 'TFU', None, 'Russia', 'N')]


As you can see, the results are formatted as a list of tuples . Each tuple corresponds to a row in the database that we accessed. Dealing
with data this way is fairly painful. We’d need to manually add column heads,
and manually parse the data. Luckily, the pandas library has an easier way,
which we’ll look at in the next section.

Before we move on, it’s good practice to close Connection objects and Cursor
objects that are open. This prevents the SQLite database from being locked. When
a SQLite database is locked, you may be unable to update the database, and may
get errors. We can close the Cursor and the Connection like this:

cur.close()conn.close()

MAPPING AIRPORTS
Using our newfound knowledge of queries, we can create a plot that shows where
all the airports in the world are. First, we query latitudes and longitudes:

importsqlite3conn=sqlite3.connect(""flights.db"")cur=conn.cursor()coords=cur.execute(""""""  select cast(longitude as float),   cast(latitude as float)   from airports;"""""").fetchall()

The above query will retrieve the latitude and longitude columns from airports , and convert both of them to floats. We then call the fetchall method to retrieve them.

We then need to setup our plotting by importing matplotlib , the primary plotting library for Python. Combined with the basemap package, this allows us to create maps only using Python.

We first need to import the libraries:

frommpl_toolkits.basemapimportBasemapimportmatplotlib.pyplotasplt

Then, we setup our map, and draw the continents and coastlines that will form
the background of our map:

m=Basemap(projection='merc',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180,lat_ts=20,resolution='c')m.drawcoastlines()m.drawmapboundary()

Finally, we plot the coordinates of each airport onto the map. We retrieved a
list of tuples from the SQLite database. The first element in each tuple is the
longitude of the airport, and the second is the latitude. We’ll convert the
longitudes and latitudes into their own lists, and then plot them on the map:

x,y=m([l[0]forlincoords],[l[1]forlincoords])m.scatter(x,y,1,marker='o',color='red')

We end up with a map that shows every airport in the world:

As you may have noticed, working with data from the database is a bit painful.
We needed to remember which position in each tuple corresponded to what database
column, and manually parse out individual lists for each column. Luckily, the
pandas library gives us an easier way to work with the results of SQL queries.

READING RESULTS INTO A PANDAS DATAFRAME
We can use the pandas read_sql_query function to read the results of a SQL query directly into a pandas DataFrame.
The below code will execute the same query that we just did, but it will return
a DataFrame. It has several advantages over the query we did above:

 * It doesn’t require us to create a Cursor object or call fetchall at the end.
 * It automatically reads in the names of the headers from the table.
 * It creates a DataFrame, so we can quickly explore the data.

importpandasaspdimportsqlite3conn=sqlite3.connect(""flights.db"")df=pd.read_sql_query("",conn)df

index id name alias iata icao callsign country active 0 0 1 Private flight \N - None None None Y 1 1 2 135 Airways \N None GNL GENERAL United States N 2 2 3 1Time Airline \N 1T RNX NEXTIME South Africa Y 3 3 4 2 Sqn No 1 Elementary Flying Training School \N None WYT None United Kingdom N 4 4 5 213 Flight Unit \N None TFU None Russia NAs you can see, we get a nicely formatted DataFrame as the result. We could
easily manipulate the columns:

df[""country""]

  0              None
  1     United States
  2      South Africa
  3    United Kingdom
  4            Russia
  Name: country, dtype: object


It’s highly recommended to use the read_sql_query function when possible.

MAPPING ROUTES
Now that we know how to read queries into pandas DataFrames, we can create a map
of every airline route in the world. We first start by querying the data. The
below query will:

 * Get the latitude and longitude for the source airport for each route.
 * Get the latitude and longitude for the destination airport for each route.
 * Convert all the coordinate values to floats.
 * Read the results into a DataFrame, and store them to the variable routes .

routes=pd.read_sql_query(""""""                           select cast(sa.longitude as float) as source_lon,                            cast(sa.latitude as float) as source_lat,                           cast(da.longitude as float) as dest_lon,                           cast(da.latitude as float) as dest_lat                           from routes                            inner join airports sa on                           sa.id = routes.source_id                           inner join airports da on                           da.id = routes.dest_id;                           """""",conn)

We then setup our map:

m=Basemap(projection='merc',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180,lat_ts=20,resolution='c')m.drawcoastlines()

We iterate through the first 3000 rows, and draw them. The below code will:

 * Loop through the first 3000 rows in routes .
 * Figure out if the route is too long.
 * If the route isn’t too long: * Draw a circle between the origin and the destination.
   
   
forname,rowinroutes[:3000].iterrows():ifabs(row[""source_lon""]-row[""dest_lon""])<90:# Draw a great circle between source and dest airports.m.drawgreatcircle(row[""source_lon""],row[""source_lat""],row[""dest_lon""],row[""dest_lat""],linewidth=1,color='b')

We end up with the following map:

The above is much more efficient when we use pandas to turn the results of the
SQL query into a DataFrame, instead of working with the raw results from sqlite3 .

Now that we know how to query database rows, let’s move on to modifying them.

ENJOYING THIS POST? LEARN DATA SCIENCE WITH DATAQUEST!
 * Learn from the comfort of your browser.
 * Work with real-life data sets.
 * Build a portfolio of projects.

Start for FreeMODIFYING DATABASE ROWS
We can use the sqlite3 package to modify a SQLite database by inserting, updating, or deleting rows.
Creating the Connection is the same for this as it is when you’re querying a
table, so we’ll skip that part.

INSERTING ROWS WITH PYTHON
To insert a row, we need to write an INSERT query. The below code will add a new row to the airlines table. We specify 9 values to insert, one for each column in airlines . This will add a new row to the table.

cur=conn.cursor()cur.execute(""insert into airlines values (6048, 19846, 'Test flight', '', '', null, null, null, 'Y')"")

If you try to query the table now, you actually won’t see the new row yet.
Instead you’ll see that a file was created called flights.db-journal . flights.db-journal is storing the new row until you’re ready to commit it to the main database, flights.db .

SQLite doesn’t write to the database until you commit a transaction . A transaction consists of 1 or more queries that all make changes to the
database at once. This is designed to make it easier to recover from accidental
changes, or errors. Transactions allow you to run several queries, then finally
alter the database with the results of all of them. This ensures that if one of
the queries fails, the database isn’t partially updated.

A good example is if you have two tables, one of which contains charges made to
people’s bank accounts ( charges ), and another which contains the dollar amount in the bank accounts ( balances ). Let’s say a bank customer, Roberto, wants to send $50 to his sister, Luisa.
In order to make this work, the bank would need to:

 * Create a row in charges that says $50 is being taken from Roberto’s account and sent to Luisa.
 * Update Roberto’s row in the balances table and remove $50.
 * Update Luisa’s row in the balances table and add $50.

These will require three separate SQL queries to update all of the tables. If a
query fails, we’ll be stuck with bad data in our database. For example, if the
first two queries work, then the third fails, Roberto will lose his money, but
Luisa won’t get it. Transactions mean that the main database isn’t updated
unless all the queries succeed. This prevents the system from getting into a bad
state, where customers lose their money.

By default, sqlite3 opens a transaction when you do any query that modifies the database. You can
read more about it here . We can commit the transaction, and add our new row to the airlines table, using the commit method:

conn.commit()

Now, when we query flights.db , we’ll see the extra row that contains our test flight:

pd.read_sql_query("",conn)

index id name alias iata icao callsign country active 0 1 19846 Test flight None None None YPASSING PARAMETERS INTO A QUERY
In the last query, we hardcoded the values we wanted to insert into the
database. Most of the time, when you insert data into a database, it won’t be
hardcoded, it will be dynamic values you want to pass in. These dynamic values
might come from downloaded data, or might come from user input.

When working with dynamic data, it might be tempting to insert values using
Python string formatting:

cur=conn.cursor()name=""Test Flight""cur.execute(""insert into airlines values (6049, 19847, {0}, '', '', null, null, null, 'Y')"".format(name))conn.commit()

You want to avoid doing this! Inserting values with Python string formatting
makes your program vulnerable to SQL Injection attacks. Luckily, sqlite3 has a straightforward way to inject dynamic values without relying on string
formatting:

cur=conn.cursor()values=('Test Flight','Y')cur.execute(""insert into airlines values (6049, 19847, ?, '', '', null, null, null, ?)"",values)conn.commit()

Any ? value in the query will be replaced by a value in values . The first ? will be replaced by the first item in values , the second by the second, and so on. This works for any type of query. This
created a SQLite parameterized query , which avoids SQL injection issues.

UPDATING ROWS
We can modify rows in a SQLite table using the execute method:

cur=conn.cursor()values=('USA',19847)cur.execute(""update airlines set country=? where id=?"",values)conn.commit()

We can then verify that the update happened:

pd.read_sql_query("",conn)

index id name alias iata icao callsign country active 0 6049 19847 Test Flight None None USA YDELETING ROWS
Finally, we can delete the rows in a database using the execute method:

cur=conn.cursor()values=(19847,)cur.execute(""delete from airlines where id=?"",values)conn.commit()

We can then verify that the deletion happened, by making sure no rows match our
query:

pd.read_sql_query("",conn)

index id name alias iata icao callsign country activeCREATING TABLES
We can create tables by executing a SQL query. We can create a table to
represent each daily flight on a route, with the following columns:

 * id – integer
 * departure – date, when the flight left the airport
 * arrival – date, when the flight arrived at the destination
 * number – text, the flight number
 * route_id – integer, the id of the route the flight was flying

cur=conn.cursor()cur.execute(""create table daily_flights (id integer, departure date, arrival date, number text, route_id integer)"")conn.commit()

Once we create a table, we can insert data into it normally:

cur.execute(""insert into daily_flights values (1, '2016-09-28 0:00', '2016-09-28 12:00', 'T1', 1)"")conn.commit()

When we query the table, we’ll now see the row:

pd.read_sql_query("",conn)

id departure arrival number route_id 0 1 2016-09-28 0:00 2016-09-28 12:00 T1 1CREATING TABLES WITH PANDAS
The pandas package gives us a much faster way to create tables. We just have to
create a DataFrame first, then export it to a SQL table. First, we’ll create a
DataFrame:

fromdatetimeimportdatetimedf=pd.DataFrame([[1,datetime(2016,9,29,0,0),datetime(2016,9,29,12,0),'T1',1]],columns=[""id"",""departure"",""arrival"",""number"",""route_id""])

Then, we’ll be able to call the to_sql method to convert df to a table in a database. We set the keep_exists parameter to replace to delete and replace any existing tables named daily_flights :

df.to_sql(""daily_flights"",conn,if_exists=""replace"")

We can then verify that everything worked by querying the database:

pd.read_sql_query("",conn)

index id departure arrival number route_id 0 0 1 2016-09-29 00:00:00 2016-09-29 12:00:00 T1 1ALTERING TABLES WITH PANDAS
One of the hardest parts of working with real-world data science is that the
data you have per record changes often. Using our airline example, we may decide
to add an airplanes field to the airlines table that indicates how many airplanes each airline owns. Luckily, there’s a
way to alter a table to add columns in SQLite:

cur.execute("")

Note that we don’t need to call commit – alter table queries are immediately executed, and aren’t placed into a transaction. We can
now query and see the extra column:

pd.read_sql_query("",conn)

index id name alias iata icao callsign country active airplanes 0 0 1 Private flight \N - None None None Y NoneNote that all the columns are set to null in SQLite (which translates to None in Python) because there aren’t any values for the column yet.

ALTERING TABLES WITH PANDAS
It’s also possible to use Pandas to alter tables by exporting the table to a
DataFrame, making modifications to the DataFrame, then exporting the DataFrame
to a table:

df=pd.read_sql(""select * from daily_flights"",conn)df[""delay_minutes""]=Nonedf.to_sql(""daily_flights"",conn,if_exists=""replace"")

The above code will add a column called delay_minutes to the daily_flights table.

FURTHER READING
You should now have a good grasp on how to work with data in a SQLite database
using Python and pandas. We covered querying databases, updating rows, inserting
rows, deleting rows, creating tables, and altering tables. This covers all of
the major SQL operations, and almost everything you’d work with on a day to day
basis.

Here are some supplemental resources if you want to dive deeper:

 * sqlite3 documentation
 * Comparing pandas and SQL
 * Dataquest SQL Course
 * sqlite3 operations guide

If you want to keep practicing, you can download the file we used in this blog
post, flights.db , here .

Join 40,000+ Data Scientists : Get Data Science Tips, Tricks and Tutorials, Delivered Weekly. Vik's PictureVIK PARUCHURI
Developer and Data Scientist in San Francisco; Founder of Dataquest.io (Learn Data Science in your Browser).
Get in touch @vikparuchuri .

SHARE THIS POST
Twitter Facebook Google+ Please enable JavaScript to view the comments powered by Disqus.ENJOYING THIS POST? LEARN DATA SCIENCE WITH DATAQUEST!
 * Learn from the comfort of your browser.
 * Work with real-life data sets.
 * Build a portfolio of projects.

Start for Free Dataquest Blog © • All rights reserved.","In this post, you’ll learn to query, update, and create SQLite databases in Python.  We’ll also show you how to use the Pandas package to speed up your workflow.",Working with SQLite Databases using Python and Pandas,Live,1028
3220,"DATALAYER: MANAGING (OR NOT) THE DATA IN IMMUTABLE INFRASTRUCTURE
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Dec 27, 2016The idea of immutable infrastructure is awesome. However, a major problem
immediately erupts when we get to the part where we have to connect application
infrastructure with data infrastructure. In his presentation, Adron Hall of
Thrashing Code and Home Depot, aims to start conversations about what specifics
we can aim for now, and in the future, to remove this gap. He also talks about
and shows what and how he's worked up solutions in production with immutable
infrastructure and data connectivity.

Adron is the “guy with solutions” if there's a question about how something to
build something, he is the go-to guy for coming up with how to get things done,
deployed, and into production. With his calm and collected demeanor, and
sometimes snarky wit, Adron is the person you want building teams and creating
systems. His ability to impart upon others energy, knowledge, and enthusiasm is
exceptional.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Thom Crowe is a marketing and community guy at Compose, who enjoys long walks on the
beach, reading, spending time with his wife and daughter and tinkering. Love
this article? Head over to Thom Crowe’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support Documentation Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Enterprise Add-ons * Deployments AWS SoftLayer Google Cloud
 * Subscribe Join 75,000 other devs on our weekly database newsletter.

&copy 2016 Compose","Adron Hall of Thrashing Code and Home Depot, talks about the problems that arise when connecting immutable application infrastructure with data infrastructure.",DataLayer Conference: Managing (or not) the Data in Immutable Infrastructure,Live,1029
3222,"Skip to contentWin-Vector Blog

The Win-Vector LLC data science blog

Menu and widgets * Win-Vector Blog
 * About
 * Company
 * Popular Articles
 * Practical Data Science with R
 * Introduction To Data Science
 * @WinVectorLLC Twitter

PROVIDED BY WIN-VECTOR LLC
Win-Vector LLC: providing expert data science consulting and training.SEARCH
Search for:PRACTICAL DATA SCIENCE WITH R
INTRODUCTION TO DATA SCIENCE VIDEO COURSE
ABOUT
The Win-Vector blog is a product of Win-Vector LLC , a data science consultancy. Contact us for custom consulting and training: contact@win-vector.com .PAGES
 * About
 * Introduction To Data Science
 * Popular Articles
 * Practical Data Science with R

RECENT POSTS
 * Laplace noising versus simulated out of sample methods (cross frames)
 * Some vtreat design principles
 * A quick look at RStudio’s R notebooks
 * Data science for executives and managers
 * Upcoming Talks

SUBSCRIBE/FOLLOW US
Please follow us on RSS and Twitter @WinVectorLLC .CATEGORIES
 * Administrativia (68)
 * Applications (15)
 * art (2)
 * Coding (31)
 * Computer Science (36)
 * Computers (6)
 * data science (72)
 * Exciting Techniques (27)
 * Expository Writing (72)
 * Finance (14)
 * History (7)
 * math programming (16)
 * Mathematics (55)
 * Opinion (100)
 * Practical Data Science (70)
 * Pragmatic Data Science (85)
 * Pragmatic Machine Learning (87)
 * Programming (45)
 * Public Service Article (9)
 * Quantitative Finance (8)
 * Rants (50)
 * Statistics (194)
 * Statistics To English Translation (25)
 * Tutorials (127)
 * Uncategorized (3)

TAGS
A/B testing Analytics cross-validation Data Mining data science differential privacy Dynamic Programming Finance ggplot2 git GLM Hadoop hypothesis testing impact coding Introduction to Data Science linear regression Logistic Regression Machine Learning Map Reduce Markov Chains Mathematical Bedside Reading modeling trick plotting Practical Data Science Practical Data Science with R principal components analysis principal components regression python R Random Sampling R as it is Regression R is not your friend R programming annoyances Sharpe Ratio Shiny significance Statistics Statistics as it should be variable selection variable treatment visualization vtreat Wald writingCREDIT
Win-Vector Blog (The Applied Theorist's Point of View) is part of Win-Vector LLC , authors John Mount and Nina Zumel .

All material Copyright Win-Vector LLC . Some material under redistribution agreement.

THE WIN-VECTOR LLC MAILING LIST
Please subscribe to the Win-Vector LLC mailing list.COMMENT POLICY
All comments are held for moderation. Only comments that will be interesting to
other readers will be considered for posting. Comments that are irrelevant,
offensive or link-spam will be deleted. Also we do use a mechanical comment spam
filter, and would like to apologize in advance for any comments that get lost to
the filter.ARCHIVES
 * November 2016 (2)
 * October 2016 (7)
 * September 2016 (4)
 * August 2016 (7)
 * July 2016 (3)
 * June 2016 (6)
 * May 2016 (8)
 * April 2016 (6)
 * March 2016 (8)
 * February 2016 (8)
 * January 2016 (9)
 * December 2015 (6)
 * November 2015 (6)
 * October 2015 (9)
 * September 2015 (10)
 * August 2015 (2)
 * July 2015 (5)
 * June 2015 (8)
 * May 2015 (4)
 * April 2015 (4)
 * March 2015 (3)
 * February 2015 (3)
 * January 2015 (5)
 * December 2014 (4)
 * November 2014 (3)
 * October 2014 (2)
 * September 2014 (3)
 * August 2014 (2)
 * July 2014 (3)
 * June 2014 (4)
 * May 2014 (7)
 * April 2014 (4)
 * March 2014 (4)
 * February 2014 (5)
 * January 2014 (4)
 * December 2013 (4)
 * November 2013 (2)
 * October 2013 (3)
 * September 2013 (2)
 * August 2013 (1)
 * July 2013 (1)
 * June 2013 (1)
 * May 2013 (5)
 * April 2013 (6)
 * March 2013 (2)
 * February 2013 (5)
 * January 2013 (1)
 * December 2012 (3)
 * November 2012 (2)
 * October 2012 (5)
 * September 2012 (4)
 * August 2012 (4)
 * July 2012 (3)
 * June 2012 (2)
 * May 2012 (3)
 * April 2012 (3)
 * March 2012 (2)
 * February 2012 (2)
 * January 2012 (2)
 * December 2011 (3)
 * November 2011 (2)
 * October 2011 (2)
 * September 2011 (3)
 * August 2011 (2)
 * July 2011 (3)
 * June 2011 (2)
 * April 2011 (2)
 * March 2011 (1)
 * February 2011 (1)
 * January 2011 (1)
 * December 2010 (2)
 * November 2010 (1)
 * October 2010 (1)
 * September 2010 (1)
 * August 2010 (3)
 * July 2010 (1)
 * June 2010 (1)
 * May 2010 (1)
 * April 2010 (3)
 * March 2010 (1)
 * February 2010 (2)
 * January 2010 (3)
 * December 2009 (3)
 * November 2009 (3)
 * October 2009 (2)
 * September 2009 (2)
 * August 2009 (3)
 * July 2009 (3)
 * June 2009 (3)
 * May 2009 (2)
 * April 2009 (1)
 * March 2009 (2)
 * February 2009 (1)
 * January 2009 (3)
 * November 2008 (1)
 * October 2008 (1)
 * September 2008 (1)
 * August 2008 (1)
 * June 2008 (2)
 * May 2008 (2)
 * April 2008 (2)
 * March 2008 (1)
 * February 2008 (1)
 * October 2007 (1)
 * June 2007 (1)

LAPLACE NOISING VERSUS SIMULATED OUT OF SAMPLE METHODS (CROSS FRAMES)
Posted on November 9, 2016 November 10, 2016 Author John Mount Categories Exciting Techniques , Expository Writing , Opinion , Statistics Tags cross-validation , differential privacy , impact coding , impact models , learning by counts , overfit , R , vtreat

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here ) as a known method to break the overfit bias that comes from using the same
data to design impact codes and fit a next level model. It is a fascinating
method inspired by differential privacy methods , that Nina and I respect but don’t actually use in production.


Nested dolls, Wikimedia Commons

Please read on for my discussion of some of the limitations of the technique,
and how we solve the problem for impact coding (also called “effects codes”), and a worked example in R . We define a nested model as any model where the results of a sub-model are used
as inputs for a later model. Common examples include variable preparation,
ensemble methods, super-learning, and stacking.

Nested models are very common in machine learning. They occur when y-aware
methods (that is methods that look at the outcome to predict) are used in data
preparation, or when models are combined (as in stacking or super learning).
They deliver more modeling power and are an important technique. The downside
is: nested models can introduce an undesirable bias which we call “nested model
bias” which can lead to very strong over-fit and bad excess generalization
error. Nina shares a good discussion of this and a number of examples here .

One possible mitigation technique is adding Laplace noise to try and break the
undesirable dependence between modeling stages. It is a clever technique
inspired by the ideas of differential privacy and usually works about as well as
the techniques we recommend in practice (cross-frames or simulated out of sample
data). The Laplace noising technique is different than classic Laplace smoothing (and formally more powerful as Nina points out in her talk). So it is an
interesting alternative that we enjoy discussing. However, we have never seen a
published precise theorem that links the performance guarantees given by
differential privacy to the nested modeling situation. And I now think such a
theorem would actually have fairly unsatisfying statement as a one possible “bad
real world data” situation violates the usual “no re-use” duplicated or related
columns or variables break the Laplace noising technique. It may seem an odd
worry, but in practice anything you don’t actually work to prevent can occur in
messy real world data.

Let’s work an example. For our nested model problem we will train a model
predicting if an outcome y is true or false based on 5 weakly correlated independent variables. Each of
these variables has 40 possible string values (so they are categorical
variables) and we have only 500 rows of training data. So the variables are
fairly complex: they have a lot of degrees of freedom relative to how much
training data we have. For evaluation we assume we have 10000 more rows of
evaluation data generated the same way the training data was produced. For this
classification problem we will use a simple logistic regression model.

We will prepare the data one of two ways: using Misha Bilenko’s count encoding
(defined in his references) or using vtreat impact coding. In each case the re-coding of variables reduces the apparent
model complexity and deals with special cases such as novel levels occurring
during test (that is variable string values seen in the test set that didn’t
happen to occur during training). This preparation is essential as standard
contrast or one-hot coding produces quasi-separated data unsuitable for logistic regression (and as always, do not use hash-encoding with linear methods). Each of these two encodings has the potential of
introducing the previously mentioned undesirable nested modeling bias, so we are
going to compare a number of mitigation strategies.

The nested modeling techniques we will compare include:

 * “vtreat impact coding” using simulated out of sample data methods (the mkCrossFrameCExperiment technique ). This is the technique we recommend using. It includes simulating out of
   sample data through cross-validation techniques, minor
   smoothing/regularization, and variable and level significance pruning.
 * “Jackknifed count coding”, count coding using one-way hold out for nested
   model mitigation. An efficient deterministic direct approach that works well
   on this problem (though doesn’t work as well as vtreat when a problem
   requires structured back-testing methods ).
 * “Split count coding”, count coding with count models build on half the
   training data and the logistic regression fit on the compliment. This is a
   correct technique that improves as we have more data (and if we send a larger
   fraction to the coding step).
 * “Naive count coding”, count coding with no nested model mitigation (for
   comparison).
 * “Laplace noised count coding” the method discussed by Misha Bilenko.
 * “Laplace noised count coding (abs)” a variation of above that restricts to
   non-negative pseudo-counts.

Here are typical results on data (all examples can be found here ):


We have plotted the pseudo R-squared (fraction of deviance explained) for 10
re-runs of the experiment for each modeling technique (all techniques seeing the
exact same data). What we want are large pseudo R-squared values on the test
data. Notice the dominant methods are vtreat and jackknifing. But also notice
that Laplace noised count coding is, as it often is, in the “best on test” pack
and certainly worth considering (as it is very easy to implement, especially for
online learning stores). Also note that the common worry about jackknife coding
(that it reverses scores on some rare counts) seems not to be hurting
performance (also note Laplace noising can also swap such score).

We have given the Laplace noising methods an extra benefit in allowing them to
examine test performance to pick their Laplace noise rate. In practice this
could introduce an upward bias on observed test performance (the Laplace method
model may look better on test data than it will on future application data), but
we are going to allow this to give the Laplace noising methods a beyond
reasonable chance. Also as we don’ it is reasoning about joint distribution of
many variables that is likely to the problem).

Let’s now try a a variation of the problem where each data column or independent
variable is duplicated or entered into the data schema four more times. This is
of course a ridiculous artificial situation which we are exaggerating to clearly
show the effect. But library code needs to work in the limit (as you don’t know
ahead of time what users will throw at it) and there are a lot of mechanisms
that do produce duplicate, near-duplicate, and related columns in data sources
used for data science (one of the difference between data science and classical
statistics is data science tends to apply machine learning techniques on very
under-curated data sets).

Differential privacy defenses are vulnerable to repetition as adversaries can
average the repeated experiments to strip off defensive noise. This is an issue
that is deliberately under-discussed in differential privacy applications (as
correctly and practically defining and defending against related queries is
quite hard). In our case repeated queries to a given column are safe (as we see
the exact same noise applied during data collection each time), but queries
between related columns are dangerous (as they have different noise which then
can be averaged out).

The results on our artificial “each column five times” data set are below:


Notice that the Laplace noising technique test performances are significantly
degraded (performance on held-out test usually being a better simulation of
future model performance than performance on the training set). In addition to
over-fit some of the loss of performance is coming from the prepared data
driving bad behavior in the logistic regression (quasi-separation, rank
deficiencies, and convergence issues), which is why we see the naive method’s
performance also changing.

And that is our acid test of Laplace noising. We knew Laplace noising worked on
example problems, so we always felt obligated to discuss it as an alternative to
the simulated out of sample (also called cross-frame or “level 1”) methods we
use and recommend. We don’t use Laplace noising in production, so we didn’t have
a lot of experience with it at scale (especially with many variables). We
suspect there is an easy proof the technique works for one variable, and now
suspect there is not a straightforward pleasing formulation of such a result in
the presence of many variables (as such a statement is going to have to
constrain both joint distribution variables, and the downstream modeling
procedures).

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to print (Opens in new window)
 * Click to share on Facebook (Opens in new window)
 * Click to share on Tumblr (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on Google+ (Opens in new window)
 * 

RELATED
Posted on November 9, 2016 November 10, 2016 Author John Mount Categories Exciting Techniques , Expository Writing , Opinion , Statistics Tags cross-validation , differential privacy , impact coding , impact models , learning by counts , overfit , R , vtreatONE THOUGHT ON “LAPLACE NOISING VERSUS SIMULATED OUT OF SAMPLE METHODS (CROSS
FRAMES)”
 1. John Mount says: November 10, 2016 at 11:21 amOne thing we did not show was the pushing the row data to the glm and letting R’s native indicator/dummy-variable/contrast encoding deal with
    the problem. This is because the results are so awful they change the scale
    of the graph. Of course the problem is avoidable (either through feature
    preparation as mentioned in the above article, or through better modeling
    discipline such as step-wise methods, regularization, lasso regression, and
    so on). Also we found every once in a while the training set did not see all
    the possible levels and thus could not even be run on the entirety of the
    test set (glm throughs an exception if it sees such novel value during
    scoring, this happens in about 4 in 10000 examples- but it is always a worry
    with high cardinality categorical variables).
    
    For fun we have produced a run that skipped the novel rows and ran the
    direct method (as shown in the graph below). Notice it always horribly
    over-fits and gets negative pseudo-R2 on held out data (which is just saying
    using the model is worse than using an unconditional average).
    
    But what this is reminding us is: the loss of a sub-optimal encoding is
    often less than the loss in using no re-encoding (though obviously we were
    able to show an example where Laplace nosing was also returning negative
    pseudo-R2s on test). Though obviously you don’t want either loss: which is
    why you work hard to get a well founded encoding scheme. We often talk about
    the technical details of these techniques (as they are important), but the
    big lesson is you often need to use re-coding techniques and they are very
    much worth the (mitigable) problems they cause you to worry about (such as
    nested model bias, please see here for a presentation on the topic).
    
    
    Reply
 2. 

LEAVE A REPLY CANCEL REPLY
POST NAVIGATION
Previous Previous post: Some vtreat design principles Proudly powered by WordPress Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.",Describes the use of Laplace noise in machine learning models.,Laplace noising versus simulated out of sample methods (cross frames),Live,1030
3227,"This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

If you would like to purchase an eBook or printed version of this         book once it is complete, you can do so from O'Reilly Media:         Buy this book from

O'Reilly Media

We welcome feedback – if you spot any errors or would like to         suggest improvements, please open an issue on the GitHub repo.     Who Should Read This BookWhy We Wrote This BookHow to Read This BookConventions Used in This BookYou Know, for Search…Not Quite NotThe Closer, The BetterRelevance Tuning Is the Last 10%You Have an Accent","A full guide to Elasticsearch, the real-time distributed search and analytics engine. This book was written by Clinton Gormley and Zachary Tong and is soon to be released by O'Reilly Media.",The Definitive Guide,Live,1031
3229,,See how quick and easy it is to set up a dashDB instance in IBM Bluemix and load data to perform analytics in dashDB.,Get started with dashDB on Bluemix,Live,1032
3230,"The relational database has been the dominant model for persisting data for the last 40 years. While SQL databases aren’t going away anytime soon, the NoSQL (""not only SQL"") movement has challenged the relational database's place as the default persistence layer for modern applications. Learn about horizontal scaling and eventual consistency as well as key-value stores, document databases, graph databases, and more.","The relational database has been the dominant model for persisting data for the last 40 years. While SQL databases aren’t going away anytime soon, the NoSQL (""not only SQL"") movement has challenged the relational database's place as the default persistence layer for modern applications. Learn about horizontal scaling and eventual consistency as well as key-value stores, document databases, graph databases, and more.",The Many Flavors of NoSQL at That Conference,Live,1033
3232,"Skip to main content IBM developerWorks / Developer Centers Sign In | Register Cloud Data Services * Services
 * How-Tos
 * Blog
 * Events
 * Connect

YOUR FIRST DATA WAREHOUSE IS EASY. MEET THE ODS.
Sarah Maston / August 24, 2015Building your first data warehouse doesn’t have to be as enterprise-y and scary
as it sounds, especially in the era of cloud services. A good first iteration of
your warehousing environment is a simple architecture that we BI architects like
to call an “Operational Data Store” (ODS).

Old-school DWs can be scary beasts. But don’t worry! Your first DW doesn’t have
to be if you land data in its native form.

In this post, I discuss how my job in the new IBM Cloud Data Services (CDS)
group led me to design our first data warehouse. SPOILER ALERT : It was an ODS.

SIMPLE DATA PIPE ORIGIN STORY
It all starts with the Simple Data Pipe project that David Taieb and I designed. “The Pipe” was born out of solving a
real-world business reporting problem for the new IBM CDS group. We needed to
consolidate a number of data sources into a central database and hook up Looker business intelligence visualizations to it.

At the top of my list was Salesforce.com data. Of all the things business
managers want reports on, CRM data is the obvious place to start. So I sketched
some architectures. On a flight home from a conference one night, David and I
outlined a piece of software we called the Simple Data Pipe, which would land
our Salesforce data in IBM’s cloud data warehouse service dashDB , where we could view it alongside data from other sales and billing systems.
It has since become the one-click solution for data movement in our Enterprise
Data Warehouse environment for CDS.

FEED THE BEAST USING ETL
What we needed was a simple ETL tool to pump data from operational data sources into an operational data store (hence the “O” in ODS). If you don’t know, ETL stands for “Extract, Transform
and Load,” and it’s the process that forms the “plumbing” of a data warehouse
system.

In the case of some old-school enterprise data warehouses, the “Transform source
format” step can get pretty beastly. That’s where the enterprise ETL software
vendors make their money: Informatica, Oracle, Pentaho, and, of course, IBM ;-)

In the case of building an ODS, you don’t need all that. Your ETL process is
simplified to just “EL.” Since the goal of an ODS is to consolidate data while
preserving the format of the operational source, you only need a tool that
“Extracts from source” and “Loads to warehouse.” This is what the Simple Data
Pipe does. It’s an “EL” tool that works well for cloud operational data sources.

We have made one connector thus far for Salesforce. We have connectors for Stripe , SugarCRM , and Amazon Redshift under development. Of course, you could also fork our pipes GitHub repo and start working on your own. Contributions are most welcome!

DATA CONSOLIDATION IS A COMMON PAIN
Business reporting is the reason the practice of data warehousing exists in the
first place. Without a warehouse, someone in your organization, maybe even you , is going to ten different systems, grabbing different exports and metrics,
and consolidating that data on some giant Excel spreadsheet. They’re spending at
least one day a month doing it.

What management really wants is a dashboard that sees across all your
operational systems for a holistic view of their business. In fact they’re
probably asking for this view, wistfully looking out into the void of all the
disparate operational systems they know their business depends on. This
situation is exactly what I encountered upon joining IBM CDS.

AN OPERATIONAL DATA STORE SOLVES DATA CONSOLIDATION HEADACHES
In order to collect data from disparate systems, you need to land them in a
common data store. Then you connect your analytics platform of choice to this
new data store where you’re consolidating information, which now becomes an
official ODS. It is the most basic and most achievable version of data warehouse
architecture.


A closer look at our architecture shows that a proper ODS sources tables from
your operational systems and creates its own copies of this data. The simplest
first approach is to take a fresh copy every night.


Now with all the tables landed next to each other in the ODS, you can produce
the business intelligence reports your management has been dreaming of. We used
Looker in our Simple Data Pipe tutorial, but you can use the BI software of your
choice: Tableau, Cognos, Qlik — whatever!

COMING SOON: MORE CONNECTORS
Check back here soon for an update on our progress creating data connectors for
Stripe, SugarCRM, and Amazon Redshift. If all goes as planned, we’ll provide an
overview of the implementation and share a code template that will help you
design your own connectors for Simple Data Pipe. The more operational data
sources, the better!

SHARE THIS:
 * Click to email this to a friend (Opens in new window)
 * Click to share on Twitter (Opens in new window)
 * Click to share on LinkedIn (Opens in new window)
 * Share on Facebook (Opens in new window)
 * Click to share on Reddit (Opens in new window)
 * Click to share on Pocket (Opens in new window)
 * 

Please enable JavaScript to view the comments powered by Disqus. blog comments powered by Disqus * SUBSCRIBE TO BLOG UPDATES
   Enter your email address to subscribe to this blog and receive notifications
   of new posts by email.
   
   Email Address
   
   
 * CATEGORIES
    * Analytics
    * Cloudant
    * Community
    * Compose
    * CouchDB
    * dashDB
    * Data Warehousing
    * DB2
    * Elasticsearch
    * Gaming
    * Geospatial
    * Hybrid
    * IoT
    * Location
    * Message Hub
    * Migration
    * Mobile
    * MongoDB
    * NoSQL
    * Offline
    * Open Data
    * PostgreSQL
    * Redis
    * Spark
    * SQL
   
   
RSS Feed * Report Abuse
 * Terms of Use
 * Third Party Notice
 * IBM Privacy

IBM Send to Email Address Your Name Your Email Address Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email.","Building your first data warehouse doesn’t have to be as enterprise-y and scary as it sounds, especially in the era of cloud services. A good first iteration of your warehousing environment is a simple architecture that we BI architects like to call an “Operational Data Store” (ODS).",Your First Data Warehouse Is Easy. Meet the ODS.,Live,1034
3235,"Skip to contentDinesh Nirmal's Blog

A blog about how big data, analytics and digital technologies are impacting
businesses in the data driven economy.

Menu * Welcome!

 * Twitter

MACHINE LEARNING FOR THE ENTERPRISE.
In my last blog “Business differentiation through Machine Learning” I introduced
and described the concepts of machine learning. We traced its origins from a
computer science project to Watson showcasing and winning on the Jeopardy TV
quiz show and its real world use across numerous industries including health
care.

We concluded that machine learning has the potential to help make the world a
better and safer place. However (yep…there’s always a ‘however’), in order for
that to become a reality machine learning in all its forms has to be enterprise
ready.


Enterprise Requirements.

When I use the word “Enterprise” I mean organizations that have business
critical requirements. It’s not just about organization size. Volumes of
transactions and data, velocity of interactions, variety of data – yes it’s
those “Vs” of big data again – are all key factors that might impact an
organization’s machine learning requirements. There is also collaboration across
data scientists, engineers, developers creating, testing, training, deploying
machine learning and its models and the levels at which different audiences want
to be exposed to machine learning. So let’s look at some of these and other
factors that make machine learning from IBM truly enterprise ready.


Collaboration

Large enterprises tend to have a significant data scientist team often with
multiple data scientists engaged on a single project. Collaboration across these
data scientists and maybe other personas is required to maximize productivity,
agility and effectiveness. Today as part of the IBM data science experience we
bring in the concept of the “project” wherein various personas and users can
safely collaborate on a project to build, test, use and deploy many artefacts
with a group of people. Our machine learning technologies adopt the same concept
– able to share all the analytic artefacts (notebook, pipeline, model, etc,).
For example a group of people can collaborate on a single notebook wherein one
person can do curation/transformation and then can hand over to another person
for creating algorithms and testing and training the model. Then other team
members can evaluate the model and deploy it. Each individual user can be
authenticated separately and authorized as part of the roles defined in the
project limiting or granting access to part of the overall process / experience
accordingly.


Consumption

Not everyone is a data scientist nor wants or needs to know about model design,
statistical theory or training the model. Developers for example may have
varying levels of needs. They may just want to be able to use a known model that
works well and deploy it in their app. Figure #1 below shows how IBM has
designed a work space that allows application developers to not only choose and
deploy a model but actually create a pipeline via a step-by-step process. Take
that one step further higher level developers may want to choose from a
collection of pre-packaged machine learning services such as fraud detection,
weather prediction, manufacturing models, sentiment analysis, emotional
analysis. IBM provides these today through its Bluemix services which are
integrated as part of data science experience.


Figure #1 Integrated workspace – creating a pipeline.


Commoditizing, Automating Machine Learning

Machine learning in enterprise environments can be challenging. It starts with
the assumption that the model becomes stale the minute you stop training it.
Over time the accuracy of the models can worsen and can take significant time to
understand what is happening, why and to then retrain existing models and deploy
new versions. It comes down to revenue – and some enterprises may have a hard
time adopting the necessary discipline – because they cannot gauge the impact to
the bottom line. A lot of machine learning use cases might not be very intuitive
in a sense that you cannot set clear control points and flows and people cannot
logically relate to them. So just the very term “Machine Learning” may send some
of the less scientific people in our communities running the opposite direction.
Often data scientists perform a number of tedious and time-consuming steps to
derive insight from a raw data set. The process can involve data ingestion,
cleaning, and transformation (e.g. outlier removal, missing value imputation),
then proceed to model building, and finally a presentation of predictions that
align with the end-users objectives and preferences. It can be a long, complex,
and sometimes artful process requiring substantial time and effort, especially
because of the combinatorial explosion in choices of algorithms (and platforms),
their parameters, and their compositions. Tools that can help automate steps in
this process have the potential to accelerate the time-to-delivery of useful
results, expand the reach of data science to non-experts, and offer a more
systematic exploration of the available options. Cognitive Automation of Data
Science (CADS) helps integrate learning, planning and composition, and
orchestration techniques to automatically and efficiently build models. This is
done by deploying analytic flows to interactively support would-be data
scientists in their tasks. CADS also provides the capability to run multiple
predefined algorithms in parallel and identify the best suitable algorithm for a
particular use case. In short, CADS selects the best algorithm for the given use
case. Click here to read an IBM paper on CADS.


Training, Tuning, Model Optimization

There can be times when the model or algorithm used in machine learning becomes
too good to be true on the training data: the model predicts the training data
very well, but performs poorly on new data. This is known as overfitting – an extreme case can occur in rote learning where the model achieves 100%
performance on data you have already seen, and probably won’t do any better than
random guess on new data. Imagine what this would do to a business – it could
ruin it, upset customers, set the wrong price point and miss many business
opportunities. IBM machine learning helps counter this through a clean
separation of training data from holdout data used to evaluate model
performance, as well as careful use of cross validation techniques.


Data Sovereignty and Isolation

Some enterprise organizations have a fear, psychological or otherwise, when it
comes to putting their data and applications on hardware, storage and network
infrastructures that is shared with other organizations. IBM’s Cloud First
strategy provides the necessary sovereignty, multi-tenancy and isolation to help
ensure that their data and applications are managed privately across IBM world
wide data centers.


Variety – All types of Data

There used to be a time when data was simpler – structured relational or
hierarchical data stored in databases. Big data simply means “all data” which
includes volumes of raw content some structured in some form, other parts
unstructured. Add the Internet of Things sending massive amounts of sensor data
and the world is not so simple. The IBM data strategy is to Make Data Simple.
Our machine learning capabilities leverages this strategy being able to process
structured, semi-structured and unstructured data sets using many connectors and
abstracting complexity by exploiting the Spark, R, Python runtimes for the
machine learning. IBM provides 20+ different data sources(connectors) from which
an organization can ingest data.


Large compute power

Enterprises need high compute power since they process ever increasing work
loads of data, transactions and processes. Since our Spark service is a single
multi-tenant cluster, resource utilization is not wasted as we can repurpose the
computer power. Scaling out and scaling down are capabilities that are built as
part of the service. Our machine learning is able to transparently enable this
scale out / scale down capabilities.


Information governance

IBM has a strong heritage on information governance from managing data over it’s
lifecycle, cleansing and quality of data, data wrangling and shaping, through to
security and privacy of data. The information governance catalogue uses policies
that help ensure that only the right people can see, access, execute data and
services. IBM can also help provide real time monitoring, threat detection,
prevention and intervention as well as forensics, compliance, detailed audit,
obfuscation/masking of data, encryption and more. All these can be applied to
machine learning and training data.


Machine Learning not a One Trick Pony – Machine Learning as a Service (MLaas)

There are many forms of machine learning. From the System ML that IBM donated to
Apache Foundation to natural language processing, vision, personality and
emotional insights, customer sentiment, retrieve and rank and more. IBM can
makeit simple for an enterprise of any size to pick and choose from a number of
predefined machine learning services from the Bluemix tiles below in figure #2.
I ran my earlier blog through the “personality analyser” – I knew I was a really
nice guy but it was reassuring to hear it from a machine learning service. Don’t
believe me? Try it here .


Figure #2 Machine Learning as a Bluemix service.


Using Machine Learning to Reduce Costs and Risks
There are many customers across different industries that have used our machine
learning capabilities to help reduce costs, improve customer service and reduce
risks.

The Vermont Electrical Power Company (VELCO) worked with IBM Research to develop
an integrated weather forecasting system to help deliver reliable, clean,
affordable power to their consumers while integrating renewable energy into the
grid. The solution combines high resolution weather with multiple forecasting
tools based on machine learning. The machine learning models are trained on
hindcasts of weather correlated to historical energy production and historical
net demand.

The results are some of the most precise and accurate wind and solar generation
forecasts in the world. This powerful tool turns multiple streams of
data—transmission telemetry, distribution meter data, generation production,
highly precise forecast models—into actionable information using leading edge
analytics. A collaborative achievement involving dozens of in-state and regional
partners and the formidable intellectual resources of IBM Research, VWAC’s
results are significant and its value already demonstrated, even as further
benefits continue to emerge. To find out more watch the video on the VELCO website Courtesy Vermont Electrical Power Company web site and video


Conclusion – Breadth, Depth and Enterprise Ready

While many vendors may provide an aspect of machine learning often restricted to
a particular runtime or platform IBM provides many forms of machine learning
covering generic machine learning, natural language processing, vision,
personality and emotional insights, sentiment, retrieve and rank and many
others. IBM has been exploiting many of these machine learning capabilities for
many years as part of its Watson Analytics portfolio helping to take
organizations on their cognitive journey. Combine this breadth and depth of
capability with the enterprise readiness of machine learning capabilities
discussed above with our cognitive strategy and execution and it becomes clear
why some many of the biggest and business critical organizations in the world
choose IBM.

For more information on IBM’s cognitive strategy and machine learning
capabilities click this link ibm.com/outthink


Dinesh Nirmal,

Vice President, Development, Next Generation Platforms, Big Data & Analytics

Follow me on Twitter @DineshNirmalIBM


TRADEMARK DISCLAIMER: Apache, Apache Hadoop, Hadoop, Apache Spark, Spark are
trademarks of The Apache Software Foundation.

SHARE THIS:
 * Twitter
 * 

Author Dinesh Nirmal Posted on August 23, 2016 Categories UncategorizedLEAVE A REPLY CANCEL REPLY
Enter your comment here...Fill in your details below or click an icon to log in:

 * 
 * 
 * 
 * 
 * 

Email (required) (Address never made public) Name (required) WebsiteYou are commenting using your WordPress.com account. ( Log Out / Change )

You are commenting using your Twitter account. ( Log Out / Change )

You are commenting using your Facebook account. ( Log Out / Change )

You are commenting using your Google+ account. ( Log Out / Change )

CancelConnecting to %s

Notify me of new comments via email.


POST NAVIGATION
Previous Previous post: Business Differentiation through Machine Learning.RECENT POSTS
 * Machine Learning for the Enterprise. August 23, 2016
 * Business Differentiation through Machine Learning. June 29, 2016
 * Welcome! May 12, 2016

NEWS AND EVENTS
Upcoming Spark Summits
Upcoming Hadoop Conferences Search for: Search * Welcome!

 * Twitter

Dinesh Nirmal's Blog Blog at WordPress.com. FollowFOLLOW “DINESH NIRMAL'S BLOG”
Get every new post delivered to your Inbox.


Build a website with WordPress.com",In my last blog “Business differentiation through Machine Learning” I introduced and described the concepts of machine learning. We traced its origins from a computer science project to Watson show…,Machine Learning for the Enterprise.,Live,1035
3236,"Compose The Compose logo Articles Sign in Free 30-day trialFINDING DUPLICATE DOCUMENTS IN MONGODB
Published Apr 26, 2017 mongodb aggregation deduplication Finding Duplicate Documents in MongoDBNeed to find duplicate documents in your MongoDB database? This article will
show you how to find duplicate documents in your existing database using
MongoDB's aggregation pipeline.

Finding duplicate values in your database can be difficult, especially if you
have millions of documents to look at. MongoDB's aggregation pipeline makes
finding duplicate documents easier by allowing you to customize how documents
are grouped together and filtered. In other words, MongoDB lets you select
fields and group together documents based on your selection in order to find
duplicate documents.

The dataset that we will be using is a CSV file containing a list of 550 failed banks compiled by the Federal Deposit Insurance Corporation
(FDIC) . We've altered the list for this article to include duplicate banks, so our
list now includes 559 banks which can be downloaded here .

After downloading the CSV file, let's import the file into our Compose MongoDB
deployment. To insert the CSV file, we'll be using mongoimport to create a database called banks and a collection called list . Make sure to use the credentials for your deployment.

mongoimport --host aws-us-west-2-portal.0.dblayer.com --port 99999 --db banks --collection list --ssl --sslAllowInvalidCertificates -u user -p mypass --type csv --headerline --file banklist.csv  


One you've run mongoimport in the terminal, you should see that 559 documents have been successfully
imported into your database. Now, let's log into our MongoDB shell and look at
our documents.

WORKING WITH THE BANK DATA
In the MongoDB shell, our data should look similar to this:

db.list.findOne();  
{
    ""_id"" : ObjectId(""5900b01b2ce12a2383328e61""),
    ""Bank Name"" : ""Seaway Bank and Trust Company"",
    ""City"" : ""Chicago"",
    ""ST"" : ""IL"",
    ""CERT"" : 19328,
    ""Acquiring Institution"" : ""State Bank of Texas"",
    ""Closing Date"" : ""27-Jan-17"",
    ""Updated Date"" : ""17-Feb-17""
}


Each document is provided with a unique ObjectId by MongoDB and also a unique CERT number provided by the FDIC. The CERT number is unique to each branch of a
bank. The CERT number is the number that we'll be using to identify duplicate
banks. We won't be looking for duplicate Bank Name s since a bank can have multiple branches and have the same name.

Now, let's start looking for duplicate banks in our dataset using MongoDB's
aggregation pipeline operators.

FINDING DUPLICATES WITH AGGREGATIONS
You can find duplicate values within your MongoDB database using the aggregate method along with the $group and $match aggregation pipeline operators. For a closer look at MongoDB's aggregation
pipeline operators see the article Aggregations in MongoDB by Example .

Let's start out with how to use the aggregate method in MongoDB. We'll apply the
aggregate function to our list like this:

db.list.aggregate([ ... ]);  


The aggregate function takes an array of aggregation operators. The $group operator is the first one we want to use, and we want to set it up so it groups
our documents by the CERT number.

db.list.aggregate([  
    {$group: { 
        _id: {CERT: ""$CERT""} 
        } 
    }
]);


We'll need an _id field first, which is mandatory because it indicates what we are grouping by.
In this case, we are grouping together the CERT numbers so $CERT is used to point to the CERT field. The key CERT within _id can have any name.

Running the aggregation using only the $group pipeline operator, we'll get something like the following below from the
MongoDB shell. However, the $group pipeline operator doesn't order documents so you may get varying results:

{ ""_id"" : { ""CERT"" : 6629 } }
{ ""_id"" : { ""CERT"" : 34264 } }
{ ""_id"" : { ""CERT"" : 32646 } }
{ ""_id"" : { ""CERT"" : 22002 } }
{ ""_id"" : { ""CERT"" : 24382 } }
{ ""_id"" : { ""CERT"" : 19183 } }
{ ""_id"" : { ""CERT"" : 33784 } }
{ ""_id"" : { ""CERT"" : 16445 } }
{ ""_id"" : { ""CERT"" : 27094 } }
...


In order to view documents that have the same CERT number, we'll need to collect
them together and show their IDs. To do that, we can create another field to
preserve the IDs as we move through the pipeline. For now, let's name it uniqueIds , but you can name it whatever you'd like. What we'll do is associate a CERT
with all of the MongoDB documents that have the same CERT number in an array.
The array will be created with the $addToSet operator and it will include all the ObjectId s of the documents with the same CERT number using each document's _id field represented by $_id . If there is more than one ObjectId associated with a CERT number in the uniqueIds array, then we've found a duplicate.

db.list.aggregate([  
    {$group: {
        _id: {CERT: ""$CERT""},
        uniqueIds: {$addToSet: ""$_id""}
        }
    }
]);


The result of this query looks like the following:

...
{ ""_id"" : { ""CERT"" : 34485 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329007"") ] }
{ ""_id"" : { ""CERT"" : 34682 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329005"") ] }
{ ""_id"" : { ""CERT"" : 22130 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329003"") ] }
{ ""_id"" : { ""CERT"" : 28312 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329001"") ] }
{ ""_id"" : { ""CERT"" : 57017 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332904c"") ] }
{ ""_id"" : { ""CERT"" : 27197 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383328fff"") ] }
{ ""_id"" : { ""CERT"" : 34979 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332907b"") ] }
{ ""_id"" : { ""CERT"" : 25231 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383328ffe"") ] }
{ ""_id"" : { ""CERT"" : 20203 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383328ffd"") ] }
{ ""_id"" : { ""CERT"" : 35030 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383328ffb"") ] }
{ ""_id"" : { ""CERT"" : 31293 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908c""), ObjectId(""5900b01b2ce12a2383328f23"") ] }
{ ""_id"" : { ""CERT"" : 2303 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908d""), ObjectId(""5900b01b2ce12a2383328f22"") ] }
...


At this point, we could stop and check to see which groupings have more than one
unique ObjectId and decide which document we want to keep. However, imagine if we had hundreds
or thousands of documents. It would become very tedious to look for all of the
groupings with more than one unique ObjectId . So, let's make it a little easier by including another field in our $group pipeline operator that will count the documents. The count field uses the $sum operator which adds the expression 1 to the total for this group for each document in the group. When used in the $group stage, $sum returns the collective sum of all the numeric values that result from applying
a specified expression to each document in a group of documents that share the
same group by key.

This aggregation will look like the following:

db.list.aggregate([  
    {$group: {
        _id: {CERT: ""$CERT""},
        uniqueIds: {$addToSet: ""$_id""},
        count: {$sum: 1}
        }
    }
]);


Running this aggregation, we'll get the number of documents that have the same CERT in the count field.

...
{ ""_id"" : { ""CERT"" : 34485 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329007"") ], ""count"" : 1 }
{ ""_id"" : { ""CERT"" : 34682 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329005"") ], ""count"" : 1 }
{ ""_id"" : { ""CERT"" : 22130 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329003"") ], ""count"" : 1 }
{ ""_id"" : { ""CERT"" : 28312 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329001"") ], ""count"" : 1 }
{ ""_id"" : { ""CERT"" : 57017 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332904c"") ], ""count"" : 1 }
{ ""_id"" : { ""CERT"" : 27197 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383328fff"") ], ""count"" : 1 }
{ ""_id"" : { ""CERT"" : 34979 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332907b"") ], ""count"" : 1 }
{ ""_id"" : { ""CERT"" : 25231 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383328ffe"") ], ""count"" : 1 }
{ ""_id"" : { ""CERT"" : 20203 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383328ffd"") ], ""count"" : 1 }
{ ""_id"" : { ""CERT"" : 35030 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383328ffb"") ], ""count"" : 1 }
{ ""_id"" : { ""CERT"" : 31293 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908c""), ObjectId(""5900b01b2ce12a2383328f23"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 2303 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908d""), ObjectId(""5900b01b2ce12a2383328f22"") ], ""count"" : 2 }
...


To get only the groups that have a count of more than one, we can use the $match operator to filter our results. Within
the $match pipeline operator, we'll tell it to look at the count field and tell it to look for counts greater than one using the $gt operator representing ""greater than"" and the number 1 . This looks like the following:

 db.list.aggregate([
    {$group: {
        _id: {CERT: ""$CERT""},
        uniqueIds: {$addToSet: ""$_id""},
        count: {$sum: 1}
        }
    },
    {$match: { 
        count: {""$gt"": 1}
        }
    }
]);


The result of this query will give us only the grouped documents that have more
than one document with the same value.

{ ""_id"" : { ""CERT"" : 31293 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908c""), ObjectId(""5900b01b2ce12a2383328f23"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 2303 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908d""), ObjectId(""5900b01b2ce12a2383328f22"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 19657 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908b""), ObjectId(""5900b01b2ce12a2383328ed0"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 57315 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908f""), ObjectId(""5900b01b2ce12a238332908e""), ObjectId(""5900b01b2ce12a2383328f7f"") ], ""count"" : 3 }
{ ""_id"" : { ""CERT"" : 35312 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908a""), ObjectId(""5900b01b2ce12a2383328e66"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 28144 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329087""), ObjectId(""5900b01b2ce12a2383328ea1"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 8221 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329088""), ObjectId(""5900b01b2ce12a2383328fa2"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 34486 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329089""), ObjectId(""5900b01b2ce12a2383328f61"") ], ""count"" : 2 } 


The filtering of the results to include CERT groupings with more than one
document allows for a lot of the noise to be removed. In some cases, you may
have more groupings with a lot more duplicates. Therefore, you may want to order
them so that the groupings with the most duplicates appear at the top of your
results, or in descending order. MongoDB provides us with the $sort pipeline operator to select the fields that we want to sort our groupings by.
To sort by the count field in descending order, we add the expression -1 , and now our aggregation will look like:

 db.list.aggregate([
    {$group: {
        _id: {CERT: ""$CERT""},
        uniqueIds: {$addToSet: ""$_id""},
        count: {$sum: 1}
        }
    },
    {$match: { 
        count: {""$gt"": 1}
        }
    },
    {$sort: {
        count: -1
        }
    }
]);


Now, our results will be in descending order based on the count field.

{ ""_id"" : { ""CERT"" : 57315 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908f""), ObjectId(""5900b01b2ce12a238332908e""), ObjectId(""5900b01b2ce12a2383328f7f"") ], ""count"" : 3 }
{ ""_id"" : { ""CERT"" : 31293 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908c""), ObjectId(""5900b01b2ce12a2383328f23"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 2303 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908d""), ObjectId(""5900b01b2ce12a2383328f22"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 19657 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908b""), ObjectId(""5900b01b2ce12a2383328ed0"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 35312 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a238332908a""), ObjectId(""5900b01b2ce12a2383328e66"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 28144 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329087""), ObjectId(""5900b01b2ce12a2383328ea1"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 8221 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329088""), ObjectId(""5900b01b2ce12a2383328fa2"") ], ""count"" : 2 }
{ ""_id"" : { ""CERT"" : 34486 }, ""uniqueIds"" : [ ObjectId(""5900b01b2ce12a2383329089""), ObjectId(""5900b01b2ce12a2383328f61"") ], ""count"" : 2 }


Using MongoDB's aggregate method allows you to group together your documents and
find duplicates in just a few lines of code. You can include other fields into
the different pipeline operators to refine your search even further; for
example, you can add the name of the banks to your groups adding just the $Bank Name field to your $group pipeline operator.

Of course, it would have been better to never import the duplicates in the first
place. If we created a unique index on CERT before importing, then one of the
duplicates would be dropped automatically. The problem there though is you don't
know which document will be dropped. By deduplicating before importing or
importing and deduplicating, you stand a good chance of better data quality with
a usable audit trail.

Nonetheless, if we're working with data that already exists in our database then
using MongoDB's aggregation pipeline gives us with a way to find duplicate
documents quickly and with just a little code involved. With this knowledge,
experiment on your own dataset to find duplicate documents that you may want to
keep, merge, or delete altogether.


--------------------------------------------------------------------------------

If you have any feedback about this or any other Compose article, drop the
Compose Articles team a line at articles@compose.com . We're happy to hear from you.

attribution Chris Barbalis

Abdullah Alger is a content creator at Compose. Moved from academia to the forefront of cloud
technology. Love this article? Head over to Abdullah Alger ’s author page and keep reading.RELATED ARTICLES
Feb 23, 2017AGGREGATIONS IN MONGODB BY EXAMPLE
In this second half of MongoDB by Example, we'll explore the MongoDB aggregation
pipeline. The first half of this series cov…

John O'Connor Apr 25, 2017HORIZONTAL SCALING ARRIVES ON COMPOSE ENTERPRISE
Today, Compose is bringing horizontal scaling to more databases on our
Enterprise platform. MongoDB, Elasticsearch and Scylla…

Jason McCay Apr 24, 2017CLASSCRAFT - MAKING THE MOST OF COMPOSE
Classcraft gamifies the whole classroom experience, making education a fun
adventure for both students and teachers. We chatt…

Arick Disilva Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Customer Stories Compose Webinars Support System Status Support & Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ ScyllaDB MySQL Deployments AWS SoftLayer Google Cloud Services Enterprise Add-ons © 2017 Compose, an IBM Company",MongoDB's aggregation pipeline makes finding duplicate documents easier by allowing you to customize how documents are grouped together and filtered.,Finding Duplicate Documents in MongoDB,Live,1036
3237,"Glynn Bird Blocked Unblock Follow Following Developer Advocate @ IBM Watson Data Platform. Views are my own etc. Jul 25
--------------------------------------------------------------------------------

COUCHDB WRITES: PIECEMEAL, BULK, OR BATCH?
WHICH WRITE API ENDPOINT IS THE RIGHT WRITE CALL FOR YOU?
There are many different deployment models for CouchDB-style databases, but
thankfully CRUD operations work the same across all of them. Apache CouchDB™ is a database, specifically a
JSON document store, with an HTTP API. IBM Cloudant is Apache CouchDB with a few
extra bells and whistles run as-a-service in pay-as-you-go, dedicated, and local
configurations.

In this article, I’m going to explain the various options for writing data using
the CouchDB API, and I’ll look at the different endpoints and tradeoffs along
the way. First, however, it will help to understand the basics of CouchDB as a
distributed system, and what the database means when it says your writes are
written.

COUCHDB CLUSTERS 101
A CouchDB cluster is a distributed system that exposes a single API — you treat
your CouchDB cluster as a single data store, but behind the scenes your database
is divided into shards and multiple copies of your documents are stored on separate machines.

A 6-node CouchDB cluster.The larger the number of nodes in the cluster, the greater the volume of data
and the number of concurrent requests it can handle.

When you write data to a CouchDB cluster (or a Cloudant service), by default,
two or more copies of your document are persisted on disk (for example, on two or more machines in a 3-node cluster). Other database
systems may give you the thumbs up to your write requests before the data is written to disk as a speed optimisation — behaviour that risks data
loss in the event of a node failure.

Now that you know the basics of what’s happening behind the scenes, I’ll cover
the API calls that allow you to write data to CouchDB and the options you have
that trade off storage guarantees and performance.

PIECEMEAL WRITES
Writing data to CouchDB is simply an HTTP POST request:

curl -v -X POST \
     -H 'Content-type: application/json' \
     -d '{""name"": ""Mittens"", ""type"": ""cat""}' \
     ""$COUCH_URL/animals""

HTTP/1.1 201 Created
Cache-Control: must-revalidate
Content-Length: 95
Content-Type: application/json
Date: Fri, 02 Jun 2017 06:33:08 GMT
Location: http://localhost:5984/animals/

{""ok"":true,""id"":""7bff55e2a7f9fa3a999c1f76bd00044b"",""rev"":""1-76558a77771fb4c1f81d4d91144dc83f""}

Here, I POST a JSON document to my database, and the reply indicates the
auto-generated id of the document that was created. The ""201"" response code indicates success and
guarantees that the document was stored on a quorum of servers in the cluster (at least two of the three shard copies).

Below, I have a 6-node cluster, which means a database is sharded across all six
nodes. Additionally, the system maintains three copies of that database, and it
ensures that the shard copies reside on different physical machines. This means
there are three copies of each shard, each on a different machine. The cluster
below shows what this write looks like when it has fully propagated throughout
the cluster.

A basic write request for the new “Mittens” document. Note that each node in the
cluster does not get the write—rather, it’s each of the three shards that gets the write.This process is important for mission-critical data. It means that if the
servers were abruptly powered off, your data would be safe on disk on multiple
machines.

BULK WRITES
If you have lots of data to write to the database, then a single bulk API
request is more efficient than making several individual API calls. More
efficient in terms of fewer HTTP round trips, and more efficient for the
database cluster too:

curl -v -X POST \
     -H 'Content-type: application/json' \
     -d '{""docs"": [{""name"": ""Snowy"", ""type"": ""cat""},{""name"": ""Patch"", ""type"": ""dog""}]}' \
     ""$COUCH_URL/animals/_bulk_docs""

HTTP/1.1 201 Created
Cache-Control: must-revalidate
Content-Length: 192
Content-Type: application/json
Date: Fri, 02 Jun 2017 06:44:45 GMT

[{""ok"":true,""id"":""7bff55e2a7f9fa3a999c1f76bd001d39"",""rev"":""1-263fbfee100b3417c513b14f4dacd776""},{""ok"":true,""id"":""7bff55e2a7f9fa3a999c1f76bd00202b"",""rev"":""1-591fadc21c08df0ba8efa5c5912c1cfb""}]

A basic bulk write request. Two more new documents are added, this time together
as an array.In this case I supply an object containing an array of documents and, in reply,
I receive an array of objects. The body can contain inserts, updates, and
deletes:

{
  ""docs"": [
    { ""name"": ""Paws"", ""type"": ""cat"" },
    { ""_id"": ""7bff55e2a7f9fa3a999c1f76bd001d39"", ""_rev"": ""1-263fbfee100b3417c513b14f4dacd776"", ""name"": ""Snowie"", ""type"": ""cat""},
    { ""_id"": ""7bff55e2a7f9fa3a999c1f76bd00202b"", ""_rev"": ""1-591fadc21c08df0ba8efa5c5912c1cfb"", ""_deleted"": true} 
  ]
}

Bulk write requests can contain a mixture of inserts, updates, and deletes. The
bulk request adds Paws, updates “Snowy” to “Snowie”, and deletes Patch.Is there a limit to how many documents should be posted in a single bulk
request? There isn’t a limit per se , but 500 small documents is a reasonable rule of thumb.

Note: The pay-as-you-go Cloudant plans limit the size of POST requests to 1MB.BATCH WRITES
In some circumstances, it is not possible to combine writes into fewer bulk
requests. For example, if your application is running on a serverless platform
such as OpenWhisk , then your code has no visibility into the other serverless actions that are
performing similar requests concurrently. This example is where batch mode may be of use. By supplying ?batch=ok to a single write request, you are indicating to the server that you permit
CouchDB to buffer the document in memory before writing it to disk in batches:

curl -v -X POST \
     -H 'Content-type: application/json' \
     -d '{""name"": ""Tiddles"", ""type"": ""cat""}' \
     ""$COUCH_URL/animals?batch=ok""

HTTP/1.1 202 Accepted
Cache-Control: must-revalidate
Content-Length: 52
Content-Type: application/json
Date: Fri, 02 Jun 2017 07:01:50 GMT

{""ok"":true,""id"":""7bff55e2a7f9fa3a999c1f76bd002cec""}

In this case, I get a “202” response, indicating that the document is accepted
but not written to disk (yet). This behavior is faster and more efficient than
piecemeal write performance, but doesn’t provide any persistence guarantees.
Batch mode should not be used for writing critical data to the database but may
be useful for some applications.

REFERENCES
I hope this article helps you take better advantage of CouchDB and Cloudant. I
learned the finer points of write behaviour as a user. If you want to read a
more scholarly article on how CouchDB handles writes—from an engineer on the
Cloudant team who is closer to database internals—then this blog by Mike Rhodes is a great place to start.

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

Thanks to Mike Broberg and Teri Chadbourne, CMP . * Couchdb
 * Cloudant
 * Database
 * Web Development
 * Ibm Bluemix

1 Blocked Unblock Follow FollowingGLYNN BIRD
Developer Advocate @ IBM Watson Data Platform. Views are my own etc.

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates",Which write API endpoint is the right write call for you?,"Piecemeal, Bulk, or Batch? – IBM Watson Data Lab – Medium",Live,1037
3240,"MENU
Close
Subscribe SubscribeREDUCING OVERPLOTTING IN SCATTERPLOTS
Nov 14, 2016 #advice #visualization #big dataNothing spoils a plot like (too much) data.

The bottleneck when visualizing large datasets is the number of pixels on a
screen.
— Hadley Wickham

Massive numbers of symbols on the page can easily result in an uninterpretable
mess.
— David Smith

Managing information density is critical for effective visualizations. Good
graphics let us understand global shape as well as finer details like outliers.
We need a dynamic range in visualization, much like in photography with HDR .


It’s clear scatterplots don’t scale . But while overplotting is unavoidable well before big data, it is possible to
mitigate it and limit symbol saturation.
Above all show the data! Let’s see how.

LIMITING DATA-INK
Sampling is a great solution if you just want a broad undertanding of your data. You may
miss outliers, but if you are good with it, and take care to properly randomize,
go for it.

Filtering by default can give the user control over the graphic, and prevent overplotting from the
start. Just show less data! In many cases, you’ll have time-dependant data: select by default the smallest range that makes sense, and give to brave users an option to add
more information.

Using smaller points are an easy fix. Just plot much less! If you use pixel-sized symbols, computing
the plot will often be much faster.

Empty markers , like circles instead of discs, will give some air to you visualization while
keeping individual observations readable. It’s a solid choice (hahaha!), and R’s default graphics opted for this option.

Using empty markers is a #solid choice for scatterplots ⭕️😉 https://t.co/FXW4TqSQgL #overplotting #dataviz #datascience #bigdata

— Shape Science (@shape_science) November 15th, 2016

My experience has been that those ink-wise techniques work well… until someone
sends you a bigger dataset…

MOVING THE DATA AROUND
The order in which you plot points is important. Overplotting will hide the points at the bottom, so consider
removing bias by randomizing your points’ ordering.

Facetting will split your observation into multiple plots and often save your sanity
trying to understand how different groups blend.

Jitter can help if your data points are crowded around thresholds or if they are censored . You data is often less continious than you’d like. Adding jitter will also do wonders when you have ordered categorical variables.

FADING INDIVIDUAL OBSERVATIONS
Alpha-blending means adding some transparency (alpha) to your datapoints’ fill color. As they
pile up, you’ll get a density plot for free. If using empty symbols is not enough, do this. There are some pitfalls:

 * Outliers become hard to see.
 * For R users, ggplot2 is not smart enough to show the opaque color in your legend . You will wonder how to fix it every time!

COLOR SCALES AND ALPHA-BLENDING DON’T MIX EASILY
It is very possible your points are colored according to an other variable. As
points stack on top of each other, you want to retail legibility. Unfortunately color are a difficult topic .

The basics are that for categorical scales you want perceptually distinct colors , and for quantitative scales you want perceptually regular and smooth color transitions . Getting it wrong is easy: it’s often best to default to ColorBrewer or viridis ( read more ):


QUANTITATIVE VARIABLES AND ALPHA-BLENDING
With a color range with varying luminance and saturation (but not hue) – as often recommended – superposing points will appear darker and more saturated than they are. This
makes you misinterpret the data! I would advise to use multi-hue color scales .


An other sad problem is that the resulting transparency when stacking N points is exponentially
decreasing.

It’s flat for most of the regime, and then spikes in a relatively short scale.
The spike is where we get color differentiation (different opacities get
different colors). That’s bad: color differentiation should be uniform across
the scale. – Carlos Scheidegger , @scheidegger

CATEGORICAL VARIABLES AND ALPHA-BLENDING
 * You may want to add a 2d density contour layer for extra legibility.
 * Blending colors is difficult, especially if you care about perception . Make sure your scatterplots’ colors are not artefacts resulting from the
   blending process. Be extra careful here.
 * For some inspiration, read about “splatterplots” :


STILL TOO MUCH OVERPLOTTING? USE BINNING
At some point , all those “big data” points draw a fine empirical 2d density estimator . So pick the right tools and use 2d density plots or binning -based tools.

More generally you can see the process as binning > aggregate > (smooth) > plot . Don’t be afraid, it’s not far from ggplot2 ’s logic . Here are some examples :


Aggregating by gives you count density histograms count > 0? the data’s support median value bins’ common values distribution bins’ distribution distribution (cat.) blendingDoing binning right is harder than it looks 1 . Find libraries that do the work for you… 2 , and read good explanations in bigvis ’s paper , or in this presentation by Trifacta .

FINAL TIP
If all fails, show density marginals . In fact, do it even if you don’t have overplotting issues!


--------------------------------------------------------------------------------

 1. Should you pick hexagonal or rectangular bin shapes? How do you choose the
    bin size / banwidth? How do we choose robust summary / aggregation
    statistics? How much smoothing should you apply along bins, if any? Doing
    this fast requires solid engineering… [return]
 2. imMens was inpiring some years ago. I’m out of data really… [return]

Arthur Flam's pictureARTHUR FLAM
Entrepreneur, data scientist. Now experimenting new projects at ECI Telecom's
CTO office.

Tel Aviv http://www.ecitele.comSHARING IS CARING 😇
Twitter Facebook LinkedInWE WANT TO STAY IN TOUCH.
Get the latest posts delivered right to your inbox. No spam.

Subscribe or like us on facebook ! Please enable JavaScript to view the comments powered by Disqus. INTERVIEW: DATA SCIENCE METHODOLOGY Shape Science All rights reserved - 2016 Proudly generated by HUGO , with Casper theme. Background by Simon C. Page .",Nothing spoils a plot like (too much) data.,Reducing overplotting in scatterplots,Live,1038
3243,"Homepage IBM Watson Data Lab Follow Sign in / Sign up * Home
 * Cognitive Computing
 * Data Science
 * Web Dev
 * 

va barbosa Blocked Unblock Follow Following code rules everything around me Apr 26
--------------------------------------------------------------------------------

YOU TOO CAN MAKE MAGIC (IN JUPYTER NOTEBOOKS WITH PIXIEDUST)
GETTING STARTED WITH CUSTOM VISUALIZATIONS, SIMPLE TABLES & WORD CLOUDS
https://unsplash.com/@thkelley?photo=5YtjgRNTli4PixieDust , the open source Python helper library that extends the usability of
notebooks, is quite adept at creating magic in notebooks . And with multiple visualizations available, there is magic to be had for just about any situation .

However, for the times when the default visualizations are just not quite what
you are looking for, there are options. Sure, you can submit an enhancement request to get new visualizations into PixieDust, but why not get ahead of the game and
try to create one yourself!

Most magic tends to be secretive and not readily shared, but PixieDust is open to all . With the PixieDust Extensibility APIs , you can create and deliver your own brand of visualization magic to notebook
users without forcing them to type much, if any, lines of code.

THE PREPARATION
Like all great magic, a little prep work is required. You can follow the steps
outlined here in any Jupyter Notebook environment. However, the instructions and screenshots walk through the
notebook in IBM’s Data Science Experience (DSX). The first step is to sign into DSX and create a Notebook.

For best results, use the latest version of either Mozilla Firefox or Google
Chrome.CREATE A NEW NOTEBOOK
After signing into DSX:

 1. On the upper right of the DSX site, click the + and choose Create project .
 2. Enter a Name for your project
 3. Select a Spark Service
 4. Click Create

From within the new project, you will create your notebook:

 1. Click add notebooks
 2. Click the Blank tab in the Create Notebook form
 3. Enter a Name for the notebook
 4. Select Python 2 for the Language
 5. Select 1.6 for the Spark version
 6. Select the Spark Service
 7. Click Create Notebook

When you use a notebook in DSX, you can run a cell only by selecting it, then
going to the toolbar and clicking on the Run Cell (▸) button. When a cell is running, an [*] is shown beside the cell. Once the cell has finished, the asterisk is replaced
by a number. If you don’t see the notebook toolbar showing the Run Cell (▸) button and other notebook controls, you are not in edit mode. Go to the dark
blue toolbar above the notebook and click the edit (pencil) icon. Data Science Experience toolbarTHE PLEDGE
Using PixieDust in a notebook is straightforward. No misdirection. No sleight of
hand.

INSTALL PIXIEDUST
DSX already comes with the PixieDust library installed, but it is always a good
idea to make sure you have the latest version. In the first cell of the
notebook, enter:

!pip install --user --upgrade pixiedust

Click on the Run Cell (▸) button. After the cell finishes running, if you are instructed to restart the
kernel, from the notebook toolbar menu:

 1. Go to > Kernel > Restart
 2. Click Restart in the confirmation dialog

The status of the kernel briefly flashes near the upper right corner, alerting
when it is Not Connected , Restarting , Ready , etc.IMPORT PIXIEDUST
At this point, you can introduce your lovely assistant: data! In the next cell
enter and run:

import pixiedust

Whenever the kernel is restarted, import pixiedust must be run before continuing. Any previous loaded data will also need to be
re-loaded.In a new cell enter and run:

df = pixiedust.sampleData(7)
display(df)

Using PixieDust’s sampleData API, you have loaded some sample data. More specifically, crime data from the
city of Boston (over a two-week span). And you are viewing it using PixieDust's display API.

You can try different visualizations by selecting different chart types and
renderers provided by PixieDust.

THE TURN
One of PixieDust’s default visualizations is a nice table view of your data. To
start, you’ll customize a version of this table. It’s not quite jaw-dropping
graphics, but it will cover the basics (i.e., template, metadata) of creating
your own visualization.

THE TEMPLATE
Your first step will be to create the HTML fragment for the template of your
visualization. PixieDust supports Jinja2 , the popular Python templating engine. This allows for adding some logic and
conditional statements to simplify your template. In addition, you can also make
use of Bootstrap CSS classes and Font Awesome icons.

To define your template:

 1. Import from pixiedust.display.display
 2. Create a class that extends Display
 3. Implement def doRender(self, handlerId) in your class
 4. In doRender , call self._addHTMLTemplateString , passing your template

You can access your DataFrame via the entity variable

Python is indentation sensitive. Do not mix space and tab indentations. Either
use strictly spaces or tabs for all indentations.In a new cell enter and run:

PixieDust Extensibility API — Simple Table TemplateTHE METADATA
To be able to invoke your visualization, it must be added to PixieDust’s display output toolbar. Menu options can be added to the toolbar area by including some
specific metadata.

To specify the metadata:

 1. Create a class that extends DisplayHandlerMeta
 2. Annotate the class with @PixiedustDisplay()
 3. Implement def getMenuInfo(self,entity,dataHandler) in your class
 4. Annotate the getMenuInfo with @addId
 5. In getMenuInfo , return a JSON array defining attributes for your menu option
 6. Implement def newDisplayHandler(self,options,entity) in your class
 7. In newDisplayHandler , return the response from the call to your HTML fragment class

Metadata attributes include:

 * categoryId - used to group menu options
 * title - title or label for the menu option
 * icon - icon class for the menu option (accepts Font Awesome icon css classes)
 * id - unique identifier for the menu option

In a new cell enter and run:

PixieDust Extensibility API — Simple Table MetadataTHE OUTPUT
You are ready to try out your new visualization. If all goes well, the table
menu option should now be a dropdown that includes your new menu option. In a
new cell enter and run:

display(df)

Click the table dropdown menu and choose My Simple Table :

My Simple Table menu optionCongrats! You have created your first PixieDust visualization.

My Simple TableYou’ve seen the ease in which a custom visualization can be made. Now you can
try a more interesting one.

THE PRESTIGE
Word (or Tag) clouds are a common way to visualize text data. So, why not try to create a PixieDust
word cloud visualization? Rather than trying to write the logic for generating
the word cloud, you can rely on a little word cloud generator that already exists and is easy to use.

In a new cell enter and run:

!pip install wordcloud

This will install the word cloud generator library. After the install completes,
restart the kernel, re-import PixieDust, and re-download the sample data. You
can scroll up and find the import and download cells, or, after restarting the
kernel, insert a new cell and run:

import pixiedust
df = pixiedust.sampleData(7)

THE LOOK OF A CLOUD
In a new cell enter and run:

PixieDust Extensibility API — Word Cloud TemplateThe template has been defined. If you look closely, you can see

 1. entity is converted to a Dictionary ( dfdict )
 2. dfdict is turned into a WordCloud object ( wc )
 3. wc is encoded into base64 ( img_str )
 4. img_str is passed to an HTML img tag

CLOUD AWARENESS
In a new cell enter and run:

PixieDust Extensibility API — Word Cloud MetadataWith this metadata code, you are stating you are only interested in DataFrame classes. You also specified that the Simple Word Cloud menu option get added to the chart dropdown.

SHOW ME THE CLOUD
The word cloud generator needs a list of words and their frequencies. You can
take the sample data and create a new DataFrame, which includes the necessary
data. In a new cell enter and run:

df2 = df.groupBy(""street"").count()

This will create a new DataFrame showing the number of incidents reported by
street. In a next cell enter and run:

display(df2)

Click the chart dropdown menu and choose Simple Word Cloud :

Simple Word Cloud menu optionCongrats! You have created your second PixieDust visualization. You have
yourself a word cloud. Based on the size of the text you can quickly see which
streets had a higher number of crimes reported.

Boston Crime Data Streets Word CloudCURTAIN CALL
While the visualizations here were created in the notebook, they can easily be
made into and distributed as a Python module for better sharing and integrating
into PixieDust.

Neither the table nor the word cloud were intricate visualizations. However,
they did provide you with the building blocks needed to tackle more advanced visualizations . You will find guidance and additional information in the GitHub repo wiki .

All are invited to contribute and pull requests are welcome. Who knows — maybe
you could even contribute your visualization back to the PixieDust community?
Magic!

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * Data Science
 * Python
 * Pixiedust
 * Data Visualization
 * Jupyter

Blocked Unblock Follow FollowingVA BARBOSA
code rules everything around me

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Getting started with custom visualizations, simple tables & word clouds.",You Too Can Make Magic (in Jupyter Notebooks with PixieDust),Live,1039
3247,"Although it is built around a JavaScript engine, many people approach the Mongo Shell as simply a way of entering queries, updates and administration commands. But that JavaScript engine opens up a world of possibilities for making a MongoDB user's life easier and more efficient when it comes to herding the data.

The critical part of the process is understanding that you are working with a JavaScript session within the shell. Each line you enter into the shell's REPL (Read-Evaluate-Print Loop) is evaluated as a JavaScript expression and the results of that evaluation are persisted for the session with the shell. Without realising that, you can end up typing and retyping large chunks of boiler plate code into the Shell, which, even though it has simple inline editing and command completion, is still a chore.

Let's look at a practical example of handing a migration. Imagine we have a collection of machines called ""webscale"", with each machine's kind being either ""production"",""test"" or ""dev"", running an app in either ""javascript"",""scala"",""go"",""ruby"" or ""java"", and with an active flag set to true or false.

Now, we should all be generally aware that you can do basic querying ...

db.webscale.find({kind: ""production""})

db.webscale.update({kind: ""production""}, {$set: {active: true}})

Consider now if we wanted to take some subset of that data and copy it to another collection. The job at hand may be to copy the ""test"" machines in their own ""test"" collection. Using the built-in forEach function and JavaScript's anonymous functions gives us the power we need to do this.

db.webscale.find({kind:""test""}).forEach(function(doc) {db.test.insert(doc)})

That's fine for a quick one-liner but as soon as you need to do some real work, or want to reuse them, anonymous functions become tough to read and difficult to debug.

For example, say the requirement just changed so we need to remove the test machines from the webscale collection. Here's where named functions are easier to wield. We can define named functions directly in the shell so that we can reuse them. Let's define one for the ""moveit"" task:

function moveit(doc) {

db.test.insert(doc)

db.webscale.remove(doc)

Now, we can iterate over our collection and apply this function to each document:

db.webscale.find({kind: ""test""}).forEach(moveit)

Fantastic! Now we're getting somewhere. But what happens when our migrations get more complicated or the requirements change again? We could cursor up and use the Mongo Shell's inline editor but normally we'd write a function like this in our favourite text editor.

Mongo has an 'edit' helper in the shell which can call up your preferred editor so you can modify an already defined function by saying edit functionname. To work out what editor to use, the Mongo shell first tries to use the JavaScript variable EDITOR, which you can set within the shell like so:

EDITOR=""/bin/vim""

If EDITOR isn't set, the Mongo shell checks for the environment variable $EDITOR, which is often already set, but if not, can be set at the operating system's command line, or in your profile like this.

export EDITOR=/bin/vim # or emacs, no judgements here

With EDITOR set, we can now do the following.

edit moveit

Using your favorite editor within the shell makes it easier to write and debug more complicated functions.

Now, we don't have a collection to hand to safely run these experiments on – don't run this on a production server or collection with live data – but we can create one. When we are creating data, we usually need to select one of a list of possible values. Let's define a reusable function in the shell to do that:

function oneFrom(list) {

return list[Math.floor(Math.random()*list.length)];

We can now say onefrom(['a','b','c']) and get 'a', 'b' or 'c' back. Using this we can make a randomLanguage function:

function randomLanguage() {

return oneFrom(['javascript', 'ruby', 'scala', 'golang', 'java']);

We can also create a randomKind function too:

function randomKind() {

return oneFrom(['production','dev','test']);

Now we could create a ten thousand machines using JavaScript's for loop:

for (var i = 0; i < 1e5; i++) {

db.webscale.insert({number:i, lang: randomLanguage(), kind:randomKind()})

But the insert statement is already getting a bit messy to work with in the shell and we haven't even set the active field yet. Another function to the rescue, this time to create a document for a machine given just a number:

function makeMachine(i) {

var machine={};

machine.number=i;

machine.lang=randomLanguage();

machine.kind=randomKind();

machine.active=oneFrom([ true, false ]);

return machine;

We can edit and reuse this function easily and we can clearly express how we create our test data collection:

for (var i = 0; i < 1e5; i++) {

db.webscale.insert(makeMachine(i));

We can test our migration script skills a little more easily now! Let's split the original collection 'webscale' into separate collections based on programming language. As a shortcut, we'll use the fact that the db is an associative array to reference our language collections:

function movedoc(doc) {

db.webscale.remove(doc)

doc.moved_at = new Date()

// notice that we can reference the collection as an array member

db[doc.lang].insert(doc)

db.webscale.find().forEach(movedoc)

Now we have five different collections each containing a subset of original data and an empty webscale collection. If we rebuilt the webscale collection and edited movedoc so we used doc.kind instead of doc.lang and reran our command, the collection would be sorted into collections named after the various kinds of system.

Of course you could also make your migration script a function too so you didn't have to type it out and could go back and edit it too:

function migrationtest() {

db.webscale.find().forEach(movedoc);

And you easily use edit to modify it so it could clear out old collections, create new test collections and even run some checks for you at the end. There's an awful lot you can do in the shell when you embrace its JavaScript-ness. There is only one caveat and that is these functions exist only for the lifetime of the Mongo Shell; exit it and they are gone. There are ways to persistently reuse your shell functions though and we'll look at that in the future.

But for now, what you need to remember is if you find yourself doing the same thing over and over again in the shell, or if you need to create and test a migration for your data, define some functions, use the power of JavaScript and get more done with less effort.","Although it is built around a JavaScript engine, many people approach the Mongo Shell as simply a way of entering queries, updates and administration commands. But that JavaScript engine opens up a world of possibilities for making a MongoDB user's life easier and more efficient when it comes to herding the data.",How I Stopped Worrying & Learned to Love the Mongo Shell,Live,1040
3250,"Margriet Groenendijk Blocked Unblock Follow Following Developer Advocate | IBM Watson Data Platform | Data Science | Climate and
Weather | Geography May 19
--------------------------------------------------------------------------------

MAPPING ALL THE THINGS WITH PYTHON
AWESOME PYTHON GEO PACKAGES
Last week I attended the GeoPython conference in Basel, Switzerland, where a group of enthusiastic people shared
their work on geodata and Python.

Geodata as points, polygons or a regular grid. I skipped the hexagons.POINTS, POLYGONS AND GRIDS…
Geodata is information about geographic locations that is stored in a format
that can be used with a geographic information system (GIS). Geodata can be
stored in a database, geodatabase, shapefile, coverage, raster image, or even a
dbf table or Microsoft Excel spreadsheet. — EsriDepending on your data you can choose to represent it with points, lines,
polygons or a regular grid.

… AND HEXAGONS
But a few talks showed hexagon maps as well. Jez Nicholson showed a hexagon election maps of cities in the UK and a running app where the goal is to protect and extend your empire by running around in all
your hexagons. As with a square grid the size of the cells is not the same
everywhere. Because of the curvature of the earth, Finns will have to run a lot
more than someone closer to the equator to control their empire. Clinton Dow presented a solution to this. He showed how this can be solved with a standard grid consisting of hexagons where the size of each hexagon is always the same.

3D MAPPING
3D maps were also discussed. Martin Christen presented the pyRT package that renders 3D images adding shade and reflections. This looks like a great
package to start learning more about 3D rendering. Stefan Blaser brought a mobile mapping backpack and showed how he created a 3D map of the
conference centre by walking around through it with the backpack.

JUPYTER NOTEBOOKS
Most (if not all) packages will work in Jupyter notebooks, which is a great
environment for exploring, because code, charts, maps and comments are all
nicely kept together in one notebook that you can view in any web browser. You
can install notebooks locally or try out the IBM Data Science Experience , which I used for the live demos in my talk . I showed how to create maps with only one display() command from the PixieDust package.

Another great new package for geodata is the Python ArcGIS API that was presented by Rohit Singh and Matthias Schenker . With this package you can not only display geodata, but also do extensive
geodata analysis as it is a GIS inside a notebook.

A FEW MORE PYTHON PACKAGES
All presented packages have in common that they will make it simpler to work
with geodata. This could be either loading, processing, analysing or displaying
the data.

Brendan Collins presented five projects that everyone should know about:

 * Conda : a package manager
 * Dask : a parallel computing library
 * PyToolz : a set of utility functions for iterators, functions, and dictionaries
 * Numba : a compiler for Python code
 * Xarray to work with multi-dimensional data in a pandas-like way

Reuben Cummings presented Meza , an alternative to pandas written in pure Python that can process more than 10 different file types.

It was amazing to see that all the projects are open-source. I am sure I have
forgotten some projects and might not even be aware of the existence of others.
Please add them in the comments and I will compile a more complete list of geodata python packages .

If you enjoyed this article, please ♡ it to recommend it to other Medium
readers.

 * Data Science
 * Python
 * Pixiedust
 * Geodata
 * Programming Tools

1 Blocked Unblock Follow FollowingMARGRIET GROENENDIJK
Medium member since May 2017Developer Advocate | IBM Watson Data Platform | Data Science | Climate and
Weather | Geography

FollowIBM WATSON DATA LAB
The things you can make with data, on the IBM Watson Data Platform.

 * Share
 * 1
 * 
 * 
 * 

Never miss a story from IBM Watson Data Lab , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Lab Get updates Get updates","Last week I attended the GeoPython conference in Basel, Switzerland, where a group of enthusiastic people shared their work on geodata and Python. Depending on your data you can choose to represent…",Mapping All the Things with Python – IBM Watson Data Lab – Medium,Live,1041
3251,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Charles gomes Blocked Unblock Follow Following Jan 17
--------------------------------------------------------------------------------

USE IBM DATA SCIENCE EXPERIENCE TO READ AND WRITE DATA STORED ON AMAZON S3
In this post, we will go through how to read and write data from and to Amazon
S3 using Python 2 with Spark 2.0. Scala also has a similar api. You can use the
code below in IBM Data Science Experience notebooks.

PREREQUISITE
You have an Amazon S3 account with credentials generated. You will need your
Access Key ID and Secret Access Key from your AWS account.

Your s3a credentials are set in SparkContext’s hadoopConfiguration api so that
SparkContext can use these credentials when reading and writing data using the s3a protocol.

#Replace Accesskey with your Amazon AccessKey and Secret with amazon secret hconf = sc._jsc.hadoopConfiguration() hconf.set(""fs.s3a.access.key"", ""<put-your-access-key>"") hconf.set(""fs.s3a.secret.key"", ""<put-your-secret-key>"")

READING FROM AMAZON S3 BUCKET
Use the SparkSession api introduced in Spark 2.0 to create your spark session
variable. You can use the spark.read api to read csv, parquet and other
supported file types in Spark DataFrame. The load api will let you specify the
path to your S3 bucket file. Replace your-bucket-name, foldername and file-name
according to your S3 Bucket. Once the DataFrame is created, test it by checking
the first 5 rows.

spark = SparkSession.builder.getOrCreate() df_data_1 = spark.read\ .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\ .option('header', 'true')\ .load('s3a://<your-bucket-name>/<foldername>/<filename>.csv') df_data_1.take(5)

[Row(id=u'10001', name=u'Tony'), Row(id=u'10002', name=u'Mike'), Row(id=u'10003', name=u'Pat'), Row(id=u'10004', name=u'Chris'), Row(id=u'10005', name=u'Paco')]

We will also check schema inferred using printSchema().

df_data_1.printSchema()

root |-- id: string (nullable = true) |-- name: string (nullable = true)

WRITING TO AMAZON S3 BUCKET
Use the same DataFrame to write to another bucket on Amazon S3, but instead of
saving it in csv format, save it as parquet . The path format is similar.

df_data_1.write.save(""s3a://charlesbuckets31/FolderB/users.parquet"")

Read the parquet file to test that the write was successful and correct.

df_data_2 = spark.read\ .format('org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat')\ .option('header', 'true')\ .load('s3a://charlesbuckets31/FolderB/users.parquet') df_data_2.take(5)

[Row(id=u'10001', name=u'Tony'), Row(id=u'10002', name=u'Mike'), Row(id=u'10003', name=u'Pat'), Row(id=u'10004', name=u'Chris'), Row(id=u'10005', name=u'Paco')]

Reference to complete notebook


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on January 17, 2017.

 * AWS


Blocked Unblock Follow FollowingCHARLES GOMES
FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","In this post, we will go through how to read and write data from and to Amazon S3 using Python 2 with Spark 2.0. Scala also has a similar api. You can use the code below in IBM Data Science…",Use IBM Data Science Experience to Read and Write Data Stored on Amazon S3,Live,1042
3253,"Homepage Follow Sign in Get started * Home
 * Data Science Experience
 * Data Catalog
 * IBM Data Refinery
 * 
 * Watson Data Platform
 * 

Natasha D'Silva Blocked Unblock Follow Following IBM Streams and Streaming Analytics on IBM Cloud Mar 15
--------------------------------------------------------------------------------

USE IOT DATA IN STREAMS DESIGNER FOR BILLING AND ALERTS
As more devices become internet enabled, harnessing that data to provide value
for consumers is becoming an essential strategy for many businesses. Some
utility companies, for example, offer smart meters that send usage readings from
homes and businesses, improving accuracy and enabling remote reporting. Since
the smart meters have more fine grained usage data, some companies are offering
discounts to customers if their consumption is during off peak periods.
Processing streaming data from thousands of smart devices is a good fit for Streams Designer, a new web based IDE for creating streaming applications that run in the IBM
Cloud.

The videos below show how I used Streams Designer to create a time based billing
application. The application ingests incoming usage readings from smart meters,
saves hourly usage costs in a DB2 warehouse, and alerts customers as soon as
their usage cost for the month exceeds a certain threshold.

The final application graphWATCH THE VIDEOS
If you’re familiar with Streams Designer, these videos show how to use the WatsonIoT , Code , and Email operators.

Part 1 shows how to use the WatsonIoT operator to ingest data from IoT devices through the Watson IoT Platform.

https://www.youtube.com/watch?v=gdLGMZUX-B4

Part 2 demonstrates the alerts portion of the application. The 3rd version of
the application, not shown in the video, also saves hourly usage costs in a DB2
warehouse.

https://www.youtube.com/watch?v=zc9fNIBTonU

GET THE SAMPLES
After watching the videos, you can download the samples from Github to try them yourself in Streams Designer.

WHAT’S INCLUDED IN THE SAMPLE:
* Python notebook to simulate data from utility meters
* DDL file to create the tables in the DB2 Warehouse, as well as sample data in
CSV files to populate the database.
* Complete Streams flows that you can import and run.

LEARN MORE
Watch the Streams Designer playlist on Youtube, and find more Streams and Streaming Analytics content on Streamsdev .

 * Internet of Things
 * IBM
 * Streaming Analytics
 * Real Time Analytics

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

12 Blocked Unblock Follow FollowingNATASHA D'SILVA
IBM Streams and Streaming Analytics on IBM Cloud

FollowIBM WATSON DATA
Build smarter applications and quickly visualize, share, and gain insights

 * 12
 * 
 * 
 * 

Never miss a story from IBM Watson Data , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Get updates Get updates","As more devices become internet enabled, harnessing that data to provide value for consumers is becoming an essential strategy for many businesses. Some utility companies, for example, offer smart…",Use IoT data in Streams Designer for billing and alerts,Live,1043
3258,"* 
 * 
 * 
 * 
 * 
 * 
 * 
 * 

MAPPING POINTS WITH FOLIUM
2017-02-01 07:34 | Source * 
 * Folium

Continuing my previous work on exploring Arlington's Bikeometer data , I have decided to look at all of Arlington's counters in this article. My
goal is to retrieve all of the counters available via the Bike Arlington web services API and then map them.

In order to map these points in Python, I will use the Folium module. The Folium module provides a way to feed data in Python into a Leaflet.js map. Leaflet maps are interactive, attractive, and can be directly inserted
into webpages. Folium provides many options to customize these maps, and I will
explore several of these options in this article.

First, I'll load the Python modules that will be needed for this project. These
include Pandas, Requests, XML, Numpy, and Folium.

In [1]:importpandasaspdimportrequestsfromxml.etreeimportElementTreeimportnumpyasnpimportfolium

The url below accesses all the counters provided by the Bike Arlington web
services API. Then I save the xml data to a local file called
xml_getallcounters.xml.

In [2]:GetAllCountersUrl=""http://webservices.commuterpage.com/counters.cfc?wsdl&method=GetAllCounters""

In [3]:xmlfile=open('xml_getallcounters.xml','w')xmldata=requests.get(GetAllCountersUrl)xmlfile.write(xmldata.text)xmlfile.close()xml_data='xml_getallcounters.xml'

The next step is to parse that xml data. It can be somewhat tricky to parse xml
data, so I usually test out the commands to pull in each data column first on
one record before applying the code to all the data records.

In [4]:tree=ElementTree.parse(xml_data)counter=tree.find('counter')name=counter.find('name')name.text

Out[4]:'110 Trail'

In [5]:counter.find('latitude').text

Out[5]:'38.885315'

In [6]:counter.find('longitude').text

Out[6]:'-77.065022'

In [7]:counter.find('region/name').text

Out[7]:'Arlington'

From the code above, it looks like I can pull in the counter name, latitude,
longitude, and region. I'm also adding in the ID number for future reference.

In [8]:id=[]name=[]latitude=[]longitude=[]region=[]forcintree.findall('counter'):id.append(c.attrib['id'])name.append(c.find('name').text)latitude.append(c.find('latitude').text)longitude.append(c.find('longitude').text)region.append(c.find('region/name').text)df_counters=pd.DataFrame({'ID':id,'Name':name,'latitude':latitude,'longitude':longitude,'region':region})df_counters.head()

Out[8]: ID Name latitude longitude region 0 33 110 Trail 38.885315 -77.065022 Arlington 1 30 14th Street Bridge 38.874260 -77.044610 Arlington 2 43 15th Street NW 38.907470 -77.034610 DC 3 32 Arlington Mill Trail 38.845610 -77.096046 Arlington 4 24 Ballston Connector 38.882950 -77.121235 ArlingtonFrom the dataframe, df_counters, I want to create a map of each of their
locations based on the latitude and longitude coordinates. Ideally, each of the
counters will also be labeled with their name. To do this, I first create a list
of latitude and longitude coordinate pairs.

In [9]:locations=df_counters[['latitude','longitude']]locationlist=locations.values.tolist()len(locationlist)locationlist[7]

Out[9]:['38.943070', '-77.115660']

I checked the length of the list to see if I grabbed all the counters from the
dataframe. Then I also check the latitude and longitude of a few random
individual records to make sure everything is working well. Now that the data is
all prepped, it time to make some maps with Folium.

The first thing to do is to create a map object centered around the area of
interest. I also set them zoom level to 12 in order to see the individual
points. I then added in each of the points using a for loop on the locationlist.
For each location, I also create a popup tooltip to provide the name of each
counter.

In [10]:map=folium.Map(location=[38.9,-77.05],zoom_start=12)forpointinrange(0,len(locationlist)):folium.Marker(locationlist[point],popup=df_counters['Name'][point]).add_to(map)map

Out[10]:The map above doesn't look too bad, but there are several improvements that can
be made. First, the default OpenStreetMap tiles are rather busy. Second, I
noticed that a few data points are not visible when the map is zoomed in at 12.
It appears the API includes counters from Alexandria, DC, and Montgomery County
in addition to the Arlington County sensors. Third, I also noticed that several
points are hard to see since they can be very close to each other. To fix these
issues, I switch the basemap tiles to CartoDB dark_matter and reduced the zoom
level to 11, and I created marker clusters.

The marker clusers group points that overlap and then it labels the resulting
cirlce with the number of points in that area. If you click on the circle, the
map zooms to the area to show you the individual points. That's pretty cool.

In [11]:map2=folium.Map(location=[38.9,-77.05],tiles='CartoDB dark_matter',zoom_start=11)marker_cluster=folium.MarkerCluster().add_to(map2)forpointinrange(0,len(locationlist)):folium.Marker(locationlist[point],popup=df_counters['Name'][point]).add_to(marker_cluster)map2

Out[11]:The result map point are a bit easier to see, but I'm not sure I'm a bit fan of
the dark color of the CartoDB dark_matter tiles. I think I prefer a lighter
color for this map. For the next map, I decided to try out the Stamen Terrain
tiles. I also decided to change the look of the point icons. I changed their
color to dark blue. I also tried to add a bicycle icon. The bicycle icon was not
available in version 0.2.1 of Folium, so I instead went with a pedestrian
looking icon.

In [12]:map2=folium.Map(location=[38.9,-77.05],tiles='Stamen Terrain',zoom_start=11)marker_cluster=folium.MarkerCluster().add_to(map2)forpointinrange(0,len(locationlist)):folium.Marker(locationlist[point],popup=df_counters['Name'][point],icon=folium.Icon(color='darkblue',icon_color='white',icon='male',angle=0,prefix='fa')).add_to(marker_cluster)map2

Out[12]:The terrain map is an improvement over the dark map, but I still found the green
coloring for natural areas to be distracting from the main objective of the map.
I now also want to differentiate the counters by region. To do that, I decided
to create a function to assign a unique color to each region.

In [13]:defregioncolors(counter):ifcounter['region']=='Arlington':return'green'elifcounter['region']=='Alexandria':return'blue'elifcounter['region']=='DC':return'red'else:return'darkblue'df_counters[""color""]=df_counters.apply(regioncolors,axis=1)df_counters.head()

Out[13]: ID Name latitude longitude region color 0 33 110 Trail 38.885315 -77.065022 Arlington green 1 30 14th Street Bridge 38.874260 -77.044610 Arlington green 2 43 15th Street NW 38.907470 -77.034610 DC red 3 32 Arlington Mill Trail 38.845610 -77.096046 Arlington green 4 24 Ballston Connector 38.882950 -77.121235 Arlington greenI then changed the point icon color for each counter based on its region. I also
switched the basemap to the lighter CartoDB positron tiles. While I was at it, I
also realized it would be a really good idea to include the counter ID number on
the map since you need that number when making specific API calls for counter
data.

In [14]:map3=folium.Map(location=[38.9,-77.05],tiles='CartoDB positron',zoom_start=11)marker_cluster=folium.MarkerCluster().add_to(map3)forpointinrange(0,len(locationlist)):folium.Marker(locationlist[point],popup='ID:'+df_counters['ID'][point]+' '+df_counters['Name'][point],icon=folium.Icon(color=df_counters[""color""][point],icon_color='white',icon='male',angle=0,prefix='fa')).add_to(marker_cluster)map3

Out[14]:I'm fairly please with this last map. In the future, I would like to change the
symbols on the pointers based on whether the particular sensor can distiguish
between bicycles and pedestrians. It is also possible to attach bar charts or
other graph types to these pointers. In addition to point maps, Folium can also
create choropleth maps for displaying regional statistics using color gradients.

 * Previous post

Please enable JavaScript to view the comments powered by Disqus. Comments powered by DisqusContents © 2017 George T. Silva - Powered by Nikola","Continuing my previous work on exploring Arlington's Bikeometer data, I have decided to look at all of Arlington's counters in this article.  My goal is to retrieve all of the counters available via t",Mapping Points with Folium,Live,1044
3262,"lA SPEED GUIDE TO REDIS LUA SCRIPTING
Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Published Jul 12, 2016WHAT'S LUA?
Lua is a language which has been around since 1993. Its origins in engineering made
for a compact language which could be embedded in other applications. It's been
embedded in applications as diverse as World of Warcraft and the Nginx web
server. And Redis, which is why we are here.

WHAT DOES REDIS LET YOU DO WITH LUA?
It lets you create your own scripted extensions to the Redis database. That
means that with Redis you can execute Lua scripts like this:

> EVAL 'local val=""Hello Compose"" return val' 0
""Hello Compose""


The string after the EVAL is the Lua script.

local val=""Hello Compose""  
return val  


So at its simplest you can run Lua scripts. But more importantly, you can run
Lua scripts that act like an intelligent transaction. You can handle errors
smartly, so instead of just rolling back, you can carry on processing. Of
course, the intelligence of the transaction will be up to you.

But to start tapping into that power you'll probably want to pass the script
some keys and arguments.

KEYS AND ARGUMENTS?
The 0 at the end of the EVAL, thats the number of keys being passed to the Lua
code, in that example there were none. But if instead of 0 it was 2 foo bar fizz buzz then the first two items in the list, foo and bar, would be passed as keys and
fizz and buzz would be arguments.

If you do pass keys, they are available to the Lua script in the KEYS table (A
table is Lua's associative array which also is used as an 1-based array). If you
have arguments, they appear in the ARGV table. For example:

return ARGV[1]..' '..KEYS[1]  


The .. is Lua's string concatenation operator so this returns whatever the argument is
concatenated with a space and then the key name given in the arguments.

> EVAL ""return ARGV[1]..' '..KEYS[1]"" 1 name:first ""Hello""
""Hello name:first""


No magic is applied to the KEYS, they are just strings so we still have to look
up their value.

AND HOW DO YOU CALL REDIS FROM INSIDE LUA?
We can get at Redis's functions through the redis.call() command. If we use this script:

return ARGV[1].."" ""..redis.call(""get"",KEYS[1])  


and EVAL it, assuming we've SET the key name:first to something we'll see this:

 > EVAL 'return ARGV[1].."" ""..redis.call(""get"",KEYS[1])' 1 name:first ""Hello""
""Hello Brian""


So now, in a script, we've taken a parameter, looked up a key's value and
created a string and returned that as a result.

THIS EVAL COMMAND IS GOING TO GET PRETTY MESSY ISN'T IT?
Yes. That's why there's other ways to get Lua scripts up to the server. The one
we like is by using the command line arguments of the redis-cli command. Just write your ""more complex"" Lua script...

lua local name=redis.call(""get"", KEYS[1]) local greet=ARGV[1] local
result=greet.."" ""..name return result

We'll save that as longhello.lua and the run this at the command line:

 $ redis-cli -h aws-us-east-1-portal.15.dblayer.com -p 11260 -a secret  --eval longhello.lua name:first , Hello
""Hello Brian""


From the top, we're running the redis-cli, complete with -h, -p and -a
parameters to connect to the database. Then comes the new bit, --eval. This lets
you name a file to be sent up to the server and eval'd. So we follow that with longhello.lua . Now our script needs keys and arguments. The keys come first and redis-cli
counts them for us; each command line argument is a key, right up to the comma.
What comes after the comma are arguments.

So now we can write Lua code for Redis locally and quickly test it on the
server, even a remote one.

I'M PRETTY GOOD FOR GREETINGS FUNCTIONS, GOT ANYTHING MEATIER?
Why yes, here's a little problem for you that Lua solves nicely. Consider a
situation where various divisions of a company increment different counters in
the big scheme of things. So say ""region:one"" bumps the counter keys
""count:emea"", ""count:usa"", ""count:atlantic"" while ""region:two"" just bumps
""count:usa"". These counter lists may be added to in the future, but you really
want to make sure it happens all in one fell swoop. Remember what we said about
this being an ""intelligent transaction"", well, here we can do all that in one
script.

Let's set up our regions as lists:

> rpush region:one count:emea count:usa count:atlantic
(integer) 3
> rpush region:two ""count:usa""
(integer) 1


Now we'll create a Lua script locally:

local count=0  


We start with a count variable - we will count all the increments we do and
return that value.

local broadcast=redis.call(""lrange"", KEYS[1], 0,-1)  


Here we ask Redis to give us all the values in the list that should be referred
to in the first key.

for _,key in ipairs(broadcast) do  


This is the opener of a Lua for loop. The ipairs function iterates through the
Lua table we just got in order and we take the key from each.

  redis.call(""INCR"",key)
  count=count+1


And for each key we ask Redis to increment it. Oh and then we bump up our
counter. And thats nearly it. All thats left is...

end  
return count  


... to end the for loop and return the count. Save that to a file and then run
it with an argument:

$ redis-cli -h aws-us-east-1-portal.15.dblayer.com -p 11260 -a secret --eval broadcast.lua region:one
(integer) 3


and if we go look we find:

> mget count:usa count:atlantic count:emea
1) ""1""  
2) ""1""  
3) ""1""  


And if we use the script on region 2...

 $ redis-cli -h aws-us-east-1-portal.15.dblayer.com -p 11260 -a secret --eval broadcast.lua region:two
(integer) 1
...
> mget count:usa count:atlantic count:emea
1) ""2""  
2) ""1""  
3) ""1""  


So now we've got ourselves a useful little function. What if there was an error?
Ah, well as it stands this script would error out too. That's because we're
using redis.call() which explicitly does that. If we used redis.pcall() , if there was an error, the error details would be returned instead and we
could decide what to do. But this is a speed guide and...

WAIT A MINUTE, I HAVE TO UPLOAD THE SCRIPT EVERY TIME?
No, Redis has a script cache and a command SCRIPT LOAD to just load scripts into
the cache. We'll use it from the command line here, but you can incorporate it
into your applications like any other Redis command.

 $ redis-cli -h aws-us-east-1-portal.15.dblayer.com -p 11260 -a secret SCRIPT LOAD ""$(cat broadcast.lua)""
""84ffc8b6e4b45af697cfc5cd83894417b7946cc1""


That ""$(cat broadcast.lua)"" just turns our script into a quoted argument. The important bit is the number
that comes back (its in hex). It's the SHA1 signature of the script. We can use
this to invoke the script using the EVALSHA command like this:

> EVALSHA 84ffc8b6e4b45af697cfc5cd83894417b7946cc1 1 region:one
(integer) 3


There's commands to check for scripts ( SCRIPT EXISTS ) and to flush them out ( SCRIPT FLUSH ) so you can manage your script loading too.

SO WHAT'S THE CATCH?
Lua scripts will get killed off after a particular time limit (5 seconds) and
that time limit is insanely generous as you should be writing your Lua scripts
to run very quickly, in milliseconds. Why? Because while your script is running,
everything else is on hold.

HEY, LUA HAS A LOT OF LIBRARIES, CAN I USE THEM?
No, afraid not. The documentation lists the ones that are loaded; base , table , string , math , struct , cjson , cmsgpack , bitop , redis.sha1hex, ref

WHERE NEXT?
The EVAL commands documentation is where you start as it goes into detail on how Redis
types are converted to Lua types. The Lua Manual covers the entire language - and remember there's versions to download that you can run outside Redis for practice. There's also an alternative introduction from 2013 which touches on using the libraries and common gotchas too.

Share on Twitter Share on Facebook Share on Google+ Vote on Hacker News Subscribe Dj Walker-Morgan is Compose's resident Content Curator, and has been both a developer and writer
since Apples came in II flavors and Commodores had Pets. Love this article? Head
over to Dj Walker-Morgan’s author page and keep reading. Company About Us We’re Hiring Articles Write Stuff Plans & Pricing Database as a Service Customer Stories Support System Status Support Enhanced Docs Security Privacy Policy Terms of Service Products MongoDB Elasticsearch RethinkDB Redis PostgreSQL etcd RabbitMQ Enterprise Add-ons Deployments AWS DigitalOcean SoftLayer© 2016 Compose","Lua is a compact language which can be embedded in other applications -- as diverse as World of Warcraft and the Nginx web server. And Redis, which is why we are here.",A Speed Guide To Redis Lua Scripting,Live,1045
3266,"PouchDB-find is a new API and syntax that allows for a simpler way to query PouchDB. It is much more suited to ad-hoc querying and a fair amount easier to learn than PouchDB’s current way of querying documents via Map/Reduce. It is a MongoDB-inspired query language to query a PouchDB database. It works with PouchDB, Cloudant Query and CouchDB Mango (CouchDB 2.0 Release).",PouchDB uses MapReduce as its default search mechanism but that's about to change. Garren Smith gives an insight into how PouchDB's find plugin works and its relationship to the Cloudant Query search technology that uses a MongoDB-style query language.,A look under the covers of PouchDB-find,Live,1046
3268,"We compare discriminative and generative learning as typified by logistic regression and naive Bayes. We show, contrary to a widely-held belief that discriminative classifiers are almost always to be preferred, that there can often be two distinct regimes of performance as the training set size is increased, one in which each algorithm does better. ","We compare discriminative and generative learning as typified by logistic regression and naive Bayes. We show, contrary to a widely-held belief that discriminative classifiers are almost always to be preferred, that there can often be two distinct regimes of performance as the training set size is increased, one in which each algorithm does better. ",A comparison of logistic regression and naive Bayes ,Live,1047
3272,"Essays about data, building products and bootstrapping businesses.

Home About Archives Running Talks© 2017. All rights reserved.

JEAN-NICHOLAS HOULD DATA SCIENCE & BUSINESS BOOTSTRAPPING
WHAT I LEARNED IMPLEMENTING A CLASSIFIER FROM SCRATCH IN PYTHON
04 Jan 2017This post is part of the Learning Machine Learning series. It’s based on Chapter 1 and 2 of Python Machine Learning .

Machine learning can be intimidating for a newcomer. The concept of a machine
learning things alone is quite abstract. How does that work in practice?

In order to demystify some of the magic behind machine learning algorithms, I
decided to implement a simple machine learning algorithm from scratch. I will
not be using a library such as scikit-learn which already has many algorithms implemented. Instead, I’ll be writing all of
the code in order to have a working binary classifier algorithm. The goal of
this exercise is to understand its inner workings.

SO, WHAT THE HECK IS A BINARY CLASSIFIER?
A classifier is a machine-learning algorithm that determines the class of an
input element based on a set of features. For example, a classifier could be
used to predict the category of a beer based on its characteristics, it’s
“features”. These features could include its alcohol content, aroma, appearance,
etc. A machine learning classifier could potentially be used to predict that a
beer with 8% alcohol content, 100 IBU and with strong aromas of oranges is an
Indian Pale Ale.

In machine learning, there are three main types of tasks: unsupervised learning,
supervised learning and reinforcement learning. The classifier algorithm falls
under the supervised learning category. Supervised learning means that we know
the right answer beforehand. The desired outputs are known. In the case of the
beer example, we could realistically have a dataset describing beers and their
category. We could train the classifier algorithm to predict those categories
based on the beers features.

A binary classifier classifies elements in two groups. Zero or one. True or
false. IPA or not.

BUILDING A MACHINE LEARNING MODEL
There are four steps to build and use a machine learning model.

 1. Preprocessing
 2. Learning
 3. Evaluation
 4. Prediction

Source: Python Machine Learning by Sebastian Raschka.

PREPROCESSING
The preprocessing is the first step in building a machine learning model. At
this step, you acquire and prepare the data for future usage. You clean up the
data, tidy it and select the features you want to use from your data.

The following tasks can be considered as part of the “preprocessing”:

 * Extract features from raw data
 * Clean and format the data
 * Remove superfluous features (or highly correlated features)
 * Reduce the number of features for performance
 * Standardize the range of feature data (also named Feature Scaling )
 * Split your dataset randomly: training dataset and test dataset

LEARNING OR TRAINING
Once you have your datasets ready to be used, the second step is to select an
algorithm to perform your desired task. In our case, the algorithm we selected
is a binary classifier called Perceptron. There are many algorithms designed to
do different tasks. They each have their strengths and weaknesses.

At this step, you can test a few algorithms, see how they perform and select the
best performing one. There are a wide variety of metrics that can be used to
measure the performance of a machine learning model. According to Raschka, “one
commonly used metric is classification accuracy, which is defined as the
proportion of correctly classified instances”. At this step, you will make
adjustments to the parameters of your machine learning algorithm. These are
named hyperparameters.

In this post, we’ll mainly focus on this part of the machine learning work flow.
We’ll deep dive in the algorithm inner workings. If you are interested in the
other sections of the machine learning work flow, which you should be, I’ll be
linking to a great notebook at the end of this post.

EVALUATION
When the model has been “trained” on the dataset it can be evaluated on new
unseen data. The goal here is to measure the generalization error . This metric measures “how accurately an algorithm is able to predict outcome
values for previously unseen data”. Once you are satisfied with the results, you
can use your machine learning model to make predictions.

INTRODUCING THE PERCEPTRON
The algorithm that we’ll be re-implementing is a Perceptron which is one of the very first machine learning algorithm.

The Perceptron algorithm is simple but powerful. Given a training dataset, the
algorithm automatically learns “the optimal weight coefficients that are then
multiplied with the input features in order to make the decision of whether a
neuron fires or not”.

But, how does the algorithm do that?

THE ALGORITHM
Here’s the sequence of the algorithm:

First, we initialize an array with the weights equal to zero. The array length
is equal to the number of features plus one. This additional feature is the
“threshold”. It’s important to note that in the case of the Perceptron
algorithm, the features must be of numerical value.

self.w_=np.zeros(1+X.shape[1])

Secondly, we start a loop equal to the number of iterations n_iter . This is an hyperparameter defined by the data scientist.

for_inrange(self.n_iter):

Thirdly, we start a loop on each training data point and it’s target. The target
is the desired output we want the algorithm to eventually predict. Since this is
a binary classifier, the targets are either -1 or 1 . They are of binary value.

Based on the data point features, the algorithm will predict the category: 1 or -1 . The prediction calculation is a matricial multiplication of the features with
their appropriate weights. To this multiplication we add the value of the
threshold. If the result is above 0, the predicted category is 1 . If the result is below 0, the predicted category is -1 .

At each iteration on a data point, if the prediction is not accurate, the
algorithm will adjust the weights. During the first few iterations, the
predictions are not likely to be accurate because the weights haven’t been
adjusted many times. They haven’t had a chance to start converging. The
adjustments are made proportionally to the difference between the target and the
predicted value. This difference is then multiplied by the learning rate eta , an hyperparameter of value between zero and one set by the data scientist.
The higher the eta is, the larger the correction on the weights will be. If the prediction is
accurate, the algorithm won’t adjust the weights.

self.w_=np.zeros(1+X.shape[1])for_inrange(self.n_iter):forxi,targetinzip(X,y):update=self.eta*(target-self.predict(xi))self.w_[1:]+=update*xiself.w_[0]+=updatedefnet_input(self,X):""""""Calculate net input""""""returnnp.dot(X,self.w_[1:])+self.w_[0]defpredict(self,X):""""""Return class label after unit step""""""returnnp.where(self.net_input(X)>=0.0,1,-1)

The Perceptron will converge only if the two classes are linearly separable.
Simply said, if you are able to draw a straight line to entirely separate the
two classes, the algorithm will converge. Else, the algorithm will keep
iterating and will readjust weights until it reaches the maximum number of
iterations n_iter .

Source: Python Machine Learning by Sebastian Raschka.

COMPLETE CODE
importnumpyasnpclassPerceptron(object):""""""Perceptron classifier.

    Parameters
    ------------
    eta : float
        Learning rate (between 0.0 and 1.0)
    n_iter : int
        Passes over the training dataset.

    Attributes
    -----------
    w_ : 1d-array
        Weights after fitting.
    errors_ : list
        Number of misclassifications in every epoch.

    """"""def__init__(self,eta=0.01,n_iter=10):self.eta=etaself.n_iter=n_iterdeffit(self,X,y):""""""Fit training data.

        Parameters
        ----------
        X : {array-like}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.
        y : array-like, shape = [n_samples]
            Target values.

        Returns
        -------
        self : object

        """"""self.w_=np.zeros(1+X.shape[1])self.errors_=[]for_inrange(self.n_iter):errors=0forxi,targetinzip(X,y):update=self.eta*(target-self.predict(xi))self.w_[1:]+=update*xiself.w_[0]+=updateerrors+=int(update!=0.0)self.errors_.append(errors)returnselfdefnet_input(self,X):""""""Calculate net input""""""returnnp.dot(X,self.w_[1:])+self.w_[0]defpredict(self,X):""""""Return class label after unit step""""""returnnp.where(self.net_input(X)>=0.0,1,-1)

Source: Python Machine Learning by Sebastian Raschka.

THREE LEARNINGS
LEARNING RATE, NUMBER OF ITERATION & CONVERGENCE
Parameters such as learning rate and number of iteration can seem very abstract if you jump in straight to using an algorithm from a
library like scikit-learn . It’s hard to grasp what these really do. By implementing the algorithm, it’s
now clear for me what they represent in the context of the Perceptron.

Learning RateThe learning rate is a ratio by which the weights are corrected when the
prediction is not accurate. The value needs to be between zero and one. As you
can see in the snippet below, the fit function will iterate on each observation, call the predict function and then adjust the weights based on the difference between the target
and the predicted value and then multiplied by the learning rate.

A higher learning rate means that the algorithm will adjust the weights more
aggressively. At each iteration, the weights will be adjusted if the predicted
value is inaccurate.

# Partial portion of the ""fit"" functionforxi,targetinzip(X,y):update=self.eta*(target-self.predict(xi))self.w_[1:]+=update*xiself.w_[0]+=updateerrors+=int(update!=0.0)

Number of iterationsThe number of iteration is the number of times the algorithm will run through
the training dataset. If the number of iteration was set to one, the algorithm
would loop through the dataset only once and update the weights one time for
each data point. The resulting model would be more likely to be inaccurate than
a model with a higher number of iteration. On large datasets, there is a cost to
having a high number of iterations.

for_inrange(self.n_iter):errors=0forxi,targetinzip(X,y):update=self.eta*(target-self.predict(xi))self.w_[1:]+=update*xiself.w_[0]+=updateerrors+=int(update!=0.0)self.errors_.append(errors)

The learning rate and number of iteration go hands-in-hand. They need to be adjusted together. For example, if you have a
very low learning rate , which means that the algorithm will adjust it’s weight only marginally at
each iteration, you will probably need a higher number of iteration.

LINEAR ALGEBRA
It’s critical to mention that the capabilities of the Perceptron algorithm are
attributable to linear algebra. The whole algorithm can be described through
linear algebra formulas. If you have never done linear algebra in college, the
formulas will be cryptic. As usual, Khan Academy is a great place to start with if you want to get familiar with linear algebra.
It’s also a great place to get a refresher on the topic.

For me, the main learning here is how fundamental linear algebra is to this
machine learning algorithm.

TYPE EVERYTHING
This learning is actually a concept I re-learned while going through the code
for this post. It’s not specific to machine learning and it has nothing to do
with the Perceptron.

Back in 2012 when I was learning to code Ruby on Rails, a web application
development framework, I realized that typing down all of the code examples from
tutorials really helped me memorize and understand the concepts. I spent weeks
writing code while following tutorials. No copy and paste. I typed all the code.
This may sound stupid but it was extremely helpful to grasp the concepts. During
the process, I inevitably made typos and spend some time figuring out what was
broken. These moments where crucial because that’s when you usually stop and
think.

If you are going through the Perceptron code, don’t copy and paste the code from
the repository . Type it down in your own Jupyter Notebook. Type everything. Don’t read
passively. Get involved, type it down and you’ll assimilate the concepts faster.

NEXT STEPS
In this post, my goal was to share my understanding of the algorithm and the
learnings I’ve made while reimplementing it. However, you can do much more than
simply reimplementing the model. You can actually use it with real data in order
to do some simple predictions. In Python Machine Learning, Raschka uses the
Perceptron to predict the class of Iris flower based on a the sepal and petal length of the flower. With actual data, you can
then evaluate the model and make predictions on unseen data.

DON'T MISS MY NEXT POST, SUBSCRIBE TO MY NEWSLETTER:
Email Address First NameRELATED POSTS
 * 2016 IN REVIEW 27 DEC 2016
   
 * LEARNING MACHINE LEARNING 13 DEC 2016
   
 * TIDY DATA IN PYTHON 06 DEC 2016","In order to demystify some of the magic behind machine learning algorithms, I decided to implement a simple machine learning algorithm from scratch.",What I Learned Implementing a Classifier from Scratch in Python · Jean-Nicholas Hould,Live,1048
3279,,Learn how to use IBM dashDB as data store for Apache Spark. See how dashDB lets you analyze data loaded from Spark. ,Use dashDB with Spark,Live,1049
3281,"Homepage Follow Sign in / Sign up Homepage * Home
 * Data Science Experience
 * 
 * Watson Data Platform
 * 

Greg Filla Blocked Unblock Follow Following Product manager & Data scientist — Data Science Experience and Watson Machine
Learning Jun 30, 2016
--------------------------------------------------------------------------------

Jupyter Notebooks with Scala, Python, or R Kernels

Once you get used to developing in a Notebook environment, it can be painful to
go back to traditional IDEs. In traditional IDEs, you execute your entire script
and get a single output. This is great for application development, but is less
than ideal when performing data analysis.

There are many things to love about using notebooks for data science. Some
obvious things that data scientists all love are features like:

 * Single line or code block execution . Process one block of code…see output…tweak it… then repeat.
 * Inline plotting . You won’t need to create an image file and open it up after the script
   runs. Instead, execute your code and see the results immediately.

 * Markdown . Instead of just commenting your code, use markdown to clarify each step
   you are taking in your workflow. Between this and plotting, we can easily
   transform code into blog posts.

Going beyond these features there is something else to love about notebooks that
will really helps developers and data scientists who need to use different
languages for different projects — the notebook kernel support for multiple
languages.

Originally notebooks were limited to scripts written in Python. Data scientists
needed to set up their local environment to work with different languages. But
this can sometimes become a hassle when all you care about is digging into your
data set. This is no longer the case with IBM Data Science Experience- you can
create notebooks using Python, R, or Scala.

Below is a screenshot from Data Science Experience that shows how easy it is to
toggle between languages for your notebook. This can come in handy when
collaborating with members of your team that prefer different languages. This
makes it easy to have a consistent file format or structure for your data
science code.

If you use any of these three languages and have not used a notebook before,
sign up for Data Science Experience to see how easy it is!


--------------------------------------------------------------------------------

Originally published at datascience.ibm.com on June 30, 2016.

 * Data Science
 * Python
 * Jupyter
 * Notebook

A single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.

3 Blocked Unblock Follow FollowingGREG FILLA
Product manager & Data scientist — Data Science Experience and Watson Machine
Learning

FollowIBM WATSON DATA PLATFORM
Build smarter applications and quickly visualize, share, and gain insights

 * 3
 * 
 * 
 * 

Never miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates","Once you get used to developing in a Notebook environment, it can be painful to go back to traditional IDEs. In traditional IDEs, you execute your entire script and get a single output. This is great…","Jupyter Notebooks with Scala, Python, or R Kernels",Live,1050